Tuesday, May 25, 2010

Keeping your skeletons in the closet when open sourcing code in git

In my day job (DNC Innovation Lab), my team and I received approval to open-source some code that was started well before I arrived there. It was all stored on an internal git server, so no one thought it would be that hard to do. You get cocky after slinging code around with git for awhile. That's what good tools do.

Unfortunately there were already passwords, API keys, and other things we couldn't release publicly in our git commit history.

No problem, we thought, we'll just take a snapshot of the code, remove the bits we can't open source, and then upload that to github as a new repo, devoid of all that messy history. Easy peasy.

Er, not so much. The problem is we wanted to maintain our internal branch, complete with git history, but open up development on the open source version (which comprised >99% of all the code) to github and thus outside collaborators. So we were going to be pushing and pulling to/from github, as well as merging into our internal branch. We couldn't let non-open-source code leak into github, but we also needed to merge the open sources changes into the internal version.

Git took a look at these two branches and decided they didn't have anything to do with each other because they had no common ancestry in the commit history. This was the appropriate response from git because we had purposefully removed the commit history of the github version.

I tried various methods of merging the two codebases, but git always generated conflicts left and right because it was attempting to merge two almost-but-not-quite identical codebases with no common ancestor commits.

Here's how I solved it (I hope). Let's say the internal branch is called "internal" and the open source branch is called "opensource". The commit history of the open source branch is one über-commit (X) followed by a couple small changes (Y & Z), which we'll represent as X <-- Y <-- Z. The commit history of the internal branch is pretty long, so we'll just abbreviate it as the three most recent commits, A <-- B <-- C. So here's how our two branches start out:

internal:   A <-- B <-- C
opensource: X <-- Y <-- Z


I decided to try merging the opensource branch with the internal branch using the "ours" merge strategy on the X commit. This merge strategy just discards the changes in the other branch and considers the two branches merged anyway. So I ran:


git checkout internal
git merge -s ours (sha1 of the X commit)


Then my commit history looked like this:


internal:   A <-- B <-- C <-- D
                             /
opensource:                 X <-- Y <-- Z


The new D commit was the "merge" between C and X, but a couple quick git diffs showed that my working tree was exactly the same as the C commit, and thus had discarded the changes from X. This is what I wanted because X represented almost the exact same codebase, except for the minor changes required to open source it.


Now I was in a position to merge new commits to the opensource branch into the internal branch. I ran this (still on the internal branch):

git merge opensource


Then my commit history looked like this:





internal:   A <-- B <-- C <-- D    <--    E
                             /           /
opensource:                 X <-- Y <-- Z

So now I can merge new commits to the opensource branch (coming from github or others on my internal team) into the internal branch, and changes that are internal only can be made directly to that branch.

I'll update this post if I run into any breakage due to this.