Wednesday, December 28, 2016

Migrating the extra large NetBeans Mercurial repository to Git

Note: This article is a living document and will be updated as I learn new useful information (last update 31st December 2016). I will move the helper scripts to a dedicated repository and copy part of this article into the Apache NetBeans wiki.

Introduction

The NetBeans source code has been stored in a Mercurial repository for almost a decade now.

But starting October 2016 NetBeans is preparing to become an Apache project.

And all incubating projects must store their source code on Apache Software Foundation infrastructure which only provides Subversion or Git hosting.

So, NetBeans must migrate its Mercurial repository to Git.

Size concerns

The NetBeans Mercurial repository covers 17 years of history and has grown to over 3GB.

Apache projects are mirrored on GitHub and they have a limit of 2GB or so. As such, any talk of migration started with ways of reducing the size by potentially splitting up the repository or removing some of the history.

Luckily, it turns out the NetBeans Mercurial server was just using a really old Mercurial version. When Gregory Szorc looked into it we learned that with the format.generaldelta=true and format.aggressivemergedeltas=true flags, the repository drops to about 1GB.

This was great news but, of course, we still have to migrate to Git.

Under Git we have to make sure some compression is applied. This is done with the git gc command which reduces the repository to under 1GB too.

With size out of the way, we can do a straight migration and preserve our mono repository and the whole history.

The case of the corrupted repository

The most important NetBeans repository is releases. This repository holds the release branches such as release82 as well as the current state in the default branch. The main-silver default branch is periodically pushed into the releases/ default branch.

A direct conversion of releases/ is impossible though because the repository is corrupt:

$ hg verify
checking changesets
checking manifests                                                                                                                      

crosschecking files in changesets and manifests                                                                                        

checking files
 applemenu/src/org/netbeans/modules/applemenu/layer.xml@?: rev 12 points to unexpected changeset 149753                                  
 (expected 149755)
 defaults/src/org/netbeans/modules/defaults/Eclipse-keybindings-mac.xml@?: rev 0 points to unexpected changeset 149753                  
 (expected 149755)
 defaults/src/org/netbeans/modules/defaults/Eclipse-keybindings.xml@?: rev 25 points to unexpected changeset 149753                      
 (expected 149755)
 defaults/src/org/netbeans/modules/defaults/mf-layer.xml@?: rev 74 points to unexpected changeset 149753                                
 (expected 149755)
192754 files, 313961 changesets, 1122263 total revisions                                                                                

4 warnings encountered!
4 integrity errors encountered!

Luckily, the corruption seems to be in the default branch.

So, we can get a valid releases/ repository by first making a main-silver clone and then pulling the missing changesets from releases:

mkdir releases.fixed
cd releases.fixed
hg init .
hg pull http://hg.netbeans.org/main-silver
hg pull http://hg.netbeans.org/releases
hg out http://hg.netbeans.org/releases
#nothing should be displayed here
hg verify

hg-fast-export all the way

Now that we have a valid repository we just follow the steps in the official documentation about migrating from Mercurial:

git clone http://repo.or.cz/r/fast-export.git /tmp/fast-export
git init ~/git-releases
cd ~/git-releases
/tmp/fast-export/hg-fast-export.sh -r ~/releases.fixed
git gc --aggressive --prune=now

and then wait 48 hours for it to finish!

.. but first: removing the unnamed heads

Once you do start hg-fast-export.sh you'll notice it fails early with

Error: repository has at least one unnamed head: hg rXXXX

This is caused because Git, unlike Mercurial, does not support unnamed branches.

It's not a big problem for the NetBeans repository because there are very few such commits and basically historical mistakes with no relevance.

I have just removed them altogether with hg strip

Incremental push

Although we have world class internet speed in Romania, I happened to be on a slow connection when the conversion finished. And it is no fun to restart a git push after 400MB have already been uploaded and the connection dropped!

I fixed this by uploading incrementally each month:

echo "Incremental git push"

for year in 2012 2013 2014 2015 2016; do
    for month in 1 2 3 4 5 6 7 8 9 10 11 12; do
SHA=`git rev-list -1 --before="$year-$month-1 12:00" master`
echo "$SHA $year-$month"
  echo git push origin "$SHA:master"
git push origin "$SHA:master"
    done;
done;

Syncing with the old Mercurial repository

Right now my GitHub repositories are just experimental. The real work is still done in the Mercurial repository. As such, I still have to convert the new commits from Mercurial to Git.

hg-fast-export seems designed with this in mind. The -r parameter which specifies the source repository is only needed the 1st time. After that it may be skipped and hg-fast-export will incrementally convert the missing changesets.

So, it's just a matter of:

cd ~/releases.fixed
hg pull
cd ~/git-releases
hg-fast-export.sh
git push

Note that a pull on releases/ will bring back the stripped commits...

Saving hg-fast-export state

I did run into a deadlock while incrementally converting with hg-fast-export.

The only solution seemed to be to redo the conversion.

At 48 hours per full repository this doesn't seem like fun, so I recommend periodically saving these files from the .git folder: hg2git-headshg2git-mappinghg2git-marks and hg2git-state.

Make sure not to ignoreCase

I discovered that on macOS core.ignoreCase is true which means that changesets that only change the case of a file name will produce an incorrect git changeset.

So on macOS the option needs to be explicitly set to false:

git config core.ignorecase false



2 comments:

rcoacci said...

Seems like a good candidate for ESR's reposurgeon (http://www.catb.org/esr/reposurgeon/).

Ernest said...

Visualisation of the NetBeans repository is available here: https://youtu.be/KTX7-PrbeNc.

The Trouble with Harry time loop

I saw The Trouble with Harry (1955) a while back and it didn't have a big impression on me. But recently I rewatched it and was amazed a...