Updating the first commit in a git repo

I started a git repo in a VM without configuring an author and wanted to update all the commits, including the first one, which is a bit elusive.

The StackOverflow post Change first commit of project with git? almost nails it. For my use-case, or git version, I found Step 4 to be slightly incorrect. It states to use <commit after changed>, but I got it to work with the same hash used in Step 2. For example, say you are on the master branch and you want to update the author information for the first commit:

$ git checkout -b rename
$ first_commit=$(git rev-list HEAD | tail -n 1)
$ git reset --hard ${first_commit}
$ git commit --amend --author "Roberto Aguilar " -C HEAD
$ git rebase --onto rename ${first_commit} master

Now that the first commit is correct, I use the trick outlined in my Bulk Updating Author in Git to finish up the job:

$ new_first_commit=$(git rev-list HEAD | tail -n 1)
$ GIT_EDITOR="sed -i -e 's/^pick/edit/'" git rebase -i ${new_first_commit}
$ while [ 1 ]; do git commit --amend -C HEAD --author='Roberto Aguilar <roberto@baremetal.io' && git rebase --continue || break; done;


Hash buckets, rsync, and xargs magic

At work we have a couple of directories that are organized as two-deep hash buckets, totaling 65536 directories [1]. This creates a ton of directories and traversing this, e.g. find . -type f, takes ages. This structure also causes rsync to take up a lot of memory.

One way to solve this is to work on a single directory at a time instead of all 256 directories (each containing 256 directories of their own). For example, this will run rsync once per directory which dramatically decreases rsync‘s work load and works pretty well:

for i in *; do rsync -a $i server:/path/to/dest/$i; done;

With xargs the serial process above can be parallelized. The following will continually process 8 directories until all 256 have been copied over:

ls | xargs -n 1 -P 8 -I% rsync -a % server:/path/to/dest/%

I tried with 32, 16, then 8 parallel processes. In my case a -P value more than 10 will cause xargs to explode trying to create that many rsync processes. I haven’t figured out why, but it really doesn’t matter. With 8 running in parallel, the disk and network should be pretty well saturated anyway.

[1] the base directory has 256 directories 00 – ff, which each have 256 00 – ff directories in them. 256^2 = 65536 directories.