Hash buckets, rsync, and xargs magic

At work we have a couple of directories that are organized as two-deep hash buckets, totaling 65536 directories [1]. This creates a ton of directories and traversing this, e.g. find . -type f, takes ages. This structure also causes rsync to take up a lot of memory.

One way to solve this is to work on a single directory at a time instead of all 256 directories (each containing 256 directories of their own). For example, this will run rsync once per directory which dramatically decreases rsync‘s work load and works pretty well:

for i in *; do rsync -a $i server:/path/to/dest/$i; done;

With xargs the serial process above can be parallelized. The following will continually process 8 directories until all 256 have been copied over:

ls | xargs -n 1 -P 8 -I% rsync -a % server:/path/to/dest/%

I tried with 32, 16, then 8 parallel processes. In my case a -P value more than 10 will cause xargs to explode trying to create that many rsync processes. I haven’t figured out why, but it really doesn’t matter. With 8 running in parallel, the disk and network should be pretty well saturated anyway.

[1] the base directory has 256 directories 00 – ff, which each have 256 00 – ff directories in them. 256^2 = 65536 directories.