Hash buckets, rsync, and xargs magic

At work we have a couple of directories that are organized as two-deep hash buckets, totaling 65536 directories [1]. This creates a ton of directories and traversing this, e.g. find . -type f, takes ages. This structure also causes rsync to take up a lot of memory.

One way to solve this is to work on a single directory at a time instead of all 256 directories (each containing 256 directories of their own). For example, this will run rsync once per directory which dramatically decreases rsync‘s work load and works pretty well:

for i in *; do rsync -a $i server:/path/to/dest/$i; done;

With xargs the serial process above can be parallelized. The following will continually process 8 directories until all 256 have been copied over:

ls | xargs -n 1 -P 8 -I% rsync -a % server:/path/to/dest/%

I tried with 32, 16, then 8 parallel processes. In my case a -P value more than 10 will cause xargs to explode trying to create that many rsync processes. I haven’t figured out why, but it really doesn’t matter. With 8 running in parallel, the disk and network should be pretty well saturated anyway.

[1] the base directory has 256 directories 00 – ff, which each have 256 00 – ff directories in them. 256^2 = 65536 directories.

Advertisements

3 thoughts on “Hash buckets, rsync, and xargs magic”

  1. thanks for this, the -P option explained in this context made me actually try it, and yay for parallel processes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s