Sampling Data from the Command Line

My command line fu ain’t that great, but the upside is that I often come across neat tools that elicit deep feelings of excitement and wonder. While I hate to set the bar that high here, recently I came across shuf, which promises to be a useful addition to my toolbelt.

Say you have a large file, and you want to take a peek at it. One way is to just call head or tail, but sometimes the data is ordered in such a way that the top and bottom aren’t particulary interesting and don’t give you a good picture of the data. Or maybe you want to perform some operations on this file in, say, R, but you don’t want to have to read the whole thing into memory only to then discard most of it.

This is where shuf comes in. Its basic operation is to randomly permute the lines of a file, which can be specified by name as an argument or fed in through standard input. You can let it return a permutation of all the lines, or you can select a random subset of the lines. The latter is what we’re interested in here. The operation goes like this:

shuf -n10 my_data.csv

Above, we select 10 random lines from my_data.csv.

So how’s shuf work under the hood? Well, I’m not a C guy, but from my reading, if your file is small enough, shuf reads it into memory, selects random line numbers, then writes those lines out. But if you’re dealing with a large file, shuf uses a trick called reservoir sampling. shuf reads in n lines from the file to establish a “reservoir” of samples. Then, each additional line replaces an existing line in the reservoir with a decreasing probability. In the end though, each line has the same chance of being selected, so it’s a simple random sample.

There’s one problem with my above example though: the header is lost. But it’s easy enough to re-attach it. If you’re just grabbing a few lines, you can echo the header and then the sampled lines:

echo "$(head -n1 my_data.csv)" "$(tail -n+2 my_data.csv | shuf -n 10)"

The tail call is there to ensure that we don’t sample the header, thus duplicating it in the body of the file.

If you’d rather write to a file, you can simply redirect the output of the above, or you can break the process into two parts and dispense with echo:

head -n1 my_data.csv > downsampled.csv
tail -n+2 my_data.csv | shuf -n 100000 >> downsampled.csv

Of course, if your data has no header, you can freely ignore the extra steps.