Unix command-line tools for data science

The Unix (and macOS) command line has tons of small standalone utilities that are really helpful for data-science work. The DataCamp podcast recently highlighted some of their favorite commands:

  • grep — use regular expressions to search contents of a file
  • head — print the first few rows of a file
  • find — search, on steroids
  • sed — parses and transforms streams of text
  • unique — filters for unique rows in a text file
  • cat — concatenates files together
  • wc — word count

The Unix philosophy

Unix advocated for small, standalone utilities that do one task only and do it exceptionally well. These tools are designed to be used together, which is facilitated by the pipe.

The Pipe (|)

DataCamp gives the example of a data-science task that you might do in the Unix shell as a preprocessing step:

  1. Find all .csv files in a directory matching a particular regex;
  2. Concatenate them;
  3. Remove all duplicate rows;
  4. Order the rows by a particular field;
  5. Write the top 5 rows to a new .csv file.

You don’t have to load a whole dataset into memory to operate on it. A Unix pipeline operates on streams of text, and therefore it does not suffer from the same memory constraints as, say, loading a huge dataset into a Pandas dataframe. An additional benefit of streaming is built-in parallelization across CPU cores for faster processing. Furthermore, it’s rather straightforward to do batch processing, wherein you need to repeat the same series of operations on multiple files in series.

How to learn

For more information and educational resources on the Unix shell, check out the following:
Command line tricks for data scientists
Software Carpentry: The Unix shell
DataCamp: Intro to shell for data science

Leave a Reply

Your email address will not be published. Required fields are marked *