Here’s a really great tour through some advanced Pandas features, by Kevin Markham of Data School.
Here are the tricks that he features:
- Show installed versions
- Create an example DataFrame
- Rename columns
- Reverse row order
- Reverse column order
- Select columns by data type
- Convert strings to numbers
- Reduce DataFrame size
- Build a DataFrame from multiple files (row-wise)
- Build a DataFrame from multiple files (column-wise)
- Create a DataFrame from the clipboard
- Split a DataFrame into two random subsets
- Filter a DataFrame by multiple categories
- Filter a DataFrame by largest categories
- Handle missing values
- Split a string into multiple columns
- Expand a Series of lists into a DataFrame
- Aggregate by multiple functions
- Combine the output of an aggregation with a DataFrame
- Select a slice of rows and columns
- Reshape a MultiIndexed Series
- Create a pivot table
- Convert continuous data into categorical data
- Change display options
- Style a DataFrame
- Bonus: Profile a DataFrame
My favorite tip is #25, on styling a dataframe. The bonus tip on Pandas profiling is also pretty cool!
A Jupyter notebook with example usage is available on GitHub.
If you’re hungry for more best practices in Pandas, you can check out Kevin’s PyCon 2019 workshop presentation or his complete series of videos on YouTube.
Pandas is a widely popular component of the scientific python stack, and it is truly an indispensable part of the data scientist’s toolkit. The name pandas is actually a portmanteau created from panel and data. Of course, most of us are familiar with dataframes. But what’s a panel?
Panel data contain 3-dimensional data. A very common example is a time-series: Imagine a dataset with the
stock (e.g., AAPL, MSFT, etc.) as the index defining the x axis and the
price as the variable defining the y axis. A regular 2-dimensional dataframe works fine if you are only taking a cross-sectional snapshot of stock prices at one point in time. But the moment you want to look at patterns in price over the last few months, then
time becomes a new index defining the z axis. A panel is a specific data structure designed to accommodate this.
Recently, the Pandas team announced the deprecation of the panel data structure (as of version 0.20.0). Rather, they are encouraging the use of dataframes with hierarchical indexing (MultiIndex). Using a MultiIndex, one may easily process 3-dimensional data in a dataframe — and indeed, any number of dimensions becomes possible.
MultiIndex is intuitive once you learn how to use it, but it can be tricky to wrap your head around it at first. Kevin Markham of the Data School released a great tutorial explaining how to use the MultiIndex in Pandas.
Read more about hierarchical indexing in the official Pandas documentation.
The QuantEcon tutorial site provides a “real-world” example that demonstrates the use of MultiIndex for analysis of 3-dimensional data.
Learn more about the use of large electronic datasets for pragmatic clinical trials and causal inference — expert epidemiologist and biostatistician Miguel Hernán from the Harvard School of Public Health shares his thoughts on this terrific podcast.
The Unix (and macOS) command line has tons of small standalone utilities that are really helpful for data-science work. The DataCamp podcast recently highlighted some of their favorite commands:
grep — use regular expressions to search contents of a file
head — print the first few rows of a file
find — search, on steroids
sed — parses and transforms streams of text
unique — filters for unique rows in a text file
cat — concatenates files together
wc — word count
The Unix philosophy
Unix advocated for small, standalone utilities that do one task only and do it exceptionally well. These tools are designed to be used together, which is facilitated by the pipe.
The Pipe (|)
DataCamp gives the example of a data-science task that you might do in the Unix shell as a preprocessing step:
- Find all
.csv files in a directory matching a particular regex;
- Concatenate them;
- Remove all duplicate rows;
- Order the rows by a particular field;
- Write the top 5 rows to a new
You don’t have to load a whole dataset into memory to operate on it. A Unix pipeline operates on streams of text, and therefore it does not suffer from the same memory constraints as, say, loading a huge dataset into a Pandas dataframe. An additional benefit of streaming is built-in parallelization across CPU cores for faster processing. Furthermore, it’s rather straightforward to do batch processing, wherein you need to repeat the same series of operations on multiple files in series.
How to learn
For more information and educational resources on the Unix shell, check out the following:
– Command line tricks for data scientists
– Software Carpentry: The Unix shell
– DataCamp: Intro to shell for data science
At the recent JupyterCon 2017 NYC, there were a few presentations that provided an update on development of JupyterLab.
The Next-Generation Jupyter Frontend
Brian Granger (Cal Poly San Luis Obispo), Chris Colbert (Project Jupyter), Ian Rose (UC Berkeley) offer an overview of JupyterLab, which enables users to work with the core building blocks of the classic Jupyter Notebook in a more flexible and integrated manner.
Building a Powerful Data-Science IDE
JupyterLab provides a robust foundation for building flexible computational environments. Ali Marami explains how R-Brain leveraged the JupyterLab extension architecture to build a powerful IDE for data scientists, one of the few tools in the market that evenly supports R and Python in data science and includes features such as IntelliSense, debugging, and environment and data view.
Ever wonder how to read, parse, and write CSV files in Python? This video tutorial from Corey Schafer has the answers.
I received a good recommendation on PythonistaCafe today to check out this video by Corey Schafer: CSV Module – How to Read, Parse, and Write CSV Files.