Top 25 Pandas tricks

Here’s a really great tour through some advanced Pandas features, by Kevin Markham of Data School.

Here are the tricks that he features:

  1. Show installed versions
  2. Create an example DataFrame
  3. Rename columns
  4. Reverse row order
  5. Reverse column order
  6. Select columns by data type
  7. Convert strings to numbers
  8. Reduce DataFrame size
  9. Build a DataFrame from multiple files (row-wise)
  10. Build a DataFrame from multiple files (column-wise)
  11. Create a DataFrame from the clipboard
  12. Split a DataFrame into two random subsets
  13. Filter a DataFrame by multiple categories
  14. Filter a DataFrame by largest categories
  15. Handle missing values
  16. Split a string into multiple columns
  17. Expand a Series of lists into a DataFrame
  18. Aggregate by multiple functions
  19. Combine the output of an aggregation with a DataFrame
  20. Select a slice of rows and columns
  21. Reshape a MultiIndexed Series
  22. Create a pivot table
  23. Convert continuous data into categorical data
  24. Change display options
  25. Style a DataFrame
  26. Bonus: Profile a DataFrame

My favorite tip is #25, on styling a dataframe. The bonus tip on Pandas profiling is also pretty cool!

A Jupyter notebook with example usage is available on GitHub.

If you’re hungry for more best practices in Pandas, you can check out Kevin’s PyCon 2019 workshop presentation or his complete series of videos on YouTube.

Great explanation of MultiIndex in Pandas

Pandas is a widely popular component of the scientific python stack, and it is truly an indispensable part of the data scientist’s toolkit. The name pandas is actually a portmanteau created from panel and data. Of course, most of us are familiar with dataframes. But what’s a panel?

Panel data contain 3-dimensional data. A very common example is a time-series: Imagine a dataset with the stock (e.g., AAPL, MSFT, etc.) as the index defining the x axis and the price as the variable defining the y axis. A regular 2-dimensional dataframe works fine if you are only taking a cross-sectional snapshot of stock prices at one point in time. But the moment you want to look at patterns in price over the last few months, then time becomes a new index defining the z axis. A panel is a specific data structure designed to accommodate this.

Recently, the Pandas team announced the deprecation of the panel data structure (as of version 0.20.0). Rather, they are encouraging the use of dataframes with hierarchical indexing (MultiIndex). Using a MultiIndex, one may easily process 3-dimensional data in a dataframe — and indeed, any number of dimensions becomes possible.

MultiIndex is intuitive once you learn how to use it, but it can be tricky to wrap your head around it at first. Kevin Markham of the Data School released a great tutorial explaining how to use the MultiIndex in Pandas.

Read more about hierarchical indexing in the official Pandas documentation.

The QuantEcon tutorial site provides a “real-world” example that demonstrates the use of MultiIndex for analysis of 3-dimensional data.