Update to the Python distribution landscape

Cristian Medina recently provide an update on the state of application packaging and distribution in 2019.

Application distribution goes well beyond simple distribution of a module or library via PyPi, Anaconda Cloud, or other such channels. Application distribution must consider cross-platform differences, lack of availability of Python itself on the destination computer, uncertainty of ability to download and import needed libraries from the internet, and non-Python interoperability concerns (e.g., importing C code as well as Python). Additionally, there are matters of convenience and experience to consider, as some end-users really want an executable file and icon on their desktop that they can double-click to launch your application.

In addition to his article, Cristian was also interviewed in Episode #245 of the TalkPython podcast.

Technologies that he discusses include the following:

Top 25 Pandas tricks

Here’s a really great tour through some advanced Pandas features, by Kevin Markham of Data School.

Here are the tricks that he features:

  1. Show installed versions
  2. Create an example DataFrame
  3. Rename columns
  4. Reverse row order
  5. Reverse column order
  6. Select columns by data type
  7. Convert strings to numbers
  8. Reduce DataFrame size
  9. Build a DataFrame from multiple files (row-wise)
  10. Build a DataFrame from multiple files (column-wise)
  11. Create a DataFrame from the clipboard
  12. Split a DataFrame into two random subsets
  13. Filter a DataFrame by multiple categories
  14. Filter a DataFrame by largest categories
  15. Handle missing values
  16. Split a string into multiple columns
  17. Expand a Series of lists into a DataFrame
  18. Aggregate by multiple functions
  19. Combine the output of an aggregation with a DataFrame
  20. Select a slice of rows and columns
  21. Reshape a MultiIndexed Series
  22. Create a pivot table
  23. Convert continuous data into categorical data
  24. Change display options
  25. Style a DataFrame
  26. Bonus: Profile a DataFrame

My favorite tip is #25, on styling a dataframe. The bonus tip on Pandas profiling is also pretty cool!

A Jupyter notebook with example usage is available on GitHub.

If you’re hungry for more best practices in Pandas, you can check out Kevin’s PyCon 2019 workshop presentation or his complete series of videos on YouTube.

ACPA “Courage Bears” a great success at Duke Cleft & Craniofacial Center

The Duke Cleft & Craniofacial Center was recently chosen to partner with the American Cleft Palate-Craniofacial Association (ACPA) for distribution of GUND teddy bears to patients with cleft lip/palate and craniofacial anomalies. A distinguishing feature of the Cleft Courage Bears is stitching in the lip to represent a repaired cleft lip. This is designed to bring comfort to patients who may feel anxious about their facial differences.

You may read more about the Cleft Courage Bears on the Duke Surgery blog and on the ACPA community blog.

ACPA Cleft Courage Bears

Note: Patient photographs utilized with parental permission and HIPAA release.

Great explanation of MultiIndex in Pandas

Pandas is a widely popular component of the scientific python stack, and it is truly an indispensable part of the data scientist’s toolkit. The name pandas is actually a portmanteau created from panel and data. Of course, most of us are familiar with dataframes. But what’s a panel?

Panel data contain 3-dimensional data. A very common example is a time-series: Imagine a dataset with the stock (e.g., AAPL, MSFT, etc.) as the index defining the x axis and the price as the variable defining the y axis. A regular 2-dimensional dataframe works fine if you are only taking a cross-sectional snapshot of stock prices at one point in time. But the moment you want to look at patterns in price over the last few months, then time becomes a new index defining the z axis. A panel is a specific data structure designed to accommodate this.

Recently, the Pandas team announced the deprecation of the panel data structure (as of version 0.20.0). Rather, they are encouraging the use of dataframes with hierarchical indexing (MultiIndex). Using a MultiIndex, one may easily process 3-dimensional data in a dataframe — and indeed, any number of dimensions becomes possible.

MultiIndex is intuitive once you learn how to use it, but it can be tricky to wrap your head around it at first. Kevin Markham of the Data School released a great tutorial explaining how to use the MultiIndex in Pandas.

Read more about hierarchical indexing in the official Pandas documentation.

The QuantEcon tutorial site provides a “real-world” example that demonstrates the use of MultiIndex for analysis of 3-dimensional data.

Apple Books update in iOS12: a “love letter to readers”

MacStories reports that Apple has created a major update for iBooks, rebranding the app simply “Books” (or Apple Books, akin to Apple Music).

I highly recommend reading the full story, which was nicely written by Ryan Christoffel.

While the redesign itself is lovely (hurrah for a touch of skeumorphism in the book spines and for the elegant use of typography and whitespace), what I’m really happy about are twofold: (1) the much improved organization (in the store, as well as for your own collection); and (2) the addition of Goodreads-like features! Yes, now you can keep track of your reading, build wishlists, and get suggestions for future titles to read.

Nice job, Apple! This reader appreciates the hard work you put into Books.

(Now, if only you could work on the Podcasts app…. but that’s another story.)

Unix command-line tools for data science

The Unix (and macOS) command line has tons of small standalone utilities that are really helpful for data-science work. The DataCamp podcast recently highlighted some of their favorite commands:

  • grep — use regular expressions to search contents of a file
  • head — print the first few rows of a file
  • find — search, on steroids
  • sed — parses and transforms streams of text
  • unique — filters for unique rows in a text file
  • cat — concatenates files together
  • wc — word count

The Unix philosophy

Unix advocated for small, standalone utilities that do one task only and do it exceptionally well. These tools are designed to be used together, which is facilitated by the pipe.

The Pipe (|)

DataCamp gives the example of a data-science task that you might do in the Unix shell as a preprocessing step:

  1. Find all .csv files in a directory matching a particular regex;
  2. Concatenate them;
  3. Remove all duplicate rows;
  4. Order the rows by a particular field;
  5. Write the top 5 rows to a new .csv file.

You don’t have to load a whole dataset into memory to operate on it. A Unix pipeline operates on streams of text, and therefore it does not suffer from the same memory constraints as, say, loading a huge dataset into a Pandas dataframe. An additional benefit of streaming is built-in parallelization across CPU cores for faster processing. Furthermore, it’s rather straightforward to do batch processing, wherein you need to repeat the same series of operations on multiple files in series.

How to learn

For more information and educational resources on the Unix shell, check out the following:
Command line tricks for data scientists
Software Carpentry: The Unix shell
DataCamp: Intro to shell for data science

Update on JupyterLab

At the recent JupyterCon 2017 NYC, there were a few presentations that provided an update on development of JupyterLab.

The Next-Generation Jupyter Frontend

Brian Granger (Cal Poly San Luis Obispo), Chris Colbert (Project Jupyter), Ian Rose (UC Berkeley) offer an overview of JupyterLab, which enables users to work with the core building blocks of the classic Jupyter Notebook in a more flexible and integrated manner.

Building a Powerful Data-Science IDE

JupyterLab provides a robust foundation for building flexible computational environments. Ali Marami explains how R-Brain leveraged the JupyterLab extension architecture to build a powerful IDE for data scientists, one of the few tools in the market that evenly supports R and Python in data science and includes features such as IntelliSense, debugging, and environment and data view.

Geospatial analysis comes to Jupyter notebooks!

“GeoNotebook: An extension to the Jupyter Notebook for exploratory geospatial analysis,” presented by Chris Kotfila at JupyterCon 2017 NYC.

Chris Kotfila offers an overview of the GeoNotebook extension to the Jupyter Notebook, which provides interactive visualization and analysis of geospatial data. Unlike other geospatial extensions to the Jupyter Notebook, GeoNotebook includes a fully integrated tile server providing easy visualization of vector and raster data formats.

Why I gave up on Manuscripts and Ulysses and went back to Scrivener

Writing is an arduous task that requires discipline and perseverance. Writers choose tools that make this process easier, more effective, and more efficient. In this post, I discuss my search for the perfect writing app — from Scrivener to Ulysses to Manuscripts and back again.

It’s a common theme for researchers and productivity enthusiasts: finding (or creating) the perfect writing workflow. As for many others, this search has led me to try many different writing tools along the way, and in the end come around full-circle.

Why not just Microsoft Word?

Most everyone starts out using Microsoft Word, of course — and for good reason: it works reasonably well; it is pervasive and cross-platform; .docx is the de facto standard for file formats (e.g., for journal submissions); and it does indeed have an excellent Track Changes mode for versioning and collaboration.

At a certain point, however, most writers hunger for something better. For me, the main stumbling block with Word is that it is designed for writing in a linear fashion. The WYSIWYG design metaphor upon which it was based was not limited to mere typography but also extended to page layout. Word’s digital documents were made to mirror, skeumorphically, their hardcopy relatives — that is, a document as an ordered stack of pages.

The problem with design metaphor is that the process of writing is rarely linear. Rather, like an artist who begins working with a sketch, the writer may start with an outline, or even a jumble of notes, jumping back and forth between sections as thoughts come to mind. Sometimes, reference material gets cut-n-pasted into relevant sections, resulting in clutter and mass disarray.

Surely, there must be a better solution!

Enter Scrivener

I first saw Scrivener in 2006. I was working in a lab at the time. I didn’t own a Mac, but my labmates did, and I stared longingly at some of the apps they were using. Truth be told, none of them used Scrivener. (Some of them used OmniFocus — or rather the Kinkless kGTD scripts for OmniOutliner — since OmniFocus per se hadn’t been created yet.) But they sparked my interest in the Mac, and I started looking at the platform more seriously.

As I started to browse apps that were available on the Mac but not on Windows, the one that made me the most jealous was Scrivener.

I was awestruck: Here was a writing application that was structured the way writers write: It allowed for outlining. It allowed for collecting reference material in a dedicated section, keeping it separate from the organically evolving draft. It allowed for brainstorming via notes. It invited reorganization, via notecards on a corkboard and drag-n-drop of outline elements in the binder. It had a (rudimentary) track-changes functionality via snapshots. And it could export to practically any format you needed, including Word.

When I finally became a Mac convert in 2010, Scrivener was among the first apps that I purchased immediately.

Why not Scrivener?

Despite my high enthusiasm for Scrivener, there were a few friction points. First, I preferred writing in plain text and the (then-new) Markdown format rather than in rich text. While some Markdown compatibility was added later, Scrivener remained a decidedly rich-text editor.

Second, around this time (2010-2013), cloud computing became a more realistic and practical thing. Whereas in the past I typically wrote manuscripts on one computer, there was a growing need to be able to work on documents wherever I was. This required the ability to edit the same file from multiple computers (iMacs and MacBook Pros). Scrivener’s file format is a ‘bundle’ that contains multiple files, and this always posed some challenge with synchronization over cloud services.

Third, during this same time period, the iPhone and iPad were growing in functionality. When I looked at the wonderful success of the OmniGroup’s “iPad or Bust!” venture, I couldn’t help but long for Scrivener to follow suit. Assuredly, Literature and Latte (the company that makes Scrivener) tried to translate Scrivener to iOS; however, unfortunately and quite sadly, this effort was met with setback after many setback. We fans waited as long as we could stand — but eventually, with no release date in sight, many of us were forced to look elsewhere.

Enter Ulysses

And so I discovered Ulysses. I can’t remember where exactly I heard about it… perhaps from Brett Terpstra, who worked on their .textbundle file spec; perhaps from the Mac Power Users podcast episode contrasting Scrivener and Ulysses; perhaps from Jeff Taekman’s WIPPP blog where he described his writing workflow.

In any case, I stumbled across Ulysses, and it was alluring. Here was an app that was Markdown-based (no rich text). It allowed me to create projects with sheets within it. I could jump from section to section and reorganize things easily. And best of all, the macOS and iOS versions synchronized flawlessly. While I was less fond of the one-library-containing-all-your-projects framework of Ulysses (in contrast to the more traditional separate-file-per-project approach of Scrivener), I was willing to make that tradeoff. (In fact, over time, I saw that there were some definite benefits to this library approach because you never had to wonder where your stuff was: It was simply in the Ulysses library.) Another downside that Ulysses had was its limited ability to store multimedia reference material. (Scrivener excels in this.) One last downside was difficulty in getting Ulysses to play friendly with LaTeX mathematical notation that I could export and visualize in Marked 2. (To be fair, Scrivener is not great for equations either.)

These downsides were not showstoppers, however. Before I knew it, I had switched from Scrivener to Ulysses full-time.

Manuscripts: A momentary distraction

For reference management, I use Papers, a marvelous app. The original creators of this app went on to create Findings, a digital lab notebook, and Manuscripts, a writing app designed by and for scientific writing. Was this the writing solution that I had been searching for?

The concept behind Manuscripts was sound. It had a Scrivener-like outline view as a navigation pane. You could rearrange sections easily. Reference management, figure management, table editing, and equation editing are all first-class citizens. Figures and tables can be inserted in-line where they make sense as you write the paper, but then can be exported as separate files (as may be required during mansucript submission to journals). Speaking of journals, Manuscripts allowed the user to format/reformat documents according to journal templates, simplifying the manuscript preparation process.

The problem was that the program was extremely buggy (think beta software, even though it was out of beta). The excessive number of bugs meant that there was no way that I would depend on Manuscripts to be my main writing environment. Instead, I decided to keep testing Manuscripts from time-to-time to see if any of the interval updates improved the situation. Indeed, some early updates fixed critical bugs, but the app was still far from useful. Heartbreakingly, development on the product seems to have withered away. Despite repeated affirmations from the developers via email that the product is still alive and that a “big update” is in the works, the Manuscript web site, blog, and Twitter account have not been appreciably updated since ~2015.

Complete. radio. silence…

So, for the time being, I stuck with Ulysses for all my serious writing.

(UPDATE 2017-12-29: Matias Piipari of the Manuscripts team just announced that Manuscripts will become a free, open-source app in 2018. This creates a lot of potential for its future, but it will still be a long time before it is ready for prime time.)

Scrivener for iOS finally arrives

While I was happily writing productively with Ulysses and enjoying its seamless sync functionality, it just so happened that Scrivener for iOS was released on July 20, 2016. I purchased it immediately — if only to see how it finally turned out after so many setbacks, and perhaps also to support a developer who deserved it. I was pleasantly surprised at how elegant it was. It really preserved so much of the functionality present in the macOS version of Scrivener, but translated perfectly to a touch environment, and refining some concepts as well.

Certainly, this was the iOS version that everyone had hoped for. But there were two hangups, at least with version 1.0: Sync was tied to Dropbox only (not iCloud, Box, etc.). And because of the ‘bundle’ file format, sync couldn’t happen instantaneously behind the scenes, but rather had to be triggered manually.

Certainly, Scrivener for iOS tempted me in more ways than one. I continued to be fond of Scrivener — there is lots to love. Yet I decided to keep things simple and stick with Ulysses.

A monkey wrench: App subscription models

I don’t mind paying for software. I would easily drop $50-$100 for a real, hardcore piece of software for productivity and academic work. The great irony is that the company that once championed and enabled crafting of great indie software — Apple — also created the iPhone and App Store that effectively commoditized apps. As simple games and text editors became pervasive at $0.99, consumers began equating all apps as being cheap, easy, and replaceable. They expected that all apps should cost that little (or be free) — including incredibly complex writing studios, task managers, and productivity suites. The difficulty in allowing for upgrade pricing in the App Store resulted in an expectation that all future updates should be provided for free. In sum, there was a deterioration of the public’s appreciation for the value of software and of the time and effort that was required to create it.

The problem with app pricing has resulted in many software developers having to switch to different revenue-generating models. Freemium software (with in-app purchases) is one example. Add-supported software is another. But what seems to be a trend nowadays is subscriptions. Consider a few notable titles that converted to subscription pricing: TextExpander; Day One; 1Password.

The latest to convert to subscription pricing? Ulysses.

These decisions are not taken lightly. Indie software developers care a great deal about their customers, and they want to treat them fairly. They also know that subscription models are unpopular, and that this type of change can have severe backlashes. Case in point: Ulysses’ ratings on the iOS and macOS App Stores plummeted from 5 stars to 2 stars immediately following the announcement, and Twitter was rampant with acerbic criticism.

Personally, I totally understand software developers’ rationale for this switch, and I commiserate with them. It’s quite a dilemma. However, I am also personally opposed to enrolling in numerous monthly subscriptions. One app or service (Netflix, perhaps?) is okay. But with so many apps going this route, it’s easy to lose track — not only of how many apps you’re paying for or how much each costs, but also how to cancel them when you’re tired of paying for them. Katie Floyd and David Sparks discussed this challenge at length on Mac Power Users, with Katie going as far as having an Excel spreadsheet to keep track of everything. Do I want to do this? No way. Too confusing and annoying. Also, while subscription pricing might make sense for a cloud-based service (such as Dropbox or AWS storage), it feels like a strange fit for a desktop-based or tablet-based application.

Call me old fashioned, but I’m a loyal proponent of the old software model: I value software. I’m happy to pay for it in exchange for a solid major release and the promise of patches/fixes and minor feature releases over subsequent months or years. And I appreciate an upgrade price when it comes time for the next major release.

With Ulysses going subscription-based, I decided I wasn’t going to take the plunge into subscription-land. While my existing Ulysses apps on macOS and iOS worked just fine (for now), there were no promises that they would continue to work perfectly with forthcoming changes in macOS High Sierra or iOS 11 or beyond. This left me with an unsettled feeling.

Return of Scrivener

And so (it may come as no surprise), I checked in on Scrivener to see how it was developing. Lo and behold, version 3 was just announced. Scrivener had been substantially rewritten and improved.

From lead developer Keith Blount’s post:

Featuring a refreshed UI and some major overhauls, our focus for Scrivener 3 has been not only on new features but also on consolidating and simplifying what’s already there. We’ve taken years of experience of writing in Scrivener, both our own and that of our users, and poured it into Scrivener 3. The result is the best version of Scrivener yet, and we can’t wait to get it into the hands of our users.

And so, I am patiently waiting for the release date later this Fall. I have no qualms or reservations about returning to Scrivener. Having used it before, I know it’s a great product. I know its iOS sync solution works effectively with Dropbox, and there are hints that iCloud may be coming down the pike eventually. Many features have been refined, including Markdown support, the Compile process (for exporting/publishing), style management, annotation, navigation within the document, and organization.

What am I doing while I wait? Well, I have one project in Ulysses that I am choosing to finish there because it’s close to completion. I also have other “incubator”-type projects that I’m not presently working on, the notes for which I’m choosing to keep in Ulysses until I’m ready to start them. But all my other projects, and new ones, have already been migrated to Scrivener (version 2) so that I can get back into the Scrivener mindset while I anxiously await the release of version 3.

Exciting times ahead!1

(UPDATE 2017-12-29: Scrivener 3 for macOS was released, and it was well worth the wait. It is so much more refined than version 2. Everything is cleaner and more efficient. And best of all, it syncs great with the iOS version using Dropbox. Highly recommended!)

  1. And a HUGE thank-you to Keith Blount and the rest of the Scrivener team!!!