data_algebra 0.7.0 What is New

This article was first published on python – Win Vector LLC , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward.

 

 

The data algebra

The data algebra is a modern realization of elements of Codd’s 1969 relational model for data wrangling (see also Codd’s 12 rules).

The idea is: most data manipulation tasks can usefully be broken down into a small number of fundamental data transforms plus composition. In Codd’s initial writeup, composition was expressed using standard mathematical operator notation. For “modern” realizations one wants to use a composition notation that is natural for the language you are working in. For Python the natural composition notation is method dispatch.

The problems with the relational model were two fold:

    • The name. The relational model was named after a now abandoned feature: insisting all tables have unique keying, and relating this idea to the concept of a mathematical relation. This data model was very different than the prior dominant data model: the hierarchical model (which itself is essentially pointers or even what we now call a graph database).
    • The first dominant realization. The first dominant realization of the relation model evolved into what we now call SQL. SQL had the curse of early success. In hindsight SQL makes a complete mess of composition, as the original SQL notion of composition was right-side statement nesting. This turns out to be illegible (prior to the introduction of with/”common table expressions”, a SQL99 notation not available in some databases until 2005 (ref)).

The data algebra implements the Codd transforms (using Codd’s names where practical) in Python.  It can manipulate data in Pandas or SQL. Such a strategy is famously used in the dplyr / dbplyr R packages (which use a pipe operator for composition, as R native S3/S4 method dispatch is again through somewhat illegible nesting).

Benefits

The benefits / purposes of the data algebra include:

  • Faster development. We find the compositional notation to be very fast to develop with. In fact the loss of such notation in moving from R to Python is a common complaint for multi-lingual data scientists. Data algebra uses method dispatch as its composition notation, making it a natural fit for Python (and eliminating any need for a so-called operator pipe). Pandas and SQL particularities can be worked around in the data algebra package.
  • More legible code. Data algebra pipelines read as a sequence of transforms on data. We find the “everything happens in the data frame” notation can be more legible than the common Pandas user pattern of “take column out, work on it somewhere else, and then put it back in the data frame.”
  • Future proofing / platform independence. The data algebra allows you to work in memory using Pandas or SQLite, and then use the exact same code in a large database such as BigQuery or PostgreSQL.

Example

Here is a simple data algebra example (source here).

What is new in version 0.7.0?

Version 0.7.0 is a major upgrade. The improvements include:

  • Switching from a Python-eval based expression parser to a Lark-grammar based parser. This new parser is safer and allows more direct control of expression features.
  • Targeting and testing of Google BigQuery as a SQL back end. We have used the data algebra on PostgreSQL, MySQL, and Spark. Right now we are primarily testing on SQLite and BigQuery.
  • Moving away from Pandas .eval() and .query(). Previous versions of the data algebra tried to dispatch expression evaluation to Pandas through the .eval() and .query() interfaces. These interfaces have proven to be fairly limited, and not how most users use Pandas. data algebra now directly manages expression evaluation over Pandas columns.

Conclusion

The data algebra is a great tool for Python data science projects. We are thrilled it has gotten to the point where we use it in client projects. What is missing is a “data algebra manual” and training, but with luck we hope to someday fill that gap.

To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC .

Want to share your content on python-bloggers? click here.