Data Algebra 0.9.0 Release

[This article was first published on python – Win Vector LLC, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I am pleased to announce the 0.9.0 release of the data algebra.

The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include being able to specify a single data transformation that can then be translated and executed in many realizations, currently including Pandas, Google Big Query, PostgreSQL, Spark, and SQLite. It allows you to rehearse and debug your big data work in memory.

Some noteable features of the 0.9.0 PyPi release include:

  • Improvements to the SQL generation pipeline. The conversion is now in stages: data algebra (the data manipulation gammer), to near sql (objects representing SQL steps), to lines, to single text. This allows a lot of re-use and sharing between the different database dialects.
  • More use of SQL’s WITH operator for more better machine generated SQL.
  • Simulation of RIGHT and FULL joins for SQLite. SQLite doesn’t include RIGHT and FULL joins. The data algebra SQL for SQLite adapter now converts RIGHT joins to LEFT and FULL joins to larger pipelines. The use and methodology is described here. This allows more data pipelines to be rehearsed in SQLite before moving to another database.

We’ve been using the data algebra to speed up development on both client and internal Python data science projects. I invite you to give it a try.

To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC.

Want to share your content on python-bloggers? click here.