Want to share your content on python-bloggers? click here.
I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward.
The data algebra
The data algebra is a modern realization of elements of Codd’s 1969 relational model for data wrangling (see also Codd’s 12 rules).
The idea is: most data manipulation tasks can usefully be broken down into a small number of fundamental data transforms plus composition. In Codd’s initial writeup, composition was expressed using standard mathematical operator notation. For “modern” realizations one wants to use a composition notation that is natural for the language you are working in. For Python the natural composition notation is method dispatch.
The problems with the relational model were two fold:
-
- The name. The relational model was named after a now abandoned feature: insisting all tables have unique keying, and relating this idea to the concept of a mathematical relation. This data model was very different than the prior dominant data model: the hierarchical model (which itself is essentially pointers or even what we now call a graph database).
- The first dominant realization. The first dominant realization of the relation model evolved into what we now call SQL. SQL had the curse of early success. In hindsight SQL makes a complete mess of composition, as the original SQL notion of composition was right-side statement nesting. This turns out to be illegible (prior to the introduction of with/”common table expressions”, a SQL99 notation not available in some databases until 2005 (ref)).
The data algebra implements the Codd transforms (using Codd’s names where practical) in Python. It can manipulate data in Pandas or SQL. Such a strategy is famously used in the dplyr / dbplyr R packages (which use a pipe operator for composition, as R native S3/S4 method dispatch is again through somewhat illegible nesting).
Benefits
The benefits / purposes of the data algebra include:
- Faster development. We find the compositional notation to be very fast to develop with. In fact the loss of such notation in moving from R to Python is a common complaint for multi-lingual data scientists. Data algebra uses method dispatch as its composition notation, making it a natural fit for Python (and eliminating any need for a so-called operator pipe). Pandas and SQL particularities can be worked around in the data algebra package.
- More legible code. Data algebra pipelines read as a sequence of transforms on data. We find the “everything happens in the data frame” notation can be more legible than the common Pandas user pattern of “take column out, work on it somewhere else, and then put it back in the data frame.”
- Future proofing / platform independence. The data algebra allows you to work in memory using Pandas or SQLite, and then use the exact same code in a large database such as BigQuery or PostgreSQL.
Example
Here is a simple data algebra example (source here).
First let’s import our packages and set up an example data frame.
import pandas
from data_algebra.data_ops import *
d = pandas.DataFrame({
'c': ['c', 'c', 'b', 'a'],
'v': [1, 2, 3, 4],
})
d
c | v | |
---|---|---|
0 | c | 1 |
1 | c | 2 |
2 | b | 3 |
3 | a | 4 |
Now let’s define our data transform using the data algebra. New columns are defined by specifying a Python dictionary where new column names are the keys and the source-code for the operations are the values. We try to use Codd’s names for operators: adding columns is extend()
, and summarizing data is project()
.
table_name = 'data-algebra-test.test_1.d'
operations = describe_table(d, table_name=table_name) .\
extend({
'g': '"prefix_" %+% c' # concatenate strings
}) .\
project({ # build per- group g totals of v
'group_total': 'v.sum()'
},
group_by=['g']
) .\
order_rows(['g']) # choose a presentation order of rows
We can then apply these operations to any data frame that has the columns specified in the table description (and appropriate column types).
res_pandas = operations.transform(d)
res_pandas
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
Applying the same operations in a database is quite simple. First we connect to our database. Here we are inserting the data as an example, in serious applications the source table would usually already be present.
import os
from google.cloud import bigquery
import data_algebra.BigQuery
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/johnmount/big_query/big_query_jm.json"
biqquery_handle = data_algebra.BigQuery.BigQueryModel().db_handle(bigquery.Client())
biqquery_handle.insert_table(d, table_name=table_name, allow_overwrite=True)
Then we can generate SQL tailored to the specific database.
bigquery_sql = biqquery_handle.to_sql(operations, pretty=True)
print(bigquery_sql)
SELECT `g`, `group_total` FROM (SELECT `g`, SUM(`v`) AS `group_total` FROM (SELECT ("prefix_"|| `c`) AS `g`, `v` FROM `data-algebra-test.test_1.d`) `extend_1` GROUP BY `g`) `project_2` ORDER BY `g`
Operations can be used to land results in the database (using CREATE TABLE AS
, avoiding round-tripping data in and out of the database). Operations can as be used to return results as a Pandas data frame.
res_bigquery = biqquery_handle.read_query(operations)
res_bigquery
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
And we can check we get equivilent results in Pandas and from the database.
assert res_pandas.equals(res_bigquery)
biqquery_handle.close()
We can repeat the database example using another database, simply by building a different database handle.
import sqlite3
import data_algebra.SQLite
sqlite_handle = data_algebra.SQLite.SQLiteModel().db_handle(sqlite3.connect(":memory:"))
sqlite_sql = sqlite_handle.to_sql(operations, pretty=True)
print(sqlite_sql)
SELECT "g", "group_total" FROM (SELECT "g", SUM("v") AS "group_total" FROM (SELECT ('prefix_'|| "c") AS "g", "v" FROM "data-algebra-test.test_1.d") "extend_1" GROUP BY "g") "project_2" ORDER BY "g"
sqlite_handle.insert_table(d, table_name=table_name, allow_overwrite=True)
res_sqlite = sqlite_handle.read_query(operations)
res_sqlite
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
assert res_sqlite.equals(res_bigquery)
Also, operations have a pretty good printing method.
operations
TableDescription( table_name='data-algebra-test.test_1.d', column_names=[ 'c', 'v']) .\ extend({ 'g': "'prefix_'.concat(c)"}) .\ project({ 'group_total': 'v.sum()'}, group_by=['g']) .\ order_rows(['g'])
And that is a small demonstration of the data algebra.
What is new in version 0.7.0?
Version 0.7.0 is a major upgrade. The improvements include:
- Switching from a Python-eval based expression parser to a Lark-grammar based parser. This new parser is safer and allows more direct control of expression features.
- Targeting and testing of Google BigQuery as a SQL back end. We have used the data algebra on PostgreSQL, MySQL, and Spark. Right now we are primarily testing on SQLite and BigQuery.
- Moving away from Pandas
.eval()
and.query()
. Previous versions of the data algebra tried to dispatch expression evaluation to Pandas through the.eval()
and.query()
interfaces. These interfaces have proven to be fairly limited, and not how most users use Pandas. data algebra now directly manages expression evaluation over Pandas columns.
Conclusion
The data algebra is a great tool for Python data science projects. We are thrilled it has gotten to the point where we use it in client projects. What is missing is a “data algebra manual” and training, but with luck we hope to someday fill that gap.
Want to share your content on python-bloggers? click here.