Python-bloggers

data_algebra/rquery as a Category Over Table Descriptions

This article was first published on python – Win-Vector Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Introduction

I would like to talk about some of the design principles underlying the data_algebra package (and also in its sibling rquery package).

The data_algebra package is a query generator that can act on either Pandas data frames or on SQL tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.


(Note: if there are any blog-rendering problems the original source-workbook used to create this article can be found here.)

Background

We will introduce a few ideas before trying to synthesize our thesis.

The data_algebra

An introduction to the data_algebra package can be found here. In this note we will discuss some of the inspirations for the package: Codd’s relational algebra, experience working with dplyr at scale, sklearn Pipeline, and category theory.

sklearn Pipeline

A major influence on the data_algebra design is sklearn.pipeline.Pipeline. sklearn.pipeline.Pipeline itself presumably become public with Edouard Duchesnay’s Jul 27, 2010 commit: “add pipeline”.

sklearn.pipeline.Pipeline maintains a list of steps to be applied to data. What is interesting is the steps are not functions. Steps are instead objects that implement both a .transform() and a .fit() method.

.transform() typically accepts a data-frame type structure and returns a modified version. Typical operations include adding a new derived column, selecting columns, selecting rows, and so on.

From a transform-alone point of view the steps compose like functions. For list [s, t] transform(x) is defined to as:

   transform([s, t], x) := 
      t.transform(s.transform(x))

(the glyph “:=” denoting “defined as”).
The fit-perspective is where things get interesting. obj.fit(x) changes the internal state of obj based on the value x and returns a reference to obj. I.e. it learns from the data and stores this result as a side-effect in the object itself. In sklearn it is common to assume a composite method called .fit_transform() often defined as:
obj.fit_transform(x) := obj.fit(x).transform(x)
(though for our own vtreat package, this is deliberately not the case).
Using .fit_transform() we can explain that in a sklearn Pipeline .fit() is naturally thought of as:
fit([s, t], x]) :=
t.fit(s.fit_transform(x))

My point is: sklearn.pipeline.Pipeline generalizes function composition to something more powerful: the ability to both fit and to transform. sklearn.pipeline.Pipeline is a natural way to store a sequence of parameterized or data-dependent data transform steps (such as centering, scaling, missing value imputation, and much more).

This gives as a concrete example of where rigid mindset where “function composition is the only form of composition” would not miss design opportunities.

Fist class citizens

We are going to try to design our tools to be “first class citizens” in the sense of Strachey:

First and second class objects. In ALGOL, a real number may appear in an expression or be assigned to a variable, and either of them may appear as an actual parameter in a procedure call. A procedure, on the other hand, may only appear in another procedure call either as the operator (the most common case) or as one of the actual parameters. There are no other expressions involving procedures or whose results are procedures. Thus in a sense procedures in ALGOL are second class citizens—they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter)…

Quoted in Wikipedia. Christopher Strachey, “Fundamental Concepts in Programming Languages” in Higher-Order and Symbolic Computation 13:11 (2000); though published in 2000, these are notes from lectures Strachey delivered in August, 1967

What we will draw out: is if our data transform steps are “first class citizens” we should expect to be able to store them in variables, compose them, examine them, and many other steps. A function that we can only use or even re-use is not giving us as much as we expect from other types. Or alternately, if functions don’t give us everything we want, we may not want to use them as our only type or abstraction of data processing steps.

Composability

Most people first encounter the mathematical concept of “composability” in terms of functions. This can give the false impression that to work with composable design principles, one must shoe-horn the object of interest to be functions or some sort of augmented functions.

This Procrustean view loses a lot of design opportunities.

In mathematics composability is directly studied by the field called “Category Theory.” So it makes sense to see if category theory may have ideas, notations, tools, and results that may be of use.

Category Theory

A lot of the benefit of category theory is lost if every time we try to apply category theory (or even just use some of the notation conventions) we attempt to explain all of category theory as a first step. So I will try to resist that urge here. I will introduce the bits I am going to use here.

Category theory routinely studies what are called “arrows.” When treated abstractly an arrow has two associated objects called the “domain” and “co-domain.” The names are meant to be evocative of the “domain” (space of inputs) and “co-domains” (space of outputs) from the theory of functions.

Functions are commonly defined as having:

Packages that use function composition typically collect functions in lists and define operator composition either through lambda-abstraction or through list concatenation.

Category theory differs from function theory in that category theory talks about arrows instead of functions. The theory is careful to keep separate the following two concepts: what arrows are and how arrows are composed.

When using arrows to model a system we expect to be able to specify, with some extra degrees of freedom in specifying:

An action is a mapping from arrows and items to items. I.e. action(arrow, item) = new_item. For categories the items may or may not be related to the domain and co-domain. Not all categories have actions, but when they do have actions the action must be compatible with arrow composition.

Good general references on category theory, include:

Functions have a very ready category theory interpretation as arrows. Given a function f with domain A and co-domain B, we can think of any triple (f, A', B') as an arrow in a category of functions if A' contained in A and B contained in B'. In this formulation we define the arrow composition of (f, A', B') and (g, C', D') as (f . g, C', B') where f . gis defined to be the function such that for all x in domain x we have:
(f . g)(x) := f(g(x))

We will call the application of a function to a value as an example of an “action.” A function f() “acts on its domain” and f(x) is the action of f on x. For functions we can define the action “apply” as:
apply(f, x) := f(x)

The extra generalization power we get from moving away from functions to arbitrary arrows (that might not correspond to functions) comes from the following:

To be a category a few conditions must be met, including: the composition must be associative and we must have some identity arrows. By “associative composition” we mean, it must be the case that for arrows a, b, and c (with appropriate domains and co-domains):
(a . b) . c = a . (b . c)

Our action must also associate with arrow composition. That is we must have for values x we must have for co-variant actions:
apply(a . b, x) = apply(a, apply(b, x))

Or for contra-variant actions:
apply(a . b, x) = apply(b, apply(a, x))

The idea is: the arrow a . b must have an action equal to the actions of a and b composed as functions. That is: arrow composition and actions can differ from function composition and function application, but they must be at least somewhat similar in that they remain associative.

We now have the background to see that category theory arrows differ from functions in that arrows are more general (we can pick more of their properties) and require a bit more explicit bookkeeping.

Back to sklearn.pipeline.Pipeline

We now have enough notation to attempt a crude category theory description of sklearn.pipeline.Pipeline.

Define our sklearn.pipeline.Pipeline category P as follows:

To see this is a category (and a category compatible action) we must check associativity of the composition (which in this case is list concatenation) and associativity of the action with respect to list concatenation.

We can also try to model the .fit_transform() methods. We will not try to model the side-effect that .fit_transform() changes state of the arrows (to have the fit information in each step). But we can at least define an action (with side effects) as follows:

To confirm this is an action (ignoring the side-effects), we would want check is if the following equality holds or not:
fit_transform_action(a . b, x) =
fit_transform_action(b, fit_transform_action(a, x))

The above should follow by brute pushing notation around (assuming we have defined fit_transform_action correctly, and sufficiently characterized .fit_transform()).

Notice we didn’t directly define a “fit_action” action, as it isn’t obvious that has a not obvious that has a nice associative realization. This an opportunity for theory to drive design; notation considerations hint that fit_transform() may be more fundamental than, and thus preferred over, fit().

The category theory concepts didn’t so-much design sklearn.pipeline.Pipeline, but give us a set of criteria to evaluate sklearn.pipeline.Pipeline design. We trust the category theory point of view is useful as it emphasizes associativity (which is a great propriety to have), and is routinely found to be a set of choices that work in complicated systems. The feeling being: the design points category theory seems to suggest, turn out to be the ones you want down the round.

The data_algebra

Now that we have some terminology, lets get back to the data_algebra

What is the data_algebra?

An introduction to the data_algebra can be found here.

We now have the terminology to concisely state a data_algebra design principle: use general concepts (such as category theory notation) to try and ensure data_algebra transforms are first class citizens (i.e. we can do a lot with them and to them).

The naive functional view

If we were to again take a mere functional view of the data_algebra we would say the data_algebra is a set of functions that operate on data. They translate data frames to new data frames using Codd-inspired operations. We could think of the data_algebra as acting on data on the right, and acting on data_algebra operators on the left.

However, this is not the right abstraction. data_algebra methods primarily map data transforms to data transforms. However even this is a “too functional view”. It makes sense to think of data_algebra operators as arrows, and the whole point of arrows is composition.

The categorical view

The data_algebra can be mapped to a nice category. The idea being something that can be easily mapped to an orderly system, is it self likely an somewhat orderly system.

Good references on the application of category theory to concrete systems (including databases) include:

Our data_algebra category D is defined as follows.

Some notes on the category theory interpretation of the data_algebra package can be found here.

Let’s demonstrate the above with Python code. The data_algebra allows for the specification of data transforms as first class objects.

First we import some modules and create some example data.

from data_algebra.data_ops import *
import pandas

d = pandas.DataFrame({
    'x': [1, 2, 3],
    'y': [3, 4, 4],
})

d
x y
0 1 3
1 2 4
2 3 4

To specify adding a new derived column z we would write code such as the following.

td = describe_table(d)

a = td.extend(
    { 'z': 'x.mean()' },
    partition_by=['y']
)

a

TableDescription( table_name=’data_frame’, column_names=[ ‘x’, ‘y’]) .
extend({ ‘z’: ‘x.mean()’}, partition_by=[‘y’])

We can let this transform act on data.

a.transform(d)
x y z
0 1 3 1.0
1 2 4 2.5
2 3 4 2.5

We can compose this transform with more operations to create a composite transform.

b = a.extend({
    'ratio': 'y / x'
})

b

TableDescription( table_name=’data_frame’, column_names=[ ‘x’, ‘y’]) .
extend({ ‘z’: ‘x.mean()’}, partition_by=[‘y’]) .
extend({ ‘ratio’: ‘y / x’})

As a bonus we can also map the above transform to a SQL query representing the same action in databases.

from data_algebra.SQLite import SQLiteModel

print(
    b.to_sql(db_model=SQLiteModel(), pretty=True)
)

SELECT "x",
"y",
"z",
"y" / "x" AS "ratio"
FROM
(SELECT "x",
"y",
avg("x") OVER (PARTITION BY "y") AS "z"
FROM ("data_frame") "SQ_0") "SQ_1"


All of this is the convenient interface we expect users will want. However, if we asked that all operators specified their expected input schema (or their domain) we have the category D. We don’t expect users to do such, but we have code supporting this style of notation to show that the data_algebra is in fact related to a nice category over schemas.

Lets re-write the above queries as formal category arrows.

from data_algebra.arrow import *

a1 = DataOpArrow(a)

print(str(a1))

[ ‘data_frame’: [ x: <class ‘numpy.int64’>, y: <class ‘numpy.int64’> ] -> [ x, y, z ] ]

The above is rendering the arrow as just its domain and co-domain. The domain and co-domains are just single-table schemas: lists of column names (possibly with column types).

We can get a more detailed representation of the arrow as follows.

print(a1.__repr__())

DataOpArrow( TableDescription( table_name=’data_frame’, column_names=[ ‘x’, ‘y’]) .
extend({ ‘z’: ‘x.mean()’}, partition_by=[‘y’]), free_table_key=’data_frame’)

Or we can examine the domain and co-domain directly. Here we are using a common category theory trick: associating the object with the identity arrow of the object. So what we are showing as domain and co-domains are actually identity arrows instead of objects.

a1.dom()

DataOpArrow( TableDescription( table_name=”, column_names=[ ‘x’, ‘y’]), free_table_key=”)

a1.cod()

DataOpArrow( TableDescription( table_name=”, column_names=[ ‘x’, ‘y’, ‘z’]), free_table_key=”)

Now we can write our second transform step as an arrow as follows.

a2 = DataOpArrow(a1.cod_as_table().extend({
    'ratio': 'y / x'
}))

a2

DataOpArrow( TableDescription( table_name=”, column_names=[ ‘x’, ‘y’, ‘z’]) .
extend({ ‘ratio’: ‘y / x’}), free_table_key=”)

We took extra steps, that most users will not want to take, to wrap the second-stage (a2) operations as an arrow. Being an arrow means that we have a domain and co-domain that can be used to check if operations are composable.

A typical user would not work with arrow, but instead work with the data algebra which itself is a shorthand for the arrows. That is: the users may want the power of a category, but they don’t want to be the one handling the extra bookkeeping. To add an extra operation a user would work directly with a and just write the following.

a.extend({
    'ratio': 'y / x'
})

TableDescription( table_name=’data_frame’, column_names=[ ‘x’, ‘y’]) .
extend({ ‘z’: ‘x.mean()’}, partition_by=[‘y’]) .
extend({ ‘ratio’: ‘y / x’})

The above has substantial pre-condition checking and optimizations (as it is merely user facing shorthand for the arrows).

The more cumbersome arrow notation (that requires the specification of pre-conditions) has a payoff: managed arrow composition. That is: complex operator pipelines can be directly combined. We are not limited to extending one operation at a time.

If the co-domain of arrow matches the domain of another arrow we can compose them left to right as follows.

a1.cod() == a2.dom()

True

composite = a1 >> a2

composite

DataOpArrow( TableDescription( table_name=’data_frame’, column_names=[ ‘x’, ‘y’]) .
extend({ ‘z’: ‘x.mean()’}, partition_by=[‘y’]) .
extend({ ‘ratio’: ‘y / x’}), free_table_key=’data_frame’)

And when this isn’t the case, composition is not allowed. This is exactly what we want as this means the preconditions (exactly which columns are present) for the second arrow are not supplied by the first arrow.

a2.cod() == a1.dom()

False

try:
    a2 >> a1
except ValueError as e:
    print("Caught: " + str(e))

Caught: extra incoming columns: {‘ratio’, ‘z’}

An important point is: for this arrow notation composition is not mere list concatenation or function composition. Here is an example that makes this clear.

b1 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'x': 'x + 1',
    'y': 7
}))

b1

DataOpArrow( TableDescription( table_name=”, column_names=[ ‘x’, ‘y’]) .
extend({ ‘x’: ‘x + 1’, ‘y’: ‘7’}), free_table_key=”)

b2 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'y': 9
}))

Now watch what happens when we use “>>” to compose the arrow b1 and b2.

b1 >> b2

DataOpArrow( TableDescription( table_name=”, column_names=[ ‘x’, ‘y’]) .
extend({ ‘x’: ‘x + 1’, ‘y’: ‘9’}), free_table_key=”)

Notice in this special case the composition of b1 and b2 is a single extend node combining the operations and eliminating the dead-value 7. The idea is: the package has some freedom to define composition as long as it is associative. In this case we have an optimization at the compose step so the composition is not list concatenation or function composition.

As we have said, a typical user will not take the time to establish pre-conditions on steps. So they are not so much working with arrows but with operators that can be specialized to arrows. An actual user might build up the above pipeline as follows.

TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'x': 'x + 1',
    'y': 7
    }). \
   extend({
    'y': 9
    })
    

TableDescription( table_name=”, column_names=[ ‘x’, ‘y’]) .
extend({ ‘x’: ‘x + 1’, ‘y’: ‘9’})

We recently demonstrated this sort of optimization in the R rquery package.

In the above example the user still benefits from the category theory design. As they composed left to right the system was able to add in the pre-conditions for them. The user only needs to set pre-conditions for non-trivial right-hand side pipelines.

Conclusion

The advantage the data_algebra package gets from category theory is: it lets us design the package action (how the package works on data) somewhat independently from operator composition. This gives us a lot more design room and power than a strict function composition or list concatenation theory would give us.

To leave a comment for the author, please follow the link and comment on their blog: python – Win-Vector Blog .

Want to share your content on python-bloggers? click here.
Exit mobile version