We have a new improved version of the “how to design a cdata/data_algebra data transform” up!
The original article, the Python example, and the R example have all been updated to use the new video.
Please check it out!
Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.
This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).
Each of these tutorials include link to our new “design a fluid data transform in under 1 minute” instructional video.
The video is unlikely to make sense without reading the articles (and possibly some of the linked backing tutorials). But for the prepared mind this video can be an “Aha!” moment.
Once you get your head around the concept (which takes much longer than a minute!): you can see how we take an example input/output pair and annotate them to become the data transform specification. This can be ground breaking, as it encourages you to spend all of your time thinking about the data. It is easy to copy/paste the specific detailed commands after you have the specification in place.
We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).
This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data)
, coder$fit(train_data_cal)$transform(train_data_model)
, and coder$fit(application_data)
.
We have also regenerated the current task-oriented vtreat documentation to demonstrate the new nested bias warning feature:
R
regression example, Python
regression example.R
classification example, Python
classification example.R
unsupervised example, Python
unsupervised example.R
multinomial classification example, Python
multinomial classification example.And we now have new versions of these documents showing the sklearn $fit_transform()
style notation in R.
R
$fit_transform()
regression example.R
$fit_transform()
classification example.R
$fit_transform()
unsupervised example.R
$fit_transform()
multinomial classification example.The original R interface is going to remain the standard interface for vtreat. It is more idiomatic R, and is taught in chapter 8 of Zumel, Mount; Practical Data Science with R, 2nd Edition, Manning 2019.
In contrast, the $fit_transform()
notation will always just be an adaptor over the primary R interface. However, there is a lot to be learned from sklearn’s organization and ideas, so we felt we would use make their naming convention available as a way of showing appreciation and giving credit. Some more of my notes on the grace of the sklearn interface in being a good way to manage state and generative effects (see Brendan Fong, David I. Spivak; An Invitation to Applied Category Theory, Cambridge University Press, 2019) can be found here.
Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.
For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.
The book is available in print
and various electronic formats. Because it is self-published, we plan to update it more rapidly than would be
possible through traditional publishers.
Below you will find a detailed table of contents and a list
of the textbook authors.
OVERVIEW OF BIOMEDICAL DATA SCIENCE
SPREADSHEET TOOLS AND TIPS
BIOSTATISTICS PRIMER
DATA VISUALIZATION
INTRODUCTION TO DATABASES
BIG DATA
BIOINFORMATICS and PRECISION MEDICINE
PROGRAMMING LANGUAGES FOR DATA ANALYSIS
MACHINE LEARNING
ARTIFICIAL INTELLIGENCE
Brenda Griffith
Technical Writer
Data.World
Austin, TX
Robert Hoyt MD, FACP, ABPM-CI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA
David Hurwitz MD, FACP, ABPM-CI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL
Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO
Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY
Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN
Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN
Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL
A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.
MinIO is a object storage database which uses S3(from Amazon). This is a very convenient tool in for data scientists or machine learning engineers to easily collaborate and share data and machine learning models. MinIO is a cloud storage server compatible with Amazon S3, released under Apache License v2. As an object store, MinIO can store unstructured data such as photos, videos, log files, backups and container images. The maximum size of an object is 5TB.
In this tutorial, I would show you how to build a simple machine learning model, connect to MinIO server, load and extract saved models. What this model will not cover is installing MinIO as the documentation on the website is well written.
https://min.io/download#/linux
from sklearn.datasets import load_iris
# load iris data set iris = load_iris() type(iris)
sklearn.utils.Bunch
# define features and class labels x=iris.data y=iris.target
from sklearn.model_selection import train_test_split # split the data into train and test x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5)
from sklearn import tree
# define a classifier classifier=tree.DecisionTreeClassifier() # fit the classifier classifier.fit(x_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
# perform predictions and save the predictions in object predictions=classifier.predict(x_test)
from sklearn.metrics import accuracy_score # estimate the accuracy of the model print(accuracy_score(y_test,predictions))
0.9466666666666667
# use joblib library to to the exporting from sklearn.externals import joblib from joblib import dump
/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=DeprecationWarning)
filename = 'decisionTree.sav' joblib.dump(classifier, filename)
['decisionTree.sav']
from minio import Minio
from minio.error import ResponseError # create a connection to server minioClient = Minio('192.168.1.1:8080', access_key='test', secret_key='test123', secure=False)
Now, we can export the model, decesionTree.sav to a bucket called example. File_data an file_stat parameters are a must to export them.
import os # open the file and put the object in bucket called example with open('decisionTree.sav', 'rb') as file_data: file_stat = os.stat('decisionTree.sav') minioClient.put_object('example', 'decisionTree.sav', file_data, file_stat.st_size)
We can now list the objects in the bucket to see if we have the uploaded file. Below we can see the bucket is example, file name is decisionTree.sav and the time it was uploaded.
# List all object paths in bucket that begin with my-prefixname. objects = minioClient.list_objects('example', recursive=True) for obj in objects: print(obj.bucket_name, obj.object_name.encode('utf-8'), obj.last_modified, obj.etag, obj.size, obj.content_type)
example b'decisionTree.sav' 2020-01-13 16:26:35.982000+00:00 895f7dd35c0723a74338825e78a8d7d3-1 2022 None
# get the object from MinIO and safe it as newfile print(minioClient.fget_object('example', 'decisionTree.sav', "newfile"))
<Object: bucket_name: example object_name: b'decisionTree.sav' last_modified: time.struct_time(tm_year=2020, tm_mon=1, tm_mday=13, tm_hour=16, tm_min=26, tm_sec=35, tm_wday=0, tm_yday=13, tm_isdst=0) etag: 895f7dd35c0723a74338825e78a8d7d3-1 size: 2022 content_type: application/octet-stream, is_dir: False, metadata: {'Content-Type': 'application/octet-stream'}>
# use the downloaded object to do the predictions and print result filename = 'newfile' loaded_model = joblib.load(filename) result = loaded_model.score(x_test, y_test) print(result)
0.9466666666666667
If you are a R person, I have rewritten a package for MinIO. It’s based on aws.s3 from cloudyR.
For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat
package (both the R
version and Python
version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).
The next version of vtreat
will warn the user if they have improperly used the same data for both vtreat
impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. vtreat
has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.
This example is excerpted from some of our classification documentation.
One way to design variable treatments for binomial classification problems in vtreat
is to design a cross-frame experiment.
# For this example we want vtreat version 1.5.1 or newer
# remotes::install_github("WinVector/vtreat")
library(vtreat)
packageVersion("vtreat")
## [1] '1.5.1'
...
transform_design = vtreat::mkCrossFrameCExperiment(
# data to learn transform from
dframe = training_data,
# columns to transform
varlist = setdiff(colnames(training_data), c('y', 'yc')),
# outcome variable
outcomename = 'yc',
# outcome of interest
outcometarget = TRUE
)
Once we have that we can pull the data transform and correct cross-validated training frame off the returned object as follows.
transform <- transform_design$treatments
train_prepared <- transform_design$crossFrame
train_prepared
is prepared in the correct way to use the same training data for inferring the impact-coded variables, using the returned $crossFrame
from mkCrossFrameCExperiment()
.
We prepare new test or application data as follows.
test_prepared <- prepare(transform, test_data)
The issue is: for training data we should not call prepare()
, but instead use the cross-frame that is produces during transform design.
The point is we should not do the following:
train_prepared_wrong <- prepare(transform, training_data)
## Warning in prepare.treatmentplan(transform, training_data):
## possibly called prepare() on same data frame as designTreatments*()/
## mkCrossFrame*Experiment(), this can lead to over-fit. To avoid this, please use
## mkCrossFrame*Experiment$crossFrame.
Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.
And that is the new nested model bias warning feature.
The full R
example can be found here, and a full Python
example can be found here.
I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.
We can create an abstract version of such data in R as follows.
set.seed(2020)
n <- 1000000
mk_data <- function(n) {
d <- data.frame(x = rnorm(n))
d$g <- sprintf("level_%09g",
sample.int(n, size = n, replace = TRUE))
return(d)
}
d <- mk_data(n)
The sampling with replacement has an expected number of unique IDs in the ballpark of n/log(n)
via the coupon collector’s problem. So we expect lots of small groups in such data.
Our task can be specified in rquery/rqdatatable notation as follows.
library(rqdatatable)
ops_rqdatatable <- local_td(d, name = 'd') %.>%
extend(.,
rn %:=% row_number(),
cs %:=% cumsum(x),
partitionby = 'g',
orderby = 'x') %.>%
order_rows(.,
c('g', 'x'))
The key step is the extend()
, which adds the new columns rn
and cs
in a per-g
group manner in a by-x
order. We feel the notation is learnable and expressive. (Note: normally we would use :=
for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used %:=%
to stay out of data.table’s way.)
We translated the same task in to several different notations: data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.
Method | Interface Language | Data Engine | Mean run time in seconds |
---|---|---|---|
rqdatatable | R | data.table | 3.8 |
data.table | R | data.table | 2.1 |
dplyr | R | dplyr | 35.1 |
dtplyr | R | data.table | 5.1 |
data_algebra | Python | Pandas | 17.1 |
What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).
What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods.
Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.
I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra
and in rquery
/rqdatatable
.
I think I’ve found an even better category theory re-formulation of the package, which I will describe here.
In the earlier formalism our data transform pipelines were arrows over a category of sets of column names (sets of strings).
These pipelines acted on Pandas
tables or SQL
tables, with one table marked as special. Marking one table as special (or using a “pointed set” notation) lets us use a nice compositional notation, without having to appeal to something like operads. The treating one table as the one of interest is fairly compatible with data science, as in data science often when working with many tables one is the primary model-frame and the rest are used to join in additional information.
The above formulation was really working well. But we have found a variation of the data_algebra
with an even neater formalism.
The data_algebra
objects have a very nice interpretation as arrows in a category whose objects are set families described by:
The arrows a
and b
compose as a >> b
as long as:
b
are produced by a
.b
are produced by a
.This is still an equality check of domains and co-domains, so as long as we maintain associativity we still have a nice category.
We can illustrate the below.
First we import our modules.
import sqlite3
import pandas
from data_algebra.data_ops import *
from data_algebra.arrow import fmt_as_arrow
import data_algebra.SQLite
We define our first arrow which is a transform that creates a new column x
as the sum of the columns a
and b
.
a = TableDescription(table_name='table_a', column_names=['a', 'b']). \
extend({'c': 'a + b'})
a
TableDescription(
table_name='table_a',
column_names=[
'a', 'b']) .\
extend({
'c': 'a + b'})
print(fmt_as_arrow(a))
[
'table_a':
at least [ a, b ]
->
at least [ a, b, c ]
]
And we define our second arrow, b
, which renames the column a
to a new column name x
.
b = TableDescription(table_name='table_b', column_names=['a']). \
rename_columns({'x': 'a'})
b
TableDescription(
table_name='table_b',
column_names=[
'a']) .\
rename_columns({'x': 'a'})
print(fmt_as_arrow(b))
[
'table_b':
at least [ a ] , and none of [ x ]
->
at least [ x ]
]
The rules are met, so we can combine these two arrows.
ab = a >> b
ab
TableDescription(
table_name='table_a',
column_names=[
'a', 'b']) .\
extend({
'c': 'a + b'}) .\
rename_columns({'x': 'a'})
print(fmt_as_arrow(ab))
[
'table_a':
at least [ a, b ] , and none of [ x ]
->
at least [ b, c, x ]
]
Notice this produces a new arrow ab
with appropriate required and forbidden columns. By associativity (one of the primary properties needed to be a category) we get that the arrow ab
has an action on data frames the same as using the a
action followed by the b
action.
Let’s illustrate that here.
d = pandas.DataFrame({
'a': [1, 2],
'b': [30, 40]
})
d
a | b | |
---|---|---|
0 | 1 | 30 |
1 | 2 | 40 |
b.act_on(a.act_on(d))
x | b | c | |
---|---|---|---|
0 | 1 | 30 | 31 |
1 | 2 | 40 | 42 |
ab.act_on(d)
x | b | c | |
---|---|---|---|
0 | 1 | 30 | 31 |
1 | 2 | 40 | 42 |
.act_on()
copies forward all columns consistent with the transform specification and used at the output. Missing columns are excess columns are checked for at the start of a calculation.
excess_frame = pandas.DataFrame({
'a': [1],
'b': [2],
'd': [3],
'x': [4]})
try:
ab.act_on(excess_frame)
except ValueError as ve:
print("caught ValueError: " + str(ve))
caught ValueError: Table table_a has forbidden columns: {‘x’}
The .transform()
method, on the other hand, copies forward only declared columns.
ab.transform(excess_frame)
x | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
Notice in the above that the input x
did not interfere with the calculation, and d
was not copied forward. The idea is behavior during composition is very close to behavior during action/application, so we find more issues during composition.
However, .transform()
does not associate with composition, or is not an action of this category, as we have b.transform(a.transform(d))
is not equal to ab.transform(d)
. .transform()
does associate with the arrows of the stricter identical column set category we demonstrated earlier, so it is an action of this category.
b.transform(a.transform(d))
x | |
---|---|
0 | 1 |
1 | 2 |
In both cases we still have result-oriented narrowing.
c = TableDescription(table_name='table_c', column_names=['a', 'b', 'c']). \
extend({'x': 'a + b'}). \
select_columns({'x'})
c
TableDescription( table_name=’table_c’, column_names=[ ‘a’, ‘b’, ‘c’]) .
extend({ ‘x’: ‘a + b’}) .
select_columns([‘x’])
print(fmt_as_arrow(c))
[ ‘table_c’: at least [ a, b, c ] -> at least [ x ] ]
table_c = pandas.DataFrame({
'a': [1, 2],
'b': [30, 40],
'c': [500, 600],
'd': [7000, 8000]
})
table_c
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 30 | 500 | 7000 |
1 | 2 | 40 | 600 | 8000 |
c.act_on(table_c)
x | |
---|---|
0 | 31 |
1 | 42 |
c.transform(table_c)
x | |
---|---|
0 | 31 |
1 | 42 |
.select_columns()
conditions are propagated back through the calculation.
Another useful operator is .drop_columns()
which drops columns if they are present, but does not raise an issue if the columns to be removed are already not present. .drop_columns()
can be used to guarantee forbidden columns are not present. We could use .act_on()
or excess_frame
using .drop_columns()
as follows.
tdr = describe_table(excess_frame).drop_columns(['x'])
tdr
TableDescription( table_name=’data_frame’, column_names=[ ‘a’, ‘b’, ‘d’, ‘x’]) .
drop_columns([‘x’])
rab = tdr >> ab
rab
TableDescription( table_name=’data_frame’, column_names=[ ‘a’, ‘b’, ‘d’, ‘x’]) .
drop_columns([‘x’]) .
extend({ ‘c’: ‘a + b’}) .
rename_columns({‘x’: ‘a’})
The >>
notation is composing the arrows. tdr >> ab
is syntactic sugar for ab.apply_to(tdr)
. Both of these are the arrow composition operations.
rab.act_on(excess_frame)
x | b | d | c | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 3 |
Remember, the original ab
operator rejects excess_frame
.
try:
ab.act_on(excess_frame)
except ValueError as ve:
print("caught ValueError: " + str(ve))
caught ValueError: Table table_a has forbidden columns: {‘x’}
We can also adjust the input-specification by composing pipelines with table descriptions.
a
TableDescription( table_name=’table_a’, column_names=[ ‘a’, ‘b’]) .
extend({ ‘c’: ‘a + b’})
bigger = TableDescription(table_name='bigger', column_names=['a', 'b', 'x', 'y', 'z'])
bigger
TableDescription( table_name=’bigger’, column_names=[ ‘a’, ‘b’, ‘x’, ‘y’, ‘z’])
bigger_a = bigger >> a
bigger_a
TableDescription( table_name=’bigger’, column_names=[ ‘a’, ‘b’, ‘x’, ‘y’, ‘z’]) .
extend({ ‘c’: ‘a + b’})
print(fmt_as_arrow(bigger_a))
[ ‘bigger’: at least [ a, b, x, y, z ] -> at least [ a, b, c, x, y, z ] ]
Notice the new arrow (bigger_a
) has a wider input specification. Appropriate checking is performed during the composition.
As always, we can also translate any of our operators to SQL
.
db_model = data_algebra.SQLite.SQLiteModel()
print(bigger_a.to_sql(db_model=db_model, pretty=True))
SELECT “a” + “b” AS “c”, “a”, “x”, “y”, “b”, “z” FROM “bigger”
The SQL
translation is similar to .transform()
in that it only refers to known columns by name. This means we are safe from extra columns in the source tables. This means if we did derive an action acting on SQL
or composition over SQL
it would not associate with the data_algebra
operator composition (just as .transform()
did not).
Notice we no longer have to use the arrow-adapter classes (except for formatting), the data_algebra
itself has been adjusted to a more direct categorical basis.
And that is some of how the data_algebra
works on our new set-oriented category. In this formulation much less annotation is required from the user, while still allowing very detailed record-keeping. The detailed record-keeping lets us find issues while assembling the pipelines, not later when working with potentially large/slow data.
In our recent note What is new for rquery
December 2019 we mentioned an ugly processing pipeline that translates into SQL
of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra
.
dplyr
translates the query to SQL
as:
SELECT 5.0 AS `x`, `sum23`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23`
FROM `d`)))))
rquery
translates the query to SQL
as:
SELECT
"x",
"sum23"
FROM (
SELECT
"col2" + "col3" AS "sum23",
5 AS "x"
FROM (
SELECT
"col2",
"col3"
FROM
"example_table"
) tsql_28722584463189084716_0000000000
) tsql_28722584463189084716_0000000001
Notice the rquery
SQL
doesn’t copy the column col1
around, and also skips the dead-values assigned into x
. The query still has some waste: the inner and outer guard queries that are used to make SQL
look a bit more regular.
What I would like to add is our new note, showing what the data_algebra
translates a similar query into the following SQL
:
SELECT 5 AS "x",
"col2" + "col3" AS "sum23",
"col3"
FROM "d"
(The extra col3
as we asked for that column to be part of the result in the newer demonstration.) This new query has fewer unnecessary steps. The idea is one can code intent step-wise in a pipeline and still end up with a fairly compact and performant SQL
query in the end.
I think both rquery
and data_algebra
can save quite a lot of development resources and machine time in data wrangling tasks.
I would like to talk about some of the design principles underlying the data_algebra
package (and also in its sibling rquery
package).
The data_algebra
package is a query generator that can act on either Pandas
data frames or on SQL
tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.
(Note: if there are any blog-rendering problems the original source-workbook used to create this article can be found here.)
We will introduce a few ideas before trying to synthesize our thesis.
data_algebra
An introduction to the data_algebra
package can be found here. In this note we will discuss some of the inspirations for the package: Codd’s relational algebra, experience working with dplyr
at scale, sklearn Pipeline, and category theory.
A major influence on the data_algebra
design is sklearn.pipeline.Pipeline
. sklearn.pipeline.Pipeline
itself presumably become public with Edouard Duchesnay’s Jul 27, 2010 commit: “add pipeline”.
sklearn.pipeline.Pipeline
maintains a list of steps to be applied to data. What is interesting is the steps are not functions. Steps are instead objects that implement both a .transform()
and a .fit()
method.
.transform()
typically accepts a data-frame type structure and returns a modified version. Typical operations include adding a new derived column, selecting columns, selecting rows, and so on.
From a transform-alone point of view the steps compose like functions. For list [s, t]
transform(x)
is defined to as:
transform([s, t], x) :=
t.transform(s.transform(x))
(the glyph “:=
” denoting “defined as”).
The fit-perspective is where things get interesting. obj.fit(x)
changes the internal state of obj based on the value x
and returns a reference to obj
. I.e. it learns from the data and stores this result as a side-effect in the object itself. In sklearn
it is common to assume a composite method called .fit_transform()
often defined as:
obj.fit_transform(x) := obj.fit(x).transform(x)
(though for our own
vtreat
package, this is deliberately not the case).
Using .fit_transform()
we can explain that in a sklearn Pipeline
.fit()
is naturally thought of as:
fit([s, t], x]) :=
t.fit(s.fit_transform(x))
My point is: sklearn.pipeline.Pipeline
generalizes function composition to something more powerful: the ability to both fit and to transform. sklearn.pipeline.Pipeline
is a natural way to store a sequence of parameterized or data-dependent data transform steps (such as centering, scaling, missing value imputation, and much more).
This gives as a concrete example of where rigid mindset where “function composition is the only form of composition” would not miss design opportunities.
We are going to try to design our tools to be “first class citizens” in the sense of Strachey:
First and second class objects. In
ALGOL
, a real number may appear in an expression or be assigned to a variable, and either of them may appear as an actual parameter in a procedure call. A procedure, on the other hand, may only appear in another procedure call either as the operator (the most common case) or as one of the actual parameters. There are no other expressions involving procedures or whose results are procedures. Thus in a sense procedures inALGOL
are second class citizens—they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter)…
What we will draw out: is if our data transform steps are “first class citizens” we should expect to be able to store them in variables, compose them, examine them, and many other steps. A function that we can only use or even re-use is not giving us as much as we expect from other types. Or alternately, if functions don't give us everything we want, we may not want to use them as our only type or abstraction of data processing steps.
Most people first encounter the mathematical concept of “composability” in terms of functions. This can give the false impression that to work with composable design principles, one must shoe-horn the object of interest to be functions or some sort of augmented functions.
This Procrustean view loses a lot of design opportunities.
In mathematics composability is directly studied by the field called “Category Theory.” So it makes sense to see if category theory may have ideas, notations, tools, and results that may be of use.
A lot of the benefit of category theory is lost if every time we try to apply category theory (or even just use some of the notation conventions) we attempt to explain all of category theory as a first step. So I will try to resist that urge here. I will introduce the bits I am going to use here.
Category theory routinely studies what are called “arrows.” When treated abstractly an arrow has two associated objects called the “domain” and “co-domain.” The names are meant to be evocative of the “domain” (space of inputs) and “co-domains” (space of outputs) from the theory of functions.
Functions are commonly defined as having:
domain(g)
is contained in co-domain(f)
then: g . f
is defined as the function such that (g . f)(x) = g(f(x))
for all values in the domain of f. Mathematical functions are usually thought of as a specialization of binary relation, and considered to be uniquely determined by their evaluations (by the axiom of extensionality).
Packages that use function composition typically collect functions in lists and define operator composition either through lambda-abstraction or through list concatenation.
Category theory differs from function theory in that category theory talks about arrows instead of functions. The theory is careful to keep separate the following two concepts: what arrows are and how arrows are composed.
When using arrows to model a system we expect to be able to specify, with some extra degrees of freedom in specifying:
a
and b
with co-domain(b) = domain(a)
then: a . b
denotes the composition in the category, and is itself a new arrow in the same category. Composition is not allowed (or defined) when co-domain(b) != domain(a)
.
An action is a mapping from arrows and items to items. I.e. action(arrow, item) = new_item
. For categories the items may or may not be related to the domain and co-domain. Not all categories have actions, but when they do have actions the action must be compatible with arrow composition.
Good general references on category theory, include:
Functions have a very ready category theory interpretation as arrows. Given a function f
with domain A
and co-domain B
, we can think of any triple (f, A', B')
as an arrow in a category of functions if A' contained in A
and B contained in B'
. In this formulation we define the arrow composition of
and (f, A', B')
(g, C', D')
as (f . g, C', B')
where f . g
is defined to be the function such that for all x
in domain x
we have:
(f . g)(x) := f(g(x))
We will call the application of a function to a value as an example of an “action.” A function f()
“acts on its domain” and f(x)
is the action of f
on x
. For functions we can define the action “apply
” as:
apply(f, x) := f(x)
The extra generalization power we get from moving away from functions to arbitrary arrows (that might not correspond to functions) comes from the following:
To be a category a few conditions must be met, including: the composition must be associative and we must have some identity arrows. By “associative composition” we mean, it must be the case that for arrows a
, b
, and c
(with appropriate domains and co-domains):
(a . b) . c = a . (b . c)
Our action must also associate with arrow composition. That is we must have for values x
we must have for co-variant actions:
apply(a . b, x) = apply(a, apply(b, x))
Or for contra-variant actions:
apply(a . b, x) = apply(b, apply(a, x))
The idea is: the arrow a . b
must have an action equal to the actions of a and b composed as functions. That is: arrow composition and actions can differ from function composition and function application, but they must be at least somewhat similar in that they remain associative.
We now have the background to see that category theory arrows differ from functions in that arrows are more general (we can pick more of their properties) and require a bit more explicit bookkeeping.
sklearn.pipeline.Pipeline
We now have enough notation to attempt a crude category theory description of sklearn.pipeline.Pipeline
.
Define our sklearn.pipeline.Pipeline
category P
as follows:
0
. All arrows will have domain and co-domain equal to 0
, i.e.: we are not doing any interesting pre-condition checking in this category. This sort of category is called a “monoid.”
Python
objects that define .transform()
, .fit()
, and .fit_transform()
methods.
a1 . a2
is defined as the list concatenation: a2 + a1
. “+
” being Python
's list concatenate in this case, and the order set to match sklearn.pipeline.Pipeline
list order convention.
transform_action
” defined as:
transform_action([step1, step2, ..., stepk], x) :=
stepk.transform(... step2.transform(step1.transform(x)) )
To see this is a category (and a category compatible action) we must check associativity of the composition (which in this case is list concatenation) and associativity of the action with respect to list concatenation.
We can also try to model the .fit_transform()
methods. We will not try to model the side-effect that .fit_transform()
changes state of the arrows (to have the fit information in each step). But we can at least define an action (with side effects) as follows:
fit_transform
” defined as:
fit_transform_action([step1, step2, ..., stepk], x) :=
stepk.fit_transform(... step2.fit_transform(step1.fit_transform(x)) )
To confirm this is an action (ignoring the side-effects), we would want check is if the following equality holds or not:
fit_transform_action(a . b, x) =
fit_transform_action(b, fit_transform_action(a, x))
The above should follow by brute pushing notation around (assuming we have defined fit_transform_action
correctly, and sufficiently characterized .fit_transform()
).
Notice we didn't directly define a “fit_action
” action, as it isn't obvious that has a not obvious that has a nice associative realization. This an opportunity for theory to drive design; notation considerations hint that fit_transform()
may be more fundamental than, and thus preferred over, fit()
.
The category theory concepts didn't so-much design sklearn.pipeline.Pipeline
, but give us a set of criteria to evaluate sklearn.pipeline.Pipeline
design. We trust the category theory point of view is useful as it emphasizes associativity (which is a great propriety to have), and is routinely found to be a set of choices that work in complicated systems. The feeling being: the design points category theory seems to suggest, turn out to be the ones you want down the round.
data_algebra
Now that we have some terminology, lets get back to the data_algebra
data_algebra
?data_algebra
is a package for building up complex data manipulation queries data_algebra
queries are first class citizens in the Strachey sense (can be: passed as an argument, returned from a function, modified, assigned to a variable, printed, inspected, and traversed as a data structure).
SQL
(targeting PostgeSQL
, Spark
, and other implementations), or as acting on Pandas
data (we are hoping to extend this to modin
, RAPIDS
, and others).
data_algebra
has an R
sibling package grouprquery
/rqdatatable
) similar to dplyr
.An introduction to the data_algebra
can be found here.
We now have the terminology to concisely state a data_algebra
design principle: use general concepts (such as category theory notation) to try and ensure data_algebra
transforms are first class citizens (i.e. we can do a lot with them and to them).
If we were to again take a mere functional view of the data_algebra
we would say the data_algebra
is a set of functions that operate on data. They translate data frames to new data frames using Codd-inspired operations. We could think of the data_algebra
as acting on data on the right, and acting on data_algebra
operators on the left.
However, this is not the right abstraction. data_algebra
methods primarily map data transforms to data transforms. However even this is a “too functional view”. It makes sense to think of data_algebra
operators as arrows, and the whole point of arrows is composition.
The data_algebra
can be mapped to a nice category. The idea being something that can be easily mapped to an orderly system, is it self likely an somewhat orderly system.
Good references on the application of category theory to concrete systems (including databases) include:
Our data_algebra
category D
is defined as follows.
data_algebra
operator chains.
Some notes on the category theory interpretation of the data_algebra
package can be found here.
Let's demonstrate the above with Python
code. The data_algebra
allows for the specification of data transforms as first class objects.
First we import some modules and create some example data.
from data_algebra.data_ops import *
import pandas
d = pandas.DataFrame({
'x': [1, 2, 3],
'y': [3, 4, 4],
})
d
x | y | |
---|---|---|
0 | 1 | 3 |
1 | 2 | 4 |
2 | 3 | 4 |
To specify adding a new derived column z
we would write code such as the following.
td = describe_table(d)
a = td.extend(
{ 'z': 'x.mean()' },
partition_by=['y']
)
a
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y'])
We can let this transform act on data.
a.transform(d)
x | y | z | |
---|---|---|---|
0 | 1 | 3 | 1.0 |
1 | 2 | 4 | 2.5 |
2 | 3 | 4 | 2.5 |
We can compose this transform with more operations to create a composite transform.
b = a.extend({
'ratio': 'y / x'
})
b
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'})
As a bonus we can also map the above transform to a SQL
query representing the same action in databases.
from data_algebra.SQLite import SQLiteModel
print(
b.to_sql(db_model=SQLiteModel(), pretty=True)
)
SELECT "x",
"y",
"z",
"y" / "x" AS "ratio"
FROM
(SELECT "x",
"y",
avg("x") OVER (PARTITION BY "y") AS "z"
FROM ("data_frame") "SQ_0") "SQ_1"
All of this is the convenient interface we expect users will want. However, if we asked that all operators specified their expected input schema (or their domain) we have the category D
. We don't expect users to do such, but we have code supporting this style of notation to show that the data_algebra
is in fact related to a nice category over schemas.
Lets re-write the above queries as formal category arrows.
from data_algebra.arrow import *
a1 = DataOpArrow(a)
print(str(a1))
[ 'data_frame': [ x: <class 'numpy.int64'>, y: <class 'numpy.int64'> ] -> [ x, y, z ] ]
The above is rendering the arrow as just its domain and co-domain. The domain and co-domains are just single-table schemas: lists of column names (possibly with column types).
We can get a more detailed representation of the arrow as follows.
print(a1.__repr__())
DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']), free_table_key='data_frame')
Or we can examine the domain and co-domain directly. Here we are using a common category theory trick: associating the object with the identity arrow of the object. So what we are showing as domain and co-domains are actually identity arrows instead of objects.
a1.dom()
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']), free_table_key='')
a1.cod()
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']), free_table_key='')
Now we can write our second transform step as an arrow as follows.
a2 = DataOpArrow(a1.cod_as_table().extend({
'ratio': 'y / x'
}))
a2
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']) .
extend({ 'ratio': 'y / x'}), free_table_key='')
We took extra steps, that most users will not want to take, to wrap the second-stage (a2
) operations as an arrow. Being an arrow means that we have a domain and co-domain that can be used to check if operations are composable.
A typical user would not work with arrow, but instead work with the data algebra which itself is a shorthand for the arrows. That is: the users may want the power of a category, but they don't want to be the one handling the extra bookkeeping. To add an extra operation a user would work directly with a
and just write the following.
a.extend({
'ratio': 'y / x'
})
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'})
The above has substantial pre-condition checking and optimizations (as it is merely user facing shorthand for the arrows).
The more cumbersome arrow notation (that requires the specification of pre-conditions) has a payoff: managed arrow composition. That is: complex operator pipelines can be directly combined. We are not limited to extending one operation at a time.
If the co-domain of arrow matches the domain of another arrow we can compose them left to right as follows.
a1.cod() == a2.dom()
True
composite = a1 >> a2
composite
DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'}), free_table_key='data_frame')
And when this isn't the case, composition is not allowed. This is exactly what we want as this means the preconditions (exactly which columns are present) for the second arrow are not supplied by the first arrow.
a2.cod() == a1.dom()
False
try:
a2 >> a1
except ValueError as e:
print("Caught: " + str(e))
Caught: extra incoming columns: {'ratio', 'z'}
An important point is: for this arrow notation composition is not mere list concatenation or function composition. Here is an example that makes this clear.
b1 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
extend({
'x': 'x + 1',
'y': 7
}))
b1
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '7'}), free_table_key='')
b2 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
extend({
'y': 9
}))
Now watch what happens when we use “>>
” to compose the arrow b1
and b2
.
b1 >> b2
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '9'}), free_table_key='')
Notice in this special case the composition of b1
and b2
is a single extend node combining the operations and eliminating the dead-value 7
. The idea is: the package has some freedom to define composition as long as it is associative. In this case we have an optimization at the compose step so the composition is not list concatenation or function composition.
As we have said, a typical user will not take the time to establish pre-conditions on steps. So they are not so much working with arrows but with operators that can be specialized to arrows. An actual user might build up the above pipeline as follows.
TableDescription(column_names=['x', 'y'], table_name=None). \
extend({
'x': 'x + 1',
'y': 7
}). \
extend({
'y': 9
})
TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '9'})
We recently demonstrated this sort of optimization in the R
rquery
package.
In the above example the user still benefits from the category theory design. As they composed left to right the system was able to add in the pre-conditions for them. The user only needs to set pre-conditions for non-trivial right-hand side pipelines.
The advantage the data_algebra
package gets from category theory is: it lets us design the package action (how the package works on data) somewhat independently from operator composition. This gives us a lot more design room and power than a strict function composition or list concatenation theory would give us.