Schemas for Python Data Frames
Want to share your content on python-bloggers? click here.
However, a common missing component remains: a general “Pythonic” data schema definition, documentation, and invariant enforcement mechanism.
It turns out it is quite simple to add such functionality using Python decorators. This isn’t particularly useful for general functions (such as pd.merge()
), where the function is supposed to support arbitrary data schemas. However, it can be very useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from such schema documentation and enforcement.
I propose the following simple check criteria for both function signatures and data frames that applies to both inputs and outputs:
- Data must have at least the set of argument names or column names specified.
- Each column must have no more types (for non-null values) than the types specified.
In this note I will demonstrate the how to add such schema documentation and enforcement to Python functions working over data frames using Python decorators.
Let’s import our modules.
# import modules
from pprint import pprint
import numpy as np
import pandas as pd
import polars as pl
import data_algebra as da
from data_algebra.data_schema import SchemaCheckSwitch
- At least the columns we expect.
- No types we don’t expect in those columns.
These two covariant constraints are what we need to ensure we can write the operations over columns (which we need to know exist), and to not get unexpected results (from unexpected types). Instead of getting down-stream signalling nor non-signalling errors during column operations, we get useful exceptions on columns and values. This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation that one can copy into the application.
We also want to be able to turn enforcement on or off in an entire code base easily. To do this we define a indirect importer called schema_check.py
. It’s code looks like the following:
from data_schema import SchemaCheckSwitch # from data_schema import SchemaMock as SchemaCheck from data_schema import SchemaRaises as SchemaCheck SchemaCheckSwitch().on()
Isolating these lines in a shared import lets all other code switch behavior by only editing this file.
Let’s go ahead and import that code.
# use a indirect import, so entire package behavior
# can be changed globally all at once
import schema_check
# standard define of a function
def fn(a, /, b, *, c, d=None):
"""doc"""
return d
SchemaCheck
decoration. The details of this decorator are documented here.# same function definition, now with schema decorator
@schema_check.SchemaCheck({
'a': int,
'b': {int, float},
'c': {'x': int},
},
return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
"""doc"""
return d
We are deliberately concentrating on data frames, and not the inspection of arbitrary composite Python types. This is because we what to enforce data frame or table schemas, and not inflict an arbitrary runtime type system on Python. Schemas over tables of atomic types is remains a sweet spot for data definitions.
Our decorator documentation declares that fn()
expects at least:
- an argument
a
of typeint
. - an argument
b
of typeint
orfloat
. - an argument
c
that is a data frame (implied by the dictionary argument), and that data frame contains a columnx
that has no non-null elements of type other thanint
. - to return a data frame (indicated by the dictionary argument) that has at least a column
z
that contains no non-null elements of type other thanfloat
.
This gives us some enforceable invariants that can improve our code.
We can see this repeated back in the decorator altered help()
.
# show altered help text
help(fn)
Help on function fn in module __main__: fn(a, /, b, *, c, d=None) arg specifications {'a': <class 'int'>, 'b': {<class 'float'>, <class 'int'>}, 'c': {'x': <class 'int'>}} return specification: {'z': <class 'float'>} doc
Let’s see it catch an error. We show what happens if we call fn()
with none of the expected arguments.
# catch schema mismatch
threw = False
try:
fn()
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: expected arg a missing expected arg b missing expected arg c missing
# catch schema mismatch
threw = False
try:
fn(1, 2, c=3)
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c expected a Pandas or Polars data frame, had int
# catch schema mismatch
threw = False
try:
fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c missing required column 'x'
# catch schema mismatch
threw = False
try:
fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c column 'x' expected type int, found type float
# catch schema mismatch
rv = None
threw = False
try:
fn(
1,
2,
c=pd.DataFrame({'x': [30], "z": [17.2]}),
d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
print(e.args[0])
rv = e.args[1]
threw = True
assert threw
# the return value is available for inspection
rv
fn() return value: missing required column 'z'
q | |
---|---|
0 | 7.0 |
TypeError
to help with diagnosis and debugging.Again, these sort of checks are not for generic utility methods (such as pd.merge()
), which are designed to work over a larger variety of schemas. However, they are very useful near client interfaces, APIs, and database tables. This technique and data algebra processing may naturally live near data sources. There is a an-under appreciated design principle that package code should be generic, and application code should be specific (even in the same project).
Let’s show a successful call.
fn(
1,
b=2,
c=pd.DataFrame({'x': [3]}),
d=pd.DataFrame({'z': [7.0]}))
z | |
---|---|
0 | 7.0 |
# turn off checking globally
SchemaCheckSwitch().off()
# show wrong return value is now allowed
fn(
1,
2,
c=pd.DataFrame({'x': [30], "z": [17.2]}),
d=pd.DataFrame({'q': [7.0]}))
q | |
---|---|
0 | 7.0 |
z
column, but with checks off the function is not interfered with.When checks are on: failures are detected much closer to causes, making debugging and diagnosis much easier. Also, the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.
And, the input and output schema are attached to the function as objects.
# show argument schema specifications
pprint(fn.data_schema.arg_specs)
{'a': <class 'int'>, 'b': {<class 'float'>, <class 'int'>}, 'c': {'x': <class 'int'>}}
# show return value schema
pprint(fn.data_schema.return_spec)
{'z': <class 'float'>}
A downside is, the technique can run into what I call “the first rule of meta-programming”. Meta-programming only works as long as it doesn’t run into other meta-programming (also called the “its only funny when I do it” theorem). That being said, I feel these decorators can be very valuable in Python data science projects.
This documentation and demo can be found here.
# turn back on checking globally
SchemaCheckSwitch().on()
# failing example in Polars
threw = False
try:
fn(1, 2, c=pl.DataFrame({'z': [7]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c missing required column 'x'
# failing example in Polars
rv = None
threw = False
try:
fn(
1,
2,
c=pl.DataFrame({'x': [30], "z": [17.2]}),
d=pl.DataFrame({'q': [7.0]}))
except TypeError as e:
print(e.args[0])
rv = e.args[1]
threw = True
assert threw
# the return value is available for inspection
rv
fn() return value: missing required column 'z'
shape: (1, 1)
q |
---|
f64 |
7.0 |
# good example in Polars
fn(
1,
b=2,
c=pl.DataFrame({'x': [3]}),
d=pl.DataFrame({'z': [7.0]}))
shape: (1, 1)
z |
---|
f64 |
7.0 |
SchemaCheck
decoration is a simple and effective tool to add schema documentation and enforcement to your analytics projects.# show some relevant versions
pprint({
'pd': pd.__version__,
'pl': pl.__version__,
'np': np.__version__,
'da': da.__version__})
{'da': '1.6.10', 'np': '1.25.2', 'pd': '2.0.3', 'pl': '0.19.2'}
Want to share your content on python-bloggers? click here.