Schemas for Python Data Frames

Posted on September 12, 2023 by John Mount in Data science | 0 Comments

This article was first published on python – Win Vector LLC , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can even extend Pandas to accept query languages or operator algebras, as we have done in with the data algebra.

However, a common missing component remains: a general “Pythonic” data schema definition, documentation, and invariant enforcement mechanism.

It turns out it is quite simple to add such functionality using Python decorators. This isn’t particularly useful for general functions (such as pd.merge()), where the function is supposed to support arbitrary data schemas. However, it can be very useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from such schema documentation and enforcement.

I propose the following simple check criteria for both function signatures and data frames that applies to both inputs and outputs:

Data must have at least the set of argument names or column names specified.
Each column must have no more types (for non-null values) than the types specified.

In this note I will demonstrate the how to add such schema documentation and enforcement to Python functions working over data frames using Python decorators.

Let’s import our modules.

In [1]:

# import modules
from pprint import pprint
import numpy as np
import pandas as pd
import polars as pl
import data_algebra as da
from data_algebra.data_schema import SchemaCheckSwitch

As I said, we are interested in documenting that the data frames we work with have:

At least the columns we expect.
No types we don’t expect in those columns.

These two covariant constraints are what we need to ensure we can write the operations over columns (which we need to know exist), and to not get unexpected results (from unexpected types). Instead of getting down-stream signalling nor non-signalling errors during column operations, we get useful exceptions on columns and values. This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation that one can copy into the application.

We also want to be able to turn enforcement on or off in an entire code base easily. To do this we define a indirect importer called schema_check.py. It’s code looks like the following:

   from data_schema import SchemaCheckSwitch
   # from data_schema import SchemaMock as SchemaCheck
   from data_schema import SchemaRaises as SchemaCheck
   SchemaCheckSwitch().on()

Isolating these lines in a shared import lets all other code switch behavior by only editing this file.

Let’s go ahead and import that code.

In [2]:

# use a indirect import, so entire package behavior
# can be changed globally all at once
import schema_check

The usual way to define a function in Python is as follows.

In [3]:

# standard define of a function
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d

Let’s instead, define the same function including the SchemaCheck decoration. The details of this decorator are documented here.

In [4]:

# same function definition, now with schema decorator
@schema_check.SchemaCheck({
        'a': int, 
        'b': {int, float}, 
        'c': {'x': int},
        },
        return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d

The decorator defines the types schemas of at least a subset of positional and named arguments. Declarations are either values (converted to Python types), Python types, or sets of types. A special case is dictionaries, which specify a subset of the column structure of function signatures or data frames. “return_spec” is reserved to name the return schema of the function.

We are deliberately concentrating on data frames, and not the inspection of arbitrary composite Python types. This is because we what to enforce data frame or table schemas, and not inflict an arbitrary runtime type system on Python. Schemas over tables of atomic types is remains a sweet spot for data definitions.

Our decorator documentation declares that fn() expects at least:

an argument a of type int.
an argument b of type int or float.
an argument c that is a data frame (implied by the dictionary argument), and that data frame contains a column x that has no non-null elements of type other than int.
to return a data frame (indicated by the dictionary argument) that has at least a column z that contains no non-null elements of type other than float.

This gives us some enforceable invariants that can improve our code.

We can see this repeated back in the decorator altered help().

In [5]:

# show altered help text
help(fn)

Help on function fn in module __main__:

fn(a, /, b, *, c, d=None)
     arg specifications
    {'a': <class 'int'>,
     'b': {<class 'float'>, <class 'int'>},
     'c': {'x': <class 'int'>}}
     return specification:
    {'z': <class 'float'>}
    
    
    doc

This is a learnable schema specification convention.

Let’s see it catch an error. We show what happens if we call fn() with none of the expected arguments.

In [6]:

# catch schema mismatch
threw = False
try:
    fn()
except TypeError as e:
    print(e)
    threw = True
assert threw

function fn(), issues:
expected arg a missing  
expected arg b missing  
expected arg c missing

Or, and this is where we start to see benefits, we can call with a wrong argument type.

In [7]:

# catch schema mismatch
threw = False
try:
    fn(1, 2, c=3)
except TypeError as e:
    print(e)
    threw = True
assert threw

function fn(), issues:
arg c expected a Pandas or Polars data frame, had int

And, we show that this checking pushes down into the structure of data frame arguments! In our next example we see an argument is missing a required column.

In [8]:

# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

function fn(), issues:
arg c missing required column 'x'

We can check column and cell types in addition to mere column names.

In [9]:

# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

function fn(), issues:
arg c  column 'x' expected type int, found type float

And, we can check return types.

In [10]:

# catch schema mismatch
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pd.DataFrame({'x': [30], "z": [17.2]}), 
        d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

fn() return value: missing required column 'z'

Out[10]:

	q
0	7.0

Notice the rejected return value is attached to the TypeError to help with diagnosis and debugging.

Again, these sort of checks are not for generic utility methods (such as pd.merge()), which are designed to work over a larger variety of schemas. However, they are very useful near client interfaces, APIs, and database tables. This technique and data algebra processing may naturally live near data sources. There is a an-under appreciated design principle that package code should be generic, and application code should be specific (even in the same project).

Let’s show a successful call.

In [11]:

fn(
    1, 
    b=2, 
    c=pd.DataFrame({'x': [3]}), 
    d=pd.DataFrame({'z': [7.0]}))

Out[11]:

	z
0	7.0

We can turn off the checking with a single global command.

In [12]:

# turn off checking globally
SchemaCheckSwitch().off()

Now notice a previously failing call is no longer checked.

In [13]:

# show wrong return value is now allowed
fn(
    1, 
    2, 
    c=pd.DataFrame({'x': [30], "z": [17.2]}), 
    d=pd.DataFrame({'q': [7.0]}))

Out[13]:

	q
0	7.0

The return value has is missing the required z column, but with checks off the function is not interfered with.

When checks are on: failures are detected much closer to causes, making debugging and diagnosis much easier. Also, the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.

And, the input and output schema are attached to the function as objects.

In [14]:

# show argument schema specifications
pprint(fn.data_schema.arg_specs)

{'a': <class 'int'>,
 'b': {<class 'float'>, <class 'int'>},
 'c': {'x': <class 'int'>}}

In [15]:

# show return value schema
pprint(fn.data_schema.return_spec)

{'z': <class 'float'>}

This makes the schema data available for other uses.

A downside is, the technique can run into what I call “the first rule of meta-programming”. Meta-programming only works as long as it doesn’t run into other meta-programming (also called the “its only funny when I do it” theorem). That being said, I feel these decorators can be very valuable in Python data science projects.

This documentation and demo can be found here.

The system also works with Polars data frames instead of Pandas as the data frame realization.

In [16]:

# turn back on checking globally
SchemaCheckSwitch().on()

In [17]:

# failing example in Polars
threw = False
try:
    fn(1, 2, c=pl.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

function fn(), issues:
arg c missing required column 'x'

In [18]:

# failing example in Polars
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pl.DataFrame({'x': [30], "z": [17.2]}), 
        d=pl.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

fn() return value: missing required column 'z'

Out[18]:

shape: (1, 1)

q
f64
7.0

In [19]:

# good example in Polars
fn(
    1, 
    b=2, 
    c=pl.DataFrame({'x': [3]}), 
    d=pl.DataFrame({'z': [7.0]}))

Out[19]:

shape: (1, 1)

z
f64
7.0

And we also have simple “types in data frame” inspection tools here.

In conclusion: the SchemaCheck decoration is a simple and effective tool to add schema documentation and enforcement to your analytics projects.

In [20]:

# show some relevant versions
pprint({
    'pd': pd.__version__, 
    'pl': pl.__version__, 
    'np': np.__version__, 
    'da': da.__version__})

{'da': '1.6.10', 'np': '1.25.2', 'pd': '2.0.3', 'pl': '0.19.2'}

To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Schemas for Python Data Frames

Related