I think Pandas may have “lost the plot.”

John Mount

4 years ago

This article was first published on python – Win Vector LLC , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.

Now I kind of wonder what Pandas is, or what it wants to be.

The version 1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).

It is now considered rude to insert a column into a Pandas data frame. I find this off, if I wanted a structure that was hard to add columns to, I’d stick to numpy.

Let’s work an example.

# import our packages
import numpy
import pandas
import timeit

# confirm our Pandas version
pandas.__version__

'1.3.0'

# define some experiment parameters
nrow = 100   # number of rows to generate
ncol = 100   # number of columns to generate
nreps = 100  # number of repetitions in timing

First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.

# define our first function: adding columns as user might
def f_insert():
    d = pandas.DataFrame({
        'y': numpy.zeros(nrow)
    })
    for i in range(ncol):
        d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)
    return d

# time the friendly version
timeit.timeit('f_insert()', number=nreps, globals=globals())

<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)





2.707611405

The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add %%capture to each and every cell to try and work around this.

The demanded alternative is something like the following.

# switch to Pandas's insisted upon alternative
def f_concat():
    d = pandas.DataFrame({
        'y': numpy.zeros(nrow)
    })
    return pandas.concat(
        [d] + 
        [pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) for i in range(ncol)],
        axis=1
    )

# time the baroque version
timeit.timeit('f_concat()', number=nreps, globals=globals())

1.4440117099999998

Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?

I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.

To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC .

Want to share your content on python-bloggers? click here.