I think Pandas may have “lost the plot.”

[This article was first published on python – Win Vector LLC, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.

Now I kind of wonder what Pandas is, or what it wants to be.

Not sure if ive lost the plot of he has lost the plot

The version 1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).

It is now considered rude to insert a column into a Pandas data frame. I find this off, if I wanted a structure that was hard to add columns to, I’d stick to numpy.

Let’s work an example.

# import our packages
import numpy
import pandas
import timeit
# confirm our Pandas version
# define some experiment parameters
nrow = 100   # number of rows to generate
ncol = 100   # number of columns to generate
nreps = 100  # number of repetitions in timing

First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.

# define our first function: adding columns as user might
def f_insert():
    d = pandas.DataFrame({
        'y': numpy.zeros(nrow)
    for i in range(ncol):
        d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)
    return d
# time the friendly version
timeit.timeit('f_insert()', number=nreps, globals=globals())
<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)


The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add %%capture to each and every cell to try and work around this.

The demanded alternative is something like the following.

# switch to Pandas's insisted upon alternative
def f_concat():
    d = pandas.DataFrame({
        'y': numpy.zeros(nrow)
    return pandas.concat(
        [d] + 
        [pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) for i in range(ncol)],
# time the baroque version
timeit.timeit('f_concat()', number=nreps, globals=globals())

Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?

I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.

To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC.

Want to share your content on python-bloggers? click here.