# I think Pandas may have “lost the plot.”

**python – Win Vector LLC**, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)

Want to share your content on python-bloggers? click here.

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.

Now I kind of wonder what Pandas is, or what it wants to be.

The version `1.3.0`

package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).

It is now considered rude to insert a column into a Pandas data frame. I find this off, if I wanted a structure that was hard to add columns to, I’d stick to numpy.

Let’s work an example.

# import our packages import numpy import pandas import timeit

# confirm our Pandas version pandas.__version__

'1.3.0'

# define some experiment parameters nrow = 100 # number of rows to generate ncol = 100 # number of columns to generate nreps = 100 # number of repetitions in timing

First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.

# define our first function: adding columns as user might def f_insert(): d = pandas.DataFrame({ 'y': numpy.zeros(nrow) }) for i in range(ncol): d['var_' + str(i).zfill(4)] = numpy.zeros(nrow) return d

# time the friendly version timeit.timeit('f_insert()', number=nreps, globals=globals())

<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()` d['var_' + str(i).zfill(4)] = numpy.zeros(nrow) 2.707611405

The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add `%%capture`

to each and every cell to try and work around this.

The demanded alternative is something like the following.

# switch to Pandas's insisted upon alternative def f_concat(): d = pandas.DataFrame({ 'y': numpy.zeros(nrow) }) return pandas.concat( [d] + [pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) for i in range(ncol)], axis=1 )

# time the baroque version timeit.timeit('f_concat()', number=nreps, globals=globals())

1.4440117099999998

Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?

I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.

**leave a comment**for the author, please follow the link and comment on their blog:

**python – Win Vector LLC**.

Want to share your content on python-bloggers? click here.