Want to share your content on python-bloggers? click here.
I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.
Now I kind of wonder what Pandas is, or what it wants to be.
The version 1.3.0
package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).
It is now considered rude to insert a column into a Pandas data frame. I find this off, if I wanted a structure that was hard to add columns to, I’d stick to numpy.
Let’s work an example.
# import our packages import numpy import pandas import timeit
# confirm our Pandas version pandas.__version__
'1.3.0'
# define some experiment parameters nrow = 100 # number of rows to generate ncol = 100 # number of columns to generate nreps = 100 # number of repetitions in timing
First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.
# define our first function: adding columns as user might def f_insert(): d = pandas.DataFrame({ 'y': numpy.zeros(nrow) }) for i in range(ncol): d['var_' + str(i).zfill(4)] = numpy.zeros(nrow) return d
# time the friendly version timeit.timeit('f_insert()', number=nreps, globals=globals())
<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()` d['var_' + str(i).zfill(4)] = numpy.zeros(nrow) 2.707611405
The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add %%capture
to each and every cell to try and work around this.
The demanded alternative is something like the following.
# switch to Pandas's insisted upon alternative def f_concat(): d = pandas.DataFrame({ 'y': numpy.zeros(nrow) }) return pandas.concat( [d] + [pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) for i in range(ncol)], axis=1 )
# time the baroque version timeit.timeit('f_concat()', number=nreps, globals=globals())
1.4440117099999998
Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?
I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.
Want to share your content on python-bloggers? click here.