Want to share your content on python-bloggers? click here.
Biodiversity.
We’d like more of it.
More of each thing, and more different types of thing.
And more of the things that help make more of the different types of thing.
But can you have too many things?
In Data Science we are often working with rectangular data structures – databases, spreadsheets,
data-frames. Within Python alone, there are multiple ways to work with this type of data, and your
choice is constrained by data volume, storage, fluency and so on. For datasets that could readily be
held in memory on a single computer, the standard Python tool for rectangling is
Pandas,
which became an open-source project in 2009. Many other tools now exist though.
In particular, the
Polars library has become extremely popular in Python over recent years.
But when Pandas works, is well-supported, and is the standard tool in your team or your domain,
and if you are primarily working with in-memory datasets, is there a value in learning a new
data-wrangling tool? Of course there is.
But this is a blog post, not a course, so what we’ll do here is compare the Pandas and Polars syntax
for some standard data-manipulation code. We will also introduce a new bit of syntax that Pandas
3.0 will be introducing soon.
Let’s talk about pollinators.
There’s a nice dataset about pollinators and plants found in areas of the UK available
on the
UK Centre for Ecology and Hydrology (UKCEH) website.
See the full citation below. Briefly, the dataset contains counts of different types of pollinators
in a range of 1 km2 grids across the UK. With it, we can see trends over time in pollinator
numbers.
Installation
We will use separate ‘uv’-based projects to analyse the UKCEH dataset,
by installing Polars, Pandas 2 and Pandas 3 into different virtual environments. See our recent
summary of
2025-trends in Python to get more information
about ‘uv’.
Let’s install some bears inside a snake and analyse some bees:
# Make separate environments for pandas2, pandas3, polars: uv init pandas2 cd pandas2 uv add "pandas==2.3.3" uv run python -c "import pandas; print(pandas.__version__)" # 2.3.3 cd ..
For Pandas 3, we are going to install a development version of the package. One way to do this in uv is using uv pip install
uv init pandas3 cd pandas3 uv venv # explicitly initialise the virtual env uv pip install --pre \ --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \ pandas # Resolved 5 packages in 2.35s # Installed 5 packages in 31ms # + numpy==2.4.0.dev0 # + pandas==3.0.0.dev0+2562.ga329dc353a # + python-dateutil==2.9.0.post0 # + six==1.17.0 # + tzdata==2025.2 # (Note this venv isn't managed by uv...) uv run python -c "import pandas; print(pandas.__version__)" # 3.0.0.dev0+2562.ga329dc353a cd ..
Finally, we’ll install polars into a separate project.
I’ve called this project polars-proj.
If the project had been called polars, we couldn’t have installed
the polars package within it.
# We can't call this project 'polars', # as we'll be installing the 'polars' package inside it uv init polars-proj cd polars-proj uv add "polars==1.34.0" cd ../
So we now have three different projects (‘pandas2’, ‘pandas3’, and ‘polars-proj’).
Download the data
Data was downloaded from
ceh.ac.uk
and stored in ./data/ukpoms_1kmpantrapdata_2017-2022_insects.csv
See the citation below if you wish to work with this data.
As of the start of November 2025, this dataset has been downloaded 29 times.
Data processing
Pandas 2
Pandas 2 is a well-known Python syntax for data-frame work.
From the pandas2 project, we can open a Jupyter notebook based on the pandas2 virtual environment:
# [bash] uv run --with jupyter jupyter lab
We will read in the data and then make some summaries, to produce an output table:
import pandas as pd pollinators = pd.read_csv( "../data/ukpoms_1kmpantrapdata_2017-2022_insects.csv", encoding="ISO-8859-1" )
Bees, and related species, are from the order “Hymenoptera”:
bees = pollinators[pollinators["order"] == "Hymenoptera"] # 9245 rows, 16 columns
Within bees we find a range of interestingly-named insects: nomad bees, small shaggy bees, the
impunctate mini-miner, a few Buffish mining bees and a clutch of heather girdled Colletes, amongst
others. So I’m wondering how many bees and how many different species are observed in a given
sector.
bees["english_name"].unique() # array(['Common Yellow-face Bee', 'Red-tailed Bumblebee', # 'Common Carder Bee', 'Bloomed Furrow Bee', ...
We have a sample_id and an occurrence_id column. There may be multiple rows with the same
sample_id, but each row has a unique occurrence_id. The sample_id defines the
1 km2 sector in which a given pollinator count was performed – there are multiple rows, because there are
typically multiple pollinators present in a sector. Any given sample_id is present for only one
year (not shown).
So what we want to do is group the dataset by sample_id and count up the bees within that sector.
We will store the year along with the sample_id.
We can count up the observations in each sector as follows. Here we are summing the number of
observed insects (aggregating the ‘count’ column using the ‘sum’ function) and counting the number
of distinct taxa in the sector (the length of the unique entries in the taxon_standardised column).
bee_counts = (
bees
.groupby(["sample_id", "year"])
.agg({
"count": "sum",
"taxon_standardised": lambda x: len(x.unique())
})
.rename(columns={
"count": "n_insects",
"taxon_standardised": "n_species"
})
)
With that, we can view the sectors that had the most bees overall:
bee_counts.sort_values("n_insects", ascending=False).head()
# n_insects n_species
# sample_id year
# 14940524 2021 28 3
# 15465304 2021 28 5
# 6810184 2019 28 1
And that had the most bee diversity:
bee_counts.sort_values("n_species", ascending=False).head()
# n_insects n_species
# sample_id year
# 11873611 2020 25 11
# 4440178 2018 20 11
# 11745253 2020 24 11
You could do considerably more advanced analysis if you had time.
Polars
We will repeat the above, but using syntax typical for the Polars package.
The syntax for subsetting the rows of a data-frame is different in Polars.
Passing a Boolean data-mask, pollinators[pollinators["order"] == "Hymenoptera"], doesn’t work
in Polars and the printed error will recommend you use the .filter() method instead:
bees = (
pollinators
.filter(pl.col("order") == "Hymenoptera")
)
Inside a data-frame method (like .filter()) we can refer to a column using
pl.col("column_name"). This means we don’t have to precompute a data-mask on a concrete
data-frame, and can implicitly refer to a column in the current state of the data-frame (in Pandas,
pollinators["order"] == "Hymenoptera" returns a Series of Boolean values that can be used to
index into the rows of a data-frame; this logical series is a “data-mask”). So we
can chain filtering steps together.
The syntax for grouping and summarising data is similar to the Pandas syntax but, again, we can
refer to columns using pl.col(). By providing named arguments to .agg() the names of the
output columns can be defined in a single step.
bee_counts = (
bees
.group_by(["sample_id", "year"])
.agg(
n_insects = pl.col("count").sum(),
n_species = pl.col("taxon_standardised").unique().len()
)
)
Pandas 3
Pandas 3.0 is introducing a
new syntax
that can be used for filtering rows, or adding new columns.
It is closely related to the Polars pl.col() syntax. For example, filtering to keep only the
“Hymenoptera” in the pollinators dataset can be performed using the following code:
# Pandas 3.0
import pandas as pd
pollinators = pd.read_csv(....)
bees = pollinators.loc[pd.col("order") == "Hymenoptera"]
The new part of this syntax is the use of pd.col(), the .loc[] method is actually available in
Pandas 2.0, where we use an anonymous function to select the required rows:
# Pandas 2 or 3.0 bees = pollinators.loc[lambda x: x["order"] == "Hymenoptera"]
Summary
In this blog post we have shown the similarities and differences between Pandas and Polars syntax
for typical data-manipulation tasks. There are some fundamental differences between Pandas and
Polars that go deeper than the syntactic things covered here (and we’ve really only scratched the
surface of those differences). Polars is implemented in Rust, whereas Pandas is written in Python
on top of Numpy’s C++ code base. The speed of Polars and Pandas can differ on the same tasks as a
result of the different implementations. Processing speed is occasionally a good reason to choose
one package over another. But if you are considering migrating from Pandas to Polars, you have to
accept that your team will all need onboarding to the Polars syntax. From what we’ve seen here,
the contrast between Pandas and Polars syntax aren’t that great; the methods have similar names for
example. In fact, from discussing the two packages with data scientists, we have found that it is
the Polars syntax, rather than it’s speed, that has led some to migrate away from Pandas.
Data Citation
UK Pollinator Monitoring Scheme (2025). Pan trap survey data from the UK Pollinator Monitoring
Scheme, 2017-2022. NERC EDS Environmental Information Data Centre.
https://doi.org/10.5285/4a565007-d3a1-468d-9f84-70ec7594fafe
The UK Pollinator Monitoring Scheme (UK PoMS) is a partnership funded jointly by the UK Centre for
Ecology & Hydrology (UKCEH) and Joint Nature Conservation Committee (JNCC) (through funding from the
Department for Environment, Food & Rural Affairs, Scottish Government, Welsh Government and
Department of Agriculture, Environment and Rural Affairs for Northern Ireland). UKCEH’s contribution
is part-funded by the Natural Environment Research Council formerly as part of the UK-SCAPE
programme (award NE/R016429/1) and now as part of the NC-UK programme (award NE/Y006208/1)
delivering National Capability. Between 2017 and 2021, PoMS was funded by UKCEH and Defra (England),
Welsh Government, Scottish Government, DAERA (Northern Ireland), and JNCC. PoMS is indebted to the
many volunteers who carry out surveys and contribute data to the scheme.
For updates and revisions to this article, see the original post
Want to share your content on python-bloggers? click here.
