Importing Data with Python
Want to share your content on python-bloggers? click here.
Importing data is a key step in the data science workflow. It also has a
huge responsibility. How you import (or connect to) a dataset has
consequences for how you work with that data throughout a project,
because a Pandas DataFrame (say) requires Pandas-specific code.
Similarly, your data constrains your code – if it can fit in memory on a
single computer, the constraints are different than if your data is so
large that its storage must be distributed. Data-import is a key place
where a data-project can go wrong. If you import a dataset without
validating that it contains sensible values, a nasty surprise may await
you….
Python has wonderful libraries for data manipulation and analysis. You
can readily work with data sources of a variety of types. For example:
- that are too big to hold in memory;
- that are distributed across a network;
- that update rapidly;
- or that don’t easily conform to a tabular, relational form.
The Python data stack is sufficiently mature that there are multiple
libraries for all of these settings. There are some moves to introduce a
standardised syntax for some
data-frame functionality across libraries. At Jumping Rivers, we have a
number of
Python training courses,
and teach how to use Pandas, PySpark, Numpy and how to work with
databases via SQL from Python.
Importing into memory
Firstly we will compare two in-memory data-frame packages:
Pandas and Polars. We
will work in a virtual environment (so that the packages installed are
handled independently of the system-wide Python; see our Barbie-themed
blog post
on virtual environments).
# [Linux terminal commands] # Create & activate a new virtual environment python -m venv .venv source .venv/bin/activate # Install pandas and polars into the environment pip install pandas polars
We will generate a simple dataset to import with the two packages.
PyPI download numbers for a specific Python package
can be obtained using the
pypistats
package. After
installing it, we will pull out the number of downloads for the package
pytest
– see our recent
blog posts for an
introduction to this testing library.
# [Linux terminal commands] # Install pypistats pip install pypistats # Obtain download-statistics for `pytest` in tab-separated format pypistats python_minor -f tsv pytest > data/pytest-downloads.tsv
The structure of that file is straight-forward. It records both the
number, and the percentage, of pytest
downloads across each minor
version of Python (“3.8”, “3.9” and so on) for the last 180 days (a
default time-span).
# [Linux terminal commands] head data/pytest-downloads.tsv
## "category" "percent" "downloads" ## "3.11" "23.40%" 282,343,944 ## "3.9" "18.78%" 226,548,604 ## "3.10" "18.74%" 226,155,405 ## "3.12" "15.48%" 186,819,921 ## "null" "8.41%" 101,489,156 ## "3.8" "6.13%" 73,965,471 ## "3.13" "4.91%" 59,253,846 ## "3.7" "3.36%" 40,551,618 ## "3.6" "0.54%" 6,546,017
So it should be trivial to import it into Python using either Pandas or
Polars.
files = { "pytest_data": "data/pytest-downloads.tsv" }
import pandas as pd downloads_pd = pd.read_csv(files["pytest_data"], sep="\t") downloads_pd.head()
## category percent downloads ## 0 3.11 23.40% 282,343,944 ## 1 3.9 18.78% 226,548,604 ## 2 3.10 18.74% 226,155,405 ## 3 3.12 15.48% 186,819,921 ## 4 NaN 8.41% 101,489,156
import polars as pl downloads_pl = pl.read_csv(files["pytest_data"], separator="\t") downloads_pl.head()
## shape: (5, 3) ## ┌──────────┬─────────┬─────────────┐ ## │ category ┆ percent ┆ downloads │ ## │ --- ┆ --- ┆ --- │ ## │ str ┆ str ┆ str │ ## ╞══════════╪═════════╪═════════════╡ ## │ 3.11 ┆ 23.40% ┆ 282,343,944 │ ## │ 3.9 ┆ 18.78% ┆ 226,548,604 │ ## │ 3.10 ┆ 18.74% ┆ 226,155,405 │ ## │ 3.12 ┆ 15.48% ┆ 186,819,921 │ ## │ null ┆ 8.41% ┆ 101,489,156 │ ## └──────────┴─────────┴─────────────┘
We can see that Pytest was downloaded ~ 280 million times on Python
3.11, and this accounted for about 23% of downloads.
The syntax for importing the dataset using the two libraries is almost
identical. The only difference for the read_csv()
function is that you
use sep=...
in Pandas and separator=...
in Polars.
Polars is
more memory efficient
than Pandas in a lot of settings. One of the memory-efficiencies that Polars
allows, is filtering by rows and columns during import. To take
advantage of this, you can use the scan_csv()
function in Polars. This
creates a lazy data-frame that can be directly manipulated by Polars
DataFrame methods. These method calls are applied at the point when the
data is loaded, rather than on an in-memory data-frame. The Polars
website has
far more details about the
lazy API, and the benefits it can bring to your work.
Here, we will be working with eagerly-loaded, in-memory, data-frames – a
good choice during exploratory work.
Validating a dataset
Have we loaded what we wanted?
The Pandas/Polars .dtypes
attribute gives you information about the
data-types in each column of a data-frame:
print(downloads_pd.dtypes)
## category object ## percent object ## downloads object ## dtype: object
print(downloads_pl.dtypes)
## [String, String, String]
Again, the formatting of the output looks a little different between the
two packages, but the results are broadly similar: all of our data—the
version-strings in ‘category’, the percentage values and download
counts—have been read as character strings. Pandas tell us they are
’object’s, but we know what that typically means they are strings:
type( downloads_pd["category"][0] )
## <class 'str'>
So Python has imported our data incorrectly. In a real project we would
want to know about this, so we might validate the data in our datasets.
We would also want to prevent data-import mistakes as far as possible
(see later) by being more explicit about how each column is imported and
converted.
Python has a range of packages for validating both the schema for, and
the values in, a dataset.
For a general class, if you want to check the data-types that are stored
in the fields, you could use
Pydantic:
# [Terminal] pip install pydantic
from pydantic import BaseModel, NonNegativeInt class Person(BaseModel): name: str age: NonNegativeInt
Correct data-types cause no fuss:
Person(name="Russ", age=47)
## Person(name='Russ', age=47)
But incorrect data (here a negative age) throw errors:
Person(name="Buddy", age=-1)
## pydantic_core._pydantic_core.ValidationError: 1 validation error for Person ## age ## Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-1, input_type=int] ## For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal
There are extensions of Pydantic that work with data-frames. For
example, Pandera
can work with Pandas, Polars and several other data-frame libraries. You
would need to install a different Pandera extension depending on the
data-frame library you are working with (“pandera[pandas]” for Pandas,
“pandera[polars]” for Polars etc).
# Terminal pip install "pandera[pandas]"
import pandera.pandas as pa schema = pa.DataFrameSchema({ "category": pa.Column(str, nullable=True), "percent": pa.Column(float, checks=pa.Check.in_range(0, 100)), "downloads": pa.Column(int) })
schema.validate(downloads_pd)
## pandera.errors.SchemaError: non-nullable series 'percent' contains null values: ## 17 NaN ## 18 NaN ## Name: percent, dtype: object
Validation of the data has identified an issue with the dataset. There
is a missing value in the “percent” column towards the end of the
dataset. There are other issues – the two numeric columns are currently
strings – but lets check out the issue that Pandera has identified
first.
This is the end of the dataset:
downloads_pd.tail()
## category percent downloads ## 14 3.3 0.00% 2,259 ## 15 2.6 0.00% 87 ## 16 3.2 0.00% 68 ## 17 Total NaN 1,206,596,056 ## 18 Date range: 2024-12-31 - 2025-07-07 NaN NaN
There is some metadata included in the final couple of lines of the
dataset that should be ignored at import. Let’s just ignore the final
couple of lines at import (this isn’t a very robust solution, but is
fine for now).
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t", nrows = 17) downloads_pd.tail()
## category percent downloads ## 12 3.40 0.00% 12,777 ## 13 3.15 0.00% 2,882 ## 14 3.30 0.00% 2,259 ## 15 2.60 0.00% 87 ## 16 3.20 0.00% 68
Now if we validate the Pandas data-frame, another issue has been
identified.
schema.validate(downloads_pd)
## pandera.errors.SchemaError: expected series 'category' to have type str: ## failure cases: ## index failure_case ## 0 0 3.11 ## 1 1 3.90 ## 2 2 3.10 ## 3 3 3.12 ## 4 5 3.80 ## 5 6 3.13 ## 6 7 3.70 ## 7 8 3.60 ## 8 9 2.70 ## 9 10 3.14 ## 10 11 3.50 ## 11 12 3.40 ## 12 13 3.15 ## 13 14 3.30 ## 14 15 2.60 ## 15 16 3.20
This is a more substantial issue – the Python version-strings (3.2, 3.3
and so on) have been converted into floating-point numbers during
import. So now Python version “3.2” is Python 3.20 in the data-frame.
Rich Iannone has written a useful blog post comparing various
data-validation libraries for Polars at the “Posit-dev”
blog.
The tools mentioned in his post can check more substantial matters than
the missing-data, data-type and data-range issues that we mentioned
above. In particular, the tool
“pointblank” can create
data-validation summary reports that can be used to report back to
data-collection teams or to analysts. Python already had data-reporting
tools like “great
expectations”
and “test-driven data analysis”.
Importing with data-type constraints
I knew the structure of the tab-separated file before attempting to load
it with Pandas and Polars. That is, I knew that the percents looked like
“12.34%” and that the download counts looked like “123,456,789” (with
commas separating the thousands and millions). Neither package can
automatically convert these number formats into the format that they
require without a bit of help. Even if we explained what the post-import
data-type should be for each column, the two libraries wouldn’t be able
to parse the input data directly.
Both Polars and Pandas allow you to provide a pre-defined schema when
importing data. For Pandas, you provide a dtype
dictionary which
specifies what the output column data-type should be. For Polars, you
provide a schema
argument, where the data-types are specified in a
Polars-specific format (because the data is stored in Rust, the Python
int
and str
data-types don’t work for Polars).
If we import our data using a schema, Pandas and Polars will complain:
downloads_pd = pd.read_csv( files["pytest_data"], sep="\t", nrows = 17, dtype={"category": str, "percent": float, "downloads": int} )
## ValueError: could not convert string to float: '23.40%'
downloads_pl = pl.read_csv( files["pytest_data"], separator="\t", n_rows=17, schema={"category": pl.Utf8, "percent": pl.Float64, "downloads": pl.Int64} )
## polars.exceptions.ComputeError: could not parse `"23.40%"` as dtype `f64` at column 'percent' (column number 2) ## ## The current offset in the file is 7 bytes. ## ## You might want to try: ## - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`), ## - specifying correct dtype with the `schema_overrides` argument ## - setting `ignore_errors` to `True`, ## - adding `"23.40%"` to the `null_values` list. ## ## Original error: ```remaining bytes non-empty```
The errors arise because we need to also specify how to convert our data
into the expected data-types when it isn’t obvious. This is done using
‘converters’ in Pandas:
downloads_pd = pd.read_csv( files["pytest_data"], sep="\t", nrows = 17, dtype={"category": str}, converters = { # 12.34% -> 12.34 "percent": lambda x: float(x.strip("%")), # 123,456,789 -> 123456789 "downloads": lambda x: int(x.replace(",", "")) } ) downloads_pd.head()
## category percent downloads ## 0 3.11 23.40 282343944 ## 1 3.9 18.78 226548604 ## 2 3.10 18.74 226155405 ## 3 3.12 15.48 186819921 ## 4 NaN 8.41 101489156
In Polars, the with_columns()
method allows conversion to the expected
data-types.
downloads_pl = pl.read_csv( files["pytest_data"], separator="\t", n_rows=17 ).with_columns( # 12.34% -> 12.34 pl.col("percent").str.replace("%", "").cast(pl.Float64), # 123,456,789 -> 123456789 pl.col("downloads").str.replace_all(",", "").cast(pl.Int64) ) downloads_pl.head()
## shape: (5, 3) ## ┌──────────┬─────────┬───────────┐ ## │ category ┆ percent ┆ downloads │ ## │ --- ┆ --- ┆ --- │ ## │ str ┆ f64 ┆ i64 │ ## ╞══════════╪═════════╪═══════════╡ ## │ 3.11 ┆ 23.4 ┆ 282343944 │ ## │ 3.9 ┆ 18.78 ┆ 226548604 │ ## │ 3.10 ┆ 18.74 ┆ 226155405 │ ## │ 3.12 ┆ 15.48 ┆ 186819921 │ ## │ null ┆ 8.41 ┆ 101489156 │ ## └──────────┴─────────┴───────────┘
Now if we validate our datasets against the Pandera schema, we should
have a little more success:
schema.validate(downloads_pd)
## category percent downloads ## 0 3.11 23.40 282343944 ## 1 3.9 18.78 226548604 ## 2 3.10 18.74 226155405 ## 3 3.12 15.48 186819921 ## 4 NaN 8.41 101489156 ## 5 3.8 6.13 73965471 ## 6 3.13 4.91 59253846 ## 7 3.7 3.36 40551618 ## 8 3.6 0.54 6546017 ## 9 2.7 0.20 2371860 ## 10 3.14 0.02 300816 ## 11 3.5 0.02 231325 ## 12 3.4 0.00 12777 ## 13 3.15 0.00 2882 ## 14 3.3 0.00 2259 ## 15 2.6 0.00 87 ## 16 3.2 0.00 68
Nice.
Summary
In this post we have looked at importing data with both Pandas and
Polars. We have seen that the import functions are similar
(read_csv(...)
) but that there are some subtle differences with how
you specify some things (the column separator argument, the data-schema
and so on). Explicit conversion of the imported data is performed
differently between the two packages as well.
Once your data has been imported, you should check that it contains what
it is supposed to: are the columns you need all present, do they store
the correct data-types, do the values within those columns sit within
the expected range. Data validation libraries, like Pointblank and
Panderas are really useful for checking data-schema and data-values
before you do anything critical with your dataset.
For updates and revisions to this article, see the original post
Want to share your content on python-bloggers? click here.