Converting CSV to Parquet
Want to share your content on python-bloggers? click here.
Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.
pip install polars

pip install duckdb

Run times are on a normal laptop, dedicating 8 threads to the crunching.
Let’s generate a 2 GB csv file first
import duckdb
import numpy as np
import polars as pl
n = 100_000_000
rng = np.random.default_rng(42)
df = pl.DataFrame(
{
"X": rng.choice(["a", "b", "c"], n),
"Y": rng.uniform(0, 1, n),
"Z": rng.choice([1, 2, 3, 4, 5], n),
}
)
df.write_csv("data.csv")
Polars
Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.
# Native API with POLARS_MAX_THREADS = 8
(
pl.scan_csv("data.csv")
.filter(pl.col("X") == "a")
.drop("X")
.sort(["Y", "Z"])
.sink_parquet("data.parquet", row_group_size=100_000) # "zstd" compression
)
# 3.5 sIn case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:
# Via SQL API (slower!?)
(
pl.scan_csv("data.csv")
.sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z")
.sink_parquet("data.parquet", row_group_size=100_000)
)
# 6.8 sIn both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.
pl.scan_parquet("data.parquet").head(5).collect()
# Output
Y Z
f64 i64
3.7796e-8 4
5.0273e-8 5
5.7652e-8 4
8.0578e-8 3
8.1598e-8 4DuckDB
As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.
con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"})
con.sql(
"""
COPY (
SELECT Y, Z
FROM 'data.csv'
WHERE X == 'a'
ORDER BY Y, Z
) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000)
"""
)
# 3.9 sAgain, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..
con.sql("SELECT * FROM 'data.parquet' LIMIT 5")
# Output
┌────────────────────────┬───────┐
│ Y │ Z │
│ double │ int64 │
├────────────────────────┼───────┤
│ 3.779571322581887e-08 │ 4 │
│ 5.0273087692787044e-08 │ 5 │
│ 5.76523543349694e-08 │ 4 │
│ 8.057776434977626e-08 │ 3 │
│ 8.159834352650108e-08 │ 4 │
└────────────────────────┴───────┘Final words
- With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
- We can apply filters, selects, sorts etc. on the fly.
Want to share your content on python-bloggers? click here.