Converting CSV to Parquet
Want to share your content on python-bloggers? click here.
Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.
pip install polars
pip install duckdb
Run times are on a normal laptop, dedicating 8 threads to the crunching.
Let’s generate a 2 GB csv file first
import duckdb import numpy as np import polars as pl n = 100_000_000 rng = np.random.default_rng(42) df = pl.DataFrame( { "X": rng.choice(["a", "b", "c"], n), "Y": rng.uniform(0, 1, n), "Z": rng.choice([1, 2, 3, 4, 5], n), } ) df.write_csv("data.csv")

Polars
Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.
# Native API with POLARS_MAX_THREADS = 8 ( pl.scan_csv("data.csv") .filter(pl.col("X") == "a") .drop("X") .sort(["Y", "Z"]) .sink_parquet("data.parquet", row_group_size=100_000) # "zstd" compression ) # 3.5 s
In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:
# Via SQL API (slower!?) ( pl.scan_csv("data.csv") .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z") .sink_parquet("data.parquet", row_group_size=100_000) ) # 6.8 s
In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.
pl.scan_parquet("data.parquet").head(5).collect() # Output Y Z f64 i64 3.7796e-8 4 5.0273e-8 5 5.7652e-8 4 8.0578e-8 3 8.1598e-8 4
DuckDB
As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.
con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"}) con.sql( """ COPY ( SELECT Y, Z FROM 'data.csv' WHERE X == 'a' ORDER BY Y, Z ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000) """ ) # 3.9 s
Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..
con.sql("SELECT * FROM 'data.parquet' LIMIT 5") # Output ┌────────────────────────┬───────┐ │ Y │ Z │ │ double │ int64 │ ├────────────────────────┼───────┤ │ 3.779571322581887e-08 │ 4 │ │ 5.0273087692787044e-08 │ 5 │ │ 5.76523543349694e-08 │ 4 │ │ 8.057776434977626e-08 │ 3 │ │ 8.159834352650108e-08 │ 4 │ └────────────────────────┴───────┘
Final words
- With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
- We can apply filters, selects, sorts etc. on the fly.
Want to share your content on python-bloggers? click here.