Handling Larger-than-Memory Datasets with Polars LazyFrame

Posted on June 14, 2025 by The Pleasure of Finding Things Out: A blog by James Triveri in Data science | 0 Comments

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Pandas is the de-facto standard when working with tabular datasets in Python, but it encounters significant challenges when handling large datasets. The core issue stems from Pandas’ architecture, which relies entirely on in-memory processing. This design requires loading the entire dataset into memory, resulting in a hard limit on the size of datasets that Pandas can process. When data volumes exceed this threshold, performance degrades dramatically, often resulting in system crashes or memory exhaustion errors.

In addition, Pandas operates via single-threaded execution, processing data sequentially rather than leveraging the parallel processing capabilities of multi-core systems. This creates performance bottlenecks that become increasingly pronounced as the size of data grows.

In a previous post, I demonstrated how Polars can we used as an (almost) drop-in Pandas replacement for routine data processing tasks. Here we explore the Polars LazyFrame, which has no analog in Pandas. A LazyFrame builds up a sequence of operations without immediately executing them. Instead, it constructs a deferred query plan that is only run when explicitly requested, usually by calling the .collect() method. Deferred execution allows Polars to optimize the entire pipeline and apply techniques like predicate and projection pushdown, and process data using out-of-core techniques when necessary.

In the examples that follow, the Flight Prices dataset is used. It is a 31GB CSV file containing one-way flight itineraries found on Expedia between 2022-04-16 and 2022-10-05 for airport codes ATL, DFW, DEN, ORD, LAX, CLT, MIA, JFK, EWR, SFO, DTW, BOS, PHL, LGA, IAD and OAK. The file is larger than the client’s available memory, so it isn’t possible to analyze the full Flight Prices dataset with Pandas, which requires the entire file to be loaded into memory upfront. Given that the 31GB footprint is about 2x the RAM of the client I’m working from, it should be a perfect setting to demonstrate LazyFrame functionality.

In Polars, .scan_csv creates a LazyFrame by scanning the metadata and schema of a CSV file without loading the full dataset. In contrast, .read_csv immediately reads the entire file into memory as a DataFrame, performing eager evaluation. .scan_csv supports lazy, memory-efficient workflows, while .read_csv is eager and requires the dataset to fit in RAM.

We can preview the first 5 rows of the Flight Prices dataset calling scan_csv followed by .head(5).collect() (fetch has been deprecated):

import polars as pl 

pl.Config(tbl_rows=30)
pl.Config(float_precision=4)
pl.Config(tbl_cols=None)


# Create LazyFrame based on itineraries.csv. Dataset is available here:
#   https://www.kaggle.com/datasets/dilwong/flightprices
lf = pl.scan_csv("itineraries.csv")

# Display the first 5 rows.
first5 = lf.head(5).collect()

first5

shape: (5, 27)

legId	searchDate	flightDate	startingAirport	destinationAirport	fareBasisCode	travelDuration	elapsedDays	isBasicEconomy	isRefundable	isNonStop	baseFare	totalFare	seatsRemaining	totalTravelDistance	segmentsDepartureTimeEpochSeconds	segmentsDepartureTimeRaw	segmentsArrivalTimeEpochSeconds	segmentsArrivalTimeRaw	segmentsArrivalAirportCode	segmentsDepartureAirportCode	segmentsAirlineName	segmentsAirlineCode	segmentsEquipmentDescription	segmentsDurationInSeconds	segmentsDistance	segmentsCabinCode
str	str	str	str	str	str	str	i64	bool	bool	bool	f64	f64	i64	i64	str	str	str	str	str	str	str	str	str	str	str	str
“9ca0e81111c683bec1012473feefd2…	“2022-04-16”	“2022-04-17”	“ATL”	“BOS”	“LA0NX0MC”	“PT2H29M”	0	false	false	true	217.6700	248.6000	9	947	“1650214620”	“2022-04-17T12:57:00.000-04:00”	“1650223560”	“2022-04-17T15:26:00.000-04:00”	“BOS”	“ATL”	“Delta”	“DL”	“Airbus A321”	“8940”	“947”	“coach”
“98685953630e772a098941b7190659…	“2022-04-16”	“2022-04-17”	“ATL”	“BOS”	“LA0NX0MC”	“PT2H30M”	0	false	false	true	217.6700	248.6000	4	947	“1650191400”	“2022-04-17T06:30:00.000-04:00”	“1650200400”	“2022-04-17T09:00:00.000-04:00”	“BOS”	“ATL”	“Delta”	“DL”	“Airbus A321”	“9000”	“947”	“coach”
“98d90cbc32bfbb05c2fc32897c7c10…	“2022-04-16”	“2022-04-17”	“ATL”	“BOS”	“LA0NX0MC”	“PT2H30M”	0	false	false	true	217.6700	248.6000	9	947	“1650209700”	“2022-04-17T11:35:00.000-04:00”	“1650218700”	“2022-04-17T14:05:00.000-04:00”	“BOS”	“ATL”	“Delta”	“DL”	“Boeing 757-200”	“9000”	“947”	“coach”
“969a269d38eae583f455486fa90877…	“2022-04-16”	“2022-04-17”	“ATL”	“BOS”	“LA0NX0MC”	“PT2H32M”	0	false	false	true	217.6700	248.6000	8	947	“1650218340”	“2022-04-17T13:59:00.000-04:00”	“1650227460”	“2022-04-17T16:31:00.000-04:00”	“BOS”	“ATL”	“Delta”	“DL”	“Airbus A321”	“9120”	“947”	“coach”
“980370cf27c89b40d2833a1d5afc97…	“2022-04-16”	“2022-04-17”	“ATL”	“BOS”	“LA0NX0MC”	“PT2H34M”	0	false	false	true	217.6700	248.6000	9	947	“1650203940”	“2022-04-17T09:59:00.000-04:00”	“1650213180”	“2022-04-17T12:33:00.000-04:00”	“BOS”	“ATL”	“Delta”	“DL”	“Airbus A321”	“9240”	“947”	“coach”

To demonstrate how LazyFrames work, we’ll execute a simple (contrived) filter/aggregate/average pipeline to compute the average ticket price by origin airport and airline, but only for non-stop flights departing in July and August. This requires filtering, grouping, and aggregation operations that benefit from lazy evaluation and query optimization:

Filter rows where the flightDate is in July or August.
Group by startingAirport airport and segmentsAirlineName.
Calculate the average totalFare for each group.
Sort the results from highest to lowest average fare.

result = (
    lf
    .filter(
        pl.col("flightDate").str.strptime(pl.Date, "%Y-%m-%d").dt.month().is_in([7, 8]) &
        pl.col("isNonStop")
    )
    .group_by(["startingAirport", "segmentsAirlineName"])
    .agg(
        pl.col("totalFare").mean().alias("avg_price")
    )
    .sort("avg_price", descending=True)
)

Notice that we didn’t include .collect() in the result pipeline. Until .collect() is called, no computation is performed: Polars just builds a query plan.

It is possible to view the query plan by calling .explain(optimized=True). This will display the optimized logical plan that Polars will execute when .collect() is called:

print(result.explain(optimized=True))

SORT BY [col("avg_price")]
  AGGREGATE
    [col("totalFare").mean().alias("avg_price")] BY [col("startingAirport"), col("segmentsAirlineName")] FROM
    simple π 5/5 ["totalFare", "startingAirport", ... 3 other columns]
      Csv SCAN [itineraries.csv]
      PROJECT 5/27 COLUMNS
      SELECTION: [(col("isNonStop")) & (col("flightDate").str.strptime([String(raise)]).dt.month().is_in([Series]))]

See this article for more information on interpreting query plans. You can optionally install Graphviz to have the query plan rendered graphically using lf.show_graph().

When calling .collect(), we include streaming=True. This ensures Polars processes the data in a streaming (out-of-core) fashion. The dataset will be processed in small chunks rather than loading the entire dataset into memory at once.

# Execute result pipeline in streaming mode.
result = result.collect(streaming=True)

print(f"type(result): {type(result)}")
print(f"result.shape: {result.shape}")

result.head(10)

type(result): <class 'polars.dataframe.frame.DataFrame'>
result.shape: (95, 3)

shape: (10, 3)

startingAirport	segmentsAirlineName	avg_price
str	str	f64
“BOS”	“Alaska Airlines”	502.8843
“SFO”	“American Airlines”	437.5257
“IAD”	“Alaska Airlines”	433.5487
“LAX”	“American Airlines”	425.0462
“ORD”	“Alaska Airlines”	407.9898
“SFO”	“JetBlue Airways”	402.2174
“ATL”	“Alaska Airlines”	383.4524
“LAX”	“Delta”	374.1640
“JFK”	“United”	369.2713
“LAX”	“JetBlue Airways”	368.5215

Once .collect(streaming=True) is called, result materializes into a Polars DataFrame. The ultimate output is a 95×3 DataFrame with average fare price by origin airport and airline. Executing the result pipeline took about 45 seconds.

Alaska Airlines seems to have higher average fares than other airlines. This is probably a consequence of the greater distance Alaska Airlines flights cover compared with other carriers, resulting in higher average fare costs. Instead we should calculate average total fare per mile. The result2 pipeline in the next cell performs this calculation. We need to convert segmentsDistance to a numeric type since it was interpreted as a text column when initially read:

# Calculate the average total fare per mile for non-stop flights in July and August.
result2 = (
    lf
    .filter(
        pl.col("flightDate").str.strptime(pl.Date, "%Y-%m-%d").dt.month().is_in([7, 8]) &
        pl.col("isNonStop")
    )
    .with_columns(
        # Cast segmentDistance to Float64. Use strict=False to convert errors to null.
        pl.col("segmentsDistance").cast(pl.Float64, strict=False).alias("miles")
    )
    .filter(
        # Ensure numeric_distance is not null and positive.
        pl.col("miles").is_not_null() & (pl.col("miles") > 0)
    )
    .group_by(["startingAirport", "segmentsAirlineName"])
    .agg(
        # Calculate the mean of totalFare / numeric_distance
        (pl.col("totalFare") / pl.col("miles")).mean().alias("avg_price_per_mile")
    )
    .sort("avg_price_per_mile", descending=True)
).collect(streaming=True)

result2.head(20)

shape: (20, 3)

startingAirport	segmentsAirlineName	avg_price_per_mile
str	str	f64
“PHL”	“United”	0.7913
“CLT”	“Delta”	0.6019
“IAD”	“Delta”	0.5592
“DTW”	“United”	0.5220
“DTW”	“American Airlines”	0.5048
“PHL”	“Delta”	0.4816
“BOS”	“Delta”	0.4673
“IAD”	“United”	0.4650
“EWR”	“United”	0.4603
“CLT”	“United”	0.4481
“ATL”	“American Airlines”	0.4441
“BOS”	“American Airlines”	0.4434
“DTW”	“Delta”	0.4387
“IAD”	“American Airlines”	0.4381
“CLT”	“American Airlines”	0.4291
“LGA”	“United”	0.4277
“EWR”	“Delta”	0.4215
“JFK”	“Delta”	0.4106
“JFK”	“American Airlines”	0.4095
“BOS”	“United”	0.4083

Polars LazyFrames unlock a new level of efficiency and scalability for data analysis in Python, enabling the handling of massive datasets with ease and speed without having to maintain a dedicated compute cluster. LazyFrames overcome many of the limitations of traditional in-memory tools like pandas by leveraging deferred execution, query optimization and out-of-core computation. For additional information on LazyFrames and streaming mode, check out the links below:

To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Handling Larger-than-Memory Datasets with Polars LazyFrame

Related