Handling Larger-than-Memory Datasets with Polars LazyFrame
Want to share your content on python-bloggers? click here.
Pandas is the de-facto standard when working with tabular datasets in Python, but it encounters significant challenges when handling large datasets. The core issue stems from Pandas’ architecture, which relies entirely on in-memory processing. This design requires loading the entire dataset into memory, resulting in a hard limit on the size of datasets that Pandas can process. When data volumes exceed this threshold, performance degrades dramatically, often resulting in system crashes or memory exhaustion errors.
In addition, Pandas operates via single-threaded execution, processing data sequentially rather than leveraging the parallel processing capabilities of multi-core systems. This creates performance bottlenecks that become increasingly pronounced as the size of data grows.
In a previous post, I demonstrated how Polars can we used as an (almost) drop-in Pandas replacement for routine data processing tasks. Here we explore the Polars LazyFrame, which has no analog in Pandas. A LazyFrame builds up a sequence of operations without immediately executing them. Instead, it constructs a deferred query plan that is only run when explicitly requested, usually by calling the .collect()
method. Deferred execution allows Polars to optimize the entire pipeline and apply techniques like predicate and projection pushdown, and process data using out-of-core techniques when necessary.
In the examples that follow, the Flight Prices dataset is used. It is a 31GB CSV file containing one-way flight itineraries found on Expedia between 2022-04-16 and 2022-10-05 for airport codes ATL, DFW, DEN, ORD, LAX, CLT, MIA, JFK, EWR, SFO, DTW, BOS, PHL, LGA, IAD and OAK. The file is larger than the client’s available memory, so it isn’t possible to analyze the full Flight Prices dataset with Pandas, which requires the entire file to be loaded into memory upfront. Given that the 31GB footprint is about 2x the RAM of the client I’m working from, it should be a perfect setting to demonstrate LazyFrame functionality.
In Polars, .scan_csv
creates a LazyFrame by scanning the metadata and schema of a CSV file without loading the full dataset. In contrast, .read_csv
immediately reads the entire file into memory as a DataFrame, performing eager evaluation. .scan_csv
supports lazy, memory-efficient workflows, while .read_csv
is eager and requires the dataset to fit in RAM.
We can preview the first 5 rows of the Flight Prices dataset calling scan_csv
followed by .head(5).collect()
(fetch
has been deprecated):
import polars as pl pl.Config(tbl_rows=30) pl.Config(float_precision=4) pl.Config(tbl_cols=None) # Create LazyFrame based on itineraries.csv. Dataset is available here: # https://www.kaggle.com/datasets/dilwong/flightprices lf = pl.scan_csv("itineraries.csv") # Display the first 5 rows. first5 = lf.head(5).collect() first5
shape: (5, 27)
legId | searchDate | flightDate | startingAirport | destinationAirport | fareBasisCode | travelDuration | elapsedDays | isBasicEconomy | isRefundable | isNonStop | baseFare | totalFare | seatsRemaining | totalTravelDistance | segmentsDepartureTimeEpochSeconds | segmentsDepartureTimeRaw | segmentsArrivalTimeEpochSeconds | segmentsArrivalTimeRaw | segmentsArrivalAirportCode | segmentsDepartureAirportCode | segmentsAirlineName | segmentsAirlineCode | segmentsEquipmentDescription | segmentsDurationInSeconds | segmentsDistance | segmentsCabinCode |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | i64 | bool | bool | bool | f64 | f64 | i64 | i64 | str | str | str | str | str | str | str | str | str | str | str | str |
“9ca0e81111c683bec1012473feefd2… | “2022-04-16” | “2022-04-17” | “ATL” | “BOS” | “LA0NX0MC” | “PT2H29M” | 0 | false | false | true | 217.6700 | 248.6000 | 9 | 947 | “1650214620” | “2022-04-17T12:57:00.000-04:00” | “1650223560” | “2022-04-17T15:26:00.000-04:00” | “BOS” | “ATL” | “Delta” | “DL” | “Airbus A321” | “8940” | “947” | “coach” |
“98685953630e772a098941b7190659… | “2022-04-16” | “2022-04-17” | “ATL” | “BOS” | “LA0NX0MC” | “PT2H30M” | 0 | false | false | true | 217.6700 | 248.6000 | 4 | 947 | “1650191400” | “2022-04-17T06:30:00.000-04:00” | “1650200400” | “2022-04-17T09:00:00.000-04:00” | “BOS” | “ATL” | “Delta” | “DL” | “Airbus A321” | “9000” | “947” | “coach” |
“98d90cbc32bfbb05c2fc32897c7c10… | “2022-04-16” | “2022-04-17” | “ATL” | “BOS” | “LA0NX0MC” | “PT2H30M” | 0 | false | false | true | 217.6700 | 248.6000 | 9 | 947 | “1650209700” | “2022-04-17T11:35:00.000-04:00” | “1650218700” | “2022-04-17T14:05:00.000-04:00” | “BOS” | “ATL” | “Delta” | “DL” | “Boeing 757-200” | “9000” | “947” | “coach” |
“969a269d38eae583f455486fa90877… | “2022-04-16” | “2022-04-17” | “ATL” | “BOS” | “LA0NX0MC” | “PT2H32M” | 0 | false | false | true | 217.6700 | 248.6000 | 8 | 947 | “1650218340” | “2022-04-17T13:59:00.000-04:00” | “1650227460” | “2022-04-17T16:31:00.000-04:00” | “BOS” | “ATL” | “Delta” | “DL” | “Airbus A321” | “9120” | “947” | “coach” |
“980370cf27c89b40d2833a1d5afc97… | “2022-04-16” | “2022-04-17” | “ATL” | “BOS” | “LA0NX0MC” | “PT2H34M” | 0 | false | false | true | 217.6700 | 248.6000 | 9 | 947 | “1650203940” | “2022-04-17T09:59:00.000-04:00” | “1650213180” | “2022-04-17T12:33:00.000-04:00” | “BOS” | “ATL” | “Delta” | “DL” | “Airbus A321” | “9240” | “947” | “coach” |
To demonstrate how LazyFrames work, we’ll execute a simple (contrived) filter/aggregate/average pipeline to compute the average ticket price by origin airport and airline, but only for non-stop flights departing in July and August. This requires filtering, grouping, and aggregation operations that benefit from lazy evaluation and query optimization:
- Filter rows where the
flightDate
is in July or August. - Group by
startingAirport
airport andsegmentsAirlineName
. - Calculate the average
totalFare
for each group. - Sort the results from highest to lowest average fare.
result = ( lf .filter( pl.col("flightDate").str.strptime(pl.Date, "%Y-%m-%d").dt.month().is_in([7, 8]) & pl.col("isNonStop") ) .group_by(["startingAirport", "segmentsAirlineName"]) .agg( pl.col("totalFare").mean().alias("avg_price") ) .sort("avg_price", descending=True) )
Notice that we didn’t include .collect()
in the result
pipeline. Until .collect()
is called, no computation is performed: Polars just builds a query plan.
It is possible to view the query plan by calling .explain(optimized=True)
. This will display the optimized logical plan that Polars will execute when .collect()
is called:
print(result.explain(optimized=True))
SORT BY [col("avg_price")] AGGREGATE [col("totalFare").mean().alias("avg_price")] BY [col("startingAirport"), col("segmentsAirlineName")] FROM simple π 5/5 ["totalFare", "startingAirport", ... 3 other columns] Csv SCAN [itineraries.csv] PROJECT 5/27 COLUMNS SELECTION: [(col("isNonStop")) & (col("flightDate").str.strptime([String(raise)]).dt.month().is_in([Series]))]
See this article for more information on interpreting query plans. You can optionally install Graphviz to have the query plan rendered graphically using lf.show_graph()
.
When calling .collect()
, we include streaming=True
. This ensures Polars processes the data in a streaming (out-of-core) fashion. The dataset will be processed in small chunks rather than loading the entire dataset into memory at once.
# Execute result pipeline in streaming mode. result = result.collect(streaming=True) print(f"type(result): {type(result)}") print(f"result.shape: {result.shape}") result.head(10)
type(result): <class 'polars.dataframe.frame.DataFrame'> result.shape: (95, 3)
shape: (10, 3)
startingAirport | segmentsAirlineName | avg_price |
---|---|---|
str | str | f64 |
“BOS” | “Alaska Airlines” | 502.8843 |
“SFO” | “American Airlines” | 437.5257 |
“IAD” | “Alaska Airlines” | 433.5487 |
“LAX” | “American Airlines” | 425.0462 |
“ORD” | “Alaska Airlines” | 407.9898 |
“SFO” | “JetBlue Airways” | 402.2174 |
“ATL” | “Alaska Airlines” | 383.4524 |
“LAX” | “Delta” | 374.1640 |
“JFK” | “United” | 369.2713 |
“LAX” | “JetBlue Airways” | 368.5215 |
Once .collect(streaming=True)
is called, result
materializes into a Polars DataFrame. The ultimate output is a 95×3 DataFrame with average fare price by origin airport and airline. Executing the result
pipeline took about 45 seconds.
Alaska Airlines seems to have higher average fares than other airlines. This is probably a consequence of the greater distance Alaska Airlines flights cover compared with other carriers, resulting in higher average fare costs. Instead we should calculate average total fare per mile. The result2
pipeline in the next cell performs this calculation. We need to convert segmentsDistance
to a numeric type since it was interpreted as a text column when initially read:
# Calculate the average total fare per mile for non-stop flights in July and August. result2 = ( lf .filter( pl.col("flightDate").str.strptime(pl.Date, "%Y-%m-%d").dt.month().is_in([7, 8]) & pl.col("isNonStop") ) .with_columns( # Cast segmentDistance to Float64. Use strict=False to convert errors to null. pl.col("segmentsDistance").cast(pl.Float64, strict=False).alias("miles") ) .filter( # Ensure numeric_distance is not null and positive. pl.col("miles").is_not_null() & (pl.col("miles") > 0) ) .group_by(["startingAirport", "segmentsAirlineName"]) .agg( # Calculate the mean of totalFare / numeric_distance (pl.col("totalFare") / pl.col("miles")).mean().alias("avg_price_per_mile") ) .sort("avg_price_per_mile", descending=True) ).collect(streaming=True) result2.head(20)
shape: (20, 3)
startingAirport | segmentsAirlineName | avg_price_per_mile |
---|---|---|
str | str | f64 |
“PHL” | “United” | 0.7913 |
“CLT” | “Delta” | 0.6019 |
“IAD” | “Delta” | 0.5592 |
“DTW” | “United” | 0.5220 |
“DTW” | “American Airlines” | 0.5048 |
“PHL” | “Delta” | 0.4816 |
“BOS” | “Delta” | 0.4673 |
“IAD” | “United” | 0.4650 |
“EWR” | “United” | 0.4603 |
“CLT” | “United” | 0.4481 |
“ATL” | “American Airlines” | 0.4441 |
“BOS” | “American Airlines” | 0.4434 |
“DTW” | “Delta” | 0.4387 |
“IAD” | “American Airlines” | 0.4381 |
“CLT” | “American Airlines” | 0.4291 |
“LGA” | “United” | 0.4277 |
“EWR” | “Delta” | 0.4215 |
“JFK” | “Delta” | 0.4106 |
“JFK” | “American Airlines” | 0.4095 |
“BOS” | “United” | 0.4083 |
Polars LazyFrames unlock a new level of efficiency and scalability for data analysis in Python, enabling the handling of massive datasets with ease and speed without having to maintain a dedicated compute cluster. LazyFrames overcome many of the limitations of traditional in-memory tools like pandas by leveraging deferred execution, query optimization and out-of-core computation. For additional information on LazyFrames and streaming mode, check out the links below:
Want to share your content on python-bloggers? click here.