GPU Acceleration with Polars LazyFrames
Want to share your content on python-bloggers? click here.
In a previous post, I walked through how Polars can be used to process larger-than-memory datasets without needing to setup and maintain a dedicated compute cluster. This was accomplished by taking advantage of the Polars LazyFrame. Unlike a regular DataFrame which executes operations immediately, a LazyFrame builds up a logical query plan and defers execution until you explicitly call a method like .collect(). This allows Polars to optimize the entire query before running it by taking advantage of predicate and projection pushdown or reordering operations for efficiency. The benefit is that you can chain many transformations together without incurring intermediate computation costs, and when the query finally runs, Polars can execute it in a highly optimized fashion.

Recent Polars releases have introduced GPU acceleration as a capability for scaling analytical workloads. From the user’s perspective, all that needs to be done is to pass engine="gpu" to .collect(), and existing queries will be executed on NVIDIA GPUs through cuDF and the RAPIDS ecosystem, resulting in significant speed-ups for many common DataFrame operations.
I encountered a bit of difficulty getting my environment configured to take advatntage of GPU acceleration (installing NVIDIA drivers, CUDA toolkit, etc.) I ultimately settled on using the NVIDIA RAPIDS Docker image. The NVIDIA RAPIDS Docker images are pre-built containers that bundle the RAPIDS AI libraries (cuDF, cuML, cuGraph) together with CUDA, Python, and system dependencies. It’s designed so you can quickly run RAPIDS on any machine with a compatible NVIDIA GPU without having to install and configure all the pieces manually. I opted for the RAPIDS notebook image which starts a JupyterLab notebook server by default. I can’t recommend this approach enough. After pulling the image, the container can be initialized with:
$ docker run --rm -it --gpus all -p 8888:8888 -e JUPYTER_TOKEN=rapids \ nvcr.io/nvidia/rapidsai/notebooks:25.08-cuda12.9-py3.13
To ensure that the container environment can access the GPU(s) available on the host, run:
import cupy as cp
print("GPU available:", cp.cuda.runtime.getDeviceCount())
# Should be >=1.Alternatively, verify that the device is recognized from within the container using nvidia-smi:
!nvidia-smi
Thu Oct 30 03:16:07 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+The output indicates the system has 23GB of VRAM available.
It is also necessary to install cudf-polars-cu12 to enable GPU acceleration in Polars since it provides the bridge between the Polars DataFrame engine and RAPIDS cuDF, which is NVIDIA’s GPU-accelerated dataframe library built on CUDA 12. The standard Polars package runs entirely on CPU, but with cudf-polars-cu12 installed, Polars can transparently offload supported query operations to the GPU using the RAPIDS runtime. This package contains the CUDA-specific bindings, GPU kernels, and dependencies required for the Polars engine=“gpu” option.
!pip install watermark cudf-polars-cu12
To benchmark Polars CPU vs GPU performance, the New York City 311 Service Requests dataset (via NYC Open Data) will be used as the starting point. It is an ~16GB CSV file representing a city-wide log of non-emergency service requests submitted by residents of New York City since 2010. Each row corresponds to a single request. For example, a noise complaint, trash pickup issue, illegal dumping report or street-light outage, etc. It also includes key attributes such as when the request was created and when it was resolved, the type of complaint, the responding agency, location (latitude/longitude or borough/zip), and status. It is available for download on Kaggle.
%load_ext watermark import polars as pl pl.Config(tbl_rows=30) pl.Config(float_precision=4) pl.Config(tbl_cols=None) %watermark -v -m -p numpy,pandas,polars
Python implementation: CPython Python version : 3.12.9 IPython version : 8.37.0 numpy : 1.26.4 pandas: 2.3.1 polars: 1.32.3 Compiler : GCC 13.3.0 OS : Linux Release : 5.10.244-240.970.amzn2.x86_64 Machine : x86_64 Processor : x86_64 CPU cores : 8 Architecture: 64bit
We start by creating a LazyFrame based on the service requests dataset, and display the first 5 records:
lf = pl.scan_csv("311-service-requests.csv")
# Display the first 5 rows.
first5 = lf.head(5).collect()
first5shape: (5, 7)
| year | month | Borough | Complaint Type | Latitude | Longitude | resp_hours |
|---|---|---|---|---|---|---|
| i64 | i64 | str | str | f64 | f64 | i64 |
| 2019 | 12 | “MANHATTAN” | “Street Condition” | 40.7457 | -73.9877 | null |
| 2019 | 12 | “BROOKLYN” | “Noise – Commercial” | 40.5965 | -73.9777 | null |
| 2019 | 12 | “BROOKLYN” | “Noise – Residential” | 40.6606 | -73.8835 | null |
| 2019 | 12 | “QUEENS” | “Noise – Residential” | 40.7600 | -73.8069 | null |
| 2019 | 12 | “QUEENS” | “Illegal Parking” | 40.7295 | -73.7300 | null |
We also obtain a count of the number of rows in the dataset:
n = lf.select(pl.len()).collect().item()
print(f"311-service-requests.csv: {n:,}")311-service-requests.csv: 21,960,000
Next a query is created to perform a set of transformations on the dataset. The example in the next cell aggregates service requests by type, year, month and quantized longitude and latitude using 0.005 degree bins. Since we’re using a LazyFrame, no action will be taken until .collect() is called.
query = (
lf
.select([
pl.col("Borough").alias("borough"),
pl.col("Complaint Type").alias("type"),
pl.col("Latitude").alias("lat"),
pl.col("Longitude").alias("lon"),
pl.col("year").cast(pl.Int16),
pl.col("month").cast(pl.Int8),
])
.filter(
pl.col("borough").is_not_null() &
pl.col("type").is_not_null() &
pl.col("lat").is_not_null() &
pl.col("lon").is_not_null() &
(pl.col("lat").abs() > 0) &
(pl.col("lon").abs() > 0)
)
# Quantize to 0.005 degree bins.
.with_columns([
(pl.col("lat") * 200).floor().cast(pl.Int32).alias("lat_bin"),
(pl.col("lon") * 200).floor().cast(pl.Int32).alias("lon_bin"),
])
.group_by(["year", "month", "lat_bin", "lon_bin", "type"])
.agg([
pl.len().alias("n"),
])
.sort("n", descending=True)
)First the standard Polars CPU engine is benchmarked:
import time
t_init = time.time()
df = query.collect()
t_total = time.time() - t_init
print(f"Total runtime using CPU: {t_total:,.2f} seconds.\n")
print(f"df.shape: {df.shape}.")
df.head(3)Total runtime using CPU: 6.72 seconds. df.shape: (71133, 6).
shape: (3, 6)
| year | month | lat_bin | lon_bin | type | n |
|---|---|---|---|---|---|
| i16 | i8 | i32 | i32 | str | u32 |
| 2019 | 9 | 8138 | -14786 | “Noise – Residential” | 81966 |
| 2019 | 9 | 8144 | -14762 | “Illegal Parking” | 69024 |
| 2019 | 9 | 8144 | -14762 | “Noise – Vehicle” | 69024 |
Next the query is executed against the GPU engine. The only difference from the previous cell is engine="gpu" is passed into .collect():
# Execute query with GPU engine.
t_init = time.time()
df2 = query.collect(engine="gpu")
t_total = time.time() - t_init
print(f"Total runtime using GPU engine: {t_total:,.2f} seconds.\n")
print(f"df2.shape: {df2.shape}.")
df2.head(3)Total runtime using GPU engine: 0.34 seconds. df2.shape: (71133, 6).
shape: (3, 6)
| year | month | lat_bin | lon_bin | type | n |
|---|---|---|---|---|---|
| i16 | i8 | i32 | i32 | str | u32 |
| 2019 | 9 | 8138 | -14786 | “Noise – Residential” | 81966 |
| 2019 | 9 | 8144 | -14762 | “Illegal Parking” | 69024 |
| 2019 | 9 | 8144 | -14762 | “Noise – Vehicle” | 69024 |
Using the CPU engine took 6.72 seconds, vs. 0.34 seconds for the GPU engine, or a ~20x speedup. This is an incredible performance gain that required no code changes.
One thing to mention: When engine="gpu" is specified, if an operation is not supported on GPU, the query will silently fallback to CPU execution, which can make benchmarking tricky. To have Polars fail loudly if part of a query cannot be executed on GPU, we can pass a GPUEngine object inplace of "gpu" in the call to .collect(). If raise_on_fail is set True, any non-GPU supported operations will cause the entire pipeline to fail. The next cell shows what this would look like (not the failure, but creating a GPUEngine object):
# Fail loudly if can't execute on the GPU. gpu_engine = pl.GPUEngine(device=0, raise_on_fail=True) df2 = query.collect(engine=gpu_engine) df2.head(3)
shape: (3, 6)
| year | month | lat_bin | lon_bin | type | n |
|---|---|---|---|---|---|
| i16 | i8 | i32 | i32 | str | u32 |
| 2019 | 9 | 8138 | -14786 | “Noise – Residential” | 81966 |
| 2019 | 9 | 8144 | -14762 | “Illegal Parking” | 69024 |
| 2019 | 9 | 8144 | -14762 | “Noise – Vehicle” | 69024 |
Since all the operations in our query are GPU supported, no error is thrown.
Another tip for monitoring GPU usage for longer running jobs: From the terminal, run nvidia-smi -l 1, which refreshes the output nvidia-smi every second. The expectation is that if the GPU is being utilized, the amount of VRAM in use should change over time.
GPU-enabled Polars pushes analytical performance to a new level. For large, computation-heavy workloads (joins, group-bys, and sorts across tens or hundreds of millions of rows), the GPU engine can deliver dramatic speedups without changing a single line of code. That said, not everything is supported on GPU yet. Operations like complex datetime handling, regex, or window functions may fall back to CPU, and for smaller or I/O-heavy jobs the extra GPU overhead can actually slow things down. But as datasets continue to grow and GPU coverage expands, the ability to execute entire analytical pipelines directly in VRAM opens the door to running truly large-scale analytics on a single machine, turning what used to require distributed infrastructure into something you can do from your laptop.
Want to share your content on python-bloggers? click here.