GPU Acceleration with Polars LazyFrames

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In a previous post, I walked through how Polars can be used to process larger-than-memory datasets without needing to setup and maintain a dedicated compute cluster. This was accomplished by taking advantage of the Polars LazyFrame. Unlike a regular DataFrame which executes operations immediately, a LazyFrame builds up a logical query plan and defers execution until you explicitly call a method like .collect(). This allows Polars to optimize the entire query before running it by taking advantage of predicate and projection pushdown or reordering operations for efficiency. The benefit is that you can chain many transformations together without incurring intermediate computation costs, and when the query finally runs, Polars can execute it in a highly optimized fashion.

Recent Polars releases have introduced GPU acceleration as a capability for scaling analytical workloads. From the user’s perspective, all that needs to be done is to pass engine="gpu" to .collect(), and existing queries will be executed on NVIDIA GPUs through cuDF and the RAPIDS ecosystem, resulting in significant speed-ups for many common DataFrame operations.

I encountered a bit of difficulty getting my environment configured to take advatntage of GPU acceleration (installing NVIDIA drivers, CUDA toolkit, etc.) I ultimately settled on using the NVIDIA RAPIDS Docker image. The NVIDIA RAPIDS Docker images are pre-built containers that bundle the RAPIDS AI libraries (cuDF, cuML, cuGraph) together with CUDA, Python, and system dependencies. It’s designed so you can quickly run RAPIDS on any machine with a compatible NVIDIA GPU without having to install and configure all the pieces manually. I opted for the RAPIDS notebook image which starts a JupyterLab notebook server by default. I can’t recommend this approach enough. After pulling the image, the container can be initialized with:

$ docker run --rm -it --gpus all -p 8888:8888 -e JUPYTER_TOKEN=rapids \
  nvcr.io/nvidia/rapidsai/notebooks:25.08-cuda12.9-py3.13

To ensure that the container environment can access the GPU(s) available on the host, run:

import cupy as cp
print("GPU available:", cp.cuda.runtime.getDeviceCount())
# Should be >=1.

Alternatively, verify that the device is recognized from within the container using nvidia-smi:

!nvidia-smi
Thu Oct 30 03:16:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The output indicates the system has 23GB of VRAM available.

It is also necessary to install cudf-polars-cu12 to enable GPU acceleration in Polars since it provides the bridge between the Polars DataFrame engine and RAPIDS cuDF, which is NVIDIA’s GPU-accelerated dataframe library built on CUDA 12. The standard Polars package runs entirely on CPU, but with cudf-polars-cu12 installed, Polars can transparently offload supported query operations to the GPU using the RAPIDS runtime. This package contains the CUDA-specific bindings, GPU kernels, and dependencies required for the Polars engine=“gpu” option.

!pip install watermark cudf-polars-cu12

To benchmark Polars CPU vs GPU performance, the New York City 311 Service Requests dataset (via NYC Open Data) will be used as the starting point. It is an ~16GB CSV file representing a city-wide log of non-emergency service requests submitted by residents of New York City since 2010. Each row corresponds to a single request. For example, a noise complaint, trash pickup issue, illegal dumping report or street-light outage, etc. It also includes key attributes such as when the request was created and when it was resolved, the type of complaint, the responding agency, location (latitude/longitude or borough/zip), and status. It is available for download on Kaggle.

%load_ext watermark

import polars as pl 

pl.Config(tbl_rows=30)
pl.Config(float_precision=4)
pl.Config(tbl_cols=None)

%watermark -v -m -p numpy,pandas,polars
Python implementation: CPython
Python version       : 3.12.9
IPython version      : 8.37.0

numpy : 1.26.4
pandas: 2.3.1
polars: 1.32.3

Compiler    : GCC 13.3.0
OS          : Linux
Release     : 5.10.244-240.970.amzn2.x86_64
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

We start by creating a LazyFrame based on the service requests dataset, and display the first 5 records:

lf = pl.scan_csv("311-service-requests.csv")

# Display the first 5 rows.
first5 = lf.head(5).collect()

first5

shape: (5, 7)

yearmonthBoroughComplaint TypeLatitudeLongituderesp_hours
i64i64strstrf64f64i64
201912“MANHATTAN”“Street Condition”40.7457-73.9877null
201912“BROOKLYN”“Noise – Commercial”40.5965-73.9777null
201912“BROOKLYN”“Noise – Residential”40.6606-73.8835null
201912“QUEENS”“Noise – Residential”40.7600-73.8069null
201912“QUEENS”“Illegal Parking”40.7295-73.7300null

We also obtain a count of the number of rows in the dataset:

n = lf.select(pl.len()).collect().item()

print(f"311-service-requests.csv: {n:,}")
311-service-requests.csv: 21,960,000

Next a query is created to perform a set of transformations on the dataset. The example in the next cell aggregates service requests by type, year, month and quantized longitude and latitude using 0.005 degree bins. Since we’re using a LazyFrame, no action will be taken until .collect() is called.

query = (
    lf
    .select([
        pl.col("Borough").alias("borough"),
        pl.col("Complaint Type").alias("type"),
        pl.col("Latitude").alias("lat"),
        pl.col("Longitude").alias("lon"),
        pl.col("year").cast(pl.Int16),
        pl.col("month").cast(pl.Int8),
    ])
    .filter(
          pl.col("borough").is_not_null() &
          pl.col("type").is_not_null() &
          pl.col("lat").is_not_null() &
          pl.col("lon").is_not_null() &
          (pl.col("lat").abs() > 0) &
          (pl.col("lon").abs() > 0)
      )
    # Quantize to 0.005 degree bins.
    .with_columns([
        (pl.col("lat") * 200).floor().cast(pl.Int32).alias("lat_bin"),
        (pl.col("lon") * 200).floor().cast(pl.Int32).alias("lon_bin"),
    ])
    .group_by(["year", "month", "lat_bin", "lon_bin", "type"])
    .agg([
        pl.len().alias("n"),
     ])
    .sort("n", descending=True)
)

First the standard Polars CPU engine is benchmarked:

import time

t_init = time.time()
df = query.collect()
t_total = time.time() - t_init

print(f"Total runtime using CPU: {t_total:,.2f} seconds.\n")
print(f"df.shape: {df.shape}.")
df.head(3)
Total runtime using CPU: 6.72 seconds.

df.shape: (71133, 6).

shape: (3, 6)

yearmonthlat_binlon_bintypen
i16i8i32i32stru32
201998138-14786“Noise – Residential”81966
201998144-14762“Illegal Parking”69024
201998144-14762“Noise – Vehicle”69024

Next the query is executed against the GPU engine. The only difference from the previous cell is engine="gpu" is passed into .collect():

# Execute query with GPU engine.
t_init = time.time()
df2 = query.collect(engine="gpu")
t_total = time.time() - t_init

print(f"Total runtime using GPU engine: {t_total:,.2f} seconds.\n")
print(f"df2.shape: {df2.shape}.")
df2.head(3)
Total runtime using GPU engine: 0.34 seconds.

df2.shape: (71133, 6).

shape: (3, 6)

yearmonthlat_binlon_bintypen
i16i8i32i32stru32
201998138-14786“Noise – Residential”81966
201998144-14762“Illegal Parking”69024
201998144-14762“Noise – Vehicle”69024

Using the CPU engine took 6.72 seconds, vs. 0.34 seconds for the GPU engine, or a ~20x speedup. This is an incredible performance gain that required no code changes.

One thing to mention: When engine="gpu" is specified, if an operation is not supported on GPU, the query will silently fallback to CPU execution, which can make benchmarking tricky. To have Polars fail loudly if part of a query cannot be executed on GPU, we can pass a GPUEngine object inplace of "gpu" in the call to .collect(). If raise_on_fail is set True, any non-GPU supported operations will cause the entire pipeline to fail. The next cell shows what this would look like (not the failure, but creating a GPUEngine object):

# Fail loudly if can't execute on the GPU.
gpu_engine = pl.GPUEngine(device=0,  raise_on_fail=True)  
df2 = query.collect(engine=gpu_engine)
df2.head(3)

shape: (3, 6)

yearmonthlat_binlon_bintypen
i16i8i32i32stru32
201998138-14786“Noise – Residential”81966
201998144-14762“Illegal Parking”69024
201998144-14762“Noise – Vehicle”69024

Since all the operations in our query are GPU supported, no error is thrown.

Another tip for monitoring GPU usage for longer running jobs: From the terminal, run nvidia-smi -l 1, which refreshes the output nvidia-smi every second. The expectation is that if the GPU is being utilized, the amount of VRAM in use should change over time.

GPU-enabled Polars pushes analytical performance to a new level. For large, computation-heavy workloads (joins, group-bys, and sorts across tens or hundreds of millions of rows), the GPU engine can deliver dramatic speedups without changing a single line of code. That said, not everything is supported on GPU yet. Operations like complex datetime handling, regex, or window functions may fall back to CPU, and for smaller or I/O-heavy jobs the extra GPU overhead can actually slow things down. But as datasets continue to grow and GPU coverage expands, the ability to execute entire analytical pipelines directly in VRAM opens the door to running truly large-scale analytics on a single machine, turning what used to require distributed infrastructure into something you can do from your laptop.

To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.