Pandas vs Polars: where Polars actually wins

Polars is faster. But not always 100x faster, and migration has real costs. Here’s where it actually matters.

Environment

Python 3.12, polars 0.20, pandas 2.1, M2 MacBook Pro 16GB

The benchmark that matters: groupby on 10M rows

import polars as pl
import pandas as pd
import time

# Polars
df_pl = pl.read_csv("events_10m.csv")
t0 = time.perf_counter()
result = df_pl.group_by("user_id").agg(pl.col("amount").sum())
print(f"Polars: {time.perf_counter() - t0:.2f}s")

# Pandas
df_pd = pd.read_csv("events_10m.csv")
t0 = time.perf_counter()
result = df_pd.groupby("user_id")["amount"].sum()
print(f"Pandas: {time.perf_counter() - t0:.2f}s")

Results: Polars 0.4s, Pandas 3.1s. That’s real.

Lazy evaluation: the real advantage

# This doesn't execute until .collect()
result = (
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .collect()
)

Polars pushes the filter down before reading the full file. On a 50GB CSV this is the difference between 2 minutes and 40 seconds.

When to stay with Pandas

  • Existing codebase with heavy sklearn/statsmodels integration
  • Team familiarity > performance gains
  • Files under ~1M rows (difference is milliseconds)

What went wrong

Tried to use Polars LazyFrame with a custom Python UDF — not supported. Polars UDFs break parallelism. Keep transforms in native Polars expressions.

Checklist

  • Use scan_csv / scan_parquet instead of read_ for large files
  • Prefer native expressions over .map_elements (Python UDFs)
  • Check .schema early — Polars is strict about types