Migrating from Pandas to Polars: A Concise Tutorial

If you’re a data scientist or analyst familiar with Pandas and are considering switching to Polars for its performance benefits, this guide will help you make the transition smoothly. I’ll cover the key differences and provide examples to get you up and running with Polars.

Why Polars?

Polars is a DataFrame library designed for high-performance data manipulation and analysis. It leverages Rust’s speed and efficiency, offering significant performance improvements over Pandas, especially with large datasets.

Installation

First, install Polars using pip:

pip install polars

Basic DataFrame Operations

Creating DataFrames

Pandas:

import pandas as pd

data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
df = pd.DataFrame(data)

Polars:

import polars as pl

data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
df = pl.DataFrame(data)

Viewing Data

Pandas:

print(df.head())

Polars:

print(df.head())

Selecting Columns

Pandas:

df['a']

Polars:

df.select('a')

Filtering Rows

Pandas:

df[df['a'] > 1]

Polars:

df.filter(pl.col('a') > 1)

Adding New Columns

Pandas:

df['c'] = df['a'] + df['b']

Polars:

df = df.with_column((pl.col('a') + pl.col('b')).alias('c'))

Group By and Aggregation

Pandas:

df.groupby('a').sum()

Polars:

df.groupby('a').agg(pl.sum('b'))

Advanced Operations

Joining DataFrames

Pandas:

df1 = pd.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
df_merged = pd.merge(df1, df2, on='key', how='inner')

Polars:

df1 = pl.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})
df2 = pl.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
df_merged = df1.join(df2, on='key', how='inner')

Applying Functions

Pandas:

df['a'].apply(lambda x: x * 2)

Polars:

df = df.with_column(pl.col('a').apply(lambda x: x * 2).alias('a'))

Pivot Tables

Pandas:

df.pivot_table(index='a', columns='b', values='c', aggfunc='sum')

Polars:

df.pivot(index='a', columns='b', values='c', aggregate_fn='sum')

Performance Considerations

Polars is designed to be faster and more memory-efficient than Pandas, especially with larger datasets. It achieves this through:

  1. Columnar Storage: Polars uses Arrow’s columnar format, allowing efficient data access and manipulation.
  2. Parallel Execution: Polars performs many operations in parallel, taking advantage of multi-core processors.
  3. Lazy Evaluation: Polars supports lazy evaluation, meaning operations are optimized and executed only when needed.

Example of Lazy Evaluation

Pandas:

df = df[df['a'] > 1]
df['d'] = df['b'] * 2

Polars:

lazy_df = df.lazy()
result = lazy_df.filter(pl.col('a') > 1).with_column((pl.col('b') * 2).alias('d')).collect()

Conclusion

Transitioning from Pandas to Polars can significantly boost your data processing performance. While the syntax and concepts are similar, Polars offers additional features like lazy evaluation and parallel execution that can handle larger datasets more efficiently. By understanding these differences and adapting your code accordingly, you can leverage the full power of Polars.

References:


Happy coding! Feel free to reach me out at matheus.pestana at fgv.br