Migrating from Pandas to Polars: A Concise Tutorial
If you’re a data scientist or analyst familiar with Pandas and are considering switching to Polars for its performance benefits, this guide will help you make the transition smoothly. I’ll cover the key differences and provide examples to get you up and running with Polars.
Why Polars?
Polars is a DataFrame library designed for high-performance data manipulation and analysis. It leverages Rust’s speed and efficiency, offering significant performance improvements over Pandas, especially with large datasets.
Installation
First, install Polars using pip:
pip install polars
Basic DataFrame Operations
Creating DataFrames
Pandas:
import pandas as pd
data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
df = pd.DataFrame(data)
Polars:
import polars as pl
data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
df = pl.DataFrame(data)
Viewing Data
Pandas:
print(df.head())
Polars:
print(df.head())
Selecting Columns
Pandas:
df['a']
Polars:
df.select('a')
Filtering Rows
Pandas:
df[df['a'] > 1]
Polars:
df.filter(pl.col('a') > 1)
Adding New Columns
Pandas:
df['c'] = df['a'] + df['b']
Polars:
df = df.with_column((pl.col('a') + pl.col('b')).alias('c'))
Group By and Aggregation
Pandas:
df.groupby('a').sum()
Polars:
df.groupby('a').agg(pl.sum('b'))
Advanced Operations
Joining DataFrames
Pandas:
df1 = pd.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
df_merged = pd.merge(df1, df2, on='key', how='inner')
Polars:
df1 = pl.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})
df2 = pl.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
df_merged = df1.join(df2, on='key', how='inner')
Applying Functions
Pandas:
df['a'].apply(lambda x: x * 2)
Polars:
df = df.with_column(pl.col('a').apply(lambda x: x * 2).alias('a'))
Pivot Tables
Pandas:
df.pivot_table(index='a', columns='b', values='c', aggfunc='sum')
Polars:
df.pivot(index='a', columns='b', values='c', aggregate_fn='sum')
Performance Considerations
Polars is designed to be faster and more memory-efficient than Pandas, especially with larger datasets. It achieves this through:
- Columnar Storage: Polars uses Arrow’s columnar format, allowing efficient data access and manipulation.
- Parallel Execution: Polars performs many operations in parallel, taking advantage of multi-core processors.
- Lazy Evaluation: Polars supports lazy evaluation, meaning operations are optimized and executed only when needed.
Example of Lazy Evaluation
Pandas:
df = df[df['a'] > 1]
df['d'] = df['b'] * 2
Polars:
lazy_df = df.lazy()
result = lazy_df.filter(pl.col('a') > 1).with_column((pl.col('b') * 2).alias('d')).collect()
Conclusion
Transitioning from Pandas to Polars can significantly boost your data processing performance. While the syntax and concepts are similar, Polars offers additional features like lazy evaluation and parallel execution that can handle larger datasets more efficiently. By understanding these differences and adapting your code accordingly, you can leverage the full power of Polars.
References:
Happy coding! Feel free to reach me out at matheus.pestana at fgv.br