Skip to article frontmatterSkip to article content

Selecting and manipulating data with Polars

by Igor Miske
import polars as pl
import sklearn

Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

ames_data = sklearn.datasets.fetch_openml("house_prices", as_frame=True)
ames_data.keys()
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
cal_data = sklearn.datasets.fetch_california_housing()
cal_data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])
df = pl.from_numpy(cal_data["data"], schema=cal_data["feature_names"])
df = df.with_columns(
    pl.Series(cal_data["target"]).alias(cal_data["target_names"][0]),
)
df
Loading...

Data selection

Columns

df.select()
Loading...
df.with_columns()
Loading...

Rows

df.slice(2000)
Loading...
df.filter()
Loading...

Combination

Manipulating data

Addendum

Data- or LazyFrame? Lazy operations?

Using square brackets

It is recommended to use expressions to select and slice data.

However, you can use square brackets to select rows and columns,

df["MedInc"]
Loading...
df[["MedInc", "Population"]]
Loading...
df[0:2, ["MedInc", "Population"]]
Loading...
df[pl.col("Population") > 350]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniforge3/envs/datascience/lib/python3.13/site-packages/polars/_utils/getitem.py:167, in get_df_item_by_key(df, key)
    166 try:
--> 167     return _select_rows(df, key)  # type: ignore[arg-type]
    168 except TypeError:

File ~/miniforge3/envs/datascience/lib/python3.13/site-packages/polars/_utils/getitem.py:328, in _select_rows(df, key)
    327 msg = f"cannot select rows using key of type {qualified_type_name(key)!r}: {key!r}"
--> 328 raise TypeError(msg)

TypeError: cannot select rows using key of type 'Expr': <Expr ['[(col("Population")) > (dyn in…'] at 0x706C9B088C50>

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 df[pl.col("Population") > 350]

File ~/miniforge3/envs/datascience/lib/python3.13/site-packages/polars/dataframe/frame.py:1395, in DataFrame.__getitem__(self, key)
   1258 def __getitem__(
   1259     self,
   1260     key: (
   (...)   1269     ),
   1270 ) -> DataFrame | Series | Any:
   1271     """
   1272     Get part of the DataFrame as a new DataFrame, Series, or scalar.
   1273 
   (...)   1393     └─────┴─────┴─────┘
   1394     """
-> 1395     return get_df_item_by_key(self, key)

File ~/miniforge3/envs/datascience/lib/python3.13/site-packages/polars/_utils/getitem.py:169, in get_df_item_by_key(df, key)
    167     return _select_rows(df, key)  # type: ignore[arg-type]
    168 except TypeError:
--> 169     return _select_columns(df, key)

File ~/miniforge3/envs/datascience/lib/python3.13/site-packages/polars/_utils/getitem.py:260, in _select_columns(df, key)
    255         raise TypeError(msg)
    257 msg = (
    258     f"cannot select columns using key of type {qualified_type_name(key)!r}: {key!r}"
    259 )
--> 260 raise TypeError(msg)

TypeError: cannot select columns using key of type 'Expr': <Expr ['[(col("Population")) > (dyn in…'] at 0x706C9B088C50>