Skip to article frontmatterSkip to article content

Selecting and slicing data

Introduction

Coming from R, selecting parts of a data frame with Pandas was confusing for me because there seemed to be so many different ways to do the same task.

The Pandas cheat sheet is a good reminder to all the possible ways to select data. This article gives a little context to these methods as I will go through some common cases you will face in a day-to-day data-sciency application.

I will assume that you already know that there are two basic data structures in Pandas, the Series and the DataFrame. The first one is a one-dimensional array of values of the same data type, the second one is a two-dimensional array consisting of multiple Series of the same length. There are three main operators / methods to access parts of a DataFrame:

  • square brackets my_df[...]
  • method my_df.loc[...]
  • method my_df.iloc[...]

And then there is the query method which is also very handy.

In the following we will look at all these methods in detail. But first we load some example data from the scikit-learn library and convert it to a Pandas dataframe.

import pandas as pd

from sklearn import datasets
boston_data = datasets.load_boston()
boston_data.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
df = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
df['MEDV'] = boston_data['target']
df
Loading...

Because the difference between the index values and positional values of rows is important (as we will see later on) we change the index values:

df = df.set_index('i_' + df.index.map(str))
df
Loading...

square brackets

The square brackets operator that is also used in Python for accessing parts of a list, a string or a dict can be used to select both a single or multiple columns and rows of a DataFrame.

To select a single column you have to refer to it by its name:

df['CRIM']
i_0 0.00632 i_1 0.02731 i_2 0.02729 i_3 0.03237 i_4 0.06905 ... i_501 0.06263 i_502 0.04527 i_503 0.06076 i_504 0.10959 i_505 0.04741 Name: CRIM, Length: 506, dtype: float64

Notice that the result is a Series. We can check this by typing

type(df['CRIM'])
pandas.core.series.Series

To select multiple columns you refer to them by a list of their names:

df[['CRIM', 'INDUS']]
Loading...

The result is a DataFrame, even if the list contains only one column name.

We can select rows using square brackets in the same way you can select elements of a list by refering to their positional values. E.g. to select the first three rows, we write:

df[0:3]
Loading...

This also works by refering to row names, like this:

df['i_0':'i_2']
Loading...

Notice the slightly different logic (the last element we reference i_2 is included) compared to the slicing method using integers.

We can also leave one side of the colon empty to refer to the first or last row, like this:

df[:'i_2']
Loading...

Another way to select multiple rows is using a condition. We have to provide a Series or list of type bool with the same length as the number of rows. This is often used to condition on the values of a column in the same DataFrame, like this:

df[df['INDUS'] > 27]
Loading...

As we can see whether we select columns or rows depends on the type of input. Because of this the square brackets operator can be confusing at the beginning. Furthermore, we can’t select specific columns and rows at once, as we might be expect coming from R.

loc method

The loc method expects references by name both to rows and columns.

df.loc['i_2', 'CRIM']
0.02729

To select multiple rows and columns, we put their respective names in a list:

df.loc[['i_0', 'i_2'], ['CRIM', 'INDUS']]
Loading...

We can refer to a range of rows or columns using a colon, like this:

df.loc['i_4':'i_5', 'CRIM':'INDUS']
Loading...

Or mix both methods:

df.loc[['i_0','i_2'], 'CRIM':'INDUS']
Loading...

If we select only a single row or column the result is a Series:

df.loc['i_5', 'CRIM':'INDUS']
CRIM 0.02985 ZN 0.00000 INDUS 2.18000 Name: i_5, dtype: float64

If we want to select all rows or columns we leave both sides of the colon empty:

df.loc['i_5', :]
CRIM 0.02985 ZN 0.00000 INDUS 2.18000 CHAS 0.00000 NOX 0.45800 RM 6.43000 AGE 58.70000 DIS 6.06220 RAD 3.00000 TAX 222.00000 PTRATIO 18.70000 B 394.12000 LSTAT 5.21000 MEDV 28.70000 Name: i_5, dtype: float64

With the loc method we can also also restrict the selection of rows using a condition:

df.loc[df['INDUS'] > 27, 'CRIM':'CHAS']
Loading...

iloc method

Using the iloc method we can refer to rows and columns by their positional value. To get the third row we type

df.iloc[2]
CRIM 0.02729 ZN 0.00000 INDUS 7.07000 CHAS 0.00000 NOX 0.46900 RM 7.18500 AGE 61.10000 DIS 4.96710 RAD 2.00000 TAX 242.00000 PTRATIO 17.80000 B 392.83000 LSTAT 4.03000 MEDV 34.70000 Name: i_2, dtype: float64

The result is a Series. Multiple rows can be selected using a list

df.iloc[[2,4]]
Loading...

or, for a range of rows, using the familiar slicing syntax:

df.iloc[2:4]
Loading...

The result in both cases is a data frame (even if only one row is selected).

This can be combined with a selection of columns using the respective positional values. To select only the second column we type

df.iloc[2:4, 1]
i_2 0.0 i_3 0.0 Name: ZN, dtype: float64

To select multiple columns we can use a list or the slicing syntax:

df.iloc[2:4, 1:3]
Loading...

To select all rows or columns we leave sides of the colon empty:

df.iloc[:, 1:3]
Loading...

query method

The query method allows for some very handy selection using an expression like this:

df.query("AGE < 7")
Loading...

We can easily combine multiple conditions,

df.query("AGE < 7 and INDUS == 10.81")
Loading...

do calculations,

df.query("LSTAT*15 < MEDV")
Loading...

and even crazy things like this:

df.query("INDUS < RM/1.5 < DIS > RAD*8")
Loading...

To select all rows where a column is equal to several values we can put a list into the expression:

df.query("RAD == [1., 2., 3.]")
Loading...

This gives us all rows where RAD is either 1, 2, or 3.

We can also refer to a list from the environment using the symbol @:

several_values = [1, 2, 3]
df.query("RAD == @several_values")
Loading...

Lit! 🔥

Miscellaneous

filter method

The filter method allows for a nifty selection of columns or rows (if we set axis=0) based on a regular expression. E.g. to select columns beginning with the letter C we type

df.filter(regex='^C')
Loading...
df.filter(regex='^i_50', axis=0)
Loading...

Select a column using .

Instead of df['column_name'] we can also select a column typing df.column_name. However, this only works for columns whose names do not coincide with another DataFrame method. Thus, I would discourage selecting columns this way.

df.CRIM[0:5]
i_0 0.00632 i_1 0.02731 i_2 0.02729 i_3 0.03237 i_4 0.06905 Name: CRIM, dtype: float64

at / iat methods

For the special case, we want to access a single value there are the methods at and iat, which are supposed to be faster than loc and iloc.

df.iat[2, 2]
7.07
df.at['i_2', 'INDUS']
7.07

ix method

If you are googling for a Pandas-related question you may find solutions using the method my_df.ix(...) which is deprecated since Pandas 0.20 and not working anymore in 2020.