
Introduction¶
Coming from R, selecting parts of a data frame with Pandas was confusing for me because there seemed to be so many different ways to do the same task.
The Pandas cheat sheet is a good reminder to all the possible ways to select data. This article gives a little context to these methods as I will go through some common cases you will face in a day-to-day data-sciency application.
I will assume that you already know that there are two basic data structures in Pandas,
the Series
and the DataFrame.
The first one is a one-dimensional array of values of the same data type, the second one is a two-dimensional array consisting of multiple Series of the same length.
There are three main operators / methods to access parts of a DataFrame:
- square brackets
my_df[...] - method
my_df.loc[...] - method
my_df.iloc[...]
And then there is the query method which is also very handy.
In the following we will look at all these methods in detail. But first we load some example data from the scikit-learn library and convert it to a Pandas dataframe.
import pandas as pd
from sklearn import datasets
boston_data = datasets.load_boston()
boston_data.keys()dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])df = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
df['MEDV'] = boston_data['target']
dfBecause the difference between the index values and positional values of rows is important (as we will see later on) we change the index values:
df = df.set_index('i_' + df.index.map(str))
dfsquare brackets¶
The square brackets operator that is also used in Python for accessing parts of a list, a string or a dict can be used to select both a single or multiple columns and rows of a DataFrame.
To select a single column you have to refer to it by its name:
df['CRIM']i_0 0.00632
i_1 0.02731
i_2 0.02729
i_3 0.03237
i_4 0.06905
...
i_501 0.06263
i_502 0.04527
i_503 0.06076
i_504 0.10959
i_505 0.04741
Name: CRIM, Length: 506, dtype: float64Notice that the result is a Series. We can check this by typing
type(df['CRIM'])pandas.core.series.SeriesTo select multiple columns you refer to them by a list of their names:
df[['CRIM', 'INDUS']]The result is a DataFrame, even if the list contains only one column name.
We can select rows using square brackets in the same way you can select elements of a list by refering to their positional values. E.g. to select the first three rows, we write:
df[0:3]This also works by refering to row names, like this:
df['i_0':'i_2']Notice the slightly different logic (the last element we reference i_2 is included) compared to the slicing method using integers.
We can also leave one side of the colon empty to refer to the first or last row, like this:
df[:'i_2']Another way to select multiple rows is using a condition. We have to provide a Series or list of type bool with the same length as the number of rows. This is often used to condition on the values of a column in the same DataFrame, like this:
df[df['INDUS'] > 27]As we can see whether we select columns or rows depends on the type of input. Because of this the square brackets operator can be confusing at the beginning. Furthermore, we can’t select specific columns and rows at once, as we might be expect coming from R.
loc method¶
The loc method expects references by name both to rows and columns.
df.loc['i_2', 'CRIM']0.02729To select multiple rows and columns, we put their respective names in a list:
df.loc[['i_0', 'i_2'], ['CRIM', 'INDUS']]We can refer to a range of rows or columns using a colon, like this:
df.loc['i_4':'i_5', 'CRIM':'INDUS']Or mix both methods:
df.loc[['i_0','i_2'], 'CRIM':'INDUS']If we select only a single row or column the result is a Series:
df.loc['i_5', 'CRIM':'INDUS']CRIM 0.02985
ZN 0.00000
INDUS 2.18000
Name: i_5, dtype: float64If we want to select all rows or columns we leave both sides of the colon empty:
df.loc['i_5', :]CRIM 0.02985
ZN 0.00000
INDUS 2.18000
CHAS 0.00000
NOX 0.45800
RM 6.43000
AGE 58.70000
DIS 6.06220
RAD 3.00000
TAX 222.00000
PTRATIO 18.70000
B 394.12000
LSTAT 5.21000
MEDV 28.70000
Name: i_5, dtype: float64With the loc method we can also also restrict the selection of rows using a condition:
df.loc[df['INDUS'] > 27, 'CRIM':'CHAS']iloc method¶
Using the iloc method we can refer to rows and columns by their positional value. To get the third row we type
df.iloc[2]CRIM 0.02729
ZN 0.00000
INDUS 7.07000
CHAS 0.00000
NOX 0.46900
RM 7.18500
AGE 61.10000
DIS 4.96710
RAD 2.00000
TAX 242.00000
PTRATIO 17.80000
B 392.83000
LSTAT 4.03000
MEDV 34.70000
Name: i_2, dtype: float64The result is a Series. Multiple rows can be selected using a list
df.iloc[[2,4]]or, for a range of rows, using the familiar slicing syntax:
df.iloc[2:4]The result in both cases is a data frame (even if only one row is selected).
This can be combined with a selection of columns using the respective positional values. To select only the second column we type
df.iloc[2:4, 1]i_2 0.0
i_3 0.0
Name: ZN, dtype: float64To select multiple columns we can use a list or the slicing syntax:
df.iloc[2:4, 1:3]To select all rows or columns we leave sides of the colon empty:
df.iloc[:, 1:3]query method¶
The query method allows for some very handy selection using an expression like this:
df.query("AGE < 7")We can easily combine multiple conditions,
df.query("AGE < 7 and INDUS == 10.81")do calculations,
df.query("LSTAT*15 < MEDV")and even crazy things like this:
df.query("INDUS < RM/1.5 < DIS > RAD*8")To select all rows where a column is equal to several values we can put a list into the expression:
df.query("RAD == [1., 2., 3.]")This gives us all rows where RAD is either 1, 2, or 3.
We can also refer to a list from the environment using the symbol @:
several_values = [1, 2, 3]
df.query("RAD == @several_values")Lit! 🔥
Miscellaneous¶
filter method¶
The filter method allows for a nifty selection of columns or rows (if we set axis=0) based on a regular expression.
E.g. to select columns beginning with the letter C we type
df.filter(regex='^C')df.filter(regex='^i_50', axis=0)Select a column using .¶
Instead of df['column_name'] we can also select a column typing df.column_name.
However, this only works for columns whose names do not coincide with another DataFrame method. Thus, I would discourage selecting columns this way.
df.CRIM[0:5]i_0 0.00632
i_1 0.02731
i_2 0.02729
i_3 0.03237
i_4 0.06905
Name: CRIM, dtype: float64at / iat methods¶
For the special case, we want to access a single value there are the methods at and iat, which are supposed to be faster than loc and iloc.
df.iat[2, 2]7.07df.at['i_2', 'INDUS']7.07ix method¶
If you are googling for a Pandas-related question you may find solutions using the method
my_df.ix(...) which is deprecated since Pandas 0.20 and not working anymore in 2020.