Slicing / selecting data with Pandas
We go over all the different ways to select parts of a Pandas dataframe
Table of Contents
Introduction
Coming from R, selecting parts of a data frame with Pandas was confusing for me because there seemed to be so many different ways to do the same task.
The Pandas cheat sheet is a good reminder to all the possible ways to select data. This article gives a little context to all of them. I will go through the most common cases you will face in a day-to-day data-sciency application.
I will assume that you already know that there are two basic data types in Pandas,
the Series
and the DataFrame.
The first one is a one-dimensional array of values of the same data type, the second one is a two-dimensional array consisting of multiple Series
of the same length.
There are three main operators / methods to access parts of a DataFrame
:
- square brackets
my_df[...]
- method
my_df.loc[...]
- method
my_df.iloc[...]
And then there is the query
method which is also very handy.
In the following we will look at all these methods in detail. Before we do that we load some example data from the scikit-learn library and convert it to a Pandas dataframe.
import pandas as pd
from sklearn import datasets
boston_data = datasets.load_boston()
boston_data.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
df = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
df['MEDV'] = boston_data['target']
df
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
Because the difference between the index values and positional values of rows is important (as we will see later on) we change the index values:
df = df.set_index('i_' + df.index.map(str))
df
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
i_3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
i_4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
i_501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
i_502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
i_503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
i_504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
i_505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
square brackets
The square brackets operator that is also used in Python for accessing parts of a list
, a string
or a dict
can be used to select both a single or multiple columns and rows of a DataFrame
.
To select a single column you have to refer to it by its name:
df['CRIM']
i_0 0.00632
i_1 0.02731
i_2 0.02729
i_3 0.03237
i_4 0.06905
...
i_501 0.06263
i_502 0.04527
i_503 0.06076
i_504 0.10959
i_505 0.04741
Name: CRIM, Length: 506, dtype: float64
Notice that the result is a Series
. We can check this by typing
type(df['CRIM'])
pandas.core.series.Series
To select multiple columns you refer to them by a list of their names:
df[['CRIM', 'INDUS']]
CRIM | INDUS | |
---|---|---|
i_0 | 0.00632 | 2.31 |
i_1 | 0.02731 | 7.07 |
i_2 | 0.02729 | 7.07 |
i_3 | 0.03237 | 2.18 |
i_4 | 0.06905 | 2.18 |
... | ... | ... |
i_501 | 0.06263 | 11.93 |
i_502 | 0.04527 | 11.93 |
i_503 | 0.06076 | 11.93 |
i_504 | 0.10959 | 11.93 |
i_505 | 0.04741 | 11.93 |
506 rows × 2 columns
The result is a DataFrame
, even if the list contains only one column name.
We can select rows using square brackets in the same way you can select elements of a list
by refering to their positional values. E.g. to select the first three rows, we write:
df[0:3]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
This also works by refering to row names, like this:
df['i_0':'i_2']
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
Notice the slightly different logic (the last element we reference i_2
is included) compared to the slicing method using integers.
We can also leave one side of the colon empty to refer to the first or last row, like this:
df[:'i_2']
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
Another way to select multiple rows is using a condition. We have to provide a Series
or list
of type bool
with the same length as the number of rows. This is often used to condition on the values of a column in the same DataFrame
, like this:
df[df['INDUS'] > 27]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_488 | 0.15086 | 0.0 | 27.74 | 0.0 | 0.609 | 5.454 | 92.7 | 1.8209 | 4.0 | 711.0 | 20.1 | 395.09 | 18.06 | 15.2 |
i_489 | 0.18337 | 0.0 | 27.74 | 0.0 | 0.609 | 5.414 | 98.3 | 1.7554 | 4.0 | 711.0 | 20.1 | 344.05 | 23.97 | 7.0 |
i_490 | 0.20746 | 0.0 | 27.74 | 0.0 | 0.609 | 5.093 | 98.0 | 1.8226 | 4.0 | 711.0 | 20.1 | 318.43 | 29.68 | 8.1 |
i_491 | 0.10574 | 0.0 | 27.74 | 0.0 | 0.609 | 5.983 | 98.8 | 1.8681 | 4.0 | 711.0 | 20.1 | 390.11 | 18.07 | 13.6 |
i_492 | 0.11132 | 0.0 | 27.74 | 0.0 | 0.609 | 5.983 | 83.5 | 2.1099 | 4.0 | 711.0 | 20.1 | 396.90 | 13.35 | 20.1 |
As we can see whether we select columns or rows depends on the type of input. Because of this the square brackets operator can be confusing at the beginning. Furthermore, we can’t select specific columns and rows at once, as we might be expect coming from R.
loc
method
The loc
method expects references by name both to rows and columns.
df.loc['i_2', 'CRIM']
0.02729
To select multiple rows and columns, we put their respective names in a list
:
df.loc[['i_0', 'i_2'], ['CRIM', 'INDUS']]
CRIM | INDUS | |
---|---|---|
i_0 | 0.00632 | 2.31 |
i_2 | 0.02729 | 7.07 |
We can refer to a range of rows or columns using a colon, like this:
df.loc['i_4':'i_5', 'CRIM':'INDUS']
CRIM | ZN | INDUS | |
---|---|---|---|
i_4 | 0.06905 | 0.0 | 2.18 |
i_5 | 0.02985 | 0.0 | 2.18 |
Or mix both methods:
df.loc[['i_0','i_2'], 'CRIM':'INDUS']
CRIM | ZN | INDUS | |
---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 |
i_2 | 0.02729 | 0.0 | 7.07 |
If we select only a single row or column the result is a Series
:
df.loc['i_5', 'CRIM':'INDUS']
CRIM 0.02985
ZN 0.00000
INDUS 2.18000
Name: i_5, dtype: float64
If we want to select all rows or columns we leave both sides of the colon empty:
df.loc['i_5', :]
CRIM 0.02985
ZN 0.00000
INDUS 2.18000
CHAS 0.00000
NOX 0.45800
RM 6.43000
AGE 58.70000
DIS 6.06220
RAD 3.00000
TAX 222.00000
PTRATIO 18.70000
B 394.12000
LSTAT 5.21000
MEDV 28.70000
Name: i_5, dtype: float64
With the loc
method we can also also restrict the selection of rows using a condition:
df.loc[df['INDUS'] > 27, 'CRIM':'CHAS']
CRIM | ZN | INDUS | CHAS | |
---|---|---|---|---|
i_488 | 0.15086 | 0.0 | 27.74 | 0.0 |
i_489 | 0.18337 | 0.0 | 27.74 | 0.0 |
i_490 | 0.20746 | 0.0 | 27.74 | 0.0 |
i_491 | 0.10574 | 0.0 | 27.74 | 0.0 |
i_492 | 0.11132 | 0.0 | 27.74 | 0.0 |
iloc
method
Using the iloc
method we can refer to rows and columns by their positional value. To get the third row we type
df.iloc[2]
CRIM 0.02729
ZN 0.00000
INDUS 7.07000
CHAS 0.00000
NOX 0.46900
RM 7.18500
AGE 61.10000
DIS 4.96710
RAD 2.00000
TAX 242.00000
PTRATIO 17.80000
B 392.83000
LSTAT 4.03000
MEDV 34.70000
Name: i_2, dtype: float64
The result is a Series
. Multiple rows can be selected using a list
df.iloc[[2,4]]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
i_4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
or, for a range of rows, using the familiar slicing syntax:
df.iloc[2:4]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
i_3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
The result in both cases is a data frame (even if only one row is selected).
This can be combined with a selection of columns using the respective positional values. To select only the second column we type
df.iloc[2:4, 1]
i_2 0.0
i_3 0.0
Name: ZN, dtype: float64
To select multiple columns we can use a list
or the slicing syntax:
df.iloc[2:4, 1:3]
ZN | INDUS | |
---|---|---|
i_2 | 0.0 | 7.07 |
i_3 | 0.0 | 2.18 |
To select all rows or columns we leave sides of the colon empty:
df.iloc[:, 1:3]
ZN | INDUS | |
---|---|---|
i_0 | 18.0 | 2.31 |
i_1 | 0.0 | 7.07 |
i_2 | 0.0 | 7.07 |
i_3 | 0.0 | 2.18 |
i_4 | 0.0 | 2.18 |
... | ... | ... |
i_501 | 0.0 | 11.93 |
i_502 | 0.0 | 11.93 |
i_503 | 0.0 | 11.93 |
i_504 | 0.0 | 11.93 |
i_505 | 0.0 | 11.93 |
506 rows × 2 columns
query
method
The query
method allows for some very handy selection using an expression like this:
df.query("AGE < 7")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_41 | 0.12744 | 0.0 | 6.91 | 0.0 | 0.448 | 6.770 | 2.9 | 5.7209 | 3.0 | 233.0 | 17.9 | 385.41 | 4.84 | 26.6 |
i_42 | 0.14150 | 0.0 | 6.91 | 0.0 | 0.448 | 6.169 | 6.6 | 5.7209 | 3.0 | 233.0 | 17.9 | 383.37 | 5.81 | 25.3 |
i_43 | 0.15936 | 0.0 | 6.91 | 0.0 | 0.448 | 6.211 | 6.5 | 5.7209 | 3.0 | 233.0 | 17.9 | 394.46 | 7.44 | 24.7 |
i_70 | 0.08826 | 0.0 | 10.81 | 0.0 | 0.413 | 6.417 | 6.6 | 5.2873 | 4.0 | 305.0 | 19.2 | 383.73 | 6.72 | 24.2 |
i_73 | 0.19539 | 0.0 | 10.81 | 0.0 | 0.413 | 6.245 | 6.2 | 5.2873 | 4.0 | 305.0 | 19.2 | 377.17 | 7.54 | 23.4 |
i_74 | 0.07896 | 0.0 | 12.83 | 0.0 | 0.437 | 6.273 | 6.0 | 4.2515 | 5.0 | 398.0 | 18.7 | 394.92 | 6.78 | 24.1 |
i_252 | 0.08221 | 22.0 | 5.86 | 0.0 | 0.431 | 6.957 | 6.8 | 8.9067 | 7.0 | 330.0 | 19.1 | 386.09 | 3.53 | 29.6 |
We can easily combine multiple conditions,
df.query("AGE < 7 and INDUS == 10.81")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_70 | 0.08826 | 0.0 | 10.81 | 0.0 | 0.413 | 6.417 | 6.6 | 5.2873 | 4.0 | 305.0 | 19.2 | 383.73 | 6.72 | 24.2 |
i_73 | 0.19539 | 0.0 | 10.81 | 0.0 | 0.413 | 6.245 | 6.2 | 5.2873 | 4.0 | 305.0 | 19.2 | 377.17 | 7.54 | 23.4 |
do calculations,
df.query("LSTAT*15 < MEDV")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_40 | 0.03359 | 75.0 | 2.95 | 0.0 | 0.4280 | 7.024 | 15.8 | 5.4011 | 3.0 | 252.0 | 18.3 | 395.62 | 1.98 | 34.9 |
i_161 | 1.46336 | 0.0 | 19.58 | 0.0 | 0.6050 | 7.489 | 90.8 | 1.9709 | 5.0 | 403.0 | 14.7 | 374.43 | 1.73 | 50.0 |
i_162 | 1.83377 | 0.0 | 19.58 | 1.0 | 0.6050 | 7.802 | 98.2 | 2.0407 | 5.0 | 403.0 | 14.7 | 389.61 | 1.92 | 50.0 |
i_163 | 1.51902 | 0.0 | 19.58 | 1.0 | 0.6050 | 8.375 | 93.9 | 2.1620 | 5.0 | 403.0 | 14.7 | 388.45 | 3.32 | 50.0 |
i_195 | 0.01381 | 80.0 | 0.46 | 0.0 | 0.4220 | 7.875 | 32.0 | 5.6484 | 4.0 | 255.0 | 14.4 | 394.23 | 2.97 | 50.0 |
i_204 | 0.02009 | 95.0 | 2.68 | 0.0 | 0.4161 | 8.034 | 31.9 | 5.1180 | 4.0 | 224.0 | 14.7 | 390.55 | 2.88 | 50.0 |
i_232 | 0.57529 | 0.0 | 6.20 | 0.0 | 0.5070 | 8.337 | 73.3 | 3.8384 | 8.0 | 307.0 | 17.4 | 385.91 | 2.47 | 41.7 |
i_282 | 0.06129 | 20.0 | 3.33 | 1.0 | 0.4429 | 7.645 | 49.7 | 5.2119 | 5.0 | 216.0 | 14.9 | 377.07 | 3.01 | 46.0 |
i_283 | 0.01501 | 90.0 | 1.21 | 1.0 | 0.4010 | 7.923 | 24.8 | 5.8850 | 1.0 | 198.0 | 13.6 | 395.52 | 3.16 | 50.0 |
i_368 | 4.89822 | 0.0 | 18.10 | 0.0 | 0.6310 | 4.970 | 100.0 | 1.3325 | 24.0 | 666.0 | 20.2 | 375.52 | 3.26 | 50.0 |
i_370 | 6.53876 | 0.0 | 18.10 | 1.0 | 0.6310 | 7.016 | 97.5 | 1.2024 | 24.0 | 666.0 | 20.2 | 392.05 | 2.96 | 50.0 |
and even crazy things like this:
df.query("INDUS < RM/1.5 < DIS > RAD*8")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_254 | 0.04819 | 80.0 | 3.64 | 0.0 | 0.392 | 6.108 | 32.0 | 9.2203 | 1.0 | 315.0 | 16.4 | 392.89 | 6.57 | 21.9 |
i_255 | 0.03548 | 80.0 | 3.64 | 0.0 | 0.392 | 5.876 | 19.1 | 9.2203 | 1.0 | 315.0 | 16.4 | 395.18 | 9.25 | 20.9 |
i_286 | 0.01965 | 80.0 | 1.76 | 0.0 | 0.385 | 6.230 | 31.5 | 9.0892 | 1.0 | 241.0 | 18.2 | 341.60 | 12.93 | 20.1 |
i_349 | 0.02899 | 40.0 | 1.25 | 0.0 | 0.429 | 6.939 | 34.5 | 8.7921 | 1.0 | 335.0 | 19.7 | 389.85 | 5.89 | 26.6 |
i_350 | 0.06211 | 40.0 | 1.25 | 0.0 | 0.429 | 6.490 | 44.4 | 8.7921 | 1.0 | 335.0 | 19.7 | 396.90 | 5.98 | 22.9 |
To select all rows where a column is equal to several values we can put a list into the expression:
df.query("RAD == [1., 2., 3.]")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
i_3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
i_4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
i_501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
i_502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
i_503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
i_504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
i_505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
82 rows × 14 columns
This gives us all rows where RAD
is either 1, 2, or 3.
We can also refer to a list from the environment using the symbol @
:
several_values = [1, 2, 3]
df.query("RAD == @several_values")
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
i_1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
i_2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
i_3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
i_4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
i_501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
i_502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
i_503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
i_504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
i_505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
82 rows × 14 columns
Lit! 🔥
Miscellaneous
filter
method
The filter method allows for a nifty selection of columns or rows (if we set axis=0
) based on a regular expression.
E.g. to select columns beginning with the letter C we type
df.filter(regex='^C')
CRIM | CHAS | |
---|---|---|
i_0 | 0.00632 | 0.0 |
i_1 | 0.02731 | 0.0 |
i_2 | 0.02729 | 0.0 |
i_3 | 0.03237 | 0.0 |
i_4 | 0.06905 | 0.0 |
... | ... | ... |
i_501 | 0.06263 | 0.0 |
i_502 | 0.04527 | 0.0 |
i_503 | 0.06076 | 0.0 |
i_504 | 0.10959 | 0.0 |
i_505 | 0.04741 | 0.0 |
506 rows × 2 columns
df.filter(regex='^i_50', axis=0)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i_50 | 0.08873 | 21.0 | 5.64 | 0.0 | 0.439 | 5.963 | 45.7 | 6.8147 | 4.0 | 243.0 | 16.8 | 395.56 | 13.45 | 19.7 |
i_500 | 0.22438 | 0.0 | 9.69 | 0.0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6.0 | 391.0 | 19.2 | 396.90 | 14.33 | 16.8 |
i_501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
i_502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
i_503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
i_504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
i_505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
Select a column using .
Instead of df['column_name']
we can also select a column typing df.column_name
.
However, this only works for columns whose names do not coincide with another DataFrame
method. Thus, I would discourage selecting columns this way.
df.CRIM[0:5]
i_0 0.00632
i_1 0.02731
i_2 0.02729
i_3 0.03237
i_4 0.06905
Name: CRIM, dtype: float64
at / iat methods
For the special case, we want to access a single value there are the methods at
and iat
, which are supposed to be faster than loc
and iloc
.
df.iat[2, 2]
7.07
df.at['i_2', 'INDUS']
7.07
ix method
If you are googling for a Pandas-related question you may find solutions using the method
my_df.ix(...)
which is deprecated since Pandas 0.20 and not working anymore in 2020.