Slicing / selecting data with Pandas

We go over all the different ways to select parts of a Pandas dataframe

by Max Delsid
Table of Contents

Introduction

Coming from R, selecting parts of a data frame with Pandas was confusing for me because there seemed to be so many different ways to do the same task.

The Pandas cheat sheet is a good reminder to all the possible ways to select data. This article gives a little context to all of them. I will go through the most common cases you will face in a day-to-day data-sciency application.

I will assume that you already know that there are two basic data types in Pandas, the Series and the DataFrame. The first one is a one-dimensional array of values of the same data type, the second one is a two-dimensional array consisting of multiple Series of the same length.

There are three main operators / methods to access parts of a DataFrame:

  • square brackets my_df[...]
  • method my_df.loc[...]
  • method my_df.iloc[...]

And then there is the query method which is also very handy.

In the following we will look at all these methods in detail. Before we do that we load some example data from the scikit-learn library and convert it to a Pandas dataframe.

import pandas as pd

from sklearn import datasets
boston_data = datasets.load_boston()
boston_data.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
df = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
df['MEDV'] = boston_data['target']
df

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

506 rows × 14 columns

Because the difference between the index values and positional values of rows is important (as we will see later on) we change the index values:

df = df.set_index('i_' + df.index.map(str))
df

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
i_30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
i_40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
i_5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
i_5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
i_5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
i_5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
i_5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

506 rows × 14 columns

square brackets

The square brackets operator that is also used in Python for accessing parts of a list, a string or a dict can be used to select both a single or multiple columns and rows of a DataFrame.

To select a single column you have to refer to it by its name:

df['CRIM']
i_0      0.00632
i_1      0.02731
i_2      0.02729
i_3      0.03237
i_4      0.06905
          ...
i_501    0.06263
i_502    0.04527
i_503    0.06076
i_504    0.10959
i_505    0.04741
Name: CRIM, Length: 506, dtype: float64

Notice that the result is a Series. We can check this by typing

type(df['CRIM'])
pandas.core.series.Series

To select multiple columns you refer to them by a list of their names:

df[['CRIM', 'INDUS']]

CRIMINDUS
i_00.006322.31
i_10.027317.07
i_20.027297.07
i_30.032372.18
i_40.069052.18
.........
i_5010.0626311.93
i_5020.0452711.93
i_5030.0607611.93
i_5040.1095911.93
i_5050.0474111.93

506 rows × 2 columns

The result is a DataFrame, even if the list contains only one column name.

We can select rows using square brackets in the same way you can select elements of a list by refering to their positional values. E.g. to select the first three rows, we write:

df[0:3]

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7

This also works by refering to row names, like this:

df['i_0':'i_2']

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7

Notice the slightly different logic (the last element we reference i_2 is included) compared to the slicing method using integers.

We can also leave one side of the colon empty to refer to the first or last row, like this:

df[:'i_2']

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7

Another way to select multiple rows is using a condition. We have to provide a Series or list of type bool with the same length as the number of rows. This is often used to condition on the values of a column in the same DataFrame, like this:

df[df['INDUS'] > 27]

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_4880.150860.027.740.00.6095.45492.71.82094.0711.020.1395.0918.0615.2
i_4890.183370.027.740.00.6095.41498.31.75544.0711.020.1344.0523.977.0
i_4900.207460.027.740.00.6095.09398.01.82264.0711.020.1318.4329.688.1
i_4910.105740.027.740.00.6095.98398.81.86814.0711.020.1390.1118.0713.6
i_4920.111320.027.740.00.6095.98383.52.10994.0711.020.1396.9013.3520.1

As we can see whether we select columns or rows depends on the type of input. Because of this the square brackets operator can be confusing at the beginning. Furthermore, we can’t select specific columns and rows at once, as we might be expect coming from R.

loc method

The loc method expects references by name both to rows and columns.

df.loc['i_2', 'CRIM']
0.02729

To select multiple rows and columns, we put their respective names in a list:

df.loc[['i_0', 'i_2'], ['CRIM', 'INDUS']]

CRIMINDUS
i_00.006322.31
i_20.027297.07

We can refer to a range of rows or columns using a colon, like this:

df.loc['i_4':'i_5', 'CRIM':'INDUS']

CRIMZNINDUS
i_40.069050.02.18
i_50.029850.02.18

Or mix both methods:

df.loc[['i_0','i_2'], 'CRIM':'INDUS']

CRIMZNINDUS
i_00.0063218.02.31
i_20.027290.07.07

If we select only a single row or column the result is a Series:

df.loc['i_5', 'CRIM':'INDUS']
CRIM     0.02985
ZN       0.00000
INDUS    2.18000
Name: i_5, dtype: float64

If we want to select all rows or columns we leave both sides of the colon empty:

df.loc['i_5', :]
CRIM         0.02985
ZN           0.00000
INDUS        2.18000
CHAS         0.00000
NOX          0.45800
RM           6.43000
AGE         58.70000
DIS          6.06220
RAD          3.00000
TAX        222.00000
PTRATIO     18.70000
B          394.12000
LSTAT        5.21000
MEDV        28.70000
Name: i_5, dtype: float64

With the loc method we can also also restrict the selection of rows using a condition:

df.loc[df['INDUS'] > 27, 'CRIM':'CHAS']

CRIMZNINDUSCHAS
i_4880.150860.027.740.0
i_4890.183370.027.740.0
i_4900.207460.027.740.0
i_4910.105740.027.740.0
i_4920.111320.027.740.0

iloc method

Using the iloc method we can refer to rows and columns by their positional value. To get the third row we type

df.iloc[2]
CRIM         0.02729
ZN           0.00000
INDUS        7.07000
CHAS         0.00000
NOX          0.46900
RM           7.18500
AGE         61.10000
DIS          4.96710
RAD          2.00000
TAX        242.00000
PTRATIO     17.80000
B          392.83000
LSTAT        4.03000
MEDV        34.70000
Name: i_2, dtype: float64

The result is a Series. Multiple rows can be selected using a list

df.iloc[[2,4]]

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
i_40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2

or, for a range of rows, using the familiar slicing syntax:

df.iloc[2:4]

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
i_30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4

The result in both cases is a data frame (even if only one row is selected).

This can be combined with a selection of columns using the respective positional values. To select only the second column we type

df.iloc[2:4, 1]
i_2    0.0
i_3    0.0
Name: ZN, dtype: float64

To select multiple columns we can use a list or the slicing syntax:

df.iloc[2:4, 1:3]

ZNINDUS
i_20.07.07
i_30.02.18

To select all rows or columns we leave sides of the colon empty:

df.iloc[:, 1:3]

ZNINDUS
i_018.02.31
i_10.07.07
i_20.07.07
i_30.02.18
i_40.02.18
.........
i_5010.011.93
i_5020.011.93
i_5030.011.93
i_5040.011.93
i_5050.011.93

506 rows × 2 columns

query method

The query method allows for some very handy selection using an expression like this:

df.query("AGE < 7")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_410.127440.06.910.00.4486.7702.95.72093.0233.017.9385.414.8426.6
i_420.141500.06.910.00.4486.1696.65.72093.0233.017.9383.375.8125.3
i_430.159360.06.910.00.4486.2116.55.72093.0233.017.9394.467.4424.7
i_700.088260.010.810.00.4136.4176.65.28734.0305.019.2383.736.7224.2
i_730.195390.010.810.00.4136.2456.25.28734.0305.019.2377.177.5423.4
i_740.078960.012.830.00.4376.2736.04.25155.0398.018.7394.926.7824.1
i_2520.0822122.05.860.00.4316.9576.88.90677.0330.019.1386.093.5329.6

We can easily combine multiple conditions,

df.query("AGE < 7 and INDUS == 10.81")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_700.088260.010.810.00.4136.4176.65.28734.0305.019.2383.736.7224.2
i_730.195390.010.810.00.4136.2456.25.28734.0305.019.2377.177.5423.4

do calculations,

df.query("LSTAT*15 < MEDV")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_400.0335975.02.950.00.42807.02415.85.40113.0252.018.3395.621.9834.9
i_1611.463360.019.580.00.60507.48990.81.97095.0403.014.7374.431.7350.0
i_1621.833770.019.581.00.60507.80298.22.04075.0403.014.7389.611.9250.0
i_1631.519020.019.581.00.60508.37593.92.16205.0403.014.7388.453.3250.0
i_1950.0138180.00.460.00.42207.87532.05.64844.0255.014.4394.232.9750.0
i_2040.0200995.02.680.00.41618.03431.95.11804.0224.014.7390.552.8850.0
i_2320.575290.06.200.00.50708.33773.33.83848.0307.017.4385.912.4741.7
i_2820.0612920.03.331.00.44297.64549.75.21195.0216.014.9377.073.0146.0
i_2830.0150190.01.211.00.40107.92324.85.88501.0198.013.6395.523.1650.0
i_3684.898220.018.100.00.63104.970100.01.332524.0666.020.2375.523.2650.0
i_3706.538760.018.101.00.63107.01697.51.202424.0666.020.2392.052.9650.0

and even crazy things like this:

df.query("INDUS < RM/1.5 < DIS > RAD*8")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_2540.0481980.03.640.00.3926.10832.09.22031.0315.016.4392.896.5721.9
i_2550.0354880.03.640.00.3925.87619.19.22031.0315.016.4395.189.2520.9
i_2860.0196580.01.760.00.3856.23031.59.08921.0241.018.2341.6012.9320.1
i_3490.0289940.01.250.00.4296.93934.58.79211.0335.019.7389.855.8926.6
i_3500.0621140.01.250.00.4296.49044.48.79211.0335.019.7396.905.9822.9

To select all rows where a column is equal to several values we can put a list into the expression:

df.query("RAD == [1., 2., 3.]")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
i_30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
i_40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
i_5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
i_5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
i_5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
i_5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
i_5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

82 rows × 14 columns

This gives us all rows where RAD is either 1, 2, or 3.

We can also refer to a list from the environment using the symbol @:

several_values = [1, 2, 3]
df.query("RAD == @several_values")

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
i_10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
i_20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
i_30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
i_40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
i_5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
i_5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
i_5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
i_5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
i_5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

82 rows × 14 columns

Lit! 🔥

Miscellaneous

filter method

The filter method allows for a nifty selection of columns or rows (if we set axis=0) based on a regular expression. E.g. to select columns beginning with the letter C we type

df.filter(regex='^C')

CRIMCHAS
i_00.006320.0
i_10.027310.0
i_20.027290.0
i_30.032370.0
i_40.069050.0
.........
i_5010.062630.0
i_5020.045270.0
i_5030.060760.0
i_5040.109590.0
i_5050.047410.0

506 rows × 2 columns

df.filter(regex='^i_50', axis=0)

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
i_500.0887321.05.640.00.4395.96345.76.81474.0243.016.8395.5613.4519.7
i_5000.224380.09.690.00.5856.02779.72.49826.0391.019.2396.9014.3316.8
i_5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
i_5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
i_5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
i_5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
i_5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

Select a column using .

Instead of df['column_name'] we can also select a column typing df.column_name. However, this only works for columns whose names do not coincide with another DataFrame method. Thus, I would discourage selecting columns this way.

df.CRIM[0:5]
i_0    0.00632
i_1    0.02731
i_2    0.02729
i_3    0.03237
i_4    0.06905
Name: CRIM, dtype: float64

at / iat methods

For the special case, we want to access a single value there are the methods at and iat, which are supposed to be faster than loc and iloc.

df.iat[2, 2]
7.07
df.at['i_2', 'INDUS']
7.07

ix method

If you are googling for a Pandas-related question you may find solutions using the method my_df.ix(...) which is deprecated since Pandas 0.20 and not working anymore in 2020.

Resources

Bodo Burger
Bodo Burger
Data Scientist at Mücke Roth & Company

I am a statistician based in Munich, Germany. I hold a Master’s degree in statistics and a Bachelor’s degree in economics.

Related