MENU

Python Pandas Pro – Session Two – Selection on Data Frames

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In the previous post we worked with a custom data set we had created for the purposes of demonstration. This time we are going to work with the gapminder dataset.

Viewing the gapminder dataset

To initialise the gapminder dataset in my project I will use the below import statements to prepare the project ready for the demonstration:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from gapminder import gapminder as gp
import pandas as pd
import numpy as np
# Take a copy of the data frame
df = gp.copy()
print(df.head(10))
from gapminder import gapminder as gp import pandas as pd import numpy as np # Take a copy of the data frame df = gp.copy() print(df.head(10))
from gapminder import gapminder as gp
import pandas as pd
import numpy as np
# Take a copy of the data frame
df = gp.copy()
print(df.head(10))

This code uses the gapminder package using from and imports only the relevant information i.e. the gapminder data and this is aliased as gp.

Then, as in the previous example we import pandas and numpy for working in the project.

The final step is to take a copy of the data and then we print the head of the dataset for inspection:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
5 Afghanistan Asia 1977 38.438 14880372 786.113360
6 Afghanistan Asia 1982 39.854 12881816 978.011439
7 Afghanistan Asia 1987 40.822 13867957 852.395945
8 Afghanistan Asia 1992 41.674 16317921 649.341395
9 Afghanistan Asia 1997 41.763 22227415 635.341351
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395
9  Afghanistan      Asia  1997   41.763  22227415  635.341351

The dataset contains the population movements against life expectancy and the gross domestic product as a percentage of the population. To sort the data by the highest population the below functionality could be added:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
#Sort by value
print(df.sort_values(by='pop', ascending=False))
#Sort by value print(df.sort_values(by='pop', ascending=False))
#Sort by value
print(df.sort_values(by='pop', ascending=False))

To sort by multiple columns this can simply be extended by using a list in Python, these always have square brackets:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
#Sort by value
print(df.sort_values(by=['pop', 'year'], ascending=False))
#Sort by value print(df.sort_values(by=['pop', 'year'], ascending=False))
#Sort by value
print(df.sort_values(by=['pop', 'year'], ascending=False))

The output from the model shows it has been successful in sorting by the sort values specified:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent year lifeExp pop gdpPercap
299 China Asia 2007 72.961 1318683096 4959.114854
298 China Asia 2002 72.028 1280400000 3119.280896
297 China Asia 1997 70.426 1230075000 2289.234136
296 China Asia 1992 68.690 1164970000 1655.784158
707 India Asia 2007 64.698 1110396331 2452.210407
... ... ... ... ... ... ...
1299 Sao Tome and Principe Africa 1967 54.425 70787 1384.840593
1298 Sao Tome and Principe Africa 1962 51.893 65345 1071.551119
420 Djibouti Africa 1952 34.812 63149 2669.529475
1297 Sao Tome and Principe Africa 1957 48.945 61325 860.736903
1296 Sao Tome and Principe Africa 1952 46.471 60011 879.583586
country continent year lifeExp pop gdpPercap 299 China Asia 2007 72.961 1318683096 4959.114854 298 China Asia 2002 72.028 1280400000 3119.280896 297 China Asia 1997 70.426 1230075000 2289.234136 296 China Asia 1992 68.690 1164970000 1655.784158 707 India Asia 2007 64.698 1110396331 2452.210407 ... ... ... ... ... ... ... 1299 Sao Tome and Principe Africa 1967 54.425 70787 1384.840593 1298 Sao Tome and Principe Africa 1962 51.893 65345 1071.551119 420 Djibouti Africa 1952 34.812 63149 2669.529475 1297 Sao Tome and Principe Africa 1957 48.945 61325 860.736903 1296 Sao Tome and Principe Africa 1952 46.471 60011 879.583586
                    country continent  year  lifeExp         pop    gdpPercap
299                   China      Asia  2007   72.961  1318683096  4959.114854
298                   China      Asia  2002   72.028  1280400000  3119.280896
297                   China      Asia  1997   70.426  1230075000  2289.234136
296                   China      Asia  1992   68.690  1164970000  1655.784158
707                   India      Asia  2007   64.698  1110396331  2452.210407
...                     ...       ...   ...      ...         ...          ...
1299  Sao Tome and Principe    Africa  1967   54.425       70787  1384.840593
1298  Sao Tome and Principe    Africa  1962   51.893       65345  1071.551119
420                Djibouti    Africa  1952   34.812       63149  2669.529475
1297  Sao Tome and Principe    Africa  1957   48.945       61325   860.736903
1296  Sao Tome and Principe    Africa  1952   46.471       60011   879.583586

Selection – getting stuff from data frames

The below two subsections will show how to get items from a data frame and then how you can also set items.

Selecting by column

To select a specific column(s) you can use the following syntax:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Selection by 1 x column
print(df['pop'])
# Selection by more than 1 column
print(df[['country', 'pop']])
# Selection by 1 x column print(df['pop']) # Selection by more than 1 column print(df[['country', 'pop']])
# Selection by 1 x column
print(df['pop'])
# Selection by more than 1 column
print(df[['country', 'pop']])

The first line will print out the pop column, whereas the double brackets will select both country and population:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country pop
0 Afghanistan 8425333
1 Afghanistan 9240934
2 Afghanistan 10267083
3 Afghanistan 11537966
4 Afghanistan 13079460
... ... ...
1699 Zimbabwe 9216418
1700 Zimbabwe 10704340
1701 Zimbabwe 11404948
1702 Zimbabwe 11926563
1703 Zimbabwe 12311143
country pop 0 Afghanistan 8425333 1 Afghanistan 9240934 2 Afghanistan 10267083 3 Afghanistan 11537966 4 Afghanistan 13079460 ... ... ... 1699 Zimbabwe 9216418 1700 Zimbabwe 10704340 1701 Zimbabwe 11404948 1702 Zimbabwe 11926563 1703 Zimbabwe 12311143
          country       pop
0     Afghanistan   8425333
1     Afghanistan   9240934
2     Afghanistan  10267083
3     Afghanistan  11537966
4     Afghanistan  13079460
...           ...       ...
1699     Zimbabwe   9216418
1700     Zimbabwe  10704340
1701     Zimbabwe  11404948
1702     Zimbabwe  11926563
1703     Zimbabwe  12311143

Slicing rows

To slice rows from the data frame it is as simple as the below:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Slicing rows
print(df[0:10])
# Slicing rows print(df[0:10])
# Slicing rows
print(df[0:10])

The results in the first 10 rows being sliced, as Python has a zero based indexing system, like most languages:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
5 Afghanistan Asia 1977 38.438 14880372 786.113360
6 Afghanistan Asia 1982 39.854 12881816 978.011439
7 Afghanistan Asia 1987 40.822 13867957 852.395945
8 Afghanistan Asia 1992 41.674 16317921 649.341395
9 Afghanistan Asia 1997 41.763 22227415 635.341351
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395
9  Afghanistan      Asia  1997   41.763  22227415  635.341351

Selecting by label

The driving function here is to use the loc keyword to select by location in the data frame. This implementation shows how to return all rows and only matching columns:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Selection by label
print(df.loc[:, ['lifeExp', 'pop']])
# Selection by label print(df.loc[:, ['lifeExp', 'pop']])
# Selection by label
print(df.loc[:, ['lifeExp', 'pop']])

Outputting:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
lifeExp pop
0 28.801 8425333
1 30.332 9240934
2 31.997 10267083
3 34.020 11537966
4 36.088 13079460
... ... ...
1699 62.351 9216418
1700 60.377 10704340
1701 46.809 11404948
1702 39.989 11926563
1703 43.487 12311143
lifeExp pop 0 28.801 8425333 1 30.332 9240934 2 31.997 10267083 3 34.020 11537966 4 36.088 13079460 ... ... ... 1699 62.351 9216418 1700 60.377 10704340 1701 46.809 11404948 1702 39.989 11926563 1703 43.487 12311143
      lifeExp       pop
0      28.801   8425333
1      30.332   9240934
2      31.997  10267083
3      34.020  11537966
4      36.088  13079460
...       ...       ...
1699   62.351   9216418
1700   60.377  10704340
1701   46.809  11404948
1702   39.989  11926563
1703   43.487  12311143

The : operator says select all rows and only those matching columns in the df.loc command.

To specify both rows and columns to return, you can adapt the statement slightly to easily achieve that:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.loc[0:10, ['lifeExp', 'pop']])
print(df.loc[0:10, ['lifeExp', 'pop']])
print(df.loc[0:10, ['lifeExp', 'pop']])

This outputs:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
lifeExp pop
0 28.801 8425333
1 30.332 9240934
2 31.997 10267083
3 34.020 11537966
4 36.088 13079460
5 38.438 14880372
6 39.854 12881816
7 40.822 13867957
8 41.674 16317921
9 41.763 22227415
10 42.129 25268405
lifeExp pop 0 28.801 8425333 1 30.332 9240934 2 31.997 10267083 3 34.020 11537966 4 36.088 13079460 5 38.438 14880372 6 39.854 12881816 7 40.822 13867957 8 41.674 16317921 9 41.763 22227415 10 42.129 25268405
    lifeExp       pop
0    28.801   8425333
1    30.332   9240934
2    31.997  10267083
3    34.020  11537966
4    36.088  13079460
5    38.438  14880372
6    39.854  12881816
7    40.822  13867957
8    41.674  16317921
9    41.763  22227415
10   42.129  25268405

For getting a scalar variable, you would use:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# For getting a scalar value
print(df.loc[dates[0], 'pop'])
# For getting a scalar value print(df.loc[dates[0], 'pop'])
# For getting a scalar value
print(df.loc[dates[0], 'pop'])

That would get the first date index on the row.

Selection by position

To select by position you can use the iloc keyword.

Select single column

This below code block will show how to slice by index position:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Selection by position
print(df.iloc[3])
# Selection by position print(df.iloc[3])
# Selection by position
print(df.iloc[3])

This outputs:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country Colombia
continent Americas
year 1952
lifeExp 50.643
pop 12350771
gdpPercap 2144.12
Name: 300, dtype: object
country Colombia continent Americas year 1952 lifeExp 50.643 pop 12350771 gdpPercap 2144.12 Name: 300, dtype: object
country      Colombia
continent    Americas
year             1952
lifeExp        50.643
pop          12350771
gdpPercap     2144.12
Name: 300, dtype: object

Select multiple elements – rows and columns

This method uses integer slicing to select rows and columns:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
#Integer slicing
print(df.iloc[3:5, 0:2])
#Integer slicing print(df.iloc[3:5, 0:2])
#Integer slicing
print(df.iloc[3:5, 0:2]) 

Outputting:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent
3 Afghanistan Asia
4 Afghanistan Asia
country continent 3 Afghanistan Asia 4 Afghanistan Asia
       country continent
3  Afghanistan      Asia
4  Afghanistan      Asia

You can achieve a similar result using numpy lists:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.iloc[[1, 2, 4], [0, 2]])
print(df.iloc[[1, 2, 4], [0, 2]])
print(df.iloc[[1, 2, 4], [0, 2]])

Slicing rows and columns explicitly

To achieve this, we make use of the magic : operator in Python:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.iloc[1:9, :]
print(df.iloc[1:9, :]
print(df.iloc[1:9, :]

Hey presto:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
5 Afghanistan Asia 1977 38.438 14880372 786.113360
6 Afghanistan Asia 1982 39.854 12881816 978.011439
7 Afghanistan Asia 1987 40.822 13867957 852.395945
8 Afghanistan Asia 1992 41.674 16317921 649.341395
country continent year lifeExp pop gdpPercap 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395
       country continent  year  lifeExp       pop   gdpPercap
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395

To perform this for columns we just reverse the position in the list:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.iloc[:, 1:3]) #Slicing columns explicitly
print(df.iloc[:, 1:3]) #Slicing columns explicitly
print(df.iloc[:, 1:3]) #Slicing columns explicitly

This then selects just the first 3 rows you require:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
continent year lifeExp
0 Asia 1952 28.801
1 Asia 1957 30.332
2 Asia 1962 31.997
3 Asia 1967 34.020
4 Asia 1972 36.088
... ... ... ...
1699 Africa 1987 62.351
1700 Africa 1992 60.377
1701 Africa 1997 46.809
1702 Africa 2002 39.989
1703 Africa 2007 43.487
continent year lifeExp 0 Asia 1952 28.801 1 Asia 1957 30.332 2 Asia 1962 31.997 3 Asia 1967 34.020 4 Asia 1972 36.088 ... ... ... ... 1699 Africa 1987 62.351 1700 Africa 1992 60.377 1701 Africa 1997 46.809 1702 Africa 2002 39.989 1703 Africa 2007 43.487
     continent  year  lifeExp
0         Asia  1952   28.801
1         Asia  1957   30.332
2         Asia  1962   31.997
3         Asia  1967   34.020
4         Asia  1972   36.088
...        ...   ...      ...
1699    Africa  1987   62.351
1700    Africa  1992   60.377
1701    Africa  1997   46.809
1702    Africa  2002   39.989
1703    Africa  2007   43.487

Select a value explicitly

To do this you need to specify a single row and column index:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.iloc[1, 1]) #Getting a value explicitly
print(df.iloc[1, 1]) #Getting a value explicitly
print(df.iloc[1, 1]) #Getting a value explicitly

This returns just one record:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Asia
Asia
Asia

To undertake this, giving faster reload, you can use the iat function:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df.iat[1, 1]) #Fast access to a scalar explicitly
print(df.iat[1, 1]) #Fast access to a scalar explicitly
print(df.iat[1, 1]) #Fast access to a scalar explicitly

Boolean indexing

To filter the populations greater than a certain value, we can use boolean indexing:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(df[df['pop'] > 30000000]) #Using operators to subset
print(df[df['pop'] > 30000000]) #Using operators to subset
print(df[df['pop'] > 30000000]) #Using operators to subset

This gives you all the records that are greater than the selected value. To extend this you could use or and and operators. This outputs:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
country continent year lifeExp pop gdpPercap
11 Afghanistan Asia 2007 43.828 31889923 974.580338
34 Algeria Africa 2002 70.994 31287142 5288.040382
35 Algeria Africa 2007 72.301 33333216 6223.367465
55 Argentina Americas 1987 70.774 31620918 9139.671389
56 Argentina Americas 1992 71.868 33958947 9308.418710
... ... ... ... ... ... ...
1651 Vietnam Asia 1987 62.820 62826491 820.799445
1652 Vietnam Asia 1992 67.662 69940728 989.023149
1653 Vietnam Asia 1997 70.672 76048996 1385.896769
1654 Vietnam Asia 2002 73.017 80908147 1764.456677
1655 Vietnam Asia 2007 74.249 85262356 2441.576404
country continent year lifeExp pop gdpPercap 11 Afghanistan Asia 2007 43.828 31889923 974.580338 34 Algeria Africa 2002 70.994 31287142 5288.040382 35 Algeria Africa 2007 72.301 33333216 6223.367465 55 Argentina Americas 1987 70.774 31620918 9139.671389 56 Argentina Americas 1992 71.868 33958947 9308.418710 ... ... ... ... ... ... ... 1651 Vietnam Asia 1987 62.820 62826491 820.799445 1652 Vietnam Asia 1992 67.662 69940728 989.023149 1653 Vietnam Asia 1997 70.672 76048996 1385.896769 1654 Vietnam Asia 2002 73.017 80908147 1764.456677 1655 Vietnam Asia 2007 74.249 85262356 2441.576404
          country continent  year  lifeExp       pop    gdpPercap
11    Afghanistan      Asia  2007   43.828  31889923   974.580338
34        Algeria    Africa  2002   70.994  31287142  5288.040382
35        Algeria    Africa  2007   72.301  33333216  6223.367465
55      Argentina  Americas  1987   70.774  31620918  9139.671389
56      Argentina  Americas  1992   71.868  33958947  9308.418710
...           ...       ...   ...      ...       ...          ...
1651      Vietnam      Asia  1987   62.820  62826491   820.799445
1652      Vietnam      Asia  1992   67.662  69940728   989.023149
1653      Vietnam      Asia  1997   70.672  76048996  1385.896769
1654      Vietnam      Asia  2002   73.017  80908147  1764.456677
1655      Vietnam      Asia  2007   74.249  85262356  2441.576404

To view more subsetting examples, along with setting values a data frame, come back for session three.

I hope you are enjoying these tutorials and the code can be found by clicking the Github button.

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.