Python Pandas Pro – Session Two – Selection on Data Frames

[This article was first published on Python – Hutsons-hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In the previous post we worked with a custom data set we had created for the purposes of demonstration. This time we are going to work with the gapminder dataset.

Viewing the gapminder dataset

To initialise the gapminder dataset in my project I will use the below import statements to prepare the project ready for the demonstration:

from gapminder import gapminder as gp
import pandas as pd
import numpy as np
# Take a copy of the data frame
df = gp.copy()
print(df.head(10))

This code uses the gapminder package using from and imports only the relevant information i.e. the gapminder data and this is aliased as gp.

Then, as in the previous example we import pandas and numpy for working in the project.

The final step is to take a copy of the data and then we print the head of the dataset for inspection:

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395
9  Afghanistan      Asia  1997   41.763  22227415  635.341351

The dataset contains the population movements against life expectancy and the gross domestic product as a percentage of the population. To sort the data by the highest population the below functionality could be added:

#Sort by value
print(df.sort_values(by='pop', ascending=False))

To sort by multiple columns this can simply be extended by using a list in Python, these always have square brackets:

#Sort by value
print(df.sort_values(by=['pop', 'year'], ascending=False))

The output from the model shows it has been successful in sorting by the sort values specified:

                    country continent  year  lifeExp         pop    gdpPercap
299                   China      Asia  2007   72.961  1318683096  4959.114854
298                   China      Asia  2002   72.028  1280400000  3119.280896
297                   China      Asia  1997   70.426  1230075000  2289.234136
296                   China      Asia  1992   68.690  1164970000  1655.784158
707                   India      Asia  2007   64.698  1110396331  2452.210407
...                     ...       ...   ...      ...         ...          ...
1299  Sao Tome and Principe    Africa  1967   54.425       70787  1384.840593
1298  Sao Tome and Principe    Africa  1962   51.893       65345  1071.551119
420                Djibouti    Africa  1952   34.812       63149  2669.529475
1297  Sao Tome and Principe    Africa  1957   48.945       61325   860.736903
1296  Sao Tome and Principe    Africa  1952   46.471       60011   879.583586

Selection – getting stuff from data frames

The below two subsections will show how to get items from a data frame and then how you can also set items.

Selecting by column

To select a specific column(s) you can use the following syntax:

# Selection by 1 x column
print(df['pop'])
# Selection by more than 1 column
print(df[['country', 'pop']])

The first line will print out the pop column, whereas the double brackets will select both country and population:

          country       pop
0     Afghanistan   8425333
1     Afghanistan   9240934
2     Afghanistan  10267083
3     Afghanistan  11537966
4     Afghanistan  13079460
...           ...       ...
1699     Zimbabwe   9216418
1700     Zimbabwe  10704340
1701     Zimbabwe  11404948
1702     Zimbabwe  11926563
1703     Zimbabwe  12311143

Slicing rows

To slice rows from the data frame it is as simple as the below:

# Slicing rows
print(df[0:10])

The results in the first 10 rows being sliced, as Python has a zero based indexing system, like most languages:

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395
9  Afghanistan      Asia  1997   41.763  22227415  635.341351

Selecting by label

The driving function here is to use the loc keyword to select by location in the data frame. This implementation shows how to return all rows and only matching columns:

# Selection by label
print(df.loc[:, ['lifeExp', 'pop']])

Outputting:

      lifeExp       pop
0      28.801   8425333
1      30.332   9240934
2      31.997  10267083
3      34.020  11537966
4      36.088  13079460
...       ...       ...
1699   62.351   9216418
1700   60.377  10704340
1701   46.809  11404948
1702   39.989  11926563
1703   43.487  12311143

The : operator says select all rows and only those matching columns in the df.loc command.

To specify both rows and columns to return, you can adapt the statement slightly to easily achieve that:

print(df.loc[0:10, ['lifeExp', 'pop']])

This outputs:

    lifeExp       pop
0    28.801   8425333
1    30.332   9240934
2    31.997  10267083
3    34.020  11537966
4    36.088  13079460
5    38.438  14880372
6    39.854  12881816
7    40.822  13867957
8    41.674  16317921
9    41.763  22227415
10   42.129  25268405

For getting a scalar variable, you would use:

# For getting a scalar value
print(df.loc[dates[0], 'pop'])

That would get the first date index on the row.

Selection by position

To select by position you can use the iloc keyword.

Select single column

This below code block will show how to slice by index position:

# Selection by position
print(df.iloc[3])

This outputs:

country      Colombia
continent    Americas
year             1952
lifeExp        50.643
pop          12350771
gdpPercap     2144.12
Name: 300, dtype: object

Select multiple elements – rows and columns

This method uses integer slicing to select rows and columns:

#Integer slicing
print(df.iloc[3:5, 0:2]) 

Outputting:

       country continent
3  Afghanistan      Asia
4  Afghanistan      Asia

You can achieve a similar result using numpy lists:

print(df.iloc[[1, 2, 4], [0, 2]])

Slicing rows and columns explicitly

To achieve this, we make use of the magic : operator in Python:

print(df.iloc[1:9, :]

Hey presto:

       country continent  year  lifeExp       pop   gdpPercap
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395

To perform this for columns we just reverse the position in the list:

print(df.iloc[:, 1:3]) #Slicing columns explicitly

This then selects just the first 3 rows you require:

     continent  year  lifeExp
0         Asia  1952   28.801
1         Asia  1957   30.332
2         Asia  1962   31.997
3         Asia  1967   34.020
4         Asia  1972   36.088
...        ...   ...      ...
1699    Africa  1987   62.351
1700    Africa  1992   60.377
1701    Africa  1997   46.809
1702    Africa  2002   39.989
1703    Africa  2007   43.487

Select a value explicitly

To do this you need to specify a single row and column index:

print(df.iloc[1, 1]) #Getting a value explicitly

This returns just one record:

Asia

To undertake this, giving faster reload, you can use the iat function:

print(df.iat[1, 1]) #Fast access to a scalar explicitly

Boolean indexing

To filter the populations greater than a certain value, we can use boolean indexing:

print(df[df['pop'] > 30000000]) #Using operators to subset

This gives you all the records that are greater than the selected value. To extend this you could use or and and operators. This outputs:

          country continent  year  lifeExp       pop    gdpPercap
11    Afghanistan      Asia  2007   43.828  31889923   974.580338
34        Algeria    Africa  2002   70.994  31287142  5288.040382
35        Algeria    Africa  2007   72.301  33333216  6223.367465
55      Argentina  Americas  1987   70.774  31620918  9139.671389
56      Argentina  Americas  1992   71.868  33958947  9308.418710
...           ...       ...   ...      ...       ...          ...
1651      Vietnam      Asia  1987   62.820  62826491   820.799445
1652      Vietnam      Asia  1992   67.662  69940728   989.023149
1653      Vietnam      Asia  1997   70.672  76048996  1385.896769
1654      Vietnam      Asia  2002   73.017  80908147  1764.456677
1655      Vietnam      Asia  2007   74.249  85262356  2441.576404

To view more subsetting examples, along with setting values a data frame, come back for session three.

I hope you are enjoying these tutorials and the code can be found by clicking the Github button.

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks.

Want to share your content on python-bloggers? click here.