Python Pandas Pro – Session One – Creation of Pandas objects and basic data frame operations

Posted on September 28, 2020 by Gary Hutson in Data science | 0 Comments

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I first started using Python a couple of years ago, when the limitations of some of the Deep Learning libraries in R became apparent.

What I want to do is create a series of tutorials that allow you to get up to speed with Pandas for data frames in Python quickly.

The first session will focus on the basics of Pandas and how to create your own pandas in memory very quickly.

Pandas structures

Creating a series

A series in Pandas is the same as a sequence in R. This essentially allow you to create a series of values:

import numpy as np
import pandas as pd

# Create a series by passing a list of values

series = pd.Series([1,2,3,np.nan, 6, 8, np.nan, 10, 11])
print(series)
print(series.size)

The above code snippet uses the alias pd to link to pandas, if this was ommitted then you would need to use pandas.Series every time, by aliasing with the as command, this will be familiar to those who use SQL, then it makes it much easier to refer to the libraries to pull out the relevant functions, etc.

The output of the above shows:

0    1.0
1    2.0
2    3.0
3    NaN
4    6.0
5    8.0
dtype: float64
6

This print the extend of the series, the data type contained therein and the series size.

That is all there is to series. The next demo will show how to create a datetime index and labelled columns with a NumPy array.

Create Data Frame by passing a NumPy array with a date time index

The first step, and this will be extending the code we wrote previously, will be to create a dates series and then use this as the index in the Pandas data frame:

dates = pd.date_range('20200101', periods=series.size)
print(dates)

This creates a date time index output using the date_range object and then it is printed out to the console:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09'],
              dtype='datetime64[ns]', freq='D')

The date time index output shows the dates, the dtype (data type) and the frequency of the interval between the dates i.e. freq=’D’.

Next, we will now create a data frame based on the randn() function and uses a list to specify the column names and the row index linked to the date index we created:

df = pd.DataFrame(np.random.randn(series.size,4),
   index=dates, columns = list('1234'))
print(df)

The outputs of this show a data frame of the following structure:

                   1         2         3         4
2020-01-01 -1.001336  0.312421  0.213412  0.399853
2020-01-02 -1.160968  0.600596  0.612449  0.106262
2020-01-03  0.539773  0.457708  0.818120 -0.496321
2020-01-04  0.821966 -0.849103 -1.125686  0.816331
2020-01-05 -0.362707  1.449582  1.485910 -1.284188
2020-01-06  0.168309 -1.627923 -0.900661 -0.185069
2020-01-07  1.736149  0.820594 -0.840311  2.941485
2020-01-08 -0.560419 -0.332010 -1.256690 -1.128578
2020-01-09  0.142882 -1.151348  0.998045  1.472304

Dictionaries and Data Frames

Dictionaries are a very useful type in Python.

The below shows how to use an example of dictionaries with data frames:

df2 = pd.DataFrame(
    {'A': 1.,
    'B': pd.Timestamp('20200102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'Gary'}
    )
print(df2)

Dictionaries are creating by using curly braces{} and the above uses:

Simple double value
The Pandas Timestamp function
Pandas Series
Numpy array
Pandas Categorical data types to pass categorical values to an array
A string literal

The output, using print, outputs the below data frame constructed with the various Python structures and data types:

     A          B    C  D      E     F
0  1.0 2020-01-02  1.0  3   test  Gary
1  1.0 2020-01-02  1.0  3  train  Gary
2  1.0 2020-01-02  1.0  3   test  Gary
3  1.0 2020-01-02  1.0  3  train  Gary

To check the data types of all the columns in the structure data frame you can use the below syntax:

print(df2.dtypes)

Viewing data types of all data frame objects

This outputs all the data types of the data frame:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object

This shows the underlying data types of all the Python objects.

Data Frame operations

Data frame operations are always accessed by using the period after the object declaration, as in df.head.

Viewing the top and bottom of data frames

To view the top of a large data frame, you can use the head command to achieve this:

print(df2.head(2))

Simply replace the number inside the head function to specify the top n number of values. The code in the statement outputs:

     A          B    C  D      E     F
0  1.0 2020-01-02  1.0  3   test  Gary
1  1.0 2020-01-02  1.0  3  train  Gary

To perform the same for the bottom values, use the tail function with the same syntax as head.

Obtaining descriptive statistics of a data frame

To obtain the descriptive statistics of a data frame, the function to do this is describe():

#Get descriptive statistics
print(df2.describe(include='all')) #The include command will allow the inclusion of all stats

The output of this is:

          A                    B    C    D      E     F
count   4.0                    4  4.0  4.0      4     4
unique  NaN                    1  NaN  NaN      2     1
top     NaN  2020-01-02 00:00:00  NaN  NaN  train  Gary
freq    NaN                    4  NaN  NaN      2     4
first   NaN  2020-01-02 00:00:00  NaN  NaN    NaN   NaN
last    NaN  2020-01-02 00:00:00  NaN  NaN    NaN   NaN
mean    1.0                  NaN  1.0  3.0    NaN   NaN
std     0.0                  NaN  0.0  0.0    NaN   NaN
min     1.0                  NaN  1.0  3.0    NaN   NaN
25%     1.0                  NaN  1.0  3.0    NaN   NaN
50%     1.0                  NaN  1.0  3.0    NaN   NaN
75%     1.0                  NaN  1.0  3.0    NaN   NaN
max     1.0                  NaN  1.0  3.0    NaN   NaN

Displaying column number

The command to display a column heading is very simple:

print(df2.columns)

This produces:

Index(['1', '2', '3', '4'], dtype='object')

Transposing a data frame

The way to transpose a data frame is by using the T function:

#Transpose the data
print(df2.T)

This flips the data frame around:

                     0                    1                    2                    3
A                    1                    1                    1                    1
B  2020-01-02 00:00:00  2020-01-02 00:00:00  2020-01-02 00:00:00  2020-01-02 00:00:00
C                    1                    1                    1                    1
D                    3                    3                    3                    3
E                 test                train                 test                train
F                 Gary                 Gary                 Gary                 Gary

What’s next?

The next in the series is Sorting, Indexing and Slicing data frames in Python with Pandas.

Stay tuned for more tutorials on how to use Pandas.

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers