T-Test with Pingouin

This article was first published on python – educational research techniques , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post, we will look at how to use the Pingouin package to calculate both t-test and ANOVA results. This post is not a post on statistics. Rather, we are focused on how to do t-test and ANOVA using Python. Therefore, the explanation of the statistics is not a part of this post.

We will be using the Duncan dataset from the pydataset package. In the code below, we are loading the needed libraries and we are also printing a portion of the Duncan dataset.

import pandas as pd
import pingouin
from pydataset import data
df=data("Duncan")
df.head()

The Duncan dataset is simple. It has stats on various jobs which include the type of job, income, education, and prestige. We want to compare job type with income. What we want to do is compare professional jobs (prof) with white-collar jobs (wc) and see if there is a difference. After doing this, we will compare all three job types (bc, wc, prof) using ANOVA.

T-Test

In the code below, we need to subset our data so that the professionals and white-collar workers are separate.

df_prof=df[df['type']=='prof']
df_wc=df[df['type']=='wc']

Now that this is complete, the code below is what is used for conducting the t-test. We are comparing professional income with white-collar income. The t-test is two-sided which means we are looking for any difference at all. Below are the results

pingouin.ttest(x=df_prof['income'],y=df_wc['income'],alternative="two-sided")

According to the p-value, there is no difference between the salaries of professionals when compared to white-collar workers. We will now move to ANOVA.

ANOVA

T-test only allows the user to compare two groups. ANOVA allows the user to compare multiple groups. We have three types of workers and not just two. Using ANOVA, we can compare all three at once. In addition, unlike the t-test, there is no data preparation needed in this example.

The code below is relatively simple, we are using the ANOVA function from Pingouin. The first argument is for the data, the second indicates the dependent variable, and the between argument indicates the independent variable. Below is the code and output.

pingouin.anova(data=df,dv="income",between="type")

The value we are focused on is the p-unc or p-value. The results are significant. In other words, there is a difference between one of the comparisons. We don’t know which one will require us to do a pairwise comparison. Below are two different pairwise comparisons, one without an adjustment and one with an adjustment.

Pairwise Comparision No Adjustment

The first pairwise comparison is without an adjustment. The code below is mostly the same as for ANOVA. The main difference is we are using the pairwise_test function and there is an additional argument called padjust which is set to none. Below is the code and output.

pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='none')

Focusing on the p-values (p-unc) again we can see that there is a difference between blue-collar workers and professionals and another difference between blue-collar workers and white-collar workers. However, there is no difference between professional and white-collar workers. Keep in mind that we already knew that there was no difference between professionals and white-collar workers from the t-test results.

Pairwise Comparision with Adjustment

In the code below, we have the same code but with a Bonferroni p-value adjustment. Adjustments become important when you have a large number of groups. The details of this are beyond the scope of this post. However, it is important to make this adjustment because otherwise, you could get false positives which could skew your results and interpretation. Below is the code and output.

pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='bonf')

You may have noticed that the numbers are the same. That is because in our example we have a small number of groups. Therefore, this correction is not necessary for the data we are using.

Conclusion

The main purpose here was to show what the penguin package can do when it comes to t-tests and ANOVA. We could have calculated means for each group and other statistics. However, that was not the focus. Now, you know some of the tools that are available in the pingouin library.

To leave a comment for the author, please follow the link and comment on their blog: python – educational research techniques .

Want to share your content on python-bloggers? click here.