How To Run Logistic Regression On Aggregate Data In Python

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Following up our post about Logistic Regression on Aggregated Data in R, we will show you how to deal with grouped data when you want to perform a Logic regression in Python. Let us first create some dummy data.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

df=pd.DataFrame(
{
'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]),
'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]),
"Response":np.random.binomial(1,size=200,p=0.2)
    }
)

df.head()
  Gender      Age  Response
0      f  [30-65]         0
1      m  [30-65]         0
2      m    [<30]         0
3      f  [30-65]         1
4      f    [65+]         0

Logistic Regression on Non-Aggregate Data

Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.

Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn.

model=smf.logit('Response~Gender+Age',data=df)
result = model.fit()
print(result.summary())
Logit Regression Results                           
==============================================================================
Dep. Variable:               Response   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Mon, 22 Feb 2021   Pseudo R-squ.:                 0.02765
Time:                        18:09:11   Log-Likelihood:                -85.502
converged:                       True   LL-Null:                       -87.934
Covariance Type:            nonrobust   LLR p-value:                    0.1821
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
================================================================================

Logistic Regression on Aggregate Data

Below are 3 methods we used to deal with aggregated data.


1. Logistic Regressions using Responders and Non-Responders

In the following code, we grouped our data and we created columns for the responders(Yes) and Non-Responders(No).

grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes')
grouped.reset_index(inplace=True)
grouped
  Gender      Age  Yes  Impressions  No
0      f  [30-65]    9           38  29
1      f    [65+]    2            7   5
2      f    [<30]    8           25  17
3      m  [30-65]   17           79  62
4      m    [65+]    2           12  10
5      m    [<30]    9           39  30
glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial())
result_grouped=glm_binom.fit()
print(result_grouped.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:          ['Yes', 'No']   No. Observations:                    6
Model:                            GLM   Df Residuals:                        2
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -8.9211
Date:                Mon, 22 Feb 2021   Deviance:                       1.2641
Time:                        18:15:15   Pearson chi2:                    0.929
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
================================================================================

2. Logistic Regression with Weights

For this method, we need to create a new column with the response rate of every group.

grouped['RR']=grouped['Yes']/grouped['Impressions']
glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions']))
result_grouped2=glm.fit()
print(result_grouped2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     RR   No. Observations:                    6
Model:                            GLM   Df Residuals:                      196
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -59.807
Date:                Mon, 22 Feb 2021   Deviance:                       1.2641
Time:                        18:18:16   Pearson chi2:                    0.929
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
================================================================================

3.Expand the Aggregate Data

lastly, we can “ungroup” our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.

grouped['No']=grouped['No'].apply(lambda x: [0]*x)
grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x)
grouped['Response']=grouped['Yes']+grouped['No']

expanded=grouped.explode("Response")[['Gender','Age','Response']]
expanded['Response']=expanded['Response'].astype(int)

expanded.head()
  Gender      Age Response
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1
model=smf.logit('Response~ Gender + Age',data=expanded)
result = model.fit()
print(result.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               Response   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Mon, 22 Feb 2021   Pseudo R-squ.:                 0.02765
Time:                        18:29:33   Log-Likelihood:                -85.502
converged:                       True   LL-Null:                       -87.934
Covariance Type:            nonrobust   LLR p-value:                    0.1821
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
================================================================================

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.