How To Run Logistic Regression On Aggregate Data In Python
Want to share your content on python-bloggers? click here.
Following up our post about Logistic Regression on Aggregated Data in R, we will show you how to deal with grouped data when you want to perform a Logic regression in Python. Let us first create some dummy data.
import pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf df=pd.DataFrame( { 'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]), 'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]), "Response":np.random.binomial(1,size=200,p=0.2) } ) df.head()
Gender Age Response 0 f [30-65] 0 1 m [30-65] 0 2 m [<30] 0 3 f [30-65] 1 4 f [65+] 0
Logistic Regression on Non-Aggregate Data
Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.
Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn.
model=smf.logit('Response~Gender+Age',data=df) result = model.fit() print(result.summary())
Logit Regression Results ============================================================================== Dep. Variable: Response No. Observations: 200 Model: Logit Df Residuals: 196 Method: MLE Df Model: 3 Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765 Time: 18:09:11 Log-Likelihood: -85.502 converged: True LL-Null: -87.934 Covariance Type: nonrobust LLR p-value: 0.1821 ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 ================================================================================
Logistic Regression on Aggregate Data
Below are 3 methods we used to deal with aggregated data.
1. Logistic Regressions using Responders and Non-Responders
In the following code, we grouped our data and we created columns for the responders(Yes) and Non-Responders(No).
grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes') grouped.reset_index(inplace=True) grouped
Gender Age Yes Impressions No 0 f [30-65] 9 38 29 1 f [65+] 2 7 5 2 f [<30] 8 25 17 3 m [30-65] 17 79 62 4 m [65+] 2 12 10 5 m [<30] 9 39 30
glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial()) result_grouped=glm_binom.fit() print(result_grouped.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: ['Yes', 'No'] No. Observations: 6 Model: GLM Df Residuals: 2 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -8.9211 Date: Mon, 22 Feb 2021 Deviance: 1.2641 Time: 18:15:15 Pearson chi2: 0.929 No. Iterations: 5 Covariance Type: nonrobust ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 ================================================================================
2. Logistic Regression with Weights
For this method, we need to create a new column with the response rate of every group.
grouped['RR']=grouped['Yes']/grouped['Impressions']
glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions'])) result_grouped2=glm.fit() print(result_grouped2.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: RR No. Observations: 6 Model: GLM Df Residuals: 196 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -59.807 Date: Mon, 22 Feb 2021 Deviance: 1.2641 Time: 18:18:16 Pearson chi2: 0.929 No. Iterations: 5 Covariance Type: nonrobust ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 ================================================================================
3.Expand the Aggregate Data
lastly, we can “ungroup” our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.
grouped['No']=grouped['No'].apply(lambda x: [0]*x) grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x) grouped['Response']=grouped['Yes']+grouped['No'] expanded=grouped.explode("Response")[['Gender','Age','Response']] expanded['Response']=expanded['Response'].astype(int) expanded.head()
Gender Age Response 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1
model=smf.logit('Response~ Gender + Age',data=expanded) result = model.fit() print(result.summary())
Logit Regression Results ============================================================================== Dep. Variable: Response No. Observations: 200 Model: Logit Df Residuals: 196 Method: MLE Df Model: 3 Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765 Time: 18:29:33 Log-Likelihood: -85.502 converged: True LL-Null: -87.934 Covariance Type: nonrobust LLR p-value: 0.1821 ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 ================================================================================
Want to share your content on python-bloggers? click here.