Composite Estimators in scikit-learn

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

To build a composite estimator in scikit-learn, transformers are usually combined with other transformers and/or predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. The Pipeline is often used in combination with ColumnTransformer or FeatureUnion which concatenate the output of transformers into a composite feature space.

In this notebook, I demonstrate how to create a composite estimator based on a synthetic dataset.

"""
Create synthetic dataset for composite estimator demo.
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=8)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

rng = np.random.default_rng(516)

n = 1000

df = pd.DataFrame({
    "A": rng.gamma(shape=2, scale=50000, size=n),
    "B": rng.normal(loc=1000, scale=250, size=n),
    "C": rng.choice(["red", "green", "blue"], p=[.7, .2, .1], size=n),
    "D": rng.choice(["left", "right", None], p=[.475, .475, .05], size=n),
    "E": rng.poisson(17, size=n),
    "target": rng.choice([0., 1.], p=[.8, .2], size=n)
})

# Set a selected samples to NaN in A, B and C. 
df.loc[rng.choice(n, size=10),"A"] = np.NaN
df.loc[rng.choice(n, size=17),"B"] = np.NaN
df.loc[rng.choice(n, size=5),"E"] = np.NaN

# Create train-validation split. 
y = df["target"]
dftrain, dfvalid, ytrain, yvalid = train_test_split(df, y, test_size=.05, stratify=y)

print(f"dftrain.shape: {dftrain.shape}")
print(f"dfvalid.shape: {dfvalid.shape}")
print(f"prop. ytrain : {ytrain.sum() / dftrain.shape[0]:.4f}")
print(f"prop. yvalid : {yvalid.sum() / dfvalid.shape[0]:.4f}")

dftrain.shape: (950, 6)
dfvalid.shape: (50, 6)
prop. ytrain : 0.2389
prop. yvalid : 0.2400

For this dataset, we’ll use ColumnTransformer to create separate pre-processing pipelines for continuous and categorical features. For continuous features, we impute missing values and standardize each to be on the same scale. For categorical features, we impute missing values and one-hot encode, creating k-1 features for a variable with k distinct levels. As the last step a LogisticRegression classifier is included with elastic net penatly. The code to accomplish this is given below:

from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression


# Data pre-processing for LogisticRegression model.
lr = LogisticRegression(
    penalty="elasticnet", solver="saga", max_iter=5000
    )

# Identify continuous and catergorical features. 
continuous = ["A", "B", "E"]
categorical = ["C", "D"]

continuous_transformer = Pipeline(steps=[
    ("imputer", IterativeImputer()),
    ("scaler" , StandardScaler())
    ])
categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="error"))
    ])

preprocessor = ColumnTransformer(transformers=[
    ("continuous" , continuous_transformer, continuous),  
    ("categorical", categorical_transformer, categorical)
    ], remainder="drop"
    )

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", lr)
    ]).set_output(transform="pandas")

pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('continuous',
                                                  Pipeline(steps=[('imputer',
                                                                   IterativeImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['A', 'B', 'E']),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 sparse_output=False))]),
                                                  ['C', 'D'])])),
                ('classifier',
                 LogisticRegression(max_iter=5000, penalty='elasticnet',
                                    solver='saga'))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.