Composite Estimators in scikit-learn
Want to share your content on python-bloggers? click here.
To build a composite estimator in scikit-learn, transformers are usually combined with other transformers and/or predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. The Pipeline
is often used in combination with ColumnTransformer
or FeatureUnion
which concatenate the output of transformers into a composite feature space.
In this notebook, I demonstrate how to create a composite estimator based on a synthetic dataset.
""" Create synthetic dataset for composite estimator demo. """ import numpy as np import pandas as pd from sklearn.model_selection import train_test_split np.set_printoptions(suppress=True, precision=8) pd.options.mode.chained_assignment = None pd.set_option('display.max_columns', None) pd.set_option('display.width', None) rng = np.random.default_rng(516) n = 1000 df = pd.DataFrame({ "A": rng.gamma(shape=2, scale=50000, size=n), "B": rng.normal(loc=1000, scale=250, size=n), "C": rng.choice(["red", "green", "blue"], p=[.7, .2, .1], size=n), "D": rng.choice(["left", "right", None], p=[.475, .475, .05], size=n), "E": rng.poisson(17, size=n), "target": rng.choice([0., 1.], p=[.8, .2], size=n) }) # Set a selected samples to NaN in A, B and C. df.loc[rng.choice(n, size=10),"A"] = np.NaN df.loc[rng.choice(n, size=17),"B"] = np.NaN df.loc[rng.choice(n, size=5),"E"] = np.NaN # Create train-validation split. y = df["target"] dftrain, dfvalid, ytrain, yvalid = train_test_split(df, y, test_size=.05, stratify=y) print(f"dftrain.shape: {dftrain.shape}") print(f"dfvalid.shape: {dfvalid.shape}") print(f"prop. ytrain : {ytrain.sum() / dftrain.shape[0]:.4f}") print(f"prop. yvalid : {yvalid.sum() / dfvalid.shape[0]:.4f}")
dftrain.shape: (950, 6) dfvalid.shape: (50, 6) prop. ytrain : 0.2389 prop. yvalid : 0.2400
For this dataset, we’ll use ColumnTransformer
to create separate pre-processing pipelines for continuous and categorical features. For continuous features, we impute missing values and standardize each to be on the same scale. For categorical features, we impute missing values and one-hot encode, creating k-1 features for a variable with k distinct levels. As the last step a LogisticRegression
classifier is included with elastic net penatly. The code to accomplish this is given below:
from sklearn.compose import ColumnTransformer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler from sklearn.linear_model import LogisticRegression # Data pre-processing for LogisticRegression model. lr = LogisticRegression( penalty="elasticnet", solver="saga", max_iter=5000 ) # Identify continuous and catergorical features. continuous = ["A", "B", "E"] categorical = ["C", "D"] continuous_transformer = Pipeline(steps=[ ("imputer", IterativeImputer()), ("scaler" , StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ("onehot", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="error")) ]) preprocessor = ColumnTransformer(transformers=[ ("continuous" , continuous_transformer, continuous), ("categorical", categorical_transformer, categorical) ], remainder="drop" ) pipeline = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", lr) ]).set_output(transform="pandas") pipeline
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])), ('classifier', LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])), ('classifier', LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga'))])
ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])
['A', 'B', 'E']
IterativeImputer()
StandardScaler()
['C', 'D']
OneHotEncoder(drop='first', sparse_output=False)
LogisticRegression(max_iter=5000, penalty='elasticnet', solver='saga')
In the next cell, RandomizedSearchCV
is run agasinst two hyperparameters: l1_ratio
and C
. Notice that we only call mdl.fit
on the pipeline, as the data transform will be applied to each of the k-datasets separately based on the samples in each fold.
from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform # Hyperparameters to search over. param_grid = { "classifier__l1_ratio": uniform(loc=0, scale=1), "classifier__C": uniform(loc=0, scale=10) } mdl = RandomizedSearchCV( pipeline, param_grid, scoring="accuracy", cv=5, verbose=2, n_iter=3, random_state=516 ) mdl.fit(dftrain.drop("target", axis=1), ytrain) print(f"\nbest parameters: {mdl.best_params_}")
Fitting 5 folds for each of 3 candidates, totalling 15 fits [CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s [CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s [CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s [CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s [CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s [CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s [CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s [CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s [CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s [CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s [CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s [CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s [CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s [CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s [CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s best parameters: {'classifier__C': 8.115660497752215, 'classifier__l1_ratio': 0.7084090612742915}
When an estimator is included within a scikit-learn pipeline and a grid search performed using RandomizedGridSearchCV
, the estimator is automatically set to the best parameters found during the search. The best_estimator_
attribute of the RandomizedGridSearchCV
object will reflect the best parameters for the estimator within the pipeline in terms of the scoring measure:
mdl.best_estimator_
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])), ('classifier', LogisticRegression(C=8.115660497752215, l1_ratio=0.7084090612742915, max_iter=5000, penalty='elasticnet', solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])), ('classifier', LogisticRegression(C=8.115660497752215, l1_ratio=0.7084090612742915, max_iter=5000, penalty='elasticnet', solver='saga'))])
ColumnTransformer(transformers=[('continuous', Pipeline(steps=[('imputer', IterativeImputer()), ('scaler', StandardScaler())]), ['A', 'B', 'E']), ('categorical', Pipeline(steps=[('onehot', OneHotEncoder(drop='first', sparse_output=False))]), ['C', 'D'])])
['A', 'B', 'E']
IterativeImputer()
StandardScaler()
['C', 'D']
OneHotEncoder(drop='first', sparse_output=False)
LogisticRegression(C=8.115660497752215, l1_ratio=0.7084090612742915, max_iter=5000, penalty='elasticnet', solver='saga')
Once the optimal model has been determined, we can pass our validation/test data into the pipeline to generate predicted probabilities for unseen data:
# Assessing model performance on unseen data. ypred = mdl.predict_proba(dfvalid)[:,1] ypred
array([0.23803061, 0.23987571, 0.22497394, 0.2360284 , 0.21692351, 0.24979123, 0.22930123, 0.23805811, 0.18848299, 0.2269307 , 0.18739627, 0.21963412, 0.24601412, 0.24592807, 0.26313459, 0.19509853, 0.22403892, 0.2644474 , 0.25217899, 0.25114582, 0.25275472, 0.25602435, 0.23526247, 0.22682578, 0.21364797, 0.31097165, 0.25706994, 0.26917858, 0.21912074, 0.14953379, 0.2521859 , 0.19803027, 0.23446292, 0.20239688, 0.22329016, 0.23452063, 0.19225738, 0.1971433 , 0.32557197, 0.2366244 , 0.21352434, 0.27294373, 0.25589429, 0.23278834, 0.24858346, 0.2058699 , 0.17559173, 0.24556249, 0.22534097, 0.22728177])
In some cases, we may want to pickle our model to share with a third-party for some downstream task. This is straightforward:
import pickle with open("my-model.pkl", "wb") as fpkl: pickle.dump(mdl, fpkl, protocol=pickle.HIGHEST_PROTOCOL)
Want to share your content on python-bloggers? click here.
Copyright © 2025 | MH Corporate basic by MH Themes