Composite Estimators in scikit-learn

To build a composite estimator in scikit-learn, transformers are usually combined with other transformers and/or predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. The Pipeline is often used in combination with ColumnTransformer or FeatureUnion which concatenate the output of transformers into a composite feature space.

In this notebook, I demonstrate how to create a composite estimator based on a synthetic dataset.

"""
Create synthetic dataset for composite estimator demo.
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=8)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

rng = np.random.default_rng(516)

n = 1000

df = pd.DataFrame({
    "A": rng.gamma(shape=2, scale=50000, size=n),
    "B": rng.normal(loc=1000, scale=250, size=n),
    "C": rng.choice(["red", "green", "blue"], p=[.7, .2, .1], size=n),
    "D": rng.choice(["left", "right", None], p=[.475, .475, .05], size=n),
    "E": rng.poisson(17, size=n),
    "target": rng.choice([0., 1.], p=[.8, .2], size=n)
})

# Set a selected samples to NaN in A, B and C. 
df.loc[rng.choice(n, size=10),"A"] = np.NaN
df.loc[rng.choice(n, size=17),"B"] = np.NaN
df.loc[rng.choice(n, size=5),"E"] = np.NaN

# Create train-validation split. 
y = df["target"]
dftrain, dfvalid, ytrain, yvalid = train_test_split(df, y, test_size=.05, stratify=y)

print(f"dftrain.shape: {dftrain.shape}")
print(f"dfvalid.shape: {dfvalid.shape}")
print(f"prop. ytrain : {ytrain.sum() / dftrain.shape[0]:.4f}")
print(f"prop. yvalid : {yvalid.sum() / dfvalid.shape[0]:.4f}")

dftrain.shape: (950, 6)
dfvalid.shape: (50, 6)
prop. ytrain : 0.2389
prop. yvalid : 0.2400

For this dataset, we’ll use ColumnTransformer to create separate pre-processing pipelines for continuous and categorical features. For continuous features, we impute missing values and standardize each to be on the same scale. For categorical features, we impute missing values and one-hot encode, creating k-1 features for a variable with k distinct levels. As the last step a LogisticRegression classifier is included with elastic net penatly. The code to accomplish this is given below:

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Composite Estimators in scikit-learn

Related