  • 0 – Import packages that will be used in the demo
  • 1 – Data-wrangling (using the querier)
  • 2 – Modeling/Hyperparameter tuning (using mlsauce and GPopt)
  • 3 – Explain model’s decisions (using the-teller)

0 – Import packages

!pip install querier # A query language for Python Data Frames (part of Techtonique)
!pip install mlsauce # Miscellaneous Statistical/Machine Learning stuff (part of Techtonique)
!pip install GPopt # Bayesian optimization using Gaussian Process Regression (part of Techtonique)
!pip install the-teller # Model-agnostic Statistical/Machine Learning explainability (part of Techtonique)
! pip install scikit-learn
!pip install SQLAlchemy
!pip install matplotlib==3.1.3 # this version is required
import numpy as np
import matplotlib.pyplot as plt
import sqlite3
import pandas as pd
import sqlalchemy
import matplotlib.pyplot as plt
import matplotlib.style as style
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold
from sklearn.metrics import classification_report, confusion_matrix
from time import time

import querier as qr 
import GPopt as gp 
import mlsauce as ms
import teller as tr 

1 – Data-wrangling (using the querier)

Remark: Some querier verbs were tested on macOS and Linux so far (experimental).

breast_cancer = load_breast_cancer(as_frame=True)
Create a data frame breast_cancer_df with columns that can be used by the querier:

breast_cancer_df = breast_cancer.frame
breast_cancer_df_columns = breast_cancer_df.columns
breast_cancer_df.columns = ["_".join(elt.split()) for elt in breast_cancer_df_columns]

Querying the data frame with the querier:


qr.select(breast_cancer_df, "mean_radius, mean_texture, mean_perimeter, mean_area, target", 
          limit=4, random=True)


qr.filtr(breast_cancer_df, "(target == 1) & (mean_radius >= 10)")


breast_cancer_df['target'] = breast_cancer_df['target'].astype(object)
qrobj = qr.Querier(df=breast_cancer_df)

request_1 = qrobj.select("mean_radius,\
                            group_by = "target")            
   avg_mean_radius  avg_mean_concave_points  target
0        17.462830                 0.087990       0
1        12.146524                 0.025717       1

2 – Modeling/Hyperparameter tuning (using mlsauce and GPopt)

X = breast_cancer.data
y = breast_cancer.target
# split data into training test and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.2, random_state=123)

Chosen model is LSBoost.
Hyperparameters tuning:

def lsboost_cv(X_train, y_train, 
    estimator = ms.LSBoostClassifier(n_estimators=n_estimators, 
                                     seed=seed, verbose=0)

    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123)
    return -cross_val_score(estimator, X_train, y_train,
                          scoring='accuracy', cv=cv, n_jobs=4).mean()    
def optimize_lsboost(X_train, y_train):

    def crossval_objective(x):
        return lsboost_cv(            
    gp_opt = gp.GPOpt(objective_func=crossval_objective, 
                      lower_bound = np.array([  50, -6,   2, -2, 0.5, 0.5,   0, -6]), 
                      upper_bound = np.array([1000, -1, 250,  5,   1,   1, 0.7, -1]),
                      n_init=10, n_iter=90, seed=123)    
    return {'parameters': gp_opt.optimize(verbose=2), 'opt_object':  gp_opt}

res_optimize_lsboost = optimize_lsboost(X_train, y_train)
best_parameters = res_optimize_lsboost['parameters'][0]
start = time()

estimator_breast_cancer = ms.LSBoostClassifier(n_estimators=int(best_parameters[0]),  
                                               seed=123, verbose=0).fit(X_train, y_train)

print(f"\n\n Test set accuracy: {estimator_breast_cancer.score(X_test, y_test)}")
print(f"\n Elapsed: {time() - start}")
 Test set accuracy: 0.9824561403508771

 Elapsed: 3.462388038635254
y_pred = estimator_breast_cancer.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       1.00      0.95      0.98        42
           1       0.97      1.00      0.99        72

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114
print(confusion_matrix(y_test, y_pred))
[[40  2]
 [ 0 72]]

3 – Explain model’s decisions (using the-teller)

# creating the explainer for class = 1 (probability of being a malignant tumor)
expr = tr.Explainer(obj=estimator_breast_cancer, y_class=1, normalize=False) 
# adjusting the explainer to the test set
expr.fit(X_test.values, y_test.values, X_names=list(breast_cancer.feature_names))
Calculating the effects...
30/30 [██████████████████████████████] - 3s 86ms/step

                                n_estimators=385, n_hidden_features=40,
                                tolerance=2.2258898141256302e-05, verbose=0),
# summary of results for the model (must use matplotlib=3.1.3)

Average effects

# Heterogeneity of effects (must use matplotlib=3.1.3)

Distribution of effects

# summary of results for the model
Heterogeneity of marginal effects: 
                             mean       std    median       min       max
fractal dimension error  1.082723  0.266851  0.868091  0.819966  1.456801
mean fractal dimension   0.652445  0.087281  0.586653  0.556320  0.782740
compactness error        0.310099  0.035665  0.283370  0.269509  0.360864
concavity error          0.097867  0.023285  0.079271  0.071594  0.129780
symmetry error           0.047409  0.058531 -0.000141 -0.031695  0.128003
mean compactness         0.021578  0.007013  0.016079  0.011121  0.032218
texture error            0.001695  0.000844  0.001001  0.000533  0.002907
worst area              -0.000008  0.000001 -0.000009 -0.000010 -0.000006
mean area               -0.000012  0.000002 -0.000013 -0.000014 -0.000009
area error              -0.000015  0.000016 -0.000027 -0.000032  0.000007
worst perimeter         -0.000197  0.000016 -0.000206 -0.000222 -0.000162
mean perimeter          -0.000231  0.000019 -0.000243 -0.000261 -0.000191
worst texture           -0.001210  0.000034 -0.001216 -0.001327 -0.001085
mean texture            -0.001278  0.000052 -0.001297 -0.001438 -0.001125
perimeter error         -0.001409  0.000302 -0.001624 -0.001784 -0.000937
worst radius            -0.001675  0.000083 -0.001717 -0.001825 -0.001448
mean radius             -0.001735  0.000126 -0.001813 -0.001941 -0.001450
worst compactness       -0.010538  0.001996 -0.011886 -0.014363 -0.006346
radius error            -0.018356  0.002165 -0.019816 -0.021018 -0.014330
worst concavity         -0.035444  0.001509 -0.036161 -0.038979 -0.031021
mean smoothness         -0.071665  0.011880 -0.078204 -0.105191 -0.033539
mean concavity          -0.073131  0.004785 -0.075833 -0.081392 -0.061772
mean symmetry           -0.111694  0.005669 -0.113490 -0.131086 -0.092818
worst symmetry          -0.140455  0.002756 -0.140564 -0.150495 -0.129108
worst concave points    -0.149019  0.003037 -0.149390 -0.158989 -0.135773
worst fractal dimension -0.177296  0.018744 -0.188971 -0.212246 -0.141802
mean concave points     -0.208505  0.004745 -0.208881 -0.222427 -0.188558
worst smoothness        -0.321451  0.006868 -0.321642 -0.345280 -0.295260
smoothness error        -0.645636  0.181327 -0.781757 -0.871939 -0.381126
concave points error    -0.766840  0.024979 -0.772391 -0.845261 -0.674872

The notebook (so that you can reproduce the workflow) can be found here.

