mlsauce version 0.8.10: Statistical/Machine Learning with Python and R
Want to share your content on python-bloggers? click here.
This week, among other things, I’ve been working on updating mlsauce for both Python and R (that’s version 0.8.10 of the package).
mlsauce is a package for Statistical/Machine Learning that contains in particular:
- AdaOpt, a probabilistic classifier which uses nearest neighbors to obtain predictions. Interestingly, with AdaOpt, one neighbor can suffice to obtain a high accuracy.
- LSBoost
, a gradient boosting algorithm based on randomized nnetworks (similar to XGBoost, LightGBM or Catboost, but not using Gradient Boosted Decision Trees a.k.a GBDT).
Not a lot of GitHub stars for mlsauce’s repository but someday, to my surprise, I noticed that mlsauce.LSBoost’s 2020 “paper” had more than 2000 reads on ResearchGate. Well, people, starring the repository on GitHub is pretty cool too.
Then, I had a ResearchGate recommendation on that same mlsauce.LSBoost’s “paper”, and I told to myself: ‘I’ve probably been missing something in this work for 3 years’. Yes I know I designed it from beginning to end, but some people can be using it better than I did so far!
Indeed, I’ve never obtained great results with mlsauce.LSBoost IN THE PAST. Eventually, as of today, my feelings are: mlsauce is fast, thanks to Cython (which is not easy to package though, IMHO), and quite competitive when well-tuned; as you’ll see below.
In this post, I revisit mlsauce, with examples of use of AdaOpt and LSBoostclassifier.AdaOpt is used for digits recognition (and seems to be doing well on this type of tasks, more on this in the future). LSBoostclassifier is used on toy examples from scikit-learn as done in the paper, but with better hyperparameters’ tuning. For both models, AdaOpt and LSBoostclassifier, a distribution of test set accuracy is presented.
Contents
- Install and import Python packages
AdaOptPython — with test set accuracy’s distributionLSBoostclassifierPython — with test set accuracy’s distribution- R example
A notebook can also be found here: https://github.com/Techtonique/mlsauce/blob/master/mlsauce/demo/thierrymoudiki_051123_GPopt_mlsauce_classification.ipynb.
1 – Install and import Python packages
!pip install mlsauce
!pip install GPopt # a package that implements Bayesian optimization, used here for hyperparameters' tuning
import GPopt as gp import mlsauce as ms import numpy as np from sklearn.datasets import load_breast_cancer, load_wine, load_digits from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import classification_report from time import time
2 – AdaOpt Python – with test set accuracy’s distribution
import numpy as np
from sklearn.datasets import load_digits # a dataset for digits recognition
from sklearn.model_selection import train_test_split, cross_val_score
from time import time
digits = load_digits()
Z = digits.data
t = digits.target
np.random.seed(13239)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
obj = ms.AdaOpt(n_iterations=50,
learning_rate=0.3,
reg_lambda=0.1,
reg_alpha=0.5,
eta=0.01,
gamma=0.01,
tolerance=1e-4,
row_sample=1,
k=1,
n_jobs=3, type_dist="euclidean", verbose=1)
start = time()
obj.fit(X_train, y_train)
print(f"\n\n Elapsed train: {time()-start} \n")
start = time()
print(f"\n\n Accuracy: {obj.score(X_test, y_test)}")
print(f"\n Elapsed predict: {time()-start}")
100%|██████████| 360/360 [00:00<00:00, 1979.13it/s] Elapsed train: 0.01917862892150879 Accuracy: 0.9916666666666667 Elapsed predict: 0.19308829307556152
Obtaining test set accuracy distribution with the same hyperparameters
from collections import namedtuple from sklearn.metrics import classification_report from tqdm import tqdm from scipy import stats
def eval_adaopt(k=1, B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
obj = ms.AdaOpt(n_iterations=50,
learning_rate=0.3,
reg_lambda=0.1,
reg_alpha=0.5,
eta=0.01,
gamma=0.01,
tolerance=1e-4,
row_sample=1,
k=k,
n_jobs=-1, type_dist="euclidean", verbose=0)
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
start = time()
obj.fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_k1_B250 = eval_adaopt(k=1, B=250) res_k2_B250 = eval_adaopt(k=2, B=250) res_k3_B250 = eval_adaopt(k=3, B=250) res_k4_B250 = eval_adaopt(k=4, B=250) res_k5_B250 = eval_adaopt(k=5, B=250)
100%|██████████| 250/250 [00:50<00:00, 4.96it/s] 100%|██████████| 250/250 [00:50<00:00, 4.94it/s] 100%|██████████| 250/250 [00:50<00:00, 4.96it/s] 100%|██████████| 250/250 [00:51<00:00, 4.90it/s] 100%|██████████| 250/250 [00:51<00:00, 4.90it/s]
display(res_k1_B250[1]) display(res_k2_B250[1]) display(res_k3_B250[1]) display(res_k4_B250[1]) display(res_k5_B250[1])
DescribeResult(nobs=250, minmax=(0.9722222222222222, 1.0), mean=0.9872888888888888, variance=2.5628935495066882e-05, skewness=-0.13898324248427138, kurtosis=0.22445816198359791) DescribeResult(nobs=250, minmax=(0.9666666666666667, 0.9972222222222222), mean=0.9846888888888888, variance=3.354355694382497e-05, skewness=-0.2014633213050366, kurtosis=-0.16851847469456605) DescribeResult(nobs=250, minmax=(0.9611111111111111, 0.9972222222222222), mean=0.9836666666666666, variance=3.45951708066838e-05, skewness=-0.3714590259216959, kurtosis=0.264762318251484) DescribeResult(nobs=250, minmax=(0.9555555555555556, 1.0), mean=0.9793777777777778, variance=4.80023798899302e-05, skewness=-0.24910751075977636, kurtosis=0.4395617044106124) DescribeResult(nobs=250, minmax=(0.9555555555555556, 0.9972222222222222), mean=0.9770444444444444, variance=5.1334225792057076e-05, skewness=-0.12883539300214827, kurtosis=0.1411098033435696)
Obtaining a distribution of training timings
display(res_k1_B250[2]) display(res_k2_B250[2]) display(res_k3_B250[2]) display(res_k4_B250[2]) display(res_k5_B250[2])
DescribeResult(nobs=250, minmax=(0.00498199462890625, 0.021169185638427734), mean=0.007840995788574218, variance=4.368068123193988e-06, skewness=2.175594596266775, kurtosis=7.499194342725625) DescribeResult(nobs=250, minmax=(0.005329132080078125, 0.016299962997436523), mean=0.007670882225036621, variance=3.612048206608975e-06, skewness=1.7118375802873183, kurtosis=3.358366931595608) DescribeResult(nobs=250, minmax=(0.0053746700286865234, 0.015506505966186523), mean=0.007794314384460449, variance=2.920214088930605e-06, skewness=1.6360801483869196, kurtosis=3.2315493234819064) DescribeResult(nobs=250, minmax=(0.005369901657104492, 0.02190709114074707), mean=0.007874348640441894, variance=4.55353231021138e-06, skewness=2.3223174208412916, kurtosis=8.922678944294534) DescribeResult(nobs=250, minmax=(0.005362033843994141, 0.017331361770629883), mean=0.00786894702911377, variance=4.207144846754069e-06, skewness=1.8494401442014954, kurtosis=3.8446086533270085)
Obtaining a distribution of testing timings
display(res_k1_B250[3]) display(res_k2_B250[3]) display(res_k3_B250[3]) display(res_k4_B250[3]) display(res_k5_B250[3])
DescribeResult(nobs=250, minmax=(0.1675705909729004, 0.3001070022583008), mean=0.19125074195861816, variance=0.0003424395337048105, skewness=2.2500063799757677, kurtosis=5.9722526245151375) DescribeResult(nobs=250, minmax=(0.16643667221069336, 0.31163525581359863), mean=0.1923248109817505, variance=0.0003310783018211768, skewness=2.476834016032642, kurtosis=8.109087286878708) DescribeResult(nobs=250, minmax=(0.17519187927246094, 0.37604689598083496), mean=0.1916730365753174, variance=0.0003895799321858523, skewness=4.280046315900402, kurtosis=30.357835694940057) DescribeResult(nobs=250, minmax=(0.17512750625610352, 0.3540067672729492), mean=0.19378959369659424, variance=0.00035161275596300016, skewness=3.595469226517824, kurtosis=21.271489103625353) DescribeResult(nobs=250, minmax=(0.17573857307434082, 0.2584831714630127), mean=0.19390375328063963, variance=0.0002475867594812809, skewness=2.0323201018310013, kurtosis=3.343216700759352)
Graph: distribution of test set accuracy for different numbers of neighbors (1 to 4)
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(np.column_stack((res_k1_B250[0][0],
res_k2_B250[0][0],
res_k3_B250[0][0],
res_k4_B250[0][0])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function sns.distplot(a=df["k1"], hist=True, kde=True, rug=True) sns.distplot(a=df["k2"], hist=True, kde=True, rug=True) sns.distplot(a=df["k3"], hist=True, kde=True, rug=True) sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)

Graph: distribution of training timings for different numbers of neighbors (1 to 4)
df = pd.DataFrame(np.column_stack((res_k1_B250[0][1],
res_k2_B250[0][1],
res_k3_B250[0][1],
res_k4_B250[0][1])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["k1"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k2"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k3"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)

Graph: distribution of testing timings for different numbers of neighbors (1 to 4)
df = pd.DataFrame(np.column_stack((res_k1_B250[0][2],
res_k2_B250[0][2],
res_k3_B250[0][2],
res_k4_B250[0][2])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["k1"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k2"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k3"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)

3 – LSBoostClassifier Python – with test set accuracy’s distribution
3 – 1 Classification of Breast Cancer dataset
data = load_breast_cancer() X = data.data y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=13)
def lsboost_cv(X_train, y_train,
n_estimators=100,
learning_rate=0.1,
n_hidden_features=5,
reg_lambda=0.1,
dropout=0,
tolerance=1e-4,
seed=123):
estimator = ms.LSBoostClassifier(n_estimators=int(n_estimators),
learning_rate=learning_rate,
n_hidden_features=int(n_hidden_features),
reg_lambda=reg_lambda,
dropout=dropout,
tolerance=tolerance,
seed=seed, verbose=0)
return -cross_val_score(estimator, X_train, y_train,
scoring='accuracy', cv=5, n_jobs=-1).mean()
def optimize_lsboost(X_train, y_train):
# objective function for hyperparams tuning
def crossval_objective(x):
return lsboost_cv(
X_train=X_train,
y_train=y_train,
n_estimators=int(x[0]),
learning_rate=x[1],
n_hidden_features=int(x[2]),
reg_lambda=x[3],
dropout=x[4],
tolerance=x[5])
gp_opt = gp.GPOpt(objective_func=crossval_objective,
lower_bound = np.array([10, 0.001, 5, 1e-2, 0, 0]),
upper_bound = np.array([100, 0.4, 250, 1e4, 0.7, 1e-1]),
n_init=10, n_iter=190, seed=123)
return {'parameters': gp_opt.optimize(verbose=2, abs_tol=1e-2), 'opt_object': gp_opt}
# hyperparams tuning
res1 = optimize_lsboost(X_train, y_train)
print(res1)
parameters = res1["parameters"]
start = time()
estimator = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=1).fit(X_train, y_train)
print(f"\n\n Test set accuracy: {estimator.score(X_test, y_test)}")
print(f"\n Elapsed: {time() - start}")
Test set accuracy: 0.9912280701754386 Elapsed: 0.11275959014892578
from collections import namedtuple from sklearn.metrics import classification_report from tqdm import tqdm from scipy import stats
Distribution of test set accuracy of LSBoost on Breast Cancer dataset
def eval_lsboost(B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
#try:
start = time()
obj = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=0).fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_lsboost_B250 = eval_lsboost(B=250)
100%|██████████| 250/250 [00:11<00:00, 21.07it/s]
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(res_lsboost_B250[0][0],
columns=["accuracy"])
# Plot the histogram thanks to the distplot function sns.distplot(a=df["accuracy"], hist=True, kde=True, rug=True)

3 – 2 Classification of Wine dataset
data = load_wine() X = data.data y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=13)
res2 = optimize_lsboost(X_train, y_train)
print(res2)
parameters = res2["parameters"]
start = time()
estimator = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=1).fit(X_train, y_train)
print(f"\n\n Test set accuracy: {estimator.score(X_test, y_test)}")
print(f"\n Elapsed: {time() - start}")
Test set accuracy: 1.0 Elapsed: 0.6752924919128418
test set accuracy’s distribution
def eval_lsboost2(B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
start = time()
obj = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=0).fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_lsboost2_B250 = eval_lsboost2(B=250)
100%|██████████| 250/250 [01:23<00:00, 3.01it/s]
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(res_lsboost2_B250[0][0],
columns=["accuracy"])
# Plot the histogram thanks to the distplot function sns.distplot(a=df["accuracy"], hist=True, kde=True, rug=True)

4 – R example
install.packages("remotes")
remotes::install_github("Techtonique/mlsauce/R-package")
library(datasets)
X <- as.matrix(iris[, 1:4])
y <- as.integer(iris[, 5]) - 1L
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(iris[train_index, 1:4])
y_train <- as.integer(iris[train_index, 5]) - 1L
X_test <- as.matrix(iris[test_index, 1:4])
y_test <- as.integer(iris[test_index, 5]) - 1L
obj <- mlsauce::AdaOpt()
print(obj$get_params())
obj$fit(X_train, y_train)
# Accuracy (\~ 97\%)
print(obj$score(X_test, y_test))
Want to share your content on python-bloggers? click here.