An Open Source Journey with Scikit-Learn

Posted on November 24, 2023 by Christian Lorentzen in Data science | 0 Comments

This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post, I’d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.

Table of Content

How it all started

Back in 2015/2016, I was working as a non-life pricing actuary. The standard vendor desktop applications we used for generalized linear models (GLM) had problems of system discontinuities, manual error prone steps and the lack of modern machine learning capabilities (not even out-of-sample model comparison).

Python was then on the rise for data science. Numpy, scipy and pandas had laid the foundations, then came deep learning alias neural net frameworks leading to tensorflow and pytorch. XGBoost was also a game changer visible in Kaggle competition leaderboards. All those projects came as open source with thriving communities and possibilities to contribute.

While the R base package always comes with splendid dataframes (I guess they invented it) and battle proven GLMs out of the box, the Python site for GLMs was not that well developed. So I started with GLMs in statsmodels and generalized linear mixed models (a.k.a. hierarchical or multilevel models) in pymc (then called pymc3). My first open source contributions in the Python world were small issues in statsmodels and a little later the bug report pymc#2640 about memory alignment issues which was caused by joblib#563.

To my great surprise the famous machine learning library scikit-learn did not have GLMs, only penalized linear models and logistic regression, but no Poisson or Gamma GLMs which are essential in non-life insurance pricing. Fortunately, I was not the first one to notice this lack. There was already an open issue scikit-learn#5975 with many people asking for this feature. Just nobody had contributed a pull request (PR) yet.

That’s when I said to myself: It should not fail just because no one implements it. I really like open source and gained some programming experience during my PhD in particle physics, mainly C++. Eventually, I boldly (because I was still a newbie) opened the PR scikit-learn#9405 in summer 2017.

Becoming a scikit-learn core developer

This PR turned out to be essential for the development of GLMs and for becoming a scikit-learn core developer. I dare say that I almost got crazy trying to convince the core developers that GLMs are really that useful for supervised machine learning and that GLMs should land in scikit-learn. In retrospective, this was the hardest part and it took me almost 2 years of patience and repeating my arguments, some examples comments are given below:

comment example 1

“I can only repeat myself: I’d prefer to have this functionality in scikit-learn for several reasons (your review, opinion and ideas, very official/trustworthy library, more efficient maintainance, effort to release this pr as its own library, …).
To be more explicit for the moment: If it takes longer than the end of 2019 (+-), I’ll consider to release it as separate library.” link

comment example 2

“I see it a bit different. Scikit-Learn like R glm and glmnet is trusted world-wide and can be used in many companies, whereas it might be difficult to get any of the existing GLM libraries on pypi (h2o excluded) into production (no offense intended). That being said, I’d like to return the question and ask you: What exactly has to be fulfilled in order for a GLM PR to be merged into scikit-learn? Once that is clarified, I’ll think about starting a collaboration for this.” link

comment example 3

…

guidance – maintenance
As a GLM user on a fairly regular basis, I’d be happy to help as good as I can. Feel free to reach out to me. As to maintenance, I think a unified framework would even lower the burden. I can also imagine to give some support for maintenance.

miscellaneous

…
For GBMs to rely on the same loss and link functions would make sense …

…

further steps
Besides further commits to this PR, let me know how I can help you best.

[link]

As I wanted to demonstrate the full utility of GLMs, this PR had become much too large for review and inclusion: +4000 lines of code with several solvers, penalty matrices, 3 examples, a lot of documentation and good test coverage (and a lot of things I would do differently today).

The conclusion was to carve out a minimal GLM implementation using the L-BFGS solver of scipy. This way, I met Roman Yurchak with whom it was a pleasure to work with. It took a little Swiss chocolate incentive to finally get scikit-learn#14300 (still +2900 loc) reviewed and merged in spring 2020. Almost 3 years after opening my original PR, it was released in scikit-learn version 0.23!

I guess it was mainly this work and perseverance around GLMs that catched the attention of the core developers and that motivated them to vote for me: In summer 2020, I was invited to become a scikit-learn core developer and gladly accepted.

Summary as core developer

Further directions

My work on GLMs was easily extensible to other estimators in the form of loss functions. Again, to my surprise, loss functions, a core element for supervised learning, were re-implemented again and again within scikit-learn. So, based on Roman’s idea in #15123, I started a project to unify them, and by unifying also extending several tree estimator classes with poisson and gamma losses (and making existing ones more stable and faster).

As loss functions are such important core components, they have basically 2 major requirements: be numerically stable and fast. That’s why I went with Cython (preferred way for fast code in scikit-learn) in scikit-learn#20567 and guess which loop it closed? Again, I met segfault errors caused by joblib#563. This time, it motivated another core developer to quite an investment in fixing it in joblib#1254.

Another story branch is the dedicated GLM Python library glum. The authors took my original way too long GLM PR as a starting point and developed one of the most feature rich and fastest GLM implementations out there. This is almost like a dream come true.

A summary of my contributions over those 3 intensive years as scikit-learn core developer are best given in several categories.

Pull requests

A summary of my contributions in terms of code may be:

Unified loss module, unified naming of losses, poisson and gamma losses for GLMs and decision tree based models
LinearModelLoss and NewtonSolver (newton-cholesky) for GLMs like LogisticRegression and PoissonRegressor as well as further solver improvements
QuantileRegressor (linear quantile regression) and quantile/pinball loss for HistGradientBoostingRegressor (HGBT). BTW, linear quantile regression is much harder than GLM solvers!
SplineTransformer
Interaction constraints and feature subsampling for HGBT

From the release notes and the github PRs (where one would miss a few) a more details list of important PRs

Sample weights for ElasticNet v0.23 (Major Feature)
Minimal Generalized linear models implementation v0.23 (Major Feature)
ENH Poisson loss for HistGradientBoostingRegressor v0.23
ENH add Poisson splitting criterion for single trees v0.24
Common Private Loss Module with tempita v1.1
- RFC Consistent options/names for loss and criterion v1.0 and v1.1
- ENH Replace loss module HGBT v1.1
- FEA add quantile HGBT v1.1 (Major Feature)
- ENH Loss module LogisticRegression v1.1
- ENH migrate GLMs / TweedieRegressor to linear loss v1.1
- FEA Add Gamma deviance as loss function to HGBT v1.3
- ENH replace loss module Gradient boosting future v1.4
Add quantile regression together with David Dale v1.0 (MajorFeature)
FEA Add SplineTransformer v1.0
ENH FEA add interaction constraints to HGBT v.1.2 (Major Feature)
FEA add (single) Cholesky Newton solver to GLMs v.1.2
ENH add newton-cholesky solver to LogisticRegression v.1.2
TST tight tests for GLMs v.1.2
ENH scaling of LogisticRegression loss as 1/n * LinearModelLoss future v1.4
ENH add feature subsampling per split for HGBT future v1.4

Reviewing and steering

Among the biggest changes in newer scikit-learn history are two scikit-learn enhancement proposals (SLEP)

SLEP018: Pandas Output for Transformers with set_output
championed by Thomas Fan, implemented in PR#23734 v1.2, further developments like PR#27315 for polars in future v1.4
SLEP006: Metadata Routing
championed by Adrin Jalali, base implementation PR#22083

For both, I did one of the 2 obligatory reviews. Then, maybe the technically most challenging review I can remember was on:

ENH Add Categorical support for HistGradientBoosting from Thomas Fan, v0.24 (MajorFeature)

Keep in mind that review(er)s are by far the scarcest resource of scikit-learn.

I also like to mention PR#25753 which changed to government to be more inclusive, in particular with voting rights.

Lessons Learned

Just before the end, a few critical words must be allowed.

Scikit-learn is focused a lot on stability. For some items of my wish list to land in scikit-learn, it would have again taken years. This time, I decided to release my own library model-diagnostics and I enjoy the freedom to use cutting edge components like polars.
As part-time statistician, I consider certain design choices like classifiers’ predict implicitly using a 50% threshold instead of returning a predicted probability (what predict_proba does) a bit poor. Hard to change!!! At least, PR#26120 might improve that to some extent.
I ponder a lot on the pipeline concept. At first, it was like an eye-opener for me to think of feature preprocessing as part of the estimator. The scikit-learn API is build around the pipeline design with fit, transform and predict. But the current trend of modern model classes like gradient boosted trees (XGBoost, LightGBM, HGBT) don’t need a preprocessing pipeline anymore, e.g., they can natively deal with categorical features and missing values. But it is hard to pass the information which feature to treat as categorical through a pipeline, see scikit-learn#18894.
It is still a very painful experience to specify design matrices of linear models, in particular interaction terms, see scikit-learn#15263, #19533 and #25412. Doing that in a pipline with a ColumnTransformer is just very complicated and prohibits a lot of optimizations (mostly for categoricals)—which is one of the reasons glum is faster.

One of the greatest rewards of this journey was that I learned a lot, about Python, machine learning, rigorous reviews, CI/CD, open source communities, endurance. But even more so, I had the pleasure to meet and work with some kind, brilliant and gifted people like Roman Yurchak, Alexandre Gramfort, Olivier Grisel, Thomas Fan, Nicolas Hug, Adrin Jalali and many more. I am really grateful to be a part of something bigger than its parts.

To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers