Python-bloggers

An Open Source Journey with Scikit-Learn

This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post, I’d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.

How it all started

Back in 2015/2016, I was working as a non-life pricing actuary. The standard vendor desktop applications we used for generalized linear models (GLM) had problems of system discontinuities, manual error prone steps and the lack of modern machine learning capabilities (not even out-of-sample model comparison).

Python was then on the rise for data science. Numpy, scipy and pandas had laid the foundations, then came deep learning alias neural net frameworks leading to tensorflow and pytorch. XGBoost was also a game changer visible in Kaggle competition leaderboards. All those projects came as open source with thriving communities and possibilities to contribute.

While the R base package always comes with splendid dataframes (I guess they invented it) and battle proven GLMs out of the box, the Python site for GLMs was not that well developed. So I started with GLMs in statsmodels and generalized linear mixed models (a.k.a. hierarchical or multilevel models) in pymc (then called pymc3). My first open source contributions in the Python world were small issues in statsmodels and a little later the bug report pymc#2640 about memory alignment issues which was caused by joblib#563.

To my great surprise the famous machine learning library scikit-learn did not have GLMs, only penalized linear models and logistic regression, but no Poisson or Gamma GLMs which are essential in non-life insurance pricing. Fortunately, I was not the first one to notice this lack. There was already an open issue scikit-learn#5975 with many people asking for this feature. Just nobody had contributed a pull request (PR) yet.

That’s when I said to myself: It should not fail just because no one implements it. I really like open source and gained some programming experience during my PhD in particle physics, mainly C++. Eventually, I boldly (because I was still a newbie) opened the PR scikit-learn#9405 in summer 2017.

Becoming a scikit-learn core developer

This PR turned out to be essential for the development of GLMs and for becoming a scikit-learn core developer. I dare say that I almost got crazy trying to convince the core developers that GLMs are really that useful for supervised machine learning and that GLMs should land in scikit-learn. In retrospective, this was the hardest part and it took me almost 2 years of patience and repeating my arguments, some examples comments are given below:

comment example 1

comment example 2

comment example 3

As I wanted to demonstrate the full utility of GLMs, this PR had become much too large for review and inclusion: +4000 lines of code with several solvers, penalty matrices, 3 examples, a lot of documentation and good test coverage (and a lot of things I would do differently today).

The conclusion was to carve out a minimal GLM implementation using the L-BFGS solver of scipy. This way, I met Roman Yurchak with whom it was a pleasure to work with. It took a little Swiss chocolate incentive to finally get scikit-learn#14300 (still +2900 loc) reviewed and merged in spring 2020. Almost 3 years after opening my original PR, it was released in scikit-learn version 0.23!

I guess it was mainly this work and perseverance around GLMs that catched the attention of the core developers and that motivated them to vote for me: In summer 2020, I was invited to become a scikit-learn core developer and gladly accepted.

Summary as core developer

Further directions

My work on GLMs was easily extensible to other estimators in the form of loss functions. Again, to my surprise, loss functions, a core element for supervised learning, were re-implemented again and again within scikit-learn. So, based on Roman’s idea in #15123, I started a project to unify them, and by unifying also extending several tree estimator classes with poisson and gamma losses (and making existing ones more stable and faster).

As loss functions are such important core components, they have basically 2 major requirements: be numerically stable and fast. That’s why I went with Cython (preferred way for fast code in scikit-learn) in scikit-learn#20567 and guess which loop it closed? Again, I met segfault errors caused by joblib#563. This time, it motivated another core developer to quite an investment in fixing it in joblib#1254.

Another story branch is the dedicated GLM Python library glum. The authors took my original way too long GLM PR as a starting point and developed one of the most feature rich and fastest GLM implementations out there. This is almost like a dream come true.

A summary of my contributions over those 3 intensive years as scikit-learn core developer are best given in several categories.

Pull requests

A summary of my contributions in terms of code may be:

Reviewing and steering

Among the biggest changes in newer scikit-learn history are two scikit-learn enhancement proposals (SLEP)

For both, I did one of the 2 obligatory reviews. Then, maybe the technically most challenging review I can remember was on:

Keep in mind that review(er)s are by far the scarcest resource of scikit-learn.

I also like to mention PR#25753 which changed to government to be more inclusive, in particular with voting rights.

Lessons Learned

Just before the end, a few critical words must be allowed.

One of the greatest rewards of this journey was that I learned a lot, about Python, machine learning, rigorous reviews, CI/CD, open source communities, endurance. But even more so, I had the pleasure to meet and work with some kind, brilliant and gifted people like Roman Yurchak, Alexandre Gramfort, Olivier Grisel, Thomas Fan, Nicolas Hug, Adrin Jalali and many more. I am really grateful to be a part of something bigger than its parts.

To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog .

Want to share your content on python-bloggers? click here.
Exit mobile version