January 2020

Data re-Shaping in R and in Python

January 28, 2020 | 0 Comments

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial. This reflects our opinion on the "which is better for data science R or Python?" They both are

sklearn Pipe Step Interface for vtreat

January 14, 2020 | 0 Comments

We've been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface). This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data), coder$fit(train_data_cal)$transform(train_data_model), and coder$fit(application_data). We have also regenerated

Biomedical Data Science Textbook Available

January 14, 2020 | 0 Comments

By Bob Hoyt & Bob Muenchen Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work

MinIO for Machine Learning Model Storage using Python

January 13, 2020 | 0 Comments

MinIO is a object storage database which uses S3(from Amazon). This is a very convenient tool in for data scientists or machine learning engineers to easily collaborate and share data and machine learning models. MinIO is a cloud storage server compatible with Amazon S3, released under Apache License v2. As an object store, MinIO can...

New vtreat Feature: Nested Model Bias Warning

January 11, 2020 | 0 Comments

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the

New Timings for a Grouped In-Place Aggregation Task

January 2, 2020 | 0 Comments

I'd like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow. Roughly, the task was to add in some derived per-group aggregation columns to a few million