New improved cdata instructional video

February 8, 2020

We have a new improved version of the "how to design a cdata/data_algebra data transform" up! The original article, the Python example, and the R example have all been updated to use the new video. Please check it out!

Data re-Shaping in R and in Python

January 28, 2020

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial. This reflects our opinion on the "which is better for data science R or Python?" They both are

sklearn Pipe Step Interface for vtreat

January 14, 2020

We've been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface). This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data), coder$fit(train_data_cal)$transform(train_data_model), and coder$fit(application_data). We have also regenerated

Biomedical Data Science Textbook Available

January 14, 2020

By Bob Hoyt & Bob Muenchen Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work

MinIO for Machine Learning Model Storage using Python

January 13, 2020

MinIO is a object storage database which uses S3(from Amazon). This is a very convenient tool in for data scientists or machine learning engineers to easily collaborate and share data and machine learning models. MinIO is a cloud storage server compatible with Amazon S3, released under Apache License v2. As an object store, MinIO can...

New vtreat Feature: Nested Model Bias Warning

January 11, 2020

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the

CodeWars: Learn programming through test-driven development

January 8, 2020

As I wrote about Project Euler and CodingGame before, someone recommended me CodeWars. CodeWars offers free online learning exercises to develop your programming skills through fun daily challenges. In line with Project Euler, you are tasked with solving increasingly complex programming challenges. At CodeWars, these little problems you need to solve with code are called

New Timings for a Grouped In-Place Aggregation Task

January 2, 2020

I'd like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow. Roughly, the task was to add in some derived per-group aggregation columns to a few million

Python Web Scraping: WordPress Visitor Statistics

December 29, 2019

I've had this WordPress domain for several years now, and in the beginning it was very convenient. WordPress enabled me to set up a fully functional blog in a matter of hours. Everything from HTML markup, external content embedding, databases, and simple analytics was already conveniently set up. However, after a while, I wanted to
