New improved cdata instructional video

February 8, 2020 | 0 Comments

We have a new improved version of the “how to design a cdata/data_algebra data transform” up! The original article, the Python example, and the R example have all been updated to use the new video. Please check it out! [...Read more...]

Data re-Shaping in R and in Python

January 28, 2020 | 0 Comments

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial. This reflects our opinion on the “which is better for data science R or Python?” They both are … Continue reading Data re-Shaping in R and in Python [...Read more...]

sklearn Pipe Step Interface for vtreat

January 14, 2020 | 0 Comments

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface). This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data), coder$fit(train_data_cal)$transform(train_data_model), and coder$fit(application_data). We have also regenerated … Continue reading sklearn Pipe Step Interface for vtreat [...Read more...]

Biomedical Data Science Textbook Available

January 14, 2020 | 0 Comments

By Bob Hoyt & Bob Muenchen Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work … Continue reading → [...Read more...]

MinIO for Machine Learning Model Storage using Python

January 13, 2020 | 0 Comments

MinIO is a object storage database which uses S3(from Amazon). This is a very convenient tool in for data scientists or machine learning engineers to easily collaborate and share data and machine learning models. MinIO is a cloud storage server compatible with Amazon S3, released under Apache License v2. As an object store, MinIO can... Continue Reading → [...Read more...]

New vtreat Feature: Nested Model Bias Warning

January 11, 2020 | 0 Comments

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the … Continue reading New vtreat Feature: Nested Model Bias Warning [...Read more...]

New Timings for a Grouped In-Place Aggregation Task

January 2, 2020 | 0 Comments

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow. Roughly, the task was to add in some derived per-group aggregation columns to a few million … Continue reading New Timings for a Grouped In-Place Aggregation Task [...Read more...]

A Richer Category for Data Wrangling

December 22, 2019 | 0 Comments

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable. I think I’ve found an even better category theory re-formulation of the package, which I will describe here. In the earlier formalism our data transform … Continue reading A Richer Category for Data Wrangling [...Read more...]

Better SQL Generation via the data_algebra

December 18, 2019 | 0 Comments

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra. dplyr translates the query to SQL as: SELECT 5.0 AS `x`, … Continue reading Better SQL Generation via the data_algebra [...Read more...]

data_algebra/rquery as a Category Over Table Descriptions

December 14, 2019 | 0 Comments

Introduction I would like to talk about some of the design principles underlying the data_algebra package (and also in its sibling rquery package). The data_algebra package is a query generator that can act on either Pandas data frames or on SQL tables. This is discussed on the project site and the examples directory. In this … Continue reading data_algebra/rquery as a Category Over Table Descriptions [...Read more...]
1 2 3 5