Articles by John Mount

I think Pandas may have “lost the plot.”

August 4, 2021 | John Mount

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. Now I kind of wonder what Pandas is, or what it wants to be. The version 1.3.0 package seems to be marking natural ways […]

[...Read more...]

Using WITH For Neater SQL

June 21, 2021 | John Mount

I’d like to work an example of using SQL WITH Common Table Expressions to produce more legible SQL. A major complaint with SQL is that it composes statements by right-ward nesting. That is: a sequence of operations A -__ B -__ C is represented as SELECT C FROM SELECT […] [...Read more...]

data_algebra 0.7.0 What is New

June 7, 2021 | John Mount

I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward. The data algebra The data algebra is a modern realization of […] [...Read more...]

New improved cdata instructional video

February 8, 2020 | John Mount

We have a new improved version of the “how to design a cdata/data_algebra data transform” up! The original article, the Python example, and the R example have all been updated to use the new video. Please check it out! [...Read more...]

Data re-Shaping in R and in Python

January 28, 2020 | John Mount

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial. This reflects our opinion on the “which is better for data science ... [...Read more...]

sklearn Pipe Step Interface for vtreat

January 14, 2020 | John Mount

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface). This means the user can express easily express modeling intent by choosing between coder$fit_... [...Read more...]

New vtreat Feature: Nested Model Bias Warning

January 11, 2020 | John Mount

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows ... [...Read more...]

New Timings for a Grouped In-Place Aggregation Task

January 2, 2020 | John Mount

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow. Roughly, the task was to add in some derived per-group ... [...Read more...]

A Richer Category for Data Wrangling

December 22, 2019 | John Mount

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable. I think I’ve found an even better category theory re-formulation of the package, which I will ... [...Read more...]

Better SQL Generation via the data_algebra

December 18, 2019 | John Mount

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra. dplyr translates the query to ... [...Read more...]

« 1 2 3 4 »

Python-bloggers

Data science news and tutorials - contributed by Python bloggers