I am pleased to announce the 0.9.0 release of the data algebra. The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include […] [...Read more...]
I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science. The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows […] [...Read more...]
I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. Now I kind of wonder what Pandas is, or what it wants to be. The version 1.3.0 package seems to be marking natural ways […]
I’d like to work an example of using SQL WITH Common Table Expressions to produce more legible SQL. A major complaint with SQL is that it composes statements by right-ward nesting. That is: a sequence of operations A -__ B -__ C is represented as SELECT C FROM SELECT […] [...Read more...]
I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward. The data algebra The data algebra is a modern realization of […] [...Read more...]
We have a new improved version of the “how to design a cdata/data_algebra data transform” up!
The original article, the Python example, and the R example have all been updated to use the new video.
Please check it out! [...Read more...]
Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial. This reflects our opinion on the “which is better for data science ... [...Read more...]
We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface). This means the user can express easily express modeling intent by choosing between coder$fit_... [...Read more...]
For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows ... [...Read more...]