Better SQL Generation via the data_algebra
This article was first published on python – Win-Vector Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.
Want to share your content on python-bloggers? click here.
In our recent note What is new for rqueryrquery
December 2019 we mentioned an ugly processing pipeline that translates into
SQL
SQL
of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra
data_algebra
.dplyrdplyr
translates the query to
SQL
SQL
as:
SELECT 5.0 AS `x`, `sum23`<br> FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x`<br> FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x`<br> FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x`<br> FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x`<br> FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23`<br> FROM `d`)))))
SELECT 5.0 AS `x`, `sum23`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x`
FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23`
FROM `d`)))))
rqueryrquery
translates the query to
SQL
SQL
as:
SELECT<br> "x",<br> "sum23"<br> FROM (<br> SELECT<br> "col2" + "col3" AS "sum23",<br> 5 AS "x"<br> FROM (<br> SELECT<br> "col2",<br> "col3"<br> FROM<br> "example_table"<br> ) tsql_28722584463189084716_0000000000<br> ) tsql_28722584463189084716_0000000001
SELECT
"x",
"sum23"
FROM (
SELECT
"col2" + "col3" AS "sum23",
5 AS "x"
FROM (
SELECT
"col2",
"col3"
FROM
"example_table"
) tsql_28722584463189084716_0000000000
) tsql_28722584463189084716_0000000001
Notice the
rquery
rquery
SQL
SQL
doesn’t copy the column col1
col1
around, and also skips the dead-values assigned into x
x
. The query still has some waste: the inner and outer guard queries that are used to make SQL
SQL
look a bit more regular.What I would like to add is our new note, showing what the data_algebradata_algebra
translates a similar query into the following
SQL
SQL
:
SELECT 5 AS "x",<br> "col2" + "col3" AS "sum23",<br> "col3"<br> FROM "d"
SELECT 5 AS "x",
"col2" + "col3" AS "sum23",
"col3"
FROM "d"
(The extra
col3
col3
as we asked for that column to be part of the result in the newer demonstration.) This new query has fewer unnecessary steps. The idea is one can code intent step-wise in a pipeline and still end up with a fairly compact and performant SQL
SQL
query in the end.I think both rqueryrquery
and data_algebradata_algebra
can save quite a lot of development resources and machine time in data wrangling tasks.
To leave a comment for the author, please follow the link and comment on their blog: python – Win-Vector Blog .
Want to share your content on python-bloggers? click here.