This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

• The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
• The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
• The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

## An Example

First we import some modules and packages, and import some notional data.

In :
```# https://pandas.pydata.org
import pandas

# PyYAML from https://pyyaml.org
import yaml

# pip install https://github.com/WinVector/data_algebra/raw/master/dist/data_algebra-0.1.3.tar.gz
import data_algebra.cdata
import data_algebra.cdata_impl
import data_algebra.data_ops
import data_algebra.yaml
import data_algebra.SQLite

# ask YAML to write simpler structures
data_algebra.yaml.fix_ordered_dict_yaml_rep()

# print it out
iris
```
Out:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesid
05.13.51.40.2setosa0
14.93.01.40.2setosa1
24.73.21.30.2setosa2
Our goal is to move from this de-normalized or wide-form (or “model matrix”/”data matrix” form, where each record is exactly one row) into a tall form where records may span more than one row.

Or, more concretely, we want our data to look like the following.

In :
```answer = pandas.read_csv("answer.csv")
```
Out:
idSpeciesPartMeasureValue
00setosaPetalLength1.4
10setosaPetalWidth0.2
20setosaSepalLength5.1
30setosaSepalWidth3.5
41setosaPetalLength1.4
51setosaPetalWidth0.2
61setosaSepalLength4.9
71setosaSepalWidth3.0
82setosaPetalLength1.3
92setosaPetalWidth0.2
102setosaSepalLength4.7
112setosaSepalWidth3.2
Notice each row of the original data set is now four rows of the derived one. This “tall form” is often useful for plotting.

This sort of conversion can be called an anti-pivot. In Python these sorts of transforms are specified with pandas.DataFrame.pivot, pandas.pivot_table, pandas.melt,
and others. Many R packages supply related operators: reshape, reshape2, data.table, tidyr, cdata, and more. We feel the data_algebra and cdata methods (which only perform data-reshaping, and not aggregation) offer a number of significant advantages (some of which we have already mentioned, and a few more of which we will see demonstrated in this note).

Back to our example.

To specify our desired transform, we build a structure describing what a data record looks like. We can crib this specification from the answer as follows.

In :
```control_table = answer.loc[answer.id == 0, ['Part', 'Measure']]
control_table = control_table.reset_index(inplace=False, drop=True)
control_table["Value"] = [control_table['Part'][i] + '.' + control_table['Measure'][i] for
i in range(control_table.shape)]
control_table
```
Out:
PartMeasureValue
0PetalLengthPetal.Length
1PetalWidthPetal.Width
2SepalLengthSepal.Length
3SepalWidthSepal.Width
We can derive the control table from the answer, as we did here, or just type one in directly. The idea is: we can use any method we want to derive the prototype record shape, we are not limited to a sub-query language or methodology from any one package.

For each record we take care to identify what keys identify records (the `record_keys`) and want parts identify rows within the record (the `control_table_keys`). We suppress the `record_key` when writing the control table, as these are exactly the values that do not vary inside each record.

We combine all of these specification into a `RecordSpecification` as follows:

In :
```record_spec = data_algebra.cdata.RecordSpecification(
control_table,
control_table_keys = ['Part', 'Measure'],
record_keys = ['id', 'Species']
)
record_spec
```
Out:
```RecordSpecification
record_keys: ['id', 'Species']
control_table_keys: ['Part', 'Measure']
control_table:
Part Measure         Value
0  Petal  Length  Petal.Length
1  Petal   Width   Petal.Width
2  Sepal  Length  Sepal.Length
3  Sepal   Width   Sepal.Width```
In this notation any cells of the control table from columns that are not `control_table_keys` are “value stand ins”. During data transfrom these cells will be replaced by values coming from the columns named by these cells.

Now we can transform our original row-record oriented data into general block records. To do this we specify a `RecordMap` using our record specification to describe the outgoing record structure. The incoming record structure is implicitly assumed to be single-row records, unless we specify otherwise (using the `blocks_in` argument).

In :
```mp_to_blocks = data_algebra.cdata_impl.RecordMap(blocks_out=record_spec)
print(str(mp_to_blocks))
```
```Transform row records of the form:
record_keys: ['id', 'Species']
['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']
to block records of structure:
RecordSpecification
record_keys: ['id', 'Species']
control_table_keys: ['Part', 'Measure']
control_table:
Part Measure         Value
0  Petal  Length  Petal.Length
1  Petal   Width   Petal.Width
2  Sepal  Length  Sepal.Length
3  Sepal   Width   Sepal.Width

```
And we are ready to apply our specified transform.

In :
```arranged_blocks = mp_to_blocks.transform(iris)
arranged_blocks
```
Out:
idSpeciesPartMeasureValue
00setosaPetalLength1.4
10setosaPetalWidth0.2
20setosaSepalLength5.1
30setosaSepalWidth3.5
41setosaPetalLength1.4
51setosaPetalWidth0.2
61setosaSepalLength4.9
71setosaSepalWidth3.0
82setosaPetalLength1.3
92setosaPetalWidth0.2
102setosaSepalLength4.7
112setosaSepalWidth3.2
We see the operation has been performed for us. Notice we specify the transform declaratively with data structures carrying deceptions of what we want, instead of having to build a sequence of verbs that realize the transformation.

### Inverting the transform

An inverse transform is simply expressed by reversing the roles of the `blocks_out` and `blocks_in` arguments. In this case the output is row-records, as we didn’t specify an outgoing block structure with `blocks_out`.

In :
```mp_to_rows = data_algebra.cdata_impl.RecordMap(blocks_in=record_spec)
print(str(mp_to_rows))
```
```Transform block records of structure:
RecordSpecification
record_keys: ['id', 'Species']
control_table_keys: ['Part', 'Measure']
control_table:
Part Measure         Value
0  Petal  Length  Petal.Length
1  Petal   Width   Petal.Width
2  Sepal  Length  Sepal.Length
3  Sepal   Width   Sepal.Width
to row records of the form:
record_keys: ['id', 'Species']
['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']

```
In :
```arranged_rows = mp_to_rows.transform(arranged_blocks)
arranged_rows
```
Out:
idSpeciesPetal.LengthPetal.WidthSepal.LengthSepal.Width
00setosa1.40.25.13.5
11setosa1.40.24.93.0
22setosa1.30.24.73.2

## Arbitrary transforms

Arbitrary record to record transforms can be specified by setting both `blocks_in` (to describe incoming structure) and `blocks_out` (to describe outgoing structure) at the same time.

## Transforms in databases

`data_algebra` also implements all the transform steps in databases using `SQL` (via `row_recs_to_blocks_query()` and `blocks_to_row_recs_query()`).

These queries can be seen below.

In :
```db_model = data_algebra.SQLite.SQLiteModel()
```
In :
```print(db_model.row_recs_to_blocks_query(
source_view=data_algebra.data_ops.describe_pandas_table(
iris, 'iris'),
record_spec=record_spec,
record_view=data_algebra.data_ops.describe_pandas_table(
record_spec.control_table, "control_table")
))
```
```SELECT
a."id" AS "id",
a."Species" AS "Species",
b."Part" AS "Part",
b."Measure" AS "Measure",
CASE
WHEN b."Value" = 'Petal.Length' THEN a."Petal.Length"
WHEN b."Value" = 'Petal.Width' THEN a."Petal.Width"
WHEN b."Value" = 'Sepal.Length' THEN a."Sepal.Length"
WHEN b."Value" = 'Sepal.Width' THEN a."Sepal.Width"
ELSE NULL END AS "Value"
FROM (
"iris" ) a
CROSS JOIN (
"control_table" ) b
ORDER BY a."id", a."Species", b."Part", b."Measure"
```
In :
```print(db_model.blocks_to_row_recs_query(
source_view=data_algebra.data_ops.describe_pandas_table(
iris, 'iris'),
record_spec=record_spec
))
```
```SELECT
"id" AS "id",
"Species" AS "Species",
MAX(CASE WHEN  ( "Part" = 'Petal' )  AND  ( "Measure" = 'Length' )  THEN "Value" ELSE NULL END) AS "Petal.Length",
MAX(CASE WHEN  ( "Part" = 'Petal' )  AND  ( "Measure" = 'Width' )  THEN "Value" ELSE NULL END) AS "Petal.Width",
MAX(CASE WHEN  ( "Part" = 'Sepal' )  AND  ( "Measure" = 'Length' )  THEN "Value" ELSE NULL END) AS "Sepal.Length",
MAX(CASE WHEN  ( "Part" = 'Sepal' )  AND  ( "Measure" = 'Width' )  THEN "Value" ELSE NULL END) AS "Sepal.Width"
FROM (
"iris"
)
GROUP BY "id", "Species"
ORDER BY "id", "Species"
```
The use case for `SQL` queries is: the queries can be used to with “`CREATE TABLE table_name AS`” to materialize transform results in a database, without round-tripping the data in and out of the database.

As complicated as the queries look, they actually expose some deep truths:

• The `row_recs_to_blocks_query()` is essentially a cross-join of the data to the record description. Each combination of data row and record description row builds a new result row.
• The `blocks_to_row_recs_query()` is an aggregation. Each set of rows corresponding to a given data record is aggregated into a single result row.
• Just about any arbitrary record shape to arbitrary record shape can be written as a transform from the first record shape to row-records (record sets that have exactly one row per record), followed by a transform from the row-records to the new format. This transform can preserve column types as in the intermediate form each different record entry has its own column. This is an advantage of using a “thin” intermediate form such as RDF triples.

This leads us to believe that transforming to and from single-row records are in fact fundemental operations, and not just implementation details.

## The R `cdata` version

The `data_algebra` had been previously implemented in
R in the `cdata`, `rquery`, and `rqdatatable` packages.
This packages support both in-memory operations and translation of the transforms to SQL.

We would perform the above transforms in R as follows.

First we write out a copy of the transform.

In :
```# convert objects to a YAML string
xform_rep = yaml.dump(mp_to_blocks.to_simple_obj())
# write to file
with open("xform_yaml.txt", "wt") as f:
print(xform_rep, file=f)
```
Now we we start up our R environment and read in the transform specification and data.

In :
```%load_ext rpy2.ipython
```
In :
```%%R

# install.packages("cdata")
library(cdata)
library(yaml)
# https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata_yaml.R
source("cdata_yaml.R")

print(iris)
```
```  Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
1          5.1         3.5          1.4         0.2  setosa  0
2          4.9         3.0          1.4         0.2  setosa  1
3          4.7         3.2          1.3         0.2  setosa  2
```
In :
```%%R

transform <- convert_yaml_to_cdata_spec(r_yaml)

print(transform)
```
```{
row_record <- wrapr::qchar_frame(
"id"  , "Species", "Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width" |
.   , .        , Petal.Length  , Petal.Width  , Sepal.Length  , Sepal.Width   )
row_keys <- c('id', 'Species')

# becomes

block_record <- wrapr::qchar_frame(
"id"  , "Species", "Part" , "Measure", "Value"      |
.   , .        , "Petal", "Length" , Petal.Length |
.   , .        , "Petal", "Width"  , Petal.Width  |
.   , .        , "Sepal", "Length" , Sepal.Length |
.   , .        , "Sepal", "Width"  , Sepal.Width  )
block_keys <- c('id', 'Species', 'Part', 'Measure')

# args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
}

```
Now that we have recovered the transform, we can use it in R.

In :
```%%R

iris %.>% transform
```
```   id Species  Part Measure Value
1   0  setosa Petal  Length   1.4
2   0  setosa Petal   Width   0.2
3   0  setosa Sepal  Length   5.1
4   0  setosa Sepal   Width   3.5
5   1  setosa Petal  Length   1.4
6   1  setosa Petal   Width   0.2
7   1  setosa Sepal  Length   4.9
8   1  setosa Sepal   Width   3.0
9   2  setosa Petal  Length   1.3
10  2  setosa Petal   Width   0.2
11  2  setosa Sepal  Length   4.7
12  2  setosa Sepal   Width   3.2
```

### Cross-language work

As the record transform specifications, both in Python `data_algebra` and R `cata` are simple data structures (just the control table, and a few lists of key column names), they can be moved from one language to another by `YAML` (as we also demonstrated in the logistic scoring example.

`data_algebra` supplies a write method, so cross-language interoperation is just a matter of adding additional read/write methods.

In :
```print(yaml.dump(mp_to_blocks.to_simple_obj()))
```
```type: data_algebra.cdata_impl.RecordMap
blocks_out:
type: data_algebra.cdata.RecordSpecification
record_keys:
- id
- Species
control_table_keys:
- Part
- Measure
control_table:
Part:
- Petal
- Petal
- Sepal
- Sepal
Measure:
- Length
- Width
- Length
- Width
Value:
- Petal.Length
- Petal.Width
- Sepal.Length
- Sepal.Width

```

## Conclusion

The `cdata` and `data_algebra` systems yield powerful implementations, and deep understanding of the nature of record transformations. They allow one to reshape data quickly and conveniently either in R. Python/Pandas, or even SQL.