Reproducible reports with Jupyter

Posted on September 21, 2023 by The Jumping Rivers Blog in Data science | 0 Comments

This article was first published on The Jumping Rivers Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Jupyter notebooks are a useful tool for Python users of all levels. They allow
us to mix together plain text (formatted as Markdown) with Python code. This
is beneficial for beginners and experienced data scientists alike:

Beginners that are learning Python for the first time can use Markdown cells
to annotate code and record notes.
By splitting up their code into chunks, developers can write and test their
code in a modular manner.
Jupyter notebooks are open-source and a convenient format for developers to
share reports containing live code, equations, visualisations and narrative
text with colleagues.

In this post, we will go deeper with these ideas and show you how to create
reproducible HTML and PDF reports with Jupyter. This blog is a follow-up to
Quarto for the Python user,
which explained how to generate reproducible reports from plain text files with
Quarto.

Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.

What is Quarto?

Quarto is a free-to-use, open-source software based on Pandoc that enables
users to convert plain text files into a range of formats, including PDF, HTML
and powerpoint presentations. These documents can contain a mixture of
narrative text, Python code, and figures that are dynamically generated by the
embedded code.

This has many use-cases:

Your company may have a weekly board meeting to go over the latest sales
figures. By having a Quarto presentation that pulls in the latest company
sales data, you can regenerate the presentation slides each week at the click
of a button.
As a researcher you may be preparing a report for publication. By having the
code that generates data tables and figures embedded within the report,
regenerating the draft as the experimental data floods in is a breeze!

In our recent blog post,
Quarto for the Python user,
we used Quarto to render dynamic reports that mix together Python code and
narrative text. We used Quarto’s standard workflow, which starts from plain
text .qmd files. In this post we will extend these ideas to Jupyter
Notebooks.

Starting with .ipynb notebook files, the Quarto workflow is:

A flow chart of the Quarto rendering workflow: The ipynb file is first converted to Markdown, with Jupyter used to interpret the code cells. The Markdown file can then be converted to a variety of formats, including HTML, DOCX and PDF, using Pandoc.

A Jupyter kernel is used to interpret the Python code cells and Quarto
generates a Markdown document.
The Markdown document includes the text, code, and any figures or results
that were generated by the code.
This is then converted into the desired output format (PDF, HTML, etc) using
Pandoc.

Prerequisites

We will be using VS Code to edit and
render our Jupyter notebook (the only other IDE with support for both Jupyter
and Quarto is
JupyterLab).
Before you can work with Jupyter in VS Code, you will need to install the
Jupyter extension. This can be located in VS Code by clicking “Settings” ->
“Extensions” then typing “jupyter” into the extensions search bar. Select the
“Jupyter” extension by Microsoft and click “Install”.

You will also need to install Quarto.
You can then find the Quarto extension in VS Code by typing “quarto” into the
extensions search bar. Select the “Quarto” extension and click “Install”.

Finally, to reproduce the examples covered in this post, you will need to
install the Python dependencies by running the following command from your
terminal:

python3 -m pip install ipykernel nbclient nbformat pandas papermill plotly statsmodels

These dependencies are required for creating an interactive Plotly figure in
Jupyter and rendering the notebook from the command line.

Setting up a virtual environment

In case you’d like to follow along with these examples using a virtual
environment, we will provide brief instructions for setting up a kernel on
Jupyter. If you’re happy to just use your system Python installation then you
can move onto the next section.

To create a virtual environment, run the following command from your command
terminal:

python3 -m venv venv

This will create a folder called “venv” which can be used to activate the
virtual environment (you can call it whatever you like). To activate it, run:

source venv/bin/activate

Now install the Python dependencies into your environment by running the pip
command shared above. You can now add this environment to your list of Jupyter
kernels by running:

ipython kernel install --user --name=venv

This will add a kernel called “venv”. Next time you open a Jupyter notebook,
you should now be able to select this kernel from the list of options.

Rendering a report

We will generate a report about Mario Kart 64 world records. Please refer to
our previous post
for a recap of the YAML header, Markdown syntax and code chunk options (we will
only briefly cover these topics here).

Setting up Jupyter

Within VS Code, create a Jupyter notebook by clicking “File” -> “New File…”
-> “Jupyter Notebook (.ipynb support)”. Within the notebook, you can select the
kernel by clicking “Select Kernel” and choosing an option from the available
list (for example, your system Python installation or a virtual environment).
For this post, we used Python 3.10.

Header settings

The first code cell should be changed to a Raw NB Convert cell. In VS Code, the
cell type can be changed by clicking the text in the bottom-right corner of the
cell (this will read “Python” for a Python code cell). To select a raw cell,
type “raw” in the search bar and click the option that appears.

The raw NB convert cell acts as the YAML header of the Quarto report. This is
where we include settings such as the title and default output format. Our
example is given below:

---
title: "Reporting on Mario Kart 64 World Records"
author: "Parisa Gregg & Myles Mitchell"
date: "1 Aug 2023"
format: html
execute:
    eval: true
jupyter: python3
---

This sets the default output format to HTML and ensures that the code cells are
evaluated on execution. Remember to include the fencing (---) for YAML
code.

Adding text and code

The remainder of the report will be built from a mixture of Markdown and Python
code cells:

Markdown cells are used for narrative text in the report.
Python cells are used for displaying Python code and generating dynamic
content (e.g., figures, tables and inline results).

Try copying the following into a Markdown code cell. This adds the Abstract,
Introduction and the beginning of the Methods section:

## Abstract

Investigating how the world record for Rainbow Road in Mario Kart 64
developed over time.

## Introduction

Mario Kart 64 is a racing video game developed and published by
[Nintendo](https://en.wikipedia.org/wiki/Nintendo) for the
[Nintendo 64](https://en.wikipedia.org/wiki/Nintendo_64).

Players can choose from eight characters to race as, including:

- Mario
- Toad
- Princess Peach

The game consists of 16 tracks to race around. World records can be
set for either one lap or a full race (three laps) of the course. As
players have competed for faster times, several track shortcuts have
been discovered. There are separate world records for both _with_ and
_without_ the use of a shortcut.

## Methods

We loaded a dataset of [Mario Kart 64](https://mkwrs.com/) world
records. This data is from [tidytuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-05-25/readme.md)
with credit to [Benedikt Claus](https://github.com/benediktclaus).

For this investigation we are interested in the world records for
Rainbow Road over a three-lap course. The dataset was loaded and
filtered using pandas:

By running the Markdown cell, the text will be rendered so it includes
subheadings, bullet points, italic text fomatting and hyperlinks.

Next we may wish to display the code used for loading and filtering the data.
Try copying this code into a Python cell:

import pandas as pd

# Load the records data
records = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv"
)
# Filter the data
rainbow_road = records.loc[
        (records["track"] == "Rainbow Road") &
        (records["type"] == "Three Lap")
].reset_index()
# View the data
rainbow_road.head()

Running this should produce the expected Pandas output, including the first five
rows of the rainbow_road data.

Let’s now include some results, starting with a Markdown cell to add the
Results section header and opening text:

## Results

The figure below shows the development of world records for the Rainbow Road
track on Mario Kart 64 from 1997 to 2021.

We could insert the figure as a PNG or PDF image. But to make this report
reproducible, let’s dynamically generate the figure using a Python code cell:

#| echo: false
#| fig-cap: "Progress of Rainbow Road world records, with and without allowing shortcuts."
#| fig-width: 8
#| label: wr-plot
import plotly.express as px

px.line(
    rainbow_road,
    x="date",
    y="time",
    color="shortcut",
    title="Progress of Rainbow Road N64 World Records",
    line_shape="hv",
    markers="."
)

The code chunk options at the top of this cell will make the code invisible in
the rendered document and set the figure caption, width, and label to
our liking. Plotly is used to visualise the world record for Rainbow Road over
time. Try running this code within your notebook to check that it generates a
figure like the one below:

Image of the plot generated by the Plotly code above. The three-lap world record time is plotted against date from 1997 to 2021. Two coloured lines are shown: red for world records with a shortcut, and blue for without a shortcut.

Finally, let’s quote the longest time a world record was held for using inline
code. Copy this code into a Python cell:

#| echo: false
from IPython.display import display, Markdown

max_duration = rainbow_road.record_duration.max()
display(Markdown(
f"""
The longest a 3 lap world record was held 
for on Rainbow Road is {max_duration} days
({round(max_duration/365,1)} years).
"""
))

Running this should add the sentence “The longest a 3 lap world record was held
for on Rainbow Road is 2214 days (6.1 years).”, where the numbers 2214 and 6.1
have been calculated by Python. If more data is added, these numbers can be
updated automatically by re-rendering the notebook.

Rendering your notebook

You should now have a complete notebook with a YAML header, Markdown text and
Python code cells. To see how it should look, you can view our notebook here.

To render the report from the command line:

quarto render <notebook>.ipynb --to html will render the document as HTML.
quarto preview <notebook>.ipynb will generate a live preview which
can be viewed as you edit the notebook.
quarto render <notebook>.ipynb --execute will execute the code cells as the
output is generated. Without this, you will need to ensure that you have run
the code cells in the notebook manually, before quarto is used to render
it.

Upon rendering, an HTML document like the one
here should be
created.

It’s also possible to render the notebook with the VS Code UI. Provided you
have the Quarto extension installed, there should be options to “Render”,
“Render All”, “Render HTML”, “Render PDF”, and “Render DOCX”:

Screenshot displaying the render options in the VS Code UI. The options are accessed by clicking on the symbol with three dots found in the tool bar. The rendering options include “Render”, “Render All”, “Render DOCX”, “Render HTML” and “Render PDF”.

Note that the HTML plot generated by Plotly cannot be displayed in a DOCX or
PDF document. Instead we would have to use a static image format like PNG or
PDF.

Cell embedding

In Quarto 1.3 a new feature was added that enables you to embed external
Jupyter notebook cells in a Quarto document. This is particularly useful if you
have results from different notebooks that you want to extract into a report.

As well as investigating the word records set on Rainbow Road, we have also
been looking at those set on Choco Mountain. The results for Choco Mountain are
in a separate choco_mountain.ipynb notebook.
We might now want to summarise
our various Mario Kart results in a single .qmd report (see our
previous post
for a guide to .qmd reports).

Rather than having to replicate our plotting code, we can embed the relevant
cells from our rainbow_road.ipynb and choco_mountain.ipynb notebooks
directly into the .qmd report:

---
title: "Reporting on Mario Kart 64 World Records"
author: "Myles Mitchell & Parisa Gregg"
date: "14 June 2023"
format: html
---

## Rainbow Road

The figure below shows the development of world records for the
Rainbow Road track on Mario Kart 64 from 1997 to 2021.

{{< embed rainbow_road.ipynb#wr-plot >}}


## Choco Mountain

The figure below shows the development of world records for the
Choco Mountain track on Mario Kart 64 from 1997 to 2021.

{{< embed choco_mountain.ipynb#wr-plot >}}

Here we have used the “wr-plot” label to reference the code cells that produce
the Plotly figures in the Rainbow Road and Choco Mountain reports. These code
cells are now embedded in the .qmd report and the figures will be visible
in the rendered document (as can be seen here).

Parameterised Reports

Above we produced a report for the Rainbow Road world records on Mario Kart 64.
There are 16 tracks in total in the game. What if we wanted to replicate this
report for each track? With Quarto and Jupyter notebooks we can define a set of
parameters to easily create different variations of a report.

To parameterise a Jupyter notebook we need to create a cell with a “parameters”
tag. To add a parameters tag to a Python cell in VS Code, click on “…” (More
Actions) in the cell tool bar and select “Add Cell Tag”:

Screenshot depicting how to add a tag to a notebook cell. The cell actions are expanded by clicking on the symbol with three dots in the cell tool bar. The “Add Cell Tag” option is visible in the dropdown list.

To add a parameters tag we then just type “parameters” into the pop up box:

Screenshot showing the pop up box that appears after selecting the “Add Cell Tag” option. A parameters tag is added by typing “parameters” into the box and pressing Enter.

The cell should now have a “parameters” tag:

Screenshot showing a code cell after it has been assigned a parameters tag. A “parameters” label is now visible at the lower-left corner of the cell, with an option to add another tag to the right of it.

If we want to have the track as a parameter in the report, we can define a
track variable in the tagged cell (as above):

track = "Rainbow Road"

We can then use this variable in the remainder of our notebook. For example, it can be used to set the track filter in the data-loading code:

# Load the records data
records = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv"
)
# Filter the data
course_records = records.loc[
        (records["track"] == track) &
        (records["type"] == "Three Lap")
].reset_index()

The full code for our parameterised mario_kart.ipynb notebook can be found
here.
In this example we have used "Rainbow Road" as the
default value for our track parameter. Running the following will therefore
generate a report for Rainbow Road:

quarto render mario_kart.ipynb --execute

If we want to report on the "Moo Moo Farm" world records instead, we can pass
this to the track parameter on the command line using the -P flag:

quarto render mario_kart.ipynb -P track:"Moo Moo Farm" --execute

You may have noticed that running the above command actually inserts a cell
defining the track variable as “Moo Moo Farm” into mario_kart.ipynb.

# Injected Parameters
track = "Moo Moo Farm"

Python-bloggers

Data science news and tutorials - contributed by Python bloggers