Creating a Reproducible Example

Posted on May 31, 2022 by The Jumping Rivers Blog in Data science | 0 Comments

This article was first published on The Jumping Rivers Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Maintaining training materials

Over the last few years, we increased both the number and types of
training courses we offer. In addition to our usual R courses in {dplyr}
and {shiny}, we also offer
training on Docker,
Python, Stan, TensorFlow, and others.

As the number of courses we offer increased, so did the maintenance
burden of our associated training materials (lecture notes, slides,
exercises, and more). To ease this burden, and to assist in ensuring
that our training materials build consistently, we developed an R
package called {jrNotes2}. Amongst other things, this package ensures
that all courses:

have identical “template files”: .gitlab-ci.yml, .gitignore,
Makefiles, index.Rmd, …;
have the same directory structure, and
pass a set of quality-assurance checks.

To make a change to course content, a team member must push their
suggestions to a branch on GitLab. This action launches a CI job, which
runs a Docker container that performs a set of checks. The templated
.gitlab-ci.yml file ensures that every course undergoes the same build
process and quality-assurance checks. If the content passes these
checks, and an eligible approver approves the changes, then the
changes are merged into the main branch.

Cartoon showing arrows from Data scientist to GitLab to Docker container to Continuous Integration

This means course content in a main branch should never fail our checks.
Well, not quite…

Why we can’t freeze all dependencies

When teaching a course, we want to teach with the exact same packages
an attendee would get via an install.packages() or pip install
command. This means we must always use the latest versions of packages
available on CRAN and PyPI. However, always using the latest available
packages has it dangers: a change to a package used by a course can
suddenly cause our teaching materials to begin failing our build checks.

To try and pre-empt package changes breaking our training materials we
use scheduled CI runs. That is, at regular intervals a CI job
automatically runs our tests and checks against a course’s training
materials. If a course’s materials fail these checks, we are notified
via a message in a Slack channel. Around early January, we started
getting notifications about our Introduction to Python course:

Screenshot of slack notification showing the failed pipeline, where failed job is notes-build.

Do you require help building a Shiny app? Would you like someone to take over the maintenance burden?
If so, check out
our
Shiny and Dash
services.

The problem

Unfortunately, the traceback given by the CI wasn’t the most
enlightening:

segfault traceback screenshot

Strangely, the course materials

built successfully on Colin’s laptop;
failed to build on Jack’s laptop, and
failed to build on the CI runner.

As far as we could see, everything appeared roughly the same on all
three systems: with all three running the same operating system, the
same R version, and using the same package versions.

Whilst we could reproduce the error in a docker container, the error was
difficult to debug as

the container used a large number of internal Jumping Rivers R
packages;
the materials build process involved a set of non-trivial Rmd files,
and
the error wasn’t encountered until around eight minutes into the
build and test process.

In short, whilst we had a reproducible example of the error, it was only
reproducible by a Jumping Rivers employee, and it was far from a
minimal example.

Simplifying the problem

To make progress, we had to simplify the docker container. We asked
ourselves the following questions:

Can we remove all unnecessary files, such as presentation slides?
Yes.
Can we simplify the course notes? Yes: we were able to find a single
Python code chunk that caused the issue.
Can we remove all of our custom Rmd styling? Yes: a simpler Rmd file
with the same chunk gave the same error.
Can we reproduce the issue without R Markdown? Yes: a simple R
script can reproduce the same error.
Does the Dockerfile need to be complex? No: we can remove most of
the unnecessary Python, Debian and R related packages.

A minimal reproducible example

After all of our simplifications, we arrived at a minimal reproducible
example with the Dockerfile:

FROM rocker/r-ver:latest
RUN apt update && apt install -y python3 python3-dev python3-venv
RUN install2.r --error reticulate
COPY test.R /root/

and associated R script:

reticulate::virtualenv_create(
  envname = "./venv",
  packages = "matplotlib"
)
reticulate::use_virtualenv("./venv")
reticulate::py_run_string("import matplotlib.pyplot as plt; plt.plot([1, 2, 3], [1, 2, 3])")

By simplifying the problem, we were now in a position to ask for help
from others.

As this appeared to be a bug (it used to work, but now it doesn’t), we
raised an issue against the
{reticulate}
repository.

A (partial) solution

Soon after posting we received a
response
from one of the {reticulate} developers. Their response revealed that
matplotlib was nothing but an innocent bystander in our issue, and that
the real culprits were the incompatible BLAS (Basic Linear Algebra
Subprograms) libraries being used by R and numpy!

The suggested solution was to was compile the numpy package from source
within Docker. However, compiling numpy at container runtime added
around 3 minutes to the CI checks every time they ran. As such, we
opted to build the numpy package from source at image build-time,
effectively caching the package build, and avoiding re-compiling numpy
every time our build tests ran against our training materials.

Although compiling numpy from source did fix our issue, it currently
presents as more of a workaround than a long-term solution. Hopefully, a
future change to the BLAS libraries used by the rocker image series or
numpy, can allow the two to be friends again. Here’s to hoping!

Take-aways

Using scheduled CI jobs allowed us to catch this issue early, and
gave us plenty of time to fix it before the next time the course
ran.
Having a CI ensured we had an (internally) reproducible example, as
the CI is based on a docker container.
In order to get help, it was crucial to simplify the problem.
Debugging is hard, and it’s okay to ask for help!

References

https://github.com/rstudio/reticulate/issues/1133

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers