Caching

Reproducible software pipelines

why do we use notebooks?
To save time in not re-executing everything
but then you change a function definition and you forget to rerun all the cells and end up with inconsistent results

A better workflow is to re-execute all the pipeline on each change

Caching

Break down the pipeline into small steps
Save the output of each step
Re-execute a step only if its inputs changed (or its code)

A long history

The make program is mainly used to build C/C++ programs

A first version was first created in 1976 (!!!) (C was create in 1972)

blah: blah.o
    cc blah.o -o blah # Runs third

blah.o: blah.c
    cc -c blah.c -o blah.o # Runs second

blah.c:
    echo "int main() { return 0; }" > blah.c # Runs first

Each block is a step
The order in the file does not matter
Steps are executed when their inputs change
The communication between steps is via files

Many modern descendants

Snakemake (multi-language)
Joblib (Python)
Targets (R)

Tagging along

As an example application we will use motif discovery in time series
We will use the pyattimo library
Datasets are available here: https://is.gd/O3CVLe

Snakemake

File with custom syntax to define the pipeline
File-based communication between stages
Leverage patterns in files to encode parameters
Multi-language: R, Python, Julia, Rust…
Very widespread: about 45k references on Github

Python’s `joblib`

joblib is a Python library
You write simple python scripts
Steps of the pipeline are defined as functions
A funcion is cached by decorating it with @cache
Whenever the code of a function changes its cache is invalidated

R’s `targets`

targets is an R library
You write R scripts
The pipeline is explicitly structured in a list in the _targets.R file
Whenever the code of a function changes its cache is invalidated, along with that of its dependents

Caching

Caching

A long history

Many modern descendants

Tagging along

Snakemake

Python’s joblib

R’s targets

Python’s `joblib`

R’s `targets`