Caching

Reproducible software pipelines

  • why do we use notebooks?
  • To save time in not re-executing everything
  • but then you change a function definition and you forget to rerun all the cells and end up with inconsistent results

  • A better workflow is to re-execute all the pipeline on each change

Caching

  • Break down the pipeline into small steps
  • Save the output of each step
  • Re-execute a step only if its inputs changed (or its code)

A long history

The make program is mainly used to build C/C++ programs

A first version was first created in 1976 (!!!) (C was create in 1972)

blah: blah.o
    cc blah.o -o blah # Runs third

blah.o: blah.c
    cc -c blah.c -o blah.o # Runs second

blah.c:
    echo "int main() { return 0; }" > blah.c # Runs first
  • Each block is a step
  • The order in the file does not matter
  • Steps are executed when their inputs change
  • The communication between steps is via files

Many modern descendants

  • Snakemake (multi-language)
  • Joblib (Python)
  • Targets (R)

Tagging along

  • As an example application we will use motif discovery in time series
  • We will use the pyattimo library
  • Datasets are available here: https://is.gd/O3CVLe

Snakemake

  • File with custom syntax to define the pipeline
  • File-based communication between stages
  • Leverage patterns in files to encode parameters
  • Multi-language: R, Python, Julia, Rust…
  • Very widespread: about 45k references on Github

Python’s joblib

  • joblib is a Python library
  • You write simple python scripts
  • Steps of the pipeline are defined as functions
  • A funcion is cached by decorating it with @cache
  • Whenever the code of a function changes its cache is invalidated

R’s targets

  • targets is an R library
  • You write R scripts
  • The pipeline is explicitly structured in a list in the _targets.R file
  • Whenever the code of a function changes its cache is invalidated, along with that of its dependents