Reproducible software pipelines
Notebooks, especially Jupyter
notebooks, are a very common tool in data science pipelines.
Use with caution!
In this case it is rather simple to reconstruct the execution flow, but with long notebooks it becomes nearly impossible. So much that in many cases one dreads restarting the notebook.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "4cc139ba-137c-4ddc-b966-19f6ac9267c2",
"metadata": {},
"outputs": [],
"source": [
"x = 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66366e84-ab75-405f-b518-125651e814f9",
"metadata": {},
"outputs": [],
"source": [
"x = 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a70ba6d-875f-4354-877e-f7cbcfda48af",
"metadata": {},
"outputs": [],
"source": [
"print(x)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "70b3bd0e-f5ee-46ef-bdc0-712390b0c12c",
"metadata": {},
"outputs": [],
"source": [
"x = 3"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
json
format → terrible diff
sdiff
s → horrible merge conflicts5c5
< "execution_count": 1,
---
> "execution_count": 6,
10c10
< "x = 1"
---
> "x = 5"
15c15
< "execution_count": null,
---
> "execution_count": 8,
25c25
< "execution_count": null,
---
> "execution_count": 7,
28c28,36
< "outputs": [],
---
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "9\n"
> ]
> }
> ],
30c38
< "print(x)"
---
> "print(x + 4)"
35c43
< "execution_count": 2,
---
> "execution_count": 3,
Many times one has to manage a notebook with git. This tool: