Data management

Reproducible Software Pipelines

Types of input data

Raw data, on which you have no direct control
Preprocessed data, that you can manage

Dealing with raw data

Is the data source reasonably perpetual/long lived?
It might be a good idea to store a copy of the raw data
Several possibilities for perpetual storage:
- figshare.com
- zenodo.org
git is not good at managing large files

Example Data

https://is.gd/O3CVLe

Getting raw data

Automate the downloading of raw data!
with Python

import urllib.request

urllib.request.urlretrieve(url, filename=local_path)

with R

download.file(url, local_path, method="wget")

Dealing with preprocessed data

Always preprocess your data with a program/script
Commit the program/script to source control
If your raw data comes in Excel files, import it:
- Python: pandas.read_excel() or polars.read_excel()
- R: readxl::read_excel()

never do a manual operation on the data
always use automated processing