Data management

Reproducible Software Pipelines

Types of input data

  • Raw data, on which you have no direct control

  • Preprocessed data, that you can manage

Dealing with raw data

  • Is the data source reasonably perpetual/long lived?

  • It might be a good idea to store a copy of the raw data

  • Several possibilities for perpetual storage:

  • git is not good at managing large files

Example Data

https://is.gd/O3CVLe

Getting raw data

  • Automate the downloading of raw data!

  • with Python

import urllib.request

urllib.request.urlretrieve(url, filename=local_path)
  • with R
download.file(url, local_path, method="wget")

Dealing with preprocessed data

  • Always preprocess your data with a program/script

  • Commit the program/script to source control

  • If your raw data comes in Excel files, import it:

    • Python: pandas.read_excel() or polars.read_excel()
    • R: readxl::read_excel()

  • never do a manual operation on the data

  • always use automated processing