Reproducible Software Pipelines
Raw data, on which you have no direct control
Preprocessed data, that you can manage
Is the data source reasonably perpetual/long lived?
It might be a good idea to store a copy of the raw data
Several possibilities for perpetual storage:
git
is not good at managing large files
Automate the downloading of raw data!
with Python
Always preprocess your data with a program/script
Commit the program/script to source control
If your raw data comes in Excel files, import it:
pandas.read_excel()
or polars.read_excel()
readxl::read_excel()
never do a manual operation on the data
always use automated processing