What is a Data Pipeline Manifest?
- I looked for good examples, standards, recommendations…
- There are many other uses of manifest (or similar) files in other programming contexts:
- web extensions
- chrome extensions
- package.json in NPM (JS) packages
- Typically, they define all the things needed to build something.
So, what is a Data Pipeline Manifest to me?
- Like a build manifest, but somewhat in reverse
- everything I would need to know to re-build the same exact dataset
- born out of a need to explain or verify specific data points in datasets I had created
Reproducability is key to fixing/understanding things
- very common in software development / bug fixing
- same goes for data, if you can’t do it you can’t explain it
How does a manifest file help?
- helps answers questions to guide your investigation
- is the data unmodified?
- is everything there?
- what were the input files?
- what was used to build it?
- what happened in the black box?
Output files
- are all the files there?
- Filenames
- Directory structure
- do the files contain what is expected?
- File extension
- File format
- Schema
- Encoding
Output files continued
- have the files changed?
- Timestamp (created, last modified, ?)
- File size
- File hash
Dependencies
- what are all the exact dependencies that were used?
- Packages
- Versions
- Dev or modified versions?
Code
- what is the exact version of the code that was used?
- Repo
- Version
- Git sha
- Git status clean?
Questions?
Thank you for your time! 🚀