Data Pipeline Manifest

CJ Yetman

What is a Data Pipeline Manifest?

I looked for good examples, standards, recommendations…
- I didn’t find any! 😂
There are many other uses of manifest (or similar) files in other programming contexts:
1. web extensions
2. chrome extensions
3. package.json in NPM (JS) packages
Typically, they define all the things needed to build something.

So, what is a Data Pipeline Manifest to me?

Like a build manifest, but somewhat in reverse
- everything I would need to know to re-build the same exact dataset
born out of a need to explain or verify specific data points in datasets I had created

Reproducability is key to fixing/understanding things

very common in software development / bug fixing
same goes for data, if you can’t do it you can’t explain it

How does a manifest file help?

helps answers questions to guide your investigation
- is the data unmodified?
- is everything there?
- what were the input files?
- what was used to build it?
- what happened in the black box?

Output files

are all the files there?
1. Filenames
2. Directory structure
do the files contain what is expected?
1. File extension
2. File format
3. Schema
4. Encoding

Output files continued

have the files changed?
1. Timestamp (created, last modified, ?)
2. File size
3. File hash

Input data

where (exactly) did the data come from, or how can I get it?
1. Publisher/Source
2. Version
3. URL/link to the data
4. URL that describes and/or links to the data
5. Timestamp of download
6. Location of local archive
7. All the same output file metadata from before

Input environment

what was the exact environment the data pipeline run in?
1. Platform/architecture
2. OS and version
3. Environment variables
4. language and version

Dependencies

what are all the exact dependencies that were used?
1. Packages
2. Versions
3. Dev or modified versions?

Code

what is the exact version of the code that was used?
1. Repo
2. Version
3. Git sha
4. Git status clean?

Practical example

PACTA’s Data Preparation

https://github.com/RMI-PACTA/pacta.data.preparation/blob/main/R/write_manifest.R https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/config.yml

What does it look like?

Questions?

Thank you for your time! 🚀