Data Pipeline Manifest

CJ Yetman

What is a Data Pipeline Manifest?

  • I looked for good examples, standards, recommendations…
    • I didn’t find any! 😂
  • There are many other uses of manifest (or similar) files in other programming contexts:
    1. web extensions
    2. chrome extensions
    3. package.json in NPM (JS) packages
  • Typically, they define all the things needed to build something.

So, what is a Data Pipeline Manifest to me?

  • Like a build manifest, but somewhat in reverse
    • everything I would need to know to re-build the same exact dataset
  • born out of a need to explain or verify specific data points in datasets I had created

Reproducability is key to fixing/understanding things

  • very common in software development / bug fixing
  • same goes for data, if you can’t do it you can’t explain it

How does a manifest file help?

  • helps answers questions to guide your investigation
    • is the data unmodified?
    • is everything there?
    • what were the input files?
    • what was used to build it?
    • what happened in the black box?

Output files

  • are all the files there?
    1. Filenames
    2. Directory structure
  • do the files contain what is expected?
    1. File extension
    2. File format
    3. Schema
    4. Encoding

Output files continued

  • have the files changed?
    1. Timestamp (created, last modified, ?)
    2. File size
    3. File hash

Input data

  • where (exactly) did the data come from, or how can I get it?
    1. Publisher/Source
    2. Version
    3. URL/link to the data
    4. URL that describes and/or links to the data
    5. Timestamp of download
    6. Location of local archive
    7. All the same output file metadata from before

Input environment

  • what was the exact environment the data pipeline run in?
    1. Platform/architecture
    2. OS and version
    3. Environment variables
    4. language and version

Dependencies

  • what are all the exact dependencies that were used?
    1. Packages
    2. Versions
    3. Dev or modified versions?

Code

  • what is the exact version of the code that was used?
    1. Repo
    2. Version
    3. Git sha
    4. Git status clean?

Practical example

PACTA’s Data Preparation

https://github.com/RMI-PACTA/pacta.data.preparation/blob/main/R/write_manifest.R https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/config.yml

What does it look like?

manifest.json

Questions?

Thank you for your time! 🚀