Build Systems and Data Pipelines

Alex Axthelm

20 second make refresh

output1: input1 input2
  command

input2: sourceA sourceB
  write_input2.sh

Agenda

  • Build Systems
  • Data Pipelines
  • Implications for our work
  • Discussion

There are only two hard things in Computer Science: cache invalidation and naming things.

– Phil Karlton

Build system

A build system is a tool (or library/package) for:

  • Declaring build “targets” (outputs)
  • Defining the steps required to create those targets
  • Executing those steps

it usually also includes a mechanism for caching:

  • Identifying what the inputs are to a step (i.e. code files)
  • Determining if the outputs are stale (outdated)
  • Determining what steps need to be re-run based on staleness

Examples of build systems

  • make/cmake (generic)
  • docker build (docker)
  • vite (JS/TS)
  • uv (python)

Data Pipelines

A data pipeline system is a tool (or library/package) for:

  • Declaring build “targets” (outputs)
  • Defining the steps required to create those targets
  • Executing those steps

it usually also includes a mechanism for caching:

  • Identifying what the inputs are to a step (i.e. data, code files)
  • Determining if the outputs are stale (outdated)
  • Determining what steps need to be re-run based on staleness

Examples of data pipeline tools

  • targets (R)
  • Airflow, Luigi (Python)
  • RMarkdown/Quarto (DataSci languages)
  • Microsoft Excel

Task runner

A task runner is a tool that runs tasks without caching. They may or may not support dependency resolution.

Examples of task runners:

  • just
  • npm run
  • Shell scripts

A build system that ignores its caches can be used as a task runner

Same idea, different names

Build systems Data pipelines
Target / artifact Table / dataset / object
Rule Script / transform / job
Dependency Upstream input
Incremental rebuild Incremental refresh / backfill
Cache Materialization / snapshot
Hermetic build Reproducible run (pinned code + inputs)
Test Data quality check / invariant

Same graph, different choices

  • Invalidation
    • Timestamps (make)
    • Content hashes (Docker)
    • Freshness windows (data pipelines)
  • Scheduling & execution
    • Local
    • CI runners
    • Distributed schedulers + compute engines
  • Determinism / hermeticity
    • Pinned toolchains + inputs
    • External systems
    • Mutable/Immutable data
  • What does “success” mean?
    • Build passes tests
    • Data passes invariants, freshness, and anomaly checks

Enabling latest

If you treat data and code both as dependencies, you can very easily start to talk about the latest version of a data object (or dataset) in the same way you can talk about the latest version of an app or package

  • frequently (for us), code is what changes.
    • How often does relevant data update vs. codebase? Yearly? Weekly? Immediate?
  • This does not need to be a literal latest tag

Continuous Publication

  • If everything is a build, then everything can be a deploy
  • If it can be deployed, it could be deployed to PROD

The CI/CD pipeline is our Continuous Publication mechanism. It is possible to track latest and have up-to-date information, while still checking against tests, model expectations, invariants, and data quality checks.

It is also possible to mark old versions as inaccurate or deprecated

⚠️ The information or analysis in this report are no longer sound. See an updated version at foo.rmi.org/1234

Caching and archiving

  • Identifying what to cache/store
    • What is expensive to calculate? (Cache)
    • What can we not re-create? (Archive)
    • Do we have concerns about an upstream dependency changing history/going offline?

Tech Stack

  • How you want to handle dependencies determines what is the right tool
  • What tool you’re using determines how you handle dependencies

These affect system design (what is in or out of a system boundary), sometimes invisibly

A closer look at make

Make is a universal glue, and as such is worth examining as both a toy model and a well- and commonly-used tool.

Syntax overview/reminder

site.tar: src/*
  npm run build

report.pdf: report.qmd data.csv
    quarto render report.qmd -o report.pdf

%.o : %.c
        $(CC) -c $(CFLAGS) $< -o $@

reports/sector_%.pdf: sector_report_template.qmd %.csv
    quarto render $< -o $@

See also: Stitch Makefile

Language-specific pipelines

Using lanugage-speficic pipelines (targets) gives an easy way to track both code objects and data as “inputs”, and add them to the dependency graph.

It also gives them better syntax (familiar to devs and often just more expressive/easier to use than make)

Baby’s first data pipeline

https://github.com/IndianaCHE/Detailed-SSP-Reports/blob/master/run.R