Model/docs/adr/0003-strict-ingestion-modelling-separation.md
Khalim Conn-Kowlessar 8291f29721 docs(ara): composable stage-orchestrator design (ADR-0011 + ADR-0003 amend + CONTEXT)
Records the grill-with-docs outcomes for the ara_first_run rebuild: three
composable stage orchestrators (Ingestion/Baseline/Modelling), one lambda per
use case chaining them through repos (not in-memory), and the Fetcher-vs-Repo
data-source taxonomy. Amends ADR-0003's chaining rule to generalise beyond
RefreshOrchestrator. Adds the pipeline-composition + First Run vocabulary to
CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 16:28:48 +00:00

1.9 KiB

Strict separation between Ingestion and Modelling

Status: Accepted, refined by ADR-0011. The one-way flow below stands. ADR-0011 generalises the chaining rule: it is no longer "only a RefreshOrchestrator may chain" — it is "only a top-level use-case pipeline orchestrator (e.g. FirstRunPipeline) may chain across the Ingestion→Modelling seam; the stage orchestrators communicate through repos and never call across it."

Data flows one way only: Ingestion → Repos → Modelling. Modelling services never make external HTTP calls; Ingestion services never run business logic. If Modelling needs fresh data, it sees a stale record in a repo and returns; the caller (a refresh orchestrator or the FE) decides whether to ingest first. We considered allowing modelling services to call fetchers directly on cache miss — convenient — and rejected it.

The trade-off is that modelling cannot "self-heal" by going to the gov EPC API when it finds stale data. The benefit is that modelling becomes a deterministic function of repository state: same Property in the repos, same modelling output. That is the property that makes modelling unit-testable against fakes (no DB, no network, no ML lambda), reproducible, and debuggable. It also enables a per-property UI flow where fetched data is shown to the user for review and possible override before modelling runs.

Under the rushed timeline this constraint is more valuable, not less. Mixing fetchers into services is the easy thing to do when shipping fast; once it's done it's hard to extract.

Consequences

  • Every modelling service depends only on Repos (and other Services / domain logic). No HTTP libraries in the modelling import graph.
  • A RefreshOrchestrator is the only thing that calls Ingestion then Modelling in sequence; nothing else may.
  • "Modelling is stale, refetch in-line" is a forbidden pattern — surface staleness, do not silently repair it.