docs(adr): record nearby-postcode broadening (0034) + share HTTP retry primitive

Closes out the cohort-broadening work with its decision record and consolidates
the retry plumbing.

ADR-0034 documents broadening the EPC-Prediction cohort to the real unit
postcodes nearest the target (via postcodes.io) when its own postcode holds no
same-type comparable — extending ADR-0031 decision 5. Records why postcodes.io
was chosen over council[] (whole-LA, no property_type in rows), a bulk Code-Point
Open / ONSPD dataset, and the OS Places radius API, and the lazy / nearest-first
early-stop / soft-fail policy. Broadening-specific docstrings now cite 0034.

Retry consolidation: extract the EPC client's call_with_retry into a shared
infrastructure/http_retry.py keyed off a generic TransientHttpError marker, so
the mechanism (exponential backoff, Retry-After) is shared while each client
keeps its own transient policy. EpcRateLimitError now subclasses TransientHttpError
(still an EpcApiError); PostcodesIoClient routes through the same helper, raising
TransientHttpError on 429/5xx and soft-failing to the seed once exhausted (the EPC
client propagates instead). Direct tests for the shared helper; EPC + postcodes.io
suites repointed at the shared sleep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-06-23 16:54:06 +00:00
parent 0bd2db4f03
commit de7fb94ff7
12 changed files with 331 additions and 97 deletions

View file

@ -231,7 +231,7 @@ def handler(body: dict[str, Any], context: Any) -> None:
def _broaden(target: PredictionTarget) -> list[ComparableProperty]:
"""The nearby-postcode cohort for a gated-out target — the real unit
postcodes nearest it, walked until enough same-type comparables surface
(ADR-0031). Memoised per (postcode, property_type) so co-located
(ADR-0034). Memoised per (postcode, property_type) so co-located
same-type misses share one walk."""
key = (target.postcode, target.property_type)
if key not in _nearby_cohort_cache:

View file

@ -0,0 +1,90 @@
# EPC Prediction cohort broadens to nearby postcodes
ADR-0029 sizes an EPC-less Property from the **Comparable Properties** in its
postcode; ADR-0031 decision 5 made `property_type` the **hard** cohort filter — a
flat is never sized from houses — and gated out any Property with no same-type
comparable. This records how we **broaden** the cohort when a Property's own
postcode is too sparse to fill, rather than gating it out. Resolved against a
real failure on branch `feature/e2e-runs`: property 718580, the only flat lodged
in postcode `BR6 6BS` (the other 8 certs were houses/bungalows), gated out and
failed its modelling subtask.
## Status
Accepted. Extends ADR-0029 / ADR-0031 decision 5 (the postcode cohort).
## Context
ADR-0031 decision 5 anticipated this — "an Ordnance Survey `postcode_search`
source can supply property type more broadly … wiring it is a later enhancement
that widens the eligible population." The gating it describes is correct (never
size a flat from a mixed-type cohort), but it makes prediction only as good as
the *own* postcode's lodged stock. A genuinely-isolated type (one flat among
houses) is unpredictable with no recourse, and a no-EPC + can't-predict Property
fails its subtask — which, over SQS, retries and dead-letters even though the
input data is permanently insufficient.
The legacy engine handled this by **trimming the postcode** (`BR6 6BS``BR6 6B`
`BR6` …) and re-querying, because the old EPC API accepted partial postcodes.
The current gov EPC API (`api.get-energy-performance-data.communities.gov.uk`)
does **not**: its domestic search accepts only a *full* real postcode, `uprn`,
`council[]`, `constituency[]`, or `address` — confirmed against its OpenAPI spec.
There is no outcode/prefix, radius, or lat/long search.
## Decisions
### 1. Broaden to the real unit postcodes physically nearest the target
When the own-postcode cohort yields no same-type member, the cohort is broadened
to the real unit postcodes around the target, resolved from its coordinates via
**postcodes.io**'s keyless `nearest` endpoint (already a trusted dependency in
`scripts/fetch_epc_prediction_dense_corpus.py`). Each nearby postcode is then
searched exactly as the own postcode is — so `property_type` stays the same hard
filter and ADR-0031 decision 5 is upheld; only the *reach* of the candidate pool
widens, never the selection rule. The hard gate still fires when even the
broadened cohort has no same-type comparable: broadening widens the eligible
population, it does not force a prediction.
### 2. postcodes.io over `council[]` and a bulk centroid dataset
- **`council[]`** (the only "wider than postcode" gov-API filter) returns the
*whole* local authority — tens of thousands of certs, many pages — and its
result rows carry **no `property_type`**, so narrowing to the target's type
would mean fetching every certificate. Rejected: cost is unbounded by distance
and dominated by per-cert fetches.
- **OS Code-Point Open / ONSPD → S3** would give an offline unit-postcode→
coordinate table, but is a ~1.7M-row asset to ingest and host for a path that
fires rarely. Rejected as disproportionate; reconsider if postcodes.io's
availability becomes a liability.
- **OS Places radius API** needs an `OS_API_KEY` the modelling path does not
otherwise carry. Rejected to avoid a new secret.
postcodes.io is keyless, bounded by radius (a tight neighbourhood, not an
authority), and reuses the existing per-postcode search unchanged.
### 3. Lazy, nearest-first with early-stop, and soft-failing
Broadening runs **only on gate-out** (a no-EPC Property whose own postcode lacks
a same-type comparable) — the common cases never pay for it. Nearby postcodes are
walked **nearest first** and the walk **stops early** once enough same-type
comparables surface, so a dense area resolves in one or two searches instead of
the whole radius. Results are memoised per `(postcode, property_type)`. The
postcodes.io call is **soft-failing**: any error (after the shared transient
retry) degrades to the seed postcode alone, so a flaky or unavailable
postcodes.io can never break prediction — it just declines to broaden.
## Consequences
- A new external dependency, postcodes.io, sits in the modelling/Lambda path. It
is keyless and isolated behind `PostcodesIoClient` (shared transient-retry +
soft-fail), but its availability and rate limits are now a (bounded) operational
concern. ADR-0031 decision 5's `postcode_search` lever remains an alternative.
- The `ComparablePropertiesRepository` gains `candidates_near` alongside
`candidates_for`; the predictor/handler try the own postcode first and broaden
only on an empty cohort.
- Prediction reaches across postcode boundaries, so the geo-distance weighting
ADR-0029 already applies (a comparable's vote decays with haversine distance)
becomes load-bearing for broadened cohorts, not just a refinement.
- `property_type` gating is unchanged: a Property with no same-type comparable
*anywhere nearby* is still gated out — now a genuine "no comparable stock"
signal rather than an artefact of a sparse single postcode.

View file

@ -1,35 +0,0 @@
import time
from typing import Callable, Optional, TypeVar
import httpx
from infrastructure.epc_client.exceptions import EpcRateLimitError
T = TypeVar("T")
def call_with_retry(
fn: Callable[[], T],
max_retries: int = 5,
backoff_base: float = 1.0,
backoff_multiplier: float = 2.0,
max_backoff: float = 60.0,
) -> T:
"""Retry `fn` on transient EPC-API failures: HTTP 429 rate limits and
transport errors (read/connect timeouts, connection resets). A 429 honours
the server's `Retry-After`; transport errors back off exponentially. Non-
transient failures (other 4xx/5xx, mapping errors) propagate immediately."""
last_exc: Optional[Exception] = None
for attempt in range(max_retries + 1):
try:
return fn()
except (EpcRateLimitError, httpx.TransportError) as exc:
last_exc = exc
if attempt < max_retries:
if isinstance(exc, EpcRateLimitError) and exc.retry_after is not None:
delay = exc.retry_after
else:
delay = backoff_base * (backoff_multiplier**attempt)
time.sleep(min(delay, max_backoff))
assert last_exc is not None
raise last_exc

View file

@ -10,7 +10,7 @@ from infrastructure.epc_client.exceptions import (
EpcNotFoundError,
EpcRateLimitError,
)
from infrastructure.epc_client._retry import call_with_retry
from infrastructure.http_retry import call_with_retry
from datatypes.epc.domain.epc_property_data import EpcPropertyData
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
from datatypes.epc.search import EpcSearchResult

View file

@ -1,4 +1,4 @@
from typing import Optional
from infrastructure.http_retry import TransientHttpError
class EpcApiError(Exception):
@ -9,9 +9,7 @@ class EpcNotFoundError(EpcApiError):
"""Raised when the API returns 404."""
class EpcRateLimitError(EpcApiError):
"""Raised when the API returns 429 and all retries are exhausted."""
def __init__(self, message: str, retry_after: Optional[float] = None) -> None:
super().__init__(message)
self.retry_after = retry_after
class EpcRateLimitError(EpcApiError, TransientHttpError):
"""Raised when the API returns 429. A ``TransientHttpError`` so the shared
``call_with_retry`` retries it (honouring ``Retry-After``), while remaining an
``EpcApiError`` for callers that catch the EPC hierarchy."""

View file

@ -0,0 +1,60 @@
"""Shared transient-failure retry for the HTTP source clients.
The retry *mechanism* is generic; each client owns the *policy* of what counts as
transient. A caller signals "retry this" by raising ``TransientHttpError``
(carrying any server-advised ``retry_after``); transport-level errors (read /
connect timeouts, connection resets) are always treated as transient.
``call_with_retry`` backs off exponentially between attempts honouring
``retry_after`` when present and re-raises the last error once attempts are
exhausted, leaving the caller to decide how to surface it (the EPC client lets it
propagate; postcodes.io soft-fails to the seed postcode).
"""
from __future__ import annotations
import time
from typing import Callable, Optional, TypeVar
import httpx
T = TypeVar("T")
class TransientHttpError(Exception):
"""A failure worth retrying. ``retry_after`` is the server-advised delay (a
429's ``Retry-After``), used in place of the computed backoff when present."""
def __init__(self, message: str, retry_after: Optional[float] = None) -> None:
super().__init__(message)
self.retry_after = retry_after
def call_with_retry(
fn: Callable[[], T],
max_retries: int = 5,
backoff_base: float = 1.0,
backoff_multiplier: float = 2.0,
max_backoff: float = 60.0,
) -> T:
"""Retry ``fn`` on transient failures — ``TransientHttpError`` (e.g. a 429)
and ``httpx.TransportError`` (read/connect timeouts, connection resets)
backing off exponentially, or by the error's ``retry_after`` when it carries
one. Non-transient failures propagate immediately; the last transient error
is re-raised once ``max_retries`` is exhausted."""
last_exc: Optional[Exception] = None
for attempt in range(max_retries + 1):
try:
return fn()
except (TransientHttpError, httpx.TransportError) as exc:
last_exc = exc
if attempt < max_retries:
retry_after = (
exc.retry_after if isinstance(exc, TransientHttpError) else None
)
if retry_after is not None:
delay = retry_after
else:
delay = backoff_base * (backoff_multiplier**attempt)
time.sleep(min(delay, max_backoff))
assert last_exc is not None
raise last_exc

View file

@ -15,12 +15,12 @@ broadening" rather than breaking prediction.
from __future__ import annotations
import time
from typing import Any, Optional
import httpx
from domain.geospatial.coordinates import Coordinates
from infrastructure.http_retry import TransientHttpError, call_with_retry
class PostcodesIoClient:
@ -91,50 +91,44 @@ class PostcodesIoClient:
return [row for row in payload if isinstance(row, dict)]
def _call(self, path: str, params: Optional[dict[str, Any]]) -> Any:
"""One GET against postcodes.io, retrying transient failures (transport
errors, 429s, 5xx) with exponential backoff. Returns the parsed
``result`` payload, or None on a non-transient failure (e.g. an unknown
postcode's 404) or once retries are exhausted — broadening then falls
back to the seed alone."""
for attempt in range(self.MAX_RETRIES + 1):
try:
response = httpx.get(
f"{self.BASE_URL}{path}",
params=params,
timeout=self.REQUEST_TIMEOUT,
)
except httpx.TransportError:
if not self._sleep_before_retry(attempt, retry_after=None):
return None
continue
except httpx.HTTPError:
return None # non-transient client-side error (e.g. bad URL)
if self._is_transient(response.status_code):
if not self._sleep_before_retry(
attempt, retry_after=self._retry_after(response)
):
return None
continue
if not response.is_success:
return None
try:
body: Any = response.json()
except ValueError:
return None
return body.get("result") if isinstance(body, dict) else None
return None
"""The parsed ``result`` payload for a postcodes.io GET, retried on
transient failures via the shared ``call_with_retry``, or None on a
non-transient failure (e.g. an unknown postcode's 404) or once retries are
exhausted broadening then falls back to the seed alone. The soft-fail is
the difference from the EPC client, which lets the error propagate."""
try:
return call_with_retry(
lambda: self._fetch(path, params),
max_retries=self.MAX_RETRIES,
backoff_base=self.BACKOFF_BASE,
backoff_multiplier=self.BACKOFF_MULTIPLIER,
max_backoff=self.MAX_BACKOFF,
)
except (TransientHttpError, httpx.HTTPError):
return None
def _sleep_before_retry(self, attempt: int, retry_after: Optional[float]) -> bool:
"""Sleep before the next attempt and report whether one remains; on the
final attempt, return False so the caller soft-fails instead of looping."""
if attempt >= self.MAX_RETRIES:
return False
if retry_after is not None:
delay = retry_after
else:
delay = self.BACKOFF_BASE * (self.BACKOFF_MULTIPLIER**attempt)
time.sleep(min(delay, self.MAX_BACKOFF))
return True
def _fetch(self, path: str, params: Optional[dict[str, Any]]) -> Any:
"""One GET. Raises ``TransientHttpError`` on a 429/5xx (and lets
``httpx.TransportError`` propagate) so ``call_with_retry`` retries it;
returns None for a non-transient non-success (e.g. 404) or unparseable
body."""
response = httpx.get(
f"{self.BASE_URL}{path}",
params=params,
timeout=self.REQUEST_TIMEOUT,
)
if self._is_transient(response.status_code):
raise TransientHttpError(
f"postcodes.io {response.status_code} on {path}",
retry_after=self._retry_after(response),
)
if not response.is_success:
return None
try:
body: Any = response.json()
except ValueError:
return None
return body.get("result") if isinstance(body, dict) else None
@staticmethod
def _is_transient(status_code: int) -> bool:

View file

@ -46,7 +46,7 @@ class CohortGeospatial(Protocol):
class NearbyPostcodes(Protocol):
"""Resolves the real unit postcodes physically near a seed postcode (e.g.
`PostcodesIoClient`). The gov EPC API cannot search by radius, so this is how
the cohort reaches beyond the target's own postcode (ADR-0031)."""
the cohort reaches beyond the target's own postcode (ADR-0034)."""
def nearby(
self, postcode: str, coordinates: Optional[Coordinates] = None
@ -83,7 +83,7 @@ class EpcComparablePropertiesRepository(ComparablePropertiesRepository):
minimum: int = _DEFAULT_MINIMUM_COHORT,
) -> list[ComparableProperty]:
"""The broadened cohort: candidates drawn from the real unit postcodes
nearest ``postcode`` (ADR-0031), for when the target's own postcode holds
nearest ``postcode`` (ADR-0034), for when the target's own postcode holds
no same-type comparables. Postcodes are visited nearest first and each
candidate is deduped by certificate number across them.

View file

@ -709,7 +709,7 @@ def main() -> None:
def broaden(target: PredictionTarget) -> list[ComparableProperty]:
# Broadened cohort for a gated-out target: the nearest real postcodes,
# walked until enough same-type comparables surface (ADR-0031). Memoised
# walked until enough same-type comparables surface (ADR-0034). Memoised
# per (postcode, property_type).
key = (target.postcode, target.property_type)
if key not in _nearby_cohort_cache:

View file

@ -78,7 +78,7 @@ def test_429_retry_after_header_drives_sleep_duration(
_mock_response(200, cert_response),
]
with patch("httpx.get", side_effect=responses), patch(
"infrastructure.epc_client._retry.time.sleep"
"infrastructure.http_retry.time.sleep"
) as mock_sleep:
epc_service.get_by_certificate_number("CERT-001")
@ -100,7 +100,7 @@ def test_429_without_retry_after_uses_exponential_backoff(
_mock_response(200, cert_response),
]
with patch("httpx.get", side_effect=responses), patch(
"infrastructure.epc_client._retry.time.sleep"
"infrastructure.http_retry.time.sleep"
) as mock_sleep:
epc_service.get_by_certificate_number("CERT-001")
@ -121,7 +121,7 @@ def test_429_malformed_retry_after_falls_back_to_backoff(
_mock_response(200, cert_response),
]
with patch("httpx.get", side_effect=responses), patch(
"infrastructure.epc_client._retry.time.sleep"
"infrastructure.http_retry.time.sleep"
) as mock_sleep:
epc_service.get_by_certificate_number("CERT-001")
@ -140,7 +140,7 @@ def test_429_retry_after_capped_by_max_backoff(epc_service, rdsap_21_0_1_cert):
_mock_response(200, cert_response),
]
with patch("httpx.get", side_effect=responses), patch(
"infrastructure.epc_client._retry.time.sleep"
"infrastructure.http_retry.time.sleep"
) as mock_sleep:
epc_service.get_by_certificate_number("CERT-001")

View file

@ -18,8 +18,9 @@ _MODULE = "infrastructure.postcodes_io.postcodes_io_client"
@pytest.fixture(autouse=True)
def _no_sleep() -> Iterator[MagicMock]:
"""Never actually sleep during backoff — just record the calls."""
with patch(f"{_MODULE}.time.sleep") as sleep:
"""Never actually sleep during backoff (the shared retry owns the sleep) —
just record the calls."""
with patch("infrastructure.http_retry.time.sleep") as sleep:
yield sleep

View file

@ -0,0 +1,126 @@
"""call_with_retry — the shared transient-failure retry both HTTP source clients
use. Generic mechanism (exponential backoff, Retry-After), per-client policy
(what gets raised as transient)."""
from __future__ import annotations
from typing import Iterator
from unittest.mock import MagicMock, call, patch
import httpx
import pytest
from infrastructure.http_retry import TransientHttpError, call_with_retry
@pytest.fixture(autouse=True)
def _no_sleep() -> Iterator[MagicMock]:
with patch("infrastructure.http_retry.time.sleep") as sleep:
yield sleep
def test_returns_immediately_on_success(_no_sleep: MagicMock) -> None:
# Act
result = call_with_retry(lambda: 42)
# Assert
assert result == 42
_no_sleep.assert_not_called()
def test_retries_a_transient_error_then_succeeds(_no_sleep: MagicMock) -> None:
# Arrange — fail twice transiently, then succeed.
calls = {"n": 0}
def fn() -> str:
calls["n"] += 1
if calls["n"] < 3:
raise TransientHttpError("429")
return "ok"
# Act
result = call_with_retry(fn)
# Assert — exponential backoff between the three attempts.
assert result == "ok"
assert _no_sleep.call_args_list == [call(1.0), call(2.0)]
def test_honours_retry_after_over_the_computed_backoff(_no_sleep: MagicMock) -> None:
# Arrange
raised = {"done": False}
def fn() -> str:
if not raised["done"]:
raised["done"] = True
raise TransientHttpError("429", retry_after=7.0)
return "ok"
# Act
call_with_retry(fn)
# Assert
_no_sleep.assert_called_once_with(7.0)
def test_retries_transport_errors(_no_sleep: MagicMock) -> None:
# Arrange
attempts = {"n": 0}
def fn() -> str:
attempts["n"] += 1
if attempts["n"] == 1:
raise httpx.ReadTimeout("slow")
return "ok"
# Act
result = call_with_retry(fn)
# Assert
assert result == "ok"
assert attempts["n"] == 2
def test_reraises_the_last_transient_error_once_exhausted(
_no_sleep: MagicMock,
) -> None:
# Arrange — always transient.
def fn() -> str:
raise TransientHttpError("persistent 429")
# Act / Assert
with pytest.raises(TransientHttpError, match="persistent 429"):
call_with_retry(fn, max_retries=2)
assert _no_sleep.call_count == 2
def test_non_transient_error_propagates_without_retry(_no_sleep: MagicMock) -> None:
# Arrange
attempts = {"n": 0}
def fn() -> str:
attempts["n"] += 1
raise ValueError("not transient")
# Act / Assert
with pytest.raises(ValueError, match="not transient"):
call_with_retry(fn)
assert attempts["n"] == 1
_no_sleep.assert_not_called()
def test_retry_after_is_capped_by_max_backoff(_no_sleep: MagicMock) -> None:
# Arrange
raised = {"done": False}
def fn() -> str:
if not raised["done"]:
raised["done"] = True
raise TransientHttpError("429", retry_after=999.0)
return "ok"
# Act
call_with_retry(fn, max_backoff=60.0)
# Assert
_no_sleep.assert_called_once_with(60.0)