Model/services/ml_training_data/tests/unit
Khalim Conn-Kowlessar 6072d8795a slice 16i: MAE + RMSE in metrics; sample_weight_fn + low_sap_tail_weight
train_baseline now returns mae + rmse alongside mape/smape/r2.  MAE is the
user-facing metric ("predicted SAP within N points"); RMSE the quadratic
counterpart.  Both come straight from sklearn.

New sample_weight_fn parameter: callable(y_train) -> per-row weights.
Threads into LGBMRegressor.fit's sample_weight argument.  Default None
preserves existing behaviour.

Default tail strategy exposed as low_sap_tail_weight(y, threshold=58,
weight=3): 3x weight where SAP < 58.  Threshold picked from slice 16h's
per-decile residuals — decile 0 (SAP 1-58) carries 17% MAPE vs <5% body.

Three TDD tracers, all AAA.
2026-05-17 14:48:00 +00:00
..
__init__.py slice 14a: ml_training_data pkg + sample.py (CSV filter + random sample) 2026-05-16 17:39:43 +00:00
test_build_features.py slice 14h: handle real bulk-JSON shape (NDJSON wrappers + document payload) 2026-05-16 19:45:52 +00:00
test_bulk_zip_reader.py slice 14h: handle real bulk-JSON shape (NDJSON wrappers + document payload) 2026-05-16 19:45:52 +00:00
test_sample.py slice 14a: ml_training_data pkg + sample.py (CSV filter + random sample) 2026-05-16 17:39:43 +00:00
test_storage.py slice 14c: BulkZipReader streams certs from gov bulk JSON ZIP 2026-05-16 18:27:24 +00:00
test_train_baseline.py slice 16i: MAE + RMSE in metrics; sample_weight_fn + low_sap_tail_weight 2026-05-17 14:48:00 +00:00
test_write_parquet.py slice 14e: write_training_dataset emits parquet + schema.json + manifest.json 2026-05-16 18:43:31 +00:00