Skip to content

Fix calibration population overshoot (~6% drift)#310

Draft
vahid-ahmadi wants to merge 10 commits intomainfrom
fix/population-rescale-217
Draft

Fix calibration population overshoot (~6% drift)#310
vahid-ahmadi wants to merge 10 commits intomainfrom
fix/population-rescale-217

Conversation

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

@vahid-ahmadi vahid-ahmadi commented Mar 20, 2026

Summary

  • Adds post-calibration population rescaling: after the optimiser finishes, all weights are uniformly scaled so the weighted UK population matches the ONS target exactly
  • The optimiser treats population as 1 of ~556 targets, causing it to drift ~6% high (69M → 74M)
  • Extracts rescale_weights_to_population() as a standalone function for testability
  • Adds 4 microsimulation-based tests using native microdf weighted operations (MicroSeries.sum()) — population target match (3% tolerance), household count range, inflation guard, and country-sum consistency
  • Tightens existing test_population tolerance from 7% to 3%
  • Fixes pre-existing ruff lint errors (unused Microsimulation import, ambiguous variable name l)

Closes #217

Test plan

  • CI passes (ruff, pytest)
  • test_weighted_population_matches_ons_target — weighted population within 3% of 69.5M
  • test_household_count_reasonable — total households in 25–33M range
  • test_population_not_inflated — population stays below 72M
  • test_country_populations_sum_to_uk — country populations sum to UK total

🤖 Generated with Claude Code

@vahid-ahmadi vahid-ahmadi requested review from MaxGhenis and removed request for MaxGhenis March 20, 2026 13:34
Copy link
Copy Markdown
Contributor

@nikhilwoodruff nikhilwoodruff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vote against jumping against the calibration, which should be the final step. this would invalidate the calibration dashboard

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator Author

@nikhilwoodruff Good point — I've replaced the post-hoc rescaling with a fix inside the calibration loss function itself.

What changed: Instead of uniformly scaling all weights after the optimiser finishes (which would invalidate the calibration dashboard), the population target (ons/uk_population) now gets 10x weight in the national loss during training. This means the optimiser treats population accuracy as a near-hard constraint rather than 1-of-556 equally-weighted soft targets.

Specifically:

  • Removed rescale_weights_to_population() and the post-calibration call
  • Added _build_national_target_weights() — builds a per-target weight vector (all 1.0 except population at 10.0)
  • Changed torch.mean(sre(...))weighted_mean(sre(...), weights) in the national loss

The calibration output is now the final output — no post-hoc modification. The dashboard stays valid.

We don't have anything similar to per-target weighting in the codebase currently — is this an approach you'd be happy with, or would you prefer a different method?

@vahid-ahmadi vahid-ahmadi self-assigned this Mar 23, 2026
@nikhilwoodruff
Copy link
Copy Markdown
Contributor

Don't think we should change our standpoint against weighted targets- should find the root cause of why we can't fit population, given we have hundreds of targets on it

@vahid-ahmadi vahid-ahmadi removed the request for review from nikhilwoodruff March 23, 2026 12:19
@vahid-ahmadi vahid-ahmadi marked this pull request as draft March 23, 2026 12:19
@MaxGhenis
Copy link
Copy Markdown
Contributor

Rebased onto current main and resolved the loss_valloss_value rename conflict. Taking the main-branch naming throughout so this merges cleanly. Ready for re-review / CI once you're available @vahid-ahmadi.

vahid-ahmadi and others added 9 commits April 18, 2026 07:39
The optimiser treats population as 1 of ~556 targets so it drifts high.
After calibration, rescale all weights so the weighted UK population
matches the ONS target exactly. Also fix pre-existing ruff lint errors
(unused import, ambiguous variable name).

Closes #217

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract rescale_weights_to_population() from calibrate_local_areas() so
it can be unit tested independently. Add 10 tests covering: scale up,
scale down, exact match, missing column, zero weights, multiple columns,
raw numpy arrays, 1D weights, non-mutation, and realistic 6% overshoot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the baseline fixture to verify weighted population matches the ONS
target via native microdf calculations. Tighten tolerance from 7% to 3%
now that post-calibration rescaling is in place. Also adds household
count, inflation guard, and country-sum consistency checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual .values * weights numpy calculations with MicroSeries
.sum() which applies weights automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…name

- household_weight.sum() on MicroSeries applies weights (w*w), use raw
  numpy array instead for simple sum of weights
- people_in_household doesn't exist; use people + country at person level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `country` variable is household-level (53K rows) but `people` is
person-level (115K rows), causing an IndexingError when used as a
boolean indexer. Add `map_to="person"` so both series have matching
indices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aling

Addresses review feedback: instead of rescaling weights after calibration
(which invalidates the calibration dashboard), boost the population target
weight 10x in the national loss function so the optimiser keeps it on target
during training.

- Remove rescale_weights_to_population() and its post-calibration call
- Add _build_national_target_weights() giving ons/uk_population 10x weight
- Replace torch.mean() with weighted_mean() in national loss computation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The min-of-two-ratios SRE loss penalised undershoot more than
overshoot of the same magnitude (e.g. 6% overshoot cost 89% of
6% undershoot). Across ~11k targets this systematically inflated
weights, causing the ~6% population overshoot.

Replace with squared log-ratio which is perfectly symmetric:
log(a/b)² = log(b/a)².

Also remove redundant Scotland children/babies targets that
overlapped with regional age bands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the fix/population-rescale-217 branch from 10919aa to 0c0a45f Compare April 18, 2026 11:41
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor

Two things after my recent merges:

  1. Conflict — your branch is now DIRTY against current main. The TFC calibration targets PR (Refresh Tax-Free Childcare calibration targets #363) just merged, which conflicts with existing changes here. Needs a rebase.
  2. Test failures remaining — the population rescale tests are still failing:
    • test_population: actual 71.6M vs 69.5M target (3.07% off, tolerance 3.0%)
    • test_household_count_reasonable: 33.5M vs max 33.0M
    • test_weighted_population_matches_ons_target: same 3.07% delta
    • test_salary_sacrifice_below_cap_users: 5.1M vs target 4.3M (19% over)

The asymmetric-loss fix narrowed the gap but didn't fully close it. Either the tolerance needs to bump to ~3.5% or the POPULATION_LOSS_WEIGHT needs to go higher. Whichever you prefer.

MaxGhenis added a commit that referenced this pull request Apr 19, 2026
The weighted-UK-population drift that motivated #310 has already
dropped from ~6.5% to ~1.6% on current main as a side-effect of the
data-pipeline improvements landed yesterday (stage-2 QRF #362, TFC
target refresh #363, reported-anchor takeup #359).

Tightens `test_population` tolerance from 7 % to 3 % to lock in that
gain — any future calibration change that regresses back toward the
pre-April-2026 overshoot now trips CI instead of silently drifting.
Adds a new `test_population_fidelity.py` with four regression tests
extracted from the #310 draft:

- weighted-total ONS match (3 % tolerance)
- household-count sanity range (25-33 M)
- non-inflation guard (< 72 M)
- country-populations-sum-to-UK consistency

Does not include #310's loss-function change or Scotland target
removal; those are independent proposals and should be evaluated on
their own merits once the practical overshoot is resolved.

Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor

Update: the latest push-CI on current main shows UK population settling at 70.97 M vs 69.87 M target = +1.58 % — the data-pipeline merges from yesterday (#362 stage-2 QRF, #363 TFC target refresh, #359 reported-anchor takeup) pulled the overshoot you were fighting from ~6.5 % down to ~1.6 %.

Opened #366 to cherry-pick just your test-tolerance tightening + the four new regression tests from this branch, so the current-state gain is locked in with CI gates. That side-steps Nikhil's concerns about weighted targets / post-hoc rescaling without losing your test coverage.

The asymmetric-loss change (log((1+x)/(1+y))^2) and the Scotland NRS target removal are independent proposals — happy to keep this PR open for those as separate discussion, but they'd benefit from a before/after sweep across all targets (not just population) to confirm the loss swap is net-positive. Want to repurpose this branch for that, close it, or leave it as-is?

Thanks for the tests — they're going into #366 with co-authorship attribution.

MaxGhenis added a commit that referenced this pull request Apr 19, 2026
* Tighten population tolerance and add fidelity tests

The weighted-UK-population drift that motivated #310 has already
dropped from ~6.5% to ~1.6% on current main as a side-effect of the
data-pipeline improvements landed yesterday (stage-2 QRF #362, TFC
target refresh #363, reported-anchor takeup #359).

Tightens `test_population` tolerance from 7 % to 3 % to lock in that
gain — any future calibration change that regresses back toward the
pre-April-2026 overshoot now trips CI instead of silently drifting.
Adds a new `test_population_fidelity.py` with four regression tests
extracted from the #310 draft:

- weighted-total ONS match (3 % tolerance)
- household-count sanity range (25-33 M)
- non-inflation guard (< 72 M)
- country-populations-sum-to-UK consistency

Does not include #310's loss-function change or Scotland target
removal; those are independent proposals and should be evaluated on
their own merits once the practical overshoot is resolved.

Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Loosen population tolerance 3% -> 4% for stochastic calibration variance

First CI run on this branch produced 71.8M (3.31% over target) where
yesterday's main build produced 70.97M (1.58%). Stochastic dropout
in the calibration optimiser (`dropout_weights(weights, 0.05)`) gives
~1-2 percentage point build-to-build variance on the population total.

4% keeps the regression gate well below the pre-April-2026 overshoot
(~6.5%) while not flaking on normal stochastic variance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calibration inflates UK population from 69M to 74M (should be ~70M)

3 participants