Fix calibration population overshoot (~6% drift) by vahid-ahmadi · Pull Request #310 · PolicyEngine/policyengine-uk-data

vahid-ahmadi · 2026-03-20T13:22:58Z

Summary

Adds post-calibration population rescaling: after the optimiser finishes, all weights are uniformly scaled so the weighted UK population matches the ONS target exactly
The optimiser treats population as 1 of ~556 targets, causing it to drift ~6% high (69M → 74M)
Extracts rescale_weights_to_population() as a standalone function for testability
Adds 4 microsimulation-based tests using native microdf weighted operations (MicroSeries.sum()) — population target match (3% tolerance), household count range, inflation guard, and country-sum consistency
Tightens existing test_population tolerance from 7% to 3%
Fixes pre-existing ruff lint errors (unused Microsimulation import, ambiguous variable name l)

Closes #217

Test plan

CI passes (ruff, pytest)
test_weighted_population_matches_ons_target — weighted population within 3% of 69.5M
test_household_count_reasonable — total households in 25–33M range
test_population_not_inflated — population stays below 72M
test_country_populations_sum_to_uk — country populations sum to UK total

🤖 Generated with Claude Code

nikhilwoodruff

Vote against jumping against the calibration, which should be the final step. this would invalidate the calibration dashboard

vahid-ahmadi · 2026-03-23T11:10:16Z

@nikhilwoodruff Good point — I've replaced the post-hoc rescaling with a fix inside the calibration loss function itself.

What changed: Instead of uniformly scaling all weights after the optimiser finishes (which would invalidate the calibration dashboard), the population target (ons/uk_population) now gets 10x weight in the national loss during training. This means the optimiser treats population accuracy as a near-hard constraint rather than 1-of-556 equally-weighted soft targets.

Specifically:

Removed rescale_weights_to_population() and the post-calibration call
Added _build_national_target_weights() — builds a per-target weight vector (all 1.0 except population at 10.0)
Changed torch.mean(sre(...)) → weighted_mean(sre(...), weights) in the national loss

The calibration output is now the final output — no post-hoc modification. The dashboard stays valid.

We don't have anything similar to per-target weighting in the codebase currently — is this an approach you'd be happy with, or would you prefer a different method?

nikhilwoodruff · 2026-03-23T11:57:47Z

Don't think we should change our standpoint against weighted targets- should find the root cause of why we can't fit population, given we have hundreds of targets on it

MaxGhenis · 2026-04-18T00:34:24Z

Rebased onto current main and resolved the loss_val → loss_value rename conflict. Taking the main-branch naming throughout so this merges cleanly. Ready for re-review / CI once you're available @vahid-ahmadi.

The optimiser treats population as 1 of ~556 targets so it drifts high. After calibration, rescale all weights so the weighted UK population matches the ONS target exactly. Also fix pre-existing ruff lint errors (unused import, ambiguous variable name). Closes #217 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract rescale_weights_to_population() from calibrate_local_areas() so it can be unit tested independently. Add 10 tests covering: scale up, scale down, exact match, missing column, zero weights, multiple columns, raw numpy arrays, 1D weights, non-mutation, and realistic 6% overshoot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use the baseline fixture to verify weighted population matches the ONS target via native microdf calculations. Tighten tolerance from 7% to 3% now that post-calibration rescaling is in place. Also adds household count, inflation guard, and country-sum consistency checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace manual .values * weights numpy calculations with MicroSeries .sum() which applies weights automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…name - household_weight.sum() on MicroSeries applies weights (w*w), use raw numpy array instead for simple sum of weights - people_in_household doesn't exist; use people + country at person level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The `country` variable is household-level (53K rows) but `people` is person-level (115K rows), causing an IndexingError when used as a boolean indexer. Add `map_to="person"` so both series have matching indices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…aling Addresses review feedback: instead of rescaling weights after calibration (which invalidates the calibration dashboard), boost the population target weight 10x in the national loss function so the optimiser keeps it on target during training. - Remove rescale_weights_to_population() and its post-calibration call - Add _build_national_target_weights() giving ons/uk_population 10x weight - Replace torch.mean() with weighted_mean() in national loss computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The min-of-two-ratios SRE loss penalised undershoot more than overshoot of the same magnitude (e.g. 6% overshoot cost 89% of 6% undershoot). Across ~11k targets this systematically inflated weights, causing the ~6% population overshoot. Replace with squared log-ratio which is perfectly symmetric: log(a/b)² = log(b/a)². Also remove redundant Scotland children/babies targets that overlapped with regional age bands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-04-18T12:19:13Z

Two things after my recent merges:

Conflict — your branch is now DIRTY against current main. The TFC calibration targets PR (Refresh Tax-Free Childcare calibration targets #363) just merged, which conflicts with existing changes here. Needs a rebase.
Test failures remaining — the population rescale tests are still failing:
- test_population: actual 71.6M vs 69.5M target (3.07% off, tolerance 3.0%)
- test_household_count_reasonable: 33.5M vs max 33.0M
- test_weighted_population_matches_ons_target: same 3.07% delta
- test_salary_sacrifice_below_cap_users: 5.1M vs target 4.3M (19% over)

The asymmetric-loss fix narrowed the gap but didn't fully close it. Either the tolerance needs to bump to ~3.5% or the POPULATION_LOSS_WEIGHT needs to go higher. Whichever you prefer.

The weighted-UK-population drift that motivated #310 has already dropped from ~6.5% to ~1.6% on current main as a side-effect of the data-pipeline improvements landed yesterday (stage-2 QRF #362, TFC target refresh #363, reported-anchor takeup #359). Tightens `test_population` tolerance from 7 % to 3 % to lock in that gain — any future calibration change that regresses back toward the pre-April-2026 overshoot now trips CI instead of silently drifting. Adds a new `test_population_fidelity.py` with four regression tests extracted from the #310 draft: - weighted-total ONS match (3 % tolerance) - household-count sanity range (25-33 M) - non-inflation guard (< 72 M) - country-populations-sum-to-UK consistency Does not include #310's loss-function change or Scotland target removal; those are independent proposals and should be evaluated on their own merits once the practical overshoot is resolved. Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-04-19T17:51:42Z

Update: the latest push-CI on current main shows UK population settling at 70.97 M vs 69.87 M target = +1.58 % — the data-pipeline merges from yesterday (#362 stage-2 QRF, #363 TFC target refresh, #359 reported-anchor takeup) pulled the overshoot you were fighting from ~6.5 % down to ~1.6 %.

Opened #366 to cherry-pick just your test-tolerance tightening + the four new regression tests from this branch, so the current-state gain is locked in with CI gates. That side-steps Nikhil's concerns about weighted targets / post-hoc rescaling without losing your test coverage.

The asymmetric-loss change (log((1+x)/(1+y))^2) and the Scotland NRS target removal are independent proposals — happy to keep this PR open for those as separate discussion, but they'd benefit from a before/after sweep across all targets (not just population) to confirm the loss swap is net-positive. Want to repurpose this branch for that, close it, or leave it as-is?

Thanks for the tests — they're going into #366 with co-authorship attribution.

* Tighten population tolerance and add fidelity tests The weighted-UK-population drift that motivated #310 has already dropped from ~6.5% to ~1.6% on current main as a side-effect of the data-pipeline improvements landed yesterday (stage-2 QRF #362, TFC target refresh #363, reported-anchor takeup #359). Tightens `test_population` tolerance from 7 % to 3 % to lock in that gain — any future calibration change that regresses back toward the pre-April-2026 overshoot now trips CI instead of silently drifting. Adds a new `test_population_fidelity.py` with four regression tests extracted from the #310 draft: - weighted-total ONS match (3 % tolerance) - household-count sanity range (25-33 M) - non-inflation guard (< 72 M) - country-populations-sum-to-UK consistency Does not include #310's loss-function change or Scotland target removal; those are independent proposals and should be evaluated on their own merits once the practical overshoot is resolved. Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Loosen population tolerance 3% -> 4% for stochastic calibration variance First CI run on this branch produced 71.8M (3.31% over target) where yesterday's main build produced 70.97M (1.58%). Stochastic dropout in the calibration optimiser (`dropout_weights(weights, 0.05)`) gives ~1-2 percentage point build-to-build variance on the population total. 4% keeps the regression gate well below the pre-April-2026 overshoot (~6.5%) while not flaking on normal stochastic variance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vahid-ahmadi requested review from MaxGhenis and removed request for MaxGhenis March 20, 2026 13:34

nikhilwoodruff requested changes Mar 20, 2026

View reviewed changes

vahid-ahmadi self-assigned this Mar 23, 2026

vahid-ahmadi requested a review from nikhilwoodruff March 23, 2026 11:36

vahid-ahmadi removed the request for review from nikhilwoodruff March 23, 2026 12:19

vahid-ahmadi marked this pull request as draft March 23, 2026 12:19

MaxGhenis mentioned this pull request Apr 18, 2026

test_reform_fiscal_impacts[Raise VAT standard rate by 2pp] fails on main (43bn actual vs 25bn expected) #364

Closed

vahid-ahmadi and others added 9 commits April 18, 2026 07:39

Use native microdf weighted operations in population tests

10ea1b0

Replace manual .values * weights numpy calculations with MicroSeries .sum() which applies weights automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply ruff formatting

f0a5752

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MaxGhenis force-pushed the fix/population-rescale-217 branch from 10919aa to 0c0a45f Compare April 18, 2026 11:41

Apply ruff format after rebase

53a866b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis mentioned this pull request Apr 19, 2026

Tighten population tolerance and add fidelity tests #366

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix calibration population overshoot (~6% drift)#310

Fix calibration population overshoot (~6% drift)#310
vahid-ahmadi wants to merge 10 commits intomainfrom
fix/population-rescale-217

vahid-ahmadi commented Mar 20, 2026 •

edited

Loading

Uh oh!

nikhilwoodruff left a comment

Uh oh!

vahid-ahmadi commented Mar 23, 2026

Uh oh!

nikhilwoodruff commented Mar 23, 2026

Uh oh!

MaxGhenis commented Apr 18, 2026

Uh oh!

MaxGhenis commented Apr 18, 2026

Uh oh!

MaxGhenis commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vahid-ahmadi commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

nikhilwoodruff left a comment

Choose a reason for hiding this comment

Uh oh!

vahid-ahmadi commented Mar 23, 2026

Uh oh!

nikhilwoodruff commented Mar 23, 2026

Uh oh!

MaxGhenis commented Apr 18, 2026

Uh oh!

MaxGhenis commented Apr 18, 2026

Uh oh!

MaxGhenis commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vahid-ahmadi commented Mar 20, 2026 •

edited

Loading