Phase 4: Survey support for ImputationDiD, TwoStageDiD, CallawaySantAnna by igerber · Pull Request #233 · igerber/diff-diff

igerber · 2026-03-22T20:44:48Z

Summary

Add weights parameter to solve_logit() — survey weights enter IRLS as w_survey * mu*(1-mu), enabling survey-weighted propensity score estimation
Add survey support to ImputationDiD: weighted iterative FE, survey-weighted ATT aggregation, weighted conservative variance (Theorem 3), survey df for inference
Add survey support to TwoStageDiD: weighted iterative FE, weighted Stage 2 OLS, weighted GMM sandwich variance with survey weights in both stages
Add survey support to CallawaySantAnna: survey-weighted regression, IPW (via weighted solve_logit()), and DR methods with explicit influence functions; survey-weighted WIF in aggregation; Cholesky cache bypassed under survey weights
Unblock TripleDifference IPW/DR with survey (weighted solve_logit() now available, removing Phase 3 deferral)
41 new tests covering all estimators, weighted logit, scale invariance, and CS inference validation
Update Phase 3 tests: TripleDifference IPW/DR tests changed from "raises" to "works"
Update survey-roadmap.md with Phase 4 implementation status
Document CS per-cell SE deviation in REGISTRY.md (influence-function variance matching R's did, not full TSL)
Log pre-existing Hausman stale n_cl bug from PR EfficientDiD: cluster-robust SEs, last-cohort control, Hausman pretest, small cohort warning #230 in TODO.md

Methodology references (required if estimator / math changes)

ImputationDiD: Borusyak, Jaravel & Spiess (2024), "Revisiting Event-Study Designs", RES — Theorem 3 conservative variance with survey-weighted aggregation weights
TwoStageDiD: Gardner (2022) / Butts & Gardner (2022) — GMM sandwich with survey weights in cross-products
CallawaySantAnna: Callaway & Sant'Anna (2021), "Difference-in-Differences with multiple time periods", JoE — survey weights compose multiplicatively with IPW weights; influence-function-based per-cell SE matching R's did::att_gt() analytical path
Weighted logit: Matches R's svyglm(family=binomial) — working weights = survey_w × mu × (1−mu)
Intentional deviation: CS per-cell ATT(g,t) SEs use IF variance (not full TSL with strata/PSU/FPC). Design effects enter at aggregation via WIF and survey df. Documented in REGISTRY.md.

Validation

Tests added/updated: tests/test_survey_phase4.py (41 tests), tests/test_survey_phase3.py (2 tests updated)
Full test suite: 2284 passed, 0 failed
Smoke tests verify: uniform weights match unweighted, non-uniform weights change ATT, bootstrap+survey raises NotImplementedError, scale invariance, survey metadata in summary output

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

🤖 Generated with Claude Code

github-actions · 2026-03-22T20:54:40Z

Overall Assessment

⛔ Blocker

Executive Summary

The PR removes the survey guard for TripleDifference IPW/DR, but fit() still routes those methods through the unweighted wrappers, so non-uniform survey weights are silently ignored for both ATT and SE.
ImputationDiD documents survey-weighted Theorem 3 variance, but the variance code never receives survey weights and still uses the unweighted v_it formulas/projection matrices.
CallawaySantAnna’s new no-covariate survey SE formulas are not influence-function based as documented; they use first-power weighted variances while the stored IFs use weighted deviations.
The Methodology Registry still says TripleDifference IPW/DR survey support is deferred, while the PR updates tests/roadmap to claim it is supported.
The new DDD survey tests only check “does not raise” / finite outputs, which is why the weighting regression slipped through.

Methodology

Severity: P0. Impact: TripleDifference IPW/DR with survey_design still computes unweighted estimates and unweighted SEs. Only the regression path receives survey_weights / resolved_survey; the IPW and DR branches call wrappers that drop both arguments, even though the downstream decomposition code already has survey-aware hooks. That is a silent correctness bug: users can pass non-uniform survey weights and get the wrong numbers with no warning. Concrete fix: either restore the NotImplementedError for IPW/DR+survey, or thread survey_weights and resolved_survey through the IPW/DR call chain and add regression tests that verify non-uniform weights change ATT/SE while uniform weights match the unweighted result. References: diff_diff/triple_diff.py:L557-L570, diff_diff/triple_diff.py:L791-L800, diff_diff/triple_diff.py:L1021-L1047.
Severity: P1. Impact: ImputationDiD claims survey-weighted conservative variance, but the survey path only changes the treated aggregation weights passed into _compute_conservative_variance(). The actual Theorem 3 helper still builds v_untreated from unweighted untreated counts in the FE-only case and from unweighted normal equations A_0'A_0 in the covariate case. Since survey_weights never reach these helpers, the reported SE is not the weighted Theorem 3 variance described in the registry. Concrete fix: thread survey weights into the v_it construction itself, not just the treated aggregation weights; for the covariate case that means using the weighted projection system rather than A_0'A_0, and for the FE-only case the closed form needs weighted untreated totals rather than raw counts. References: diff_diff/imputation.py:L447-L473, diff_diff/imputation.py:L1022-L1100, diff_diff/imputation.py:L1196-L1278, docs/methodology/REGISTRY.md:L845-L845.
Severity: P1. Impact: CallawaySantAnna’s new no-covariate survey SEs are internally inconsistent with the influence functions the code stores and uses downstream. In the survey branches, the IF is w_norm * (y - mean_w(y)), but the SE is computed from sum(w_norm * (y - mean_w(y))^2) rather than sum((w_norm * (y - mean_w(y)))^2). That makes per-cell se, t_stat, p_value, and conf_int wrong whenever weights are non-uniform, in the common no-covariate regression/IPW/DR paths and the no-covariate fallback path. Concrete fix: derive SEs from the same IF object the method returns, or replace both with a single shared survey IF helper so the per-cell SE and stored IF cannot diverge again. References: diff_diff/staggered.py:L709-L724, diff_diff/staggered.py:L1614-L1629, diff_diff/staggered.py:L1778-L1792, diff_diff/staggered.py:L1994-L2003, docs/methodology/REGISTRY.md:L419-L420.

Code Quality

No additional code-quality findings beyond the methodology wiring and variance issues above.

Performance

No new performance blockers found in the changed paths.

Maintainability

No new untracked maintainability blockers found beyond the issues above.

Tech Debt

Severity: P3. Impact: The PR correctly tracks one real deferred limitation in TODO.md: full design-based per-cell CallawaySantAnna survey SEs remain deferred. That tracking is appropriate, but it does not mitigate the separate SE bugs above. Concrete fix: none required for the tracking itself. Reference: TODO.md:L55-L56.

Security

No findings.

Documentation/Tests

Severity: P1. Impact: The Methodology Registry still says TripleDifference IPW/DR survey support is deferred, while the PR updates tests and the survey roadmap to treat it as implemented. For this project, REGISTRY.md is the methodology source of truth, so that is an unmitigated documentation/methodology mismatch. Concrete fix: after fixing the actual IPW/DR survey wiring, update the registry; if the feature remains incomplete, restore the guard and keep the registry as-is. References: docs/methodology/REGISTRY.md:L1248-L1249, tests/test_survey_phase3.py:L499-L529, tests/test_survey_phase4.py:L675-L716.
Severity: P2. Impact: The new DDD survey tests only assert that IPW/DR “work” in the sense of not raising and returning finite outputs. They never assert that survey weights actually affect IPW/DR estimates or that uniform weights collapse to the unweighted result, so they miss the core regression. Concrete fix: add one non-uniform-weight test per method that compares weighted vs unweighted ATT/SE, plus one uniform-weight invariance test for IPW and DR. References: tests/test_survey_phase3.py:L499-L529, tests/test_survey_phase4.py:L675-L716.

Path to Approval

Fix or re-disable TripleDifference IPW/DR survey support. The minimal safe change is to restore the NotImplementedError; the preferred change is to thread survey_weights and resolved_survey through IPW/DR and verify weighted ATT/SE numerically.
Rework ImputationDiD survey SEs so the Theorem 3 v_it construction uses weighted Step 1 algebra, not just weighted treated aggregation weights.
Fix CallawaySantAnna survey no-covariate per-cell SEs to use the same IF algebra that the code stores and aggregates.
Update docs/methodology/REGISTRY.md only after the estimator behavior is actually correct, and strengthen the DDD survey tests so they assert weighted behavior rather than mere finiteness.

igerber · 2026-03-22T23:19:16Z

/ai-review

github-actions · 2026-03-22T23:32:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ab65708f0116716c097d6fc16c57fd97a8a40850

Overall Assessment

⚠️ Needs changes

This re-review clears two of the prior blockers: TripleDifference no-covariate IPW/DR survey wiring is now present, and the CallawaySantAnna per-cell survey SE deviation is now implemented as IF-based per-cell variance and explicitly mitigated in both REGISTRY.md and TODO.md. The highest remaining unmitigated severity is still P1: ImputationDiD survey SEs do not implement the documented weighted conservative variance, and I found two additional survey-path P1s in covariate-adjusted TripleDifference IPW/DR and TwoStageDiD’s always-treated edge case.

Executive Summary

ImputationDiD now survey-weights ATT aggregation, but its Theorem 3 untreated-side v_it algebra is still unweighted, so survey SEs do not match the documented method.
The previous TripleDifference blocker where survey weights were dropped entirely on the no-covariate IPW/DR path appears fixed, but covariate-adjusted survey IPW/DR still computes nuisance IF corrections with unweighted Hessian/score/OLS terms, so those survey SEs remain wrong.
TwoStageDiD resolves survey arrays before excluding always-treated units; on that documented edge case, the filtered fit can fail with stale-length survey arrays.
The prior CallawaySantAnna per-cell survey SE concern is no longer blocking: the code now derives those SEs from the stored IF, and the remaining full design-based per-cell TSL gap is documented and tracked.
The new survey tests are materially better for no-covariate paths, but they still miss DDD IPW/DR with covariates and TwoStageDiD with always-treated units under survey.

Methodology

The touched paths are all source-sensitive: Borusyak-Jaravel-Spiess place ImputationDiD uncertainty on the Theorem 3 conservative variance, Gardner/did2s treats two-stage inference as a generated-regressor/GMM problem, Callaway-Sant’Anna support OR/IPW/DR staggered DiD with covariates, and the current DDD source paper explicitly emphasizes covariate-adjusted RA/IPW/DR inference. (economics-static.uchicago.edu)

Severity: P1 (unresolved prior finding). Impact: ImputationDiD now survey-weights the ATT and treated-side aggregation weights, but the untreated-side conservative variance construction is still unweighted in both branches: the FE-only path still uses raw untreated counts n0_by_unit / n0_by_time, and the covariate path still solves A_0' A_0 instead of a weighted projection system. That means the reported survey SEs for overall ATT, event-study, and group aggregation do not implement the survey-weighted Theorem 3 variance described in the registry. Concrete fix: Thread survey weights into the untreated v_it algebra itself: use weighted untreated totals in the FE-only closed form, solve the weighted projection system in the covariate branch, and reuse that corrected helper across all aggregations. Evidence: diff_diff/imputation.py:L442-L474, diff_diff/imputation.py:L1061-L1093, diff_diff/imputation.py:L1196-L1279, diff_diff/imputation.py:L1489-L1534, diff_diff/imputation.py:L1624-L1666, docs/methodology/REGISTRY.md:L798-L845.
Severity: P1 [Newly identified]. Impact: TripleDifference now correctly threads survey_weights into the IPW/DR decomposition, so the earlier no-covariate blocker is fixed. But the covariate-adjusted survey IPW/DR IF corrections are still unweighted: the propensity-score Hessian uses pscore*(1-pscore) without survey weights, the PS score correction uses an unweighted (PA4 - pscore) * X, and the DR OLS linear representations also ignore survey weights even though the point-estimation fits are survey-weighted. Because the final survey SE is computed by applying TSL to that combined IF, survey SEs are wrong whenever covariates is provided in IPW or DR mode. Concrete fix: Weight the PS Hessian and score consistently with the weighted logit, weight the DR OLS bread/score terms consistently with the survey WLS fits, and add covariate-adjusted survey regressions for both IPW and DR. Evidence: diff_diff/triple_diff.py:L557-L586, diff_diff/triple_diff.py:L897-L950, diff_diff/triple_diff.py:L1023-L1083, diff_diff/triple_diff.py:L1323-L1337, diff_diff/triple_diff.py:L1512-L1693, docs/methodology/REGISTRY.md:L1238-L1248.
Severity: P1 [Newly identified]. Impact: TwoStageDiD.fit() resolves survey_weights on the original data before excluding always-treated units, but Stage 1, Stage 2, and GMM then index those stale arrays against the filtered df. On a survey fit that hits the documented always-treated exclusion path, this can raise length-mismatch errors instead of handling the edge case. Concrete fix: Move survey resolution after the always-treated filter, or subset/re-resolve all survey arrays and metadata immediately after filtering before any downstream masking or matrix construction. Evidence: diff_diff/two_stage.py:L224-L245, diff_diff/two_stage.py:L271-L288, diff_diff/two_stage.py:L330-L360, diff_diff/two_stage.py:L727-L753, diff_diff/two_stage.py:L877-L904, diff_diff/two_stage.py:L1046-L1070, diff_diff/two_stage.py:L1168-L1192, docs/methodology/REGISTRY.md:L913-L923.
Severity: P3-informational. Impact: The prior CallawaySantAnna per-cell survey SE blocker is mitigated. The no-covariate survey branches now compute SEs from the same stored IF (sum(IF_i^2)), and the remaining deviation from full design-based per-cell TSL is explicitly documented in the registry and tracked in TODO.md, which makes it non-blocking under this project’s rules. Concrete fix: None required for this PR. Evidence: diff_diff/staggered.py:L709-L724, diff_diff/staggered.py:L1616-L1633, diff_diff/staggered.py:L1790-L1797, diff_diff/staggered.py:L1996-L2009, docs/methodology/REGISTRY.md:L419-L420, TODO.md:L55.

Code Quality

No additional code-quality findings beyond the estimator/inference issues above.

Performance

No new performance blockers in the changed survey paths.

Maintainability

No additional maintainability findings beyond the stale in-code documentation noted below.

Tech Debt

Severity: P3-informational. Impact: The newly added TODO.md entry for CallawaySantAnna’s remaining full per-cell design-based TSL limitation is appropriate tracking and does mitigate that deferrable gap under the project rules. Concrete fix: None required unless you want to implement full unit-level per-cell survey TSL now. Evidence: TODO.md:L55, docs/methodology/REGISTRY.md:L419-L420.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new survey tests still miss the two concrete regressions above. TripleDifference survey IPW/DR is only tested on no-covariate data, while the only survey+covariate DDD test covers reg only; likewise the new TwoStageDiD survey tests never hit the documented always-treated exclusion path. That is why the remaining covariate-adjusted DDD SE issue and the filtered-survey length mismatch are not caught. Concrete fix: Add survey tests for DDD IPW+covariates and DR+covariates (both uniform-weight invariance and non-uniform-weight sensitivity), plus a TwoStageDiD survey test with at least one always-treated unit. Evidence: tests/test_survey_phase3.py:L499-L603, tests/test_survey_phase4.py:L341-L461, tests/test_survey_phase4.py:L672-L804.
Severity: P2. Impact: TripleDifference.fit()’s docstring still says survey support is only available for estimation_method="reg", while the methodology registry and the new tests now treat survey support as implemented for reg, ipw, and dr. That stale in-code reference will mislead users and future reviewers even after the code is fixed. Concrete fix: Update the docstring once the actual support boundary is correct, or restore the runtime guard if IPW/DR survey support is intentionally partial. Evidence: diff_diff/triple_diff.py:L480-L484, docs/methodology/REGISTRY.md:L1243-L1248.

Static-review note: I could not execute the test suite in this environment because pytest is unavailable and the local Python environment is missing numpy.

Path to Approval

Rework ImputationDiD survey conservative variance so survey weights enter the untreated v_it construction itself in both the FE-only and covariate branches, then add a regression test that checks SE behavior under non-uniform vs uniform survey weights.
Fix TripleDifference survey IPW/DR with covariates by weighting the PS Hessian/score and DR OLS linear-representation terms consistently with the survey fits, then add survey+covariate IPW/DR tests.
Move TwoStageDiD survey resolution after the always-treated exclusion or subset/re-resolve all survey arrays immediately after filtering, then add a survey-weighted always-treated regression test.
Update TripleDifference.fit()’s survey-support docstring once the actual support boundary is correct.

igerber · 2026-03-22T23:54:44Z

/ai-review

github-actions · 2026-03-23T00:03:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ae6b278a4512079d7564f2f73da320c9d91f19b2

Overall Assessment

⚠️ Needs changes

The previous ImputationDiD variance-weighting gap and TwoStageDiD always-treated survey-array bug look fixed in this diff, but there are still unmitigated P1 issues in the new survey inference math for CallawaySantAnna and covariate-adjusted TripleDifference.

Executive Summary

ImputationDiD now threads survey weights into untreated-side Theorem 3 denominators / projections, so the prior imputation variance blocker appears resolved.
TwoStageDiD now subsets survey arrays after excluding always-treated units, so the prior always-treated edge-case blocker appears resolved.
CallawaySantAnna survey aggregation SEs are still wrong: the new survey WIF path mis-scales the weighted group-share influence function, so overall / event-study / group SEs do not implement the documented survey-weighted aggregation.
TripleDifference survey IPW/DR no longer hard-fails, but the covariate-adjusted survey IF corrections still apply survey weights twice in several moment maps, so those SEs remain incorrect.
CallawaySantAnna survey covariate paths (reg, ipw, dr) do not carry the weighted nuisance-estimation IF terms implied by the registry claim that survey SEs follow the analytical IF path, so per-cell and aggregated SEs are wrong when covariates are used.
The new tests improve plumbing coverage, but they still do not pin down the new SE algebra: there are no deterministic WIF checks for CS, no CS survey+covariate tests, and no SE-focused covariate survey tests for TripleDifference.

Methodology

Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s new survey-weighted WIF is algebraically mis-scaled, so survey overall / dynamic / group aggregation SEs are wrong for all methods. The code replaces raw cohort indicators with sw_i / sum(sw) before subtracting pg, then divides by n_units again; that drops the unit weight from the negative pg term and applies the normalization on the wrong scale. This is not the documented per-cell TSL deviation in TODO.md; it is a separate aggregation-level SE error. Concrete fix: derive the WIF from the weighted share estimator p_g = Σ s_i 1{G_i=g} / Σ s_i, keep the s_i factor on both terms of the cohort-share IF, and apply the final normalization once. Add a small deterministic regression test for simple, event-study, and group aggregation SEs. Evidence: diff_diff/staggered_aggregation.py:L265-L390, diff_diff/staggered_aggregation.py:L397-L451, docs/methodology/REGISTRY.md:L419-L420, TODO.md:L55-L56
Severity: P1 (unresolved prior finding). Impact: TripleDifference survey IPW/DR with covariates still has incorrect SEs. The PR fixed the obvious missing-weight omissions, but several correction moments now weight the same terms twice: the survey branch first multiplies riesz_* by weights, then computes M2_*, M3_*, and mom_* with np.average(..., weights=weights). That distorts the combined IF that is later passed into TSL. Concrete fix: in the covariate IPW/DR survey branches, make survey weights enter each estimating equation exactly once; either keep weights inside riesz_* and use plain means afterward, or keep riesz_* unweighted and use a single weighted empirical average consistently. Add SE-focused covariate survey tests, including weight-rescaling invariance. Evidence: diff_diff/triple_diff.py:L1288-L1353, diff_diff/triple_diff.py:L1481-L1717, diff_diff/triple_diff.py:L1078-L1085, docs/methodology/REGISTRY.md:L1195-L1248
Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s survey covariate branches do not implement the weighted nuisance-estimation influence terms promised by the new registry note. In _outcome_regression, _ipw_estimation, and _doubly_robust, the survey inf_func collapses to simple weighted residual pieces and omits the regression / propensity-score estimation corrections that drive the analytical IF path. Because the same inf_func is reused for aggregation, both per-cell and aggregated SEs are wrong whenever covariates are supplied. Concrete fix: port the full weighted analytical IF into the survey reg / ipw / dr branches, or reject covariates+survey at runtime and narrow the docs until that IF is implemented. Evidence: diff_diff/staggered.py:L1572-L1601, diff_diff/staggered.py:L1682-L1754, diff_diff/staggered.py:L1859-L1976, docs/methodology/REGISTRY.md:L419-L420

Code Quality

No additional code-quality findings beyond the inference-math issues above.

Performance

No new performance blockers in the changed survey paths.

Maintainability

No additional maintainability findings beyond the support-boundary ambiguity noted in Documentation/Tests.

Tech Debt

Severity: P3. Impact: the new TODO.md entries properly track the documented per-cell CS survey-TSL limitation and the pre-existing EfficientDiD stale-n_cl issue, so those tracked items are not blockers under this review rubric. Concrete fix: none required for this PR unless you choose to implement those deferred items now. Evidence: TODO.md:L55-L56

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests mainly check “works / finite / ATT changes” behavior, but they do not verify the newly added SE algebra that is currently broken. There are no deterministic CS WIF tests, no CallawaySantAnna survey+covariate tests, and no TripleDifference covariate survey SE regression tests. The roadmap also now advertises TripleDifference and CallawaySantAnna survey support as “Full,” which is overstated until the SE issues above are fixed or those combinations are gated/documented as partial. Concrete fix: add targeted SE regression tests for the flagged cases; if the math is deferred, revert the new “Full” support claims and restore runtime guards for the unsupported combinations. Evidence: tests/test_survey_phase4.py:L495-L980, docs/survey-roadmap.md:L17-L25, docs/survey-roadmap.md:L43-L53, diff_diff/triple_diff.py:L480-L496

Path to Approval

Fix CallawaySantAnna’s survey WIF so the cohort-share influence function is derived from the weighted share estimator and normalized exactly once, then add deterministic SE regression tests for simple, event-study, and group aggregation.
Fix TripleDifference covariate survey IPW/DR moment maps so survey weights enter each IF correction term exactly once, then add covariate survey SE tests that check weight-rescaling invariance.
Implement the full weighted nuisance-estimation IF in CallawaySantAnna survey reg / ipw / dr covariate branches, or explicitly reject covariates+survey for those methods until that is done.
If any unsupported survey combinations remain after the code fixes, narrow the new “Full” support claims in the registry / roadmap and gate those combinations at runtime instead of documenting them as complete.

Static-review note: I could not run the targeted tests in this environment because pytest is not installed.

igerber · 2026-03-23T01:05:23Z

/ai-review

github-actions · 2026-03-23T01:17:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e754b04cc72778b32a737fd6dea0c07bc67fa3c4

Overall Assessment

⚠️ Needs changes

Unmitigated P1s remain in the new survey inference paths. The earlier ImputationDiD variance-weighting gap and the obvious TripleDifference survey double-weighting problem look resolved in this diff, but CallawaySantAnna still has wrong survey SE algebra, and TwoStageDiD adds a new always-treated survey d.f. edge-case bug.

Executive Summary

The previous ImputationDiD blocker looks fixed; I did not find a new P1 in the changed imputation variance path.
The previous TripleDifference survey IPW/DR blocker also looks fixed; I did not find an unmitigated P1 in the changed weighted IF code there.
CallawaySantAnna survey aggregation SEs are still wrong: the new survey WIF branch does not implement the official did weight-IF formula and rescales the term incorrectly.
CallawaySantAnna survey + covariates is still incomplete for ipw/dr: the new survey branches omit the nuisance-estimation IF corrections needed for analytical SEs.
TwoStageDiD now subsets survey arrays after dropping always-treated units, but it leaves n_psu / n_strata stale, so survey d.f. and reported metadata can be wrong on that edge case.
The new tests are mostly smoke / plumbing checks and do not pin down the broken Callaway survey SE algebra; the roadmap’s Full (analytical) claim is therefore overstated.

Methodology

Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey aggregation SEs are still incorrect. In diff_diff/staggered_aggregation.py:L343-L396, the survey WIF branch uses unit_sw * pg in the negative term and divides by total_weight inside indicator_diff, then divides by total_weight again at psi_wif = wif_contrib / total_weight. The official did::wif() uses weights.ind * I(G=g) - pg and adds that WIF directly to the aggregate IF before a single getSE() scaling step, so this branch is both algebraically different and over-normalized. Concrete fix: derive the survey WIF from unit_sw * I(G=g) - pg, remove the inner /total_weight, keep only the final normalization, and add deterministic SE regression tests for simple, event-study, and group aggregation. Evidence: diff_diff/staggered_aggregation.py:L343-L396, diff_diff/staggered_aggregation.py:L403-L455, docs/methodology/REGISTRY.md:L419-L420. (rdrr.io)
Severity: P1 (unresolved prior finding, narrowed). Impact: CallawaySantAnna survey + covariates is still missing required analytical IF terms for estimation_method="ipw" and "dr". The new survey IPW branch in diff_diff/staggered.py:L1707-L1779 stops at the leading treated/control terms, but the official panel IPW IF includes an additional propensity-score estimation correction (inf.cont.2). The new survey DR branch in diff_diff/staggered.py:L1884-L2001 similarly reduces to psi_treated/psi_control and omits the treated OR correction plus the control PS/OR corrections. That makes per-cell and aggregated survey SEs wrong whenever covariates are supplied. Concrete fix: port the DRDID panel nuisance-estimation IF pieces into these survey branches, or reject survey_design with covariates for ipw/dr until that is implemented; then add reference-based SE tests on a fixed dataset. Evidence: diff_diff/staggered.py:L1707-L1779, diff_diff/staggered.py:L1884-L2001, docs/methodology/REGISTRY.md:L419-L420. (rdrr.io)
Severity: P1 [Newly identified]. Impact: TwoStageDiD can emit wrong survey d.f., p-values, confidence intervals, and survey metadata after excluding always-treated units. The new filtering path slices the resolved survey arrays with replace(...) in diff_diff/two_stage.py:L290-L314, but ResolvedSurveyDesign.df_survey is computed from cached n_psu / n_strata in diff_diff/survey.py:L295-L304, and those counts are then consumed at diff_diff/two_stage.py:L434-L435. If dropping always-treated units removes a PSU or an entire stratum, inference is silently based on stale counts. Concrete fix: after filtering, re-resolve the survey design on the filtered df or recompute/re-factorize PSU and stratum counts before using df_survey or compute_survey_metadata(), and add a regression test where always-treated units occupy their own PSU/stratum.

Code Quality

No additional code-quality findings beyond the inference bugs above.

Performance

No new performance blockers identified in the changed survey paths.

Maintainability

Severity: P2. Impact: the new survey feature is not exposed consistently through the public convenience APIs. diff_diff/imputation.py:L1888-L1945 and diff_diff/two_stage.py:L1589-L1645 still construct the estimator from **kwargs and never accept/pass survey_design into fit(), so users of imputation_did() / two_stage_did() cannot reach the new capability. Concrete fix: add survey_design=None to both wrappers, thread it through, and add wrapper-parity tests.

Tech Debt

Severity: P3. Impact: the new TODO entry for Callaway per-cell survey TSL is properly tracked and is not a blocker under this rubric; same for the unrelated EfficientDiD stale-n_cl item. Concrete fix: none required for this PR. Note that TODO.md:L55 does not cover the unmitigated aggregation WIF bug or the covariate-survey IF omissions above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests do not pin down the broken CallawaySantAnna survey IF math. The full-design tests only assert metadata and df-driven p-value differences in tests/test_survey_phase4.py:L573-L602 and tests/test_survey_phase4.py:L955-L980; they do not include a deterministic/reference survey aggregation SE check, and they do not cover Callaway survey + covariate ipw/dr SEs. At the same time, the roadmap labels CallawaySantAnna as Full (analytical) in docs/survey-roadmap.md:L47-L49. Concrete fix: add reference-based Callaway survey SE tests for simple/event-study/group aggregation plus covariate-adjusted ipw/dr, or narrow/gate the documented support surface until the P1 math issues are fixed.

Review note: pytest is not installed in this environment, so I could not execute the added tests; this is a static review only.

Path to Approval

Fix the CallawaySantAnna survey WIF algebra in diff_diff/staggered_aggregation.py to match did::wif() and add a deterministic weighted-SE regression test for simple, event-study, and group aggregation.
Implement the missing DRDID-style nuisance-estimation IF corrections in the Callaway survey ipw and dr covariate branches, or explicitly reject survey_design + covariates + {ipw, dr} at runtime until that is done.
Rebuild/re-resolve the survey design after always-treated filtering in TwoStageDiD, then add a test where filtered units remove an entire PSU or stratum and assert the updated df_survey/p-values.

igerber · 2026-03-23T01:26:10Z

/ai-review

github-actions · 2026-03-23T01:48:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 98d80c60f2fc5c4d9cb6220e28c092f03dcdc646

Overall Assessment

⚠️ Needs changes

This re-review leaves two unmitigated P1s in the new survey CallawaySantAnna analytical inference paths. The prior TwoStageDiD always-treated survey d.f. issue appears fixed in diff_diff/two_stage.py:290, and I did not find a new P1+ elsewhere in the changed survey code.

Executive Summary

The prior TwoStageDiD survey d.f./metadata bug after always-treated filtering looks fixed in diff_diff/two_stage.py:290.
I did not find a clear new P1 in the changed ImputationDiD or TripleDifference survey paths.
CallawaySantAnna survey aggregation SEs are still wrong: the survey WIF branch in diff_diff/staggered_aggregation.py:343 does not match the did::wif() algebra.
CallawaySantAnna survey + covariates still has incorrect analytical IF math for ipw and dr in diff_diff/staggered.py:1759 and diff_diff/staggered.py:2014, so covariate-adjusted survey SEs remain wrong.
The new Callaway tests at tests/test_survey_phase4.py:573, tests/test_survey_phase4.py:901, and tests/test_survey_phase4.py:955 are mostly smoke/invariance checks and do not pin the survey SE math to did/DRDID reference values. Static review only: pytest is not installed in this environment.

Methodology

Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey aggregation SEs remain methodologically incorrect for aggregate="simple" and aggregate="event_study". The survey WIF branch in diff_diff/staggered_aggregation.py:343 builds indicator_diff as s_i * 1{G_i=g_k} - s_i * pg_k via weighted_pg_term, but the official did WIF uses weights.ind * 1{G_i=g_k} - pg_k. The registry note at docs/methodology/REGISTRY.md:419 documents survey-weighted group sizes, not this algebraic deviation. The affected aggregation call sites are diff_diff/staggered_aggregation.py:130 and diff_diff/staggered_aggregation.py:548. Concrete fix: reimplement the survey WIF exactly as in did::wif()/aggte and add deterministic regression tests for survey simple and event_study SEs.
Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey + covariates analytical SEs remain incorrect for estimation_method="ipw" and "dr". In the survey IPW branch diff_diff/staggered.py:1759, the new PS correction uses a normalized control-only M2 term that does not match the DRDID panel IPW influence function. In the survey DR branch diff_diff/staggered.py:2014, the influence function stops at psi_treated/psi_control and omits the nuisance-estimation corrections from the DRDID panel formula. The note at docs/methodology/REGISTRY.md:420 only documents the per-cell TSL-vs-IF choice, so it does not mitigate this mismatch. Concrete fix: port the exact DRDID panel nuisance-estimation IF pieces into the survey ipw and dr covariate paths, or raise NotImplementedError for survey_design + covariates + {"ipw","dr"} until that implementation is complete; then add reference-based per-cell and aggregated SE tests.

Code Quality

No additional code-quality findings beyond the inference bugs above.

Performance

No new performance blockers identified in the changed survey paths.

Maintainability

Severity: P2 (unresolved prior finding). Impact: the new survey capability is still unavailable through the imputation_did() and two_stage_did() convenience APIs, because both wrappers only forward constructor kwargs and never pass survey_design into fit() at diff_diff/imputation.py:1888 and diff_diff/two_stage.py:1608. Concrete fix: add explicit survey_design=None parameters to both wrappers, thread them through to fit(), and add wrapper-parity tests.

Tech Debt

Severity: P3. Impact: the per-cell CallawaySantAnna survey TSL limitation is properly documented in docs/methodology/REGISTRY.md:420 and tracked in TODO.md:55, so it is not a blocker under this rubric. This tracking does not mitigate the P1 aggregation WIF or survey+covariates IF defects above. Concrete fix: none required for approval; keep the registry/TODO note synchronized if the implementation changes.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new CallawaySantAnna survey tests only check metadata, p-value differences, and scale invariance at tests/test_survey_phase4.py:573, tests/test_survey_phase4.py:901, and tests/test_survey_phase4.py:955. Within the Callaway survey test block starting at tests/test_survey_phase4.py:496, I did not find a CallawaySantAnna(..., covariates=[...], survey_design=...) case. The roadmap still advertises survey support as “Full (analytical)” at docs/survey-roadmap.md:49. Concrete fix: add fixed-dataset regression tests against did/DRDID for survey simple/event_study SEs and covariate-adjusted survey ipw/dr, or narrow the roadmap wording until those paths match the reference implementations.

Path to Approval

Fix the CallawaySantAnna survey WIF algebra in diff_diff/staggered_aggregation.py to match did::wif() exactly, then add deterministic regression tests for survey aggregate="simple" and aggregate="event_study" SEs.
Implement the full DRDID panel nuisance-estimation influence-function corrections for survey CallawaySantAnna with covariates in the ipw and dr paths, or explicitly block survey_design + covariates + {"ipw","dr"} with NotImplementedError until that work is done.
After those fixes, add reference-based tests and/or narrow docs/survey-roadmap.md:49 so the documented support surface matches the validated methodology.

…, CallawaySantAnna) - Weighted solve_logit(): survey weights enter IRLS as w_survey * mu*(1-mu) - ImputationDiD: weighted iterative FE, survey-weighted ATT aggregation, weighted conservative variance (Theorem 3), survey df for inference - TwoStageDiD: weighted iterative FE, weighted Stage 2 OLS, weighted GMM sandwich variance with survey weights in both stages - CallawaySantAnna: survey-weighted regression, IPW (via weighted solve_logit), and DR methods with explicit influence functions; survey-weighted WIF in aggregation; Cholesky cache bypassed under survey weights - Unblock TripleDifference IPW/DR with survey (weighted solve_logit now available) - 38 new tests in test_survey_phase4.py covering all estimators + scale invariance - Update survey-roadmap.md, REGISTRY.md with Phase 4 status and deviation notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e-existing Hausman bug, add inference tests - Add REGISTRY.md deviation note: CS per-cell ATT(g,t) SEs use influence-function variance (matching R's did), not full TSL with strata/PSU/FPC - Add TODO entries: CS per-cell TSL SE (Medium), Hausman stale n_cl from #230 (Medium) - Add 3 CS survey inference validation tests (scale invariance, per-cell ATT change, survey df effect) - Remove unused survey_weight_type in ImputationDiD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…, CS SE formula - P0: Thread survey_weights through TripleDifference IPW and DR call chains (_ipw_estimation, _doubly_robust, _compute_did_rc_ipw, _compute_did_rc_dr). Survey weights now enter Riesz representers for weighted Hajek averages. - P1: Fix CallawaySantAnna no-covariate survey SE to derive from sum(IF^2) instead of sum(w_norm * (y-mean)^2). All 4 locations now consistent with stored influence functions. - P1: Update REGISTRY.md TripleDifference entry to reflect full survey support (was still marked as "IPW/DR deferred"). - P2: Add behavioral tests for TripleDiff IPW/DR survey: non-uniform weights change ATT, uniform weights match unweighted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… always-treated survey - ImputationDiD: use survey-weighted untreated sums (not raw counts) in Theorem 3 v_it FE-only path, and weighted normal equations (A_0'W_0 A_0) in covariate path - TripleDifference: weight PS Hessian, score correction, and DR OLS linear representations with survey weights for covariate-adjusted IPW/DR IF corrections; update fit() docstring to reflect full survey support - TwoStageDiD: subset survey_weights and resolved_survey arrays after always-treated unit exclusion to prevent length mismatch - Add tests: TwoStageDiD always-treated + survey, DDD covariate IPW/DR + survey Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…riate nuisance IF - TripleDifference: remove double-weighting in IPW/DR moment corrections — since Riesz representers already incorporate survey weights, moment means use np.mean() not np.average(weights=). Removed _wmean_ax0 helper. - CallawaySantAnna WIF: apply s_i symmetrically to both indicator and pg terms in the weighted share estimator IF. Normalize by total_weight (sum of survey weights) instead of n_units. - CallawaySantAnna outcome regression covariate IF: add weighted regression nuisance IF correction (asymptotic linear representation of beta from WLS, projected onto weighted treated covariate mean). IPW and DR IFs unchanged (IPW matches unweighted structure; DR is self-correcting per Theorem 3.1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ge n_psu - CallawaySantAnna WIF: remove inner /total_weight from indicator_diff — the final psi_wif/total_weight handles normalization once, matching R's did::wif() - CallawaySantAnna IPW covariate: add propensity score nuisance IF correction (survey-weighted Hessian, score, M2 gradient) so per-cell and aggregated SEs account for PS estimation uncertainty - TwoStageDiD: recompute n_psu/n_strata after always-treated filtering via np.unique() on subsetted arrays, then recompute survey_metadata Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…es+survey - CallawaySantAnna WIF: use (w_i * 1{G=g} - pg) matching R's did::wif() formula, not (w_i * 1{G=g} - w_i * pg). pg is already the population-level expected value. - Gate CallawaySantAnna covariates + survey + IPW/DR with NotImplementedError (DRDID panel nuisance IF not yet implemented). Regression with covariates works. - Narrow roadmap/REGISTRY claims to match actual support boundary - Add TODO entry for DRDID nuisance IF - Add 3 tests: IPW/DR+cov+survey raises, reg+cov+survey works Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove spurious /n_c from regression nuisance IF correction in _outcome_regression survey covariate branch. asy_lin_rep_reg is already per-observation, so dividing by n_c double-scaled. - Add test_reg_covariates_survey_se_scale_invariance: verifies ATT and SE are invariant to weight rescaling for reg+covariates+survey. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T10:37:10Z

/ai-review

github-actions · 2026-03-23T10:44:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d8f652077d38953d0e4171ced260d9bb2e617e21

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior CallawaySantAnna survey-review blockers appear addressed: the survey WIF algebra was rewritten, and survey + covariates + {ipw, dr} is now explicitly blocked and documented.
P1: ImputationDiD’s new survey-weighted Theorem 3 variance path is not the weighted Step 1 projection implied by WLS. The untreated influence weights omit the observation-level survey-weight factor, so survey SEs are wrong when weights are heterogeneous.
P1: the new survey_design fit parameter was not propagated to the public imputation_did() and two_stage_did() wrappers, so those APIs still cannot use the advertised survey support.
P2: the new Phase 4 tests cover direct .fit() smoke/invariance paths, but they do not cover the broken wrappers or any ImputationDiD(..., covariates=..., survey_design=...) analytical variance path.

Methodology

Severity: P1. Impact: ImputationDiD’s survey-weighted conservative variance does not match the weighted Step 1 estimator from Borusyak-Jaravel-Spiess Theorem 3. In the FE-only branch, the code at diff_diff/imputation.py:L1080 only swaps weighted denominators into the unweighted closed form, so two untreated observations in the same unit/time cell but with different survey weights get the same v_it. In the covariate branch, the code at diff_diff/imputation.py:L1291 and diff_diff/imputation.py:L1302 solves (A_0' W A_0) z = A_1' w but then forms v_untreated = -A_0 z; the weighted WLS mapping implied by β̂=(A_0'WA_0)^{-1}A_0'Wy_0 requires the left-side W_0 factor as well. The registry now claims survey-weighted Theorem 3 support at docs/methodology/REGISTRY.md:L845, so this is an undocumented variance mismatch. Concrete fix: re-derive v_it from the weighted Step 1 normal equations, implement the WLS analogue v_0=-W_0A_0(A_0'W_0A_0)^{-1}A_1'w and the matching FE-only closed form, then add dense-reference tests on a tiny heterogeneous-weight panel with and without covariates.
Severity: P3. Impact: the prior CallawaySantAnna survey issues look mitigated rather than outstanding. The survey WIF branch now uses the w_i 1{G_i=g} - p_g form at diff_diff/staggered_aggregation.py:L343, and unsupported survey + covariates + {ipw, dr} paths are now blocked at diff_diff/staggered.py:L1202 and documented at docs/methodology/REGISTRY.md:L419, docs/methodology/REGISTRY.md:L420, TODO.md:L55, and TODO.md:L56. Concrete fix: none required for approval.

Code Quality

Severity: P1. Impact: the public convenience APIs still drop the new survey parameter. imputation_did() and two_stage_did() were not updated when fit(..., survey_design=...) was added, so both wrappers instantiate the estimator with **kwargs at diff_diff/imputation.py:L1938 and diff_diff/two_stage.py:L1658, then call fit() without survey_design at diff_diff/imputation.py:L1939 and diff_diff/two_stage.py:L1659. Their constructors at diff_diff/imputation.py:L126 and diff_diff/two_stage.py:L132 do not accept survey_design, so wrapper callers cannot reach the new feature. Concrete fix: add explicit survey_design=None parameters to both wrapper signatures, document them, and thread them through to .fit().

Performance

No findings.

Maintainability

No additional findings beyond the broken wrapper propagation above.

Tech Debt

Severity: P3. Impact: the deferred CallawaySantAnna survey limitations are properly tracked in TODO.md:L55 and TODO.md:L56, so they are not blockers under this rubric. Concrete fix: none required for approval; keep the TODO and registry notes synchronized if the support surface changes again.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new survey tests for ImputationDiD and TwoStageDiD at tests/test_survey_phase4.py:L192 and tests/test_survey_phase4.py:L341 only exercise direct .fit() usage and no covariates= cases, so they miss both the broken wrappers and the weighted ImputationDiD variance path. The only new survey+covariate coverage added here is for CallawaySantAnna at tests/test_survey_phase4.py:L616. Concrete fix: add wrapper-parity tests for imputation_did() and two_stage_did(), plus a fixed-data regression test for ImputationDiD(..., covariates=[...], survey_design=...) that checks the analytical SE against a dense weighted-WLS reference.

Path to Approval

Fix ImputationDiD’s survey-weighted Theorem 3 implementation so the untreated influence weights come from the weighted Step 1 WLS projection in both the FE-only and covariate branches.
Add survey_design parameters to imputation_did() and two_stage_did(), forward them to .fit(), and verify wrapper parity against direct estimator usage.
Add targeted tests for ImputationDiD survey + covariates SEs and for both convenience wrappers using survey_design.

…ce wrappers - ImputationDiD FE-only v_it: multiply by per-obs survey weight (WLS projection requires observation-level weight factor, not just weighted denominators) - ImputationDiD covariate v_it: v_0 = -diag(sw_0) @ A_0 @ z (WLS projection includes left-side W_0 factor) - Add survey_design parameter to imputation_did() and two_stage_did() convenience wrappers, forwarded to .fit() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…iD survey - ImputationDiD: SE scale-invariance test for FE-only and covariate paths - ImputationDiD: wrapper parity test (imputation_did with survey_design) - TwoStageDiD: wrapper parity test (two_stage_did with survey_design) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T11:02:27Z

/ai-review

github-actions · 2026-03-23T11:15:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ec4dda3733102826e1e05ecb32002fb0fa937738

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior ImputationDiD blockers from the last review look resolved: the weighted Theorem 3 code path now carries the untreated survey-weight factor, and both imputation_did() and two_stage_did() now forward survey_design.
P1: CallawaySantAnna(estimation_method="reg") with survey_design and covariates still computes the survey influence function incorrectly, so the newly advertised analytical SEs/CIs for that branch are not reliable.
P1: TripleDifference survey IPW/DR inference double-applies survey weights by building weighted pairwise IFs and then feeding them into the Taylor-linearization helper that weights residuals again.
[Newly identified] P2: ImputationDiD and TwoStageDiD still emit unweighted treatment_effects["weight"] columns under survey mode, so those result objects no longer reproduce the estimand they report.
The documented CallawaySantAnna per-cell survey-SE deviation is properly tracked in the registry/TODO, so that part is informational rather than blocking.

Static review only: I could not execute the test suite in this sandbox because pytest is unavailable and the Python environment is missing the scientific stack.

Methodology

Severity: P1. Impact: the new survey+covariates regression branch in CallawaySantAnna does not match the panel OR influence-function construction it is supposed to emulate. In the implementation at diff_diff/staggered.py:L1605, the control-side term is coded as -sw_c_norm * resid_c + inf_control_reg_corr. The DRDID panel OR reference instead forms a treated-weighted control component from the predicted counterfactual, adds the OLS-estimation term to that same control component, and then subtracts the whole control component from treated influence. That means the new code drops the treated-side control term and gives the regression-estimation correction the wrong sign; in the intercept-only special case, the two control-side pieces cancel exactly, which cannot be correct. This makes per-cell SEs and all aggregated survey SEs wrong for the newly enabled survey_design + covariates + estimation_method="reg" path. Concrete fix: port the panel OR IF from the DRDID reg_did_panel() construction for ΔY = Y_t - Y_base, or gate this combination with NotImplementedError until that exact IF is implemented and validated. (rdrr.io)
Severity: P1. Impact: the new survey IPW/DR TripleDifference path weights the influence function twice. _compute_did_rc_ipw() and _compute_did_rc_dr() multiply survey weights directly into the Riesz/score objects at diff_diff/triple_diff.py:L1288 and diff_diff/triple_diff.py:L1471, but the combined IF is then passed to compute_survey_vcov() at diff_diff/triple_diff.py:L1078, whose pweight path multiplies residuals by resolved.weights again at diff_diff/survey.py:L553. The DRDID repeated-cross-section IPW/DR references already normalize i.weights, build weighted IF components, and compute SEs from the finished IF itself, so this extra weighting layer distorts survey SEs/p-values/CIs for TripleDifference(estimation_method in {"ipw","dr"}) with nonuniform weights. Concrete fix: keep inf_func on the unweighted linearized-variable scale and let compute_survey_vcov() apply survey weights once, or change the survey-vcov call to accept preweighted scores so the second multiplication disappears. (rdrr.io)
Severity: P3. Impact: the CallawaySantAnna per-cell survey-SE deviation from full design-based TSL is now explicitly documented in docs/methodology/REGISTRY.md:L419 and tracked in TODO.md:L55, so under the stated review rules it is mitigated and not a blocker. Concrete fix: none required for approval.

Code Quality

Severity: P2. Impact: [Newly identified] under survey mode, both diff_diff/imputation.py:L531 and diff_diff/two_stage.py:L527 still write treatment_effects["weight"] = 1/n_valid. After this PR, overall_att is survey-weighted in both estimators, so the public results object no longer reproduces the reported estimand if a user aggregates tau_hat by the provided weight column. Concrete fix: when survey_design is active, store normalized treated survey weights for finite tau_hat and add a contract test that weighting tau_hat by that column reproduces overall_att.

Performance

No findings.

Maintainability

No findings beyond the result-object weight mismatch above.

Tech Debt

Severity: P3. Impact: the new TODO entries correctly track the documented CallawaySantAnna survey-SE limitation and the pre-existing EfficientDiD Hausman issue at TODO.md:L55 and TODO.md:L57, so they are nonblocking under the deferred-work policy. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new survey tests never assert a reference analytical SE for the two broken branches. CallawaySantAnna survey+covariates+reg is only smoke/scale-invariance tested at tests/test_survey_phase4.py:L751, and TripleDifference survey IPW/DR is only tested for “works”/finiteness at tests/test_survey_phase4.py:L879. There is also no test that treatment_effects["weight"] reproduces overall_att under survey mode for ImputationDiD or TwoStageDiD. Concrete fix: add deterministic reference-SE tests for those two survey branches, plus one results-object test per estimator verifying the stored treatment-effect weights reproduce the reported weighted ATT.
Severity: P3. Impact: survey_design was added to ImputationDiD.fit, TwoStageDiD.fit, and both convenience wrappers, but the new parameter is still missing from the corresponding docstring parameter lists at diff_diff/imputation.py:L185, diff_diff/two_stage.py:L181, diff_diff/imputation.py:L1911, and diff_diff/two_stage.py:L1625. Concrete fix: document survey_design and its bootstrap limitation in those four docstrings.

Path to Approval

Replace the CallawaySantAnna survey+covariates regression IF in diff_diff/staggered.py:L1605 with the exact panel OR IF structure from the DRDID reference, or gate that combination with NotImplementedError until the exact IF is implemented.
Fix TripleDifference survey IPW/DR inference so survey weights enter the IF/TSL pipeline exactly once; the cleanest options are either an unweighted linearized variable passed to compute_survey_vcov() or a preweighted-score path that bypasses the extra multiplication in diff_diff/survey.py:L553.
Add deterministic SE regression tests for both branches above, using nonuniform survey weights and asserting the analytical SE against a hand-worked or ported reference calculation.

- TripleDifference: divide out survey weights from IF before passing to compute_survey_vcov, since Riesz representers already incorporate weights and TSL would multiply by weights again - CallawaySantAnna _outcome_regression: rewrite survey covariate IF to follow DRDID panel OR structure — all terms consistently scaled by 1/sw_t_sum, nuisance correction divided by sw_t_sum for correct normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T11:30:33Z

/ai-review

github-actions · 2026-03-23T11:38:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6165bc4d24df9a6c202d3f50d6840b56e496ecdc

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior TripleDifference survey IPW/DR blocker looks resolved: the combined IF is now de-weighted before TSL, and the unsupported CallawaySantAnna survey+covariates+IPW/DR branch is correctly gated and documented in REGISTRY.md:419, REGISTRY.md:420, TODO.md:55, and TODO.md:56.
P1: the newly supported CallawaySantAnna(estimation_method="reg", survey_design=..., covariates=[...]) branch still assembles the control-side OR nuisance IF incorrectly, so its analytical SEs/CIs are not reliable.
P2: ImputationDiD and TwoStageDiD still return unweighted treatment_effects["weight"] columns under survey mode, so the public results objects no longer reconstruct the survey-weighted ATT they report.
P2: the new survey tests still do not pin the supported CS regression+covariates survey branch to a reduction/reference case, and there is no contract test for the new survey-weighted treatment_effects["weight"] behavior.
P3: the documented CS survey deviations/deferred work are properly tracked and are nonblocking.
Static review only: I could not run the suite in this sandbox because the available Python environment is missing numpy.

Methodology

Severity: P1. Impact: In _outcome_regression() the survey+covariates regression branch claims to implement the Sant'Anna-Zhao OR influence function, but the control-side correction is still mis-scaled at diff_diff/staggered.py:1621 and diff_diff/staggered.py:1640. X_treated_mean_w is already a treated-weighted mean at diff_diff/staggered.py:1637, and asy_lin_rep_reg is already on the per-observation WLS score scale at diff_diff/staggered.py:1634, so dividing again by sw_t_sum shrinks the nuisance correction by an extra treated-weight factor. In the constant/no-information-covariate reduction, this branch does not collapse back to the survey no-covariate regression estimator, which is a direct methodology failure for a now-advertised supported path. That IF is then stored for downstream aggregation at diff_diff/staggered.py:601 and used in aggregation SEs at diff_diff/staggered_aggregation.py:129, so the problem affects both per-cell and aggregated survey inference. Concrete fix: replace the hand-assembled survey OR IF with the exact panel OR/WLS nuisance-correction formula from the cited method, and verify the constant-covariate reduction explicitly.
Severity: P3. Impact: The per-cell survey-SE deviation and the survey+covariates+IPW/DR gate are explicitly documented in REGISTRY.md:419, REGISTRY.md:420, TODO.md:55, and TODO.md:56, so those deviations are mitigated and nonblocking. Concrete fix: none.

Code Quality

Severity: P2. Impact: Under survey mode, ImputationDiD.fit() still writes uniform 1/n_valid weights to treatment_effects["weight"] at diff_diff/imputation.py:538, and TwoStageDiD.fit() does the same at diff_diff/two_stage.py:533, even though the reported ATT is survey-weighted. Users who aggregate tau_hat by the returned weight column will get a different estimand than overall_att, and that now contradicts the registry note at REGISTRY.md:842. Concrete fix: when survey_design is active, populate weight with normalized treated survey weights over finite treatment effects and keep zero for NaNs.

Performance
No findings.

Maintainability
No additional findings.

Tech Debt

Severity: P3. Impact: The new TODO entries at TODO.md:55, TODO.md:56, and TODO.md:57 correctly track deferred survey methodology limitations and the pre-existing EfficientDiD Hausman issue, so they are nonblocking under the stated policy. Concrete fix: none.

Security
No findings.

Documentation/Tests

Severity: P2. Impact: The new survey suite still does not pin the supported CallawaySantAnna survey+covariates+reg branch to a reduction/reference case: tests/test_survey_phase4.py:751 only checks that the branch runs, and tests/test_survey_phase4.py:766 only checks scale invariance. That left the remaining IF bug undetected. There is also no survey-mode contract test that treatment_effects["weight"] reconstructs overall_att for ImputationDiD or TwoStageDiD. Concrete fix: add a reduction test where a constant/all-zero covariate reproduces the no-covariate survey regression ATT/SE, and add one result-object reconstruction test per estimator for the returned weight column.
Severity: P3. Impact: survey_design was added to the fit() and wrapper signatures for both panel estimators, but the public docstrings still stop at balance_e in diff_diff/imputation.py:185, diff_diff/imputation.py:1911, diff_diff/two_stage.py:181, and diff_diff/two_stage.py:1625. Concrete fix: document survey_design and the bootstrap limitation in those four docstrings.

Path to Approval

Fix CallawaySantAnna._outcome_regression() for survey_design + covariates + estimation_method="reg" by porting the exact panel OR/WLS nuisance IF, rather than the current hand-assembled control correction.
Add a deterministic regression test for that branch in tests/test_survey_phase4.py, e.g. a constant/no-information-covariate reduction that asserts identical ATT/SE to the no-covariate survey regression path.

The DRDID panel nuisance IF corrections could not be validated against a reference implementation after multiple review cycles. Gate all covariates+ survey combinations (reg/IPW/DR) with NotImplementedError until the exact DRDID panel OR/IPW/DR IF formulas are implemented and validated. No-covariate survey support remains fully functional for all methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The survey-weighted OR IF now mirrors the unweighted code's structure exactly: - treated: (w_i / sum(w_t)) * (resid_i - ATT) - control: -(w_i / sum(w_t)) * wls_resid_i WLS residuals are orthogonal to W*X by construction, so the regression nuisance IF correction is implicit (same as the unweighted case where inf_control = -residuals/n_c with no explicit correction). This replaces the explicit nuisance correction that was repeatedly flagged by the AI reviewer for incorrect scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion tests - Add REGISTRY.md note: CS reg+covariates survey SE uses conservative plug-in IF (WLS residuals, no semiparametric nuisance correction). SEs may be wider than DRDID's efficient IF but have correct asymptotic coverage. Matches unweighted code's approach. Efficient IF deferred to future work. - Add aggregate="group" and aggregate="all" survey tests for ImputationDiD and TwoStageDiD to cover all aggregation method interactions with survey_design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-23T14:20:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 00476e2c2461ac74fa01fa1c247b4dc9a843f62c

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only; pytest is not installed in this environment, so I could not execute the new survey tests.

Executive Summary

The previous CallawaySantAnna survey blocker is largely resolved: unsupported strata/PSU/FPC inputs are now rejected at runtime, and that limitation is documented/tracked.
P1: ImputationDiD and TwoStageDiD still accept full SurveyDesigns, but strata/fpc do not enter the actual SE formula. Those inputs are effectively reduced to weighted estimation plus PSU clustering / survey d.f., which is an undocumented variance deviation for accepted inputs.
P2: the new public solve_logit(weights=...) path still skips the shared weight validation used elsewhere in linalg.py, so direct callers can hit opaque numerical failures.
Public docs are still not fully aligned: CallawaySantAnna.fit() says strata/PSU/FPC are supported, while the code now raises on them; ImputationDiD and TwoStageDiD added survey_design but did not document it in their public docstrings/wrappers.
The new TODO.md entries for deferred CallawaySantAnna survey limitations are properly tracked and nonblocking under project policy.

Methodology

Severity P1. Impact: Affected methods are survey-enabled ImputationDiD and survey-enabled TwoStageDiD. Both estimators accept full survey designs, but after resolution they only use the survey object to choose an effective PSU/cluster and to pull df_survey for inference; the actual Theorem 3 / GMM variance paths never consume stratification or FPC. That is inconsistent with the library’s own survey-variance contract, where compute_survey_vcov() explicitly treats strata and FPC as variance inputs. The new tests also encode full-design acceptance without checking any design-aware SE change. References: diff_diff/imputation.py:L241-L369, diff_diff/imputation.py:L489-L490, diff_diff/two_stage.py:L237-L405, diff_diff/two_stage.py:L464-L465, diff_diff/survey.py:L520-L679, tests/test_survey_phase4.py:L245-L274, tests/test_survey_phase4.py:L504-L531. Concrete fix: either reject SurveyDesigns with strata/fpc in these estimators until a design-aware variance path exists, or route the relevant influence-function / score objects through compute_survey_vcov() so stratification/FPC actually affect SE magnitude.
Severity P3. Impact: The previous CallawaySantAnna aggregation-SE blocker is now mitigated. Unsupported full-design survey inputs are rejected at runtime, and the limitation is documented in the registry, roadmap, and TODO.md. References: diff_diff/staggered.py:L1191-L1210, docs/methodology/REGISTRY.md:L419-L421, TODO.md:L55-L56, docs/survey-roadmap.md:L43-L49. Concrete fix: none for this PR.

Code Quality

Severity P2. Impact: solve_logit() now exposes a public weights argument but never validates length, finiteness, or sign before forming sqrt(weights * mu * (1-mu)). The rest of linalg.py already has _validate_weights() for exactly this contract, so direct callers of solve_logit(weights=...) can get broadcasting errors, nans, or opaque linear-algebra failures instead of explicit validation errors. References: diff_diff/linalg.py:L389-L405, diff_diff/linalg.py:L1108-L1215, tests/test_survey_phase4.py:L95-L184. Concrete fix: call _validate_weights(weights, "pweight", n) at the top of solve_logit() and add direct negative tests for length mismatch, NaN, Inf, and negative weights.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The PR correctly adds deferred-work tracking for the remaining CallawaySantAnna survey limitations, so those gaps are nonblocking under the repo’s review policy. References: TODO.md:L55-L56. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: CallawaySantAnna.fit() still documents survey_design as supporting “weights, strata, PSU, and FPC,” but the implementation now raises NotImplementedError for strata/PSU/FPC. The registry and roadmap were updated, so the remaining mismatch is the public docstring. References: diff_diff/staggered.py:L1157-L1210, docs/methodology/REGISTRY.md:L419-L421, docs/survey-roadmap.md:L43-L49. Concrete fix: change the docstring to weights-only support and mention the explicit runtime gate for strata/PSU/FPC.
Severity P3. Impact: The new public survey_design parameter is still undocumented in the ImputationDiD / TwoStageDiD estimator and wrapper docstrings, and the Imputation registry still says treatment_effects["weight"] is always 1/n_valid even though survey mode now stores normalized treated survey weights. References: diff_diff/imputation.py:L175-L215, diff_diff/imputation.py:L1918-L1965, diff_diff/two_stage.py:L171-L210, diff_diff/two_stage.py:L1628-L1660, diff_diff/imputation.py:L544-L555, docs/methodology/REGISTRY.md:L843-L846. Concrete fix: document survey_design, the pweight-only contract, the analytical-only limitation, and the survey-weighted treatment_effects["weight"] semantics.
Severity P3. Impact: Phase 4 tests cover happy-path survey support, but they do not add estimator-specific aweight/fweight rejection coverage for the new estimators or direct invalid-input coverage for solve_logit(weights=...). References: tests/test_survey_phase3.py:L331-L358, tests/test_survey_phase4.py:L95-L1275. Concrete fix: add negative tests for SurveyDesign(weight_type="aweight") / "fweight" on ImputationDiD, TwoStageDiD, CallawaySantAnna, and TripleDifference, plus direct solve_logit() tests for invalid weight arrays.

Path to Approval

Make the survey contract explicit for ImputationDiD and TwoStageDiD: either reject SurveyDesigns with strata/fpc at fit time, or implement a design-aware SE path where those fields change the variance magnitude.
Add regression tests for the chosen behavior: if rejecting, assert SurveyDesign(weights="weight", strata="stratum", psu="psu") and one fpc case raise in both estimators; if supporting, assert at least one SE magnitude changes relative to weights-only input, not just p_value through df_survey.
Align the public docs with the chosen contract: fix the CallawaySantAnna survey docstring, add survey_design docs to the ImputationDiD / TwoStageDiD estimator and wrapper docstrings, and update the Imputation registry note for treatment_effects["weight"].

…contract - ImputationDiD/TwoStageDiD: reject FPC (finite population correction) in SurveyDesign with NotImplementedError. PSU is used for cluster-robust variance, strata enters survey df for t-distribution. FPC not yet implemented in Theorem 3 / GMM variance. - REGISTRY.md: document exact survey contract — PSU as cluster variable, strata for df only (no stratified sandwich), FPC rejected. This is conservative relative to full stratified design-based variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T14:32:28Z

/ai-review

github-actions · 2026-03-23T14:47:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 869b85a8a7e887cfdafa7a0805319f612d75a652

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Static review only: pytest is not installed in this environment, so I could not execute tests/test_survey_phase4.py.

Executive Summary

The prior P1 methodology blocker is resolved under repo policy: the survey-variance deviations for ImputationDiD and TwoStageDiD are now explicitly documented in the Methodology Registry, so they are informational rather than blocking. See docs/methodology/REGISTRY.md:L846 and docs/methodology/REGISTRY.md:L924.
CallawaySantAnna now correctly gates unsupported strata/PSU/FPC inputs at runtime, and that limitation is both documented and tracked. See diff_diff/staggered.py:L1200, docs/methodology/REGISTRY.md:L419, docs/methodology/REGISTRY.md:L421, and TODO.md:L55.
I did not find an unmitigated P0/P1 in the changed estimator math, weighting, variance, or default behavior.
The main remaining issue is that public solve_logit(weights=...) still bypasses the shared weight-validation helper, leaving a public API path with opaque failure modes on bad weights. See diff_diff/linalg.py:L389 and diff_diff/linalg.py:L1108.
Public docs still overstate or omit survey-support details for CallawaySantAnna, ImputationDiD, and TwoStageDiD, and the Imputation registry note for treatment_effects["weight"] is now stale. See diff_diff/staggered.py:L1157, diff_diff/imputation.py:L182, diff_diff/imputation.py:L1919, diff_diff/two_stage.py:L178, diff_diff/two_stage.py:L1631, and docs/methodology/REGISTRY.md:L843.
Phase 4 tests are broad on happy paths, but still light on negative coverage for the new survey restrictions and invalid direct solve_logit() inputs. See tests/test_survey_phase4.py:L95, tests/test_survey_phase4.py:L192, and tests/test_survey_phase3.py:L331.

Methodology

Severity P3. Impact: Affected methods are ImputationDiD, TwoStageDiD, CallawaySantAnna, and the weighted solve_logit() path that also unblocks survey IPW/DR in TripleDifference. Cross-checking the PR against the cited method families and the registry, I do not see an unmitigated undocumented mathematical or identification deviation. The earlier survey-variance concern for ImputationDiD/TwoStageDiD is now explicitly documented in docs/methodology/REGISTRY.md:L846 and docs/methodology/REGISTRY.md:L924, and CallawaySantAnna now rejects unsupported full survey designs in diff_diff/staggered.py:L1200 with matching notes in docs/methodology/REGISTRY.md:L419, docs/methodology/REGISTRY.md:L421, TODO.md:L55, and TODO.md:L56. Under the stated review policy, those are informational rather than blocking. Concrete fix: none.

Code Quality

Severity P2. Impact: the new public weighted-logit entry point in diff_diff/linalg.py:L1108 does not call the shared _validate_weights() helper in diff_diff/linalg.py:L389 before forming weights * mu * (1 - mu). Internal survey callers are protected because SurveyDesign.resolve() enforces finite positive weights, but direct callers can still hit shape-dependent broadcasting, NaN/Inf propagation, or opaque least-squares failures instead of a clear input error. Concrete fix: call _validate_weights(weights, "pweight", n) at the top of solve_logit() and add direct negative tests for length mismatch, NaN, Inf, and negative weights.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining survey limitations for CallawaySantAnna are properly tracked in TODO.md:L55 and TODO.md:L56, and the PR also records the pre-existing EfficientDiD stale-n_cl issue in TODO.md:L57. That is the correct disposition under the repo’s deferred-work policy and does not block approval. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: CallawaySantAnna.fit() still documents survey_design as supporting “weights, strata, PSU, and FPC” in diff_diff/staggered.py:L1157, but the implementation now raises NotImplementedError for strata/PSU/FPC in diff_diff/staggered.py:L1200. Concrete fix: change the docstring to weights-only survey support and mention the explicit runtime gate and the covariates + IPW/DR survey restriction.
Severity P3. Impact: the new survey_design parameter is still undocumented in the public ImputationDiD and TwoStageDiD fit/wrapper docstrings at diff_diff/imputation.py:L182, diff_diff/imputation.py:L1919, diff_diff/two_stage.py:L178, and diff_diff/two_stage.py:L1631. The Imputation registry note also still says treatment_effects["weight"] is always 1/n_valid in docs/methodology/REGISTRY.md:L843, but survey mode now writes normalized treated survey weights in diff_diff/imputation.py:L549. Concrete fix: document survey_design, the pweight-only / analytical-only / no-FPC contract, and update the registry note for survey-weighted treatment_effects["weight"].
Severity P3. Impact: Phase 4 adds broad happy-path coverage for survey support, but it still lacks direct negative coverage for invalid solve_logit(weights=...) inputs and does not add estimator-specific rejection tests for the new survey restrictions on ImputationDiD/TwoStageDiD. The new test file covers weighted-logit happy paths and positive survey flows in tests/test_survey_phase4.py:L95, tests/test_survey_phase4.py:L192, and tests/test_survey_phase4.py:L453, while the earlier explicit rejection pattern exists in tests/test_survey_phase3.py:L331 and the only Phase 4 full-design rejection cases are CallawaySantAnna-specific in tests/test_survey_phase4.py:L736 and tests/test_survey_phase4.py:L1181. Concrete fix: add negative tests for SurveyDesign(weights="weight", weight_type="aweight"/"fweight") on the newly survey-enabled estimators, add FPC rejection tests for ImputationDiD and TwoStageDiD, and add direct invalid-weight tests for solve_logit().

P2: solve_logit() now validates weights (shape, NaN, Inf, positive) before IRLS, giving clear errors instead of opaque numerical failures. P3 docs: update CallawaySantAnna fit() docstring to weights-only contract; add survey_design param docs to ImputationDiD/TwoStageDiD fit() and wrapper docstrings; update REGISTRY treatment_effects["weight"] note for survey mode. P3 tests: add negative tests for solve_logit bad weights (NaN, negative, wrong shape); add aweight/fweight rejection and FPC rejection tests for ImputationDiD and TwoStageDiD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T14:57:03Z

/ai-review

github-actions · 2026-03-23T15:10:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 133df1974f64f7ad22aa1f6fbcdd7895e274a138

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only; I did not execute pytest in this environment.

Executive Summary

The prior weighted-solve_logit() concern looks resolved: the new public weights= path now validates shape, NaN/Inf, and positivity, and the PR adds direct negative tests for those cases in diff_diff/linalg.py:L1160-L1170 and tests/test_survey_phase4.py:L186-L215.
The previously contentious survey deviations for ImputationDiD, TwoStageDiD, and CallawaySantAnna are now explicitly documented in the Methodology Registry, so those specific deviations are informational rather than blocking under repo policy; see docs/methodology/REGISTRY.md:L419-L421, docs/methodology/REGISTRY.md:L846-L846, and docs/methodology/REGISTRY.md:L924-L924.
The blocker is the new CallawaySantAnna survey + cluster= interaction: the code now injects cluster as an effective PSU and uses it for df_survey/metadata, but the actual analytical SE formulas remain unit-level IF sums rather than cluster/PSU-aware survey variance; see diff_diff/staggered.py:L1289-L1324, diff_diff/staggered.py:L1625-L1677, diff_diff/staggered.py:L1779-L1833, diff_diff/staggered.py:L2034-L2055, and diff_diff/staggered_aggregation.py:L435-L456.
The new Callaway survey tests do not cover that cluster= interaction, and they also stop at the default/simple plus aggregate="event_study" paths, unlike the added group/all survey coverage for ImputationDiD and TwoStageDiD; see tests/test_survey_phase4.py:L447-L476, tests/test_survey_phase4.py:L656-L683, and tests/test_survey_phase4.py:L784-L990.
Minor documentation issue remains: TwoStageDiD survey docstrings still describe its variance as “Theorem 3 variance,” but the implementation and registry describe a GMM sandwich; see diff_diff/two_stage.py:L202-L207, diff_diff/two_stage.py:L1672-L1677, and docs/methodology/REGISTRY.md:L894-L925.

Methodology

Severity P1. Impact: CallawaySantAnna.fit(..., survey_design=SurveyDesign(weights=...), cluster="...") now accepts cluster and converts it into an effective PSU after the weights-only gate, which changes survey_metadata.n_psu / df_survey and therefore p-values/CIs, but the estimator still computes per-cell and aggregated SEs from unit-level IF formulas (sqrt(sum(psi^2))) instead of a cluster/PSU-aware survey variance. That means the changed survey+cluster path has inconsistent inference: clustering affects the reference distribution, not the variance meat. This is an incorrect variance/SE interaction on a changed methodology path, and TODO.md does not mitigate it because the code does not reject the combination. See diff_diff/staggered.py:L1157-L1161, diff_diff/staggered.py:L1181-L1213, diff_diff/staggered.py:L1289-L1324, diff_diff/staggered.py:L1625-L1677, diff_diff/staggered.py:L1779-L1833, diff_diff/staggered.py:L2034-L2055, diff_diff/staggered_aggregation.py:L435-L456, diff_diff/survey.py:L488-L517, diff_diff/survey.py:L295-L305, docs/methodology/REGISTRY.md:L419-L421, docs/methodology/REGISTRY.md:L1931-L1937, and TODO.md:L55-L56. Concrete fix: either reject cluster whenever CallawaySantAnna is run with survey_design, or implement the intended design-based path by routing the combined per-cell / aggregate IF (including WIF) through compute_survey_vcov() so injected PSUs change the variance meat as well as df_survey; if you keep support, document the exact contract in REGISTRY.md.
Severity P3. Impact: The survey deviations for CallawaySantAnna weights-only support, CallawaySantAnna reg+covariates SEs, ImputationDiD survey variance, and TwoStageDiD survey variance are now explicitly documented in the registry, so those deviations are mitigated under the project’s policy rather than defects. See docs/methodology/REGISTRY.md:L419-L421, docs/methodology/REGISTRY.md:L846-L846, and docs/methodology/REGISTRY.md:L924-L924. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The PR correctly records deferred, non-blocking survey limitations for CallawaySantAnna and the pre-existing EfficientDiD stale-n_cl issue in TODO.md, which is the right disposition for deferrable work under repo policy. See TODO.md:L55-L57. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P1. Impact: The new survey_design path is not exercised across all changed CallawaySantAnna interactions in the Phase 4 suite. The file adds aggregate="group" / aggregate="all" survey coverage for ImputationDiD and TwoStageDiD, but the Callaway survey section stops at the default/simple flows and one aggregate="event_study" case, with no cluster= + survey regression test. That leaves changed parameter combinations unverified, and the broken survey+cluster variance path above is a concrete example of what slipped through. See tests/test_survey_phase4.py:L447-L476, tests/test_survey_phase4.py:L656-L683, and tests/test_survey_phase4.py:L784-L990. Concrete fix: add Callaway survey tests for cluster= + survey_design, plus aggregate="group" and aggregate="all" survey cases for supported estimation methods.
Severity P3. Impact: TwoStageDiD’s new survey docstrings still say PSU is used as the cluster variable for “Theorem 3 variance,” but TwoStageDiD implements a GMM sandwich, and the registry now documents it that way. This is a documentation mismatch, not a numerical bug. See diff_diff/two_stage.py:L202-L207, diff_diff/two_stage.py:L1672-L1677, and docs/methodology/REGISTRY.md:L894-L925. Concrete fix: update both docstrings to say PSU is used for the GMM sandwich variance, not Theorem 3.

Path to Approval

In CallawaySantAnna, either reject cluster when survey_design is active, or implement cluster/PSU-aware analytical survey SEs by sending the combined IF/WIF through compute_survey_vcov() for both ATT(g,t) and aggregation paths; keep REGISTRY.md aligned with whichever contract you choose.
Add a regression test for CallawaySantAnna(cluster="...", survey_design=SurveyDesign(weights="...")) that fails if clustering only changes survey_metadata.df_survey / p-values without entering the variance calculation itself.
Add survey interaction coverage for CallawaySantAnna(..., aggregate="group") and CallawaySantAnna(..., aggregate="all") so the new survey_design parameter is exercised across the aggregation code paths it now affects.

CallawaySantAnna uses weights-only survey (strata/PSU/FPC rejected), so cluster-as-PSU injection is unnecessary and misleading. The user's cluster= parameter is handled by the existing non-survey clustering path. Removed _inject_cluster_as_psu call and unused imports. Added aggregate="group" and aggregate="all" survey tests for CS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T15:14:01Z

/ai-review

github-actions · 2026-03-23T15:22:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 85c371018c202b89ee258b8ede0d97e89d70d275

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only. I did not run pytest here because the local Python environment is missing numpy.

Executive Summary

The prior CallawaySantAnna survey+cluster/PSU concern looks resolved: this PR now explicitly rejects strata/PSU/FPC for CS survey support and documents the deferred full-design path in the registry/TODO.
The remaining blocker is new: CallawaySantAnna now feeds survey df from the long panel into unit-level ATT(g,t) and aggregation inference, which overstates degrees of freedom and makes p-values/CIs too optimistic.
ImputationDiD and TwoStageDiD survey threading is broadly aligned with the new registry notes, including effective-PSU injection for panel inference.
weighted solve_logit() and the TripleDifference IPW/DR unblocking look internally consistent and are documented in the registry.
Minor documentation issue remains: TwoStageDiD survey docstrings still call its variance “Theorem 3 variance,” but the implementation is a weighted GMM sandwich.

Methodology

Severity P1. Impact: In the new weights-only CallawaySantAnna survey path, survey metadata and df_survey are resolved on the long panel before the estimator collapses to one score per unit. The code then uses that row-level df in safe_inference / safe_inference_batch for ATT(g,t), overall ATT, event-study, and group aggregation. For panel data, that silently uses roughly n_rows - 1 instead of n_units - 1, so the new survey path reports p-values that are too small and CIs that are too narrow. This is especially clear because the same code also collapses survey weights to one per unit for estimation, and _validate_unit_constant_survey() is only needed for panel estimators that collapse survey info within unit. See diff_diff/survey.py:L295-L305, diff_diff/survey.py:L393-L430, diff_diff/staggered.py:L393-L414, diff_diff/staggered.py:L1181-L1412, diff_diff/staggered_aggregation.py:L558-L564, and diff_diff/staggered_aggregation.py:L646-L652. Concrete fix: after _validate_unit_constant_survey(), construct a unit-collapsed survey design/metadata for CS (one weight per unit, matching all_units) and thread that unit-level df_survey through all CS inference calls; do not reuse the long-panel ResolvedSurveyDesign.df_survey for this estimator.
Severity P3. Impact: The PR’s documented CS survey limitations are properly mitigated under repo policy: full design-based CS SEs are rejected at runtime, and survey+covariates for IPW/DR is explicitly gated and tracked. See diff_diff/staggered.py:L1200-L1234, REGISTRY.md:L419-L421, and TODO.md:L55-L56. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The added TODO entries correctly track the deferrable CS survey limitations, so those items are informational rather than blocking. See TODO.md:L55-L56. Concrete fix: none.

Security

No findings.

Documentation/Tests

Severity P3. Impact: TwoStageDiD’s new survey docstrings still say PSU is used for “Theorem 3 variance,” but the implementation and registry now describe a weighted GMM sandwich. That is a documentation mismatch, not a numerical bug. See diff_diff/two_stage.py:L202-L207, diff_diff/two_stage.py:L1356-L1513, diff_diff/two_stage.py:L1672-L1677, and REGISTRY.md:L924-L924. Concrete fix: replace “Theorem 3 variance” with “GMM sandwich variance” in both survey docstrings.
Severity P3. Impact: The new CS survey tests check metadata presence and scale invariance, but they do not assert the critical unit-level df_survey contract, which is why the P1 above is currently unguarded. See tests/test_survey_phase4.py:L803-L817 and tests/test_survey_phase4.py:L1227-L1304. Concrete fix: add a regression test asserting weights-only CS panel input reports df_survey == n_units - 1 and uses that same df in simple, event-study, and group inference.

Path to Approval

In CallawaySantAnna, recompute survey metadata and df_survey on the unit-collapsed design, not the long panel, and use that unit-level df in all CS safe_inference / safe_inference_batch calls.
Add a regression test for weights-only CS panel data that fails if survey_metadata.df_survey equals len(data) - 1 instead of n_units - 1, and cover aggregate="simple", aggregate="event_study", and aggregate="group".
Update the two TwoStageDiD survey docstrings to say PSU feeds the weighted GMM sandwich variance, not “Theorem 3 variance.”

CallawaySantAnna operates at unit level, so df_survey should be n_units-1 for weights-only designs, not n_obs-1 from the long panel. The panel-level df overstated degrees of freedom, making p-values/CIs too optimistic. - Precomputed and fit() both use len(all_units)-1 for df_survey - survey_metadata.df_survey overridden to unit-level - Added test_df_survey_is_unit_level asserting df == n_units-1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T15:30:38Z

/ai-review

github-actions · 2026-03-23T15:44:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3872b85126207d2cc370e3d7314453736e3c1789

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only. I couldn’t run the local test suite because pytest is not installed here, and the local Python environment also lacks numpy.

Executive Summary

The previous CallawaySantAnna survey-d.f. blocker looks resolved: the code now uses unit-level survey d.f. in the precomputed survey path and in aggregation inference (diff_diff/staggered.py:414, diff_diff/staggered.py:1328, diff_diff/staggered_aggregation.py:558, diff_diff/staggered_aggregation.py:646).
The registry/TODO updates correctly mitigate the intentional survey deviations and deferred CallawaySantAnna limitations; I did not find a new undocumented paper/registry mismatch in the changed estimator formulas.
The remaining blocker is an edge-case correctness issue in the new survey CallawaySantAnna regression-with-covariates path: rank-deficient covariates can now turn otherwise estimable ATT(g,t) cells into NaN, and those cells then get dropped from aggregation.
TripleDifference IPW/DR survey unblocking and the weighted solve_logit() change look internally consistent with the documented implementation choices.
TwoStageDiD’s survey docstrings still say “Theorem 3 variance”; that remains informational only.

Methodology

Severity P3. Impact: No unmitigated methodology defect found in the changed survey formulas themselves. The remaining survey-side departures are explicitly documented in the registry/TODO: CallawaySantAnna is still weights-only, the reg+covariates per-cell SE is explicitly marked as a conservative deviation, and the weighted-logit / WIF / did2s-GMM pieces are implementation choices rather than undocumented source-material mismatches. Concrete fix: none. (docs/methodology/REGISTRY.md:419, docs/methodology/REGISTRY.md:420, docs/methodology/REGISTRY.md:421, docs/methodology/REGISTRY.md:924, TODO.md:55, TODO.md:56). (rdrr.io)

Code Quality

Severity P1. Impact: The new survey CallawaySantAnna estimation_method="reg" + covariates path is forced off the optimized reduced-solve branch when has_survey is true, so it always goes through _outcome_regression() (diff_diff/staggered.py:1344). That function feeds _linear_regression() coefficients directly into X @ beta for treated predictions and control residuals (diff_diff/staggered.py:1619), but solve_ols() intentionally returns NaN in dropped rank-deficient columns (diff_diff/linalg.py:860). Under the default rank_deficient_action="warn", a collinear survey covariate can therefore make predicted_control, resid_c, and ATT(g,t) become NaN; those NaN cells are then excluded from aggregation (diff_diff/staggered_aggregation.py:95). Concrete fix: after _linear_regression(), zero non-finite coefficients before prediction/residual reconstruction, matching the existing DR path’s handling (diff_diff/staggered.py:1987).

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The new TODO.md entries properly track the deferred CallawaySantAnna survey work and the pre-existing EfficientDiD Hausman issue, so they are informational rather than blockers under the project policy. Concrete fix: none. (TODO.md:55, TODO.md:56, TODO.md:57, TODO.md:58)

Security

No findings.

Documentation/Tests

Severity P3. Impact: TwoStageDiD’s public survey docstrings still say PSU feeds “Theorem 3 variance,” but the implementation and registry describe a weighted GMM sandwich. This does not affect numerical results, but it is still an unresolved public-doc mismatch. Concrete fix: replace “Theorem 3 variance” with “GMM sandwich variance” in both survey-design docstrings. (diff_diff/two_stage.py:202, diff_diff/two_stage.py:1672, docs/methodology/REGISTRY.md:924)

Path to Approval

In CallawaySantAnna._outcome_regression(), sanitize dropped/rank-deficient coefficients before computing treated predictions and control residuals, so rank_deficient_action="warn" drops only the redundant columns instead of poisoning the whole ATT(g,t) cell with NaN.
Add a regression test in tests/test_survey_phase4.py with weights-only SurveyDesign, estimation_method="reg", and perfectly collinear covariates (for example x2 = 2 * x1) that asserts finite per-cell and overall survey estimates under rank_deficient_action="warn".

Zero NaN coefficients before prediction in _outcome_regression survey covariate branch, matching the DR path convention. Rank-deficient columns contribute 0 to the column space projection, so zeroing NaN betas gives the correct prediction. Previously, NaN coefficients propagated into predictions, turning estimable cells into NaN. Added test_reg_covariates_rank_deficient_survey with collinear covariates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T16:04:34Z

/ai-review

github-actions · 2026-03-23T16:14:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8371aedf09daa65c2c4b7b1a985dd10c9300cf25

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only. I could not run the test suite locally because pytest is not installed here, and the local Python environment also lacks numpy.

Executive Summary

The previous CallawaySantAnna survey reg+covariates rank-deficiency blocker appears fixed: dropped WLS coefficients are now zeroed before prediction/residual reconstruction, and there is a targeted regression test for the collinearity case in the new survey suite. diff_diff/staggered.py:L1619-L1649 tests/test_survey_phase4.py:L933-L955
The new registry/TODO entries correctly mitigate the documented CallawaySantAnna survey deviations and deferred work; I did not find a new untracked methodology mismatch in the ImputationDiD / TwoStageDiD / CallawaySantAnna survey formulas themselves. docs/methodology/REGISTRY.md:L419-L421 docs/methodology/REGISTRY.md:L846-L846 docs/methodology/REGISTRY.md:L924-L924 TODO.md:L55-L58
The remaining blocker is in TripleDifference: the new survey TSL path now divides the combined influence function by survey weights for all estimation methods, but the regression-adjustment branch does not build its pairwise IF on the same explicitly weight-multiplied scale as the new IPW/DR branches. That makes the survey SE path for estimation_method="reg" methodologically inconsistent and liable to produce incorrect inference. diff_diff/triple_diff.py:L1085-L1098 diff_diff/triple_diff.py:L1302-L1307 diff_diff/triple_diff.py:L1378-L1453 diff_diff/triple_diff.py:L1484-L1492
solve_logit() weighting and the new survey threading in ImputationDiD / TwoStageDiD look internally consistent with the new registry notes and the added scale-invariance tests. diff_diff/linalg.py:L1108-L1228 tests/test_survey_phase4.py:L94-L182 tests/test_survey_phase4.py:L366-L420 tests/test_survey_phase4.py:L1268-L1280
The prior TwoStageDiD docstring mismatch is still present: the public survey docstrings still say “Theorem 3 variance” even though the implementation and registry now describe a weighted GMM sandwich. That is informational only. diff_diff/two_stage.py:L202-L207 diff_diff/two_stage.py:L1672-L1677 docs/methodology/REGISTRY.md:L894-L924

Methodology

Severity P1. Impact: The new survey TSL adjustment in TripleDifference applies inf / survey_weights unconditionally before compute_survey_vcov(), but only the new IPW/DR branches explicitly move their pairwise IFs onto a weight-multiplied Riesz scale (riesz *= weights). The regression-adjustment branch still constructs its pairwise IF from weighted means and weighted OLS score pieces without that same transformation. As written, the PR changes the already-supported survey-reg inference path with no method-specific derivation and can silently change se, t_stat, p_value, and CI for TripleDifference(estimation_method="reg", survey_design=...). Concrete fix: make the inf / sw step method-specific, or re-derive/rewrite the RA pairwise IF so it is explicitly on the same weighted-score scale as IPW/DR before dividing it back out for TSL. diff_diff/triple_diff.py:L1085-L1098 diff_diff/triple_diff.py:L1378-L1453 diff_diff/triple_diff.py:L1302-L1307 diff_diff/triple_diff.py:L1484-L1492 docs/methodology/REGISTRY.md:L1249-L1250

No other unmitigated methodology findings. The new CallawaySantAnna survey deviations and deferred limitations are explicitly documented in the registry and tracked in TODO.md, so I did not count them as defects. docs/methodology/REGISTRY.md:L419-L421 TODO.md:L55-L56

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No unmitigated findings. The new TODO.md entries appropriately track the deferred CallawaySantAnna survey work and the pre-existing EfficientDiD Hausman issue, so those are informational rather than blockers under the project policy. TODO.md:L55-L58

Security

No findings.

Documentation/Tests

Severity P3. Impact: The new and existing TripleDifference survey tests do not pin the regression-adjustment survey SE numerically. The reg survey coverage is still smoke-level, while the Phase 4 additions focus on IPW/DR “works / finite” behavior. That gap is why the new method-agnostic inf / sw transform was able to land without a regression alarm. Concrete fix: add a non-uniform-weight TripleDifference(estimation_method="reg", survey_design=...) test that asserts se, t_stat, and p_value against a known analytical target or validated baseline. tests/test_survey_phase3.py:L482-L603 tests/test_survey_phase4.py:L1069-L1148
Severity P3. Impact: TwoStageDiD’s public survey docstrings still say PSU feeds “Theorem 3 variance,” but the implementation and registry now describe a weighted GMM sandwich variance. This does not affect results, but it is still a user-facing documentation mismatch. Concrete fix: update both survey docstrings to say “GMM sandwich variance.” diff_diff/two_stage.py:L202-L207 diff_diff/two_stage.py:L1672-L1677 docs/methodology/REGISTRY.md:L894-L924

Path to Approval

In TripleDifference’s survey TSL path, stop applying the same inf / survey_weights transform to the regression-adjustment branch unless the RA pairwise IF is first rewritten onto the same explicitly weight-scaled basis as IPW/DR. diff_diff/triple_diff.py:L1085-L1098
Add a regression test for TripleDifference(estimation_method="reg", survey_design=SurveyDesign(weights=...)) on a non-uniform-weight 2×2×2 sample that pins se, t_stat, and p_value against a known TSL target or validated baseline. tests/test_survey_phase3.py:L482-L603

The regression-adjustment IF is already on the unweighted scale (WLS fits use weights internally but the IF is residual-based, not Riesz-weighted). De-weighting only applies to IPW/DR where Riesz representers are multiplied by survey weights. Updated REGISTRY note to document method-specific TSL handoff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T16:40:43Z

/ai-review

github-actions · 2026-03-23T16:52:30Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 10e98ca77836bb4b86978f741110496a66250b15

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only. I cross-checked the changed estimator paths against docs/methodology/REGISTRY.md and the changed in-code method notes, but I could not run the suite locally because this environment is missing numpy, pandas, and pytest.

Executive Summary

The prior TripleDifference survey SE blocker looks resolved: the survey de-weighting step now applies only to IPW/DR, and the registry documents that RA/IPW/DR distinction. diff_diff/triple_diff.py:L1085-L1099 docs/methodology/REGISTRY.md:L1249-L1250
[Newly identified] ImputationDiD’s new survey_design support does not propagate into the user-facing pretrend_test() path. After a survey-weighted fit, that path still runs unweighted demeaning, unweighted solve_ols(), and cluster-count F inference with no warning. diff_diff/imputation.py:L206-L210 diff_diff/imputation.py:L573-L589 diff_diff/imputation.py:L1815-L1864 diff_diff/imputation_results.py:L413-L437
I did not find other unmitigated methodology mismatches in the changed ImputationDiD, TwoStageDiD, CallawaySantAnna, or weighted solve_logit() paths.
The new CallawaySantAnna survey limitations and deviations are explicitly documented in the registry and tracked in TODO.md, so I treated them as informational rather than defects. docs/methodology/REGISTRY.md:L419-L421 TODO.md:L55-L56
TwoStageDiD’s survey docstrings still say “Theorem 3 variance” even though the implementation and registry describe a GMM sandwich variance. That is informational only. diff_diff/two_stage.py:L202-L206 diff_diff/two_stage.py:L1672-L1676 diff_diff/two_stage.py:L1357-L1376 docs/methodology/REGISTRY.md:L924-L924

Methodology

[Newly identified] Severity P1. Impact: the ImputationDiD registry says the pre-trend test uses the same iterative FE machinery as Step 1, and the new survey note says survey weights enter Step 1 FE with survey d.f. entering inference. But the saved _fit_data does not retain survey state, and ImputationDiDResults.pretrend_test() still delegates to an unweighted demeaning and unweighted OLS path with denominator d.f. n_clusters - 1. After fit(..., survey_design=...), this can silently return a non-survey parallel-trends diagnostic and misstate the assumption check. Concrete fix: store the resolved survey weights/design/d.f. in _fit_data and thread them through weighted demeaning/WLS plus survey-aware Wald/F d.f., or explicitly raise NotImplementedError whenever pretrend_test() is called after a survey-weighted fit. docs/methodology/REGISTRY.md:L823-L846 diff_diff/imputation.py:L573-L589 diff_diff/imputation.py:L789-L807 diff_diff/imputation.py:L1815-L1864 diff_diff/imputation_results.py:L413-L437

No other unmitigated methodology findings. The new CallawaySantAnna survey deviations are documented in docs/methodology/REGISTRY.md:L419-L421 and tracked in TODO.md:L55-L56, and the prior TripleDifference concern is resolved in diff_diff/triple_diff.py:L1085-L1099 with matching registry note docs/methodology/REGISTRY.md:L1249-L1250.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: none for merge. The remaining CallawaySantAnna survey limitations and the pre-existing EfficientDiD Hausman issue are now explicitly tracked in TODO.md, so they are mitigated deferred work under the project policy. Concrete fix: none required in this PR. TODO.md:L55-L58

Security

No findings.

Documentation/Tests

Severity P3. Impact: the public TwoStageDiD survey docstrings still say PSU feeds “Theorem 3 variance,” but the implementation and methodology registry describe a weighted GMM sandwich variance. This does not affect results, but it is user-facing and misleading. Concrete fix: update both survey docstrings to say “GMM sandwich variance.” diff_diff/two_stage.py:L202-L206 diff_diff/two_stage.py:L1672-L1676 diff_diff/two_stage.py:L1357-L1376 docs/methodology/REGISTRY.md:L924-L924
Severity P3. Impact: the changed Imputation survey suite covers fit, aggregation, wrapper, and scale-invariance paths, but I did not find a survey pretrend_test() regression, which is why the missing survey propagation above can ship without a failing test. Concrete fix: add a fit(..., survey_design=...).pretrend_test() test for the supported behavior, or a test asserting the explicit NotImplementedError if that path is intentionally blocked. tests/test_survey_phase4.py:L223-L463 diff_diff/imputation_results.py:L413-L437

Path to Approval

Make ImputationDiDResults.pretrend_test() either survey-aware or explicitly unsupported after survey fits. Minimum acceptable implementation: store survey state in _fit_data and use weighted within-transform/WLS plus survey-aware denominator d.f. Minimum acceptable guard: immediate NotImplementedError on survey-weighted fits. diff_diff/imputation.py:L573-L589 diff_diff/imputation.py:L1815-L1864 diff_diff/imputation_results.py:L413-L437
Add a regression test for that behavior. A good minimum case is a non-uniform-weight ImputationDiD fit followed by pretrend_test(), plus a full-design case with strata/PSU if that path remains supported. tests/test_survey_phase4.py:L223-L463

pretrend_test() uses unweighted demeaning and cluster-count F inference, which do not account for survey weights. Raise NotImplementedError when called after a survey-weighted fit. Store survey_design in _fit_data for detection. Survey-aware pretrend_test is planned for future work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T16:57:23Z

/ai-review

github-actions · 2026-03-23T17:09:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d184c946b21184f5bf8c72646a9f8ea743c5bec

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Static review only. I could not execute the suite in this environment because numpy is unavailable.

Executive Summary

The prior ImputationDiD survey pretrend_test() blocker is resolved in code: survey fits now fail fast with NotImplementedError instead of silently running the unweighted test path. diff_diff/imputation.py:L1757-L1763 diff_diff/imputation_results.py:L413-L439
Severity P1: CallawaySantAnna’s new survey aggregation/WIF path silently mis-normalizes weights on unbalanced panels, which can make overall_se, dynamic SEs, and group SEs wrong even when the per-cell ATT(g,t) estimates are fine. diff_diff/survey.py:L117-L120 diff_diff/staggered.py:L393-L399 diff_diff/staggered_aggregation.py:L343-L395 docs/methodology/REGISTRY.md:L419-L421 The reference formulas use unit-level weights.ind/G in did::wif(), and DRDID::drdid_panel() expects panel weights normalized over that unit-level sample. citeturn1view0turn1view1
I did not find additional unmitigated methodology defects in the changed ImputationDiD, TwoStageDiD, or TripleDifference survey paths.
The new TODO.md entries properly track the intentionally deferred CallawaySantAnna strata/PSU/FPC and covariate-IPW/DR limitations, so those are informational rather than blockers. TODO.md:L55-L57
The PR still leaves two P3s: no regression test for the new survey pretrend_test() guard, and TwoStageDiD survey docstrings still describe “Theorem 3 variance” even though the implementation/registry now describe a GMM sandwich variance. tests/test_survey_phase4.py:L223-L502 diff_diff/two_stage.py:L202-L207 diff_diff/two_stage.py:L1672-L1677 docs/methodology/REGISTRY.md:L924-L924

Methodology

Severity P1. Impact: CallawaySantAnna’s survey WIF path is correct on balanced panels but not on unbalanced panels. SurveyDesign.resolve() normalizes pweights on the long panel (n = len(data)), _precompute_structures() then collapses those observation-level weights to one value per unit without renormalizing, and _compute_combined_influence_function() feeds that collapsed vector directly into the did-style WIF term. Because the WIF formula uses the raw unit weights, the global normalization constant does not cancel after collapse on an unbalanced panel, so aggregated survey SEs are silently mis-scaled. Concrete fix: after collapsing to one weight per unit in _precompute_structures(), renormalize the unit vector to sum to len(all_units) before storing precomputed["survey_weights"], and recompute the unit-level survey metadata from that same collapsed vector. diff_diff/survey.py:L117-L120 diff_diff/staggered.py:L393-L399 diff_diff/staggered_aggregation.py:L343-L395 docs/methodology/REGISTRY.md:L419-L421 did::wif() is written on a unit-level G/weights.ind vector, and DRDID::drdid_panel() documents mean-1 normalization over its panel sample. citeturn1view0turn1view1

No other unmitigated methodology findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: none for merge. The newly added TODO.md entries properly track the intentionally deferred CallawaySantAnna survey limitations and the pre-existing EfficientDiD Hausman issue, so those items are mitigated under the project’s deferred-work policy. Concrete fix: none required in this PR. TODO.md:L55-L58

Security

No findings.

Documentation/Tests

Severity P3. Impact: the prior ImputationDiD survey pretrend_test() blocker is fixed in code, but I did not find a regression test that exercises fit(..., survey_design=...).pretrend_test(). Concrete fix: add a test that fits with non-uniform survey weights and asserts the new NotImplementedError. diff_diff/imputation.py:L1757-L1763 tests/test_survey_phase4.py:L223-L502
Severity P3. Impact: the new CallawaySantAnna survey tests cover balanced panels and scale invariance, but I did not find an unbalanced-panel survey aggregation/WIF case; that gap is why the P1 above can ship undetected. Concrete fix: add an unbalanced staggered panel test that drops periods for a subset of units and asserts overall_se, event-study SEs, and group SEs match the unit-renormalized survey path. tests/test_survey_phase4.py:L739-L1062 tests/test_survey_phase4.py:L1265-L1339
Severity P3. Impact: TwoStageDiD survey docstrings still say PSU feeds “Theorem 3 variance,” but the implementation and registry now describe a weighted GMM sandwich variance. Concrete fix: update the fit() and wrapper docstrings to say “GMM variance” or “GMM sandwich variance.” diff_diff/two_stage.py:L202-L207 diff_diff/two_stage.py:L1672-L1677 diff_diff/two_stage.py:L1357-L1377 docs/methodology/REGISTRY.md:L924-L924

Path to Approval

Renormalize CallawaySantAnna’s collapsed per-unit survey weights after _precompute_structures() reduces the long panel to one row per unit, and use that same unit-normalized vector in both the WIF path and the survey metadata.
Add an unbalanced-panel regression test for survey CallawaySantAnna aggregation. Minimum acceptable case: drop some periods for a subset of units while keeping survey columns unit-constant, then assert overall_se, dynamic SEs, and group SEs are unchanged by a global rescaling of unit weights and match the manually unit-renormalized computation.

The pweight normalization preserves relative unit weights on unbalanced panels because all IF/WIF formulas use weight ratios (sw_i/sum(sw)) where the normalization constant cancels. Added REGISTRY note explaining this and an unbalanced panel scale-invariance test confirming it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T17:38:34Z

/ai-review

github-actions · 2026-03-23T17:50:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 66adcdcef78a2aa2fc7d40fad3bf9927a01195b6

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review only; I did not execute the suite in this environment.

Executive Summary

The prior ImputationDiD survey pretrend_test() blocker is fixed in code: survey-weighted fits now raise NotImplementedError instead of silently running the unweighted pre-trend test path. diff_diff/imputation.py:L1757-L1763
Severity P1 [Newly identified]: CallawaySantAnna’s new survey aggregation path uses cell-specific treated survey totals (survey_weight_sum) as simple and dynamic aggregation weights. On unbalanced panels, that changes the aggregate target relative to Callaway-Sant’Anna / did::aggte(), which uses fixed cohort shares pg computed once from the unit-level sample. This can silently change overall_att, dynamic effects, and their analytical SEs. diff_diff/staggered.py:L393-L398 diff_diff/staggered.py:L610-L611 diff_diff/staggered_aggregation.py:L65-L139 diff_diff/staggered_aggregation.py:L481-L549 docs/methodology/REGISTRY.md:L419-L421 (rdrr.io)
The WIF code already computes fixed pg from the full unit sample, so the current implementation mixes cell-specific standard aggregation weights with fixed-share WIF weights on unbalanced panels. diff_diff/staggered_aggregation.py:L268-L289
The new TODO.md entries correctly track the intentionally deferred CallawaySantAnna survey limitations, so I treated those as informational rather than blockers. TODO.md:L55-L56
I did not find additional unmitigated methodology defects in the changed ImputationDiD, TwoStageDiD, TripleDifference, or solve_logit() survey paths.

Methodology

Severity P1 [Newly identified]. Impact: CallawaySantAnna now stores a per-cell survey_weight_sum from the treated units that are observed in that specific (g,t) comparison, then uses those cell-specific totals in _aggregate_simple() and _aggregate_event_study(). On unbalanced panels, that makes aggregate weights depend on cell-specific missingness instead of fixed cohort shares. The official did::aggte() docs describe simple aggregation as weighting ATT(g,t) by group size, and the reference implementation computes pg once from the unit-level sample and reuses pg[keepers] / pg[whiche] in both simple and dynamic aggregation. This PR does not document any deviation from that in REGISTRY.md; the registry note only documents the survey WIF normalization note and the per-cell SE deviation. Concrete fix: compute fixed cohort survey totals once from the unit-level survey sample (precomputed["survey_weights"] + unit_cohorts), carry that fixed cohort share on each (g,t) cell, and use it for both the point-estimate weights and the standard-IF/WIF aggregation in simple and dynamic summaries. Add an unbalanced-panel regression test where within-cohort missingness differs across post periods and assert fixed-cohort-share aggregation. diff_diff/staggered.py:L393-L398 diff_diff/staggered.py:L610-L611 diff_diff/staggered.py:L1400-L1410 diff_diff/staggered_aggregation.py:L65-L139 diff_diff/staggered_aggregation.py:L268-L289 diff_diff/staggered_aggregation.py:L481-L549 docs/methodology/REGISTRY.md:L419-L421 tests/test_survey_phase4.py:L1268-L1306 (rdrr.io)

No other unmitigated methodology findings in scope.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: none for merge. The new TODO entries explicitly track the current weights-only scope of CallawaySantAnna survey support and the missing survey nuisance-IF corrections for IPW/DR with covariates, so these are properly deferred under the project’s policy. Concrete fix: none required in this PR. TODO.md:L55-L56

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new survey-aware ImputationDiD pretrend_test() guard is fixed in code, but I did not find a regression test that fits with survey_design and then calls .pretrend_test(). Concrete fix: add a test asserting ImputationDiD(..., survey_design=...).fit(...).pretrend_test() raises NotImplementedError. diff_diff/imputation.py:L1757-L1763 tests/test_survey_phase4.py:L223-L503
Severity P3. Impact: TwoStageDiD survey docstrings still say PSU feeds “Theorem 3 variance,” while the implementation and methodology registry now describe a weighted GMM sandwich variance. This is documentation drift, not a numerical defect, but it makes methodology review harder. Concrete fix: update both survey docstrings to say “GMM sandwich variance.” diff_diff/two_stage.py:L202-L207 diff_diff/two_stage.py:L1672-L1677 docs/methodology/REGISTRY.md:L924-L924

Path to Approval

In CallawaySantAnna, replace cell-specific survey_weight_sum aggregation weights with fixed cohort shares computed once from the full unit-level survey sample, so simple and dynamic aggregation match the source method and the existing WIF implementation.
Add an unbalanced-panel regression test where units in the same cohort are missing different post periods, and assert that survey overall_att, dynamic effects, and analytical SEs follow fixed cohort-share aggregation rather than cell-specific observed treated totals.

Replace per-cell survey_weight_sum with fixed cohort-level weight sums computed once from the full unit-level sample, matching R's did::aggte() which uses pg = n_g / N (cohort shares from the full sample, not varying by cell). On unbalanced panels, cell-specific sums can differ from fixed cohort shares. The WIF already uses fixed pg — this aligns the aggregation point estimate with the WIF computation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T18:27:05Z

/ai-review

github-actions · 2026-03-23T18:33:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e6f410dad3fed9906b18c3fc81d31cb9e2da9a07

Overall Assessment

✅ Looks good

Highest unmitigated severity: P3. Re-review: the previous CallawaySantAnna survey-aggregation blocker appears resolved. Static review only; I did not execute the suite in this environment.

Executive Summary

The prior blocker is fixed: CallawaySantAnna simple and dynamic aggregation now use fixed cohort survey totals from the full unit sample, and the WIF path uses the same global cohort shares. diff_diff/staggered_aggregation.py:65 diff_diff/staggered_aggregation.py:279
The remaining CallawaySantAnna survey deviations are documented in the methodology registry and tracked in TODO.md, so they are mitigated and informational rather than blockers. docs/methodology/REGISTRY.md:419 docs/methodology/REGISTRY.md:420 docs/methodology/REGISTRY.md:421 TODO.md:55 TODO.md:56
I did not find new unmitigated P0/P1 issues in the changed survey paths for ImputationDiD, TwoStageDiD, CallawaySantAnna, TripleDifference, or weighted solve_logit().
P3: ImputationDiD’s new survey pretrend_test() guard has no regression test. diff_diff/imputation.py:1757 tests/test_survey_phase4.py:223
P3: TwoStageDiD’s new survey docstrings still say “Theorem 3” variance instead of the weighted GMM sandwich actually implemented. diff_diff/two_stage.py:202 diff_diff/two_stage.py:1672 docs/methodology/REGISTRY.md:924
P3: the newly unblocked TripleDifference IPW/DR survey path has no SE scale-invariance regression comparable to the existing reg-path coverage. diff_diff/triple_diff.py:1085 tests/test_survey_phase4.py:1069 tests/test_survey_phase3.py:901

Methodology

No unmitigated findings. The prior CallawaySantAnna survey aggregation mismatch is fixed: the changed code now computes fixed cohort survey totals from the full unit sample for simple/dynamic aggregation, and the WIF path derives matching cohort shares from the same unit-level survey weights. diff_diff/staggered_aggregation.py:65 diff_diff/staggered_aggregation.py:279
Severity P3. Impact: none for merge. The remaining CallawaySantAnna survey deviations and limits are explicitly documented in the Methodology Registry, so they are mitigated under project policy. Concrete fix: none required in this PR. docs/methodology/REGISTRY.md:419 docs/methodology/REGISTRY.md:420 docs/methodology/REGISTRY.md:421

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: none for merge. The PR adds TODO entries for the remaining CallawaySantAnna survey gaps, which is proper tracking under the project’s deferred-work policy. Concrete fix: none required in this PR. TODO.md:55 TODO.md:56

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new ImputationDiD survey guard can regress without detection because there is no test that fits with survey_design and then calls .pretrend_test(). Concrete fix: add a regression test asserting ImputationDiD(..., survey_design=...).fit(...).pretrend_test() raises NotImplementedError. diff_diff/imputation.py:1757 tests/test_survey_phase4.py:223
Severity P3. Impact: TwoStageDiD’s new survey docstrings still describe PSU as feeding “Theorem 3” variance, but the implementation and registry now use the weighted GMM sandwich. Concrete fix: update both survey docstrings to say “GMM sandwich variance.” diff_diff/two_stage.py:202 diff_diff/two_stage.py:1672 docs/methodology/REGISTRY.md:924
Severity P3. Impact: the new TripleDifference IPW/DR survey inference path has smoke/uniform-weight coverage, but no SE scale-invariance regression analogous to the existing reg-path test. Concrete fix: add IPW and DR tests that fit with weights and c * weights and assert identical att and se. diff_diff/triple_diff.py:1085 tests/test_survey_phase4.py:1069 tests/test_survey_phase3.py:901

igerber and others added 8 commits March 23, 2026 06:24

igerber force-pushed the survey-roadmap branch from 98d80c6 to d8f6520 Compare March 23, 2026 10:36

igerber and others added 2 commits March 23, 2026 06:51

igerber and others added 3 commits March 23, 2026 07:42

igerber merged commit ea66c57 into main Mar 23, 2026
14 checks passed

igerber deleted the survey-roadmap branch March 23, 2026 19:32

Conversation

igerber commented Mar 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

igerber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects