Phase 4: Survey support for ImputationDiD, TwoStageDiD, CallawaySantAnna#233
Phase 4: Survey support for ImputationDiD, TwoStageDiD, CallawaySantAnna#233
Conversation
|
Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment This re-review clears two of the prior blockers: Executive Summary
Methodology The touched paths are all source-sensitive: Borusyak-Jaravel-Spiess place ImputationDiD uncertainty on the Theorem 3 conservative variance, Gardner/did2s treats two-stage inference as a generated-regressor/GMM problem, Callaway-Sant’Anna support OR/IPW/DR staggered DiD with covariates, and the current DDD source paper explicitly emphasizes covariate-adjusted RA/IPW/DR inference. (economics-static.uchicago.edu)
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Static-review note: I could not execute the test suite in this environment because Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The previous Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Static-review note: I could not run the targeted tests in this environment because |
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Unmitigated P1s remain in the new survey inference paths. The earlier ImputationDiD variance-weighting gap and the obvious TripleDifference survey double-weighting problem look resolved in this diff, but Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Review note: Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment This re-review leaves two unmitigated P1s in the new survey Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…, CallawaySantAnna) - Weighted solve_logit(): survey weights enter IRLS as w_survey * mu*(1-mu) - ImputationDiD: weighted iterative FE, survey-weighted ATT aggregation, weighted conservative variance (Theorem 3), survey df for inference - TwoStageDiD: weighted iterative FE, weighted Stage 2 OLS, weighted GMM sandwich variance with survey weights in both stages - CallawaySantAnna: survey-weighted regression, IPW (via weighted solve_logit), and DR methods with explicit influence functions; survey-weighted WIF in aggregation; Cholesky cache bypassed under survey weights - Unblock TripleDifference IPW/DR with survey (weighted solve_logit now available) - 38 new tests in test_survey_phase4.py covering all estimators + scale invariance - Update survey-roadmap.md, REGISTRY.md with Phase 4 status and deviation notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e-existing Hausman bug, add inference tests - Add REGISTRY.md deviation note: CS per-cell ATT(g,t) SEs use influence-function variance (matching R's did), not full TSL with strata/PSU/FPC - Add TODO entries: CS per-cell TSL SE (Medium), Hausman stale n_cl from #230 (Medium) - Add 3 CS survey inference validation tests (scale invariance, per-cell ATT change, survey df effect) - Remove unused survey_weight_type in ImputationDiD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, CS SE formula - P0: Thread survey_weights through TripleDifference IPW and DR call chains (_ipw_estimation, _doubly_robust, _compute_did_rc_ipw, _compute_did_rc_dr). Survey weights now enter Riesz representers for weighted Hajek averages. - P1: Fix CallawaySantAnna no-covariate survey SE to derive from sum(IF^2) instead of sum(w_norm * (y-mean)^2). All 4 locations now consistent with stored influence functions. - P1: Update REGISTRY.md TripleDifference entry to reflect full survey support (was still marked as "IPW/DR deferred"). - P2: Add behavioral tests for TripleDiff IPW/DR survey: non-uniform weights change ATT, uniform weights match unweighted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… always-treated survey - ImputationDiD: use survey-weighted untreated sums (not raw counts) in Theorem 3 v_it FE-only path, and weighted normal equations (A_0'W_0 A_0) in covariate path - TripleDifference: weight PS Hessian, score correction, and DR OLS linear representations with survey weights for covariate-adjusted IPW/DR IF corrections; update fit() docstring to reflect full survey support - TwoStageDiD: subset survey_weights and resolved_survey arrays after always-treated unit exclusion to prevent length mismatch - Add tests: TwoStageDiD always-treated + survey, DDD covariate IPW/DR + survey Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…riate nuisance IF - TripleDifference: remove double-weighting in IPW/DR moment corrections — since Riesz representers already incorporate survey weights, moment means use np.mean() not np.average(weights=). Removed _wmean_ax0 helper. - CallawaySantAnna WIF: apply s_i symmetrically to both indicator and pg terms in the weighted share estimator IF. Normalize by total_weight (sum of survey weights) instead of n_units. - CallawaySantAnna outcome regression covariate IF: add weighted regression nuisance IF correction (asymptotic linear representation of beta from WLS, projected onto weighted treated covariate mean). IPW and DR IFs unchanged (IPW matches unweighted structure; DR is self-correcting per Theorem 3.1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ge n_psu - CallawaySantAnna WIF: remove inner /total_weight from indicator_diff — the final psi_wif/total_weight handles normalization once, matching R's did::wif() - CallawaySantAnna IPW covariate: add propensity score nuisance IF correction (survey-weighted Hessian, score, M2 gradient) so per-cell and aggregated SEs account for PS estimation uncertainty - TwoStageDiD: recompute n_psu/n_strata after always-treated filtering via np.unique() on subsetted arrays, then recompute survey_metadata Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es+survey
- CallawaySantAnna WIF: use (w_i * 1{G=g} - pg) matching R's did::wif() formula,
not (w_i * 1{G=g} - w_i * pg). pg is already the population-level expected value.
- Gate CallawaySantAnna covariates + survey + IPW/DR with NotImplementedError
(DRDID panel nuisance IF not yet implemented). Regression with covariates works.
- Narrow roadmap/REGISTRY claims to match actual support boundary
- Add TODO entry for DRDID nuisance IF
- Add 3 tests: IPW/DR+cov+survey raises, reg+cov+survey works
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove spurious /n_c from regression nuisance IF correction in _outcome_regression survey covariate branch. asy_lin_rep_reg is already per-observation, so dividing by n_c double-scaled. - Add test_reg_covariates_survey_se_scale_invariance: verifies ATT and SE are invariant to weight rescaling for reg+covariates+survey. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No additional findings beyond the broken wrapper propagation above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…ce wrappers - ImputationDiD FE-only v_it: multiply by per-obs survey weight (WLS projection requires observation-level weight factor, not just weighted denominators) - ImputationDiD covariate v_it: v_0 = -diag(sw_0) @ A_0 @ z (WLS projection includes left-side W_0 factor) - Add survey_design parameter to imputation_did() and two_stage_did() convenience wrappers, forwarded to .fit() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iD survey - ImputationDiD: SE scale-invariance test for FE-only and covariate paths - ImputationDiD: wrapper parity test (imputation_did with survey_design) - TwoStageDiD: wrapper parity test (two_stage_did with survey_design) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Static review only: I could not execute the test suite in this sandbox because Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the result-object weight mismatch above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
- TripleDifference: divide out survey weights from IF before passing to compute_survey_vcov, since Riesz representers already incorporate weights and TSL would multiply by weights again - CallawaySantAnna _outcome_regression: rewrite survey covariate IF to follow DRDID panel OR structure — all terms consistently scaled by 1/sw_t_sum, nuisance correction divided by sw_t_sum for correct normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: P1. Executive Summary
Methodology
Code Quality
Performance Maintainability Tech Debt
Security Documentation/Tests
Path to Approval
|
The DRDID panel nuisance IF corrections could not be validated against a reference implementation after multiple review cycles. Gate all covariates+ survey combinations (reg/IPW/DR) with NotImplementedError until the exact DRDID panel OR/IPW/DR IF formulas are implemented and validated. No-covariate survey support remains fully functional for all methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The survey-weighted OR IF now mirrors the unweighted code's structure exactly: - treated: (w_i / sum(w_t)) * (resid_i - ATT) - control: -(w_i / sum(w_t)) * wls_resid_i WLS residuals are orthogonal to W*X by construction, so the regression nuisance IF correction is implicit (same as the unweighted case where inf_control = -residuals/n_c with no explicit correction). This replaces the explicit nuisance correction that was repeatedly flagged by the AI reviewer for incorrect scaling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion tests - Add REGISTRY.md note: CS reg+covariates survey SE uses conservative plug-in IF (WLS residuals, no semiparametric nuisance correction). SEs may be wider than DRDID's efficient IF but have correct asymptotic coverage. Matches unweighted code's approach. Efficient IF deferred to future work. - Add aggregate="group" and aggregate="all" survey tests for ImputationDiD and TwoStageDiD to cover all aggregation method interactions with survey_design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only; Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…contract - ImputationDiD/TwoStageDiD: reject FPC (finite population correction) in SurveyDesign with NotImplementedError. PSU is used for cluster-robust variance, strata enters survey df for t-distribution. FPC not yet implemented in Theorem 3 / GMM variance. - REGISTRY.md: document exact survey contract — PSU as cluster variable, strata for df only (no stratified sandwich), FPC rejected. This is conservative relative to full stratified design-based variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Highest unmitigated severity: Static review only: Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
P2: solve_logit() now validates weights (shape, NaN, Inf, positive) before IRLS, giving clear errors instead of opaque numerical failures. P3 docs: update CallawaySantAnna fit() docstring to weights-only contract; add survey_design param docs to ImputationDiD/TwoStageDiD fit() and wrapper docstrings; update REGISTRY treatment_effects["weight"] note for survey mode. P3 tests: add negative tests for solve_logit bad weights (NaN, negative, wrong shape); add aweight/fweight rejection and FPC rejection tests for ImputationDiD and TwoStageDiD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only; I did not execute Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
CallawaySantAnna uses weights-only survey (strata/PSU/FPC rejected), so cluster-as-PSU injection is unnecessary and misleading. The user's cluster= parameter is handled by the existing non-survey clustering path. Removed _inject_cluster_as_psu call and unused imports. Added aggregate="group" and aggregate="all" survey tests for CS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only. I did not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
CallawaySantAnna operates at unit level, so df_survey should be n_units-1 for weights-only designs, not n_obs-1 from the long panel. The panel-level df overstated degrees of freedom, making p-values/CIs too optimistic. - Precomputed and fit() both use len(all_units)-1 for df_survey - survey_metadata.df_survey overridden to unit-level - Added test_df_survey_is_unit_level asserting df == n_units-1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only. I couldn’t run the local test suite because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Zero NaN coefficients before prediction in _outcome_regression survey covariate branch, matching the DR path convention. Rank-deficient columns contribute 0 to the column space projection, so zeroing NaN betas gives the correct prediction. Previously, NaN coefficients propagated into predictions, turning estimable cells into NaN. Added test_reg_covariates_rank_deficient_survey with collinear covariates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only. I could not run the test suite locally because Executive Summary
Methodology
No other unmitigated methodology findings. The new Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No unmitigated findings. The new Security No findings. Documentation/Tests
Path to Approval
|
The regression-adjustment IF is already on the unweighted scale (WLS fits use weights internally but the IF is residual-based, not Riesz-weighted). De-weighting only applies to IPW/DR where Riesz representers are multiplied by survey weights. Updated REGISTRY note to document method-specific TSL handoff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only. I cross-checked the changed estimator paths against Executive Summary
Methodology
No other unmitigated methodology findings. The new Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
pretrend_test() uses unweighted demeaning and cluster-count F inference, which do not account for survey weights. Raise NotImplementedError when called after a survey-weighted fit. Store survey_design in _fit_data for detection. Survey-aware pretrend_test is planned for future work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Static review only. I could not execute the suite in this environment because Executive Summary
Methodology
No other unmitigated methodology findings. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
The pweight normalization preserves relative unit weights on unbalanced panels because all IF/WIF formulas use weight ratios (sw_i/sum(sw)) where the normalization constant cancels. Added REGISTRY note explaining this and an unbalanced panel scale-invariance test confirming it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: Executive Summary
Methodology
No other unmitigated methodology findings in scope. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
Replace per-cell survey_weight_sum with fixed cohort-level weight sums computed once from the full unit-level sample, matching R's did::aggte() which uses pg = n_g / N (cohort shares from the full sample, not varying by cell). On unbalanced panels, cell-specific sums can differ from fixed cohort shares. The WIF already uses fixed pg — this aligns the aggregation point estimate with the WIF computation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Highest unmitigated severity: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
weightsparameter tosolve_logit()— survey weights enter IRLS asw_survey * mu*(1-mu), enabling survey-weighted propensity score estimationsolve_logit()), and DR methods with explicit influence functions; survey-weighted WIF in aggregation; Cholesky cache bypassed under survey weightssolve_logit()now available, removing Phase 3 deferral)did, not full TSL)n_clbug from PR EfficientDiD: cluster-robust SEs, last-cohort control, Hausman pretest, small cohort warning #230 in TODO.mdMethodology references (required if estimator / math changes)
did::att_gt()analytical pathsvyglm(family=binomial)— working weights = survey_w × mu × (1−mu)Validation
tests/test_survey_phase4.py(41 tests),tests/test_survey_phase3.py(2 tests updated)Security / privacy
Generated with Claude Code
🤖 Generated with Claude Code