Skip to content

ci: bump high-stats NTHREADS to 128 and pass --noHessian to ZMassDilepton externalPostfit step#693

Draft
bendavid wants to merge 1 commit into
WMass:mainfrom
bendavid:ci-noHessian-nthreads
Draft

ci: bump high-stats NTHREADS to 128 and pass --noHessian to ZMassDilepton externalPostfit step#693
bendavid wants to merge 1 commit into
WMass:mainfrom
bendavid:ci-noHessian-nthreads

Conversation

@bendavid
Copy link
Copy Markdown
Collaborator

Summary

Two independent CI fixes for the high-stats workflow.

1. NTHREADS 64 → 128 in the high-stats setenv branch

The setup 1:1 data:mc events branch (high-stats path, triggered by workflow_dispatch or any non-30 5 * * 1-5 schedule) was timing out steps at NTHREADS=64. Bump to 128 to fit comfortably under the per-step time budget. PR/reference runs (the other two branches in setenv) are unaffected.

2. --noHessian on the dilepton ptll from wlike rabbit_fit step

The high-stats CI repeatably fails dilepton-plotting at the step that invokes rabbit_fit.py with --externalPostfit ZMassWLike_eta_pt_charge/fitresults_uncorr.hdf5 --externalPostfitResult uncorr --noPostfitProfileBB. The fit results file is written successfully, then rabbit_fit.py:302 unconditionally calls edmval_cov(grad, hess) on the local loss Hessian at the externally-supplied (non-stationary) point, and Cholesky fails with 4-th leading minor of the array is not positive definite. Off-minimum, the local Hessian is generically indefinite; this isn't a numerical bug.

rabbit_fit.py already has --noHessian (line 298 gates exactly this block) — passing it skips the EDM + covariance computation while still saving all the histograms and writing fitresults_from_ZMassWLike_eta_pt_charge.hdf5. The downstream rabbit_plot_hists.py step reads from there and the externally-loaded postfit's own covariance, neither of which depends on the indefinite local Hessian.

Reproduced in the most recent high-stats run (https://github.com/WMass/WRemnants/actions/runs/26352296494) and the preceding one (muon_reweight_test on 2026-05-23, ID 26327021423) — same 4-th leading minor failure, both pre-existing the muon_reweight branch changes.

Test plan

  • CI on this PR (PR-mode path is unaffected by either change).
  • Trigger high-stats workflow_dispatch on this branch and confirm dilepton-plotting passes and the previously timed-out steps finish.

🤖 Generated with Claude Code

…pton externalPostfit step

The high-stats workflow (``setup 1:1 data:mc events`` branch of
``setenv``) was timing out at NTHREADS=64; bump to 128 to keep
under the per-step time budget.

The ``dilepton ptll from wlike`` step invokes ``rabbit_fit.py`` with
``--externalPostfit`` (loading the wlike fit's ``uncorr`` postfit
values) and no local re-minimisation. ``rabbit_fit.py`` then
unconditionally tries to compute an EDM + covariance from the local
Hessian at that externally-supplied point, which is generically
indefinite off-minimum — Cholesky failed at the 4th leading minor
and the step exited non-zero, even though the fit-results file had
already been written. Add ``--noHessian`` (already wired in
``rabbit_fit.py`` at line 298) to skip that local Hessian / EDM
computation; the downstream plotting step reads the saved
``fitresults_from_ZMassWLike_eta_pt_charge.hdf5`` and the
externally-loaded postfit covariance, neither of which depended on
the indefinite local Hessian.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bendavid
Copy link
Copy Markdown
Collaborator Author

These changes do fix the CI run, but increasing the number of threads to 128 leads to even worse oversubscription of the CPU cores on the node, so we should reconsider/improve how this is handled.

A better immediate alternative might be to just increase the timeout from 6 to 12 hours and keep the number of threads at 64.

For the rabbit plotting failure (trying to compute the covariance matrix with external fit values) this arguably can/should be fixed on the rabbit side by skipping the covariance computation automatically in this case.

@bendavid bendavid marked this pull request as draft May 25, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant