ci: bump high-stats NTHREADS to 128 and pass --noHessian to ZMassDilepton externalPostfit step#693
Draft
bendavid wants to merge 1 commit into
Draft
ci: bump high-stats NTHREADS to 128 and pass --noHessian to ZMassDilepton externalPostfit step#693bendavid wants to merge 1 commit into
bendavid wants to merge 1 commit into
Conversation
…pton externalPostfit step The high-stats workflow (``setup 1:1 data:mc events`` branch of ``setenv``) was timing out at NTHREADS=64; bump to 128 to keep under the per-step time budget. The ``dilepton ptll from wlike`` step invokes ``rabbit_fit.py`` with ``--externalPostfit`` (loading the wlike fit's ``uncorr`` postfit values) and no local re-minimisation. ``rabbit_fit.py`` then unconditionally tries to compute an EDM + covariance from the local Hessian at that externally-supplied point, which is generically indefinite off-minimum — Cholesky failed at the 4th leading minor and the step exited non-zero, even though the fit-results file had already been written. Add ``--noHessian`` (already wired in ``rabbit_fit.py`` at line 298) to skip that local Hessian / EDM computation; the downstream plotting step reads the saved ``fitresults_from_ZMassWLike_eta_pt_charge.hdf5`` and the externally-loaded postfit covariance, neither of which depended on the indefinite local Hessian. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
Collaborator
Author
|
These changes do fix the CI run, but increasing the number of threads to 128 leads to even worse oversubscription of the CPU cores on the node, so we should reconsider/improve how this is handled. A better immediate alternative might be to just increase the timeout from 6 to 12 hours and keep the number of threads at 64. For the rabbit plotting failure (trying to compute the covariance matrix with external fit values) this arguably can/should be fixed on the rabbit side by skipping the covariance computation automatically in this case. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent CI fixes for the high-stats workflow.
1.
NTHREADS64 → 128 in the high-statssetenvbranchThe
setup 1:1 data:mc eventsbranch (high-stats path, triggered byworkflow_dispatchor any non-30 5 * * 1-5schedule) was timing out steps atNTHREADS=64. Bump to 128 to fit comfortably under the per-step time budget. PR/reference runs (the other two branches insetenv) are unaffected.2.
--noHessianon thedilepton ptll from wlikerabbit_fit stepThe high-stats CI repeatably fails
dilepton-plottingat the step that invokesrabbit_fit.pywith--externalPostfit ZMassWLike_eta_pt_charge/fitresults_uncorr.hdf5 --externalPostfitResult uncorr --noPostfitProfileBB. The fit results file is written successfully, thenrabbit_fit.py:302unconditionally callsedmval_cov(grad, hess)on the local loss Hessian at the externally-supplied (non-stationary) point, and Cholesky fails with4-th leading minor of the array is not positive definite. Off-minimum, the local Hessian is generically indefinite; this isn't a numerical bug.rabbit_fit.pyalready has--noHessian(line 298 gates exactly this block) — passing it skips the EDM + covariance computation while still saving all the histograms and writingfitresults_from_ZMassWLike_eta_pt_charge.hdf5. The downstreamrabbit_plot_hists.pystep reads from there and the externally-loaded postfit's own covariance, neither of which depends on the indefinite local Hessian.Reproduced in the most recent high-stats run (https://github.com/WMass/WRemnants/actions/runs/26352296494) and the preceding one (
muon_reweight_teston 2026-05-23, ID 26327021423) — same4-th leading minorfailure, both pre-existing the muon_reweight branch changes.Test plan
dilepton-plottingpasses and the previously timed-out steps finish.🤖 Generated with Claude Code