feat(text-metrics): split qa_accuracy#645
feat(text-metrics): split qa_accuracy#645davidberenstein1957 wants to merge 2 commits intofeat/vlm-pr-2-infrastructurefrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 15db155. Configure here.
| "between prompts and generated images via VQA-style questions." | ||
| ), | ||
| metrics=["clip_score"], # §3.2: Mask2Former; not in Pruna | ||
| metrics=["qa_accuracy", "clip_score"], # strict QA + CLIP score |
There was a problem hiding this comment.
GenEval benchmark uses lenient mean aggregation
Medium Severity
The GenEval benchmark's qa_accuracy metric is intended for "strict QA" (all-or-nothing aggregation). However, Task.from_benchmark doesn't pass the all_or_nothing kwarg, causing QAAccuracyMetric to default to mean aggregation. This leads to inflated scores that are not comparable to the paper's reference.
Reviewed by Cursor Bugbot for commit 15db155. Configure here.
| raise ValueError( | ||
| f"qa_accuracy aggregation must be one of {{'mean', 'all_or_nothing'}}. Got: {self.aggregation!r}." | ||
| ) | ||
| self.metric_units = type(self).metric_units |
There was a problem hiding this comment.
Redundant metric_units self-assignment in __init__
Low Severity
Setting self.metric_units = type(self).metric_units is a no-op because metric_units is already declared as a class attribute and nothing earlier in __init__ (neither super().__init__ nor the surrounding code) overrides it. The line just shadows the class attribute with the same value and adds maintenance noise.
Reviewed by Cursor Bugbot for commit 15db155. Configure here.
21212de to
7054e53
Compare
15db155 to
04ab2e5
Compare
Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
04ab2e5 to
161223e
Compare


Summary
Splits
qa_accuracyinto its own stacked PR, addsQAAccuracyMetric, and wires GenEval toqa_accuracy + clip_score.Stack Position
feat/vlm-pr-2-infrastructure)feat/vlm-pr-3b-oneig-alignment)feat/vlm-pr-5-e2e-tests)feat/metrics-vlm-support)Files
src/pruna/evaluation/metrics/metric_qa_accuracy.pysrc/pruna/evaluation/benchmarks.pyTest Plan
Review Focus
all_or_nothing)Review Flow (Order)
Review the stack in this exact order:
This PR in the flow (3/10)