feat(text-metrics): split oneig_reasoning by davidberenstein1957 · Pull Request #648 · PrunaAI/pruna

davidberenstein1957 · 2026-04-28T13:04:04Z

Summary

Splits oneig_reasoning into its own stacked PR, adds OneIGReasoningMetric, and introduces lazy MetricRegistry loading required by this metric family.

Stack Position

Base: PR feat(text-metrics): split text_score pair #647 (feat/vlm-pr-3c-text-score-pair)
Next: PR feat(vision-metrics): split vqa #649 (feat/vlm-pr-4a-vqa)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/metric_oneig_reasoning.py
src/pruna/evaluation/metrics/registry.py
src/pruna/evaluation/benchmarks.py
tests/evaluation/test_text_metrics.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k oneig_reasoning

Review Focus

Reasoning score extraction and averaging
Lazy metric loading behavior in registry

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (6/10)

Review after PR feat(text-metrics): split text_score pair #647.
Next PR to review: feat(vision-metrics): split vqa #649.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit e586366. Configure here.}

cursor · 2026-04-28T13:11:46Z

+            model_name=self.model_name,
+            llm_model_name=self.llm_model_name,
+            device=self.device,
+        )


Scorer not cached, models reload every batch

High Severity

_get_scorer creates a brand new _LLM2CLIPScorer instance on every update() call when no mock scorer is injected (the production path). Because _LLM2CLIPScorer._load_models lazily loads models by checking self._clip_model is not None, and each new instance starts with self._clip_model = None, the multi-gigabyte LLM2CLIP checkpoints are re-loaded from disk on every batch. The method needs to cache the newly created scorer in self._scorer before returning it.

^{Reviewed by Cursor Bugbot for commit e586366. Configure here.}

cursor · 2026-04-28T13:11:46Z

+            self.model_name,
+            self.llm_model_name,
+        )
+        dtype = torch.bfloat16 if self.device == "cuda" else torch.float32


Device check misses "cuda:N" device strings

Medium Severity

The dtype selection (self.device == "cuda") and the autocast guard in score both use exact string equality, so they fail to match indexed device strings like "cuda:0". When the framework's device resolution produces "cuda:0" (which set_to_best_available_device("cuda") does via _resolve_cuda_device), the model loads in float32 instead of bfloat16 and autocast is skipped. The nearby attn_impl line correctly uses startswith("cuda:") but these two checks do not.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_oneig_reasoning.py#L201-L202

^{Reviewed by Cursor Bugbot for commit e586366. Configure here.}

cursor · 2026-04-28T13:11:47Z

+        input_pixels = self._processor(images=pil_images, return_tensors="pt").pixel_values.to(self.device)
+        captions = [text_prompt]
+        text_features = self._l2v.encode(captions, convert_to_tensor=True, device=self.device).to(self.device)
+        text_features = self._clip_model.get_text_features(text_features)


Text features computed outside torch.no_grad() context

Medium Severity

In _LLM2CLIPScorer.score, self._clip_model.get_text_features(text_features) runs outside torch.no_grad(), while the analogous get_image_features call is correctly wrapped. The model's parameters still have requires_grad=True (only train(mode=False) was called, not requires_grad_(False)), so PyTorch builds a full computation graph for the text projection forward pass that is never used for backprop. This wastes GPU memory and is inconsistent with the image feature computation on the very next line.

^{Reviewed by Cursor Bugbot for commit e586366. Configure here.}

Adds oneig_reasoning metric, lazy registry loading, and focused benchmark/test wiring as the final text-metric stack branch. Made-with: Cursor

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

cursor Bot reviewed Apr 28, 2026

View reviewed changes

feat(text-metrics): split oneig_reasoning into dedicated branch

c853dd2

Adds oneig_reasoning metric, lazy registry loading, and focused benchmark/test wiring as the final text-metric stack branch. Made-with: Cursor

davidberenstein1957 force-pushed the feat/vlm-pr-3c-text-score-pair branch from 3cdc2bb to 946b068 Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-3d-oneig-reasoning branch from e586366 to c853dd2 Compare May 8, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text-metrics): split oneig_reasoning#648

feat(text-metrics): split oneig_reasoning#648
davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-3c-text-score-pairfrom
feat/vlm-pr-3d-oneig-reasoning

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (6/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Scorer not cached, models reload every batch

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Device check misses "cuda:N" device strings

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Text features computed outside torch.no_grad() context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading