Skip to content

feat(text-metrics): split oneig_alignment#646

Open
davidberenstein1957 wants to merge 2 commits intofeat/vlm-pr-3a-qa-accuracyfrom
feat/vlm-pr-3b-oneig-alignment
Open

feat(text-metrics): split oneig_alignment#646
davidberenstein1957 wants to merge 2 commits intofeat/vlm-pr-3a-qa-accuracyfrom
feat/vlm-pr-3b-oneig-alignment

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 commented Apr 28, 2026

Summary

Splits oneig_alignment into its own stacked PR, adds OneIGAlignmentMetric, and wires OneIG alignment subset benchmarks.

Stack Position

Files

  • src/pruna/evaluation/metrics/metric_oneig_alignment.py
  • src/pruna/evaluation/benchmarks.py
  • tests/evaluation/test_text_metrics.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k oneig_alignment

Review Focus

  • Dependency-mask semantics
  • OneIG subset benchmark mapping

Review Flow (Order)

Review the stack in this exact order:

  1. feat(vendor): add LLM2Vec embedding model #637 vendor
  2. feat(infrastructure): add VLM base classes and utilities #638 infrastructure
  3. feat(text-metrics): split qa_accuracy #645 qa_accuracy
  4. feat(text-metrics): split oneig_alignment #646 oneig_alignment
  5. feat(text-metrics): split text_score pair #647 text_score pair
  6. feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
  7. feat(vision-metrics): split vqa #649 vqa
  8. feat(vision-metrics): split vie_score #650 vie_score
  9. feat(vision-metrics): split img_edit_score #651 img_edit_score
  10. feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (4/10)

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 2627d78. Configure here.

Benchmark(
name="OneIG Portrait",
description="OneIG subset: people and portraits.",
metrics=["oneig_alignment"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneIG subset benchmarks fail to register at import

High Severity

The new OneIG benchmarks create lookup keys (e.g., OneIGAnimeStylization) that are missing from base_datasets. This causes BenchmarkRegistry._register to raise a ValueError, failing module import for pruna.evaluation.benchmarks and its dependents. The original OneIG entry was also removed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2627d78. Configure here.

assert "21" not in record["questions"]
assert "21" not in record["dependencies"]
assert record["questions"] == {"1": "Is there a cat?"}
assert record["dependencies"] == {"1": [0]}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneIG record test calls function with wrong signature

Medium Severity

The test_to_oneig_record_strips_null_questions_and_dependencies test calls _to_oneig_record with an incorrect number of arguments, causing a TypeError. The test also asserts that null-valued questions and dependencies are stripped, a behavior not implemented in _to_oneig_record.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2627d78. Configure here.

metrics=["oneig_alignment"],
task_type="text_to_image",
reference="https://arxiv.org/abs/2506.07977",
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneIG Multilingualism subset has no alignment questions

Medium Severity

The OneIG Multilingualism benchmark wires the oneig_alignment metric, but _CATEGORY_TO_QD in prompt.py only maps Anime_Stylization, Portrait, and General_Object to Q_D files. Multilingualism rows therefore receive an empty questions dict in _to_oneig_record, and OneIGAlignmentMetric.update raises ValueError whenever questions is empty. Running this benchmark would error out on every sample rather than produce alignment scores.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2627d78. Configure here.

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics.

Made-with: Cursor
Adds oneig_alignment metric implementation, its focused tests, and benchmark subset wiring while keeping reasoning and text-rendering metrics for later stacked PRs.

Made-with: Cursor
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 15db155 to 04ab2e5 Compare May 8, 2026 09:01
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3b-oneig-alignment branch from 2627d78 to c51653e Compare May 8, 2026 09:01
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 04ab2e5 to 161223e Compare May 8, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant