Skip to content

evals metadata in builtin JUnit reporter#22

Open
giovanni-guidini wants to merge 1 commit intomainfrom
gio/explore-junit-report-annotations
Open

evals metadata in builtin JUnit reporter#22
giovanni-guidini wants to merge 1 commit intomainfrom
gio/explore-junit-report-annotations

Conversation

@giovanni-guidini
Copy link
Copy Markdown

These changes annotate the testTask in a way that the built-in JUnit reporter can expose the evals metadata - similar to what is done for the JSON reporter.

The formatting is discussed at length here JUnit is a very pervasive format, well established in the industry.

As an example, generating the JUnit report for the tests in the repo we get for one of the tests:

<testcase classname="src/ai-sdk-integration.test.ts" name="Tool Argument Validation &gt; What&apos;s the weather in Seattle in Celsius?" time="0.004360291">
  <properties>
      <property name="evals.scores.score_0.value" value="1">
      </property>
      <property name="evals.scores.score_0.type" value="float">
      </property>
      <property name="evals.scores.score_0.metadata.rationale" value="All expected tools were called">
      </property>
      <property name="evals.toolCalls.0.id" value="call_9999">
      </property>
      <property name="evals.toolCalls.0.name" value="getWeather">
      </property>
      <property name="evals.toolCalls.0.arguments.location" value="Seattle">
      </property>
      <property name="evals.toolCalls.0.arguments.units" value="celsius">
      </property>
      <property name="evals.toolCalls.0.result.temperature" value="18">
      </property>
      <property name="evals.toolCalls.0.result.condition" value="partly cloudy">
      </property>
      <property name="evals.toolCalls.0.status" value="completed">
      </property>
      <property name="evals.toolCalls.0.type" value="function">
      </property>
  </properties>
</testcase>

These changes annotate the `testTask` in a way that the built-in JUnit reporter can
expose the evals metadata - similar to what is done for the JSON reporter.

The formatting is discussed at length [here](https://www.notion.so/sentry/Evals-Schema-for-JUnit-XML-2098b10e4b5d80609d62f5beffc3de26?source=copy_link)
JUnit is a very pervasive format, well established in the industry.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.43%. Comparing base (c0945ef) to head (22b06c7).
⚠️ Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #22      +/-   ##
==========================================
+ Coverage   84.50%   86.43%   +1.93%     
==========================================
  Files           4        4              
  Lines         400      457      +57     
  Branches      115      134      +19     
==========================================
+ Hits          338      395      +57     
  Misses         62       62              
Flag Coverage Δ
unittests 86.43% <100.00%> (+1.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@giovanni-guidini giovanni-guidini requested a review from dcramer July 18, 2025 15:43
name: "Factuality",
score: 0.9,
metadata: {
llm_judge: "gemini_2.5pro",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider just using the term model here?

aside I think metadata might be entirely arbitrary in scorers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants