Legal NLP tasks on Swiss data by rolshoven · Pull Request #1032 · huggingface/lighteval

rolshoven · 2025-10-31T17:54:24Z

This PR introduces a set of Swiss legal eval tasks, initially created by @JoelNiklaus and extended by me for the SLDS dataset. I had to do some adaptions to the other included tasks as well due to recent changes in lighteval, but I think it should be working again now.

NathanHB · 2025-11-03T13:56:59Z

hey ! thanks for your PR, don't hesitate to ping when you want a review :)

HuggingFaceDocBuilderDev · 2025-11-03T13:59:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rolshoven · 2025-11-04T05:22:24Z

hey ! thanks for your PR, don't hesitate to ping when you want a review :)

Hi! Thanks for your reply, a review would be nice :)

It is quite a large file since I just extended it, we could also split it into two distinct task sets. I have a version of the individual community task in another repository for a preview of how it could look like:

https://github.com/rolshoven/slds-eval/blob/main/custom_task/slds.py

NathanHB · 2025-11-04T10:29:26Z

yes so what you could do is make a folder named after your task and split your file into separate modules.

prompts.py: for the judge prompt and tasks prompts
metrics.py: for your metric defs
main.py: for your task defs

this should be easier to review and maintain!

NathanHB · 2025-11-17T12:15:02Z

hey @rolshoven ! do you need any help on this ? :)

rolshoven · 2025-11-17T12:45:23Z

@NathanHB Sorry I was away for three weeks but now I'm back :) I hope that I finde some time soon to do the changes. But does that mean that the community task itself is a directory, which contains the three suggested files, and then only the main.py includes the DATASETS and TASK_TABLE variables?

Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export.

Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far.

…Tra-Bench code

- Introduced functions to load COMET and GEMBA metrics, providing clear error messages for missing dependencies. - Disabled specific COMET metrics due to numpy version conflicts, with warnings logged for skipped metrics. - Updated GPU metrics list to reflect the disabled COMET metrics. - Improved code organization and clarity in metric processing functions.

…arameter from TranslationTask constructor.

rolshoven · 2026-03-17T21:17:47Z

@NathanHB A little bit late, but I implemented the proposed changes and fixed some bugs. Could you revise the changes and give feedback? Thank you :)

rolshoven · 2026-04-06T14:45:43Z

@NathanHB Friendly ping -- just wanted to check if you had a chance to look at this 🙂

NathanHB

Sorry for the delay on this! Approving the PR so you can merge asap, only few nits to fix / be aware of.

…ed values

…hint

…ecision Summarization judge

Default `rescale_with_baseline` to False in BertScoreMultilingual. With near-1.0 baselines (e.g. German xlm-roberta-large layer-24 ≈ 0.98), the (score - baseline) / (1 - baseline) formula amplifies deviations ~50x, and the subsequent x100 scaling compounds it — empty/weak predictions could yield scores like -5000. Make rescaling opt-in, warn when enabled, and skip the x100 scaling in that case since rescaled scores are not bounded to [0, 1].

rolshoven · 2026-05-20T12:11:05Z

Sorry for the delay on this! Approving the PR so you can merge asap, only few nits to fix / be aware of.

@NathanHB Thank you for having a look! I made the suggested changes and I also set rescale_with_baseline to False, since it may lead to large negative scores for specific outputs (such as empty responses):

https://github.com/rolshoven/lighteval/blob/df00dc169970849cced0a8ae2be88fbc5f277297/src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py#L102-L110

If there is anything else that you think should change, let me know! Otherwise I'm looking forward to having the changes integrated :-)

rolshoven force-pushed the community_task_slds branch from 481bc4f to f3a626d Compare October 31, 2025 18:31

rolshoven added 10 commits March 17, 2026 22:13

Legal NLP tasks on Swiss data

83c8a07

refactor: split Swiss legal multilingual tasks into modular package

8d19996

Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export.

refactor: update higher_is_better type in MetricGrouping

4032ce4

Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far.

refactor: Updated prompts and implementation to match the latest SwiL…

ca7e834

…Tra-Bench code

Add Gemba dependency for Swiss legal evaluations and remove suite p…

8c449c2

…arameter from TranslationTask constructor.

Fix batched metric aggregation for grouped metric names

9113d53

Fixed missing system prompt

6d3fdf3

Judge models now are used through OpenRouter

bba0e91

fix reasoning model token handling when max_tokens is unset

cc22ecb

rolshoven force-pushed the community_task_slds branch from ac784b9 to cc22ecb Compare March 17, 2026 21:14

chore: trigger PR update

a4e0ba1

Merge branch 'main' into community_task_slds

406953f

NathanHB reviewed May 20, 2026

View reviewed changes

Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py Outdated

Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py Outdated

Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py

NathanHB approved these changes May 20, 2026 •

edited

Loading

View reviewed changes

rolshoven added 4 commits May 20, 2026 13:59

fix: return raw score for BLEU, CHRF, and TER metrics instead of scal…

052586a

…ed values

fix: replaced accidental default value assignment with intended type …

fa341ac

…hint

fix: add error handling for unsupported languages in Swiss Landmark D…

011c404

…ecision Summarization judge

JoelNiklaus merged commit 8d29839 into huggingface:main May 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legal NLP tasks on Swiss data#1032

Legal NLP tasks on Swiss data#1032
JoelNiklaus merged 16 commits into
huggingface:mainfrom
rolshoven:community_task_slds

rolshoven commented Oct 31, 2025

Uh oh!

NathanHB commented Nov 3, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 3, 2025

Uh oh!

rolshoven commented Nov 4, 2025

Uh oh!

NathanHB commented Nov 4, 2025

Uh oh!

NathanHB commented Nov 17, 2025

Uh oh!

rolshoven commented Nov 17, 2025

Uh oh!

rolshoven commented Mar 17, 2026

Uh oh!

rolshoven commented Apr 6, 2026

Uh oh!

NathanHB left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rolshoven commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rolshoven commented Oct 31, 2025

Uh oh!

NathanHB commented Nov 3, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 3, 2025

Uh oh!

rolshoven commented Nov 4, 2025

Uh oh!

NathanHB commented Nov 4, 2025

Uh oh!

NathanHB commented Nov 17, 2025

Uh oh!

rolshoven commented Nov 17, 2025

Uh oh!

rolshoven commented Mar 17, 2026

Uh oh!

rolshoven commented Apr 6, 2026

Uh oh!

NathanHB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rolshoven commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants