Skip to content

Legal NLP tasks on Swiss data#1032

Merged
JoelNiklaus merged 16 commits into
huggingface:mainfrom
rolshoven:community_task_slds
May 20, 2026
Merged

Legal NLP tasks on Swiss data#1032
JoelNiklaus merged 16 commits into
huggingface:mainfrom
rolshoven:community_task_slds

Conversation

@rolshoven
Copy link
Copy Markdown
Contributor

This PR introduces a set of Swiss legal eval tasks, initially created by @JoelNiklaus and extended by me for the SLDS dataset. I had to do some adaptions to the other included tasks as well due to recent changes in lighteval, but I think it should be working again now.

@rolshoven rolshoven force-pushed the community_task_slds branch from 481bc4f to f3a626d Compare October 31, 2025 18:31
@NathanHB
Copy link
Copy Markdown
Member

NathanHB commented Nov 3, 2025

hey ! thanks for your PR, don't hesitate to ping when you want a review :)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@rolshoven
Copy link
Copy Markdown
Contributor Author

hey ! thanks for your PR, don't hesitate to ping when you want a review :)

Hi! Thanks for your reply, a review would be nice :)

It is quite a large file since I just extended it, we could also split it into two distinct task sets. I have a version of the individual community task in another repository for a preview of how it could look like:

https://github.com/rolshoven/slds-eval/blob/main/custom_task/slds.py

@NathanHB
Copy link
Copy Markdown
Member

NathanHB commented Nov 4, 2025

yes so what you could do is make a folder named after your task and split your file into separate modules.

  • prompts.py: for the judge prompt and tasks prompts
  • metrics.py: for your metric defs
  • main.py: for your task defs

this should be easier to review and maintain!

@NathanHB
Copy link
Copy Markdown
Member

hey @rolshoven ! do you need any help on this ? :)

@rolshoven
Copy link
Copy Markdown
Contributor Author

@NathanHB Sorry I was away for three weeks but now I'm back :) I hope that I finde some time soon to do the changes. But does that mean that the community task itself is a directory, which contains the three suggested files, and then only the main.py includes the DATASETS and TASK_TABLE variables?

Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export.
Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far.
- Introduced functions to load COMET and GEMBA metrics, providing clear error messages for missing dependencies.
- Disabled specific COMET metrics due to numpy version conflicts, with warnings logged for skipped metrics.
- Updated GPU metrics list to reflect the disabled COMET metrics.
- Improved code organization and clarity in metric processing functions.
…arameter from TranslationTask constructor.
@rolshoven rolshoven force-pushed the community_task_slds branch from ac784b9 to cc22ecb Compare March 17, 2026 21:14
@rolshoven
Copy link
Copy Markdown
Contributor Author

@NathanHB A little bit late, but I implemented the proposed changes and fixed some bugs. Could you revise the changes and give feedback? Thank you :)

@rolshoven
Copy link
Copy Markdown
Contributor Author

@NathanHB Friendly ping -- just wanted to check if you had a chance to look at this 🙂

Copy link
Copy Markdown
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this! Approving the PR so you can merge asap, only few nits to fix / be aware of.

Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py Outdated
Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py Outdated
Comment thread src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py
NathanHB
NathanHB approved these changes May 20, 2026
rolshoven added 4 commits May 20, 2026 13:59
Default `rescale_with_baseline` to False in BertScoreMultilingual. With
near-1.0 baselines (e.g. German xlm-roberta-large layer-24 ≈ 0.98), the
(score - baseline) / (1 - baseline) formula amplifies deviations ~50x,
and the subsequent x100 scaling compounds it — empty/weak predictions
could yield scores like -5000.
Make rescaling opt-in, warn when enabled, and skip the x100 scaling in
that case since rescaled scores are not bounded to [0, 1].
@rolshoven
Copy link
Copy Markdown
Contributor Author

Sorry for the delay on this! Approving the PR so you can merge asap, only few nits to fix / be aware of.

@NathanHB Thank you for having a look! I made the suggested changes and I also set rescale_with_baseline to False, since it may lead to large negative scores for specific outputs (such as empty responses):

https://github.com/rolshoven/lighteval/blob/df00dc169970849cced0a8ae2be88fbc5f277297/src/lighteval/tasks/multilingual/tasks/swiss_legal/metrics.py#L102-L110

If there is anything else that you think should change, let me know! Otherwise I'm looking forward to having the changes integrated :-)

@JoelNiklaus JoelNiklaus merged commit 8d29839 into huggingface:main May 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants