Legal NLP tasks on Swiss data#1032
Conversation
481bc4f to
f3a626d
Compare
|
hey ! thanks for your PR, don't hesitate to ping when you want a review :) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi! Thanks for your reply, a review would be nice :) It is quite a large file since I just extended it, we could also split it into two distinct task sets. I have a version of the individual community task in another repository for a preview of how it could look like: https://github.com/rolshoven/slds-eval/blob/main/custom_task/slds.py |
|
yes so what you could do is make a folder named after your task and split your file into separate modules.
this should be easier to review and maintain! |
|
hey @rolshoven ! do you need any help on this ? :) |
|
@NathanHB Sorry I was away for three weeks but now I'm back :) I hope that I finde some time soon to do the changes. But does that mean that the community task itself is a directory, which contains the three suggested files, and then only the |
Move Swiss legal task definitions into prompts/metrics/main modules and keep backward compatibility via swiss_legal_evals re-export.
Changed the type of higher_is_better from a dictionary of callables to a dictionary of booleans for improved clarity and type safety. This is also how it has been used so far.
- Introduced functions to load COMET and GEMBA metrics, providing clear error messages for missing dependencies. - Disabled specific COMET metrics due to numpy version conflicts, with warnings logged for skipped metrics. - Updated GPU metrics list to reflect the disabled COMET metrics. - Improved code organization and clarity in metric processing functions.
…arameter from TranslationTask constructor.
ac784b9 to
cc22ecb
Compare
|
@NathanHB A little bit late, but I implemented the proposed changes and fixed some bugs. Could you revise the changes and give feedback? Thank you :) |
|
@NathanHB Friendly ping -- just wanted to check if you had a chance to look at this 🙂 |
NathanHB
left a comment
There was a problem hiding this comment.
Sorry for the delay on this! Approving the PR so you can merge asap, only few nits to fix / be aware of.
…ecision Summarization judge
Default `rescale_with_baseline` to False in BertScoreMultilingual. With near-1.0 baselines (e.g. German xlm-roberta-large layer-24 ≈ 0.98), the (score - baseline) / (1 - baseline) formula amplifies deviations ~50x, and the subsequent x100 scaling compounds it — empty/weak predictions could yield scores like -5000. Make rescaling opt-in, warn when enabled, and skip the x100 scaling in that case since rescaled scores are not bounded to [0, 1].
@NathanHB Thank you for having a look! I made the suggested changes and I also set If there is anything else that you think should change, let me know! Otherwise I'm looking forward to having the changes integrated :-) |
This PR introduces a set of Swiss legal eval tasks, initially created by @JoelNiklaus and extended by me for the SLDS dataset. I had to do some adaptions to the other included tasks as well due to recent changes in lighteval, but I think it should be working again now.