Fix/issue#503: FunctionGemma now usable#508
Closed
lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
Closed
Fix/issue#503: FunctionGemma now usable#508lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
Conversation
…mple, fix tool response wrapping)
Remove hardcoded “When you decide to call…” guidance and arg example from developer turn
Fix FunctionGemma trigger string (no trailing period) and append tools declarations directly
Wrap tool responses in a developer turn and allow stacking multiple tool responses
Stop wrapping tool outputs in value:; pass through {...} payload as-is
Close pending tool-response developer turn before next user/model turn to avoid malformed prompts
Signed-off-by: Lennart <lvoelz@outlook.de>
…erwise model breaks TODO: fuse ops, reduce memory footprint Signed-off-by: Lennart <lvoelz@outlook.de>
Signed-off-by: Lennart <lvoelz@outlook.de>
…tly micro optimizations, but shaves 20-30ms off because allocations are done once for each thread and then reused) Signed-off-by: Lennart <lvoelz@outlook.de>
This was referenced Mar 8, 2026
Contributor
Author
|
With all the latest fixes applied (see #510), the fine-tuned model now matches the outputs from the Transformers library's reference implementation. This allows all Gemma family models to be fine-tuned and deployed in Cactus :) Who needs cloud fallback now... |
Collaborator
|
@lennartvoelz thanks, you wanna resolve the conflicts |
Contributor
Author
|
Replaced with #510 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


This PR fixes issue #503
After trying overflow prevention in the residual stream, I realized that the issue is the logit sampling. Sampling
directly in FP16 causes numerical breakdown: the softmax and probability normalization lose too much precision across the 256K+ vocabulary, leading to degenerate token selection (repetition, gibberish, wrong/hallucinated tool calls).
I introduced a new sampling kernel, that reads FP16 logins and performs all internal math in fp32 (I previously tried to use the existing cactus_sample_f32 kernel, but the upcasting introduced too much overhead for additional fp32 buffer allocs). This kernel is only used, when a flag (fp32_accumulate in OpParams) is set, so other models that do not need the higher precision in sampling math are not affected.
This is the rough strategy:
-> Remaining work on the really small list
The kernel is a bit slower but the overhead vs the pure f16 kernel is dominated by BW in the full vocab pass, so basically unavoidable (at least I think so, correct me if I'm wrong).
The result is a much better working model with roughly 370ms average latency (MacBook M3) on the Hackathon dataset. Before the changes, fine tuned models were unusable (described in #503). The new kernel also make the base weight model much better.
Here are the results from the "benchmark":




Fine-tuned & new sampling:
Fine-tuned & old sampling:
Base weights & new sampling:
Base weights & old sampling:
Please note, that this is not a real benchmark in terms of the model latency, because the hackathon benchmark setup returns 0ms on model failure -> the times for the old sampling are heavily skewed towards smaller execution times.
I did not have time to properly benchmark the execution times. If you want me to do that, let me know