Skip to content

Fix/issue#503: FunctionGemma now usable#508

Closed
lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
lennartvoelz:fix/issue#503
Closed

Fix/issue#503: FunctionGemma now usable#508
lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
lennartvoelz:fix/issue#503

Conversation

@lennartvoelz
Copy link
Copy Markdown
Contributor

This PR fixes issue #503
After trying overflow prevention in the residual stream, I realized that the issue is the logit sampling. Sampling
directly in FP16 causes numerical breakdown: the softmax and probability normalization lose too much precision across the 256K+ vocabulary, leading to degenerate token selection (repetition, gibberish, wrong/hallucinated tool calls).

I introduced a new sampling kernel, that reads FP16 logins and performs all internal math in fp32 (I previously tried to use the existing cactus_sample_f32 kernel, but the upcasting introduced too much overhead for additional fp32 buffer allocs). This kernel is only used, when a flag (fp32_accumulate in OpParams) is set, so other models that do not need the higher precision in sampling math are not affected.

This is the rough strategy:

  • Fused widening convert with scaling and running max acc
  • Reduce to a smaller active token list to reduce working set
    -> Remaining work on the really small list
  • Allocations: all buffers are static thread_local and reused across tokens
  • RNG: static thread_local std::mt19937, seeded once per thread

The kernel is a bit slower but the overhead vs the pure f16 kernel is dominated by BW in the full vocab pass, so basically unavoidable (at least I think so, correct me if I'm wrong).

The result is a much better working model with roughly 370ms average latency (MacBook M3) on the Hackathon dataset. Before the changes, fine tuned models were unusable (described in #503). The new kernel also make the base weight model much better.

Here are the results from the "benchmark":
Fine-tuned & new sampling:
Bildschirmfoto 2026-03-08 um 00 54 58
Fine-tuned & old sampling:
Bildschirmfoto 2026-03-08 um 00 58 40
Base weights & new sampling:
Bildschirmfoto 2026-03-08 um 00 56 39
Base weights & old sampling:
Bildschirmfoto 2026-03-08 um 00 58 02

Please note, that this is not a real benchmark in terms of the model latency, because the hackathon benchmark setup returns 0ms on model failure -> the times for the old sampling are heavily skewed towards smaller execution times.
I did not have time to properly benchmark the execution times. If you want me to do that, let me know

…mple, fix tool response wrapping)

Remove hardcoded “When you decide to call…” guidance and arg example from developer turn

Fix FunctionGemma trigger string (no trailing period) and append tools declarations directly

Wrap tool responses in a developer turn and allow stacking multiple tool responses

Stop wrapping tool outputs in value:; pass through {...} payload as-is

Close pending tool-response developer turn before next user/model turn to avoid malformed prompts

Signed-off-by: Lennart <lvoelz@outlook.de>
…erwise model breaks

TODO: fuse ops, reduce memory footprint

Signed-off-by: Lennart <lvoelz@outlook.de>
Signed-off-by: Lennart <lvoelz@outlook.de>
…tly micro optimizations, but shaves 20-30ms off because allocations are done once for each thread and then reused)

Signed-off-by: Lennart <lvoelz@outlook.de>
@lennartvoelz
Copy link
Copy Markdown
Contributor Author

With all the latest fixes applied (see #510), the fine-tuned model now matches the outputs from the Transformers library's reference implementation. This allows all Gemma family models to be fine-tuned and deployed in Cactus :)
This was not possible before the chain of fixes.
Benchmark now with the fine tuned model:
Bildschirmfoto 2026-03-08 um 22 59 44
Benchmark now with the base FunctionGemma model:
Bildschirmfoto 2026-03-08 um 22 57 32

Who needs cloud fallback now...

@HenryNdubuaku
Copy link
Copy Markdown
Collaborator

@lennartvoelz thanks, you wanna resolve the conflicts

@lennartvoelz
Copy link
Copy Markdown
Contributor Author

Replaced with #510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants