Fix/issue#503: FunctionGemma now usable by lennartvoelz · Pull Request #508 · cactus-compute/cactus

lennartvoelz · 2026-03-08T00:03:51Z

This PR fixes issue #503
After trying overflow prevention in the residual stream, I realized that the issue is the logit sampling. Sampling
directly in FP16 causes numerical breakdown: the softmax and probability normalization lose too much precision across the 256K+ vocabulary, leading to degenerate token selection (repetition, gibberish, wrong/hallucinated tool calls).

I introduced a new sampling kernel, that reads FP16 logins and performs all internal math in fp32 (I previously tried to use the existing cactus_sample_f32 kernel, but the upcasting introduced too much overhead for additional fp32 buffer allocs). This kernel is only used, when a flag (fp32_accumulate in OpParams) is set, so other models that do not need the higher precision in sampling math are not affected.

This is the rough strategy:

Fused widening convert with scaling and running max acc
Reduce to a smaller active token list to reduce working set
-> Remaining work on the really small list
Allocations: all buffers are static thread_local and reused across tokens
RNG: static thread_local std::mt19937, seeded once per thread

The kernel is a bit slower but the overhead vs the pure f16 kernel is dominated by BW in the full vocab pass, so basically unavoidable (at least I think so, correct me if I'm wrong).

The result is a much better working model with roughly 370ms average latency (MacBook M3) on the Hackathon dataset. Before the changes, fine tuned models were unusable (described in #503). The new kernel also make the base weight model much better.

Here are the results from the "benchmark":
Fine-tuned & new sampling:

Fine-tuned & old sampling:

Base weights & new sampling:

Base weights & old sampling:

Please note, that this is not a real benchmark in terms of the model latency, because the hackathon benchmark setup returns 0ms on model failure -> the times for the old sampling are heavily skewed towards smaller execution times.
I did not have time to properly benchmark the execution times. If you want me to do that, let me know

…mple, fix tool response wrapping) Remove hardcoded “When you decide to call…” guidance and arg example from developer turn Fix FunctionGemma trigger string (no trailing period) and append tools declarations directly Wrap tool responses in a developer turn and allow stacking multiple tool responses Stop wrapping tool outputs in value:; pass through {...} payload as-is Close pending tool-response developer turn before next user/model turn to avoid malformed prompts Signed-off-by: Lennart <lvoelz@outlook.de>

…erwise model breaks TODO: fuse ops, reduce memory footprint Signed-off-by: Lennart <lvoelz@outlook.de>

Signed-off-by: Lennart <lvoelz@outlook.de>

…tly micro optimizations, but shaves 20-30ms off because allocations are done once for each thread and then reused) Signed-off-by: Lennart <lvoelz@outlook.de>

lennartvoelz · 2026-03-08T22:02:20Z

With all the latest fixes applied (see #510), the fine-tuned model now matches the outputs from the Transformers library's reference implementation. This allows all Gemma family models to be fine-tuned and deployed in Cactus :)
This was not possible before the chain of fixes.
Benchmark now with the fine tuned model:

Benchmark now with the base FunctionGemma model:

Who needs cloud fallback now...

HenryNdubuaku · 2026-03-09T00:41:13Z

@lennartvoelz thanks, you wanna resolve the conflicts

lennartvoelz · 2026-03-16T20:00:26Z

Replaced with #510

lennartvoelz added 4 commits March 6, 2026 12:48

feat: add decode method to GemmaModel for token sampling in fp32, oth…

e834a38

…erwise model breaks TODO: fuse ops, reduce memory footprint Signed-off-by: Lennart <lvoelz@outlook.de>

Much faster sampling, but not where I want it yet

8a6ebce

Signed-off-by: Lennart <lvoelz@outlook.de>

refactor: optimize cactus_sample_f16_f32_acc for performance (tbh mos…

e9df769

…tly micro optimizations, but shaves 20-30ms off because allocations are done once for each thread and then reused) Signed-off-by: Lennart <lvoelz@outlook.de>

This was referenced Mar 8, 2026

[Bug] Gemma fails on multi tool call because of wrong stop token and bug in logit biasing #509

Open

Fix gemma multi tool call and logit biasing #510

Open

Merge branch 'main' into fix/issue#503

4dd1caf

lennartvoelz closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/issue#503: FunctionGemma now usable#508

Fix/issue#503: FunctionGemma now usable#508
lennartvoelz wants to merge 5 commits intocactus-compute:mainfrom
lennartvoelz:fix/issue#503

lennartvoelz commented Mar 8, 2026

Uh oh!

lennartvoelz commented Mar 8, 2026

Uh oh!

HenryNdubuaku commented Mar 9, 2026

Uh oh!

lennartvoelz commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lennartvoelz commented Mar 8, 2026

Uh oh!

lennartvoelz commented Mar 8, 2026

Uh oh!

HenryNdubuaku commented Mar 9, 2026

Uh oh!

lennartvoelz commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants