Skip to content

Switch desktop backend from Gemini 3 Pro to Flash#6312

Open
atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
atlas/gemini-pro-to-flash
Open

Switch desktop backend from Gemini 3 Pro to Flash#6312
atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
atlas/gemini-pro-to-flash

Conversation

@atlas-agent-omi
Copy link
Copy Markdown

One-line change: gemini-3-pro-previewgemini-3-flash-preview in the desktop Rust backend LLM client.

Why: Gemini 3 Pro is the #1 GCP cost at $12.4K/month for 144 desktop users ($86/user). Flash handles the same conversation processing tasks (structure, action items, memories, KG extraction) at ~92% lower cost.

Savings: ~$11.4K/month

Risk: Minimal — Flash is highly capable for structured extraction tasks. The rate limiter already downgrades some Pro requests to Flash (rate_limit.rs:200), so Flash is already proven in this pipeline.

Saves ~$11.4K/month. Same conversation processing tasks (structure,
action items, memories, knowledge graph) — Flash handles them equally
well at ~92% lower cost.

Co-authored-by: Aarav Garg <aaravgarg@users.noreply.github.com>
@atlas-agent-omi
Copy link
Copy Markdown
Author

@greptileai

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 4, 2026

Greptile Summary

This PR swaps the default Gemini model in the Rust backend's LlmClient from gemini-3-pro-preview to gemini-3-flash-preview — a one-line change aimed at reducing GCP LLM costs by ~92% for the desktop's server-side conversation processing pipeline (structure extraction, memories, action items, and knowledge graph extraction).

Key points:

  • The change is minimal and correct — the model string is used directly in the Gemini API URL at client.rs:247, so the swap is immediately effective.
  • The LlmClient calls the Gemini API directly (not through the proxy route at /v1/proxy/gemini/*), so this change is entirely independent of the rate-limiter degradation logic in rate_limit.rs.
  • The PR description's claim that "Flash is already proven in this pipeline via the rate limiter" is not quite accurate: the rate limiter's maybe_rewrite_model_path only matches gemini-pro-latest (the Swift client's model over the proxy path), not gemini-3-pro-preview (the Rust backend's direct API calls). Flash has not previously handled the Rust backend structured-extraction tasks through that degradation path — though it is still a reasonable model choice for these tasks.
  • No tests, benchmarks, or quality comparisons for the structured extraction tasks are included; post-deploy monitoring of extraction quality would be prudent.

Confidence Score: 4/5

Safe to merge with low technical risk; primary concern is unvalidated quality impact on structured extraction tasks, not correctness of the code change itself.

The code change is a trivially correct one-liner with no logic errors. The main risk is output quality degradation on structured tasks (memories, action items, KG extraction) since Flash is less capable than Pro, and no quality benchmarks are provided. The PR description's validation argument (rate limiter already uses Flash) doesn't fully apply to this code path, so there is no prior evidence of Flash's performance for these specific Rust backend tasks.

No files require special attention beyond desktop/Backend-Rust/src/llm/client.rs — the change is isolated to the model name in the LlmClient constructor.

Important Files Changed

Filename Overview
desktop/Backend-Rust/src/llm/client.rs Single-line model swap from gemini-3-pro-preview to gemini-3-flash-preview in the LlmClient constructor; no logic changes. Note that this code path calls the Gemini API directly and is independent of the proxy rate-limiter degradation path.

Sequence Diagram

sequenceDiagram
    participant SwiftClient as Swift Desktop Client
    participant RustProxy as Rust Backend (proxy.rs)
    participant RateLimiter as GeminiRateLimiter (rate_limit.rs)
    participant RustLLM as Rust LlmClient (client.rs)
    participant GeminiAPI as Gemini API

    Note over SwiftClient,GeminiAPI: Path 1 — Swift client requests (NOT changed by this PR)
    SwiftClient->>RustProxy: POST /v1/proxy/gemini/models/gemini-pro-latest:generateContent
    RustProxy->>RateLimiter: check_and_record(uid)
    RateLimiter-->>RustProxy: Allow / DegradeToFlash / Reject
    Note over RustProxy: maybe_rewrite_model_path()<br/>gemini-pro-latest → gemini-3-flash-preview<br/>(only when DegradeToFlash)
    RustProxy->>GeminiAPI: Forward request (possibly rewritten model)
    GeminiAPI-->>RustProxy: Response
    RustProxy-->>SwiftClient: Response

    Note over RustLLM,GeminiAPI: Path 2 — Rust backend direct calls (CHANGED by this PR)
    RustLLM->>GeminiAPI: POST .../models/gemini-3-flash-preview:generateContent
    Note over RustLLM: Direct call — bypasses proxy and rate limiter<br/>Used for: structure extraction, memories,<br/>action items, KG extraction
    GeminiAPI-->>RustLLM: JSON response
Loading

Reviews (1): Last reviewed commit: "Switch desktop conversation processing f..." | Re-trigger Greptile

client: Client::new(),
api_key,
model: "gemini-3-pro-preview".to_string(),
model: "gemini-3-flash-preview".to_string(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Rate-limiter "Flash already proven" justification doesn't apply here

The PR description states Flash is already validated in this pipeline because rate_limit.rs:200 downgrades Pro → Flash. However, that rewrite only matches models/gemini-pro-latest: (used by the Swift client going through /v1/proxy/gemini/*), not models/gemini-3-pro-preview: (used by this LlmClient, which calls the Gemini API directly and never goes through the proxy).

These are two independent request paths:

  • Proxy path (proxy.rs): Swift client → gemini-pro-latest → rate limiter may rewrite to gemini-3-flash-preview
  • Direct path (client.rs): Rust backend → gemini-3-pro-preview → Gemini API (no proxy, no rate limiter)

So gemini-3-flash-preview has not previously run the Rust backend's structured extraction tasks (memories, action items, KG extraction) via the rate limiter degradation path. Flash is still a reasonable choice for these tasks, but the validation argument in the description is for a different flow. Worth keeping in mind when evaluating output quality after rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant