feat: add Flux, LTX-Video, LTX-2/2.3, speculative decoding, and Windows CUDA support by icryo · Pull Request #45 · evilsocket/cake

icryo · 2026-03-10T05:49:31Z

Summary

Windows CUDA worker support — cross-compile workers for Windows with CUDA
Flux image generation — distributed Flux model with CLIP + T5 + DiT + VAE components
LTX-Video (0.9.x) — distributed video generation with T5 + DiT transformer + 3D VAE
LTX-2 / LTX-2.3 (19B/22B) — distributed video generation with Gemma-3 12B text encoder, dual-stream DiT, 3D VAE, and vocoder
- GPU Gemma-3 via GGUF quantization (Q4_K_M): 26x text encoding speedup (3s vs 80s), 4x overall speedup
- STG (Spatio-Temporal Guidance) support for LTX-2.3
- Distributed transformer inference across multiple GPUs
Speculative decoding — draft model acceleration for text generation (--draft-model, --spec-tokens)
Video infrastructure — VideoGenerator trait, VideoMaster, AVI muxer, video API endpoint
Component-based topology — named components (e.g., ltx2-gemma, ltx2-transformer) for non-LLM model distribution

Performance (LTX-2, 768x512, 41 frames, 2 GPUs: 4090 + 5090)

Config	Time
30 steps, no CFG	~28s
30 steps, CFG=3.0 (CPU Gemma)	~283s
30 steps, CFG=3.0 (GPU Gemma Q4_K_M)	~70s

Test plan

cargo test --features cuda — all 135 tests pass
End-to-end LTX-2 video generation across 2 GPUs
GPU Gemma GGUF encoding verified against CPU safetensors output
LTX-2.3 video generation with gated attention and prompt modulation

🤖 Generated with Claude Code

…eline - LTX-2 (19B): Gemma-3 12B encoder, dual-stream DiT transformer, VAE, vocoder - Flux: T5-XXL + CLIP text encoders, DiT transformer, VAE - LTX-Video (0.9.x): T5-XXL encoder, DiT transformer, 3D VAE - Video generation: VideoGenerator trait, VideoMaster, AVI muxer - Speculative decoding support (--draft-model, --spec-tokens) - GGUF quantization utilities - Model stubs: LLaVA (VLM), Mixtral (MoE), HunyuanVideo - Direct path resolution for Windows workers (no HF cache required) - Bug fixes: video position midpoint averaging, Gemma padding mask, connector divisibility check Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds detailed timing logs to identify the 450s/step bottleneck: - Unpack time, input shapes, dtype, device - Setup phase (proj_in + adaln + caption + RoPE) - Per-8-block cumulative timing with forced GPU sync - Total forward pass time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LTX-2 transformer is ~35GB in BF16 — too large for a single 32GB GPU. This splits it into block ranges that can be distributed: - Worker (5090, 32GB): blocks 0-23 (~17GB) - Master (4090, 24GB): blocks 24-47 + connector + VAE (~20GB) Changes: - LTXModel: add new_block_range() to load only blocks N-M - LTXModel: split forward into forward_setup/forward_blocks/forward_finalize - Ltx2Transformer: parse "ltx2-transformer.N-M" layer names - Ltx2: orchestrate split pipeline (setup → remote blocks → local blocks → finalize) - Topology: use "ltx2-transformer.0-23" instead of "ltx2-transformer" - find_weight_files: properly handle 8-shard model files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Load latents_mean/latents_std from VAE safetensors instead of defaulting to identity values. Match Python LTX2Pipeline behavior: skip initial noise normalization for txt2vid, only denormalize before VAE decode. Remove per-block GPU debug logging that doubled step time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on, and 4-block VAE LTX-2.3 extends LTX-2 with: - Gated attention and prompt modulation in transformer blocks - Cross-attention AdaLN conditioning - 8-layer connector with 32 heads (4096 dim) and feature_extractor - 4-block VAE decoder with per-block strides for asymmetric upsampling - prompt_temb wired through distributed protocol (8th packed tensor) - Gemma-3 encoder loading with HF_TOKEN support - Conversion script for monolithic checkpoint to diffusers format Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix gated attention: use 2*sigmoid (not sigmoid) matching Python, was halving all attention outputs - Fix final normalization: use LayerNorm (not RMSNorm) in forward_finalize, matching nn.LayerNorm - Fix attention mask sign: use -1e9 (not +1e9) for masked positions - Fix prompt modulation: modulate cross-attention context (key/value) instead of post-attention residual - Fix Gemma-3 sliding window: layer_idx % pattern != 0 (not (layer_idx+1) % pattern > 0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Skip self-attention at block 28 (V passthrough) for STG perturbation pass - Guidance formula: cond + (cfg-1)*(cond-uncond) + stg*(cond-perturbed) - Rescale: lerp(1.0, cond.std()/pred.std(), rescale_scale) to prevent oversaturation - CLI args: --ltx-stg-scale (default 1.0), --ltx-stg-block (default 28), --ltx-rescale (default 0.7) - STG blocks propagated through network protocol for distributed workers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both encode() and encode_from_tokens() now use pack_text_embeds_v2 which applies per-token RMS normalization instead of per-batch min/max. This preserves token-level variation critical for CFG differentiation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Check timesteps.dim(2) > 6 in addition to self.adaln_params > 6 to prevent narrow OOB when config/temb shape disagree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When STG is active (block_idx=2) but prompt_temb is absent, the STG blocks tensor at index 7 was misinterpreted as prompt_temb. Fix by always treating the last tensor as stg_blocks when block_idx==2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Image models like LTX-2 use diffusers format without a root config.json. Their forwarders handle HF resolution internally, so the generic download path should only be used for text models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --model is a HuggingFace repo ID (not a local directory), the forwarders should use the default HF cache (~/.cache/huggingface/hub) instead of constructing a path from the repo ID. This fixes Windows worker startup failure with "Access is denied" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --model is a repo ID like "Lightricks/LTX-2", the relative path creates a partial cache with broken symlinks. Check that model_dir is an actual existing directory before using the model-local cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For sharded models, the HF API only downloads files explicitly requested. Parse the index.json to find shard filenames and download each one before trying to mmap them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Preserves all Python test/debug scripts and Rust diagnostic logging in git history before removing them for the upstream PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove per-block CA/FF diff static Mutex tracking from transformer_block.rs - Remove pre-flight per-block diagnostic and Python CA reference test from ltx2.rs - Remove per-step CFG/STG/velocity/latent verbose logging - Remove Gemma per-layer hidden state stats and tensor stat logging - Remove caption_projection output stats from model.rs - Remove unused blocks()/block_start() accessors from model.rs - Remove unused rms_norm import, normalize_latents import - Fix unused variable warnings (t_q, ctx in trait impls) - Suppress dead_code warnings for worker-only functions - Delete 18 Python test/debug scripts - Delete operational files (RUNBOOK, topology ymls, setup script) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strip stub modules that bail at runtime to keep the PR shipping only working, tested features. The stubs remain in git history for future implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Load Gemma-3 12B text encoder from GGUF (Q4_K_M) for GPU inference, achieving ~26x speedup over CPU safetensors (3s vs 80s per encoding). Falls back to CPU safetensors when --ltx-gemma-gguf is not provided. Key optimizations: - Share RoPE tables across layers via Arc (saves ~6.4GB) - Cap RoPE to 1024 tokens (encoder max, not 131072) - Dequantize embeddings to F16 instead of F32 (saves ~2GB) - Cache unconditional embeddings to disk (keyed by GGUF path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

icryo · 2026-03-10T05:52:24Z

Going to review the merge conflicts and reopen. Stable in testing

Merge upstream changes (GPTQ support, ChatML auto-detect, qwen3_5_moe, sliding window cache, EOS token handling) with our additions (GGUF, speculative decoding, video models, KvCache optimization). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge upstream additions: - FLUX.2-klein support (take upstream's implementation over ours) - Flash attention refactored into utils::flash_attn module - GPTQ quantization support - Qwen3.5 MoE model support Retain our additions: - LTX-Video, LTX-2/2.3 video generation - GPU Gemma-3 via GGUF quantization - Speculative decoding - GGUF model loading - Video infrastructure (VideoMaster, AVI muxer, video API) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

icryo · 2026-03-15T01:26:18Z

fixed the latest conflicts

evilsocket · 2026-03-15T15:27:42Z

oh wow this is amazing @icryo! I was working on FLUX integration myself but am stuck on some artifacts from the output:

Did you run image and video generation (even as a single master node) and checked the output?

evilsocket · 2026-03-15T15:28:55Z

Also, why so many changes to core files and stuff that is not related to the new models?

icryo · 2026-03-15T15:30:29Z

Looks like a VAE issue, Flux and Video are working on my end. Let me run a few tests and share some results.

icryo · 2026-03-31T20:50:54Z

PR is cleaned up but having a few headaches with WAN support.

…support - Wan DiT transformer: verified correct vs Python (max_diff=0.0007 per step) - UMT5-XXL text encoder: matches Python to 6 significant figures - VAE decoder: chunked temporal decoding with feature cache - GGUF quantized loading: Q4/Q5/Q8, all-F32 intermediates - Diffusers format support: key remapping for transformer + VAE - Distributed topology for 2-GPU setup - Also: LLaVA VLM, Mixtral MoE, HunyuanVideo stubs

evilsocket · 2026-04-05T09:54:15Z

hey @icryo did you rebase with main? Because there have been a lot of core changes lately

icryo · 2026-04-05T14:55:49Z

Don't merge at this time, still troubleshooting some issues with supporting WAN. I'll rebase & re-open this when it's ready.

icryo and others added 18 commits March 6, 2026 21:42

fix(ltx2): guard cross-attention AdaLN on actual tensor dim

ce9cf4f

Check timesteps.dim(2) > 6 in addition to self.adaln_params > 6 to prevent narrow OOB when config/temb shape disagree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: snapshot debug scripts and diagnostic logging before cleanup

6d42a83

Preserves all Python test/debug scripts and Rust diagnostic logging in git history before removing them for the upstream PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove unfinished model stubs (llava, mixtral, hunyuan_video)

9951df3

Strip stub modules that bail at runtime to keep the PR shipping only working, tested features. The stubs remain in git history for future implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

icryo closed this Mar 10, 2026

icryo and others added 2 commits March 10, 2026 00:57

icryo reopened this Mar 15, 2026

Merge remote-tracking branch 'origin/main' into windows-cuda-support

25006d3

evilsocket force-pushed the main branch from 26de51e to df6b7a5 Compare March 16, 2026 15:16

icryo added 3 commits April 4, 2026 18:23

feat: add cuda-no-flash feature for Windows builds

0a1d2ee

fix: WanShardable actually loads component forwarders for workers

6270fb9

icryo closed this Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Flux, LTX-Video, LTX-2/2.3, speculative decoding, and Windows CUDA support#45

feat: add Flux, LTX-Video, LTX-2/2.3, speculative decoding, and Windows CUDA support#45
icryo wants to merge 24 commits intoevilsocket:mainfrom
icryo:windows-cuda-support

icryo commented Mar 10, 2026

Uh oh!

icryo commented Mar 10, 2026

Uh oh!

icryo commented Mar 15, 2026

Uh oh!

evilsocket commented Mar 15, 2026

Uh oh!

evilsocket commented Mar 15, 2026

Uh oh!

icryo commented Mar 15, 2026

Uh oh!

icryo commented Mar 31, 2026

Uh oh!

evilsocket commented Apr 5, 2026

Uh oh!

icryo commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

icryo commented Mar 10, 2026

Summary

Performance (LTX-2, 768x512, 41 frames, 2 GPUs: 4090 + 5090)

Test plan

Uh oh!

icryo commented Mar 10, 2026

Uh oh!

icryo commented Mar 15, 2026

Uh oh!

evilsocket commented Mar 15, 2026

Uh oh!

evilsocket commented Mar 15, 2026

Uh oh!

icryo commented Mar 15, 2026

Uh oh!

icryo commented Mar 31, 2026

Uh oh!

evilsocket commented Apr 5, 2026

Uh oh!

icryo commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

icryo commented Apr 5, 2026 •

edited

Loading