feat: add Flux, LTX-Video, LTX-2/2.3, speculative decoding, and Windows CUDA support#45
feat: add Flux, LTX-Video, LTX-2/2.3, speculative decoding, and Windows CUDA support#45icryo wants to merge 24 commits intoevilsocket:mainfrom
Conversation
…eline - LTX-2 (19B): Gemma-3 12B encoder, dual-stream DiT transformer, VAE, vocoder - Flux: T5-XXL + CLIP text encoders, DiT transformer, VAE - LTX-Video (0.9.x): T5-XXL encoder, DiT transformer, 3D VAE - Video generation: VideoGenerator trait, VideoMaster, AVI muxer - Speculative decoding support (--draft-model, --spec-tokens) - GGUF quantization utilities - Model stubs: LLaVA (VLM), Mixtral (MoE), HunyuanVideo - Direct path resolution for Windows workers (no HF cache required) - Bug fixes: video position midpoint averaging, Gemma padding mask, connector divisibility check Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds detailed timing logs to identify the 450s/step bottleneck: - Unpack time, input shapes, dtype, device - Setup phase (proj_in + adaln + caption + RoPE) - Per-8-block cumulative timing with forced GPU sync - Total forward pass time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LTX-2 transformer is ~35GB in BF16 — too large for a single 32GB GPU. This splits it into block ranges that can be distributed: - Worker (5090, 32GB): blocks 0-23 (~17GB) - Master (4090, 24GB): blocks 24-47 + connector + VAE (~20GB) Changes: - LTXModel: add new_block_range() to load only blocks N-M - LTXModel: split forward into forward_setup/forward_blocks/forward_finalize - Ltx2Transformer: parse "ltx2-transformer.N-M" layer names - Ltx2: orchestrate split pipeline (setup → remote blocks → local blocks → finalize) - Topology: use "ltx2-transformer.0-23" instead of "ltx2-transformer" - find_weight_files: properly handle 8-shard model files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Load latents_mean/latents_std from VAE safetensors instead of defaulting to identity values. Match Python LTX2Pipeline behavior: skip initial noise normalization for txt2vid, only denormalize before VAE decode. Remove per-block GPU debug logging that doubled step time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on, and 4-block VAE LTX-2.3 extends LTX-2 with: - Gated attention and prompt modulation in transformer blocks - Cross-attention AdaLN conditioning - 8-layer connector with 32 heads (4096 dim) and feature_extractor - 4-block VAE decoder with per-block strides for asymmetric upsampling - prompt_temb wired through distributed protocol (8th packed tensor) - Gemma-3 encoder loading with HF_TOKEN support - Conversion script for monolithic checkpoint to diffusers format Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix gated attention: use 2*sigmoid (not sigmoid) matching Python, was halving all attention outputs - Fix final normalization: use LayerNorm (not RMSNorm) in forward_finalize, matching nn.LayerNorm - Fix attention mask sign: use -1e9 (not +1e9) for masked positions - Fix prompt modulation: modulate cross-attention context (key/value) instead of post-attention residual - Fix Gemma-3 sliding window: layer_idx % pattern != 0 (not (layer_idx+1) % pattern > 0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip self-attention at block 28 (V passthrough) for STG perturbation pass - Guidance formula: cond + (cfg-1)*(cond-uncond) + stg*(cond-perturbed) - Rescale: lerp(1.0, cond.std()/pred.std(), rescale_scale) to prevent oversaturation - CLI args: --ltx-stg-scale (default 1.0), --ltx-stg-block (default 28), --ltx-rescale (default 0.7) - STG blocks propagated through network protocol for distributed workers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both encode() and encode_from_tokens() now use pack_text_embeds_v2 which applies per-token RMS normalization instead of per-batch min/max. This preserves token-level variation critical for CFG differentiation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check timesteps.dim(2) > 6 in addition to self.adaln_params > 6 to prevent narrow OOB when config/temb shape disagree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When STG is active (block_idx=2) but prompt_temb is absent, the STG blocks tensor at index 7 was misinterpreted as prompt_temb. Fix by always treating the last tensor as stg_blocks when block_idx==2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Image models like LTX-2 use diffusers format without a root config.json. Their forwarders handle HF resolution internally, so the generic download path should only be used for text models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --model is a HuggingFace repo ID (not a local directory), the forwarders should use the default HF cache (~/.cache/huggingface/hub) instead of constructing a path from the repo ID. This fixes Windows worker startup failure with "Access is denied" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --model is a repo ID like "Lightricks/LTX-2", the relative path creates a partial cache with broken symlinks. Check that model_dir is an actual existing directory before using the model-local cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For sharded models, the HF API only downloads files explicitly requested. Parse the index.json to find shard filenames and download each one before trying to mmap them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserves all Python test/debug scripts and Rust diagnostic logging in git history before removing them for the upstream PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove per-block CA/FF diff static Mutex tracking from transformer_block.rs - Remove pre-flight per-block diagnostic and Python CA reference test from ltx2.rs - Remove per-step CFG/STG/velocity/latent verbose logging - Remove Gemma per-layer hidden state stats and tensor stat logging - Remove caption_projection output stats from model.rs - Remove unused blocks()/block_start() accessors from model.rs - Remove unused rms_norm import, normalize_latents import - Fix unused variable warnings (t_q, ctx in trait impls) - Suppress dead_code warnings for worker-only functions - Delete 18 Python test/debug scripts - Delete operational files (RUNBOOK, topology ymls, setup script) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip stub modules that bail at runtime to keep the PR shipping only working, tested features. The stubs remain in git history for future implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Load Gemma-3 12B text encoder from GGUF (Q4_K_M) for GPU inference, achieving ~26x speedup over CPU safetensors (3s vs 80s per encoding). Falls back to CPU safetensors when --ltx-gemma-gguf is not provided. Key optimizations: - Share RoPE tables across layers via Arc (saves ~6.4GB) - Cap RoPE to 1024 tokens (encoder max, not 131072) - Dequantize embeddings to F16 instead of F32 (saves ~2GB) - Cache unconditional embeddings to disk (keyed by GGUF path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Going to review the merge conflicts and reopen. Stable in testing |
Merge upstream changes (GPTQ support, ChatML auto-detect, qwen3_5_moe, sliding window cache, EOS token handling) with our additions (GGUF, speculative decoding, video models, KvCache optimization). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge upstream additions: - FLUX.2-klein support (take upstream's implementation over ours) - Flash attention refactored into utils::flash_attn module - GPTQ quantization support - Qwen3.5 MoE model support Retain our additions: - LTX-Video, LTX-2/2.3 video generation - GPU Gemma-3 via GGUF quantization - Speculative decoding - GGUF model loading - Video infrastructure (VideoMaster, AVI muxer, video API) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
fixed the latest conflicts |
|
oh wow this is amazing @icryo! I was working on FLUX integration myself but am stuck on some artifacts from the output:
Did you run image and video generation (even as a single master node) and checked the output? |
|
Also, why so many changes to core files and stuff that is not related to the new models? |
|
Looks like a VAE issue, Flux and Video are working on my end. Let me run a few tests and share some results. |
|
PR is cleaned up but having a few headaches with WAN support. |
…support - Wan DiT transformer: verified correct vs Python (max_diff=0.0007 per step) - UMT5-XXL text encoder: matches Python to 6 significant figures - VAE decoder: chunked temporal decoding with feature cache - GGUF quantized loading: Q4/Q5/Q8, all-F32 intermediates - Diffusers format support: key remapping for transformer + VAE - Distributed topology for 2-GPU setup - Also: LLaVA VLM, Mixtral MoE, HunyuanVideo stubs
|
hey @icryo did you rebase with main? Because there have been a lot of core changes lately |
|
Don't merge at this time, still troubleshooting some issues with supporting WAN. I'll rebase & re-open this when it's ready. |

Summary
--draft-model,--spec-tokens)VideoGeneratortrait,VideoMaster, AVI muxer, video API endpointltx2-gemma,ltx2-transformer) for non-LLM model distributionPerformance (LTX-2, 768x512, 41 frames, 2 GPUs: 4090 + 5090)
Test plan
cargo test --features cuda— all 135 tests pass🤖 Generated with Claude Code