Skip to content

Model support: Qwen3-next and Qwen3.5#1233

Open
shihaobai wants to merge 31 commits intomainfrom
qwen3.5_clean
Open

Model support: Qwen3-next and Qwen3.5#1233
shihaobai wants to merge 31 commits intomainfrom
qwen3.5_clean

Conversation

@shihaobai
Copy link
Collaborator

No description provided.

sufubao and others added 24 commits March 2, 2026 07:47
Implement comprehensive support for Qwen3Next model with linear attention mechanism:

Model Features:
- Implement linear attention with MTP (Multi-Token Prediction) capability
- Add custom Triton kernels for gated delta networks (GDN) operations
- Support chunked operations for efficient attention computation
- Add specialized buffer pool and memory managers for linear attention

Triton Kernels:
- Add causal_conv1d for efficient convolution operations
- Implement chunk-based operations (chunk_o, chunk_delta_h, chunk_scaled_dot_kkt)
- Add gated delta network kernels (fused_gdn_gating, gdn_decode_mtp)
- Implement fused normalization (gemma_rmsnorm, gated_rmsnorm)

Infrastructure:
- Add hybrid radix cache for efficient memory management
- Implement mamba cache manager for state management
- Add allocator utilities for buffer management
- Add parameter weight abstraction for flexible weight handling
- Update model registration and API endpoints

Performance Optimizations:
- Add H200 autotune configurations for all Triton kernels
- Optimize memory allocation with custom kernels
- Support chunked prefill and decode backends

This implementation enables efficient inference for models with linear attention
mechanisms, providing significant speedup for long sequence lengths.
- Reduce Triton kernels from 6 (1D/2D/3D × p2p/broadcast) to 2 (1D only)
  by flattening contiguous trailing dimensions via tensor view
- Wire up MambaCacheManager to use the Triton kernels instead of
  PyTorch advanced indexing with Python for-loops
- Cast strides to int64 in kernels to prevent pointer arithmetic overflow
- Add Qwen3.5 multimodal vision-language model support
- Add mamba_cache_ratio parameter (default 0.5)
- Change mamba_cache_size default from 3000 to None
- Implement automatic memory allocation based on ratio
- Add clear error messages with solutions when memory insufficient
- Maintain backward compatibility with explicit mamba_cache_size

Ratio formula: mamba_memory = total_available * ratio / (1 + ratio)
- ratio=0.5 -> 33% mamba, 67% KV
- ratio=1.0 -> 50% mamba, 50% KV
- ratio=2.0 -> 67% mamba, 33% KV
Change ratio meaning from complex formula to simple percentage:
- Old: ratio = mamba / kv, mamba = total * ratio / (1+ratio)
- New: ratio = mamba / total, mamba = total * ratio

This makes the ratio more intuitive:
- 0.3 → 30% mamba, 70% KV
- 0.5 → 50% mamba, 50% KV (default)
- 0.7 → 70% mamba, 30% KV

Also simplifies error message recommendation formula.
- Replace get_radix_cache_class() classmethod with radix_cache_class
  class attribute in TpPartBaseModel and Qwen3NextTpPartModel
- Move RadixCache/HybridRadixCache imports to module top-level
- Update base_backend.py to access radix_cache_class directly
- Replace alloc_buffer_for_req_triton with simpler indexed PyTorch assignment
- Remove now-unused alloc_buffer_kernel.py Triton kernel
- Revert LOADWORKER default to 1 and remove language_model. prefix stripping
The sgl_kernel.fwd.default API requires attention_chunk before softcap.
This file was missed when the parameter was added in commit a4ab210.

Also update sgl-kernel from 0.3.7.post1 to 0.3.21 which supports this API.
- Rename copy_buffer_p2p → copy_mamba_buffer (indexed 1:1 slot copy)
- Rename copy_buffer_broadcast → fork_mamba_buffer (1:N MTP fork)
- Unify chunk offset param name (pair_idx_offset/copy_idx_offset → chunk_offset)
- Rename stride_index → stride_slot to reflect the slot/cache dimension
- Rename src_idx_in_batch → src_chunk_idx in fork kernel
- Extract _MAX_GRID_DIM = 65535 module constant (was duplicated inline)
- Add divisibility assertion before implicit // in fork autotuned wrapper
- Update autotuner cache keys to match new names
Pre-commit hook formatting changes.
Configs for Qwen3.5-35B (layer_num=30) and 397B (layer_num=48):
- SSM state (float32): d_size=262144/393216
- Conv state (bf16): d_size=12288/15360
- copy_buffer_p2p -> copy_state_buffers (removes misleading "p2p")
- copy_buffer_broadcast -> fork_state_buffers (aligns with fork_mamba_buffer kernel)
- copy_ssm_buffer_broadcast -> fork_ssm_buffers (consistent naming)
- Remove redundant docstrings in mamba_buffer_copy.py
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands LightLLM's model compatibility by integrating Qwen3-next and Qwen3.5, which utilize a novel hybrid attention architecture. The changes introduce a sophisticated memory management system tailored for linear attention states and leverage highly optimized Triton kernels to accelerate both prefill and multi-token prediction decoding. These updates enhance the framework's ability to efficiently handle complex, multimodal models with advanced architectural features.

Highlights

  • Model Support: Added comprehensive support for Qwen3-next and Qwen3.5 models, including both dense and MoE variants, and their multimodal capabilities.
  • Hybrid Attention Architecture: Implemented a hybrid attention mechanism for Qwen3-next and Qwen3.5, combining traditional full attention layers with Gated Delta Networks (GDN) for linear attention layers.
  • Multi-Token Prediction (MTP) Enhancements: Introduced advanced Multi-Token Prediction (MTP) support with specialized memory management for linear attention states and optimized Triton kernels for GDN decode.
  • New Triton Kernels: Developed and integrated several new Triton kernels for efficient GDN operations, including causal convolution, fused gating, chunked delta rule, and optimized state copying.
  • Memory Management for Hybrid Models: Created a TokenAllocator base class and MambaCacheManager to handle the unique state management requirements of linear attention, integrated into a Qwen3NextHybridMemManager.
  • Tool Call Parser: Added a Qwen3CoderDetector for parsing Qwen3-Coder's XML-style function calls, enhancing tool-use capabilities.
  • Multimodal Integration: Qwen3.5 models now fully integrate multimodal features from Qwen3VL, including mrope position encoding for image and video tokens, and optimized vision processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lightllm/common/allocator_utils.py
    • Added TokenAllocator class for managing token memory.
  • lightllm/common/basemodel/attention_vit/fa3/fp.py
    • Modified _vit_att_fwd to include attention_chunk and softcap parameters.
  • lightllm/common/basemodel/basemodel.py
    • Imported RadixCache and added radix_cache_class attribute.
  • lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py
    • Refactored attention forward methods to separate attention logic from FFN.
  • lightllm/common/basemodel/layer_weights/meta_weights/init.py
    • Imported new norm weight classes (GEMMANormWeight, QKGEMMANormWeight) and parameter weight classes (ParameterWeight, TpParameterWeight).
  • lightllm/common/basemodel/layer_weights/meta_weights/norm_weight.py
    • Added GEMMANormWeight and QKGEMMANormWeight classes, including specific load_hf_weights and _triton_forward implementations for Gemma-style normalization.
  • lightllm/common/basemodel/layer_weights/meta_weights/parameter_weight.py
    • Added ParameterWeight and TpParameterWeight classes for handling general model parameters, including tensor parallelism splitting.
  • lightllm/common/basemodel/triton_kernel/mamba_buffer_copy.py
    • Added new Triton kernels (_copy_buffer_kernel, _fork_buffer_kernel) and Python wrappers (copy_mamba_buffer, fork_mamba_buffer) for efficient copying and forking of Mamba-style state buffers.
  • lightllm/common/basemodel/triton_kernel/norm/qk_norm.py
    • Modified _qk_rms_norm_fused_kernel and qk_rmsnorm_fused_forward to include FP32_MULTIPLY parameter for Gemma-specific normalization behavior.
  • lightllm/common/kv_cache_mem_manager/mem_manager.py
    • Refactored MemoryManager to inherit from TokenAllocator and removed duplicated memory allocation/free logic, also updated shared memory name.
  • lightllm/common/mamba_cache_mem_manager/cache_manager.py
    • Added MambaCacheManager and ReadOnlyStaticsMambaCacheManager for managing Mamba-style linear attention states, including LayerCache and buffer copy/fork functionalities.
  • lightllm/common/req_manager.py
    • Added ReqManagerForMamba which extends ReqManager to handle Mamba-specific buffer allocation, freeing, and copying.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=16,H=12,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=16,H=16,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=16,H=8,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=32,H=12,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=32,H=16,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=32,H=8,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=64,H=12,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=64,H=16,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_fwd_o/{BT=64,H=8,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_gated_delta_rule_fwd_h/{BT=64,H=12,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_gated_delta_rule_fwd_h/{BT=64,H=16,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_gated_delta_rule_fwd_h/{BT=64,H=8,K=128,V=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_local_cumsum_scalar/{B=1,BT=64,H=12,IS_VARLEN=true,REVERSE=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_local_cumsum_scalar/{B=1,BT=64,H=16,IS_VARLEN=true,REVERSE=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_local_cumsum_scalar/{B=1,BT=64,H=8,IS_VARLEN=true,REVERSE=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_scaled_dot_kkt_fwd/{BT=64,H=12,IS_VARLEN=true,K=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_scaled_dot_kkt_fwd/{BT=64,H=16,IS_VARLEN=true,K=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/chunk_scaled_dot_kkt_fwd/{BT=64,H=8,IS_VARLEN=true,K=128}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/fused_gdn_gating:v1/{NUM_HEADS=12,a_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/fused_gdn_gating:v1/{NUM_HEADS=16,a_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/fused_gdn_gating:v1/{NUM_HEADS=8,a_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/gated_rmsnorm_forward:v1/{N=128,has_bias=false,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/gemma_rmsnorm_forward:v1/{N=2048,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/gemma_rmsnorm_forward:v1/{N=256,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/gemma_rmsnorm_forward:v1/{N=3072,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/gemma_rmsnorm_forward:v1/{N=5120,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=128,N=2048,expert_num=256,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=128,N=3072,expert_num=256,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=2048,N=256,expert_num=256,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=2048,N=512,expert_num=256,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=256,N=2048,expert_num=256,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/grouped_matmul:v1/{K=3072,N=256,expert_num=256,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/moe_align_fused:v1/{topk_num=8}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/moe_sum_reduce:v1/{hidden_dim=2048,out_dtype=torch.bfloat16,topk_num=8}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/moe_sum_reduce:v1/{hidden_dim=3072,out_dtype=torch.bfloat16,topk_num=8}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=1,Q_HEAD_NUM=4,dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=1,Q_HEAD_NUM=6,dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=1,Q_HEAD_NUM=8,dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/silu_and_mul_fwd:v1/{N=128,out_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H100_80GB_HBM3/silu_and_mul_fwd:v1/{N=256,out_dtype=torch.bfloat16}_NVIDIA_H100_80GB_HBM3.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=16,H=16,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=16,H=24,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=16,H=8,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=32,H=16,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=32,H=24,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=32,H=8,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=64,H=16,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=64,H=24,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_fwd_o/{BT=64,H=8,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_gated_delta_rule_fwd_h/{BT=64,H=16,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_gated_delta_rule_fwd_h/{BT=64,H=24,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_gated_delta_rule_fwd_h/{BT=64,H=8,K=128,V=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_local_cumsum_scalar/{B=1,BT=64,H=16,IS_VARLEN=true,REVERSE=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_local_cumsum_scalar/{B=1,BT=64,H=24,IS_VARLEN=true,REVERSE=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_local_cumsum_scalar/{B=1,BT=64,H=8,IS_VARLEN=true,REVERSE=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_scaled_dot_kkt_fwd/{BT=64,H=16,IS_VARLEN=true,K=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_scaled_dot_kkt_fwd/{BT=64,H=24,IS_VARLEN=true,K=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/chunk_scaled_dot_kkt_fwd/{BT=64,H=8,IS_VARLEN=true,K=128}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/fused_gdn_gating:v1/{NUM_HEADS=16,a_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/fused_gdn_gating:v1/{NUM_HEADS=24,a_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/fused_gdn_gating:v1/{NUM_HEADS=8,a_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/gated_rmsnorm_forward:v1/{N=128,has_bias=false,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/gemma_rmsnorm_forward:v1/{N=2048,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/gemma_rmsnorm_forward:v1/{N=256,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/gemma_rmsnorm_forward:v1/{N=4096,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/gemma_rmsnorm_forward:v1/{N=5120,weight_dtype=torch.bfloat16,x_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=128,N=2048,expert_num=512,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=128,N=4096,expert_num=512,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=2048,N=256,expert_num=512,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=10,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=2048,N=512,expert_num=256,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=2048,N=512,expert_num=512,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=10,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=256,N=2048,expert_num=256,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=256,N=2048,expert_num=512,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=4096,N=256,expert_num=512,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=10,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_copy_1d:v1/{d_size=12288,dtype=torch.bfloat16,layer_num=30,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_copy_1d:v1/{d_size=15360,dtype=torch.bfloat16,layer_num=48,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_copy_1d:v1/{d_size=262144,dtype=torch.float32,layer_num=30,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_copy_1d:v1/{d_size=393216,dtype=torch.float32,layer_num=48,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_fork_1d:v1/{d_size=12288,dtype=torch.bfloat16,layer_num=30,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_fork_1d:v1/{d_size=15360,dtype=torch.bfloat16,layer_num=48,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_fork_1d:v1/{d_size=262144,dtype=torch.float32,layer_num=30,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mamba_buffer_fork_1d:v1/{d_size=393216,dtype=torch.float32,layer_num=48,ndim=3}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_align_fused:v1/{topk_num=10}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_sum_reduce:v1/{hidden_dim=2048,out_dtype=torch.bfloat16,topk_num=10}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_sum_reduce:v1/{hidden_dim=4096,out_dtype=torch.bfloat16,topk_num=10}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mrope_triton_fused:v1/{HEAD_DIM=256,K_HEAD_NUM=1,Q_HEAD_NUM=4,dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=1,Q_HEAD_NUM=4,dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=1,Q_HEAD_NUM=8,dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/mrope_triton_fused:v1/{HEAD_DIM=64,K_HEAD_NUM=2,Q_HEAD_NUM=12,dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/silu_and_mul_fwd:v1/{N=128,out_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/silu_and_mul_fwd:v1/{N=256,out_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added autotuned kernel configuration.
  • lightllm/models/init.py
    • Imported Qwen3NextTpPartModel, Qwen3NextMTPModel, Qwen3_5TpPartModel, and Qwen3_5MOETpPartModel to the model registry.
  • lightllm/models/llama/layer_weights/transformer_layer_weight.py
    • Modified _parse_config to use .get for intermediate_size with a default value.
  • lightllm/models/qwen2_5_vl/qwen2_5_visual.py
    • Modified rot_pos_emb to correctly handle t (temporal dimension) in grid_thw for position IDs.
  • lightllm/models/qwen2_vl/qwen2_visual.py
    • Optimized VisionPatchEmbed by converting proj.weight.data to channels_last_3d and ensuring hidden_states are contiguous in channels_last_3d format for Conv3d operations.
  • lightllm/models/qwen2_vl/triton_kernel/mrope.py
    • Modified mrope_triton_fused to accept partial_rotary_factor for head_dim calculation.
  • lightllm/models/qwen2_vl/vision_process.py
    • Modified _preprocess_bydevice to create a copy of the NumPy array before converting to a PyTorch tensor, preventing a warning about read-only arrays.
    • Added debug logging for image processing.
  • lightllm/models/qwen3_5/init.py
    • Added Qwen3_5TpPartModel and QWen3_5Tokenizer to the module.
  • lightllm/models/qwen3_5/infer_struct.py
    • Added Qwen35InferStateInfo inheriting from Qwen2VLInferStateInfo, incorporating Qwen3Next features like output gating, MTP-aware batching, and hybrid attention buffer management, along with mrope position encoding.
  • lightllm/models/qwen3_5/layer_infer/transformer_layer_infer.py
    • Added Qwen35FullAttentionTransformerLayerInfer and Qwen35GatedDeltaNetTransformerLayerInfer classes, inheriting from Qwen3Next versions and adapting _get_qkv for Qwen3.5-specific mrope and output gating.
  • lightllm/models/qwen3_5/layer_weights/transformer_layer_weight.py
    • Added Qwen35TransformerLayerWeight inheriting from Qwen3NextTransformerLayerWeight, adjusting weight names and _init_gdn_weight to handle grouped QKVZBA weights for linear attention.
  • lightllm/models/qwen3_5/model.py
    • Added Qwen3_5TpPartModel and QWen3_5Tokenizer, combining hybrid attention, multimodal support, and dense MLP layers.
    • Overrode _init_config and _init_infer_layer for Qwen3.5 specifics.
  • lightllm/models/qwen3_5_moe/layer_weights/transformer_layer_weight.py
    • Added Qwen35MOETransformerLayerWeight and split_fused_expert_weights function to handle fused expert weights for MoE models.
  • lightllm/models/qwen3_5_moe/model.py
    • Added Qwen3_5MOETpPartModel inheriting from Qwen3_5TpPartModel with MoE-specific transformer layer weights.
  • lightllm/models/qwen3_moe/layer_infer/transformer_layer_infer.py
    • Modified Qwen3MOETransformerLayerInfer to use .get with default values for network config parameters, making it more robust to missing keys.
  • lightllm/models/qwen3_moe/layer_weights/transformer_layer_weight.py
    • Modified Qwen3MOETransformerLayerWeight to use .get with default values for network config parameters.
  • lightllm/models/qwen3_moe/model.py
    • Modified _init_custom to conditionally initialize DeepEP group only if num_experts is present and greater than 0 in the config.
  • lightllm/models/qwen3_omni_moe_thinker/qwen3_omni_visual.py
    • Optimized VisionPatchEmbed by converting proj.weight.data to channels_last_3d and ensuring hidden_states are contiguous in channels_last_3d format for Conv3d operations.
  • lightllm/models/qwen3_vl/qwen3_visual.py
    • Added logger.
    • Optimized VisionPatchEmbed by converting proj.weight.data to channels_last_3d and ensuring hidden_states are contiguous in channels_last_3d format for Conv3d operations.
    • Added debug logging for image processing.
  • lightllm/models/qwen3next/init.py
    • Added Qwen3NextTpPartModel to the module.
  • lightllm/models/qwen3next/infer_struct.py
    • Added Qwen3NextInferStateInfo inheriting from LlamaInferStateInfo, introducing gate_value for output gating and MTP-aware batching attributes.
  • lightllm/models/qwen3next/layer_infer/transformer_layer_infer.py
    • Added Qwen3NextFullAttentionBaseLayerInfer, Qwen3NextFullAttentionTransformerLayerInfer, and Qwen3NextGatedDeltaNetTransformerLayerInfer classes.
    • Implemented hybrid attention, MoE FFN, output gating, and GDN-specific forward passes with MTP optimizations.
  • lightllm/models/qwen3next/layer_weights/pre_and_post_layer_weight.py
    • Added Qwen3NextPreAndPostLayerWeight for Qwen3Next models, using GEMMANormWeight for normalization.
  • lightllm/models/qwen3next/layer_weights/transformer_layer_weight.py
    • Added Qwen3NextTransformerLayerWeight for Qwen3Next models, implementing hybrid attention weight initialization, Gemma-style norms, and specific weight preprocessing for GDN.
  • lightllm/models/qwen3next/mem_manager.py
    • Added Qwen3NextHybridMemManager to manage both KV cache for full attention layers and Mamba-style states for linear attention layers.
  • lightllm/models/qwen3next/model.py
    • Added Qwen3NextTpPartModel, implementing hybrid attention, MTP support, and custom memory management for GDN states.
    • Included autotuning for layers and sets Triton allocator.
  • lightllm/models/qwen3next/triton_kernel/causal_conv1d.py
    • Added causal_conv1d_fn and causal_conv1d_update for 1D causal convolution operations, adapted from SGLang.
  • lightllm/models/qwen3next/triton_kernel/fla/init.py
    • Added __init__.py for the fla package, indicating adapted code from vLLM/flash-linear-attention.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/init.py
    • Added __init__.py for fla.ops, exposing chunk_gated_delta_rule and fused_recurrent_gated_delta_rule.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/chunk.py
    • Added chunk_gated_delta_rule_fwd and chunk_gated_delta_rule for chunked gated delta rule attention, including L2 normalization.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/chunk_delta_h.py
    • Added chunk_gated_delta_rule_fwd_kernel_h_blockdim64 and chunk_gated_delta_rule_fwd_h for computing the h state in chunked GDN.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/chunk_o.py
    • Added chunk_fwd_kernel_o and chunk_fwd_o for computing the output o in chunked GDN.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/chunk_scaled_dot_kkt.py
    • Added chunk_scaled_dot_kkt_fwd_kernel and chunk_scaled_dot_kkt_fwd for computing beta * K * K^T in chunked GDN.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/cumsum.py
    • Added chunk_local_cumsum_scalar_kernel, chunk_local_cumsum_vector_kernel, and chunk_local_cumsum for local cumulative sum operations.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/fused_recurrent.py
    • Added fused_recurrent_gated_delta_rule_fwd_kernel and fused_recurrent_gated_delta_rule for fused recurrent gated delta rule attention, including fused gating.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/index.py
    • Added helper functions prepare_lens, prepare_chunk_indices, and prepare_chunk_offsets for handling variable-length sequences in chunked operations.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/l2norm.py
    • Added l2norm_fwd_kernel1, l2norm_fwd_kernel, l2norm_fwd_kernel2, and l2norm_fwd for L2 normalization.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/op.py
    • Added Triton utility functions like safe_exp, gather, and make_tensor_descriptor (with fallbacks).
  • lightllm/models/qwen3next/triton_kernel/fla/ops/solve_tril.py
    • Added solve_tril_16x16_kernel, merge_16x16_to_32x32_inverse_kernel, merge_16x16_to_64x64_inverse_kernel, and solve_tril for solving lower triangular systems.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/utils.py
    • Added utility functions for Triton kernels, including tensor_cache, input_guard, platform checks, and shared memory checks.
  • lightllm/models/qwen3next/triton_kernel/fla/ops/wy_fast.py
    • Added recompute_w_u_fwd_kernel and recompute_w_u_fwd for recomputing w and u in GDN.
  • lightllm/models/qwen3next/triton_kernel/fused_add_gemma_rmsnorm.py
    • Added _fused_add_gemma_rmsnorm_kernel and fused_add_gemma_rmsnorm for fused residual add and Gemma RMSNorm.
  • lightllm/models/qwen3next/triton_kernel/fused_gdn_gating.py
    • Added fused_gdn_gating_kernel and fused_gdn_gating for fused computation of GDN gating parameters (g and beta).
  • lightllm/models/qwen3next/triton_kernel/fused_qkv_gating.py
    • Added _fused_gdn_gating_only_kernel and fused_gdn_gating_v2 for fused QKV projection and GDN gating computation.
  • lightllm/models/qwen3next/triton_kernel/gated_rmsnorm.py
    • Added gated_rmsnorm_forward_kernel and gated_rmsnorm_forward for gated RMSNorm.
  • lightllm/models/qwen3next/triton_kernel/gdn_decode_mtp.py
    • Added various Triton kernels and Python wrappers for optimized GDN decode with MTP support, including data reorganization and state copying.
  • lightllm/models/qwen3next_mtp/init.py
    • Added Qwen3NextMTPModel to the module.
  • lightllm/models/qwen3next_mtp/layer_infer/pre_layer_infer.py
    • Added Qwen3NextMTPPreLayerInfer for MTP pre-layer inference, handling normalization and projection of embeddings.
  • lightllm/models/qwen3next_mtp/layer_infer/transformer_layer_infer.py
    • Added Qwen3NextMTPTransformerLayerInfer for MTP transformer layers, using full attention with MoE FFN and shared expert.
  • lightllm/models/qwen3next_mtp/layer_weights/pre_and_post_layer_weight.py
    • Added Qwen3NextMTPPreAndPostLayerWeight for MTP pre/post layer weights, using Gemma-style normalization.
  • lightllm/models/qwen3next_mtp/layer_weights/transformer_layer_weight.py
    • Added Qwen3NextMTPTransformerLayerWeight for MTP transformer layer weights, overriding QKV initialization and adding shared expert weights.
  • lightllm/models/qwen3next_mtp/model.py
    • Added Qwen3NextMTPModel, which is a specialized MTP model that shares memory and components with the main model.
  • lightllm/models/vit/triton_kernel/flashattention_nopad.py
    • Modified flash_attention_v3_fwd to include attention_chunk and softcap parameters.
  • lightllm/server/api_cli.py
    • Added qwen3_coder to tool_call_parser choices and qwen3next_vanilla, qwen3next_eagle to mtp_mode choices.
    • Added new arguments --mamba_cache_size, --mamba_cache_ratio, and --mamba_ssm_data_type for hybrid attention models.
  • lightllm/server/api_openai.py
    • Added support for file:// prefix in image URLs for multimodal requests.
  • lightllm/server/api_start.py
    • Modified MTP initialization logic to default mtp_draft_model_dir to model_dir if not specified, and ensures mtp_step is greater than 0.
  • lightllm/server/build_prompt.py
    • Added logic to convert tool_calls function arguments from JSON string to dictionary for Jinja template compatibility in Qwen's chat template.
    • Made tools parameter optional.
  • lightllm/server/core/objs/sampling_params.py
    • Refactored init and load_generation_cfg methods to handle None values for sampling parameters more robustly, ensuring default numeric values are used.
  • lightllm/server/core/objs/start_args_type.py
    • Added qwen3_coder to tool_call_parser choices and qwen3next_vanilla, qwen3next_eagle to mtp_mode choices.
    • Changed default running_max_req_size and graph_max_batch_size.
    • Added mamba_cache_size, mamba_cache_ratio, and mamba_ssm_data_type fields.
  • lightllm/server/function_call_parser.py
    • Added Qwen3CoderDetector for parsing Qwen3-Coder XML-style function calls and registered it in FunctionCallParser.
  • lightllm/server/httpserver/manager.py
    • Added debug logging for multimodal requests.
  • lightllm/server/router/dynamic_prompt/hybrid_radix_cache.py
    • Added HybridRadixCache inheriting from RadixCache, introducing buffer_mem_manager and evict_buffer_set for managing Mamba-style states alongside KV cache.
    • Implemented specialized eviction and insertion logic.
  • lightllm/server/router/dynamic_prompt/radix_cache.py
    • Modified TreeNode to include buffer_idx and buffer_time for hybrid attention models.
    • Updated RadixCache constructor to accept kv_cache_mem_manager.
    • Modified _try_merge to consider buffer_idx when merging nodes.
  • lightllm/server/router/model_infer/infer_batch.py
    • Added has_recurrent_state property to InferenceContext.
    • Modified register to assert HybridRadixCache for recurrent state models.
    • Added _alloc_and_copy_req_buffers for Mamba-specific buffer management.
    • Refactored free_a_req_mem into free_a_req_mem_for_mamba and _free_req_mem_and_buffers to handle Mamba buffers.
    • Updated _match_radix_cache to incorporate Mamba prefill block size and min insert length.
  • lightllm/server/router/model_infer/mode_backend/base_backend.py
    • Updated init_model to use radix_cache_class from the model and pass kv_cache_mem_manager.
    • Modified init_mtp_draft_model to support qwen3next_vanilla and qwen3next_eagle MTP modes, and to dynamically calculate mem_layer_start and disable CUDA graphs for draft models.
  • lightllm/server/router/model_infer/mode_backend/chunked_prefill/impl.py
    • Added _maybe_insert_hybrid_radix_cache to handle hybrid radix cache insertion during prefill.
    • Modified prefill_normal and prefill_mtp to call this new method.
    • Added logic to copy accepted buffer states back to buffer[0] for MTP in decode_mtp.
  • lightllm/server/router/model_infer/mode_backend/dp_backend/impl.py
    • Added logic to copy accepted buffer states back to buffer[0] for MTP in decode_mtp and decode_overlap_mtp.
  • lightllm/server/tokenizer.py
    • Added QWen3_5Tokenizer for Qwen3.5 models with multimodal support.
  • lightllm/server/visualserver/model_infer/model_rpc.py
    • Extended Qwen3VisionTransformerPretrainedModel support to qwen3_5 and qwen3_5_moe model types.
  • lightllm/utils/config_utils.py
    • Added logic to retrieve eos_token_id from tokenizer for Qwen3.5 models, prioritizing it over config.
  • requirements.txt
    • Updated sgl-kernel version from 0.3.7.post1 to 0.3.21.
  • test_gsmk.py
    • Added a new test file for GSM8K benchmark, adapted from SGLang, to evaluate model accuracy and latency using LightLLM API.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new features to support Qwen3-next and Qwen3.5 models, which utilize hybrid attention mechanisms (mixing full attention and linear attention) and multi-token prediction (MTP). The changes are extensive, touching upon memory management, inference logic, Triton kernels, and API layers.

Key additions include a new TokenAllocator for managing token memory, a MambaCacheManager for handling recurrent states, and a HybridRadixCache to support prompt caching for these new models. The inference and weight-loading logic has been substantially refactored to accommodate the unique architectures of these models.

While the implementation is comprehensive, I've identified a critical security vulnerability related to arbitrary file access, a high-severity issue with the use of assert for error handling, and several medium-severity concerns regarding performance and code maintainability. Addressing these points will be crucial for ensuring the stability, security, and long-term health of the codebase.

Comment on lines +179 to +185
elif img.startswith("file://"):
# Local file path with file:// prefix
file_path = img[7:] # Remove "file://" prefix
with open(file_path, "rb") as f:
multimodal_params_dict["images"].append(
{"type": "base64", "data": base64.b64encode(f.read()).decode("utf-8")}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Accepting arbitrary file paths via file:// URIs introduces a severe security vulnerability known as Path Traversal. A malicious user could craft a request to access sensitive files on the server (e.g., file:///etc/passwd). All user-provided paths must be strictly validated to ensure they are confined to a designated, safe directory. Without proper validation, this could lead to unauthorized file access and information disclosure.

def alloc(self, need_size) -> torch.Tensor:
if need_size > self.mark_end - self.mark_start:
logger.error(f"warn no enough cache need_size {need_size} left_size {self.can_use_mem_size}")
assert False, "error alloc state"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert False for error handling is not robust, as assertions can be disabled in production environments (e.g., when running Python with the -O flag). This would lead to silent failures when the allocator runs out of memory. It's better to raise a specific exception, like MemoryError, to ensure that out-of-memory conditions are always handled correctly.

Suggested change
assert False, "error alloc state"
raise MemoryError(f"error alloc state: not enough cache, need {need_size}, but only {self.mark_end - self.mark_start} is available.")

Comment on lines +54 to +58
"""_summary_

Args:
free_index (torch.Tensor): _description_
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring contains placeholder text. Please provide a meaningful description of the function's purpose, arguments, and behavior to improve code maintainability.

num_buffers_per_req = self.mtp_step + 1
buffer_indexes = self.buffer_mem_manager.alloc(num_reqs * num_buffers_per_req)
if not buffer_indexes.is_cuda:
buffer_indexes = buffer_indexes.cuda()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to .cuda() is a synchronous host-to-device copy, which can block the CPU and become a performance bottleneck. Consider using an asynchronous copy with non_blocking=True to allow the CPU to continue with other tasks while the data transfer occurs in the background. This can improve overall throughput, especially if there are other CPU-bound operations that can be overlapped with the transfer.

Suggested change
buffer_indexes = buffer_indexes.cuda()
buffer_indexes = buffer_indexes.cuda(non_blocking=True)

Comment on lines +314 to +319
model_type = mtp_model_cfg.get("model_type", "")
if model_type == "qwen3_next":
# Qwen3Next has integrated MTP with 1 layer per module
mtp_layers_per_module = 1
else:
mtp_layers_per_module = mtp_model_cfg["num_hidden_layers"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for determining mtp_layers_per_module relies on a hardcoded check of model_type. This approach is not easily extensible and can become a maintenance burden as more MTP-capable models are added. Consider refactoring this to a more robust mechanism, such as having the model class itself provide this information through a class attribute or a dedicated method. This would make the MTP initialization more modular and easier to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants