Conversation
Implement comprehensive support for Qwen3Next model with linear attention mechanism: Model Features: - Implement linear attention with MTP (Multi-Token Prediction) capability - Add custom Triton kernels for gated delta networks (GDN) operations - Support chunked operations for efficient attention computation - Add specialized buffer pool and memory managers for linear attention Triton Kernels: - Add causal_conv1d for efficient convolution operations - Implement chunk-based operations (chunk_o, chunk_delta_h, chunk_scaled_dot_kkt) - Add gated delta network kernels (fused_gdn_gating, gdn_decode_mtp) - Implement fused normalization (gemma_rmsnorm, gated_rmsnorm) Infrastructure: - Add hybrid radix cache for efficient memory management - Implement mamba cache manager for state management - Add allocator utilities for buffer management - Add parameter weight abstraction for flexible weight handling - Update model registration and API endpoints Performance Optimizations: - Add H200 autotune configurations for all Triton kernels - Optimize memory allocation with custom kernels - Support chunked prefill and decode backends This implementation enables efficient inference for models with linear attention mechanisms, providing significant speedup for long sequence lengths.
- Reduce Triton kernels from 6 (1D/2D/3D × p2p/broadcast) to 2 (1D only) by flattening contiguous trailing dimensions via tensor view - Wire up MambaCacheManager to use the Triton kernels instead of PyTorch advanced indexing with Python for-loops - Cast strides to int64 in kernels to prevent pointer arithmetic overflow - Add Qwen3.5 multimodal vision-language model support
- Add mamba_cache_ratio parameter (default 0.5) - Change mamba_cache_size default from 3000 to None - Implement automatic memory allocation based on ratio - Add clear error messages with solutions when memory insufficient - Maintain backward compatibility with explicit mamba_cache_size Ratio formula: mamba_memory = total_available * ratio / (1 + ratio) - ratio=0.5 -> 33% mamba, 67% KV - ratio=1.0 -> 50% mamba, 50% KV - ratio=2.0 -> 67% mamba, 33% KV
Change ratio meaning from complex formula to simple percentage: - Old: ratio = mamba / kv, mamba = total * ratio / (1+ratio) - New: ratio = mamba / total, mamba = total * ratio This makes the ratio more intuitive: - 0.3 → 30% mamba, 70% KV - 0.5 → 50% mamba, 50% KV (default) - 0.7 → 70% mamba, 30% KV Also simplifies error message recommendation formula.
- Replace get_radix_cache_class() classmethod with radix_cache_class class attribute in TpPartBaseModel and Qwen3NextTpPartModel - Move RadixCache/HybridRadixCache imports to module top-level - Update base_backend.py to access radix_cache_class directly - Replace alloc_buffer_for_req_triton with simpler indexed PyTorch assignment - Remove now-unused alloc_buffer_kernel.py Triton kernel - Revert LOADWORKER default to 1 and remove language_model. prefix stripping
The sgl_kernel.fwd.default API requires attention_chunk before softcap. This file was missed when the parameter was added in commit a4ab210. Also update sgl-kernel from 0.3.7.post1 to 0.3.21 which supports this API.
- Rename copy_buffer_p2p → copy_mamba_buffer (indexed 1:1 slot copy) - Rename copy_buffer_broadcast → fork_mamba_buffer (1:N MTP fork) - Unify chunk offset param name (pair_idx_offset/copy_idx_offset → chunk_offset) - Rename stride_index → stride_slot to reflect the slot/cache dimension - Rename src_idx_in_batch → src_chunk_idx in fork kernel - Extract _MAX_GRID_DIM = 65535 module constant (was duplicated inline) - Add divisibility assertion before implicit // in fork autotuned wrapper - Update autotuner cache keys to match new names
Pre-commit hook formatting changes.
Configs for Qwen3.5-35B (layer_num=30) and 397B (layer_num=48): - SSM state (float32): d_size=262144/393216 - Conv state (bf16): d_size=12288/15360
- copy_buffer_p2p -> copy_state_buffers (removes misleading "p2p") - copy_buffer_broadcast -> fork_state_buffers (aligns with fork_mamba_buffer kernel) - copy_ssm_buffer_broadcast -> fork_ssm_buffers (consistent naming) - Remove redundant docstrings in mamba_buffer_copy.py
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands LightLLM's model compatibility by integrating Qwen3-next and Qwen3.5, which utilize a novel hybrid attention architecture. The changes introduce a sophisticated memory management system tailored for linear attention states and leverage highly optimized Triton kernels to accelerate both prefill and multi-token prediction decoding. These updates enhance the framework's ability to efficiently handle complex, multimodal models with advanced architectural features. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new features to support Qwen3-next and Qwen3.5 models, which utilize hybrid attention mechanisms (mixing full attention and linear attention) and multi-token prediction (MTP). The changes are extensive, touching upon memory management, inference logic, Triton kernels, and API layers.
Key additions include a new TokenAllocator for managing token memory, a MambaCacheManager for handling recurrent states, and a HybridRadixCache to support prompt caching for these new models. The inference and weight-loading logic has been substantially refactored to accommodate the unique architectures of these models.
While the implementation is comprehensive, I've identified a critical security vulnerability related to arbitrary file access, a high-severity issue with the use of assert for error handling, and several medium-severity concerns regarding performance and code maintainability. Addressing these points will be crucial for ensuring the stability, security, and long-term health of the codebase.
| elif img.startswith("file://"): | ||
| # Local file path with file:// prefix | ||
| file_path = img[7:] # Remove "file://" prefix | ||
| with open(file_path, "rb") as f: | ||
| multimodal_params_dict["images"].append( | ||
| {"type": "base64", "data": base64.b64encode(f.read()).decode("utf-8")} | ||
| ) |
There was a problem hiding this comment.
Accepting arbitrary file paths via file:// URIs introduces a severe security vulnerability known as Path Traversal. A malicious user could craft a request to access sensitive files on the server (e.g., file:///etc/passwd). All user-provided paths must be strictly validated to ensure they are confined to a designated, safe directory. Without proper validation, this could lead to unauthorized file access and information disclosure.
| def alloc(self, need_size) -> torch.Tensor: | ||
| if need_size > self.mark_end - self.mark_start: | ||
| logger.error(f"warn no enough cache need_size {need_size} left_size {self.can_use_mem_size}") | ||
| assert False, "error alloc state" |
There was a problem hiding this comment.
Using assert False for error handling is not robust, as assertions can be disabled in production environments (e.g., when running Python with the -O flag). This would lead to silent failures when the allocator runs out of memory. It's better to raise a specific exception, like MemoryError, to ensure that out-of-memory conditions are always handled correctly.
| assert False, "error alloc state" | |
| raise MemoryError(f"error alloc state: not enough cache, need {need_size}, but only {self.mark_end - self.mark_start} is available.") |
| """_summary_ | ||
|
|
||
| Args: | ||
| free_index (torch.Tensor): _description_ | ||
| """ |
| num_buffers_per_req = self.mtp_step + 1 | ||
| buffer_indexes = self.buffer_mem_manager.alloc(num_reqs * num_buffers_per_req) | ||
| if not buffer_indexes.is_cuda: | ||
| buffer_indexes = buffer_indexes.cuda() |
There was a problem hiding this comment.
The call to .cuda() is a synchronous host-to-device copy, which can block the CPU and become a performance bottleneck. Consider using an asynchronous copy with non_blocking=True to allow the CPU to continue with other tasks while the data transfer occurs in the background. This can improve overall throughput, especially if there are other CPU-bound operations that can be overlapped with the transfer.
| buffer_indexes = buffer_indexes.cuda() | |
| buffer_indexes = buffer_indexes.cuda(non_blocking=True) |
| model_type = mtp_model_cfg.get("model_type", "") | ||
| if model_type == "qwen3_next": | ||
| # Qwen3Next has integrated MTP with 1 layer per module | ||
| mtp_layers_per_module = 1 | ||
| else: | ||
| mtp_layers_per_module = mtp_model_cfg["num_hidden_layers"] |
There was a problem hiding this comment.
The logic for determining mtp_layers_per_module relies on a hardcoded check of model_type. This approach is not easily extensible and can become a maintenance burden as more MTP-capable models are added. Consider refactoring this to a more robust mechanism, such as having the model class itself provide this information through a class attribute or a dedicated method. This would make the MTP initialization more modular and easier to maintain.
No description provided.