Skip to content

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140

Open
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.6-27b
Open

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.6-27b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds NxDI contrib implementation of Qwen3.6-27B, a 27B parameter dense model with hybrid DeltaNet + GQA attention architecture
  • Qwen3.6-27B is a post-training update of Qwen3.5-27B (PR Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture #128) with identical architecture (qwen3_5 model_type) -- improved agentic coding and thinking preservation, only weights differ
  • Same NxDI implementation as Qwen3.5-27B with updated documentation, Qwen3.6-27B benchmarks, quality validation, and cross-references between the two contribs

Relationship to PR #128 (Qwen3.5-27B)

This contrib uses the same Qwen35* classes and modeling_qwen35*.py filenames as the Qwen3.5-27B contrib (PR #128). The code is identical -- both models share the qwen3_5 model_type. Only the HuggingFace model ID and weights differ.

Config Compatibility

Qwen3.6-27B adds output_gate_type="swish" to text_config. Investigation confirmed this field is completely unused by HF transformers (zero references across v4.57.6, v5.6.0, and GitHub main) and by this NxDI code. No code changes required.

Test Results

Unit Tests (42/42 PASS, CPU only)

Module Tests
test_config.py 26/26
test_weight_conversion.py 16/16

Architecture-level tests -- identical results to Qwen3.5-27B.

Quality Validation (7/7 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Test Result
Speed of light PASS
17 * 23 = 391 PASS
60mph * 2.5h = 150 miles PASS
is_prime function PASS
French translation PASS
Capital of Japan PASS
sqrt(144) = 12 PASS

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, SDK 2.29)

Metric Qwen3.6-27B Qwen3.5-27B Delta
TPOT (P50) 54.2 ms 53 ms +2.3%
Throughput 18.5 tok/s 18.9 tok/s -2.1%
TTFT (P50) 306 ms 576 ms *

* TTFT difference due to compilation config (256-token vs 128-token bucket), not model. Architectural performance is equivalent.

Files (15 files, ~6600 lines)

contrib/models/Qwen3.6-27B/
├── README.md
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py              (text decoder)
│   ├── modeling_qwen35_vision.py       (vision encoder)
│   ├── modeling_qwen35_vl.py           (VL pipeline)
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py             (recurrent kernel)
│       ├── nki_deltanet_chunked.py     (per-chunk kernel)
│       └── nki_deltanet_fused.py       (fused chunked kernel)
└── test/
    ├── unit/
    │   ├── test_config.py              (26 tests)
    │   └── test_weight_conversion.py   (16 tests)
    └── integration/
        └── test_model.py               (8 tests)

Checklist

  • Contrib-only (no changes to NxDI src/)
  • Unit tests (42/42 pass)
  • Quality validation (7/7 pass on trn2.3xlarge, SDK 2.29)
  • Benchmarks (TPOT=54.2ms, 18.5 tok/s)
  • README with architecture details, benchmarks, cross-reference to Qwen3.5-27B, and config compatibility notes
  • Apache 2.0 license headers
  • SDK 2.29+ / NKI 0.3.0 required

Qwen3.6-27B shares identical architecture with Qwen3.5-27B (qwen3_5
model_type, hybrid DeltaNet + GQA). Same NxDI implementation as PR aws-neuron#128
with updated documentation, Qwen3.6-27B benchmarks, and cross-references.

Validated on trn2.3xlarge (TP=4, SDK 2.29): 7/7 quality tests passed,
TPOT=54.2ms, 18.5 tok/s, TTFT=306ms. Performance within 2% of Qwen3.5-27B.
- Replace fused NKI kernel with v14: 8-block (16x16) forward substitution
  with 4-round Neumann per block. Fixes catastrophic cancellation in fp32
  that the previous full-128x128 Neumann approach had for large decay values.

- Fix concurrent BS>1 DeltaNet state: use seq_ids-based index_select (read)
  and scatter (write) for both conv_state_buffer and recurrent_state_buffer.
  Enables correct multi-slot concurrent serving via vLLM-neuron.

- Add vLLM compatibility: add_derived_config() promotes text_config fields
  and extracts rope_theta from nested rope_parameters dict when loaded via
  vLLM's AutoConfig (multimodal-style config layout).

- Default CTE to fused NKI kernel (USE_NKI_FUSED=1 by default). The v14
  kernel is numerically stable and avoids PyTorch chunk_forward.

- Replace manual conv1d loop with F.conv1d(..., groups=conv_dim).

- Clean up nkilib loading: remove fork path hack, use installed package
  with graceful fallback.

- Fix test_capital_of_france: increase max_new_tokens for thinking mode,
  strip <think> tags before assertion.

Tested: 42/42 unit tests PASS, 8/8 integration tests PASS on trn2.3xlarge
(SDK 2.29.1, NKI 0.3.0, TP=4).
jimburtoft added a commit to jimburtoft/neuronx-distributed-inference that referenced this pull request May 21, 2026
- Fix concurrent BS>1 DeltaNet state: replace BS=1 shortcut with proper
  seq_ids-based index_select (read) and scatter (write) for both
  conv_state_buffer and recurrent_state_buffer. Enables correct multi-slot
  concurrent serving via vLLM-neuron.

- Add vLLM compatibility: add_derived_config() promotes text_config fields
  and extracts rope_theta from nested rope_parameters dict when loaded via
  vLLM's AutoConfig (multimodal-style config layout).

Consistent with Qwen3.5-27B (PR aws-neuron#128) and Qwen3.6-27B (PR aws-neuron#140).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant