Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) by jimburtoft · Pull Request #140 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-23T23:28:05Z

Summary

Adds NxDI contrib implementation of Qwen3.6-27B, a 27B parameter dense model with hybrid DeltaNet + GQA attention architecture
Qwen3.6-27B is a post-training update of Qwen3.5-27B (PR Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture #128) with identical architecture (qwen3_5 model_type) -- improved agentic coding and thinking preservation, only weights differ
Same NxDI implementation as Qwen3.5-27B with updated documentation, Qwen3.6-27B benchmarks, quality validation, and cross-references between the two contribs

Relationship to PR #128 (Qwen3.5-27B)

This contrib uses the same Qwen35* classes and modeling_qwen35*.py filenames as the Qwen3.5-27B contrib (PR #128). The code is identical -- both models share the qwen3_5 model_type. Only the HuggingFace model ID and weights differ.

Config Compatibility

Qwen3.6-27B adds output_gate_type="swish" to text_config. Investigation confirmed this field is completely unused by HF transformers (zero references across v4.57.6, v5.6.0, and GitHub main) and by this NxDI code. No code changes required.

Test Results

Unit Tests (42/42 PASS, CPU only)

Module	Tests
test_config.py	26/26
test_weight_conversion.py	16/16

Architecture-level tests -- identical results to Qwen3.5-27B.

Quality Validation (7/7 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Test	Result
Speed of light	PASS
17 * 23 = 391	PASS
60mph * 2.5h = 150 miles	PASS
is_prime function	PASS
French translation	PASS
Capital of Japan	PASS
sqrt(144) = 12	PASS

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, SDK 2.29)

Metric	Qwen3.6-27B	Qwen3.5-27B	Delta
TPOT (P50)	54.2 ms	53 ms	+2.3%
Throughput	18.5 tok/s	18.9 tok/s	-2.1%
TTFT (P50)	306 ms	576 ms	*

* TTFT difference due to compilation config (256-token vs 128-token bucket), not model. Architectural performance is equivalent.

Files (15 files, ~6600 lines)

contrib/models/Qwen3.6-27B/
├── README.md
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py              (text decoder)
│   ├── modeling_qwen35_vision.py       (vision encoder)
│   ├── modeling_qwen35_vl.py           (VL pipeline)
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py             (recurrent kernel)
│       ├── nki_deltanet_chunked.py     (per-chunk kernel)
│       └── nki_deltanet_fused.py       (fused chunked kernel)
└── test/
    ├── unit/
    │   ├── test_config.py              (26 tests)
    │   └── test_weight_conversion.py   (16 tests)
    └── integration/
        └── test_model.py               (8 tests)

Checklist

Contrib-only (no changes to NxDI src/)
Unit tests (42/42 pass)
Quality validation (7/7 pass on trn2.3xlarge, SDK 2.29)
Benchmarks (TPOT=54.2ms, 18.5 tok/s)
README with architecture details, benchmarks, cross-reference to Qwen3.5-27B, and config compatibility notes
Apache 2.0 license headers
SDK 2.29+ / NKI 0.3.0 required

Qwen3.6-27B shares identical architecture with Qwen3.5-27B (qwen3_5 model_type, hybrid DeltaNet + GQA). Same NxDI implementation as PR aws-neuron#128 with updated documentation, Qwen3.6-27B benchmarks, and cross-references. Validated on trn2.3xlarge (TP=4, SDK 2.29): 7/7 quality tests passed, TPOT=54.2ms, 18.5 tok/s, TTFT=306ms. Performance within 2% of Qwen3.5-27B.

- Replace fused NKI kernel with v14: 8-block (16x16) forward substitution with 4-round Neumann per block. Fixes catastrophic cancellation in fp32 that the previous full-128x128 Neumann approach had for large decay values. - Fix concurrent BS>1 DeltaNet state: use seq_ids-based index_select (read) and scatter (write) for both conv_state_buffer and recurrent_state_buffer. Enables correct multi-slot concurrent serving via vLLM-neuron. - Add vLLM compatibility: add_derived_config() promotes text_config fields and extracts rope_theta from nested rope_parameters dict when loaded via vLLM's AutoConfig (multimodal-style config layout). - Default CTE to fused NKI kernel (USE_NKI_FUSED=1 by default). The v14 kernel is numerically stable and avoids PyTorch chunk_forward. - Replace manual conv1d loop with F.conv1d(..., groups=conv_dim). - Clean up nkilib loading: remove fork path hack, use installed package with graceful fallback. - Fix test_capital_of_france: increase max_new_tokens for thinking mode, strip <think> tags before assertion. Tested: 42/42 unit tests PASS, 8/8 integration tests PASS on trn2.3xlarge (SDK 2.29.1, NKI 0.3.0, TP=4).

- Fix concurrent BS>1 DeltaNet state: replace BS=1 shortcut with proper seq_ids-based index_select (read) and scatter (write) for both conv_state_buffer and recurrent_state_buffer. Enables correct multi-slot concurrent serving via vLLM-neuron. - Add vLLM compatibility: add_derived_config() promotes text_config fields and extracts rope_theta from nested rope_parameters dict when loaded via vLLM's AutoConfig (multimodal-style config layout). Consistent with Qwen3.5-27B (PR aws-neuron#128) and Qwen3.6-27B (PR aws-neuron#140).

m-deepankar-singh mentioned this pull request Apr 30, 2026

[contrib] Add Qwen3.5 4B and 9B hybrid DeltaNet contrib models #152

Open

14 tasks

m-deepankar-singh mentioned this pull request May 13, 2026

Add Qwen3.6-27B contrib model with vLLM APC baseline #164

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.6-27b

jimburtoft commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 23, 2026

Summary

Relationship to PR #128 (Qwen3.5-27B)

Config Compatibility

Test Results

Unit Tests (42/42 PASS, CPU only)

Quality Validation (7/7 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, SDK 2.29)

Files (15 files, ~6600 lines)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant