Add Qwen3.5-2B contrib model by jimburtoft · Pull Request #141 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-24T03:03:59Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds Qwen3.5-2B, a 2B parameter dense hybrid DeltaNet + GQA decoder from Alibaba Cloud, to the contrib directory. This model features 18 DeltaNet linear recurrent attention layers and 6 standard GQA layers in a [3 DeltaNet + 1 GQA] x 6 pattern, requiring custom NKI kernels for the DeltaNet forward passes on Neuron.

Key implementation details:

Fused NKI kernel for DeltaNet context encoding (CTE) — single-kernel chunked forward
Per-token NKI recurrent kernel for DeltaNet token generation (TKG)
Standard GQA attention for the 6 non-DeltaNet layers
Tied embeddings support (tie_word_embeddings=true)
Partial RoPE position encoding (25% of head_dim)

Model Information

Model Name: Qwen3.5-2B

Model Architecture: Decoder-only hybrid DeltaNet/GQA transformer (24 layers: 18 DeltaNet + 6 GQA), dense SwiGLU MLP, 2048 hidden size, 248K vocabulary

Purpose: Text generation (chat model with <|im_start|>/<|im_end|> format)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- 9 integration tests validating model accuracy on Neuron
- First-token logit validation against pre-computed CPU BF16 reference logits (cosine similarity 0.9156, top-1 match, top-5 overlap 4/5)
- Multi-prompt coherence tests (factual Q&A, code generation, knowledge, list generation)
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints (HuggingFace Hub)
- Testing Instructions: Commands to run unit and integration test suites, including CPU reference logit generation
Source Code (src/)
- modeling_qwen35.py — Main text decoder with NKI DeltaNet kernels
- modeling_qwen35_vision.py — Vision encoder (for future VL support)
- modeling_qwen35_vl.py — VL orchestrator (for future VL support)
- nki_kernels/ — DeltaNet NKI kernel implementations (fused CTE, per-token TKG, chunked)

Optional Components

Unit Tests (CPU or Neuron-based)
- 42 unit tests for config parsing, weight conversion, and architecture validation
- Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen3.5-2B/
  README.md
  /src
    __init__.py
    modeling_qwen35.py
    modeling_qwen35_vision.py
    modeling_qwen35_vl.py
    /nki_kernels
      __init__.py
      nki_deltanet.py
      nki_deltanet_chunked.py
      nki_deltanet_fused.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests were run on a trn2.3xlarge instance with TP=4, LNC=2, SDK 2.29 (NKI 0.3.0, PyTorch 2.9, neuronx-distributed-inference). The model was compiled from Qwen/Qwen3.5-2B HuggingFace weights.

Test Results:

42 unit tests PASSED (CPU)
9/9 integration tests PASSED (Neuron, trn2.3xlarge TP=4):
  - test_model_loads: PASS
  - test_model_generates: PASS (20 tokens generated)
  - test_output_coherence: PASS
  - test_top_token_valid: PASS
  - test_capital_of_france: PASS ("Paris" in output)
  - test_logit_accuracy: PASS (cosine=0.9156, top-1 match, top-5 overlap 4/5)
  - test_performance_ttft: PASS (157.8 ms)
  - test_performance_throughput: PASS (114.5 tok/s)
  - test_multi_prompt_generation: PASS (4/4 prompts coherent)

Benchmark results (BF16, seq_len=128):

Batch Size	TTFT (ms)	Throughput (tok/s)
1	157.8	114.5
2	72.0	233.1
4	104.4	329.6
8	185.6	409.5

Note on logit validation approach: DeltaNet layers (18 of 24) use NKI linear recurrent kernels that produce higher BF16 numerical divergence than standard GQA. Autoregressive sequences diverge after the first generated token, making multi-token logit_validation() inapplicable. The first-token logits are validated where CPU and Neuron process identical input prefixes. The model outputs TP-sharded logits (vocab/tp_degree) because ModelWrapper does not call _gather_along_dim, so comparison uses the TP shard 0 slice.

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NKI 0.3.0)
Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
PyTorch Version: 2.9.0
Python Version: 3.12

Additional Information

SDK 2.29+ is required due to NKI 0.3.0 API requirements for the DeltaNet kernels
The PyTorch _chunk_forward path creates 5D tensors that trigger a neuronx-cc codegen crash (NCC_INLA001); the fused NKI kernel is the default and required CTE path
No mini model test is possible because DeltaNet layers require NKI kernels that only execute on Neuron devices
Qwen3.5-2B is a chat model; raw text prompts produce echoey output — tokenizer.apply_chat_template() is required for quality output
The qwen3_5 model type requires transformers>=5.0 for CPU reference generation, but the NxDI-pinned transformers==4.57.* works for Neuron inference since the model is loaded via manual config.json parsing

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

tejasamx-aws

left overall comments, blocking is hardcoded paths

tejasamx-aws · 2026-05-10T19:45:49Z

+
+        import importlib
+
+        _fork_path = "/home/ubuntu/nki-library-fork/nkilib_src"


hardcoded path, pls update

tejasamx-aws · 2026-05-10T19:58:08Z

+            _nkilib_flash_attn, log_level=logging.INFO
+        )
+        logger.info("Option B: nkilib flash attention loaded for head_dim > 128")
+    except Exception as e:


generic exception catching

tejasamx-aws · 2026-05-10T19:58:31Z

+            fn()
+            print(f"  PASS")
+            passed += 1
+        except Exception as e:


same as above, generic exception catching. deteriorates debugging capabilities

tejasamx-aws · 2026-05-10T19:59:06Z

+                vision_mask = torch.zeros((0,), dtype=torch.int32)
+
+            padded = list(bucket_inputs)
+            while len(padded) < 21:


magic number, please explain where this comes from

tejasamx-aws · 2026-05-10T20:00:31Z

+
+            w = self.conv1d.weight.squeeze(1)
+            conv_out = torch.zeros_like(mixed)
+            for k in range(4):


This manual convolution loop creates a new tensor per iteration. The range(4) is also a magic number. we should be using F.conv1d or at minimum be using self.conv_kernel_size.

- Remove hardcoded nkilib path; load from installed package - Replace generic 'except Exception' with specific exception types - Replace magic number 21 with _NXDI_BASE_FORWARD_ARGS constant - Replace manual conv1d loop with F.conv1d(groups=conv_dim) - Fix vision_mask sentinel: use n_active_tokens-1 in input_generator (large sentinel caused scatter/gather OOB in compiled NEFF) - Keep _VISION_MASK_PAD_SENTINEL in pad_inputs with clamp for runtime - Remove unused 'import sys' Validated: 8/8 integration tests pass on trn2.3xlarge (SDK 2.28, TP=4) TTFT: 165ms, throughput: 73.3 tok/s

Implements Qwen3.5-2B for NxD Inference with: - DeltaNet linear attention via fused NKI kernel (v14: numerically stable 8-block Neumann inversion, handles uniform-key edge cases without NaN) - Vision encoder with SDPA maskless mode and chunked processing for 2K×2K and 4K×4K image resolution (splits patches into 8192-token chunks) - Multimodal Rotary Position Embeddings (mRoPE) for vision-language - SDK 2.29.1 compatible (NKI 0.3.0 GA)

…ubstitution The full 128x128 Neumann power-doubling suffers catastrophic cancellation in fp32 for uniform-key inputs (solid-color images). Intermediate matrices reach 10^57 magnitude but must cancel to ~1.0 — impossible with 7 significant digits. Solution: partition into 8 blocks of 16x16 rows with 4-round Neumann per block. Max intermediate ~2300, error < 2e-4. Sequential forward solve handles cross-block coupling. Validated on trn2.3xlarge: 150.7ms TTFT, correct text output.

m-deepankar-singh mentioned this pull request Apr 30, 2026

[contrib] Add Qwen3.5 4B and 9B hybrid DeltaNet contrib models #152

Open

14 tasks

tejasamx-aws requested changes May 10, 2026

View reviewed changes

jimburtoft force-pushed the contrib/qwen3.5-2b-pr branch from 006a5b5 to b456669 Compare May 18, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5-2B contrib model#141

Add Qwen3.5-2B contrib model#141
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr

jimburtoft commented Apr 24, 2026

Uh oh!

tejasamx-aws left a comment

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		import importlib

		_fork_path = "/home/ubuntu/nki-library-fork/nkilib_src"

Conversation

jimburtoft commented Apr 24, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

tejasamx-aws left a comment

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants