Skip to content

Add Qwen3.5-2B contrib model#141

Open
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr
Open

Add Qwen3.5-2B contrib model#141
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds Qwen3.5-2B, a 2B parameter dense hybrid DeltaNet + GQA decoder from Alibaba Cloud, to the contrib directory. This model features 18 DeltaNet linear recurrent attention layers and 6 standard GQA layers in a [3 DeltaNet + 1 GQA] x 6 pattern, requiring custom NKI kernels for the DeltaNet forward passes on Neuron.

Key implementation details:

  • Fused NKI kernel for DeltaNet context encoding (CTE) — single-kernel chunked forward
  • Per-token NKI recurrent kernel for DeltaNet token generation (TKG)
  • Standard GQA attention for the 6 non-DeltaNet layers
  • Tied embeddings support (tie_word_embeddings=true)
  • Partial RoPE position encoding (25% of head_dim)

Model Information

Model Name: Qwen3.5-2B

Model Architecture: Decoder-only hybrid DeltaNet/GQA transformer (24 layers: 18 DeltaNet + 6 GQA), dense SwiGLU MLP, 2048 hidden size, 248K vocabulary

Purpose: Text generation (chat model with <|im_start|>/<|im_end|> format)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • 9 integration tests validating model accuracy on Neuron
    • First-token logit validation against pre-computed CPU BF16 reference logits (cosine similarity 0.9156, top-1 match, top-5 overlap 4/5)
    • Multi-prompt coherence tests (factual Q&A, code generation, knowledge, list generation)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:

    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints (HuggingFace Hub)
    • Testing Instructions: Commands to run unit and integration test suites, including CPU reference logit generation
  • Source Code (src/)

    • modeling_qwen35.py — Main text decoder with NKI DeltaNet kernels
    • modeling_qwen35_vision.py — Vision encoder (for future VL support)
    • modeling_qwen35_vl.py — VL orchestrator (for future VL support)
    • nki_kernels/ — DeltaNet NKI kernel implementations (fused CTE, per-token TKG, chunked)

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • 42 unit tests for config parsing, weight conversion, and architecture validation
    • Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen3.5-2B/
  README.md
  /src
    __init__.py
    modeling_qwen35.py
    modeling_qwen35_vision.py
    modeling_qwen35_vl.py
    /nki_kernels
      __init__.py
      nki_deltanet.py
      nki_deltanet_chunked.py
      nki_deltanet_fused.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests were run on a trn2.3xlarge instance with TP=4, LNC=2, SDK 2.29 (NKI 0.3.0, PyTorch 2.9, neuronx-distributed-inference). The model was compiled from Qwen/Qwen3.5-2B HuggingFace weights.

Test Results:

42 unit tests PASSED (CPU)
9/9 integration tests PASSED (Neuron, trn2.3xlarge TP=4):
  - test_model_loads: PASS
  - test_model_generates: PASS (20 tokens generated)
  - test_output_coherence: PASS
  - test_top_token_valid: PASS
  - test_capital_of_france: PASS ("Paris" in output)
  - test_logit_accuracy: PASS (cosine=0.9156, top-1 match, top-5 overlap 4/5)
  - test_performance_ttft: PASS (157.8 ms)
  - test_performance_throughput: PASS (114.5 tok/s)
  - test_multi_prompt_generation: PASS (4/4 prompts coherent)

Benchmark results (BF16, seq_len=128):

Batch Size TTFT (ms) Throughput (tok/s)
1 157.8 114.5
2 72.0 233.1
4 104.4 329.6
8 185.6 409.5

Note on logit validation approach: DeltaNet layers (18 of 24) use NKI linear recurrent kernels that produce higher BF16 numerical divergence than standard GQA. Autoregressive sequences diverge after the first generated token, making multi-token logit_validation() inapplicable. The first-token logits are validated where CPU and Neuron process identical input prefixes. The model outputs TP-sharded logits (vocab/tp_degree) because ModelWrapper does not call _gather_along_dim, so comparison uses the TP shard 0 slice.

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NKI 0.3.0)
  • Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

  • SDK 2.29+ is required due to NKI 0.3.0 API requirements for the DeltaNet kernels
  • The PyTorch _chunk_forward path creates 5D tensors that trigger a neuronx-cc codegen crash (NCC_INLA001); the fused NKI kernel is the default and required CTE path
  • No mini model test is possible because DeltaNet layers require NKI kernels that only execute on Neuron devices
  • Qwen3.5-2B is a chat model; raw text prompts produce echoey output — tokenizer.apply_chat_template() is required for quality output
  • The qwen3_5 model type requires transformers>=5.0 for CPU reference generation, but the NxDI-pinned transformers==4.57.* works for Neuron inference since the model is loaded via manual config.json parsing

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Copy link
Copy Markdown

@tejasamx-aws tejasamx-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left overall comments, blocking is hardcoded paths


import importlib

_fork_path = "/home/ubuntu/nki-library-fork/nkilib_src"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded path, pls update

_nkilib_flash_attn, log_level=logging.INFO
)
logger.info("Option B: nkilib flash attention loaded for head_dim > 128")
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic exception catching

fn()
print(f" PASS")
passed += 1
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, generic exception catching. deteriorates debugging capabilities

vision_mask = torch.zeros((0,), dtype=torch.int32)

padded = list(bucket_inputs)
while len(padded) < 21:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number, please explain where this comes from


w = self.conv1d.weight.squeeze(1)
conv_out = torch.zeros_like(mixed)
for k in range(4):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manual convolution loop creates a new tensor per iteration. The range(4) is also a magic number. we should be using F.conv1d or at minimum be using self.conv_kernel_size.

jimburtoft added a commit to jimburtoft/neuronx-distributed-inference that referenced this pull request May 13, 2026
- Remove hardcoded nkilib path; load from installed package
- Replace generic 'except Exception' with specific exception types
- Replace magic number 21 with _NXDI_BASE_FORWARD_ARGS constant
- Replace manual conv1d loop with F.conv1d(groups=conv_dim)
- Fix vision_mask sentinel: use n_active_tokens-1 in input_generator
  (large sentinel caused scatter/gather OOB in compiled NEFF)
- Keep _VISION_MASK_PAD_SENTINEL in pad_inputs with clamp for runtime
- Remove unused 'import sys'

Validated: 8/8 integration tests pass on trn2.3xlarge (SDK 2.28, TP=4)
TTFT: 165ms, throughput: 73.3 tok/s
Implements Qwen3.5-2B for NxD Inference with:
- DeltaNet linear attention via fused NKI kernel (v14: numerically stable
  8-block Neumann inversion, handles uniform-key edge cases without NaN)
- Vision encoder with SDPA maskless mode and chunked processing for
  2K×2K and 4K×4K image resolution (splits patches into 8192-token chunks)
- Multimodal Rotary Position Embeddings (mRoPE) for vision-language
- SDK 2.29.1 compatible (NKI 0.3.0 GA)
@jimburtoft jimburtoft force-pushed the contrib/qwen3.5-2b-pr branch from 006a5b5 to b456669 Compare May 18, 2026 13:17
…ubstitution

The full 128x128 Neumann power-doubling suffers catastrophic cancellation in
fp32 for uniform-key inputs (solid-color images). Intermediate matrices reach
10^57 magnitude but must cancel to ~1.0 — impossible with 7 significant digits.

Solution: partition into 8 blocks of 16x16 rows with 4-round Neumann per block.
Max intermediate ~2300, error < 2e-4. Sequential forward solve handles cross-block
coupling. Validated on trn2.3xlarge: 150.7ms TTFT, correct text output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants