Add InternVL3-8B-Instruct contrib model by jimburtoft · Pull Request #153 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-30T23:31:23Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via the NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters, BF16).

Validated on trn2.3xlarge (LNC=2, TP=4) with logit_validation passing, 75.1 tok/s text generation, and end-to-end multimodal inference.

Model Information

Model Name: InternVL3-8B-Instruct

Model Architecture: Vision-language model (InternViT-300M encoder + Qwen2.5-7B decoder)

Purpose: Multimodal text generation (image-to-text, visual question answering, text-only chat)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- Integration test validates model accuracy using torch_neuronx.testing.validation.logit_validation
- Generates CPU FP32 reference logits via generate_expected_logits, compares against Neuron BF16 output
- Handles TP vocab padding (151674 → 151676) by truncating to TP-aligned boundary
- BF16-appropriate tolerance map (top-5: 0.05, top-50: 0.06, all: 0.08)
- 2/2 tests pass on trn2.3xlarge
README.md with the following sections:
- Usage Example: Compile and run examples for text-only and multimodal inference
- Compatibility Matrix: Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29
- Example Checkpoints: OpenGVLab/InternVL3-8B-Instruct
- Testing Instructions: pytest test/integration/test_model.py -v --tb=short
Source Code (src/)
- modeling_internvl3.py: Top-level VLM (NeuronBaseForImageToText)
- modeling_internvl3_text.py: Text backbone (Qwen2.5-7B with vision embedding injection)
- modeling_internvl3_vision.py: Vision encoder (InternViT-300M, torch_neuronx.trace)

Optional Components

Unit Tests (CPU or Neuron-based)
- test/unit/__init__.py exists (placeholder for future tests)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/InternVL3-8B-Instruct/
  README.md
  compile_internvl3_vlm.py
  /src
    __init__.py
    modeling_internvl3.py
    modeling_internvl3_text.py
    modeling_internvl3_vision.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests run on trn2.3xlarge (LNC=2, TP=4) with Neuron SDK 2.29.

Model compiled using compile_internvl3_vlm.py (text CTE+TKG + vision encoder NEFFs)
Integration test test_model.py runs:
- test_config: Validates InternVL3 config matches expected Qwen2.5-7B architecture (7 assertions)
- test_text_logit_validation: CPU FP32 reference logits (16 tokens) compared against Neuron BF16 via logit_validation() with per-tier tolerances

Test Results:

test/integration/test_model.py::TestInternVL3Integration::test_config PASSED
test/integration/test_model.py::TestInternVL3Integration::test_text_logit_validation PASSED

Summary: Max divergence difference = 0
  Top k = 5  max error = 0.036 (tol: 0.05)
  Top k = 50 max error = 0.048 (tol: 0.06)
  Top k = 1000 max error = 0.050 (tol: 0.06)
  Top k = None max error = 0.063 (tol: 0.08)

======================= 2 passed in 26.82s ========================

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (NxDI 0.9.17334, neuronx-cc 2.24.5133.0)
Instance Type(s): trn2.3xlarge (LNC=2, TP=4)
PyTorch Version: 2.9
Python Version: 3.12.3

Additional Information

Performance: 75.1 tok/s (BS=1, seq_len=2048), TTFT 138ms. 1.85x faster than L40S GPU.
Vision encoder: Compiled via torch_neuronx.trace() with --auto-cast=matmult -O1. 34.5ms per 448x448 tile.
TP vocab padding: InternVL3 vocab_size (151674) is not divisible by TP=4, so lm_head pads to 151676. The test truncates logits to the TP-aligned boundary (151672) to avoid false failures from padding artifacts.
Maintainer: Jim Burtoft (@jimburtoft)

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

vLLM integration requires patches to vllm-neuron's model loader to register the InternVL3 architecture. See README for details.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters). Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29: - logit_validation passes with BF16-appropriate tolerances - 75.1 tok/s (BS=1, seq_len=2048), 1.85x vs L40S GPU - Text-only and multimodal inference supported Includes compile script, src/ modeling code (3 files), and integration tests using torch_neuronx logit_validation.

tejasamx-aws · 2026-05-10T20:09:02Z

@@ -0,0 +1,242 @@
+#!/usr/bin/env python3


i could be missing something but afiak this current test only validates text-only inference. For a VLM, we should be adding at least one test that passes actual pixel_values through the vision encoder and validates the output. for example: describe a known image, check for expected keywords, or do logit validation on a multimodal prompt. The compile script already does a multimodal smoke test. We can adapt it into a proper pytest integration test.

tejasamx-aws · 2026-05-10T20:11:33Z

+from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/"


hardcoded paths. additionally import argparseis present but unused. we should either add proper argument parsing or useos.environ.get()` with these as defaults.

tejasamx-aws · 2026-05-10T20:13:51Z

+)
+
+# InternVL3 vision constants
+VISION_HIDDEN_SIZE = 1024


where are these coming from? HF? we should ideally be reading from config.vision_config. unless there is a specific reason for hardcoding these in

tejasamx-aws

pr looks good overall, blocking comments mentioned about hardcoded paths and another for reading from hf config rather than hardcoding

tejasamx-aws reviewed May 10, 2026

View reviewed changes

tejasamx-aws requested changes May 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InternVL3-8B-Instruct contrib model#153

Add InternVL3-8B-Instruct contrib model#153
jimburtoft wants to merge 1 commit into
aws-neuron:mainfrom
jimburtoft:contrib/internvl3-8b-instruct

jimburtoft commented Apr 30, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

tejasamx-aws left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimburtoft commented Apr 30, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants