Add InternVL3-8B-Instruct contrib model#153
Conversation
InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters). Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29: - logit_validation passes with BF16-appropriate tolerances - 75.1 tok/s (BS=1, seq_len=2048), 1.85x vs L40S GPU - Text-only and multimodal inference supported Includes compile script, src/ modeling code (3 files), and integration tests using torch_neuronx logit_validation.
| @@ -0,0 +1,242 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
i could be missing something but afiak this current test only validates text-only inference. For a VLM, we should be adding at least one test that passes actual pixel_values through the vision encoder and validates the output. for example: describe a known image, check for expected keywords, or do logit validation on a multimodal prompt. The compile script already does a multimodal smoke test. We can adapt it into a proper pytest integration test.
| from modeling_internvl3 import InternVL3InferenceConfig, NeuronInternVL3ForCausalLM | ||
| from neuronx_distributed_inference.models.config import NeuronConfig | ||
|
|
||
| MODEL_PATH = "/mnt/models/InternVL3-8B-Instruct/" |
There was a problem hiding this comment.
hardcoded paths. additionally import argparseis present but unused. we should either add proper argument parsing or useos.environ.get()` with these as defaults.
| ) | ||
|
|
||
| # InternVL3 vision constants | ||
| VISION_HIDDEN_SIZE = 1024 |
There was a problem hiding this comment.
where are these coming from? HF? we should ideally be reading from config.vision_config. unless there is a specific reason for hardcoding these in
tejasamx-aws
left a comment
There was a problem hiding this comment.
pr looks good overall, blocking comments mentioned about hardcoded paths and another for reading from hf config rather than hardcoding
Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.
Description
InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via the NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters, BF16).
Validated on trn2.3xlarge (LNC=2, TP=4) with logit_validation passing, 75.1 tok/s text generation, and end-to-end multimodal inference.
Model Information
Model Name: InternVL3-8B-Instruct
Model Architecture: Vision-language model (InternViT-300M encoder + Qwen2.5-7B decoder)
Purpose: Multimodal text generation (image-to-text, visual question answering, text-only chat)
Checklist
Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.
Required Components
Accuracy Test (ex.
test/integration/test_model.py)torch_neuronx.testing.validation.logit_validationgenerate_expected_logits, compares against Neuron BF16 outputREADME.md with the following sections:
pytest test/integration/test_model.py -v --tb=shortSource Code (
src/)modeling_internvl3.py: Top-level VLM (NeuronBaseForImageToText)modeling_internvl3_text.py: Text backbone (Qwen2.5-7B with vision embedding injection)modeling_internvl3_vision.py: Vision encoder (InternViT-300M, torch_neuronx.trace)Optional Components
test/unit/__init__.pyexists (placeholder for future tests)Folder Structure
Confirm your contribution follows this structure:
Testing
How did you test this change?
All tests run on trn2.3xlarge (LNC=2, TP=4) with Neuron SDK 2.29.
compile_internvl3_vlm.py(text CTE+TKG + vision encoder NEFFs)test_model.pyruns:test_config: Validates InternVL3 config matches expected Qwen2.5-7B architecture (7 assertions)test_text_logit_validation: CPU FP32 reference logits (16 tokens) compared against Neuron BF16 vialogit_validation()with per-tier tolerancesTest Results:
Compatibility
Tested with:
Additional Information
torch_neuronx.trace()with--auto-cast=matmult -O1. 34.5ms per 448x448 tile.Related Issues
N/A
vLLM Integration
vLLM integration requires patches to vllm-neuron's model loader to register the InternVL3 architecture. See README for details.
By submitting this PR, I confirm that: