fix: V1 protocol support for multi-instance requests on unbatched models by larmoreg · Pull Request #1 · larmoreg/server

larmoreg · 2026-05-11T16:41:26Z

Summary

Adds full KServe V1 protocol support to Triton's HTTP server, fixing multi-instance request handling for models with max_batch_size=0 (ensembles).

Problem

Vertex AI Batch Prediction sends requests in V1 instances format ({"instances": [...]}). When multiple instances are batched together and sent to an ensemble model with max_batch_size=0, the V1 handler incorrectly constructs input tensor shapes — prepending a batch dimension [N, 1] instead of concatenating along the first dimension [N].

This breaks Vertex AI Batch Prediction with batchSize > 1 for any ensemble pipeline.

Changes

V1 predict endpoint (/v1/models/{model}:predict): Parse V1 instances format, convert to Triton inference requests, return V1 predictions format responses.
Base64 binary encoding: Support {"b64": "..."} encoding for TYPE_BYTES inputs/outputs per TF Serving V1 protocol. Enables binary data (images, serialized tensors) through JSON.
Multi-instance shape fix: For max_batch_size=0 models, concatenate instance data along the first dimension instead of prepending a new batch dimension.
Vertex AI integration: Route /v1/models/* requests through the Vertex AI server handler.
Tests: V1 protocol integration tests for the Vertex AI endpoint.

Tested

Single-instance V1 predict with binary (base64) and numeric inputs
Multi-instance V1 predict (batchSize=2) through Vertex AI Batch Prediction
Ensemble models with max_batch_size=0 containing sub-models with max_batch_size >= N

Add full KServe V1 (predict/explain) protocol support to the HTTP server, fixing multi-instance request handling for models with max_batch_size=0 (e.g. ensembles). Key changes: - V1 predict endpoint (/v1/models/{model}:predict): Parse V1 instances format, convert to Triton inference requests, and return V1 predictions format responses. - Base64 binary encoding: Support {\b64\: \...\} encoding for TYPE_BYTES inputs/outputs per TF Serving V1 protocol. This enables binary data (images, serialized tensors) to be passed through JSON. - Multi-instance shape fix: For models with max_batch_size=0, concatenate instance data along the first dimension instead of prepending a new batch dimension. Shape [1] x N becomes [N], not [N, 1]. - Vertex AI integration: Route /v1/models/* requests through the Vertex AI server handler for predict/health routes. - Tests: Add V1 protocol integration tests for the Vertex AI endpoint. This fixes a blocker for Vertex AI Batch Prediction, which sends requests in V1 instances format and requires max_batch_size=0 ensembles to handle multi-instance batches correctly.

larmoreg force-pushed the fix/v1-unbatched-multi-instance branch from 7219754 to 680cd03 Compare May 11, 2026 16:53

larmoreg force-pushed the fix/v1-unbatched-multi-instance branch from 680cd03 to 80f5220 Compare May 11, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: V1 protocol support for multi-instance requests on unbatched models#1

fix: V1 protocol support for multi-instance requests on unbatched models#1
larmoreg wants to merge 1 commit into
mainfrom
fix/v1-unbatched-multi-instance

larmoreg commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larmoreg commented May 11, 2026

Summary

Problem

Changes

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant