Skip to content

fix: V1 protocol support for multi-instance requests on unbatched models#1

Draft
larmoreg wants to merge 1 commit into
mainfrom
fix/v1-unbatched-multi-instance
Draft

fix: V1 protocol support for multi-instance requests on unbatched models#1
larmoreg wants to merge 1 commit into
mainfrom
fix/v1-unbatched-multi-instance

Conversation

@larmoreg
Copy link
Copy Markdown
Owner

Summary

Adds full KServe V1 protocol support to Triton's HTTP server, fixing multi-instance request handling for models with max_batch_size=0 (ensembles).

Problem

Vertex AI Batch Prediction sends requests in V1 instances format ({"instances": [...]}). When multiple instances are batched together and sent to an ensemble model with max_batch_size=0, the V1 handler incorrectly constructs input tensor shapes — prepending a batch dimension [N, 1] instead of concatenating along the first dimension [N].

This breaks Vertex AI Batch Prediction with batchSize > 1 for any ensemble pipeline.

Changes

  • V1 predict endpoint (/v1/models/{model}:predict): Parse V1 instances format, convert to Triton inference requests, return V1 predictions format responses.

  • Base64 binary encoding: Support {"b64": "..."} encoding for TYPE_BYTES inputs/outputs per TF Serving V1 protocol. Enables binary data (images, serialized tensors) through JSON.

  • Multi-instance shape fix: For max_batch_size=0 models, concatenate instance data along the first dimension instead of prepending a new batch dimension.

  • Vertex AI integration: Route /v1/models/* requests through the Vertex AI server handler.

  • Tests: V1 protocol integration tests for the Vertex AI endpoint.

Tested

  • Single-instance V1 predict with binary (base64) and numeric inputs
  • Multi-instance V1 predict (batchSize=2) through Vertex AI Batch Prediction
  • Ensemble models with max_batch_size=0 containing sub-models with max_batch_size >= N

@larmoreg larmoreg force-pushed the fix/v1-unbatched-multi-instance branch from 7219754 to 680cd03 Compare May 11, 2026 16:53
Add full KServe V1 (predict/explain) protocol support to the HTTP server,
fixing multi-instance request handling for models with max_batch_size=0
(e.g. ensembles).

Key changes:

- V1 predict endpoint (/v1/models/{model}:predict): Parse V1 instances
  format, convert to Triton inference requests, and return V1 predictions
  format responses.

- Base64 binary encoding: Support {\b64\: \...\} encoding for
  TYPE_BYTES inputs/outputs per TF Serving V1 protocol. This enables
  binary data (images, serialized tensors) to be passed through JSON.

- Multi-instance shape fix: For models with max_batch_size=0, concatenate
  instance data along the first dimension instead of prepending a new
  batch dimension. Shape [1] x N becomes [N], not [N, 1].

- Vertex AI integration: Route /v1/models/* requests through the Vertex
  AI server handler for predict/health routes.

- Tests: Add V1 protocol integration tests for the Vertex AI endpoint.

This fixes a blocker for Vertex AI Batch Prediction, which sends requests
in V1 instances format and requires max_batch_size=0 ensembles to handle
multi-instance batches correctly.
@larmoreg larmoreg force-pushed the fix/v1-unbatched-multi-instance branch from 680cd03 to 80f5220 Compare May 11, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant