fix: V1 protocol support for multi-instance requests on unbatched models#1
Draft
larmoreg wants to merge 1 commit into
Draft
fix: V1 protocol support for multi-instance requests on unbatched models#1larmoreg wants to merge 1 commit into
larmoreg wants to merge 1 commit into
Conversation
7219754 to
680cd03
Compare
Add full KServe V1 (predict/explain) protocol support to the HTTP server,
fixing multi-instance request handling for models with max_batch_size=0
(e.g. ensembles).
Key changes:
- V1 predict endpoint (/v1/models/{model}:predict): Parse V1 instances
format, convert to Triton inference requests, and return V1 predictions
format responses.
- Base64 binary encoding: Support {\b64\: \...\} encoding for
TYPE_BYTES inputs/outputs per TF Serving V1 protocol. This enables
binary data (images, serialized tensors) to be passed through JSON.
- Multi-instance shape fix: For models with max_batch_size=0, concatenate
instance data along the first dimension instead of prepending a new
batch dimension. Shape [1] x N becomes [N], not [N, 1].
- Vertex AI integration: Route /v1/models/* requests through the Vertex
AI server handler for predict/health routes.
- Tests: Add V1 protocol integration tests for the Vertex AI endpoint.
This fixes a blocker for Vertex AI Batch Prediction, which sends requests
in V1 instances format and requires max_batch_size=0 ensembles to handle
multi-instance batches correctly.
680cd03 to
80f5220
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full KServe V1 protocol support to Triton's HTTP server, fixing multi-instance request handling for models with
max_batch_size=0(ensembles).Problem
Vertex AI Batch Prediction sends requests in V1 instances format (
{"instances": [...]}). When multiple instances are batched together and sent to an ensemble model withmax_batch_size=0, the V1 handler incorrectly constructs input tensor shapes — prepending a batch dimension[N, 1]instead of concatenating along the first dimension[N].This breaks Vertex AI Batch Prediction with
batchSize > 1for any ensemble pipeline.Changes
V1 predict endpoint (
/v1/models/{model}:predict): Parse V1 instances format, convert to Triton inference requests, return V1 predictions format responses.Base64 binary encoding: Support
{"b64": "..."}encoding forTYPE_BYTESinputs/outputs per TF Serving V1 protocol. Enables binary data (images, serialized tensors) through JSON.Multi-instance shape fix: For
max_batch_size=0models, concatenate instance data along the first dimension instead of prepending a new batch dimension.Vertex AI integration: Route
/v1/models/*requests through the Vertex AI server handler.Tests: V1 protocol integration tests for the Vertex AI endpoint.
Tested
max_batch_size=0containing sub-models withmax_batch_size >= N