Add live audio transcription streaming support to Foundry Local Python SDK#612
Add live audio transcription streaming support to Foundry Local Python SDK#612
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds end-to-end live audio (PCM chunk) streaming transcription to the Foundry Local Python SDK, including session lifecycle management, native interop support for binary payloads, and tests/samples to validate Windows DLL loading and Nemotron ASR streaming.
Changes:
- Introduces
LiveAudioTranscriptionSession+ supporting response/options/error types for streaming microphone-style PCM input. - Extends
CoreInteropwith aStreamingRequestBufferandexecute_command_with_binary()to push raw audio to native core. - Adds unit + E2E coverage and a sample app, including Windows DLL preload workarounds for brotli/LoadLibrary behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/python/src/openai/live_audio_transcription_client.py | Implements the streaming session (start/append/stream/stop) and background push loop. |
| sdk/python/src/openai/live_audio_transcription_types.py | Adds response/options/error DTOs and JSON parsing helpers. |
| sdk/python/src/detail/core_interop.py | Adds binary-command execution path and Windows DLL loading hardening for ORT/GenAI. |
| sdk/python/src/openai/audio_client.py | Adds factory method to create the live transcription session. |
| sdk/python/src/openai/init.py | Exports new session and types from the openai package surface. |
| sdk/python/test/openai/test_live_audio_transcription.py | Unit tests for parsing/options/state guards and mocked streaming behavior. |
| sdk/python/test/openai/test_live_audio_transcription_e2e.py | Windows-only E2E test exercising real native DLLs and nemotron model pipeline. |
| sdk/python/test/openai/conftest.py | Preloads ORT/GenAI DLLs for E2E to avoid brotli-related DLL search changes. |
| sdk/python/test/conftest.py | Preloads ORT/GenAI DLLs early in all tests to avoid Windows DLL search conflicts. |
| samples/python/live-audio-transcription/src/app.py | Demonstration app using PyAudio to stream microphone PCM into the session. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # execute_command_with_binary is required for audio streaming but may | ||
| # not be present in older Core builds. Register it if available; | ||
| # the method will raise AttributeError at call-time if missing. |
There was a problem hiding this comment.
The FL Core version is updated so this should always exist.
| self._started = True | ||
| self._stopped = False | ||
|
|
||
| # Start the push loop in a daemon thread |
There was a problem hiding this comment.
Do we need to use a daemon thread? Should we make this non-daemon and require stop() + join() instead? One of the features of FL Core that we have advertised is that it does not require a daemon.
| """Locate the e2e-test-pkgs directory.""" | ||
| current = Path(__file__).resolve().parent | ||
| while True: | ||
| candidate = current / "samples" / "python" / "e2e-test-pkgs" |
There was a problem hiding this comment.
This looks like a similar function to what is inside conftest.py. Can either this or conftest.py be removed then?
| raise FileNotFoundError("e2e-test-pkgs not found") | ||
|
|
||
|
|
||
| if sys.platform.startswith("win"): |
There was a problem hiding this comment.
Given there already exists a conftest.py file, can we move any modifications to that file and remove this one?
|
|
||
|
|
||
| def _preload_and_init(): | ||
| """Pre-load DLLs from e2e-test-pkgs and initialize the SDK. |
There was a problem hiding this comment.
Can we move this function into conftest.py instead so that the E2E unit tests are cleaner to look at?
Description
Adds real-time audio streaming support to the Foundry Local Python SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI's StreamingProcessor API (Nemotron ASR).
This is the Python port of C# PR #485 with full feature parity. The existing
AudioClientonly supports file-based transcription. This PR introducesLiveAudioTranscriptionSessionthat accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as a synchronous generator.What's included
New files
src/openai/live_audio_transcription_client.py— Streaming session withstart(),append(),get_transcription_stream(),stop()src/openai/live_audio_transcription_types.py—LiveAudioTranscriptionResponse(ConversationItem-shaped),LiveAudioTranscriptionOptions,CoreErrorResponse,TranscriptionContentParttest/openai/test_live_audio_transcription.py— 22 unit tests for deserialization, settings, state guards, streaming pipelinetest/openai/test_live_audio_transcription_e2e.py— E2E test with real native DLLs and nemotron modeltest/openai/conftest.py— DLL preload for E2E testssamples/python/live-audio-transcription/src/app.py— Live microphone transcription demoModified files
src/openai/audio_client.py— Addedcreate_live_transcription_session()factory methodsrc/detail/core_interop.py— AddedStreamingRequestBufferstruct,execute_command_with_binary(),start_audio_stream,push_audio_data,stop_audio_streammethods, and_load_dll_win()for robust DLL loading on Windowssrc/openai/__init__.py— Exported new live transcription typestest/conftest.py— Pre-load ORT/GenAI DLLs before brotli import to avoid Windows DLL search conflictsAPI surface
C# parity
CreateLiveTranscriptionSession()create_live_transcription_session()StartAsync(ct)start()AppendAsync(ReadOnlyMemory<byte>, ct)append(bytes)GetTranscriptionStream()get_transcription_stream()StopAsync(ct)stop()IAsyncDisposablewith)LiveAudioTranscriptionOptionsLiveAudioTranscriptionOptionsLiveAudioTranscriptionResponseLiveAudioTranscriptionResponseDesign highlights
LiveAudioTranscriptionResponseuses the OpenAI RealtimeConversationItemshape (content[0].text/transcript) for forward compatibilityqueue.Queueserializes audio pushes from any thread (safe for mic callbacks) with backpressurestart()and immutable during the sessionappend()copies input data to avoid issues with callers reusing buffers (e.g., PyAudio)start_audio_streamandstop_audio_streamroute throughexecute_command;push_audio_dataroutes throughexecute_command_with_binary— no new native entry points requiredLoadLibraryExWwithLOAD_WITH_ALTERED_SEARCH_PATHon Windows to prevent conflicts with stale system-level ORT DLLsVerified working