Parallelize chunked Parakeet batch transcription#507
Open
hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
Open
Parallelize chunked Parakeet batch transcription#507hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why is this change needed?
This PR speeds up Parakeet batch transcription for long audio by ~2.2-2.8x, by parallelizing the existing stateless chunked path. It doesn't change the streaming/live transcription path.
It adds a configurable
parallelChunkConcurrencysetting toASRConfig, letsAsrManagercreate worker clones from already-loadedAsrModels, and updatesChunkProcessorto send independent chunks across that worker pool before merging the results with the existing merge logic.The important part is that the decoding behavior for each chunk stays the same. The patch is really about scheduling chunk work in parallel so the runtime can keep more hardware busy and improve throughput on longer files.
Validation
Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded from this video (~1 hour duration), with 5 runs each for current upstream vs. PR branch.
I compared the resulting transcripts and word timings before and after this change for v2, v3, and
tdt-ctc-110m, and found no differences. So based on this one test file at least, the optimization appears safe.Peak memory footprint was measured with macOS
/usr/bin/time -lp. While it does increase, the measured increase is modest relative to the speedup, so I think it's reasonable to keepparallelChunkConcurrencyset to4by default rather than make it opt-in.parallelChunkConcurrencyOptimal ValueA default value of
4for the chunk parallelism was chosen becuase values higher than it yielded little to no extra speedup and values less than it still left speed on the table; on the two devices I tested on, at least, which were iPhone SE 3 and M3 MacBook Air.AI Disclosure
OpenAI Codex was used to write the code for this patch.