Skip to content

Parallelize chunked Parakeet batch transcription#507

Open
hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
hamzaq2000:main
Open

Parallelize chunked Parakeet batch transcription#507
hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
hamzaq2000:main

Conversation

@hamzaq2000
Copy link
Copy Markdown

@hamzaq2000 hamzaq2000 commented Apr 9, 2026

Why is this change needed?

This PR speeds up Parakeet batch transcription for long audio by ~2.2-2.8x, by parallelizing the existing stateless chunked path. It doesn't change the streaming/live transcription path.

It adds a configurable parallelChunkConcurrency setting to ASRConfig, lets AsrManager create worker clones from already-loaded AsrModels, and updates ChunkProcessor to send independent chunks across that worker pool before merging the results with the existing merge logic.

The important part is that the decoding behavior for each chunk stays the same. The patch is really about scheduling chunk work in parallel so the runtime can keep more hardware busy and improve throughput on longer files.

Validation

Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded from this video (~1 hour duration), with 5 runs each for current upstream vs. PR branch.

Model Upstream Avg Time PR Branch Avg Time Speedup Upstream Avg Peak Mem PR Branch Avg Peak Mem Delta
Parakeet v2 31.84 s 11.25 s 2.83x 515.9 MiB 537.4 MiB +21.4 MiB
Parakeet v3 31.37 s 12.75 s 2.46x 496.0 MiB 527.0 MiB +31.0 MiB
Parakeet tdt-ctc-110m 19.89 s 9.08 s 2.19x 489.6 MiB 509.2 MiB +19.7 MiB

I compared the resulting transcripts and word timings before and after this change for v2, v3, and tdt-ctc-110m, and found no differences. So based on this one test file at least, the optimization appears safe.

Peak memory footprint was measured with macOS /usr/bin/time -lp. While it does increase, the measured increase is modest relative to the speedup, so I think it's reasonable to keep parallelChunkConcurrency set to 4 by default rather than make it opt-in.

parallelChunkConcurrency Optimal Value

A default value of 4 for the chunk parallelism was chosen becuase values higher than it yielded little to no extra speedup and values less than it still left speed on the table; on the two devices I tested on, at least, which were iPhone SE 3 and M3 MacBook Air.

AI Disclosure

OpenAI Codex was used to write the code for this patch.


Open with Devin

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant