Parallelize chunked Parakeet batch transcription by hamzaq2000 · Pull Request #507 · FluidInference/FluidAudio

hamzaq2000 · 2026-04-09T05:25:40Z

Why is this change needed?

This PR speeds up Parakeet batch transcription for long audio by ~2.2-2.8x, by parallelizing the existing stateless chunked path. It doesn't change the streaming/live transcription path.

It adds a configurable parallelChunkConcurrency setting to ASRConfig, lets AsrManager create worker clones from already-loaded AsrModels, and updates ChunkProcessor to send independent chunks across that worker pool before merging the results with the existing merge logic.

The important part is that the decoding behavior for each chunk stays the same. The patch is really about scheduling chunk work in parallel so the runtime can keep more hardware busy and improve throughput on longer files.

Validation

Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded from this video (~1 hour duration), with 5 runs each for current upstream vs. PR branch.

Model	Upstream Avg Time	PR Branch Avg Time	Speedup	Upstream Avg Peak Mem	PR Branch Avg Peak Mem	Delta
Parakeet v2	31.84 s	11.25 s	2.83x	515.9 MiB	537.4 MiB	+21.4 MiB
Parakeet v3	31.37 s	12.75 s	2.46x	496.0 MiB	527.0 MiB	+31.0 MiB
Parakeet tdt-ctc-110m	19.89 s	9.08 s	2.19x	489.6 MiB	509.2 MiB	+19.7 MiB

I compared the resulting transcripts and word timings before and after this change for v2, v3, and tdt-ctc-110m, and found no differences. So based on this one test file at least, the optimization appears safe.

Peak memory footprint was measured with macOS /usr/bin/time -lp. While it does increase, the measured increase is modest relative to the speedup, so I think it's reasonable to keep parallelChunkConcurrency set to 4 by default rather than make it opt-in.

`parallelChunkConcurrency` Optimal Value

A default value of 4 for the chunk parallelism was chosen becuase values higher than it yielded little to no extra speedup and values less than it still left speed on the table; on the two devices I tested on, at least, which were iPhone SE 3 and M3 MacBook Air.

AI Disclosure

OpenAI Codex was used to write the code for this patch.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Parallelize chunked Parakeet batch transcription

801f226

devin-ai-integration bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize chunked Parakeet batch transcription#507

Parallelize chunked Parakeet batch transcription#507
hamzaq2000 wants to merge 1 commit intoFluidInference:mainfrom
hamzaq2000:main

hamzaq2000 commented Apr 9, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamzaq2000 commented Apr 9, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why is this change needed?

Validation

parallelChunkConcurrency Optimal Value

AI Disclosure

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hamzaq2000 commented Apr 9, 2026 •

edited by devin-ai-integration bot

Loading

`parallelChunkConcurrency` Optimal Value