Auto-Caption Generation: Whisper + FFmpeg in a Node.js Worker

By Sylvester Das

•

April 2, 2026

•

4 min read

node js FFmpeg whisper Video Processing #Captions speech to text

In today's digital landscape, video consumption habits have shifted dramatically. A staggering 85% of social media videos are watched without sound. If your video content isn't captioned, you're not just missing out on accessibility; you're losing a vast segment of your audience and delivering an incomplete experience. Auto-captioning is no longer a luxury; it's a necessity.

This post dives into building a robust, automated caption generation pipeline using cutting-edge tools: OpenAI's Whisper for highly accurate speech-to-text transcription, and FFmpeg for all video and audio manipulation. To handle the computational demands efficiently, we'll orchestrate these operations within a Node.js worker process, ensuring your main application thread remains responsive.

Audio Extraction with FFmpeg

The first step in transcribing any video is to isolate its audio track. FFmpeg is the go-to tool for this, offering powerful and flexible media processing capabilities. We can extract the audio into a common format like WAV or MP3, which Whisper can then process.

ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output.wav

This command extracts the audio from input.mp4, removes the video stream (-vn), sets the audio codec to uncompressed PCM (-acodec pcm_s16le), resamples to 16kHz (-ar 16000), and converts it to mono (-ac 1), which are optimal parameters for Whisper.

Transcription with Whisper

Once we have the audio, OpenAI's Whisper model takes center stage. Whisper is a general-purpose speech recognition model trained on a vast dataset of diverse audio, making it exceptionally good at handling various languages, accents, and technical jargon. It can directly output timed transcripts in formats like VTT or SRT, which are crucial for displaying captions.

Integrating Whisper into a Node.js environment typically involves using a pre-compiled binary or a wrapper library. For local execution or within a worker, you might call the whisper.cpp executable directly or use a Node.js binding.


whisper -l en -otxt -f output.wav -o output.vtt

This command transcribes output.wav in English (-l en) and generates a VTT file (-o output.vtt). The VTT file contains the text along with precise start and end timestamps for each caption segment. Accuracy is key here, and Whisper often outperforms many cloud-based alternatives, especially for nuanced speech.

Burning Captions with FFmpeg

Having a VTT file is great for soft subtitles, but for many platforms, especially short-form video, burning the captions directly into the video stream (hardsubbing) is preferred to ensure universal display. FFmpeg can also handle this using its subtitles filter.


ffmpeg -i input.mp4 -vf "subtitles=output.vtt:force_style='Fontname=Arial,FontSize=24,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BorderStyle=1,Outline=1,Shadow=0'" -c:v libx264 -crf 23 -preset medium -c:a copy output_captioned.mp4

This command takes input.mp4, applies the subtitles from output.vtt (-vf "subtitles=..."), and allows for extensive styling. You can customize font, size, color, outline, and background to match your brand. The video is re-encoded using libx264 (-c:v), maintaining quality (-crf 23), and the audio is simply copied (-c:a copy) to avoid re-encoding loss.

Node.js Worker for Scalability

Both FFmpeg and Whisper are CPU-intensive operations. Running them directly on the main thread of a Node.js application would block the event loop, leading to unresponsiveness. This is where Node.js Worker Threads become invaluable. By offloading these tasks to a dedicated worker, the main thread remains free to handle incoming requests and other I/O operations.

A worker thread can execute the FFmpeg and Whisper commands in a child process (using child_process.exec or spawn) and then communicate the results back to the main thread once done. This pattern is crucial for building scalable media processing services.

Real-world Application

This exact stack forms the backbone of the captioning pipeline used by Clipspeed.ai. By leveraging Whisper's accuracy and FFmpeg's versatility within a Node.js worker architecture, Clipspeed.ai efficiently generates and burns high-quality captions into user-uploaded videos, enhancing accessibility and engagement for its users. The user workflow involves uploading a video, the system processes it in the background via workers, and then presents the captioned video, often with options for further editing.

Considerations and Trade-offs

While powerful, this approach has considerations. Whisper models come in various sizes (tiny, base, small, medium, large), impacting both accuracy and processing speed. Larger models offer better accuracy but require more computational resources and time. FFmpeg re-encoding can also be time-consuming, depending on video length and desired quality. Careful selection of codecs and presets is necessary to balance quality and file size.

Conclusion

Implementing an auto-captioning pipeline with Whisper and FFmpeg in a Node.js worker thread is a robust solution for modern video content. It addresses the critical need for accessibility and engagement in a sound-off world, providing a scalable and efficient way to enhance your video offerings. By embracing this technology, you ensure your content reaches and resonates with the widest possible audience.