AI in practice: Generating video subtitles
Captured source
source ↗AI in practice: Generating video subtitles Build • Diego Coy • 01/12/23 • 5 min read
Scaleway is a French company with an international vision, so it is imperative that we provide information to our 550+ employees in both English and French, to ensure clear understanding and information flow. We create a diverse set of training videos for internal usage, with some being originally voiced in English, and others in French. In all cases they should include subtitles for both languages.
Creating subtitles is a time-consuming process that we quickly realized would not scale. Fortunately, we were able to harness the power of AI for this exact task. With the help of OpenAI’s Whisper , the University of Helsinki’s Opus-MT and a bit of code, we were able to not only transcribe, and when required, translate our internal videos; but we could also generate subtitles in the srt format , that we can simply import into a video editing software or feed to a video player.
OpenAI’s Whisper
Whisper is an Open Source model created by OpenAI. It is a general-purpose speech recognition model that is able to identify and transcribe a wide variety of spoken languages. It is one of the most popular models around today and is released under MIT license.
OpenAI provides a Python SDK that will interact with the model, which has a wide variety of “flavors” based on the accuracy of their results: tiny, base, small, medium, and large. Larger models have been trained with a greater amount of parameters or examples, which makes them larger in size, and more resource-hungry — the tiny version of the model requires 1GB of VRAM (Video RAM) and the large version requires around 10GB.
Helsinki-NLP’s Opus-MT
The University of Helsinki made its own Open Source text translation models available based on the Marian-MT framework used by Microsoft Translator. Opus-MT models are provided as language pairs: translation source, and translation target, meaning that the model Helsinki-NLP/opus-mt-fr-en will translate text in French (fr) to English (en), and the other way around with Helsinki-NLP/opus-mt-en-fr.
Opus-MT can be used via the Transformers Python library from Hugging Face or using Docker. It is an Open Source project released under the MIT License and requires you to cite the OPUS-MT paper on your implementations:
@InProceedings{TiedemannThottingal:EAMT2020, author = {J{\"o}rg Tiedemann and Santhosh Thottingal}, title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld}, booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)}, year = {2020}, address = {Lisbon, Portugal} } CopyContentIcon Copy code Generating subtitles
Combining these two models into a subtitle-generating service is only a matter of adding some code to “glue” them together. But before diving into the code, let’s review our requirements:
First, we need to create a Virtual Machine capable of running AI models without a hitch, and the NVIDIA H100-1-80G GPU instance is a great choice.
With the type of instance clear, we can now focus on the functional requirements. We want to pass in a video file as input to Whisper to get a transcript. The second step will be to translate that transcript using OPUS-MT from a specific source language to a target language. Finally, we want to create a subtitle file in the target language that is in sync with the audio.
Setting up Whisper
You will find the latest information about setting it up on their GitHub repository , but in general, you can install the Python library using pip:
pip install -U openai-whisper CopyContentIcon Copy code Whisper relies heavily on the FFmpeg project for manipulating multimedia files. FFmpeg can be installed via APT:
sudo apt install ffmpeg -y CopyContentIcon Copy code The code
1. A simple text transcription
This basic example is the most straightforward way to transcribe audio into text. After importing the Whisper library, you load a flavor of the model by passing a string with its name to the load_model method. In this case, the base model is accurate enough, but some use cases may require larger or smaller model flavors.
After loading the model, you load the audio source by passing the file path. Notice that you can use both audio and video files, and in general, any file type with audio that is supported by FFmpeg.
Finally, you make use of the transcribe method of the model by passing it the loaded audio. As a result, you get a dictionary that amongst other items, contains the whole transcription text.
#main.py import whisper model = whisper . load_model ( "base" ) audio = whisper . load_audio ( "input_file.mp4" ) result = model . transcribe ( audio ) print ( result [ "text" ] ) CopyContentIcon Copy code This basic example gives you the main tools needed for the rest of the project: loading a model, loading an input audio file, and transcribing the audio using the model. This is already a big step forward and puts us closer to our goal of generating a subtitle file, however, you may have noticed that the resulting text doesn’t include any time references, it’s only text. Syncing this transcribed text with the audio would be a task that would require large amounts of manual work, but fortunately, Whisper’s transcription process also outputs segments that are time-coded.
2. Segments
Having time-coded segments means you can pinpoint them to their specific start and end times during the clip. For instance, if the first speech segment in the clip is “We're no strangers” and it starts at 00:17:50 and ends at 00:18:30, you will get that information in the segment dictionary, giving you all you need to create an srt subtitle file, now all you have to do is to properly format it to conform with the appropriate syntax.
#Getting the transcription segments from datetime import timedelta #For when getting the segment time import os #For creating the srt file in the filesystem import whisper model = whisper . load_model ( "base" ) audio = whisper . load_audio ( "input_file.mp4" ) result = model . transcribe ( audio ) segments = result [ "segments" ] #A list of segments for segment in segments : #... CopyContentIcon Copy code 3. An srt subtile file
Subtitle files in the srt format are divided into sequences that include the start and end timecodes — separated by the “ --> " string — followed by the caption text ending in a line break. Here’s an example:
1 00:01:26,612 --> 00:01:29,376 Took you…
Excerpt shown — open the source for the full document.