In this comprehensive guide, we delve into an in-depth implementation of WhisperX, exploring transcription, alignment, and word-level timestamps. We’ll set up the environment, load and preprocess audio, and then execute the full pipeline, from transcription to alignment and analysis, ensuring memory efficiency and supporting batch processing. Along the way, we’ll visualize results, export them in multiple formats, and even extract keywords to gain deeper insights from the audio content.
Setup and Configuration
We commence by installing WhisperX along with essential libraries, such as pandas, matplotlib, and seaborn. We then configure our setup, detecting whether CUDA is available, selecting the compute type, and setting parameters like batch size, model size, and language to prepare for transcription.
“`python
!pip install -q git+https://github.com/m-bain/whisperX.git
!pip install -q pandas matplotlib seaborn
import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.display import Audio, display, HTML
import warnings
warnings.filterwarnings(‘ignore’)
CONFIG = {
“device”: “cuda” if torch.cuda.is_available() else “cpu”,
“compute_type”: “float16” if torch.cuda.is_available() else “int8”,
“batch_size”: 16,
“model_size”: “base”,
“language”: None,
}
“`
Audio Processing and Transcription
We begin by downloading a sample audio file for testing and loading it for analysis. We then transcribe the audio using WhisperX, setting up batched inference with our chosen model size and configuration. We output key details such as language, number of segments, and total text length.
“`python
def download_sample_audio():
“””Download a sample audio file for testing”””
!wget -q -O sample.mp3 https://github.com/mozilla-extensions/speaktome/raw/master/content/cv-valid-dev/sample-000000.mp3
print(” Sample audio downloaded”)
return “sample.mp3”
def load_and_analyze_audio(audio_path):
“””Load audio and display basic info”””
audio = whisperx.load_audio(audio_path)
duration = len(audio) / 16000
print(f” Audio: {Path(audio_path).name}”)
print(f” Duration: {duration:.2f} seconds”)
print(f” Sample rate: 16000 Hz”)
display(Audio(audio_path))
return audio, duration
def transcribe_audio(audio, model_size=CONFIG[“model_size”], language=None):
“””Transcribe audio using WhisperX (batched inference)”””
print(“\n STEP 1: Transcribing audio…”)
# … (rest of the function)
“`
Alignment and Word-Level Timestamps
Next, we align the transcription to generate precise word-level timestamps. By loading the alignment model and applying it to the audio, we refine timing accuracy and report the total aligned words while ensuring memory is cleared for efficient processing.
“`python
def align_transcription(segments, audio, language_code):
“””Align transcription for accurate word-level timestamps”””
print(“\n STEP 2: Aligning for word-level timestamps…”)
# … (rest of the function)
“`
Transcription Analysis
We analyze the transcription by generating detailed statistics such as total duration, segment count, word count, and character count. We also calculate words per minute, pauses between segments, and average word duration to better understand the pacing and flow of the audio.
“`python
def analyze_transcription(result):
“””Generate statistics about the transcription”””
print(“\n TRANSCRIPTION STATISTICS”)
print(“=”*70)
# … (rest of the function)
“`
**Results Visualization and Export**
We format results into clean tables, export transcripts to JSON/SRT/VTT/TXT/CSV formats, and maintain precise timestamps with helper formatters. We also batch-process multiple audio files end-to-end and extract top keywords, enabling us to quickly turn raw transcriptions into analysis-ready artifacts.
“`python
def display_results(result, show_words=False, max_rows=50):
“””Display transcription results in formatted table”””
# … (rest of the function)
def export_results(result, output_dir=”output”, filename=”transcript”):
“””Export results in multiple formats”””
# … (rest of the function)
def batch_process_files(audio_files, output_dir=”batch_output”):
“””Process multiple audio files in batch”””
# … (rest of the function)
def extract_keywords(result, top_n=10):
“””Extract most common words from transcription”””
# … (rest of the function)
“`
**Full Pipeline Execution**
Finally, we run the full WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract keywords, render a clean results table, and export everything to multiple formats, ready for real use.
“`python
def process_audio_file(audio_path, show_output=True, analyze=True):
“””Complete WhisperX pipeline”””
# … (rest of the function)
print(“\n Setup complete! Uncomment examples above to run.”)
“`
In conclusion, we’ve built a complete WhisperX pipeline that not only transcribes audio but also aligns it with precise word-level timestamps. We export the results in multiple formats, process files in batches, and analyze patterns to make the output more meaningful. With this, we now have a flexible, ready-to-use workflow for transcription and audio analysis, and we’re ready to extend it further into real-world projects.