Ok I want to do a few things.
Maybe non-obvious you will need:
I won't go over setup. It's hacks and jank.
Convert to whisper specs: mono, 16 bit, 16000hz sample rate wav file.
ffmpeg -i input-video.mkv -ac 1 -ar 16000 input-audio.wav
This one is annoying because you might try a few things.
whisperx --model large-v3-turbo --model_dir ~/.cache/whisper --output_dir out/ --task transcribe --diarize --highlight_words True --verbose True --output_format all --print_progress True --compute_type int8 --threads 24 --hf_token your-token-here input-audio.wav
I found not too much diff between large-v2, v3 and v3-turbo. Turbo is a bit faster and newer so going with that.
You might not want highlight_words as it can be distracting.
This generates a .txt transcript, .srt and .vtt (the latter two are near equivalent).
NOTE: try with a single small wave file before you do *.wav!!! It will batch process and you want to do 1 good run to catch errors such as with huggingface. Or you will spend 10 hours in whisper then fail only to have to redo it.
Now, this results in text with SPEAKER_00 and SPEAKER_01. First I wanted to convert to substation alpha (ass/ssa) to do colors. Then I got bored and did .srt/.vtt directly. It's only a bit cursed.
Here is a bash script to take in the subtitle and add color tags.
NOTE: font tags, and not span as some sources say. Also I did the SRT and not VTT since there's some issues with tags in VTT? and limits to what is supported? Idk much man.
#!/bin/bash
# Define an array of colors for each speaker
COLORS=("yellow" "cyan" "green" "magenta" "orange" "green" "blue" "white")
# Input file (VTT or SRT)
INPUT_FILE="$1"
# Output file (with colors added)
OUTPUT_FILE="${INPUT_FILE%.*}_colored.${INPUT_FILE##*.}"
# Check if input file exists
if [[ ! -f "$INPUT_FILE" ]]; then
echo "Error: Input file '$INPUT_FILE' not found."
exit 1
fi
# Process the file
awk -v colors="${COLORS[*]}" '
BEGIN {
# Split the colors into an array
split(colors, colorArray, " ")
}
{
# Check if the line contains a speaker label
if (match($0, /\[SPEAKER_[0-9]+\]:/)) {
# Extract the speaker number
speaker_num = substr($0, RSTART + 9, RLENGTH - 10)
# Get the corresponding color
color = colorArray[speaker_num + 1]
# Wrap the line in a span with the color
$0 = "<font color=\""color"\">" $0 "</font>"
}
# Print the modified line
print
}
' "$INPUT_FILE" > "$OUTPUT_FILE"
echo "Colors added successfully! Output saved to '$OUTPUT_FILE'."
At this point you have SRT file. You can embed that into video as a subtitle track in say an MKV. That's how I would personally do it.
But sometimes you're sending to people and don't want them to deal with subtitle tracks.
Then just burn it in and do hardsubs.
ffmpeg -i input.mkv -vf "subtitles=input_subtitles_colored.srt:force_style='OutlineColour=&H00000000,BorderStyle=3,Outline=1,Shadow=0,MarginV=20,FontSize=28'" -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4
This will use our color tags, do black background and outline on the subs at font size 28. Choose based on your preference.
Note: I might hate h264 and the default aac encoder sounds like ass but sometimes you need a dumb mp4 file. Use AV1 and opus if you can.