Vosk Speech to Text

(November 2021)

The objective of this exercise was to automatically generate text transcriptions of podcasts and other audio recordings, without resorting to proprietary off-site cloud services.

Transcription

We chose to use Vosk, an offline open source speech recognition toolkit. It is not as accurate as cloud-based commercial services such as Otter (used by Zoom teleconferencing), but is nevertheless pretty effective.

The conversion process is single threaded, so we need to use ffmpeg to split the audio file into segments that can be transcribed in parallel:

sudo add-apt-repository ppa:savoury1/ffmpeg4
sudo apt update
sudo apt install ffmpeg

We will use the Python interface to Vosk:

sudo python3 -m pip install vosk
git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip  # 1.8GB fairly accurate English model
unzip vosk-model-en-us-0.22.zip
mv vosk-model-en-us-0.22 model
python3 ./test_simple.py test.wav

Our shell script to transcribe a single audio file:

#!/bin/bash
# Usage: /vosk-api/python/example/par_vosk.sh testrecording.mp3
# Multi-core vosk: Segment audio file and pass through vosk speech to text engine
# 2021/Nov/22 Initial Vsn

if [ ! -f "$1" ]; then echo "Unable to find $1"; exit 1; fi

audiofile="$1"
cores=9

# Calculate segment size to equally utilise all CPU cores
length=$(ffprobe -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$audiofile" 2>/dev/null)

segdur=$(awk -v v="$length" -v c="$cores" 'BEGIN { val=v / c; printf("%.0f", (val == int(val)) ? val : int(val)+1) }')

# Split into PCM 16khz 16bit mono segments (the format required by Vosk)
ffmpeg -loglevel fatal -i "$audiofile" -f segment -segment_time "$segdur" -acodec pcm_s16le -ac 1 -ar 16000 out%1d.wav

mcore=$((cores - 1))
for i in $(seq 0 $mcore); do
    python3 /vosk-api/python/example/test_text.py "out$i.wav" > "out$i.txt" 2>"out$i.log" &
done
wait
cat out{0..8}.txt > transcription.txt
rm out{0..8}.txt
rm out{0..8}.wav
rm out{0..8}.log

Tested in November 2021 on an Intel I5-10400 (6-core, 12-threads) with 64GB RAM, running Ubuntu 20.04.

A 42 minute, 40MB mp3 BBC Radio 4 talk program, transcribed in 3 minutes, producing a 7,452 word 40KB text file. Without the parallelisation, the transcription took 11.5 minutes.

Note: Because we used the large model file, the process is memory hungry – 9 simultaneous transcriptions consumed 44GB of RAM. The less accurate 40MB small English model only uses 3GB RAM and completes in seconds.

Punctuation

The transcribed text is quite hard to read, due to the lack of punctuation. We can improve this, utilising https://github.com/xashru/punctuation-restoration. Our compute server does not have an Nvidia graphics card with CUDA capability, so we used the CPU only version.

sudo python3 -m pip install torch==1.10.0+cpu torchvision==0.11.1+cpu torchaudio==0.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

git clone https://github.com/xashru/punctuation-restoration.git
cd punctuation-restoration
sudo python3 -m pip install -r requirements.txt

To use CPU instead of CUDA, we must edit line 41 of src/inference.py:
deep_punctuation.load_state_dict(torch.load(model_save_path, map_location=torch.device('cpu')))

Download roberta-large English pre-trained model:
https://drive.google.com/file/d/17BPcnHVhpQlsOTC8LEayIFFJ7WkL00cr/view?usp=sharing

Run the punctuation process:
python3 src/inference.py --pretrained-model=roberta-large --weight-path=roberta-large-en.pt --language=en --in-file=data/test_en.txt --out-file=data/test_en_out.txt

This is multi-threaded (used 6 cores and 4GB RAM on our system), and only takes a few seconds to punctuate our sample text file.

Having inserted punctuation, we can further improve readability by giving i, i’d, i’m, i’ve a capital I, and adding break lines after . and ?, starting the next line with a capital:
sed "s/ i\([ \']\)/ I\1/g; s/\([.?]\) \(.\)/\1\n\U\2/g" test_en_out.txt > test_en_out_breaks.txt

The final result is far from perfect, but as a completely automated process it’s adequate for many purposes.