Training, Evaluation and Deployment of Popular Large Language Models in Automatic Speech Recognition

Author: Linchuan Du Affiliation: Department of Mathematics, The University of British Columbia Date: August 2023

Abstract

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), uses Deep Learning technologies to transcribe speech-included audios to texts. In the fields of Deep Learning Artificial Intelligence, Large Language Models (LLMs) mimic human brains in processing words and phrases, and have the ability to understand and generate text data. LLMs usually contain millions of weights and pre-trained with various kinds of datasets. Specifically, an ASR LLM will convert audio inputs to desired input formats by feature extraction and tokenization.

To customize an ASR LLM with ideal performance, fine-tuning procedures of Whisper, an ASR LLM developed by OpenAI, were tested on Google Colaboratory first. Larger models were then deployed in GPU-equipped environments in Windows OS to speed up training and alleviate GPU availability or limit issues on Colab and MacOS. Audio data were investigated on reliability based on information such as audio quality and transcript accuracy. Models were then improved and optimized by data preprocessing and hyper-parameter tuning techniques. In case of failing to resolve GPU memory issues by means of regular fine-tuning, Parameter-Efficient-Fine-Tuning (PEFT) with Low Rank Adaptation (LoRA) was utilized to freeze most parameters to save memory allocation without sacrificing too much in performances. Results were visualized along with loss curves to ensure the fitness and optimization of fine-tuning processes.

Possibility of multi-speaker support in Whisper was explored using Neural Speaker Diarization. Integration with Pyannote was implemented using pipeline and WhisperX, a project containing similar ideas with extra features of word-level timestamps and Voice Activity Detection (VAD). WhisperX was tested on long-form transcription with batching as well as diarization.

Besides Whisper, other models with ASR functionality were installed and compared with Whisper baseline, including Massively Multilingual Speech (MMS) by Meta AI research, PaddleSpeech by PaddlePaddle, SpeechBrain and ESPNet. Chinese datasets were used to compare these models in CER metrics. In addition, Custom Speech in Azure AI, which supports real-time STT features, was introduced to compare performances (mainly Mandarin Chinese). Then a choice can be made between trained Azure models and loadable models like Whisper for deployment.

Overview

Topics covered in this research:

Preparing Environment - Google Colab, Anaconda, VS Code, CUDA GPU
Audio Data Source - Hugging Face, OpenSLR datasets
Whisper Fine-tuning - Fine-tuning, PEFT with LoRA, Results
Speaker Diarization - Pyannote.audio, WhisperX
Other Models - Meta MMS, PaddleSpeech, SpeechBrain, ESPnet
Azure Speech Studio - Custom Speech training and deployment

1. Preparing Environment

a. Google Colaboratory

Google Colaboratory is a hosted Jupyter Notebook service that has limited free GPU & TPU computing resources. In Google Colaboratory, ipynb extension format is used to edit and execute Python scripts.

Log in to Google Colab through Google account, share written scripts with others via “share” on the right top corner of the page, and optionally authorize Colab with a Github account.

How to set up environments on Colab:

Select Tab Runtime → Change Runtime to enable GPU for use
Use pip or other package installers to install necessary dependencies

1
!pip install packageName

b. Anaconda

Besides Colab, environments can also be prepared on local PCs. Anaconda is a well-known distribution platform for Data Science field, including data analysis and building machine learning models in Python. It contains Conda, an environment and package manager that helps to manage open-source Python packages and libraries.

How to set up environments with Anaconda:

Install Anaconda from Free Download | Anaconda and add to PATH environment variable
Search Command Prompt and get into base environment, e.g (Windows):

1
(base) C:\Users\username>

Create a new Conda environment with a new name:

1
conda create --name myenv

Activate every time a specific Conda environment is needed, or return to base environment using deactivate:

1
2
conda activate myenv
conda deactivate

Install dependencies through PyPI or Conda package manager:

1
2
pip install packageName>=0.0.1
conda install packageName

Tip: Other useful Conda commands: https://conda.io/projects/conda/en/latest/commands

c. Visual Studio Code

Visual Studio Code, or VS Code, is a powerful source-code editor for Windows, MacOS and Linux with various programming languages available for editing. It supports multiple tasks, including debugging, executing in integrated terminals, enriching functionalities by extensions, and version control by embedded Git.

How to set up environments in VS Code:

Open the folder(s) on the left side under EXPLORER and create files inside the folder
On the bottom right, select the environment needed. Execute Python scripts in either interactive window on the top right with IPython kernel installed or executing Python files using commands:

1
python xxx.py

An alternative way is to use the ipynb extension (Jupyter Notebook)
The Git icon on the left panel is the place where the source codes are controlled

Tip: VS Code needs to reopen if packages in the environment are updated

d. CUDA GPU

Compute Unified Device Architecture (CUDA) is a parallel computing platform and Application Programming Interface (API) developed by NVIDIA. It allows developers to use NVIDIA Graphics Processing Units (GPUs) for multiple computing tasks.

How to use CUDA GPU:

Install the CUDA Toolkit, which includes necessary libraries, tools, and drivers for developing and running CUDA applications
Check relevant information in Command Prompt with the command:

1
nvidia-smi

PyTorch Installation

After setting up CUDA Toolkit, download a GPU-compatible PyTorch version from PyTorch.

Tip: When a previous PyTorch version is needed, check the right commands of Previous PyTorch Versions to avoid compatibility issues.

Version check can be performed directly through Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import torch

print(f' CUDA availability on PyTorch is {torch.cuda.is_available()}')
print(f' Current PyTorch version is {torch.__version__}')
print(f' Current CUDA version is {torch.version.cuda}')
print(f' cuDNN version is {torch.backends.cudnn.version()}')
print(f' The number of available GPU devices is {torch.cuda.device_count()}')

# Use CUDA on the device
device = torch.device("cuda")

2. Audio Data Source

a. Hugging Face

Hugging Face is a company and an open-source platform dedicated to Natural Language Processing (NLP) and Artificial Intelligence.

Hugging Face Token Settings

Create a Hugging Face account to utilize published models or upload customized models. Personal READ and WRITE tokens can be created on https://huggingface.co/settings/tokens.

Common ASR LLMs and their relevant information:

Model	# Params Size	Languages	Task	Structure
OpenAI Whisper	large-v2 1550M	Most languages	Multitasks	Transformer encoder-decoder Regularized
OpenAI Whisper	large 1550M	Most languages	Multitasks	Transformer encoder-decoder
OpenAI Whisper	medium 769M	Most languages	Multitasks	Transformer encoder-decoder
OpenAI Whisper	small 244M	Most languages	Multitasks	Transformer encoder-decoder
guillaumekln faster-whisper	large-v2	Most languages	Multitasks	CTranslate2
facebook wav2vec2	large-960h-lv60-self	English	transcription	Wav2Vec2CTC decoder
facebook wav2vec2	base-960h 94.4M	English	transcription	Wav2Vec2CTC decoder
facebook mms	1b-all 965M	Most languages	Multitasks	Wav2Vec2CTC decoder

Common audio datasets:

Dataset	# hours / Size	Languages
mozilla-foundation common_voice_13_0	17689 validated hrs	108 languages
google fleurs	~12 hrs per language	102 languages
LIUM tedlium	118 to 452 hrs for 3 releases	English
librispeech_asr	~1000 hrs	English
speechcolab gigaspeech	10000 hrs	English
PolyAI minds14	8.17k rows	14 languages

Warning: PolyAI/minds14 is primarily for intent detection task, and not ideal for ASR purpose

b. Open SLR

Open SLR is another useful website that hosts speech and language resources with compressed files. Various audio datasets can be seen along with their brief summaries in the Resources tab.

Chinese audio datasets for ASR purposes:

Dataset	# hours (size)	# speakers	Transcript accuracy
Aishell-1 (SLR33)	178 hrs	400	95+%
Free ST (SLR38)	100+ hrs	855	/
aidatatang_200zh (SLR62)	200 hrs	600	98+%
MAGICDATA (SLR68)	755 hrs	1080	98+%

3. Whisper Model Fine-tuning

Whisper is an ASR (Automatic Speech Recognition) system released by OpenAI in September, 2022. It was trained on 680,000 hours of multilingual and multitask supervised data, enabling multiple language transcription and translation. The architecture is an encoder-decoder Transformer.

The audios will be chunked into 30 seconds and converted into a log-Mel spectrogram, which enables frequencies to be changed into the Mel scale. Then it will be passed into an encoder.

Resources:

a. Fine-tuning on Colab

Step 1: Login through Hugging Face token to enable datasets download

1
2
from huggingface_hub import notebook_login
notebook_login()

Step 2: Load desired dataset(s) through load_dataset in datasets

Tip: Sometimes permissions for access to certain datasets are needed on Hugging Face

Step 3: Preprocess datasets to feed data into Whisper:

Manipulate columns: e.g. remove_columns, cast_column
Normalize transcript, e.g. upper/lowercase, punctuations, special tokens
Change sampling rate to 16k using Audio in Datasets library
Load pre-trained feature extractor and tokenizer from transformers library

1
2
3
4
from transformers import WhisperFeatureExtractor, WhisperTokenizer
WhisperFeatureExtractor.from_pretrained("model_id")
WhisperTokenizer.from_pretrained("model_id")
WhisperProcessor.from_pretrained("model_id")

Tip: AutoProcessor detects processor type automatically

In tokenizer, usually target languages and tasks are specified:

1
language="lang", task="transcribe"  # or "translation"

Step 4: Define Data Collator in Sequence to Sequence with label padding

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Step 5: Import evaluation metrics (WER)

1
2
import evaluate
metric = evaluate.load("wer")

Tip: When using English or most European languages, WER (Word Error Rate) is a common evaluation metric for transcription accuracy.

WER Formula: WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

Step 6: Design metrics computation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Step 7: Load conditional generation and configure model

1
2
3
4
5
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("model_id")

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

Step 8: Define hyperparameters in Seq2SeqTrainingArguments

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
training_args = Seq2SeqTrainingArguments(
    output_dir="kawadlc/whisperv1",          # own repo name
    per_device_train_batch_size=16,          # batch size per GPU for train
    gradient_accumulation_steps=1,           # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,                      # important param to handle overfitting and underfitting issue
    weight_decay=1e-2,                       # mechanism of regularization
    warmup_steps=200,                        # enhance early performances
    max_steps=3000,                          # total optimization step
    gradient_checkpointing=True,             # saving memory
    evaluation_strategy="steps",             # evaluation strategy, others: "epoch"
    fp16=True,                               # half-precision floating point format
    per_device_eval_batch_size=8,            # batch size per GPU for evaluation
    predict_with_generate=True,              # do generation
    generation_max_length=200,               # max num of tokens for autoregressive generation
    eval_steps=500,                          # num of steps per evaluation
    report_to=["tensorboard"],               # save training logs to tensorboard
    load_best_model_at_end=True,             # best model at the end of output
    metric_for_best_model="wer",             # metric of the best at the end of output
    greater_is_better=False,                 # WER lower for better
    push_to_hub=False,                       # push to hub, optional
)

Step 9: Start training with trainer.train()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Handling CUDA Out of Memory (OOM) Errors:

First priority: Reduce batch size to use more time to compensate for memory savings. Work along with gradient accumulation.
Gradient checkpointing: Trades a small increase in computation time for significant reductions in memory usage.
Mixed precision training: Reduces memory footprint significantly while maintaining training stability.
Clear GPU cache:

1
2
3
4
5
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

Tip: If all methods fail, changing to a smaller model size can be the last resort. Less model complexity will help save GPU memories.

b. Data Preprocessing

Hugging Face Dataset

Load the dataset using the load_dataset function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
common_voice = DatasetDict()

common_voice["train"] = load_dataset("common_voice", "ja", split="train+validated", use_auth_token=True)
common_voice["validation"] = load_dataset("common_voice", "ja", split="validation", use_auth_token=True)
common_voice["test"] = load_dataset("common_voice", "ja", split="test", use_auth_token=True)

# Create DatasetDict, choose the sample size for training and evaluation
common_voice = DatasetDict({
    "train": common_voice['train'].select(range(3500)),
    "validation": common_voice['validation'].select(range(500)),
    "test": common_voice['test'].select(range(100)),
})

# Remove columns that are not needed for the training
common_voice = common_voice.remove_columns(["age", "client_id", "down_votes", "gender", "path", "up_votes"])

Tip: Use “streaming=True” when space is limited on the disk, or if the download of the whole dataset is unnecessary.

Change sampling rate to 16k Hz (required by Whisper architecture):

1
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

Transcript Cleaning

1
2
3
4
5
# lowercase texts and ignore apostrophes
text = [s.lower() for s in text]
punctuation_without_apostrophe = string.punctuation.replace("'", "")
translator = str.maketrans('', '', punctuation_without_apostrophe)
text = [s.translate(translator) for s in text]

1
2
3
4
# remove special tokens
def remove_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

c. Fine-tuned Results

Abbreviations:

lr = learning rate, wd = weight decay, ws = warmup steps
ms = max steps, #e = number of epochs
es = evaluation strategy, ml = max length
tbz = train batch size, ebz = eval batch size
#ts = train sample size, #es = eval sample size

Dataset/Size/Split	Model/Lang/Task	Hyperparameters	Result
common_voice_11_0 #ts=100, #es=100 train/test	Whisper small Hindi Transcribe	lr=1e-5, wd=0, ws=5, ms=40, es=steps, ml=225, tbz=4, ebz=8	WER: 67.442%
common_voice_11_0 #ts=500, #es=500 train+validation/test	Whisper small Hindi Transcribe	lr=1e-5, wd=0, ws=0, ms=60, es=steps, ml=50, tbz=16, ebz=8	WER: 62.207%
common_voice #ts=3500, #es=500 train+validated/validation	Whisper small Japanese Transcribe	lr=1e-6, wd=0, ws=50, ms=3500, es=steps, ml=200, tbz=16, ebz=8	WER: 2.4%
librispeech_asr #ts=750, #es=250 train.100/validation	Whisper medium English Transcribe	lr=1e-5, wd=0.01, ws=10, ms=750, es=steps, ml=80, tbz=1, ebz=1	WER: 13.095%

Note: As Japanese is character-based, a more suitable evaluation metric is Character Error Rate (CER).

d. PEFT with LoRA

Parameter-Efficient Fine-tuning (PEFT) approaches only fine-tune a small number of model parameters while freezing most parameters of the pre-trained LLMs, greatly decreasing computational and storage costs.

LoRA (Low Rank Adaptation) decomposes the weights of pre-trained models into low-rank matrices or tensors and significantly reduces the number of parameters that need to be fine-tuned.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
model = WhisperForConditionalGeneration.from_pretrained(
    'openai/whisper-large-v2',
    load_in_8bit=True,
    device_map="auto"
)

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model, prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

PEFT Training Arguments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
training_args = Seq2SeqTrainingArguments(
    output_dir="jackdu/whisper-peft",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_steps=0,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=150,
    logging_steps=100,
    remove_unused_columns=False,  # required for PEFT
    label_names=["labels"],       # required for PEFT
)

PEFT Results:

Dataset/Size/Split	Model/Lang/Task	Hyperparameters	Result
common_voice_13_0 #ts=1000, #es=100 train+validation/test	Whisper medium Japanese Transcribe	lr=1e-3, wd=0, ws=50, #e=3, es=steps, ml=128, tbz=8, ebz=8	WER: 73%, NormWER: 70.186%
common_voice_13_0 #ts=100, #es=30 train+validation/test	Whisper large-v2 Vietnamese Transcribe	lr=1e-4, wd=0.01, ws=0, #e=3, es=steps, ml=150, tbz=8, ebz=8	WER: 26.577%, NormWER: 22.523%

Tip: Resources for PEFT:
INT8 training guide for ASR tasks
Parameter-Efficient Fine-Tuning using PEFT

e. Loss Curves Visualization

1
2
3
4
5
6
7
8
9
plt.figure(figsize=(10, 6))
plt.plot(training_epoch, training_loss, label="Training Loss")
plt.plot(evaluation_epoch, evaluation_loss, label="Evaluation Loss")
plt.xlabel("Training Epochs")
plt.ylabel("Loss")
plt.title("Loss Curves for Whisper Fine-Tuning")
plt.legend()
plt.grid(True)
plt.show()

Key patterns to identify:

Overfitting: Low training loss but high validation loss
Underfitting: High training and validation loss
Smoothness: Smooth curves indicate well-behaved training
Loss Plateau: Model struggles to learn further from available data

f. Baseline Results

Dataset/Split/Size	Model/Task	Result
distil-whisper/tedlium-long-form test	Whisper medium baseline en→en	WER: 28.418%
distil-whisper/tedlium-long-form validation	Whisper large-v2 baseline en→en	WER: 26.671%
librispeech_asr clean test	Whisper large-v2 baseline en→en	WER: 4.746%
Aishell S0770 test #353	Whisper large-v2 baseline zh-CN→zh-CN	CER: 8.595%
Aishell S0768 test #367	Whisper large-v2 baseline zh-CN→zh-CN	CER: 12.379%
MagicData 38_5837 test #585	Whisper large-v2 baseline zh-CN→zh-CN	CER: 21.750%

4. Speaker Diarization

Speaker Diarization involves segmenting speech audio into distinct segments corresponding to different speakers. The goal is to identify and differentiate individual speakers in an audio stream.

a. Pyannote.audio

Pyannote-audio is an open-source toolkit for speaker diarization, voice activity detection, and speech turn segmentation.

How to use Pyannote.audio with Whisper:

1
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
login(read_token)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

from pyannote.audio import Pipeline, Audio
sd_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token=True)
wav_files = glob.glob(os.path.join(audio_dirpath, '*.wav'))

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
results = []

for audio_file in wav_files:
    diarization = sd_pipeline(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)
    audio = Audio(sample_rate=16000, mono='random')
    for segment, _, speaker in diarization.itertracks(yield_label=True):
        waveform, sample_rate = audio.crop(audio_file, segment)
        text = pipe(
            {"raw": waveform.squeeze().numpy(), "sampling_rate": sample_rate},
            batch_size=8,
            generate_kwargs={"language": "<|zh|>", "task": "transcribe"}
        )["text"]
        results.append({
            'start': segment.start,
            'stop': segment.end,
            'speaker': speaker,
            'text': text
        })

b. WhisperX

WhisperX integrates Whisper, Phoneme-Based Model (Wav2Vec2) and Pyannote.audio. It claims to be 70x faster in real-time speech recognition than Whisper large-v2 with word-level timestamps and speaker diarization with VAD feature.

1
2
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install git+https://github.com/m-bain/whisperx.git

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
diarize_model = whisperx.DiarizationPipeline(
    model_name="pyannote/speaker-diarization",
    use_auth_token='hf_token',
    device=device
)

model = whisperx.load_model(
    whisper_arch=model,
    device=device,
    compute_type=compute_type,
    language=language_abbr
)

audio = whisperx.load_audio(matching_file_path)
diarize_segments = diarize_model(matching_file_path, min_speakers=6, max_speakers=6)
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.assign_word_speakers(diarize_segments, result)

Tip: Advantages:
WhisperX: Multi-speaker scenario, VAD, Extra Phoneme model, Easier for local audios
Whisper Pipeline: More languages, Flexible chunk length (≤30s), Easier for HF datasets

WhisperX Results:

Dataset	Model/Task/Compute Type	Result
TED LIUM 1st release SLR7 test	WhisperX medium en→en int8	WER: 37.041%
TED LIUM 1st release SLR7 test	WhisperX large-v2 en→en int8	WER: 36.917%
distil-whisper/tedlium-long-form validation	WhisperX large-v2 en→en int8 batch_size=1	WER: 24.651%
distil-whisper/tedlium-long-form validation	WhisperX medium en→en int8 batch_size=1	WER: 24.353%
AISHELL-4 selected audio file	WhisperX manual check	CER: 15.6%~24.658%

5. Other Models

a. Meta MMS

The Massively Multilingual Speech (MMS) project by Meta expands speech technology from around 100 languages to more than 1,100 languages.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from transformers import Wav2Vec2ForCTC, AutoProcessor

model_id = "facebook/mms-1b-all"
target_lang = "cmn-script_simplified"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

processor.tokenizer.set_target_lang(target_lang)
model.load_adapter(target_lang)
model = model.to(device)

b. PaddleSpeech

PaddleSpeech is a Chinese open-source toolkit on the PaddlePaddle platform. Available architectures include DeepSpeech2, Conformer, and U2 (Unified Streaming and Non-streaming). See the feature list for details.

1
2
pip install pytest-runner
pip install paddlespeech

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()

for audio_file in wav_files:
    result = asr(
        model='conformer_wenetspeech',
        lang='zh',
        sample_rate=16000,
        audio_file=audio_file,
        device=paddle.get_device()
    )
    transcript.append(result)

Tip: ASR training tutorial on Linux: asr1

c. SpeechBrain

SpeechBrain is an open-source conversational AI toolkit developed by the University of Montreal.

1
pip install speechbrain

1
2
3
4
5
6
7
8
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(
    source="speechbrain/asr-transformer-aishell",
    savedir="pretrained_models/asr-transformer-aishell",
    run_opts={"device": "cuda"}
)
result = asr_model.transcribe_file(audio_file)

d. ESPnet

ESPnet is an end-to-end speech processing toolkit covering speech recognition, text-to-speech, speech translation, and speaker diarization.

1
pip install espnet_model_zoo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained(
    model_id,
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)

e. Baseline Results Comparison

English:

Dataset	Model/Method	WER
librispeech_asr clean	Meta MMS mms-1b-all	4.331%
common_voice_13_0 #1000	Meta MMS mms-1b-all	23.963%

Chinese:

Dataset	Model/Method	CER
Aishell S0770 #353	PaddleSpeech Default (conformer_u2pp_online_wenetspeech)	4.062%
Aishell S0768 #367	SpeechBrain wav2vec2-transformer-aishell	8.436%
Aishell S0768 #367	Meta MMS mms-1b-all	34.241%
MagicData 4 speakers #2372	PaddleSpeech conformer-wenetspeech	9.79%
MagicData 4 speakers #2372	SpeechBrain wav2vec2-ctc-aishell	15.911%
MagicData 4 speakers #2372	Whisper large-v2 baseline	24.747%

Key Finding: For Chinese inference, PaddleSpeech had better performance compared to Whisper, while Meta MMS Chinese transcription results were worse than Whisper.

6. Azure Speech Studio

Azure AI Speech Services is a collection of cloud-based speech-related services offered by Microsoft Azure. Custom Speech Projects in Speech Studio can be created in different languages.

a. Upload Datasets

Three methods for uploading training and testing datasets:

Speech Studio (direct upload)
REST API
CLI usage

Azure Blob Storage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
pip install azure-storage-blob

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

def upload_zip_to_azure_blob(account_name, account_key, container_name, local_zip_path, zip_blob_name):
    connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)

    container_client = blob_service_client.get_container_client(container_name)
    if not container_client.exists():
        container_client.create_container()

    zip_blob_client = container_client.get_blob_client(zip_blob_name)
    with open(local_zip_path, "rb") as zip_file:
        zip_blob_client.upload_blob(zip_file)

Audio format requirements:

Format: WAV
Sampling rate: 8k Hz or 16k Hz
Channels: Single channel (mono)
Archive: ZIP format, under 2GB and 10k files within

b. Train and Deploy Models

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
pip install azure-cognitiveservices-speech

from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

predictions = []

for root, _, files in os.walk(wav_base_path):
    for file_name in files:
        if file_name.endswith(".wav"):
            audio_file_path = os.path.join(root, file_name)
            audio_config = AudioConfig(filename=audio_file_path)
            speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
            result = speech_recognizer.recognize_once()
            predictions.append(result.text)

c. Azure Results

Test Dataset	Train Datasets	Error Rate (Custom vs Baseline)
MagicData 9452 11:27:39s	Aishell 12+ hrs	4.69% / 4.24%
MagicData 9452 11:27:39s	Aishell+Minds14 32+ hrs: 1+ hr	4.67% / 4.23%
MagicData+Aishell+CV13 8721 11:45:52s	Aishell+CV13 8+ hrs: 7+ hrs	2.51% / 3.70%
MagicData+Aishell+CV13 8721 11:45:52s	Aishell+CV13+Fleurs 8+ hrs: 7+ hrs: 9+ hrs	2.48% / 3.70%

Note: The best Azure model was trained with AISHELL-1, mozilla-foundation/common_voice_13_0 and google/fleurs, resulting in 2.48% error rate.

7. Prospect

Key findings and future directions:

Data sources: Chinese sources with high transcript quality are much less available than English sources.
Hardware limitations: Multi-GPU training or more advanced GPUs (NVIDIA 40 series) could help achieve better results with larger models.
LoRA configurations: Effects of different LoRA parameters on PEFT model performance could be explored further.
Speaker Diarization: While Pyannote.audio with Whisper integration shows potential, current diarizing ability in multi-speaker meeting scenarios is still not sufficient.
Azure Speech Services: Keep good audio qualities and word-level accuracy in transcripts. Filtering training audio files that are not in good quality can enhance model performances.

8. References

Anaconda, Inc. (2017). Command reference - conda documentation. conda.io/projects/conda/en/latest/commands
OpenAI (2022, September 21). Introducing Whisper. openai.com/research/whisper
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.
Gandhi, S. (2022, November 3). Fine-Tune Whisper for Multilingual ASR with Transformers. huggingface.co/blog/fine-tune-whisper
The Linux Foundation (2023). Previous PyTorch Versions. pytorch.org/get-started/previous-versions
Hugging Face, Inc. (2023). Hugging Face Documentations. huggingface.co/docs
Srivastav, V. (2023). fast-whisper-finetuning. github.com/Vaibhavs10/fast-whisper-finetuning
Mangrulkar, S., & Paul, S. (2023). Parameter-Efficient Fine-Tuning Using PEFT. huggingface.co/blog/peft
Bredin, H., et al. (2020). pyannote.audio: neural building blocks for speaker diarization. ICASSP 2020.
Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. INTERSPEECH 2023.
Meta AI (2023, May 22). Introducing speech-to-text, text-to-speech, and more for 1,100+ languages. ai.meta.com/blog/multilingual-model-speech-recognition
Pratap, V., et al. (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.
Zhang, H. L. (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. NAACL 2022.
Ravanelli, M., et al. (2021). SpeechBrain: A General-Purpose Speech Toolkit.
Gao, D., et al. (2022). EURO: ESPnet Unsupervised ASR Open-source Toolkit. arXiv:2211.17196.
ESPnet (2021). espnet_model_zoo. github.com/espnet/espnet_model_zoo
Microsoft (2023). Custom Speech overview - Azure AI Services. learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-speech-overview
Microsoft (2023). Speech service documentation. learn.microsoft.com/en-us/azure/ai-services/speech-service/