ASR Whisper Fine-tuning Vietnamese Japanese Azure Speech Studio

日语与越南语 ASR 模型微调

Vikas Reddy - University of Maryland 6 分钟阅读
日语与越南语 ASR 模型微调

基于 OpenAI Whisper 和 Azure Speech Studio 对越南语及日语语音识别进行微调的研究成果。

摘要

在通信技术快速发展的背景下,以 OpenAI Whisper 模型为代表的最新突破显著提升了多语言语音转文字的准确性和可及性。然而,在识别精度方面仍有提升空间。本研究专注于越南语和日语的自动语音识别(ASR)模型性能优化。

我们采用标准指标进行评估:越南语使用词错误率(WER),日语使用字符错误率(CER)。

核心结果:

  • 越南语(FOSD + Common Voice + Google Fleurs + Vivos):WER 9.46%
  • 日语(ReazonSpeech + Common Voice + Google Fleurs):CER 8.15%

目录

  1. 背景介绍
  2. 环境搭建
  3. 数据集加载
  4. 数据预处理
  5. 训练
  6. 参数高效微调
  7. 结果
  8. 评估
  9. Azure Speech Studio
  10. 总结

1. 背景介绍

在当今社会,通信与技术已不可或缺,但在可及性、包容性和高效知识传播方面仍面临诸多挑战。自动语音识别(ASR)技术的进步正在简化人机交互,尤其体现在在线会议场景中。

ASR 是将语音信号转换为对应文本的过程。近年来,随着大规模语音数据集及其转录文本的可用性提高,各大企业纷纷关注这一领域。

OpenAI Whisper 架构

OpenAI Whisper 是一个基于 Transformer 的编码器-解码器模型,采用 sequence-to-sequence 架构设计。它以音频频谱图特征作为输入,将其转换为文本 token 序列。具体过程如下:

  1. 特征提取器将原始音频转换为 log-Mel 频谱图
  2. Transformer 编码器生成编码器隐藏状态序列
  3. 解码器通过 cross-attention 机制预测文本 token

2. 环境搭建

Whisper 微调有两种方式:使用 Google Colab,或在本地 PC 上运行代码。

所需依赖包

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
python -m pip install -U pip
pip install evaluate pandas numpy huggingface_hub pydub tqdm spacy ginza audiomentations
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install datasets>=2.6.1
pip install git+https://github.com/huggingface/transformers
pip install librosa
pip install evaluate>=0.30
pip install jiwer
pip install gradio
pip install -q bitsandbytes datasets accelerate loralib
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

注意: 本文微调实验使用的硬件配置为 Windows 11 Pro、AMD Ryzen 7 3700X 8核处理器、80GB 内存、GeForce RTX 3090 显卡。

3. 数据集加载

方法一:通过 Hugging Face 加载

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_dataset(
    "mozilla-foundation/common_voice_11_0", "ja",
    split="train+validation", use_auth_token=True
)
common_voice["test"] = load_dataset(
    "mozilla-foundation/common_voice_11_0", "ja",
    split="test", use_auth_token=True
)
common_voice = common_voice.remove_columns([
    "accent", "age", "client_id", "down_votes",
    "gender", "locale", "path", "segment", "up_votes"
])

方法二:手动准备数据集

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import os, csv, codecs

def text_change_csv(input_path, output_path):
    file_csv = os.path.splitext(output_path)[0] + ".csv"
    output_dir = os.path.dirname(input_path)
    output_file = os.path.join(output_dir, file_csv)
    encodings = ["utf-8", "latin-1"]

    for encoding in encodings:
        try:
            with open(input_path, 'r', encoding=encoding) as rf:
                with codecs.open(output_file, 'w', encoding=encoding, errors='replace') as wf:
                    readfile = rf.readlines()
                    for read_text in readfile:
                        read_text = read_text.split('|')
                        writer = csv.writer(wf, delimiter=',')
                        writer.writerow(read_text)
            print(f"CSV has been created using encoding: {encoding}")
            return True
        except UnicodeDecodeError:
            continue

使用的数据集

数据集语言获取方式语音时长
Common Voice 13.0越南语、日语Hugging Face19小时(VN)、10小时(JP)
Google Fleurs越南语、日语Hugging Face11小时(VN)、8小时(JP)
Vivos越南语Hugging Face15小时
FPT Open Speech Dataset越南语下载并解压30小时
VLSP2020越南语下载并解压100小时
ReazonSpeech日语Hugging Face5小时
JSUT日语下载并解压10小时
JVS日语下载并解压30小时

4. 数据预处理

数据增强

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

augment_waveform = Compose([
    AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=0.2),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.2, leave_length_unchanged=False),
    PitchShift(min_semitones=-4, max_semitones=4, p=0.2)
])

def augment_dataset(batch):
    audio = batch["audio"]["array"]
    augmented_audio = augment_waveform(samples=audio, sample_rate=16000)
    batch["audio"]["array"] = augmented_audio
    return batch

common_voice['train'] = common_voice['train'].map(augment_dataset, keep_in_memory=True)

转录文本归一化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import string

def remove_punctuation(sentence):
    translator = str.maketrans('', '', string.punctuation)
    modified_sentence = sentence.translate(translator)
    return modified_sentence

def fix_sentence(sentence):
    transcription = sentence
    if transcription.startswith('"') and transcription.endswith('"'):
        transcription = transcription[1:-1]
    transcription = remove_punctuation(transcription)
    transcription = transcription.lower()
    return transcription

准备 Whisper 输入数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = processor.feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    transcription = fix_sentence(batch["transcription"])
    batch["labels"] = processor.tokenizer(
        transcription, max_length=225, truncation=True
    ).input_ids
    return batch

common_voice = common_voice.map(
    prepare_dataset,
    remove_columns=common_voice.column_names['train'],
    num_proc=1,
    keep_in_memory=True
)

5. 训练

Data Collator

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

评估指标(越南语 - WER)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

评估指标(日语 - CER)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import spacy, ginza

nlp = spacy.load("ja_ginza")
ginza.set_split_mode(nlp, "C")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # 对日语文本进行分词后评估
    pred_str = [" ".join([str(i) for i in nlp(j)]) for j in pred_str]
    label_str = [" ".join([str(i) for i in nlp(j)]) for j in label_str]

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

训练参数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from transformers import Seq2SeqTrainingArguments

model.config.dropout = 0.05

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-fine-tuned",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-6,
    lr_scheduler_type='linear',
    optim="adamw_bnb_8bit",
    warmup_steps=200,
    num_train_epochs=5,
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=255,
    eval_steps=500,
    logging_steps=500,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
    save_total_limit=1
)

提示: 关键训练参数说明:

  • learning_rate:1e-5 或 1e-6 效果最佳
  • warmup_steps:建议设置为总步数的 10%
  • per_device_train_batch_size:根据 GPU 显存设置(RTX 3090 可设为 16)
  • dropout:设为 0.05 或 0.10 以防止过拟合

6. 参数高效微调(PEFT)

PEFT 仅使用**全部参数的 1%**进行训练,即可达到与全量微调相当的性能。

全量微调参数高效微调
训练速度较快训练时间较长
需要大量计算资源计算资源消耗少
重新训练整个模型仅修改少量参数
容易过拟合不易过拟合

LoRA 配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from transformers import WhisperForConditionalGeneration, prepare_model_for_int8_training
from peft import LoraConfig, get_peft_model

model = WhisperForConditionalGeneration.from_pretrained(
    model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = prepare_model_for_int8_training(model)

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# Output: trainable params: 15728640 || all params: 1559033600 || trainable%: 1.01%

7. 结果

越南语结果

越南语 Whisper Large 结果

FOSD + Google Fleurs + Vivos + CV 数据集上微调的模型取得了最低 WER 9.46%

日语结果

日语 Whisper Large 结果

JSUT + ReazonSpeech + Google Xtreme + CV 数据集上微调的模型取得了最低 CER 8.15%

优化损失曲线

优化损失曲线

8. 评估

越南语评估

越南语评估结果

在所有评估数据集中,Google Fleurs + Common Voice + Vivos 组合取得了最低 CER 7.84%,表现出极高的转录准确度。

日语评估

日语评估结果

ReazonSpeech + Google Xtreme + CV 组合取得了最低 CER 7.44%

Faster-Whisper 转换

1
2
3
4
5
6
7
8
from ctranslate2.converters import TransformersConverter

model_id = "./whisper-fine-tuned/checkpoint-5000"
output_dir = "whisper-ct2"

converter = TransformersConverter(model_id, load_as_float16=True)
converter.convert(output_dir, quantization="float16")
model = WhisperModel(output_dir, device="cuda", compute_type="float16")

注意: Faster Whisper 在保持同等精度的前提下,推理速度比标准微调 Whisper 提升约 40%

9. Azure Speech Studio

Azure Speech Studio 为 ASR 模型微调提供了另一种方案。

使用 Azure 进行转录

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import os, evaluate
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

subscription_key = "your_subscription_key"
location = "japaneast"
endpoint = "your_endpoint"

config = SpeechConfig(subscription=subscription_key, region=location)
config.endpoint_id = endpoint
speech_config = SpeechConfig(
    subscription=subscription_key,
    region=location,
    speech_recognition_language="ja-JP"
)

predictions = []
for root, _, files in os.walk(wav_base_path):
    for file_name in files:
        if file_name.endswith(".wav"):
            audio_file_path = os.path.join(root, file_name)
            audio_config = AudioConfig(filename=audio_file_path)
            speech_recognizer = SpeechRecognizer(
                speech_config=speech_config,
                audio_config=audio_config
            )
            result = speech_recognizer.recognize_once()
            if result.text:
                predictions.append(result.text)

Azure 结果

越南语: 基于 Common Voice 14.0 训练的模型,WER 为 7.33%

日语: 基于 JSUT 训练的模型,CER 为 6.97%

注意: 虽然 Azure Speech Studio 在训练阶段的 WER 可能更低,但在面对多样化和复杂的未知音频数据时,Whisper 往往表现出更好的泛化能力。

10. 总结

Whisper ASR 模型的微调被证明是提升性能的有效技术手段。主要发现如下:

  1. DeepL 在中文到英文翻译中表现最为出色
  2. 微调能够带来一致的性能提升(越南语 WER 7.33-12.15%,日语 CER 8.15-17.93%)
  3. 数据增强通过 audiomentations 库引入了有价值的数据多样性
  4. 数据集质量至关重要:数据量、音频清晰度和主题多样性均会影响最终性能
  5. Whisper 在实际应用场景中表现更优,面对未知数据时优于 Azure

参考文献

  1. Radford, A., et al. (2022). Robust speech recognition via large-scale weak supervision. arXiv:2212.04356
  2. Ardila, R., et al. (2020). Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670
  3. Conneau, A., et al. (2022). FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech. arXiv:2205.12446
  4. Gandhi, S. (2022). Fine-Tune Whisper for Multilingual ASR with Transformers. Hugging Face Blog
  5. Mangrulkar, S. & Paul, S. Parameter-Efficient Fine-Tuning Using PEFT. Hugging Face Blog
分享这篇文章

免费试用 VoicePing

借助 AI 翻译跨越语言障碍。立即开始使用免费计划。

免费开始