Evaluating Speaker Diarization Models: A Practical Comparison

Test Setup: 6 scenarios, 5 languages (EN, JA, KO, VI, ZH), 3 models Hardware: NVIDIA GeForce RTX 4090 Date: December 2025

TL;DR

We evaluated three speaker diarization models across six scenarios:

Model	Description	Avg DER	Avg RTF
NeMo Neural (MSDD)	Multi-Scale Diarization Decoder with neural refinement	0.081	0.020
NeMo Clustering	Clustering-only approach without MSDD	0.103	0.010
Pyannote 3.1	End-to-end diarization pipeline	0.181	0.027

Key Findings:

NeMo Neural provides best accuracy with fast processing
Japanese benefits from longer context: Performance improves on 30min+ audio
Multilingual without Japanese performs excellently (DER: 0.050)

1. Introduction

We needed to choose a diarization model for production. Our evaluation covers 6 scenarios representing real-world conditions:

Different audio lengths (10 minutes to 1 hour)
Varying speaker counts (4 to 14 speakers)
Different overlap levels (0% to 40%)
Multilingual audio mixing

2. Models Under Test

NeMo Neural (MSDD)

TitaNet-large for 192-dimensional speaker embeddings
Processes audio at 5 temporal scales (1.0s-3.0s windows)
MSDD neural network refines initial clustering results
Average RTF: ~0.015-0.032

NeMo Clustering (Pure)

Same embedding model (TitaNet-large)
Uses only spectral clustering without MSDD refinement
Significantly faster due to skipping neural refinement
Average RTF: ~0.014-0.028

Pyannote 3.1

End-to-end pipeline with VAD, segmentation, and clustering
Uses pyannote/segmentation-3.0 and wespeaker models
Average RTF: ~0.018-0.043

3. Evaluation Setup

3.1 Test Scenarios

Scenario	Duration	Speakers	Overlap	Purpose
Long Audio	10min	4-5	15%	Standard production
Very Long	30min	10-12	15%	Stress test
1-Hour Audio	60min	12-14	15%	Extreme duration
High Overlap	15min	8-10	40%	Worst-case overlap
Multilingual (5-lang)	15min	8	20%	EN+JA+KO+VI+ZH
Multilingual (4-lang)	15min	8	20%	EN+KO+VI+ZH (no JP)

3.2 Metrics

Accuracy Metrics:

DER Full (collar=0.0s): Strictest metric, no boundary tolerance
DER Fair (collar=0.25s): Primary metric with 250ms tolerance
DER Forgiving (collar=0.25s, overlap ignored): Most lenient

DER Components:

Miss Rate: Speech missed by the system
False Alarm Rate: Non-speech marked as speech
Confusion Rate: Speech assigned to wrong speaker

4. Overall Performance

4.1 Accuracy Comparison

Overall DER comparison across all scenarios

Key Observations:

NeMo Neural is ~55% more accurate than Pyannote (DER: 0.081 vs 0.181)
NeMo Clustering performs nearly as well as Neural (only 27% worse)
Pyannote has 3.4x higher confusion rate

4.2 Speed Comparison

Processing speed comparison (RTF - lower is faster)

NeMo Clustering is fastest (RTF 0.010)
NeMo Neural is very fast (RTF 0.020)
All models are much faster than real-time

4.3 Accuracy vs Speed Trade-off

Accuracy vs Speed trade-off visualization

Major Finding: NeMo Neural achieves best accuracy with fast speed, making it the clear winner for most use cases.

5. Results by Scenario

5.1 Long Audio (10 minutes)

NeMo Neural Results by Language:

EN: 0.019 (Excellent)
JA: 0.157 (8.3x harder than English)
KO: 0.046
VI: 0.037
ZH: 0.053
Average: 0.062

5.2 Very Long Audio (30 minutes)

Critical Discovery - Japanese Benefits from Longer Context:

10min audio: DER 0.157 (8.3x harder than English)
30min audio: DER 0.067 (2.9x harder than English)

Extended duration provides better acoustic context for pitch-accent language modeling.

5.3 High Overlap (40%)

NeMo Neural and Clustering perform virtually identically (DER: 0.114 vs 0.115)
Pyannote struggles more (DER: 0.202, ~77% worse than NeMo)
Japanese remains the hardest language (DER: 0.232)

6. Language-Specific Analysis

Overall language difficulty ranking

Key Observations:

Japanese is universally hardest (5.0x harder than English on average)
English is easiest (DER: 0.037)
Vietnamese is close second (only 1.1x harder than English)

Why Japanese is Difficult

Japanese performance across different audio lengths

Hypotheses:

Pitch-accent language: Pitch carries linguistic meaning, confusing speaker embeddings
Narrow phonetic inventory: ~100 mora vs thousands of English phonemes
Shorter syllable durations: Less temporal context per speaking turn

7. Neural vs Clustering

Neural vs Clustering performance comparison

Key Findings:

Clustering is only 3% worse on average
Clustering is 2x faster in processing
The speed/accuracy trade-off is minimal

Recommendation:

Use NeMo Neural for best accuracy
Use NeMo Clustering for maximum speed (2x faster, 3% worse)

8. Multilingual Performance

8.1 The Japanese Effect

Multilingual performance with and without Japanese

Key Insight: Japanese is the primary factor making multilingual diarization difficult.

Configuration	NeMo Neural DER
With Japanese (5-lang)	0.142
Without Japanese (4-lang)	0.050

8.2 Error Analysis

Error breakdown with vs without Japanese

Why 4-Language Multilingual Works Well:

More acoustic diversity helps VAD detect speech boundaries
Language changes provide natural segment boundaries
EN, KO, VI, ZH have compatible acoustic features
Japanese’s pitch-accent features cause cross-language speaker confusion

9. Conclusion

Key Takeaways

NeMo Neural is the clear winner:

Best accuracy: DER 0.081 average
Fast processing: RTF 0.020 (50x faster than real-time)
Excellent multilingual without Japanese: DER 0.050

Critical Findings:

Japanese benefits dramatically from longer context (30min optimal)
Multilingual with Japanese is challenging (DER 0.142) but manageable
MSDD neural refinement provides minimal benefit over clustering (27% better)
All models are fast and production-ready

Recommendations

Use Case	Model	Reason
Best accuracy	NeMo Neural	DER 0.081
Maximum speed	NeMo Clustering	2x faster
Long audio (30min-1h)	NeMo Neural	Handles complexity
Multilingual (no Japanese)	NeMo Neural	DER 0.050
Japanese (30min+)	NeMo Neural	Context helps

Default Choice: NeMo Neural - best accuracy with fast processing.