Part 2: Scaling Translation Inference: +82% Throughput

Recap: The Problem

In Part 1, we identified the bottleneck: our FastAPI service used multiprocessing workers with IPC queues to distribute translation tasks. This created:

Queue serialization overhead
GPU compute contention between worker processes
Spiky GPU utilization pattern

Baseline: 2.2 RPS at 25 concurrent requests

The path forward: eliminate multiprocessing and utilize vLLM’s batch inference.

Attempt 2: Static Batching

We implemented static batching within the existing worker processes.

Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Within worker process
MAX_BATCH_SIZE = 16
BATCH_TIMEOUT = 0.05  # 50ms

while True:
    batch_keys = []
    batch_tasks = []

    # Collect first task (blocking)
    first_key = queue.get()
    batch_keys.append(first_key)
    batch_tasks.append(tasks[first_key])

    # Try to collect more tasks (non-blocking with timeout)
    batch_start = time.time()
    while len(batch_keys) < MAX_BATCH_SIZE:
        time_remaining = BATCH_TIMEOUT - (time.time() - batch_start)
        if time_remaining <= 0:
            break
        try:
            key = queue.get(timeout=time_remaining)
            batch_keys.append(key)
            batch_tasks.append(tasks[key])
        except Empty:
            break

    # Process batch using vLLM
    results = translation_provider.translate_batch(
        texts=[t.text for t in batch_tasks],
        source_langs=[t.source_lang for t in batch_tasks],
        target_langs=[t.target_lang for t in batch_tasks]
    )

Key points:

Batch size: 16 requests
Timeout: 50ms (don’t wait indefinitely for full batch)
vLLM processes multiple sequences together
Still uses multiprocessing workers

Results

Figure 1: Static batching delivers significant throughput and response time improvements

Nearly 3x throughput improvement. Per-request inference time: 452ms → 171ms.

Trade-offs

Pros:

Massive throughput gains
GPU better utilized
Simple implementation

Cons:

Head-of-line blocking: All requests wait for the slowest one
With variable-length inputs, short translations wait for long ones
Example: [50 tokens, 50 tokens, 200 tokens] – first two wait for the 200-token translation

This was good progress, but we wanted to eliminate the head-of-line blocking issue.

Attempt 3: Continuous Batching

The solution: vLLM’s AsyncLLMEngine with continuous batching.

What is Continuous Batching?

Unlike static batching, continuous batching composes batches dynamically:

New requests join mid-generation
Completed requests leave immediately (don’t wait for others)
Batch composition updates every token
vLLM’s AsyncLLMEngine handles this automatically

No head-of-line blocking. Short translations return as soon as they’re done.

Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from vllm import AsyncLLMEngine, EngineArgs

engine_args = EngineArgs(
    model=model_id,
    max_num_seqs=64,  # Initial attempt
    max_num_batched_tokens=16384,
    gpu_memory_utilization=0.3,
    enable_chunked_prefill=True,
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/translate")
async def translate(request: TranslateRequest):
    result_generator = engine.generate(
        request.text,
        sampling_params,
        request_id=generate_id()
    )

    async for output in result_generator:
        final_output = output
    return TranslateResponse(translation=final_output.text)

Architecture change:

AsyncLLMEngine used directly in FastAPI
vLLM handles batching internally via continuous batching engine
Pure async/await throughout

Testing Reality Check

Initial Results (Uniform Inputs)

We tested with standard uniform-length inputs (similar lengths):

Figure 2: Continuous batching with uniform inputs showing impressive 15 RPS throughput

15 RPS vs 2.2 baseline – nearly 7x improvement. This looked great.

Variable-Length Inputs (Reality)

Then we tested with realistic variable-length inputs (10-200 tokens, mixed short and long):

Baseline re-run with variable inputs:

Very heavy load: 1.1 RPS (vs 2.2 RPS with uniform)
Even baseline performed worse with realistic data

Continuous batching (max_num_seqs=64) with variable inputs:

Very heavy load: 3.5 RPS (with max_num_seqs=16 tuning)
Same configuration that gave us 15 RPS with uniform inputs

Figure 3: Performance gap between uniform test data and realistic variable-length inputs

Configuration Tuning

The poor performance with max_num_seqs=64 led us to analyze vLLM’s internal metrics.

What We Found

1
2
3
4
5
# vLLM Prometheus metrics we monitored:
# - vllm:time_to_first_token_seconds (TTFT)
# - vllm:time_per_output_token_seconds (decode time)
# - vllm:gpu_cache_usage_perc (KV cache utilization)
# - vllm:num_requests_running / waiting (queue depth)

The issue:

Actual workload: 2-20 concurrent requests per server (production peak ~20 per server)
Configuration: max_num_seqs=64
Result: 60+ empty slots creating overhead

What happens with oversized config:

KV cache pre-allocated for 64 sequences
vLLM scheduler manages 64 slots but only uses 5-10
Decode time per token increases
Memory wasted on unused sequence slots
Scheduler overhead for empty slots

Tuning Approach

Following vLLM continuous batching tuning guide:

Measure actual concurrent request distribution in production
Start with max_num_seqs=1, gradually increase: 2 → 4 → 8 → 16 → 32
Monitor decode time and tail latency at each step
Stop when performance degrades

max_num_seqs	Result
8	Good latency, but throughput limited
16	Best balance
32	Decode time increased, tail latency worse

Final Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from translation_lib.config import AsyncVLLMTranslationProvider

provider = AsyncVLLMTranslationProvider(
    model_name=model_id,
    revision=model_revision,
    gpu_memory_utilization=0.3,  # ~10GB on RTX 5090
    max_num_seqs=16,  # Right-sized to actual workload per server
    huggingface_token=hf_token,
    supported_language_pairs=None,  # Multilingual model
)

await provider.initialize_engine()

Configuration Rationale

max_num_seqs=16:

Production peak: ~20 concurrent requests per server
Testing: Validated up to 25 concurrent
Provides headroom without wasting resources
Scheduler overhead matched to actual load

max_num_batched_tokens=8192:

Reduced from default 16384
Better suited for our average sequence lengths
Reduces memory pressure

gpu_memory_utilization=0.3:

Allocates ~10GB VRAM for model + KV cache on RTX 5090 (32GB)
Tracked via vllm:gpu_cache_usage_perc
Balanced for our configuration

Note: The principle: match configuration to your actual workload, not theoretical limits.

Figure 4: Throughput progression through all optimization attempts

Production Results

We deployed the optimized configuration to production (RTX 5090 GPUs).

Before vs After

Metric	Before (Multiprocessing)	After (Optimized AsyncLLM)	Change
Throughput	9.0 RPS	16.4 RPS	+82%
GPU Utilization	Spiky (93% → 0% → 93%)	Consistent 90-95%	Stable

Figure 5: Production deployment results showing 82% throughput improvement

Figure 6: P95 latency improvements across optimization attempts

Figure 7: Response time evolution with variable-length inputs

The improvement held in production. From 9 RPS to 16.4 RPS under real traffic.

Summary

What Worked

vLLM’s continuous batching

AsyncLLMEngine handles batching automatically
No manual batch collection overhead
Direct async/await integration with FastAPI

Right-sized configuration

max_num_seqs=16 (matched actual workload per server)
Not 64 (theoretical max that created overhead)
gpu_memory_utilization=0.3 for 10GB allocation

Tested with realistic data

Variable-length inputs exposed configuration issues
Uniform test data gave misleading 15 RPS

Monitored vLLM metrics

KV cache usage
Decode time per token
Queue depth
Guided configuration decisions

Complete Journey

Approach	Throughput	vs Baseline	Notes
Baseline (multiprocessing)	2.2 RPS	-	IPC overhead, GPU contention
Two workers	2.0 RPS	-9%	Made it worse
Static batching	5.9 RPS	+168%	Head-of-line blocking
Async (64, uniform)	15.0 RPS	+582%	Misleading test data
Async (16, variable)	3.5 RPS	+59%	Realistic, but tuning needed
Final optimized	10.7 RPS	+386%	Staging validation
Production	16.4 RPS	+82%	Real traffic, RTX 5090

Related: Read Part 1: The Bottleneck to Scale Our Translation vLLM Inference Servers

Recap: The Problem

Attempt 2: Static Batching

Implementation

Results

Trade-offs

Attempt 3: Continuous Batching

What is Continuous Batching?

Implementation

Testing Reality Check

Initial Results (Uniform Inputs)

Variable-Length Inputs (Reality)

Configuration Tuning

What We Found

Tuning Approach

Final Configuration

Configuration Rationale

Production Results

Before vs After

Summary

What Worked

Complete Journey

Try VoicePing for Free

Related Posts

Part 1: The Bottleneck to Scale Our Translation Inference Servers

Whisper in Production: Real-Time Dual-Language Switching, the Failures We Hit, and the Architecture That Works

Speaker Diarization Performance Evaluation: Pyannote.audio vs Nvidia Nemo and Post-Processing with GPT-4