Part 1: The Bottleneck to Scale Our Translation Inference Servers

The Problem

We run a translation microservice using FastAPI and vLLM. Under heavy load, we hit server latency issues that didn’t match what our GPU utilization metrics suggested.

GPU utilization showed a stuttering pattern: spike to 93%, drop to 0%, spike again. Not the consistent high utilization we expected.

The question: if the GPU has idle periods, where’s the bottleneck?

This article covers how we identified the architectural issues in our FastAPI + multiprocessing setup that were preventing efficient GPU utilization.

System Context

Our translation service runs as multiple API servers behind a load balancer:

Figure 1: Overall system architecture showing client applications, proxy/load balancer, and multiple API servers

Clients: Web, mobile, backend services
Proxy: Routes requests based on language pairs and server health
API Servers: Multiple FastAPI instances, each running vLLM

This article focuses on a single API server’s internal architecture and bottlenecks.

API Server Architecture

Here’s the internal structure of one API server:

Figure 2: Single API server architecture showing FastAPI, multiprocessing queues, worker processes, and vLLM instances

Components

1. FastAPI Main Process

1
2
3
4
5
# Single-threaded async event loop
@app.post("/translate")
async def translate_endpoint(request: TranslateRequest):
    result = await translation_service.translate(request)
    return result

Handles HTTP requests with async/await
Single Python process, one event loop
Non-blocking I/O for concurrent request handling

2. TranslationService

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class TranslationService:
    def __init__(self, worker: TranslationWorker):
        self.worker = worker

    async def translate(self, request: TranslateRequest) -> TranslateResponse:
        # Create translation task
        event_task = self.worker.add_translation_task(
            text=request.text,
            source_lang=request.source_lang,
            target_lang=request.target_lang,
            timeout=30
        )

        # Wait asynchronously for result
        await event_task.event.wait()
        return TranslateResponse(translation=event_task.result.translation)

Creates translation tasks
Manages EventTask objects with asyncio.Event
Bridges async/await with multiprocessing

3. TranslationWorker (Main Process)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class TranslationWorker:
    def __init__(self):
        self.ctx = multiprocessing.get_context("spawn")
        self.translation_queue = None  # Created in run()
        self.event_queue = None
        self.event_tasks: Dict[str, EventTask] = {}

    def _initialize(self):
        # Create queues in main process
        self.translation_queue = self.ctx.JoinableQueue(maxsize=300)
        manager = self.ctx.Manager()
        self.translation_tasks = manager.dict()  # Shared state
        self.event_queue = self.ctx.Queue()

    def add_translation_task(...) -> EventTask:
        key = "t_" + generate_random_key(10)
        # Store in shared dict
        self.translation_tasks[key] = TranslationTask(...)

        # Send to workers via queue
        self.translation_queue.put(key)  # Serialization

        # Create event for async waiting
        event_task = EventTask(key)
        self.event_tasks[key] = event_task
        return event_task

Queues created in main process (shared with workers)
JoinableQueue for task distribution
manager().dict() for shared task state
Event queue for results

4. Worker Processes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def run(self):
    for worker_id in range(self.num_workers):
        worker = self.ctx.Process(
            target=self.process_queue,
            args=(worker_id, ready_event)
        )
        worker.start()

def process_queue(self, worker_id, ready_event):
    # Each worker loads its own vLLM instance
    translation_processor = TranslationProcessor(
        worker_id=worker_id,
        model_key=self.model_key,
        gpu_memory_utilization=self.gpu_memory_per_worker
    )

    # Process from shared queue
    while True:
        key = self.translation_queue.get()  # Deserialization
        task = self.translation_tasks[key]

        # Translate using vLLM
        result = translation_processor.translate(
            task.text,
            task.source_lang,
            task.target_lang
        )

        # Send result back
        self.event_queue.put((key, EventType.completed, result))  # Serialization

Spawned as separate processes (ctx.Process)
Each loads its own vLLM model instance
Pull from shared translation_queue
Return via shared event_queue

5. EventTask (Async Synchronization)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class EventTask:
    def __init__(self, key: str):
        self.key = key
        self.event = asyncio.Event()  # Async synchronization
        self.event_type = EventType.waiting
        self.result = None

    def update(self, event_type, result):
        self.event_type = event_type
        self.result = result
        self.event.set()  # Wake waiting coroutine

Bridges multiprocessing with async/await
Each request gets an EventTask
await event.wait() blocks coroutine until worker completes

Request Flow

Here’s what happens for a single translation request:

Figure 3: Step-by-step request flow showing serialization points and async waiting

Step by step:

Client POST /translate → FastAPI creates async coroutine
async translate() → TranslationService handles request
create_task() → Generate ID, create TranslationTask in shared dict
queue.put(key) → Serialize task key, send to workers (IPC overhead)
Worker: vllm.translate() → Worker processes translation
event_queue.put(result) → Serialize result, send back (IPC overhead)
event.set() → Update EventTask, wake coroutine
await event.wait() unblocked → Retrieve result
Return response → Send to client

Overhead points:

Step 4: Serialization (pickle task key)
Step 6: Serialization (pickle result)
Step 8: Async waiting for multiprocessing result
IPC coordination throughout

Baseline Performance

Before optimization attempts:

Figure 4: Baseline performance showing throughput decrease and response time increase under load

Pattern:

Response time grows linearly (1.4s → 11.3s)
Throughput decreases under load (3.3 → 2.2 RPS)
Actual vLLM translation time per request: 300-450ms

Figure 5: GPU utilization pattern before (spiky) and after (consistent) optimization

Spiky pattern: GPU alternates between busy and idle. This indicated the GPU was waiting for work, not compute-bound.

Attempt 1: Multiple Workers

First hypothesis: more workers = better parallelization.

We increased from 1 worker to 2 workers.

Configuration

1
2
num_workers = 2
gpu_memory_per_model = 0.3

Worker 1: Models A+B
Worker 2: Model C
Both share the same GPU

Results

Figure 6: Performance degradation when adding a second worker process

Median translation time also degraded: 452ms → 2,239ms.

Performance dropped across all load levels.

Why Multiple Workers Failed

This result makes sense when you understand GPU behavior and our architecture.

Figure 7: Multiple worker processes competing for GPU compute capacity

The Issue: Compute Contention

When one worker is processing a translation:

It uses ~90% of GPU compute capacity
Other workers can’t effectively utilize the remaining capacity in parallel
Workers end up waiting for GPU availability

Why no parallel benefit:

Worker 1 starts vLLM generation → uses ~90% GPU compute
Worker 2 tries to start → only ~10% GPU compute available
Worker 2 runs slowly or waits
Effectively sequential execution despite separate processes

Additional overhead:

Process spawning and management
GPU memory split between workers (each loads model weights)
IPC queue coordination
Context switching between processes

The GPU can technically run multiple CUDA kernels simultaneously, but when one worker is actively using ~90% of compute capacity, there’s insufficient remaining capacity for another worker to run efficiently in parallel.

Additional Architectural Issues

With multiple workers competing for the same resources:

Context switching overhead: OS switching between worker processes
Doubled memory usage: Each worker loads full model weights
No effective parallelism: Sequential GPU execution despite parallel architecture

The same queues handle all workers (translation_queue and event_queue shared), so the IPC overhead per request remains constant. However, the additional overhead from process management, context switching, and memory duplication, combined with no parallel GPU benefit, made performance worse.

Identified Bottlenecks

After this experiment, we identified the core issues:

1. IPC Serialization Overhead

Every request: serialize task → worker, serialize result → main
Python multiprocessing queue uses pickle
Overhead on every request

2. Compute Contention

One worker using ~90% GPU compute
Other workers can’t run effectively in parallel
Sequential execution despite multiprocessing

3. Async/Await + Multiprocessing Bridge

asyncio.Event waiting for multiprocessing result
Thread-based event queue consumer
Coordination overhead between async and multiprocess models

4. Wasted GPU Cycles

GPU idle while waiting for queue operations
Spiky utilization (93% → 0% → 93%)
Translation time ~400ms, total response time 11+ seconds
Most time spent in queues, not computing

5. Architecture Complexity

FastAPI (async/await)
TranslationService (bridge)
TranslationWorker (coordination)
JoinableQueue (IPC)
Worker processes (multiprocessing)
Event queue (IPC)
EventTask (async sync)
vLLM (actual work)

Each layer added latency.

Key Insights

1. Async/Await + Multiprocessing = Overhead

Bridging these two concurrency models requires coordination:

asyncio.Event for async waiting
Thread pool for consuming event queue
Serialization at process boundaries

This bridge has a cost.

2. Multiple Processes ≠ GPU Parallelism

Adding worker processes doesn’t automatically improve GPU utilization when:

One worker uses ~90% of GPU compute
Insufficient remaining capacity for parallel work
Sequential execution despite multiprocessing overhead

3. Queue Overhead Dominates

At 25 concurrent requests:

vLLM translation time: ~400ms
Total response time: 11,258ms
Queue overhead: ~97% of total time

The majority of time was spent in queues and coordination, not computing.

4. Spiky GPU = Architectural Issue

Consistent GPU utilization (e.g. 90-95%) indicates compute-bound workload
Spiky pattern (93% → 0% → 93%) indicates the GPU is waiting for work—bottleneck is elsewhere (in our case, queues and IPC)

Conclusion

The bottleneck wasn’t GPU capacity. It was our multiprocessing architecture:

Issues identified:

IPC overhead from queue serialization
GPU compute contention without effective parallelism
Async/await + multiprocessing coordination overhead
Most latency from queues, not vLLM processing

Symptoms:

Spiky GPU utilization
Response time dominated by queue wait
Adding workers made performance worse

Note: In Part 2, we’ll cover the solution: eliminating multiprocessing, using vLLM’s AsyncLLMEngine directly, and achieving an 82% throughput improvement in production.

Preview:

Remove multiprocessing architecture entirely
Use vLLM’s AsyncLLMEngine with FastAPI directly
Right-size continuous batching configuration
Production result: Improved throughput (+82%)

Read next: Part 2: Scaling Translation Inference: +82% Throughput