Enhancing English-Chinese Translation with Retrieval-Augmented Fine-Tuning (RAFT)

Abstract

This study explores using Retrieval-Augmented Fine-Tuning (RAFT) to enhance English-Chinese bidirectional translation with Llama 3.1-8B. RAFT combines retrieval mechanisms with fine-tuning to provide contextual examples during training.

Key Findings:

Benchmark fine-tuning achieved best overall results
RAFT showed modest improvements on specific metrics
Random-based RAFT sometimes outperformed similarity-based RAFT
Translation quality depends heavily on training data relevance

1. Introduction

Background

Large Language Models excel at language tasks but can benefit from domain-specific optimization. This research explores whether RAFT—a technique that augments training with retrieved examples—can improve translation quality.

Research Questions

Can RAFT improve translation compared to standard fine-tuning?
Does similarity-based retrieval outperform random retrieval?
How do different RAFT configurations affect bidirectional translation?

2. Methodology

RAFT Overview

RAFT (Retrieval-Augmented Fine-Tuning) enhances the training process by:

Retrieving relevant examples from a corpus for each training sample
Augmenting the training context with retrieved examples
Fine-tuning the model with this enriched context

RAFT methodology diagram

Experimental Setup

Component	Configuration
Base Model	Llama 3.1-8B Instruct
Fine-tuning	LoRA (r=16, alpha=16)
Dataset	News Commentary v18.1 (zh-en)
GPU	NVIDIA A100 80GB

Dataset Preparation

The News Commentary dataset contains parallel English-Chinese sentence pairs:

Training: 10,000 sentence pairs
Evaluation: TED Talks corpus
Preprocessed for quality and length consistency

RAFT Configurations

Configuration	Description
Benchmark	Standard fine-tuning without retrieval
Similarity RAFT	Retrieve top-k similar examples using embeddings
Random RAFT	Randomly sample k examples from corpus

3. Results

English-to-Chinese Translation

Method	BLEU	COMET
Baseline (No Fine-tuning)	15.2	0.785
Benchmark Fine-tuning	28.4	0.856
Similarity RAFT (k=3)	27.1	0.849
Random RAFT (k=3)	27.8	0.852

Chinese-to-English Translation

Method	BLEU	COMET
Baseline (No Fine-tuning)	18.7	0.812
Benchmark Fine-tuning	31.2	0.871
Similarity RAFT (k=3)	30.5	0.865
Random RAFT (k=3)	30.9	0.868

Note: Benchmark fine-tuning consistently outperformed RAFT configurations in this experiment. This may be due to the homogeneous nature of the News Commentary dataset.

Training performance comparison

BLEU and COMET score comparison

Analysis

Why RAFT didn’t outperform benchmark:

Dataset Homogeneity: News Commentary has consistent style
Retrieval Quality: Similarity metrics may not capture translation-relevant features
Context Length: Additional examples increase context, potentially diluting focus

4. Conclusion

While RAFT shows promise, our experiments suggest that for translation tasks on homogeneous datasets, standard fine-tuning remains competitive. Future work should explore diverse training corpora and better retrieval metrics.

References

Zhang, T., et al. (2024). “RAFT: Adapting Language Model to Domain Specific RAG.”
Lewis, P., et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”
Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”