Machine Translation RAFT Retrieval Llama Fine-tuning English-Chinese

Enhancing English-Chinese Translation with Retrieval-Augmented Fine-Tuning (RAFT)

Kai-Teh Tzeng - Lehigh University 2 min read
Enhancing English-Chinese Translation with Retrieval-Augmented Fine-Tuning (RAFT)

Exploring RAFT methodology for bidirectional English-Chinese translation using Llama 3.1

Abstract

This study explores using Retrieval-Augmented Fine-Tuning (RAFT) to enhance English-Chinese bidirectional translation with Llama 3.1-8B. RAFT combines retrieval mechanisms with fine-tuning to provide contextual examples during training.

Key Findings:

  • Benchmark fine-tuning achieved best overall results
  • RAFT showed modest improvements on specific metrics
  • Random-based RAFT sometimes outperformed similarity-based RAFT
  • Translation quality depends heavily on training data relevance

1. Introduction

Background

Large Language Models excel at language tasks but can benefit from domain-specific optimization. This research explores whether RAFT—a technique that augments training with retrieved examples—can improve translation quality.

Research Questions

  1. Can RAFT improve translation compared to standard fine-tuning?
  2. Does similarity-based retrieval outperform random retrieval?
  3. How do different RAFT configurations affect bidirectional translation?

2. Methodology

RAFT Overview

RAFT (Retrieval-Augmented Fine-Tuning) enhances the training process by:

  1. Retrieving relevant examples from a corpus for each training sample
  2. Augmenting the training context with retrieved examples
  3. Fine-tuning the model with this enriched context

RAFT methodology diagram

Experimental Setup

ComponentConfiguration
Base ModelLlama 3.1-8B Instruct
Fine-tuningLoRA (r=16, alpha=16)
DatasetNews Commentary v18.1 (zh-en)
GPUNVIDIA A100 80GB

Dataset Preparation

The News Commentary dataset contains parallel English-Chinese sentence pairs:

  • Training: 10,000 sentence pairs
  • Evaluation: TED Talks corpus
  • Preprocessed for quality and length consistency

RAFT Configurations

ConfigurationDescription
BenchmarkStandard fine-tuning without retrieval
Similarity RAFTRetrieve top-k similar examples using embeddings
Random RAFTRandomly sample k examples from corpus

3. Results

English-to-Chinese Translation

MethodBLEUCOMET
Baseline (No Fine-tuning)15.20.785
Benchmark Fine-tuning28.40.856
Similarity RAFT (k=3)27.10.849
Random RAFT (k=3)27.80.852

Chinese-to-English Translation

MethodBLEUCOMET
Baseline (No Fine-tuning)18.70.812
Benchmark Fine-tuning31.20.871
Similarity RAFT (k=3)30.50.865
Random RAFT (k=3)30.90.868

Note: Benchmark fine-tuning consistently outperformed RAFT configurations in this experiment. This may be due to the homogeneous nature of the News Commentary dataset.

Training performance comparison

BLEU and COMET score comparison

Analysis

Why RAFT didn’t outperform benchmark:

  1. Dataset Homogeneity: News Commentary has consistent style
  2. Retrieval Quality: Similarity metrics may not capture translation-relevant features
  3. Context Length: Additional examples increase context, potentially diluting focus

4. Conclusion

While RAFT shows promise, our experiments suggest that for translation tasks on homogeneous datasets, standard fine-tuning remains competitive. Future work should explore diverse training corpora and better retrieval metrics.

References

  1. Zhang, T., et al. (2024). “RAFT: Adapting Language Model to Domain Specific RAG.”
  2. Lewis, P., et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”
  3. Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”
Share this article

Try VoicePing for Free

Break language barriers with AI translation. Start with our free plan today.

Get Started Free