Improving In-Context Learning Performance in Machine Translation

1. Introduction

Large language models (LLMs) have shown impressive proficiency in downstream tasks through conditioning on input-label pairs. This inference mode is called in-context learning (Brown et al. 2020). GPT-4 can improve its translation capabilities without fine-tuning by providing specific task examples.

Figure 1: In-context learning for Chinese to English translation using few-shot examples

The effectiveness of in-context learning stems from Implicit Bayesian Inference (Xie et al. 2022). Random selection of examples cannot effectively help GPT-4 understand the prompt concept. The primary objective is strategic selection of suitable examples based on user input.

2. Proposed Method

This methodology assumes access to a dataset Ds containing translation pairs. A text retriever (Gao 2023) locates and selects the top K sentences with similar meaning to the user prompt.

The retriever has two components:

TF-IDF Matrix - measures term frequency and inverse document frequency
Cosine Similarity - measures similarity between TF-IDF vectors

TF-IDF Score

TF-IDF scores measure the significance of words within documents:

TF (Term Frequency): How often a word appears in a document
IDF (Inverse Document Frequency): Significance of a word across the corpus

Cosine Similarity

Cosine similarity assesses similarity between two vectors by considering the angle between their representations. Higher scores indicate greater similarity between the user prompt and dataset documents.

Figure 2: Using TF-IDF matrix and cosine similarity to select top-K examples from the dataset

3. Experimental Setup

3.1 Experimental Procedure

The experiment covers three scenarios:

No ICL: GPT-4 translation without in-context learning examples
Random ICL: Random selection of translation examples
Proposed Method: TF-IDF retriever selects top 4 examples based on similarity scores

Evaluation Metrics

BLEU Score: Compares translated segments with reference translations (Papineni et al. 2002)
COMET Score: Neural framework for multilingual MT evaluation achieving state-of-the-art correlation with human judgments (Rei et al. 2020)

3.2 Datasets

OPUS-100 (Zhang et al. 2020) was chosen because it:

Contains diverse translation language pairs (ZH-EN, JA-EN, VI-EN)
Covers diverse domains for effective example selection

Configuration:

10,000 training instances per language pair for Ds
First 100 sentences from test set for evaluation

3.3 Implementation

Using scikit-learn’s TfidfVectorizer and cosine_similarity functions:

Merge user prompt with Ds
Calculate cosine similarity scores between prompt and all sentences
Select top 4 examples based on similarity
Embed examples into GPT-4 prompt

Figure 3: Final prompt with four examples identified by the retriever

4. Results and Discussion

Table 1: Translation accuracy across three scenarios for all language pairs

Key Findings:

The proposed approach shows superior translation accuracy across all language pairs
A 1% improvement in BLEU score is significant in machine translation
Random ICL sometimes performs worse than no ICL at all
This highlights the importance of judicious example selection

Dataset Size Impact

Table 2: Translation accuracy with different dataset sizes

Testing with 1 million sentences confirmed that larger Ds datasets improve GPT-4 task learning effectiveness.

5. Conclusion and Next Steps

This paper introduces a method to enhance GPT-4 translation through in-context learning with TF-IDF retrieval. The approach:

Constructs a retriever using TF-IDF matrix and cosine similarity
Selects sentences that closely align with user prompts
Shows improvements in both BLEU and COMET scores

Future Research Directions:

Dataset Construction: Creating comprehensive, high-quality translation datasets across domains
Example Quantity: Investigating impact of using 5 or 10 examples instead of 4

6. References

Brown, T., et al. (2020). “Language models are few-shot learners.”
Xie, S. M., et al. (2022). “An Explanation of In-context Learning as Implicit Bayesian Inference.”
Bashir, D. (2023). “In-Context Learning, in Context.” The Gradient.
Das, R., et al. (2021). “Case-based reasoning for natural language queries over knowledge bases.”
Liu, J., et al. (2022). “What makes good in-context examples for GPT-3?”
Margatina, K., et al. (2023). “Active learning principles for in-context learning with large language models.”
Gao, L., et al. (2023). “Ambiguity-Aware In-Context Learning with Large Language Models.”
Papineni, K., et al. (2002). “BLEU: A method for automatic evaluation of machine translation.”
Rei, R., et al. (2020). “COMET: A Neural Framework for MT Evaluation.”
Zhang, B., et al. (2020). “Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation.”