Skip to content

Faster and cheaper implementation by combining api calls into one. Better translation quality too?

License

Notifications You must be signed in to change notification settings

cboler13/translation-agent

 
 

Repository files navigation

Translation Agent: Agentic translation using reflection workflow

This is a Python demonstration of a reflection agentic workflow for machine translation. The main steps are:

  1. Prompt an LLM to translate a text from source_language to target_language;
  2. Have the LLM reflect on the translation to come up with constructive suggestions for improving it;
  3. Use the suggestions to improve the translation.

Customizability

By using an LLM as the heart of the translation engine, this system is highly steerable. For example, by changing the prompts, it is easier using this workflow than a traditional machine translation (MT) system to:

  • Modify the output's style, such as formal/informal.
  • Specify how to handle idioms and special terms like names, technical terms, and acronyms. For example, including a glossary in the prompt lets you make sure particular terms (such as open source, H100 or GPU) are translated consistently.
  • Specify specific regional use of the language, or specific dialects, to serve a target audience. For example, Spanish spoken in Latin America is different from Spanish spoken in Spain; French spoken in Canada is different from how it is spoken in France.

This is not mature software, and is the result of Andrew playing around with translations on weekends the past few months, plus collaborators (Joaquin Dominguez, Nedelina Teneva, John Santerre) helping refactor the code.

According to our evaluations using BLEU score on traditional translation datasets, this workflow is sometimes competitive with, but also sometimes worse than, leading commercial offerings. However, we’ve also occasionally gotten fantastic results (superior to commercial offerings) with this approach. We think this is just a starting point for agentic translations, and that this is a promising direction for translation, with significant headroom for further improvement, which is why we’re releasing this demonstration to encourage more discussion, experimentation, research and open-source contributions.

If agentic translations can generate better results than traditional architectures (such as an end-to-end transformer that inputs a text and directly outputs a translation) -- which are often faster/cheaper to run than our approach here -- this also provides a mechanism to automatically generate training data (parallel text corpora) that can be used to further train and improve traditional algorithms. (See also this article in The Batch on using LLMs to generate training data.)

Comments and suggestions for how to improve this are very welcome!

Getting Started

To get started with translation-agent, follow these steps:

Installation:

  • The Poetry package manager is required for installation. Poetry Installation Depending on your environment, this might work:
pip install poetry 
  • A .env file with a OPENAI_API_KEY is required to run the workflow. See the .env.sample file as an example.
git clone https://github.com/andrewyng/translation-agent.git
cd translation-agent
poetry install
poetry shell # activates virtual environment

Usage:

import translation_agent as ta
source_lang, target_lang, country = "English", "Spanish", "Mexico"
translation = ta.translate(source_lang, target_lang, source_text, country)

See examples/example_script.py for an example script to try out.

License

Translation Agent is released under the MIT License. You are free to use, modify, and distribute the code for both commercial and non-commercial purposes.

Ideas for extensions

Here are ideas we haven’t had time to experiment with but that we hope the open-source community will:

  • Try other LLMs. We prototyped this primarily using gpt-4-turbo. We would love for others to experiment with other LLMs as well as other hyperparameter choices and see if some do better than others for particular language pairs.
  • Glossary Creation. What’s the best way to efficiently build a glossary -- perhaps using an LLM -- of the most important terms that we want translated consistently? For example, many businesses use specialized terms that are not widely used on the internet and that LLMs thus don’t know about, and there are also many terms that can be translated in multiple ways. For example, ”open source” in Spanish can be “Código abierto” or “Fuente abierta”; both are fine, but it’d better to pick one and stick with it for a single document.
  • Glossary Usage and Implementation. Given a glossary, what’s the best way to include it in the prompt?
  • Evaluations on different languages. How does its performance vary in different languages? Are there changes that make it work better for particular source or target languages? (Note that for very high levels of performance, which MT systems are approaching, we’re not sure if BLEU is a great metric.) Also, its performance on lower resource languages needs further study.
  • Error analysis. We’ve found that specifying a language and a country/region (e.g., “Spanish as colloquially spoken in Mexico”) does a pretty good job for our applications. Where does the current approach fall short? We’re also particularly interested in understanding its performance on specialized topics (like law, medicine) or special types of text (like movie subtitles) to understand its limitations.
  • Better evals. Finally, we think better evaluations (evals) is a huge and important research topic. As with other LLM applications that generate free text, current evaluation metrics appear to fall short. For example, we found that even on documents where our agentic workflow captures context and terminology better, resulting in translations that our human raters prefer over current commercial offerings, evaluation at the sentence level (using the FLORES dataset) resulted in the agentic system scoring lower on BLEU. Can we design better metrics (perhaps using an LLM to evaluate translations?) that capture translation quality at a document level that correlates better with human preferences?

Related work

A few academic research groups are also starting to look at LLM-based and agentic translation. We think it’s early days for this field!

UPDATE

To improve both costs and speed, Ladi and I have updated the translation method. Before, 3N calls to the Open AI API had to be called, (where N is the number of 'chunks' in the translatable text).

We've reconfigured utils so that only 1N calls must be made to the API, but the same quality of translation should be achieved. Should be cheaper and faster now.

How it works

Before, this was the structure of calling. if text is < Chunk Size (1,000c) Single Chunk: One_chunk_initial_translation (1 call) One_chunk_reflect_on_translation (1 call) One_chunk_improve_translation (1 call) = 3 calls to Open AI API

if text is > Chunk Size Multi Chunk: Multichunk_initial_translation (n calls) Multichunk_reflect_on_translation (n calls) Multichunk_improve_translation (n calls) = 3n calls

Now, we've tried to combine all the agentic calls to the API into one. We ask ChatGPT to perform all the steps it performed before (initial, reflect, improve) but to do this in the background and only output the final translation. Therefore, while ChatGPT is performing all the analysis, it's only actually charging you for the text it outputs.

For example, here's the one_chunk_initial_translation call to the OpenAI API:

"This is an {source_lang} to {target_lang} translation, but you will be following a series of 
steps to arrive at a final translation. While all these steps will be performed by you, the only text you will output 
is the final {target_lang} translation you arrive at. 

Translate the following {source_text} from {source_lang} to {target_lang}.  
Here are the steps you must Follow when executing the translation: 
1. Translate {source_text} from {source_lang} to {target_lang}
2. Carefully read the source text and the translation you just made from {source_lang} to {target_lang}, and then give constructive criticism and helpful suggestions to improve your previous translation. \
3. Edit the translation using the suggestions created in the previous step. The final style and tone of the translation should match the style of {target_lang} colloquially spoken in {country}.
4. Repeat steps 2 and 3 until you cannot find any more suggestions to improve the translation. 
Perform all these steps, but only print out the final translated version.

Do NOT output anything other than the final translation of the indicated part of the text. 

And remember, Output only the final translation of the portion you are asked to translate, and nothing else. \
 
{source_lang}: {source_text}

{target_lang}:"

As you can see, we can essentially get more bang for our buck by compressing everything we want the API to do for us into one call.

There is still a lot of testing and translation quality analysis to be done to ensure the quality really is an improvement from an initial translation, but so far our early tests are promising.

Update Material may not be recognized and translated unless there are no breaks in the .txt file. I will work on this tomorrow.

About

Faster and cheaper implementation by combining api calls into one. Better translation quality too?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Python 100.0%