2026-05-18 · 6 minute read · translation · multilingual

When AI translation keeps the voice

By loom team

There is a sentence in a Murakami opening that does not survive a literal translation. It is not the meaning that breaks. It is the breathing — a comma where English would not put one, a verb held to the end because the paragraph needed it there. Any translator who has worked between English and Japanese has met that sentence. So has every machine that has tried to.

The new question is not whether large language models can translate. They can. The question is whether anything of the author makes it across.

Frontier MT, briefly

On WMT24 — the standard machine-translation shoot-out — Claude 3.5 Sonnet placed first in 9 of 11 language pairs, including Japanese-English, beating dedicated MT systems that have been tuned for a decade. The most credible single review for the Japanese pair specifically notes Claude's handling of honorifics, omitted subjects, and implication: the parts of Japanese that do not have one-to-one English equivalents and break sentence-level translators. (Best LLM for Japanese-English translation, note.com)

The architectural shift behind that result is small and important. Earlier neural MT translated one sentence at a time. Document-level prompting — give the model the whole paragraph, even the whole chapter, and ask once — outperformed sentence-by-sentence translation across 18 linguistically diverse pairs including Japanese. The context matters because Japanese keeps its subject implicit; you cannot recover it from a single line. (Document-level context for literary translation, arXiv 2304.03245)

So we have gotten somewhere. We have not gotten where literary translators want to be.

What the recent literary-translation study actually found

The most cited 2024 evaluation of LLMs on literary translation — a corpus of 2,197 annotated segments and 13,346 sentences across German-English, English-German, German-Chinese, and English-Chinese — ran professional translators against GPT-4o, DeepL, and Google Translate using Best-Worst Scaling. Professional evaluators preferred human translations 94% of the time. Even GEMBA-MQM, the strongest automated metric, picked the human translation in only 9.6% of comparisons. Automated metrics could not tell the difference; humans could, instantly. (How Good Are LLMs for Literary Translation, Really? arXiv 2410.18697)

The texture of the failure is what matters. Human translators produced the lowest syntactic similarity to the source (0.21) and the lowest lexical overlap with other systems (18.9%). LLMs clustered around 0.27 syntactic similarity and reused vocabulary. The authors are blunt: "high syntactic similarity frequently sacrifices naturalness in the target language and hinders creativity in translations."

The model is staying close to the source because the model is trained to stay close to the source. The translator is staying close to the reader, because that is the job.

Style transfer is not what literary translation is

Style-aware MT papers are improving — the SAMAS system treats style as a signal and assembles specialized agents per piece. (SAMAS: Spectrum-Guided Multi-Agent System for Style Fidelity, arXiv 2602.19840) The benchmarks tick up. But a literary voice is not a style preset. It is the cumulative weight of every word an author refused. Models can be told what style to aim for; they cannot be told which words to refuse, because they have no record of an author's refusals.

Which is why the most credible working translators do not delegate the line. The workflow Simon Willison documented from Tom Gally — keep the LLM as a sentence-level thesaurus, ask for ten alternatives instead of one, verify across multiple models, then read the result out loud — survives because it never asks the model to be the translator. It asks the model to widen the human translator's bench. (A professional workflow for translation using LLMs)

What we'd argue, working both directions

Translate the paragraph, not the sentence. Run the draft through two models, not one. Read the Japanese aloud; if a native ear hears prose, you are closer than the score will tell you. Treat any line where the model and the human translator disagree as the interesting line — most of the work is in those.

The literal output is now nearly free. The voice still costs the same thing it always cost: a person making the call.

Sources

How Good Are LLMs for Literary Translation, Really? (arXiv 2410.18697) — Professional evaluators preferred human literary translations 94% of the time over GPT-4o, DeepL, and Google Translate.
Document-level context for literary translation (arXiv 2304.03245) — Paragraph-level prompting outperforms sentence-by-sentence translation across 18 language pairs including Japanese.
SAMAS: Spectrum-Guided Multi-Agent System for Style Fidelity (arXiv 2602.19840) — Treats literary style as a signal and assembles specialized translation agents per piece.
Best LLM for Japanese-English Translation: Benchmarks and Practical Selection (note.com) — Reviews WMT24 results showing Claude 3.5 Sonnet first in 9 of 11 language pairs, with specific notes on honorifics and omitted subjects.
A professional workflow for translation using LLMs (Simon Willison, 2025) — Documents Tom Gally's sentence-level, multi-model approach with the human translator firmly in charge.