When AI translation keeps the voice
By loom team
By loom team
There is a sentence in a Murakami opening that does not survive a literal translation. It is not the meaning that breaks. It is the breathing — a comma where English would not put one, a verb held to the end because the paragraph needed it there. Any translator who has worked between English and Japanese has met that sentence. So has every machine that has tried to.
The new question is not whether large language models can translate. They can. The question is whether anything of the author makes it across.
On WMT24 — the standard machine-translation shoot-out — Claude 3.5 Sonnet placed first in 9 of 11 language pairs, including Japanese-English, beating dedicated MT systems that have been tuned for a decade. The most credible single review for the Japanese pair specifically notes Claude's handling of honorifics, omitted subjects, and implication: the parts of Japanese that do not have one-to-one English equivalents and break sentence-level translators. (Best LLM for Japanese-English translation, note.com)
The architectural shift behind that result is small and important. Earlier neural MT translated one sentence at a time. Document-level prompting — give the model the whole paragraph, even the whole chapter, and ask once — outperformed sentence-by-sentence translation across 18 linguistically diverse pairs including Japanese. The context matters because Japanese keeps its subject implicit; you cannot recover it from a single line. (Document-level context for literary translation, arXiv 2304.03245)
So we have gotten somewhere. We have not gotten where literary translators want to be.
The most cited 2024 evaluation of LLMs on literary translation — a corpus of 2,197 annotated segments and 13,346 sentences across German-English, English-German, German-Chinese, and English-Chinese — ran professional translators against GPT-4o, DeepL, and Google Translate using Best-Worst Scaling. Professional evaluators preferred human translations 94% of the time. Even GEMBA-MQM, the strongest automated metric, picked the human translation in only 9.6% of comparisons. Automated metrics could not tell the difference; humans could, instantly. (How Good Are LLMs for Literary Translation, Really? arXiv 2410.18697)
The texture of the failure is what matters. Human translators produced the lowest syntactic similarity to the source (0.21) and the lowest lexical overlap with other systems (18.9%). LLMs clustered around 0.27 syntactic similarity and reused vocabulary. The authors are blunt: "high syntactic similarity frequently sacrifices naturalness in the target language and hinders creativity in translations."
The model is staying close to the source because the model is trained to stay close to the source. The translator is staying close to the reader, because that is the job.
Style-aware MT papers are improving — the SAMAS system treats style as a signal and assembles specialized agents per piece. (SAMAS: Spectrum-Guided Multi-Agent System for Style Fidelity, arXiv 2602.19840) The benchmarks tick up. But a literary voice is not a style preset. It is the cumulative weight of every word an author refused. Models can be told what style to aim for; they cannot be told which words to refuse, because they have no record of an author's refusals.
Which is why the most credible working translators do not delegate the line. The workflow Simon Willison documented from Tom Gally — keep the LLM as a sentence-level thesaurus, ask for ten alternatives instead of one, verify across multiple models, then read the result out loud — survives because it never asks the model to be the translator. It asks the model to widen the human translator's bench. (A professional workflow for translation using LLMs)
Translate the paragraph, not the sentence. Run the draft through two models, not one. Read the Japanese aloud; if a native ear hears prose, you are closer than the score will tell you. Treat any line where the model and the human translator disagree as the interesting line — most of the work is in those.
The literal output is now nearly free. The voice still costs the same thing it always cost: a person making the call.