2026-05-08 · 6 minute read · editorial · writing

Voice is a feature, not a setting

By loom team

Open a clean draft. Paste your best paragraph beside the model's. Read both out loud. You already know which is yours, and you knew before the second sentence — because voice isn't a knob you turn up. It's the thing that survives the edit, or it doesn't.

The new orthodoxy treats voice like a preference pane. Pick a register, set a temperature, paste three examples and hope. But the research on what large models actually do to personal style is unforgiving — and worth reading before you accept the next "rewrite in my tone" suggestion.

The flattening is measurable

A September 2025 paper from a team evaluating six frontier models — GPT-4o, GPT-4o-mini, Gemini-2.0-Flash, Gemma-3-27B, DeepSeek-V3, Llama-4-Maverick — set up the obvious experiment. Give the model some of an author's writing, ask it to continue in that voice, then see whether a stylometric classifier can still tell the author apart from imitations. On formal corpora (the CCAT50 news set, professional email), authorship verification stayed strong: 95–97% accuracy. On the writing most of us actually do — blog posts, forum threads — verification collapsed to 16–21% on blogs and 49–66% on forums. The conclusion the authors land on is plain: "LLMs still struggle to reproduce nuanced personal styles — especially in informal and stylistically diverse domains." More examples didn't help. The architecture, not the prompt, is the ceiling. (Catch Me If You Can?, Findings of EMNLP 2025)

A second line of evidence runs the experiment in reverse. If you can tell human writing from machine writing in ten sentences with up to 1.00 accuracy on balanced sets — the result a 2025 stylometry study reports for Wikipedia versus GPT-4 — then the difference is not in the topic. It is in the texture: word frequencies, sentence shapes, the specific way someone keeps deciding what to keep. (Stylometry recognizes human and LLM-generated texts)

What the model is actually optimizing for

The recent ICLR-adjacent work on creative homogeneity puts numbers on something writers have been muttering about for two years. Pooled across the Alternative Uses Test, Forward Flow, and the Divergent Association Task, human population variability sat at 0.738, 0.835, and 0.819 respectively. The same tasks across LLMs collapsed to 0.459, 0.534, 0.665 — every gap statistically significant at p < 0.001. The headline the authors chose is the right one: "LLM responses are much more similar to other LLM responses than human responses are to each other." (We're Different, We're the Same, arXiv 2501.19361)

That isn't a temperature problem. Models are trained to find the median sentence the median reader will accept. Asking them to write in your voice is asking them to leave the median on purpose, and they do not leave the median easily.

Style detection is improving faster than style production

The Better Call Claude evaluation on the PAN 2024 and 2025 style-change datasets showed Claude 3.7 Sonnet hitting F1 scores of 0.86, 0.84, and 0.66 across easy, medium, and hard splits — nearly matching a fine-tuned transformer on the medium set, zero-shot. (Better Call Claude, arXiv 2508.00680)

Read that next to the imitation results and the asymmetry is awkward: frontier models can tell whose voice a paragraph belongs to better than they can write in that voice. They are good critics. They are mediocre mimics.

What this means for the tool you choose

A draft assistant that quietly regresses every sentence toward the model's median is not saving you time. It is laundering your voice out, then handing back something fluent enough that you stop noticing. The Gally workflow that Simon Willison documented — ten alternative phrasings instead of one, sentence-level not paragraph-level, multi-model review — works because it refuses to let the model decide what the line is. (A professional workflow for translation using LLMs, Simon Willison)

Voice is a feature. Treat it like one. Anchor on three pieces you would defend. Ask for ten variants and keep one. Score the result against the lines that earned it. The byline is the bet you are making — that the way you keep deciding is worth the reader's time. A tool that flattens that bet is not on your side.

Sources

Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors (arXiv 2509.14543) — Frontier-model evaluation showing 16–21% authorship verification on blogs versus 95–97% on news.
Stylometry recognizes human and LLM-generated texts in short samples (arXiv 2507.00838) — Stylometric classifiers reach up to 1.00 accuracy distinguishing Wikipedia from GPT-4 on 10-sentence samples.
We're Different, We're the Same: Creative Homogeneity Across LLMs (arXiv 2501.19361) — Quantifies cross-model homogeneity: LLM population variability is roughly 60% of human variability on standard creativity tasks.
Better Call Claude: Can LLMs Detect Changes of Writing Style? (arXiv 2508.00680) — Claude 3.7 Sonnet hits 0.86 F1 on PAN 2024 style-change detection, outperforming several fine-tuned baselines.
A professional workflow for translation using LLMs (Simon Willison, 2025) — Tom Gally's sentence-level, multi-model practice for keeping a human translator in charge of voice.