2026-05-30 · 6 minute read · editorial · culture

The hidden cost of AI-generated copy

By loom team

There is a particular paragraph appearing in newsletters this year. Three sentence-fragments, a transitional adverb, an em-dash, a clean kicker. You have read it twice this week without noticing. It was written by a model, then by another model, then by a person who used a model. The cost of AI-generated copy is not that any one paragraph is bad. The cost is that they have all started to be the same paragraph.

This is the part the quality discourse keeps missing. Output quality has been good enough to ship for a while. What gets lost in "good enough" is the variance — the reason an editorial voice exists at all.

The diversity gap is now measured, not vibed

The Alternative Uses Test, Forward Flow, and Divergent Association Task are three standard psychology instruments for creative diversity. Run on human groups and on LLM populations, the gap is large and consistent. Human population variability lands around 0.74, 0.84, and 0.82 on the three tasks. LLM populations across frontier models collapse to 0.46, 0.53, 0.67. Each gap is significant at p < 0.001. The authors put it plainly: "LLM responses are much more similar to other LLM responses than human responses are to each other." Pooling outputs from multiple models does not fix it — they cluster together. (We're Different, We're the Same, arXiv 2501.19361)

The cowriting literature finds the same thing inside a single piece. Co-writing with InstructGPT increased inter-author similarity and significantly reduced lexical and content diversity — one of the first clean empirical demonstrations that the assistant pulls writers toward each other, not just toward itself. (Homogenization Effects of LLMs on Human Creative Ideation, C&C 2024)

And the temperature defense — turn the knob up, dial up creativity — does not survive the data. The empirical comparison of human and ChatGPT writing across three studies found human writing increased the collective semantic diversity of an essay group "approximately two to eight times more" than base GPT-4 essays. Modifying prompts and sampling parameters did not close the gap. (Homogenizing effect of LLMs on creative diversity, ScienceDirect)

Why detection is winning, faster than imitation

The clean tell is stylometric. A 2025 paper on stylometry classifiers reached up to 1.00 accuracy distinguishing Wikipedia from GPT-4 on 10-sentence balanced sets, and a multiclass Matthews correlation coefficient of 0.87 separating seven LLM-and-human classes. Paraphrasing attacks barely dented it: recall stayed above 98% in most cases. (Stylometry recognizes human and LLM-generated texts, arXiv 2507.00838)

The Better Call Claude work shows the same asymmetry inside the model itself. Claude 3.7 Sonnet hit F1 of 0.86 on easy and 0.66 on hard PAN style-change tasks zero-shot — nearly matching a fine-tuned transformer on the medium split. The frontier model can recognise a stylistic shift inside a paragraph. It still cannot reliably produce one. (Better Call Claude, arXiv 2508.00680)

The hidden cost, named

The cost of AI-generated copy compounds at the category level, not the page level. One launch announcement reads fine. The hundredth launch announcement that reads the same way does something else — it teaches the reader to skim. Then the brand that earned its readers by sounding like itself spends the next quarter trying to figure out why open rates are sliding.

The cost is paid by:

Brands that lose the variance that earned them an audience.
Categories that lose the difference between competitors. If three SaaS companies all run the model that finds the median sentence, the median sentence is no longer worth reading.
Readers, who stop paying attention because there is less to pay attention to.

The detection literature is now ahead of the production literature, which is to say: the median paragraph is becoming easier for a classifier to spot than for a model to vary. Six months ago you could call AI copy "fluent." Now the fluency itself is the signal.

What we'd ship instead

Use the model for the work it is shaped for — research, an outline, fifteen variants of a sentence you already wrote. Do not let it pick the line. Anchor every draft on three pieces with a voice you would defend, and check the output against them. Read the result aloud — the model cannot hear the difference between rhythm and median, and you can.

The fluent paragraph is now free. The voice that does not sound like everyone else's is the one thing the model cannot give you back.

Sources

We're Different, We're the Same: Creative Homogeneity Across LLMs (arXiv 2501.19361) — Cross-model creativity-task variability is roughly 60% of human variability, with every gap significant at p < 0.001.
Homogenizing effect of LLMs on creative diversity: human vs ChatGPT (ScienceDirect) — Human writing increased collective essay diversity two to eight times more than base GPT-4, and parameter tweaks did not close the gap.
Homogenization Effects of LLMs on Human Creative Ideation (C&C 2024) — Co-writing with InstructGPT raises inter-author similarity and lowers lexical and content diversity.
Stylometry recognizes human and LLM-generated texts in short samples (arXiv 2507.00838) — Tree-based stylometry hits up to 1.00 binary accuracy and 0.87 multiclass MCC distinguishing human from LLM text.
Better Call Claude: Can LLMs Detect Changes of Writing Style? (arXiv 2508.00680) — Frontier LLMs are now strong style-change detectors, even as they struggle to produce stylistic variety themselves.