Why Dapple Grey Press Is Choosing “Human Certified.”

Rae Lawrence
Mar 21
7 min read

There is a serious problem with AI detectors, and too many people are pretending otherwise. I have been speaking out about this issue on social media for some time now, leveraging the domain experience in machine learning/AI gained through my day job.

AI detectors are being marketed as if they can deliver a definitive answer to the question of whether text was AI- or Human-generated. They're being used as if they can "prove" misconduct, and being trusted in situations where the cost of a wrongful accusation is reputational, academic and personal. The only ones who stand to win are the lawyers.

This is a big claim, and the numbers do not back it up.

At Dapple Grey Press, we are not interested in outsourcing judgment to unreliable software. That is why, from this spring, we will begin adding our own Human Certified label to our covers, with earlier titles in our catalogue being updated retrospectively.

This is our editorial position.

Selling unearned confidence

It's a nice idea - paste in some text and press a button. A few seconds later you get a RAG (red/amber/green) assessment to tell you whether a human or robot wrote it.

AI detector companies sell a compelling idea: paste in a passage, click a button, and find out whether a human wrote it.

In my professional opinion, this is a fantasy. And incredibly irresponsible.

Even OpenAI withdrew its own AI text classifier, explicitly stating that it had a low rate of accuracy and that human-written text could be incorrectly labelled as AI-written. Yikes! When the Big-Daddy of the GPT world is backing away from this, you have to question the others schlepping these tools.

That should have been the moment the whole market sobered up. Instead, detector tools kept appearing, kept boasting, and kept being used by organisations desperate for a definitive answer. And laughing all the way to the bank.

"Now we see the violence inherent in the system!"

I've been dying to use that Monty Python line for ages. Help! Help! I'm being repressed.

Sorry - I got distracted. Back to the blog.

Large language models (LLMs) improve by absorbing vast quantities of human language: books, articles, websites, social media posts, reviews, essays, blogs, forum comments, news copy, and more. Detector models then try to identify “AI-like” writing by learning statistical patterns from that same language environment.

While that doesn't necessarily mean every detector is trained on the exact same dataset as every LLM, it does mean they are trying to police a line that is constantly being crossed and reshaped by that same body of data.

That is a systemic flaw. Put more bluntly: it is like grading your own homework.

As the generators improve by learning from more human writing, and from real-world feedback about what passes as natural, the endpoint the detectors are trying to predict becomes less stable.

That is not a sound basis for high-stakes judgment.

A ten-minute experiment is enough to expose the problem

Because AI is something I spend a lot of time doing in my day job, my scientific curiosity took over and I started digging deeper and experimenting.

I tested several AI detectors on classic literature: Frankenstein, Pride and Prejudice, and Wuthering Heights.

The results were ludicrous. I alternated between saying WTF, laughing out loud, and sighing deeply. Organisations are paying money for and trusting these detector models as if they are gospel.

Depending on the detector and the passage, these undeniably human-written texts came back as anything from around 10% AI-generated to roughly 40% AI-generated.

Unless Mary Shelley, Jane Austen, and Emily Brontë were moonlighting as time-travelling prompt engineers, this is a failure of the highest order. Plain and simple. Do not pass GO. Do not collect £200.

If a detector looks at canonical literature and decides it might be partly machine-written, then the detector is a scam. It is unreliable at best, malicious at worst.

And this is exactly where the detector industry becomes hard to take seriously. It makes lofty claims about accuracy, but once you look at these systems as a modeller rather than as a customer, the weaknesses are evident. Shiny marketing means very little if the model produces false positives on legitimate writing, struggles with edge cases, and cannot be trusted in the conditions where people are actually using AI to generate writing.

In my own professional field, I would not be comfortable publishing a model with that kind of practical unreliability/low confidence as if it were fit for high-stakes decision-making. I would trust that other experts would kibosh this during the peer-review process. I'd be mortified to hand such a poor model over to a customer, preferring to simply explain why the model wasn't good enough.

I certainly would not market it to worried institutions as a solution to an integrity crisis.

False positives are not a minor inconvenience

Let's not underestimate the danger of false positives.

A false positive here is not an abstract statistical inconvenience. It is an accusation against a real person: a student, a teacher, an author, a journalist, a researcher.

Research has already shown that AI detectors can be biased against non-native English writers, misclassifying their work as AI-generated at much higher rates. A 2024 paper on false positives argued directly that AI detection tools can unfairly accuse scholars of AI plagiarism and cause real harm. A 2026 paper on AI detectors in education (actual PDF you can read!) went further, arguing that these systems rely on unverifiable probabilistic estimates and should not be treated as if they can determine misconduct.

The outcomes of those studies should have been enough to end the fantasy that a detector score is proof.

Detector score is an output from a probabilistic model with known weaknesses, unclear applicability boundaries, and real-world failure modes that are nowhere near trivial. These limitations are brushed aside (or buried deep in the fine print) by the companies pushing these models.

In any serious modelling discipline, that would be a scandal. Retracted papers; repuational damage and blacklisting.

Frankly, I would not publish models like this

This is where professional experience matters.

In scientific modelling, I don't get to wave a flashy accuracy number around and hope nobody questions it. I have to consider the training and validation set, failure modes, class balance, and the domain of applicability. And very importantly, I consider the cost of being wrong. If I release a low-quality/low-confidence model, researchers will waste time/resources chasing red herrings. The ensuing annoyance comes back to my employer and me, and damages our reputations in our industry.

There are so many unanswered questions about the AI detectors.

What happens on simple prose?What happens on edited prose? What happens on historical prose (I have hypotheses)? What happens on dyslexia-friendly prose? What happens on text written by someone whose English is correct but non-idiomatic? What happens when the text is partly human and partly machine-assisted?

These are not niche edge cases. These are exactly the scenarios a detector will meet in the real world.

And yet the industry behaves as though a probabilistic guess with a fancy user interface is enough to justify accusations, suspicion, and reputational damage.

If I saw statistics like those from the commercial AI detectors in a model submission in my own domain, I would question why anyone thought it was ready for deployment. Then griping about people wasting my time with garbage. Heads would roll.

Publishing is already in dangerous territory

This is not hypothetical.

In March 2026, The Guardian reported that Hachette pulled the horror novel Shy Girl after allegations that parts of the text(*) may have been AI-generated. The issue was not proof. It was suspicion, scrutiny, and reputational risk.

There are dragons here. Big ones. With sharp pointy teeth. And teams of lawyers greedily rubbing their hands together, ready for the windfall.

Once an accusation is made, the damage starts immediately. Proof often arrives too late, if it arrives at all. And when the underlying tools are unreliable/low confidence, the whole situation becomes legally and ethically ugly.

Students can be falsely accused. I have high-school-aged daughters, and I've made my position on AI-detectors clear to their school leadership.

A writer's reputation can be publicly tarnished. An organisation can be sued for libel/defamation, and even a junior statistician could easily introduce enough reasonable doubt to influence judgment towards the plaintiff.

Ultimately, careers can and will be damaged on the back of software that was inherently flawed.

This recklessness keeps me up at night.

This gets worse as LLMs get better

The inconvenient truth is that the detectors are chasing their own tails.

As more human writing is published onlineand models continue to improve through training, fine-tuning, reinforcement, and wider exposure to real-world language, generated text will become smoother, more varied, and harder to distinguish from human writing on the page. OpenAI’s own move away from a text classifier and toward provenance-focused approaches reflects that challenge.

That means AI detectors are in a losing position.

Over time, they will drift towards two equally useless outcomes: flagging almost nothing, or flagging everything. Either they miss the content they are meant to catch, or they start accusing legitimate writers because the boundary between human- and machine-generated material has collapsed.

This is all bad.

Dapple Grey Press is choosing provenance over paranoia

We do not believe that slapping “AI-generated” labels on books is the right answer.

That approach assumes the detector behind the label is trustworthy enough to justify it. In most cases, it's not.

What matters far more is provenance.

Where did the manuscript come from? How did it develop? Can the author show drafts/sloppy-copies, revision history, conversations, editorial exchanges, and the ordinary evidence of real creative work? Has the publisher actually done the job of reading critically and knowing its authors?

That is the standard we care about.

At Dapple Grey Press, we look at early drafts. We talk to our authors. We edit properly. We read with care. We assess the work itself. We use judgment. And yes, we also look for bad writing, because we're still allowed to notice when prose is weak, derivative, or suspiciously synthetic.

It is tedious. But it is responsible.

What “Human Certified” means to us

This spring, Dapple Grey Press will begin adding our own Human Certified label to our covers, with earlier books in our catalogue being updated retrospectively.

We don't believe in AI detector fairytales. We believe readers deserve confidence that the books they buy from us were created by human authors: people bringing imagination, effort, revision, craft, emotion, and care to the page.

We have a clear editorial commitment: we know our authors, we know our books, and we stand behind the provenance of what we publish.

About Rae Lawrence

Rae Lawrence, PhD, brings over twenty years of professional expertise in computational chemistry, QSAR, and machine learning to Dapple Grey Press’s approach to AI. She has published on artificial intelligence in drug discovery and co-authored work on machine learning for medicinal chemistry, including Artificial Intelligence in Drug Discovery and MedChemInformatics: An Introduction to Machine Learning for Drug Discovery. She is currently Product Manager at Optibrium, where her work sits at the intersection of scientific software, modelling, and AI-enabled drug discovery.

(*) I read some of the excerpts. I will not share an opinion on the origin of the text. But I will say that I don't think writing is publication-ready. I expect a higher-quality product if I'm spending money to read.

Dapple Grey Press

A UK-based independent pubisher of accessible children's books.