
“We tell ourselves stories in order to live.” The line is Joan Didion’s, from The White Album. And she’s right. We tell stories about progress, about innovation, about the inexorable march of technology. Some of us tell a particularly seductive story about artificial intelligence.
A story of digital oracles and nascent superintelligence; a story that promises to solve the problems of the world. What we seldom discuss is the cost of the story, the quiet harvest required to feed the protagonist. We seldom discuss the books.
It begins with an appetite. A need. To build a large language model (LLM) like Meta’s Llama 3, one requires a vast and ready supply of high-quality writing. And so, a simple ethical question presented itself to the employees at Meta.
To acquire the necessary text legally, through licensing, was a possibility. But it was, in the calculus of the moment, an unappealing one. It was “unreasonably expensive.” It was, a senior manager added, “incredibly slow.” One potential data provider could take more than four weeks to deliver.
The other option was piracy.
The story of this decision, of this quiet predation, did not stay within the walls of the company. It surfaced in the language of court documents, in copyright-infringement lawsuits brought by authors who discovered their life’s work had been taken.
It galvanised writers, like me, in the United Kingdom, as our trade union, The Society of Authors (SoA), launched a petition not merely of protest, but of profound existential concern. A narrative was emerging, one of disquiet, a story about the unseen cost of telling ourselves stories about progress.
The calculus of acquisition
There is a certain kind of tremor that runs through a system when its foundational assumptions are questioned. For the creators of these new systems, the old ways were an impediment. Licensing was not just slow and expensive; it was a strategic dead end.
A Meta director of engineering noted a peculiar downside to the legal route: “The problem is that people don’t realise that if we license one single book, we won’t be able to lean into fair use strategy.” The logic is cold, circular, and revealing. To maintain the argument that taking is not theft, one must not pay.
And so attention turned elsewhere. It turned to the shadow libraries, to the grey zones of the internet where everything is available. It turned to Library Genesis, known as LibGen, one of the largest archives of pirated material in the world. The scale is difficult to comprehend.
LibGen contains more than 7.5 million books and 81 million research papers. For the team at Meta, this was the solution. A senior manager felt it was “really important for [Meta] to get books ASAP”, noting that “books are actually more important than web data”.
This was not a decision made in the lower echelons. According to the court filings, the team at Meta eventually received permission from “MZ”, an apparent reference to Meta CEO Mark Zuckerberg, to download and use the dataset. The sanction came from the top. This was not an isolated case. In a separate lawsuit, it was revealed that OpenAI, the creator of ChatGPT, had also used LibGen in the past.
This internal narrative stands in contrast to the company’s public-facing story. Officially, Meta has stated that Llama 3 was pre-trained on “over 15 trillion data tokens from publicly available sources.” They emphasise the use of extensive data quality filters and classifiers to curate this vast dataset.
It’s the polished story of progress, a narrative of scale and technical curation that omits the details emerging from the courts.
Mining the shadow library
To understand LibGen is to understand a particular modern paradox. Created around 2008 by scientists in Russia, its stated purpose was to serve those without access to academic journals in “Africa, India, Pakistan, Iran, Iraq, China, Russia and post-USSR.” It grew through peer-to-peer networks, a decentralised ghost that authorities found impossible to exorcise. Publishers have tried.
Elsevier won a $15 million judgment in 2015; Macmillan and others won a $30 million judgment in 2023. The fines went unpaid, the injunctions were ignored, and the library continued to grow.
Its appeal to an AI company is self-evident. The collection mirrors our entire literary and scientific culture. It holds recent works by Sally Rooney and Jonathan Haidt, classics by Dostoevsky, and academic papers from leading journals like Nature and The Lancet. Cujo is in there, as is The Gulag Archipelago. It is a sprawling, disorganised, but breathtakingly comprehensive snapshot of human knowledge. It is the perfect harvest.
Ethical fissures
There was no illusion about the undertaking. Internal communications at Meta show employees acknowledged a “medium-high legal risk” in training Llama on LibGen. They discussed “mitigations” to mask their activity.
The suggestions were telling. “Remove data clearly marked as pirated/stolen” and “do not externally cite the use of any training data including LibGen.”
One manager suggested fine-tuning the model to refuse queries like, “reproduce the first three pages of ‘Harry Potter and the Sorcerer’s Stone.” But the centre only holds so long. A flicker of personal conscience surfaced in a message from one employee who stated that “torrenting from a corporate laptop doesn’t feel right.”
In June 2025, a U.S. federal judge dismissed a key lawsuit brought by authors, including Sarah Silverman, against Meta. However, the ruling was not a clear victory for the tech giant. The judge noted that the plaintiffs had “failed to develop a record in support of the right one,” while explicitly stating that the decision “does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful.”
The battle continues, mired in legal nuance, with diverging court decisions leaving the core question of fair use unanswered.
The authors’ response
The reaction from authors was not one of surprise, but of anger. For creators across the UK, the discovery that their work was in the LibGen database, likely used to train a commercial product without their permission, was a “clear infringement of copyright law”.
The Society of Authors, a UK trade union I’m part of that represents writers, illustrators, and translators, gave this anger a voice. It launched a petition demanding that the British government hold Meta to account. The central friction is one of power. An individual author feels “almost powerless given the enormous cost and complexities of pursuing litigation against corporate defendants with such deep pockets”.
This is what the petition calls the “unscrupulous behaviour exhibited by global tech companies which seemingly exploit copyright-protected material, safe in the knowledge that they will not be held to account”.
The demands are specific. SoA calls on the UK’s Secretary of State for Culture, Media and Sport to summon Meta executives to Parliament. They demand a detailed response, “unequivocal assurances” that copyright will be respected, and, crucially, that Meta “pay authors for all historic infringements.”
Failure to act, the petition states, “will unquestionably have a catastrophic and irreversible impact on all UK authors” as their rights are “systematically and repeatedly ignored.”
Broader implications
We are left with a deeper vertigo. It isn’t just about who owns what. It’s about how we know what we know. The AI chatbots that result from this harvest are presented as “oracles” but they rarely cite their sources, or they invent them.
This practice “decontextualises knowledge, prevents humans from collaborating, and makes it harder for writers and researchers to build a reputation and engage in healthy intellectual debate”.
The accessibility of shadow libraries like LibGen relies entirely on the initial labour. The time, expertise, and money of the people who created the knowledge in the first place. Now, the AI companies that have absorbed that labour aim to create products that “compete with the originals”. This leads to the central, unnerving question posed in The Atlantic:
Will these AI systems “be better for society than the human dialogue they are already starting to replace?”
A question of value
We are left staring at a delta between the stories we tell ourselves and the facts on the ground. The facts are of a quiet, systematic acquisition of the entire corpus of human culture, undertaken without permission because doing so legally was too slow, too expensive, and strategically inconvenient.
This is a fundamental reckoning with what we value. We are being shown, in the cold logic of internal memos and legal filings, the precise value assigned to the creative work that underpins our society, and that is less than the cost of asking.
The unseen cost is the erosion of the idea that a writer’s work, a scientist’s research, an artist’s vision, has a value that must be honoured. The harvest is underway. And the looming question is what, if anything, will be left to sow when it is over.
Sources
Brittain, Blake. “Meta fends off authors’ US copyright lawsuit over AI.” Reuters. (25 June 2025), https://www.reuters.com/sustainability/boards-policy-regulation/meta-fends-off-authors-us-copyright-lawsuit-over-ai-2025-06-25/.
Didion, Joan. The White Album. (1979).
“Llama 3 Guide: Everything You Need to Know About Meta’s New Model and Its Data.” Kili Technology, https://kili-technology.com/large-language-models-llms/llama-3-guide-everything-you-need-to-know-about-meta-s-new-model-and-its-data.
O’Brien, Matt & Ortutay, Barbara. “Judge tosses authors’ AI training copyright lawsuit against Meta.” PBS NewsHour. (26 June 2025), https://www.pbs.org/newshour/arts/judge-tosses-authors-ai-training-copyright-lawsuit-against-meta.
Reisner, Alex. “The Unbelievable Scale of AI’s Pirated-Books Problem.” The Atlantic. (20 March 2025), https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093/.
The Society of Authors. “Protect authors’ livelihoods from the unlicensed use of their work in AI training”, https://www.change.org/p/protect-authors-livelihoods-from-the-unlicensed-use-of-their-work-in-ai-training.
Hi, I'm Miriam - an independent AI ethicist, writer and strategic SEO consultant.
I break down the big topics in AI ethics, so we can all understand what's at stake. And as a consultant, I help businesses build resilient, human-first SEO strategies for the age of AI.
If you enjoy these articles, the best way to support my work is with a free or paid subscription. It keeps this publication independent and ad-free, and your support means the world. 🖤
*This article was originally published in Ai-Ai-OH.