--:--
Back

Anthropic's Panama Project: 2M Books Scanned for AI

Anthropic's Panama Project: 2M Books Scanned for AI

Explore Anthropic's Project Panama, a secret book-scanning effort that processed millions of volumes for AI training data. Unsealed documents reveal destructive methods, piracy allegations, and ongoing copyright lawsuits against companies like Meta and OpenAI, shedding light on ethical challenges in AI development.

12 min read

Anthropic’s Secret Book-Scanning Project: A Deep Dive into AI’s Data Hunger

In the fast-evolving world of artificial intelligence, few stories capture the tension between innovation and ethics quite like the revelations surrounding Anthropic’s Project Panama. This ambitious initiative, aimed at digitizing vast libraries of books to fuel AI models, highlights the high-stakes race among tech giants to amass training data. What started as a covert operation in early 2024 has now come to light through unsealed court documents, exposing how far companies are willing to go to build smarter systems like the chatbot Claude. At its core, Project Panama wasn’t just about scanning pages—it was a calculated push to ingest human knowledge into machines, often skirting legal boundaries.

The project’s internal planning document laid it out plainly: “Project Panama is our effort to destructively scan all the books in the world.” The emphasis on secrecy was clear—“We don’t want it to be known that we are working on this.” Within a year, Anthropic had poured tens of millions of dollars into acquiring and processing millions of books. This involved slicing off spines, scanning pages at high speeds, and ultimately recycling the remnants. The goal? To enrich the datasets powering AI models, making them more capable of understanding and generating human-like text.

These details emerged from over 4,000 pages of documents in a copyright lawsuit filed by book authors against Anthropic, a company valued at $183 billion by investors. While Anthropic settled the case for $1.5 billion in August, a district judge’s decision to unseal records last week painted a fuller picture of the company’s aggressive data acquisition tactics. This isn’t an isolated incident; it’s part of a larger pattern where AI firms like Anthropic, Meta, Google, and OpenAI have pursued colossal troves of data through sometimes clandestine means.

Books, in particular, have emerged as a prized resource in this data race. Court records reveal why: they offer structured, high-quality narratives that can teach AI to “write well,” avoiding the pitfalls of “low quality internet speak,” as one Anthropic co-founder noted in a January 2023 document. A 2024 email from Meta echoed this, calling access to digital book troves “essential” for staying competitive. Yet, gaining permission from publishers and authors proved impractical for these companies, leading them to explore bulk acquisition methods that often bypassed consent—including downloading pirated copies from online shadow libraries.

The Value of Books in AI Training

To understand the frenzy around books, it’s helpful to step back and look at how AI models are built. Modern large language models, like those behind Claude or ChatGPT, rely on AI training data—massive datasets that the systems analyze to learn patterns in language, facts, and creativity. Internet scrapes provide volume, but books deliver depth. Fiction hones storytelling skills; nonfiction builds factual accuracy. Without diverse, high-fidelity sources, AI outputs can feel shallow or erratic.

Anthropic’s co-founder highlighted this in early 2023, arguing that book-based training could elevate AI from mimicking casual online chatter to producing polished prose. This isn’t hyperbole; studies in AI research have long shown that curated corpora, like literary works, improve model coherence and reduce biases from noisy web data. For companies racing to release cutting-edge products, books represent a shortcut to sophistication.

However, this pursuit has sparked ethical debates. Authors and creators argue that their intellectual property is being harvested without compensation, fueling AI tools that could one day compete directly with human works. The lawsuits underscore a core question: Can AI companies treat the world’s literature as a free-for-all resource, or does this cross into exploitation?

Unsealed Documents Reveal Project Panama’s Inner Workings

Project Panama’s execution was as meticulous as it was secretive. Launched in early 2024, the project quickly scaled up, with Anthropic investing heavily in logistics and technology. The company acquired physical books in bulk, often in lots of tens of thousands, from used book retailers like Better World Books and the U.K.-based World of Books. While exact figures remain redacted, a vendor proposal outlined the scope: converting 500,000 to two million books over six months using specialized equipment.

The process was industrial and irreversible—what the documents call “destructive scanning.” Books were fed into hydraulic-powered cutting machines that sliced off spines neatly, freeing pages for high-speed scanners. These scanners captured images at production-level quality, turning text into digital files suitable for AI ingestion. Afterward, the mutilated books were scheduled for recycling, a step that raised eyebrows among critics for its wastefulness.

To lead this effort, Anthropic tapped Tom Turvey, a Silicon Valley veteran who had spearheaded Google’s controversial Google Books project two decades earlier. That initiative, which digitized millions of titles, faced its own legal battles over copyright but ultimately helped popularize large-scale book scanning. Turvey’s expertise was a natural fit, bringing proven methods to Anthropic’s table.

Initial plans considered sourcing from libraries or iconic bookstores. Documents from a March 2024 content acquisition meeting mention outreach to the Strand in New York City, famous for its “18 miles” of shelves, which expressed interest in selling used stock. Discussions also floated approaching underfunded U.S. libraries, including the New York Public Library, as potential partners. However, it’s unclear if these ideas panned out. The Strand confirmed no sales occurred, and the NYPL has not commented.

Instead, Anthropic leaned on established booksellers for efficiency. This physical approach contrasted with digital shortcuts but aligned with a strategy to build a clean, proprietary dataset. By buying and scanning legitimately purchased books, the company aimed to sidestep some piracy allegations—though not entirely, as later revelations showed.

Challenges and Costs of Destructive Scanning

Destructive scanning isn’t new; it’s been used in libraries and archives for decades to preserve content digitally while managing physical decay. But at Anthropic’s scale, it posed unique hurdles. Logistical costs soared: transportation, storage, and specialized machinery demanded significant upfront investment. The vendor proposal estimated high operational expenses, including labor for handling delicate volumes and quality checks to ensure scans were OCR-accurate for AI processing.

Environmentally, the project drew quiet criticism. Recycling millions of books means landfill diversion for paper, but the energy-intensive scanning process—running 24/7 on industrial equipment—adds to AI’s already hefty carbon footprint. Proponents argue the long-term benefits outweigh this; digitized knowledge becomes accessible indefinitely, potentially reducing the need for physical printing. Still, for authors whose works were destroyed in the process, it feels like a double loss: their creations fuel AI without consent, then vanish as waste.

An image from court filings depicts a sprawling book warehouse implicated in the project, stacks of volumes awaiting their fate. This visual underscores the sheer ambition—and controversy—of turning humanity’s literary heritage into machine-readable fuel.

The Shadow Side: Piracy and Clandestine Data Acquisition

While Project Panama focused on physical books, Anthropic’s data hunt extended to digital shadows. Court filings allege the company downloaded millions of pirated titles from shadow libraries—unauthorized online repositories like LibGen, a notorious site hosting books, articles, and more. In June 2021, co-founder Ben Mann spent 11 days pulling fiction and nonfiction from LibGen, using file-sharing software documented in browser screenshots.

A year later, in July 2022, Mann excitedly shared a link to the Pirate Library Mirror with colleagues, timing it with the site’s launch. This mirror boasted a massive database and openly admitted to violating copyright laws in most countries. Anthropic maintains it never trained revenue-generating models on this data and avoided using the mirror for full AI builds. Even so, the disclosures paint a picture of executives embracing risky sources to accelerate development.

Meta’s story mirrors this pattern, with internal messages revealing a mix of enthusiasm and unease. Employees downloaded millions of books via torrent platforms, which reward uploaders with faster access to files. One 2023 engineer message captured the discomfort: “Torrenting from a corporate laptop doesn’t feel right.” Another flagged potential legal issues with sharing pirated works during downloads.

Despite these concerns, the practice escalated. A December 2023 email noted approval after “escalation to MZ”—likely CEO Mark Zuckerberg—for using LibGen in training Llama 3, with mitigations for risks like media exposure undermining regulatory talks. By April 2024, teams were torrenting via Amazon-rented servers to obscure traces back to Meta. The company denies distributing plaintiffs’ works through these methods.

OpenAI and Microsoft face similar accusations in a 2023 lawsuit. OpenAI, where Mann and Anthropic CEO Dario Amodei previously worked, admitted downloading LibGen but claimed deletion before ChatGPT’s release. Attorney Justin A. Nelson, representing authors, called OpenAI the “starting gun” for AI piracy, accusing it of “strip-mining” human expression.

Internal Debates: Ethics vs. Competition

These revelations highlight a recurring theme: internal pushback clashing with competitive pressures. At Meta, legal teams reviewed risks, yet higher-ups prioritized speed. Chat logs show employees questioning server choices to “avoid tracing back” activity, suggesting deliberate opacity.

This isn’t just about books; it’s symptomatic of broader AI data acquisition challenges. Early AI research thrived on open academic sharing, but commercialization amplified stakes. Companies invested billions in pipelines reliant on copyrighted material, creating a lock-in effect. As Cornell Tech’s James Grimmelmann notes, firms “talked themselves into a fallacy,” extending research norms to profit-driven models without adapting.

Anthropic’s pivot to physical scanning, post-piracy downloads, proved prescient. Grimmelmann calls it a “smart call” for legal compliance, blending restraint with ambition.

The wave of lawsuits against AI companies stems from creators—authors, artists, photographers, and news outlets—fighting for control over their work. Books are a focal point, with plaintiffs alleging infringement in training processes. Most cases remain ongoing, but early rulings offer clues.

Central to defenses is fair use, a U.S. copyright doctrine allowing limited use for purposes like criticism or research. In June, District Judge William Alsup ruled Anthropic’s book use for AI training transformative, likening it to teachers instructing students. He emphasized models don’t replicate books but create “something different,” calling it “quintessentially transformative.”

Similarly, in the Meta case, Judge Vince Chhabria dismissed harm-to-sales claims, finding no evidence AI outputs supplanted book purchases. Yet, acquisition methods remain vulnerable. Alsup greenlit class-action status for authors whose books appeared in shadow libraries Anthropic downloaded pre-Panama. The $1.5 billion settlement lets claimants seek about $3,000 per title, without admitting fault.

Meta faces ongoing claims of illegal distribution via torrents, with plaintiffs pushing for class-action expansion. Anthropic’s deputy general counsel, Aparna Sridhar, affirmed the June ruling’s endurance: The settlement addressed acquisition, not training legality.

Key Rulings and Their Implications

Case Company Ruling Highlights Outcome
Authors vs. Anthropic Anthropic Training on books is fair use; transformative like education. Piracy downloads may infringe. $1.5B settlement; class-action for shadow library victims.
Authors vs. Meta Meta No proven market harm from AI outputs. Torrent distribution allegations proceed. Partial dismissal; class-action pursuit ongoing.
Authors vs. OpenAI/Microsoft OpenAI, Microsoft Allegations of LibGen use; deletion claimed pre-ChatGPT. Ongoing; no early rulings.
Authors/Illustrators vs. Google Google Publishers seek to join 2023 suit over book use. Pending expansion.

These decisions signal fair use could shield training, but not shady sourcing. Grimmelmann warns unsettled law leaves room for appeals, potentially reshaping copyright in AI.

“We urgently need a reset across the AI industry, such that creatives start being paid fairly for the vital contributions they make,” says Ed Newton-Rex, former AI executive and nonprofit leader advocating for creators’ rights.

Broader Industry Ramifications: Creators, Competition, and Compliance

The fallout extends beyond courtrooms. Authors feel undervalued, their lifetimes of work distilled into AI without royalties. Newton-Rex’s call for fair pay resonates amid AI’s rise, where tools now draft novels or summarize texts, blurring lines between inspiration and imitation.

For companies, the pressure mounts. Investors value Anthropic at $183 billion partly on Claude’s promise, but scandals erode trust. Settlements like the $1.5 billion hit—while not crippling—signal rising costs for data practices. Meta’s Llama models and OpenAI’s GPT series face scrutiny, with regulators eyeing negotiations.

Ethically, this pushes a reckoning. Early AI pioneers benefited from permissive data norms, but commercialization demands accountability. Grimmelmann points to the “fast-paced, high-stakes competition” locking firms into risky paths. Anthropic’s Panama shift exemplifies adaptation: By investing in physical scans, it mitigated piracy claims while securing quality data.

Looking ahead, alternatives emerge. Licensed datasets from publishers could foster partnerships, though pricier than free shadows. Blockchain-tracked content or opt-in creator programs might balance access and compensation. Yet, with cases like Google’s expanding suit, the industry braces for more turbulence.

Ongoing Cases and Potential Shifts

  • OpenAI/Microsoft: Focus on preemptive deletions; authors demand proof AI wasn’t influenced.
  • Google: Publishers joining to amplify claims, targeting search-to-AI pipelines.
  • Meta Torrent Claims: Distribution allegations could set precedents for digital piracy in corporate settings.

These battles could redefine AI ethics, urging sustainable models over scorched-earth acquisition. As AI integrates deeper into daily life—from writing aids to virtual assistants—the question lingers: Will innovation respect its human roots?

Toward Ethical AI: Lessons from the Data Wars

Project Panama and its peers reveal AI’s double-edged sword: boundless potential shadowed by overreach. Companies chased knowledge aggressively, blending legitimate buys with gray-area downloads, all to outpace rivals. Yet, as rulings affirm fair use for training while punishing piracy, a hybrid path crystallizes—ethical sourcing paired with legal safeguards.

For creators, advocacy grows. Nonprofits like Newton-Rex’s push for industry-wide resets, envisioning royalties from AI outputs tied to training data. Authors in settlements recover modestly, but systemic change demands more: Transparent audits, consent frameworks, and revenue shares.

Technologically, advancements like synthetic data generation offer hope, reducing reliance on real works. But books’ irreplaceable value—nuanced expression honed over centuries—ensures they’ll remain central.

Ultimately, this saga isn’t just about scanned pages or shadow downloads; it’s a pivot point for AI’s soul. Will it evolve as a collaborative force, honoring creators, or persist as an extractor? The courts, creators, and companies hold the answer, but one thing’s clear: The race for data is far from over, and its finish line demands fairness.