Court Filings Reveal Meta Employees Debated Using Copyrighted Content for AI Training

For years, internal discussions at Meta have touched on a controversial topic—whether to use copyrighted content obtained through questionable means to train the company’s AI models. Recently unsealed court documents shed new light on how Meta employees wrestled with the legal and ethical implications of this decision.

The filings are part of Kadrey v. Meta, one of several ongoing legal battles over AI and copyright in the U.S. courts. Meta maintains that training AI on copyrighted materials—particularly books—falls under “fair use.” However, the plaintiffs, including well-known authors like Sarah Silverman and Ta-Nehisi Coates, strongly disagree.

Earlier case materials suggested that Meta CEO Mark Zuckerberg had approved using copyrighted content for AI training and that Meta had walked away from negotiations with book publishers over licensing deals. But the latest documents provide the clearest glimpse yet into the company’s internal conversations, revealing how it may have decided to proceed despite the legal gray areas—especially for its Llama AI models.

In one internal chat, Meta employees, including Melanie Kambadur, a senior manager on the Llama research team, openly discussed using books they knew might be legally risky.

“My opinion would be—kind of like ‘ask for forgiveness, not permission’—we try to get the books and escalate the decision to the executives,” wrote Xavier Martinet, a Meta research engineer, in a February 2023 conversation. “That’s why they set up this Gen AI org—to take more risks.”

Martinet even suggested simply buying e-books at retail prices to create a training dataset rather than going through the lengthy process of negotiating licensing agreements with publishers. When another employee warned that using copyrighted content without permission could lead to legal trouble, Martinet remained unfazed.

“I mean, worst-case scenario? We find out later that it’s actually okay, while a ton of startups have already pirated books off BitTorrent,” he wrote, emphasizing the slow pace of licensing negotiations.

These conversations paint a striking picture of how AI training decisions are made behind closed doors—where innovation, legality, and corporate risk tolerance collide.

In a work chat, Kambadur mentioned that Meta was in discussions with Scribd, a document-hosting platform, and others about obtaining licenses. However, he pointed out that while using "publicly available data" for training AI models still required approvals, Meta's legal team had become more flexible than before.

"Yeah, we definitely need to get licenses or approvals for publicly available data," Kambadur acknowledged, according to the filings. "The difference now is that we have more money, more lawyers, and better business development support. We can fast-track approvals, and our legal team is taking a less conservative approach."

The Libgen Discussion

In another internal conversation detailed in the filings, Kambadur and his colleagues explored the possibility of using Libgen, a platform that aggregates links to copyrighted books and academic papers, as a potential data source for Meta’s AI training.

Libgen has faced multiple lawsuits and has been ordered to shut down and fined millions for copyright violations. One of Kambadur’s colleagues responded to the discussion by sharing a screenshot of a Google search result stating, “No, Libgen is not legal.”

Despite this, some within Meta were concerned that avoiding Libgen could put the company at a disadvantage in the AI competition. According to the filings, Sony Theakanath, Meta’s director of product management, emailed Joelle Pineau, VP of Meta AI, emphasizing that Libgen was “essential” for achieving top performance in AI benchmarks.

In the same email, Theakanath proposed strategies to minimize legal risks. He suggested filtering out files explicitly labeled as "pirated" or "stolen" and not publicly disclosing any use of Libgen data. “We would not disclose use of Libgen datasets used to train,” he wrote.

This strategy, according to the filings, involved scanning Libgen files for words like “stolen” or “pirated” and ensuring models avoided responding to risky prompts, such as requests to reproduce copyrighted material.

Other Data Sources and Legal Risks

The filings also suggest that Meta may have scraped data from Reddit for training its AI models, possibly by mimicking the behavior of a third-party app called Pushshift. In April 2023, Reddit announced plans to charge AI companies for access to its data, raising questions about whether Meta had already acquired it before such policies were enforced.

In a March 2024 chat, Chaya Nayak, a director in Meta’s generative AI division, mentioned that leadership was reconsidering previous restrictions on data sources. She noted that past decisions to avoid using Quora content, licensed books, and scientific articles were being questioned, as Meta’s existing training datasets—such as Facebook and Instagram posts and text from Meta’s own platforms—were insufficient. “[W]e need more data,” she wrote.

Legal Battle Intensifies

The lawsuit Kadrey v. Meta, originally filed in 2023 in the U.S. District Court for the Northern District of California, has been amended multiple times. The latest complaint alleges that Meta cross-referenced pirated books with legally available ones to determine whether it was worth pursuing a licensing agreement with publishers.

Recognizing the high legal stakes, Meta has brought in two Supreme Court litigators from the law firm Paul Weiss to bolster its defense.

Meta has not yet responded to requests for comment.