Categories Op-Ed

The News Media And AI: A New Front In Copyright Law

Yona Roerts Golding, Columbbia Jouralism Review, CJR, OpenAI, FTC, Congress, AI,Meta, Karla Ortiz, Large Language Models, ChatGPT, Copy Right Law
          Yona Roberts Golding

In the months since OpenAI and other companies began releasing generative AI tools to the public, the technology has become the basis for jokes, apocalyptic fears, business models—and it has also become a highly contested front in copyright law, including for the news media.

A step back: the thing we call generative artificial intelligence is, in fact, a kind of computing that looks for patterns in information and mimics them on command. To do this with any amount of coherence, the system must first be rigorously “trained”—fed absolutely enormous amounts of data to digest into a series of statistical probabilities. For an image-generating AI, this content includes millions of images. For a words-generating AI, it includes millions of words.

Several artists and authors have filed lawsuits over the use of their work in these training datasets. Karla Ortiz, a conceptual artist, is a plaintiff in ongoing cases against three image-generating services. “The creative economy only works when the basic tenets of consent, credit, compensation, and transparency are followed,” Ortiz said at a roundtable hosted by the Federal Trade Commission on October 4.

And, according to interviews with lawyers and media companies, as well as public statements from some of the key players, it may be an issue for the news media, too. We know for certain that at least some of the large language models (LLMs) behind generative AI tools—Open AI’s ChatGPT, Meta’s Llama, Google’s Bard—pulled from the copyrighted material of online publishers to train their systems.

The data on what material was used is ambiguous and fragmentary. But in addition to repositories of pirated books, research papers from Meta and OpenAI report that their models were trained in large part with data scraped by CommonCrawl, a nonprofit that essentially compiles data from public websites for use in research.

In 2021, the Allen Institute of AI ran an analysis based on a snapshot of CommonCrawl data. The Washington Post drew on this report to publish a searchable index of the websites scraped by CommonCrawl and used to train LLMs. The analysis revealed that half of the top ten Web sources were news outlets, including the New York Times and Los Angeles Times. (CJR was 13,420th on the list.)

What remains unclear is how the inclusion of this content could affect the news publishing industry—and what individual publishers will do about it. Tech companies maintain that their datasets are protected by the fair use doctrine, which allows for the repurposing of copyrighted work under certain conditions.

But Ortiz’s lawyer, Matthew Butterick, is not convinced by that argument. “Worst of all,” he said in an interview, “these language models are being held out commercially as replacing authors. There are instances already of AI-generated books being posted to Amazon, sometimes under real names, sometimes under fake names.”

In recent months rumors have circulated that several high-profile media companies might also bring the likes of OpenAI, one of the most prominent generative AI developers, to court.  During an interview at the Semafor Media Summit in April, Barry Diller, chairman of IAC, which owns the Daily Beast and People magazine, among many others, said that unless media companies are compensated for the use of copyrighted material, “all will be lost” for the industry. Diller argued that the practice is absolutely grounds for litigation, suggesting that companies should pursue legal action against OpenAI and others. (Later he seemed to soften that position.)

In June, the Wall Street Journal reported that companies including the New York Times, News Corp, and Diller’s IAC were considering forming a coalition to address the threats posed by AI. According to the article, one threat was that the ability of chatbots to respond directly to user queries could draw traffic away from publishers’ own websites, potentially diminishing ad and subscription revenue.

(In June, ChatGPT Plus users discovered that the chatbot’s internet search feature, Browse with Bing, could even bypass publisher paywalls directly when asked to reprint an article from a URL. The company temporarily disabled the feature “in order to do right by content owners,” which it said it did “out of an abundance of caution.” The feature was modified and made available again to paying users in September.)

According to Mehtab Khan, a lawyer and legal scholar at Harvard’s Berkman Klein Institute whose work focuses on copyright and internet law, filing a lawsuit at this stage would be a gamble. If the publishers lost, it could define training data for AI as fair use, which would close off an avenue for contesting the practice. So other options are also being explored.

The Times and a number of other outlets have indicated their desire to protect their copyrighted material with the addition of a line of code, robot.txt, meant to prevent Web scrapers from accessing their materials in the future (it does nothing to remove material that had previously been scraped).

In July, the Associated Press became the first news outlet to reach a licensing agreement with OpenAI over the use of some of its news content, effectively sidestepping court involvement and ensuring that the news agency would benefit from the use of its work.  Even as the industries work toward common ground, Congress and the FTC are involved in discussions around how copyright law applies in the training and use of AI. Their task, according to Khan, will be reframing existing ideas about how the fair use doctrine protects the public interest.

“There’s this fear that if text and data mining is restricted, if it’s not considered fair use, then there’d be too much power concentrated in the hands of large copyright holders,” she said. The key, she said, is to “find a balance that would allow the public to have access, but would also address some of the anxieties that artists and creative industries have about their words being used without consent, and without compensation.”

—————————

Yona Roberts Golding is a  Columbia Journalism Review  CJR fellow. 

Author

Avatar photo

Making law simple, clear, and useful for everyone.

More From Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Artificial intelligence, AI and human civilization, AI reshaping society, future of AI, AI and employment, AI replacing jobs, AI and culture, AI ethics, AI and human intelligence, machine learning impact, AI and education, AI technology advancement, AI and creativity, AI domination, superintelligent AI, AI and mental health, AI risks and benefits, automation and unemployment, AI and governance, AI and decision making, AI and social media, AI manipulation, human vs machine, AI and fine arts, AI and literature, AI controlled systems, AI policy, AI and economic inequality, future of work, AI and banking, AI and healthcare, AI existential threat, Nick Bostrom AI, Kai-Fu Lee AI, AI and skilled workers, AI and unskilled labor, AI and job transition, responsible AI, AI and spirituality, AI and quantum physics, AI governance, AI and youth, ethical AI development, AI Pandora's Box, AI friend or foe, AI and human creativity, AI and global challenges, AI and climate change, AI and manufacturing, AI and logistics

Minds We Made: How AI Is Reshaping Work, Culture, Ethics, And Human Intelligence

Besides other grave issues such as climate change, global warming, nuclear proliferation and scarcity of…

Shanza Shadab,Millennials vs Gen Z intelligence, Plato’s Cave digital metaphor, Gen Z attention span, Flynn Effect decline, digital literacy vs deep focus, algorithmic bias critical thinking, cognitive evolution, social media impact on learning, modern mind erosion, information processing speed.

Millennials vs. GenZ: Evolution or Erosion of the Modern Mind? From Plato’s cave to Instagram reels – the real truth behind GenZ’s intelligence

Whether today’s youth really becoming less intelligent than previous generations? Why do many teachers say…

Digital Pakistan 2026, Data Protection Bill, National AI Policy Pakistan, Privacy Rights Pakistan, Cyber Security Law, Khudayar Mohla Advocate, Freelancers Pakistan, PECA Law, Data Adequacy, 5G Pakistan, Biometric Security, Digital Trade Pakistan, IT Exports.

The Data We Don’t Own

It’s 2026, and if you listen to the digital discourse in Islamabad or Karachi, you…