Categories Op-Ed

The News Media And AI: A New Front In Copyright Law

Yona Roerts Golding, Columbbia Jouralism Review, CJR, OpenAI, FTC, Congress, AI,Meta, Karla Ortiz, Large Language Models, ChatGPT, Copy Right Law
          Yona Roberts Golding

In the months since OpenAI and other companies began releasing generative AI tools to the public, the technology has become the basis for jokes, apocalyptic fears, business models—and it has also become a highly contested front in copyright law, including for the news media.

A step back: the thing we call generative artificial intelligence is, in fact, a kind of computing that looks for patterns in information and mimics them on command. To do this with any amount of coherence, the system must first be rigorously “trained”—fed absolutely enormous amounts of data to digest into a series of statistical probabilities. For an image-generating AI, this content includes millions of images. For a words-generating AI, it includes millions of words.

Several artists and authors have filed lawsuits over the use of their work in these training datasets. Karla Ortiz, a conceptual artist, is a plaintiff in ongoing cases against three image-generating services. “The creative economy only works when the basic tenets of consent, credit, compensation, and transparency are followed,” Ortiz said at a roundtable hosted by the Federal Trade Commission on October 4.

And, according to interviews with lawyers and media companies, as well as public statements from some of the key players, it may be an issue for the news media, too. We know for certain that at least some of the large language models (LLMs) behind generative AI tools—Open AI’s ChatGPT, Meta’s Llama, Google’s Bard—pulled from the copyrighted material of online publishers to train their systems.

The data on what material was used is ambiguous and fragmentary. But in addition to repositories of pirated books, research papers from Meta and OpenAI report that their models were trained in large part with data scraped by CommonCrawl, a nonprofit that essentially compiles data from public websites for use in research.

In 2021, the Allen Institute of AI ran an analysis based on a snapshot of CommonCrawl data. The Washington Post drew on this report to publish a searchable index of the websites scraped by CommonCrawl and used to train LLMs. The analysis revealed that half of the top ten Web sources were news outlets, including the New York Times and Los Angeles Times. (CJR was 13,420th on the list.)

What remains unclear is how the inclusion of this content could affect the news publishing industry—and what individual publishers will do about it. Tech companies maintain that their datasets are protected by the fair use doctrine, which allows for the repurposing of copyrighted work under certain conditions.

But Ortiz’s lawyer, Matthew Butterick, is not convinced by that argument. “Worst of all,” he said in an interview, “these language models are being held out commercially as replacing authors. There are instances already of AI-generated books being posted to Amazon, sometimes under real names, sometimes under fake names.”

In recent months rumors have circulated that several high-profile media companies might also bring the likes of OpenAI, one of the most prominent generative AI developers, to court.  During an interview at the Semafor Media Summit in April, Barry Diller, chairman of IAC, which owns the Daily Beast and People magazine, among many others, said that unless media companies are compensated for the use of copyrighted material, “all will be lost” for the industry. Diller argued that the practice is absolutely grounds for litigation, suggesting that companies should pursue legal action against OpenAI and others. (Later he seemed to soften that position.)

In June, the Wall Street Journal reported that companies including the New York Times, News Corp, and Diller’s IAC were considering forming a coalition to address the threats posed by AI. According to the article, one threat was that the ability of chatbots to respond directly to user queries could draw traffic away from publishers’ own websites, potentially diminishing ad and subscription revenue.

(In June, ChatGPT Plus users discovered that the chatbot’s internet search feature, Browse with Bing, could even bypass publisher paywalls directly when asked to reprint an article from a URL. The company temporarily disabled the feature “in order to do right by content owners,” which it said it did “out of an abundance of caution.” The feature was modified and made available again to paying users in September.)

According to Mehtab Khan, a lawyer and legal scholar at Harvard’s Berkman Klein Institute whose work focuses on copyright and internet law, filing a lawsuit at this stage would be a gamble. If the publishers lost, it could define training data for AI as fair use, which would close off an avenue for contesting the practice. So other options are also being explored.

The Times and a number of other outlets have indicated their desire to protect their copyrighted material with the addition of a line of code, robot.txt, meant to prevent Web scrapers from accessing their materials in the future (it does nothing to remove material that had previously been scraped).

In July, the Associated Press became the first news outlet to reach a licensing agreement with OpenAI over the use of some of its news content, effectively sidestepping court involvement and ensuring that the news agency would benefit from the use of its work.  Even as the industries work toward common ground, Congress and the FTC are involved in discussions around how copyright law applies in the training and use of AI. Their task, according to Khan, will be reframing existing ideas about how the fair use doctrine protects the public interest.

“There’s this fear that if text and data mining is restricted, if it’s not considered fair use, then there’d be too much power concentrated in the hands of large copyright holders,” she said. The key, she said, is to “find a balance that would allow the public to have access, but would also address some of the anxieties that artists and creative industries have about their words being used without consent, and without compensation.”

—————————

Yona Roberts Golding is a  Columbia Journalism Review  CJR fellow. 

Author

Avatar photo

Making law simple, clear, and useful for everyone.

More From Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Pakistan population crisis, population growth in Pakistan, overpopulation in Pakistan, population control Pakistan, family planning Pakistan, Pakistan demographic challenge, child labor in Pakistan, underage children Pakistan, poverty and population growth, population and unemployment Pakistan, social issues in Pakistan, human development Pakistan, quality vs quantity population, population pressure on resources, Pakistan education crisis, healthcare burden Pakistan, crime and population growth, extremism and population Pakistan, psychological impact of overpopulation, unproductive population Pakistan, weak Pakistani passport, global image of Pakistan, governance failure Pakistan, ruling elite and population control, vote bank politics Pakistan, BISP subsidies population policy, conditional state benefits Pakistan, two child policy Pakistan, population legislation Pakistan, women education Pakistan, children rights Pakistan, Islamic perspective on family planning, population policy examples China Bangladesh Iran, national interest and population control, sustainable development Pakistan, future generations Pakistan, Islamabad opinion column, Pakistan social reform, public policy Pakistan, population awareness Pakistan

Growing Population: Pakistan’s Silent but Greatest Crisis

Whenever the issue of population growth is discussed in Pakistan, one argument is immediately thrown…

Khudayar Mohla,Ahmer Bilal Soofi, US withdrawal from international organisations, Trump foreign policy, international law analysis, UN Charter violation, non-intervention principle, use of force doctrine, Department of War renaming, NATO crisis, Greenland annexation, Venezuela intervention, US unilateralism, Russia China joint declaration, multilateral institutions, global governance vacuum, China Russia influence, developing countries blocs, G77 influence, Organisation of Islamic Cooperation, international legal order, law of the sea convention, International Seabed Authority, International Criminal Court Rome Statute, US opposition to ICC, norm against war, civilisational norms, global peace and security, US foreign policy critique, international law commentary, BBC international law interview

An Opportunity for States

The withdrawal of the US from 66 international organisations has not been noticed enough given…

CDA Amendment Bill 2025, Capital Development Authority, Islamabad land acquisition, property compensation Pakistan, land resettlement Islamabad, urban development Islamabad, displaced communities Pakistan, land-sharing compensation, rehabilitation and resettlement, President Asif Ali Zardari, Prime Minister Shehbaz Sharif, Article 254 Constitution Pakistan, equitable urban growth, property rights Pakistan, distributive justice Islamabad, Khudayar Mohla

CDA Amendment Bill 2025: Can Islamabad’s Urban Growth Serve Its People?

Urban development in Pakistan’s capital is about more than buildings and roads – it mirrors…