By Perla Khattar[1]

 

On September 19, 2023, George R.R. Martin and other professional fiction writers filed a class action lawsuit against OpenAI in the United States District Court for the Southern District of New York. The Plaintiffs alleged that at the heart of Large Language Models (LLMs) exists “systematic theft on a mass scale.”[2]  In their complaint, the plaintiffs explained that OpenAI, the maker of the LLM ChatGPT, copied their copyrighted works of fiction without permission and fed the data into LLMs that are carefully programmed to “output human-seeming text responses to users’ prompts and queries.”[3] The authors allege that OpenAI downloaded the manuscripts from pirated eBooks repositories.[4]

The writers emphasize that ChatGPT not only threatens their livelihood by imitating their marketable creative literary expression, but also portends the originality of their work by mimicking, summarizing, and paraphrasing their copyrighted manuscripts.[5] The plaintiffs add that OpenAI could have resorted to works in the public domain to train their computer algorithms, but instead chose to allegedly violate the Copyright Act by scarping the authors’ copyrighted works without paying a reasonable licensing fee.[6]

This lawsuit is far from being the first time an artist or a creative firm sues generative AI firms for unlawfully scraping and processing copyrighted materials: in early February of this year, Getty Images sued Stability AI in the United States District Court for the District of Delaware for allegedly “[copying] more than 12 million photographs” from the plaintiff’s collection to train its AI without permission or compensation.[7] Stability AI was also sued by a group of visual artists in the United States District Court for the Southern District of California for alleged vicarious copyright infringement and violation of the Digital Millennium Copyright Act. [8] These lawsuits raise important questions about the ways in which LLMs are trained and the legality of data scraping under U.S. law, namely Title 17 of the United States Code. [9]

LLMs, like ChatGPT, are the culmination of a multifaceted and rigorous process that involves data collection, preprocessing, architecture design, and extensive training. The genesis of ChatGPT, and models of its ilk, can be traced through the following five steps.[10]

First, Data Collection and Preprocessing: The foundation of ChatGPT’s existence lies in the vast corpus of text data it is trained on. This data is collected from diverse sources across the internet, encompassing books, articles, websites, and various forms of written content. Careful attention is paid to ensure that the dataset represents a wide spectrum of languages, topics, and writing styles, making it a comprehensive reflection of human language. Preprocessing is then undertaken to clean and format the data, removing any noise or inconsistencies that might hinder the model’s learning process.

Second, Architecture Design: Once the data is collected and processed, the next step is the design of the neural network architecture that underlies ChatGPT and other LLMs. Researchers and engineers craft architectures that can handle the immense complexity of language, typically using a deep learning framework like the transformer architecture. These architectures consist of multiple layers of attention and feedforward mechanisms, allowing the model to capture intricate patterns and dependencies within the data.

Third, Training: The heart of the process involves training the model on the prepared data. ChatGPT, for example, undergoes extensive training through a process known as unsupervised learning. During training, the model learns to predict the next word in a sentence based on the preceding context. It does this repeatedly, over millions of sentences, refining its understanding of language with each iteration. The training process requires immense computational power and can take several days or even weeks to complete.

Fourth, Fine-tuning: Fine-tuning involves exposing the model to a narrower dataset that is carefully curated and generated with human reviewers’ input. These reviewers follow guidelines provided by the model developers to ensure the model’s behavior aligns with desired standards. This iterative feedback process helps shape the model’s responses and ensures it adheres to ethical and safety guidelines.

Fifth, Deployment and Monitoring: Once the model reaches a satisfactory level of performance and safety, it is deployed for use in various applications, including chatbots, content generation, and natural language understanding tasks. However, deployment is accompanied by rigorous monitoring and oversight to detect and rectify any unintended biases, misinformation, or harmful behaviors that may arise during real-world interactions.

The outcome of Authors Guild v. OpenAI remains uncertain, as the court’s decision has the potential to either significantly expand or restrict the fair use doctrine, with far-reaching consequences for U.S. Copyright law.

U.S. Copyright law identifies a copyright infringer as someone who violates any of the exclusive rights of the copyright owner, or who imports copies or phonorecords.[11] The plaintiffs in Authors Guild v. OpenAI are the rightful and lawful owner of the copyrights in and to their respective works,[12] have duly registered their works with the copyright office,[13] and are the legal owner of the right to reproduce their respective copyrighted works.[14] If OpenAI has truly scraped the world wide web and fed their LLMs copyrighted works sourced from pirated eBooks repositories, then a court of law could find the AI firm liable under 17 U.S.C. § 501.

However, OpenAI could raise a fair use defense. Under the doctrine of fair use, natural persons and organizations are allowed limited use of copyrighted material without permission from the copyright owner for purposes such as criticism, commentary, news reporting, teaching, scholarship, and research.[15] Whether using copyrighted works for training a machine learning model falls under fair use is a matter of interpretation and can depend on factors like the amounts of data scraped, purpose of the use, and profit generation by OpenAI.

If the court sides with the plaintiffs, then AI firms will no longer be able to scrape the web to create big data sets on which they can train their LLMs. Developers would need to seek permission from the authors and publishers, therefore entailing the negotiation of licensing agreements for proper use of copyrighted works. Restricting developers from web scraping could have adverse effects on innovation and the development of AI technology. Web scraping often serves as a vital source of real-world, diverse data that fuels the training and improvement of AI models, enabling them to better understand and interact with the world.[16] If such practices were heavily regulated or prohibited altogether, it could impede the progress of AI research and limit the ability to create AI systems that are more useful and adaptable, potentially hindering advancements in areas like natural language processing, computer vision, and recommendation systems.[17]

If the court sides with the defendants, AI firms will have the greenlight to freely scrape the world wide web and feed their models the harvested data. With no parameters, AI firms could use any and all data found online to tweak their algorithms. Such policy, however, may engender challenges related to the unauthorized utilization of copyrighted materials, potentially eroding the intrinsic value of intellectual property, and infringing upon authors’ ability to exercise control over the dissemination of their works. Such a permissive stance may undermine the customary safeguards enshrined in copyright statutes, potentially diminishing incentives for content creators to engage in the production of original works.

The authors are asking the court to prohibit OpenAI, and consequently any AI Firm, from using their copyrighted works without express authorization. While both parties have solid arguments to present in court, it is essential to strike a balance between fostering AI innovation and safeguarding the rights of copyright owners. Winter may be coming for OpenAI.

 

 

 

 

 

 

[1] Perla Khattar is a JSD candidate at Notre Dame Law School, Class of 2026

[2] Class Action Complaint at 1, Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023).

[3] Id.

[4] Id. at 97.

[5] Id. at 2.

[6] Id. at 4.

[7] Complaint at 1, Getty Images, Inc. v. Stability AI, No. 1:23-cv-00135 (D. Del. filed Feb. 3, 2023).

[8] Class Action Complaint, Andersen v. Stability AI Ltd., No. 3:23-cv-00201, (N.D. Cal. filed Jan 13, 2023).

[9] 17 U.S.C. §§ 501, 1201–1205.

[10] Jordan Hoffmann et al., Training Compute-Optimal Large Language Models, arXiv:2203.15556 (Mar. 29. 2022).

[11] Supra note 8.

[12] Supra note 1, at 326.

[13] Id. at 328.

[14] Id. at 329.

[15] 17 U.S.C. § 107.

[16] Brad Mitchell, What is Web Scraping and Why is It Valuable? CodingDojo (Jun. 3, 2022), available at: https://www.codingdojo.com/blog/what-is-web-scraping

[17] Id.

 

Image Source: https://www.sol915.com.ar/george-r-r-martin-y-otros-autores-demandan-a-openai-por-copiar-obras-sin-permiso-ni-consideracion/