Google-Reddit AI Deal Heralds New Era in Social Media Licensing – Bloomberg Law

author
6 minutes, 30 seconds Read
image

Reddit’s blockbuster deal with Google to train artificial intelligence products on the platform’s data is just the beginning of an anticipated licensing bonanza sweeping up content from individuals who have little recourse, thanks to user agreements’ catch-all copyright provisions.

Alphabet Inc.’s Google inked a licensing deal with Reddit Inc. reportedly worth $60 million annually last month to train its large language models on the platform’s extensive user-generated content. It’s the first major public licensing deal between a US social media giant and an external AI company. But other agreements are expected to follow given the massive troves of data the platforms could provide to AI companies, and the critical importance of such diverse data for training their large language models, copyright and tech attorneys say.

Major AI companies already face a suite of copyright lawsuits over their training data, including from Pulitzer Prize-winning authors and major news organizations. But social media users who may feel uneasy about their posts being scraped and collected to train AI models wouldn’t have the same recourse as major brands or well-known individuals who have sued AI companies including OpenAI Inc. and Meta Platforms Inc. for infringement, even if they have copyright ownership of their posts, attorneys say.

Start getting our free AI newsletter.

That’s because terms of service generally allow users to retain intellectual property rights to their posts while giving the platforms wide-ranging, enduring licenses to use and distribute them broadly in any manner they see fit.

The reality is users have “already given those rights away” to the platforms, said Pillsbury Winthrop Shaw Pittman LLP Partner Edward Cavazos. “They can do anything with your content basically.”

Value of Content

Similar licensing agreements will likely continue to pop up until every large language model has deals in place with major social media platforms, Cavazos said. While recent partnerships with news publishers like Axel Springer SE, which owns POLITICO and Business Insider, provide authoritative information for training, AI companies want to train their tools on diverse data. Social media platforms’ vast data is valuable for training, even if it’s not “high quality” content like that of news organizations, Cavazos said.

Models “need to know how 15 year olds on social media speak, and they’re not going to get that from POLITICO,” he said. “Ultimately, they just want data. The more data, the smarter their models are.”

In addition to OpenAI’s deals with Axel Springer and the Associated Press, publications including Time, CNN, and Fox have been reported to be in talks to license their content. Comedian-author Sarah Silverman and other creatives sued OpenAI about a week before the AP agreement was struck last July, and the New York Times lawsuit was filed several weeks after the Axel Springer deal was announced.

Those infringement lawsuits against AI companies made their move towards licensing predictable, said Jeanne Hamburg, an attorney at Norris McLaughlin PA.

“Until their hand was forced, these companies weren’t ready to bargain,” she said.

Other licensing targets could be user-review platforms like Yelp Inc. and TripAdvisor Inc., Columbia Law Professor Shyamkrishna Balganesh theorized, because of their rich user-generated content and agreements that individuals rarely read in full.

“I can assure you that even as a copyright professor, I don’t” pay attention to websites’ terms of service and copyright disclosures, he said.

As AI tools expand from text-generating services like ChatGPT to those that generate audio and video, Cavazos said he believes future licensing deals could also include platforms that similarly rely on user content like SoundCloud and Flickr.

“I would not be surprised if every kind of user-generated, content-driven site eventually provides their data to large language model operators,” he said.

Broad User Agreements

Discussion of Reddit’s deal quickly reached the platform. Some users said the move was expected after Reddit revised its API terms last April to prohibit the use of user content to train machine learning or AI models without express permission. API is short for application programming interface, and is part of the data used to integrate outside apps, including AI tools. Google, in a blog post about the Reddit deal, said access to the API would allow it to “display, train on, and otherwise use” the data “in the most accurate and relevant ways.”

Other users had mixed reactions about their content being used to train models. “Is this content even theirs to sell?” one user asked. Another wrote, “I don’t mind AI training on my comments. As I put them on the internet myself, for all to see. I very much mind Reddit selling access my comments.”

One user pointed out Reddit’s free-wheeling user agreement, which says users “retain any ownership rights” over their content but also “grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world.”

Yelp, TripAdvisor, Instagram, and Facebook all have broad sub-license clauses in their terms of service, and SoundCloud’s provides that users “explicitly agree” that their content “may be used to inform, train, develop or serve as input to artificial intelligence or machine intelligence technologies or services as part of and for providing the services.”

Meta has already used public posts on its Instagram and Facebook platforms to train one of its own AI tools, Meta President of Global Affairs Nick Clegg told Reuters in September. Tumblr and WordPress.com are preparing to sell user material to MidJourney and OpenAI for model training, according to a recent report by 404 Media.

Tumblr’s terms of service are similar to Reddit’s, saying users retain ownership of any intellectual property rights to their posts but “grant Tumblr a non-exclusive, worldwide, royalty-free, sublicensable, transferable right and license to use, host, store, cache, reproduce, publish, display (publicly or otherwise), perform (publicly or otherwise), distribute, transmit, modify, adapt (including, without limitation, in order to conform it to the requirements of any networks, devices, services, or media through which the Services are available), and create derivative works of, such User Content.”

Social media platforms’ user agreements have to be so sweeping because they wouldn’t be able to reproduce and distribute people’s content—wouldn’t be able to provide their core services—without such licenses, Hamburg said.

Opting Out

Though platforms likely weren’t contemplating AI training when drafting their user agreements, Cavazos said their language is “so broad that it’s inconceivable that that’s not in the set of rights they get.” Even if users try to argue that training wasn’t part of the deal when they joined the platforms, he said, they’re going to have very little legal room to bring a lawsuit.

Platforms could offer users a way to opt out from having their data scraped, as some companies have already done, Hamburg said, but there are questions about how effective opt out technology is. She pointed to 404 Media’s reporting, which included allegations that some platform data queried for AI training contained content that was supposed to be omitted, like private and deleted posts.

Reddit and Automattic Inc., which owns Tumblr and WordPress, didn’t respond to Bloomberg Law’s questions about whether the companies plan to update their terms of service or allow users to opt out of having their data used for AI training.

Cavazos said opt-outs may be tricky because it’s unclear whether the models will be able to unlearn content once they’ve ingested it if a user opts out after training has begun.

“It’s like asking someone to unremember something,” he said. “It’s not that easy.”

This post was originally published on this site

Similar Posts