Is reddit violating the GDPR with their $60 million per year Google LLM deal?

As reddit prepares for it’s IPO, it’s looking to prove they are a profitable business with a good outlook. To this end, they have announced a deal with Google to allow the tech giant access to the huge amount of posts of their millions of users. The purpose: to train Googles Large Language Model (LLM)
Users and former users are having doubts about this move and are banding together to get the eyes of European data protection agencies (DPAs) on this deal. Is reddit even allowed to simply sell access to this data set for the purposes of training an LLM? Fediverse user AlteredStateBlob has doubts.

Is reddit violating the GDPR?

Fediverse user AlteredStateBlob posted a lengthy write up on the website kbin.social, a branch of the so called “fediverse”, a cluster of decentralized services and websites compatible with the Activity Pub protocol, allowing interoperability.

In their post, the user outlines the many reasons why they believe that selling the posts data of reddit users might violate the GDPR. The main point is outlining the problem around Article 5 paragraph 1 section b, which requires that processing of personal data is bound to a specific purpose. The passage reads:

"Personal data shall be [...] collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; further processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes shall, in accordance with Article 89(1), not be considered to be incompatible with the initial purposes (‘purpose limitation’);"
GDPR Article 5
paragraph 1 section b

The user argues, that the purpose of data being collected is to serve the posts to the public within the reddit website. Moving away from this purpose by allowing the processing of such data by an external company might pose an issue for reddit, depending on the reading of this section by data protection agencies.

There are many more points being made, but the strongest might simply be a possible misappropriation of personal data that is incompatible with the original purpose of the data being collected.

Do reddit posts constitute personal data?

Some users are questioning if reddit posts do constitute personal data to begin with. User AlteredStateBlob argues that it is nearly impossible to ensure that open text entries can truly be sanitized of personal data, as the personally identifiable data might not just be the link between post body and username, but also other users or people being mentioned within the bodies of posts directly.

They refer to Recital 26 which states the following:

"The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes."
GDPR Recital 26
(bold emphasis ours)

AlteredStateBlob argues that it is possible that Google’s LLM might reveal training data, which has happened with other LLMs (which Google’s Deep Mind team directly uncovered) and tracing the training data back to the original post on reddit is simple by using Google with the “site:reddit.com” parameter and the text of the post body.

 

AlteredStateBlob, 1 day ago That is not quite correct. As long as it is possible to identify the user, it is personal data. True anonymization under GDPR is nearly impossible without destroying the data set. Reddit would have to fully delete it, otherwise simply searching Google with the exact text with site:reddit.com on any comment immediately reveals who the author is. It doesn't matter if the dataset in use allows for identification, as long as identification remains possible.

Will European Data Protection Agencies take action against reddit?

Data Protection Agencies have the ability to act proactively in their investigations into the conduct of companies. Several users have followed the call to action in the post and confirmed that they have posted that they filed a report with their respective data protection agencies.

Whether or not any DPA will follow up or even agree with the perspective of AlteredStateBlob, one thing is clear: online users are becoming more and more aware of their rights and the potential violations by big tech companies. Anyone dealing with personal data would do well to ensure that they are taking every step required to ensure compliance with the GDPR.

Other users argue that a large company like reddit has obviously taken everything into account and consulted their lawyers before engaging with Google. Many more do not seem impressed by such arguments, given the past conduct of US companies and disregard for established laws.

Given the slow acting nature of legislators and supervisory authorities, reddit will likely move forward with their plans regardless of the legal reality of the situation. Ultimately, it is up to the Data Protection Agencies of the EU to determine the next steps.

Helping you with GDPR compliance

We are currently building tools to help companies achieving better GDPR compliance, starting with a tool for keeping a record of processing activities (ROPA) under Article 30 of the GDPR.

If you’re among the many companies dealing with personal data of EU citizens such as eMail addresses, usernames, eMails, names, addresses, etc., who still struggles with maintaining a comprehensive ROPA, feel free to shoot us a message over at [email protected] to get more information on our system.