Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models
The current development of AI models – in particular generative AI – is provoking a great deal of comment and analysis and requires in particular data protection authorities to prioritize this publication topic. In France, the CNIL was a little ahead of the curve when it launched its initial work back in 2017, subsequently setting up a department specializing in AI, and more recently publishing numerous practical fact sheets on the development of artificial intelligence systems (see our previous articles: here and here). At the European level, the EDPB just published its first position on the subject.
In this position, the EDPB addresses three issues: (i) the conditions under which AI models themselves can be considered anonymous; (ii) the use of the legitimate interest legal basis to develop or use AI models; and (iii) the impact of an unlawful AI model development on its subsequent use.
- An AI model may itself constitute processing of personal data
In principle, an AI model does not directly contain personal data. It is a mathematical construct comprising parameters representing probabilistic relationships between certain information. It would therefore be tempting to consider a priori that an AI model can never, in itself (i.e. disregarding the training phases and its use), constitute the processing of personal data.
This is not, however, the position of the EDPB, who considers that an AI model cannot be considered “anonymous” – in the sense of personal data protection regulations – in three cases.
Firstly, the EDPB sets aside AI models that have been specifically designed to provide personal data about the same persons whose data have been used to train the model. This applies, for example, to generative AI models designed to reproduce the voices of the persons on whom they have been trained. For the EDPB, these models intrinsically include personal data relating to the data subjects targeted by the training phase.
Secondly, the EDPB considers that, in order to be considered anonymous, it must be verified that, by deploying “all the means reasonably likely to be used“, it is highly unlikely: (i) to extract personal data from the AI model itself; and (iii) to obtain output results that concern the same data subjects as those targeted by the training phase.
In order to conduct the analysis, data protection authorities are invited to refer to the relevant EDPB (or ex-WP29) guidelines, foremost among which are the 05/2014 guidelines on anonymization techniques. Data protection authorities will also have to take into account “all the means reasonably likely to be used” by the controller or a third party, to extract personal data from the model or obtain such data – concerning the individuals targeted by the training phase – in the outputs. This – very elastic – notion of “reasonable means” is central in personal data. Recital 26 of the RGPD refers to it in particular in order to determine whether a data subject can be directly or indirectly “identified or identifiable” from information of any kind. Lastly, protection authorities must take account of risk assessments previously carried out by data controllers on this point.
The EDPB then provides a non-exhaustive list of more specific elements that should be verified:
- Measures taken when selecting data sources to ensure that they are relevant and adequate for the purposes pursued.
- How training data have been “prepared”: anonymization/pseudonymization, data minimization strategies, systematic filtering, etc.
- The model development methods followed, in particular whether these methods guarantee that the model will be sufficiently generic (i.e. not overly targeted to personal training data).
- Measures implemented to reduce the risk of obtaining personal data used during training in the model results
- Evaluations and audits carried out by data controllers to test the anonymity of models, including testing their resistance to external attacks.
- Documentation drawn up by data controllers: data protection impact analysis (DPIA), feedback from the DPO, description of technical and organizational measures put in place to reduce the risks of identification (including through external attacks), etc.
In light of the evaluation criteria proposed by the EDPB, it now appears very important for companies developing AI models (including solely in-house) to conduct their own analysis relating to the anonymity – or otherwise – of these models and to document it. As soon as the model cannot be considered as anonymous, this means that the GDPR will apply and entail the application of all the obligations of this text: information of data subjects, legal basis, exercise of their rights by data subjects, etc.
- Using the legitimate interest legal basis in the development and deployment of AI models
In this second part, the EDPB provide some general considerations on the criteria that need to be taken into account in order to be able to base processing for the development or deployment of AI models on the legal basis of legitimate interest.
The EDPB’s first contribution on this point is a confirmation – which nobody really doubted – that legitimate interest can, if the conditions are met, constitute the legal basis for such processing. The consent of data subjects – or the need to perform a contract between them and the controller – is therefore not required in all cases.
The EDPB then comments on each of the criteria of the three-step test for being able to base a processing operation on legitimate interest:
- The interest pursued by the controller or a third party must be legitimate:
The Board considers that, for example, the development of a conversational agent to assist users, a system to detect fraud or a system to improve the detection of cyber threats, constitute a priori legitimate interests.
- Processing must be necessary for the pursuit of legitimate interests.
Classically, the EDPB recalls here that the processing must actually enable to achieve the purposes of the processing, and that it would not be possible through less intrusive means of processing. This second criterion is particularly important in the case of AI models whose performance is in many cases proportional to the amount of data (including personal data) ingested during training. Data controllers will have to be careful not to be excessively “hungry”, at a time when publicly accessible data sources have never been so numerous and voluminous.
- Balancing the interests and the fundamental rights and freedoms.
Firstly, the Board gives some examples of the interests and fundamental rights and freedoms of the data subjects that need to be taken into account in the context of the development and deployment of AI models: self-determination and maintaining control over one’s own personal data, risks to private and family life that these models may present, etc. The EDPB also points out that some models, particularly those based on large-scale collection of personal data, may present other risks, such as that of surveillance of the individuals. The Board notes, however, that account must also be taken of the benefits that data subjects may derive from the use of AI models (financial benefits, improved accessibility of services, help in identifying offensive content on the Internet, etc.).
It is then necessary to assess the impact of the processing on the interests, freedoms and fundamental rights of the data subjects. According to the Board, this depends in particular on the nature of the data processed (e.g., particularly private data such as location should be considered as having a serious potential impact on the data subjects), and the context of the processing (e.g., does the processing involve combinations of different data sources, are the data subjects children or other vulnerable persons, etc.).
The EDPB then describes the criteria to be checked to determine whether data subjects can reasonably expect their data to be used in the development and/or deployment of AI models, including, for example: the public nature of data sources, the relationship between data subjects and the controller, the type of service provided, the potential subsequent uses of the model, etc.
Finally, the Board lists examples of measures and safeguards that can be put in place to mitigate risks to individuals.
- Can the development of an AI model in violation of the RGPD impact the lawfulness of its use?
The EDPB’ answer to this question is very nuanced and will leave data controllers in doubt: it all depends on the parties involved (e.g. are the development and use of the model carried out by the same data controller?), the processing purposes (e.g. do the development and use of the model pursue the same purposes?) as well as the seriousness of the breach of the GDPR committed in the development of the AI model.
Data controllers will therefore have to conduct a case-by-case analysis, and document it. For example, if the use of the AI model relies on the legal basis of legitimate interest, the fact that its development was conducted in violation of the GDPR should necessarily negatively impact the balance of interests (but the EDPB does not completely close the door, since this will depend in particular on the measures put in place to limit the impact of the use of the AI model on data subjects).
The only scenario on which the EDPB provides a clear answer is one in which the data controller which has developed a model in breach of the GDPR then completely anonymizes that model. In this case, the lawfulness of subsequent personal data processing, derived from the use of the AI model, will not be impacted by the unlawfulness of its development.