Seven new draft recommendations on AI submitted for public consultation by the CNIL (2/2)
CNIL publication on its website
In our first article (https://www.lexology.com/library/detail.aspx?g=4decbc09-a91d-4416-9028-529e58c6fe19), we presented the first three CNIL draft recommendations on the design of AI systems, which the authority has put out to public consultation until September 1er 2024. This second article looks at the other four drafts, which deal with informing data subjects, exercising their rights, annotating data and security.
- Informing data subjects
One of the key points of this recommendation lies in the practical examples and good practice set out by the CNIL in the field of AI. For example, the CNIL proposes information templates for the indirect collection of data from third parties or publicly accessible sources, as in the case of web scraping. It also outlines the specific information to be provided when personal data used for training is stored by the model or can be reconstructed by it.
The CNIL then details the conditions under which the data controller could rely on two derogations provided for by the GDPR to avoid having to provide the information to data subjects. Firstly, the Regulation allows information not to be repeated when the data subjects have already obtained it (articles 13(4) and 14(5)(a) of the GDPR). However, the initial information must be sufficiently precise to exempt the second controller from completing it. For example, the mere mention that data may be reused by third parties in the general information section of a website is deemed insufficient by the CNIL.
The Commission then looks at the provision of the GDPR allowing individual information not to be provided where this “proves impossible or would involve a disproportionate effort” (Article 14(5)(b) of the GDPR). To rely on this derogation, the data controller must first balance the scale of the efforts that providing the information would represent (e.g. absence of contact details for the data subjects, age of the data, number of data subjects) against the invasion of privacy that the processing risks involving (e.g. particularly intrusive processing, sensitive data). The CNIL gives two examples of web scraping: if only pseudonymized data is collected, the authority considers that it would be disproportionate to collect more identifying data in order to be able to inform the persons concerned individually. On the other hand, a case-by-case analysis will have to be carried out in the event of the collection of directly identifying data. In any case, general information should be published by the data controller. The CNIL also recommends that additional measures be put in place, such as carrying out a DPIA, applying pseudonymization techniques or other security measures.
Finally, the CNIL lists a number of best practices for improving transparency in the development of AI models. These include publication of the AIPD carried out, or the application of certain practices from the world of open licenses (e.g. publication of source code).
- Data subjects’ rights
The CNIL distinguishes between two situations: on the one hand, requests concerning training data bases, and on the other, those concerning the AI model itself.
With regard to the first situation, the CNIL considers that, in many cases, the supplier does not need directly identifying data for AI training. As a result, the supplier will not always be able to retrieve data concerning the person making the request. This principle is accepted, but individuals must be informed. However, the authority envisages two cases in which the data controller could be required to respond to the request by actively seeking out the personal data relating to the person making the request: (i) when the latter himself provides additional information enabling him to be re-identified among the training data; and (ii) when the retention of data enabling such re-identification is decided upstream by the data controller, in order to provide an additional guarantee for the rights of individuals and thus be able to base processing on legitimate interest.
The CNIL then reviews the various items of information that must be provided in response to an access request. In particular, it considers two difficulties for the data controller: the exhaustive identification of all recipients, and of data sources – especially when they have been collected by web scraping. With regard to the first difficulty, the Commission recommends setting up authentication or API mechanisms to record the identity of third-party recipients. With regard to the second, it accepts that in some cases it may be impossible to identify all sources, but considers that the data controller should still try to provide all relevant information enabling the type of sources used to be understood.
With regard to the exercise of rights over the models themselves, the CNIL first raises the difficulty of determining whether the model, in particular generative AI models, are themselves subject to the GDPR. This could be the case if the model has memorized personal data during its learning, and this data can be regurgitated as is or in a modified way during its use. This could also be the case if the model has been specifically designed to generate synthetic and fictitious information, but which could incidentally concern a real individual. The CNIL therefore recommends that a case-by-case analysis be carried out, which will not be without technical difficulties, particularly for deployers who do not necessarily have visibility over the way in which the model has been constructed.
While a model may be subject to the GDPR, this does not mean that the data controller can actually identify individuals within it. The same considerations as for the learning database apply here (informing people that they will not be able to exercise their rights and/or the possibility of providing additional information to enable re-identification, etc.).
In this draft recommendation, the CNIL also points out that it is possible, in certain limited cases, to rely on a derogation from the GDPR to refuse to grant data subjects’ requests: excessive requests, specific derogations provided for by French or European law, or the possibility of asserting a compelling reason to defeat the right to object.
- Data annotation
Data annotation, which can be manual or (semi)automatic, is essential for developing a training-based AI model. The CNIL gives several examples of annotation, including speaker identification when training an AI model for person recognition, and annotation of medical images for an AI system to aid diagnosis.
When they concern natural persons, annotations must comply with the GDPR and first and foremost with its fundamental principles. The French authority insists on the principles of minimization and accuracy. Minimization of annotations presupposes that the annotated information is relevant to the intended functionality of the AI system. In particular, the CNIL considers that information is relevant when its link with model performance is proven or “sufficiently plausible.” Annotations must also be accurate, to prevent the AI system from reproducing errors about people, which could lead to degrading or even discriminating outputs.
To ensure compliance with these principles, the CNIL recommends the introduction of a procedure for the continuous verification of annotations, and the involvement of a referent or an ethics committee and provides a few details on the modalities of its solutions.
The authority then points out that, since annotation is in itself a processing of personal data (when it relates to natural persons), data subjects must be informed and be able to exercise their rights. With regard to information, the CNIL recommends enhancing the transparency of processing by delivering information that goes partly beyond the list in Articles 13 and 14 of the GDPR: (i) the purpose of the annotation; (ii) the entity in charge of this annotation; (iii) the social responsibility criteria followed by the entity(ies) in charge of the annotation; and (iv) the security measures taken regarding the annotation phase.
Lastly, the recommendation addresses the situation where the annotation reveals sensitive data (e.g. ethnic origins, health, political opinions), including when the source data is not in itself sensitive data within the scope of Article 9 of the GDPR. In this context, the CNIL points out that the processing of such annotations will be prohibited unless it can be based on one of the exceptions of the GDPR or the French Data Protection Act. In particular, the CNIL cites the example of users who have manifestly made their political opinions public on the internet, and where this is used to annotate the publications concerned and train an AI model. The CNIL also makes a number of recommendations to be applied even when the processing of sensitive annotations can be legally implemented: use objective and factual criteria for annotation, limit annotation to the context of the data, reinforce annotation verification, increase security and pay particular attention to the risks of regurgitation and inference of sensitive data.
- Security
In this draft recommendation, the Commission details the points to be taken into account to ensure the security not only of the AI system’s environment (infrastructures, IT authorizations, physical security) but also the security of the system’s development and maintenance.
The authority stresses that the development of AI systems, most of which are recent, entails specific risks that need to be taken into account. In particular, it cites the example of external data sources that have not necessarily been subject to in-depth security assessment. The CNIL therefore recommends carrying out a data protection impact analysis, even in cases where such an analysis is not mandatory under the GDPR.
Among the points addressed by the CNIL in this recommendation, the authority lists the factors it believes need to be considered when assessing the level of risk: (i) the nature of the data – e.g.: sensitive or non-sensitive data; (ii) the control over the data, models tools used – e.g.: collaborative models may contain unidentified corrupted files; (iii) the methods of access to the system and the content of its outputs – e.g.: according to the CNIL, open-source dissemination of a model likely to store or infer personal data will increase the possibility of such attacks; and (iv) the intended context of use for the AI system – e.g.: a system used in the healthcare field will require particular attention in terms of security.
In the last part of the recommendation, CNIL outlines a series of security measures that can be considered in practice.
These seven comprehensive draft recommendations should help guide data controllers. However, it can be anticipated that the technical nature of the recommendations, combined with that of the AI models as well as their wide variety will require detailed analysis on a case-by-case basis, and that the recommendations alone will not be enough to answer all compliance questions.