Synthetic data – a miracle cure or a data protection headache?

Author info

Nayana Murali

Magdalena Kogut-Czarkowska

25/01/2024

Artificial Intelligence (AI)

The authors of this article are participating in the following EU-funded projects:

Flute - Magdalena Kogut-Czarkowska

AISym4MED - Nayana Murali

1. What is synthetic data?

Synthetic data, a term lacking a precise legal definition, broadly refers to data artificially generated to resemble the characteristics of real data, including their structure and statistical distribution [1]. A more nuanced definition specifies that synthetic data is generated through the utilization of a mathematical model or algorithm, aiming to generate data that is statistically realistic yet inherently 'artificial' [2].

The generation of synthetic data can take various forms, including its production from real datasets or its creation "from scratch" by leveraging knowledge and expertise gathered by data analysts on specific dependencies. It can also result from a combination of these approaches, incorporating both real data and expert knowledge to create synthetic datasets [3].

The primary objective of synthetic data is to preserve the characteristics and properties of real data tailored to a specific use case [4]. Notably, the determination of which properties of the real data should be preserved hinges on the intended purpose of the data usage. For instance, distinct data qualities are required when assessing the storage capacities of an IT system compared to using the data for training an artificial intelligence (AI) model in cancer detection.

In certain applications, the relevance of data quality, in the sense of the close resemblance between synthetic data and real data, may be nonessential. For example, when synthetic data is used to train self-driving vehicles, the occurrence of risky situations in this dataset may need to be more frequent than in real life driving conditions [5]. Hence, the case-dependency factor plays a crucial role in shaping the approach to generating synthetic data.

2. Why is synthetic data useful?

The progress and evolution of technology, particularly in the realm of AI, hinge on the availability of extensive datasets [6]. Synthetic data emerges as a crucial asset when real-life data is inaccessible or insufficient due to scarcity, lack of variability, or legal constraints such as General Data Protection Regulation (GDPR) [7], intellectual property rights or trade secret protection. Synthetic data also assumes a pivotal role in overcoming the labour-intensive and costly nature of labelling real-life data [1].

In practical terms, since the data is generated, it can lower the costs and resources involved in collecting the required data [5]. Using "dummy" data for initial AI model training provides developers with a strategic advantage, yielding faster results before transitioning to real data. Numerous practical examples underscore the utility of synthetic data, particularly in training machine learning models and conducting data analysis. Amazon’s Alexa, for instance, reportedly undergoes training on synthetic data. [8]. To witness the generation of synthetic data first-hand, one can explore the Random Face Generator at https://this-person-does-not-exist.com/en [9].

Synthetic data contributes to enriching virtual reality (VR) and augmented reality (AR) experiences by creating realistic virtual environments. In cybersecurity, the simulation of diverse cyber threats using synthetic data is crucial for training and testing defence mechanisms. Meteorology leverages synthetic data to enhance weather forecasting models, simulating a spectrum of atmospheric conditions for more accurate predictions. In autonomous vehicle development, synthetic data is used for simulating diverse road conditions and obstacles, aiding in the training of algorithms.

One of the most promising applications of synthetic data lies in health research and innovation. It is being explored whether virtual, computer-generated patients can prove valuable in the development of medical drugs and devices, potentially providing a way to reduce reliance on human testing and shorten testing times [5].

In another notable instance, synthetic data was employed to address the underrepresentation of diverse skin types in existing datasets [10]. Recognizing a bias towards predominantly light skin samples in data repositories, a more inclusive set of skin images was created using synthetic data. This initiative aimed to train detection models capable of effectively recognizing potentially malignant skin conditions, such as melanoma, across a spectrum of shades.

In essence, synthetic data stands not merely as a solution to data challenges but as a transformative force, reshaping technology across diverse applications. Its seamless integration into various fields reflects its pivotal role in advancing and revolutionizing the capabilities of artificial intelligence and data-driven technologies.

3. Does the GDPR apply to synthetic data?

The relationship between synthetic data and the GDPR is a subject of debate, with most researchers agreeing that synthetic data is not automatically “private” [11] or placed outside of the realm of data protection laws. Legal considerations predominantly arise when creating synthetic data from real-life datasets containing personal data, as seen for example in medical datasets. In such cases, the process begins with collecting and preparing actual personal data for training AI models that generate synthetic data. From a GDPR perspective, creating synthetic data based on personal data requires processing of the latter [12].

This imposes several requirements on the developers. For example, they need to implement the GDPR principle of data minimization (Article 5.1c), by pseudonymizing the input data and removing direct identifiers from it. Another crucial principle is ensuring the integrity and confidentiality of input personal data (Article 5.1f), particularly by incorporating technical and organizational security measures (Article 32) to safeguard it from unlawful disclosure. As with any personal data processing, there is a need for a legal basis for using input personal data for synthetic data generation.

Opinion 05/2014 of the Article 29 Working Party on Anonymisation Techniques [13], states that anonymisation as an instance of further processing of personal data can be compatible with the original purposes of the processing if the result is truly anonymous data. According to some authors, similar argument can be made for synthetic data generation "provided that the data synthesis is carried out adequately and synthetic data is reliably produced" [1], or, with a higher standard, that the synthetic data is anonymous (non-personal).

This leads to the imminent question of whether synthetic data is ‘personal data’ governed by the data protection law. On the face of it, one may argue that since the data is purposely disrupted and changed (there is no one-to-one mapping from synthetic records back to the person), it is automatically non-personal. However, there have been studies [14] that indicate that not in all cases sufficient level of anonymization is achieved. Even if the generation of the data was performed on initially de-identified data (where direct identifiers, such as names were removed), there remains a risk that an individual can be indirectly identifiable either from the synthetic data itself or with other available sources [15].

The potential risk becomes especially relevant in cases where a model is vulnerable to 'overfitting' [15]. In such instances, the model excessively focuses on the details of the training data, essentially memorizing examples from that data and reproducing them in synthetic data [12 and other sources cited there] [16]. Consequently, this phenomenon exposes a vulnerability in synthetic data, as it has “the capacity to leak information about the data it was derived from” [11], rendering it susceptible to privacy attacks.

As a result, conducting a thorough assessment of any synthetic data becomes imperative to ascertain its personal or non-personal status. Notably, the European Data Protection Supervisor (EDPS) has emphasized that this assessment should evaluate the extent to which data subjects can be identified in the synthetic data and the amount of new data about those subjects that would be revealed upon successful identification [17].

Nevertheless, such an assessment is not a straightforward process. From a legal perspective, the assessment of synthetic data under the GDPR is influenced by the ongoing debate on the limits of "personal data”. This topic is very complex (refer to recent rulings of CJEU Case C-319/22 and GC T-557/20 [18]), resulting in a lack of agreed standards and a potentially expansive definition of 'personal data.' Essentially, debates concerning the risk of identification within the GDPR definition of personal data often centre on determining whose perspective should decide if a piece of information qualifies as personal. Additionally, there is a need to establish a threshold for 'reasonable likenesses' as a measure to assess the risk of re-identification. Another persistent issue associated with synthetic data involves the potential deduction of sensitive information about an individual, even in cases where the identifiability test fails to yield a positive outcome.

Even if the synthetic data falls short of the anonymity threshold, replacing collected personal data with artificially generated data offers an additional layer of security to personal data. The AEPD [4] and ICO [19] consider synthetic data as a privacy-enhancing technology (PET) which aims to weaken or break the connection between an individual in the original personal data. Some researchers propose combining synthetic data with other PETs, such as differential privacy, to enhance privacy protection while retaining utility [5].

4. Can synthetic data be regulated so its status is made clear?

The term “synthetic data” is making its way into EU regulations. In particular, recital 7 of the Data Governance Act states that “There are techniques enabling analyses on databases that contain personal data, such as anonymisation, differential privacy, generalisation, suppression and randomisation, the use of synthetic data or similar methods and other state-of-the-art privacy-preserving methods that could contribute to a more privacy-friendly processing of data” [20]. While the Data Governance Act acknowledges the value of synthetic data as a PET, it does not offer a legal definition nor position regarding its status as personal or non-personal data.

As mentioned above, the current stance of certain data protection authorities and privacy professionals is that synthetic data must be evaluated within the framework of the GDPR, and the privacy implications of any synthetic dataset are highly contingent on the specific context [4]. This perspective is seen as a potential barrier to advancing the use of synthetic data in research. Concerns have been raised regarding the complex legal requirements and GDPR compliance processes that must be adhered to, which could impede technological progress and hinder the widespread adoption of synthetic data. It may be tempting to suggest that the complexities related to the qualification of synthetic data could be easily resolved by the establishment of a legal definition enacted by European Union legislators. Such hope was kindled by the Artificial Intelligence Act (AI Act) proposal [21] where in Article 54.1 (b) it is mentioned that:

“In the AI regulatory sandbox personal data lawfully collected for other purposes may be processed for the purposes of developing, testing and training of innovative AI systems in the sandbox under the following cumulative conditions:

a) (…)

b) the data processed are necessary for complying with one or more of the requirements referred to in Title III, Chapter 2 [namely those applicable to high-risk AI systems] where those requirements cannot be effectively fulfilled by processing anonymised, synthetic or other non-personal data;”

Attention has been paid to the part of the provision in which the categories of anonymised, synthetic or other non-personal data are mentioned together. As some argue [22], this wording suggests – by legal implication – that synthetic data is considered as a type of non-personal data. However, in our assessment, this conclusion appears somewhat premature.

The origin of synthetic data is an important factor in assessing whether it qualifies as personal data. When synthetic data is created from original personal data, a crucial trade-off emerges where utility and anonymity are inherently interconnected. The more utility a synthetic dataset provides, the lower its anonymity (meaning the higher the risk of reidentification), and vice versa [23] [24]. Therefore, striking a balance between absolute anonymity and utility preservation is a nuanced task when synthetic data is generated from real personal data, and it is unlikely that a unanimous consensus will emerge asserting that synthetic data is unequivocally non-personal in all instances. Conversely, synthetic data generated through assumptions, bypassing the direct processing of personal data, will not need to face these challenges.

In this regard, critical perspectives caution policymakers against assuming equal effectiveness across all forms of data synthesis. Experts also advise that context and practice will have a major influence on the risk of re-identification [15].They argue that Data Protection Authorities (DPAs) and the community should arrive at "appropriate standards and approaches to assessing identifiability of specific synthetic data generation methods, utilizing quantitative metrics as far as possible" [15].

Time will show whether these comments will be taken on board in the final version of the AI Act. In the AI Act amendments adopted on 14 June 2023 by the European Parliament [25], a reference to synthetic data in Article 10.5 was added, outlining conditions for processing special categories of data to detect negative biases in high-risk AI systems. One of the conditions is that “the bias detection and correction cannot be effectively fulfilled by processing synthetic or anonymised data”. This addition does not imply synthetic data to be a category of “non-personal data” like Article 54.1(b). Interestingly, the original proposal’s text in Article 54.1 (b) remains unchanged. At the time of writing of this blog, the final text of the provisional agreement reached between the Council presidency and the European Parliament’s negotiators [26] is yet to be disclosed, and therefore it remains to be seen how (and whether) the final text tackles the status of synthetic data.

5. What should I do if I am planning to generate or use synthetic data?

Some of the good practices related to synthetic data include:

Establishing a clear legal basis: If the starting point for generating the synthetic data involves personal data, the processing of this personal data must comply with the GDPR. Accordingly, organizations should carefully assess the legal justification for processing personal input data, ensuring that it falls under an appropriate legal basis.
Transparency and Accountability: Organizations must be transparent when collecting and processing personal data of individuals for the purpose of generating synthetic data. Furthermore, keeping detailed records of processing personal data for the purpose of synthetic data generation is crucial, demonstrating the organization's dedication to transparency and accountability.
Striking a balance: Similar to data anonymization, producing synthetic data requires finding a balance between utility and anonymity. If the synthetic data too closely resembles actual data, while being valuable for researchers, it can compromise the privacy of data subjects and remain within the realm of personal data. This could pose significant challenges in terms of data protection compliance. For example, due to the unique nature of synthetic data ensuring data accuracy, addressing correction requests, and handling objections from individuals regarding their data, would be challenging if not impossible. Synthetic data is artificially generated and does not correspond to real-world information about specific individuals.
Privacy Assurance Assessments: In order to ensure that synthetic data does not qualify as personal data, it is crucial to conduct privacy assurance assessments. This involves evaluating the risk of re-identification, ensuring data minimization, and implementing appropriate safeguards to protect individual privacy. Ongoing research is exploring methods and metrics to assess the probability of re-identification of synthetic datasets.
Documentation and monitoring: As with any AI training, careful documentation of input data and the process of synthetic data creation is essential. Expert analysis and oversight, from both domain experts and data scientists, are important in the generation as well as evaluation stages of synthetic data. Organizations must ensure the level of data quality appropriate to the foreseen use case and incorporate the principle of data protection by design into the synthetic data generation life cycle.

It's important to acknowledge that since synthetic data is relatively new, the rules of its use and legal implications in various domains are still unclear. Extra caution is advised in scenarios where the data is to be used for training and validation of AI models intended to be classified as medical devices. Notably, concerns have been raised about the use of synthetic data for clinical validation, underlining the absence of a foundation in Medical Device Regulation [27] [28]. Amidst this evolving landscape, where standards for evaluating synthetic data quality are subject to ongoing refinement (in the dimension of both completeness and accuracy), which as mentioned above is highly context dependent, organizations must tread cautiously.

The risks associated to synthetic data include the potential for inaccuracies arising from flawed input data or background information [29], as well as the risk of bias in data creation due to inadequately balanced input information. Additionally, concerns arise about users' ability to understand the underlying logic applied by machine learning in generating synthetic values, raising questions about the transparency and trustworthiness of the data. In this dynamic realm of synthetic data, where standards and risks are under ongoing scrutiny, ensuring compliance and responsible data management calls for careful consideration.

FLUTE and AISym4Med have received funding from the European Union’s Horizon 2020 and Horizon Europe research and innovation programmes. However, the content of this article reflects the opinion of its authors and does not in any way represent opinions of the European Union or the European Commission. The European Commission is not responsible for any use that may be made of the information the article contains.

References:

[1] López, C. A. F, ‘On synthetic data: a brief introduction for data protection law dummies’, European Law Blog, (September 2022). Accessible at: https://europeanlawblog.eu/2022/09/22/on-synthetic-data-a-brief-introduction-for-data-protection-law-dummies/

[2] Valerie Marshall, Charlie Markham, Pavle Avramovic, Paul Comerford, Carsten Maple, Lukasz Szpruch, FCA Official, ‘Research Paper: Exploring Synthetic Data Validation – Privacy, Utility and Fidelity’. Accessible at: https://cy.ico.org.uk/media/for-organisations/documents/4025484/sythetic-data-roundtable-202306.pdf

[3] K. El Emam, L. Mosquera, and R. Hoptroff, ‘Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data’. O'Reilly Media Inc, (May 2020). Accessible at: https://cdn.ttgtmedia.com/rms/pdf/Practical_Synthetic_Data_Generation.pdf

[4] Agencia Espanola Proteccion Datos, ‘Synthetic data and data protection’, (November 2023). Accessible at: https://www.aepd.es/en/prensa-y-comunicacion/blog/synthetic-data-and-data-protection

[5] Gal, M. S., & Lynskey, O, ‘Synthetic Data: Legal Implications of the Data-Generation Revolution’, 109 Iowa Law Review, Forthcoming, LSE Legal Studies Working Paper No. 6/2023, (January 2023). Accessible at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4414385

[6] Fontanillo López, C. A., & Elbi, A, ‘On the legal nature of synthetic data’, Center for IT and IP Law, KU Leuven, NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research. Accessible at: https://openreview.net/pdf?id=M0KMbGL2yr

[7] Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Accessible at: https://eur-lex.europa.eu/eli/reg/2016/679/oj

[8] Elise Devaux, ‘Types of synthetic data and 4 real-life examples’, (2022). Accessible at: https://www.statice.ai/post/types-synthetic-data-examples-real-life-examples

[9] Random Face Generator. Accessible at: https://this-person-does-not-exist.com/en

[10] Timo Kohlberger & Yuan Liu, ‘Generating Diverse Synthetic Medical Image Data for Training Machine Learning Models’, (February 2020). Accessible at: https://blog.research.google/2020/02/generating-diverse-synthetic-medical.html?m=1

[11] Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, ‘Synthetic Data - what, why and how?’ (May 2022). Accessible at: https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf

[12] Ganev, Georgi, ‘When Synthetic Data Met Regulation’, arXiv preprint arXiv:2307.00359vl, (July 2023). Accessible at: https://arxiv.org/pdf/2307.00359.pdf

[13] Opinion 05/2014 of the Article 29 Working Party on Anonymisation Techniques. Accessible at: https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf

[14] Theresa Stadler, Bristena Oprisanu, Carmela Troncoso, ‘Synthetic Data -- Anonymisation Groundhog Day’, (November 2020). Accessible at: https://arxiv.org/abs/2011.07018

[15] Colin Mitchell and Elizabeth Redrup Hill, ‘Are synthetic health data 'personal data'?’. Accessible at: https://www.phgfoundation.org/report/are-synthetic-health-data-personal-data#:~:text=We%20found%20that%20regulators%20and,been%20reduced%20to%20remote%20levels.

[16] Julia Ive, ‘Leveraging the Potential of Synthetic Text for AI in Mental Healthcare’, Front. Digit. Health (October 2022). Accessible at: https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2022.1010202/full

[17] European Data Protection Supervisor, Tech Champion: Robert Rieman, publication on ‘Synthetic Data’. Accessible at: https://edps.europa.eu/press-publications/publications/techsonar/synthetic-data_en

[18] Alexandre Lodie, European Law Blog, ‘Are personal data always personal? Case T-557/20 SRB v. EDPS or when the qualification of data depends on who holds them’, (November 2023). Accessible at: https://europeanlawblog.eu/2023/11/07/are-personal-data-always-personal-case-t-557-20-srb-v-edps-or-when-the-qualification-of-data-depends-on-who-holds-them/#more-9476

[19] Information Commissioner’s Office. ‘Draft anonymisation, pseudonymisation and privacy enhancing technologies guidance. Chapter 5: Privacy-enhancing technologies (PETs)’. (September 2022). Accessible at: https://ico.org.uk/media/about-the-ico/consultations/4021464/chapter-5-anonymisation-pets.pdf

[20] Regulation (EU) 2022/868 of the European Parliament and of the Council of May 30, 2022, on European data governance and amending Regulation (EU) 2018/1724 (Data Governance Act). Accessible at: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32022R0868

[21] Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules On Artificial Intelligence (Artificial Intelligence Act) AND Amending Certain Union Legislative Acts, COM/2021/206 final. Accessible at: https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=celex:52021PC0206

[22] Legal status of Synthetic Data, Lorenzo Cristofaro, (October 2023). Accessible at: https://www.linkedin.com/pulse/legal-status-synthetic-data-lorenzo-cristofaro

[23] Khaled El Emam, ‘Precaution, ethics and risk: Perspectives on regulating non-identifiable data’, IAPP, (May 2022). Accessible at: https://iapp.org/news/a/precaution-ethics-and-risk-perspectives-on-regulating-non-identifiable-data/

[24] López, Cesar Augusto Fontanillo, ‘On the legal nature of synthetic data’, NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, (2022). Accessible at: https://openreview.net/pdf?id=M0KMbGL2yr

[25] Amendments adopted by the European Parliament on 14 June 2023 on the proposal for a regulation of the European Parliament and of the Council on laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts (COM(2021)0206 – C9-0146/2021 – 2021/0106(COD), Accessible at: https://www.europarl.europa.eu/doceo/document/TA-9-2023-0236_EN.html

[26] https://www.consilium.europa.eu/en/press/press-releases/2023/12/09/arti…

[27] Regulation (EU) 2017/745of the European Parliament and of the Council of 5 April 2017 on medical devices, amending Directive 2001/83/EC, Regulation (EC) No 178/2002 and Regulation (EC) No 1223/2009 and repealing Council Directives 90/385/EEC and 93/42/EEC. Accessible at: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32017R0745

[28] Jarosław Greser, ‘Synthetic Data and Medical AI – Where Do We Stand?’, (October 2023). Accessible at: https://lsts.research.vub.be/synthetic-data-and-medical-ai-where-do-we-stand

[29] Theresa Stadler, Bristena Oprisanu & Carmela Troncoso, ‘Synthetic Data – Anonymisation Groundhog Day’ (unpublished manuscript, January 2022). Accessible at: https://arxiv.org/pdf/2011.07018.pdf.