Synthesising artificial patient-level data for Open Science - an evaluation of five methods

Background Open science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality. Methods We have tested five synthetic data methods: 1. A technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data. 2. Synthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours. 3. Generative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data. , and a generator network trained to produce data that can fool the discriminator network. 4. CT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data. 5. Variational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions. Two data sets are used to evaluate the methods: 1. The Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables. 2. A stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers. Methods are evaluated in three ways: 1. The ability of synthetic data to train a logistic regression classification model. 2. A comparison of means and standard deviations between original and synthetic data. 3. A comparison of covariance between features in the original and synthetic data. Results Using the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducng original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets. Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set. Conclusions The pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting). More work is required to further refine and test methods across a broader range of patient-level data sets.


Introduction
Open science is a movement seeking to make scientific research accessible to all 1 . This includes not only open publication of scientific papers, but provision of code (using open source software), and underlying data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals.
If we are to publish analysis or models, such as machine learning models, with data we therefore need a means of producing data that contains all the features of the original data but does not present any significant risk to patient confidentiality. We may take two approaches to solving this problem. Firstly we may add noise to the data to sufficiently protect anonymity; this approach is used in the differential privacy method 2 . A second approach is to try to produce synthetic data that is a reasonable facsimile of the original data, but that does not directly recreate original data points.
In this paper we present experiments with five different methods of producing synthetic data: 1. A method based on Principal Component Analysis, PCA, a classical statistical method for dimensionality reduction 3 . Data is transformed into k orthogonal dimensions. We use this approach to create synthetic data by sampling from distributions for each Principal Component dimension, and transforming these sampled data points back into the original data dimension space.
2. Synthetic Minority Oversampling Technique, SMOTE 4 . This is a method normally used to enhance data with extra points created by interpolation between near neighbours. Here we follow the same methodology used for data augmentation, but remove the original data points, leaving only the synthetic data points.
3. Generative Adversarial Network, GAN 5 . This method relies on two adversarial artificial neural networks. A discriminator network is trained to distinguish between synthetic and real data. A generator network is trained to produce data that can fool the discriminator network. The performance of each improves as the two networks are trained in contest with each other. 4. Conditional Tabular GAN, CT-GAN 6 . CT-GAN is a development of a general GAN with the aim of providing synthetic tabular data. A conditional GAN framework is used to help prevent modal collapse, a problem where a GAN may generate realistic synthetic data, but that the the population variance of the synthetic data is significantly reduced compared to the original data set.

5.
Variational Auto Encoders, VAE 7 . An autoencoder is a type of artificial neural network that encodes data in a reduced dimension space 8 . The network is trained so that data is forced down through a layer (the latent space layer with fewer dimensions than the original data. Decoding layers then expand back to the original number of dimensions, and the network is trained to minimise the loss between the decoded data and the original data. Variational Auto Encoders are an adaptation to allow this framework to be used for synthetic data production, using a specialised way of regulating the network to avoid over-fitting to the original data. The latent layer is framed as a distribution for each dimension, with the loss function for training the model incorporating a penalty for low variance distributions. Synthetic data is produced by sampling values for the latent layers using the distribution parameters obtained in training of the network. Clinical data can take various forms. Here we investigate techniques using two different data sets. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. All synthetic methods have been constructed to to initially produce continuous variable outputs. Non continuous outputs are generated as follows: 1. Binary: values of 0.5 or greater are set to 1, otherwise 0.

2.
Integer : values are set as the rounded integer of the continuous variable. No clipping is applied.

3.
Categorical : values are converted to one-hot encoding in the raw data. In the synthetic data the one-hot feature with the greatest value is set to 1, and all others are set to 0.

evaluation
For each method the synthetic method is run separately for the negative and positive class examples.
Methods are evaluated in three ways: 1. A logistic regression model (SciKit Learn 12 ) is trained using synthetic data. This is then tested against 25% of the original data (the remaining 75% of the original data is used to train another logistic regression model for comparison).
2. Means and standard deviations are compared between original and synthetic data. Coefficient of correlation of the comparison is described in the results section. Detailed plots are provided in the appendix.
3. Covariance between features is evaluated in the original and synthetic data. The pair-wise coefficient of correlation for each feature pair is compared between original and synthetic data and an overall coefficient of correlation (of the pair-wise coefficients of correlation between original and synthetic data) provided in the results. Detailed plots are provided in the appendix. Table 1 shows the performance of a logistic regression classification model trained on original or synthetic data (when original data is used to train the model, the model is tested on 25% of the data not used to train the model). Results are shown for accuracy (proportion of all cases identified correctly), sensitivity (proportion of positive cases identified correctly) and specificity (proportion of negative cases identified correctly). The experiment was repeated five times. Table 2 shows a summary of correlations between original and synthetic means and standard deviations (see appendix for detailed charts). Further detailed results are available in the Jupyter Notebooks in the on-line GitHub repository. Table 3 shows a summary of correlation coefficients between original and synthetic pair-wise feature correlation coefficients (see appendix for detailed charts).

Comparison of means and standard deviations
4 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . Table 1: Performance of original and synthetic data sets when used to train a logistic regression model, which is tested against original data. Results show mean ± sem (n=5).   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Discussion
In the data sets we have examined here synthetic approaches based on PCA and SMOTE have the best performance overall: classification model performance is maintained, means and standard deviations of the synthetic data closely match the original data set, and covariance between features is well maintained. A standard GAN performed reasonably well in all categories. CT-GAN, however performed more poorly in the classification model training for the stroke model and, while means and standard deviations closely matched the original data set, there was a significant loss in covariance between the features. The VAE performance was similar to the standard GAN, but also suffered from some loss in covariance between features, especially in the Wisconsin Breast Cancer data set.
PCA and SMOTE currently appear the best choices for synthesising tabular patient data, but testing on more data sets is required. PCA may struggle as feature sets become larger and computation of the principal components becomes more computationally challenging.
GANs are a rapidly developing type of network. They are able to synthesise complex data including non-structured data such as images (see https: //thispersondoesnotexist.com as an example of a GAN creating realistic images of people. Here we have used just a simple GAN and CT-GAN. There is potential in testing developments of the GAN approach, such as the Wasserstein GAN 19 which improves stability of GANs and helps prevent modal collapse where the synthetic data is realistic but from the population of the synthetic data is more limited in variance than that of the original data.
Instability in GANs is a well known phenomenon, hence the development of techniques such as Wasserstein GAN to improve stability. One practical approach may also be to train an ensemble of GANs, and choose the one that produces the highest quality synthetic data.
The relatively poor performance of the CT-GAN, especially the profound loss of the expected covariance between features in the synthetic data, was a surprise, as this method is targeted at replicating tabular data. From our results, this method should be used with caution where preservation of feature covariance is important.
The overall performance of the VAEs was similar to the GAN, except that there was some loss in covariance between features. Unlike PCA there is no requirement of the encoded reduced dimension layer to have encoded features that are orthogonal to each other, so it is perhaps not surprisingly that the VAE does not necessarily maintain feature covariance.
Whether performance of a synthetic method is sufficiently good, and which method is best, depends on the purpose of the synthetic data. Is the synthetic data to be used as an illustrative data set, or will detailed analysis be performed on it? For Open Science, the former will probably most common -the synthetic data must resemble the original data with close enough approximation that the methods and results being presented may be understood using the synthetic data. A next step up the synthetic data quality ladder is to use of synthetic data that may be made publicly available and that can be used to test ideas before an application is made for robust analysis of the original data. The final rung of the synthetic data quality ladder is when synthetic data may totally replace original data for research, with no need even to confirm results using the original data. The quality of patient-level synthetic data, from our pilot experiments, appears to be within this spectrum -easily good enough to be used to help people understand and test published methodology, and likely good enough to be used to screen ideas (e.g. in an educational research setting).

Limitations
Two key limitations of the work described here are: 1. We have so far used only two patient-level data sets. Those these data sets were chosen to represent different types of data, further evaluation is needed using alternative patient-level data sets.
2. We have used methods in their basic configuration. Further optimisation, or use of refined approaches, may improve on performance observed here.
3. In all our methods we trained the synthetic data engines on data from each data class separately.
We have yet to evaluate the performance of machine learning classification trained on synthetic data where the class is treated as just one of the features in the data set (a single synthetic data engine would be trained and used, rather than class-based engines).

Further work
Further work will focus on the following areas: 1. Optimising methods and using refinements to methods (especially more advanced GAN techniques).
2. Testing on a broader range of patient-level data sets.
3. Testing the ability to produce synthetic data suitable for machine learning classification with-6 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint out the need to separately train methods on different classes of data.

Testing ensembles of artificial neural nets
(GANs, VAEs), picking the best performing engine and testing against a separate held-back data set (for machine learning classification).

Conclusions
The pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting).

References
Appendices A Wisoconsin Breast Cancer data set is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

B Stroke thrombolysis pathway data set
The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 11: Comparison of mean and standard deviations of features between original and synthetic stroke thrombolysis pathway data, with synthetic data produced using a Principal Component based approach. Different colours represent five alaternative model runs.

21
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 12: Comparison of correlation between all features in original and synthetic stroke thrombolysis pathway data with synthetic data produced using a Principal Component based approach. Different colours represent five alaternative model runs.

22
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 13: Comparison of mean and standard deviations of features between original and synthetic stroke thrombolysis pathway data, with synthetic data produced using SMOTE. Different colours represent five alaternative model runs.

23
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . Figure 14: Comparison of correlation between all features in original and synthetic stroke thrombolysis pathway data with synthetic data produced using SMOTE. Different colours represent five alaternative model runs.

24
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 15: Comparison of mean and standard deviations of features between original and synthetic stroke thrombolysis pathway data, with synthetic data produced using a Generate Adversarial Network. Different colours represent five alaternative model runs.

25
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 16: Comparison of correlation between all features in original and synthetic stroke thrombolysis pathway data with synthetic data produced using a Generate Adversarial Network. Different colours represent five alaternative model runs.

26
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 17: Comparison of mean and standard deviations of features between original and synthetic stroke thrombolysis pathway data, with synthetic data produced using CT-GAN. Different colours represent five alaternative model runs.

27
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 18: Comparison of correlation between all features in original and synthetic stroke thrombolysis pathway data with synthetic data produced using CT-GAN. Different colours represent five alaternative model runs.
28 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint Figure 19: Comparison of mean and standard deviations of features between original and synthetic stroke thrombolysis pathway data, with synthetic data produced using a Variational Auto Encoder. Different colours represent five alaternative model runs.

29
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210138 doi: medRxiv preprint