Clinical data sharing using Generative Adversarial Networks
Abstract
Obtaining data is challenging for researchers, especially when it comes to medical data. Moreover, using medical data as there are concerns about privacy and confidentiality issues requires specific considerations. Generative models aim to learn data distribution via various statistical learning approaches. Among generative models, a machine learning-based approach named Generative Adversarial Networks (GANs) has proved their potential in the implicit density estimation of high dimensional data. Therefore, we suggest an approach that each healthcare organization, especially hospitals, could create and share their own GAN model, entitled Hospital-Based GANs
Keywords
MEDICAL DATA SHARING PROBLEM
Obtaining data is challenging for researchers, especially when it comes to medical data. Using medical data as there are concerns about privacy and confidentiality issues requires specific considerations. Also, Sharing this data is necessary to verify the experiments and extract more knowledge from the data[1]. One of the potential solutions for data sharing while preserving privacy is the de-identification of data. The main concern in this approach is that the process could be reversed, and the real patients’ identities would be unveiled. Another solution for sharing data is to encourage the patient populations to share data by giving rewards to them or benefiting their communities[2]. While it can be a feasible solution for small health ecosystems, the scalability of this approach is questionable. Many stakeholders, including each one of the patients, could have a different viewpoint. Thus, reaching a consensus might be challenging. In this paper, we have proposed a new solution to overcome the medical sharing problem. The main idea behind our solution can be demonstrated by a simple example: assume that in a scenario, we want to share the heights of individuals without disclosure of their identities. In this case, we could share the distribution of the heights (in the case of normal distribution, sharing the mean and standard deviation). Having the parameters of this distribution enables others to reuse the data and create samples of the heights. The cornerstone of this approach is to identify the distribution of the data. It is worth mentioning that the estimation of the data distribution would be a very complicated task when it comes to high-dimensional data such as medical images. A well-studied branch of machine learning called generative models has emerged to address such a problem.
GENERATIVE MODELS AS A SAFE WAY TO SHARE PRIVATE DATA
The underlying assumption in most machine learning tasks is that data samples are drawn from a unique data-generating distribution[3]. Generative models aim to learn this distribution via various statistical learning approaches. Once we have the data generating distribution, we can generate new samples of data that are not necessarily the same as input data. Hence, the generative models can be viewed as a secure tool for sharing new data while preserving the patients’ privacy. Generative models fall into two categories: implicit density estimation and explicit density estimation[4]. Here, what we are interested in is generating new samples from the data distribution and not the parametric distribution. Among generative models, Generative Adversarial Networks (GANs) have proved their potential in the implicit density estimation of high dimensional data.
STATE OF THE ARTS OF THE GENERATIVE MODELS: GAN NETWORKS
Recently, Deep Learning has outperformed traditional methods in different areas, including computer vision, natural language processing, and image processing. Deep learning models are powerful in learning highly nonlinear mappings. GANs can be viewed as the marriage of deep learning and generative models. GANs are composed of two neural networks: a generator and a discriminator network[5]. The generator tries to fool the discriminator by generating realistic data that are close to the distribution of the data, and the discriminator tries to discriminate between these so-called fake data and the real data. In other words, the training process is a minimax game. Note that, after training the GAN to generate new samples, we only require the generator network, and the discriminator can be discarded. As a result, the generator creates samples that are from the same distribution of the data. They successfully have been implemented for generating samples by learning the data generating distribution from a limited amount of data[6]. Currently, GANs are widely used to generate new texts and images for different purposes. One important application of GANs is to enhance the performance of the classifiers that are trained by imbalanced datasets. An imbalanced dataset can severely affect the performance of the classifier, and these types of datasets are prevalent in medical applications. For example, in breast cancer datasets, the number of mammography images with malignancy is much less than benign ones. This makes the classifier biased towards the benign class[4]. To solve this problem, GANs can be used to make such datasets balanced. We can train a GAN to generate malignant images, then make new samples of the malignant cases.
INTRODUCING HOSPITAL-BASED GANS
We suggest an approach that each healthcare organization, especially hospitals, could create and share their own GAN - Hospital-Based GANs (H-GANs) instead of sharing raw data of patients. This solution provides a framework for sharing the hospital data without violating patients’ privacy by providing a generator of data instead of the patients’ data records. In summary, this solution provides three major advantages: first and foremost is preserving patients’ privacy. Second, it enables the researchers to create an unlimited amount of data to train complex models that require huge amounts of data, such as deep learning classifiers. Also, it mitigates the imbalanced dataset issue. Besides, it reduces the required storage and bandwidth for storing and transferring the data by sharing the models instead of the whole images. For example, a dataset consisting of 5000 mammography images requires around 100GB, while the GAN model created from this dataset is around 100MB. That means a 1:1000 compression ratio. At the next level, The H-GANs could theoretically be combined to create multi-hospital, national, regional, and even global GANs, and these models could include a comprehensive range of samples.
DECLARATIONS
Authors’ contributionsMade substantial contributions to the conception and design of the study and performed data analysis, interpretation and data acquisition, as well as providing administrative, technical, and material support: Ayyoubzadeh SM (Seyed Mohammad Ayyoubzadeh), Ayyoubzadeh SM (Seyed Mehdi Ayyoubzadeh), Marzieh Esmaeili
Availability of data and materialsNot applicable.
Financial support and sponsorshipNone.
Conflicts of interestAll authors declared that there are no conflicts of interest.
Ethical approval and consent to participateNot applicable.
Consent for publicationNot applicable.
Copyright© The Author(s) 2022.
REFERENCES
1. Bauchner H, Golub RM, Fontanarosa PB. Data sharing: an ethical and scientific imperative. JAMA 2016;315:1237-9.
2. McCoy MS, Joffe S, Emanuel EJ. Sharing patient data without exploiting patients. JAMA 2020;323:505-6.
3. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, MA: MIT Press; 2016. Available from: https://books.google.com.hk/books?hl=zh-CN&lr=&id=omivDQAAQBAJ&oi=fnd&pg=PR5&dq=Goodfellow,+I.,+Y.+Bengio,+and+A.+Courville,+Deep+learning.+2016:+MIT+press.&ots=MNS-dvnBPZ&sig=NJdjTCQPqdh_9MNYzT7igJdFhfE&redir_esc=y#v=onepage&q=Goodfellow%2C%20I.%2C%20Y.%20Bengio%2C%20and%20A.%20Courville%2C%20Deep%20learning.%202016%3A%20MIT%20press.&f=false [Last accessed on 25 Aug 2022].
4. Goodfellow I. NIPS 2016 tutorial: Generative Adversarial Networks. arXiv 2017; doi: 10.48550/arXiv.1701.00160.
5. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Adv Neural Inf Process Syst 2014;27:2672-80. Available from: https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3 [Last accessed on 25 Aug 2022]
Cite This Article
How to Cite
Ayyoubzadeh, S. M.; Ayyoubzadeh, S. M.; Esmaeili, M. Clinical data sharing using Generative Adversarial Networks. Conn. Health. Telemed. 2022, 1, 98-100. http://dx.doi.org/10.20517/ch.2022.15
Download Citation
Export Citation File:
Type of Import
Tips on Downloading Citation
Citation Manager File Format
Type of Import
Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.
Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.