Transfer learning for non-image data in clinical research: A scoping review

Background Transfer learning is a form of machine learning where a pre-trained model trained on a specific task is reused as a starting point and tailored to another task in a different dataset. While transfer learning has garnered considerable attention in medical image analysis, its use for clinical non-image data is not well studied. Therefore, the objective of this scoping review was to explore the use of transfer learning for non-image data in the clinical literature. Methods and findings We systematically searched medical databases (PubMed, EMBASE, CINAHL) for peer-reviewed clinical studies that used transfer learning on human non-image data. We included 83 studies in the review. More than half of the studies (63%) were published within 12 months of the search. Transfer learning was most often applied to time series data (61%), followed by tabular data (18%), audio (12%) and text (8%). Thirty-three (40%) studies applied an image-based model to non-image data after transforming data into images (e.g. spectrograms). Twenty-nine (35%) studies did not have any authors with a health-related affiliation. Many studies used publicly available datasets (66%) and models (49%), but fewer shared their code (27%). Conclusions In this scoping review, we have described current trends in the use of transfer learning for non-image data in the clinical literature. We found that the use of transfer learning has grown rapidly within the last few years. We have identified studies and demonstrated the potential of transfer learning in clinical research in a wide range of medical specialties. More interdisciplinary collaborations and the wider adaption of reproducible research principles are needed to increase the impact of transfer learning in clinical research.


Abstract Background
Transfer learning is a form of machine learning where a pre-trained model trained on a specific task is reused as a starting point and tailored to another task in a different dataset. While transfer learning has garnered considerable attention in medical image analysis, its use for clinical non-image data is not well studied. Therefore, the objective of this scoping review was to explore the use of transfer learning for nonimage data in the clinical literature.

Methods and Findings
We systematically searched medical databases (PubMed, EMBASE, CINAHL) for peer-reviewed clinical studies that used transfer learning on human non-image data.
We included 83 studies in the review. More than half of the studies (63%) were published within 12 months of the search. Transfer learning was most often applied to time series data (61%), followed by tabular data (18%), audio (12%) and text (8%). Thirty-three (40%) studies applied an image-based model to non-image data after transforming data into images (e.g. spectrograms). Twenty-nine (35%) studies did not have any authors with a health-related affiliation. Many studies used publicly available datasets (66%) and models (49%), but fewer shared their code (27%).

Conclusions
In this scoping review, we have described current trends in the use of transfer learning for non-image data in the clinical literature. We found that the use of transfer learning has grown rapidly within the last few years. We have identified studies and

Introduction
There is no doubt that most clinicians will use technologies integrating artificial intelligence (AI) to automate routine clinical tasks in the future. In recent years, the U.S. Food and Drug Administration has been approving an increasing number of AIbased solutions, dominated by deep learning algorithms [1]. Examples include atrial fibrillation detection via smart watches, diagnosis of diabetic retinopathy based on fundus photographs, and other tasks involving pattern recognition [1]. In the past, the development of such algorithms would have taken an enormous effort, both regarding computational capacity and technical expertise. Nowadays, computational tools, including cloud computing, free software, and training materials are more easily accessible than ever before [2]. This means that more researchers, with more diverse backgrounds, have access to machine learning, and that they can focus more on the subject matter when developing AI-based solutions for clinical practice.
Despite this trend, machine learning, neural networks, transfer learning, and other elements of AI still seem to be surrounded by mystery in the clinical research community, and AI has yet to reach its potential in the clinic [1].
A neural network is a type of machine learning model, inspired by the structure of the human brain (Box 1). In the simplest scenario, the input data flows through layers of artificial neurons, known as hidden layers. Each hidden neuron takes the results from previous neurons, calculates a weighted sum before applying a nonlinear function, and feeding this value forward to the next layer. The final layer transforms the results according to the prediction task, for example to probabilities, when the task is to predict whether a lung nodule on a chest X-ray is malignant. Fitting or training a neural network means optimizing the weights or parameters against some performance metric. Deep learning means the use of neural networks with several hidden layers of neurons. In the last decade, several neural network architectures were designed with some consisting of more than 100 layers and tens of millions of weights [3]. In such a deep neural network, neurons at lower levels (i.e. closer to input layer) 'learn' to recognize some lower-level features in the data (e.g. circles, vertical and horizontal lines in images), which the higher level neurons combine into more complex features (e.g. a face, some text, etc. in an image) [4]. Neural networks are popular tools in machine learning as they can approximate any complex nonlinear association. This flexibility, however, comes with a price, as fitting complex neural networks requires very large datasets, which limits the spectrum of fields where their application is feasible.
Transfer learning circumvents the above-mentioned limitation and unlocks the potential of machine learning for smaller datasets by reusing a pre-trained neural network built for a specific task, typically on a very large dataset, on another dataset and potentially for a different task. The pre-trained model is also known as the source model, while the new dataset and task is referred to as the target data and target task. A common example is to take a computer vision model, trained to identify everyday objects in millions of images, and further train this model for grading diabetic retinopathy on only a few thousand fundus photographs, instead of training the model from scratch on this smaller dataset [5]. This example demonstrates a type of transfer learning known as fine-tuning, or weight or parameter transfer. Another type of transfer learning is feature-representation transfer, where the features from the hidden layers of a model are used as inputs for another model, but other forms of transfer learning exist [6][7][8].
Applications of transfer learning are common in computer vision, where large datasets are publicly available to train models that can then be adapted to different . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint domains. One of the most influential datasets is ImageNet, which includes more than a million images from everyday life [9]. Two recent scoping reviews focused on transfer learning for medical image analysis, one of which identified around a hundred articles applying ImageNet-based models for clinical prediction tasks [10,11]. The number of published articles using transfer learning for medical image analysis approximately doubled every year in the last decade, demonstrating an increasing interest in transfer learning [10,11]. To our knowledge, a comprehensive overview of the use of transfer learning for other non-image data types is lacking, despite that tabular and time series data seem to dominate in the clinical literature.
Therefore, the objective of this scoping review was to fill this knowledge gap by exploring and characterizing studies that used transfer learning for non-image data in clinical research.

Methods
Scoping reviews identify available evidence, examine research practices, and characterize attributes related to a concept (i.e. transfer learning in our case) [12].
This format fits better with our research objective than a traditional systematic review and meta-analysis, as the latter require a more well-defined research question.
During the process, we followed the 'PRISMA for Scoping Reviews' guidelines and the manual for conducting scoping reviews by the Joanna Briggs Institute [13,14].

Eligibility criteria
In accordance with the aim of this review, we only wanted to include studies using transfer learning for a clinical purpose. Similarly, we were only interested in articles indexed in medical databases, as opposed to in purely technical databases such as The Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library, which most clinical researchers and practitioners might be unfamiliar with. Moreover, . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint such articles are often written in a technical language, which limits their utility for the clinical research community. By only including articles indexed in medical databases, we could gauge how exposed clinical researchers are to clinical use of transfer learning and could also provide a list of articles that can serve as inspiration to clinical researchers interested in transfer learning.
To be considered for inclusion, studies needed to 1) be published, peer-reviewed, written in English, and indexed in a medical database (defined as PubMed, EMBASE, or CINAHL) since database inception, 2) be a clinical study or focus on clinically relevant outcomes or measurements, 3) use data from human participants or synthetic data representing human participants as target data, 4) use transfer learning with either fine-tuning (parameter transfer) or feature-representation transfer, and 5) analyze non-image target data (text, time-series, tabular data, or audio). Accordingly, we excluded preprints and conference abstracts, basic research, cell studies, animal studies, and studies analyzing image data. Videos were considered as blends of audio and images; therefore, we did not include studies analyzing this type of target data. However, we did not exclude studies that converted non-image data into images (e.g. converting an audio file into an audiogram) and then analyzed them using models from computer vision. Finally, we also excluded studies which combined source data with the target data as an integral part of their analysis as opposed to reusing only a source model. Eligibility criteria were predefined and the review protocol is available online [15]. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Information sources and search strategy
The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint online [15]. In brief, we searched for all studies that either specifically mentioned 'transfer learning' or included both a similar but less specific phrase (e.g. 'transfer of learning', 'transfer weights', 'connectionist network', etc.) and an AI-related keyword (e.g. 'artificial intelligence', 'neural network', 'NLP', etc.). The search strategy was supplemented by a call-out on Twitter by AH (@adamhulman) on May 25, 2021, and by scanning the references of relevant reviews found during the screening process.
As per our aim, we did not include grey literature.

Selection of sources of evidence
The search results from each database were imported to the reference program Endnote 20.1 (Clarivate Analytics, Philadelphia, PA, USA). Duplicate removal was done by AE and is documented online [15]. Hereafter, the records were transferred to Covidence (Veritas Health Innovation, Melbourne, Australia) for the screening process.
Abstracts and titles were screened against the predefined eligibility criteria by at least two independent authors (AE, MT, OEA, AH). If an abstract and title clearly did not meet the inclusion criteria, the record was excluded. In case of uncertainty, the record was included for full-text screening. Next, full-text versions of all reports were retrieved if possible and assessed for eligibility. If a report was excluded at this step of the screening process, a reason was documented. In any stage of the screening process, conflicts were resolved through discussion between the dissenting authors.
In case of no consensus, the last author (AH) made the final decision. The study selection process is presented using a PRISMA flow diagram [16] in Figure 1.

Data charting process
Research questions were pre-specified and published in the scoping review protocol [15]. Data of interest from each included study was extracted by two independent . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint researchers based on a pre-developed and tested data extraction form [15]. Disagreements were solved as described above. Data on study characteristics, like the study area within medicine, the affiliation of the authors (medical or technical departments), and the aim of the study, was extracted. Furthermore, extracted data included knowledge on model characteristics such as what method and origin the model being transferred is based on, type of transfer learning, type of source and target data, and the advantages/disadvantages if compared to a non-transfer learning method. Lastly, information on the reproducibility of the studies was registered, i.e. the public availability of the data, the reused model, and the code for the analysis, as well as the software used. The studies are listed by field within medicine and data type in Table 1, and the complete dataset is available in the electronic supplement.

Results
The search resulted in 4,902 records, of which 2,097 were duplicates ( Figure 1).
After screening the remaining 2,805 records, 2,528 were excluded as irrelevant. Of the 277 records included for full-text review, 194 were excluded, mainly because they focused on basic, animal, or other non-clinical research (n=107) or did not use . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint transfer learning (n=43). Another 64 records were identified in reviews or on Twitter, but none of the papers were eligible for inclusion. In total, 83 studies were included in this review (Table 1).
Only one of the identified studies [17]  The most common fields of studies were neurology (n=26) and cardiology (n=18) followed by genetics, infectious diseases and psychiatry (n=5 for each). In line with the medical specialties, analyses of electroencephalography (EEG) and electrocardiogram (ECG) data were common with 20 and 19 studies, respectively.
Study aims were most often described with the terms: 'prediction', 'detection', and 'classification'.
In total, 50 out of 83 (60%) studies included at least one author with a clinical affiliation and at least one with a technical affiliation. Studies with pure technical affiliations were more common (29 out of 83; 35%) than studies with pure clinical affiliations (4 out of 83; 5%).
Time series was the most common target data type (n=51; 61%), followed by tabular data (n=15; 18%), audio (n=10; 12%) and text (n=7; 8%). The reused model was is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint developed in the ImageNet [9] dataset (n=27). Figure 3 shows the different combinations of source and target data types.
Fine-tuning was approximately three times as common (58 out

Discussion
Applications of transfer learning were identified in a variety of clinical studies and demonstrated the potential in reusing models across different prediction tasks, data types, and even species. Improvements in predictive performance were especially striking when transfer learning was applied to smaller datasets, as compared to . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint training machine learning algorithms from scratch. Image-based models seemed to have a large impact outside of computer vision and were reused for the analysis of time series, audio, and tabular data. Many studies utilized publicly available datasets and models, but surprisingly few shared their code.
Transfer learning via fine-tuning and feature-representation learning has almost been unknown in the clinical literature until 2019, when the number of articles started to increase rapidly. This development is lagging a few years behind the trend seen in medical image analysis [11], which might be explained partly by the impact that computer vision models have on non-image applications too. Also, we only included peer-reviewed articles indexed in medical databases, while the latest developments in computer science are most likely to be first published as preprints or conference proceedings.
More than half of the studies were from the fields of neurology and cardiology, while the rest of the studies were more equally distributed across other fields. This could be due to the high-resolution nature of EEG and ECG data that suits well with the application of machine learning. Also, there are many publicly available datasets and data science competitions that attract the attention of the computer science community.
The majority of study aims were predictions of binary or categorical outcomes on the individual level e.g. detection of a disease, classification of disease stages, or risk estimation. Few studies focused on prediction of continuous outcomes (e.g. glucose levels [26,27]) and forecasts on the population level (e.g. COVID-19 trend [28] and dengue fever outbreaks [29]). Prediction models are abundant in the clinical literature, however most do not make it into routine clinical use [30]. For machine learning algorithms, one of the barriers is that clinical practitioners often consider . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint them as black boxes without intuitive interpretations, even though new solutions are constantly being developed to solve this problem [31]. Another issue is that machine learning algorithms are often used to analyze modestly sized tabular datasets and (not surprisingly) fail to outperform traditional statistical methods, contributing to even more skepticism. Transfer learning helps to overcome this problem by reusing models trained on large datasets and tailoring them to smaller ones. This provides the opportunity to capitalize on recent developments in machine learning research in new areas of clinical research. However, to guarantee the development of easily accessible and clinically relevant models that go beyond the 'proof-of-concept' level into clinical deployment, it is crucial that computer scientists and clinical researchers work together, which we observed in less than two-thirds of the studies. This proportion would probably have been even lower if we had considered studies indexed in non-clinical databases. Even the clinical studies using transfer learning, we identified, are still largely being published in rather specialized journals focusing on an audience with some technical background (e.g. IEEE and interdisciplinary journals) and have not yet reached mainstream clinical journals.
We were surprised by the high number of studies reusing a model that was developed in a dataset of a different data type than the target dataset. This also made it difficult to compare the size of source and target datasets quantitatively, as they might have been characterized in very different units (e.g. hours of audio recordings in the target dataset vs. number of images in the source dataset), but our observation was that the size of the target datasets was often much smaller than the source dataset. E.g. models developed in the ImageNet dataset (>1 million images) were used even in smaller clinical studies with <100 patients [24,32,33], where the use of machine learning models would otherwise not be recommended or feasible. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Many studies compared long lists of different model architectures (e.g. ResNet [34], Inception [35], GoogLeNet [36], AlexNet [37], VGGNet [38], MobileNet [39] etc.) or even different methods. This, often combined with the use of several performance metrics (area under the ROC-curve, F1-score, accuracy, sensitivity, specificity, etc.), makes it difficult to summarize the advantages/disadvantages of transfer learning, even though many studies included comparisons with other deep learning solutions without transfer learning. We observed that the degree of performance improvements varied greatly. Some studies reported faster and more stable model fitting process with transfer learning due to better utilization of data. Some of these results are described in the sections below characterizing the different data types.
Even though half of the studies reused a publicly available model and even more used a publicly available dataset, principles of reproducible research were followed less often. One-third of the studies did not report at all which software they used and even fewer shared their code. Those who did, most often used Python, then MATLAB, while we did not find a single study using R for the transfer learning part.
This might be a consequence of the fact that commonly used libraries like PyTorch, TensorFlow, or fastai, are native in Python and only recently became available in R.
Those authors who shared their code did so almost exclusively on GitHub.
A discussion of data type-specific observations is organized in the following four sections.
Almost half of the studies transformed their data into images (e.g. spectrograms, scalograms), most often using Fourier (e.g. short-time or fast) or continuous wavelet transforms, to be able to utilize models from computer vision. A reason for this might be that publicly available time series datasets were rather small until recently, while in computer vision, ImageNet has been described and widely used as a large benchmark dataset for about a decade [9]. In 2020, PTB-XL [83]  where it would not be feasible to train models from scratch. This may have a positive impact on the study of rare diseases or minority groups, regardless of the type of data. Lopes et al. [69] used this strategy to detect a rare genetic heart disease based on ECG recordings, although they only had 155 recordings from patients. First, the authors developed a convolutional neural network for sex detection, which required a readily available outcome from their database including approximately a quarter of a million ECG recordings. Even though the model was initially trained for a task with is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint little clinical relevance, the model still learnt some useful features from the data. This model was then fine-tuned for the detection of a rare genetic heart disease using only 310 recordings from 155 patients and 155 matched controls. With this approach, the authors achieved major improvements in model discrimination not only compared to training the same architecture from scratch, but also outperformed various machine learning approaches and clinical experts [61].
In the previous example, an easily accessible outcome or label (i.e. sex) made it possible to train the base model, however even this can be avoided by using autoencoders [84]. An autoencoder is an unsupervised machine learning algorithm often used for denoising a dataset by first reducing its dimensions and only keeping relevant information (encoding), before reconstructing the original dataset (decoding). The feature representations learnt during the encoding step might be then transferred to new tasks. Jang et al. [59] developed an autoencoder based on 2.6 million unlabeled ECG recordings which was then reused as the base of an ECG rhythm classifier. The authors compared this approach to an image-based transfer learning classifier and a model trained from scratch. The autoencoder solution performed best, but the difference to the model trained from scratch was much smaller than in the other example. More interestingly, the authors compared the results when using 100%, 50% and 25% of the available data, and found a major drop in performance of the model trained from scratch when using only 25% of the data, while the autoencoder-based transfer learning solution still had an excellent performance as it still indirectly utilized data from >2 million ECG recordings. The image-based solution had a slightly lower F1-score (indicating worse performance) than other methods when using 100% of the data, but it did not change much when . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint using only 25% of the data, which led to a better performance than of the model trained from scratch. EEG data were most often used for sleep staging and epileptic seizure detection. We highlight the study by Raghu et al. [51] using pre-trained convolutional neural networks for seizure detection based on EEG-based spectrograms, because the authors described both fine-tuning and feature-representation learning solutions. In the latter approach, features were fed into a support vector machine classifier, a popular machine learning technique. The authors found that taking features from deeper layers, representing higher-level features, resulted in higher accuracy for most architectures. For almost all architectures, feature-representation transfer was found to be better than fine-tuning regarding accuracy. However, this can partly be a consequence of the fact that many more models were fitted including features from different layers. Further, once the authors found out which layer provided the most useful representation for a specific problem, the optimization process was shortened markedly compared to the fine-tuning process (~4 vs. 52 min with the Inception-v3 architecture).

Audio
Voice recognition and other audio-based applications of AI surround us in our everyday lives. In a clinical setting, doctors have traditionally used audio signals from stethoscopes to screen patients e.g. for heart and lung diseases. Additionally, vocal biomarkers processed with machine learning algorithms are getting more and more attention in a research setting, and are expected to aid diagnosis and monitoring of diseases in the future [85]. Despite the increasing interest and opportunities for relatively easy and cheap data collection, we only found 10 studies that used transfer learning on audio data. However, these studies covered a variety of fields in . CC-BY 4.0 International license It is made available under a perpetuity.
Recordings of speech contain different levels of information: linguistic features that can be derived from transcripts, and temporal and acoustic features that can be derived from raw audio recordings. Two studies applied BERT [94], an open source, pre-trained natural language model, to analyze transcripts of speech with the aim of diagnosing Alzheimer's disease [21,87]. Balagopalan et al. [87] also tested a model for the same task that included acoustic features, however, this model performed worse than the model only based on linguistic features. This finding highlights the possible complexity of audio data analysis and the importance of benchmarking different models against one another to get the best performing algorithm.
Machine learning algorithms pre-trained on labeled image-data, like ImageNet, can recognize features on non-image data like audio [95]. To use image models on audio data, the data has to be transformed, usually into a spectrogram image. We were surprised to find that half of the audio studies used pre-trained image models to analyze audio data, and only two studies used pre-trained audio models. This, again, highlights the impact of image models even outside of computer vision. Of interest, Koike et al. [88] aimed to predict heart disease from heart sounds with transfer learning and compared two models: one trained on audio data and another on is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint datasets exist (e.g. AudioSet [96], LibriSpeech [97]), and the results of Koike et. al highlight how models pre-trained on audio might perform better than image models.
With audio data easily obtainable in a clinical setting, there is a great opportunity to develop and improve transfer learning models that can aid clinicians to screen and diagnose patients in a cost-effective way.

Tabular data
Tabular data is probably the most used data type within clinical research. However, we only identified 15 studies using transfer learning on tabular data covering very different fields in medicine: two-thirds of them were from genetics [98][99][100][101][102], pathology [103][104][105], and intensive care [18,106], while the remaining five were from surgery [17], neonatology [107], infectious disease [108], pulmonology [109], and pharmacology [110]. Oncological applications like classification of cancer and prediction of cancer survival were common among the studies in genetics or pathology. As an example for zooming in from a broader disease category to a specific, rarer disease, one study developed a model using gene expression data from a broad variety of cancer types and then fine-tuned the model using a much smaller dataset to predict cancer survival in lung cancer patients [101].
Transformation of tabular data into images was rare compared to time series and audio data. We only identified two studies transforming gene expression data into images [98,101] and then applying computer vision models for a prediction task.
Both of these studies used open access gene expression datasets: AlShibili et al. [98] studied classification of cancer types in a dataset from cBioPortal [111], and López-García et al. [101] predicted cancer survival in a dataset from the UCSC Xena Browser [112]. Both studies reported better accuracy, sensitivity, and specificity . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint when using transfer learning compared to non-transfer learning approaches based on the original tabular data.
We also identified a study reusing a model across species, which highlights another potential of transfer learning. Seddiki et al. [105] reused a model developed on mass spectrometry data from animals to classify human mass spectrometry data. Here, the benefits of transfer learning are clear, as genetic data from animals are often easier to retrieve and share due to the lack of privacy issues. This can potentially provide new opportunities within translational clinical research by reusing scientific knowledge gained from animal studies.
Wardi et al. [108] used both a transfer learning approach and a non-transfer learning approach to predict septic shock in an emergency department setting based on data from electronic health records (EHR) like blood pressure, heart rate, temperature, saturation, and blood sample values. They showed that transfer learning outperformed the traditional machine learning model, especially when only a smaller fraction of data was used. Furthermore, the transfer learning model was externally validated with promising results, which supports its clinical utility. The study by Wardi et al. makes individual-specific predictions possible, which is a prerequisite before an AI-based tool can be implemented in everyday clinical practice to support decision making [1].
Another promising use of transfer learning was described in the study by Gao et al. [99]. The authors developed models for various prediction tasks using genetic data from a heterogeneous population with a focus on how to translate models that perform well on the ethnic majority group to ethnic minority groups. Transfer learning clearly improved performance as compared to other machine learning approaches developed from scratch separately for each ethnic minority group. This application . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint demonstrates how transfer learning can be used to tackle inequalities in health research.

Text
Our main finding regarding text and natural language processing (NLP) was how underutilized transfer learning was in clinical research. We identified only seven studies using transfer learning on text. Applications ranged from risk assessment of psychiatric stressors, diseases, and medication abuse [20,[113][114][115], to prediction of morbidity, mortality [116,117], and adverse incidents from oncological radiation [118]. It is somewhat surprising that we found so few studies, given the massive interest in NLP and the field's rapid development in recent years [119,120]. The relatively slow adoption of transfer learning in clinical research is presumably due to the many technical challenges in medical text analysis (e.g. ambiguous abbreviations, specialty-specific jargon, etc.) [121]. We can indirectly observe this in our review, where all seven articles were written by clinicians and technical authors together or by clinicians alone. This is in contrast with what we observed for time series, tabular data, and audio, where articles were more often written by technical authors only. This is not to say that technical authors are not interested in the use of transfer learning on clinical text, but rather a reflection that transfer learning in text analysis is still a relatively new method. Several articles have been written on transfer learning using radiological reports or EHR, but the studies often focus on purely technical aspects of NLP and are published in journals, which are not indexed in the databases that clinicians are familiar with [119,120]. While transfer learning in NLP has made considerable progress from a technical perspective, it seems that this trend has only just begun to appear in clinical research. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Another important reason why we found so few articles on text is that there currently only exist a few large, annotated, and publicly available datasets with EHR [121].
Removing patient identifiable data from EHR is one of the key challenges currently limiting the sharing of large medical text datasets, though automatic tools have recently been developed to fasten this task [122]. Indeed, among the seven identified studies, three analyzed public social media posts to predict mental illnesses, two used relatively small institutional EHR datasets (<1000 patients), one used a large UK database with restricted access, and only a single study used a large, curated, and openly available dataset (the MIMIC-III critical care database [123]). A review on the use of all types of NLP on radiological reports found a similar pattern, where most studies used institutional EHR datasets [124], again highlighting the need for more large, annotated, and publicly available datasets.
As a final note on transfer learning in text, we would like to highlight the study by Si et al. [117]. In this paper, the authors proposed a new method to analyze medical records to predict mortality and identify obesity-related comorbidities. In brief, this method took the temporal aspect of each patient's documents into account, when predicting the patient's risk of mortality within the next year. It makes intuitive sense from a clinical perspective that a recent myocardial infarction could be more informative for mortality than one more than ten years ago, and this technical development could be important for future research in clinical text analysis.

Limitations
The identification of clinical studies turned out to be more challenging than we expected, as it is a concept that is difficult to define. We had long discussions on whether to include brain-computer interface studies [125], where transfer learning seems to be a promising method to speed up calibration, with the final aim of helping . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint people with a clinical condition. However, we concluded that the actual prediction models solve engineering problems rather than predict a clinical outcome. Similarly, we excluded NLP studies on biomedical named entity recognition, because even though the datasets might have been medically relevant documents, the task of identifying specific terms are only indirectly relevant from a clinical perspective e.g. by speeding up knowledge synthesis.
The type of the target data might also be ambiguous, depending on how we define raw data. Transcripts of audio recordings could easily be considered as text, but we tried to evaluate what the first dataset was that e.g. a medical device output without further processing. Using this approach, the recordings were considered as a dataset of audio type and transcribing them to text was considered as a data transformation.
Our scoping review did not include clinical studies if they were only published as preprints, or in proceedings of computer science conferences. However, we chose our inclusion criteria this way consciously, so that we could give an overview of the field from the perspective of clinical researchers. Therefore, our review does likely not include the latest technical developments within AI research and transfer learning, but instead include the techniques which have started to impact clinical research.
During data extraction, we planned to find the best proxy variables to answer our research questions, but some of these were challenging. As described previously, we decided not to extract quantitative data on the size of the source and the target datasets, as the unit of observation often differed. It was also difficult to characterize comparisons of transfer learning vs. non-transfer learning solutions because many studies reported many different models and performance metrics. We were curious . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint about the background of the authors (clinical vs. technical), but used only the affiliations as a proxy, as that was easily accessible information in most cases.

Conclusions
Our scoping review is a roadmap to transfer learning for non-image data in clinical research, providing clinical researchers an easily accessible resource to this relatively new technique in machine learning. The interest in transfer learning for non-image data in clinical research began to increase rapidly only recently, lagging a few years behind trends in its use for medical image analysis. Applications are unbalanced between different clinical research areas and data types. Neurology and cardiology seem to be among the 'first movers' with time series data, partly driven by the public availability of EEG and ECG datasets, which also suit machine learning due to their high-resolution nature. We found fewer classical epidemiological studies than expected despite transfer learning can help to overcome some of the big challenges of the field i.e. data collection on a large scale is often difficult and expensive, and data sharing can be hindered by privacy concerns. In the future, some of the largest epidemiological datasets (e.g. UK Biobank [126]) could serve as source data for the development of machine learning algorithms, that a wide range of smaller studies could build on by using transfer learning without access to the actual dataset. This is in line with the FAIR principles [127] supporting the reuse of digital assets in an environment with increased volume and complexity of datasets.
Moreover, the wider use of reproducible research principles and stronger interdisciplinary collaborations between clinical researchers and computer scientists are crucial for the development of clinically relevant prediction models that can be reused with transfer learning across studies, clinical specialties, or even species. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Conflicts of interest
There is no conflict of interest in this project.

Data and code availability
The data extracted from the identified articles and then presented in the Results section is available as an electronic supplement along with the code used for the analysis. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint Agrusa 2020 Some articles could have been assigned to more medical specialties, but the authors have chosen a primary one with consensus for the sake of simplicity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint Box 1 Glossary (Definitions were taken or adapted from Howard & Gugger [2] and the Machine Learning Glossary by Google [128])

Artificial intelligence (AI)
A non-human program or model that can solve sophisticated tasks.

Machine learning (ML)
The training of programs developed by allowing a computer to learn from its experience, rather than through manually coding individual steps.

Neural network
A neural network is a particular kind of ML model. A model that, taking inspiration from the brain, is composed of layers of neurons. An input layer (first layer) receives the input data. This is followed by hidden layer(s). Hidden neurons (yellow circles) typically take multiple input values and generate one output value, which is calculated by applying a nonlinear transformation (activation function) to a weighted sum of input values. An output layer (final layer) returns the results. Model fitting (training) aims to optimize the weights to get the best model according to a pre-defined performance metric (loss).

Deep learning
A type of machine learning that uses neural networks with multiple hidden layers.

Transfer learning
The use of a pre-trained model for a task different to what it was originally trained for. For example, to take a model that can recognize everyday objects and further train it on fundus photographs to grade diabetic retinopathy, instead of training a model from scratch on fundus photographs only.

Fine-tuning
A type of transfer learning that updates the parameters of a pre-trained model by training for a different task than it was originally trained for.

Feature-representation transfer
A transfer learning technique that passes input data through a pre-trained model and extracts featurerepresentations (values from hidden layers), which then become inputs for another model for a different task than the pre-trained model was trained for. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint