Harmonizing Healthy Cohorts to Support Multicenter Studies on Migraine Classification using Brain MRI Data

Multicenter and multi-scanner imaging studies might be needed to provide sample sizes large enough for developing accurate predictive models. However, multicenter studies, which likely include confounding factors due to subtle differences in research participant characteristics, MRI scanners, and imaging acquisition protocols, might not yield generalizable machine learning models, that is, models developed using one dataset may not be applicable to a different dataset. The generalizability of classification models is key for multi-scanner and multicenter studies, and for providing reproducible results. This study developed a data harmonization strategy to identify healthy controls with similar (homogenous) characteristics from multicenter studies to validate the generalization of machine-learning techniques for classifying individual migraine patients and healthy controls using brain MRI data. The Maximum Mean Discrepancy (MMD) was used to compare the two datasets represented in Geodesic Flow Kernel (GFK) space, capturing the data variabilities for identifying a “healthy core”. A set of homogeneous healthy controls can assist in overcoming some of the unwanted heterogeneity and allow for the development of classification models that have high accuracy when applied to new datasets. Extensive experimental results show the utilization of a healthy core. One dataset consists of 120 individuals (66 with migraine and 54 healthy controls) and another dataset consists of 76 (34 with migraine and 42 healthy controls) individuals. A homogeneous dataset derived from a cohort of healthy controls improves the performance of classification models by about 25% accuracy improvements for both episodic and chronic migraineurs.


Introduction
Neuroimaging studies have shown structural and functional alterations in brain regions involved in sensorydiscriminative, affective, and cognitive pain processing, pain modulation, and multisensory integration in patients with migraine [Eck, J. et al. 2011, Lo Buono et al. 2017, Meylakh, N. et al. 2018, Mu, J. et al. 2020, Wei H. et al. 2020, Scheepens, D. S. 2020, Ashina, S. 2021. Machine learning models trained to automatically classify individuals with migraine from healthy controls (HCs) using functional magnetic resonance imaging (fMRI) and structural MRI data have shown good utility for discriminating those with migraine from HCs. Yet, the generalization of the algorithms might be poor; that is, the classification model developed using one dataset may not perform as accurately on a separate dataset. The generalizability of classification models is key for multi-scanner and multicenter studies, and for providing reproducible results. Harmonization in clinical research has been focusing on defining, reviewing, and standardizing common data elements [Firnkorn, D et al. 2015, Yu, M. et al. (2018, Ma, D. et al. (2019), Chen, A. A. et al. (2022)]. This is known as data curation -a strategy to combine the data from different sources to improve the generalizability of classification models. To be specific, this harmonization strategy is the explicit removal of the site or cohort-related effects in data from multiple sources. This harmonization strategy is desirable for improving the generalizability of classification models. Yet, from a data-driven aspect, the data from different sources may inherently exhibit different characteristics due to several reasons, such as differences in patient populations from which research participants are identified and the methods by which data are acquired. In machine learning research, extensive efforts are dedicated to transforming the data from different sources into a common feature space for evaluating similarities from multiple sources or adapting them thoroughly. This is also known as Domain Adaptation, a well-studied field [Tzeng, et al. 2017, Csurka et al. 2017, Tang, et al. 2020].
Domain adaptation leverages labeled data in one or more related source domains, to build a machine learning model for unseen or unlabeled data in a target domain. These adaptation methods aim to discover a shared feature representation for minimizing distribution differences while preserving the important . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 properties of original datasets. There are several types of domain adaptation approaches, such as reweighting, parameter adaptation, feature augmentation, and feature transformation [Yao, Y et al. 2010, Gong, B 2012, Matasci, G 2015, Tzeng, et al. 2017, Csurka et al. 2017, Tang, et al. 2020. After adaptation, a standard machine learning approach is used that assumes the test data are drawn from a similar distribution as the training data. As noted, most domain adaptation methods are designed to transform the data (both HCs and patients) together into a common space. Careful consideration of this approach reveals a potential concern, that is, it ignores the fact that there may exist heterogeneity even within the HC cohorts. As a result, transforming and aligning the whole dataset may not work as well as expected. In this research, we propose a new concept termed "healthy core": that transforms and aligns imaging data from HCs using their brain MRI data. Specifically, this study aims to 1) identify HCs from different sources to construct a healthy core; 2) develop and test classification models based on medical imaging that utilizes the healthy core in model training, validation, and testing across the datasets. We hypothesized that, compared to classification models developed without the use of a healthy core, imaging-based classification models for migraine developed with the use of a healthy core would have higher classification accuracy when applied to an unseen dataset.

Approvals and Consent
All studies were approved by the Mayo Clinic and Washington University School of Medicine in St. Louis Institutional Review Boards. All participants provided written consent prior to study participation.
Individuals with migraine were identified from the headache clinics at both institutions. HCs and individuals with migraine were enrolled from a database of research volunteers, and via community outreach.

Study Participants
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 Diagnoses of episodic migraine (EM) and chronic migraine (CM) were assigned using the most recent version of the International Classification of Headache Disorders available at the time of enrollment [ICHD-3 beta 2013]. Migraine diagnoses were assigned by a headache specialist (TS). HCs were excluded if they had a history of migraine, but occasional tension-type headaches (< 3 tension-type headaches per month) were allowed. Participants were male or female adults between the ages of 18-65. Those with contraindications to MRI, acute pain conditions other than migraine, neurological disorders other than migraine, women who were pregnant, and individuals with abnormal brain MRI findings according to usual clinical interpretation were excluded. Participants were recruited for the study between 2012 and 2021.

Data Collection
Dataset . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. T1 images were used for ruling out gross structural abnormalities.

Evaluation of Dataset differences between DS1 and DS2
For both datasets, measurements of cortical thickness, cortical surface area, and cortical volume for 68 regions were derived resulting in 204 measures per subject. Image features of HCs on cortical thickness, cortical surface area, and volumes in DS1 and DS2 were compared separately to identify the potential cohort difference. The comparison was performed for the means of features, the variances of features, and feature correlation structures. The cohort-wise means/variances of each feature were compared using a two-sample . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10.1101/2023.06.26.23291909 doi: medRxiv preprint t-test/F-test. False Discovery Rate (FDR) was used to correct for multiple comparisons. The cohort-wise correlation matrices for the volume, area, and thickness features were compared using the two-sample Hotelling's T2 (the multivariate extension of the common two-group Student's t-test) and Multivariate Analysis of Covariance. In order to avoid the confounding of multiple cohort differences, the mean and variance differences were removed from each cohort to identify the remaining differences.

2.5.
Constructing the Healthy Core Differences in imaging acquisition protocols, scanners, and even small differences in populations from which research participants are enrolled can lead to cohort effects. Therefore, simply combining all HCs from different studies/sites may not be optimal due to cohort effects. In order to address this issue, here we propose a new design strategy to discover the homogenous HCs from different datasets as referenceable HCs (see Step 2 in Figure 1). Our proposed approach contains two key concepts: 1) measure the similarity between HCs from different datasets, and 2) explore feature representations of the dataset to assess the similarities. First, the proposed method utilizes the Maximum Mean Discrepancy (MMD) [Gretton, A et al. 2012, Cui, P et al. 2020] which is a kernel-based statistic used to determine whether two distributions are similar. MMD can be considered as a criterion to determine whether subset samples from multiple datasets are homogenous [Gretton, A et al. 2012, Long, et al. 2015, Yan, et al. 2017, Epstein, et al. 2019, Zhang, et al. 2020, Kirchler, et al 2020. However, the MMD is a point estimate-based statistic which may be too strict since the original datasets might inherently exhibit different characteristics. To capture the inherent variabilities, we hypothesize that Geodesic Flow Kernel (GFK), which utilizes geometric manifolds termed Grassmann, may help. GFK has been widely used in the domain adaptation research field because it does not focus only on the feature representations of the two datasets (snapshot of each dataset), it constructs the transformation over the Grassmann space of the two datasets with a scale parameter. As a result, it has the potential to identify the subset samples from each dataset, for example, using MMD, as the homogenous core.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10.1101/2023.06.26.23291909 doi: medRxiv preprint We hypothesize that using a healthy core, a homogenous subset of HCs from different sources, can help improve the generalization of the machine learning model for classifying migraine since it addresses the heterogeneity of the HCs within and across the datasets; and sets the bridges between the datasets. To comprehensively assess the clinical utility of the proposed method, we designed three sets of experiments: (1) evaluating the performance of migraine classification models within a single dataset; (2) evaluating the performance of migraine classification models developed using one dataset and tested on a second dataset; and (3) evaluating the performance of migraine classification models with and without using healthy core samples for classifying individuals with CM and those with EM from a different dataset. In the training of the classification models, cross-validation has been applied. To get reliable results for the relatively small sample sizes, experiments are repeated under five different random scenarios (cross-validated) to build classifiers and report average performances. Since our last experiment focuses on migraine only, we report the Area Under the Curve (AUC) for the first two experiments and the accuracy of identifying migraine in the last experiment.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. Unknown 0 1 0 0

Dataset (Cohort) Differences
First, the mean of area, thickness, and volume features are significantly different between the HCs in the two datasets with a p-value<0.01 before the False Discovery Rate (FDR) correction. Further, three imaging . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10.1101/2023.06.26.23291909 doi: medRxiv preprint features among those 41 significant features were identified as significantly different between two cohorts by the two-sample t-test and Kolmogorov-Smirnov test after FDR correction (Table 3).  Figure 2 illustrates the significant mean difference in each imaging feature between the two cohorts. This result confirms that there are mean differences between the two cohorts for each modality. Second, the variances of the features were tested. Table 4 shows that the variance for each feature is not significantly different between two cohorts under both tests after FDR correction. Furthermore, we investigated the possibility that significant group differences in sex distribution (p-value=0.01) contributed to the cohort difference.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023.
We randomly selected the sex-balanced sub samples from both cohorts. After making the balanced groups, the mean and variance of features (volume, area, and thickness) were tested using the t-test and KS test for mean, F-test and Ansari-Bradley test for variance whether the dataset difference exists. Since we selected a random subset to test, this test procedure was repeated 10,000 times to avoid a sampling-dependent conclusion. It was concluded that there is a significant mean difference and a marginal variance difference in each feature (volume, area, and thickness), but sex distribution was not a significant factor in the dataset difference.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Experiment I: evaluating the performance of different classification models within a single dataset
The goal is to demonstrate the accuracy of machine learning models in classifying migraine within a single dataset as shown in table 6. In this experiment, we built four binary classifiers (regularized logistic regression, support vector machine (SVM), random forest, and XGBoost) to identify HC vs. CM, or HC vs EM. For each dataset (DS1 or DS2), we split the data into training (~60%), validation (~20%), and testing (~20%). is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023.   Table 7B. We conclude that the machine learning models provide good performance during training, but that the performance deteriorates when using test data. This deterioration in accuracy is expected and the model performance may still be considered satisfactory.

Experiment II: evaluating the performance of the classification models developed on one dataset and tested on a second dataset
The goal is to assess the generalizability of machine learning models across datasets as shown in Table 8.
Like Experiment I, here we tested on HC vs. EM, or HC vs. CM. Since one dataset can be fully utilized in training and validating with testing to be conducted in the second dataset, we split the data into training (~80%) and validation (~20%) for DS1, or DS2 respectively. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 As seen in Table 9A for HC vs CM, all four trained classifiers did not perform well in this cross-dataset experiment. The average AUC on models trained on DS1 dropped from 0.7533 (testing on DS1) to 0.5245 (testing on DS2), and the average AUC on models trained on DS2 dropped from 0.6465 (testing on DS2) to 0.6230 (testing on DS1). We can also observe similar patterns for HC vs EM in the cross-dataset experiment as seen in Table 9B. The average AUC on models trained on DS1 dropped from 0.6965 (testing on DS1) to 0.5781 (testing on DS2), and the average AUC on models trained on DS2 dropped from 0.6135 (testing on DS2) to 0.5963 (testing on DS1). This result demonstrates the existence of dataset discrepancies that need to be cautiously handled when building classification models.

Experiment III: evaluating the performance of classification models with and without using Healthy Core
The goal is to assess the applicability of machine learning models developed in one dataset including participants with migraine and the healthy core to classify individuals with migraine from a separate dataset.
The overall sample numbers in Cross-dataset Experiments with Healthy Core are shown in Table 10. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10.1101/2023.06.26.23291909 doi: medRxiv preprint 0.4903 -about 25 out of 51). Unlike typical approaches, the proposed method using the healthy core samples to build the same four classifiers achieved significantly better classification accuracy (see Table   11). These results show that the use of a subset of the HCs from different datasets has the potential to support the generalization of classification models.

Discussions on Healthy Core
In Section 3.1, we conducted statistical analysis to confirm the dataset discrepancy between DS1 and DS2.
For illustration, here we adopted t-Distributed Stochastic Neighbor Embedding (t-SNE), a technique for dimensionality reduction (e.g., based on principal component analysis) that is particularly well-suited for the visualization of high-dimensional datasets to demonstrate the impact of a healthy core. For details of t-SNE, interested readers are referred to [Van der Maaten 2008].
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10.1101/2023.06.26.23291909 doi: medRxiv preprint As seen in Figure 3A and Figure 3C, combined HCs do not have separating (thus predictive) pattern compared to CM from either DS1 or DS2. On the other hand, the selected HCs cores have representative similar patterns, and healthy core (see Figure 3B and Figure 3D) can be well clustered in the t-SNE plot.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Discussion and Conclusion
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 The main finding of this study is that utilization of a healthy core, i.e. a homogeneous dataset derived from a cohort of healthy controls, improves the performance of classification models for EM and CM. The use of a healthy core helps to overcome the deterioration in model performance that is otherwise seen when applying the migraine classification model to independent datasets collected from patients enrolled at different institutions and imaged on different scanners. Depending on the modeling approach, the inclusion of a healthy core resulted in classification accuracy as high as 88% for EM and CM.
The use of independent training and testing datasets is a substantial strength of the studies reported herein.
Many prior publications (including some of our own) reported the accuracy of classification models for migraine train and test the model using the same dataset, with strategies such as the "leave-one-out" methodology [Schwedt, T.J. et al. 2015, Lee M.J. et al., 2019, Zhang, Q. et al., 2018, Yang, H. et al., 2018, Chong, C.D., 2021. Although this type of approach is valid and well-accepted, it likely leads to overestimating the performance of the model if it was used in a completely new and previously unseen dataset. Because independent training and testing datasets were used in the experiments described herein, the reported classification accuracies should be considered as more conservative estimates of model performance.
Useful classification models need to have high performance when tested on data that are collected from new patient populations and using collection techniques that might differ slightly from those used to collect the data that were included for model training (i.e. cross dataset accuracy). For example, a brain MRI-based classification model should still have high performance when tested on data from patients imaged at a different medical center and using different MR scanners. The experiments reported in this manuscript purposefully introduce this type of heterogeneity, with participants being enrolled from two different medical centers in two different regions of the United States. Results demonstrate the expected reduction in accuracy of the migraine classification models when they have been tested on independent datasets that include heterogeneity typical of multicenter imaging experiments. The experiments without the healthy core that are reported herein demonstrate this: EM and CM classification accuracy was higher in the single . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 28, 2023. ; https://doi.org/10. 1101/2023 dataset experiments (CM accuracy 65%-75%, EM accuracy 61%-70%) compared to the cross-dataset experiments (CM accuracy 57%-59%, EM accuracy 58%-60%). The introduction of a healthy core helped to overcome this deteriorating model performance and provided much higher classification accuracies for EM and CM.
When performing research that includes "healthy controls" there is an assumption that the healthy control cohort is homogeneous. However, even when stringent eligibility criteria are applied in an attempt to make the healthy control group as healthy as possible, there will always be identifiable and unidentifiable heterogeneity within the healthy control group due to differences in demographics, prior life experiences, physical and mental well-being, underlying genetic heterogeneity, the existence of diseases that have not yet manifest, and other sources of heterogeneity. When establishing "normal values" for certain types of tests, such as blood test results, for example, the use of very large sample sizes can mostly overcome this issue of heterogeneity within the healthy control cohort. However, available sample sizes are smaller for establishing normal values for brain MRI, a diagnostic test that is time and cost-intensive. The "healthy core" method described in this manuscript can help overcome this challenge. For example, with the use of the healthy core and XGBoost, the cross dataset classification accuracy improved to 84%-87% for CM and to 79%-83% for EM.
When using the healthy core, the classification accuracies for migraine are comparable, if not somewhat higher, than those reported previously in the literature. This is so even though independent training and testing sets were used. For example, prior studies using brain imaging structural data reported classification accuracies of 67%-86% for differentiating migraine from healthy controls [Schwedt, T.J. et al. 2015], while those using functional MRI data reported accuracies of 73%-86% [Chong, C. D. et al. 2017;Yang, H et al. 2018;Lee, M. J. et al. 2019]. Classification models including multimodality imaging data, combining structural and functional measures, have provided accuracies of 83%-84 [Zhang, Q. et al. 2016;Gaw, N. et al. 2018]. Classification accuracies for EM were generally lower than for CM in the experiments reported herein. This is consistent with expectations since CM is the more severe disease state, prior brain imaging . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2023. ; studies have demonstrated that headache frequency has a positive correlation with the amplitude of structural and functional brain changes, and prior classification models have had similar findings [Valfrè, W. et al. 2008;Schmitz, N. et al. 2008;Maleki, N. at al. 2012;Schwedt, T.J. et al. 2015;Gaw, N. et al. 2018;Magon, S. et al. 2019;Lai, K. L. et al. 2020;Mu, J. et al. 2020].
Study Limitations: We assumed heterogeneity amongst the healthy control cohort, and this assumption served as justification for developing a healthy core. However, a limitation of this study is that we are only partially able to describe the heterogeneity using the non-imaging data that were collected, such as the medical center from which a participant was enrolled, their sex, and age. Future studies should more deeply phenotype the healthy control subjects in search of sources of heterogeneity that might associate with differences in brain structure. Also, it should be noted that imaging data used in the analyses reported herein were also used in prior research on classification models for migraine [Schwedt, T.J. et al. 2015;Gaw, N. et al. 2018]. Future studies will continue to validate the migraine classification models reported herein and test the value of using a healthy core when developing new classification models.
In conclusion, the utilization of a healthy core can increase the accuracy and generalizability of brain imaging-based classification models for EM and CM. Inclusion of a healthy core addresses intrinsic heterogeneity that exists within a healthy control cohort and in multicenter studies, even when stringent participant eligibility criteria and methodology are used. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 28, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023