Towards 3D Deep Learning for neuropsychiatry: predicting Autism diagnosis using an interpretable Deep Learning pipeline applied to minimally processed structural MRI data

By capitalizing on the power of multivariate analyses of large datasets, predictive modeling approaches are enabling progress toward robust and reproducible brain-based markers of neuropsychiatric conditions. While Deep Learning offers a particularly promising avenue to further advance progress, there are challenges related to implementation in 3D (best for MRI) and interpretability. Here, we address these challenges and describe an interpretable predictive pipeline for inferring Autism diagnosis using 3D Deep Learning applied to minimally processed structural MRI scans. We trained 3D Deep Learning models to predict Autism diagnosis using the openly available ABIDE I and II datasets (n = 1329, split into training, validation, and test sets). Importantly, we did not perform transformation to template space, to reduce bias and maximize sensitivity to structural alterations associated with Autism. Our models attained predictive accuracies equivalent to those of previous Machine Learning studies, while side-stepping the time- and resource-demanding requirement to first normalize data to a template, thus minimizing the time required to generate predictions. Further, our interpretation step, which identified brain regions that contributed most to accurate inference, revealed regional Autism-related alterations that were highly consistent with the literature, such as in a left-lateralized network of regions supporting language processing. We have openly shared our code and models to enable further progress towards remaining challenges, such as the clinical heterogeneity of Autism, and to enable the extension of our method to other neuropsychiatric conditions.

One challenge in the search for biomarkers and in the development of predictive models is the attainment of sample sizes that afford adequate statistical power. This challenge is exacerbated by clinical heterogeneity [30]. Multi-site collaborative studies yielding well-powered samples, such as ABIDE I and II [31][32], have gone some way to addressing this challenge, and analyses of these samples suggest a distributed pattern of Autism-related structural alterations [16-18, 20, 21, 24, 33-34]. The application of multivariate approaches, such as Machine and Deep Learning, offer another promising avenue for the search for brain-based biomarkers and the construction of predictive models.
These methods enable the simultaneous exploration of a very large set of features, offering much more powerful analytical capacity than univariate approaches. To date, such . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint approaches have had moderate success, with recently reported prediction accuracies (for Autism diagnosis) in the range of 65-70% for models built using both functional and structural MRI data [35][36][37][38]. In an effort to boost accuracy through competition, Traut et al. [39] held an international challenge in which competing teams predicted Autism diagnosis using a large multisite dataset comprising preprocessed anatomical and functional MRI data from > 2,000 individuals. Of the 589 models submitted, the 10 best were combined and evaluated using a subset of unseen data (from one of the sites included in the main dataset), as well as data from an additional, independent acquisition site. The blended model achieved an ROC AUC of~0.66 using features extracted from anatomical data only. One observation from this effort was the fact that prediction accuracy increased with increasing sample size. Another was that while prediction accuracy for the subset of unseen data was similar to validation accuracy, accuracy for the novel site was poorer, illustrating the challenge of generalization, particularly to new data collection sites.
Although recent gains in prediction accuracy are promising, Machine Learning studies conducted to date have two main limitations. The first is that preprocessing pipelines often have many steps, each of which can introduce biases to prediction models. In particular, preprocessing typically includes transformation to a template space, such as MNI152, which was created using anatomical scans acquired from neurotypical adults. Template normalization may therefore negatively impact the ability to detect Autism-related alterations in brain structure, introduce biases, and lead to poorer reproducibility [30]. A second limitation is that datasets used for prediction tend to be clinically heterogeneous, but this heterogeneity is not explicitly accounted for in the models, leading to inconsistent results between separate datasets [40]. Many Autistic participants have a secondary diagnosis, which is often another psychological condition such as ADHD or anxiety, or a neurological condition such as epilepsy or Fragile X syndrome [17,24,26]. Ignoring these comorbidities may introduce biases or lead to non-specific biomarkers [17], since in such analyses, the label "autism" is not well delimited.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10. 18.22281196 doi: medRxiv preprint In the current study, we sought to develop a prediction pipeline that could overcome these challenges. To do this, we trained 3-Dimensional Deep Learning models to predict Autism diagnosis from minimally preprocessed structural MRI data, to avoid biases introduced by template normalization. To address the influence of clinical heterogeneity, we built our models using a large sample of 1329 patients (521 with autism) without comorbidities, following the classical framework of train-validate-test. To test if the patterns identified by the best models were robust to comorbidity, we tested the three best models on a second dataset comprising 270 patients (155 with autism) with comorbid diagnoses.
Deep Learning models can extract meaningful implicit features during the optimization process, which minimizes the preprocessing required and ultimately reduces prediction time.
While 2D Deep Learning models are increasingly popular, 3D Deep Learning is not widely used in Medical Imaging applications, in part because of the large number of parameters to optimize (greater than in 2D) and concerns related to interpretability. To address the challenge of extracting information about predictive features (i.e., interpretability), we leveraged recently developed methods to build an interpretation pipeline that identifies predictive brain areas while avoiding the requirement for template normalization.
In this paper, we described our novel pipeline for interpretable 3D Deep Learning prediction of Autism diagnosis from structural MRI data. In our proof-of-concept analyses, our models achieved the same prediction accuracy as is typical for Machine Learning models, while avoiding the potential biases introduced by template normalization. Our interpretation pipeline identified a set of regions that replicated well across datasets (including participants with comorbidities), and models, and which converged with previous structural imaging studies on Autism. To facilitate further development of our pipeline, we have openly shared all our code through GitHub (https://github.com/garciaml/Autism-3D-CNN-brain-sMRI). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Data and Quality Control
We used T1-weighted structural MRI data from the ABIDE I (980 scans) and II (857 scans) datasets[31-32] and 140 scans from ADHD200 [41]. We performed quality control using BrainQCNet [42], retaining scans with a probability score below 60% as advised in [42]; 797 scans from ABIDE I, 704 from ABIDE II and 98 from ADHD200 remained after this step.
Our primary analysis focused on participants with a diagnosis of Autism but no reported comorbidity and comparison participants with no psychiatric diagnosis. Excluding participants with comorbidities resulted in a dataset of 1329 participants which were used for training, validating and testing the models.
All participants in the testing set (n = 65, 26 with Autism) were obtained from different (independent) data collection sites than participants in the training (n = 1074, 421 with Autism) and validation (n = 190, 74 with Autism) sets.
To examine the impact of comorbidities on prediction accuracy, we created a second evaluation set of participants who had at least other diagnosis in addition to Autism, such as ADHD, phobias, depression, and anxiety. This dataset (testing set 2) contained scans from 270 participants (155 with Autism diagnosis).
Further details on the datasets are provided in S1 -Detailed Data Description.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Preprocessing
We employed a minimal preprocessing pipeline that did not apply transformation to template space, to avoid any impact of brain normalization on the detecting of Autism-related alterations in brain structure. Instead, we applied FSL's Brain Extraction Tool (BET; https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/BET) to remove non-brain tissue, followed by a number of minor non-deforming transformations, to prepare our data to be processed by the Deep Learning algorithm: • • Cropping or Padding: We cropped or padded each volume to obtain a uniform shape for all the volumes of 256*256*256. This shape was sufficiently large to fit the full brains and was also appropriate as an input shape to our deep learning models, in view of the filters applied all along each network (described in detail below). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Classification Models in 3D
Comparing different types of algorithm enables the detection of overfitting and retention of the best type of algorithm for the given problem [43]. We compared two models: (1) DenseNet121 [44] and (2) Med3D-ResNet50 [45], well known Deep Learning algorithms with good 2D performance [44][45]. DenseNet121 is more compact and has fewer parameters than ResNet50 making it possible to train on 3D data, while Med3D-ResNet50 [45] is a version of ResNet50 that has been pre-trained on medical images, including brain sMRI scans. Logically, pre-trained models enable better convergence and performance on new data and tasks of the same context. We fine-tuned Med3d-ResNet50 to adapt it to our task by training the last convolutional layers (corresponding to the 4th convolutional block). We also appended the last classifier block, consisting of a global average pooling layer and a fully connected layer (see S2 -Model architectures) Like in [44] and in [45], we used the ReLU function as the activation function, the cross entropy loss, and the Adam optimizer with a fixed learning rate of 0.001.

Guided-Grad-CAM
In order to interpret and evaluate the reliability and relevance of our 3D Deep Learning models, we used Guided Grad-CAM [46], which combines guided backpropagation [47] and Grad-CAM [46]. This represents a good trade-off between the precision offered by feature maps produced by interpretability algorithms and the processing time required.
Mathematically, guided Grad-CAM [46] is an element-wise product of the results of the two algorithms. It returns a high resolution map of the fine-grained features that is also class-discriminative.
In the context of our study, for a given trained CNN model (either DenseNet121 or Med3DNet-ResNet50), we used guided Grad-CAM to generate one "attention map" for each is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint to 1. We used this mask to identify the brain regions that are the most important for the prediction of Autism across the sample and across algorithms.

HighRes3DNet
As noted above, a key feature of our preprocessing pipeline was our avoidance of normalization to a group template. This creates a significant challenge for the identification of the brain areas that were most predictive of diagnosis across participants. We solved this challenge by segmenting individual scans into anatomical units and combining this information with the mask created in the preceding step.
HighRes3DNet [48] is a Deep Learning algorithm that segments brain MRI scans following the GIF brain parcellation (V3, http://niftyweb.cs.ucl.ac.uk/program.php?p=GIF ; [49]). The GIF algorithm was especially built to be robust to brain morphological differences, especially those encountered in populations with atypical brain development like Autism [49].
We segmented each participant's brain with the HighRes3dNet algorithm (first homogenizing scans to voxel size 1mm*1mm*1mm using linear Interpolation). The resulting segmented images were resampled to 256*256*256 images of voxel size 1.5mm*1.5mm*1.5mm to match the resolution of the attention maps obtained from the guided Grad-CAM algorithm, while retaining the segmented voxel values. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Thus, if we take the coordinates of a voxel in the mask obtained from guided ', ', ' Grad-CAM, we can obtain the corresponding voxel coordinates in the segmented , , image, and thus get the voxel value and the name of the area at ).

( , ,
Applying this procedure for every scan, we obtained a table containing, for every area of the HighRes3DNet atlas, a relative frequency corresponding to the number of voxels in the area with value = 1, divided by the total number of voxels in this area in the segmented image. This relative frequency corresponds to the proportion of the area that is considered important for the prediction by a CNN model, for that participant. These proportions were then used to compare different brain areas and to draw up a ranking of brain areas for each model, dataset (training, validation, testing sets), and type of prediction (True Positives, True Negatives, False Positives, False Negatives), to improve interpretability for our CNN models.

Machine and Code availability
We trained our model on a GPU Nvidia RTX 3090 (24 GB memory) with a batch size of 2. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint We openly shared the code of this project on GitHub, in the repository: https://github.com/garciaml/Autism-3D-CNN-brain-sMRI. The models are also shared so that they can be reused as pre-trained models for similar applications. For ResNet50, the best validation set accuracy was 62.6%, achieved at 42 epochs. For DenseNet121, 66.3% accuracy was achieved at 32 epochs and 67.4% was achieved at 70 epochs. Next, we compared the performance of these three best models (one ResNet50 model and two DenseNet121 models) for the prediction of diagnosis in the training, validation, and testing sets.

Prediction Performance: Autism diagnosis
For the prediction of Autism diagnosis, the three best models behaved differently, as shown by the Receiver Operating Characteristic curves in Fig 1. Med3d-ResNet50-42ep overfitted the data -the accuracy and ROC AUC scores were very high on the training set (94.2% and 99.9% respectively) but much lower on the validation (acc = 62.6% and AUC = 62.1%) and testing sets (acc = 53.8% and AUC=57.3%). DenseNet121-32ep appeared to be more stable is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint and validation (acc = 67.4% and AUC = 68.1%) sets than DenseNet121-32ep, but poorer performance on the testing set (acc = 40% and AUC = 38.1%). Table 1 displays the sensitivity and specificity of each model for each dataset.
DenseNet121-32ep exhibited high specificity on the training and validation sets, but low sensitivity. Paradoxically, it had high sensitivity but low specificity on the testing set.
DenseNet121-70ep behaved similarly on the testing set while on the training and validation sets, sensitivity and specificity were balanced and fairly high. Finally, for Med3d-ResNet50-42ep, sensitivity and specificity were very high on the training set, unbalanced on the validation set with low sensitivity and very high specificity, and balanced on the testing set, but with moderate values.
Sensitivity on the second testing set, which included participants with comorbidities, was low for all models. This demonstrates that when the training and testing sets include only participants without known comorbidities, predicting Autism diagnosis for participants with comorbidities is particularly challenging. Here, we found that this produces a large increase in False Negatives in particular. One potential explanation is that neuroimaging markers of Autism are less salient when individuals have another diagnosis involving similar or other neuroimaging markers. Another explanation is that more data is needed to adequately train DL algorithms on the whole spectrum of Autism in the context of comorbidities.
Further details and comments on the performance of the models are given in S3 -Performance of the models, and a comparison of the predicted scores with the scores of diagnosis are given in S4 -Analysis of ADI-R and ADOS scores, age, gender and full IQ. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Interpretability: True Positive discriminative ROIs
We segmented each participant's scan using HighRes3DNet (GIF parcellation), to extract a measure of "prediction importance" (the output of the guided Grad-CAM algorithm) for each of the three best models. We identified the regions that best contributed to True Positives is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint For every pair of model and dataset, we defined the "most predictive" regions as those with relative frequency values (see Section 3.6, above) greater than the 90% percentile. This is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Looking at these data another way, and taking the regions that were most predictive across datasets and which replicated across the three models, we again obtained left hemisphere regions that are located in the frontal lobe -middle and inferior frontal gyrus (pars triangularis) and medial precentral gyrus -and in the limbic system and its associated structures -anterior cingulate gyrus, subgenual cingulate gyrus, parahippocampal gyrus (Fig   2b).

Effect of gender
Regions important for predicting True Positives for boys were different from those for girls.
Regions common to both genders were located in the left parietal lobe: parietal operculum, supramarginal gyrus, and superior parietal lobule (Fig 2c, d) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  S8 Tables 10 and 11.

Relationship with age
Autism has been associated with disrupted brain development across the lifespan. To assess whether there were any developmental trends in the most predictive areas, we created four age categories (5-10yrs, 10-15yrs, 15-20yrs, >20yrs) and identified the most predictive (True Positives) regions for each category, separately for boys and girls. S 9 -True Positives by Gender and Age shows these results.
Our results showed that the most discriminative regions varied with the age. In particular, left precentral gyrus, central operculum, and posterior orbital gyrus replicably predicted True Positives in boys aged 5-10yrs, while left inferior frontal gyrus (pars triangularis), subcallosal/subgenual cingulate cortex, and supramarginal gyrus, were most predictive for boys aged 10-15 years old.
In addition, we found that the replicability of each region decreased as age increased.
Indeed, we found that the left and inferior frontal (pars triangularis) gyrus, posterior orbital gyrus and putamen were most predictive for 15-20 years old, but only for participants without comorbidities. Left temporal areas -parahippocampal gyrus, superior temporal gyrus and temporal pole -were most predictive for males aged 20-64yrs without comorbidity.
Examining global prediction performance for these different age groups reveals other interesting trends, such as a decrease in the number of False Negatives and True Negatives with increasing age, for both boys and girls. This suggests that our prediction of Autism diagnosis tended to be more sensitive but less specific as age increased. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

True Negatives
We adopted the same approach described above to identify regions most predictive of True Negatives (i.e., absence of an Autism diagnosis). The results (see

Bad predictions -False Positives and False Negatives
We adopted the same approach described above to identify regions most predictive of False Positives (i.e., incorrectly predicted Autism diagnosis) and False Negatives (i.e., incorrectly failed to predict Autism diagnosis). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Does image background contribute to model predictions?
As a final test, we examined whether image background (i.e., information outside the brain) contributed to predictions. For Med3d-ResNet50-42ep the relative frequency of the Background (RF) is the smallest (RF=0.97%) and the second smallest for DenseNet121-70ep (RF=0.28%), meaning that this area is not considered predictive for the models. For DenseNet121-32ep, it is among the last 4% informative areas of the model (RF=0.74%). These results confirm that the models use information from inside rather than outside the brain to make a prediction, supporting their validity.

Multi-site effect
We observed an inhomogeneous consistency of the distributions of probability scores between the different sites (see S10 -Multi-site effect). We displayed the accuracy scores for every site in the whole dataset (training+validation+testing sets) in S10 Table 20, and it also confirmed the multi-site effect.

Discussion:
This study outlines and demonstrates a novel approach for inferring Autism diagnosis from structural brain imaging data using 3D Deep Learning algorithms. To maximize the interpretability of the model outputs, we also used a second type of algorithm -guided is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint analysis, the brain structural features of which were most important for accurate inference of Autism diagnosis (i.e., True Positives), are highly consistent with the literature. Our predictive modeling framework has considerable potential to be extended to further datasets to identify and refine sensitive and specific brain biomarkers of Autism using MRI data.

3D Deep Learning applied to minimally processed data
To our knowledge, this is the first time that 3D-DL CNNs have been used to predict Autism diagnosis from 3D structural MRI scans. Our findings show that these algorithms are capable of inferring Autism diagnosis on the basis of structural MRIs with at least the same level of accuracy as traditional Machine Learning algorithms, while requiring a smaller number of training epochs. The average accuracy score (64.1%) and ROC AUC score (0.67) obtained for participants without comorbidities is consistent with previous Machine Learning models trained on sMRI data (e.g., [39]). The comparable accuracy we achieved should be viewed in the context of the speed of inference of Deep Learning models over Machine Learning approaches. While Machine Learning algorithms require inputs derived following extensive preprocessing of structural MRI data, including normalization to template space, our Deep Learning models used minimally preprocessed data. In particular, we avoided transformation to template space, a near-universal requirement of neuroimaging analyses that may negatively impact the ability to detect structural alterations associated with the diagnosis of interest. Although our pipeline included some minimal preprocessing steps to address the fact that a diversity of scanners and acquisition protocols was used across data collection sites, resulting in heterogeneous voxel spacing and signal intensities. Resolution homogenization and intensity normalization were applied to address these variations, and it is possible that these steps could bias the algorithm. Further, despite these steps, a clear effect of the data collection site was observable. Future studies will incorporate specific preprocessing steps like the ComBat algorithm[50] to integrate scan parameters during training and minimize site effects. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Interpretability
The outputs of Deep Learning models are not straightforwardly understandable, giving rise to the challenge of poor interpretability. This challenge arises because mathematically, Deep Learning models are composed of multiple functions. Each of these functions is nonlinear and is itself the sum of multiple functions. Further, models such as the 3D CNNs used in the current study have a large number of parameters that must be optimized. One of the goals of our study was to address this drawback by devising a pipeline that would allow for the extraction of predictive brain regions, providing interpretability. Guided Grad-CAM [46] was chosen for this purpose, due to its reasonable computation time and its ability to return fine-grained class-specific segmentations of important (predictive) voxels in the input images.
A challenge for our novel interpretability process was to identify brain areas that were predictive of Autism diagnosis across participants while avoiding the requirement for template normalization. To address this issue, we used a segmentation algorithm to partition individual volumes into established anatomical regions. We used HighRes3DNet [48] for this task because it was built to be pathology-agnostic, robust to brain morphology differences, and has reduced computation time compared to other algorithms (e.g., the GIF algorithm [49]). We performed a detailed analysis of the regions that were most relevant for inferring an Autism diagnosis, by examining true and false positives and negatives separately for each dataset and algorithm. We also identified regions that were reproducibly identified across algorithms and datasets. This detailed analysis is important because each model has biases, likely resulting in a differential weighting of anatomical features and brain areas. This analysis showed that regions of left prefrontal cortex (inferior and middle frontal gyrus, medial prefrontal gyrus, anterior and subgenual cingulate cortex), along with the parahippocampal gyrus were brain regions whose morphological features contributed most to the accurate inference of Autism across models and datasets without and with comorbidities. The areas highlighted are consistent with previous studies reporting . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint Autism-related disruptions to cortical development[24, [51][52][53] and gyrification processes [24,54] in these regions. Further, also consistent with the literature, we found that the most predictive regions varied according to both gender and age, as well as the presence of comorbidities [17,24,55]. This is consistent with observations that Autism is a complex condition, with patterns of neurological divergence that vary with age [17,24,[52][53] and sex [17,24,55].
Reproducibly predictive regions in the limbic system (left parahippocampal gyrus, anterior cingulate gyrus, and subcallosal area), dorsal medial frontal cortex, and precentral gyrus fit well with previous work on the role of atypical socio-emotional and motor circuitry in Autism [56][57][58][59][60]. Many of the left-hemisphere regions identified as contributing to accurate inference of Autism diagnosis fall within the canonical left-lateralized language network, including inferior prefrontal and inferior parietal regions, and the planum temporale in superior temporal gyrus [61][62][63]. Divergent structure and function in the language network is a robust and reproducible finding in Autism [64][65][66][67]. Since early language processing appears to be an important predictor of long-term outcomes in Autism [68][69][70] identification of early-emerging structural alterations in the underlying language network has the potential to yield a powerful marker of Autism or Autism subtypes, which could, in turn, direct individualized interventions and improve prognosis.
An important caveat is that while our novel interpretation step identified which regions of the brain had morphological features relevant to the model-based inference of Autism, it did not provide information on what these morphological features were. For example, features such as cortical thickness, the location of the gray-white boundary, surface area, and gyral/sulcal morphometry could all play a role in prediction of Autism [22,52,71]; and different morphological features may be relevant in different brain areas. While the precise nature of the Autism-related morphological features are not discernable from our analyses, our predictive modeling analyses can be followed up with in-depth, targeted, and . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint hypothesis-driven examinations of the areas highlighted in independent samples to uncover the nature of these features.

Limitations
Our pipeline for prediction of neuropsychiatric diagnosis (Autism) on the basis of minimally preprocessed T1 MRI scans advances progress toward interpretable 3D Deep Learning applications in biological psychiatry and toward the identification of reproducible brain biomarkers that will help refine diagnoses and treatment plans across conditions. Our study had several limitations, however, which may be addressed in further refinements of our pipeline.
First, we trained our models on 100 epochs, which is an acceptable number relative to other studies using 3D MRI scans [72], but which may have limited the convergence and optimization of the algorithms. Future work may train on a larger number of epochs or may employ earlystopping [73] to optimize training. Using the entire structural MRI scans (to explore prediction across the whole brain) may also have posed a challenge for convergence towards the "True" solution. Further, although we used a large dataset (1074 participants to train the models, 525 to validate and test the models), the amount of data available is still rather limited when we consider the clinical heterogeneity of Autism. This idea is supported by the poor prediction performance we observed for test set 2, which included participants with comorbid diagnoses (average accuracy = 46.3%, ROC AUC = 0.47 and average sensitivity = 15.7%). There are still questions in the literature about whether predicting a binary label, "Autism vs non-Autism" is a useful or appropriate endeavor, since Autism is a wide spectrum of behaviors and abilities which may encompass as many as four subtypes [23], and there is also considerable overlap of symptoms and neuromarkers across psychological conditions [17]. Future analyses will need to leverage even larger datasets to better address the clinical heterogeneity of Autism and to explore the prediction of categories beyond Autism and non-Autism. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Another limitation is related to the segmentation algorithm we used in the interpretation step.
We used HighRes3DNet [48] to obtain rapid segmentation for each brain using the GIF algorithm [49], which was built to be robust on atypically developing brains. The segmentation produced is rather coarse, however -the algorithm outputs relatively large parcels, encompassing anatomically heterogeneous regions such as the anterior cingulate gyrus or superior parietal lobule. Further, as noted above, while our interpretation process localized regions that were important for prediction of Autism, it did not provide information on what the predictive morphological features of those regions were.

Future Directions
There is considerable scope to extend our interpretable Deep Learning pipeline to the prediction of other neurological or neuropsychiatric conditions or to other MRI modalities.
Traut et al. [39] reported that prediction of Autism was considerably improved (from AUC=0.66 using only anatomical MRI to AUC=0.79 using both anatomical and functional data) for a blended model that incorporated both functional and structural MRI data. Future work will examine whether functional MRI data can also improve our models. Other efforts to improve our model will include training the models on more epochs, exploring other architectures, integrating scanning parameters and other confounds such as gender and age, and using different and extended class labeling. We have shared all our code (https://github.com/garciaml/Autism-3D-CNN-brain-sMRI) to enable other researchers to apply, reuse, and further develop our models and approach.

Conclusion
In this paper, we described a novel methodology to build a predictive model to infer Autism diagnosis using 3D Deep Learning applied to structural MRI scans, coupled with an interpretation step in the form of a descriptive method that identified the brain regions that . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint were most important for accurate inference. Importantly, we applied our models to minimally preprocessed data -completely avoiding the template normalization step, which may obscure diagnosis-related alterations in brain structure. We found that the predictive performance of our models was equivalent to that of Machine Learning models reported in the literature, while requiring less time to generate predictions (due to minimal preprocessing). There is considerable scope to refine our method or to incorporate other modalities (e.g., fMRI) to further boost predictive performance.
Our method for interpreting the output of Deep Learning models revealed highly predictive brain regions that were consistent with the literature, demonstrating that 3D Deep Learning models produce biologically plausible results without a priori knowledge or the requirement for pre-computation of morphological derivatives (e.g., volumes, cortical thickness, surface area). Although challenges related to the clinical heterogeneity of Autism remain to be addressed, we have openly shared our code and models for others to build on and extend, and to further progress the field towards the identification of robust and reproducible brain biomarkers for neuropsychiatric conditions. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Acknowledgements
We have much gratitude and appreciation for the Irish Research Council and the Hypercube Institute (Paris) for their funding.
We also thank Pr. Louise Gallagher as well as Dr. Robert Whelan for their feedback and guidance.

Disclosure of competing interests
None.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2022. ; https://doi.org/10.1101/2022.10.18.22281196 doi: medRxiv preprint