Abstract
Background Diverse large cohorts are necessary for dissecting subtypes of autism, and intellectual disability is one of the most robust endophenotypes for analysis. However, current cognitive assessment methods are not feasible at scale.
Methods We developed five commonly used machine learning models to predict cognitive impairment (FSIQ<80 and FSIQ<70) and FSIQ scores among 521 children with autism using parent-reported online surveys in SPARK, and evaluated them in an independent set (n=1346) with a missing data rate up to 70%. We assessed accuracy, sensitivity and specificity by comparing predicted cognitive level against clinical IQ data.
Results The elastic-net model has good performance (AUC=0.876, sensitivity=0.772, specificity=0.803) using 129 predictive features to impute cognitive impairment (FSIQ<80). Top ranked predictive features included parent-reported language and cognitive levels, age at autism diagnosis, and history of services. Prediction of FSIQ<70 and FSIQ scores also showed good prediction performance.
Conclusions We show cognitive levels can be imputed with high accuracy for children with autism, using commonly collected parent-reported data and standardized surveys. The current model offers a method for large scale autism studies seeking estimates of cognitive ability when standardized psychometric testing is not feasible.
Lay summary Children with autism who have more severe learning challenges or cognitive impairment have different needs that are important to consider in research studies. When children in our study were missing standardized cognitive testing scores, we were able to use machine learning with other information to correctly “guess” when they have cognitive impairment about 80% of the time. We can use this information in research in the future to develop more appropriate treatments for children with autism and cognitive impairment.
Introduction
There is significant phenotypic and genetic heterogeneity in ASD that complicates the ability to identify etiologies. The need for larger sample sizes for subtyping has led to the call for large-scale studies using remote phenotyping protocols (Warnell et al., 2015, Wang et al., 2016, Samocha et al., 2014). However, current clinical assessment methods are not feasible at scale.
IQ (intelligence quotient) is a key phenotype in autism research, for stratification of the population into subtypes, prediction of individual outcomes, and correlation with genes and genetic architecture (Munson et a., 2008; Mason et al., 2020; Anderson et al., 2014; Gradzinski et al., 2013; Allegrini et al., 2019; Benyamin et al., 2014, Davies et al., 2011, Steffanson et al., 2014; Robinson et al., 2017, Ronemus et al., 2014, Rauch et al., 2012; Iossofiv et al., 2014; De Rubeis et al., 2014; Sanders et al., 2015; Grove, 2019; Satterstrom, 2020). For children with autism, formal cognitive assessment requires expert in-person examination that is costly, not equally accessible to all, and not always feasible in children with the most severe forms of autism (Corona, et al., 2021). In lieu of full assessments, proxy measures of IQ, such as receptive vocabulary and single visual reasoning tasks are used (Hamburg et al., 2019; Frazier et al., 2004; Kriseleva et al., 2017, Dawson et al., 2007), and efforts are underway to develop web-based cognitive testing applications (for example, Hansen, 2016; Scott et al., 2020). However, these testing methods are not yet fully validated for remote in-home use either in children or in individuals with autism, as would be required for large studies, nor are they appropriate for very young children or severely impaired individuals who require significant support for online tasks. Sometimes the Vineland Adaptive Behavior Scales are used as a proxy for cognitive ability (Charman et al., 2011); however, its scores show non-linear associations with IQ due to discrepancies in daily functioning, particularly for those with higher intellectual ability (Bolte and Poustka, 2002; Freeman et al., 1998; Alvares et al., 2020). Clinicians and researchers have used parent report of estimated developmental ages and intelligence, but the accuracy of this method is inconsistent (Chandler et al., 2016). Ideally, clinical IQ test data could be abstracted from medical or school records, but the resources required to do this across a national cohort is challenging (Coleman et al., 2015; Maenner et al., 2020).
As an alternative, methods for estimation or imputation of cognitive level based on IQ may be used. When IQ scores are not available, cognitive ability is often imputed by alternative measures such as educational attainment in teens or adults using educational records (Cornish et al., 2015). This approach is not possible for many autistic individuals whose educational history follows a different path such as therapeutic or ungraded settings, especially in children. When only a small proportion of participants are missing IQ scores, multiple imputation can be used to impute IQ scores (Horwood et al., 2018; Reid et al., 2018). However, this method does not work well on data with significant missingness. Alternatively, missing cognitive test scores can be imputed by linear regression models based on scores from other related concurrent measures, but the performance of these models typically has not been examined in terms of sensitivity or specificity (for example, see Emerson et al., 2016).
Linear regression also has been used to predict IQ in typical children using available demographic information, such as age, sex, parent education, and family literacy level (Schoenberg et al. 2007). Traditionally, imputation with regression models typically involves a priori subjective judgments for a small number of predictors within large datasets. Its prediction performance may be limited due to missing potentially critical variables unknown to the investigators, and the subjective selection itself can lead to overfitting and less generalizability to other datasets. When dealing with complicated and large datasets, it may have limited performance when compared to other advanced methods such as classification and regression trees (Finch et al., 2011). Studies predicting psychological traits increasingly use machine learning models. Machine learning techniques do not require a priori selection of predictors, are able to uncover latent and complicated relationships between a large number of data points, and emphasize the evaluation of prediction accuracy. Most uses of machine learning for cognitive estimation have involved prediction of future cognitive status in other clinical groups, such as older adults, (Graham et al., 2020), and although machine learning and clustering models are being applied to autism, they are largely used for trait-based subtyping and detection (Lombardo, Meng-Chuan & Baron-Cohen, 2019; Veatch et al., 2013).
SPARK (SPARK Consortium, 2018; Feliciano et al., 2019) offers the opportunity to leverage a large, heterogeneous autism cohort with phenotypic data collected online, along with clinical IQ data available for a subset of participants. It enables us to develop and assess machine learning prediction models for predicting cognitive level, using commonly collected parent-reported information. We hypothesized that cognitive impairment and IQ scores in children with ASD can be predicted based on parent-reported information such as medical and developmental history and standardized questionnaires. We developed machine learning models to impute cognitive impairment using both a binary classification model with different IQ cutoffs and a continuous model, with comparison to traditional methods such as multiple imputation. The current study seeks to fill the gap in available methods to assess cognitive impairment in autism, for use in large scale autism and autism genetics research studies.
Methods
1. Sample overview
The SPARK cohort is currently the largest genetic study of autism. Participants are recruited through a network of US clinical sites and social media, and the study is conducted entirely online. As of early 2021, there were 84,161 children and dependents with a parent-reported professional ASD diagnosis (Figure 1). IQ scores were collected on 3,136 participants with ASD, and 2,545 participants have full scale IQ (Figure 1). We selected 521 participants with a complete dataset for initial model development, and another newly collected independent set (n=1,346) to further assess the model’s performance (Figure 1).
2. Measures
Surveys of demographics, diagnostic details, medical and developmental history, and standardized instruments including the Social Communication Questionnaire-Lifetime (SCQ-L; Rutter et al., 2003), Developmental Coordination Disorder Questionnaire (DCDQ; Wilson et al., 2007), Repetitive Behavior Scale-Revised (RBS-R; Bodfish et al, 2000), and Vineland 3rd Edition Parent-Caregiver Comprehensive rating form (Sparrow et al., 2016) were completed by parents online. Item level data and total scores were utilized for analyses. At enrollment, caregivers reported any history of ID and current functional language level (nonverbal/single words/ phrases/complex sentences).
Parents were asked questions regarding developmental milestones and regression (language and/or other skills), prior cognitive testing results, estimates of functioning, and special education services and therapies. The medical history collects data on major co-morbid medical, developmental and psychiatric conditions.
Estimated cognitive functioning level was reported by parents in several ways: overall IQ testing score if available, the mental age (age equivalent) they had been given by clinical or school providers, and general cognitive level (degree of delay) relative to chronological age. IQ test scores and mental age equivalents reported by parents could not be utilized due to missing data, which reflects parent uncertainty or lack of access to test results.
3. Key IQ indices collected at clinical sites
Full Scale IQ (FSIQ), Nonverbal IQ (NVIQ) and Verbal IQ (VIQ) data were collected by research staff from review of electronic and paper clinical and research records at 24 SPARK clinical sites. Staff were trained in data abstraction and required to complete and submit Case Report Forms to demonstrate inter-rater agreement with their site supervisor (psychologist), which was required to be 100% in order to proceed with data collection. Recent reports of inter-rater reliability in SPARK were high (ICC=0.98) for abstraction of IQ data using these methods (Fombonne et al., 2021). Participants age 15 months and older were prioritized, given more limited confidence in diagnosis and test results in infancy. FSIQ, NVIQ, VIQ, and developmental quotients in the case of young children, were accepted for a specific list of common, standardized, norm-referenced tests. Testing was the most recent evaluation on record. IQ assessment data were accepted if they were verified by site staff to be provided by a qualified professional (licensed M.A. or Ph.D.) in psychology or by a psychometrician supervised by a licensed psychologist, including clinical, school or research evaluations. To standardize collection of IQ data across sites, a training manual and online data entry form containing validators were provided. Any concerns regarding validity of the test results by the original test examiner of record also were entered as a flag, and such cases were excluded from analyses. All participants signed authorization within the SPARK informed consent permitting bidirectional data sharing of IQ and other information between SPARK staff and clinical sites.
IQ test data included but were not limited to the Wechsler Intelligence Scales for Children (Wechsler, 2014), Stanford-Binet Intelligence Test (Roid, 2003), Mullen Early Learning Composite (Mullen, 1995), Bayley Scales of Infant Development Mental Development Index and Cognitive Composite (Bayley, N., Aylward, G.P., 2019), and the Differential Ability Scales Global Cognitive Ability (Elliott, 2007). The indices accepted were measurements scaled to a mean standard score of 100 and standard deviation of 15. Not included were school readiness screeners, achievement tests, parent-reported developmental screeners or adaptive functioning interviews/surveys. Questionnaire measures were taken within a median of 16 months from the IQ test date among all participants.
4. Statistical association between IQ domain scores and between IQ scores and other covariates
IQ scores were coalesced across test types based upon their common scaling (Mean=100, Standard Deviation=15) as is common in autism research given the flexible assessment approaches required for valid comparisons (Bishop et al., 2015). Pearson correlations were calculated between FSIQ, NVIQ and VIQ scores. Pairwise scatterplots with a fitted linear model were used to assess whether the relationships are linear.
Similarly, we also assessed the relationship between IQ scores and age of IQ testing by Pearson correlation and scatterplots. Lowess smoothing was used to illustrate the relationship between IQ scores and age when IQ score was tested.
We also assessed the association between IQ scores and other variables available in SPARK, including those known to be strongly and consistently predictive of IQ across age in autism (Simonoff et al., 2019) - reported ID, estimated cognitive level (above, at or below chronological age level), language level at enrollment, history of any developmental delay and parent-report of motor.
5. Machine learning methods to predict cognitive impairment
FSIQ was used to define cognitive impairment since the largest number of records were available and it is highly correlated with NVIQ and VIQ. We defined cognitive impairment as FSIQ<80, in order to 1) increase the likelihood of identifying all individuals with subaverage intellectual functioning for genetic studies, 2) align better with current diagnostic criteria for ID which do not require an IQ cutoff of 70, and 3) prevent the inappropriate classification of average IQ individuals together with individuals with borderline intellectual functioning (Munson et al., 2008; Nouwens et al., 2017). The goal of the machine learning model is to distinguish participants with FSIQ <80 versus FSIQ ≥80. Because FSIQ<70 also is commonly used to define cognitive impairment, we applied the same model training, selection and testing procedures to develop an elastic net model to predict participants with FSIQ<70. Sensitivity analysis was conducted using FSIQ <70/FSIQ ≥ 70 and continuous prediction of FSIQ. There were 271 predictive measures included. Vineland scores were not included due to a high missing rate.The full list of final selected predictors is included in Supplementary Table 1.
Training, validation and testing sets
The data in the initial sample (n=521) were randomly split into 60% training (n=313), 20% validation (n=104) and 20% testing data (n=104) while keeping the proportion of lower IQ consistent across training, validation and testing sets (Table 1). The model training was conducted in the training set, the validation set was used to select the best performing model, and the testing set was used to decide the predictive probability cutoff and model evaluation. An evaluation was also conducted to assess the model performance in the presence of missing values (up to 70%) in an additional independent set (n=1,346; Figure 1, Table 1).
Model training, selection and evaluation
In the training set, we trained the following commonly used machine learning models for classification using the R package caret Version 6.0-88 (Kuhn, 2008): elastic-net regularized generalized linear models with no interaction terms (glmnet), support vector machines (svmRadial), random forest (rf), k-Nearest Neighbors (kNN) and gradient boosting (xgbTree). These five models are commonly used in solving classification problems (Friedman, 2001; Maglogiannis, 2007; Ogutu, 2011; Chen, 2016). The elastic-net model is a modified version of linear regression with a penalty on the number of covariates, and is easy to interpret. Support vector machines, random forest, k-nearest neighbors and gradient boosting trees can deal with high-dimensional datasets but are often hard to interpret. Ten-fold cross-validation was used in the training to minimize overfitting. Model parameters were chosen by a random grid of 50 possible combination of values and optimal parameters were chosen by the best receiver operating characteristic (ROC) among cross-validated resamples in the training set. As a sensitivity analysis for predicting continuous FSIQ, only elastic-net, support vector machines and k-Nearest Neighbors were suitable for continuous prediction and were used in the training.
In the validation set, we evaluated the prediction performance for the five trained models by receiver operating characteristic (ROC) curve and area under the curve (AUC). The model with the highest AUC was selected as the final model. As a sensitivity analysis for predicting continuous FSIQ, we used the Spearman correlation between true and predicted FSIQ, root mean square error (RMSE), Cohen’s kappa, sensitivity and specificity.
In the testing sets, we further evaluated the prediction performance of the final model by ROC curve, AUC, accuracy, Cohen’s kappa, sensitivity, specificity and positive predictive values (PPV) and negative predictive values (NPV). We determined the predictive probability cutoff for cognitive impairment in the initial testing set.
Variable importance ranking
The variable importance was obtained by elastic-net model using 100 bootstraps conducted with the R package caret. The variable importance was scaled from −100 to 100 with absolute value reflecting the relative importance and directionality reflecting the direction of predicting low IQ.
To compare the machine learning prediction performance with traditional methods, we applied the multiple imputation method to the same dataset of 521 participants and 271 measures. We kept the cognitive impairment status (FSIQ<80; FSIQ<70) in the training set available (n=313) and set the cognitive impairment status as missing in the initial testing set (n=104). We ran the multiple imputation algorithm (Van Buuren et al., 2011) to impute the missing cognitive impairment using the R package mice (Version 3.13.0). We assessed the accuracy, Cohen’s kappa, sensitivity, specificity and positive and negative predictive values (PPV, NPV) and compared the performance with the machine learning prediction.
6. Assess prediction performance in the additional independent dataset (n=1,346) with missingness
We used 1,346 newly collected participants with FSIQ, having no overlap with the original 521 cases and a missing rate of up to 70% among the 271 predictive features (Table 1). We first applied nonparametric missing value imputation methods (Stekhoven et al., 2012) to the dataset with all predictive features by R package missForest (Version 1.4). Then, we used our trained elastic net model to obtain the predicted probability of imputed cognitive impairment. We assessed the prediction performance by ROC curves with the predicted probability, and used the same probability cutoff as the initial testing set to assess the accuracy, Cohen’s kappa, sensitivity, specificity and positive and negative predictive values (PPV, NPV). We also assessed the prediction performance by different levels of missing rate among 271 features.
Results
1. IQ scores in SPARK are highly linearly inter-correlated and independent of age
Among 521 ASD participants with complete FSIQ and all 271 measures (Figure 1) in the initial set, 81.2% were male (Table 1) and the average age at enrollment was 7.9 years (Range 2 – 15.5 years). The racial composition is 72.4% white, 7.1% African American, 1.7% Asian, 0.2% Native American, 1.3% other, and 17.3% more than one race. Mean IQ scores are: FSIQ=84.0, NVIQ=94.9, VIQ=89.7. The training, validation and testing sets have similar demographic distribution (Table 1).
Based on the scatterplots (Supplementary Figure 1) and the pair-wise Pearson correlation, FSIQ, NVIQ and VIQ are highly positively linearly correlated (r > 0.7). We also assessed the relationship between IQ scores and age of IQ testing (Supplementary Figure 2). There is no correlation between the three types of IQ scores and age (−0.06<r<0.06).
2. Machine learning models can accurately predict cognitive impairment as defined by FSIQ<80
We conducted an initial assessment of several measures considered to be relevant to IQ (Table 2, Supplementary Figure 3). Parent-reported severity of cognitive delay and functional language level are strongly associated with FSIQ, NVIQ and VIQ, followed by any language delay or language disorder, severity of language regression, and parent-reported ID diagnosis (Table 2).
Among 150 participants with FSIQ<70, 25 participants (16.7%) were reported as ID by their parents. With FSIQ<80, 37 out of 216 participants (17.1%) were reported as ID.
Machine learning models showed good prediction of cognitive impairment as defined by FSIQ<80. In the training set, we trained five commonly used machine learning models and assessed their prediction performance in the validation set by receiver operating characteristic (ROC) curves (Figure 2). Elastic-net regularized generalized linear model has the best performance (AUC=0.900) followed by support vector machines (AUC=0.891), k-nearest neighbors (AUC=0.863), random forest (AUC=0.863) and gradient boosting (AUC=0.843). Based on the prediction performance in the validation set, we chose the elastic-net model as the final model using 129 predictive features (Supplementary table 1).
We calculated the variable importance with directionality for these 129 selected predictive features (Figure 3, Supplementary table 1). Parent-reported language and cognitive levels or diagnoses, age at ASD diagnosis, mood/anxiety disorder, specific items on standardized motor and ASD questionnaires and service history are significant IQ predictors.
When applying our final elastic net model to the initial independent testing set, the predictive performance is relatively high with an AUC of 0.888 (Figure 4). With a cutoff of 0.45 on the predictive probability by the elastic-net model, the sensitivity is 0.744 and the specificity is 0.885 for predicting cognitive impairment (Figure 4, Table 3).
We compared the model’s performance with an multiple imputation approach (Table 3). The overall accuracy, kappa, sensitivity and specificity are lower by multiple imputation compared to the elastic net model (Table 3; FSIQ<80: accuracy=0.692, sensitivity=0.558, specificity=0.787; FSIQ<70: accuracy=0.779, sensitivity=0.600, specificity=0.851).
3. Prediction performance for FSIQ<80 in an additional independent set (n=1,346) in SPARK with missingness in predictive features
In the larger newly collected independent set with missing rate up to 70% (n=1,346; Figure 1, Table 1):12.9% are nonverbal, 14.5% reported ID and the mean FSIQ is 77.89 (Table 1). Among all 271 features, rates of each level of missing values in the sample are as follows: no missing −197 (14.6%), <10% missing-528 (39.2%), 10-30% missing - 128 (9.5%), 30-50% missing - 366 (27.2%) and 50-70% missing - 127 (9.4%) participants (Table 1).
We imputed the missing values among predictive features by existing methods (Stekhoven et al., 2012) and then applied our trained elastic net model to obtain predicted probability of imputed cognitive impairment (FSIQ<80). By using a cutoff of 0.45 on the predicted probability, most prediction performance metrics in the additional independent set are comparable and just slightly lower than in the initial testing set (AUC=0.876, Figure 5; sensitivity=0.772, specificity=0.803, accuracy=0.787, Table 3). We also examined the prediction performance in subsets of participants with different missing rates within the additional independent set. The AUC’s among all subsets were comparable to the AUC in the initial testing set with no missing values (n=104) (Supplementary Figure 6a, Supplementary Table 4). Overall, the prediction performance is still comparable to the initial testing set even with some level of missingness among the predictors.
4. Sensitivity analysis when cognitive impairment is defined by FSIQ<70
The elastic net model for FSIQ<70 selected 128 features, and top ranked predictors are similar to the previous model predicting FSIQ<80. In both the initial testing set (n=104) and the additional independent set (n=1,346), most prediction performance metrics in the FSIQ<70 model are comparable and just slightly lower than in the FSIQ<80 model (initial testing set: AUC=0.881; additional independent set: AUC=0.865, Table 3, Supplementary Figure 4b). The prediction performance between the initial testing set and the additional independent set is comparable.
5. Predicting continuous FSIQ scores by machine learning methods
We trained the elastic net, support vector machines and k-nearest neighbors models to predict continuous FSIQ, and evaluated their performances compared to the binary prediction models in the initial testing set (n=104) and additional independent set (n=1,346; Supplementary Table 3, Supplementary figure 5). The elastic net model has relatively good prediction performance and similar variables of importance. The Pearson correlation with true FSIQ is 0.678 in the initial testing set (n=104) and 0.730 in the additional independent set (n=1,346), while the root mean square error (RMSE) is 17.4 in the initial testing set and 16.1 in the additional independent set (Supplementary Table 3). We also examined the RMSE in subsets of participants with different missing rate in the additional independent set (n=1,346): the RMSE among all subsets was comparable to the initial testing set with no missing values (n=104) (Supplementary Figure 6c, Supplementary Table 4). The support vector machines model also has similar prediction performance to the elastic net model in predicting continuous FSIQ (Supplementary Table 3,4).
We then compared the prediction performance of the continuous FSIQ model to our binary cognitive impairment model, using binary evaluation metrics. To do this, we categorized predicted continuous FSIQ according to cutoffs (<80 and <70, respectively) and compared it to true FSIQ classified by the same cutoffs (Supplementary Table 3). Overall, compared to the original binary elastic net prediction model, the prediction performance was comparable (initial testing set: FSIQ<80 AUC=0.864, FSIQ<70 AUC=0.812; additional independent set: FSIQ<80 AUC=0.871, FSIQ<70 AUC=0.875; Supplementary Table 3, Table 3). The prediction performance is also similar across subsets with varying missing rates in the additional independent set (Supplementary Table 4).
Because our main purpose in our research is to impute the presence or absence of cognitive impairment, we used the binary elastic net prediction as our final model.
Discussion
Using machine learning methods, we demonstrate it is possible to derive valid estimates of cognitive ability based on parent-reported data in a large research cohort of autistic individuals for whom formal IQ testing is not available. We are also able to show our model performance is relatively stable in the presence of missing data and our approach is generalizable to impute cognitive impairment in the broader population of children with ASD. Having a cognitive estimate permits more sophisticated phenotypic and genomic analyses with available registry data, matching of appropriate control groups, and stratification or covariation in analyses for subtyping or to control for influences beyond core autism. It also fills a critical gap to include more underserved populations and more severely affected individuals for whom cognitive testing may not be available.
We obtained the variable importance rankings by elastic-net model for all available measures in SPARK. Although the variable importance rankings are model dependent and data dependent, we found several interesting items that contribute to the prediction model. The assessment of predictive importance among all available measures in SPARK revealed that parent report on both standardized and non-standardized instruments contains valuable information regarding individuals’ cognitive level. Studies of other clinical conditions have included similar information as predictors of future cognitive status with good success, for example, demographic data and health history (Graham et al., 2020). In the present study, the strongest predictors were parent report of history of language delay and disorder, report of general cognitive delay, and younger age at ASD diagnosis.
We chose to focus on detection of IQ under 80, so as to sensitively capture borderline intellectual impairment for our research, beyond a more rigid and outdated definition of intellectual disability. We compared definitions of cognitive impairment as FSIQ under 80 or 70, and predictions using either definition are accurate, suggesting generalizability. However, surprisingly few parents reported ID in children with a tested IQ below 70 or 80. Parent-report of ID diagnosis was not as strongly associated with IQ scores and not as successful as other parent reported information in predicting lower IQ. These findings are consistent with prior studies showing parents tend to underreport developmental disabilities (Porter et al., 2011). We found that more concrete questions about day-to-day observations of language and cognitive tasks relative to age are more significantly associated with IQ scores and are more accurate predictors of cognitive impairment. The predictive importance of parent-reported ID diagnosis was low, and lower than many other reported characteristics, including autistic behaviors, early social-communication skills, and psychiatric co-morbidities. One of the most interesting autism characteristics that served to predict cognitive impairment was repetitive playing of a song or video segment. In terms of observable skills, after language and overall reasoning ability, good fine motor control with scissors was an indicator that cognitive impairment is unlikely. Other strong components were education service history: receipt of specific autism services, a one-on-one classroom aide and/or assignment to a full time self-contained ASD classroom all indicate likely cognitive impairment, while assignment to social skills group therapy at school indicates the likelihood that IQ is within normal limits.
There were many low-ranking variables with minimal to no predictive importance in the model that are notable. For example, a history of generic special education or early intervention, and several early signs (such as speech delay and absence of reciprocal smiling) had low yield. Similarly, age, maternal higher education, overall severity of repetitive behavior or number of signs of autism, sleep problems, motor delays, and history of a non-language regression were not strong predictors. On the standardized measures, while some specific repetitive and basic social behaviors are predictive, others are not. The reason these factors failed to be selected by the model is unclear: it is possible some are universal features of autism independent of cognitive ability, had low base rates in our sample, or were highly correlated with other variables that were alternatively selected by the model.
There are several potential implications of our findings. First, when surveying cognitive ability in large cohorts, it is important to anchor against chronological age expectations and to be specific about the skills queried. IQ and mental age data requested from families are often missing. Second, consistent with prior literature, language level remains the strongest correlate of IQ. Finally, the present model suggests a set of variables that can be easily administered as survey questions and used in an algorithm to predict cognitive ability for children in autism studies.
There are several limitations to the present study. First, the current model included only children under 18, and results could be different for adults. Second, the assumptions of the model require certain caveats. Not all individuals with poor past performance on a single IQ test have true global cognitive impairment or ID, nor do all individuals with limited language. Individuals must be appreciated for their abilities rather than their “disabilities,” and this is the purpose of expert, comprehensive, in-person assessment. In this way, all statistical models risk inaccuracy when applied to an individual. Caution is urged in estimating cognitive impairment in children below school age in particular.
The current model was trained on IQ testing data completed at any age or point in time; however, if planning to test this or a similar model, researchers may wish to exclude cases if IQ tests were completed a long time ago or prior to age 5.
The algorithm for imputing cognitive level may have more limited generalizability to groups outside of SPARK given the relatively high IQ and basic language ability of participants in the final dataset relative to population estimates in autism (Baio et al., 2018), and the high SES reported previously in SPARK (SPARK Consortium, 2017). The problem of discrepancies between NVIQ and VIQ in many children with autism also must be considered. Although NVIQ correlated highly with FSIQ in our sample, there is greater variability in its association with VIQ, and our analysis necessarily excluded children who had no FSIQ but only a NVIQ or derived ratio IQ available and thus may have been nonverbal. In applying an algorithm that leans quite heavily on language variables and is based upon FSIQ as the outcome, researchers must remain aware of the possibility the model could underestimate some nonverbal individuals. However, despite these concerns, it is promising that accuracy of the model for the subset of nonverbal individuals in our testing set was high (accuracy=97% among 164 non-verbal participants).
When interpreting the variables of greatest importance in the model, it is important to note the concept of importance is model-dependent and data dependent, and each characteristic on its own cannot be assumed as a one-to-one proxy for IQ classification or ID in an individual, but rather operate together to maximize predictive power. Further, adaptive behavior, the current standard for ID diagnosis, was not included in the analysis due to missing data. Regardless, taken together, the variables clearly have face validity and clinical relevance given our knowledge of children with greater impairments (Bishop et al., 2015; Munson et al., 2008).
In terms of methodology, despite our use of regularization and cross-validation to minimize overfitting, our ratio of data-to-features was relatively low in our model development, which could lead to overfitting and limit the model’s performance when imputing cognitive level in other ASD populations. However, we showed the model’s performance was comparably good in an additional independent set even in the presence of missing data.
Future work will re-train and replicate the predictive models with larger samples providing a greater data-to-features ratio and more complete cases with additional phenotyping data, and will evaluate the reliability of medical record data abstraction underlying the IQ data. Dissecting ASD subtypes by intellectual functioning level will allow us to more fully stratify and thereby understand similarities and differences in underlying biology across the autism spectrum. The model stands to provide a critical missing piece for future genomic and other analyses in the SPARK cohort and beyond.
Supplementary Figure 1: Full Scale IQ, Nonverbal IQ and Verbal IQ are linearly correlated
Supplementary Figure 2: Full Scale IQ, Nonverbal IQ and Verbal IQ are independent of age
Supplementary Figure 3: Full Scale IQ by cognitive and language measures by boxplot
Supplementary Figure 4: Prediction performance in initial testing set by elastic-net model for FSIQ<70. a. receiver operating characteristic (ROC) curves showing the area under the curve (AUC) is 0.881 in the testing set; b. Prediction probability cutoff by sensitivity and specificity. Final cutoff of 0.35 would provide a sensitivity of 0.767 and a specificity of 0.838
Supplementary Figure 5: Predicted FSIQ versus true FSIQ by elastic-net, support vector machines (SVM) and k-nearest neighbors (kNN) in the initial testing set (n=104) and in the additional independent set (n=1,346)
Supplementary Figure 6: Prediction performance by missing rate in the additional independent set (n=1,346). Panels a, b panel show the AUC by missing rate based on the elastic-net model trained to predict cognitive impairment defined by FSIQ<80 (a) and FSIQ<70 (b). The grey dashed line is the AUC from the testing set in the model development (AUC=0.888 for FSIQ<0.8 and AUC=0.881 for FSIQ<0.7). Panel c shows the root mean square error (RMSE) by missing rate in the additional independent set compared to the testing set in model development, for three models predicting continuous FSIQ
Conflict of Interest
All authors declare that there is no conflict of interest.
Data Availability
All data for the SPARK population dataset described in this study are available to researchers in the Simons Foundation Autism Research Initiative database, SFARI Base, at https://base.sfari.org.
Acknowledgements
We are grateful to all of the families in SPARK, the SPARK clinical sites and SPARK staff. We appreciate obtaining access to the SPARK phenotypic dataset on SFARI Base. Approved researchers can obtain the SPARK population dataset described in this study by applying at https://base.sfari.org. The SPARK consortium members at the time of this writing are: Leonard Abbeduto, Gabriella Aberbach, Shelley Aberle, John Acampado, Andy Ace, Kaitlyn Ahlers, Charles Albright, Michael Alessandri, Nicolas Alvarez, David Amaral, Alpha Amatya, Alicia Andrus, Claudine Anglo, Rob Annett, Eduardo Arzate, Irina Astrovskaya, Kelli Baalman, Melissa Baer, Gabriele Baraghoshi, Nicole Bardett, Sarah Barnes, Asif Bashar, Heidi Bates, Katie Beard, Juana Becerra, Malia Beckwith, Landon Beeson, Josh Beeson, Brandi Bell, Monica Belli, Dawn Bentley, Natalie Berger, Anna Berman, Raphael Bernier, Elizabeth BerryKravis, Mary Berwanger, Shelby Birdwell, Elizabeth Blank, Stephanie Booker, Aniela Bordofsky, Erin Bower, Catherine Bradley, Stephanie Brewster, Elizabeth Brooks, Aliso Brown, Melissa Brown, Jennylyn Brown, Cate Buescher, Martin Butler, Eric Butter, Wenteng CaI, Norma Calderon, Kristen Callahan, Alexies Camba, Claudia CampoSoria, Paul Carbone, Laura Carpenter, S. Carpenter, Lindsey Cartner, Myriam Casseus, Lucas Casten, Sullivan Catherine, Ashley Chappo, Tia Chen, Wubin Chin, Sharmista Chintalapalli, Daniel Cho, Dave Cho, YB Choi, Wendy Chung, Renee Clark, Cheryl Cohen, Kendra Coleman, Costanza Colombi, Joaquin Comitre, Sarah Conyers, Lindsey Cooper, Leigh Coppola, Lisa Cordiero, Jeanette Cordova, Dahriana Correa, Hannah Cottrell, Michelle Coughlin, Eric Courchesne, Dan Coury, Joseph Cubeis, Sean Cunningham, Mary Currin, Michele Cutri, Sophia D’Ambrosi, Amy Daniels, Sabrina Davis, Nickelle Decius, Jennifer Delaporte, Brandy Dennis, Kate Dent, Gabrielle Dichter, Katharine Diehl, Chris Diggins, Emily Dillon, Erin Doyle, Andrea Drayton, Megan DuBois, Gabrielle Duhon, Megan Dunlevy, Rachel Earl, Catherine Edmonson, Sara Eldred, Barbara Enright, Craig Erickson, Amy Esler, Anne Fanta, Carrie Fassler, Faris Fazal, Pam Feliciano, Angela Fish, Kate Fitzgerald, Chris Fleisch, Eric Fombonne, Emily Fox, Sunday Francis, Margot Frayne, Sandra Friedman, Laura Fuller, Virginia Galbraith, Swami Ganesan, Jennifer Gerdts, Mohammad Ghaziuddin, Haidar Ghina, David Giancarla, Erin Given, Jared Gong, Kelsey Gonring, Natalia Gonzalez, Antonio Gonzalez, Rachel Gordon, Catherine Greay, LeeAnne Green Snyder, Tunisia Greene, Ellen Grimes, Luke Grosvenor, Amanda Gulsrud, Abha Gupta, Jaclyn Gunderson, Chris Gunter, Anibal Gutierrez, Frampton Gwynette, Melissa Hale, Lauren K. Hall, Jake Hall, Kira Hamer, Bing Han, Nathan Hanna, Antonio Hardan, Eldric Harrell, Jill Harris, Nina Harris, Caitlin Hayes, Teryn Heckers, Kathryn Heerwagen, Susan Hepburn, Lynette Herbert, Clara Herrera, Brittani Hilscher, Kathy Hirst, Theodore Ho, Dabney Hofammann, Margaret Hojlo, Gregory Hooks, Dain Howes, Lark Huang-Storm, Samantha Hunter, Hanna Hutter, Teresa Ibanez, Dalia Istephanous, Suma Jacob, Andrea Jarratt, Stanley Jean, Anna Jelinek, Bill Jensen, Mya Jones, Mark Jones, Alissa Jorgenson, Roger Jou, Jessyca Judge, Taylor Kalmus, Stephen Kanne, Hannah Kaplan, Lauren Kasperson, Sophy Kim, Annes Kim, Cheryl Klaiman, Robin Kochel, Misia Kowanda, Melinda Koza, Sydney Kramer, Eva KurtzNelson, Hoa Lam, Elena Lamarche, Erica Lampert, Rebecca Landa, Alex Lash, Noah Lawson, J. Kiely Law, Holly Lechniak, CD Lehman, Bruce Leight, Laurie Lesher, Deana Li, Robin Libove, Natasha Lillie, Danica Limon, Desi Limpoco, Nathan Lo, Brandon Lobisi, Marilyn Lopez, Catherine Lord, Daniella Lucio, Addie Luo, Audrey Lyon, Natalie Madi, Malcolm Mallardi, Lacy Malloch, Anup Mankar, Lori Mann, Patricia Manning, Julie Manoharan, Olena Marchenko, Richard Marini, Christa Martin, Gabriela Marzano, Sarah Mastel, Sheena Mathai, Clara Maxim, Caitlin McCarthy, Nicole Mccoy, Julie McGalliard, Anne-Marie McIntyre, Brooke McKenna, Alexander McKenzie, Megan McTaggart, Sophia Melnyk, Alexandra Miceli, Sarah Michaels, Jacob Michaelson, Anna Milliken, Amanda Moftt Gunn, Sarah Mohiuddin, Jessie Montezuma, Amy Morales-Lara, Kelly Morgan, Hadley Morotti, Michael Morrier, Maria Munoz, Karla Murillo, Kailey Murray, Vincent Myers, Natalie Nagpal, Jason Neely, Katelyn Neely, Olivia Newman, Richard Nguyen, Victoria Nguyen, Amy Nicholson, Melanie Niederhauser, Megan Norris, Kaela O’Brien, Eirene O’Connor, Mitchell O’Meara, Molly O’Neil, Brian O’Roak, Edith Ocampo, Cesar Ochoa-Lubinof, Jessica Orobio, Elizabeth Orrick, Crissy Ortiz, Opal Ousley, Motunrayo Oyeyemi, Samiza Palmer, Katrina Pama, Juhi Pandey, Katherine Pawlowski, Micah Pepper, Diamond Phillips, Karen Pierce, Joseph Piven, Jose Polanco, Natalie Pott-Schmidt, Lisa Prock, Angela Rachubinski, Desiree Rambeck, Rishiraj Rana, Shelley Randall, Vaikunt Ranganathan, Ashley Raven, Madelyn Rayos, Kelli Real, Louis F. Reichardt, Richard Remington, Anna Rhea, Catherine Rice, Harper Richardson, Stacy Rife, Chris Rigby, Ben Right, Beverly Robertson, Erin Roby, Casey Roche, Nicki Rodriguez, Katherine Roeder, Daniela Rojas, Cordelia Rosenberg, Jacob Rosewater, Katelyn Rossow, Payton Runyan, Nicole Russo, Tara Rutter, Mahfuza Sabiha, Mustafa Sahin, Marina Sarris, Dustin Sarver, Madeline Savage, Jessica Scherr, Hayley Schools, Gregory Schoonover, Robert Schultz, Brady Schwind, Cheyanne Sebolt, Rebecca Shafer, Swapnil Shah, Neelay Shah, Roman Shikov, Mojeeb Shir, Amanda Shocklee, Clara Shrier, Lisa Shulman, Matt Siegel, Andrea Simon, Laura Simon, Kaitlyn Singer, Emily Singer, Vini Singh, Kaitlin Smith, Chris Smith, Ashlyn Smith, Latha Soorya, John Spiro, Diksha Srishyla, Danielle Stamps, Laura Stchur, Morgan Steele, Alexandra Stephens, Amy Swanson, Megan Sweeney, Anthony Sziklay, Maira Tafolla, Nicole Takahashi, Amber Tallbull, Nicole Targalia, Cora Taylor, Sydney Terroso, Angela Tesng, Samantha Thompson, Jennifer Tjernagel, Jaimie Toroney, Laina Townsend, Katherine Tsai, Ivy Tso, Maria Valicenti-Mcdermott, Bonnie VanMetre, Candace VanWade, Dennis Vasquez Montes, Alison Vehorn, Mary Verdi, Brianna Vernoia, Natalia Volfovsky, Lakshmi Vrittamani, Jermel Wallace, Corrie Walston, Audrey Ward, Zachary Warren, William Curtis Weaver, Sabrina White, L. Casey White-Lehman, Fiona Winoto, Ericka Wodka, Jessica Wright, Sabrina Xiao, Simon Xu, WhaJames Yang, Amy Yang, Meredith Yinger, Christopher Zaro, Hana Zaydens, Cindy Zha, Allyson Zick