A Survey of Copy Number Variants Associated with Neurodevelopmental Disorders in a Large-Scale, Multi-Ancestry Biobank

BACKGROUND: Past clinical genetic studies have identified rare, copy number variants (CNVs) as risk factors for multiple, neurodevelopmental disorders (NDD), including autism spectrum disorder and schizophrenia. However, the broad, clinical characterization of these NDD-CNVs in large population cohorts, especially of diverse ancestry, is relatively understudied. We characterized the clinical presentation of NDD-CNVs in the BioMe biobank, comprising ~25,000 individuals across diverse ancestry, medical and neuropsychiatric clinical presentation, with a mean age of 50.3 years. METHODS: Individuals within the BioMe biobank harboring NDD-CNVs were identified using a consensus of two CNV calling algorithms, based on whole-exome sequencing and genotype array data, followed by a series of novel, in-silico clinical assessments. RESULTS: The overall prevalence of a set of 64 NDD-CNVs was calculated at ~2.5%, with prevalence varying by locus, corroborating the presence of some relatively, highly-prevalent NDD-CNVs (i.e., 15q11.2 deletion/duplication, 2q13(NPHP1) deletion/duplication). An aggregate set of rare, NDD-CNVs were enriched for congenital disorders (OR=1.8, p-value=0.02) and major depressive disorders (OR=1.3, p-value=0.04) in multi-ancestry analyses, and major depressive-disorder in an African ancestry-stratified group (OR=1.8, p-value=0.01). In a meta-analysis of medical diagnoses (n=195 hierarchically-clustered diagnostic codes), an aggregated set of rare, NDD-CNVs was significantly associated with obstructive sleep apnea (Z-score=3.6 p=3.2x10-4). Further, an aggregated set of rare, NDD-CNVs was associated with increased body mass index (BMI) in a multi-ancestry analysis (Beta=0.14, p-value=0,04), and in Hispanic-stratified analyses (Beta=0.30, p-value=4.2x10-3). For 38 common serum laboratory tests, there was no identified association with the aggregate set of NDD-CNVs. CONCLUSION: The current analyses elucidated clinical features of individuals harboring NDD-CNVs, in a large-scale, multi-ancestry biobank, identifying enrichments for congenital disorders and major depressive disorder, as well as identifying associations with obesity-related phenotypes, obstructive sleep apnea and increased BMI. Future recall of individuals harboring NDD-CNVs will allow for further clinical assessments beyond the electronic health records (EHR) presently analyzed, including neurocognitive and neuroimaging outcomes.


INTRODUCTION
Clinical genomic investigations to date have identified rare copy number variants (CNVs, i.e. genomic microdeletions or microduplications >1kb) considered pathogenic for neurodevelopmental disorders (NDD), including schizophrenia (SCZ), autism spectrum disorder (ASD), and intellectual disability (ID). [1][2][3] Initial genomic studies of CNVs in NDD utilized chromosomal microarray analyses, while most recent discovery and validation efforts have leveraged genome-wide association analyses (GWAS), with CNVs called from genotyping arrays. [2][3][4][5][6] Among NDD-pathogenic CNVs, eight CNVs are reported to occur at significantly increased frequencies in schizophrenia compared to controls, including the 22q11.2 deletion underlying velocardiofacial syndrome, which is among the most well-studied and clinically characterized. 2,7 Nearly all NDD-CNVs have a greater effect on disease risk compared to common genetic variants. For example, the reported odds ratios (OR) of schizophreniaassociated-CNVs range from 3-60, whereas the OR conferred by most GWAS-significant single nucleotide polymorphisms (SNP) is less than 1.2. 2, 8 NDD-pathogenic CNVs have been characterized in clinical studies and population cohorts, to quantify clinical features of developmental delay, physical stigmata and neuropsychiatric effects, and have been established as both inherited and de novo risk factors. [9][10][11][12] . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint Despite considerable investigation of NDD-pathogenic CNVs to date, significant gaps in their clinical characterization remain. First, pleiotropy, or non-specificity of NDD-pathogenic CNVs remains unexplained, that is the same CNV may confer risk for multiple neurodevelopmental disorders. 13 Second, the variable penetrance and expressivity remain poorly understood, that is among NDD-CNV carriers, developmental or neuropsychiatric symptomatology ranges from unaffected to severely affected. 14 Some studies suggest a role of 'background' genetic risk, polygenic risk derived from common risk alleles, in interacting with NDD-pathogenic CNVs to influence penetrance; in other reports, there is evidence of a 'two-hit' model whereby a large NDD-CNV event exacerbated neurodevelopmental phenotypes, in association with other large deletions or duplications. [15][16][17][18] Most clinical reports of NDD-CNVs to date have focused on neuropsychiatric traits, absent overall medical co-morbidities. Lastly, most population cohorts/registries, or large-scale biobank studies to date of CNVs have been limited to European ancestry, limiting overall clinical generalizability. [19][20][21][22] The precise pathogenic factor within many NDD-pathogenic CNV regions remains unknown, whether gene dosage effect of protein-coding gene(s), non-coding species (i.e., microRNA) or other modifying, regulatory factors. 23,24 Further investigation of CNV carriers may elucidate not only underlying biological factors or mechanisms of pathogenesis, but also accelerate potential translation efforts and more precise therapeutic strategies.
Clinical investigation of CNV carriers is challenging however, given their rare frequency and requisite access to large-scale clinical and genetic resources, concurrently. Several research consortia focus on individual CNVs (i.e., 22q11 deletion, 3q29 deletion or 16p11.2), while other consortia focus on CNVs in aggregate, for example the Deciphering Developmental Disorders project or Enhancing Imaging Genetics through Meta-Analysis (ENIGMA-CNV), the latter investigating neuroimaging outcomes. [25][26][27] The study of multiple NDD-pathogenic CNVs is . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint advantageous since it permits inter-CNV comparison and assessment of composite clinical outcomes.
The current study leveraged a robust clinical genetics resource, the "BioMe" biobank, a genetics repository linked to the electronic health records (EHR) of approximately 25,000 individuals recruited across medical specialties and across ancestry and socio-economic background. 28,29 Notably, the BioMe biobank is multi-ancestry, enabling diverse query across ancestry (in contrast to most biobanks studies to date, which have focused primarily on European ancestry). In a forward genetics approach, individuals harboring NDD-pathogenic CNVs were identified without phenotypic ascertainment bias, followed by a series of novel in-silico clinical assessments. The genetics repository of the biobank enabled the identification of individuals whom harbor NDD-CNVs, while the EHR data enabled investigation of clinical indices, including diagnostic codes, laboratory serum tests, and clinical descriptive data contained within documented encounters.

"BioMe" Biobank
Participants were recruited throughout the Mount Sinai healthcare system, as per a protocol approved by the local Institutional Review Board (IRB), initiated in 2007. Participants were recruited across age and ancestry, and from clinics throughout the healthcare system of diverse medical and neuropsychiatric specialty (See Supplementary Table 1, Supplementary Methods).
In providing informed consent, Biobank participants authorized access to their de-identified healthcare records and also donated a blood sample for extraction of genetic material for research purposes. As per the approved protocol, no disclosure/feedback of genetic results would be provided, as the analyses were for research purposes. Participants were offered the option to consent to be re-contacted for future research studies.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

CNV Calling
The 64 CNVs called in the current analysis, herein termed "NDD-CNVs", were reported in previous biobank reports (i.e., UK Biobank and Geisinger DiscoverEHR), described as set of CNVs

Enrichment for Neuropsychiatric Disorders
For enrichment analyses, samples without phenotype data were excluded, as well as samples with less than two clinical encounters, to increase the reliability of phenotypic analyses, yielding n=22,279 individuals for the analyses of enrichment of neuropsychiatric disorders. A Fisher's exact test was applied to test for statistical significance of enrichment of an aggregate, set of rare, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint NDD-CNVs (excluding the common 15q11.2 del/duplication) for neurodevelopmental and neuropsychiatric disease categories.

Phenome-Wide Association Analyses ("PheWAS")
To reduce potential confounding effects of relatedness, a random individual from each related pair with more than seconddegree relatedness (kinship coefficient>0.0885) was excluded, as estimated based on genotype data (See Supplementary Methods).
PheWAS was conducted using 'REGENIE', a machine-learning method, especially developed for rare variant analysis of binary (case-control) traits with unbalanced case-control ratios. 34 International Classification of Diseases codes (ICD9 and ICD10) were mapped to hierarchicallyclustered 'phecodes' and for each phecode, the number of cases and controls determined, longitudinally, with a case defined as at least 2 counts, a control as 0, while a count of 1 were set to missing for the phecode. 35,36 Phecodes were filtered for a minimum of 100 cases for each ancestry-stratified group, European, African, and Hispanic, to ensure robust analyses, so that only the more prevalent n=195 phecodes, of 1,736 phecodes across the biobank cohort, were tested in the current NDD-CNV PheWAS (See Supplementary Methods). Using REGENIE, each phecode was regressed onto CNV (binary) status with covariates of: age, sex, ancestry, principal components (PCs) 1-5 of the genotype data, and density of electronic health records (See Supplementary Methods). For each PheWAS conducted using REGENIE, Bonferroni-correction was applied to control for multiple testing. In addition to ancestry-stratified analyses, multiancestry meta-analysis, was conducted using METAL 37 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Association Analyses with Quantitative Outcome
Median BMI and serum lab values were inverse normal transformed, and regressed onto NDD-CNV status (binary variable), adjusting for covariates of age, sex, ancestry, PCs 1-5 of the genotype data, and density of electronic health records (See Supplementary Methods).

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint Overall, within the biobank cohort, the demographic variables of age, sex and ancestry, did not differ significantly for the subset of NDD-CNV carriers, compared to individuals without NDD-CNVs (Table 1). In addition to the multi-ancestry composition, a notable demographic feature of the NDD-CNV carriers is the mean of 50 years of age, similar to the overall BioMe cohort.
Therefore, in the current analysis, NDD-CNVs were not surveyed in a childhood or developmental context, specifically, but rather mostly in older adults. In assessing relatedness among NDD-CNV carriers, 99 pairs of relatives of at least second-degree were identified: 17 pairs in whom both relatives harbored NDD-CNVs, and 82 pairs in which one, but not both relatives, harbored a NDD- Table 3). Furthermore, among NDD-CNV carriers within the biobank, n=20 individuals harbored two NDD-CNVs, each.

Enrichment of Neuropsychiatric Disorders
Next, individuals harboring rare, NDD-CNVs were tested for enrichment of neurodevelopmental disorders (Table 3, Supplementary Table 4 and Supplementary Methods). The aggregate set of NDD-CNV carriers (n=376) were enriched for congenital disorders (Fisher's exact two-sided p-value=0.02, OR=1.8), but not schizophrenia (p=1) nor seizure disorder (p=1). In ancestry-stratified analyses of congenital disorders, NDD-CNVs were not enriched among the major three ancestry subgroups (European p=1, African p=0.16, Hispanic p=0.13) but significant for the 'other' ancestry subgroup (p-value=0.005, OR=6.9) but with a wide-ranging 95% confidence interval, owing to small group size (n=54 combined Asian, Native American and other).
Enrichment for ASD/ID trended towards significance (p=0.09) but with notably few cases represented in BioMe; within the BioMe cohort, the prevalence of childhood developmental disorders is markedly low (BioMe ASD/ID prevalence=0.13%) owing to few pediatric participants, compared to adult-onset disorders, which better approximate overall population prevalence (i.e. BioMe SCZ prevalence=1.1%).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint Enrichment for adult-onset mood disorders was also queried, and NDD-CNVs were found to be enriched for major depressive disorder (MDD) (Fisher's exact two-sided p-value=0.04, OR=1. 3) with the enrichment for MDD driven by individuals of African ancestry (two-sided p-value=0.01, OR=1.8). In contrast, there was no NDD-CNV enrichment for bipolar disorder (two-sided p-value=0.83).

Association with ICD-Diagnosis Codes ('PheWAS')
To investigate phenotypic associations in the biobank broadly, beyond neuropsychiatric disorders, an agnostic phenome-wide association ("PheWAS") was next applied, to test NDD-CNV association with ICD codes, coded medical diagnoses across organ systems. 38 Overall, PheWAS across the NDD-CNV sets, yielded several significant associations (Figure 1).
In ancestry-meta-analysis of individual NDD-CNV loci, the 16p13.11 duplication was significant for association with 'congestive heart failure; non-hypertensive' (Z-score=4.1, p=3.38x10 -5 ) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021.

Association with Clinical Indices (BMI and Lab Values)
Further, we investigated the association of rare NDD-CNVs, in aggregate, with quantitative clinical outcomes, body mass index (BMI) and common serum lab values. The set of all rare, NDD-CNVs (excluding the more common 15q11.2 del/dup and 2q13 NPHP1 del/dup) was found to be associated with increased BMI (n=215 CNV carriers, Beta=0.14, p=0.04) (Supplementary Table   9). Interestingly, as per the ancestry-stratified analyses, for the Hispanic NDD-CNV carriers there was an association of increased BMI with NDD-CNV status (n=74, Beta=0.30, p=0.004), but not for NDD-CNV carriers of African-American (n=75, Beta=0.02, p=0.89) nor European ancestries (n=66, Beta=0.02, p=0.83). Of note, the rare, 16p11.2 deletion (associated with BMI in past reports) was found to be distributed among ancestries (n=3 African-American carriers, n=4 European carrier, n=2 Hispanic carriers). 40 For serum lab test associations, the aggregated rare, NDD-CNV set was tested for association with 38 common serum lab tests, given their widespread medical utility, including comprehensive metabolic panel, complete blood count and lipid profile tests, that are routinely performed across inpatient and outpatient clinics (Supplementary Table 10). Among the multi-ancestry NDD-CNV carriers, tested in aggregate, there were no significant associations after multiple testing correction.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021.  19,20 The most highly-prevalent NDD-CNVs in BioMe, were also identified as most highly-prevalent in other reports (i.e. 15q11.2deletion/duplication, chr2q13(NPHP1) deletion/duplication). As compared to the Discover EHR cohort, the same NDD-CNVs that had no carriers in DiscoverEHR also had no carriers in BioMe.
In comparing the current analysis of NDD-CNVs in BioMe with previous biobank reports, the divergence in ancestry is most notable, as the UKBB and Geisinger DiscoverEHR cohorts are nearly exclusively of European ancestry (i.e. DiscoverEHR: 98% European ancestry), while BioMe is markedly multi-ancestry. Of further divergence, the UKBB was formed with an intention to recruit a 'healthy cohort', in contrast to BioMe, recruiting from within a medical system. Overall, the BioMe biobank is affected by ascertainment bias (i.e. ,~67% recruited from outpatient medicine clinics), as is any biobank, an important caveat in considering biobank comparisons.
Within BioMe there is a skew towards older adults with few pediatric participants. The advanced age range presented an opportunity for broad clinical assessment. Indeed, many investigations of NDD-CNVs to date, including their initial discovery, have focused on early-childhood . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. These challenges of biobank analyses, notwithstanding, the current BioMe analyses yielded interesting and notable findings. Among NDD-CNV carriers, there was a significant enrichment for congenital disorders, confirming previous reports. 5,6,19 Major depressive disorder (MDD) was also found to be significantly enriched, addressing in part, conflicting, past reports about the enrichment of NDD-CNVs in major depressive disorder. 41,42 In contrast to the UKBB finding of enrichment of MDD in the cohort of European ancestry, however, the current BioMe analysis yielded enrichment of MDD among individuals of African ancestry, but not Hispanic or European. 41 The current study used a filtering criteria of at least 2 ICD codes assigned to validate a diagnosis, and a minimum engagement of two clinical encounters, but overall the validity of ICD codes as a proxy for neuropsychiatric diagnosis warrants ongoing assessment in BioMe and other biobanks.
Future studies may incorporate phenotyping algorithms, computational tools to mine EHR clinical descriptive data, to identify affected NDD cases, as has been developed for mood disorders. 43 A previous UKBB analysis tested 58 phenotypes for individual NDD-CNV associations, identifying 46 associations (at FDR threshold of 0.1), including among the most common, obesity, hypertension, and renal failure. 20 The current BioMe analysis tested an expanded set of 195 phenotypes for association, replicating the UKBB analysis in part, identifying top-most, subthreshold enrichment for hypertension and renal failure for the aggregate, rare NDD-CNV set.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint Furthermore, the current analysis identified a significant association of the aggregate, rare NDD-CNV set with obstructive sleep apnea (a phenotype not tested in the UKBB analysis), for which obesity is the major risk factor. 44 The current analysis also identified association of 16p13.11 duplication with congestive heart failure (non-hypertensive), and indeed, case reports of 16p13.11 microduplication indicate increased incidence of congenital heart defects and heart disease. 45 Other ancestry-specific findings of the 15q13.3 (CHRNA7) duplication, eye disorders in individuals of European ancestry, and disease of the larynx and vocal cords in individuals of Hispanic ancestry, are phenotypes not tested in the previous UKBB analysis, nor have they been widely reported to date in 15q13.3 microduplication in pediatric cases, but may warrant further investigation. 46 For the aggregate set of NDD-CNVs, there was a significant association with increased BMI, in a multi-ancestry analysis, and further, and an ancestry-specific finding of BMI association among individuals of Hispanic ancestry. While the 16p11.2 deletion is a well-characterized risk factor for obesity, this further implicates the potential role of other NDD-CNVs in obesity. 47 For 38 common, serum lab tests, including blood count and blood chemistry, there was no identified significant association with the aggregate set of NDD-CNVs. 40 A limitation of the current analysis is that due to the rarity of NDD-CNVs, the analyses combined NDD-CNVs across loci to ensure well-powered associations, but pooling may dilute the heterogeneity of the effects of individual CNVs. Further, though each CNV was selected based on previous biobank reports as per 'pathogenicity' criteria defined by American College of Medical Genetics standards, the strength of evidence varies by NDD-CNV locus. 30,31 The associations with quantitative indices were limited to median BMI and serum lab values as outcome measures, but models using generalized linear mixed models may better fit longitudinal biobank data, which can extend over many years with repeated measurements. The current analyses employed select . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint strategies to control for some variables within the EHR (i.e., threshold of number of clinical encounters for inclusion, covarying for density of electronic health records, use of REGENIE method for case/control imbalance), but further methodological innovations and strategies specific to large-scale, biobank analyses may be incorporated as they continue to emerge. 48 Ongoing large-scale, clinical genetic investigations as herein described, based on genetic stratification (i.e., 'precision psychiatry'), may lead to translational opportunities, clinical insights of at-risk individuals, as well as novel therapeutic strategies targeting specific genetic variants.
The importance of diverse inclusion within biobanks and considering the effect of rare, genetic variants in a multi-ancestry context, is evident.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 12, 2021.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 12, 2021.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint  Table 7 and Supplementary Table 8  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258554 doi: medRxiv preprint