De novo mutation hotspots in homologous protein domains identify function-altering mutations in neurodevelopmental disorders

Variant interpretation remains a major challenge in medical genetics. We developed Meta-Domain HotSpot (MDHS) to identify mutational hotspots across homologous protein domains. We applied MDHS to a dataset of 45,221 de novo mutations (DNMs) from 31,058 patients with neurodevelopmental disorders (NDDs) and identified three significantly enriched missense DNM hotspots in the ion transport protein domain family (PF00520). The 37 unique missense DNMs that drive enrichment affect 25 genes, 19 of which were previously associated with NDDs. 3D protein structure modelling supports function-altering effects of these mutations. Hotspot genes have a unique expression pattern in tissue, and we used this pattern alongside in silico predictors and population constraint information to identify candidate NDD-associated genes. We also propose a lenient version of our method, which identifies 32 hotspot positions across 16 different protein domains. These positions are enriched for likely pathogenic variation in clinical databases and DNMs in other genetic disorders.


23
Variant interpretation remains a major challenge in medical genetics. We developed Meta-24 Domain HotSpot (MDHS) to identify mutational hotspots across homologous protein 25 domains. We applied MDHS to a dataset of 45,221 de novo mutations (DNMs) from 31,058 26 patients with neurodevelopmental disorders (NDDs) and identified three significantly 27 enriched missense DNM hotspots in the ion transport protein domain family (PF00520). The 28 37 unique missense DNMs that drive enrichment affect 25 genes, 19 of which were 29 previously associated with NDDs. 3D protein structure modelling supports function-altering 30 effects of these mutations. Hotspot genes have a unique expression pattern in tissue, and we 31 used this pattern alongside in silico predictors and population constraint information to 32 identify candidate NDD-associated genes. We also propose a lenient version of our method, 33 which Introduction 41 The interpretation of sequence variation in the context of disease remains one of the biggest 42 challenges in genetics. De novo mutations (DNMs) in protein-coding genes are an established 43 cause of neurodevelopmental disorders (NDDs) 1 , and roughly ~45% of NDDs are caused by 44 a DNM in a protein-coding gene. 2,3 By modelling the probability of DNMs occurring in 45 specific genes, one can identify genes that are enriched for DNMs in patient cohorts, 46 provided that large enough cohorts are available. This statistical identification of NDD-47 associated genes requires ever-larger patient collections. 3-7 A recent study of DNMs in 48 31,058 patients with NDDs concluded that NDD-association of genes still is far from 49 saturated and that over a thousand NDD-associated genes are still to be identified. 7 50 51 Several studies of NDD patients have found that for specific genes, missense DNMs cluster 52 in functional regions, and that this fact can be used to identify disease-associated genes. 7-9 53 Conserved protein domains are of particular interest, because they harbour ~71% 10 of all 54 curated disease-causing missense variants in the Human Gene Mutation Database (HGMD) 11 55 and ClinVar 12 . Indeed, missense DNMs in NDD genes are almost three times more likely 56 located in protein domains. 7 Clustered missense DNMs in these genes may not act through 57 haploinsufficiency, but rather through dominant negative or gain-of-function effects. 8,9 The 58 detection of mutation clusters, or hotspots, can be a crucial step towards associating genes 59 with NDDs 13 and for gaining insight into underlying disease mechanisms. 9 60 61 Aggregation of variation across homologous domains can be a useful method to gain insight 62 into patterns of variation 10,14,15 and therefore can also increase statistical power to detect 63 hotspot positions. Methods such as mCluster 16

72
General description of the data and the processing steps 73 To identify hotspots of de novo mutations in homologous protein domains, we computed 74 MDHS (Equation 1) based on unique DNMs in NDD patients (Figure 1). This was done for 75 each variant type separately (missense, synonymous, nonsense). We first mapped 45,221 76 DNMs resulting from 31,058 patients with developmental disorders 7 onto gene transcripts 77 using MetaDome 14 (Methods). Then, we aggregated these to 12,389 meta-domain 10 positions 78 (Supplementary Data S1). In the 25 homology models we found that hotspot p.96 ( Figure 3A) and p.102 ( Figure 3B) 116 are part of the voltage-sensing helix that is important for the channel (in-)activation. 19 These 117 results are in line with functional studies that have been performed for missense mutations at 118 two of these hotspots. 20,21 Hotspot p.231 ( Figure 3C) is part of the channel gate at the end of 119 a transmembrane helix (Supplementary Data S8). In addition, we found that missense 120 mutations follow a specific pattern for each of these hotspots. Of the 13/16 missense DNMs 121 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.19.22272594 doi: medRxiv preprint located at hotspot p.96 and 20/20 at p.102 change the positively charged wild-type residue to 122 lose the positive charge. Losing positive charges at these locations has previously been 123 described to trigger a function altering disease-mechanism ( Figure 3AB). 20 altering mechanism of disease ( Figure 3C). 22 Overall, this shows that missense mutations at 128 the identified hotspots are likely deleterious to domain function. 129 130 Stringent hotspot genes are constrained against missense and loss of function variation 131 Dominant NDD genes are characterized by population constraint against damaging genetic 132 variation. 23, 24 We compared observed counts of loss of function, missense, and synonymous 133 variants in control, NDD-associated, hotspot and proposed novel hotspot genes in gnomAD 134 v2 to expected counts based on a null mutational model. 25 Both hotspot and novel hotspot 135 genes are constrained against loss of function and missense variation ( Figure 4A). Novel 136 hotspot genes have lower constraint against loss of function variation than hotspot genes 137 (Supplementary Data S9). We also considered whether hotspots were found in regions of 138 particular constraint against missense variation within genes. In total, 2,700 genes have 139 statistical evidence of regional differences in missense constraint. 25 Of these, 16 are hotspot 140 genes, representing a significant enrichment compared to control genes (Fisher's exact test p 141 < 2.2 x 10 -16 ) and NDD-associated genes (Fisher's exact test p = 0.002, Supplementary 142 Figure 1). Three are proposed novel hotspot genes (KCNH5, CACNA1B, TPCN1). Using 143 regional missense constraint information, we show that PF00520 domains in hotspot genes 144 are significantly more constrained against missense variation than PF00520 domains in 145 control genes (p = 1.4 x 10 -7 , Wilcoxon rank-sum test; Figure 4B), but similarly constrained 146 compared to NDD-associated genes without a hotspot that also contain a PF00520 domain (p 147 = 0.65, Wilcoxon rank-sum test). 148 149 Brain-specific expression of stringent hotspot genes 150 We analysed the expression of the 19 known hotspot genes in approximately 948 donors 151 across 54 tissues from the GTEx v8 release. 26 We observed that NDD genes and control 152 genes have distinct gene expression patterns, with a higher proportion of NDD genes 153 constitutively expressed across all tissues (p < 2.2 x 10 -16 , Fisher's exact test; 154 Supplementary Table 5). Hotspot genes share a characteristic expression pattern compared 155 to these two groups ( Figure 5A), with a significantly higher proportion of hotspot genes 156 expressed in the brain compared to control genes and significantly lower proportion 157 expressed in all other tissues compared to NDD genes (in 40/42 non-brain tissues, 158 Supplementary Data S10). Given this tissue-specific expression pattern, we grouped GTEx 159 tissues into two tissue groups (brain and other tissues, Methods). The hotspot gene set is 160 significantly enriched for genes with higher expression in brain compared to control genes 161 (89.4% versus 19.8% expressed higher in brain, p = 2.985 x 10 -5 , Fisher's exact test) and 162 NDD genes (89.4% versus 31.3%, p = 0.002, Fisher's exact test) (Supplementary Figure 2). 163 Only two hotspot genes do not have higher expression in brain: SCN10A, which is 164 constitutively unexpressed across tissues in GTEx samples, and CACNA1C (median TPM in 165 brain = 2.94, median TPM in other tissues = 4.35). We further show that this expression 166 pattern is not characteristic of all genes containing an ion transport domain, but only the 167 subset of these genes statistically associated to NDDs (Supplementary Data S11, 168 Supplementary Figure 3-4). 169 170 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We also compared the TPM distribution in brain and other tissues for control genes, NDD-171 associated genes, hotspot genes, and the six novel hotspot genes (Methods; Figure 5B). Both 172 hotspot and NDD-associated genes had significantly higher TPM in brain tissues than control 173 genes (p = 0.0039 and p < 2.2 x 10 -16 , Wilcoxon rank-sum test), and both hotspot genes and 174 control genes had significantly lower TPM in other tissues than NDD-associated genes (p < 175 2.2 x 10 -16 and p = 2.08 x 10 -8 , Wilcoxon-rank sum test; Supplementary Figure 5).

176
Modelling suggests that CACNA1B and KCNG1 belong to the hotspot gene distribution by 177 odds ratio (Supplementary Data S12). Additionally, we find 6 NDD-associated PF00520 178 domain-containing genes (HCN1, TRPV3, KCNQ5, KCNC1, KCNC3, TRPM3) that also 179 belong to the hotspot gene distribution by odds ratio. We hypothesize that missense mutations 180 at hotspot positions in these 6 genes may also cause NDDs. 181 182 Lenient DNM hotspots identified using MDHS 183 We also implemented a lenient version of MDHS that counts all missense variants at protein 184 consensus positions, even if they recur between individuals (see Methods, Supplementary 185 Figure 6). Counting recurrent missense variants gives us more power to detect hotspot 186 positions, but these hotspot positions may be driven by missense variation in a single domain. 187 We detected 32 lenient missense DNM hotspots across 16 Pfam protein domain families 188 using this method (Supplementary Table 2). We find a significant enrichment of genes 189 statistically associated to NDD (Fisher's exact p < 2.2 x 10 -16 ) and DDG2P genes (Fisher's 190 exact p < 2.2 x 10 -16 ) at lenient hotspot positions (Supplementary Figure 7). 191 192 We also find that missense variants at these positions are significantly more likely to be 193 pathogenic or likely pathogenic in clinical databases (VKGL, Figure 6A; ClinVar, Figure Table 6).

202
Lenient hotspots are enriched for missense variation in autism-spectrum disorders 203 We investigated if we could find further evidence for the identified lenient hotspots in a 204 combined cohort of publicly available de novo mutation datasets for autism-spectrum 205 disorders (ASD; 11,986 ASD probands, 35,584 individuals) and congenital heart defects 206 (CHD; 2,654 trios) alongside DNMs from unaffected individuals (1,740 ASD siblings, 1,548 207 population controls; see Methods). 208 209 We observe a significant enrichment of missense DNMs at hotspot positions in NDD and 210 ASD cohorts compared to unaffected individuals (Fisher's exact p = 3.5 x 10 -13 , NDD; 211 Fisher's exact p = 0.007, ASD; Fisher's exact p = 0.07, CHD; Figure 7A, Supplementary 212 Table 7). We observe no significant enrichment in synonymous DNMs at hotspot positions in 213 any cohort (Supplementary Table 8). However, we predict that some lenient hotspot 214 positions are driven by mutational processes or ascertainment bias in particular genes and not 215 necessarily by the cumulative effect of pathogenic mutations across several genes. To correct 216 for this, we also tested for an enrichment of missense variants unique to ASD and CHD 217 probands at lenient hotspots ( Figure 7B, Supplementary Table 9). We found a significant 218 enrichment of these unique missense variants in ASD probands (Fisher's exact p = 0.047) but 219 not in CHD probands (Fisher's exact p = 1). The majority (10/13, 77%) of the missense 220 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.19.22272594 doi: medRxiv preprint variants driving this enrichment in ASD probands are in genes statistically associated to 221 NDDs. 7 222 223 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 21, 2022. observe that some PF00520 domain-containing NDD-associated genes have lower expression 242 in brain but have a similar level of tissue-specific expression in a non-brain tissue. SCN4A, 243 for example, is not expressed in brain and is predominantly expressed in skeletal muscle. 244 Although the expression pattern of SCN4A is different from the hotspot genes presented in 245 our analysis, hotspot positions in SCN4A are similarly constrained against missense variation 246 ( Figure 5B). We hypothesise that phenotypes associated with pathogenic mutations at 247 hotspot positions may vary depending on where the mutated gene is expressed.  constraint was compared across PF00520 domain-containing control genes (blue), PF00520 390 domain-containing NDD-associated genes (green) and hotspot genes (yellow). 391 392 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

521
Regional missense constraint 522 Genes with regions of differential missense constraint were identified as described by 523 Samocha et. al, 2017 in the ExAC 49 dataset. In brief, the fraction of observed missense 524 variation along a transcript was tested for uniformity using a likelihood ratio test. If the 525 distribution was not uniform, the transcript was considered to have evidence of regional 526 missense constraint. 2,700 genes showed evidence of at least two regions of distinct missense 527 constraint using this method. 528 529 Expression data 530 Pre-computed tissue expression values in transcripts per million (TPM) were taken from 531 GTEx v8 (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz). 26,50 532 533 Gene sets for constraint and expression analysis 534 The set of 56,200 genes for which median TPM values were available was divided into one 535 of four sets: proposed novel hotspot genes, hotspot genes, NDD-associated genes, and control 536 genes. Genes containing a mutation hotspot were divided into two categories: hotspot genes, 537 n = 19, and proposed novel hotspot genes, n = 6. These categories were distinguished by 538 presence on the DD gene list (see Developmental disorder diagnostic gene lists); hotspot 539 genes are present on this list while proposed novel hotspot genes are not. The remaining 540 genes were divided into NDD-associated genes (n = 992, excluding hotspot genes) and 541 control genes not statistically associated with intellectual disability or developmental delay (n 542 = 55,183). For some analyses, control genes present in DDG2P (n = 1,250, accessed 22-04-543 2021) or OMIM (n = 3,402, accessed 31-08-2021) but not statistically associated to NDDs 544 were considered separate classes. Additionally, only some of these genes had population 545 constraint information available from gnomAD v2.1.1 (n = 19,658). Genes in all sets can be 546 found in Supplementary Data S15. 547 548 Of the 56,200 genes described above, 105 contained a PF00520 domain in MetaDome v1.0.1. 549 These 105 genes represented 19 hotspot genes, 6 proposed novel hotspot genes, 12 NDD-550 associated genes not containing a missense DNM at a hotspot position in our cohort, and 68 551 control genes (of which 6 are present in DDG2P and 26 in OMIM; Supplementary Data 552 S16). 553 554 Proportion of expressed genes across GTEx tissues 555 A fixed level of TPM > 1 was used to define expression in each tissue. NDD-associated and 556 control genes were randomly sampled 1000 times into sets of 19 genes, and the proportion of 557 expressed genes (number of genes with TPM > 1/total number of genes) was calculated for 558 each set. This generated a distribution of proportions across 54 GTEx tissues. The proportion 559 of expressed hotspot genes per tissue was computed without sampling. 560 561 TPM differences between tissue groups 562 In order to assess expression differences between brain and other tissues, GTEx tissues were 563 divided into two groups. We considered the amygdala, anterior cingulate cortex (BA24), 564 caudate (basal ganglia), cerebellar hemisphere, cerebellum, cortex, frontal cortex (BA9), 565 hippocampus, hypothalamus, nucleus accumbens (basal ganglia), putamen (basal ganglia), 566 spinal cord (cervical c-1), and substantia nigra brain tissues (n = 12). All non-brain tissues 567 were included in the 'other tissues' set (n = 42). The TPM value for each tissue set was 568 defined as the median TPM of all tissues in the set. Based on these differences, we modelled 569 the brain TPM distribution in hotspot genes and control genes and the other tissue TPM 570 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.19.22272594 doi: medRxiv preprint distribution in hotspot genes and NDD-associated genes as normal distributions. For each set 571 of distributions, a likelihood ratio of belonging to each distribution was calculated. Proposed 572 novel hotspot genes were considered to have evidence for association with NDDs if they 573 were more likely to belong to the hotspot gene distribution across both tests. 574 575 Filtering and annotation of additional de novo mutation cohorts 576 We analysed the enrichment of missense and synonymous DNMs in lenient hotspot positions 577 across a total of three additional published DNM cohorts. We used an autism-spectrum 578 disorder ( Annotation of meta-domain protein consensus positions for these datasets was done as 586 previously CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.19.22272594 doi: medRxiv preprint medRxiv Note 636 Our Supplementary Data S7 file (YASARA protein structures) cannot be made available 637 through medRxiv. Please contact the corresponding author for this data. 638 639 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.19.22272594 doi: medRxiv preprint