Rare coding variants in five DNA damage repair genes associate with timing of natural menopause

The age of menopause is associated with fertility and disease risk, and its genetic control is of great interest. We used whole-exome sequences from 119,992 women in the UK Biobank to test for associations between rare damaging variants and age at natural menopause. Rare damaging variants in three genes significantly associated with menopause: CHEK2 (p = 6.2 x 10-51) and DCLRE1A (p = 1.2 x 10-12) with later menopause and TOP3A (p = 8.8 x 10-8) with earlier menopause. Two additional genes were suggestive: RAD54L (p = 2.3 x 10-6) with later menopause and HROB (p = 2.7 x 10-6) with earlier menopause. In a follow-up analysis of repeated questionnaires in women who were initially pre-menopausal, CHEK2, TOP3A, and RAD54L genotype associated with subsequent menopause. Consistent with previous GWAS, all five genes are involved in the DNA-damage repair pathway. Phenome-wide scans across 363,977 men and women revealed that in addition to known associations with cancers and blood cell counts, rare variants in CHEK2 also associated with increased risk of uterine fibroids, polycystic ovary syndrome, and prostate hypertrophy; these associations are not shared with higher-penetrance breast cancer genes. Causal mediation analysis suggests that approximately 8% of the breast cancer risk conferred by CHEK2 pathogenic variants after menopause is mediated through delayed menopause.


26
The age at natural menopause (ANM) varies widely between women and understanding the biology of 27 menopause timing is important because early menopause is associated with risk of cardiovascular 28 disease and osteoporosis, and late menopause is associated with risk of breast cancer 1 . ANM is also 29 strongly related to fertility, because natural fertility ends on average ten years before menopause 2 . 30 Genetic and environmental factors both correlate with ANM. Lower socioeconomic status, fewer live 31 births, not using oral contraceptives, and smoking have been consistently associated with earlier ANM 1 . 32 Twin studies and family studies established that ANM is a highly heritable trait 3; 4 . Population-based 33 genome-wide association studies (GWAS) have identified many common polymorphisms associating with 34 ANM [5][6][7][8][9][10][11] . These ANM- an association between common ANM-associated variants and breast cancer risk, but not conversely 38 between common breast cancer-associated variants and ANM. This observation supports a causal 39 relationship between variation in lifetime endogenous estrogen exposure (resulting from variation in the 40 duration between menarche and menopause) and risk of breast cancer 11 . 41 GWAS are limited in their ability to identify causal genes because the majority of associated common 42 haplotypes contain no coding variants, and the closest gene to a noncoding variant is not reliably the 43 causal gene in the absence of functional evidence. Indeed, the variants associated with ANM near tumor 44 suppressor genes BRCA1 and CHEK2 have not been the same pathogenic variants implicated in breast 45 cancer. To better identify causal genes for ANM by discovering rare coding variants strongly associated 46 with ANM, we used exome sequencing data from 119,992 women in the UK Biobank. After identifying 47 51 52 Methods 53 UK Biobank study 54 The UK Biobank consists of approximately 500,000 volunteer participants, who were aged 40-69 when 55 recruited between 2006 and 2010 12; 13 . Both array genotyping and whole exome sequencing has been 56 performed on the majority of these participants. Data from genotyping, sequencing, questionnaires, 57 primary care data, hospitalization data, cancer registry data, and death registry data were obtained 58 through application number 26041. 59 Variant calling and definition 60 The source of genetic data for the main analysis was exome sequencing data. DNA from whole blood 61 was extracted and sequenced by the Regeneron Genetics Center (RGC) using protocols described 62 elsewhere 14 . Of the variants called by RGC, additional quality control filters were applied: Hardy-63 Weinberg equilibrium (among the White subpopulation) p-value less than 10 -10 and rate of missing calls 64 across individuals less than 2%. Variants were then annotated using ENSEMBL Variant Effect Predictor 65 (VEP) v95 15 , using the LOFTEE plug-in to additionally identify high-confidence predicted protein-66 truncating variants (PTVs, also known as predicted loss-of-function, pLOF) 16 . Variants were also 67 annotated with the Whole Genome Sequence Annotator (WGSA) 17 to add Combined Annotation 68 Dependent Depletion (CADD) scores 18 to predict deleteriousness of missense variants. For single-variant 69 analyses, a minor allele frequency filter was imposed such that only variants present in at least ten 70 individuals with phenotype data were retained. For rare variant burden analyses, variants were defined as 71 rare if their minor allele frequency was under 1%. Variants were aggregated in each protein-coding gene 72 as follows: PTV variants were defined as variants with their most severe consequence from VEP as "stop 73 gained", "splice donor", "splice acceptor", or "frameshift," and their confidence from LOFTEE as "HC" 74 (high confidence). Damaging missense variants were defined as variants with their most severe 75 consequence from VEP as "missense" and a CADD score of 25 or greater. Genes were defined as 76 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint An additional 1,893 women were excluded from the main analysis because they had an age of 130 menopause less than 40, and an additional 249 women were excluded because they reported 131 menopause after 60 or were premenopausal and over 60 when interviewed. These women were used for 132 subsequent replication analysis of extreme phenotypes. These exclusions resulted in a set of 119,992 133 women for the main analysis (132,377 women for the confirmatory SAIGE-Gene analyses that included 134 related individuals.) 135 A more restrictive subset of menopause data was then constructed as a subset of the main analysis set 136 ("neoplasm-and surgery-free") which did not rely on the relative timing of potentially menopause-inducing 137 operations and which excluded anyone with any neoplasm preceding the time of interview. Women were 138 excluded if they reported having had bilateral oophorectomy, hysterectomy, or doctor-diagnosed cancer 139 at the time of their initial assessment, regardless of the age at which these had occurred, and if they had 140 any cancer or neoplasm diagnosis in the national cancer registries in the year of the interview or earlier. 141 Women were also excluded based on procedures found in ICD10 hospital diagnoses and OPCS4 142 operation codes in a calendar year the same as or earlier than the interview, using the ICD10 and OPCS4 143 codes listed previously and also adding the following: ICD10 code Z40.0 (prophylactic surgery for risk 144 factors related to malignant neoplasms) and OPCS codes corresponding to all of chapter Q (operations 145 on the upper female reproductive tract), except for codes purely relating to diagnostic examination: Q18, 146 Q39, Q50, or Q55. This "neoplasm-and surgery-free" subset consisted of 92,387 women. 147 Time-to-event analysis 148 The main discovery analysis consisted of a time-to-event analysis of the association between carrier 149 status of rare variants aggregated per gene and menopause age. For each variant set, a Cox proportional 150 hazards model was constructed using the survival R package 23 . A right-censored survival object was 151 created where the status indicator was zero for women who were premenopausal and one for women 152 who were post-menopausal, and follow-up time was the age at interview for premenopausal (censored) 153 women and age at menopause for postmenopausal women. Ties were handled using the default method 154 (Efron approximation). With this survival object as the dependent variable, the independent variable in 155 each model was the presence (coded as one) or absence (coded as zero) of at least one alternate allele 156 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint from the variant set in each individual; multiple variants and zygosity were not considered for the burden 157 test. Covariates were year of birth (to account for secular trends in menopause age 24 ) and the first twelve 158 genetic PCs (to account for associations arising purely from population stratification.) 159 Time-to-event analysis was also performed for individual exome and array-typed variants; these tests 160 were performed as above, except the individual variable was the genotype of the individual variant, coded 161 as zero (homozygous reference), one (heterozygous), or two (homozygous alternate.) When both exome 162 and array data were available for a variant, the exome genotype data was used; when only array data 163 was available for a variant, the association was only calculated for the subset of exome-sequenced 164 individuals who also had array data available. 165 To determine the extent to which linkage disequilibrium was driving multiple associations seen in regional 166 patterns of single-variant association results at the CHEK2 and RAD54L loci, associations in the region 167 were tested again including the genotypes of top association signals as covariates in the regression 168 (rs34001746 at CHEK2 and rs12142240 and rs12073998 at RAD54L.) 169

Linear and logistic regression analysis
170 Secondary burden analyses were performed for menopause timing: menopause status at interview, as a 171 case-control trait, was tested using logistic regression with age at interview, year of birth, and the first 172 twelve genetic PCs as covariates; and age of menopause among postmenopausal women, as a 173 quantitative trait, was tested using linear regression with year of birth and the first twelve genetic PCs as 174 covariates. These regressions were performed using the glm function in R. 175

176
Because of known issues with high Type 1 error rate when performing association tests between rare 177 variants and rare diagnoses, and because of potential confounding by population stratification even when 178 excluding related individuals, SAIGE-GENE 25 was used for confirmation of some associations. SAIGE-179 GENE uses a generalized mixed model taking subject relatedness into account and uses a saddle point 180 approximation to estimate p-values. SAIGE-GENE was performed across the entire White population, 181 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) data from the death registry was downloaded on June 12, 2020, and data from hospital diagnoses was 197 downloaded on July 15, 2020. Regressions were performed using linear or logistic regression as 198 appropriate using the glm function in R, with age, sex, country of recruitment (England,Scotland,or 199 Wales), availability of GP data, and the first 12 PCs as covariates. 200

201
Mediation analysis was performed using the R package mediation 27 . Data from the cancer registry was 202 used to limit the analysis to only test breast cancer occurring after 60 as cases and women who were 203 breast cancer free and over 60 as controls. The analysis was also limited to women who reported an 204 ANM between ages 40 and 60 at the initial interview, to ensure for simplicity of the model that breast 205 cancer occurred after menopause. 206

207
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint

208
Among UK Biobank participants with exome sequencing data currently available, we identified a subset of 209 unrelated White women for genetic analysis who had not experienced any treatments that would preclude 210 natural menopause. Of these, 78,316 reported being postmenopausal and reported an age at 211 menopause, while 41,676 reported being pre-menopausal at their initial interview ( Figure S1). We 212 defined two sets of rare exome variants to aggregate for each gene: one set consisting only of protein-213 truncating variants (PTVs), also known as predicted loss-of-function (pLOF) variants, and a second set 214 including both PTVs and rare missense variants bioinformatically predicted to be deleterious. We 215 analyzed variant sets in genes that had at least ten carriers (12,467 genes with enough PTV carriers and 216 16,078 genes with enough PTV and missense carriers). We performed three association tests for each 217 gene: a burden test using current menopause status (pre-or post-) as a case-control trait in a logistic 218 regression model, a burden test using ANM among post-menopausal women as a quantitative trait in a 219 linear regression model, and a time-to-event (TTE) analysis using both groups jointly in a Cox 220 proportional hazards model (Table S1; QQ plots and λ GC calculations, Figure S2). In the TTE analyses, 221 aggregated variants in three genes surpassed a conservative threshold Bonferroni corrected for the total 222   Table 1). 230

231
To account for potential inaccuracy in recalling the timing of hysterectomy and oophorectomy relative to 232 menopause in the interview data, which could have resulted in some residual surgically-induced 233 menopause events in our analysis, we created a more conservative set of ANM values that excluded 234 women with any self-reported or registry-recorded cancer or gynecological operations before the date of 235 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint the questionnaire. All five genes remained associated with menopause in this subset of women (Table  236   S1). 237 Although our discovery analysis excluded closely related individuals, cryptic relatedness cannot be 238 excluded as a source of population stratification. To account for relatedness, we repeated the quantitative 239 trait and case-control analyses using the SAIGE-GENE algorithm 25 , which accounts for genetic 240 relatedness; because relatedness is explicitly modeled, we performed this analysis in an expanded set of 241 132,377 women which included related individuals. All of these associations remained significant by 242 SAIGE-GENE (Table S2). 243 We sought additional evidence outside of the discovery dataset for these associations in two ways. First, 244 we performed a TTE analysis in a subset of 4,701 women who were premenopausal at the initial 245 interview and who participated in a follow-up interview between 2-14 years later, 3,630 of whom were 246 subsequently postmenopausal and had a known ANM at the time of their most recent follow-up interview. 247 For all seven variant sets associated in the discovery analysis, the 95% confidence interval in the follow-248 up analysis was consistent with that of the discovery analysis. For CHEK2, RAD54L, and TOP3A, rare 249 variants associated menopause timing in the follow-up interview at p < 0.05 (Table S3; Figure S3). 250 Second, we looked at two extreme phenotypes we had excluded from the discovery analysis: 1,893 251 women with primary ovarian insufficiency (POI), defined as menopause before the age of forty, and 249 252 women with menopause after the age of sixty. At a threshold of p < 0.05, we detected a depletion of 253 CHEK2 and DCLRE1A carriers and an enrichment of TOP3A carriers in women with POI, and an 254 enrichment of CHEK2 carriers in women with menopause after sixty (Table S4). 255

256
To more thoroughly evaluate genetic mechanisms of association with menopause timing at these five loci, 257 we tested all individual coding variants contributing to these associations with at least ten carriers (Table  258 S5). The most significant association in CHEK2 was rs555607708, a frameshift variant 259 (p.Thr367MetfsTer15; commonly known as c.1100del; p = 1.6 x 10 -27 ) well-studied as a pathogenic 260 variant in breast and other cancers, and the second most significant association in CHEK2 was 261 rs587780174, another pathogenic frameshift variant (p.Ser422ValfsTer15; commonly known as 262 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. c.1263del; p = 1.7 x 10 -9 ); the strongest missense association was with rs28909982 (p.Arg117Gly; p = 1.3 263 x 10 -4 ). The most significant association in DCLRE1A was nonsense (premature stop) variant rs41292634 264 (p.Arg138Ter; p = 7.5 x 10 -11 ) and the strongest missense association in DCLRE1A was missense variant 265 rs11196530 (p.Ile859Phe; p = 1.7 x 10 -3 ). The most significant association in TOP3A was missense 266 variant rs34001746 (p.Leu584Arg; p = 1.6 x 10 -10 ). The most significant association in RAD54L was 267 missense variant rs28363218 (p.Arg202Cys; p = 7.1 x 10 -5 ) and the most significant association in HROB 268 was nonsense rs774881553-T (p.Gln341Ter; p = 2.7 x 10 -3 ). We also re-ran the tests of aggregated 269 variants with a leave-one-out approach to test the extent to which these individual variants contributed to 270 the gene-level associations. For all genes except TOP3A, the gene-level association remained significant 271 when holding out the strongest single-variant associations; for TOP3A, missense rs34001746 appeared 272 to explain the entire association (p = 0.50 for all other missense and PTVs.) 273 We then broadened our analysis to 30,571 array-genotyped or exome-sequenced variants within each of 274 the five gene bodies and 500 kilobases flanking, focusing on associations that would be considered 275 significant in a GWAS (p < 5 x 10 -8 ; HROB were associated with menopause timing. At TOP3A, four noncoding variants in addition to 277 missense rs34001746 were associated, had the same MAF as rs34001746 (0.7%), and were in strong 278 linkage disequilibrium (Figure S6). At both RAD54L and CHEK2, many common and rare variants were 279 significantly associated, consistent with a previous GWAS of ANM by the ReproGen consortium 11 which 280 identified noncoding associations at both of these loci and hypothesized RAD54L and CHEK2 as the 281 causal genes. We explored the independence of the associations with the multiple variants at these loci 282 by building TTE models with genotypes of multiple variants as dependent variables. 283 At RAD54L, conditional analysis suggested that the common GWAS variant rs12142240 (reported by 284 ReproGen) and rs12073998 (the strongest common-variant association we find, which is in high linkage 285 disequilibrium with rs12142240) were in high linkage disequilibrium with all of the other genome-wide 286 significant associations at the locus, and rare missense rs28363218 remained nominally significant at p = 287 2.6 x 10 -3 suggesting its independence (Table S7). At CHEK2, conditioning on the genotype of the 288 common GWAS variant rs34001746 reduced the number of additional genome-wide significant SNP 289 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint associations from 34 to 7, while improving the significance of the association with rare frameshift 290 rs587780174 (Table S8). In a TTE model incorporating both rs34001746 and rs587780174, both remain 291 highly significant (p = 2.9 x 10 -10 and p = 9.2 x 10 -28 respectively), suggesting that the association 292 identified by GWAS is independent of the predominant rare variant signal at the CHEK2 locus. 293 Other associations with menopause timing-associated genes 294 To better understand the full spectrum of consequences of rare coding variants in these five genes, we 295 performed phenome-wide association studies (PheWAS) for association with 3,457 diagnosis codes and 296 99 quantitative traits. While DLCRE1A, RAD54L, TOP3A, and HROB had no associations reaching 297 phenome-wide significance (p < 2.0 x 10 -6 ), CHEK2 associated with many diagnoses and traits (Table  298 S9; Table 2). 299 Besides breast cancer and related diagnoses, rare damaging variants in CHEK2 were also associated 300 with prostate cancer, which has been previously reported 28 . CHEK2 variants also associated with myeloid 301 leukemia, consistent with reports that CHEK2 coding variants associate with myelodysplastic syndrome 29 . 302 We also found associations with benign neoplasms: carcinoma in situ of the breast and leiomyoma of the 303 uterus (uterine fibroids.) In addition to these neoplasms, we also detected association between CHEK2 304 variants and diagnosis of polycystic ovary syndrome (PCOS), prostate hyperplasia, and seborrheic 305 keratosis. Because highly imbalanced case-control ratios can lead to false positives in association tests, 306 we confirmed that these associations remained significant when using the SAIGE-GENE algorithm, which 307 corrects for these potential false positives and also accounts for relatedness (Table S10) 30 and found that the hematological associations and disease associations, with the 315 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint exception of carcinoma in situ of the breast and prostate cancer, were specific to CHEK2 (Table 3; Table  316 S11). 317 We further tested the non-neoplasm associations in two sub-analyses: an analysis of men who have 318 never been diagnosed with a neoplasm, and an analysis of women who have never been diagnosed with 319 a neoplasm which also conditioned on menopause status as a covariate. All associations except 320 seborrheic keratosis were replicated in these secondary analyses, suggesting that these associations 321 were not secondary to cancer treatment, diagnosed neoplasms, or menopause status, and that the 322 hematological associations were not sex-specific (Table S12). 323 Because of the association between estrogen exposure and menopause timing, we also tested all of the 324 ANM-associated variant sets against estradiol levels, limiting this analysis to 32,385 premenopausal 325 women with measurements available. No association was detected (Table S13). 326

327
The association of CHEK2 variants with both breast cancer and delayed menopause raised the question 328 of the extent to which this pleiotropy was mediated by the well-established association between delayed 329 menopause and breast cancer. To distinguish between biological and mediated pleiotropy, we performed 330 causal mediation analysis. For simplicity of the model, we used only the subset of women who were 331 postmenopausal at the time of recruitment and who reported an ANM. To further ensure that breast 332 cancer followed menopause and that genetic effects on ANM did not confound the association between 333 the age covariate and follow-up time, we only considered breast cancer cases vs. breast cancer-free 334 controls after the age of 60. In this subset, we detected significant associations of CHEK2 genotype with 335 ANM, CHEK2 genotype with breast cancer, and ANM with breast cancer (while controlling for CHEK2 336 genotype), suggesting both a direct effect and indirect effect (via menopause delay) of CHEK2 genotype 337 on breast cancer. Using PTVs only, which have the largest estimated effect on breast cancer, we tested 338 the significance of the unstandardized indirect effect using 1,000 bootstrapped samples and estimated 339 the bootstrapped indirect effect as 7.8% of the total effect (95% CI: 4.5% to 14.9%; p < 2 x 10 -16 ; Figure  340 3). The proportion was similar when considering both PTVs and missense variants (7.1% mediated; 95% 341 CI: 4.1% to 15.5%; p < 2 x 10 -16 ; Figure S9). 342 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The strong link between DNA damage repair genes and menopause timing is not surprising given the role 364 of DNA damage repair and surveillance at every stage of the development of oocytes 11; 31 . The size of the 365 initial oocyte pool at birth, along with the rate of atresia of oocytes through life, influences the age at 366 which the oocyte pool is depleted to a number low enough to trigger amenorrhea (approximately 1,000 367 remaining oocytes). The meiosis that occurs in oocytes necessitates programmed double-stranded 368 breaks (DSBs) which must be repaired through the homologous recombination pathway; oocytes that do 369 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. . 373 Mendelian randomization analysis of menopause-delaying alleles supports a causal role of these variants 374 in breast cancer risk, mediated through prolonged exposure to endogenous estrogen 11 . The plausibility of 375 this causal mechanism is strongly supported by the fact that hormone-replacement therapy is a risk factor 376 for breast cancer 33 . Although common variants at the CHEK2 and BRCA1 loci have been previously 377 associated with ANM, the rare coding variants that associate with breast cancer at these loci have not 378 been previously reported to associate with ANM at genome-wide significance. A recent preprint identified 379 common-variant associations at CHEK2 and BRCA2 with ANM and followed up these associations by 380 testing for association between CHEK2 and BRCA2 pathogenic variants and menopause timing among 381 9,619 exome-sequenced women; they found associations with later menopause (p = 1 x 10 -5 ) and earlier 382 menopause (p = 0.03), respectively 34 . Although we find an association between BRCA2 PTVs and earlier 383 menopause in our primary analysis (p = 3.2 x 10 -4 ), this association is largely abrogated when excluding 384 any women with history of neoplasms or gynecological surgery (p = 0.08), suggesting the primary 385 association could be confounded by prophylactic or therapeutic surgery in BRCA2 carriers. The 386 association we observe between CHEK2 rare damaging variants and later menopause timing is strong 387 and genome-wide significant even at the level of individual variants and is seen in additional data not 388 used in the discovery analysis. Through causal mediation analysis, we show for the first time that while 389 the predominant effect of CHEK2 pathogenic variants is directly on breast cancer risk, that risk would be 390 slightly less if it were not for their menopause-delaying effect. 391 Novel CHEK2 biology 392 In addition to discovering associations between rare damaging variants in CHEK2 with menopause 393 timing, and replicating known associations with breast cancer 30 , prostate cancer 28 , and myeloid 394 leukemia 29 , we find associations with many hematological measurements such as increased leukocyte 395 counts and platelet crit, and associations with several diagnoses: polycystic ovary syndrome (PCOS), 396 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint uterine fibroids, prostate hyperplasia, and seborrheic keratosis. These hematological associations were 397 first reported in an earlier analysis of the UK Biobank exome data and were hypothesized to be 398 secondary to cancer treatment 14 ; however, we find that they exist in individuals with no diagnosed 399 neoplasm. They are unlikely to be secondary to menopause timing, as they persist in both men and 400 women, and are seen among women even when correcting for menopause status at the time of the blood 401 assay. It is more likely, therefore, that these hematological changes arise through the same mechanisms 402 that predispose to myelodysplastic syndrome in CHEK2 mutation carriers is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint discovery analysis, and carriers who were premenopausal at their initial interview were more likely than 424 non-carriers to be postmenopausal at followup interviews. 425 The remaining loci, RAD54L and HROB, are genome-wide significant in the individual genome-wide 426 scans but do not surpass a strict Bonferroni-corrected threshold for having performed two genome-wide 427 scans. Whether such a conservative threshold is justified is debatable given that variants were shared 428 between the two scans, and they were therefore far from independent tests. We chose to follow up and 429 report on these results in light of prior biological evidence, while noting that replication in a different cohort 430 would be critical before considering these associations high confidence. 431 RAD54L (Rad54-like, named after its S. cerevisiae homolog) encodes a protein involved in homologous 432 recombination that binds tightly to Holliday junctions 41 . Although RAD54L has been proposed as a gene 433 altered in the germline in breast cancer 42; 43 , it has not been associated in modern exome-wide studies of 434 breast cancer 30 , and consistent with this we do not observe an association with breast cancer in the UK 435 Biobank. The ReproGen GWAS of ANM identified a noncoding association at the RAD54L locus; we 436 identify an independent rare variant association of later ANM with missense rs28363218, which has not 437 been reported as pathogenic to ClinVar. We find that the association of RAD54L with later ANM is 438 confirmed by follow-up interviews of initially premenopausal women. The problems inherent in studies of on self-reported recollection of age at menopause have been 449 previously reported 1; 46 . The accuracy of recall of age at menopause is known to decrease with duration 450 between menopause and the interview. A bias towards ages divisible by five (40, 45, 50) is clear from the 451 distribution of ages and suggests low accuracy of individual reports of ANM. Oral contraceptive use may 452 mask the onset of menopause but is so widespread that excluding oral contraceptive users was infeasible 453 for our study. 454 Except for the most common pathogenic CHEK2 variants which have been characterized in the context of 455 breast cancer, the other variants we identify as associating with ANM remain to be tested functionally to 456 validate bioinformatic predictions. Experimental work is needed to verify that these variants indeed disrupt 457 protein function (in the case of missense variants) or result in nonsense-mediated decay of the principal 458 message (in the case of PTVs). 459 While we have attempted to use follow-up interviews and extreme phenotypes to obtain additional 460 evidence to confirm the initial discoveries, a true replication would involve study of a separate cohort that 461 has both menopause information and rare variant information from exome or whole-genome sequencing. 462 Such independent replication will be critical to confirming the weaker evidence seen for associations with 463 RAD54L and HROB. 464

465
Our study of rare coding variants confirms findings from previous GWAS highlighting a key role of DNA 466 damage repair proteins in genetic determination of menopause timing. In addition to identifying coding 467 variant associations at the CHEK2 and RAD54L loci previously identified by GWAS, we identify novel 468 associations at TOP3A, DCLRE1A, and HROB, and confirm that menopause timing is the sole phenome-469 wide significant association for rare variants in these genes. CHEK2 also appears to be highly pleiotropic 470 beyond its known role in breast cancer and other cancer syndromes, affecting hematological traits as well 471 as conferring risk for benign disorders involving hyperproliferation of tissue such as PCOS, prostate 472 hyperplasia, and seborrheic keratosis. 473

474
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

476
Approval was received to use these data from the UK Biobank under application number 26041. The UK 477 Biobank resource is an approved Research Tissue Bank and is registered with the Human Tissue 478 Authority, which means that researchers who wish to use it do not need to seek separate ethics approval 479 (unless re-contact with participants is required). Ethics oversight for the UK Biobank is provided by an 480 Ethics and Governance Council which obtained written informed consent from all participants for use of 481 the data in health-related research. 482

483
The data used in this study were obtained from the UKBB through application 26041. All phenotypic data 484 and array genotypes are accessible through application to UK Biobank. Currently, exome sequencing 485 data for Association Studies. bioRxiv, 635706. 561 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.     . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint Figure S3. Replication of menopause timing associations in a subset of women who were 671 premenopausal at the initial interview and who underwent follow-up interviews. 672 673 674 675 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.18.21255506 doi: medRxiv preprint