ABSTRACT
Breast cancer is a polygenic disorder and is the leading cause of cancer related mortality among women. Early-onset breast cancer (EOBC) is diagnosed in women prior to 45 years-of-age and is associated with worse clinical outcomes, a more aggressive disease phenotype, and poor prognosis for disease-free survival. While substantial progress has been made in defining the genetics of breast cancer, EOBC remains less well understood. In the current study we perform a retrospective analysis of data derived from The Two Sister Study. The use of alternate strategies for handling age-at-diagnosis in conjunction with haplotype-based methods yielded novel findings that help to explain the heritability of EOBC. These findings are validated through comparison against discordant sibs from The Two Sister Study as well as using data derived The Cancer Genome Atlas (TCGA).
INTRODUCTION
Breast cancer is the most frequently diagnosed oncogenic malignancy and a leading cause of cancer-related mortality among women worldwide (1, 2). Early-onset breast cancer (EOBC) accounts for approximately 5-10% of all new female breast cancer cases and young age at diagnosis correlates with worse clinical outcomes (3, 4). Germline variants play a prominent role in the etiology of breast cancer and an estimated 10-15% of women who develop breast cancer report a familial history of the disease. Germline variants in BRCA1 or BRCA2 are observed in 15-20% of familial breast cancer cases (5). Direct evidence for genetic modifiers of breast and ovarian cancer risk for BRCA1 and BRCA2 mutation carriers has been provided through genome-wide association study (GWAS) (6). Patients affected by EOBC exhibit shared patterns of gene expression that differ from their older counterparts (7). These combined observations suggest a genetic component contributes to EOBC although only a fraction of the heritability of EOBC has been explained.
Deciphering the genetic basis for phenotypic heterogeneity in complex diseases remains a major challenge. Single marker association studies often lack sufficient statistical power to support the discovery of rare variants or epistatic interactions within a polygenic architecture. Haplotype-based analysis is thought to have greater power than single marker association tests in the study of complex disease (8-10). Haplotypes, which consist of a series of sequentially ordered single nucleotide variants (SNVs), are a potentially more informative format for association testing than single markers and may have improved sensitivity and specificity for discovery (11, 12). Haplotype-based analysis has been used to gain insight in a wide array of complex disease models including mood disorders, multiple sclerosis, orofacial clefting, and cancer (13-18). Moreover, haplotype-based analysis has been effectively applied to investigate age-of-onset in human disease although relatively few studies have specifically addressed EOBC (19-23).
Several approaches have been used to investigate the genetic regulation of breast cancer age-of-onset. The Two Sister Study made use of a familial case-control design with affected cases diagnosed ≤ 50 years-of-age and discordant sibs of EOBC patients defining a control population. Parental samples were included in The Two Sister Study to allow for the identification of maternally-mediated effects and Mendelian errors in transmission (24-26). Other studies have instead used categorical thresholding with diagnosis at 35, 40, 45, and 50-years-of-age to define EOBC populations contrasted against either unaffected familial controls or unrelated age-matched controls (3, 27-29). Age-of-onset has further been evaluated in terms of phenotypic extremes by comparing individuals diagnosed at ≤ 35 years-of-age against cancer-free controls at age ≥ 60 or against an age-specific cohort diagnosed with breast cancer at ≥ 65 years-of-age (30, 31). Still others have investigated breast cancer in terms of age-stratified risk or using quantitative trait analysis to support discovery. Internally consistent logic has justified the use of these and other study designs. Yet the genetic basis for EOBC remains poorly understood and more recent studies have turned towards meta-analyses aimed at achieving sufficient statistical power to identify rare variants with small effect size (32-34).
Design considerations for the study of complex polygenic disorders have been evaluated across a range of disease models. For example, Peyrot and colleagues have convincingly argued against familial trio designs when investigating complex disease traits with a polygenic architecture or a lifetime risk ≥ 1% (35). Reasons given included a potential for reduced statistical power, ascertainment bias, and a significant underestimation of SNV heritability. Additional considerations in sib pair study design include the potential for misclassification and/or overmatching (36). Misclassification of discordant sibs presents a challenge primarily in cases associated with pronounced variation in age-of-onset. Overmatching presents a more significant challenge in complex disease models where discordant sibs are likely to share an indeterminate number of disease-related alleles. As a result, allele-frequency differences between affected and unaffected sibs are generally underestimated relative to randomly selected affected and unaffected individuals (36). Recent investigation of polygenic risk in multiplex melanoma families indicated that familial controls may carry a significantly elevated polygenic load relative to unrelated cases or healthy controls and thereby introduce bias (37). Kerber and colleagues likewise argued that familial studies should be approached with caution, particularly when investigating complex diseases such as cancer where variable onset, incomplete genetic penetrance, gene-environment interactions, and environmental phenocopies have a dramatic potential to impact disease occurrence and phenotype (38). The authors further argued in favor of a case-only analysis for an initial scan followed by more comprehensive analysis of regions surrounding initial hits using both affected and unaffected study participants. In keeping with this reasoning, we speculated that a comparison of younger and older patients diagnosed with breast cancer might provide insight into the genetic architecture of breast cancer age-of-onset.
We performed a retrospective analysis of “The Two Sister Study: A Family-Based Study of Genes and Environment in Young-Onset Breast Cancer,” hereafter referred to as “The Two Sister Study.” The Two Sister Study is one of the longest standing and best characterized studies of early-onset breast cancer and hence was chosen to establish proof-of-principle. Initial screening was performed using a case-only design and haplotype-based association testing while treating age-at-diagnosis as a categorical variable. Candidate regions identified through this initial screen were subsequently evaluated against discordant sibs defined within The Two Sister Study by variance partition analysis and haplotype-trend regression. Findings were validated using data derived from phase III of the 1000 Genomes Project and mutation and The Cancer Genome Atlas (TCGA).
RESULTS
Our objective in this study was to investigate the genetic basis for EOBC. Access to The Two Sister Study was obtained through the Database of Genotypes and Phenotypes (dbGAP; accession phs000678.v1.p1). The demographics of the study population have been described elsewhere (25, 26, 39). The original study compared patients affected by young-onset breast cancer (age-at-diagnosis ≤ 50) with familial controls using a case-control design and affected status as a binary outcome. Pertinent populations for the purpose of this study included 1,456 cases affected by breast cancer and 525 discordant sibs.
For initial screening, the affected population was dichotomized by virtue of age-at-diagnosis using statistical modules within the %findcut SAS macro. The %findcut macro calculates thresholds for the dichotomization of continuous variables and plots a local linear regression (LOESS) curve which may be used to determine whether dichotomization is appropriate for the variable in question (40). While a continuous trait would be expected to produce a linear trend line with a slope ≅0, the LOESS curve generated while analyzing The Two Sister Study case population failed to meet the assumption of linearity (Supplemental Fig S1). The observed slope and steep bend in the LOESS curve exhibited characteristics of a categorical variable justifying dichotomization of the affected population based upon age-at-diagnosis. Theoretical cutpoints were calculated using the %findcut macro and the mean value was used to distinguish between younger (diagnosis ≤ 45 years-of-age; N = 735) and older (diagnosis > 45 years-of-age; N = 721) populations.
Candidate prioritization initially involved a comparison of the younger and older affected populations using haplotype-based association testing with an expectation-maximization (EM) algorithm and a dynamic window size of 10 kilobases (kb) (Fig 1a). Quantile-quantile plotting verified that the resulting data was normally distributed (Supplemental Fig S2). Several peaks were observed by Manhattan plot (Fig 1a) with 6 haplotypes located at chromosome 6: 111,936,275-111,964,664 surpassing the threshold for genome-wide significance (p ≤ 5 x 10−8). This preliminary analysis identified 762 haploblocks representing 4,126 haplotypes (Table 1). Upon filtering using p > 5 x 10−4 as a threshold for exclusion, 322 haplotypes within 282 haploblocks spanning 64 discrete autosomal regions were retained. Of these 64 regions, 15 were associated with a single haploblock and 49 included two or more adjacent haploblocks with block clusters ranging from 2-14 haploblocks in length.
Fine mapping of the aforementioned chromosomal regions was performed using haplotype-based association testing and sliding windows of 2-6 SNVs in length as previously described by Mathias et al. (41). Assuming a panel of 684,126 variants and 3,420,615 independent tests across all windows a Bonferroni corrected threshold for genome-wide significance was calculated as p ≤ 2.92 x 10−8 using the method described by Song et al (42). A single-locus mixed model analysis was performed for comparison using an identity-by-state kinship matrix and an Efficient Mixed-Model Association eXpedited (EMMAX) algorithm as implemented in Sequence Variation Suite software (Golden Helix, Bozeman, MO). To facilitate direct comparison of single marker and haplotype-based analyses, Manhattan plots were overlaid (Fig 1b). Only windows of 2, 4, and 6 SNVs in length were included within the composite plot for visual clarity. The composite image indicated haplotype-based testing generally outperformed single marker association testing and increasing haplotype window size generally correlated with improved statistical strength. Haplotypes located at chromosome 6: 111,936,275-111,964,664 including the rs17754910 marker consistently exceeded the threshold for genome-wide significance with the most significant haplotype (GACGAA; p ≤ 3.34 x 10−10) consisting of markers rs671271, rs17754910, rs490080, rs1327199, rs9487771, and rs585057. Composite windows representing all haplotypes of 2-6 SNVs in length were filtered selecting for haplotypes with a χ2 p value ≤ 5 x 10−5. This filter was applied as an incremental step towards achieving our objective which was to identify regions where increasing haplotype structure correlated with improved significance. In total 466 haplotypes consisting of 417 unique variants spanning 165 unique haploblocks remained (Table 1; Supplemental Table S1). Of the 165 haploblocks, ten were isolated and nonadjacent. Five of these blocks were excluded from further analysis because: 1) the component SNVs were represented in adjacent clusters (blocks 6611, 7202, 8926, and 9337); or, 2) the isolated block was weakly associated with an existing cluster (block 7489). The remaining isolated haploblocks (blocks 3738, 6951, 7182, 8759, and 9862) failed to exhibit a significant association with age-of-onset based on regression analysis and hence were excluded from further consideration. The remaining 155 haploblocks consisted of 264 unique SNVs and formed consecutive clusters defining 33 discrete chromosomal regions (Supplemental Table S2). The most significant haploblocks within each of the 33 chromosomal regions are defined in Supplemental Table S3.
Visualization of haplotype structure was performed using the “Graphical Assessment of Sliding P-values” (GrASP) excel macro (41). The GrASP macro concisely depicts haplotype windows of varying length with corresponding p values, providing an efficient means for screening regions of interest while visualizing haplotype substructure. Of the chromosomal regions examined using the GrASP macro, 13 exhibited improvement in statistical significance and incremental changes in haplotype structure with increasing window size (Table 1). Composite images reflecting sliding window p values and the relative position of functional elements within candidate regions are depicted in Fig 2. Regions of interest exhibiting improved significance with increasing window size were associated with TP73, LYPD6B, KIAA1109, ADAD1, IL2, a regulatory enriched region on chromosome 6, ARHGEF10, AGO2, CNNM1, LINC00941, PPFIBP1, and non-coding loci including AL160035.1 and the NEK4P1 pseudogene. As displayed in Fig 2 the region defined on chromosome 6 was the only region to exceed the threshold for genome-wide significance. The region defined on chromosome 6 is functionally enriched consisting of predicted regulatory elements including promoter and promoter flanking regions, multiple enhancers, CTCF binding sites, and putative transcription factor binding sites. The nearest sequence element was LINC02527 located within ∼11 kb of the defined region on chromosome 6. Other regions of particular note identified through this screen included ARHGEF10 and IL2 both of which are listed within the COSMIC census of known cancer drivers. Odds ratios and 95% confidence intervals for the aforementioned chromosomal regions are portrayed as a forest plot comparing younger and older breast cancer populations in Fig 3. Positive correlations between candidate haplotypes and younger breast cancer patients (diagnosis ≤ 45 years-of-age) relative to older breast cancer patients (diagnosis = 46-50 years-of-age) were associated with LYPD6B, the long arm of chromosome 6, AGO2, LINC00941, and PPFIBP1. The most striking comparison was associated with LYPD6B with an OR = 6.95 and a 95% CI of 2.47-19.51. Negative correlations between candidate haplotypes and younger breast cancer patients (diagnosis ≤ 45 years-of-age) relative to older breast cancer patients (diagnosis = 46-50 years-of-age) were associated with TP73, ADAD1, IL2, ARHGEF10, CNNM1, the long arm of chromosome 13, and the long arm of chromosome 21.
Comparison of haplotype frequencies between siblings, however, failed to fully address the potential for overmatching as previously described (36). Correction for hidden population stratification through the use of kinship matrices provides an important and essential control in genotypic analyses but may be more robustly controlled for through population-based haplotype frequency analysis drawing upon data outside of the discovery population. Hence, we evaluated haplotype frequencies observed within The Two Sister Study against phase III data from the 1,000 Genomes Project (Fig 4). Towards this end haplotype frequencies derived from The Two Sister Study were compared to haplotype frequencies observed in African (AFR), American (AMR), East Asian (EAS), and non-Finnish European (EUR) populations. Viewed within this context, haplotype frequencies within The Two Sister Study were elevated in comparison to AFR and/or AMR populations for TP73, the KIAA1109 promoter, ADAD1, IL2, ARHGEF10, CNNM1, and the long arms of chromosomes 6, 13, and 21. Eight haplotypes did not occur within the EAS population. Minimal variation was observed in the non-Finnish EUR population relative to The Two Sister Study.
Haplotype trend regression was used to analyze the aforementioned 33 autosomal regions of interest in both affected (1,456 breast cancer patients) and unaffected (525 discordant sibs) populations as originally defined within The Two Sister Study. Whereas visual representation of data using the GrASP macro provided an intuitive sense of evolving haplotype structure, haplotype trend regression provided robust measures of statistical significance. Full model permuted p values indicated 14 of the 33 regions investigated were significantly associated with EOBC within the affected population (Table 2). Conversely, haplotype trend regression failed to detect significant associations between the candidate regions and discordant sibs. Haplomaps summarizing marker distributions are presented in Supplemental Figure S3 and marker characteristics are described in Supplemental Table S4.
In an attempt to validate our findings, we first analyzed breast cancer expression data obtained through cbioportal (43). The expression data included the “mRNA expression z-scores relative to normal samples (log RNA Seq V2 RSEM)” file and included representing 994 donors. Variance partition analysis was performed to evaluate associations between gene expression and age-at-diagnosis. Summary findings indicated that expression of AGO2, KIAA1109, and PPFIBP1 was significantly associated with breast cancer age-at-diagnosis and explained 4.47% of age-related variance within the population (Table 3).
Subsequent analysis was performed in an attempt to correlate gene-specific mutation types with age-at-diagnosis. Due to the rarity of discrete mutation types the study population was expanded to include 30 studies across various tissues that were accessed through cbioportal. Pediatric studies were excluded from analysis and 20 years-of-age was applied as a cutoff to exclude minors. Subpopulations were defined by the affected gene and type of mutation (amplification, deletion, missense/truncating mutation). Significant associations linking age-at-diagnosis to (candidate x mutation type) were identified by two sample Z-test (Table 4). The mean age-at-diagnosis ± standard deviation for controls was 60.2 ± 13.13 years-of-age. Significant associations between age-at-diagnosis and gene-specific copy number variants involving amplifications were observed in ARHGEF10 (63 ± 10.86; p = 0.029); CNNM1 (54.2 ± 11.61; p = 0.004); LYPD6B (57.5 ± 13.39; p = 0.04); and TP73 (62.9 ± 10.38; p = 0.021). Significant associations between age-at-diagnosis and gene-specific copy number variants involving deletions were observed in ADAD1 (64.4 ± 10.06; p = 0.0047); AGO2 (64.6 ± 11.02; p = 0.022); CNNM1 (66.1 ± 10.42; p = 0.00011); IL2 (64.4 ± 10.06; p = 0.0047); KIAA1109 (64.6 ± 9.57; p = 0.0023); and LYPD6B (64.8 ± 10.36; p = 0.00091). Significant associations involving missense/truncating mutations followed a trend similar to that observed in association with deletions as might be expected in terms of functional consequence.
DISCUSSION
Our objective in this retrospective study was to gain insight into the genetics of EOBC using existing data sets. We proposed to do so through a subtle rephrasing of the initial hypothesis and by applying haplotype-based methods rather than single marker tests of association. The Two Sister Study data set was chosen for retrospective analysis because it is among the best characterized studies involving EOBC and because the data structure lends itself to formation of alternative hypotheses. We believe there is a need to explore alternatives in the study of complex disease in general because the greater portion of phenotypic heterogeneity in complex disease remains unexplained. By way of example the investigation of breast cancer has resulted in the identification of a handful of genetic drivers with large effect and more than 200 susceptibility loci with minor effect explaining less than half of breast cancer heritability. Known drivers associated with EOBC are less well defined. Yet EOBC accounts for an estimated 10% of all new breast cancer cases among women and an estimated 15% of breast cancer deaths result from breast cancers initially diagnosed prior to 45 years-of-age (3, 44).
The Two Sister Study made use of a familial study design to identify maternally-mediated affects and germline associations with EOBC by contrasting breast cancer patients diagnosed prior to the age of 50 against discordant siblings (25, 26). In the current study we addressed a different question and hypothesized that candidate associations with EOBC might more readily be identified by contrasting younger and older cases of breast cancer. This supposition is consistent with arguments presented by Kerber and colleagues (38), although the merits of treating age as a categorical variable remains a subject of debate (45-47).
Initial haplotype-based association studies to compare cases (age-at-diagnosis ≤ 45) and controls (age-at-diagnosis = 46-50) yielded normally distributed results as determined by QQ plot. Preliminary screening alone identified a single SNV exceeding genome-wide significance (rs17754910; p = 4.73 x 10−9, FDR = 0.0016). Haplotype-based association testing and sliding window analysis helped identify 33 chromosomal regions of interest, 13 of which exhibited increasing haplotype structure in conjunction with improved measures of significance. The qualitative observations resulting from sliding window analysis were subsequently corroborated by haplotype trend regression with 14 of the 33 candidate regions achieving a permuted p value ≤ 0.05. It should be noted that the only haplotypes to achieve genome-wide significance by means of haplotype-based association testing included the rs17754910 SNV on chromosome 6 in a region enriched with regulatory elements. The nearest sequence element approximately 11 kb upstream of the rs17754910 SNV is the non-coding LINC02527 RNA (chromosome 6: 111,900,306-111,909,395). We note that alternating methylation patterns are observed within the LINC02527 promoter in various cancers including breast cancer (48). Other candidates identified by haplotype trend regression included the known cancer drivers IL2 and ARHGEF10 (interleukin 2 and rho guanine nucleotide exchange factor 10, respectively) neither of which have previously been associated with an early-onset cancer phenotype. The remaining candidates identified by haplotype trend regression may be broadly categorized in terms of known involvement in cancer, metastasis, and age-of-onset in disease. AGO2 (argonaute 2), CNNM1 (cyclin and CBS domain divalent metal cation transport mediator 1), KIAA1109, and TP73 (tumor promoter 73) have been implicated in breast cancer, metastasis, and disease age-of-onset (49-57). The noncoding LINC00941 RNA, LYPD6B (LY6/PLAUR Domain Containing 6B), and PPFIBP1 (liprin-beta-1) have been implicated in cancers that may or may not include breast cancer, have been implicated in metastasis, but have no known association with disease age-of-onset (58-61). ADAD1 (adenosine deaminase domain containing 1) has no known association with cancer but has been associated with early-onset asthma and may have a role in childhood seizures. Last, the NEK4P1 pseudogene and the AL160035.1 sequence have no known associations with cancer, metastasis, or disease onset. Though not addressed within the body of this study, we note that Ingenuity Pathway Analysis associated AGO2, ARHGEF10, CNNM1, IL2, KIAA1109, LYPD6B, PPFIBP1, and TP73 with a single network centered around nodes formed by TP53 and the estrogen receptor. We note the obvious absence of BRCA1/BRCA2 within the network and mention it here as an anecdote worthy of speculation.
Haplotype frequency analysis within The Two Sister Study and phase III data from the 1000 Genomes Project yielded insight specifically with regard to the potential hazards of overmatching in study design. As mentioned overmatching presents a potentially significant challenge in complex disease models where discordant sibs are likely to share an indeterminate number of disease-related alleles (36). If, as suggested by the current literature, hundreds of discrete susceptibility loci control breast cancer occurrence and phenotypic expression, we must consider the possibility that familial controls carry a greater burden of polygenic risk alleles without necessarily experiencing disease occurrence. Because no single allele drives breast cancer occurrence, it logically follows that differences in allele or haplotype frequencies between discordant sibs may lack the capacity to distinguish between alleles associated with phenotypic heterogeneity in complex disease. It is known that familial controls may carry a significantly elevated polygenic load relative to unrelated cases or healthy controls creating an uncontrolled source of bias in discovery (37). By way of example we note that haplotype frequencies are elevated in discordant sibs relative to affected breast cancer patients for TP73, IL2, and ARHGEF10 (Fig 3). IL2 and ARHGEF10 are both listed within the COSMIC census of known cancer drivers and it does not require a stretch of the imagination to consider that TP73 might play a role in breast cancer. Based solely upon haplotype frequencies observed in discordant sibs, it would appear that all three haplotypes are negatively correlated with EOBC. Yet the observed haplotype frequencies for these three genes within The Two Sister Study are elevated when compared to the 1000 genomes phase III AMR population by 1.54-fold, 1.68-fold, and 1.82-fold, respectively. The undefined element on chromosome 21 is elevated by 18.86-fold relative to the AFR population although the very same haplotype is more abundant in discordant sib controls relative to breast cancer patients diagnosed at ≤ 50 years-of-age.
Because this study was retrospective a replication of our findings would be challenging without a prospective collection of new data, something which is beyond the scope of the current study. In the absence of replication, we have attempted to validate our findings with supporting evidence as a matter of due diligence. Towards this goal breast cancer gene expression data derived from the TCGA pan-cancer study was evaluated using regression modeling and variance partition analysis to identify correlations between gene expression and age-at-diagnosis. Earlier age-at-diagnosis was associated with higher expression of AGO2 (p = 1.19 x 10−4), KIAA1109 (p = 1.13 x 10−5), and PPFIBP1 (p = 1.07 x 10−3). These findings are consistent with prior studies involving AGO2 and KIAA1109 and provide new evidence suggesting a potential association of PPFIBP1 expression in EOBC (50, 54). Under the assumption of an additive model, expression of these three genes was calculated to explain a combined 4.47% of age-related variance within the study population.
Subsequent validation involved the evaluation of age-at-diagnosis as a function of gene-specific mutations drawing upon available data from 30 distinct cancer studies for statistical purposes. Of the gene-specific mutations the vast majority were observed to result in a significant increase in the mean age-at-diagnosis. We speculate that most of these mutations are unlikely to be causally associated with late-onset disease and instead reflect the global accumulation of damage as a secondary consequence of errors in DNA repair. The candidate genes CNNM1 and LYPD6B shared a unique feature, though, in that both exhibited bidirectionality of effect depending upon mutation status. Gene amplifications affecting CNNM1 and LYPD6B were associated with a significantly lower mean age-at-diagnosis (54.2 ± 11.61 and 57.5 ± 13.39 years-of-age, respectively). Deletions affecting CNNM1 and LYPD6B were conversely associated with a significant increase in mean age-at-diagnosis (66.1 ± 10.42 and 64.8 ± 10.36 years-of-age, respectively). This bidirectionality of effect, we believe, is sufficiently compelling to warrant further investigation of CNNM1 and LYPD6B as contributory factors in EOBC.
Complex disease phenotypes remain a major challenge in the genomic sciences. Frequentist strategies, based upon the assumption that more data will translate into more insight, are currently in vogue and serve a valuable purpose. The identification of rare variants associated with disease is a matter of sample size and ongoing efforts to integrate disparate data sets for meta-analysis is a monumental challenge. Our objective in the current study is less ambitious and merely asks if we can repurpose data to improve our understanding of complex disease. To that limited extent, we have achieved our goal. We have identified a new candidate of genome-wide significance with a potential role in EOBC. We have provided strong supporting evidence justifying the pursuit of a handful of priority candidates with a potential role in EOBC. We have identified two known cancer drivers with a potential involvement in disease onset. And, we have highlighted conditions where frequentist analysis may lead to questionable conclusions in the analysis of familial data. Data-mining, in this instance, suggests that there may be merit in re-examining existing data and the assumptions made during initial inquiry.
METHODS
Data: The Two Sister Study
Discovery was performed using data derived from The Two Sister Study: A Family-Based Study of Genes and Environment in Young-Onset Breast Cancer (accession phs000678.v1.p1). Study contents were accessed under a Data Use Certification (DUC) Agreement via the Database of Genotypes and Phenotypes (dbGAP). The dataset includes genotypic, phenotypic, and demographic data for 1,456 patients, 525 discordant sib controls, and an additional 1,359 controls. The demographics of the population have been described (25, 26, 39). The parent study described 1,458 patients. We believe two of these patients were erroneously excluded from the present study during filtering to eliminate duplicate samples. Quality control filtering of the corresponding genotypic data retained a total of 684,126 variants with a call rate ≥ 0.99, a minor allele frequency ≥ 0.05, and a Hardy-Weinberg p value ≥ 1 x 10−6 within the older “control” population.
Data: cbioportal
In order to validate initial findings clinical data spanning 30 studies representing 10,902 donors was accessed through cbioportal (43). A total of 220 donors were excluded due to cross-study differences in the definition of donor age. An additional 23 donors diagnosed prior to the age of 20 were excluded as minors. The studies were selected based on three criteria: 1) existing evidence of an early-onset cancer phenotype within the tissue; 2) the availability of data defining age-at-diagnosis; and 3) exclusion of pediatric studies. Composite data included the following studies: Acute Myeloid Leukemia (OHSU) (62), Breast Cancer (METABRIC) (63, 64), Breast Cancer (SMC) (65), Breast Fibroepithelial Tumors (Duke-NUS) (66), Breast Invasive Carcinoma (British Columbia) (67), Breast Invasive Carcinoma (Broad) (68), Breast Invasive Carcinoma (Sanger) (69), Breast Invasive Carcinoma (TCGA, PanCancer Atlas) (70), Cervical Squamous Cell Carcinoma (TCGA, PanCancer Atlas) (71), Clear Cell Renal Cell Carcinoma (DFCI) (72), Colorectal Adenocarcinoma (DFCI) (73), Colorectal Adenocarcinoma (TCGA, PanCancer Atlas) (71), Esophageal Adenocarcinoma (TCGA, PanCancer Atlas) (71), Esophageal Squamous Cell Carcinoma (ICGC) (74), Esophageal Squamous Cell Carcinoma (UCLA) (75), Kidney Chromophobe (TCGA, PanCancer Atlas) (71), Kidney Renal Clear Cell Carcinoma (BGI) (76), Kidney Renal Clear Cell Carcinoma (IRC) (77), Kidney Renal Clear Cell Carcinoma (TCGA, PanCancer Atlas), Kidney Renal Papillary Cell Carcinoma (TCGA, PanCancer Atlas) (71), Liver Hepatocellular Carcinoma (TCGA, PanCancer Atlas) (71), Lung Adenocarcinoma (OncoSG) (78), Lung Adenocarcinoma (TCGA, PanCancer Atlas) (71), Ovarian Serous Cystadenocarcinoma (TCGA, PanCancer Atlas) (71), Prostate Adenocarcinoma (Broad/Cornell) (79), Prostate Adenocarcinoma (Fred Hutchinson CRC) (80), Prostate Adenocarcinoma (TCGA, PanCancer Atlas) (71), Small Cell Carcinoma of the Ovary (MSKCC) (81), Uterine Carcinosarcoma (Johns Hopkins) (82), and Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas) (71). Samples lacking mutation in any of the candidate genes were assembled into a control dataset (N = 7026). In order to validate initial findings, experimental data sets were assembled on a per gene basis and subcategorized according to mutation type (amplification, deep deletion, or missense/truncating mutations).
Data: The 1000 Genomes Project
Genotypic data from phase III of the 1000 Genomes Project was obtained through the Ensembl data portal Donor identification numbers were matched to genotypic data in order to assemble haplotypes. Quantification of haplotype frequencies was subsequently performed using the Haploview software package.
Cutpoint optimization
The %findcut SAS macro was used to calculate cutpoints as previously described (83). The mean value was used as the cutoff for dichotomizing the case population in all subsequent analyses.
Association testing
All association testing was performed using the Sequence Variation Suite software package (Golden Helix, Bozeman, MO) and a custom workstation with dual Xeon Gold 12-core processors and 192 Gb RAM (Thinkmate, Waltham, MA). Genotypic data sourced from The Two Sister Study was filtered to exclude variants with a call rate ≤ 0.99, a minor allele frequency < 0.05, or extremes in Hardy-Weinberg disequilibrium within the control population (p ≤ 1 x 10−6). In order to identify regions of interest, haplotype-based association testing was performed using an EM algorithm (50-iterations, convergence tolerance = 0.0001, frequency threshold = 0.01) and a dynamic window size of 10 kilobases (kb). Covariates in the analysis included race, family history of disease occurrence, and age-at-menopause. Regions of interest were identified using a threshold of p ≤ 5 x 10−4.
Sliding window haplotype-based association testing
Sliding window analysis was performed essentially as described by Mathias et al (41). Genotypic data from patients affected by breast cancer was filtered as described leaving a total of 684,126 variants for analysis. Sliding window analysis was subsequently performed using windows of varying size (2-6 SNPs) to evaluate unphased haplotypes. Analysis was performed using a case-control design and an EM algorithm (50-iterations, convergence tolerance = 0.0001, haplotype frequency threshold = 0.01). Applied test statistics consisted of a single test per sliding window and not the individual tests for each haplotype. For comparison, a single-locus Efficient Mixed-Model Association eXpedited (EMMAX) analysis was performed under an additive model using Sequence Variation Suite software (Golden Helix).
Manhattan plots
Manhattan plots were generated by plotting of observed versus expected -log[p] values. A single plot was constructed using output from haplotype-based association testing with a static window size of 10 kb for reference. A second composite plot was constructed by overlaying the output from single marker association tests and sliding window association tests using windows of 2, 4, and 6 SNVs in length.
Graphical assessment of p-values from sliding window haplotype tests (GrASP)
GrASP is a graphical tool for displaying p-values from sliding window tests (41). The Excel add-on produces a simple graphic that simultaneously depicts the width of the sliding windows while using user-specified color to specify varying levels of significance. GrASP allows the user to identify regions/blocks of interest, based jointly on the absolute p-value of the tests from these windows and the building of haplotypes of significance in the region. Graphical representations for regions of interest were assembled and trimmed to display regions of increasing significance while minimizing the length of flanking sequence falling below a threshold of p < 5 x 10−5. Assembled images were presented within the context of functional genomic elements as defined within the Ensembl human genome browser (GRCh38). GrASP is freely available for use at: http://research.nhgri.nih.gov/GrASP/.
Forest plot
Odds ratios (OR) and 95% confidence intervals (95% CI) were derived from a single overall test per sliding window, and not the individual tests of deviation for each haplotype. Weighting was performed using –(log[p]) as the weighting variable so that symbol size directly correlated with significance. A Forest plot depicting the OR and 95% CI was generated using the “DistillerSR Forest Plot Generator from Evidence Partners” web resource (https://www.evidencepartners.com/resources/forest-plot-generator/).
Haplotype trend regression
Haplotype trend regression was performed using the Sequence Variation Suite software package (Golden Helix). Analysis was performed using predefined blocks as described within the text and Supplemental Table S3. Stepwise regression was performed using backwards elimination and up to 50 EM iterations with a convergence tolerance of 0.0001 and frequency threshold of 0.01. Full versus reduced model regression was performed using age-at-diagnosis as a quantitative trait, with race, family history of disease, and menopause status as covariates. Correction for multiple testing was performed using Bonferroni adjusted p values and 1,000 full scan permutations.
Haplotype frequency analysis
EM frequencies representing the 1,456 cases defined in The Two Sister Study were contrasted against population-specific haplotype frequencies. Population-specific data was gathered from phase III of the 1000 Genomes Project and haplotype frequencies were determined using Haploview software (84).
Variance partition analysis
Data derived from the TCGA Pancancer Atlas Breast Invasive Carcinoma study (70) was evaluated in the R software environment using the VariancePartition application (85). Expression data consisted of z-score measures relative to normal samples obtained through cbioportal. Expression data was unavailable for ADAD1 and hence this candidate was excluded from the analysis. Age-at-diagnosis was correlated with expression data as previously described.
Pancancer mutation analysis
To evaluate the impact of mutation type on cancer age-of-onset meta-data representing 9 candidate genes from 30 studies was obtained as described. A total of 220 donors were excluded due to cross-study differences in the definition of donor age. Pertinent data included age-at-diagnosis/diagnosis age, gene-specific copy number variants, gene-specific coding variants (missense/truncating). Samples lacking mutation in any of the candidate genes were assembled into a control dataset (N = 7026). Experimental populations were defined on a per gene basis and were classified by mutation type (amplifications, deep deletions, or missense/truncating mutations). Distribution analysis was performed using a two sample Z-test and the equation where is the mean age-at-diagnosis for the control population, is the mean age-at-diagnosis for the case population, is the standard deviation for the control population divided by the square root of the number of data points, and is the standard deviation for the case population divided by the square root of the number of data points. Corresponding p values were calculated for each independent test statistic.
Data Availability
Pertinent data may be accessed through dbGAP. The Two Sister Study: A Family-Based Study of Genes and Environment in Young-Onset Breast Cancer (accession phs000678.v1.p1). All other data sources including but not limited to cbioportal, 1000 Genomes, and The Cancer Genome Atlas are within the public domain and may be freely accoessed using public data portals.
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000678.v1.p1
ACKNOWLEDGMENTS
We acknowledge the Two Sister Study PI, Clarice R. Weinberg, Susan G. Komen for the Cure (grant FAS703856) and the Intramural Program of the National Institute of Environmental Health Sciences. Fei, Cl, DeRoo, L., Sandler, D.P. and Weinberg, C.R. Fertility drugs and young-onset breast cancer: Results from the Two Sister Study. Journal of the National Cancer Institute 104(13): 1021-7, 2012 (PMID 22773825).(86) Without the invaluable work of the aforementioned investigators, the present analysis would not be possible.
Footnotes
Email addresses: James.gilbert2{at}chp.edu
James.Cray{at}osumc.edu
joseph.losee{at}chp.edu
Funding: The work described within this study was funded through the Children’s Fund of Children’s Hospital of Pittsburgh of UPMC and through the Ross H. Musgrave Endowment (J.E.L).
Conflicts of Interest: There are no conflicts of interest to disclose.
Figure 2 images edited for scaling error.
BIBLIOGRAPHY
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.
- 16.
- 17.
- 18.↵
- 19.↵
- 20.
- 21.
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.
- 53.
- 54.↵
- 55.
- 56.
- 57.↵
- 58.↵
- 59.
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵