Main

The discrete mutational processes that drive copy number change in human cancers are not readily identifiable from genome-wide sequencing data. This presents a major challenge for the development of precision medicine for cancers that are strongly dominated by copy number changes, including high-grade serous ovarian (HGSOC), esophageal, non-small-cell lung and triple-negative breast cancers1. These tumors have a low frequency of recurrent oncogenic mutations, few recurrent copy number alterations and highly complex genomic profiles2.

HGSOCs are carcinomas with poor prognosis that have ubiquitous TP53 mutations3. Despite efforts to discover new molecular subtypes and targeted therapies, overall survival has not improved over two decades4. Current genomic stratification is limited to defining homologous recombination-deficient (HRD) tumors5,6,7; approximately 20% HGSOC cases have a germline or somatic mutation in BRCA1 and/or BRCA2 with smaller contributions from mutations in or epigenetic silencing of other genes associated with homologous recombination8. Classification using gene expression predominantly reflects the tumor microenvironment and is reliable in only a subset of patients9,10,11. Detailed genomic analysis using whole-genome sequencing has shown frequent loss of RB1, NF1 and PTEN by gene breakage events12 and enrichment of amplification-associated fold-back inversions in non-HRD tumors13. However, none of these approaches has provided a broad mechanistic understanding of HGSOC, reflecting the challenges of detecting classifiers in extreme genomic complexity.

Recent algorithmic advances have enabled interpretation of complex genomic changes by identifying mutational signatures—genomic patterns that are the imprint of mutagenic processes accumulated over the lifetime of a cancer cell14. For example, ultraviolet-light exposure or mismatch repair defects induce distinct, detectable single-nucleotide variant (SNV) signatures14. The clinical utility of these signatures has recently been demonstrated through a combination of structural variant and SNV signatures to improve the prediction of HRD15. Notably, these studies show that tumor genomes are shaped by multiple mutational processes and that novel computational approaches are needed to identify coexistent signatures. We hypothesized that specific features of copy number abnormalities could represent the imprints of distinct mutational processes, and developed methods to identify signatures from copy number features in HGSOC.

Results

Experimental design and data collection

We generated absolute copy number profiles from 253 primary and relapsed HGSOC samples from 132 patients in the British Translational Research Ovarian Cancer Collaborative (BriTROC-1) cohort16 using low-cost shallow whole-genome sequencing (sWGS; 0.1×) and targeted amplicon sequencing of TP53 (Supplementary Fig. 1). These samples formed the basis of our copy number signature identification. A subset of 56 of these cases had deep whole-genome sequencing (dWGS) performed for mutation analysis and comparison with sWGS data. Independent datasets for validation included 112 dWGS HGSOC cases from the Pan-Cancer Analysis of Whole Genomes (PCAWG)17 and 415 HGSOC cases with SNP array and whole-exome sequencing data from The Cancer Genome Atlas (TCGA)8. Supplementary Figure 1a shows the Reporting recommendations for tumour Marker prognostic studies (REMARK) diagram for selection of BriTROC-1 patients. Supplementary Figure 1b outlines which samples were used in each analysis across the three cohorts. Clinical data for the BriTROC-1 cohort are summarized in Supplementary Table 1 and Supplementary Fig. 2.

Identification and validation of copy number signatures

To identify copy number signatures, we computed the genome-wide distributions of six fundamental copy number features for each sample: the breakpoint count per 10 Mb, the copy number of the segments, the difference in copy number between adjacent segments, the breakpoint count per chromosome arm, the lengths of oscillating copy number segment chains and the size of segments. These features were selected as hallmarks of previously reported genomic aberrations, including breakage–fusion–bridge cycles18, chromothripsis19 and tandem duplication20,21.

We applied mixture modeling to separate the copy number feature distributions from 91 BriTROC-1 samples with high-quality copy number profiles into mixtures of Poisson or Gaussian distributions. This resulted in a total of 36 mixture components (Fig. 1a). For each sample, the posterior probability of copy number events arising from these components was computed and summed. These sum-of-posterior vectors were then combined to form a sample-by-component sum-of-posteriors matrix. To identify copy number signatures, this matrix was subjected to non-negative matrix factorization (NMF)22, a method previously used for deriving SNV signatures14.

Fig. 1: Copy number signature identification from sWGS data and validation in independent cohorts.
figure 1

a, Step 1: absolute copy numbers are derived from sWGS data; step 2: genome-wide distributions of six fundamental copy number features are computed; step 3: Gaussian or Poisson mixture models (depending on data type) are fitted to each distribution and the optimal number of components is determined (ranging from 3 to 10); step 4: the data are represented as a matrix with 36 mixture component counts per tumor; step 5: non-negative matrix factorization is applied to the components-by-tumor matrix to derive the tumor-by-signature matrix and the signature-by-components matrix. b, Heat maps show component weights for copy number signatures in two independent cohorts of HGSOC samples profiled using WGS and SNP array. Correlation coefficients are provided in Supplementary Table 2.

NMF identified seven copy number signatures (Fig. 1a) as well as their defining features and exposures in each sample. The optimal number of signatures was chosen using a consensus from 1,000 initializations of the algorithm and 1,000 random permutations of the data combining four model selection measures (Supplementary Fig. 3). We found highly similar component weights for the signatures in the two independent cohorts (PCAWG-OV and TCGA), demonstrating the robustness of both the methodology and the copy number features (Fig. 1b, P < 9 ×10−5, median r = 0.86; Supplementary Table 2), despite a significant difference in exposures to copy number signatures 2, 3, 4 and 5 between the cohorts (P < 0.05, two-sided Wilcoxon rank-sum test, Supplementary Fig. 4).

Mutational processes underlying copy number signatures

The majority of cases that were analysed exhibited multiple signature exposures, suggesting that HGSOC genomes are shaped by more than one mutational process. As our signature analysis reduced this genomic complexity to its constituent components, we were able to link the individual copy number signatures to their underlying mutational processes. To do this, we used the component weights identified by NMF to determine which pattern of global or local copy number change defined each signature. For example, for copy number signature 1, the highest weights were observed for components that represent low numbers of breakpoints per 10 Mb, long genomic segments and two breaks occurring per chromosome arm (Fig. 2a, Supplementary Fig. 5). Two breaks per chromosome arm suggested that the mutational process underlying this signature might be breakage–fusion–bridge (BFB) events18.

Fig. 2: Linking copy number signatures with mutational processes.
figure 2

a, Component weights for copy number signature 1. Top, plots are grouped by copy number feature and show weights for each of the 36 components. Middle, the mixture model distributions, which are shaded by the component weight. Solid colors have a high weight whereas transparent colors have a low weight (contrasting colors are randomly assigned). Bottom, genome-wide distribution (histogram or density) of each copy number feature, across the BriTROC-1 cohort, with colored plots indicating important distributions (>0.1 component weight). (Note: similar plots for other copy number signatures are shown in Fig. 3 and Supplementary Fig. 5.) b, Associations between copy number signature exposures and other features. Purple, positive correlation; orange, negative correlation (see also Supplementary Fig. 6). Numbers on the right indicate the number of cases included in each analysis. Only significant correlations are shown (P < 0.05). c, Associations between copy number signature exposures and SNV signatures. Purple, positive correlation; orange, negative correlation (see also Supplementary Fig. 6). The number on the right indicates the number of cases included in the analysis. APOBEC; apolipoprotein B mRNA editing catalytic polypeptide-like. d,e, Difference in copy number signature exposures between cases with mutations in specific genes (d) and mutated or wild-type reactome pathways (e). The absolute difference in mean signature exposures was calculated for cases with and without mutations. Colors in filled circles indicate extent of difference. Only differences with false-discovery rate P < 0.05 (Mann–Whitney U-test) are shown (see also Supplementary Fig. 7). Numbers on the right indicate the number of cases with mutations (SNVs, amplifications or deletions) in each gene or pathway.

To test this hypothesis, we correlated copy number signature 1 exposures with mutation data, SNV signatures and other measures derived from dWGS and exome sequencing (Fig. 2b–e, Supplementary Figs. 69, Supplementary Tables 38). Copy number signature 1 was anti-correlated with sequencing estimates of telomere length (r = −0.32, P = 0.009), consistent with BFB events. In addition, copy number signature 1 was positively correlated with amplification-associated fold-back inversion structural variants (r = 0.36, P = 0.02), which have been strongly implicated in BFB events23 and have also been associated with reduced survival in HGSOC13. Copy number signature 1 was also enriched in cases with oncogenic RAS signaling, including NF1 loss and mutated KRAS (P = 5 × 10−6, Mann–Whitney U-test), which has previously been shown to induce chromosomal instability as a result of aberrant G2 and mitotic checkpoint controls and missegregation24,25. Taken together, these data provide independent evidence for BFB arising as a result of oncogenic RAS signaling and telomere shortening as the underlying mechanism for copy number signature 1.

We applied these approaches to the remaining signatures to identify statistically significant genomic associations using a false-discovery rate < 0.05 (Figs. 2b–e,3, Supplementary Figs. 59, Supplementary Tables 38).

Fig. 3: The seven copy number signatures in HGSOC.
figure 3

Description of the defining component weights, key associations and proposed mechanisms for the seven copy number signatures. Only the top three mutated genes for each of the pathways associated with copy number signatures 4, 6 and 7 are shown (the list of all significant genes is provided in Supplementary Tables 10 and 8).

Copy number signature 2 showed frequent breakpoints per 10 Mb, single changes in copy number (resulting in three copies), chains of oscillating copy number, and was significantly correlated with tandem duplication phenotype scores (r = 0.3, P = 0.004) and SNV signature 5 (r = 0.26, P = 0.02). In addition, this signature was enriched in patients with mutations in CDK12 (P = 0.02, Mann–Whitney U-test, Supplementary Table 6), in keeping with previous studies that have demonstrated large tandem duplication in cases with inactivating CDK12 mutations26.

Copy number signature 4 was characterized by high copy number states (4–8 copies) and predominant copy number change points of size 2. This pattern indicates a mutational process of late whole-genome duplication27. Significantly increased signature 4 exposure in cases with aberrant PI3K–AKT signaling provided further support for late whole-genome duplication, because oncogenic PIK3CA induces tolerance to genome doubling28 (P = 2 × 10−22, Mann–Whitney U-test, mutation of PIK3CA or amplification of AKT, EGFR, MET, FGFR3 and ERBB2). Signature 4 was also seen at higher levels in cases with mutations in genes that encode proteins from Toll-like receptor signaling cascades (P = 2 × 10−7), interleukin signaling pathways (P = 3 × 10−24) and CDK12 (P = 0.0009), as well as those with amplified CCNE1 (P = 2 × 10−10) and MYC (P = 9 × 10−12). It was also significantly correlated with telomere length (r = 0.46, P = 4 × 10−5).

Copy number signature 6 showed extremely high copy number states and high copy number change points for small segments interspersed among larger, lowercopy-number segments. This indicates a mutational process resulting in focal amplification. Increased signature 6 exposure was associated with mutations in genes that encode proteins across diverse pathways, including aberrant G1/S cell cycle checkpoint control (through amplification of CCNE1, CCND1, CDK2, CDK4 or MYC, deletion or inactivation of RB1 or mutations in CDK12), Toll-like receptor signaling cascades and PI3K–AKT signaling (P < 0.05). However, as many of these statistical associations are marked by gene amplification, it is difficult to determine whether the copy number states represent causal events or are simply a consequence of focal amplification. Exposure to copy number signature 6 was also positively correlated with age at diagnosis (r = 0.31, P = 6 × 10−12) and age-related SNV signature 114 (r = 0.43, P = 3 × 10−6).

Copy number signature 5 was significantly associated with predicted chromothriptic-like events using the Shatterproof algorithm29 (r = 0.44, P = 2 × 10−3). Chromothripsis is considered rare in HGSOC12,27,30. However, the key component of this signature—the presence of copy number change points centered at 0.5 copies—suggests that the events are subclonal. This indicates that chromothripsis may be an underestimated oncogenic mechanism in HGSOC that could reflect ongoing formation and rupture of micronuclei31.

Copy number signature 3 was characterized by an even distribution of breaks across all chromosomes, and copy number changes from diploid to single copy (loss of heterozygosity (LOH)). Copy number signature 3 was significantly enriched in cases with mutations in BRCA1 and BRCA2, as well as in other homologous recombination genes such as BARD1, PALB2 and ATR (P = 0.002, Mann–Whitney U-test). It was also correlated with the HRD-related SNV signature 3 (r = 0.32, P = 0.002) and anti-correlated with age at diagnosis and age-related SNV signature 1 (P < 0.05). Copy number signature 3 was also enriched in cases with loss-of-function mutations in PTEN (P = 0.002, Mann–Whitney U-test). Taken together, these data suggest that copy number signature 3 is driven by BRCA1 and/or BRCA2-related HRD mechanisms.

Copy number signature 7, like copy number signature 3, also demonstrated an even distribution of breaks across all chromosomes. In contrast to copy number signature 3, single copy number changes were observed from a tetraploid rather than a diploid state (Fig. 3). Even though a correlation with the HRD-related SNV signature 3 was found, there was no enrichment of mutations in BRCA1 or BRCA2, suggesting that alternative HRD mechanisms are the potential mutational processes.

We also investigated the relationships between the copy number signatures. BRCA1 dysfunction and CCNE1 amplification have been shown to be mutually exclusive in HGSOC32, and we observed that copy number signature 3 (BRCA1 and BRCA2 HRD) and copy number signature 6 (marked by aberrant G1/S cell cycle checkpoint control) showed mutually exclusive associations (Fig. 2b–e). Loss of BRCA1 and BRCA2 are early driver events in HGSOC, and to investigate acquisition of additional mutational processes, we studied four BriTROC-1 cases with deleterious germline BRCA2 mutations and confirmed somatic LOH at BRCA2 (Fig. 4). A diverse and variable number of copy number signatures was seen in these cases, including substantial exposures to copy number signature 1 (RAS signaling) in three of the four cases.

Fig. 4: Copy number signature exposures of four BriTROC-1 patients with germline BRCA2 mutations and somatic LOH.
figure 4

Stacked bar plots show copy number signature exposures for four BriTROC-1 cases with pathogenic germline BRCA2 mutations and confirmed somatic LOH at the BRCA2 locus.

Copy number signatures predict overall survival

We next explored the association between individual copy number signature exposures and overall survival using a combined dataset of 575 diagnostic samples with clinical outcomes. We trained a multivariate Cox proportional hazards model on 417 cases and tested this on the remaining 158 cases (Fig. 5, Supplementary Table 9). Copy number signature exposure was significantly predictive of survival (training: P = 0.002, log-rank test, stratified by age and cohort; test: P = 0.05, C index = 0.56, 95% confidence interval:0.50–0.62; entire cohort: P = 0.002, log-rank test, stratified by age and cohort). Across the entire cohort, poor outcome was significantly predicted by exposure to copy number signatures 1 (P = 0.0008) and 2 (P = 0.03), whereas good outcome was significantly predicted by exposure to copy number signatures 3 (P = 0.05) and 7 (P = 0.006).

Fig. 5: Association of survival with copy number signatures.
figure 5

Top, stacked bar plots show copy number signature exposures for each patient. Patients were ranked by risk of death estimated by a multivariate Cox proportional hazards model stratified by age and cohort, with copy number signature exposures as covariates. Middle, the matrix indicates the group for each patient assigned by unsupervised clustering of copy number signature 1, 2, 3 and 7 exposures (see also Supplementary Fig. 10). Bottom, linear fit of signature exposures ordered by risk predicted by the Cox proportional hazards model.

Unsupervised hierarchical clustering of samples by signature exposures identified three clusters (Fig. 5). Despite showing significant survival differences (P = 0.004, log-rank test, stratified by age and cohort), these clusters did not provide any prognostic information in addition to the information identified by the Cox proportional hazards model; cluster 2 was dominated by patients with high signature 1 exposures (poor prognosis), cluster 3 showed high signature 3 exposures (good prognosis) and cluster 1 had mixed signature exposures (Supplementary Fig. 10).

Copy number signatures indicate relapse following chemotherapy

Using a generalized linear model, we investigated whether copy number signatures could be used to predict outcome following chemotherapy across 36 patients from the BriTROC-1 study with paired diagnostic and relapse samples16. The model showed copy number signature 1 exposures at the time of diagnosis to be significantly predictive of platinum-resistant relapse (P = 0.02, z-test, Supplementary Table 10).

Using the same 36 sample pairs, we also investigated whether chemotherapy treatment changed copy number signature exposures. No significant effects on exposures were observed following chemotherapy treatment using a linear model that accounted for signature exposure at time of diagnosis, number of lines of chemotherapy and patient age (P > 0.05, F-test, Supplementary Table 10). The only variable that showed a significant association with exposure at relapse was signature exposure at diagnosis (P < 0.01, F-test, Supplementary Table 11).

Discussion

Copy number signatures provide a framework that is able to rederive the major defining elements of HGSOC genomes, including defective homologous recombination8, amplification of CCNE19 and amplification-associated fold-back inversions13. In addition, the copy number signatures show significant associations with known driver gene mutations in HGSOC and provide the ability to detect novel associations with gene mutations. We derived signatures using inexpensive shallow whole-genome sequencing of DNA from core biopsies. These approaches are rapid and cost-effective, thus providing a clear path to clinical implementation. Copy number signatures open new avenues for clinical trial design by highlighting contributions from underlying mutational processes that depend on oncogenic RAS and PI3K–AKT signaling.

We found that almost all patients with HGSOC demonstrated a mixture of signatures indicative of combinations of mutational processes. These results suggest that early TP53 mutation, the ubiquitous initiating event in HGSOC, may permit multiple mutational processes to coevolve, potentially simultaneously. Although further work is needed to define the precise timing of signature exposures, early driver events, such as mutations in BRCA2, still permit a diverse and variable number of copy number signatures in addition to an HRD signature (Fig. 4). These additional signature exposures may alter the risk of developing therapeutic resistance, particularly when only a single mutational process, such as HRD, is targeted.

High exposure to copy number signature 3, characterized by BRCA1–BRCA2-related HRD, is associated with improved overall survival, confirming previous data that showed that mutations in BRCA1 and BRCA2 are associated with long survival in HGSOC33,34. Conversely, high exposure to signature 1, which is characterized by oncogenic RAS signaling (including NF1, KRAS and NRAS mutations), predicts subsequent platinum-resistant relapse and poor survival. This suggests that powerful intrinsic resistance mechanisms are present at the time of diagnosis and can be readily identified using copy number signature analysis. This hypothesis is supported by the presence of exposure to copy number signature 1 in germline BRCA2-mutated cases (Fig. 4) as well as our previous work that showed the expansion of a resistant subclonal NF1-deleted population following chemotherapy treatment in HGSOC35 and poor outcomes in Nf1-deleted mouse models of HGSOC36. Our copy number signature analysis of BRCA2-mutated cases also concurs with PCAWG and ICGC data showing that over half (9 out of 16) of NF1-mutated cases also harboured mutations in BRCA1 or BRCA212. These data suggest a complex interplay between RAS signaling and HRD. Thus, RAS signaling may be an important target, especially in first-line treatment, to prevent emergence of platinum-resistant disease.

We found that copy number signature exposures were not significantly altered between diagnosis and disease relapse in 36 sample pairs with a median interval of 30.6 months16. This suggests that the underlying mutational processes in HGSOC are relatively stable and that genome-wide patterns of copy number change mainly reflect historic alterations to the genome that had been acquired during tumorigenesis37. Relative invariant genomic changes were also observed in the ARIEL2 trial, in which genome-wide LOH was used to predict HRD, and only 14.5% (17 out of 117) cases changed LOH status between diagnosis and relapse7.

Larger association studies will be required to further refine the definitions and interpretation of the copy number signatures. The application of our approach to other tumour types is likely to extend the set of signatures beyond the robust core set that was identified here. Basal-like breast cancers, squamous cell and small-cell lung carcinoma, which all have high rates of TP53 mutation and genomic instability2, are promising next targets. Although it is likely that the strong associations have identified the driver mutational processes for copy number signatures 1 and 3, functional studies will be required to establish causal links for the remaining signatures. For example, copy number signature 6 was significantly associated with multiple mutated pathways, and this association was primarily driven by amplification of target genes. As this signature represented focal amplification events, it is difficult to determine whether amplification of specific genes drives the underlying mutational process or the amplifications emerge as a consequence of strong selection of advantageous phenotypes. Our data do not provide timing information for exposures and there is the real possibility that one mutational process may well drive the emergence of other mutational processes. For example, the association between signature 6 and PI3K signaling is also shared with signature 4.

Other limitations of this work are technical: we integrated data from three sources, using three different preprocessing pipelines, and the ploidy determined by different pipelines can have a significant effect on the derived signatures. For example, high-ploidy copy number signature 4 was predominantly found in the sequenced samples that underwent careful manual curation to identify whole-genome duplication events. When extending to larger sample sets, a unified processing strategy with correct ploidy determination is likely to produce improved signature definitions. Another technical limitation is the resolution of copy number calling from sWGS (limited to 30-kb bins) and future application to large cohorts of deeply sequenced samples will be needed to improve the resolution of the copy number signatures.

Efforts to identify discrete, clinically relevant subtypes of disease have been successful in many cancer types38,39,40. However, HGSOC lacks clinically relevant patient stratification, which is reflected in continued poor survival. We show that HGSOC genomes are shaped by multiple mutational processes that preclude simple subtyping. Thus, our results suggest that HGSOC is a continuum of genomes. By dissecting the mutational forces shaping HGSOC genomes, our study creates an opportunity to understand extreme genomic complexity, as well as revealing the evolution of tumors as they relapse and acquire resistance to chemotherapy.

URLs

sWGS power calculator, https://gmacintyre.shinyapps.io/sWGS_power/; Cancer Genome Interpreter, https://www.cancergenomeinterpreter.org/home; copy number signatures code repository, https://bitbucket.org/britroc/cnsignatures.

Methods

Patients and samples

The BriTROC-1 study has been described previously16. Characteristics of the 142 patients included in this study are given in Supplementary Table 1. The study is sponsored by NHS Greater Glasgow and Clyde and ethics/IRB approval was given by Cambridge Central Research Ethics Committee (12/EE/0349). The study enrolled patients with recurrent ovarian high-grade serous or grade 3 endometrioid carcinoma who had relapsed following at least one line of platinum-based chemotherapy and whose disease was amenable either to image-guided biopsy or secondary debulking surgery. At study entry, patients were classified as having either platinum-sensitive relapse (that is, relapse six months or more following last platinum chemotherapy) or platinum-resistant relapse (that is, relapse less than six months following prior platinum chemotherapy) (Supplementary Fig. 2). All patients provided written informed consent. Access to archival diagnostic formalin-fixed tumor samples was also required. Survival was calculated from the date of enrolment to the date of death or the last clinical assessment, with data cutoff at 1 December 2016. At subsequent relapse or progression after chemotherapy following study entry, patients could optionally have a second biopsy under separate consent.

DNA was extracted from 300 samples of 142 patients—158 methanol-fixed relapse biopsies and 142 formalin-fixed paraffin-embedded (FFPE) archival diagnostic tissues. Germline DNA was extracted from blood samples of 137 patients.

Tagged-amplicon sequencing

Mutation screening of TP53, PTEN, EGFR, PIK3CA, KRAS and BRAF was performed on all 300 samples using tagged-amplicon sequencing as previously described16. DNA extracted from blood was analyzed by tagged-amplicon sequencing for BRCA1 and BRCA2 germline mutations.

sWGS

Libraries for sWGS were prepared from 100 ng DNA using modified TruSeq Nano DNA LT Sample Prep Kit (Illumina) protocol41. The quality and quantity of the libraries were assessed with DNA-7500 kit on 2100 Bioanalyzer (Agilent Technologies) and with Kapa Library Quantification kit (Kapa Biosystems) according to the manufacturer’s protocols. Sixteen to twenty barcoded libraries were pooled together in equimolar amounts and each pool was sequenced on HiSeq2500 in SE-50bp mode.

Before sequencing, we estimated the required sequencing depth by adapting calculations made in previous work that explored the relationship between sequencing depth (reads per sample) and copy number calling accuracy42. On the basis of these analyses, we devised a power calculator for sWGS copy number analysis (see URLs; described previously43). We estimated that with an average ploidy of 3 and purity of 0.65, a sequencing depth of at least 2.7 million reads is required to detect single, clonal copy number changes (minimum 60 kb) at 90% power and α = 0.05. After analysis, we determined that BritROC three-star samples had an average purity of 0.66, ploidy of 2.7, and were sequenced to an average depth of 8.6 million reads. This allowed us to detect single copy number changes with 90% power and α = 0.05 down to subclonal frequencies of 55%.

dWGS

dWGS was performed on 56 tumors with confirmed TP53 mutations and matched normal samples, of which 48 passed quality control. Libraries were constructed with around 350-bp insert length using the TruSeq Nano DNA Library prep kit (Illumina) and sequenced on an Illumina HiSeq X Ten System in paired-end 150-bp reads mode. The average depth was 60× (range 40–101×) in tumors and 40× (range 24–73×) in matched blood samples.

Variant calling

Read alignment and variant calling of tagged-amplicon sequencing data were processed as described41. dWGS samples were processed with bcbio-nextgen44 using Ensemble somatic variants called by two methods out of VarDict45, Varscan46 and FreeBayes47. Somatic SNV calls were further filtered based on mapping quality, base quality, position in read and strand bias as previously described40. In addition, the blacklisted SNVs from the Sanger Cancer Genomics Project pipeline derived from a panel of unmatched normal samples were used for filtering48.

Data download

PCAWG-OV: consensus SNVs and insertions and deletions (INDELs) (October 2016 release), consensus structural variants (v.1.6), consensus copy number calls (January 2017 release), donor clinical (August 2016 v.7-2) and donor histology information (August 2016 v.7) for 112 ovarian cancer samples were downloaded from the PCAWG data portal. ABSOLUTE49 copy number calls were used for analysis.

TCGA: ABSOLUTE49 copy number profiles from a previously published study27 for 415 ovarian cancer TCGA samples were downloaded from Synapse50. SNVs for these samples were downloaded from the Broad Institute TCGA Genome Data Analysis Center (Broad Institute TCGA Genome Data Analysis Center: Firehose stddata__2016_01_28 run; https://doi.org/10.7908/C11G0KM9, Broad Institute of MIT and Harvard). Donor clinical data were downloaded from the TCGA data portal.

Absolute copy number calling from sWGS

Segmentation

sWGS reads were aligned and relative copy numbers were called as previously described41. After inspection of the TP53 mutation status and relative copy number profiles of the 300 sequenced BriTROC-1 samples, 47 were excluded from downstream analysis for the following reasons: low purity (24), mislabeled (7), pathology re-review revealed sample was not HGSOC (3), no detectable TP53 mutation (13). Of the 253 BriTROC-1 samples analysed, 111 were FFPE-fixed. Of the 253 samples, 57 showed an oversegmentation artefact (likely owing to fixation). A more strict segmentation was subsequently applied to these samples to yield a usable copy number profile.

Absolute copy number

We combined relative copy number profiles generated by QDNA-seq42 with mutant allele frequency identified using tagged amplicon sequencing in a probabilistic graphical modeling approach to infer absolute copy number profiles. Using expectation-maximization, the model generated a posterior over a range of TP53 copy number states, using the TP53 mutant allele frequency to estimate purity for each state. The TP53 copy number state that provided the highest likelihood of generating a clonal absolute copy number profile was used to determine the final absolute copy number profile. To test the validity of this approach, we compared purity and ploidy estimates derived from sWGS to those derived from 60× WGS using the Battenberg algorithm for copy number calling51. Pearson correlation coefficients were computed for both ploidy and purity estimates using 34 three-star BriTROC-1 samples (see ‘Quality rating’) with matched sWGS and WGS (Supplementary Fig. 11).

Quality rating

Following absolute copy number fitting, samples were rated using a 1–3-star system. The one-star samples (n = 54) showed a noisy copy number profile and were considered likely to have incorrect segments and missing calls. These were excluded from further analysis. The two-star samples (n = 52) showed a reasonable copy number profile with only a small number of miscalled segments. These samples were used (with caution) for some subsequent analyses. The three-star samples (n = 147) showed a high-quality copy number profile that was used in all downstream analyses. The maximum star rating observed per patient was one-star in 15 patients, two-star in 26, and three-star in 91 patients. In total, 72 out of 111 FFPE-fixed samples (64%) were usable for signature analysis. This is consistent with typical sequencing success rates for archival material52.

Copy number signature identification

Preprocessing

The 91 three-star BriTROC-1 absolute copy number profiles were summarized using the genome-wide distribution of six different features (outlined in Fig. 1). (1) Segment size: the length of each genome segment; (2) Breakpoint count per 10 Mb: the number of genome breaks appearing in 10-Mb sliding windows across the genome; (3) Change point copy number: the absolute difference in copy number between adjacent segments across the genome; (4) Segment copy number: the observed absolute copy number state of each segment; (5) Breakpoint count per chromosome arm: the number of breaks occurring per chromosome arm; (6) Length of segments with oscillating copy number: a traversal of the genome counting the number of contiguous copy number segments alternating between two copy number states, rounded to the nearest integer copy number state.

Mixture modeling

For each of the featured density distributions, we applied mixture modeling to identify its distinct components. For distributions representing segment size, change point copy number and segment copy number, we used mixtures of Gaussian distributions. For distributions representing breakpoint count per 10 Mb, length of segments with oscillating copy number and breakpoint count per chromosome arm, we used mixtures of Poisson distributions. Mixture modeling was performed using the FlexMix V2 package in R53. The algorithm was run for each distribution with the number of components ranging from 2 to 10. The optimal number of components was selected as the run showing the lowest Bayesian information criterion, resulting in a total of 36 components (see Fig. 1 and Supplementary Table 3 for breakdown). Next, for each copy number event, we computed the posterior probability of belonging to a component. For each sample, these posterior event vectors were summed resulting in a sum-of-posterior probabilities vector. All sum-of-posterior vectors were combined in a patient-by-component sum-of-posterior probabilities matrix.

Signature identification

The NMF package in R54, with the Brunet algorithm specification55 was used to deconvolute the patient-by-component sum-of-posteriors matrix into a patient-by-signature matrix and a signature-by-component matrix. A signature search interval of 3–12 was used, running the NMF 1,000 times with different random seeds for each signature number. As provided by the NMF package54, the cophenetic, dispersion, silhouette and sparseness coefficients were computed for the signature-by-component matrix (basis), patient-by-signature matrix (coefficients) and connectivity matrix (consensus, representing patients clustered by their dominant signature across the 1,000 runs). The 1,000 random shuffles of the input matrix were performed to get a null estimate of each of the scores (Supplementary Fig. 3). We identified the minimum signature number that yielded stability in the cophenetic, dispersion and silhouette coefficients, and that yielded the maximum sparsity that could be achieved without exceeding the value that was observed in the randomly permuted matrices. As a result, seven signatures were deemed optimal under these constraints and were chosen for the remaining analysis.

Signature assignment

For the remaining 26 two-star patient samples, and the 82 secondary patient samples (from patients with two- or three-star profiles from additional tumor samples), the LCD function in the YAPSA package in Bioconductor56 was used to assign signature exposures.

Copy number signature validation

The signature identification procedure described above was applied to copy number profiles from two independent datasets: 112 whole-genome sequenced (approximately 40×) HGSOC samples processed as part of ICGC Pan-Cancer Analysis of Whole Genomes Project17 (denoted here as PCAWG-OV), and 415 SNParray profiling samples of HGSOC cases as part of TCGA27. The number of signatures was fixed at 7 for matrix decomposition with NMF. Pearson correlation was computed between the BriTROC-1 signature-by-component weight matrix and each of the PCAWG-OV and TCGA signature-by-component matrices, signature by signature (Supplementary Table 2).

Association of copy number signature exposures with other features

Association of signature exposures with other features was performed using one of two procedures: for a continuous association variable, correlation was performed; for a binary association variable, patients were divided into two groups and a Mann-Whitney U-test was performed to test for differences in signature exposure medians between the two groups. A more detailed explanation of each of these association calculations is given below. (Note: of the 48 deep WGS BriTROC-1 samples that passed quality control, only 44 had matched two- and three-star sWGS copy number profiles. Because signature exposures from sWGS were used for BriTROC-1 sample associations, only these 44 samples could be used.)

Age at diagnosis

Patient age at diagnosis for 112 PCAWG-OV samples and 415 TCGA samples was used to compute Pearson correlation with signature exposures.

Amplification-associated fold-back inversions

For 111 PCAWG-OV samples, the fraction of amplification-associated fold-back inversion events per sample was calculated as the proportion of head-to-head inversions (h2hINVs) within a 100-kb window amplified region (copy number ≥5) relative to the total number of structural variant calls per sample. In total, 94 samples had at least 1 h2hINV event out of which 58 had h2hINV events in amplified regions. On average, they accounted for 4% of structural variant calls. As these are rare events, only samples showing a non-zero fraction of fold-back inversions (n = 67) were used to compute Pearson correlation with signature exposures.

Telomere length

Telomere lengths of 44 dWGS tumor samples from the BriTROC-1 cohort were estimated using the Telomerecat algorithm57. Telomere length estimates ranged from 1.5 to 11 kb with an average of 4 kb. Correlation between telomere length and copy number signature exposures was calculated with age and tumor purity as covariates using the ppcor package in R58.

Chromothripsis

Copy number and translocation information from 111 PCAWG-OV samples were used to detect chromothripsis-like events using the Shatterproof software with default parameters29. Shatterproof, a state-of-the-art software, incorporates a wide range of hallmarks of chromothripsis in its detection algorithm as a precise definition of chromothripsis remains unknown. In a previous study, a threshold of 0.37 was recommended based on the observation that normal samples produced a low number of calls with low scores (maximum 0.37), whereas prostate, colorectal and small-cell lung cancer samples that were known to have chromothriptic events produced the highest scores29. Previous studies have reported a low incidence of chromothriptic events in HGSOC12,27,30. The number of calls per sample in the PCAWG-OV samples ranged from 5 to 47 with an average of 23. The score per call ranged from 0.15 to 0.62 with a median of 0.38. Therefore, a conservative threshold was set at the 95th percentile of our distribution of scores to minimize false positives and calls with scores greater than 0.48 were used to obtain a count of chromothriptic events per sample. As chromothriptic events are rare in HGSOC, only samples showing a non-zero number of events (n = 61) were used to compute Pearson correlations with signature exposures. Of 61 samples with scores above the threshold, 49 (80.3%) had 1–2 events, 11 samples (18%) had 3–6 events and 1 sample (1.6%) had 10 events.

Tandem duplicator phenotypes

Tandem duplicator phenotype (TDP) scores were calculated for 111 PCAWG-OV samples using the method described previously21. The number of duplication events per chromosome normalized to chromosome length per sample was used to calculate a score relative to the expected number of duplication events per chromosome per sample. The scores ranged from −1.11 to 0.53 with an average score of 0.02.

Mutational signatures

Motif matrices were extracted using the SomaticSignatures R package59 and the weights of all known COSMIC signatures were determined using the deconstructSigs R package60 for 44 deep WGS BriTROC-1 samples and 109 PCAWG-OV samples. SNV signatures showing an exposure >0 for at least one sample were retained. The rcorr function in the Hmisc R package61 was used to calculate the correlation matrix between the remaining SNV and copy number signature exposures.

The significance of all observed correlations was estimated from a t-distribution for which the null hypothesis was that the true correlation was 0. All reported P values have been adjusted for multiple testing with using the Benjamini–Hochberg method62. Comparison plots can be found in Supplementary Fig. 6.

Mutated pathways

A combined set of 479 samples (44 dWGS BriTROC-1, 112 PCAWG-OV and 323 TCGA) showing at least one driver mutation was used for mutated pathway enrichment analysis. We focused on 765 driver genes reported by the Cancer Genome Interpreter (see URLs)63. SNVs, INDELs, amplifications (copy number >5) or deletions (copy number <0.4) affecting these genes were considered bona fide driver mutations if the Cancer Genome Interpreter predicted them as TIER1 or TIER2 (Supplementary Tables 4 and 5, see URLs; run date: 2018-01-13). In total, 320 of the 765 genes were mutated in a least one case. These genes were used to test for enriched pathways in the Reactome database using the ReactomePA R package64 with a P-value cutoff of 0.05 and Q-value cutoff of 0.05. Pathways mutated in at least 5% of the cohort (n ≥ 24) were retained. For each pathway, patients were split into two groups: those with mutated genes in the pathways, and those with wild-type genes in the pathways. A one-sided Mann–Whitney U-test was carried out for each signature to determine whether the exposure was significantly higher in mutated cases versus wild-type cases. After multiple testing correction using the Benjamini–Hochberg method (threshold of P < 0.005 and the median difference in exposures ≥0.1), 186 pathways were significantly enriched. Visual inspection identified significant redundancy in the list and 9 representative pathways were manually selected as a final output (Supplementary Table 6).

Mutated genes

A combined set of 479 samples (44 deep WGS BriTROC-1, 112 PCAWG-OV and 323 TCGA) was used to test whether signature exposures were significantly higher in cases with mutated driver genes, including NF1, PTEN, BRCA1, BRCA2, PIK3CA, MYC and CDK12. Patients were split into two groups: those with the mutated gene and those with wild-type genes. A one-sided Mann-Whitney U-test was carried out for each signature to determine whether the exposure was significantly higher in mutated cases versus wild-type cases. After multiple testing correction using the Benjamini–Hochberg method (threshold of P < 0.05 and the median difference in exposures ≥0.08), 10 gene–signature combinations were significantly enriched (Supplementary Table 6).

Survival analysis

Censoring and truncation

Overall survival in BriTROC-1 patients was calculated from the date of enrolment to the date of death or the last documented clinical assessment, with data cutoff at 1 December 2016. As the BriTROC-1 study only enrolled patients with relapsed disease, left truncation was used in the survival analysis. In addition, cases in which the patient was not deceased were right censored. Survival data for the PCAWG-OV and TCGA cohorts were right censored as required (left truncation was not necessary). The combined samples were split into training (100% BriTROC-1, 70% PCAWG-OV and 70% TCGA = 417) and test (30% PCAWG-OV and 30% TCGA = 158) cohorts. All of the BriTROC-1 samples were used in the training set to avoid issues calculating prediction performance on left-truncated data.

Cox regression

As the signature exposures for a given sample summed to 1, it was necessary to select one normalizing signature to perform regression. Signature 5 was chosen as it showed the lowest variability across cohorts. To avoid division errors all 0 signature exposures were converted to 0.02. The remaining signature exposures were normalized taking the log ratio of their exposure to the exposure of signature 5. A Cox proportional hazards model was fitted on the training set, with the signature exposures as covariates, stratified by cohort (BriTROC-1, PCAWG-OV:AU, PCAWG-OV:US, TCGA) and age (<39; 40–44; 45–49; 50–54; 55–59; 60–64; 65–69; 70–74; 75–79; >80), using the survival package in Bioconductor65. After fitting, the model was used to predict risk in the test set and performance was assessed using the concordance index calculation in the survcomp package in Bioconductor47. A final Cox regression was performed using all data for reporting of hazard ratios and P values.

Unsupervised clustering of patients using signature exposures

Hierarchical clustering of the exposure vectors of the 575 samples used in the survival analysis was performed using the NbClust66 package in R. The optimal number of clusters was 3 as determined by a consensus voting approach across 23 metrics for choosing the optimal numbers of clusters. Out of 23 metrics, 12 reported 3 clusters as the optimal number. A Cox proportional hazards model was fitted using the cluster labels as covariates, stratified by cohort (BriTROC-1, Australian PCAWG-OV:AU, United States of America PCAWG-OV:US, TCGA) and age (< 39; 40–44; 45–49; 50–54; 55–59; 60–64; 65–69; 70–74; 75–79; >80), using the survival package in Bioconductor65.

Analysis of copy number signature changes during treatment

Thirty-six BriTROC-1 cases with matched diagnosis and relapse samples were used to investigate the effects of treatment on signature exposures. A linear model was fitted to test for treatment effects with exposure at relapse as the dependent variable and exposure at diagnosis, age at diagnosis, number of lines of chemotherapy and days between diagnosis and relapse as independent variables. Prior to fitting, age at diagnosis was centered and exposures transformed by log(x + 0.1) to ensure normality. Fitting was done using the lm() function in R.

To test whether signature exposures at diagnosis were predictive of platinum sensitivity, a generalized linear model with binomial error was fitted using type of relapse (platinum-sensitive or platinum-resistant) as the dependent variable and exposure at diagnosis and age at diagnosis as independent variables.

Reporting Summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.

Code availability

All code required to reproduce the analysis outlined in this manuscript can be found in the copy-number signatures code repository (see URLs).

Data availability

Sequence data that support the findings of this study have been deposited in the European Genome-phenome Archive with accession number EGAS00001002557.