Clinical-grade whole genome sequencing of colorectal cancer and 3-prime transcriptome analysis demonstrate targetable alterations in the majority of patients.

Introduction: Clinical grade whole genome sequencing (cWGS) has the potential to become standard of care within the clinic because of its breadth of coverage and lack of bias towards certain regions of the genome. Colorectal cancer presents a difficult treatment paradigm, with over 40% of patients presenting at diagnosis with metastatic disease. We hypothesised that cWGS coupled with 3-prime transcriptome analysis would give new insights into colorectal cancer. Methods: Patients underwent PCR-free whole genome sequencing and alignment and variant calling using a standardised pipeline to output SNVs, indels, SVs and CNAs. Additional insights into mutational signatures and tumour biology were gained by the use of 3-prime RNAseq. Results: Fifty-four patients were studied in total. Driver analysis identified the Wnt pathway gene APC as the only consistently mutated driver in colorectal cancer. Alterations in the PI3K/mTOR pathways were seen as previously observed in CRC. Multiple private CNAs, SVs and gene fusions were unique to individual tumours. Approximately 20% of patients had a tumour mutational burden of >10 mutations/Mb of DNA, suggesting suitability for immunotherapy. Conclusions: Clinical whole genome sequencing offers a potential avenue for identification of private genomic variation that may confer sensitivity to targeted agents and offer patients new options for targeted therapies.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. load as determined by exome sequencing was associated both with tumour associated lymphocyte infiltration and overall survival. However, a key weakness of the TCGA and other studies has been the use of exome sequencing to demonstrate key oncogenic drivers. Exome sequencing, whether by the amplicon or hybridisation approach, may miss key oncogenic drivers due to allelic drop out or the biases inherent to targeting approaches (7).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint 5 Whole genome sequencing has a number of potential advantages. Firstly, it can increase overall variant calling accuracy as exome sequencing techniques can suffer from probe drop out and poor coverage, especially at splice junctions and in "difficult" to sequence regions where probe drop out is common (8). Secondly, it can natively call fusions (9) and other structural variants (10) (by detection of split reads); and finally it can identify copy number variants (11) to a higher accuracy than alternative techniques. Given the recent attention to tumour mutational burden (TMB) in selecting patients for anti-PD1 therapies such as pembroluzimab, whole genome sequencing can accurately call mutation burden (12).
However, until very recently, studies of colorectal cancer using whole genome sequencing have been limited in number or scope. Shanmugan et al. (13) carried out whole genome sequencing in order to identify therapeutic targets in four patients with metastatic disease, finding several known mutations of interest as potentially targetable. Ishaque et al. (14) carried out paired metastasis-primary tumour whole genome sequencing in colorectal cancer, finding novel non-coding oncogenic drivers and an elevated level of "BRCAness". The Pan Cancer analysis of whole genomes (PCAWG) consortium (15) presented 52 colorectal (37 colon, 15 rectal) whole genome sequenced tumours as part of the larger consortium effort, although at the time of writing, no specific examination of the landscape of these had been carried out, presumably because of the previous TCGA colorectal cancer paper which examined the exomes of 276 colorectal cancers (5).
The United Kingdom 100,000 Genomes project has set out to sequence tens of thousands of cancer genomes (16), across multiple tumour types, using a clinical grade sequencing pipeline and variant calling algorithm. Our study has carried out whole genome sequencing of 54 paired colorectal tumour-normal samples, utilising a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020.

Samples
Immediately after resection, resected specimens were conveyed to a histopathologist who facilitated direct biopsy of tumour material and associated normal bowel (defined as the distal resection margin) by frozen section. Samples were immediately snap frozen on liquid nitrogen and stored at -80°C until needed.
Tumour content was verified by frozen section, with at least 60% tumour being needed for inclusion in the study. DNA was extracted using a Qiagen DNEasy kit and RNA with a Qiagen RNEasy kit. Nucleic acid quantity and quality were assessed using a Qubit2 fluorimeter and TapeStation assay.
Library preparation . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020.

Sequencing metrics
In total, 54 tumour-normal pairs (30/54 male, 24/54 female) underwent whole genome sequencing, with a median read depth of 68x for tumour samples and 38x for normal samples. Median purity based on WGS data was 68% (range 29-100%). . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020.

Germline mutations
The germline genome of all patients was studied for mutations in genes associated with familial colorectal cancer syndromes (APC, MYH, MLH1, MSH2, MSH6, PMS2, POLE, POLD1, SMAD4, and BMPRA1). We found no SNVs or indel germline mutations in this cohort of patients.

Most frequently mutated genes and identification of new drivers
A generic analysis of the ten most frequently mutated genes (both SNV and indel, not normalised by transcript length) demonstrated that these were (from most to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.

Copy number aberrations
A pooled analysis of copy number variation across the cohort was performed ( Figure   3). A consistent low-level pattern of both copy number gain and loss was observed.
When filtered by exonic regions across all samples, 6/354 losses and 2/30 gains were observed to be exonic. Gains were seen in all samples in the FOXI2  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.

Structural Variants
Structural variants were filtered on the basis that the most functionally relevant ones were likely to be those involving known cancer driver genes. In total, 29 potential oncogenic gene fusions, detected by WGS were seen in 16 samples (Table 2). Of the 29 potential gene fusions, no recurrent gene fusions were seen. However, CCDC6-TMEM212AS1 and BRAF-DLG1 were seen.

Kataegis
The phenomenon of kataegis (localised somatic hypermutation) has been previously demonstrated in breast cancer (42). In our study, we found that it occurred in all 54 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.

Telomere length
Because of the well observed phenomenon of shorter telomere length in cancer, we studied the lengths of telomeres as measured by whole genome sequencing, which have previously been shown to correlate well to older methods such as Southern blotting (26). Median telomere length in cancer was 5,028 bp and in normal germline blood was 6,294 bp (Mann-Whitney p<0.0001).

Differential expression profiles
In order to understand if there were any de novo transcriptional subgroups within the dataset, a cut-off of the top 250 genes by variance was extracted from the dataset.
When comparing tumour/normal expression and using clustering analysis, the number of groups found to have the lowest Davis-Bouldin index (5 clusters, 1.17) were used to set a threshold for K-means clustering ( Figure 5). Hierarchical clustering of 5 separate groups' revealed separation between the five groups and KEGG pathway analysis of each subgroup was performed (Supplementary table 6).
In three of the clusters there were either only one or two samples found. There was no distinction between these clusters in terms of anatomical location, stage or tumour mutational burden.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. For subgroup one, an over-representation of pathways concerning inflammation and DNA repair was seen. For subgroups two and three no significant pathway overrepresentation was seen, possibly because these groups only had one sample within them. For subgroup four, multiple separate inflammatory pathways (mostly IL-17, Th1 and Th2 centric) were over-represented. Subgroup five had a number of interesting over-represented pathways, including reduced MHC presentation, Wnt/BMP signalling, TGFbeta signalling (via upregulated SMAD) and upregulated Hedgehog signalling.

Pathway analysis
Single sample gene expression differences do not explain much of the context of disease processes, so we carried out a pathway gene expression analysis using the KEGG pathways of over-expressed genes to normal counts across the whole dataset. From this, we found a number of pathways of interest that were differentially expressed in colorectal cancer: the p53 signalling pathway (hsa41105, p=2.24x10-53, FDRp=1.06x10-51), NF-kappa-B signalling pathway (hsa040605, p=1.75x10-47, FDRp=4.95x10-46), and the 'colorectal cancer' pathway (hsa03030, p=2.06x10-41, FDRp=5.41x10-41) were all over-expressed in this cohort of patients.
A number of other pathways of interest (but not of direct relevance to colorectal cancer) were over-expressed, including platinum drug resistance (hsa01524), the Cytosolic DNA-sensing pathway (a.k.a. cGAS-STING, hsa04623) and several involved with DNA repair (FA pathway hsa03460, DNA replication hsa03030, NER, hsa03420).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.

Cell deconvolution using RNAseq
Immune infiltration estimation using cell type deconvolution by CIBERSORT (31)( Table 3)  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint type was CD4+ memory (resting) T-cell, followed by M2 macrophages, CD8+ T-cells, M0 macrophages then activated mast cells. There did not seem to be any correlation with purity estimates of the samples as determined by WGS.

RNA signature for hypermutation
In order to see whether a RNA based signature for hypermutation could be developed from RNA-seq data, gene-centric gene expression was processed using

Correlation between drug mutations database and druggable mutations
In order to ascertain the possibility of actionable targets from the mutations observed in the dataset, we entered a list of protein coding mutations found in at least one Also, 17/54 (32%) of patients exceeded the 10 mutations/Mb threshold for potential benefit for treatment with PD-1/PD-L1 therapy.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.

CONCLUSIONS
The use of clinical grade whole genome sequencing in this study has allowed us to identify known and novel driver mutations that are potentially druggable based on the current state of knowledge. Our study demonstrated the known driver mutations seen in colorectal cancer such as APC, KRAS, BRAF and PIK3CA (5), but also more novel mutations that would potentially be targetable by molecular agents. For instance, we detected KIT mutations that would potentially be targeted by the tyrosine kinase inhibitor imatinib (46), offering a therapeutic option not available to these patients.
We also identified and validated several interesting potential driver mutations by frequency within our cohort. Recurrent mutations were seen in KMT2C, which codes . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. Recurrent alterations in genome structure, in the form of structural variants, copy number aberrations or gene fusions have also been highlighted as a potential target for therapy. For instance the FGFR2/3 fusion seen in approximately 40% of cholangiocarcinoma is a target for the drug pemagatinib (52). Our study has shown several recurrent copy number variations or structural variations but also a number of unique "private" variations that may be targetable. For instance, we observed potential fusions between BRAF and DLG1 (which may be targetable by BRAF kinase inhibition (53)) and between ERBB2 and HAP1 (which may be targetable by lapatinib (54)).
Tumour immunotherapy, using a combination of anti-PD1 and/or anti-CTLA4 therapy has been shown to have a survival benefit across multiple tumour types (55), especially when stratified to patients with high tumour mutational burden (TMB).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint 21 TMB correlates directly with neoepitope production and thus immunovisibility of the tumour. A threshold of 10 mutations per megabase of sequence has been suggested as a cut-off threshold sufficient for benefit for immunotherapy (56). Our study has shown that up to 20% of patients with colorectal cancer reach this threshold, which is higher (16%) than previously reported (5). This may be because whole genome sequencing provides a more comprehensive detection of mutations compared to other strategies, and also because of variations in how TMB is calculated.
We have carried out a variety of analyses of the RNA data derived from our samples.
Surprisingly, the pathway analysis demonstrated findings of potential clinical utility, for instance, the presence of Kegg pathway hsa01524 (Platinum resistance).
Oxaliplatin is commonly given in adjuvant chemotherapeutic treatment in colorectal cancer and resistance remains a problem (57), especially on the background of toxicity that leads to peripheral neuropathy. Interestingly, we have shown that the most frequent transcriptomic subtype within our dataset is CMS4, which is associated (44) with a worse prognosis (also seen in our dataset) and a more aggressive phenotype mainly due to the presence of fibroblasts which act as "malignant stroma". The low numbers of accurate classification of our samples may represent a weakness of 3' RNAseq (although we have previously used this technique without issue) or inherent weaknesses in the CMS classifier when a low tumour content heterogenous tumour sample undergoes sequencing (58). We have also demonstrated by cell deconvolution a rich and varied immune infiltration with the predominant cell types being CD4+ memory and CD8+ cells, however M2 macrophages are seen in most tumours. M2 macrophages are known as "repair" macrophages that decrease inflammation and promote tissue repair (59). If this is indeed the case it highlights an intriguing future path of research in colorectal cancer.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. The CIRC classifier, which we have previously used to highlight immunovisibility (30) in cancer demonstrates that a proportion of samples have immunovisibility beyond those expected by high TMB.
In an era of personalised medicine, we have attempted to utilise current drug databases (DGIDb (60) and OpenTarget (61)) in order to identify targets for personalised medicine therapy. All patients had mutations within their tumour that were potentially "druggable" allowing their recruitment into a current or planned clinical trial. This is an exciting finding, as it gives a potential route of treatment for patients with metastatic disease, however the majority of these trials are phase one in nature and thus are not conclusively demonstrated to be active in colorectal cancer, or indeed in the targeted genomic alteration outside of pre-clinical models.
In conclusion, we have demonstrated the utility of standardised clinical grade WGS at detecting both new biological insights into colorectal cancer as well as targets for therapy. WGS has the advantage of breadth and depth of coverage but comes at the cost of expense; this is likely to drop significantly as technologies improve. A particular disadvantage in the clinical setting is the need for access to fresh-frozen tumour material in order to perform whole genome sequencing to the highest quality.
The use of 3' RNA seq allows a cost-effective way to further enrich the data returned by these assays and may be useful for future studies. The UK government has recently recommissioned Genomics England to sequence five million genomes over the next decade and we suggest based on our results that whole genome sequencing should be considered standard of care for colorectal cancer. We additionally suggest that RNA sequencing should be utilised as standard of care due to the additional insights it gives into tumour biology.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 1, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20080887 doi: medRxiv preprint Figure 3: Genome wide copy plot all samples across cohort (greengain, redloss); Height of bar is proportional to number of samples with copy number variation.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 1, 2020.