Leveraging sequences missing from the human genome to diagnose cancer

Cancer diagnosis using cell-free DNA (cfDNA) can significantly improve treatment and survival but has several technical limitations. Here, we show that tumor-associated mutations create neomers, DNA sequences 11-18bp in length that are absent in the human genome, that can accurately detect cancer subtypes and features. We show that we can detect twenty-one different tumor-types with higher accuracy than state-of-the-art methods using a neomer-based classifier. Refinement of this classifier via supervised learning identified additional cancer features with even greater precision. We also demonstrate that neomers can precisely diagnose cancer from cfDNA in liquid biopsy samples. Finally, we show that neomers can be used to detect cancer-associated non-coding mutations affecting gene regulatory activity. Combined, our results identify a novel, sensitive, specific and straightforward cancer diagnostic tool.


INTRODUCTION
Cancer is the second leading cause of death worldwide (1,2), and for most cancer types survivability is significantly higher if the tumor is detected at an early stage (3,4). Currently mass population screening is applicable only for breast and cervical cancers and utilizes physical tests like mammography and cytology screens. Detection for other cancer types, done both en masse and in a low and affordable resource setting, still poses a major challenge for the scientific and clinical communities (5). In particular, a major hurdle is to identify reliable biomarkers for the detection of cancer development at a presymptomatic stage. Detection at such an early stage would allow for patient stratification and improvement of patients' outcome by providing personalized treatments.
Circulating cell-free DNA (cfDNA) is an emerging and promising resource for cancer diagnostics and prognostics (6,7). It has a short life span (16 minutes to 2.5 hours), which makes it a highly temporal indicator of various processes occurring in the subject's body. Due to advances in sequencing technologies, cfDNA can be rapidly analyzed at a relatively low cost. Analysis of circulating tumor DNA (ctDNA) has become a prospective minimally invasive tool to screen the population and to monitor patients already diagnosed with cancer. Current applications, including identification of tissue of origin and cancer type, minimal residual disease and other technologies rely on sequencing to resolve somatic mutations (8) and epigenetic marks, such as DNA methylation or histone modifications that can determine the cancerous tissue (9,10). However, ctDNA still has many hurdles and caveats that need to be overcome (11). Some of the major hurdles include: 1) cfDNA is fragmented (180-360 base pairs) making its collection and extraction more challenging and the tumour-derived DNA makes up only a small portion (estimated to be around 0.4%; (11)) warranting the need for extremely sensitive biomarkers that can easily detect the presence of cancerous cells; 2) prior knowledge of specific mutations or methylation marks is required for targeted screening, and consequently the main focus has been on coding mutations which only constitute a small fraction of mutations; 3) cfDNA mutation and epigenetic diagnosis could be confounded by somatic alterations in white blood cells (12); 4) the diagnostic techniques used to detect methylation or histone marks are technologically complex and can have low sensitivity and specificity (6,(13)(14)(15); 5) and to provide the most optimal cancer treatment, it needs to be diagnosed at preliminary stages when the tumor is small (~5mm in diameter). At these stages, the tumor produces minute levels of ctDNA that are difficult to detect using current methods (6).
Nullomers are short DNA sequences (11)(12)(13)(14)(15)(16)(17)(18) base pairs) that are absent from the human genome (16,17). While the absence of nullomers could be due to chance, we and others have shown that a significant proportion of them is under negative selection pressures (17,18), suggesting that they could have a deleterious effect on the genome. We have also shown that these sequences could be used as DNA 'fingerprints' to identify specific human populations (18). As nullomers do not exist in a human genome, their appearance due to mutagenesis followed by clonal expansion could be exploited as a diagnostic method for diseases associated with a mutational burden, such as cancer.
Here, we set out to test whether nullomers could be used as a diagnostic tool to detect cancer in general and also specific subtypes. Throughout this manuscript we refer to nullomers found in the tumor genome as neomers to distinguish them from the more general category. We first analyzed The Cancer Genome Atlas (TCGA; (19)) database finding recurrent neomers created by somatic mutations that could be used to detect not only cancer subtypes with higher accuracy than leading methods (20) but also additional cancer features. Further analyses of cfDNA whole-genome sequencing datasets found that these neomers can also be used to detect cancer subtypes in these data without the need for matched healthy control samples. Finally, we show that cancer-associated neomers can be used to detect cancer-associated mutations in gene regulatory regions and functional assays of prostate cancer associated neomers show that they have a functional effect on their regulatory activity. Combined, our results show that neomers can be used as a rapid, sensitive, specific and straightforward cancer diagnosis and also aid in the identification of gene regulatory mutations associated with cancer.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 17, 2021

Annotation of mutations that lead to nullomers
As cancer causes DNA mutations, we investigated if they can result in the resurfacing of nullomers (Fig. 1A). Using our previously characterized human nullomers (18), we analyzed whole-genome sequencing results from 2,577 patients across 21 different cancer types from The Cancer Genome Atlas (TCGA; (21)), (Fig. S1A) for resurfacing nullomers. We focused on 16bp nullomers as it is the shortest length where we detect a sufficient number of nullomers per patient, with the human reference genome having only 37.24% of all possible 16mers. The majority of the 44,599,472 single nucleotide substitutions gives rise to multiple nullomers and we identified 213,164,038 resurfacing nullomers across all cancer types. Furthermore, we identified 2,470,091 nullomers resulting from short insertions and deletions (1-100 bp). The median number of nullomers created by each substitution was two and for indels four ( Fig. S1B-C). On average, 58.29% of substitutions in a patient resulted in one or more nullomers, with only 2.1% of the nullomers derived from coding regions. The median number of nullomers found across cancer patients was 9,107 ( Fig. 1B-C) and the number of nullomers was directly proportional to the number of mutations (Fig. S1D) and nullomer length (Fig. S1E). As mutations were identified by comparing to healthy tissues, mutations did not need to be filtered for common variants that could otherwise result in nullomers (18).
As we were interested in nullomers that could be used as cancer biomarkers, we focused on the subset of nullomers that are recurrent, i.e. those found in more than one patient for a specific cancer type, termed hereafter as neomers. The number of neomers was proportional to the total number of mutations ( Fig. 1C) and as both the number of patients per cancer type and the mutational load varied, the median number of neomers for each tissue type ranged from 0-98. Analysis of the most frequent neomers revealed several previously known cancer-associated mutations ( Table 1). For example, some of the most recurrent coding neomers were the result of either the Gly12Asp, Gly12Val or Gly12Cys missense mutation in the KRAS proto-oncogene GTPase (KRAS), which are known to make up 80% of cancer-associated KRAS mutations and lead to KRAS being constitutively active (22,23). Although KRAS has been associated with several cancers, 190/215 (88%) of these mutations were found in pancreatic cancers. Several frequently occurring coding neomers were also found in other known cancer-associated genes such as tumor protein p53 (TP53), B-Raf proto-oncogene, serine/threonine kinase (BRAF) and phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha (PIK3CA) .The mutation that most often resulted in a neomer was located in a noncoding region, within the telomerase reverse transcriptase (TERT) promoter, which is known to be associated with numerous cancer types (24). This mutation, called -124C>T or C228T, is extremely common in numerous cancer types (25) and thought to disrupt a G-quadruplex (26) leading to the binding of GAPB (27), an ETS transcription factor, resulting in increased TERT expression. We found this mutation in 97 patients with the highest incidence in glioblastoma (51%), fitting with its high prevalence rate and diagnostic use for this cancer type (28). We also identified several neomers that are frequently created by different mutations ( Table 2). Interestingly, some of these frequently recurrent nullomers are created by different mutations, yet are predominantly found in one cancer. For example, GTTTTTCTCCTAGACC is found 40 times in skin cancer at 31 different loci while CTGGCAGTGAGCCACG is found 21 times in liver cancer across 18 loci. The majority (98%) of these frequent neomers reside in noncoding regions, and many of them reside in intronic regions (35%). For example, CGACGTTCTGCCCACT is found in 32 loci, primarily in pancreatic and stomach cancer. Of those loci, 21/32 (65.6%) were found in distal intergenic regions nearby pancreatic cancer associated genes, such as the C-C motif chemokine ligand (CCL4) (29), and is also found in an intron of POM121 transmembrane nucleoporin like 12 (POM121L12) which is commonly mutated in gastrointestinal cancers (30) in the vicinity of the potassium voltage-gated channel modifier subfamily V member 1 (KCNV1) gene where promoter hypermethylation has been associated with both pancreatic (31) and esophageal cancer (32).

Generation of a cancer subtype neomer classifier
Based on the observation that most neomers are predominantly found in one cancer type, we hypothesized that they can be used to distinguish between cancer types. We filtered neomers by keeping only those that appeared >=ri times in specific cancer type i ( Table S1). Comparison of the set . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ; of neomers associated with each cancer type reveals a small overlap, as indicated by the Jaccard index which is <0.04, suggesting that each cancer type has a distinct neomer signature (Fig. 1D). The only exceptions were esophagus and stomach cancer that are known to have similar characteristics (33). We also counted the number of times neomers are found in each patient, finding that patients are strongly enriched for only one set of cancer specific neomers (Fig. 1E). . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ; To test if neomers can classify tumor samples, we trained a support vector machine classifier to identify tumor type. The classifier takes as input a 21-dimensional vector indicating the number of neomers found for each cancer specific set. Evaluation using 10-fold cross-validation, revealed that our classifier achieves both high sensitivity and specificity, with an F1 score of 0.92 and an accuracy of 0.99 ( Fig.  2A,B). The performance was better than the deep learning model recently presented by Jiao et al (20) and also required less computational resources to train.

Neomers can distinguish additional cancer features
We next tested whether a supervised approach, i.e. using neomers that are thought to be informative based on prior biological knowledge, would improve performance. We first considered microsatellite unstable (MSI) and microsatellite stable (MSS) cancers. MSI is associated with better cancer prognosis, increased benefits from surgery and higher sensitivity to immunotherapy, but with a lack of efficacy from adjuvant treatment (34). Since MSI cancers are associated with a proliferation of polyA and polyT stretches, we hypothesized that neomers containing these motifs would be able to distinguish these two cancer types. We identified ten MSI samples from a cohort of 560 breast cancers (35) and compared to ten randomly selected MSS samples from the same cohort. We found that the polyA/T neomers were able to flawlessly separate the two categories (Fig. 2C).
We next applied a similar strategy to distinguish patients with DNA polymerase epsilon catalytic subunit (POLE) deficiency, as these tumors are known to respond more favorably to immune checkpoint inhibitors (36)(37)(38). We identified 25 patients from the TCGA dataset labelled as POLE deficient, and searched for neomers created through a TCT>TAT or TCG>TTG mutation, which are the most common types of mutations in this context (38). Comparing against POLE proficient tumours, we found that the number of neomers identified for each group have very little overlap (Fig. 2D), and the classifier achieved an accuracy of 96%.

Neomers are enriched in cfDNA
We next tested whether neomers could be used to diagnose cancer in cfDNA. We focused on prostate cancer, due to the following reasons: 1) It is the fifth leading cause of death worldwide, the second most frequent cancer in males and is responsible for 3.8% of all cancer deaths (39); 2) the availability of both WGS datasets and cfDNA samples; 3) the current primary screen for this cancer, measuring levels of the prostate-specific antigen in the blood, has high false negative and false positive rates (40); 4) the need for more accurate screening for minimal residual disease after treatment or surgical interventions (41,42); 5) the number of neomers per this subtype (N=5,270, median per patient=29.5) is on the low end of all 21 tissues that we analyzed (Table S1), allowing us to asses our ability to diagnose cancer with a relatively small number of neomers; 6) localized prostate cancer has a low abundance of ctDNA . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ; making it difficult to detect by ultra low pass WGS or targeted cfDNA sequencing (43) or via methylation (44) compared to metastatic (45), providing a challenging test for our neomer approach.
We analyzed three previously published cfDNA WGS samples from metastatic prostate cancer patients and twenty-three controls, sequenced at depths 50-80x (46). We first excluded all neomers variants that are not rare due to germline variants (18). For each prostate cancer associated neomer we characterized all possible single nucleotide substitutions in the reference genome that could give rise to this neomer. By intersecting this list of neomer creating substitutions with known germline variants identified by the gnomAD project (47), we calculated the probability that each neomer will be present in an individual. We excluded all neomers that are found in the population with p>0.0005, leaving us with 3,193 prostate neomers.
Another source of neomers in cfDNA WGS data could be sequencing errors. To exclude neomers that were observed due to these technical artifacts, we developed a Poisson model (see Methods). Since sequencing errors are assumed to be distributed uniformly, neomers arising for this reason will have a profile that differs from neomers stemming from sequences that are present in the cfDNA, even at a low allele frequency. Moreover, these neomers will also differ from ones present due to germline variants which will be found at a higher frequency. After filtering our data for neomers likely to have arisen due to germline variants or sequencing errors, we compared the enrichment of reads containing neomers associated with prostate cancer as well as the number of significantly enriched neomers in cases versus controls. We found that the median number of prostate neomers detected in the patient samples is eight, while in the healthy controls we detect two. Similarly, the mean enrichment compared to the expected number of reads was 1.35 compared to 0.93 for the healthy controls (Fig. 3). Taken together, these results demonstrate that our prostate neomers classifier could serve as a sensitive and specific assay for identifying cancer in cfDNA samples.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ;

Neomers alter promoter activity
Only a small number of mutations in gene regulatory elements that affect gene expression have been found to be associated with cancer (48,49). As the vast majority of our cancer-associated nullomers reside in noncoding sequences, we tested whether nullomers could identify cancer-associated gene regulatory mutations that have a functional effect. Of note, our top neomer mutation was in the TERT promoter ( Table 1), which is associated with numerous cancers (24). Focusing on prostate cancer, we selected five neomers for luciferase reporter assays using the following criteria: i) neomers that reside in a promoter based on ENCODE annotations (50); ii) the gene regulated by the promoter is associated with prostate cancer. Our list included neomers in: 1) a promoter between two divergent genes, RPS2 and the lncRNA gene SNHG9 (Fig. 4A), both of which are overexpressed in prostate cancer (51); 2) a promoter between two divergent genes, TMEM127 and CIAO1 (Fig. 4B), with the former being downregulated in prostate cancer (52); 3) a promoter between two divergent genes, TTC23 and LRRC28, with the former showing aberrant splicing that relates to therapy resistance in prostate cancer cells (53); 4) The promoter of GNAI2, a protein that interacts with CXCR5, which positively correlates with prostate cancer progression (54); 5) A promoter between two divergent genes, PRICKLE4 and FRS3, with the latter thought to affect malignant but not benign prostate cells (55). We cloned the promoter sequence with and without the neomer into a luciferase promoter assay vector and compared their activity in androgen-sensitive human prostate adenocarcinoma cells (LNCaP). For two out of the five assayed promoters, we observed a significant effect on reporter activity (Fig. 4C). For the RPS2-SNHG9 promoter, the neomer led to significantly increased activity in line with this gene being . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. overexpressed in cancer (51). For the TMEM127-CIAO1 promoter the neomer completely abolished activity, fitting with its observed downregulation in prostate cancer (52). Combined, our experimental results show that neomers could have a significant effect on promoter activity and could potentially be used to identify cancer associated cis-regulatory mutations. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ;

DISCUSSION
Cancer is a DNA mutation causing disease. Here, we show that by analyzing cancer WGS datasets we can find mutations that lead to the generation of neomers, short sequences normally not present in the genome but present in multiple cancer genomes. Further analyses of these neomers shows that they can be used to classify not only cancer tissue origin but also additional cancer features, such as MSI or POLE deficiency with high accuracy. Analysis of cfDNA WGS datasets finds that neomers could be used to distinguish patients from healthy individuals. Finally, using reporter assays, we demonstrate that neomers have a functional effect on regulatory sequences.
Our analyses used 2,577 patients with 21 different cancerous tissues to develop a cancer tissue of origin classifier. Overall, we observed different recurring neomers in each cancer type, other than esophagus and stomach, likely because these cancers share very similar As it is based on neomer detection in sequencing datasets, read mapping is not needed, requiring only a single pass over the data, thus making it extremely effective from a computational standpoint. Additional genomes from tumor tissues, controls and cfDNA could improve this classifier. In general, the classifier has better performance for cancer types with more patients and high mutation burden (Fig 1,2). Thus, obtaining WGS datasets from tumor, matched control and cfDNA from myeloid, thyroid and prostate cancers would be extremely helpful in improving the ability of our neomer classifier to detect these cancer types. In addition, our analyses showed that other than tissue origin, neomers can also be used to detect other cancer features. It would be interesting to test whether neomers could diagnose additional tumor features and also detect other cancer characteristics such as chance of recurrence, drug response, mortality and others.
Neomers could also be combined with other cancer biomarkers and risk factors to improve the diagnostic positive predictive value. For example, it was recently shown that combining a blood test that detects both protein biomarkers and DNA mutations along with positron emission tomographycomputed tomography (PET-CT) could detect multiple cancers (56). Adding neomers to known cancerassociated coding mutations in the screening of cfDNA could increase sensitivity and specificity. In summary, adding neomer-based diagnostics to existing cancer biomarkers and risk factors could improve the power to detect various cancer subtypes.
As nullomers/neomers do not exist in the human genome they could also be exceptional candidates for neoantigens, to be targeted via immunotherapy. Previous work has shown that minimal absent words, short sequences that are absent from a genome or proteome, could be used to identify phosphorylation sites of high confidence, some of which could be associated with cancer (57). Nullomers were also shown to be effective in identifying unique peptides that are exceedingly distant from human peptides that potentially could be used as antibodies against Trypanosoma cruzi (58) or SARS-CoV-2 (59). Analysis of the Immune Epitope Database of validated antigens (60) found that 13 of the recurrent coding neomers can create neoantigens with predicted strong binding levels that were subsequently validated (Table S2). From the 1,700 neoantigens with strong binding levels, only 1.72 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. ; (p-value<1e-8, hypergeometric test) is expected to correspond to a neomer, suggesting that missense mutations also resulting in neomers are 7-fold more likely to also generate strongly binding neoantigens.
We used a sequence enrichment based assay to detect nullomers in cfDNA from blood taken from prostate cancer cases and controls. Alternate assays could potentially be used for future rapid diagnosis of cancer via nullomers. These could include the use of CRISPR-based detection tools that utilize Cas12 or Cas13 (61). For example, recent use of Cas13 in a microwell array system allowed the rapid screening of over 4,500 targets for 169 human-associated viruses with high sensitivity and specificity (62). In addition, with nullomer-based diagnostics potentially not needing large amounts of starting material, cfDNA could be collected from urine or saliva, which were shown to be a viable but reduced source of cfDNA (63,64).
Neomers could be used as a novel tool to identify cancer-associated gene regulatory mutations. Amongst the 210 prostate cancer promoter neomers, we selected five promoters and found that two of them significantly affected promoter activity due to the neomer. Their difference in activity was in line with the gene's expression change in prostate cancer, with RPS2-SNHG9 having increased activity fitting with its overexpression in prostate cancer (51) and TMEM127-CIAO1 abolishing activity, in line with its observed downregulation in prostate cancer (52). Although these promoters were selected based on their prostate cancer association, future high-throughput assays, such as massively parallel reporter assays (MPRAs; (65)) that can test thousands of sequences and variants for their regulatory activity, could be used to test the effect of neomers on gene regulation in an unbiased manner.
In summary, we show that neomers can provide a powerful tool for cancer diagnosis. As they can easily be detected via sequence or CRISPR-based tools, it should be straightforward to integrate them in current routine cancer diagnostic tests and their use could increase the sensitivity and specificity of these tests. Combining neomer-based screening with clinical characteristics and additional diagnostic tools/features could increase the positive predictive value. In addition, as cfDNA could also be isolated from urine and saliva and detection of these sequences only requires a relatively small amount of DNA, neomer-based diagnosis could be carried out in a non-invasive manner. Our work also suggests that neomers could be used to highlight cancer-associated gene regulatory mutations which have been difficult to identify. Further high-throughput characterization of these mutations could allow the detection of bona fide cancer-associated functional regulatory mutations that could be used for diagnosis and treatment.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021

Computational characterization of nullomers
The GRCh38 reference assembly of the human genome was used throughout the study. Nullomer extraction was performed for kmer lengths up to 17 base pairs using the algorithm described in (18). By definition, the reverse complement of a nullomer will also be a nullomer. Throughout this manuscript when counting nullomers, the reverse complement of nullomer i was also considered separately, unless i is a palindrome.
Substitutions and indels identified from whole genome sequencing (WGS) of tumor samples from 2,577 individuals across 21 tissues were obtained from https://dcc.icgc.org/releases/PCAWG/ (21). Nullomer extraction was performed for kmer lengths up to 17 base pairs using the same algorithm described in (18). Recurrent nullomers (neomers) (ri) were annotated as those that resulted from substitutions or indels across two or more patients within a cancer type. When possible, ri was chosen to get ~10,000 neomers from each tissue, otherwise it was set to 2 ( Table S1).

Classification of tumor tissue of origin using neomers
We trained a classifier to distinguish tissue of origin for a cancer sample based on observed neomers using the libSVM package to train a support vector machine classifier with a linear kernel. We used 10fold cross validation whereby the classifier was evaluated on a held out fraction of the data. The set of neomers for each cancer type was recalculated for each round to only include the patients in the training set.

Supervised selection of neomers
The MMR status of each biopsy sample was derived from (Zou et al. 2021). The model was trained on neomers identified in MSI samples and the performance of the algorithm evaluated. For the MSI versus the MSS samples, we counted the number of neomers that contained either AAAAAAAA or TTTTTTTT repeats. The threshold for determining MSI or MSS was set as the harmonic mean of the maximum number of counts in the MSS set and the minimum number of counts in the MSI set. The POLE deficiency status of each biopsy sample was derived from (Zou et al. 2021) and we used a similar strategy to that of MMR status, but we instead counted neomers created through either a TCT>TAT or TCG>TTG mutation. Since the number of patients in each category was limited, we only used 5-fold cross validation.

Comparison to validated neoantigens
We downloaded a list of 1,967 validated neoantigens from http://biopharm.zju.edu.cn/download.neoantigen/iedb_validated.zip. Requiring both predicted strong binding and a positive validation, left us with 1,700 neoantigens. To evaluate the enrichment of neoantigens corresponding to neomers, we assumed a hypergeometric distribution with 1,700 draws . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 17, 2021. from an urn with 188,659 white balls (total number of neomers) and 186,067,892 black balls (number of nullomers found with lower recurrency than what was required for neomers).
Neomer identification in ctDNA samples ctDNA samples were derived from (46). To filter out common population variants, we obtained variant information from the gnomAD v2 (47) and annotated all neomers that were generated due to these variants. Variants that were not single base pair substitutions were not considered. The FASTQ files were scanned for neomers by searching for exact matches to the 16-mers of interest. In addition, the quality scores of bases matching a neomer were evaluated to make sure that none of them fell below a threshold.
We developed a Poisson model where the expected number of neomers of type i is given by CaeNi/3, where C is the coverage, a is the allele frequency, e is the error rate, Ni is the number of loci where a substitution could result in the creation of neomer i, and the division by 3 is to account for the fact that only one of three substitutions will create the neomer. For the analysis of the WGS data we used e=0.015 and a=0.006 while C was calculated from the reads and Ni is derived from the reference genome. Moreover, we excluded neomers that were present at a level that would be expected if a>0. 25 to further remove neomers detected due to germline variants or experimental artefacts.
When processing the data, reads were not mapped to the genome, which means that the coverage is based on the total length of the sequenced reads rather than on the mappable reads. Each read is searched for neomers and neomers containing a base with quality <10 are excluded. The permissive threshold was chosen since otherwise all of the reads from the control samples would have been excluded.

Promoter luciferase assays
Promoter sequences with and without the neomer (Table S3) were synthetically generated and cloned into the modified Promega promoter assay luciferase vector pGL4.11b (a gift from Dr. Rick Myers, HudsonAlpha) by BioMatik Inc and Sanger sequence verified. LNCaP cells were plated at an initial density of 2*10^5 cells/well in 24-well tissue culture plates and maintained in RPMI medium, 10% FBS supplemented with L-Glutamine and Penicillin/Streptomycin. Plasmids together with a renilla expressing plasmid, pGL4.74 (Promega), at a ratio of 10:1 luciferase:renilla were transfected using the X-tremeGENE™ HP DNA Transfection Reagent (Roche) using 1:4 ratio of DNA (ug) to reagent (ul). 72 hours post transfection luciferase and renilla levels were measured using the Dual-Luciferase Reporter Assay System (Promega) following the manufacturer's protocol using a GloMax Explorer Multimode Microplate Reader (Promega). Luciferase activity was normalized to renilla levels and presented as Relative Luciferase Units (RLU). Statistical analysis was performed using Prism version 9.0.2 (GraphPad). All values were reported as means (AVG) and standard errors (SE). p values < 0.05 were considered statistically significant.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Acknowledgments
We want to first and foremost thank the individuals who participated in this study. We also want to thank Dr. Felix Fang at UCSF for his advice and support. This work was supported in part by the Benioff Initiative for Prostate Cancer Research. MH was supported by core funding from the Wellcome Trust and core funding from the Evergrande Center. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 17, 2021. ; https://doi.org/10.1101/2021.08.15.21261805 doi: medRxiv preprint