Abstract
Purpose The detection of circulating tumor DNA, which allows non-invasive tumor molecular profiling and disease follow-up, promises optimal and individualized management of patients with cancer. However, detecting small fractions of tumor DNA released when the tumor burden is reduced remains a challenge.
Experimental Design We implemented a new highly sensitive strategy to detect base-pair resolution methylation patterns from plasma DNA and assessed the potential of hypomethylation of LINE-1 retrotransposons as a non-invasive multi-cancer detection biomarker. The DIAMOND (Detection of Long Interspersed Nuclear Element Altered Methylation ON plasma DNA) method targets 30-40,000 young L1 scattered throughout the genome, covering about 100,000 CpG sites and is based on a reference-free analysis pipeline.
Results Resulting machine learning-based classifiers showed powerful correct classification rates discriminating healthy and tumor plasmas from 6 types of cancers (colorectal, breast, lung, ovarian, gastric cancers and uveal melanoma including localized stages) in two independent cohorts (AUC = 88% to 100%, N = 747). DIAMOND can also be used to perform copy number alterations (CNA) analysis which improves cancer detection.
Conclusions This should lead to the development of more efficient non-invasive diagnostic tests adapted to all cancer patients, based on the universality of these factors.
Statement of significance The DIAMOND assay is a new highly sensitive strategy to detect base-pair resolution methylation patterns of LINE-1 retrotransposons (L1) from plasma DNA. It targets 30-40,000 young L1 scattered throughout the genome, covering about 100,000 CpG sites and is based on a reference-free analysis pipeline. This provided high coverage data using affordable sequencing depth, which is instrumental to achieve high sensitivity and work with minute amounts of cell-free DNA. Resulting machine learning-based classifiers showed powerful discrimination between healthy and tumor plasmas from 6 types of cancers (colorectal, breast, lung, ovarian, gastric cancers and uveal melanoma including localized stages) in two independent cohorts (AUC = 88% to 100%, N = 747). DIAMOND data can also be used to perform copy number alterations (CNA) analysis which improves cancer detection.
Introduction
Extensive research has shown that tumor genetic alterations can be detected from plasma DNA of patients with cancer 1–3. This paved the way for the use of molecular analyses performed from liquid biopsies to genotype tumors non-invasively 4,5 and demonstrated the potential of circulating tumor DNA (ctDNA) as a marker of cancer progression 6,7. It is also a powerful prognostic factor 8 enabling detection of tumor masses not perceptible clinically, after surgery or during treatment. These approaches promise optimal management of cancer patients and are currently playing an important role in oncology 9,10. However, several technological obstacles still limit their widespread application. Samples collected at early stages of tumor progression, or during and after treatment, may contain less than one mutant copy per milliliter of plasma 1,11.
This is below the detection limit of most used technologies, even when testing multiple genetic alterations simultaneously. Moreover, most methods are biased towards preselected recurrent mutations, which do not cover all tumors. We observed in our previous studies 12–15 that approximately 25% of patients affected with breast cancer do not display common mutations trackable in plasma DNA, even at advanced stages. Therefore, it is necessary to develop more sensitive and more informative detection tools.
Multiple studies have demonstrated the central role of epigenetic processes in the onset, progression, and treatment of cancer. Epigenetic alterations (i.e., changes in the pattern of chromatin modifications such as DNA methylation and histone modifications) are promising candidates for cancer detection, diagnosis and prognosis 16,17. These extended markers provide an additional level of information, overlooked by methods that only question genetic alterations 18. Aberrant DNA methylation is a hallmark of neoplastic cells 16, which combine hypermethylation of a wide range of tumor suppressor genes along with a global hypomethylation of the genome19. DNA methylation is a stable modification, which affects a large number of CpG sites per region and per genome and will be key to achieve increased detection sensitivity 20. Moreover, the concordance of the methylation status between multiple CpGs of the same region can help detect low frequency anomalies among a heterogeneous population of molecules 21,22. Finally, combining several genomic regions allows to capture a wide range of tumor alleles and cover the heterogeneous profiles of cancer patients 23.
Previous studies have shown that cellular DNA methylation patterns are conserved in cell-free DNA (cfDNA) and that detection of cancer-specific profiles at the genome-wide scale is feasible 24–27. Until now, most studies investigating plasma DNA methylation patterns have targeted a limited number of regions at high depth, using PCR-based methods 28–30, or explored genome-wide at low depth with high-throughput sequencing 24–26,31. Both approaches have limited sensitivity, as focusing on a few regions does not cover cancer-type and patient variability and low depth cannot detect small fractions of ctDNA. More recent studies, relying on the capture of regions of interest coupled with deep sequencing have investigated the performance of larger numbers of regions at high depth 21,32–40. These methods enabled sensitive detection and classification of cancer from plasma DNA. However, since they largely focus on cancer hypermethylation and unique sequences, it involves targeting specific regions for each cancer type. As a result, developing a cost-effective universal pan-cancer test remains a challenge.
Remarkably, cancer-related hypomethylation has been reported in almost all classes of repeated sequences 41, from dispersed retrotransposons to clustered satellite repeated DNA, and within multiple forms of cancers 42. In particular, this leads to the reactivation of retrotransposons, resulting in the acquisition of genomic instability, chromosomal rearrangements and the production of chimeric transcripts between the transposable element and its adjacent locus.
Hypomethylation of the internal promoter of Long-Interspersed Element-1 (L1) has been described as a hallmark of many human cancers 42,43, which can result in the reactivation of intact L1 elements 44 and the abnormal production of their transcripts and proteins. Transposition of these competent elements induces DNA double-strand breaks and damages the genome. A recent study identified 4 types of cancer (esophagus, head and neck, lung and colorectal) with a large amount of damage linked to retrotranspositions involving mostly L1s 45. Another study identified the transposition event responsible for initiating colorectal cancer by mutating the APC gene 46. To obtain a global representation of the hypomethylation occurring during carcinogenesis and to increase sensitivity, we chose to target primate-specific copies of L1 retrotransposons (L1PA). These elements have tens of thousands of copies per cell and are hypomethylated in multiple cancers 42. Two studies have explored L1 global methylation profiles from plasma 47,48 of lung and colorectal cancers, using qPCR-based methods, but reported a low detection sensitivity, below 70%. Indeed, repeats being inherently difficult to map, detecting their methylation profiles at the single base-pair resolution requires sophisticated downstream analysis.
To overcome this, we have developed a method to detect methylation patterns of primate specific L1 elements (L1PA) from cfDNA, which we named DIAMOND (for Detection of Long Interspersed Nuclear Element Altered Methylation ON plasma DNA). We implemented computational tools to accurately align sequencing data without a reference genome and applied prediction models, trained by machine learning algorithms, integrating patterns of methylation, overall and at the single molecule level. The aim of this study was to assess the potential of circulating DNA methylation changes at L1s as a universal tumor biomarker, and to develop new highly sensitive strategies to detect cancer-specific signatures in blood.
Results
Targeting primate-specific LINE-1 elements reveals plasma DNA-methylation patterns genome-wide
We developed a PCR-based targeted bisulfite method coupled to deep sequencing to detect methylation patterns of L1PA elements. We used sodium bisulfite chemical conversion to achieve base-pair resolution analysis and designed a multiplexed PCR based on 8 amplicons covering L1PAs (Fig. 1A, Table S1, Fig. S1A). We detected thousands of L1PA elements scattered throughout the genome as shown by the genomic hits obtained from a healthy plasma, an ovarian tumor, and a uveal melanoma tumor sequenced at high depth (Fig. 1B, Table S2). We observed similar profiles for the three samples, as well as for healthy and cancer plasmas with standard coverage (Fig. S1B-E). This demonstrated the robustness of the approach. Overall, the estimated number of L1PA targets is about 30-40,000 elements per genome including half of the human specific copies (L1HS) and many copies of the other L1PA subfamilies (Fig. 1C, Table S2). This represents an estimate of 87-120,000 CpG sites.
Following deep sequencing, reads are traditionally mapped back to the genome. However, the majority of sequencing reads from repetitive sequences are assigned randomly during mapping steps and are subsequently lost for classical differentially methylated region (DMR) calling 49. We, thus, developed a new computational pipeline to accurately align repetitive sequencing data without using a reference genome (Fig. S1F). To perform this, we clustered all good quality reads based on their similarity, extracted representative sequences from the largest clusters and used them for multiple sequence alignment. We then aligned all the reads back onto this custom database. Using such reference-free method, we preserved the majority of our data and could extract the informative CpG sites agnostically. We selected sites with a CG/TG content ≥ 20% including at least 5% of CG to ensure that the position of interest carries some DNA methylation marks. This selection was done on healthy samples to avoid biases related to cancer hypomethylation. We retrieved 35 CpG positions covered by our panel including two additional CpGs with respect to the L1HS consensus annotations, located within amplicon 2 (Fig. S2A-B).
As expected, the 5’ end of the L1 copies targeted is heavily methylated 42,50, particularly within the 2nd amplicon. We also observed quite high levels in both the 5th amplicon (69% in average, Fig. 1D), which covers part of the ORFI, and the last two CpGs of amplicon 8, which is located immediately upstream of the 3’UTR. Amplicon 3, which has the lowest methylation levels within the 5’ end, displayed sequencing data with atypical distributions and showed less robust performances (not shown). Hence, we further eliminated it from the rest of the study, resulting in a total of 30 CpG positions analyzed. Overall, this reference-free method retrieved methylated sites contained by the youngest LINE-1 elements present in the human genome allowing us to study their DNA-methylation levels and motifs from minute amount of DNA such as plasma cfDNA.
L1PA hypomethylation is detectable from plasma DNA in multiple forms of cancer
We first tested the DIAMOND approach on methylation controls, cancer cell lines and tissue samples. The overall methylation levels demonstrated an extensive L1PA hypomethylation specifically in cancer samples, including colorectal (CRC), ovarian (OVC), breast (BRC) and uveal melanoma (UVM) cancer cell lines as well as OVC, BRC and UVM tumors compared to healthy white blood cells and healthy tissues collected adjacent to ovarian tumors (Fig. 2A, Table S3). Next, we tested a cohort of 473 plasma samples including 123 healthy plasma controls and plasma samples from patients with 6 different types of cancer, covering metastatic (M+) and localized (M0) stages (Table S4. This includes colorectal and ovarian cancers in which a substantial rate of L1 hypomethylation has previously been reported 51,52. We detected a statistically significant L1PA hypomethylation in cfDNA of metastatic colorectal cancer (CRC M+), breast cancer (BRC M+) and uveal melanoma (UVM M+) samples as well as in locally advanced ovarian cancers (OVC M0, stages III) and localized gastric cancers (GAC M0) (Fig. 2B, Table S3). The global methylation was not significantly different in metastatic non-small cell lung cancers (LC M+) nor in localized stages of breast cancer (BRC M0). However, focusing strictly on global methylation levels provides only part of the information.
We further computed the levels of methylation at each CpG target (n=30) for these plasma samples and observed specific patterns of methylation along the L1 structure, which are robustly conserved among the 123 healthy donors (Fig. 2C). When considering all cancer samples together, we observed a clear difference with the methylation of healthy samples. We detected a steady hypomethylation through all CpG targets except for the two sites within amplicon 8 (Fig. 2D). This is also true for metastatic colorectal cancers (CRC M+), breast cancers (BRC M+) and uveal melanoma (UVM M+). Clear hypomethylation is also observable for localized gastric (GAC M0) and ovarian (OVC M0) cancers, in particular at amplicon #1, #4 and #6, while the differences are less striking for localized breast cancers (BRC M0) and metastatic non-small cell lung cancers (LC M+). The distinction between most cancers and healthy samples were dependent on multiple CpG positions belonging to different amplicons along L1s, as shown by PCA analysis (Fig. S2C). The least discriminating positions were located within amplicon 8, which is consistent with the metaplots shown in Fig. 2D.
Next, we analyzed the motifs of methylation at the molecule level, which provide a more detailed signal. These haplotypes correspond to true patterns of methylation of adjacent CpGs, detected for each amplified DNA molecule. This was achieved by the incorporation of unique molecular identifiers (UMIs) into the library (Fig. S1A). Based on the combination of the 30 CpG targets divided into their 7 amplicons, we extracted a total of 372 unique features (Fig. S2D). We observed highly robust representation profiles of haplotypes among the 123 healthy samples (Fig. 2E). For most amplicons, the fully methylated molecules were the most represented, as expected for healthy controls. However, we observed a high proportion of totally unmethylated haplotypes in amplicon #6 and #7. This can be explained by the fact that older L1 copies are often truncated in 5’ and less regulated by DNA methylation, leading to the capture of molecules with lower DNA methylation in 3’. Nevertheless, several intermediate patterns were also among the most important features and were found to be differentially represented in healthy and cancer samples (Table S5). Fully methylated haplotypes were significantly under-represented in most cancer subgroups and in most amplicons (Fig. 2F). On the contrary, fully unmethylated haplotypes were over-represented in most cancer subgroups and in most amplicons. This is also well illustrated by the PCA analysis shown in Fig. S2E, underlining the contribution of highly methylated haplotypes towards the healthy group versus the lowly methylated haplotypes separating cancer samples (middle panel). This separation involves haplotypes from all amplicons (right panel).
Next, we compared the methylation profiles of tumor and plasma paired samples (OVC = 10, Fig. S3A-F; BRC = 16, Fig. S3G-K) by calculating the correlation between their methylation differences relative to the mean methylation of healthy donor plasmas. We observed a better correlation with methylation haplotype portions than with single CG methylation features (Fig. S3B-C and Fig. S3H-I, Table S6). These results demonstrate that L1 hypomethylation can robustly be observed from cancer plasma DNA at the level of single CpG sites and more importantly at the level of methylation haplotypes.
L1PA hypomethylation-based classifiers recognize samples from multiple forms of cancer
We then trained classification models using random forests, with the 30 features corresponding to the levels of methylation at each CpG target or the 372 features corresponding to the proportions of haplotypes, or both, and assessed their performances to automatically separate healthy from tumor plasmas. By testing all cancer samples without cancer-type specification, the methylation of L1PA elements showed an extremely good ability to discriminate between healthy and tumor plasmas, with an overall area under the curve (AUC) of 94-95% for the 3 types of features (Fig. 3A, 3C, S4A-B). Next, we trained distinct models to estimate the performances for each cancer type and/or dissemination stage (M0 vs M+). These models were extremely performant in metastatic colorectal and breast cancers but also stage III ovarian cancers and localized gastric cancers, with nearly perfect classifications and AUCs between 98-100% (Fig. 3B-C, S4A-B). Additionally, we observed excellent performances for metastatic lung cancers and uveal melanoma and more importantly for localized stages of breast cancer (AUCBRC_M0 = 92-95%). These models provide very good sensitivities at 99% specificity (Fig. 3D), in particular for CRC M+, BRC M+, OVC M0, GAC M0 and BRC M0. The latter is one of the most difficult cancer to detect non invasively, as reported in previous liquid biopsy multi-cancer tests 11,39,40.
Overall, we observed similar results using single-CpG methylation levels, haplotype proportions or both features. This can be explained by the high correlation observed between the 2 types of features (Fig. S4C). We compared these performances for each subgroup with the classification rates extracted from the model ‘all’, testing all cancer samples together. We observed similar AUCs for most cancers, but overall, they were better classified with their ‘expert’ cancer specific models (Fig. S4D-E). This shows that certain specificities of cancer type may confuse the current ‘all’ model and affect the sensitivities at 99% specificities.
Subsequently, we evaluated the importance of the features used by our classifiers (Fig. 3E-F). CpG positions displayed different patterns in the various cancer subgroups that can be informative for distinct cancer types or stages (Fig. 3E). Nonetheless, we identified features which are common to many types of cancer such as most CpGs of amplicon 1 and the first CpG of amplicon 6. Other features seemed to be characteristic of specific subgroups, such as CG7-14 which are the most important features for sorting localized stages of breast cancer (BRC M0) or CG15-18, in particular CG17, which are part of the top features for metastatic breast cancers (BRC M+). Besides their dissemination status, the different important features detected in the BRC M0 and M+ subgroups may result from the fact that they were composed of different breast cancer subtypes, with 100% of hormone dependent (HR+ HER2-) cancers in the BRC M+ subgroup and a mixture of subtypes, including also triple negative (TNBC) and HER2+ cancers, in the BRC M0 subgroup (Table S7). Indeed, we observed that slightly different hypomethylation patterns in HR+ HER2-BRC versus TNBC (Fig. S4F-H). We observed that haplotype features showed consistent patterns with important CG features (Fig. 3F). However, haplotypes provide a more detailed view of the methylation patterns with a strong importance of the most methylated or non-methylated molecules. We still observed that some methylation intermediates are important for cancer detection (ex: in amplicon #1 in CRC M+ and GAC M0, #2 in BRC M0, #4 in BRC M+ and other subgroups, #5 in LC M+ and UVM M+, #7 in OVC, #8 in BRC M0 and LC M+). BRC M+ and M0 subgroups cluster together, showing that even if they are not highly similar, they are closer to each other than to other cancer types. These breast cancer specificities are worth exploring further in the future. Overall, this suggests that L1PA methylation alterations detected from cfDNA vary in different types and stages of cancer.
To estimate the ability of DIAMOND to detect cancer at early stages of the disease, we build classifiers for 3 stage classes gathering all cancer types: early stages (I/II, N=31), locally advanced stages (III, N=30) and metastatic stages (IV, N=281). Classifications were highly performant for all 3 stage categories (AUCEarly = 95%, AUCAdv. = 97%, AUCMeta = 95%; Fig. 3G-H, S4I, S4D) with a mean sensitivity of 70% for early stages, (SenEarly = 70%, SenAdv. = 90%, SenMeta = 69%; Fig. 3I, S4E).
Next, we analyzed whole genome bisulfite sequencing (WGBS) from healthy and cancer plasma from two recently published studies 53,54 to evaluate if we can retrieve L1PA hypomethylation signal with non-targeted methods (Fig. S5A, see methods). Consistent with our findings, we observed statistically significant hypomethylation in cancer samples (Fig. S5B-H). These whole genome approaches were nevertheless not as efficient as the DIAMOND targeted approach to automatically detect cancer from L1PA hypomethylation as they cover less well the regions of interest (Fig. S5A, S5F, S5I). Indeed, they involve mapping on a reference genome leading to loss of data. Moreover, L1PA methylation measured with DIAMOND largely outperforms methods based on the detection of mutations. In comparison, the identification of the same tumor samples via the detection of frequent recurrent mutations, which is commonly used in the clinic, does not exceed 57% for ovarian cancer (Table S8), 38% for colon cancer 55 and 52% for metastatic breast cancer 13,14 (Fig. 3J). We particularly achieved remarkable performance on the cohort of 27 localized gastric cancers with a detection rate of 95% of true positive as compared to 12% for mutation screening 56. This is mostly due to the fact that methylation changes occur in virtually all cancer patients, unlike recurrent mutations.
Multi-cancer classification performances are reproducible on an independent cohort
To validate the DIAMOND approach, we tested a second independent cohort consisting of 214 patients affected with the same types of cancers as in the first cohort, excluding uveal melanoma and non-metastatic gastric cancers, along with 60 healthy donors (Fig. 4A). First, we confirmed that the methylation patterns along the L1 structure were highly reproducible between healthy donors from cohorts 1 and 2, at the level of single-CpG targets (Fig. 4B) but also for haplotype proportions (Fig. 4C, Table S9). While methylation at single-CpG within cancer subgroups showed slightly more variability (Fig. S6A), global methylation levels were quite reproducible between the two cohorts, showing similar distributions and no statistical differences (Fig. 4D, Table S10), except for non-metastatic ovarian cancers. There was an important heterogeneity among the OVC M0 samples of cohort 2, which clustered into two distinct groups, while cohort 1 was more homogeneous (Fig. S6B). Notably, no correlation was found with available clinico-histopathological parameters (age, staging, CA125 level, mutational status, treatment or response to therapy). Differential haplotype proportions between healthy and cancer subgroups were also mostly conserved (Fig. S6C, Table S11). Overall, the method showed good reliability with the 7-amplicon panel used and good robustness in detecting L1 methylation levels and changes.
Since age-related changes in DNA methylation have been described 57,58 and that the healthy donors included in the study are younger overall than the cancer patients (Fig. S7A), we have investigated whether there was an effect on the methylation patterns we studied. We found a significant but very small effect which appeared much smaller that the effect of disease status (Fig. S7B-C). This small effect was tending towards an increase in methylation with age (Fig. S7D), and we observed similar patterns and differences between healthy and cancer samples when adjusting for the age (Fig. S7E-G, Table S12). Furthermore, we observed similar performances in age-matched and non-age-matched cohorts extracted from C2 (Fig. S7H-J), demonstrating that age is not a confounding factor.
To validate our classifiers, we trained models on the entire first cohort and evaluated them on the second set of independent samples. The results showed excellent classification performances with an overall AUC of 88% when testing all cancers together with no annotations of their histological types and AUC between 88%-100% for the cancer subgroup ‘expert’ models with sensitivities at 99% specificity between 49%-100% (Fig. 4E) including 55% for localized breast cancers. It was, however, lower for metastatic lung cancer with a mean sensitivity of 49%. We observed that haplotype models were more robust compared to single-CpG methylation rates (Fig. S6D-G). This could be explained by the fact that haplotypes capture true methylation patterns at the molecule level, enabling to discard noise caused by experimental variability for example. Next, we applied the same validation method, training on C1 and testing on C2, for the 3-stage ‘expert’ classifiers and observed great classification performances with a mean AUC of 99% and a mean sensitivity of 79% for early stages (Fig. 4F). When comparing the performances for each subgroup with the classification rates extracted from the model ‘all’, we observed similar AUCs for most cancers, but lower rates for the BRC M0 subgroup and the early-stage group (Fig. S8A), which is mostly composed of BRC M0 (Fig. S8B).
To further investigate the generalization of our marker, we have also tested the ability of DIAMOND to detect cancers of a type or subgroup for which the model is not trained. To do so, we have combined the data from cohorts C1 and C2 and trained on the whole data set, systematically removing one cancer subgroup or one cancer type and testing specifically for this subgroup or type (see Methods, Fig. 4G, S8C). We observed performances similar to when the subgroup is included in the train set, which demonstrates the universality of L1PA hypomethylation. Classification performance is noticeably lower for the BRC M0 subgroup when the model still performs well for OVC M0 and GAC M0. This may be because the BRC M0 subgroup contains the earliest stages of both cohorts (Fig. S8B) and the model is less well trained for these stages when removing it. To improve this, we have developed another integrated model that is trained for all cancer types and stages together but also incorporates the probabilities from the ‘expert’ models, defining profiles for various stages and dissemination status. This ‘stack’ model provides a unique prediction for cancer versus healthy status for each sample at once, independently of their cancer type / dissemination status, and allows to balance for the cancer characteristics used to train the model by including them all. We have then extracted the performances per cancer subgroups out of this integrated model, which reached better performances to classify BRC M0 and early-stages samples (see Methods, Fig. 4G, S8A).
This demonstrates the robustness of cancer detection by probing L1PA hypomethylation from plasma DNA with the DIAMOND assay and its ability to detect early stages. Finally, we evaluated if the extent of L1PA hypomethylation was associated with survival. In the validation cohort, we compared patients with high methylation to those with low methylation levels (above or below the median threshold - Fig. S8D) and observed that more pronounced hypomethylation is clearly associated with shorter survival (Fig. S8E). This effect is thought to mainly reflect the tumor burden, which directly impacts the fraction of tumor DNA circulating in the blood, but it remains an interesting non-invasive tumor marker.
DIAMOND data contain signal to infer the tumor burden, which improves cancer detection
When looking at the overall methylation rates, we detected significantly more hypomethylation for more advanced stages of the disease, in particular in metastatic stages compared to localized stages (Fig. 5A, Table S13). However, there was no significant differences between metastatic tumor tissues and primary tissues (Fig. 5B, Table S14), which confirmed that L1PA methylation alteration is an early event in carcinogenesis 59,60 and also affects early-stage cancers. The differences observed in the blood (Fig. 5A) reflect the ctDNA fraction, which is known to be directly influenced by the tumor burden and the stage of the disease 1,9. This demonstrates the quantitative potential of DIAMOND, which could help quantify the tumor burden and monitor the disease.
A recognized marker to non-invasively estimate the tumor burden and the fraction of ctDNA is the aneuploidy or copy number alterations (CNA) 61,62, a hallmark of cancer genomes 63. Given that DIAMOND hits are dispersed throughout the genome (Fig. 1B), we investigated the possibility to use our data to perform CNA analysis. The mFast-SeqS approach had previously used a PCR-based L1PA targeting as a prescreening tool to estimate the fraction of ctDNA 64,65. This was done on native DNA whereas our data resulted from bisulfite-treated DNA.
We first tested this approach on 15 breast cancer cell lines that were also assessed by CytoScan HD microarrays for aneuploidy. DIAMOND provided an average of 78 000 uniquely mappable reads per cell line, corresponding to around 10 000 L1PA copies precisely located in the genome. These L1PA hits homogeneously overlapped with regions covered by CytoScan probes along the genome (Fig. 5C). We computed z-scores, quantifying copy number alterations, at the level of chromosome arms as previously described (65,66, see methods) and obtained similar results to those found with CytoScan arrays (Fig. S9A). We observed low alteration scores for the normal-like breast cell line HTERT-HME1 and good correlations between the 2 methods for the majority of the cell lines (Fig. S9B).
Next, we computed genome-wide z-scores in healthy and cancer plasma samples and observed high alteration scores specifically in cancer samples (Fig. 5D). Cancer subgroups z-scores mirrored global hypomethylation profiles (Fig. 5E, 2B and 4D, Table S15), both reflecting tumor burden and ctDNA fractions available. However, global methylation rates and z-scores were only moderately anti-correlated (Fig. 5F), demonstrating that these are partially independent markers that can provide distinct signals (Fig. S9C).
To obtain a final classification labelling each sample as healthy or cancer, we used a 2-step categorization incorporating CNA analysis, which improved cancer detection. We used the probability of the cancer prediction provided by the methylation-based validation models (‘expert’ and ‘all’ models), applying a threshold identified on the discovery cohort, followed by a reclassification of samples which presented a z-score > 121, as cancer. This cut-off value was deduced from a cross-validation applied on C1 (see methods and Fig. S9C). This classifier achieved high sensitivities with specificities between 97-100% for the ‘expert’ models and 93% for the integrated ‘all’ model in 4 distinct cancer types (BRC, CRC, LC, OVC, Fig. 5G) and could be applied as is in the clinic. This was particularly promising for localized breast cancer with a sensitivity of 100% with both model types.
Discussion
In this study, we established a robust proof of concept that targeting hypomethylation of retrotransposons from cell-free DNA is a sensitive and specific biomarker to detect multiple forms of cancer non-invasively. Repetitive regions provide genome-wide information as they hold half of the CpG sites present in the human genome 67. Hypomethylation of L1 elements, which is a common feature of multiple forms of cancer, is a universal marker and helps cover the heterogeneous profiles of cancer patients in a single test. Previous methylation studies have left these regions aside as they are inherently difficult to map, and DMR analysis is commonly performed on mapped data. However, repeats were previously used to profile aneuploidy and one of the first reported liquid biopsy cancer detection test was based on serum DNA integrity calculated with the ratio of short over long Alu fragments 68.
Here, we developed a new pipeline to detect methylation profiles at repeats with a single base-pair resolution, without resorting to mapping on a reference genome. This allowed us to rescue unmappable sequences (such as polymorphic copies that are not annotated in the reference genomes) or sequences that are difficult to map (such as multi-mappers) and retain most of our data. This is instrumental in achieving high sensitivity. The DIAMOND assay demonstrated high performance in detecting cancer samples and we established its feasibility in six different cancer types, including three at localized stages. It outperforms mutation screening, as it covers virtually all patients and presents a promising marker for pan-cancer detection.
Previous cfDNA multi-cancer tests enabled sensitive detection and classification of multiple forms of cancers such as the CancerSeek test, based on the combination of mutations and proteins detection11,69 and more recently combined with aneuploidy profiling using representation of Alu sequences70 or the cfDNA methylation test Galleri from GRAIL, which targets 1,100,000 CpGs within unique regions 39,40. In comparison, DIAMOND targets about 100,000 CpG sites within 30-40,000 copies of LINE-1 using a unique set of probes for all these copies and the various cancer type tested. Very recently, several multi-cancer detection methods based on retrotransposons were published. This includes the detection of the circulating L1 protein ORF1p 60 as well as the probing of the representation of all type of repeats from Annapragada and colleagues 71. Another study demonstrated that cfDNA cleavage profiles at Alu sequences reflect their methylation status and can be used to detect liver cancer and nasopharyngeal carcinoma 72.
The DIAMOND assay interrogates specifically the methylation status of LINE-1 retrotransposons to obtain a global representation of the hypomethylation occurring during carcinogenesis and assess the potential of circulating DNA methylation changes at L1PA elements as a universal tumor biomarker. Our aim was to develop a new highly sensitive strategy to detect cancer-specific signatures in blood. Overall, DIAMOND reached great performances for multi-cancer detection using ‘expert’ models for specific cancer subgroups or integrated models including all types of cancers. Lower detection rate in metastatic lung cancer may be related to the fact that these samples seem to have a low tumor burden as indicated by their genome-wide z-scores (Fig. 5E). However, integrating other regions with cancer-specific methylation changes could help improve detecting this type of cancer. The DIAMOND assay provides methylation profiles from minute amount of cfDNA, down to a few nanograms, with high precision and high coverage using an affordable sequencing depth.
Stronger L1PA hypomethylation was associated with shorter overall survival demonstrating its prognostic value. We therefore anticipate that our method has the potential to be applied for the development of routine clinical tests. While our integrated models provide proof of concept for cancer detection with blood screening in asymptomatic patients, the ‘expert’ models could also be useful to help diagnosis when we suspect the cancer location or to perform disease follow up.
To push the DIAMOND assay towards a clinically applicable test, we also demonstrated that DIAMOND data can be used to perform copy number alterations analysis which improves cancer detection. We integrated this analysis in a classifier providing ‘healthy’ or ‘cancer’ labels for each sample and reached a detection of 91% of true positives for all cancers together and in particular a 100% sensitivity with 100% specificity for localized breast cancer with the BRC M0 expert model.
Further testing with a larger number of samples covering earlier stages, including an independent cohort of gastric non-metastatic cancers, more subtypes and different types of cancer will enable to consolidate and expand these findings. Moreover, this will strengthen the classification models, which will perform better with more samples for training and testing. In the current study, we did not control for ethnic origin of samples tested. The potential impact of L1 polymorphism linked to ancestry should be estimated in future studies. It will also be important to study the impact of other conditions, such as auto-immune diseases, which may lead to the detection of L1 hypomethylation in blood. The recent study on the detection of circulating L1 ORF1p in cancer by Taylor and colleagues60 demonstrated a high specificity and no sign of L1 reactivation in blood of patients with auto-immune disease, indicating that it might be a cancer-specific phenomenon. DIAMOND analysis could further be used to infer the tumor burden and monitor the disease to better detect minimal residual disease and the relapse early. However, the impact of treatments on methylation status should be investigated first.
Overall, we developed a turnkey analysis method that identifies tumor plasmas across multiple types of cancer with the same marker. This approach offers an optimized balance between the number of targeted regions and sequencing depth, which could extensively improve the sensitivity of ctDNA detection in a cost-effective manner and improve management of patients with cancer.
Methods
Materials
Cell lines
Cell lines screened in Fig. 2A are the following: CRC (HCT116); OVC (SKOV, Caov3, ES-2); BRC (MDA-MB453, SKBR3, MDA-MB361, HCC202, ZR75.1, HCC70, BT474, MDA-MB231, Cal51, MDA-MB157, BT20, MCF7, HCC1954, HCC1569, HCC38); UVM (MP38, MP41, MP46, MP65, MM28, Mel285, Mel270, 92.1, Mel202, omm2.5, Mel290, mm66, omm1).
Tissue and plasma samples
Archived tissue samples (ovarian adjacent tumor tissues, ovarian primary and metastatic tumors, breast tumors and uveal melanoma tissues) were retrieved from the Pathology department of Institut Curie. Healthy white blood cells and healthy plasma were collected from blood of healthy donors through the French blood establishment (agreement #16/EFS/031) under French and European ethical practices. Blood samples from patients treated at the Institut Curie (Paris, France) were collected, after written informed consent, as part of the following studies: resectable metastatic colorectal cancers from the Prodige14 trial (approved by a French Personal Protection Committee – “CPP-Comité de Protection des Personnes Sud Méditerranée IV” and registered in ClinicalTrials.gov under NCT01442935); non-small cell lung cancer and metastatic HR+ HER2-breast cancer from the ALCINA study (approved by a French Personal Protection Committee and registered in ClinicalTrials.gov under NCT02866149; treatment-naïve ovarian cancer or triple-negative breast cancer patients eligible for surgery or neoadjuvant chemotherapy from the SCANDARE study (approved by the French National Agency for the Safety of Medicines and Health Products “ANSM - Agence National de Sécurité du Médicament”, a French Personal Protection Committee and registered in ClinicalTrials.gov under NCT03017573); multiple-types of metastatic cancers from the SHIVA02 study (approved by the French National Agency for the Safety of Medicines and Health Products “ANSM - Agence National de Sécurité du Médicament”, a French Personal Protection Committee and registered in ClinicalTrials.gov under NCT03084757), non-metastatic operable gastric cancers and advanced uveal melanoma from CTC-CEC-ADN study (approved by a French Personal Protection Committee and registered in ClinicalTrials.gov under NCT02220556). Additional archived samples were also retrieved from the biobank of the Institut Curie, patients having provided informed consent for research use. All samples were obtained in accordance with the ethical guidelines, with the principles of Good Clinical Practice and the Declaration of Helsinki. This study was approved by the Internal Review Board and Clinical Research Committee of the Institut Curie. Blood samples were collected at the time of inclusion, before the start of the treatment, in EDTA tubes. Plasma was isolated within 4 h, to ensure a good quality of cfDNA, by centrifugation at 820 g for 10 min, followed by a second centrifugation of the supernatant at 16,000 g for 10 min and stored at −80°C until use.
Methods
Preparation of DNA from cell lines and tissues and cfDNA
Isolation of DNA from cell lines and healthy white blood cells (buffy coats) was performed using the QIAamp DNA Mini Kit or QIAamp DNA Blood Mini Kit (Qiagen) according to the manufacturer’s instructions. DNA from cryopreserved and formalin-fixed paraffin embedded (FFPE) tumor tissues was extracted using a classical phenol chloroform protocol and the NucleoSpin® FFPE DNA kit (Macherey Nagel), respectively. cfDNA was extracted from 2 ml of plasma using the automated QIAsymphony Circulating DNA kit (Qiagen), the Maxwell RSC ccfDNA LV plasma kit (Promega) or manual QIAamp circulating nucleic acid kit (Qiagen), according to the manufacturer’s instructions, and eluted in 60 μl, 75 μl or 36 μl, respectively. We verified that the extraction method did not impact our results (Fig. S10A-B). Isolated DNA was quantified by Qubit® 2.0 Fluorometer using dsDNA HS Assay Kit (Thermo Fisher Scientific) according to the manufacturer’s instructions and stored at −20°C until use.
Bisulfite conversion
We used sodium bisulfite-based chemical conversion to achieve base-pair resolution analysis, which is crucial to address methylation levels at single CpG dinucleotides and the co-methylation of multiple CpG sites to determine methylation haplotypes (methylation state of successive CpG sites). Bisulfite treatment of the isolated genomic DNA (up to 200 ng) from the cancer tissues, cancer cell lines and buffy coats was performed using an EZ DNA Methylation-Gold Kit™ (Zymo Research, CA, USA), following the manufacturer’s instructions. Bisulfite treatment of cfDNA (isolated from 2 ml of plasma) was performed using the Zymo EZ DNA Methylation-Lightning Kit™ (Zymo Research, CA, USA), according to the manufacturer’s instructions. Bisulfite-treated DNA was stored at −80°C and further used to build a sequencing library. We have compared the methylation profiles obtained with bisulfite conversion or enzymatic conversion (NEBNext® Enzymatic Methyl-seq Conversion Module, 73) followed by amplification with the DIAMOND targets and deep sequencing. We observed similar methylation profiles (Fig. S11). We have tested various starting quantities of DNA (100ng, 10ng, 5ng) of white blood cells extracted from healthy donors (buffy coats). We have also compared cfDNA (10ng) and DNA from MCF7 breast cancer cell lines (100ng) know to display hypomethylation at L1PA elements. We observed that the methylation profiles along the 30 CpG targets are similar when starting with 100ng (Fig. S11A) but, in our hands, the bisulfite conversion seems more robust with decreasing stating DNA quantities (Fig. S11B-C). We thus observe some differences with the cfDNA samples at 10ng (Fig. S11D). We could still detect very similar hypomethylation profiles in MCF7 breast cancer cell lines (Fig. S11E). This shows that EM-seq could be used to profile L1PA methylation changes but the same conversion should be used throughout the study.
Primer design
Eight primer pairs were designed using the LINE-1 Human Specific (L1HS) consensus sequence from Repbase (Fig. 1A). Although 5’UTR (promoter region) is CpG-rich and common target for methylation quantitation, L1PA copies are frequently 5’-truncated. Therefore, primers were also designed for ORFI and ORFII to target more L1PA elements and improve the sensitivity of our assay. All primers were designed for plus strand of bisulfite converted DNA, using the MethPrimer or PyroMark Softwares. Targeted regions contained 2-7 CpG targets and ranged from 101bp to 150bp, to better capture cfDNA fragments, which have a mean size of 167bp 74, (Table S1). Primers were methylation-independent, encompassing 0 to 2 CpGs (none toward the 5′ end), and were degenerated to target both the methylated and unmethylated states. They contained Fluidigm universal CS (common sequence) tags at their 5′ ends. We incorporated a 16 N (random nucleotides) as unique molecular identifiers (UMI) between the target-specific sequence and the CS2 in the reverse primers for signal deconvolution to detect true low frequency alterations and for reducing errors. As LINE-1 hold thousands of copies per genome, a high number of distinct UMIs is essential for unique barcoding of each target molecule. The 16 N stretch between the target-specific sequence and the CS1 in forward primers was used to increase diversity of sequencing libraries and improve sequencing quality. All primers were obtained from Eurogentec (RP-cartridge purification method).
Preparation of targeted bisulfite sequencing libraries
To limit batch effects, sequencing and library preparation batches contained both cancer samples and healthy donors. We also specifically processed the validation cohort C2 with equilibrated number of healthy and cancer samples distributed in 5 experimental batches. Sequencing libraries were prepared using three PCR steps (Fig. S1A): 1) target-specific linear amplification for UMI assignment, 2) target-specific exponential amplification and 3) barcoding PCR for sample identification. Each library was prepared in two individual reactions (due to the overlap of amplicon 2 with other primers), including: I) Multiplex PCR amplification of 7 probes (Amplicon 1, 3, 4, 5, 6, 7, 8), and II) Single PCR amplification of amplicon 2.
UMI assignment for multiplex reaction was performed using Platinum™ Multiplex PCR kit Master Mix (Thermofisher, Life Technologies SAS) in a 25 μL reaction containing 1x Platinum™ Multiplex PCR Master Mix, 0.01-0.06 µM mix of reverse primers and up to 5 ng bisulfite-converted DNA at the following thermocycling conditions: 95°C for 5 min followed by 1 cycle at 95°C for 30 s, 58°C for 90 s, 72°C for 40 s. UMI assignment for single reaction was performed using Hot Star Taq Plus DNA Polymerase (Qiagen) in a 25 μL reaction containing 1x Taq PCR Buffer, 0.65 U Hot Star Taq (5U/μL), 0.2 μM dNTPs, 1.5 mM MgCl2, 0.1 μM amplicon 2 reverse primer, up to 4 ng of bisulfite-converted DNA at the following thermocycling conditions: 95°C for 10 min followed by 1 cycle at 94°C for 60 s, 58°C for 30 s, 72°C for 40 s. To ensure complete removal of the reverse primers and dNTPs, each 25 μL reaction was treated with 50U of Exonuclease I and 10U of FastAP Thermosensitive Alkaline Phosphatase (Thermo Fisher Scientific) at 37°C for 1 h and heat-inactivated at 80°C for 15 min.
Target-specific exponential amplification for multiple reaction was performed using Platinum™ Multiplex PCR kit Master Mix in a 50 μL reaction containing 1x Platinum™ Multiplex PCR Master Mix, 0.01-0.06 µM mix of forward primers, 0.2 μM CS2 reverse primer and 20 μL of purified PCR product at the following thermocycling conditions: 95°C for 5 min followed by 28 cycles at 95°C for 30 s, 58°C for 90 s, 72°C for 30 s followed by a 10 min incubation at 72°C. Target-specific exponential amplification for single reaction was performed using Hot Star Taq Plus DNA Polymerase in a 25 μL reaction containing 1x Taq PCR Buffer, 0.65 U Hot Star Taq (5U/μL), 0.2 μM dNTPs, 1.5 mM MgCl2, 0.2 μM amplicon 2 forward primer, 0.2 μM CS2 reverse primer and 8 ul of purified PCR product at the following thermocycling conditions: 95°C for 10 min, 25 cycles at 94°C for 60 s, 58°C for 30 s, 72°C for 30 s and 10 min at 72°C.
PCR products of multiplex and single reaction were pooled together after quantification by qPCR and purified using Agencourt AMPure XP (Beckman Coulter) at 1.2x ratio according to the manufacturer’s protocol. Purified DNA was eluted in 30 ul of water. Barcoding PCR was performed using universal fluidigm primers. 25 μL of purified pooled PCR product, 1x Phusion HF Buffer, 1 U Phusion Hot Start II DNA Polymerase (Thermo Fisher Scientific), 0.2 μM fluidigm primer, and 0.2 mM dNTPs were mixed in the final volume of 50 μL and amplified with the following conditions: 98 °C for 2 min, followed by 20-25 cycles of 98 °C for 10 s, 62°C for 30 s, and 72 °C for 30 s followed by a 5 min incubation at 72°C. The amplified product was purified by two consecutive AMPure XP steps using 1) a low concentration of AMPure XP beads (0.6x – 0.7x ratio) where the beads containing the larger fragments are discarded and supernatant collected (reverse purification) and 2) higher beads concentration (1.1x – 1.2x ratio) where the beads containing fragments of interest were collected and purified according to the manufacturer’s protocol. Size-selected libraries were eluted in 15 μL of low-EDTA TE buffer. The libraries were quantified with Qubit HS DNA kit (Thermo Fisher Scientific), qualified with nano-electrophoresis (TapeStation, Agilent), and pooled equimolarly for sequencing. Sequencing was performed on Illumina HiSeq rapid run mode or NovaSeq (PE 30bp, 170bp).
Preprocessing of the reads
For each sample, FASTQ files containing raw sequences, composed by the following parts: CS1, forward UMI, forward primer, insert, reverse primer, reverse UMI, and CS2 (Fig. S1A) were first filtered for reads quality (average >Q20 per read) and then demultiplexed (i.e., cut using atropos v1.1.31) using forward and reverse primer sequences. FASTA files were created per primer-set, containing inserts and reverse UMIs for deduplication, as they are unique for each input DNA molecule. Inserts and reverse UMI were then filtered on expected sizes (with a tolerance of ± 5 bases for the inserts). Filtered inserts and UMIs sequences were concatenated and deduplicated using vsearch v2.15.2. Reverse UMIs were then trimmed and resulting inserts from all samples were aggregated into a single FASTA file per primer-set.
Clustering, extraction of representative sequences and global alignment
Using vsearch (with the following parameters: --cluster_fast <inputFasta> --notrunclabels -- fasta_width 0 --iddef 4 --id 0 --qmask none --clusterout_sort --consout <referenceFasta>), a clustering based on sequence identity was applied to each FASTA file, or a subset of 20 million reads randomly chosen if a given file comprised more. The 10 largest clusters’ representative sequences were isolated in separate files. Using mafft v7.508 (with the following parameters: -- globalpair --maxiterate 1000), the 10 representative sequences were aligned pairwise resulting in a reference database for each primer-set. Lastly, using mothur v1.48.0 (with the following parameters: #align.seqs(candidate=<inputFasta>, template=<referenceFasta>, align=needleman, match=1, mismatch=-1, gapopen=-1, gapextend=0)) on each primer-set FASTA file, all sequences from all samples were aligned to the corresponding reference.
CG calling, methylation levels and haplotypes extraction
To call CpG dinucleotides of interest, a sliding window of 2 bp was used on all aligned sequences to determine the distribution of dinucleotides along each amplicon target. A first threshold of ≥ 20% of CG/TG dinucleotides was used to select potential CpG sites. A second threshold was applied to eliminate dinucleotide with ≥95% TG and select position with at least 5% methylation rate. From the aligned sequences, the patterns of methylation were extracted and compiled into either average levels of methylation at each previously identified CpG sites, or proportions of methylation haplotypes for each sample.
Machine learning-based classification models
The resulting data (represented as average levels of methylation per CpG site or proportions of methylation haplotypes or both) were used to do supervised learning of statistical models using the random forest classifier algorithm 75 from Python package scikit-learn 76, with the following hyperparameters: n_estimators=300, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’sqrt’, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None.
The rational for choosing random forest over other learning methods was driven by three main factors: 1) it is less prone to overfitting 75; 2) it shows excellent performance even when the quantitative relationship between features and observations is biased in favor of the former, such as when using methylation haplotypes data representation 77; 3) random forests also inherently return measures of variable importance 75, such as mean decrease in impurity, which greatly facilitate the interpretability of model decisions. The features used to train the models were the average levels of methylation per CG site (n=30), the proportions of methylation haplotypes (i.e., the combinatorial of all the possible methylation status of CG sites within a given amplicon, n=372) or both. No additional transformation nor feature selection was performed on the data.
Expert and All cancer models
Model classifications were run 5000 times in order to estimate variance and confidence intervals. For the discovery step, in each run, as many samples from each class were randomly drawn to construct a balanced subset of the data 78. The samples from these draws were stratified by class and split into 60% for training, 40% for evaluation. For the validation step, we trained the model on the entire cohort 1 and evaluated it on cohort 2. The true and false positive rates for all possible classification threshold were evaluated at each run, with interpolation to generate an average ROC curve with 95% confidence interval for the 5000 runs. 95% confidence interval has been calculated with the following formula: Μ±z*s/√n with M the average of the variable, z the confidence level (z∼1.96 for 95%CI), s the standard deviation, n the number of samples in the variable.
Blind models
We trained a random forest on haplotypes features, removing one cancer type or subgroup from the training set. The specific cancer type or subgroup is then assessed in the test set. We pooled together the discovery and validation cohorts, training on 2/3 of all the samples-- excluding the cancer type or subgroup to test for-- and testing on the remaining one-third of the samples. The only exception was metastatic gastric cancer: as they are made up of only 3 samples, they were systematically moved to the test set, consequently blinding the model towards GAC M+ samples. We also trained a stacked version with this setup (see below).
Stacked machine learning model
We developed a model referred to in the paper as “Stack”. This model uses one Random Forest model for each combination of cancer type and metastatic status, known as the “expert sub-model.” Each expert sub-model was trained on one-third of the healthy plasma samples and one-third of the samples matching the cancer subgroup of interest (cancer type and dissemination status). These expert sub-models were then combined into a Random Forest Stack model, which uses both the haplotype features and the probabilities output by each expert sub-model. The final Random Forest Stack model was trained on an additional one-third of the healthy and cancer plasma samples and tested on the remaining samples (which represent one third of the healthy samples and one third of each subgroup).
Mutation screening for ovarian cancer samples
Ovarian tumor genotyping was performed using the TIGER panel previously developed by Institut Curie79, which targets 78 genes or using ‘custom NGS’ with amplicons targeting Tp53 and TSC2, which are the 2 most frequently mutated genes in this group of patients, with TruSeq library constructions for low input material (dual strand technology). After mutations identification, ovarian cancer plasmas were genotyped using custom NGS or ddPCR as previously done14. Sequencing was performed on a MiSeq V3-150 (25M) with Paired End 75bd protocol. Results are presented in Table S8.
Whole genome bisulfite sequencing analysis
To see if we could retrieve cancer-associated L1PA hypomethylation in others plasma studies, we analyzed data sets from 2 recent studies profiling cell-free DNA methylation with Whole Genome Bisulfite Sequencing (WGBS) in healthy and cancer patients (Liu et al. 202453; Gao et al. 202254). Liu et al. analyzed methylation at 75 617 CpGs in the whole genome (estimated with Bis-SNP after mapping with Bismark) of 17 healthy donors and 31 cancer patients. We extracted the methylation levels at CpG residing within L1PA families hit by DIAMOND (mostly L1HS-L1PA10, see Fig. 1C). To do so, we identified CpG dinucleotides covered by L1HS-L1PA10 elements, and their position within L1 based on the L1HS consensus sequence, and we selected the DIAMOND copies using Bedtools80. Gao et al., provided FASTQ files data for 123 breast cancer and 40 healthy patients. In order to avoid breast-cancer subtype effect and age effect, we subsampled the data to generate an age matched cohort with 16 healthy and 15 HR+HER2-M+ patients. We first merged the paired-end reads using Fastp81 and mapped them on the hg38 reference genome using bismark. At this step, we followed two different approaches to obtain the methylation level at the 30 L1PA CpGs that we studied with DIAMOND. 1) Using bismark extractor, we retrieved the methylation levels of all CpGs covered and intersected it with the dinucleotides covered by DIAMOND using Bedtools. Results produced with this approach are subsequently called “Gao et al (2022) Bismark”. 2) Using mothur82, we aligned the reads that mapped on L1HS-L1PA10 elements on an aggregate of native and converted (CG converted to TG) reference sequences of L1HS. All read that mapped with a score lower than 51% of similarity (score computed by mothur) were rejected (this threshold was established from the percentile at 99% of similarity score distribution from 1000 random sequences 167 bp long).
Results generated by this approach are subsequently called “Gao et al (2022) Mothur”. Finally, for “Gao et al (2022) Bismark” and “Gao et al (2022) Mothur “, we refined the results to only take in account the copy of L1HS-L1PA10 targeted by DIAMOND. Further refinements were done to only consider L1HS-L1PA3 copies with the second approach and is referred as “L1HS-L1PA3”.
Survival analysis
Survival analysis has been performed with survival and survminer R packages - https://cran.r-project.org/web/packages/survival/citation.html, doi:10.32614/CRAN.package.survminer.
Copy number alterations analysis
Cytoscan HD microarrays: 250 ng of gDNA from 15 breast cell lines (1 normal-like: HTERT-HME1 and 14 cancer cell lines: MDA-MB231, MDA-MB453, HCC1569, BT20, HCC1954, HCC38, MDA-MB361, ZR 75.1, MDA-MB157, MCF7, SKBR3, HCC202, HCC70, BT474) were characterized using Affimetrix/Thermo Cytoscan HD microarrays at the Genomics facility of Institut Curie to profile aneuploidy. To compare with the z-score by chromosome arm, we calculated the mean of Weighted Log2 combining probes by chromosome arms.
DIAMOND CNA: 1) Z-score calculation: prepossessed reads were uniquely mapped on hg38 genome using Bismark (version 0.23.1). As in Belic et al. 2015 66, only the reads with an alignment score > 15 were kept. Resulting reads from all amplicons (excluding #2 and #3) were merged and normalized number of reads per chromosome arm (excluding sexual chromosomes X and Y) per sample were calculated with R. Next, the amplifications/deletions score was computed using the following formula: with i = a given chromosome arm, n = a given sample and controls = a set of reference samples (white blood cells from 10 healthy reference samples for the cell lines, 63 healthy plasma from C1 as a reference for cancer and healthy plasma samples). Genome-wide z-scores were computed by summing the squared z-scores of all chromosome arms. 2) Z-score threshold identification: to identify altered versus normal z-scores, we performed 5-fold cross validation of simple cutoff classification model on the discovery cohort (NHealthy = 60, NCancer = 350) using the genome-wide z-score and calculated the threshold that maximize the sensitivity at 100% specificity.
2-step classification for sample labelling
First, we selected the threshold for the probability of the cancer prediction (ProbaCancer) on the discovery cohort maximizing the sensitivity for a 99% specificity, per ‘cancer-type’ model. We applied this threshold on the ProbaCancer computed with the validation models and reclassified samples which presented a z-score > 121, as cancer (ProbaCancer ≤ Threshold C1 AND GZ-score ≤ 121: prediction = Healthy; ProbaCancer > Threshold C1 OR GZ-score > 121: prediction = Cancer) see Tables S16-22.
Data Availability
Data have been deposited as methylation matrices (CG% or haplotypes%) on the Zenodo database with the following accession code: https://zenodo.org/uploads/12206227 and as compressed FASTQ files at the European Genome-Phenome Archive at https://ega-archive.org/ under the accession code EGAD50000000646. WGBS sequencing data were downloaded from publicly available database at https://zenodo.org/records/7779198 and from the National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov) under the accession number PRJNA494975. The code used to analyze the data is available on github: https://github.com/ProudhonLab.
Funding
The NGS facility was supported by ANR-10-EQPX-03 (Equipex) and ANR-10-INBS-09-08 (France Génomique Consortium) grants and by the Cancéropôle Île-de-France.
This research was supported by grants, of which C.P. was recipient, from: The European Research Council (ERC-StG EpiDetect),
The Ligue contre le cancer (RS17-75-75),
The prematuration program of the Centre National pour la Recherche Scientifique (CNRS), The SiRIC 2 Curie program (INCa-DGOS-Inserm_12554),
The DEEP Strive funding (LABEX DEEP 11-LBX0044).
CAA research was supported in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).
Author Contributions
MM, MH and CP designed the study
MM, MH, AM, CH, MG, VD, CB, MS, CR and DG performed the experiments MM, MH, AM, KDS, MG, KVG, CAA and CP analyzed the data.
MM, AM, KDS, MG and KVG performed the statistical analyses.
CH, AR, FCB, JYP, CL, MK, CLT, IB, MHS, OL, LC contributed with identification of clinical samples.
MM, MH, AM, KDS, MG, CAA and CP wrote the manuscript.
All authors participated in revising the manuscript and approved this final version.
Competing Interests statement
CP, MM, MH, and CAA have an ongoing patent application relating to circulating tumor DNA analysis.
Supplementary materials
Acknowledgments
We thank the members of the C.P.’s laboratory for critical reading of the manuscript. We are grateful to D. Bourc’his and her team for hosting us during part of this study. We thank the members of the ICGex NGS platform of the Institut Curie, especially S. Lameiras, V. Raynal and S. Baulande for advice and the non-profit organization “La Vannetaise” for financial support.
Footnotes
In this new version, we have now included: – Analysis of plasma and tumor paired samples demonstrating very good correlation using methylation haplotype proportions (new Fig.S3) – Comparison of methylation profiles of hormone dependent and triple negative breast cancers to explain the differences of the M0 and M+ subgroups (new Fig.S4F-H) – The performances per cancer-type and stage from the all cancer model (new Fig.S4D-E and Fig.S8A) – Analysis from publicly available plasma WGBS data which shows that we can retrieve L1PA hypomethylation in cancer compared to healthy from genome wide profiles (new Fig.S5) – Analysis comparing an age-matched cohort and a non-aged-matched cohort extracted from C2 demonstrating identical classification performances in both cohorts (new Fig.S7H-J). – Survival analysis on the validation cohort demonstrating that higher hypomethylation is associated with shorter survival (new Fig.S8D-E) – Controls demonstrating no effect of the cfDNA extraction method employed (new Fig.S10) – Bisulfite and enzymatic conversion comparison (new Fig.S11) – Data for the screening of ovarian cancer mutations (new Table S6) – New data presenting a new integrated model and tests where we exclude cancer subgroups or types from the train set and which are well recognized as cancer by our models to highlight the universality of our approach (new Fig.4G and Fig.S8A-C).
References
- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵