Abstract
Pathogens are major drivers of cancer globally. Quantifying the relationship between infection and carcinogenesis is therefore crucial for developing preventative programs. The foodborne trematode Opisthorchis viverrini is a primary cause of biliary cancer (cholangiocarcinoma) and infects 12 million people in Southeast Asia. In tumours from patients exposed to O. viverrini we found that the earliest chromosomal amplification of driver genes occurred at 30 years old on average, two to four decades before cancer diagnosis, and disproportionately contained FGFR2, TP53 and PTEN genes. We then fitted transmission models to parasitological data from Thailand spanning 27 years (n = 11,517) finding that, for people born between 1960–1989, first exposure occurred at two years old and by 30 years individuals had been cumulatively infected with a median of 72 worms. Trematodes are long-lived and our analysis quantifies the average lifespan of O. viverrini as 13 years (90% credible interval [CrI] 6–23 years) within human hosts. Finally, we calculate the lifetime probability of diagnosis with cholangiocarcinoma as 1.2% (90% CrI 1.0–1.4%) given prior exposure to the parasite. Overall, our study demonstrates how pathogen exposure drives patterns of cancer within human populations.
Introduction
Infectious organisms are a major contributor to the burden of human cancers globally (1), thus challenging the classic distinction between ‘communicable’ and ‘non-communicable’ diseases. Of the eleven pathogens recognised as direct human carcinogens; three are parasitic trematodes*,†. Infection with the foodborne liver flukes Opisthorchis viverrini and Clonorchis sinensis cause cancer of the bile duct (cholangiocarcinoma), whilst the waterborne blood fluke Schistosoma haematobium causes cancer of the bladder (squamous cell carcinoma). The aetiologies resulting from infection with these parasites are classified as neglected tropical diseases, owing to the historical neglect of research into both the causative agents and the afflicted populations (2). The pathology arising from infection with parasitic worms (also known as helminths) is typically chronic making it difficult to quantify the impact of any single helminth species over decades of exposure and in populations with multiple co–infections (3, 4).
The liver fluke O. viverrini has a foodborne transmission route and is acquired from eating raw or lightly fermented freshwater fish, which is a traditional component of the diet in regions of Southeast Asia (5). The fluke has a complex lifecycle, which involves asexual reproduction within freshwater snails and encysts as a mammalian–infective stage (metacercariae) in cyprinid fish. Despite public health programs to control the parasite, an estimated 12.4 million people were infected with O. viverrini across Thailand, Laos, Cambodia and Vietnam in 2018 (6). While the prevalence has shown gradual declines in Thailand due to periodic parasite control programs during the second half of the twentieth century (7), progress has recently slowed (8, 9). In Cambodia and Laos, by contrast, there is evidence for increased transmission over the past two decades (6, 10, 11). Liver fluke endemic countries have the highest incidence of cholangiocarcinoma globally and cases of hepatic and biliary cancers in these regions are disproportionately attributable to cholangiocarcinoma, rather than hepatocellular carcinoma (12, 13). If an early diagnosis is made then curative surgical resection is possible in a minority of cases (14). Patients generally present late, however, and prognosis is poor with a median survival time of <1 year following diagnosis and the standard of care; systemic chemotherapy with cisplatin plus gemcitabine (15, 16). The mechanisms through which flukes induce cholangiocarcinoma is a combination of mechanical damage, inflammation of the biliary epithelium, and the secretion of proteins; in particular the peptide granulin (17).
Given the poor prognosis for cholangiocarcinoma and the preventable nature of parasite infection, there is a strong motivation to understand the link between liver fluke exposure and carcinogensis in humans (3, 18). Prior to the onset of driver mutations, anthelmintic treatment and reducing parasite exposure should be prioritised as public health interventions, whereas following the onset of irreversible malignancies the priority for interventions shifts to ultrasound screening for liver pathology and early referral for surgery (19).
Quantifying the epidemiological relationship between liver fluke infection and the resulting biliary pathology is challenging, however, due to i) the dynamic nature of parasite transmission; ii) exposure being cumulative over time, as humans do not generate protective immunity to reinfection; iii) the long lifespan of adult worms within the human host, which is unknown for O. viverrini; and iv) the decades–long temporal lag between initial parasite infection and cases of cholangiocarcinoma (3, 20).
In this study we infer the timing of the earliest driver mutations for fluke–induced cholangiocarcinoma using computational methods which characterise the evolution of tumours from a single biopsy (21, 22). We then define the age of first exposure to liver fluke by fitting dynamic transmission models to parasitological survey data spanning several decades. Finally, we estimate the lifetime probability of diagnosis with cholangiocarcinoma given infection with O. viverrini. By combining evolutionary cancer genomics with epidemiological analysis, we obtain unique insights into the relationship between pathogen exposure and tumourigenesis, with implications for evidence– based disease control (18).
Results
Cholangiocarcinoma tumour genomes
We obtained paired tumour and normal whole-genome sequences from cholangio-carcinoma patients who were previously infected with liver fluke and were treated at a large public hospital in Northeast Thailand (23). We used a bioinformatics pipeline to call somatic single nucleotide variants (SNVs), copy number alterations, and inferred the clonal status of these mutations (see Methods). The age of the patients at surgery ranged from 37– 79 years (median 57 years), 50% of patients were female, and all were born prior to 1980 (23). After mapping reads, variant calling, and filtering (see Methods) we obtained 2,349–27,821 (median 10,360) somatic SNVs and 268–14,230 (median 1,382) somatic indels per–tumour. The overall ploidy (chromosomal copies) per tumour ranged from 1.2–3.7 (median 2.0).
Evolution of cholangiocarcinoma tumours
We estimated the timing of driver mutations in 43 genes implicated in cholan-giocarcinoma development (24–26) using evolutionary models which time the emergence of somatic single nucleotide variants (SNVs) occurring on amplified sections of chromosome (21, 22). We first assessed whether any tumours had been subject to whole–genome duplication events based on the correlated timings of chromosomal amplifications across the genome (21) and concluded that this had occurred in three tumours (see Fig. 1A). Across all tumours, the copy number was disproportionately higher in chromosomes 7 and 17. In 17/22 tumours with sufficient ploidy and tumour purity, we inferred the timing of focal chromosomal amplifications in potential driver genes for cholangiocarcinoma (see Methods). Overall the majority of amplifications (166/287; 58%) occurred later in ‘chronological time’ (0.75 or later), with a smaller proportion (46/287; 16%) occurring earlier in the lifespan of the tumour (before 0.5), as shown in Fig. 1B.
Evolution of cholangiocarcinoma tumours. A) Subclonal lineage reconstruction for cholangiocarcinoma tumour CCA_TH19 using MutationTimeR (21). The upper plot shows the variant allele frequency (VAF) for somatic SNVs which are coloured by clonal state; clonal (early) = green, clonal (late) = purple, clonal (unknown) = blue, subclonal = red. The middle plot shows the inferred copy number frequency (ploidy) by chromosome. The lower plot shows the inferred timings of copy number gains. For this tumour, the timings of copy number gains are correlated across chromosomes, indicating a whole-genome duplication event at a ‘mutation time’ of 0.6. B) Histogram showing the timing of 287 clonal and subclonal amplification events for 44 driver mutations in 17 cholangiocarcinoma tumours inferred by AmplificationTimeR (22). C) The earliest amplified clonal and subclonal driver mutations in cholangiocarcinoma (CCA) tumours (22) multiplied by the patient age at surgery (23) to give the inferred age for each amplification event. Each point represents the first amplified driver gene per patient, labelled with the gene name, and the points are coloured by tumour anatomical subtype (intrahepatic or perihilar). The overlaid box and whisker plots shows the median and interquartile range. D) Estimates of the chronological time of amplification by gene. Results are shown from a generalised linear model applied to estimates from AmplificationTimeR (see Methods). The black points give the posterior median, the thick blue line gives the 50% credible interval and the thin black line the 95% credible interval.
Timing of amplified driver mutations
We determined the earliest amplified driver gene for each tumour sample, and classified these as either ‘clonal’ (occurring on the most recent common cell lineage of the tumour) or ‘subclonal’ (a subsequent clonal expansion within the tumour which has not risen to fixation) (27, 28). To calculate the age at which these amplifications occurred in patients, we multiplied the chronological time estimate with the age at surgery (see Methods). The earliest clonal amplification of driver genes occurred at a median age of 30 years and with an interquartile range (IQR) of 20–43 years (Fig. 1C). The earliest subclonal amplifications occurred at a median age of 33 years (IQR 29–41 years). A variety of genes were the earliest amplified, with the most frequently occurring being the tumour suppressor PTEN in the clonal lineages (earliest in two tumours and amplified in 7/17 tumours), while the signalling gene NRAS was the most common in subclonal lineages (earliest in two tumours and amplified in 7/17 tumours) see Fig. 1B.
To determine whether certain genes are disproportionately likely to be amplified early or late within the lifespan of the tumour, we applied a generalised linear model (see Methods) to the chronological age estimates for each of the amplification events, while controlling for host sex, tumour anatomical subtype (extrahepatic or perihilar) and clonality (clonal or subclonal). We excluded nine amplifications with fewer than eight somatic SNVs, giving 278 amplification events from 17 tumours for our analysis. The model estimates per gene are shown by chronological age in Fig. 1D. Overall, the driver genes which were disproportionately found to be amplified early were the fibroblast growth factor receptor FGFR2, PTEN and the tumour suppressor TP53 (these genes were amplified in 7 tumours), while the tumour suppressor genes BAP1 and PBRM1 (both amplified in 5 tumours), and the receptor gene FGFR1 (amplified in 4 tumours) were preferentially amplified late. Our findings add support to previous studies which have noted the early clonal amplification of TP53, in particular, as a driver across a range of cancer types (21).
Liver fluke transmission
As fluke–induced pathology of the biliary tract is chronic, there is a need to consider prior exposure to the parasite at the individual or population level which can be estimated from historical parasitological surveys (3). We therefore collated epidemiological surveys from Northeast Thailand where diagnostic observations of O. viverrini infection intensity (worm burdens or faecal egg counts) were available by host age. Our analysis uses data from 4,056 individuals obtained from seven surveys conducted between 1980–1989, prior to the onset of large-scale control programs against liver fluke, and 7,448 individuals from five surveys conducted between 1994–2017, which are summarised in Table 1. We fitted a mechanistic parasite transmission model to these data which considers that the parasite burden at a given age is a result of flukes infecting the host with an age-variable transmission rate, referred to as the ‘force of infection’, and parasites exit the host due the spontaneous death rate of adult worms (20). Full model details are provided in Methods.
Cross–sectional surveys analysed in this study investigating the relationship between host age and Opisthorchis viverrini worm burden, faecal egg counts, or prevalence by faecal egg diagnostic (Fig. 2A). Surveys were conducted in Northeast (N.E.) Thailand either before the onset of liver fluke control programs (S1–S7; pre–intervention), or following a national control program in the early 1990s (S8–S12; post–intervention). Age range of the participants is shown in years. 1Sample size of human participants in survey. 2Autopsy cases were from N.E. Thailand, with the majority from Khon Kaen province. 3Hospital–based study where patients from N.E. Thailand were recruited and classified as originating from either rural or urban communities.
Exposure to liver fluke in human populations
The average age at which a person born between 1960–1989 was first infected with a single O. viverrini fluke is 22 months (90% prediction interval [PI] 19–25 months). The average age of first infection for a person born after 1990 rises to 8 years (90% PI 6.5–9.5 years), though a smaller proportion of the population ultimately becomes exposed during their lifetime; 33% post–intervention compared with 88% pre–intervention. We define exposure to O. viverrini as the cumulative number of adult flukes acquired by a person over time, which is given by the area under the force of infection curve (Equation 6). In endemic regions of Thailand in the 1980s by ages 10, 20, and 30 years the median number of O. viverrini flukes acquired was 12 (90% PI 8–16), 39 (90% PI 27– 53), and 72 (90% PI 48–103) respectively (Fig. 2C). Following public health interventions in the 1990s (7), we estimate that the O. viverrini force of infection dropped substantially, with an almost 40–fold reduction in the age–dependent transmission rate, resulting in the majority of individuals born after 1990 remaining uninfected by age 30 (median worm burden is zero). The distribution of O. viverrini parasites among human hosts shows high variation, as is characteristic for helminths (29), and for the top decile of the most heavily infected hosts the cumulative burden is ≥ 693 worms by 30 years of age in the pre–intervention period (90% PI 523–931) and ≥ 14 worms in the post–intervention period (90% PI 10–21). These findings add weight to the reported declines in parasite transmission in Thailand following control programs (6–8), though our analysis is the first to quantify this effect in terms of worm burden, the age of first exposure, and the cumulative proportion exposed.
Epidemiological data on the carcinogenic liver fluke Opisthorchis viverrini in Thailand. A) Worm burdens of O. viverrini by participant age, inferred from multiple surveys conducted between 1980–1989 prior to large–scale interventions. The data used for model fitting were counts of i) adult worms obtained from autopsy, or ii) worm expulsion, or iii) parasite eggs in faeces (data shown from S1–S6, see Table 1). The fitted black line shows the simulated median worm burden by host age, with the shaded area giving the 90% prediction interval (model fitted to data from S1–S7; see Table 1). The y-axis is log transformed. B) Proportion of exposed population with at least one parasite in the pre–intervention (1980–1989; S1–S7) and post–intervention periods (1994–2017; S8–S12 see Table 1). Solid lines are central estimates and shaded areas give the 90% credible interval (CrI). Where the solid lines cross the dashed line, this gives the age at which half of the exposed population become first infected, which is 22 months (90% CrI 19–25 months) of age in the pre–intervention period and 8 years of age (90% CrI 7–9.5 years) in the post–intervention period. C) Cumulative exposure to O. viverrini by age (Equation 6). The solid line shows the simulated median for the pre–intervention period, while the dot–dash lines show the cumulative worm burden for the upper 90% population percentile. Shaded areas give the 90% prediction interval. The y-axis is log transformed. D) Average lifespan of adult O. viverrini worms in human hosts. The posterior probability density is shown here for log(2)/σ; where σ is the spontaneous death rate of adult worms (Equation 3). Dashed line gives the posterior median (13 years). The filled area gives the 90% CrI (6–23 years).
Lifespan of the adult worm
The lifespan of helminths in human infections is considered to be in the order of one year to three decades, although this is rarely quantified (30, 31). We estimate the lifespan of O. viverrini within our force of infection model as 12.9 years (90% CrI 6.2–23.1 years), which is defined as the average time for half of adult stage parasites to die within the human host in the absence of anthelmintic treatment (Equation 3, see Methods). Our posterior distribution for parasite mortality is long tailed (Fig. 2) with the top five percent of adult O. viverrini lifespans ≥ 25.7 years. This finding is consistent with case reports of the related liver fluke C. sinensis from travellers, which is known to persist in human hosts for 26 years (32). We note that the posterior distribution of the worm mortality parameter is correlated with a force of infection parameter, and therefore we encourage use of the interval for O. viverrini adult worm lifespan (6–23 years), rather than the point estimate, in future analyses.
Latent and induction periods of cholangiocarcinoma
Bringing together the evidence presented thus far, we hypothesise that people born between 1960–1989 in Northeast Thailand first became infected with liver fluke by age two and driver mutations for biliary cancer occurred around age 30. We sought to validate our hypothesis using cancer registry data (33). The age distribution of 10,737 cases of cholangiocarcinoma diagnosed in Northeast Thailand between 1985–2009 are shown in Figure 3A. The median age of diagnosis was 59 years between 1985–1997 and 63 years between 1997–2009. Using a time-to-event analysis, which accounts for interval censored data, we estimated the age of driver mutation and the subsequent age of cholangiocarcinoma as sequential gamma distributions (see Methods). The induction period (time from initial parasite infection to driver mutation) was estimated as 28 years (90% CrI 16–39 years) and the latent period (time from driver mutation to cancer diagnosis) was estimated as 32 years (90% CrI 21–44 years). These distributions are shown in Fig. 3B. The incidence data therefore supports our estimate of the time to first amplified mutation at thirty years, given exposure by age two. The latent period estimate from registry data is longer than the 22 years indicated by our earlier analysis of the tumour genomes (time between first amplified driver mutation and biliary surgery), which likely reflects an older average age of diagnosis in the registry data compared to our, much smaller, sample of patients with sequenced biliary tumours (23, 33). Prior research on the induction and latent periods for fluke–induced cholangiocarcinoma in humans has been limited, though American veterans from the Vietnam War had an elevated incidence of biliary cancer five or more decades after potential exposure to O. viverrini between 1965– 1971 (34). While the association between military service in Vietnam and fluke–induced cholangiocarcinoma is contentious (35), the reported time between parasite exposure and cancer for United States veterans (induction plus latent period) is consistent with our findings.
Incidence of cholangiocarcinoma (CCA) at the population level. A) The age distribution of 10,737 cases of cholangiocarcinoma from Northeast Thailand between 1985–2009 (33). B) Time–to–event distributions for the induction period (from parasite exposure to driver mutation) and the latent period (from driver mutation to cancer diagnosis). The plots show the posterior probability distributions, where the dashed line gives the median and the shaded orange area shows the 90% credible interval (CrI). The induction period was estimated as a median of 28 years and the latent period as 32 years. C) Annual incidence of cholangiocarcinoma per 100,000 people by age for Northeast (N.E.) Thailand, where the carcinogenic liver fluke O. viverrini is endemic, and Malaysia which is non-endemic for O. viverrini. D) Cumulative lifetime probability of diagnosis with cholangiocarcinoma for the two populations, N.E. Thailand and Malaysia, at age a, conditional on survival to age a. The model output is shown from a survival analysis (Equation 17, see Methods), where the solid black lines give the posterior median and the shaded areas the 90% CrI. Probability given as a percentage (%).
Lifetime probability of cholangiocarcinoma
We estimate the lifetime probability of acquiring cholangiocarcinoma by fitting a survival model (see Methods) to age-incidence cancer registry data from Northeast Thailand between 1998–2009 (33) and Malaysia between 2007–2009 (36), which neighbours Thailand but is non-endemic for liver fluke (37). The underlying data are shown in Fig. 3C as an annual population incidence per 100,000 population and the survival model predictions are shown in Fig. 3D. Our results show that by 75 years of age the cumulative probability of diagnosis with cholangiocarcinoma, conditional on the person surviving to that age, is 0.046% in Malaysia (90% CrI 0.039–0.054%) and 0.99% in Northeast Thailand (90% CrI 0.85–1.1%), which is twenty–fold higher. Under the assumptions that 90% of excess cancer cases in Northeast Thailand are attributable to infection with liver fluke (38) and the total proportion of the population exposed to the parasite between 1960–1989 was 88% (estimated above, see Fig. 2B.), we calculate the probability of developing cholangiocarcinoma, given prior infection with O. viverrini, as 1.0% (90% CrI 0.87– 1.2%) by 75 years of age and 1.2% (90% CrI 1.0–1.4%) by 85 years. While a previous report suggested a higher lifetime probability of 5% (39), the data and method used to produce this estimate are unclear. Calculating the lifetime probability of cancer given infection depends on the population exposed to O. viverrini. Prevalence surveys taken at a single moment in time will underestimate this proportion (38, 40), and hence overestimate the relative probability of cancer among infected people. The probability of developing cholangiocarcinoma covaries with the infection intensity of liver fluke (41). Given improved longitudinal data, the lifetime probability of biliary cancer should be stratified by the cumulative worm burden in future analyses.
Discussion
Our study uncovers many crucial, yet previously unknown, epidemiological processes on the pathway from parasite infection to carcinogenesis in humans. The aim of such research is to inform control programs to reduce both transmission of liver fluke and subsequent cases of cancer (18, 19). Given the findings presented here, that the age of first fluke infection is prior to ten years of age and the first amplified driver mutations for biliary cancer are on average at 30 years of age, we recommend that interventions should be targeted to younger individuals in at–risk regions, such as school–based deworming with the anthelmintic praziquantel (42). Relatively few studies have investigated the impact of anthelmintic treatment on preventing or reversing biliary damage in humans (43–45), though the current evidence suggests a beneficial effect of deworming even in the context of repeated reinfection. The long temporal lag of four to seven decades between first parasite exposure and diagnosis with cholangiocarcinoma (see Fig. 2B and Fig. 3) means that changes to O. viverrini transmission intensity will take many years to influence the incidence of cholangiocarcinoma at the population level (19). Future randomised clinical trials investigating the impact of deworming with praziquantel on the risk of biliary cancer could be considered unethical given the need to withhold or restrict anthelmintic treatment for controls (18). Dynamic simulations have a useful role to play, therefore, in estimating the magnitude of interventions over long time periods (46). Our results provide a platform for biomarker discovery by highlighting early driver genes and we argue that screening for biliary malignancies should start from thirty years of age. Furthermore, the long latent period provides a window for early therapeutic interventions (47).
While there are relatively few studies investigating the epidemiological relationship between infection and carcinogenesis, when compared against other pathogen–driven cancers our estimates of the induction and latent periods for fluke–induced cholangiocarcinoma appear to be longer. For human papillomavirus (HPV), the time from infection to high-grade cervical intrapithelial neoplasia (CIN2/3) has been estimated from registry data as three years, and the time from CIN2/3 to cancer 23.5 years (48). The time from infection with hepatitis B to hepatocellular carcinoma is reported as 25–30 years in a review (49), though we are unaware of any supporting evidence. Mechanisms for a potentially longer tumorigenesis from O. viverrini infection, compared with other carcinogenic pathogens, have not been investigated. We note that the impact of fluke infection on biliary pathology is dose–dependent (41), and so depends on the distribution of parasites within an endemic community (40). There may also be evolutionary reasons for a slower accumulation of pathology from helminths, given their long lifespans (30, 50).
As the data used here are primarily from Northeast Thailand our estimates of epidemiological parameters are most applicable to this region. Large–scale ultrasound screening programs for liver disease in Thailand have successfully diagnosed thousands of cholangiocarcinoma and pre–cancerous cases since 2015 (51). However, the enhanced screening will likely lead to short–term increases in the reported incidence of cholangiocarcinoma, thus further complicating our understanding of the relationship between parasite infection and cancer (19). Extending our analyses to Laos and Cambodia, where the burden of parasitic disease is greater (6, 10) and the capacity for healthcare systems to diagnose and treat biliary cancer more limited (52), remains a priority for future research.
Methods and Materials
Cholangiocarcinoma whole–genome sequences
We accessed paired tumour–normal whole–genome sequences from 22 individuals in Northeast Thailand with previous exposure to liver fluke infection. Tumour tissue was obtained from patients during surgical resection of the biliary tract at Srinagarind Hospital in Khon Kaen, Thailand (23). Normal somatic genomes were obtained from patient blood samples. Tumours are classified according to their anatomical location on the biliary tree; namely intrahepatic cholangiocarcinoma (within the hepatic ducts), perihilar cholangiocarcinoma (between second order bile ducts and the cystic duct insertion) or distal cholangiocarcinoma (below the cystic duct). Here our samples consisted of eight perihilar and 14 intrahepatic tumours. The age at surgery ranged from 37–79 years (median 57 years) and 11/22 (50%) of patients were female.
Cancer Genomics
Starting with fastq files, we trimmed adaptor sequences from 150bp illumina paired–end reads using TrimGalore v.0.4.4 (53) and mapped the reads against the human reference (GRCh37) using bwa mem v.0.7.17 with default parameters (54). The resulting BAM files were then sorted and indexed with samtools v.1.9. Duplicates were marked and removed using GATK v.4.1.4.1 (55) and base quality scores re-calibrated for tumour sequences with ICGC PCAWG consensus vcf files as known variants for single nucleotide variants (SNVs) and indels (56). Examining the depth of mapped reads, we took the output from samtools depth -a (including zeroes) and binned the mean depth within 1Mb segments (Fig. 1A). Overall we found that the mean sequencing depth averaged across the whole genomes ranged from 35x–73x per–sample (median 57x) for normal genomes and 45x–72x per–sample (median 54x) for tumour genomes.
We called somatic SNVs and indels using GATK Mutect2 with a panel of normals provided by the Broad Institute (57), and with additional filters to remove secondary and supplementary reads. Prior to filtering, the number of somatic SNVs ranged from 109,339–292,886 per sample (median 131,057). We calculated the fraction of reads in the normal with tumour contamination using the GATK tool CalculateContamination in combination with 4.7 million common germline alleles (MAF 0.01–0.20) derived from diverse Asian populations in Singapore (58). This revealed that the contamination in normal samples was low with <0.7% of reads coming from cross–sample contamination. Using the data on contamination, we filtered the Mutect2 variants calls, leaving 2,349–27,821 (median 10,360) somatic SNVs and 268–14,230 (median 1,382) somatic indels per–tumour. The number of somatic variants called in our samples are consistent with those observed for other biliary cancers (59).
Subclonal reconstruction
We estimated tumour copy number using the Battenberg algorithm (27), with reference data from the 1000 Genomes Project. The estimated fraction of tumour cells (rather than normal tissue) in our cholangiocarcinoma genomes, also known as the tumour purity or cellularity, varied substantially between samples (range 10–90%; median 60%). The overall ploidy per tumour ranged from 1.2–3.7 (median 2.0). We then phased somatic variants and assigned them to subclonal lineages (28) using dpclust3p and dpclust (60), implemented in R v.4.3.2. The number of clonal and subclonal lineages, which variants were assigned to, varied from 2–5 per tumour (median 3).
Timing of driver mutations
We applied algorithms to estimate the chronological timing of driver mutations in amplified regions; MutationTimeR (21) and AmplificationTimeR (22), which were implemented in R v.4.3.3. The MutationTimeR algorithm uses a panel of 371 known driver mutations identified by the PCAWG consortium (59). We used the temporal correlation of copy number gains from MutationTimeR to identify if tumours had undergone whole–genome duplication (21). For the AmplificationTimeR analysis, we focused on timing the amplification of 43 driver mutations which have been identified as important in early–stage cholangiocarcinoma (25, 26) or previously detected in these tumours (23). The estimates for the chronological time of amplifications were calibrated using C>T mutations at CpG sites, which have been established to have clock–like properties (61). The age in years at which driver gene amplifications occurred was calculated as the product of the bootstrapped chronological time estimates from AmplificationTimeR (22), which fall in the interval [0, 1], and the patient’s age at surgery (23).
Earliest and latest amplified genes
We used a generalised linear model to determine which genes were amplified disproportionately early or late within the lifespan of the tumour. We modelled the chronological time estimates for gene g (yt,g), estimated by AmplificationTimeR (22) with a minimum of 10 mutations, which fall in the interval [0,1] using a beta distribution parameterised by a mean µg and precision κ
where the mean is a transformed linear function of a gene– specific intercept (αg) plus covariates (x) and slopes (β). The three binary explanatory variables are the sex of the patient, whether the tumour is intrahepatic or perihilar, and whether the amplification is clonal or subclonal
The model was fitted in a Bayesian framework using the stan language v.2.34.1 (62) implemented with cmdstanr v.0.7.1 (63) in R v.4.3.3; see details on parameter estimation below.
Liver fluke surveys
We obtained data from epidemiological surveys on the liver fluke O. viverrini in Thailand between 1980–2017. The parasitological observations in these surveys consisted of i) adult worms obtained from liver dissection at autopsy, ii) adult worms recovered through expulsion following anthelminthic treatment, or iii) parasite egg counts obtained by faecal examination, see Table 1. We contacted the study authors to obtain parasitological data by host age. Where these were unavailable, we simulated individual-level data from summary tables which contained the sample size, mean, and variance for parasitological observations (worm burdens or faecal egg counts) aggregated by age group using a Pearson type I distribution.
Parasite transmission model
We developed a mechanistic model that incorporates key aspects of parasite ecology (20, 30) fitted to individual–level data on adult worm burdens or faecal egg counts (Table 1). Our process model characterises the mean worm burden at the population level (M) by host years of age (a) as an immigration–death process
where σ gives the spontaneous death rate of adult worms in the absence of anthelmintic treatment. The expected lifespan of adult O. viverrini parasites is therefore given as the time taken for half of adult worms to die in the absence of anthelmintic treatment; log(2)/σ. As there is evidence for age– dependent reinfection rates (73), we model the O. viverrini force of infection as a function of host age
The dynamic model (Equations 3 and 4) has the following analytical solution for worm burden by age a,
The cumulative parasite exposure (δ) by age a is given by the definite integral of the force of infection (Equation 4)
Observation model for survey data
The true number of O. viverrini adult worms per individual i of age group a (xi,a) follows the negative binomial distribution (NB), which takes the form of a gamma–Poisson mixture model parameterised with a mean worm burden M (a) (Equations 3 and 5) and age– dependent dispersion ka (40). Values of ka are estimated for each age–group and are themselves normally distributed with an overall mean of µk and a standard deviation of σk. During autopsy surveys, adult O. viverrini were carefully removed from cross–sections of liver and worm recovery is likely close to 100%, therefore we consider that the true worm count for each individual is equal to the recovered worms in autopsy studies (xi,a ≡ wi,a | r = 1),
In surveys where O. viverrini flukes were obtained by expulsion, participants were treated with the anthelmintic praziquantel followed by a saline purgative, which is known to result in imperfect subsequent recovery of from faeces (74). Therefore, we allow the true worm burden to be greater than, or equal to, the observed number of worms for individuals in expulsion studies (xi,a ≥ wi,a) and the probability of observing wi worms given a true count of xi is a binomial sampling process with probability of worm recovery r = 0.44 (40),
where Binom refers to a standard binomial probability distribution. In surveys where the outcome variables are eggs per gram of stool (y), if at least one egg is observed for individual i of age group a (yi,a ≥ 1), we relate this to the expected egg count for that individual using a negative binomial error distribution, where the mean is given as a density–dependent function of the true worm burden; π(x) = (Λx)γ and the dispersion is given by parameter h, which has previously been estimated for O. viverrini as 0.4 (40),
Where zero eggs are observed (yi,a = 0), we consider the individual diagnostic sensitivity as a saturating function of the worm burden, se(x) = x/(x + b), where the parameter b has been previously estimated for O. viverrini as 1.7 (40),
The population level sensitivity (S) for faecal egg diagnostics is a function of the worm burden distribution at the population level (40),
where p(M, k) indicates the true prevalence and is given by
Given an assumed diagnostic specificity of one, the observed prevalence p′ for for age group a is therefore related to the true prevalence with the following expression,
where sp gives the faecal egg diagnostic specificity, which is assumed here to be 1. For surveys where only the prevalence is given by faecal egg diagnostic, we represent this as individual– level positive or negative outcomes; zi,a ∈ {0, 1}. The probability for the binary diagnostic observations is therefore given by a Bernoulli distribution
Epidemiological parameter estimation
Model fitting was performed in a Bayesian framework using the stan language v.2.34.1 (62) implemented with cmdstanr v.0.7.1 (63) in R v.4.3.3. Parameters were assigned weakly or moderately informative prior distributions based on the results from a previous analysis (40) for the pre–intervention data (S1–S7, see Table 1). For the post–intervention analysis (S8–S12), several model parameter values were taken directly from the pre–intervention analysis posterior distributions; including the parameters relating worm burdens to egg counts (Λ, γ) and the worm mortality rate (σ), which was taken as the highest percentile of the posterior estimate (σ = 0.116), corresponding to a mean worm lifespan of 6 years, to account for higher parasite mortality resulting from periodic anthelmintic treatment. Each model was run with four parallel chains with a burn-in of 1700 iterations per chain and a total of 1000 iterations. The Gelman–Rubin diagnostic and effective sample size >500 were used to diagnose successful Markov chain convergence. Results are presented as credible intervals (CrI) of parameter posterior distribution or prediction intervals (PI) from simulations.
Induction and latent period
We performed a time–to–event analysis to validate the induction and latent periods estimated by the earlier evolutionary cancer analysis using population– level data. We obtained the age distribution of 10,737 cases of cholangiocarcinoma diagnosed at Srinagarind Hospital between 1985–2009 (33) which were grouped into age classes. We consider that for each cancer case at age a there is an unknown age at which a driver mutation was obtained m, where m < a. Values of m are drawn from a gamma distribution with mean µmut and shape parameter αmut. Following the mutation at age m there is a latent period before cancer diagnosis at age a, this latent period is also gamma distributed with mean µcca and shape αcca. The probability of developing cholangiocarcinoma at age a, ϕa is therefore
Cholangiocarcinoma cases are reported by age group with a lower bound of l and an upper bound of u in years. The likelihood ℒ for each observation is therefore optimised on the interval probability, given by
Probability of acquiring cholangiocarcinoma
We adopt a binomial regression framework with the probability of contracting cholangiocarcinoma at age a in years given by
where α is an age dependent intercept and β is a coefficient multiplied by a binary variable xf 0, 1 indicating whether the cases of cholangiocarcinoma come from a region endemic for liver fluke (xf = 1 for Thailand and xf = 0 for Malaysia). Values of α are drawn from a multivariate normal distribution with a covariance matrix which accounts for temporal autocorrelation between years.
Counts of cholangiocarcinoma cases c are reported by age group with a lower bound l and an upper bound of u in years and are therefore interval censored. Cases are drawn from an underlying population of size N. We can therefore express a binomial likelihood function for each interval,
where the interval probability of cancer
is related to the yearly probability
Data on cholangiocarcinoma cases in Thailand come from the cancer registry at Srinagarind Hospital in Khon Kaen (33). For this analysis we used counts of cholangiocarcinoma cases between 1998–2009 along with corresponding demographic data (N) for Khon Kaen in the same period (75). Cancer case and demographic data for Malaysia between 2007–2009 were obtained from the World Health Organization Mortality Database (36), using version 10 ICD codes C22 (malignant neoplasm of liver and intrahepatic bile ducts) and C24 (malignant neoplasm of other and unspecified parts of biliary tract). We assumed that half of these biliary cancers were attributable to cholangiocarcinoma (17) and, given the very low survival rate, that mortality is a valid approximation for incidence (15, 16).
Data Availability
Code and publicly available data to reproduce the analysis are available at https://github.com/tc13/ov-cca-models. Access to sequence data on biliary cancer genomes is controlled by the International Cancer Genome Consortium Data Access Compliance Office (ICGC DACO).
Ethical Statement
As an analysis of previously published and anonymous human data, this study met the criteria for exemption from ethical review at the Universities of Oxford and Glasgow.
Data and Code Accessibility
This study uses previously published datasets. Code to reproduce the analysis is available at https://github.com/tc13/ov-cca-models/s.
Authors’ Contributions
T.C. conceptualised the study, T.C. developed methodology, T.C. analysed epidemiological data, T.C and analysed cancer genomic data, P.S provided resources, C.B. provided supervision for cancer genomics, T.D.H. provided supervision for epidemiological modelling, T.C. wrote the manuscript.
Competing Interests
C.B. received honoraria as speaker (AstraZeneca, Incyte) and consultant (Incyte, Servier, Boehringer Ingelheim, AstraZeneca, Jazz Therapeutics, Tahio), received research funds (Avacta, Medannex, Servier) and her spouse is an employee of AstraZeneca. The other authors declare no competing interests.
Acknowledgements
We thank Dr. Stefan Dentro and Prof. Moritz Gerstung for providing additional code relating to the PCAWG consortium analysis to calculate mutation times of 2,778 tumour genomes. We appreciate the comments from Dr. Arporn Wangwiwatsin and Prof. Watcharin Loilome on a poster relating to this manuscript. We further thank Dr. Jessica Clark and Dr. Paul Johnson for discussions on the survival analysis. Dr. Claudio Nunes-Alves also provided valuable feedback on the manuscript.
Access to whole-genome paired cancer and normal sequence data was approved by the International Cancer Genome Consortium Data Access Compliance Office (reference: DACO-63). We acknowledge the important role of Prof. Apinya Jusakul, Prof. Patrick Tan and colleagues for originally generating this sequence data.
Access to the SG10K_Pilot data was approved by the National Precision Medicine Data Access Committee in Singapore (Application no. SG10KP00043). We thank the ‘SG10K_Pilot Investigators’ for providing the ‘SG10K_Pilot data’ (EGAD00001005337). The data from the ‘SG10K_Pilot Study’ reported here were obtained from EGA. This manuscript was not prepared in collaboration with the ‘SG10K_Pilot Study’ and does not necessarily reflect the opinions or views of the ‘SG10K_Pilot Study’.
We used data from the World Health Organization (WHO) Mortality Database for biliary cancer incidence. The analyses, interpretations and conclusions shown in this article are attributable solely to the authors and not to WHO, which is responsible only for the provision of the original information.
This research was funded in whole, or in part, by the Wellcome Trust [Grant number 215919/B/19/Z]. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
C.B. acknowledges that this study was performed under the collaborative umbrella of the European Network for the Study of Cholangiocarcinoma (EURO-CHOLANGIO-NET) supported by COST action (CA) 18122 and the Precision-BTC network supported by CA22125. F.V. acknowledges funding from a short term grant from EURO-CHOLANGIO-NET CA18122.
P.S. acknowledges funding from the National Research Council of Thailand as part of the Fluke–Free Thailand program.
Footnotes
Minor textual changes have been made to the abstract, introduction and discussion sections, plus the acknowledgements.
↵* https://monographs.iarc.fr/wp-content/uploads/2018/06/mono61.pdf
↵† https://monographs.iarc.fr/wp-content/uploads/2018/06/mono100B.pdf