Effects of psoriasis and psoralen exposure on the somatic mutation landscape of the skin

Somatic mutations are hypothesised to play a role in many non-neoplastic diseases. These diseases may also shape the somatic mutation landscape of affected tissues after onset. We performed whole-exome sequencing of 1182 microbiopsies dissected from lesional and non-lesional epidermis from 111 patients with psoriasis, a chronic inflammatory disease of the skin, to search for evidence that somatic mutations in keratinocytes may influence the disease process and to characterise the effects of the disease on the mutation landscape of the epidermis. We show that psoriasis is associated with increased mutation burden of the cell-intrinsic signatures SBS1 and SBS5 but not of UV-light, which remains the dominant mutagen in psoriatic skin. Despite the hyperproliferation of keratinocytes that characterises psoriasis, lesional skin remains highly polyclonal, showing no evidence of spread of clones carrying potentially pathogenic mutations. We find that the selection forces operating in the epidermis remain mostly unchanged in psoriasis and the mutational landscape continues to be dominated by clones carrying mutations in genes recurrently mutated in normal squamous epithelia. There is evidence of positive selection in previously reported driver genes, NOTCH1, NOTCH2, TP53, FAT1 and PPM1D and we also identify four driver genes (GXYLT1, CHEK2, ZFP36L2 and EEF1A1), that have not been previously described in studies of normal skin but which we hypothesise are selected for in squamous epithelium irrespective of disease status. We describe the mutagenic effects of psoralens, a class of chemicals previously found in some sunscreens and which remain a part of a common photochemotherapy treatment for psoriasis (psoralens and UV-A, PUVA). Psoralens leave a distinct mutational signature in the genomes of exposed cells that is tightly linked with transcription, showing evidence of both transcription-coupled repair and transcription-coupled damage. These results suggest that somatic mutations in keratinocytes are unlikely to influence the pathogenesis of psoriasis and that while psoriasis has only modest effect on the mutation landscape of the skin, PUVA treatment has the potential to exert a unique and larger effects.

mutagen in psoriatic skin. Despite the hyperproliferation of keratinocytes that characterises 23 psoriasis, lesional skin remains highly polyclonal, showing no evidence of spread of clones 24 carrying potentially pathogenic mutations. We find that the selection forces operating in the 25 epidermis remain mostly unchanged in psoriasis and the mutational landscape continues to 26 be dominated by clones carrying mutations in genes recurrently mutated in normal 27 squamous epithelia. There is evidence of positive selection in previously reported driver 28 genes, NOTCH1, NOTCH2, TP53, FAT1 and PPM1D and we also identify four driver genes 29 (GXYLT1, CHEK2, ZFP36L2 and EEF1A1), that have not been previously described in 30 studies of normal skin but which we hypothesise are selected for in squamous epithelium 31 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) tissues, including blood 1 , oesophagus 2,3 , colon 4-7 , liver 4,8,9 , urothelium 10,11 , and more [12][13][14][15][16][17] . 48 These studies have described how diverse mutagenic processes create the substrate 49 for a ceaseless Darwinian competition between clones populating normal tissues. Clones 50 acquiring mutations that provide a competitive advantage over their neighbours will expand, 51 and may come to dominate. While there appears to be significant overlap between cancer 52 driver mutations and those that drive clonal expansions in normal tissues, mutant clone 53 expansions do not necessarily drive tumour formation. However, widespread replacement of 54 wildtype cells with mutant clones can have functional consequences for the tissue, 55 potentially contributing to common complex disease risk 18 or influencing the disease 56 progression or response to treatment 19,20 . 57 Exposure to external factors, such as drugs or environmental insults, may influence 58 somatic mutation landscapes and these exposures are sometimes associated with disease. 59 For example, azathioprine, which is used to treat multiple immune-related conditions, leaves 60 non-lesional -20%) ( CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. The oligoclonal composition of the microbiopsies means their mutation counts are not 156 representative of the per-cell mutation counts in the skin. There is also a risk of counting the 157 same mutation event multiple times when a clone extends across multiple microbiopsies. We 158 performed Bayesian clustering of mutations based on their VAFs in different samples and 159 used the statistical pigeonhole principle to derive a phylogenetic relationship between the 160 clusters (Methods). Each tip of the phylogenetic trees represents a clone and we performed 161 all subsequent analyses at the level of clones rather than the level of microbiopsies. 162 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. To determine which mutational processes are active in psoriatic skin, we used a 164 Bayesian hierarchical Dirichlet process (Methods) to extract mutational signatures for each 165 mutation clone and compared these to the COSMIC reference signatures 31 (Figure 2a,  166   Supplementary Figures 2 and 3, Supplementary Table 3). Unsurprisingly, the most abundant 167 signature is SBS7b, which has been attributed to UV-exposure. This signature accounts for 168 80% of the mutations in the dataset. We also found the UV-related signature SBS7c, but it 169 only accounts for 0.14% of the mutations. In agreement with previous studies 16,22 , we 170 observed large variation in UV-associated mutation burden between cells 1-2mm apart in the 171 tissue. 172 The second most prevalent signature in the dataset is not listed in COSMIC (v3.2). It 173 is characterised by T>A, T>C and T>G mutations at TpA sites (Supplementary Table 5). 174 This is consistent with the known mutagenic effects of a psoralens 32,33 , a class of chemicals 175 that form a part of PUVA (psoralens and UV-A) phototreatment, which is used to treat 176 psoriasis and other diseases of the skin. The signature is present both in lesional and non-177 lesional skin and accounts for 10.7% of the mutations in the whole-exome dataset, and for 178 as much as 94% of the mutations in some individual clones ( processes. One is a mix of the clock-like SBS1 and SBS5 signatures, that are found 194 ubiquitously in normal tissues, including proliferative and post-mitotic tissues 17,34 . These 195 signatures tend to be correlated within a tissue and could not be separated by our model. 196 We also found a small number of samples exhibiting the mutational signatures of APOBEC 197 activation (SBS2 and SBS13), which have been occasionally found to be present in a range 198 of normal tissues 10,34,35 . While these signatures account for only 0.5% of the mutations in the 199 dataset as a whole, 22 clones had over 50 exonic mutations attributed SBS2/SBS13, 200 accounting for up to 28% of mutations in the exposed clones. 201

202
We used a linear mixed effects model to estimate the effects of age and disease 203 duration on the mutation burden after subtracting the mutations assigned to the psoralen 204 signature and controlling for the anatomical location of the sample (Methods and 205 Supplementary analysis 1). The mutation burden of the keratinocyte clones increased 206 linearly with age, with each clone accumulating 14.6 mutations per exome per year (14.6 207 (10.1-19.0, 95% CI, P=4.3 x 10 -9 , linear mixed-effects models, Supplementary analysis 1). 208 This is a much higher burden than that of other normal tissues, which range around 0.2-0.4 209 mutations per exome per year 1,2,5,8,10,34 . 210 Lacking information about the frequency or severity of psoriasis flare-ups, we next 211 tested if disease duration, which we have previously used as a proxy for inflammation 212 exposure in IBD 23 , is associated with mutation burden of keratinocyte clones (Methods). We 213 did not find a significant effect of disease duration on the total mutation burden (-0.60 214 mutations per year (-3.4-2.2, 95% CI, P=0.67, likelihood ratio test of linear mixed-effects 215 models)). However, the large variation in the burden of the UV-related SBS7b reduces 216 statistical power to detect a disease effect on the total mutation burden. In our previous work 217 on IBD-affected colonic mucosa, we found that the disease was associated with accelerated 218 mutagenesis by the cell-intrinsic signatures SBS1 and SBS5 23 . We fitted a linear mixed 219 effects model using only the mutation burden attributed to SBS1/5 and found that psoriasis 220 increases the mutation burden of these signatures by 0.16 mutations per exome per year of 221 disease duration (0.038-0.29, 95% CI, P=0.012 -Likelihood ratio of linear mixed effects 222 models, Supplementary analysis 1). We estimate the age effect of SBS1/5 to be 0.65 223 mutations per year (0.49-0.81, 95% CI, P=2.5x10 -12 , linear mixed effects model, 224 Supplementary analysis 1). Psoriasis is thus associated with ~25% increase in the rate of 225 these clock-like signatures in the skin. 226 227 228 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. The intensity of the colours increases with the fraction of microbiopsies from each patient in 268 which the variant is called. 269 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. We found a strong effect of transcription on psoralen-related mutagenesis in the 301 WGS data, with 1.72 times more mutations occurring on the untranscribed strand than on 302 the transcribed (1.86, 1.71 and 1.57 for T>A, T>C and T>G mutations, respectively, Figure  303 3a). This is partially due to the effects of transcription coupled repair (TCR), which removes 304 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint mutations from the transcribed strand, but the asymmetry is further compounded by 305 transcription coupled damage (TCD) to the complementary non-transcribed strand ( Figure  306 3c, Supplementary analysis 2). Although both TCD and TCR occur, the impact of TCR is 307 about twice that of TCD, with the result that the mutation rate at TpA sites drops with 308 increasing gene expression (Figure 3d). In addition to its effect on mutagenesis, high burden of the psoralen signature is 316 associated with clonal expansions of keratinocytes in our dataset. Microbiopsies derived 317 from skin biopsies where at least one clone had >100 psoralen-related mutations have 318 higher VAFs than microbiopsies dissected from unexposed skin (P=2.9 x 10 -9 , Mann 319 Whitney U test, Figure 3e). There was also a greater level of clonal spread between different 320 microbiopsies, as indicated by the fraction of mutations shared between them. For each pair 321 of microbiopsies, we calculated the fraction of mutations shared between the pair and plotted 322 this as a function of the distance between the pair (Figure 3f). Microbiopsy pairs from 323 psoralen-exposed skin account for 14.4% of all pairs in the dataset but for 35.8% of pairs 324 that are more than 0.5mm apart and share more than 10% of their mutations (P=5.4x10 -10 , 325 Chi-squared test). This may be the result of psoralen exposure being correlated with age or 326 severity of psoriasis, but may also be a direct consequence of the chemical itself. Psoralens 327 are cytotoxic and may enable clonal expansions through the elimination of competitor 328 clones. Exposure may also result in a higher number of driver mutations and higher fitness. 329 A number of canonical hotspot mutations in known skin cancer genes are caused by T>A, 330 T>C or T>G mutations at TpA sites (Supplementary Table 6). There is evidence that UV-331 exposure affects the clonal dynamics in the skin and the spread of TP53 mutant clones in 332 particular 43 . We did not find evidence of stronger selection of mutations in any particular 333 gene among clones which had >100 mutations attributed to the psoralen signature 334 (Supplementary analysis 2). 335 Compared with a uniform mutational process, the psoralen-signature is 1.37 times 336 more likely to result in splicing mutations in genes reported to be recurrently mutated in 337 squamous cell carcinomas and normal skin. It would be especially likely to disrupt AT-AC 338 introns, which have 5'-AT and AC-3' boundaries, in contrast to the typical 5'-GT and AG-3' 339 splice sites. 340 341 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint the transcriptional start site of genes by expression level quintiles in sun-exposed skin in the 350 GTEx dataset. Mutation rate on the transcribed strand drops in the transcribed region of 351 genes, reflecting transcription coupled repair (TCR). In contrast, the mutation rate on the 352 untranscribed strand is increased in the transcribed region, compared with intergenic 353 regions, reflecting transcription coupled damage (TCD) of this strand. The effects become 354 more pronounced with increasing expression. d) The mutation rate in ten bins of ascending 355 gene expression. Higher expressed genes have lower mutation rates. e) The variant allele 356 fractions (VAFs) of microbiopsies dissected from psoralen-exposed and unexposed skin. f) 357 Pairwise comparison of microbiopsies. For any pair from the same skin biopsy, the plot 358 shows the fraction of mutations found in both microbiopsies as a function of the distance 359 between them (measured centre-to-centre). Psoralen exposed microbiopsies are enriched 360 among pairs that are distant but nevertheless share many mutations in common (grey 361 square). CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. We used the dNdScv software 45 to assess the ratio of non-synonymous mutations to 380 synonymous mutations after accounting for sequence context and regional differences in 381 mutation rate across the genome (Supplementary Table 7 We next carried out a pathway-level dN/dS analysis, searching for enrichment of 406 missense and truncating variants across 11 gene sets that were defined a priori because of 407 their relevance in either keratinocyte cancers or psoriasis pathology (Figure 4c, Methods). 408 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint We observed a strong enrichment of mutations in genes previously reported to be 409 recurrently mutated in normal skin or in squamous cell carcinomas. Genes reported to be 410 recurrently mutated in basal cell carcinomas showed a much weaker enrichment (after 411 excluding TP53, NOTCH1 and NOTCH2). No evidence was seen for enrichment of either 412 missense or nonsense mutations in any of the other pathways implicated in psoriasis 413 pathogenesis. Together with the gene-level analysis, this suggests that somatic mutations in 414 keratinocytes are unlikely to play a role in the pathogenesis of psoriasis. 415 Based on the VAFs of individual mutations and the volume of the microbiopsies in 416 which they were detected, we estimated for each patient the fraction of cells that carry a 417 mutation in each gene (Methods, Figure 4d). We tested if the fraction of mutant cells was 418 different between lesional and non-lesional skin using a Wilcoxon test for paired 419 measurements but found no differences that were significant after correction for multiple 420 testing. This further supports the hypothesis that the selection forces operating in psoriasis 421 are the same as those operating in the skin of people without psoriasis. We have used whole-exome sequencing of microbiopsies of epidermis derived from 111 442 patients with psoriasis vulgaris to study the effects of this chronic disease on the somatic 443 mutation landscape of the skin. Psoriasis is associated with modestly increased risk of 444 keratinocyte cancers 48,49 but our results support the view that this increase in risk may be 445 predominantly due to the effects of treatment rather than features of the disease per se. 446 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint Disease duration of psoriasis, our best proxy for inflammation exposure, is 447 associated with increased mutation burden of the clock-like signatures SBS1 and SBS5. We 448 estimate that psoriasis disease duration is associated with an additional 0.16 mutations per 449 exome per year, which is approximately a quarter of the age effect of these signatures. 450 While the P-value for this effect is modest, the increased burden of SBS1 and SBS5 in the 451 context of inflammation is in line with our previous finding of increased mutation burden of 452 SBS1 and SBS5 in colonic mucosa affected by inflammatory bowel disease 23 . However, the 453 increase in mutation burden associated with psoriasis appears to be smaller than in IBD, 454 where the disease duration effect is almost equal in size to the age effect 23  has been reported to be recurrently mutated in normal oesophagus 3 and selection of 482 mutations in CHEK2 seems unlikely to be a specific feature of psoriasis. Finally, EEF1A1 483 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; encodes the alpha subunit of the elongation factor-1 complex. Unlike GXYLT1, ZFP36L2 484 and CHEK2, it was included in an extended panel of genes tested for evidence of selection 485 in a subset of samples in a previous study 22 , but did not reach significance. EEF1A1 did 486 reach significance in our previous analysis of normal urothelium from bladder cancer 487 patients 10 , suggesting it too may be generally selected for in squamous epithelia. Mutations 488 across all nine genes that reached significance in the study appeared to be present at a 489 similar frequency across lesional and non-lesional skin from psoriasis patients. We have given a detailed description of a mutational signature that we believe to be 517 the result of psoralen exposure. While psoralens have been known to be mutagenic for 518 decades, we add considerable detail to our understanding of this mutagen in humans in vivo. 519 We show that the mutational signature has a strong transcriptional strand bias which results 520 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; https://doi.org/10.1101/2022.07.04.22277086 doi: medRxiv preprint both from transcription coupled repair and transcription coupled damage. Transcription 521 coupled damage has been most comprehensively described for T>C mutations in COSMIC 522 signature SBS16, which is found almost exclusively in the liver and may result from alcohol 523 consumption 53 . SBS16 is thought to result from the chemical modification of adenine 524 residues which likely occurs more frequently when the DNA is in unwound form during 525 transcription. Our data suggest that psoralen molecules similarly preferentially react with 526 thymine in unwound DNA. Psoralens are used in experimental systems to study 527 mutagenesis and the repair of interstrand cross-link lesions. It has been demonstrated that 528 the repair of psoralen-related interstrand cross links does not depend on the Fanconi 529 anaemia pathway but rather depends on the DNA glycosylase NEIL3, which facilitates the 530 removal of the lesion without the formation of a double-strand break 54,55 . In agreement with 531 this, we find no effect of psoralens on the burden of indels or structural variants, as might be 532 expected if the cross-links required double strand breaks to resolve. 533 One patient with extremely high burden of the psoralen signature also has history of 534 extensive PUVA-therapy, suggesting that PUVA therapy can result in a large number of 535 somatic mutations and may exert a much larger effect on the somatic mutation burden of the 536 skin than the disease itself. It is also associated with clonal expansions of clones in exposed 537 skin. This may occur through changes in the selection environment, although we were not 538 able to detect any gene-level differences on selection in the current analysis. However, a 539 limited number of rounds of PUVA treatment are generally considered safe and establishing 540 a dose-response curve for the PUVA process is an important direction of future work. 541 While psoralen exposure in the context of psoriasis is particularly likely to occur 542 during PUVA-therapy, other means of exposure include the use of sunscreen or tanning 543 lotions containing psoralens or even the consumption of furocoumarin-rich foods. It has been 544 hypothesised that orally ingested furocoumarins increase risk of skin cancers 56-59 , although 545 further validation and functional work is required to confirm the relationship. We propose that 546 testing for a relationship between furocoumarin consumption and the psoralen mutational 547 signature in sun exposed skin offers a mechanistic way to test this hypothesis. 548 549 In summary, our results suggest that chronic inflammation of the skin is associated with 550 increased mutation burden of cell-intrinsic signatures but has little effect on the clonal 551 competition within the epidermis. PUVA treatment, and likely psoralen exposure in general, 552 leaves a distinct mutational signature on the genomes of exposed cells and can result in a 553 very large number of somatic mutations without resulting in malignant transformation. In 554 recent years, our understanding of somatic evolution in normal tissues has been 555 transformed. We are now in a position to carry out comparative studies of non-malignant 556 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. Hooks, for histological processing and sectioning of the skin biopsies and to Deborah 564 Plowman and Sophie Leggett for practical assistance in sample and data management. We 565 also thank the German psoriasis patients who donated tissue samples for the study. This CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. The volume of the microbiopsies was determined by adding together the size estimates of 807 the cuts from the LCM software and multiplying this by the thickness of the sections (10 μ m). 808 The same histological features from serial sections were often cut into the same well to 809 increase the DNA-yield of the samples (a practice referred to as z-stacking). The surface 810 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. ; area of the samples was determined by measuring the width of the samples along the basal 811 membrane and multiplying this with the section thickness and the number of sections (z-812 stacks) separating the first and last sections dissected into the same well. All mutations passing these filters in any sample from a donor were next genotyped in all 874 samples from that donor. Mutations could thus be called in a microbiopsy if they failed the 875 criteria above assuming they had passed the filtering criteria in another microbiopsy from the 876 same individual (for example, a mutation in a clone that dominates in one microbiopsy and 877 extends into an adjacent microbiopsy but at lower levels could still be called in the latter 878 sample). We used the bam2R() function of the deepSNV package (v. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 9, 2022. excluded from this analysis either due to the inability of ASCAT to find an optimal solution or 950 if the goodness-of-fit for the optimal solution was <90%. We removed from the call list all 951 segments smaller than 1 Mb in size, segments on chromosome X, segments within the 952 extended MHC region (Mb 20-38 on chr6) and recurrent artifactual calls on Mb19 of chr14. 953 We further removed segments that were also called in the matched normal of the sample or 954 were called in microbiopsies from both lesional and non-lesional skin (which never showed a 955 clonal relationship in the SNV data) or were called in all samples except 0, 1 or 2 from the 956 same individual (these being likely germline in origin). Finally, we removed microbiopsies for 957 which over 10 structural variants were called and which also had purity of less than 0.4, as 958 we considered these likely problematic. 959 3.2 Mutation calling in whole-genomes 960 For the 16 microbiopsies that were whole-genome sequenced we used CaVEMan and 961 cgpPindel to call and filter small variants in the same way as described above. However, in 962 this case calling was performed against a matched normal sample chosen because it was 963 phylogenetically unrelated to the index sample. This was possible because samples were 964 whole-genome sequenced only after whole-exome sequencing and was preferable because 965 the low number of samples per patient (and the lower aggregate coverage) reduces the 966 accuracy of the binomial filtering of germline variants described above. 967 968 We used ascatNgs (internal version.4.5.0, https://github.com/cancerit/ascatNgs) 9 to estimate 969 the copy number and purity of the samples. Those estimates were used by CaVEMan, rather 970 than the copy number options being manually pre-set as described for the whole-exomes 971 above. 972 973 974 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. 975 Dirichlet processing 976 We implemented a nonparametric Bayesian hierarchical Dirichlet process (HDP) to cluster 977 autosomal single base substitutions with similar variant allele fractions (VAFs). The full 978 mathematical and implementation details of the model are described in a previous 979 publication 10 . Briefly, clones of cells are present across different microbiopsies and this 980 manifests as clusters of mutations that are found at similar VAFs. For every mutation, we 981

Identification of SNV clusters by hierarchical
have two vectors, one containing the number of reads reporting the alternate allele and 982 another containing the total sequencing depth at each microbiopsy. We assume that each 983 mutation can be assigned to exactly one cluster but the number of clusters is unknown. We 984 aim to estimate the number of clusters present across all the microbiopsies dissected from a 985 patient, the location of each cluster in the n-dimensional VAF hypercube and the allocation 986 of mutations to each cluster. 987 988 We model the data using an N-dimensional Dirichlet process (NDP) clustering model, where 989 the distribution of clone sizes and numbers follows a Dirichlet process. This has the 990 advantage that there is no need to pre-specify the number of clusters present. Instead, 991 mutations are moved around the clusters and in each sampling iteration, there is a defined 992 probability that a mutation will initiate a new cluster which was not present in previous 993 iterations. Clusters can also cease to exist if all member mutations are assigned to other 994 clusters. Thus, the number of clusters varies throughout the sampling chain. 995 996 We ran the Gibbs sampler for 25,000 iterations, dropping the first 15,000 as burn-in. We 997 used the ECR algorithm 11 , implemented in the R package label.switching, to resolve the 998 label-switching problem associated with mixture models. To avoid overly complex solutions, 999 we imposed an upper limit of 100 clusters per patient. We kept for downstream analysis only 1000 those clusters that were present at a minimum VAF of 0.05 in at least one microbiopsy and 1001 had a minimum of 10 unique mutations allocated to them. 1002 5. Inference of phylogenetic trees 1003 Each cluster of single-base-substitutions identified by the NDP algorithm represents a 1004 branch of the phylogenetic tree for that patient. We applied the statistical piegonhole 1005 principle to infer phylogenetic relationships between clusters. Given clusters A and B, if the 1006 combined mutant cell fraction (CF) of both is >100% (VAF > 0.5) within the same 1007 microdissections and B consistently shows a lower CF than A, then that is strong evidence 1008 that B is nested within A, that is mutation cluster B represents a sub-clone of clone A. If the 1009 combined mutant cell fraction is ≤ 100%, only weak evidence of nesting exists. If B is found 1010 at a higher VAF than A in some microdissections but at lower VAF in others, the clusters are 1011 interpreted as being independent clones without nesting. 1012 We treated each tip of the phylogenetic tree for each patient as a clone. The length of 1013 the branches from the root (germline) was used in the mutation burden calculations, as 1014 described below. 1015 1016 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 9, 2022. composition means the available reads are split between more clones. Any comparison of 1147 the mutation rate must take this technical variability into account. 1148 To adjust the mutation burden of each cluster for coverage and VAF, we estimated 1149 the sensitivity for calling a true mutation belonging to the cluster in each microbiopsy, S bi . 1150 The total sensitivity of the cluster, S c , is one minus the sensitivity of the cluster in each specifically developed for somatic mutation data. dN/dS ratios for each gene (or groups of