Medically relevant tandem repeats in nanopore sequencing of control cohorts Authors

Research and diagnostics for medically relevant tandem repeats and repeat expansions are hampered by the lack of population-scale databases. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in the general population. The current version includes 878 individuals of the 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats were genotyped using STRdust to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this dataset, as well as the feature to upload one's own data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with Myotonic Dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition, and enable a better assessment of the pathogenic impact of tandem repeats observed in patients. PathSTR is available at https://pathstr.bioinf.be


Introduction
Tandem repeats are defined as head-to-tail direct repetitions of a DNA motif, which can be repeated exactly, with motif interruptions or with an entirely different sequence composition for some alleles.Early methods involved (repeat-primed) PCR followed by Sanger sequencing or fragment length analysis and Southern blotting, which are low throughput, labor-intensive, and do not fully describe all repeat properties.Short-read sequencing technologies have difficulty in correctly determining their size, especially as the repeat length gets longer than the read length, but some specialized methods have enabled population-wide tandem repeat genotyping (Dashnow et al., 2021;Dolzhenko et al., 2019;Jam et al., 2023), allowing imputation and the identification of tandem repeats relevant for traits and diseases (Manigbas et al., 2024).Increased resolution, however, is offered by long-read sequencing technologies such as nanopore and PacBio sequencing, which enables direct observation of the repeat's length, sequence composition, and DNA methylation.The field has yet to fully mature, and multiple genotyping, benchmarking, and comparison tools have been developed recently (Dolzhenko et al., 2023;English et al., 2023;Jam et al., 2024).
In the past years, these technologies have led to multiple novel discoveries of repeats associated with human disease (Cortese et al., 2023;Rafehi et al., 2023;Tan et al., 2023).To date, 68 repeats are associated with human diseases and summarized in STRchive (Dashnow, 2023(Dashnow, /2023)), although not all of these are firmly established as pathogenic.This is the set of repeats that we consider 'medically relevant' in the remainder of this manuscript.However, we anticipate this is only the tip of the iceberg of pathogenic tandem repeats.Both the repeat length and its composition are crucial determinants of the pathogenic potential of a specific repeat allele.A database of tandem repeat genotypes is beneficial to accurately assess an expanded allele's pathogenic potential versus a common population polymorphism.Recently, long-read sequencing technologies have matured sufficiently to apply to population-scale sequencing projects (Beyter et al., 2021;De Coster et al., 2021;Noyvert et al., 2023).The genomics community has a long-standing tradition of making data .freely available, which greatly benefits the interpretation of variants identified in patients, especially in the context of rare diseases.In this work, we describe pathSTR, a web app for the visualization of repeat length and sequence composition of medically relevant tandem repeats, also equipped with options to compare genotypes from other individuals (e.g.patients of interest) with the control cohort.At the time of writing, the database consists of genotypes of 878 individuals from the 1000 genomes project (1000Genomes Project Consortium et al., 2015), sequenced on the Oxford Nanopore Technologies PromethION (Noyvert et al., 2023).The repeats are genotyped using STRdust, an efficient tandem repeat variant caller for long-read sequencing data.We provide this dataset to the genomics community for the improved interpretation of tandem repeat alleles.

PathSTR visualization of repeat length and sequence composition
STRdust was used for genotyping the medically relevant tandem repeats from CRAM files over FTP using the curl feature of htslib (Bonfield et al., 2021).While an in-depth benchmark is outside of the scope of this work, we compared STRdust with TRGT (Dolzhenko et al., 2023) and LongTR (Jam et al., 2024) using the HG002 TR benchmark (English et al., 2023), demonstrating excellent concordance for tandem repeat lengths for the repeats of interest (Supplementary Fig. S1).
The pathSTR web app (https://pathstr.bioinf.be)shows the variation in tandem repeat length (Fig. 1) and sequence composition (Fig. 2) across a large cohort of control individuals.Length plots, showing either the total length of the repeat or the difference with the reference genome, can be changed to show the pathogenic length cut-off (as obtained from STRchive), perform a log-transformation of the repeat length, show a density plot and split the individuals per 1000 Genomes super population and/or sex.The composition is visualized in three ways, either 'raw', showing the frequency of each motif, or 'collapsed', grouping samples with similar motif frequencies or 'sequence', showing the per-individual sequence of frequently-observed motifs across the repeat length.The repeat genotyping required only two reads to support an allele to maximize sensitivity, including at limited sequencing depth.Detailed information is available for further evaluation for each individual genotype in the web app, as well as the number of supporting reads and an IGV visualization of the alignment.

PathSTR to evaluate pathogenic repeats
STRchive includes some tandem repeats for which there is conflicting evidence for the association of this repeat expansion with a disease.One of these is a TTC repeat expansion in DMD linked to Duchenne Muscular Dystrophy, where it was suggested that >60 repeat units is pathogenic (Kekou et al., 2016).Investigation of this repeat using pathSTR (Fig. 3A) clearly supports the notion that the frequency of expanded alleles is too high to be causally linked to an early-onset condition, suggesting it is not a sufficient factor to explain the disease in the family in which the expansion was described.
PathSTR enables users to upload tandem repeat genotypes generated using STRdust from their own datasets, e.g.patients of interest, and will display those next to the control cohort.Fig. 3B shows an example of comparing Cas9-targeted resequencing data (see methods) (Gilpatrick et al., 2020) for a patient with Myotonic Dystrophy type 1 (DM1) with a DMPK expansion.PathSTR shows that the DMPK repeat length in this patient is indeed above the pathogenic line and is a length outlier compared to the rest of the cohort (z-score=9.5).

A B
Figure 3: A: Repeat length in the DMD gene shows that many individuals in the general population have lengths above the proposed pathogenic length (red dotted line), suggesting it is not pathogenic.B: length visualization of the DMPK repeat, showing the uploaded data of the DM1 patient obtained using Cas9-enrichment, with the pathogenic cut-off indicated with a red dotted line.

Discussion
We present the data of medically relevant tandem repeats generated with publicly available long-read sequencing data from a control population, in a highly informative web application with rich visualizations of tandem repeat length and sequence composition.We prioritized sensitivity to detect long alleles for the pathSTR database, as the sequencing data of the 1000 genomes project used in this work were not uniformly sequenced to a high depth.We have demonstrated that STRdust is on par with other recently developed tandem repeat .
genotypers.However, an independent and more rigorous evaluation in this developing field is required.A first step towards this goal was already taken by compounding a tandem repeat catalog, assessing the tandem repeat variation for the HG002 Genome in a Bottle sample, and developing tools for comparison of methods (English et al., 2023).
While pathSTR should already greatly facilitate the interpretation of tandem repeats, we must stress that expert knowledge and caution are still required for the correct interpretation of results.The pathogenic length, as obtained from STRchive, must be evaluated in light of the available data, and may have to be reviewed when larger groups of patients and controls are sequenced.An additional reason for caution is that other repetitive sequences often flank tandem repeats, as is the case for HTT, where both repeats are often genotyped simultaneously (Höijer et al., 2018).This then complicates the assessment of the pathogenic character, as only the CAG/polyQ repeat is known to expand and cause disease.The flanking CCG/polyP repeat is stable but could confuse the genotyper, while simultaneously also influencing the polyQ pathogenicity (Urbanek et al., 2020).For these reasons, pathSTR will display a warning when the pathogenic length is added to the plots.
Sequence composition is another important determinant of pathogenicity.A clear example includes the intronic pentamer repeats in the YEATS2 gene, (Supplementary Fig. S5), one of the causes of Familial adult myoclonic epilepsy (FAME), where only expanded repeats with an ATTTC motif are pathogenic.In this instance, evaluating patients on the overall repeat length alone is insufficient as expansions of ATTTC are pathogenic, but expansions of the reference sequence ATTTT are seen in healthy individuals (Depienne & Mandel, 2021).We are likely still lacking the full picture of the sequence compositions of these tandem repeats as long-read methods have only recently started probing the composition of expanded alleles.A more complete view of the sequence composition in expanded repeats of reference individuals and clinical cases will improve our understanding of what makes repeats pathogenic, and eventually lead to better diagnostics.
The current dataset does not provide information on the DNA methylation status, which can be determined from nanopore sequencing as native DNA is sequenced without amplification (Giesselmann et al., 2019).This would be a very relevant layer of information to incorporate in a later update, as especially long CG-rich repeats are known to be methylated and lead to epigenetic silencing in cis (Depienne & Mandel, 2021).This resource will continuously be expanded when new population sequencing efforts are made available online, as well as when additional tandem repeats are identified as relevant for human diseases.

Quality control
We used cramino for quality control and to determine library metrics such as library N50, yield, and normalized coverage per chromosome (De Coster & Rademakers, 2022).
Samples with an estimated coverage <10x (32Gb) were removed (N=129, Supplementary Fig. S2), as well as a sample with unexpected normalized coverage on the sex chromosomes (Supplementary Figure S3).

Genotyping medically relevant tandem repeats
Tandem repeats for pathSTR were genotyped with STRdust, which is implemented in Rust and uses the rust-htslib and rust-bio crates (Köster, 2016).STRdust is implemented in such a way that alignment files (in CRAM or BAM format) do not have to be available locally but can instead be queried from a remote location (using FTP, HTTPS, or s3), which is relevant in the context of this application.Alignments to an artificial reference sequence without the repeat sequence are done using the rust bindings to minimap2 (Guhlin, 2022(Guhlin, /2024;;Li, 2021), after which a consensus of the repeat allele is generated using a Partial Overlap Alignment (SPOA) (Vaser et al., 2017), as implemented in rust-bio.STRdust will perform pairwise alignment and hierarchical clustering to identify the reads that make up the two alleles and assign a heterozygous or homozygous genotype, excluding any read with a repeat sequence that does not result in an appropriate alignment with sequences from any of the other reads (outlier The repeats with a role in human diseases selected for genotyping were taken from STRchive (Dashnow, 2023(Dashnow, /2023)), using the hg38 coordinates for genotyping, the motif length for kmer composition plots, and the provided cut-off for repeats to be considered pathogenic.STRdust implements a --pathogenic option to download tandem repeat coordinates from STRchive for ease of genotyping these medically relevant tandem repeats.Genotyping the 1000 Genomes samples is organized using the snakemake workflow manager (Koster & Rahmann, 2012).
We compared STRdust (v0.5.0) to  and LongTR (version 638942f-dirty), for the set of medically relevant tandem repeats and evaluated the correlation of the obtained repeat allele lengths with a scatter plot matrix and calculating the Parson correlation.

PathSTR web app
The pathSTR web app is written in Python and built upon dash (Dash Documentation & User Guide | Plotly, n.d.) and additionally uses cyvcf2 to parse VCF files (Pedersen & Quinlan, 2017), pandas to manipulate data frames (McKinney, 2011), and modules from the Python standard library (Python Programming Language | USENIX, n.d.).The parsed data is saved into an hdf5 container for easier access and quick start-up times (The HDF Group, 2020/2024).For every repeat and individual, an IGV visualization is provided using igv.js(Robinson et al., 2023) as made available through dash-bio (Hossain, 2019).The pathSTR web app is hosted in-house, deployed using nginx and gunicorn.
The repeat composition is analyzed by counting kmers in the repeat sequence according to the forward strand of the reference genome, splitting sequences based on the known repeat motif length but unbiased to which motifs can be found.Each kmer is rotated (GCA-CAG-AGC) and represented by either the known unit (as defined by STRchive) or the .

BFigure 2 :
Figure 1: pathSTR visualization of the repeat length A: scatter plot length visualization of the RFC1 repeat, comparing, per individual, the longer allele against the shorter.B: violin plots showing the CACNA1A repeat length, split by the 1000 Genomes Project super population (AMR: Admixed Americans, EUR: Europeans, EAS: East Asians, AFR: Africans, SAS: South Asians) and sex, showing the pathogenic repeat length from STRchive with a red horizontal line.

.
). STRdust does not attempt to split reads into two haplotypes for haploid sex chromosomes in male individuals.Commandline arguments are parsed with clap(Clap-Rs/Clap: A Full Featured, Fast Command Line Argument Parser for Rust, n.d.), and parallelization is achieved using rayon(Rayon-Rs/Rayon, 2014/2024).Binaries for STRdust are available at https://github.com/wdecoster/STRdust,with the source code available under the MIT license.