Comprehensive analysis of GBA using a novel algorithm for Illumina whole-genome sequence data or targeted Nanopore sequencing

GBA variants cause the autosomal recessive Gaucher disease, and carriers are at increased risk of Parkinson disease (PD) and Lewy body dementia (LBD). The presence of a highly homologous nearby pseudogene (GBAP1) predisposes to a range of structural variants arising from either gene conversion or reciprocal recombination, the latter resulting in copy number gains or losses, complicating genetic testing and analysis. To date, short-read sequencing has not been able to fully resolve these or other variants in the key homology region, and targeted long-read sequencing has not previously resolved reciprocal recombinants. We present and validate two independent methods to resolve recombinant alleles and other variants in GBA: Gauchian, a novel bioinformatics tool for short-read, whole-genome sequencing data analysis, and Oxford Nanopore long-read sequencing after enrichment with appropriate PCR. The methods were concordant for 42 samples including 30 with a range of recombinants and GBAP1-related mutations, and Gauchian outperforms the GATK Best Practices pipeline. Applying Gauchian to Illumina sequencing of over 10,000 individuals from publicly available cohorts shows that copy number variants (CNVs) spanning GBAP1 are relatively common in Africans. CNV frequencies in PD and LBD are similar to controls, but gains may coexist with other mutations in patients, and a modifying effect cannot be excluded. Gauchian detects a higher frequency of GBA variants in LBD than PD, especially severe ones. These findings highlight the importance of accurate GBA mutation detection in these patients, which is possible by either Gauchian analysis of short-read whole genome sequencing, or targeted long-read sequencing.


Introduction
The GBA gene encodes the lysosomal enzyme glucocerebrosidase, and biallelic mutations in GBA cause the autosomal recessive disorder Gaucher disease (GD [MIM: #230800, #230900 and #231000]) 1 . Around 500 disease-causing mutations have been reported, mostly missense changes arising from single nucleotide variants (SNVs) 2 . Heterozygous variants in GBA [MIM: *606463] are associated with an increased risk of developing Parkinson disease (PD) 3 , the second most common neurodegenerative disease, and the closely related Lewy body dementia (LBD) 4 . Identifying GBA mutations is difficult due to a pseudogene (GBAP1) located 6.9 kb downstream 5 which has an overall homology of 96% with GBA. This rises to 98% in the region from intron 8 to the 3'-UTR, where there are five identical segments >200 bp each 6 . The high homology predisposes to nonallelic homologous recombination between GBA and GBAP1, leading to a wide range of structural variants (SV) 7 . These can be non-reciprocal, also termed gene conversion, or reciprocal, resulting in copy number variants (CNV). Throughout this paper, we use the term copy number gain (CNG) for reciprocal duplication alleles where a 20.6 kb long region of DNA between the homology segments of GBA and GBAP1 is multiplied, and copy number loss (CNL) for reciprocal fusion alleles where the same region is deleted, creating GBA-GBAP1 fusions 7 (see Figure 1). SVs that disrupt the coding sequence by gene conversion or reciprocal recombination are expected to be pathogenic for GD and risk factors for PD. Conversely, SVs not affecting the coding sequence are not pathogenic, although a modifier effect cannot be excluded 7 . These include CNLs outside the coding region, and all CNGs, which consist of a partial duplication of pseudogene sequence merged with a variable part of the gene, often only the 3' UTR, with the resulting allele still containing a normal copy of the GBA coding region ( Figure 1D). The SV variability and population prevalence remain largely unknown. Pathogenic missense changes in the high homology exon 9-11 region such as the common p.L483P (NC_000001.11:g.155235252A>G, also known as p.L444P) may arise by gene conversion, rather than simple base substitutions, with pseudogene sequence incorporated into the gene 7 . We refer to variants corresponding to pseudogene bases in this region as "GBAP1-like".
Current sequencing approaches to characterize GBA have major pitfalls, and to date no single approach has fully resolved recombinants 6 . The correct alignment of short reads when there is a highly similar pseudogene is intrinsically problematic, and GBA is challenging in exome and wholegenome sequencing (WGS) [8][9][10] . Moreover, the reliability of the standard WGS secondary analysis pipelines such as the Genome Analysis Toolkit best practice workflow 11 has not been formally assessed. Targeted short-read sequencing approaches are also possible but may require forced alignment to GBA and visual inspection and Sanger validation to detect recombinant variants, and are not likely to provide copy number information 6,12 . We have already performed refinement of the Illumina WGS analysis for other difficult regions due to sequence homology, demonstrating reliable resolution of SVs in such regions on Illumina WGS data in the SMN1[MIM: *600354]/SMN2[MIM: *601627] genes in spinal muscular atrophy 13 14 . We also previously reported a method for GBA analysis using enrichment by long-range PCR, followed by sequencing on the Oxford Nanopore Technologies (ONT) MinION 15 , which reliably detected SNVs, including GBAP1-like variants, and could also detect non-reciprocal recombinants, but not reciprocal recombinants (Figure 1) 15 .
To overcome these limitations and improve the characterisation of GBA at scale, we have developed refined pipelines based on either targeted analysis of short-read (Illumina) WGS data or targeted long-read (ONT) single molecule sequencing. For Illumina data, we present and validate 'Gauchian', a novel algorithm for GBA locus analysis which can reliably resolve SVs and GBAP1-like variants. For ONT data, we have addressed the problem of reciprocal recombinants by using PCR primers designed to amplify CNGs and CNLs when they exist. We validated these methods and . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint then applied them to large PD, LBD, and population control samples. We demonstrate that complete resolution of all variant types in GBA is possible using either Gauchian analysis of Illumina WGS data or targeted ONT sequencing. Finally, we confirm that GBA variants are more common in LBD than PD, we report the frequency of CNVs in different populations, and suggest that a possible modifier role of CNG in PD and LBD merits further study. Both methods finally enable a precise characterisation of GBA at scale, thus driving the identification of causative variants forward.

Population cohorts and samples used
We downloaded WGS CRAM files from the 1000 Genomes Project (1kGP). These were generated by 2x150bp reads on Illumina NovaSeq 6000 instruments from PCR-free libraries sequenced to an average depth of at least 30x and aligned to the human reference, hs38DH, using BWA-MEM v0.7.15. We downloaded WGS CRAM files from PD 16 and DLB 17 cohorts and their controls from the AMP-PD knowledge portal. These were generated by sequencing 2x150bp reads to >25x coverage and processing against hs38DH using the Broad Institute's implementation of the Functional Equivalence Pipeline 18 . We also downloaded AMP-PD variant calls, generated using the Broad Institute's Joint Genotyping pipeline (referred to as BWA-GATK in this paper). Where samples had been recorded as European / Caucasian / white, or African / black in the original database, we refer to them as "European" or "African" for consistency and simplicity, despite the lack of scientific validity, in quotation marks as suggested 19 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Gauchian -a WGS-based GBA caller
Gauchian builds upon the strategies to solve closely related paralogs, as described in our previously developed SMN1/2 and CYP2D6 callers 13,14 . Gauchian calculates the total number of copies of GBA, GBAP1, and GBA/GBAP1 gene hybrids. Reciprocal recombinations across homologous regions lead to CNG and CNL of the 20.6 kb region between the homologous parts of the two genes. Since the breakpoint may vary in position, to detect CNVs, Gauchian uses the sequencing depth in the 10kb unique region between GBA and GBAP1 (chr1:155220429-155230539; hg38) ( Figure 2A). The number of reads aligned to this region is normalized and corrected for GC content, and the copy number is called from a Gaussian mixture model ( Figure   2B). A deviation of this copy number (CN) from the diploid expectation indicates the presence of a CNV, e.g. one copy indicates a CNL, and three or more copies indicate a CNG. Thus, this number plus two gives the total copies of GBA and GBAP1 combined, i.e., CN (GBA+GBAP1). Included in this CN calculation, in addition to GBA and GBAP1 genes, are gene hybrids where part of GBA and . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint GBAP1 are fused.
CNG always leaves an intact copy of GBA, while CNL can create pathogenic GBA-GBAP1 fusions if the deletion breakpoint falls within the GBA gene coding region. Next, Gauchian identifies the breakpoint of the CNV, following a similar approach as previously described 14 . To do this, we identified 82 reliable sites (Table S1)  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint approach that does not rely on accurate alignments between GBA and GBAP1. Based on the linking information of reads and read pairs covering the ten differentiating sites in either GBA or GBAP1, Gauchian phases all the haplotypes at these sites originating from either GBA or GBAP1 and identifies hybrid haplotypes (i.e., a mixture of GBA and GBAP1 bases on the same haplotype).
This allows us to identify CNL breakpoints as well as small and big gene conversion events.
To assess the relative abundance of the different haplotypes, Gauchian uses CN (GBA+GBAP1) and haplotype-supporting read counts at the differentiating bases to call CN of each haplotype.
Gauchian compares two scenarios: one copy of the wildtype GBA haplotype vs. two copies of the wildtype GBA haplotype. Gauchian determines which scenario is more likely given the number of supporting reads in the data. If we call only one copy of the wildtype GBA haplotype, this indicates that the individual is a carrier of the disease-causing variant. If an individual is a carrier of more than one variant haplotype and there is no haplotype that carries the GBA base at all variant sites of interest, Gauchian calls this sample compound heterozygous. Homozygous variants are called when the CN of the GBA base is called 0.
In addition to GBAP1-like variants in exons 9-11 homology region, Gauchian targets all known GBA pathogenic or likely pathogenic variants as classified by ClinVar (Table S2), including non-GBAP1like variants, and GBAP1-like variants outside the exons 9-11 homology region. For these, since variants don't correspond to GBAP1, or, if they do, the region between GBA and GBAP1 is not highly similar and alignments are accurate, Gauchian parses read alignments and calls the CN of variants based on the number of variant supporting reads as described for SMN/CYP2D6 callers.
Gauchian is a targeted caller for known variants and thus does not call novel variants.

ONT long-read sequencing with PCR enrichment
GBA enrichment was obtained via PCR (Table S3), with primers previously described 24 (henceforth primer pair 1), modified to carry the ONT barcode adapter sequence. The product was a 8.9 kb . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint amplicon containing the entire GBA coding region and introns (chr1:155232524-155241392; hg38). Samples were barcoded using the 96-sample barcoding kit (EXP-PBC096). Amplicons were purified with Agencourt AMPure XP magnetic beads at a ratio of 0.4x. Library preparation was As homopolymer regions are challenging for ONT 32 , and GBA has two coding poly-G stretches, we devised a method to detect variants within these ( Figure S1). This involves analysis of.bam files with Samtools depth to obtain depth of coverage across these A (chr1:155239990-155239995 and is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint five median absolute deviations from the median adjusted depth of coverage of the other samples in the run at that position, this was considered as evidence of a deletion (coverage lower than the mean for that position) or a SNV (coverage higher than the mean for that position). Variants detected with this method were validated with Sanger sequencing 33 .
To detect and amplify reciprocal recombination events, two additional sets of primers were used 24 . These primers were specifically designed to amplify the recombinant alleles: the set MTX1r/GBA-nf (henceforth referred to as primer pair 2) only amplifies recombinants with GBA gene sequence at the 5'UTR end and GBAP1 sequence at the 3'-UTR end (CNL), while the set of primers

ΨMTX1-r/ΨGBA-nf (primer pair 3) only amplifies recombinants with the 3'-UTR end and GBAP1
sequence at the 5'-UTR end (CNG, see Figure 1). Samples underwent PCR using these pairs of primers. If a product was detected on agarose electrophoresis, the sample was re-amplified with the same primer pair modified to carry the ONT barcode adapters, and the amplicons were barcoded and sequenced as described. The primer sequences and PCR conditions are given in Table S4. To define the breakpoint of CNL, the products of primer pair 2 were aligned to GBA ( to avoid alignment to GBAP1; chr1:155222384-155241249; hg38) using LAST (Version 1243) 34 . The resulting alignment was analysed with Clair to look for variants at positions where GBA and GBAP1 differ (Table S1). If a sample displayed an SNV in a certain position, it meant that the breakpoint must be upstream of that but downstream of the next sentinel position where no variant is detected ( Figure S2).

GBA enrichment with UNCALLED
To validate GBA SV without PCR enrichment, we used UNCALLED, which uses adaptive sampling to enable real-time enrichment or depletion on MinION runs via the MinKNOW API ReadUntil 35 .
UNCALLED analyses the signal generated by the DNA molecules passing through each pore of the device in real time and decides whether they align to a reference sequence provided. It can then . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. prematurely eject the molecule from the pore if not of interest, freeing up sequencing capacity for new reads and ultimately achieving purely computational enrichment. The target region for enrichment was chr1:155193567-155264811, with repetitive regions masked with the UCSC RepeatMasker.

Digital PCR
Digital PCR (dPCR) was performed by QIAGEN (Hilden, Germany) on the QIAcuity instrument.
Three probes were selected: DCH101-0776005A (chr1:155231010-155231209; hg38) and DCH101-0776012A (chr1:155232410-155232609; hg38) target the region affected by the recombination event, while DCH101-1260927A (chr1:155208699-155208804; hg38) is outside of this region and was used as a reference for analysis. Each sample was tested three times, and the result is the average of the three assays.

Illumina sequencing
WGS was performed on the Illumina NovaSeq instruments using Illumina TruSeq Nano DNA Library Prep 9 .

Statistical analysis
Analysis was carried out on R (version 4.0.5). Odds Ratios (OR) were calculated with logistic regression. To check for an additive effect on risk of CNGs on GBA variants carriers, a multivariate logistic regression was used, with disease status as the outcome variable, and GBA-carrier status and CNG-carrier status as independent variables.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint

Cross-validation confirms both Gauchian and ONT methods
To select appropriate samples for validation of Gauchian with a broad range of mutations, we first obtained Gauchian results on 1kGP samples and the AMP-PD PPMI PD and control cohorts. We  Table 1.
To obtain further orthogonal validation, we used dPCR for copy number estimation of the 20.6 kb region involved in recombination in six samples with a range of copy numbers. These included four samples where Gauchian and ONT both detected a CNG (additional copy numbers 1, 3, 5, and 6), and two where we only had ONT data, one with a CNL, and one with no CNV. The results were fully concordant (Table S5). Finally, we applied ONT sequencing with PCR-free enrichment by is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint adaptive sampling (UNCALLED) 35 to four reciprocal recombinants, two CNG and two CNL (one pathogenic and one non-pathogenic). Inspection of the resulting alignments confirmed the presence of the SV and the breakpoints of CNL alleles ( Figure S3).

Detection of all classes of GBA variants with targeted ONT long-read sequencing
We analysed 397 samples from PD or GD patients, their relatives, and controls (Table S6) using all three pairs of PCR primers, followed by ONT amplicon sequencing. These included 95 individuals previously sequenced with ONT using only primer pair 1 15 . All results are shown in Table S7. We detected two c.1263del+RecTL alleles and one RecNciI allele arising from gene conversion.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint

Comprehensive GBA analysis by Gauchian in short-read WGS population data
A total of 10623 samples were analysed with Gauchian, including 2504 samples from the 1kGP cohort, 2325 PD and 2598 LBD samples from the AMP-PD knowledge portal, and their respective controls. We identified 55 non-pathogenic CNLs and 146 CNGs (roughly correspond to DGV variant accessions dgv55e214 and esv3587619; Table 2). Additionally, we detected 97 GBAP1-like variants (including those generated by pathogenic CNL or gene conversion) in the exons 9-11 homology region of GBA in all three cohorts (Table 3 and Table S8). haplotypes containing GBA bases (see Figure 2C). Gauchian also detected other coding SNVs and indels that are not GBAP1-like in the three cohorts (see Table S9 for all variants). All these calls were concordant with BWA-GATK except in one sample where Gauchian called p.L483R, a rare pathogenic variant in the same codon as the common GBAP1-like p.L483P, but BWA-GATK did not. This variant is in the exon 9-11 homology region, and variant reads misaligned to GBAP1, causing the false-negative by BWA-GATK ( Figure S4). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint

Prevalence of GBA recombinant and non-recombinant variants in healthy, PD, and LBD populations
Gauchian allowed us to provide the first large-scale analysis of all classes of GBA mutations in healthy, PD, and LBD populations. Non-pathogenic CNVs, where the breakpoints do not alter the GBA coding region, were ten times more frequent in "Africans" than "Europeans" (11.3% vs 1.1%, Table 2). This was primarily driven by a striking difference in the prevalence of CNGs (10.8% v 0.6% for controls from both cohorts; p<2.2e-16). Additionally, "Africans" also had more copies gained, with a median gain of three copies compared to one for "Europeans".
As non-pathogenic CNVs in GBA have not previously been considered as possible PD or LBD risk factors, we compared these across the combined disease cohorts to their controls. We detected no difference in the prevalence of all non-pathogenic CNVs (1.10% vs 1.25%), CNGs (0.67% vs 0.63%) or CNLs (0.43% vs 0.63%, Table 2). Addressing SNVs next, we noticed that p.N409S was found at a very high frequency in the PD cohort of AMP-PD (in both cases and controls, 5.5% and 12.6% respectively), because of the recruitment of a large number of individuals with Ashkenazi Jewish ancestry 16 , where it is very common. After excluding individuals carrying it from both cohorts for consistency, GBA variants were more common in each disease cohort than the respective controls as expected (Table 4) (PD 7.8% v 3.9%; LBD 11.7% v 3.5%). This was also true for severe GBA variants 36 (PD 1.7% vs 0.8%; LBD 3.1% vs 0.1%). The overall OR for mutations in each disease against its controls was higher in LBD than PD (3.68 v 2.07; p=0.0098), and this was even more striking for the severe mutations (30.83 v 2.12; p=0.0009).
We also noted that some individuals carried both a CNG and another GBA variant, mostly a GBAP1-like variant in the exon 9-11 homology region (Table 2), as also seen in the cohort analysed by ONT. There were no individuals with a CNG and another GBA variant in the 1kGP cohort. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint Considering all samples analysed by Gauchian, 7 out of 146 with a CNG also had a GBAP1-like variant, against 71 of 10,407 without a CNG (4.8% v 0.68%; p-value=9.77e-5). Three additional individuals carried a non-GBAP1-like variant and a CNG. In the PD and LBD cohorts and their controls, 9 of 10 individuals carrying a CNG as well as a coding variant in GBA were patients (four PD, five LBD). One healthy control carried a CNG and p.T408M, which is a mild PD risk allele but does not cause GD. As we did not find any healthy controls with a CNG and a pathogenic variant, we considered whether the combination of both is more detrimental than a coding variant alone (excluding again p.N409S carriers, one of whom who also had a CNG). We did not detect a is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint

Discussion
The recent dramatic improvements in sequencing techniques have allowed a much better understanding of human genetic variation, but several regions, including some key disease-related genes, have remained challenging. One example is GBA, responsible for the autosomal recessive lysosomal storage disorder GD 1 , and one of the most important genetic determinants of risk for PD and the closely related LBD 17 . Here we present and validate Gauchian, a novel GBA caller for Illumina WGS data, capable of detecting SVs and SNVs within GBA. Using ONT targeted sequencing, we demonstrate that in the cases of discrepant calls between Gauchian and BWA-GATK analysis, the Gauchian calls are correct. We also demonstrate that a refined ONT ampliconsequencing pipeline can detect reciprocal recombinants, and indels as well as mutations within homopolymers in coding exons. Importantly, both methods detect CNGs and CNLs arising from reciprocal recombination, and allow straightforward classification of CNLs into those that do and do not affect the coding region, previously a complex task 8,37 . We thus provide two complementary new tools for fully resolving the GBA gene, which will be helpful to the community. Illumina WGS data can now be analysed robustly, and ONT targeted sequencing can be applied in a cost-effective way where an analysis of GBA is sufficient.
To explore the potential of Gauchian in the population and in disease contexts, we applied it to a total of 10,623 samples from the 1000 Genomes Project and PD and LBD cohorts with their controls from the AMP-PD initiative. This allowed us to provide the first large scale data on CNVs , and to evaluate the frequency of all classes of GBA variants in PD and LBD with greater accuracy than before. Reciprocal recombinants in particular are likely missed in PD studies 15 , including a recent targeted short-read study which detected none in 3402 patients 8 , although one study using exome data with qPCR validation reported CNGs in 1.2% of PD and 0.7% of controls 37 . In non-diseased individuals, we noted that CNGs were more common in those with "African" . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint ancestry, with greater copy number variability. These results are consistent with the greater African genetic diversity, with recent evidence of "African" genomes demonstrating unexplored structural variation 38 and more variability in copy numbers of SMN1, SMN2 and CYP2D6 13,14 . They further highlight the need to study non-European genomes , which has yielded additional insights into Alzheimer's disease 39 , and is being expanded in PD 40 .
In the PD and LBD cohorts, Gauchian analysis almost doubled the pathogenic variants detected in the homology region compared to BWA-GATK (86 vs 44) and eliminated false positives. We also performed a direct comparison of GBA variant frequencies between PD and LBD, after excluding the common p.N409S variant due to selection bias in the PD cohort 16 . The prevalence of GBA pathogenic or PD-risk variants was significantly higher in LBD than PD (11.7% vs 7.8%), and this difference was even larger for severe pathogenic variants (3.1% v 1.7%). The OR for GBA mutations in LBD compared to controls was higher in our analysis than in the original report in this cohort 17  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint mutations is attracting a lot of attention, with lysosomal gene variants acting as genetic modifiers 45 . An effect of common intronic haplotypes was suggested 46 , but not seen by us in AMP-PD cohort and part of the RAPSODI cohort used here 28 . A possible influence of CNGs has not yet been investigated. Although these do not alter the GBA coding region, they could affect expression and function, for example by acting as a competing endogenous microRNA sponge 47 .
CNGs were not enriched in PD or LBD. There were, however, rare carriers of both a CNG and a pathogenic or PD-associated GBA variant. In the PD and LBD cohorts, nine out of ten of these were patients, and in the only control the coding variant was the mild PD risk allele p.T408M. This raises the possibility that CNGs are modifiers, increasing the penetrance of other GBA variants. We have not, however, phased the CNGs and other variants, and did not show a statistically significant increased risk for carriers of a CNG and mutation compared to mutation alone. Therefore, further population and mechanistic work is required.
In conclusion, we have demonstrated that SNV detection and complete resolution of all classes of SVs is possible using the novel Gauchian caller with Illumina WGS, which outperforms BWA-GATK analysis, or with targeted ONT sequencing. We also demonstrate that CNVs are relatively common, and suggest that these merit investigation as possible modifiers of PD or LBD risk. Given the importance of this gene, and the rapid progress to targeted clinical trials in PD 48 , we propose that the adoption of either workflow should be considered by research and diagnostic labs, based on local resource and data availability. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint

Description of supplemental data
Supplemental data include 4 figures and 9 tables.

Acknowledgments
We thank the New York Genome Center and the Coriell Institute for Medical Research for generating and releasing the 1kGP WGS data. We thank the AMP PD Knowledge Platform for hosting WGS data for patient and control cohorts. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Data and code availability
Gauchian will be a part of Version 3.10 of the Illumina DRAGEN (Dynamic Read Analysis for GENomics) Bio-IT platform.
ONT and UNCALLED scripts used will be downloadable at https://github.com/marcotoffoli. Individual-level genome sequence data for the PD patients, LBD patients, and neurologically healthy controls are available at AMP-PD (https://amp-pd.org).
The datasets of DNA from QSBB brain samples and NHGRI samples generated during this study is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint removed or masked) to minimize the amount of genetic information made available for public access.
The datasets of DNA from PPMI samples generated during this study (targeted ONT sequencing) ONT sequencing data on living individuals are not available due to consent / IRB restrictions. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint Not to scale. A. Wild-type allele. Only primer pair 1 will produce an amplicon. B. Non-reciprocal recombination (gene conversion). Similar to non-recombinant alleles, only primer pair 1 will produce an amplicon. C. Reciprocal crossover between gene and pseudogene resulting in a 20.6 kb deletion (CNL). Only primer pair 2 will produce an amplicon. D. Reciprocal crossover between gene and pseudogene resulting in a 20.6 kb duplication (CNG). Both primer pair 1 and primer pair 3 will produce amplicons. Note that the normal allele is present and that amplification with primer pair 3 will produce an amplicon independently of the number of copy number gains. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint A. Median mapping quality (red line) across 2504 1kGP samples plotted for each position in the GBA/GBAP1 region (hg38). A median filter is applied in a 50 bp window. The eleven exons of GBA are shown as orange boxes. GBAP1 and MTX1 exons are shown as green and purple boxes, respectively. The 4kb major homology region (98.1% sequence similarity, exons 9-11) between GBA and GBAP1 is shaded in pink and highlights an area of low mapping accuracy. The light blue box shows the 10kb unique region between the two genes in which copy number calling is performed in Gauchian. B. Distribution of normalized depth in the 10 kb CN calling region in 2504 1kGP samples, showing peaks at CN1 (CNL), 2 (no CNV), and 3-8 (CNG). C. Recombinant haplotypes in the exons 9-11 homology region, distinguished by GBA/GBAP1 differentiating bases (x-axis). Reference genome sequences are shaded in yellow. There is an error in hg38 where the first three sites of GBAP1 show GBA bases, which could lead to alignment errors. The GBA recombinant haplotypes are shown in the white background, including those where one or a few nearby sites are mutated to the corresponding GBAP1 base, resulting from either gene conversion or CNL. Gray bases indicate that the base can be either GBA or GBAP1 depending on the breakpoint position of the CNL/conversion. Shaded in purple are two example GBAP1 haplotypes, found by Gauchian, that have been partially converted to GBA and can cause false positive GBA variant calls by standard secondary analysis pipelines. For the first example, the reverse-p.L483P variant on GBAP1 directs aligners to align GBAP1 reads to GBA, causing the nearby p.A495P false-positive call. For the second example, the reverse-c.1263del variant inserts 55bp to GBAP1, driving GBAP1 reads to align to GBA, causing the nearby p.D448H false-positive call.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint Samples with * were discordant with BWA-GATK. a Samples with IDs starting with NA-and HG-were obtained from NHGRI; samples with IDs starting with PP were obtained from PPMI; samples marked as brain were obtained from QSBB. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint *Severe and mild variants are defined in Table S8 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 13, 2021. ; https://doi.org/10.1101/2021.11.12.21266253 doi: medRxiv preprint