Genome-wide Analysis of Copy Number Variation in Latin American Parkinson's Disease Patients

Background: Parkinson's disease is the second most common neurodegenerative disorder and affects people from all ethnic backgrounds, yet little is known about the genetics of Parkinson's disease in non-European populations. In addition, the overall identification of copy number variants at a genome-wide level has been understudied in Parkinson's disease patients. Objectives: To understand the genome-wide burden of copy number variants in Latinos and its association with Parkinson's disease. Methods: We used genome-wide genotyping data from 747 Parkinson's disease patients and 632 ancestry matched controls from the Latin American Research Consortium on the Genetics of Parkinson's disease. Results: Genome-wide copy number burden analysis showed no difference between patients vs. controls, whereas patients were significantly enriched for copy number variants overlapping known Parkinson's disease genes compared to controls (Odds Ratio: 3.97 [1.69 - 10.5], P = 0.018). PARK2 showed the strongest copy number burden, with 20 copy number variant carriers. These patients presented an earlier age of disease onset compared to patients with other copy number variants (median age at onset: 31 years vs. 57 years, P = 7.46 x 10-7). Conclusions: We found that Parkinson's disease patients are significantly enriched with copy number variants affecting known Parkinson's disease genes. We also identified that out of 250 patients with early-onset disease, 5.6% carried a copy number variant on PARK2 in our cohort. Our study is the first to analyze genome-wide copy number variants association in Latino Parkinson's disease patients and provides insights about this complex disease in this understudied population.


Introduction:
Parkinson's disease (PD) is the second most common neurodegenerative disorder, and the fastest growing cause of disability due to a neurological disorder in the world 1,2 . PD is a multifactorial syndrome that is thought to be caused by the complex interaction of genetics, environmental factors, and aging 3 .
Evidence for the genetic basis of PD has increased substantially over the past decades 4-6 . The first causal gene for PD, SNCA, was discovered in 1997 7 , and its protein product ( -synuclein) was further shown to be a major component of Lewy bodies, the pathological hallmark of PD.
Dominant pathogenic single nucleotide variants (SNVs) in SNCA [8][9][10][11] , as well as copy number variants (CNVs), such as duplication or triplication of the entire gene with a clear dose effect, have been reported [12][13][14] . The discovery of SNCA was followed by that of PARK2 15 , where both pathogenic SNVs and CNVs are associated usually with autosomal recessive, early-onset form of the disease 16 . Almost exclusively, genetic discoveries in PD have focused on SNVs, and studies on CNVs have been infrequent [17][18][19] . CNVs in PARK2 , SNCA , PINK1 , DJ1, and ATP13A2 (from more to less frequent) have been reported using a candidate gene approach [20][21][22] , while no CNVs have been shown for LRRK2 . To date, only two studies have investigated the role of CNVs in PD at a genome-wide level, including exclusively European and Ashkenazi Jewish individuals 17,19 , with a sample size of 1672 and 432, respectively.
PD is a global disease affecting all ethnicities.
Unfortunately, the majority of studies do not include individuals of non-European ancestry, creating a large gap in knowledge. This is especially true for Hispanics/Latinos (Box 1).
Despite the fact that they are the largest and fastest growing ethnic minority in the US 23 , Hispanics/Latinos are critically underrepresented in most genetic studies 24 . This is probably due to their complex admixed ancestry with influences primarily from European, Amerindian and African . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint populations. In the US, the incidence and prevalence rates of PD among Hispanics are at least as high, if not higher than in non-Hispanic Whites, while the rates are lower for Asians and Blacks 25,26 . Yet, little is known about the genetics of PD in Hispanics/Latinos, especially the frequency and characteristics of CNVs. No genome-wide studies in this population have been performed to date.
To address the lack of diversity in PD genetic studies and to understand the genetic architecture of PD in Latinos, we created the Latin American Research Consortium on the Genetics of PD (LARGE-PD) 27 . For this study, we used genome-wide genotypes of 1,497 individuals from LARGE-PD. The aim of this study was to elucidate genomic structural changes, as well as assess the CNV burden in this cohort of Latino PD patients and controls.

Methods:
As part of our ongoing collaborative effort within LARGE-PD 27 , we examined data from a total of 1,497 individuals (807 PD patients and 690 controls) recruited from nine different sites across the following five different countries: Peru (n = 721), Colombia (n = 351), Brazil (n = 227), Uruguay (n = 191), and Chile (n = 13). All patients were evaluated by a movement disorder specialist at each of the sites and met the UK PD Society Brain Bank clinical diagnostic criteria 28 . Controls were selected from ancestry matched individuals that did not have symptoms compatible with neurodegenerative disorders. All PD patients and controls provided signed informed consent according to the local ethical requirements of each site. All individuals were genotyped on Illumina's Multi-Ethnic Global Array (MEGA) (Illumina, San Diego, CA, USA). A total of 1,779,819 markers were available before quality control (QC).
We performed an initial round of QC using PLINK 1.90 29 , based on single nucleotide polymorphism (SNP) genotype data for all samples and following established protocols described in Niestroj et al 30 . Samples with a call rate < 0.96 or a discordant sex status were excluded. We filtered autosomal SNPs for low genotyping rate (> 0.98), case-control difference in minor-allele frequency (> 0.05), and deviation from Hardy-Weinberg equilibrium (HWE, P-value ≤ 0.001) before pruning SNPs for linkage disequilibrium (--indep-pairwise 200 100 0.2) using PLINK 29 . 1000 Genomes population 31 was used as a reference for visual clustering of the Principal Component Analysis (PCA) to assess for population stratification.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint For CNV calling, we focused only on autosomal CNVs due to the higher quality of CNV calls from non-sex chromosomes. A custom population B-allele frequency (BAF) file was generated as a reference before calling CNVs. Then, we created GC wave-adjusted Log R Ratio (LRR) intensity files for all samples and employed PennCNV 32 software to detect CNVs in our dataset.
We assessed cryptic relatedness using KING 33 software, and excluded individuals who were closely related (up to second degree) to another participant in our cohort by using the unrelated algorithm in KING. We performed an intensity-based QC to remove samples with low-quality data as previously described in Huang et al. 34  Quality scores ranged from 0 (lowest) to 1 (highest) for duplications and similarly from 0 to -1 for deletions, and CNVs with quality scores between -0.5 to 0.5 were filtered out. A subset of final QC-passed CNVs were also inspected visually by five different investigators with expertise in the interpretation of BAF and LRR plots. CNVs were annotated for gene content using Ensembl 36 including gene name and the corresponding exonic coordinates in hg19 assembly using bedtools 2.27.0 37 .
We calculated CNV burden for PD using different categories to evaluate the relative contribution on PD risk: (1) the carrier status of overall CNV burden, including CNVs in non-genic regions (2) the carrier status of CNVs intersecting 'any gene' but none of the PD genes, (3) the carrier status of CNVs intersecting a list of "known PD genes", and (4) the carrier status of large CNVs ( > 1Mb in length). P values were adjusted with the false discovery rate (FDR) method to correct for multiple testing. For the overall CNV burden category, deletions and duplications were also analyzed separately. We selected 19 genes for the "known PD genes" category that were grouped as follows: six are well-established causal genes for PD ( LRRK2 , PARK2 , PARK7 , . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
To assess for the difference in CNV burden between PD patients and controls, we fitted a logistic regression model using the "glm" function of the stats package 40 in R 3.6.0 41 .
Cox-proportional hazards regression analyses and Kaplan-Meier curves were calculated using the survival package 42 . For all burden analyses, odds ratios (OR), 95% confidence intervals (CIs), and significance were calculated. ORs were calculated by the exponential of the logistic regression coefficient. For Cox-proportional hazards regression, hazard ratios (HR) were calculated to allow for censored observations. Potential confounding variables were used as covariates and included age, sex, and the first five ancestry principal components for all regression models.

Results:
We had available data from a total of 1,497 individuals in LARGE-PD. We excluded 39 individuals due to relatedness, and 79 due to failing our intensity-based QC steps. Thus after QC, our final cohort included 1,379 individuals (747 PD patients and 632 controls) from Peru (n = 677), Colombia (n = 320), Brazil (n = 192), Uruguay (n = 177), and Chile (n = 13). There were more males in PD patients compared to controls (53.2% vs 33.1%, P < 0.001). Sample demographics are shown in Table 1. To visualize the ethnic composition of our cohort, we performed PCA using 1000 Genome populations 28  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
We applied logistic regression to compare the CNV burden in PD patients and controls on all categories defined earlier, adjusting P values for multiple testing (see methods for details). We We then explored CNVs on genomic regions that were previously associated with typical PD and other parkinsonian phenotypes, and we found that PD patients were significantly enriched with CNVs overlapping these genes (OR: 3.97 [1.69 -10.5], P = 0.018) (Fig. 1). This finding was largely driven by CNVs on PARK2 in 20 patients, followed by two patients with a CNV on SNCA, compared to six controls carrying a CNV on PARK2 and none on SNCA . In addition one control had a CNV on PLA2G6 (Fig. 1).
To assess whether PD patients carrying a PARK2 CNV in our cohort had an earlier age at onset Kaplan-Meier estimates of AAO showed that individuals carrying a CNV on a known PD gene had significantly earlier onset of symptoms compared to individuals with other or no CNVs (log-rank test, P < 0.001) (Fig. 2). Using a Cox proportional-hazards regression analysis with . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint age, sex, and the first five ancestry principal components included as covariates, we found that the effect of carrying a CNV on a known PD gene on the hazard of AAO was highly significant (HR: 2.42 [1.57 -3.71], P = 5.70 x 10 -5 ). We also assessed AAO in PD patients only, comparing CNV carriers on a known PD gene to PD patients with other or no CNVs, and found that having at least one CNV results in earlier onset of symptoms. (HR: 1.92 [1.22 -3.02], P < 0.001) (Supp. Fig. 2).

Discussion:
Here, we present a genome-wide characterization of CNVs in a cohort of Latino PD patients and controls from LARGE-PD 27 . We analyzed genotypes of 1,497 individuals on the same platform and analyzed all samples with the same CNV calling and quality control pipeline. We used ancestry matched controls for the interpretation of CNVs detected in PD patients. This is particularly important considering that the data for Latino population frequency of CNVs is limited, especially in neurologically healthy adults 43,44 . We assessed the CNV burden for different categories and observed an increased burden of CNVs overlapping known PD genes in PD patients. We identified 22 patients that carried CNVs overlapping two established PD genes ( PARK2 & SNCA ), and found that 14 of these patients had a disease AAO < 50 years.
The median AAO for patients with a CNV overlapping PARK2 was almost 20 years earlier than that of other patients, in agreement with the literature [45][46][47] . PARK2 mutations are the most common genetic cause for EOPD 5 48 . In another study examining Mexican-mestizo EOPD patients (N = 63), the frequency of PARK2 CNVs was found to be 50%, and 18% of these patients were heterozygous 51 . In our . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The role of homozygous and compound heterozygous variants, including CNVs on PARK2 is well known, especially in EOPD 5,15,45 . However, there is also increasing evidence that PARK2 heterozygosity is a risk factor for PD and is associated with a decreased AAO 45,46 . In our cohort, there was a significant association between the AAO and PARK2 carrier status. Still, the role of heterozygous PARK2 CNVs in altering PD susceptibility remains controversial 16,52 . In order to correctly characterize PD patients, an integrated SNV-CNV analysis is needed, given the importance of both allele types for comprehensive genetic diagnosis in PD 53 Table 2. This is very similar to the frequency reported in previous studies mentioned above 48,51 . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint The large genetic variation in Latinos due to admixture from several populations (mostly European, Amerindian and African) creates a challenge when analyzing this population. For this reason, we established a workflow with rigorous quality control. We also constructed a Latino reference file from scratch for CNV calls, as publicly available reference files were all based on Europeans. In this study, we analyzed all samples together in order to boost statistical power.
However, separate calling of CNVs in subpopulations based on admixture analysis is likely to yield more refined results. Thus, larger sample sizes will be needed to make discoveries specific to subpopulations of Latinos. Admixture mapping to examine the chromosomal location of the CNVs could also provide more insights about the relationship between PD genetics and ethnicity 58,59 .
To our knowledge, this is the first study that focuses on genome-wide CNVs in PD patients from Latin America. We believe that expanding the diversity of genetic studies for PD is necessary to understand the genetic profiles of these individuals and that our work will enrich current scientific knowledge about CNVs in this underrepresented population.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint Odds ratios (ORs) and P values were calculated using a logistic regression for CNVs corrected with age, sex, and first 5 components of PCA. P values were adjusted with FDR for multiple testing. ORs > 1 indicates an increased risk for PD per unit of CNV burden. (B) Table showing number of CNV carriers in any of the 19 known PD genes. (C) Visualization of CNVs on PARK2 .
A . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2020. C . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 1, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 1, 2020. . https://doi.org/10.1101/2020.05.29.20100859 doi: medRxiv preprint