Abstract
Renal transplantation is the method of choice for patients with end stage kidney failure. But transplanted allograft could be affected by viral and bacterial infections and immune rejections. The standard test for the diagnosis of acute pathologies in kidney transplants is the renal biopsy. However, noninvasive tests would be desirable. Various methods using different techniques have been developed by the transplantation community. But these methods expect improvements. We present here a cost-effective method based on estimating donor-specific DNA fraction in recipient urine based on sequencing of recipient urine DNA only. We hypothesized that in the no-pathology stage, the largest tissue types present in recipient urine are donor kidney cells and in case of rejection, a larger number of recipient immune cells would be observed. Extensive in-silico simulation was used to tune the sequencing parameters: number of variants and depth of coverage. Sequencing of DNA mixture from 2 healthy individuals showed the method high prediction accuracy (maximum error < 0.04). We then demonstrated the insignificant impact of familial relationship and ethnicity using an in-house and public database. Lastly, we performed recipient deep urine DNA sequencing in 32 samples representing two pathology groups: acute rejection (AR, 12 samples) and acute tubular injury (ATI, 11 samples) and 9 samples with no pathology. We found a significant association between the donor-specific DNA fraction in the two pathology groups compared to no pathology (P = 0.0064 for AR and P = 0.026 for ATI). We conclude that deep DNA sequencing of recipient urine offers a noninvasive means of diagnosing and prognosticating acute pathologies in the human kidney allograft.
Introduction
In 1933 surgeon Yurii Voronoy from Ukraine achieved the first human kidney transplantation [1]. Kidney transplantation is the final treatment option for patients with end-stage renal failure after dialysis. Today, renal transplantation plays an important role in clinical medicine and has become a relatively safe intervention. However, various pathologies can still affect the transplanted organ, including infections, disease recurrence and immune rejections. These rejections can be related to a range of donor- and recipient-specific factor risks [2,3]. Acute renal rejection affects 10 to 20% of transplants within three months after transplantation and chronic rejections occur in 4% of kidney transplants [3–5].
To diagnose allograft rejection, tissue biopsies are considered as the gold-standard method for detecting acute and chronic immune injury, as well as other pathologies associated that may eventually lead to allograft loss. However, biopsies are invasive, costly, in rare cases they can lead to organ loss, while the readout can potentially be erroneous if a non-affected part of the kidney is sampled by chance. Therefore, proceeding with biopsies in patients of low immunological risk is sometimes criticized [6]. There is hence a strong need for non-invasive assays to detect injury in transplanted kidneys. Several studies to develop suitable biomarkers for allograft rejection have been conducted. These studies include the quantification of specific messenger RNAs in urine [7], large-scale transcriptomics analyses of peripheral blood [8], proteomics analyses of biopsies [9] and urine [10,11], and metabolomics [12] and RNA sequencing [13] of urine pellet or supernatant. Nevertheless, all non-invasive methods developed to date still have important caveats and require further improvement.
The presence of donor-specific DNA in blood was first reported in women who had a kidney and a liver transplant [14]. The measurement of cell-free donor-specific DNA in blood for a differential diagnosis of kidney injury has been suggested recently [15–17]. These studies focused on females who received a kidney from male donors by identifying the presence of DNA coding for the testis specific protein Y-linked 1 (TSPY1) or the sex-determining region of the Y chromosome using quantitative polymerase chain reaction. With the improvement of next generation sequencing technologies, whole genome sequencing (WGS) [18,19] and targeted sequencing [20] were used for measuring donor-specific DNA for solid organ transplant rejection. However, these studies focused on heart transplants and measured cell-free donor-specific DNA in blood plasma. More importantly, these methods require the sequencing of both donor and receptor DNA which is more costly.
An algorithm for measuring donor-specific DNA in plasma of organ transplants without requiring donor or recipient genotyping was implemented by Gordon et al [21]. But this algorithm made the assumption that donor fraction is < 14%. More recently, Grskovic et al. used sequencing of 266 single nucleotide variants (SNVs) that discriminate best between two unrelated individuals to count reference and alternative allele frequency for estimating the donor-derived cell-free DNA fraction [22]. This method showed a high correlation between cell-free donor specific DNA levels in recipient blood and active rejection of the kidney allografts [23]. However, this method does not account for potential sequencing errors and requires a priori knowledge of the familial relationship between donor and recipient. Finally, a statistical method combining SNV array genotyping of donor and recipient before transplantation with recipient DNA sequencing was used to estimate recipient-derived DNA fraction in heart and lung transplants [24]. Nonetheless, this method requires SNV genotyping of donor and recipient DNA before transplantation. Most importantly, all of the previous study focused on DNA extract from blood.
The presence of donor-specific DNA in urine of kidney allograft recipients has been reported [25]. We have recently conducted a study based on RNA sequencing of tissue biopsies from kidney allograft transplants and found a correlation between the ratio of heterozygous to homozygous SNVs with the rejection phenotype [26]. Moreover, we have shown in another study that DNA methylation could be used to accurately estimate the tissue type composition in recipient urine samples. We found that the largest tissue types present in recipient urine were kidney cells and neutrophils and that donor-specific DNA fraction correlates with the kidney derived cell fraction [27]. However, we restricted the analysis on kidney recipients with urinary tract infection and BK-virus nephropathy only. Most recently, we have identified different gene signatures and pathways associated with two different type of kidney rejection using RNA-seq on transplant urine: acute T cell–mediated rejection and antibody-mediated rejection [13]. Deconvolution analysis showed a higher enrichment of immune cells in rejection stage comparing to no-rejection.
Based on the idea that the fraction of donor-specific DNA can be determined using DNA sequencing, we here hypothesize that the recipient-specific DNA fraction in urine correlates with the level of active rejection in the kidney allograft, assuming that recipient-specific DNA originates mostly from tissue-invading immune cells while donor-specific DNA stems from the allograft [28]. Inspired by methods to estimate DNA contamination in sequencing projects [29,30], we present a cost-effective method to determine the fraction of donor-specific DNA (denoted α hereafter) in urine by sequencing targeted regions. We estimate the dependence of the precision of this measure on sequencing depth and length of the targeted region. Most importantly, no prior knowledge of donor and recipient relation is required. To the best of our knowledge, this is the first method for estimating donor-specific DNA fraction in DNA mixture extracted from recipient urine. Our method provides an easy way to determine the donor-specific DNA fraction regardless of donor and recipient gender. We evaluate its applicability for the detection of kidney transplant rejection. Future applications could be routine tests of urine samples as a reference to adjust and optimize the dosage of immune suppressants in kidney transplant patients.
Results
In silico simulation of donor-recipient DNA mixtures
To determine the optimal sequencing parameters, we use numerical simulations. The simulation process is based on generating two different SNV-sets, merge the two sets with a predefined proportion of each set; α from set 1 and (1-α) from set 2, and then apply a likelihood function (Methods) to estimate this proportion (observed α). Two major parameters affect the estimation of the observed α: the number of sequenced SNVs (N) and the depth of sequencing coverage (M). For a range of parameters N={10, 50, 100, 500, 1,000} and M={10, 50, 100, 500, 1,000, 5,000, 10,000} and varying α from 0 to 0.5 in steps of 0.01, we repeated the simulation process for each N x M x α combination 1,000 times to obtain an empirical distribution of observed α (Fig S1).
We computed the maximum error (ε) for each combination N x M over all tested α. ε ranges between 0 (best case where observed α = tested α) and 0.5 (worst case where tested α = 0.5 and observed α = 0 or tested α = 0 and observed α = 0.5) (Fig 1). As expected, our simulations show that increasing both N and M improves the observed α estimation accuracy. Moreover, the estimation of the observed α is unstable when using a small number of SNVs (N < 100) or low coverage (M < 500). The prediction accuracy stabilizes above N > 500 and M >1,000.
Maximum absolute (A) and relative (B) errors are represented. A total of 35 scenarios combining five different numbers of SNVs N= {10, 50, 100, 500, 1,000} and seven depth of coverage M= {10, 50, 100, 500, 1,000, 5,000, 10,000} were simulated (Fig S1). Represented here are maximum error observed in 1,000 simulations for every tested α ranging from 0 to 0.5 in steps of 0.01.
Experimental estimation of α using a controlled mixture of urine from two individuals
To assess the accuracy of detecting observed α in a mixture of two real human urine samples, we performed a targeted sequencing of urine DNA from two healthy individuals originated from different populations; S1 a 35 years old healthy European woman and S2 a 34 years old healthy Arab woman. A total of 1,850 exonic regions from a panel targeting 93 genes known to be associated with risk of breast cancer were sequenced. These sequenced genomic regions cover 370,942 base pairs across 22 chromosomes (Table S1). A total of 51,893 bi-allelic SNVs falling in these genomic regions were present in the Exome Aggregation Consortium (ExAC) [31]. As the method works on bi-allelic SNV with different genotypes between donor and recipient, we computed for each SNV the probability of having different genotypes for two individuals (Table S1). Only 437 SNVs have a probability of having different genotypes for two individuals higher than 10%.
As a measure of quality control, we first checked the balance of reference and alternative alleles in heterozygous calls. The alternative allele frequency is expected to be around 50% in heterozygous genotypes. However, we observed the presence of SNVs with skewed alternative allele frequencies (Fig S2). We noticed the recurrence of such unbalance in every replicate of both samples (Fig S3 for examples). We investigated whether the amplification-based strategies for DNA target enrichment affect the allele dropout causing the skewed alternative allele distribution. We found that the SNVs with a skewed distribution all fall into the primer sequence regions. We therefore filtered out SNVs falling into these regions and kept the 1,000 most common SNVs in the general population. These 1,000 SNVs will be used as a SNVs panel for detecting DNA fraction in a combination of two DNA sources (observed α) in the rest of the study. The alternative allele frequency was balanced in these 1,000 SNVs (Fig S4). Moreover, the maximum error of estimating the observed α based on these 1,000 SNVs in all replicates was < 0.0034 (mean error = 0.0028 ± 0.00037).
We then mixed 90 % DNA from S1 and 10% DNA from S2 in three replicates. For each replicate, targeted DNA sequencing was performed and the observed α was estimated. The preparation of the mixture was based on total DNA content in the samples. However, the presence of bacterial DNA in urine samples can strongly skew the estimation of human DNA concentration measurement [32]. We assessed the actual DNA concentration of S1 and S2 in urine by considering the mean observed α over the three replicates to 0.053. This indicates that S1 DNA concentration is ∼19 times lower than S2 DNA concentration. Considering the estimated S1 and S2 DNA concentration, the maximum error of the observed α was < 3.5% in the three replicates (Fig 2). We extended the analysis to two levels of DNA mixture scenarios: (i) 70% DNA from S1 and 30% DNA from S2, (ii) 50% DNA from S1 and 50% DNA from S2. Each scenario was replicated three times and targeted DNA sequencing was performed for each replicate. The observed α was similar in the three replicates of all three scenarios (scenario i: mean observed α = 0.11 ± 0.036, scenario ii: mean observed α = 0.032 ± 0.00048). Considering the estimated S1 and S2 DNA concentration, the maximum error of the observed α was < 3.8% in all replicates of both scenarios (0.037 in scenario (i) and 0.018 in scenario (ii)) (Fig 2).
Five scenarios of DNA mixtures and three replicates for each scenario were performed. From left to right: 100% from individual S1 and 0 % from individual S2, 0% from individual S1 and 100% from individual 2, 50% from individual 1 and 50% from individual 2, 70% from individual S1 and 30% from individual S2, 90% from individual S1 and 10% from individual S2. The estimated fractions (estimated α) are represented by black dots. The expected fractions when DNA concentration in individual Sl was 19 times lower than DNA concentration in individual S2 are represented by red dots. The expected fractions before correction for DNA concentration are represented by blue dots.
Simulation of the effect of family relationship and ethnicity on the estimation of α
The most challenging scenario is that of one sibling donating a kidney to another, as they share 50% of their genome. The extreme case of mono-zygotic twins, where both genomes are identical, can of course not be addressed with our method. To numerically explore this “worst case” scenario, we used whole genome sequencing data from 91 siblings [33] and then generated 100 combinations of every two siblings. Self-reported relationship was confirmed using identity by state (Fig S5). For each pair of siblings, we simulated donor and recipient DNA sequences by varying α from 0 to 0.5 in steps of 0.01 and resampling the mean coverage at 5,000 reads. The maximum absolute error was observed when the expected α = 0.07: observed α = 0.034 (Fig 3).
Each dot represents in A) the maximum absolute error and in B) the maximum relative error for each expected (α) from 0 to 0.5 in steps of 0.01 over 100 pairs of siblings (red), 100 pairs of individuals belonging to the same population (green) and 100 pairs of individuals belonging to different populations (blue). Afr = Africans. Amr = Americans. Eas = East Asians. Eur = Europeans. Sas = South Asians.
Simulation of the effect of population origin on the estimation of α
As per comparison to siblings, we assessed the effect of donor and recipient ethnicity on our method. We applied our method on simulated pairs of individuals belonging to the same and to different populations of the 1,000 genomes project [34]: Africans, Americans, east Asians, Europeans and south Asians. The absolute error was < 0.04 in all scenarios (Fig 3). As expected, the absolute error was lower when the two DNA sources belong to different populations (mean maximum absolute error = 0.018 ± 0.005) than when they belong to the same population (mean maximum absolute error = 0.022 ± 0.007). Additionally, the maximum relative error was comparable in all scenarios together (mean maximum relative error = 0.836 ± 0.137) compared to when the two DNA sources belong to the same (mean maximum relative error = 0.764 ± 0.207) or different populations (mean maximum relative error = 0.856 ± 0.077). These results confirm the power of the method for detecting the DNA fraction in a combination of two DNA sources independently of the familial relationship or their ethnicity.
Application to urine samples from clinical kidney allograft patients
To test our method in a real-case scenario, we used DNA extracted from 32 urine samples of 26 kidney allograft recipients collected at the time a diagnostic biopsy was performed and divided the sample into three groups based on their Banff classification: “Acute Tubular Injury” (ATI, N = 12), “Acute Rejection” (AR, N = 11) and “No Observed Pathology” (N = 9) (Table 1). DNA was extracted from urine and deep targeted sequencing was performed for the 32 samples. Reflecting the effect of depth of coverage on the accuracy of detecting observed α in simulated data, we set the mean depth of coverage to ∼ 14,000 reads. After read alignment we applied our method to estimate the donor to total DNA fraction (Fig 4).
Box plots and individual data points of the estimated fraction (observed α) are estimated from deep urine DNA targeted sequencing. AR: Acute Rejection. ATI: Acute Tubular Injury. A statistically difference was observed between all the diagnosis phenotypes (P = 0.035, Kruskal-Wallis test). By Dunn’s test, difference in observed α between the two pathologies and no pathology group was statistically significant: ATI vs no-pathology: P = 0.0064 and AR vs no-pathology: P = 0.026. Pathologies pairwise comparison was not statistically significant (P > .05).
The difference of observed α between the diagnosis phenotypes was statistically significant (P = 0.035, Kruskal-Wallis test). We observed a significant difference when comparing the two transplant kidney pathologies ATI and AR to the No Pathology group (P = 0.0064 and P = 0.026, Dunn’s test for ATI vs no pathology and AR vs no pathology, respectively). However, no significant difference was observed in observed α when comparing the two pathologies ATI to AR (P = 0.31, Dunn’s test).
Inference of donor and recipient ethnic origin
In the absence of donor and recipient genomes, it is impossible to determine whether the observed α represents the donor or the recipient fraction of the total DNA. However, in cases were recipient and donor gender or ethnicity differ, this issue can be addressed. The urine DNA sequencing we performed here did not target genomic regions of the Y chromosome. Thus, detecting recipient and donor gender cannot be carried out using the actual data, but could be easily amended in future sequencing panels.
To predict donor and recipient ethnicity, an estimation of both recipient and donor genotypes is needed. For each of 1,000 SNVs, we computed the fraction of the alternative to total alleles. We then used the observed α to compute the nine expected fractions of the alternative allele (Table 2). We then used a cost function to estimate donor and recipient genotypes that minimizes the difference between the nine expected fractions and the observed fraction of the alternative to total alleles. Based on these estimated genotypes, we applied a supervised classification method to predict the recipient and the donor ethnicity as following: as donor-specific DNA fraction has been shown to be higher in the no-pathology group [28], we supposed the observed α to represent the donor-to-total DNA fraction and computed the probability of donor and recipient for belonging to one of the three populations: African, East Asian and European (see Methods). Both donor and recipient are assigned to the population showing the highest probability and then compared to the self-reported ethnicity. Seven recipients and eight donors were excluded from the prediction because they belong to a mixed self-reported population or the observed α was ∼ 0 so the prediction of donor genotypes was impossible. The prediction was inconclusive (Probability of prediction < 70%) for 5 recipients and 8 donors. In 16 over 20 recipients (80%) and 15 over 16 donors (94%), the probability of prediction was higher than 70%. However, only one AR sample and one ATI sample have donor and recipient ethnicity mismatch for whom the prediction was conclusive. In these two samples (European donor and African recipient for both samples), the prediction was in agreement with the self-reported ethnicity. Hence, due to the small number of self-reported ethnicity mismatches, it is impossible to confirm whether the observed α represents the donor or the recipient DNA fraction (as observed α < 0.5 by definition).
Discussion
Different omics technologies, including mRNA measurement by PCR [7], metabolomics [12] and RNA-sequencing [13] have been applied by our group and others to identify non-invasive biomarkers for kidney allograft rejection. Here, we present a new approach based on targeted deep-sequencing of DNA obtained from urine samples. We extended methods originally used for the assessment of DNA contamination to estimate the fraction of recipient DNA in a two-sources mixed DNA sample [29,30]. We used in silico simulations to obtain a suitable parameter range for the method to be sufficiently accurate in estimating the fraction of a two-sources DNA mixture. We then experimentally evaluated the accuracy of the estimation method using controlled mixtures of two DNA sources. Allele drop-out occurs in amplification-based target enrichment when a variant is located in a primer region and prevents primer hybridization, leading to failed amplification and allele bias [35]. Our method overcomes these unexpected artefacts due to DNA sequencing. Other algorithm for estimating the donor-specific DNA fraction requires the donor and recipient relationship information [22]. Here, we found that ethnicity and familial relationship between donor and recipient appear to have a lower impact as compared to previously presented methods
We tested our method on clinical samples from patients with and without kidney rejection events. We compared the α value obtained from urine DNA sequencing reads of kidney allograft recipients with kidney injury associated with AR and ATI. The alpha value was significantly different in patients with AR and ATI compared to those without kidney pathology. The calculation of alpha is based on the assumption that the DNA isolated form the urine is derived from the transplanted kidney where both the recipient and the donor DNA are present: Recipient DNA from the infiltrating immune cells and the donor DNA from the kidney parenchymal cells. Indeed, we have recently shown in kidney recipients with donor-recipient gender mismatch by counting Y chromosome-derived cell free DNA that donor-specific DNA fraction was lower in recipient with UTI as comparing to the no UTI and higher in recipients with BKVN comparing to the no BKVN [28]. Thus, our approach might be considered as a potential new diagnostic signature measured in urine specimens.
We were not able to assert whether the recipient urine DNA is mostly donor’s or recipient’s. Studies have shown that both AR and ATI are associated with allograft damage indicating that there will be some donor DNA in the urine. But AR is also associated with recipient immune cell infiltration while ATI is not [36]. Thus, the fraction of recipient to donor cells in the urine should be higher for AR compared to ATI and perhaps should be just the opposite with AR showing a fraction of donor to recipient of much lower than 0.5 and ATI showing a fraction of donor to recipient of much greater than 0.5. To address this, a future complementary analysis on a bigger sample having donor and recipient ethnicity and/or gender mismatches will be worthwhile.
Methods
Algorithm
The algorithm is inspired by the contamination estimation in DNA sequencing method [29,30]. We hypothesize that recipient urine contains a mix of recipient and donor DNA. Let N be the number of bi-allelic SNVs sequenced from recipient urine DNA and each SNV i is covered by Mi reads. Let giR and giD be the genotype of recipient and donor at the SNV i, respectively. Both giR and giD are unknown. Limiting the analysis on bi-allelic SNVs only leads to three possible genotypes for recipient and donor at each SNV i: giR (giD)= {0, 1, 2} where 0 = homozygous wild type, 1 = heterozygous and 2 = homozygous for the alternative allele. The likelihood of the donor-specific DNA fraction (α) will be:
Where bij represents the read j covering the SNV i and eij represents the sequencing error of SNV i at the read j : P(eij = 1) = 10−Qij/10 and P(eij = 0) = 1-P(eij = 1) and Qij represents the minimum between the base quality of the read j at the position of the variant i and the mapping quality of the read j. The probability of bij conditioned to the recipient (donor) genotype giR (giD) and the sequencing error eij is described in Table 3. Finally, we used the simulated annealing approach together with a grid search to find α that maximizes the likelihood function [37]. The method was implemented in a Python script.
As the likelihood function for estimating α requires a balance in alternative/reference allele distribution in heterozygous calls for both recipient giR and donor giD genotypes, very deep recipient urine DNA sequencing will provide this allele balance (Table 4).
A total of 10,000 simulations were performed for each proposed M.
Simulated SNVs based on general population structure
To assess the effect of number of SNVs (N) and mean depth of coverage (M) on the allele balance and thus the prediction accuracy of the likelihood function, we simulated 2 independent SNV-sets each set containing N common SNVs (minor allele frequency ≥ 5%) and is covered on average by M reads. We varied N and M in 35 scenarios where N={10, 50, 100, 500 and 1,000} and M={10, 50, 100, 500, 1,000, 5,000 and 10,000}. We merged α reads from set1 and 1-α from set2 randomly generating a combined SNV-set and applied the likelihood function on the combined SNV-set to estimate the observed α. Because L(α) = L(1-α), we restrict 0 ≤ α ≤ 0.5 in steps of 0.01 generating 51 scenarios. A thousand replicates for each scenario and for each α were performed to obtain an empirical distribution.
Sequencing of urine DNA from a pair of healthy individuals
We extracted DNA from urine of two healthy individuals; S1, a 30 years old European woman and S2 a 30 years old Arab woman using the Qiagen® Allprep Mini Kit. extraction kit. The DNA concentration was similar for the two individuals: 35ng/μl. We mixed DNA from S1 and S2 to achieve 5 scenarios: i) 100% from S1, ii) 100% from S2, iii) 90% from S1 and 10% from S2, iv) 70% from S1 and 30% from S2, v) 50% from S1 and 50% from S2 and each scenario was replicated three times. We performed deep targeted DNA sequencing on each replicate. GeneReadDNA Seq Targeted Pannels V2; Human Breast Cancer Panel (Qiagen, USA) was used to perform target enrichment by multiplex PCR. The breast cancer panel consists of four primer pools yielding 2,915 amplicons. Briefly, 40ng of each gDNAs was amplified using PCR reagents with 4 primer pool mixes following the manufacturer’s protocol. After the completion of the 4 PCR reactions, the 4 products were combined and the enriched DNA was purified using Agencourt AMPure XP beads (Beckman Coulter, USA). The concentration and the size of the purified amplicons were determined using Qubit 2.0 Fluerometer dsDNA BR assay kit (LifeTechnologies, USA) and Agilent BioAnalyzer 2100 High-Sensitivity DNA kit (Agilent Technologies, USA). A total amount of 80-160 ng of purified enriched DNA was used as template to generate NGS libraries. The NGS libraries were prepared using NEBNEXT Ultra II DNA Library Prep Kit (New England Biolabs, USA) and NEXTflex DNA Barcodes (Bio Scientific, USA). All library preparation steeps were performed according to the manufacturer’s protocol. The size and quality of the final libraries were analyzed using Agilent BioAnalyzer 2100 with 1000 DNA kit (Agilent Technologies, USA). The quantified libraries were then normalized, pooled, and spiked with 5% PhiX control library (Illumina, USA). Finally, thepooled libraries were sequenced on a singles lane of Illumina Hiseq 4000 (Illumina, USA) paired-end 150 bp run.
Obtained reads were aligned to the human genome reference hg19 using bwa [38]. A total of 51,893 bi-allelic SNVs from the Exac project are included in the targeted genomic regions. The method works only on SNVs with different genotype between donor and recipient. Under Hardy Weinberg assumption, we assessed the probability of having a different genotype for each SNV i as:
Where
and
represent the donor and recipient genotype, respectively.
and
represent the donor and recipient reference allele frequency, respectively.
and
represent the donor and recipient alternative allele frequency, respectively.
To avoid allele dropout due to primer annealing region, we filtered out 24,237 SNVs falling at the primer sequencing regions [35]. From the 27,656 remaining SNVs, we selected the 1,000 most common and applied the likelihood function after filtering out the reads carrying the variant at the last 20 base pairs [39].
Simulated SNVs in pairs of individuals from the same and different ethnicity
We used individuals from the 1,000 genomes project phase 3 representing five major populations: AFR, AMR, EAS, EUR and SAS [34]. We randomly selected two individuals and aimed to cover all possible situations; five cases where Individual1 and individual2 belong to the same population and ten cases where individual1 and individual2 belong to different populations. We extracted the 1,000 SNVs described previously from Individuals1 and Individual2 and then merged α reads from individual1 and 1-α from individual2 generating a combined SNV-set. We varied α from 0 to 0.5 in steps of 0.01 and fixed the mean depth of coverage at M = 5,000. We applied the likelihood function to assess α for each combined SNV-set generated. We repeated the individual selecting process 100 times to obtain an empirical distribution for each situation.
Simulated SNVs in pairs of real siblings
We performed WGS on Illumina HiSeq 2500 sequencer of 91 Qatari siblings from 27 nuclear families containing at least 2 siblings at up to 7 siblings [33]. Reads were aligned to the hg19 reference genome using bwa [38]. Sequence alignment files were filtered and genotypes were called using the Genome Analysis Tool Kit best practices pipeline and variants were called using HaplotypeCaller [40,41]. We used Plink identity by state to confirm the familial relationship [42] (Fig S5). We extracted the 1,000 SNVs described previously from each sibling and merged each couple generating 100 combined SNV-sets. We applied the likelihood function to assess α for each combined SNV-set generated by varying α from 0 to 0.5 with a step of 0.01 while the mean depth of coverage was set at M = 5,000.
Donor-specific DNA fraction in real kidney recipient urine DNA
We studied 32 biopsy matched urine specimens collected from 26 kidney allograft recipients who were enrolled in the IRB approved study protocol entitled “Use of urine PCR to evaluate renal allograft status” at Weill Cornell Medicine-New York Presbyterian Hospital. Kidney allograft biopsies were classified as acute rejection (n= 12), acute tubular necrosis (n=11) and normal histology (n=9) using the Banff 2017 schema [43] (Table 1). DNA was extracted from urinary cells and deep targeted DNA sequencing was performed on all samples. Briefly, 50cc of fresh urine was centrifuged at 2,000g for 30 minutes at room temperature and the urine cell pellet was harvested after removing the supernatant. After washing the urine cell pellet with 1ml PBS, the cells were lysed using 350ul of Buffer RLT from Qiagen® and DNA was isolated from the cell pellet using Allprep DNA/RNA/Protein Mini Kit from Qiagen®. Total DNA was quantified using the NanoDrop™ Spectrophotometer. DNA sequencing was performed as previously described for the pair of healthy individuals. Obtained reads were aligned to the human genome reference hg19 using bwa [38]. We filtered out low quality reads using an in-house python script. We applied the likelihood function on the 1,000 SNV-set to estimate the recipient-specific DNA fraction. Nonparametric Kruskal-Wallis test was applied to assess the correlation between observed α and all the diagnosis phenotypes. Dunn’s function was applied to test the pairwise association. R software was used for statistical tests and generating graphs [44].
Ethnicity estimation for donor and recipient
We combined the observed α in the kidney transplant patients with a cost function to predict the genotype of both recipient (giR) and donor (giD) at each SNV i. First, we computed the expected fraction of the alternative to the total allele (reference + alternative) expi to all 9 possible combinations of giR and giD (Table 2). The observed fraction of the alternative to total alleles (obsi) at the SNV i is defined by:
Then, we used the cost function to determine giR and giD that minimizes the difference between the 9 expected (expi) and the observed (obsi) fraction of the alternative to total alleles:
Once giR and giD estimated, we performed a partial least square analysis (PLS) using 3 subpopulations from the 1,000 genomes project African, east Asian and European populations using the mixOmics R package [45] and then predicted the ethnicity of donor and recipient in the real kidney transplant samples. The leave 2 out 1,000 fold cross validation showed the highest prediction accuracy = 81.6% reached when using the Yoruba in Ibadan in Nigeria, the Southern Han Chinese and the Toscani in Italia amongst the African, east Asian and European subpopulations (Fig S6). We excluded the American and the south Asian populations because using 1,000 SNVs only is too small to perform a reliable PLS on 5 populations where the highest cross validation prediction accuracy was too low on five populations: 54.8% (Fig S6). Additionally, none of the donors and recipients involved in the study belongs to the south Asian or the American populations.
Data Availability
Data cannot be shared publicly because of ethical restrictions. Data are available from the Weill Cornell Medicine Institutional Data Access / Ethics Committee (contact via irb{at}med.cornell.edu) for researchers who meet the criteria for access to confidential data.
Ethics Statement
Kidney transplant recipients reported herein provided written informed consent to participate in Weill Cornell Medicine IRB approved study protocol entitled “Use of urine PCR to evaluate renal allograft status” and the informed consent was obtained before collection of urine specimen and inclusion in the study protocol. The participants provided consent for storage of biospecimens and use of these specimens in future research. The clinical and research activities that we report here are consistent with the principles of the “Declaration of Istanbul on Organ Trafficking and Transplant Tourism” and the “World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects” [46,47].
Acknowledgments
This work was supported by the Biomedical Research Program at Weill Cornell Medicine in Qatar, a program funded by the Qatar Foundation. This work was fund, in part, by awards from the NIH to MS (NIH MERIT Award, R37AI051652), TM (K08DK087824), and Weill Cornell Medical College (Clinical and Translational Science Center Award UL1TR000457). Lastly, this work was supported, in part, by the Qatar National research Fund (National Priorities Research Program grant # NPRP12S-0227-190173).