Abstract
Despite decades of research and advancements in diagnostics and treatment, tuberculosis remains a major public health concern, particularly in low- and middle-income countries. New bioinformatics and computational methods are needed to interrogate the intersection of host- and bacterial genomes and identify novel targets for anti-tuberculosis drugs. Host genotype datum and paired infecting bacterial isolate information were analysed for associations using a multinomial logistic regression framework implemented in SNPTest. Two geographically distinct cohorts were evaluated: a cohort of 947 participants self-identifying as belonging to a five-way admixed South African population and a Ghanaian cohort consisting of 3 311 participants. We report potential associations between host genetic variants and multiple members of the Mycobacterium tuberculosis complex (MTBC). Although none of the variants analyzed in the South African cohort passed the GWAS cut-off for significance, 32 single nucleotide polymorphisms were identified in the Ghanaian cohort as being statistically significantly associated with risk for infection with strains of different members of the MTBC. Further analysis revealed that two of these SNPs were directly genotyped, and the rest were imputed using the 1000 Genomes Phase 3 reference panel. The availability of paired host-pathogen data is imperative for investigating strain-specific interactions between MTBC and its host. As demonstrated by this study, the implementation of a multinomial logistic regression using paired host-pathogen data may prove valuable for further research investigating the complex relationships driving infectious disease.
1. Introduction
Tuberculosis (TB), a disease primarily affecting the lungs, is caused by pathogenic members of the Mycobacterium tuberculosis complex (MTBC) such as M. africanum (M. africanum) and M. tuberculosis (M. tb). Infection alone, however, is not sufficient to cause disease (Bellamy, 1998; Kinnear et al., 2017; Möller and Hoal, 2010). Each branch of the MTBC comprises several clades of specific strains with variable virulence and disease-causing mechanisms (Brites and Gagneux, 2015; Gagneux, 2012; Hoal et al., 2017). While M. africanum is the main cause of TB in West-African countries including Ghana, M. tb is responsible for TB cases in most other parts of the world (Gagneux and Small, 2007). In addition to socio-economic- and environmental factors (Pratiwi, 2016; Seddon et al., 2013), predisposing diseases (Cheng et al., 2016; Lim et al., 2017), and the genetic make-up of the human host (Chimusa et al., 2014; Kinnear et al., 2017; Oki et al., 2011; Png et al., 2012; Salie et al., 2014; Thye et al., 2012, 2010) have been shown to play pivotal roles in determining susceptibility to the disease.
Several associations between genomic loci and susceptibility to infectious diseases such as TB (Bellamy, 1998; Herb et al., 2007; Thye et al., 2010), malaria (Rockett et al., 2014) and HIV (Pastinen et al., 1998) have been reported using candidate-gene association studies, linkage studies, and genome-wide association studies (GWAS). Using these approaches, a number of genes encoding proteins of the host immune system have been associated with susceptibility to TB, including human leukocyte antigens (HLAs), NRAMP1, mannose binding lectin (MBL), IFN-gamma (IFN-□), and Vitamin D Receptor (VDR) (Bellamy, 1998; Salie et al., 2014; Søborg et al., 2003; Yim and Selvaraj, 2010). While several studies have investigated the genetic association with TB (Chimusa et al., 2014; Hoal et al., 2017; Hong et al., 2017; Kinnear et al., 2017; Schurz et al., 2019a), few have investigated the association between gene variants and susceptibility to particular strains of the MTBC (McHenry et al., 2019; Omae et al., 2017).
Candidate gene studies compare allelic and genotyping frequencies of a specific genetic marker between a group of unrelated cases and controls. In a large cohort of 1 916 sputum- positive Ghanaian TB patients genotyped for the ALOX5 g.760G>A variant, individuals who were heterozygous for the polymorphism were found to be at increased risk for developing TB (Herb et al., 2007). Furthermore, patients harboring the exonic variant (g.760A) had a greater association (OR= 1.70; [95% CI: 1.2–2.6]) with infection caused by M. africanum West Africa-2 (Herb et al., 2007). Modelling a recessive mode of inheritance, a protective association (OR= 0.60; [95% CI: 0.4–0.9]) was identified between the occurrence of TB and the MBL2 G57E variant in another cohort of Ghanaian patients (Thye et al. 2011). TB patients belonging to the Ewe population were significantly more likely to be infected with M. africanum (OR= 3.02; [95% CI: 1.67–5.47]) and further stratification by lineage revealed that the association was strongly driven by infection with members of M. africanum West Africa-1 (Asante-Poku et al., 2015). HLA types are also known to be important in the immune response to pathogens.
In a South African candidate-gene association study of HLA alleles and the M. tb strain responsible for active TB, the HLA-B27 allele was found to decrease risk for an additional disease episode due to a Beijing strain, following multiple episodes of disease caused by a Beijing strain (Salie et al., 2014). In addition, specific HLA types were found to be associated with disease caused by the different strains investigated (Salie et al., 2014). Finally, Caws and colleagues were investigated the susceptibility of the human host to different M. tb strains. Using a candidate-gene approach, a cohort of 237 adult Vietnamese TB patients were analysed. The authors concluded that for this cohort, individuals carrying the C allele of the toll-like receptor-2 (TLR2) T597C polymorphism were significantly more likely to develop TB caused by mycobacteria belonging to the East-Asian/Beijing strain family (OR = 1.57 [95% CI 1.15-2.15]) (Caws et al., 2008).
A limitation of the candidate-gene study design however, is that it requires an a priori hypothesis regarding which genes to target in the association analysis. To address this limitation, GWAS have become a popular alternative for identifying genetic associations with disease. Through genotyping of many common genetic variants, GWA studies enable a global interrogation of a host’s genome for associations to disease, without the limitation of predefined candidate genes (Hirschhorn and Daly, 2005). While genome-wide associations between the human host and TB have been extensively studied in several populations, the susceptibility of the host to different members of the MTBC has only in recent years gained some attention.
The first GWAS to investigate genetic susceptibility to strains of different MTBC lineages aimed to identify genome-wide associations with TB onset, stratifying a Thai cohort by infecting MTBC lineage, and the age at onset (Omae et al., 2017). The study initially attempted to identify age-related associations between five MTBC lineages and two age-stratified groups of TB participants, namely 219 young cases (under the age of 45), and 467 old cases (over the age of 45). To reduce the complexity of the association tests, the MTBC lineages were tested as one lineage versus a collective of all other lineages. After applying Bonferroni corrections, none of the genotyped single nucleotide polymorphisms (SNPs) reached genome-wide significance for association with either of the age-groups for any of the five MTBC lineages. However, when reducing the five lineages to two groups consisting only of ‘Beijing’ and ‘non-Beijing’ cases, and testing for age-related association to TB, the authors identified a single SNP on chromosome 1p13, rs1418425, reported to have a significant association to non-Beijing infected cases classified in the “old” age category (P = 1.58 x10−7; OR = 1.62 [95% C.I.: 1.35-1.93]). The authors were able to replicate the SNP in two independent cohorts, further demonstrating the importance of performing GWAS with a specific focus on pathogen lineage (Omae et al., 2017).
Another study investigated the coevolution of M. tb and it’s human host. The article is based on the hypothesis that longstanding coexistence between the human genome and M. tb lineage may reduce the risk of progressing to active TB or minimize the severity of disease. The authors investigated TB severity (as measured by the TBScore) in two cohorts from Uganda with paired M. tb-human DNA available to determine if interactions between M. tb lineage and human genetic variants exist. Although no association was found between lineage and disease severity, an interaction between a single nucleotide polymorphism (SNP) in SLC11A1 and the L4-Ugandan lineage were identified in both cohorts. In addition several IL12B polymorphisms were associated with disease severity (McHenry et al., 2019).
In order to improve our understanding of the genetic susceptibility to the MTBC clades, this study leveraged genome-wide genotyping data from the host and pathogen data to perform a genome-wide screen for clade-specific genetic associations in cohorts originating from two distinct populations.
2. Results
2.1. Participant recruitment, sample collection, and SNP genotyping
All participants recruited for the South African and Ghanaian cohorts provided blood and sputum samples for SNP genotyping, and MTBC strain identification, respectively. The South African cohort comprised 947 participants successfully genotyped for 397 337 variants, while the Ghanaian cohort consisted of 3 311 participants successfully genotyped for 783 338 variants (Table 1). Of the Ghanaian samples included in this study, 1 359 were TB cases and 69% of the participants were male (Table 1). Principal Component Analysis of the Ghanaian cohort showed contributing ethnicities from the Akan, Ga-Adangbe, Exe, and several other ethnic groups from northern Ghana (Thye et al. 2012).
2.2. Defining MTBC clades and superclades
2.2.1. South Africa
The MTBC strains obtained in the South African cohort contained strains of eight of the 12 lineages, namely Beijing, CAS (represented as CAS1 in the infection database), Haarlem, Haarlem-like, Low-copy Clade (LCC), T, and Quebec (Fig. 2A). During the grouping strategy, Beijing and CAS were clustered to form the “BeijingCAS1” superclade (Fig. 2B). Similarly, Haarlem, Haarlem-like, and LCC cases were clustered to form the “HaarlemsLCC” superclade (Fig. 2B). A clade denoted as “Other” was also present in the South African cohort but does not appear on the phylogenetic tree (Fig. 1) and thus was kept as a distinct member during the grouping strategy. The T superclade was excluded from subsequent analysis due to low frequency in the cohort after clustering into superclades, leaving five superclades represented by this cohort (Fig. 2B). MTBC clade distributions were dominated by the LAM clade and closely followed by Beijing. Superclade distributions showed similar frequencies for LAM, BeijingCAS1, and HaarlemsLCC, while the Quebec and “Other” clades were the least abundant (Fig. 2B).
2.2.2. Ghana
The strains obtained in the Ghanaian cohort contained 12 clade annotations obtained from spoligotyping. The afri-181 and afri-438 clades were represented by M. africanum on the phylogenetic tree and were subsequently grouped with EAI at the point of divergence, and named as the EAI_afri superclade (Fig. 1). Beijing and CAS were merged, as were Haarlem and X, into the “BeijingCAS’, and “HaarlemX” superclades, respectively. T and U clades were also clustered (Fig. 1). The “Ghana-2” clade was kept as a distinct superclade, while LAM and CAM were grouped based on the similarity in their spoligotyping patterns as illustrated in Stucki et al. 2016. This cohort thus contained 12 clades and six superclades after clustering (Fig. 2C, and Fig. 2D, respectively).
After grouping the clades into superclades, the strains of the EAI_afri and LAM_CAM superclades occurred most frequently in the cohort with both superclades having a frequency greater than 400 in the dataset (Fig. 2D). The HaarlemX, and T_U superclades had a frequency of approximately 160, and 250 in the cohort, respectively, while the BeijingCAS and Ghana-2 superclades were in least abundance (Fig. 2D).
2.3. Genotype data quality control, haplotype phasing, and genotype imputation
For the South African cohort, analysis using Genotype Harmonizer yielded a dataset of 947 participants with 356 165 variants. After quality control filters were applied, 919 participants and 239 612 variants remained. For the Ghanaian cohort, similar analysis yielded a dataset of 3 311 participants with 713 223 variants, and genotype QC filters resulted in a further reduction to 617 409 variants. No additional QC steps were used for the IH protocol, while the results of the additional data preparation steps for both cohorts imputed using the MIS tool are detailed in Table 2.
2.4. Selection of high-quality imputed genotype data
Imputation results are presented for Chromosomes 1, and X for the South African cohort, while for the Ghanaian cohort, Chromosome X data was not available, and thus the imputation of two autosomes are reported Table 3.
SNP densities were calculated for the proportion of SNPs with a quality score metric greater than 0.45. For the South African cohort, the SIS workflow using the AGR resource imputed the highest proportion of SNPs, whereas for the Ghanaian cohort, the MIS workflow using the CAAPA resource imputed the greatest proportion of SNPs with a quality metric greater than 0.45 (Table 3). For both chromosomes 1 and X, imputation using either the 1000G or the CAAPA resource with the MIS performed the worst for the South African cohort (Schurz et al., 2019b) with the maximum median quality score only reaching 0.82 at a MAF of 50%. In comparison, the SIS-AGR workflow outperformed all other workflows, and the result correlated with the AGR imputing the highest SNP density for chromosome 1 and chromosome X (Schurz et al., 2019b).
For the South African cohort, genotype datum imputed with the AGR reference panel was selected as the dataset with the highest imputation quality across all reference panels assessed. After removal of monomorphic sites and filtering on an INFO score of 0.45, 28 566 283 SNPs for 919 samples remained. After removing the 136 related individuals identified prior to imputation, and filtering for SNP- and sample missingness, and MAF, a dataset of 7 145 406 variants for 783 participants remained. Of the 525 clade-matched samples, 445 were extracted from the dataset of samples passing QC and used in the association analysis.
For the Ghanaian cohort, despite the CAAPA resource imputing the greatest SNP density (Fig. 3D and Fig. 4D), the IH workflow using the 1000G reference panel imputed the highest quality of SNPs per MAF bin but was very closely followed by the other workflows and reference panels from the 20-30% MAF bin upwards (Fig. 5). Thus, the IH dataset, imputed with the 1000G reference panel was selected as the best dataset. After filtering monomorphic SNPs, INDELS, and variants not reaching the INFO score threshold, 25 968 622 SNPs remained for 3 239 samples. After filtering for MAF, SNP- and sample missingness, and removal of 93 related individuals identified pre-imputation, the dataset comprised of 5 275 890 variants for 1 273 clade-matched samples.
2.5. Covariable data
For the South African cohort, the covariables age, sex, and ancestry proportions were available for all samples. Ancestry proportions were for the European, African, San, South-Asian, and East-Asian ancestries. Of the 445 clade-matched samples that passed the post-imputation QC, ancestry proportions were available for 357 samples as generated previously using the Affymetrix genotype datum for this cohort and ADMIXTURE software (Daya et al., 2013). For the remaining 88 samples, ancestry proportions were calculated from genotype datum generated by the MEGA array using ADMIXTURE. The East-Asian ancestry, being the smallest contributing ancestry proportion, was not included as a covariable in the analysis. Variances were calculated for each of the four remaining ancestry proportions and determined to be 0.027 (San), 0.035 (African), 0.014 (European), and 0.009 (South-Asian). As the variances were greater than the minimum cut-off of 0.001, they were included as covariables in the analysis.
For the Ghanaian cohort, age, sex, and ethnicity were included as covariables. Further examination of admixture for the Ghanaian dataset revealed that the cohort was not highly admixed, and thus ethnicity in the form of principal components was used. One sample passing QC filters did not have one of the covariables and was excluded from the dataset leaving 1 272 samples for the association analysis. The variance in the PCs provided for the cohort was calculated to be 0.0002 (PC1), 0.0003 (PC2), and 0.0004 (PC3) and therefore determined to be insufficient for inclusion in the association analysis as covariables.
2.6. Genome-wide Association Analyses
2.6.1. South Africa
An MLR was conducted for the 445 superclade-matched South African samples using SNPTEST under an additive model. All results were reported using the LAM superclade as the baseline. Although none of the SNPs passed the GWAS cut-off for significance, 4 631 SNPs had an LRT p-value less than 0.0005 and eleven SNPs had an LRT p-value less than 1 x 10−6 (Table 4). Odds ratios (OR) are reported (Table 4) and standard errors of the odds ratios are shown in Fig. 6A. A single SNP, rs9389610, located on chromosome 6, had a p-value of 1.60 x 10−7. Individuals with the A allele of this SNP were twice as likely to be have TB due to infection with a member of the BeijingCAS1 superclade (OR: 2.19) than due to either the HaarlemsLCC (OR: 1.07) or LAM superclades. Individuals with the A allele were only slightly more at risk of being infected with a member of the ‘Other’ superclade (OR: 2.78), when compared to the BeijingCAS1 superclade (OR: 2.19), and were very unlikely to be infected with a member of the Quebec superclade (OR: 0.25).
For the four SNPs located on chromosome 5, individuals with the risk allele doubled the chances of being infected with the Quebec superclade (OR: ∼2) while halving the risk of being infected with a member of the “Other” superclade (OR: ∼0.5) (Table 4). Lastly, for the six SNPs located on chromosome 17, the risk allele was shown to double the risk of being infected with a member of the LAM superclade than with the HaarlemsLCC (OR: ∼0.50) superclade. Individuals with the risk allele of these six SNPs were also equally at risk of being infected with a member of the BeijingCAS1 or LAM superclade and were twice as likely to be infected with the member of the “Other” superclade (OR: ∼ 2) when compared to the BeijingCAS1 superclade (OR: ∼ 1) (Table 4).
The VEP tool was used to retrieve gene annotations for SNPs of interest. While these SNPs are unlikely to have a direct effect on the gene expression itself, the SNP may be in linkage disequilibrium with other nearby SNPs which do have a direct effect on the gene. For the South African cohort, the most significantly associated SNP was rs9389610 (g.139039029G>A), located on chromosome 6. This SNP is an imputed SNP and its two closest directly genotyped SNPs were rs4896385 (g.139011266G>T), and rs7742202 (g.139074280A>G). The rs4896385 SNP is located in NHSL-1, while the rs7742202 SNP is located in GVQW2. Neither of these genes have been previously shown to be involved in the pathogenesis of TB. The four SNPs located on chromosome 5 (Table 4) were annotated to the StAR Related Lipid Transfer Domain Containing 4 gene (STARD4) using the VEP tool, while the six SNPs located on chromosome 17 were annotated to the TANC2 gene.
2.6.2. Ghana
Using SNPTEST under an additive model, an MLR was also conducted for 1 272 superclade-matched Ghanaian samples. For this cohort, all association results were reported using the LAM_CAM superclade as the baseline. In summary, a total of 32 SNPs had an LRT p-value less than 1×10−6 (Table 5) and were significantly associated with the MTBC superclades.
Several imputed SNPs were shown to dramatically increase the risk of being infected by a particular superclade. For example, the risk allele of the SNP rs577800201 (g.20476046C>T) was shown to increase the risk of being infected with the EAI_afri superclade by 93 times, compared to the baseline LAM_CAM superclade, and was annotated by the VEP to map to ACSM2A. The risk allele of the SNP rs374315920 (g.38496435C>T) located on chromosome 17, was also found to increase an individual’s risk of being infected with the EAI_afri superclade by more than 500 times, as compared to the LAMCAM reference superclade, and the MLR specific for this SNP was highly significant with an LRT p-value of 1.68 x 10−255. Using the VEP tool, this SNP was annotated to lie within the retinoic acid receptor alpha (RARA) gene. Since these SNPs were imputed, their associations should be considered tentative, because they may have been incorrectly imputed into this study dataset.
Further analysis revealed that of the 32 SNPs identified here, two, namely the intergenic variant rs529920 on chromosome 6, and the intron variant rs41472447 on chromosome 12, were directly genotyped. The risk allele of the SNP rs529920 (g.153835125A>G) was found to half the risk of being infected with a member of the BeijingCAS and EAI_afri superclades when compared to the LAM_CAM baseline superclade. Individuals with the risk allele for this SNP were also equally at risk of being infected with a member of the Ghana2, HaarlemX, or T_U superclades (OR: 1.04-1.24) (Table 5). Additionally, this SNP has an MAF of 0.4721 in the study cohort, which is similar to the MAF observed in African populations in the 1000G (MAF: 0.571) and in gnomAD genome (MAF:0.582). However, this SNP has no known gene consequence to date.
Similarly, the risk allele of the intron variant rs41472447 (g. 41708761A>G) was found to nearly triple the risk of infection with a member of the BeijingCAS superclade (OR: 2.56, C.I.: (1.48–4.41)) or the Ghana2 superclade (OR:2.94, C.I.: (1.71–5.08)). In contrast, the risk allele halved the risk of infection with the T_U superclade (OR:0.59, C.I.: 0.44-0.79) and had no effect on the risk for infection with members of the EAI_afri or HaarlemX superclades (Table 5). This variant had an MAF of 0.2461 in the study dataset which is similar to the MAF observed in African populations in the 1000G (MAF: 0.200) and in the gnomAD genome (MAF:0.165). Furthermore, rs41472447 maps to the PDZRN4 gene in dbSNP, but has to date not been linked to susceptibility to tuberculosis.
3. Discussion
TB is a highly infectious disease affecting millions of people each year. The genetic susceptibility of the host to developing the disease has been extensively studied using linkage analysis, candidate gene studies, and GWAS. Furthermore, a number of selected genes have been investigated for their contribution to genetic susceptibility of the host to strains of different lineages of the MTBC. However, to date, no genome-wide analysis of genetic markers affecting susceptibility to strains of different MTBC lineages has been performed.
3.1. Imputation
While several reference panels exist to facilitate genotype imputation, most of these panels focus on representing populations of European ancestry, and little representation has been made for African populations. Therefore, the present study focused on evaluating the quality of imputation attainable for the five-way admixed South African population and the Ghanaian cohort using the 1000G, AGR, and CAAPA reference panels.The more admixed a population is, the greater the heterogeneity in its haplotype block structure. This genetic complexity requires large reference panels with suitable ancestry to facilitate accurate genotype imputation (Lin et al., n.d.). The five-way admixed South African population contains genetic contributions from Bantu-speaking Africans, Europeans, KhoeSan, and South- and East-Asians (Daya et al. 2013; De Wit et al. 2010). Imputation was previously performed for this population; however, it was done using the 1000G Phase 1 (The 1000 Genomes Project Consortium 2012) and the HapMap3 release 2 (The International HapMap 3 Consortium 2010) reference panels. The 1000G panel has been expanded substantially since whereas the HapMap3 reference panel, representing individuals mostly of European ancestry (Chimusa et al. 2014), has since been deprecated.
When evaluating the Ghanaian cohort, although imputation of the 1000G reference panel with the IH method performed the best, there was very little difference in the median quality scores for the different workflows seen for SNPs with an MAF of 10-50% (Fig. 5). For rare variants (MAF 0-5%) however, the IH method outperformed all others with a median quality score above 0.75, whereas both analyses with the MIS produced a median score below the cut-off of 0.45. Thus, for the Ghanaian cohort, all reference panels and methods tested could be considered viable options for imputing common variants with an MAF of 10-50% but should be considered carefully for variants with an MAF below 10%.
In contrast to the AGR which contains no individuals recruited from West-African countries, the CAAPA resource contains 88 individuals recruited from the West-African country of Nigeria (Mathias et al., 2016). Thus, it was unsurprising that the CAAPA resource performed well when imputing SNPs with a MAF above 10% (Fig. 5). From an MAF of 20-50%, the CAAPA resource performed similarly to the other three reference panels and may thus be considered suitable for imputing cohorts of West-African ancestry, such as the Ghanaian cohort used in this study.
3.2. Multi-phenotype GWAS
In this study, the MLR functionality within SNPTEST enabled the genome-wide investigation of genetic markers for association to a number of MTBC superclades. Although none of the SNPs passed the GWAS p-value cut-off, the most significant associations for the South African cohort imputed with the AGR reference panel were reported. This was likely due to the small sample size resulting in a subsequent reduction in statistical power. In contrast, 32 SNPs passed the GWAS cut-off for the Ghanaian cohort and may be considered as potential targets for further investigation of host-directed therapies suitable for individuals of West-African descent. None of the SNPs with an LRT p-value less than 0.0005 in either cohort were found in the other, demonstrating the population-specific association of SNPs with the strains of different MTBC superclades, which has been previously shown in the investigation of TLRs and their association with cases of TB in populations of different ethnicities (Schurz et al. 2015).
3.3. Potential drug targets
For the Ghanaian cohort, 32 SNPs with significant LRT p-values were identified as being associated with the MTBC superclades investigated (Table 5). Nine of the SNPs located on chromosome 12 mapped to the PDZRN4 gene. For these nine SNPs, the risk allele increased the chances of individuals being infected with the BeijingCAS superclades 2.5 times, and in the region of three times for the Ghana2 superclade, while the risk allele halved the chances of being infected with the T_U superclade. Due to the low frequencies of the BeijingCAS and Ghana2 superclades observed for this cohort (Fig. 2D), it is possible that these odds ratios were inflated because of small sample sizes. Notably, the two SNPs which were directly genotyped, and not imputed, were found to be significantly associated with infection with particular MTBC superclades. This provided crucial evidence that despite the vast amount of imputed genotype data included in the MLR, the association analysis was able to detect two directly genotyped SNPs with potentially significant associations with the MTBC superclades.
Few studies have described the direct influence of M. tb infection on the expression of the STARD4 and RARA genes, and no studies have investigated the outcomes of infection with strains belonging to different MTBC clades on these genes. The STARD4 encodes the StAR-related lipid transfer protein which plays a crucial role in the transmembrane trafficking of lipids (such as cholesterol) - an important source of energy for M. tb (Soccio et al. 2002). Infection of macrophages with pathogens such as M. tb stimulates the process of lipid droplet formation (Daniel et al. 2011). It has been hypothesised that M. tb initiates this process in order to secure a reliable source of carbon to fuel bacterial growth (Brzostek et al. 2009). Additionally, the accumulation of cholesterol in the bacterial cell wall drastically reduces the permeability of the cell wall, subsequently reducing the penetrating capability of the anti-TB drug Rifampicin (Brzostek et al. 2009). However, a recent study has contradicted the notion that lipid droplet formation is a bacteria-driven process. Instead, it was proposed that the formation of lipids is an immune system-activated process, and does not occur as a result of direct stimulation by M. tb, but rather via the IFN- □, H1F-alpha-dependent pathway of the host immune system (Knight et al. 2018).
All-trans retinoic acid (RARA), the active form of Vitamin A, plays an essential role in the normal functioning of the adaptive and innate immune systems. The oral administration of retinoic acid to rats resulted in inhibition of the M. tb growth, following in vitro infection (Yamada et al. 2007), thus making this gene a potential target for anti-TB therapies. The results of this study have highlighted several SNPs which possibly significantly increased the risk of individuals with Ghanaian ethnicity to being infected with the endemic TB strain of M. africanum. Given the burden of disease, and the dominance of M. africanum strains in Ghana, it may be a worthwhile exploring the functional effect of these SNPs on the biological processes described.
3.4. Recommended improvements for future multi-phenotype GWAS
Several limitations may have affected this study. Obtaining a suitable sample size is a problem inherent in GWA studies making use of logistic regression modelling. Furthermore, the many phenotypes being analysed in this study, demanded a sufficient number of cases for each class. The frequency of each MTBC clade however, is dependent not only on the host, but is also affected by the virulence of the bacterium. Thus, with all these considered, the sample sizes included in the MLR should be sufficient for inclusion in the analysis but will likely also reflect the distribution in the population. Another limitation of the study is that at the time of analysis, the AGR reference panel was not publicly available for download to a local machine. Thus, its use in this study could only be facilitated via the SIS, a freely accessible online imputation server. Through this, we were able to obtain high-quality imputed data for the South African dataset, but it was necessary to be mindful when drawing comparisons as the other workflows made use of different imputation software.
With the current trajectory of the TB epidemic, novel methods are needed to augment current therapies for TB and combat the disease. This study provides the groundwork for future GWAS wishing to investigate the relationship between the host and the many members of the MTBC causing disease. Furthermore, the SNPs identified in this study may be evaluated in functional studies to assess their viability as targets for host-directed therapies. This study would not have been possible were it not for the collection of paired samples of blood and sputum from study participants. Thus, future studies of this kind will require that both samples be collected from participants in order to perform this association analysis. Although it was necessary to exclude low frequency MTBC cases at the superclade level to prevent the reduction in statistical power of the association test, this may have brought in a weakness in interpreting the odds ratios derived from the model. Thus, the odds ratios derived may only be interpreted at the superclade level and does not provide further granularity to association with specific clades. Therefore, future studies employing this method should consider excluding low-frequency clades before clustering as well as Bayesian analyses that allow inclusion of prior probability distribution for strain prevalence. Additionally, incorporation of more diverse reference panels, such as the AGR, and new algorithms for imputation could improve association results.
4. Material and methods
4.1. Study design
To perform genome-wide association analyses between human host genotypes and the infecting member of the MTBC, host genotypes with paired MTBC isolate information were sourced for two geographically distinct cohorts. Paired host genotype and isolate information was available for a South African cohort, and a Ghanaian cohort. As these two cohorts are geographically distinct, and possess vastly different admixture profiles, we did not aim to replicate our findings within these two cohorts.
The South African cohort consisted of study participants recruited in the Western Cape Province of South Africa during the period of January 1993 through December 2004. Participants were recruited from suburbs where the TB incidence was high (28.9 % in 2005) and the prevalence of HIV was a low 2% (Kritzinger et al., 2009; Shisana et al., 2012). All study participants in this cohort self-identified as belonging to a five-way admixed South African population, were HIV-negative, and provided written informed consent. Blood samples were collected for SNP genotyping of the host and sputum samples were collected for bacterial culture on Loewenstein-Jensen (L-J) media. For the Ghanaian cohort, participants were enrolled between September 2001 and July 2004 at Korle Bu Teaching Hospital in Accra, Komfo Anokye Teaching Hospital in Kumasi, and at 15 additional hospitals and polyclinics in Accra and Kumasi, as well as regional district hospitals. All cases were HIV-negative and confirmed to have pulmonary TB by sputum microscopy, performing solid mycobacterial cultures using L-J media and also by two independent radiologists (Thye et al., 2012, 2010). All subsequent research was conducted in accordance with the principles expressed in the Declaration of Helsinki (WHO, 2001).
4.2. Host SNP genotyping
4.2.1. South Africa
SNP genotyping was performed using DNA extracted from blood samples. All participants in the South African cohort were genotyped using the GeneChip Human Mapping 500K SNP array which contains 500 000 SNP markers (Affymetrix, California, United States), while a subset of this cohort was also genotyped using the Infinium Multi-Ethnic Genotyping Array (MEGA), which is comprised of 1.7 million SNP markers (Illumina, California, United States). Genotype-calling was performed using the Affymetrix Power Tools pipeline (V1.10.0) as previously described (De Wit et al., 2010; Schurz et al., 2019a). Genotype datum was made available in PLINK format (Purcell and Chang, n.d.; Chang et al. 2015). Following standard genotyping quality control (QC), ancestry proportions for the South African cohort on both the Affymetrix and MEGA arrays were estimated (Daya et al. 2013, Schurz et al. 2018) using the unsupervised algorithm implemented in ADMIXTURE (Alexander et al. 2009).
4.2.2. Ghana
DNA extracted from blood samples was genotyped using the Affymetrix SNP 6.0 array at the Affymetrix Services Laboratory in California, and at ATLAS Biolabs GmbH in Berlin. Genotype datum was made available in PLINK format, genotypes were called using the Birdseed version 2 algorithm and ancestry proportions in the form of principal components were derived using the Eigenstrat software (Thye et al., 2012).
4.3. Bacterial SNP genotyping
For the South African cohort, MTBC isolates were genotyped using spoligotyping and IS6110 Restriction Fragment Length Polymorphism (RFLP) methods as previously described (van der Spuy et al. 2009). For the Ghanaian cohort, MTBC isolates extracted from sputum samples were cultured on L-J media at the Kumasi Centre for Collaborative Research (Owusu-Dabo et al., 2006) and strains were identified using IS6110 RFLP and spoligotyping (Supply et al., 2006). All MTBC isolate information were captured on manually-curated infection databases for archiving and made available for this study.
4.4. Defining MTBC clades and superclades
The term “superclades” was used to describe the grouping of clades using a SNP-based phylogenetic tree of the MTBC. Where a common progenitor was shared, MTBC clades were grouped into superclades near a point of divergence (Fig. 1) (Dippenaar, 2014). This was performed to reduce the number of clades of low frequency, as low frequency groups are known to induce an unfavourable collinearity effect on logistic regression models (Bergtold et al., 2011). Additionally, clustering of clades into superclades mitigated the class imbalance which would negatively affect the statistical models. Clades not amalgamated with other clades were also referred to as “superclades” after clustering and those not represented on the phylogenetic tree, were kept as distinct superclades and not clustered with any of the members on the existing tree, except when a suitable reference phylogeny was found. After clustering, superclades with a frequency less than ten in the study dataset were excluded from subsequent analyses.
4.5. Genotype data quality control, haplotype phasing, and genotype imputation
Quality control, haplotype phasing and genotype imputation were performed using the methods described in Schurz et al. (2019). Briefly, genotype data were iteratively filtered for 2% SNP genotype missingness, 10% sample missingness and 5% SNP minor allele frequency (MAF), until no additional samples or variants were removed. Additionally, variants lacking chromosome or base pair information were updated using the 1000 Genomes Phase 3 (1000G) reference panel and the dbSNP (Sherry et al., 2001) database. Imputation processes are strengthened by sample size. Thus, all available samples, regardless of relatedness or whether the sample had matching MTBC information, were included in the imputation. Related individuals were noted, but not removed in order to maximise the number of haplotypes available for the imputation process. The genotype QC procedure concluded with a sex concordance check as well as strand-alignment to the 1000G reference panel [human genome build 37 (Sudmant et al., 2015)] using the Genotype Harmonizer (version 1.4.15) tool (Deelen et al., 2014).
Following the initial QC, haplotype phasing was performed using the ShapeITv2 (Delaneau et al., 2008) software set at default parameters. Although some studies have reported that pre- phasing reduces imputation accuracy (Roshyara et al. 2016), it is known to significantly speed up the computationally intensive process of genotype imputation (Kanterakis et al. 2015; B. Howie et al. 2012). To maximise the number of variants tested in the association analysis, the cleaned genotype data were imputed using five protocols - and three different reference panels - to determine which panel best served the given dataset in imputing missing variants, as detailed in Schurz et al. (2019).
The In-House (IH) protocol made use of the 1000G reference panel with the IMPUTE2 imputation software (Howie et al., 2009). The Sanger Imputation Server (SIS) (McCarthy et al., 2016) makes use of the Positional Burrows-Wheeler Transformation (PBWT) algorithm (Durbin, 2014), with two options for the reference panel: the African Genome Resource (AGR) and the 1000G. Lastly, the Michigan Imputation Server (MIS) (Das et al., 2016) makes use of the Minimac3 algorithm (Howie et al., 2009) and provides access to imputation using the 1000G and the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) (Mathias et al., 2016) reference panels. Of note, both the CAAPA and AGR reference panels were not publicly accessible for download at the time of this study and thus could only be accessed using these two online imputation server platforms. The key differences between the three protocols using the 1000G reference panel was the imputation software used, as well as additional strict QC filters imposed on the study dataset by the MIS and SIS methods.
4.6. Selection of high-quality imputed genotype data
The quality control procedure for the imputed data was implemented as described in Schurz et al. (2019). Briefly, following five imputation protocols, imputed data were filtered using a genotype calling threshold of 0.7, and the internal quality metric produced by the imputation process (Schurz et al., 2019b). SNPs with an INFO or R-squared (Rsq) value greater than 0.45 were prioritised for the association analysis and filtered iteratively for a maximum of 2% SNP genotype missingness, 10% sample missingness, and 5% SNP MAF using PLINK. Related individuals identified prior to imputation were removed followed by a second round of iterative filters for SNP- and sample missingness and MAF (Schurz et al., 2019b). Post-imputation QC concluded with extracting MTBC clade-matched samples from the remaining samples which had passed all QC filters.
4.7. Covariable data
All available covariables were obtained including sex, and age at time of active TB and subsequent recruitment into the study. To correct for differences in ethnicity amongst participants, either ancestry proportions or principal components were calculated and included as covariables.
4.8. Multi-phenotype GWAS
For the association analysis, a multinomial logistic regression (MLR) analysis using an additive genetic model was performed using SNPTEST v2.5.2 (Marchini, 2010). Two discrete variables, namely sex and superclade, as well as continuous variables, namely age at TB onset and ancestry proportions or principal components were included in the analysis. The phenotype tested was specified as the MTBC superclade. Thus, the MLR model specified was the occurrence of the MTBC superclade as a function of the baseline covariables given, as well as the host genotypes supplied.
SNPTEST is unable to include covariables when the variance in the values provided is “too small” as indicated (Marchini, n.d.). At the time of this study, SNPTEST was still under development, and it had not yet been established, or recorded in the software manual, to what degree of variance covariable data would not be accepted for inclusion in the logistic regression. For developing the method, it was established through trial and error that if the variance was below 0.001, these covariables could not be included.
The standard genome-wide significance cut-off of alpha = 5×10−8 was used when reporting significance of SNPs (Pe’er et al., 2008; The International HapMap Consortium, 2005). Odds ratios (OR) for the multiple phenotypes tested were calculated against a baseline phenotype by setting the odds of that phenotype occurring, given the genotype, to 1. For this study, the baseline phenotype was specified as the dominant superclade in the cohort, or a common superclade of intermediate frequency if more than one cohort was being studied. Thus, the LAM- and LAM_CAM superclades were used as the baseline phenotype for the association analyses of the South African, and Ghanaian cohorts, respectively.
SNPs with a Likelihood Ratio Threshold (LRT) p-value of less than 5×10−4 were selected and analysed in R (R Core Team, 2017) and odds ratios were calculated from the beta values generated by SNPTEST. SNPs with a standard error greater than 1.5 for their odds ratios were excluded and SNPs with an LRT p-value less than 1×10−6 were prioritised for further investigation. These thresholds were chosen pragmatically to facilitate the completion of method development. Finally, the Variant Effect Predictor (VEP) Tool (McLaren et al. 2016) was used to retrieve gene annotations for the SNPs of interest.
Data Availability
Summary statistics for the South African cohorts data can be made available to researchers who meet the criteria for access to confidential data after application to the Health Research Ethics Committee of Stellenbosch University. Requests may be sent to: Prof Craig Kinnear, E mail: gkin{at}sun.ac.za
Ethics approval and consent to participat
For the South African cohort, blood and sputum samples were collected from study participants as approved by the Health Research Ethics Committee of Stellenbosch University (Project numbers: S17/01/013 and 95/072). For the Ghanaian cohort, ethics for the study protocol was granted by the Committee on Human Research, Publications and Ethics, School of Medical Sciences, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana, and the Ethics Committee of the Ghana Health Service, Accra, Ghana (Thye et al., 2012). Venous blood samples were taken only after a detailed explanation of the aims of the study, and consent was obtained of individuals enrolled or their parents/guardians by signature or by thumbprint in case of illiteracy.
Availability of data and materials
Summary statistics for the South African cohort’s data can be made available to researchers who meet the criteria for access to confidential data after application to the Health Research Ethics Committee of Stellenbosch University. Requests may be sent to: Prof Craig Kinnear, E-mail: gkin{at}sun.ac.za.
Competing interests
The authors declare that they have no competing interests.
Author contributions
Stephanie J. Müller: Conceptualization, Methodology, Software, Investigation, Formal analysis, Writing - Original Draft, Visualization
Haiko Schurz: Software, Validation, Writing - Review & Editing
Gerard Tromp: Conceptualization, Supervision, Visualization, Writing - Review & Editing
Gian D. van der Spuy: Conceptualization, Supervision, Data Curation, Resources, Writing - Review & Editing
Eileen G. Hoal: Conceptualization, Supervision, Writing - Review & Editing
Paul D. van Helden: Conceptualization, Writing - Review & Editing
Ellis Owusu-Dabo: Data Curation, Writing - Review & Editing
Christian G. Meyer: Data Curation, Writing - Review & Editing
Thorsten Thye: Data Curation, Resources, Writing - Review & Editing
Stefan Niemann: Data Curation, Writing - Review & Editing
Robin M. Warren: Data Curation, Resources, Writing - Review & Editing
Elizabeth Streicher: Data Curation, Writing - Review & Editing
Marlo Möller: Conceptualization, Supervision, Writing - Review & Editing
Craig Kinnear: Conceptualization, Supervision, Writing - Review & Editing
Acknowledgements
The authors would like to acknowledge and thank the study participants for their contribution and participation. This research was partially funded by the South African government through the South African Medical Research Council. This work was also supported by the National Research Foundation of South Africa (grant number 93460) to E.H. and by a Strategic Health Innovation Partnership grant from the South African Medical Research Council and Department of Science and Innovation/South African Tuberculosis Bioinformatics Initiative to G.T. The content is solely the responsibility of the authors and does not necessarily represent the official views of the South African Medical Research Council.