Strong effect of demographic changes on Tuberculosis susceptibility in South Africa

South Africa is among the world’s top eight tuberculosis (TB) burden countries, and despite a focus on HIV-TB co-infection, most of the population living with TB are not HIV co-infected. The disease is endemic across the country, with 80–90% exposure by adulthood. We investigated epidemiological risk factors for (TB) in the Northern Cape Province, South Africa: an understudied TB endemic region with extreme TB incidence (926/100,000). We leveraged the population’s high TB incidence and community transmission to design a case-control study with similar mechanisms of exposure between the groups. We recruited 1,126 participants with suspected TB from 12 community health clinics and generated a cohort of 774 individuals (cases = 374, controls = 400) after implementing our enrollment criteria. All participants were GeneXpert Ultra tested for active TB by a local clinic. We assessed important risk factors for active TB using logistic regression and random forest modeling. We find that factors commonly identified in other global populations tend to replicate in our study, e.g. male gender and residence in a town had significant effects on TB risk (OR: 3.02 [95% CI: 2.30–4.71]; OR: 3.20 [95% CI: 2.26–4.55]). We also tested for demographic factors that may uniquely reflect historical changes in health conditions in South Africa. We find that socioeconomic status (SES) significantly interacts with an individual’s age (p = 0.0005) indicating that protective effect of higher SES changed across age cohorts. We further find that being born in a rural area and moving to a town strongly increases TB risk, while town birthplace and current rural residence is protective. These interaction effects reflect rapid demographic changes, specifically SES over recent generations and mobility, in South Africa. Our models show that such risk factors combined explain 19–21% of the variance (r2) in TB case/control status.


Introduction
Tuberculosis (TB) is among the world's leading causes of death due to infectious disease, recently surpassed by COVID-19 (1).The TB causative agent, Mycobacterium tuberculosis (M.tb), is an obligate, exclusive Homo sapiens pathogen mainly infecting the lungs, and sometimes other organs (2,3).Determinants of active TB progression are multifaceted including human host genetics, nutrition, social and economic conditions, behavior, and sex-specific biology (1,4,5).The extent of these determinants' effects varies across and within populations, necessitating epidemiological studies in differing contexts and communities (5).These factors have also been shown to vary between low and high/intermediate-incidence populations, with lower odds ratios in high/intermediate-incidence populations (6).Here, we characterize the TB epidemiology of a district in the Northern Cape Province, South Africa, a TB-endemic region with relatively low HIV.
South Africa is amongst the top 30 'high burden' countries, burdened by TB, TB/HIV coinfection, and multi-drug resistance or rifampicin-resistant TB (MDR/RR-TB).TB is South Africa's leading natural cause of death (7) with an extremely high prevalence (852/100,000, ( 8)) and accounts for 3.3% of all global TB cases (1).HIV is commonly identified as the leading risk factor for TB.In South Africa, 59% of TB patients on a TB programme (screened by a clinician and on TB medication) are co-infected with HIV.South Africa's first national TB prevalence survey (n=35,000), however, found only 28% of TB cases were co-infected with HIV (8).This discrepancy is partly explained by those with TB who go undetected, mainly symptomatic men not living with HIV who have limited clinical contact (8).Many individuals who are diagnostically TB+ may also go undetected because 78% of TB+ HIV-individuals exhibit only one or no classic TB symptoms (e.g.cough for two weeks, fever, night sweats, and weight loss; 61% have no symptoms) (8).
In case-control studies, controls should have similar disease exposure profiles to the cases.Population-based controls risk differential disease exposure, a concern in low-incidence populations that can bias statistical associations.However, in high-incidence populations with well-characterized disease burdens and transmission, population controls greatly improve statistical power (9).In South Africa, TB is community spread moreover than household (10,11) and TB latency increases with age, reaching 80% by age 30 (12)(13)(14)(15), an epidemiological scenario that ensures cases and controls have approximate disease exposure profiles.
Mycobacterium tuberculosis has a long coevolutionary history with different human populations likely leading to population-specific genetic signatures (2,16).TB susceptibility phenotypes have a heritability of 11-92% ( 17), yet few critical genetic variants have replicated across genome-wide association studies (GWAS) (18), potentially reflecting these populationspecific signatures.This result has spurred several studies to examine the relationship between genetic ancestry and TB risk (19)(20)(21)(22)(23).For instance, native Amerindian ancestry was shown to be a risk factor for TB progression in an admixed Amazonian population, and genetic variants in Peruvian populations have been associated with early active TB progression (19)(20)(21).A major challenge in identifying genetic risk factors associated with TB progression is decoupling the social and environmental effects that accompany ancestry.To this end, Asgari et al. controlled for environmental effects (e.g., sanitation, water supply, and socioeconomic status [SES])) and found indigenous Peruvian ancestry to be an independent, significant predictor of TB progression.Chimusa et al. also corrected for SES and demonstrated an association between Khoe-San ancestry and TB progression in South Africa (22).In this study, we investigated ancestry proportions as well as several common TB epidemiological variables identified in earlier studies (24).Smoking, alcohol consumption, and intravenous drug use have independently been associated with TB.Meta-analyses have found alcohol use and smoking (25,26), and specifically heavy alcohol use (26)(27)(28) to increase TB risk, though not always consistently.Our study is part of the Northern Cape Tuberculosis Project (NCTB), investigating the human-host genetics of TB among admixed Khoe-San descent populations in rural or peri-urban communities.Characterizing the TB epidemiology in this region will identify nongenetic risk factors that can serve as control variables in future genetic studies of TB risk.

Research ethics statement
This study has been approved by the Health Research Ethics Committee (HREC) of Stellenbosch University (N11/07/210A) and the Northern Cape Department of Health (NC2015/008).All participants were adults (18 years and older) and provided written informed formal consent.Authors Justin W. Myrick, Jamie Saayman, Lena van der Westhuizen and Marlo Möller had access to identifiable information about participants as they were directly involved in data collection or database management.Access to these records commenced on 26th January 2016, and is still ongoing as it is an integral part of the ongoing Northern Cape Tuberculosis Project (NCTB).

Study Design and Recruitment
Participants (18 years and older) provided written informed consent and were recruited from 12 community health clinics from the ZF Mgcawu district in the Northern Cape Province of South Africa from 26th January 2016 -15 May 2017, and 11 December 2018 -11 March 2020.Community health clinics are the front line for TB screening and treatment, visited by 87% of people who seek TB care (8).TB nurses referred patients with suspected TB (with ≥2 TB symptoms: cough for ≥2 weeks, night sweats, weight loss, and fever ≥2 weeks or a TB contact) and TB patients to our on-site RAs.All study participants took a clinic-administered sputum GeneXpert Ultra test for active TB at the time of the study interview and provided saliva for genotyping.Clinic medical charts were accessed by a staff research nurse to record GeneXpert test results and verify HIV status and TB history.

Case-Control Assignment
Cases and controls were assigned reckoning the participant's medical charts and selfreported data (see Fig. 1).Cases include anyone with active pulmonary TB in their lifetime and are HIV-negative and followed two tracks: 1) Clinically confirmed active TB (n= 343) and 2) self-reported past TB episode(s) (n=228).GeneXpert results, diagnostic test date, TB strain (drug resistance), and TB medication regimen were used to determine clinically confirmed progression to active TB.Past TB episodes are self-reported, mainly due to older medical charts not reliably available, discarded, or difficult to locate by clinic staff.
Controls are patients with suspected TB who have a negative GeneXpert Ultra result and have no history of active pulmonary TB at the time of study enrollment and are largely assumed to be latently infected with M.tb (LTBI).A majority of the population in high TB burden South African suburbs are LTBI, 88% by ages 31-35 (12,13) and studies have consistently shown LTBI in South Africa to be above 75% by age 25, increasing across adulthood (14).Our populationcontrol design relies on population-wide TB exposure, as traditional screening methods, tuberculin skin test (TST) and interferon-gamma release assay (IGRA; e.g., QuantiFERON), are limited both in the concordance and positive predictive value (29,30).IGRA and TST are used to infer M.tb infection, but cannot be used to determine previous exposure to the bacterium.Certain individuals living in high M.tb exposed populations test persistently negative for these tests and do not develop active disease, but display Mtb-specific antibody titres.These individuals are known as "resisters" or "early clearers" (31,32).
Our exclusion criteria removed participants with unknown TB or HIV status, as well as individuals with dual HIV and TB infections.Case-Control Decision Tree.Study participants were categorized as cases or controls based on medical record information and self-reported data.
All participants were GeneXpert tested for active TB infection at the time of enrollment.Past TB episodes were self-reported and cross-referenced with medical records when available.

Study covariates
We collected demographic information that included date of birth, place of birth, current residence, self-identified gender, self-reported ethnic identity, and parental ethnic identities.Behavioral variables include smoking and alcohol consumption (See Supplementary Materials in S1 Text).In our analyses, we only used binary measures for smoking and alcohol ("Do you smoke?", "Do you drink alcohol?").Residence and birthplace locations are categorized as rural (≤2000 people) and town (>2,000 people).Population size was derived from the South African census and when census data was absent, e.g., a farm, we used Google Earth (earth.google.com) to estimate population size based on the number of dwellings.Age was used as a continuous variable for all analyses and binned for calculating empirical odds (see Fig 2B).SES was operationalized as someone's number of years of education, i.e., the highest completed level of education.McKenzie et al. have shown education level, in this dataset, positively predicts body mass index in TB controls, tracking access to resources and food security (33).Case-Control status shifts across Age groups.A) Overlapping density plots of age distribution stratified by TB status (n= 878).At the oldest and youngest ages, most of our study participants are cases whilst at middle-age groups, the majority are controls.B) Empirical odds of active TB by age group.The xaxis bins our participants into 7 age groups and the y-axis: the empirical odds of active TB.Empirical odds are calculated by dividing the number of controls divided by the number of cases in each age bin.The size of the dots corresponds to the sample size of the age group.Our data reveal a signal of survivor bias.Since age is a cumulative measure of exposure, the empirical odds of TB should increase with age.This pattern is observed from our youngest age group up to age 58.The empirical odds of TB progressively decrease after age 58.Older age groups are biased towards controls due to the mortality of TB.

Data Analyses
Statistical analyses were performed in R (version 4.0.2).We calculated Pearson correlations with the R package ggcorrplot.All categorical variables were numerically coded to "0" and "1".Classification models for our binary, qualitative dependent variable ("case"/ "control") included logistic regression and random forest-a machine learning classifier robust to non-linear associations and unknown variable interactions (34) (see Supplementary Materials in S1 Text).Random forest is a growing analytic tool in epidemiology (35)(36)(37).The coefficients of the logistic regression models were converted to odds ratios using the R package gtools (38), and marginal effects were plotted using the R package effects (39).Each model was Bonferroni corrected by dividing, 0.05, by the number of variables in said model.
Our first model, the common risk factor model (n=878), includes seven covariates known to be common risk factors for TB.TB Status ~ gender + smoking + drinking + diabetes + residence + age + SES Health disparities are one of the many consequences of apartheid in South Africa (40,41).The end of apartheid improved social mobility and educational access, however, health disparities in the Northern Cape still remain (42).To capture the effect of lived experience vis-àvis Apartheid on TB outcomes we designed the "SES model" (n=878).This model includes the common risk factor model and interacts with age and SES.Age is kept as a continuous variable because Apartheid was not a historically binary event.

TB Status ~ common risk factor model + age * SES
Residing in an urban or rural environment is an established risk factor for TB status.In the "residence model", we test the relationship between current residence and birthplace residence on TB status.Here, we build on the common risk factor model to include an interaction between current residence and birthplace.Setting this interaction allows us to examine four patterns, namely: rural birthplace to urban residence, urban birthplace to rural residence, lifetime rural residence, and lifetime urban residence.TB Status ~ common risk factor model + residence * birthplace

Genetic Data Processing & Ancestry Estimation
Genetic data processing involved DNA extraction from saliva samples, genotyping for >2 million SNPs, common variant calling with GenomeStudio, rare variant calling with zCall, and further data cleaning using plink2 with specific parameters (Supplementary methods in S1 Text).Prior to genetic ancestry estimation, SNPs out of Hardy-Weinberg equilibrium (--hwe 0.001) and rare alleles (--maf 0.01) were removed from the dataset.The dataset was also pruned for linkage disequilibrium (--indep-pairwise 200 25 0.4).Individuals from Luhya, Maasai, Himba, British, Palestinian, Chinese, Bangladeshi, Tamil, Ju|'hoansi San, Khomani San, Nama populations were used as reference groups.Global ancestry estimates were calculated using ADMIXTURE v1.13(43).This was done in groups of maximally unrelated individuals to avoid biasing the ancestry estimates.ADMIXTURE was run for k=5 on unsupervised mode for each of the running groups.After matching clusters, we merged ancestry estimates across all running groups, averaging individuals that appeared in multiple running groups using pong (44)

TB case-control classification
1,126 participants were partitioned into preliminary cases, preliminary controls, and unverified TB status (571,504, and 51 respectively; Table C in S1 Text).After excluding, participants with unverified TB status, preliminary cases with unverified HIV status, and participants co-infected with TB and HIV, 878 participants remained in the study (374 cases and 504 controls; Table A in S1 Text).

Socio-behavioral covariates and demographics
Men and women were equally represented in the dataset (422:441, respectively, Table A in S1 Text).Men were more likely to drink alcohol (r = -0.14, p < 0.05; Cases and controls had similar distributions for age (mean = 43.1,SD =13.2 and mean =42.4,SD =15.2, respectively, Table A in S1 Text)."Age" is defined here as the age at the time of study enrollment and importantly, is a cumulative outcome: that is, it includes cases who currently and/or previously had TB, not the age of the TB episode.Age also captures the amount of time someone is exposed to TB.The empirical odds of active TB in our data reveal a signature of survivorship bias (Fig. 2B).We use the number of years of education as a proxy for "SES".The mean educational attainment is 8 years, equivalent to completing primary school, and similar between rural areas and towns (ANOVA, p > 0.1).In the ZF Mgcawu District census (45) 13% of people have not completed primary school compared to 25.3% of our participants.Age was moderately correlated with SES (r = -0.5, p < 0.05;

Ethnicity and Khoe-San Ancestry
Genetic ancestry analyses were performed for 159 participants (see Supplementary Methods in S1 Text) from the Northern Cape Tuberculosis Project on host-genetic susceptibility to TB.To our knowledge, this is the first study to report ancestry proportions of a clinical population in the Northern Cape Province, South Africa.Khoe-San ancestry varied across clinic locations (Fig. 3A) but remained the majority ancestry at each site (mean = 56%), followed by Bantu-speaking African ancestry (mean = 21%), European ancestry (mean = 16%), South Asian ancestry (mean = 5%), and East Asian ancestry (mean = 2%) (Fig. 3B).

Fig 3. Khoe-San Ancestry is the Primary Genetic Ancestry in Clinics
from the Northern Cape, South Africa.A subset of participants (n=159) was genotyped for preliminary ancestry analysis.A) The study population is admixed with 5 distinct ancestries with the Southern African indigenous Khoe-San ancestry being the largest proportion of ancestry across all study sites.(B) Although Khoe-San ancestry is the largest proportion of ancestry in our sample, it varies significantly across study sites.
Individuals were asked to self-identify their ethnicity without prompting.88.4% of participants (both TB cases and controls) self-identify as, coloured, followed by 4.2% as a Khoe-San ethnicity (e.g., Nama, San), 4.6 % as Tswana, 1.3 % as Xhosa, and 1.9 % as "other".Whilst we acknowledge that in some contexts the term, coloured, has derogatory connotations, it is a recognized ethnicity and used culturally in South Africa.People who self-identify using this term have different ancestries of different geographic origins, including the indigenous Khoe-San groups (e.g., Khoekhoe, San), Bantu-speaking, European, Indian, Malaysian (Southeast Asian) slaves, or people of mixed ancestry and their descendants (46).

Logistic Regression Results
We designed three logistic regression models (47) to examine the risk factors' odds ratios for the binary dependent variable, TB case/control.The common risk factor model included age, SES, gender, residence, smoking, diabetes, and alcohol as covariates.The SES model extended the common risk factor model to include an interaction between age and SES.Lastly, the residence model extended the common risk factor model to include an interaction between birthplace and current residence.The SES model (AIC = 1099; pseudo r 2 = 17%, Table 1) performed slightly better than the common risk factor model (AIC= 1108; pseudo r 2 =16%, Table 1).The residence model had a similar pseudo r 2 (15% (Table B in S1 Text)) as the other two, however, we could not compare their AICs due to different sample sizes.All significance levels were Bonferroni corrected.This was carried out by dividing 0.05 by the number of variables used in the model.

Gender, Alcohol, Smoking, and Diabetes
Males have three times the odds of active TB than females (OR = 2.85, p < 0.001; Table 1 and Fig. 4).All logistic regression models showed insufficient statistical evidence for smoking (common risk factor model: OR =1.31, p = 0.  1 and Table B in S1 Text) on TB risk.Despite the lack of significance, we note that smoking had an effect size in the expected direction (Fig. 4). Figure 4. Effect Plots demonstrating the relationship between Active TB Status and A) Gender, B) Current Residence and C) Smoking.These plots are reported from the best-performing logistic regression model (SES model).Y-axes for all panels show the odds of active TB.We find that the odds of active TB are 3 times higher in Males.Individuals currently residing in Towns have about 2.5 times higher odds of active TB as compared to individuals currently residing in rural areas.Smoking slightly increases the odds of active TB but is not statistically significant.

Age Interacts with SES
In the common risk factor model, age (OR= 0.996 [0.99, 1.00], p=0.55) and SES (OR = 0.947, p=0.0324; see Table 1) have no effect on TB risk.To examine this unexpected finding, we interacted age with years of education (proxy for SES).SES significantly affects TB status depending on age group (OR =1.005, p = 0.004, Table 1).The effect takes on a U-shaped relationship across ages, such that higher SES at younger ages (18-39 years old) is protective against TB, and higher SES at older ages (>59 years) increases risk (Fig. 5).Middle-aged individuals (40-59 years old) show no relationship between age and SES on TB risk (Fig. 5).

Fig 5. Logistic regression interaction plots.
A) The odds of active TB by education level vary across age groups (shown above by the different color lines).More years of education decreases the odds of active TB in younger age groups, but this pattern reverses in the oldest age groups.In middle-aged individuals, there is no relationship between age and years of education.B) Effect plot from the residence model visualizing an interaction term between birthplace residence and current residence.Regardless of birthplace, the odds of active TB is highest in individuals who currently reside in towns.Individuals born in towns and currently residing in rural areas have the lowest odds of active TB.

TB Risk is Highest in Towns
The odds of active TB were significantly higher for people residing in towns (common risk factor model: OR = 2.88 [2.07-4.03],p <0.0001; Table 1 and Fig. 4).For the residence model, we analyzed the impact of moving between rural areas and towns during an individual's lifetime (birthplace by residence) on TB status.We expected to see a difference in odds for TB risk between life-long residents and those who have moved between locales.Under such a model, lifelong rural dwellers would have the lowest odds and lifelong town dwellers would have the highest odds.We set an interaction term between current residence and birthplace classified into town/rural; this interaction was marginally significant (OR = 3.05, p = 0.016; Table B in S1 Text).Our results show that regardless of birthplace, current residence in a town area increases the risk of active TB (Fig. 3B).Interestingly, individuals who were born in a town and later moved to rural areas are even more protected than individuals born and currently residing in rural areas (Fig. 3B).

Random Forest Modeling
Similar to logistic regression, random forest is a binary classifier yet differs in that is robust against non-linear associations and unknown interactions (34).Random forest utilizes a permutation-based approach to generate a hierarchical list of important variables but is unable to quantify the "significance" between an independent and dependent variable.5000 subsets of our dataset were used to grow 5000 classification trees using baseline variables as predictors for active TB status.The model assigned gender, current residence, and SES respectively as the overall top important independent variables (Fig II-A in S1 Text).Age, diabetes, alcohol, and smoking were classified as uninformative predictors for TB.The random forest model also stratified the variable importance by cases and controls.Gender was the top predictor for case status, followed by current residence and SES (

Discussion
This present work represents the largest TB epidemiological study on a Northern Cape clinical population (n=878).In this study, we demonstrate the utility of population-based controls when disease exposure is known and transmission is community-spread (48) as seen in other studies in low-resource, high-burdened countries (9,49).Logistic regression and random forest models both show gender and residence as significant and important TB risk predictors.Random forest assigned SES as an important variable, and SES was only significant when interacting with age in logistic regressions.Neither smoking, alcohol consumption, nor diabetes is associated with increased TB risk in any model.Two logistic regression models, interacting SES by age (SES model), and birthplace by residence (residence model), had similar explanatory power, improving on the common risk factor model.This study demonstrates a possible unique historical context to South Africa, (post-)Apartheid differential effects between sociodemographic and health outcomes.
Age and TB risk have a general inverted U-shape relationship.During childhood, infants are at the greatest risk of TB decreasing through adolescence, increasing between 25-35 years old followed by a decrease, and another peak after 65 years (50,51).In our study population (≥18 years), the empirical odds of active TB increase with age and peak in the 49-58-year-old age group, followed by a steady decline in empirical odds after age 58 until the oldest age group (Fig. 2B).This drop in empirical odds after age 58 is most likely due to the mortality of individuals with TB, potentially a signal of survivor bias (52).This interpretation is seen in the shifting proportions of cases and controls across age groups (Fig. 2A).From ages 21 -58, most of the population are cases, and from ages 59 -88 most of the population controls (Fig. 2A).Age was neither a significant (logistic regression) nor an important (random forest) variable except when interacted with SES.SES's protective effect on TB risk is most prominent among 18-39 year-olds and becomes a risk factor among the eldest individuals (>65 years; Fig. 5A)-those who grew up and reached adulthood during Apartheid (Fig. 5A).Higher SES increasing TB risk is contrary to findings in populations in the United States and Mexico (51).This unique pattern may reflect South Africa's recent history of Apartheid and post-Apartheid societal and economic shifts.During Apartheid, individuals from historically marginalized backgrounds had limited career options, but some were able to become teachers, police officers, or nurses.Such occupations are associated with higher education requirements and would have facilitated access to larger salaries, transportation, and mobility.
Higher SES could result in apparent greater odds of TB because these individuals would have had better access to healthcare both facilitating diagnosis and treatment.Universal access to education increased post-Apartheid but given the wide variation of years of education among the youngest generations, it likely still covaries with SES.Given the unusual interaction here between age and years of education, future work should validate additional SES measures to resolve mechanisms of TB risk.
Consistent with previous research (53-56), we find TB risk is associated with living in larger towns.In our prior work, mobility in the Northern and Western Cape populations changed over the past 3 generations with the highest levels of mobility in the grandparental generation (57).Therefore, we tested whether mobility (different birthplace and residence) affected TB risk.As expected, the protective effect of living rurally vanishes when someone moves to a larger town.Further, the individuals with the lowest TB risk are those born in a town and move to a rural area.These findings are consistent with TB exposure nearing ~ 90% by 25-30 years old (13), with transmission occurring via community contacts during adolescence and adulthood.We hypothesize that those born in towns who later moved to a rural area benefit from both BCG vaccination and decreased adult exposure thereby overall decreasing their odds of TB.BCG vaccination is standard for children in South Africa, however, children in rural areas may have lower vaccination rates (observation communicated by clinical staff in the study catchment).Future work should consider collecting birthplace in addition to current residence to better identify TB risk.
Invariably across studies, men are on average 1.7 times more likely to have TB (58)(59)(60).Sex biases like this are common in other infectious diseases (61,62) and are attributable to an intersection of sex (biological factors, e.g., immune function) and gender (social and behavioral factors, e.g., risk-taking behavior) (63).Despite smoking not being a significant TB risk, we found 75.5% of men smoke compared to 55.8% of women, indicating at least some gender differences in risky behaviors in the Northern Cape population.
Smoking and alcohol consumption has been shown to increase TB risk and mortality in the Northern Cape and at the national level (64)(65)(66)(67).In our models smoking had the expected effect on TB risk and alcohol consumption had no effect.Both variables lacked statistical power in regression models and failed to meet any level of importance in the random forest model.Selfreporting biases in observational studies like this one are a concern for variables like smoking, alcohol consumption, and SES measures (68).Our sample, however, reports much higher levels of smoking compared to large-scale national surveys (e.g., (69), men: 75.5.%vs. 41%; women: 55.8% vs. 21%, respectively suggesting minimal self-report bias in our study.It is possible that these weak effects of smoking and alcohol observed from our models are due to our method of binary classification.We collected fine-scale smoking and alcohol phenotypes (Supplementary Methods in S1 Text) but because of the high missingness of these phenotypes, we ultimately classified participants as Smokers/Non-smokers and Drinkers/Non-Drinkers.This stratification may mask the heterogeneity of drinking and smoking behaviors such as casual and binge substance use or differences in the types of alcohol and smoking materials consumed.Further TB epidemiological studies in the Northern Cape should explore these smoking and alcohol phenotypes in more detail.Active TB progression is a multifactorial process involving the environment, genetics, and their interaction (1,4).Our results from the NCTB cohort indicate that sociodemographic variables strongly impact active TB risk.Effects that are unique to the Northern Cape Province may reflect how changes in the pre-to post-apartheid environment modified social factors, such as SES and mobility, which in turn impacted lifetime TB risk.This work provides a baseline to design wellinformed future studies, such as exploring host genetic correlates of active TB progression in this population (Supplementary Discussion in S1 Text).
Fig I in S1 Text) and smoke (r = -0.22,p < 0.05; Fig I in S1 Text).Most of our participants smoked (66%) and 45% drank alcohol; smoking and drinking were moderately correlated with each other (r = 0.36, p < 0.05; Fig I in S1 Text).Women were more likely to have diabetes (r = 0.12, p < 0.05; Fig I in S1 Text) and, on average, had more education than men (female mean= 8.3 years, male mean = 7.7 years).
Fig I in S1 Text) such that older participants tended to have lower SES.
Fig II-B in S1 Text).SES was the top predictor for control status followed by gender and current residence (Fig II-C in S1 Text).Interestingly, age had some predictive relevance for case status but was the worstperforming predictor for controls (Fig II-C in S1 Text).The model had an overall "out-of-bag" misclassification rate of 38%, with misclassification lower in controls (controls =30%, cases 48%; Supplementary Materials in S1 Text).

Table 1 :
Odds ratios and p-values for the Demographic and Socio Behavioral Variables used in the Common Risk Factor Model and SES 11; Table 1, Table B in S1 Text), alcohol consumption (common risk factor model: OR = 1.05, p = 0.77; Table 1 and Table B in S1 Text) and diabetes (common risk factor model: OR =1.36, p =0.32; Table