Abstract
Radiology may better define tuberculosis (TB) severity and guide duration of treatment. We aimed to systematically study baseline chest X-rays (CXR) and their association with TB treatment outcome using real-world data. We used logistic regression to associate TB treatment outcomes with CXR findings, including percent of lung involved in disease (PLI), cavitation, and Timika score, alone or in combination with other clinical characteristics, stratifying by drug resistance status and HIV (n = 2,809). We fine-tuned convolutional neural nets (CNN) to automate PLI measurement from the CXR DICOM images (n = 5,261). PLI is the only CXR finding associated with unfavorable outcome across drug resistance and HIV subgroups [Rifampicin-susceptible disease without HIV, adjusted odds ratio (aOR) 1·11 (1·01, 1·22), P-value 0·025]. The most informed model of baseline characteristics tested predicts outcome with a validation mean area under the curve (AUC) of 0·769. PLI and Timika (AUC 0·656 and 0·655 respectively) predict unfavorable outcomes better than cavitary information (best AUC 0·591). The addition of PLI improves prediction compared to sex and age alone (AUC 0·680 and 0·627, respectively).PLI>25% provides a better separation of favorable and unfavorable outcomes compared to PLI>50%. The best performing ensemble of CNNs has an AUC 0·850 for PLI>25% and mean absolute error of 11·7% for the PLI value. PLI is better than cavitation for predicting unfavorable treatment outcome in pulmonary TB in non-clinical trial settings and it can be accurately and automatically predicted with CNNs.
One Sentence Summary The percent of lung involved in disease improves prediction of unfavorable outcomes in pulmonary tuberculosis when added to clinical characteristics.
INTRODUCTION
Pulmonary tuberculosis (TB) has a wide spectrum of clinical presentation ranging from incidentally-found asymptomatic disease to severe lung destruction with cachexia and multisystem organ failure(1). The extent of pulmonary disease and its secondary effects on other organ systems is thought to influence short term prognosis, treatment response and long-term sequelae of TB(2, 3). A standardized and generalizable measure of baseline severity for TB disease can support the optimization of treatment regimens, guide care resources for treatment monitoring, and prognostication(4, 5). Such a tool can inform clinical trial design and stratified enrollment for TB across the drug resistance spectrum. There are now several tools to stratify patients based on severity profiles in clinical trials(5–8). In real-world settings, existing tools have shown promising results for predicting treatment outcome or culture conversion. However, research on these tools has relied often on small patient samples from a single site without external validation, has used only one modality of clinical data, or focused on high-cost limited access tools(9–11). Associations of chest X-ray (CXR) findings with unfavorable outcomes or severity have commonly focused on the presence of lung cavitation(4–8, 12, 13). The Timika or Ralph score sums the percent of lung involved in disease (PLI) on CXR with 40 points added if any cavities are present(11). Other approaches to assessing radiological severity have included a count of the number of zones affected by disease (0-6)(12, 13) and a dichotomization of PLI at a threshold of 50%(7, 14, 15). In addition to radiology, multiple clinical variables have been highlighted as associated with unfavorable treatment outcomes, including male sex(4), advanced age(4, 16), low BMI(4, 16), alcohol use(16), diabetes mellitus(17), malignancy(18), HIV co-infection(4), smear positivity or grade(4), and low adherence to treatment(4). It is not clear what the real-world value of CXR findings is for treatment response prediction or if radiological variables can improve treatment response prediction when combined with non-radiological variables. Here, we systematically study baseline CXR findings alone or in combination with other clinical variables and assess their association with TB treatment outcome. We used the TB-Portals database(19) that collects multimodal information from drug-susceptible and drug-resistant TB across geographically diverse real-world treatment settings. In conjunction, we aim to automate the measurement of the most predictive CXR findings using machine learning to facilitate access to severity assessment in high TB prevalence settings.
RESULTS
Subhead 1: Patient inclusion, baseline characteristics and treatment outcomes
At the time of access, TB-Portals included data on 11,282 care episodes (11,067 patients) from 13 countries between 2008-2023. The most well represented countries were Ukraine (n = 3,176), Georgia (n = 2,953), Moldova (n = 1,280) (Table S1, Fig. S1). 2,809 patient care episodes fit our inclusion criteria (Fig. 1). We stratified patients into three groups based on rifampicin susceptibility and HIV: (a) without HIV + rifampicin-susceptible TB: training-validation n = 566 (Rif-S1), test n = 285 (Rif-S2), (b) without HIV + rifampicin-resistant TB: training-validation n = 1,056 (Rif-R1), test n = 530 (Rif-R2), and (c) with HIV + any TB: n = 372. Compared with Rif-S1, the Rif-R1 and HIV subgroups had a higher frequency of prior TB, anemia, smoking, alcohol use, and other comorbidities (Table 1). People with HIV had a higher frequency of extrapulmonary disease, low BMI, smoking, alcohol use, and drug use than people without HIV. Pulmonary nodules (Rif-S1 81%) and cavities (Rif-S1 37%) were the most common CXR findings across all groups. The Rif-R1 group had the highest frequency of cavities (45%), and people with HIV the lowest (28%). Cavities were most commonly small (<3 cm) (all groups, Rif-S1 24%). Median PLI ranged from 18% to 26%, and Timika from 26 to 45 across the three groups (Table 1).
Subhead 2: Non-radiological features associated with unfavorable outcomes
We built logistic regression models of treatment outcomes using 13 demographic, clinical, microbiological, and regimen variables for the Rif-S1 (n = 566), Rif-R1 (n = 1,056) and HIV subgroups (n = 372) (complete models, Table 2, Table S2). For people with HIV, in addition to the 13 variables we included rifampicin resistance and antiretroviral therapy. We identified high smear grade (≥ 2+) compared with smear-negative disease [Rif-S1 aOR 3.84 (1.94, 7.59), p-value <0.001] to be associated with unfavorable outcome in all three groups. Other features associated with unfavorable outcome were low BMI, older age at onset of disease, prior TB, smoking, alcohol use, anemia, low smear grade (scanty, or 1+ vs. smear negative disease), rifampicin resistance, and the lack of an effective TB regimen (Table 2).
Subhead 3: Percent lung involved in disease (PLI) is associated with unfavorable TB treatment outcome
We studied ten radiological variables for association with unfavorable outcomes (Table 3). Figure 2 shows examples of CXR images with low and high PLI, with and without cavitation. We added each variable one-by-one to the complete logistic regression models and used the Wald test for hypothesis testing of the coefficient (Table 4). PLI was significantly associated with unfavorable outcomes in all three groups [Rif-R1 group aOR 1.21 (1.13, 1.30) per 10% increase, p-value <0.001]. Timika was associated with unfavorable outcome in the Rif-R and HIV subgroups [Rif-R1 aOR 1.14 (1.08, 1.20) per ten-point increase, p-value <0.001]. Four cavitation variables were associated with unfavorable outcome in the Rif-R1 group (Table 4). The cavitation variable with the largest effect size was large cavities aOR 3.21(1.93, 5.33). Cavitation also improved model fit when added to a PLI-containing model for the Rif-R1 group (LRT p-value 0.016) (Table S3).
Subhead 4: PLI improves treatment outcome prediction accuracy
We combined the Rif-S1 and Rif-R1 groups (n = 1,622) to boost statistical power. We trained logistic regression models on 75% of the data (n = 1,216) and assessed their generalizability to the remaining 25% (n = 406). We evaluated seven single-variable radiological models (Fig. S2A, D). PLI and Timika had the highest accuracy [AUC (PLI) 0.656 (0.595, 0.717)], and the former performed significantly better than cavitation [ΔAUC (PLI - Cavities (size)) 0.065 (0.000, 0.130), p-value 0.034]. The addition of cavitary disease to PLI did not improve accuracy [ΔAUC (PLI – PLI+Cavities (y/n)) −0.001 (−0.023, 0.021), p-value 0.590] indicating that the predictive accuracy of Timika is derived predominantly from the PLI component (Fig. S2B, D). The addition of PLI to a sex+age model significantly improved accuracy [ΔAUC (sex+age+PLI – sex+age) 0.052 (0.011, 0.093), p-value 0.012] but did not reach the performance of the complete 13 non-radiological variable model (Fig. S2C, D). The change in AUC resulting from addition of PLI was similar in magnitude by resistance group but was only statistically significant for the Rif-R1 group (Table S4). We repeated the analysis with people with HIV and observed similar improvements in prediction when PLI was added to the sex+age model but the increases were not statistically significant [ΔAUC (sex+age+PLI – sex+age) 0.050 (−0.029, 0.129), p-value 0.080] (Fig. S4).
Subhead 5: PLI improves prediction accuracy in independent data
We used a chronologically independent dataset of patients without HIV (Rif-S2: 2020-2023, Rif-R2: 2021-2023) to validate model accuracy (n = 815). This sample had a similar distribution of sex, prior TB, and rifampicin resistance as the training-validation data (Table 1) but was skewed geographically (85% from Ukraine) (Fig. S1C), and had a higher frequency of anemia, other comorbidities, smoking, alcohol use, and high smear grade. On this independent data, we observed a similar increase in model accuracy with the addition of Timika or PLI to sex+age as we observed in the training-validation set [ΔAUC (sex+age+PLI – sex+age) 0.054 (0.028, 0.080), p-value <0.001] (Fig. 3). The addition of PLI to a model with sex+age+SG also improved prediction accuracy in the test dataset [ΔAUC (sex+age+SG+PLI – sex+age+SG) by + 0.028 (0.006, 0.050), p-value 0.004] (Fig. S3).
Subhead 6: Impact of radiology on the stratification of risk
To understand the clinical implications of using radiology for baseline TB risk assessment, we tuned the probability threshold defining high vs. low risk to maintain sensitivity at >98% for predicting unfavorable outcome (Methods, Fig. 4A). This allows for a scenario in which the risk assessment focuses on ruling out unfavorable outcomes. We then tested the optimal threshold for each model on the independent dataset. Sex+age+PLI specificity increases to 20.0% from 8.6% vs. sex+age, and exceeds specificity of the complete model (15.1%), with comparable sensitivity (sex+age 99.5%, sex+age+PLI 97.2%, complete 99.1%) (Fig. 4B). In absolute numbers, the addition of PLI to sex+age increases the size of the low-risk group from 53 (6.5% of total, n = 1 unfavorable outcome) to 127 (15.6% of total, n = 6 unfavorable outcome) in the independent data (n = 815) (Table S5).
Subhead 7: Optimal threshold of PLI and Timika score dichotomization
A PLI cutoff of 50% was previously suggested as a predictor of unfavorable treatment outcome(14, 15). We studied the optimal threshold on PLI using Monte Carlo cross-validation. Using the training-validation set of people without HIV (n = 1,622), we identified the optimal threshold for PLI at 25%, and for Timika at 56/140 (Fig. 5A) to maximize the geometric mean of sensitivity and specificity. PLI at 25% had higher sensitivity compared with PLI at 50% (sensitivity-specificity of 59.7-65.3 vs. 25.6-88.9 respectively) and increased the size of the high-severity group by 171% (high-risk: 333 vs. 121 out of total n=815 respectively).
Subhead 8: PLI in external severity scores of unfavorable outcomes
We benchmarked TB severity scores: A5414/SPECTRA-TB (protocol in development) and endTB-Q(6) (ClinicalTrials.gov NCT03896685) as they both incorporate radiological findings and are currently being investigated as a guide for shortening TB treatment in two randomized clinical trials. We assessed the accuracy of these scores in predicting treatment outcome in real-world TB care settings and, assessed how the real-world accuracy of these scores changes with the use of PLI 25% or PLI instead of cavitation. A5414/SPECTRA-TB scores severity in drug-susceptible TB based on data from S31/A5349(7) (ClinicalTrials.gov NCT02410772) assessing a four-month rifapentine-containing treatment regimen for drug-susceptible TB, and includes extent of disease at PLI ≥ 50% (SPECTRA50, manuscript under review). We compared this model to a modified SPECTRA25 model (PLI ≥ 25%), sex+age+SG+PLI and complete+PLI trained on Rif-S1 (n=566) and tested on the pooled Rif-S data (n=851) (Supplementary Methods). There was no statistically significant difference between the AUCs for SPECTRA50, SPECTRA25, and sex+age+SG+PLI, but the modified SPECTRA25 score had a slightly higher mean AUC than the SPECTRA50 score (AUC 0.689 and 0.678, respectively) (Fig. S5A). endTB-Q uses smear grade and cavity presence to predict severity in drug-resistant TB(6). We compared a logistic regression model based on endTB-Q to a modified endTB-Q_PLI (replacing cavity presence with PLI (0-100)), sex+age+SG+PLI and complete+PLI (Supplementary Methods) trained on Rif-R1 (n = 1,056) and tested on Rif-R (n=1,586). endTB-Q_PLI and sex+age+SG+PLI (AUC 0.665 and 0.726, respectively) had higher AUCs than endTB-Q (AUC 0.579) (Fig. S5B, Table 2).
Subhead 9: Artificial intelligence to accurately predict PLI and Timika
Computer-assisted diagnosis (CAD) uses artificial intelligence (AI) to automate TB diagnosis from CXR and has gained rapid clinical adoption globally. CAD is trained to classify TB disease as present or absent(20, 21) and has recently been implemented for disease severity(22). We trained an AI model to classify TB disease severity from CXR focusing on PLI (both continuous and binarized at 25%) and Timika (continuous and binarized at 55). From TB-Portals, 5,261/7,213 chest X-ray DICOM images passed quality control for use in AI (Supplementary Methods). Of these, 2,893 were Rif-S and 2,368 were Rif-R. The ensemble CNN model (DenseNet121-res224-all) had the highest accuracy for predicting PLI and Timika score, independently and jointly for Rif-S and Rif-R data subsets [test MAE 11.7 (95%CI 10.6-12.8) and 15.8 (95%CI 14.6-17.0) respectively; test AUC 0.86 (95%CI 0.82-0.88) and 0.78 (95%CI 0.73-0.83) respectively] (Table S6).
DISCUSSION
We show that among ten CXR findings in pulmonary TB, PLI is most consistently associated with unfavorable treatment outcome. Cavitation improves model fit for the rifampicin-resistant group, but does not improve the prediction of unfavorable outcome when added to PLI. PLI improves prediction of unfavorable outcome over demographics with and without smear grade. PLI increases the number of low-risk patients compared to demographics alone and may be helpful to increase the number of patients successfully treated with a less intense or shorter regimen, when such regimens become available (ex. ClinicalTrials.gov NCT02410772)(6, 7, 15, 23).
Our study is congruent with previous works showing that CXR findings alone are not sufficiently accurate for predicting treatment outcome(24). Defining high severity of disease at PLI≥25% achieves a higher sum of sensitivity and specificity than the most common current use with PLI≥50%, but clinical variables like sex, age, smear grade, and comorbidities are needed for higher accuracy. Even then, the best combined models perform at accuracy of ~0.68-0.75. We evaluated the A5414/SPECTRA-TB and endTB-Q clinical trial severity scores for predicting outcomes for real-world rifampicin-susceptible and rifampicin-resistant TB, respectively. The definitions and ascertainment of unfavorable outcomes differ between clinical trial and real-world settings. In the former, recurrence of disease and/or complex composite outcomes and adherence are typically captured but not in the latter. Despite these differences, the AUCs for outcome prediction are comparable across data from these two settings (Yu A et al, unpublished). The performance of A5414/SPECTRA-TB may be improved if PLI at 25% is used to replace PLI at 50% and that of endTB-Q is improved if PLI is used to replace cavitation. Validation on external data and specifically on data from clinical trials of TB shortening is recommended to confirm these findings and better assess their implications.
Pulmonary cavitation in TB is thought to result from the necrosis and expansion of TB granulomas or diseased lung(25). After their formation during or in the recovery phase of active disease, cavities often persist in the lung chronically and/or lifelong(3). The presence and size of cavitation have been previously linked to unfavorable treatment outcomes, and used to describe severity in clinical trial settings(4, 5, 7, 12). PLI describes the proportion of opacified lung parenchyma, a process expected to start earlier than cavitation and that subsides recovery and cure(11, 13). We observe a stronger association for baseline PLI and outcome than for cavitation and outcome, and we identify no added predictive role of cavitation over PLI alone. We speculate that previous associations of cavitation with unfavorable outcomes in drug-susceptible disease may have been related to a correlation between cavitation and PLI in the subacute setting, i.e. patients with more extensive parenchymal disease may be more likely to progress to cavitation. It is also possible that patients with a delayed presentation are more likely to have both extensive disease involvement and cavitation, as the latter takes more time to develop.
People living with HIV can have more subtle TB findings on CXR than people without HIV(26). This is believed to be due to ineffective recruitment of immune cells to the site of disease. We observed lower prevalence of cavitary disease and Timika score in the HIV group compared to the non-HIV groups. PLI on the other hand is associated with unfavorable outcomes in the HIV group with a similar effect size to that observed for the non-HIV group. This suggests that PLI is an appropriate universal measure of radiological TB severity.
As digital CXR technology is now readily available in most TB treatment settings, the use of AI can automate interpretation, potentially improve accuracy and reduce inter-reader variability. We were able to accurately automate PLI thresholding at 25% and further work should validate these models prospectively across different geographic settings and directly in risk stratification.
Our study had several limitations including its retrospective nature and lack of prospective evaluation of clinical characteristics and treatment. Because we synthesized data across several cohorts that may have different data quality and/or entry, we cannot rule out bias or mismeasurement. In the complete models of outcome, we couldn’t account for adherence as this data is not collected by the programs or TB-Portals. Another limitation is the potential CXR inter-reader variability, especially given that not all readers were trained radiologists. Such limitations are expected in real-world data, and despite their presence in our study, we provide one of largest evaluation of radiological predictors treatment outcomes in a multicohort setting. Finally, the use of radiological features in severity scoring is dependent on the availability of imaging, and we acknowledge that access to imaging can be limited. However, digitalized imaging has been increasingly adopted as it becomes less expensive, and the use of automation has further reduced costs.
This work builds on previous analysis of a smaller TB-Portals dataset where PLI was found to be associated with unfavorable treatment outcomes(27). We extended this analysis to systematically compare ten radiological findings and assessed their added value to clinical and microbiological data for predicting treatment outcomes. We provided a range of combined PLI and clinical severity models; we evaluate the implications of using PLI and its optimal threshold in severity scores currently used in clinical trials, and lastly developed a new accurate AI model for automating PLI≥25%. Our work enables the improved use of CXR data in severity assessment in research and clinical trials for shortening treatment. Further we hypothesize that baseline PLI measurement may also prove helpful in predicting long term pulmonary sequelae of TB and further study is needed.
MATERIALS AND METHODS
Study population
We used the TB-Portals multi-cohort database curated by the National Institute of Allergy and Infectious Diseases (NIAID) (accessed on September 19th, 2023; see Data and Code Availability)(Table S1). Inclusion criteria are summarized in Figure 1. Table 3 summarizes the processing of radiological and treatment outcome variables. For each CXR, findings were coded by one clinician. Multiple clinicians provide these readings to avoid biasing the data to a single observer’s image reading practices.
Data preprocessing
We split the data into three groups based on drug resistance and HIV co-infection: (a) without HIV + rifampicin-susceptible (Rif-S), (b) without HIV + rifampicin-resistant (Rif-R) and (c) with HIV (HIV). We split the Rif-S and Rif-R groups into training-validation (Rif-S1 and Rif-R1) and test (Rif-S2 and Rif-R2) datasets. To accomplish this, we split patient records based on date of registration for training-validation and testing. For Rif-S, we assigned all cases between 2008-2019 and 2021-2023 to the training-validation and test datasets, respectively, and randomly assigned the cases from 2020 using the train_test_split function from the SciKit-Learn(28) model selection toolkit (v1.1.3) to generate the final 1,622:815 (66:33) data split. For Rif-R, we assigned all cases between 2008-2020 and 2022-2023 to the training-validation and test datasets, respectively, and randomly assigned the cases from 2021 to create the final 1,622:815 data split. We used Rif-S1, Rif-R1 and HIV in parallel to build logistic regression models for association studies and model fit analyses. We used Rif-S1 and Rif-R1 with Monte-Carlo cross-validation (75:25) to test the predictive accuracy of logistic regression models. We used Rif-S2+Rif-R2 with resampling with replacement to validate the predictive accuracy findings.
Outcome definition
Treatment outcomes were concordant with the 2013 WHO criteria.(29) Death (during the course of treatment), treatment failure (treatment termination or need for permanent regimen change of at least two drugs), and palliative care were considered unfavorable outcomes, while cure (treatment completion + bacteriological proof of conversion in three consecutive cultures at least 30 days apart) and treatment completion (treatment completion with no signs/symptoms of TB disease) were considered favorable outcomes (Table S2).
Regression
We fit univariable and multivariable logistic regression models using the Logit tool from Statsmodels(30) Python library (v0.13.2) and the Newton-Raphson method. We built a complete non-radiological logistic regression model using available demographic, medical, social, microbiological, and treatment variables selecting variables based on their suspected or known association with treatment outcomes based on literature evidence(4, 16, 17). We built a reduced model composed of sex and age at onset of disease (sex+age) to model clinical scenarios in which other characteristics are unavailable, excluding features that are difficult (e.g. extrapulmonary disease) or impossible (e.g. effective treatment) to collect at baseline, and that may be missing (e.g. BMI or comorbidities). We tested a second version of the reduced model with smear grade (sex+age+SG) given its strong association with outcomes(4, 18). We used the same logistic regression approach for radiological models. We compared the goodness of fit of nested models using likelihood ratio tests (LRT) and performed hypothesis testing with a chi-squared test, false discovery rate (FDR)-correcting P-values for multiple testing. For training-validation predictive accuracy, we conducted Monte Carlo cross-validation with 1,000 iterations. Specifically, for each iteration, we trained the logistic regression models on 75% of the data, and predicted on the remaining 25%. For testing, we trained the models on Rif-S1+Rif-R1, predicted on Rif-S2+Rif-R2 and applied resampling with replacement at 1,000 iterations to generate AUC distributions.
Statistical analysis for testing model prediction on independent data
We used bootstrapping to generate AUC distributions for each tested logistic regression model to test predictive accuracy. For the training-validation dataset, the bootstrapping was in the form Monte Carlo cross-validation, splitting the dataset 75:25 at each iteration for 1,000 iterations. For the test dataset, the bootstrapping was done through resampling with replacement for 1,000 iterations. At every iteration, we computed the difference between model AUCs (ΔAUC), and the number of observed differences that were ≤ 0 were divided by the total number of observations to assess statistical significance using a one-tailed empirical p-value approach [p-value = (#ΔAUC) ≤0/1,000]. We corrected for multiple hypothesis testing by controlling the Benjamini-Hochberg false discovery rate to <0.05.
Rule-out risk assessment
We tuned the logit probability thresholds on training-validation data to predict unfavorable outcome with maximal geometric mean sensitivity and specificity while sensitivity ≥ 0.98. We tested the specificity and true negative rate of this threshold and models on the test dataset.
Optimal prediction threshold for PLI and Timika
Using Rif-S1 and Rif-R1, we built a logistic regression model using PLI or Timika dichotomized at every integer value between 5 and 95 with 1000x Monte-Carlo cross-validation (75:25). For each model, we calculated the sensitivity and specificity, and assigned the best threshold to the model that has the highest geometric mean of sensitivity and specificity. We then computed the median of the 1000 best thresholds for PLI and Timika to generate the final optimal threshold, and validated the model compared to 50% PLI on independent data.
External severity scores used in this study
SPECTRA50 is a model that includes age, BMI, diabetes, smear grade and extent of disease on CXR (PLI ≥50% vs. <50%). This model is a version of the original model generated from the S31/A5349(7) clinical trial (manuscript in review) that was pretrained and modified to fit our data structure. endTB-Q(6) is a simple classfier that uses smear grade with cavity. We used pretrained models with pre-defined coefficients (SPECTRA50) or trained logistic regression models (endTB-Q, sex+age+SG+PLI, complete+PLI) on the training-validation dataset of interest (Rif-S1 or Rif-R1 separately). We also tested modified versions of these scores based on findings from our analysis (SPECTRA25 and endTB-Q_PLI). SPECTRA25 is identical to SPECTRA50 with extent of disease (≥25% vs. <25%) and endTBQ_PLI is smear grade with extent of disease (0-100). We tested performance on the full TB-Portals whole dataset divided by drug resistance (Rif-S or Rif-R separately). We compared these models based on their AUCs. We also tested endTB-Q and endTB-Q_PLI as simple classifiers to mimic the use of endTB-Q in the original manuscript (using PLI ≥25% as a cutoff for endTB-Q_PLI instead of cavity presence).
Convolutional Neural Networks
We used pretrained CNN models from the TorchXRayVision(31) Python library (https://github.com/mlmed/torchxrayvision/) to perform patient-level regression and classification on quality-controlled CXR DICOM data from TB-Portals. We used DenseNet121-based regression on the whole lung in concordance with recent work demonstrating this approach as more effective than applying regression on a pre-segmented image(22). We split the dataset 80:10:10 across training-validation-test sets. We pretrained The CNNs on one or multiple benchmark datasets available through TorchXRayVision. We used the training dataset to fine-tune the pretrained CNN models on the prediction of PLI and Timika. We chose the best performing model from the validation set for generalizability assessment on the test set. We computed distributions for the AUC and mean absolute error (MAE) with bootstrapping. Further details are available in the Supplementary Materials.
List of Supplementary Materials
Materials and Methods
Fig S1 to S5
Tables S1 to S6
References (32–36) are only found in the Supplementary Materials
Data Availability
All data produced are available online at: https://tbportals.niaid.nih.gov/
Funding
National Institute of Allergy and Infectious Diseases / National Institutes of Health grant R01AI155765 (MF and KRJ)
National Institute of Allergy and Infectious Diseases / National Institutes of Health grant UM1 AI068634 (LH)
This work was also supported in part with Federal funds from the NIAID, NIH, Department of Health and Human Services under BCBB Support Services Contract HHSN316201300006W/75N93022F00001 to Guidehouse, Inc, and under NIAID, NIH, Business and Science Data Analytics contract HHSN316201200018W to Deloitte Consulting LLP.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Author contributions
Conceptualization and design: MG, MRF
Methodology: MG, RS, YE, MRF
Model design and modification: MG, MRF, AYX, EY, LH, GEV, RMS
Data acquisition: AR, AG, DH, ZY, GR
Data analysis: MG, RS, YE, MRF
Result interpretation: MG, RS, YE, MRF, AYX, EY, LH, GEV, RMS, AR, AG, DH, ZY, GR, KRJ
Visualization: MG, MRF
Supervision: MRF, KRJ
Writing – original draft: MG, RS, YE, MRF
Writing – review & editing: AYX, EY, LH, GEV, RMS, AR, AG, DH, ZY, GR, KRJ
Competing interests
Authors declare that they have no competing interests.
Data and materials availability
The TB-Portals dataset is made readily available to external collaborators through NIAID after signing a data use agreement. More information on the raw data is present in the TB-Portals website (https://tbportals.niaid.nih.gov). We base this paper on all TB-Portals data available for download by September 19th, 2023. We wrote all the scripts for this project on Jupyter notebooks using Python 3.9.12 using the Harvard Medical School O2 cluster and made them available on GitHub at https://github.com/farhat-lab/tbp-severity-scoring.
Acknowledgments
We express our gratitude to Pranav Rajpurkar, PhD, and Emma Chen, MS, for their valuable advice on AI model selection and implementation relevant to chest X-ray datasets.