Accuracy of computer-aided chest X-ray screening in the Kenya National Tuberculosis Prevalence Survey

Background: Community-based screening for tuberculosis (TB) could improve detection but is resource intensive. We set out to evaluate the accuracy of computer-aided TB screening using digital chest X-ray (CXR) to determine if this approach met target product profiles (TPP) for community-based screening. Methods: CXR images from participants in the 2016 Kenya National TB Prevalence Survey were evaluated using CAD4TBv6 (Delft Imaging), giving a probabilistic score for pulmonary TB ranging from 0 (low probability) to 99 (high probability). We constructed a Bayesian latent class model to estimate the accuracy of CAD4TBv6 screening compared to bacteriologically-confirmed TB across CAD4TBv6 threshold cut-offs, incorporating data on Clinical Officer CXR interpretation, participant demographics (age, sex, TB symptoms, previous TB history), and sputum results. We compared model-estimated sensitivity and specificity of CAD4TBv6 to optimum and minimum TPPs. Results: Of 63,050 prevalence survey participants, 61,848 (98%) had analysable CXR images, and 8,966 (14.5%) underwent sputum bacteriological testing; 298 had bacteriologically-confirmed pulmonary TB. Median CAD4TBv6 scores for participants with bacteriologically-confirmed TB were significantly higher (72, IQR: 58-82.75) compared to participants with bacteriologically-negative sputum results (49, IQR: 44-57, p<0.0001). CAD4TBv6 met the optimum TPP; with the threshold set to achieve a mean sensitivity of 95% (optimum TPP), specificity was 83.3%, (95% credible interval [CrI]: 83.0%-83.7%, CAD4TBv6 threshold: 55). There was considerable variation in accuracy by participant characteristics, with older individuals and those with previous TB having lowest specificity. Conclusions: CAD4TBv6 met the optimal TPP for TB community screening. To optimise screening accuracy and efficiency of confirmatory sputum testing, we recommend that an adaptive approach to threshold setting is adopted based on participant characteristics.


Introduction
With over 95% tuberculosis (TB) cases and deaths occurring in developing countries, there is need for substantially improved case detection to find the "missing millions" and accelerate action to achieve the sustainable development goals to end TB by 2030. (1)(2)(3)(4) Chest radiography (CXR) with computer-aided detection (CAD) software for TB has been recommended for systematic screening for tuberculosis disease in the most recent WHO TB Screening Guidelines. (5) However, supporting data have predominantly come from clinical settings and CAD diagnostic accuracy is likely to vary considerably across different screening strategies and populations. (6)(7)(8) CXRs were used extensively in TB screening and active case finding (ACF) programmes in the mid-20th century due to high sensitivity (94%, 95% CI 88-98%), potential for high throughput, and lower infectious risk to health workers (compared to sputum collection for all). (9)(10)(11)(12) In addition, CXR can detect infectious but asymptomatic TB patients, this is important as a substantial fraction of TB transmission is attributable to the often prolonged asymptomatic infectious period. (13) Barriers to widespread CXR use include limited access to high quality radiography equipment, critical shortage of radiologists in low-and middle-income countries (LMICs), and inter-and intra-observer variations during interpretation. (11,12,14) CAD software that provides a probabilistic score for TB offers a potential solution to these limitations. (6,7,15,16) Previous evaluations of CAD software have been mostly conducted in triage testing use situations, with very little data available to evaluate accuracy in community-based TB screening interventions. (6,8,16,17) Our aim was to evaluate the accuracy of the Computer-Aided Detection for Tuberculosis version 6 (CAD4TBv6) system for TB screening using a large data set (n=61,848) from the 2016 Kenya National TB prevalence survey. (18,19) To do this we used a Bayesian modelling approach to evaluate the accuracy of CAD4TBv6 and Clinical Officer CXR interpretation against the bacteriological reference standard used within the prevalence survey. We hypothesized that CAD4TBv6 diagnostic sensitivity and specificity would meet the target product profile (TPP) for a test to identify people suspected of having TB, but that accuracy would vary between population groups, implying that an adaptive approach to CAD screening would be required. (20) . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint

Study design
We conducted a retrospective analysis of cross-sectional individual-level participant data from adult community members who participated in the 2016 Kenya National TB Prevalence Survey.(19)

Study population and Kenya TB prevalence survey procedures
The 2016 Kenya prevalence survey was undertaken to determine the prevalence of bacteriologically confirmed pulmonary TB among adults aged 15  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint of TB prevalence survey activities. In line with national guidelines, participants referred for TB treatment were offered HIV testing at referral facilities.

Study procedures and definitions
Analysis was conducted between January 2020 and October 2021. Anonymised, compressed DICOM CXR images were uploaded from the prevalence survey digital archive to the Delft Imaging CAD4TB cloud server and analysed using CAD4TBv6. (22) Results were provided as probabilistic scores, ranging from 0 to 99, with higher scores indicating a greater probability of TB. The reference standard for this analysis was bacteriologically-confirmed TB, defined as sputum Xpert and/or culture positive with MTB speciation. Analysis was conducted independently; the commercial provider (Delft Imaging) was not part of the study team and had no role in study design, data collection, analysis, or interpretation of results.

Statistical analysis
The characteristics of prevalence survey participants were summarized using means (with standard deviations), medians (with interquartile ranges), and percentages, and compared by Clinical Officer CXR interpretation. We used the Kruskall-Wallis test to investigate differences in CAD4TBv6 scores between Clinical Officer interpretation groups, and Chi-square and Fisher's exact tests for categorical participant characteristics. Distributions of CAD4TBv6 scores were summarized by medians and 95% highest density intervals (HDI) and compared by whether sputum was collected or not, and by sputum bacteriological status.
For our primary study outcome, we compared the accuracy (sensitivity, specificity, and area under receiver-operator curve [AUC]) of CAD4TBv6 with the bacteriological reference standard. As collection of sputum was conditional on either a participant reporting having cough of two weeks or greater or a Clinical Officer CXR classification of "abnormal, suggestive of TB", Bayesian latent class modelling was employed to infer disease prevalence within the portion of the study population without TB symptoms or CXR signs suggestive of TB, and to estimate the sensitivity, specificity, and AUC of CAD4TBv6 at thresholds ranging from 0 to 99. The model also outputs estimates of the underlying status of active pulmonary TB, and the sensitivity and specificity of Clinical Officer CXR interpretation as a screening tool and of sputum bacteriological results for the underlying true TB status. Full model details and diagnostics are reported in Supplemental Text 2. We placed informative priors on the overall prevalence of TB, inferred by the prevalence survey results, and weakly informative priors on other model parameters. To aid model convergence, we fixed specificity of the combined bacteriological reference standard (Xpert or culture positive) to be 99%. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint Models were fitted in Stan using the cmdstanr interface, with convergence assessed by inspecting trace plots across three sampling chains and Gelman-Rubin statistics. Inference was based on summarising 12,000 post-warm-up samples. We plotted model posterior summary estimates of sensitivity and specificity across CAD4TBv6 thresholds, and compared to optimum (sensitivity: 95%, specificity: 80%) and minimum (sensitivity: 90%, specificity: 70%) TPP for a community or referral test to identify people suspected of having TB. (20) In secondary analysis, we restricted model sensitivity and specificity estimates for participants by age group, sex, chronic cough status and history of previous TB treatment, and summarised as a function of CAD4TBv6 threshold, estimating what accuracy would be achieved by setting an overall screening CAD4TBv6 threshold to achieve the optimum TPP sensitivity cut-off (95%). We did not stratify by HIV status, as there was substantial missing data and testing was not performed in the prevalence survey. All analysis was done using R v4.1.1 (R Foundation for Statistical Computing, Vienna).

Ethical considerations
This study was conducted as part of the Kenya Prevalence survey ethics approval reference number is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint

Participant characteristics and Field Reader chest X-ray interpretation
A total of 62,484 CXR images were uploaded for CAD4TBv6 processing. After exclusion of 636 images that were either not analysable by the CAD4TBv6 software or had missing clinical data, 61,848 (99.0%) were included for analysis ( Figure 1).

Figure 1: Participant flowchart and results of prevalence survey investigations
Of the 61,848 participants whose images were analysed, 58.5% (36,187) were women and 70.7% (43,754) were aged <45 years (Table 1). Two thousand and eighty-four (3.4%) had previously been treated for TB, and 58 (0.1%) were currently being treated for TB. Overall, HIV positive status was self-reported by 1,577/31,495 (5.0%) of participants with data available.
Clinical Officers classified 50,045 (80.9%) CXRs as "normal", 5,045 (8.2%) as "abnormal, other", and 6,758 (10.9%) as "suggestive of TB" (Table 1). Compared to participants with CXRs classified by Clinical Officers as "normal" or "abnormal, other," participants with CXRs classified as "suggestive of TB" were more likely to be men, self-report positive HIV status, report TB symptoms including cough, . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint prolonged cough, fever, weight loss and night sweats, and have been previously treated for TB.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint

Figure 3: Model-based sensitivity and specificity of CAD4TBv6 for bacteriologically-confirmed pulmonary TB at minimum and optimum target product profile thresholds
When model estimates were stratified by participant characteristics (age, sex, presence of cough of more than two weeks, history of previous TB), we found substantial variation in the sensitivity and specificity of CAD4TBv6 for bacteriologically-confirmed pulmonary TB (Figure 4). With the CAD4TBv6 threshold set to 55 to achieve overall sensitivity of 95% for the optimal TPP within the prevalence survey population, sensitivity was highest among participants aged 41 years and older, who had previously been treated for TB and who had cough for more than two weeks. In contrast, specificity was lowest among participants previously treated for TB, and among older participants. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Discussion
This is the first study, to the best of our knowledge, to evaluate the accuracy of computer-aided CXR screening for TB in a community-based prevalence survey. Highly specific Xpert and culture tests were used as the bacteriological reference standard, with Bayesian latent class modelling employed to infer disease prevalence within the portion of the study population without TB symptoms or CXR signs suggestive of TB. Overall in the screening population, CAD4TBv6 met both the minimum and optimum TPP for a community-based referral test for identifying people suspected of having TB. (20) Very high sensitivity was demonstrated in participants in older age groups (41 years or older), those with reported cough>2 weeks and participants with previous TB history. Conversely, participants in older age groups and those with previous TB history had lower specificity. Computer-aided CXR screening is an accurate tool that could be used to support community TB screening in high burden countries where access to radiologists and clinicians is limited. To optimise screening accuracy and efficiency of confirmatory sputum testing, we recommend that an adaptive approach to screening threshold definition is adopted based on participant characteristics.
Community-based active case finding (ACF) for TB is effective at reducing TB prevalence if delivered with sufficient and sustained intensity to high burden populations. (25,26) However, operationalisation of ACF in a resource limited setting has been challenging due to substantial resourcing requirements and suboptimal TB screening and diagnosis tests. (12,27) The availability of portable/ultra-portable CXRs and CAD offer a potential solution to conduct community-based ACF for at risk groups in densely populated urban areas where TB transmission is now concentrated. (16) We have demonstrated that, overall within the prevalence survey population, CXR based screening in combination with CAD is highly accurate. CAD gives the additional flexibility for TB programs to vary the threshold for sputum testing with a saving of up to 50% of Xpert tests. (16) Given the limited resources available to National TB Programmes, by varying the CAD screening threshold, the number of TB cases deemed acceptable to be missed can be balanced against how much money is available to spend on expensive confirmatory sputum investigation. (16) By adopting an adaptive threshold within population groups, we believe that further gains in accuracy and programme efficiency can be gained. CXR and CAD as tools for community-based TB screening ACF, additionally offer the potential for individuals and TB programmes, including: earlier diagnosis; identification of asymptomatic TB, potentially reducing transmission; reductions in false positive bacteriological tests with harm from prolonged unnecessary treatment; and reduction in catastrophic costs. (13,16,28) . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint The prevalence survey participants are representative of the general population as they were randomly selected with a high participation rate, though higher amongst women than men. (18,21) The CAD accuracy finding in our study is therefore likely to be generalizable to countries in sub-Saharan Africa with high burden of TB and HIV. Though our study focused on one software (CAD4TBv.6), other comparable software that had the CE (Conformité Européenne) marking by January 2020 (Lunit Insight CXR, Lunit Insight; and qXR v2, Qure.ai.) may perform similarly or better than this.(5, 16) Rapid advances have been made in CAD software development, with a total of 12 software solutions identified in March 2021 and version updates occurring frequently. (16,29) Regular updating of WHO guidelines is therefore required to keep pace with these advances. As national TB programs adopt CAD technology into screening activities, in addition to performance, other implementation considerations include; cost effectiveness, compatibility of the X-ray systems, input image format, integration with any patient archiving systems, customer service and support, data protection, and ability to detect other non TB conditions. (16,29,30) Conditions other than TB may be as, or more prevalent than TB in high TB prevalence settings, and require comprehensive approaches to ensure participants in TB screening programmes are linked to appropriate care. (31) TB screening programs should plan for this and take into consideration resource implications to ensure additional health benefits through the identification of populations at risk of diseases other than TB. In addition to diagnostic accuracy; clinical utility, acceptability and feasibility of using CAD should be assessed. (32) In our secondary analysis we found that accuracy varied considerably by participant characteristics, specifically age and previous TB history. Similar to a previous study in Bangladesh among adults attending primary health care for triage setting, there was no significant difference in performance of CAD4TBv6 between men and women. (16) The lower specificity of CAD4TBv6 in the older age groups and those with prior history of TB is a finding similar to previous studies. (6,16,17) There are numerous anatomical and pathophysiological changes occurring in old age that could explain this lower age-related specificity, including age-related changes and sequalae of life-course accumulated lung damage.(33) People with prior TB have lung changes that could lead to difficulty distinguishing old vs active disease, leading to low specificity. (6,16,17) Further algorithm training with images from older populations may result in refinement of CAD software with improvements in specificity.
In addition, two stage screening of CAD with symptom screen followed by C-reactive protein or other novel screening tests in older populations and in participants with previous TB history could improve specificity, although this requires further investigation. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint In prevalence surveys, image classification criteria are set to a low threshold for referral for sputum testing, and non-expert readers like Clinical Officers are trained to interpret with higher sensitivity and lower specificity to avoid missing prevalent TB cases.(21) We found that, overall, the sensitivity of Clinical Officer CXR interpretation ("suggestive of TB") for the underlying true TB status was lower (44%) and specificity was higher (89%) than anticipated,(5) but that sensitivity was substantially higher (84%) and specificity lower (54%) among participants with previous TB, and with no appreciable differences by other participant characteristics. This overall low sensitivity is not usually identified by other analyses that compare clinical CXR interpretation to a microbiological reference standard, and that assume that sputum testing is 100% sensitive. From our latent class model, we can then infer that true TB cases that are bacteriologically-negative are likely to have minimal or no CXR abnormalities (unless previously treated for TB), and so are currently undetectable without new, more sensitive TB diagnostic tests. As in other studies, we have demonstrated that CAD at varied thresholds achieves higher sensitivity than human readers. (7,14,16) CAD has a high throughput and has been shown to reduce the time to treatment (28,30). Additionally, CAD has the benefit of flexibility of varying thresholds, with a higher threshold improving the positive predictive value and reducing the number of Xpert tests required to diagnose a patient by up to 50% while maintaining sensitivity above 90% (16). For TB prevalence surveys, we recommend that based on accuracy, a strategy including CAD should be considered, supported by formal health economic analyses to determine the health system feasibility of wide scale implementation. Mathematical modelling studies will likely be required to investigate potential impact on longer term trends of TB incidence, prevalence and mortality.
A major strength of our study is the use of a large population based data set from a well-conducted, WHO-approved TB prevalence survey. (18,21) Analysis of the CXRs was blinded to bacteriological status, and we used a robust bacteriological reference standard.(34) Our model prevalence estimates are slightly higher than empirical estimates obtained from the prevalence survey is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint sputum negative TB (clinically diagnosed) or extra-pulmonary TB (pleural) is challenging to undertake. We also were not able to stratify performance by HIV status as testing was not systematically conducted during the prevalence survey. This would have been important for an indepth subgroup analysis in a high TB-HIV prevalence setting. Kenya has a HIV prevalence of 4.9% with approximately 1.6 million people living with HIV and an estimated HIV-positive TB incidence at 70/100,000.(1, 35) We therefore expected a lower CAD specificity in our setting as CXR is known to be less sensitive in immunocompromised patients with pulmonary TB. (36)(37)(38)(39) We recommend further evaluation of CAD software in high TB-HIV prevalence settings and further studies on accuracy within HIV-positive populations.
In conclusion, the END TB strategy calls for concerted efforts to improve diagnosis of TB, including through new and effective approaches to systematic screening. We have demonstrated that CAD4TBv6 is an accurate tool for community based TB screening in Kenya and met the TPP in this population. In resource limited settings where radiologists are scarce, an adaptive approach to setting screening thresholds could further improve screening accuracy and efficiency. The IMPALA consortium as a whole provided review and feedback on the progress of the work at consortium meetings or through study related advisory panels.

Declaration of interest
We declare no competing interests. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint

Data sharing
The Kenya National Tuberculosis, Leprosy and Lung Disease program is the custodian of the 2016 Kenya Tuberculosis Prevalence Survey data.

Acknowledgements
We are grateful to the 2016 Kenya Tuberculosis Prevalence Survey team and the Division of National Tuberculosis, Leprosy and Lung Disease Program whose data we used for this secondary study. We would like to thank Martin Githiomi-Information Technology specialist who helped in review of the DELFT-NTLD-AFIDEP contracts ensuring the data protection act was adhered to. We also acknowledge Wendy Nkirote who played a role in review of the methodology part of the prevalence survey laboratory processes to ensure it was captured accurately. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 30, 2021. ; https://doi.org/10.1101/2021.10.21.21265321 doi: medRxiv preprint