The COMBAT-TB Workbench: Making powerful M. tuberculosis bioinformatics accessible

Whole Genome Sequencing (WGS) is a powerful method for detecting drug resistance, genetic diversity and transmission dynamics of Mycobacterium tuberculosis. Implementation of WGS in public health microbiology laboratories is impeded by a lack of user-friendly, automated and semi-automated pipelines. We present the COMBAT-TB workbench, a modular, easy to install application that provides a web based environment for Mycobacterium tuberculosis bioinformatics. The COMBAT-TB Workbench is built using two main software components: the IRIDA Platform for its web-based user interface and data management capabilities and the Galaxy bioinformatics workflow platform for workflow execution. These components are combined into a single easy to install application using Docker container technology. We implemented two workflows, for M. tuberculosis sample analysis and phylogeny, in Galaxy. Building our workflows involved updating some Galaxy tools (Trimmomatic, snippy and snp-sites) and writing new Galaxy tools (snp-dists, TB-Profiler, tb_variant_filter and TB Variant Report). The irida-wf-ga2xml tool was updated to be able to work with recent versions of Galaxy and was further developed into IRIDA plugins for both workflows. In the case of the M. tuberculosis sample analysis an interface was added to update the metadata stored for each sequence sample with results gleaned from the Galaxy workflow output. Data can be loaded into the COMBAT-TB Workbench via the web interface or via the command line IRIDA uploader tool. The COMBAT-TB Workbench application deploys IRIDA, the COMBAT-TB IRIDA plugins, the MariaDB database and Galaxy using Docker containers (https://github.com/COMBAT-TB/irida-galaxy-deploy).

This cost takes the form of equipment, consumables and expertise. The command line tools whose use is prevalent in bioinformatics (4) require skills not readily accessible outside of specialist labs. On the other hand, web based tools such as TB-Profiler (5) and the NIAID TB Portals(6) are restricted to the analyses provided by their authors and often lack features for bulk analysis. Finally platforms like Galaxy(7), while customisable do not provide a data management and analysis user interface specific for common M. tuberculosis analysis tasks.
To address these deficiencies, the computational bacterial analytical toolkit for Tuberculosis research (COMBAT-TB) was developed. The COMBAT-TB Workbench (downloadable from https://github.com/COMBAT-TB/irida-galaxy-deploy) represents a IRIDA stores sequence samples on disk and sample metadata in a MariaDB database (9). Sequence data is shared between IRIDA and the Galaxy analysis platform (reducing data duplication).

Scientific workflows and IRIDA plugins
The Workbench uses Galaxy for its bioinformatics workflow composition and execution.
Two workflows, for M. tuberculosis sample analysis and phylogeny, were implemented in Galaxy. Building workflows in Galaxy involves connecting Galaxy tools to construct an analysis workflow where the Galaxy tools (also known as tool wrappers) themselves connect command line bioinformatics tools to the Galaxy framework. Building our workflows involved updating some Galaxy tools (Trimmomatic (10), snippy (11) and snp-sites (12)) and writing new Galaxy tools (snp-dists (13), TB-Profiler (5), tb_variant_filter (14) and TB Variant Report (15)). In addition to the work on Galaxy tools, the command line tb_variant_filter and TB Variant Report tools were created as part of the COMBAT-TB project.
The irida-wf-ga2xml tool (16) was updated to be able to work with recent versions of Galaxy and it was used to build IRIDA plugin skeletons. These plugin skeletons were further developed into IRIDA plugins for both workflows, in the case of the M. tuberculosis sample analysis involving the addition of an interface between the Galaxy workflow output and the metadata stored for each sequence sample. The M.
tuberculosis Sample Report and M. tuberculosis Phylogeny plugins are hosted in Github repositories (https://github.com/COMBAT-TB/irida-plugin-tb-sample-report and At the time of writing we found two systems, Innuendo (19) and the IRIDA project (8) that were comparable to the COMBAT-TB Workbench.
Innuendo is a web interface to sequence storage and Nextflow (20) workflow execution aimed at analysis of food-borne pathogens. It is oriented around common tasks in the food-borne pathogens surveillance terrain such as molecular typing of pathogens. The platform is strongly tied to the workflows of a food-borne pathogen surveillance lab, and adding additional species to the analysis system requires modifying the underlying CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. While it is more flexible than Innuendo, as it allows deployment of a wide variety of Galaxy workflows and is not species specific, it is complex to deploy, with the installation guide assuming knowledge of deployment of both Galaxy and the Tomcat Java Servlet system (21). The COMBAT-TB Workbench, by contrast, is straightforward to deploy with a single Linux command.

Use Case
For the purpose of these analyses we installed the COMBAT-TB Workbench on a virtual machine with 8 virtual CPUs, 32 GB RAM and 3000 GB hard disk space, running CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Examining the read mapping outputs shows that only 9.63% and 1.47% of reads from samples SRR12416824 and SRR12416842 respectively mapped against the M. 13 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint    CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint    CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 26, 2021. ; https://doi.org/10.1101/2021.09. 23.21263983 doi: medRxiv preprint In parallel to the SnpEff annotation and variant filtering steps, the mapped reads are provided to TBProfiler (v. 2.8.4) which performs its own variant calling and lineage and drug resistance prediction.
Finally the filtered, annotated variants and the TBProfiler results are fed to tb_vcf_report (v. 0.1.7) which produces a report further annotated with information from the COMBAT-TB eXplorer database (28) in both text and HTML formats.
The final reports provided to the user are the variant reports from tb_vcf_report, text and JSON format reports from TBProfiler, variants in VCF format from SnpEff and mapping statistics from samtools flagstats. These reports include both user readable and raw data suitable for further downstream analysis. The metadata stored in the sample line list is updated with mapping %, M. tuberculosis lineage and spoligotype information and drug resistance information.

Implementation of the M. tuberculosis Phylogeny pipeline
The workflow in the phylogeny module starts with quality filtering of samples using fastp (v. 0.19.5) (40) with default settings. The filtered reads are then aligned to the user provided reference using snippy and predicted variants are filtered with tb_variant_filter as described above. In addition, only single nucleotide variants (SNVs) are retained as the phylogeny software used in the workflow cannot extract meaningful information from indels. 21