Elsevier

Human Immunology

Volume 77, Issue 3, March 2016, Pages 283-287
Human Immunology

Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline

https://doi.org/10.1016/j.humimm.2015.12.006Get rights and content

Abstract

Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG) is an integrated data-analysis pipeline designed for the standardized analysis of highly-polymorphic genetic data, specifically for the HLA and KIR genetic systems. Most modern genetic analysis programs are designed for the analysis of single nucleotide polymorphisms, but the highly polymorphic nature of HLA and KIR data require specialized methods of data analysis. BIGDAWG performs case-control data analyses of highly polymorphic genotype data characteristic of the HLA and KIR loci. BIGDAWG performs tests for Hardy–Weinberg equilibrium, calculates allele frequencies and bins low-frequency alleles for k × 2 and 2 × 2 chi-squared tests, and calculates odds ratios, confidence intervals and p-values for each allele. When multi-locus genotype data are available, BIGDAWG estimates user-specified haplotypes and performs the same binning and statistical calculations for each haplotype. For the HLA loci, BIGDAWG performs the same analyses at the individual amino-acid level. Finally, BIGDAWG generates figures and tables for each of these comparisons. BIGDAWG obviates the error-prone reformatting needed to traffic data between multiple programs, and streamlines and standardizes the data-analysis process for case-control studies of highly polymorphic data. BIGDAWG has been implemented as the bigdawg R package and as a free web application at bigdawg.immunogenomics.org.

Introduction

The extensive polymorphism, linkage disequilibrium and genotyping ambiguity commonly associated with the HLA and KIR loci (described here collectively as immunogenomic loci) pose challenges for the consistent analyses of these data [1]. Modern genetic analysis programs are designed for use with bi-allelic single nucleotide polymorphisms (SNPs) or SNP haplotypes generated in genome-wide association studies (GWAS), but cannot be applied to highly polymorphic immunogenomic data. New tools are needed to leverage modern computational resources for the analysis of immunogenomic data, and to integrate the analysis of immunogenomic loci with genomic SNP/GWAS data. The few Ad-hoc tools designed to handle immunogenomic data, such as PyPop [2] and Arlequin [3] are limited by operating systems, outdated with spurious maintenance cycles, and often times require cumbersome data formatting.

A typical immunogenomic data analysis workflow involves the trafficking of data between several programs; this usually involves reformatting of these data for each program, a process that is time intensive, error prone and limits reproducibility. Quite often, this data-trafficking involves the use of Microsoft Excel, which is particularly poor choice for immunogenomic data-management [1]. In addition, the management of data in a typical workflow is often idiosyncratic to the analyst, which further limits reproducibility across studies. The automated manipulation of immunogenomic data in a single analysis workflow will reduce errors and allow true analytical reproducibility.

We have developed Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG), an automated software pipeline that performs a suite of common case-control analyses of multi-locus highly polymorphic genetic data [4], [5], [6]. Unlike SNP/GWAS case-control analysis tools, BIGDAWG is tailored for use with immunogenomic data. In addition, BIGDAWG can be applied to any highly polymorphic genetic data, including SNPs and SNP haplotypes. BIGDAWG is implemented as an R package (named, bigdawg) and as a web application running at bigdawg.immunogenomics.org.

Section snippets

Implementation

BIGDAWG has been developed in the framework of the R statistical environment (http://www.r-project.org). The bigdawg R package provides documentation of all BIGDAWG functions, and includes a vignette detailing package use along with a sample dataset. The bigdawg vignette is included here as Supplementary Material. BIGDAWG’s functionality depends on the epicalc [7] and haplo.stats [8] R packages, along with the R base package parallel. The R XML package [9] is required for updating the protein

Running bigdawg

In this section, we demonstrate running bigdawg on the built-in example data set (described in Section 2.6). The example set can be accessed by setting the ‘Data’ parameter to the value ‘HLA_data’ (case sensitive). The first two lines of the following code snippet specify all possible parameters that a user can change. The subsequent two lines will load bigdawg from the R library (step 1) and run the full analysis with all defaults using the built-in dataset (step 2).

># All possible user

Discussion

BIGDAWG is a standardized pipeline for the case-control analysis of immunogenomic data. Available as the bigdawg R package, and as BWA, the BIGDAWG web application, BIGDAWG has been designed for the analysis of highly-polymorphic HLA data, but can be applied to any genotype data, including genotype data derived from disparate genetic systems (e.g., HLA, KIR and SNPs) or from a variety of sources. BIGDAWG performs case-control analyses at the haplotype, locus and amino-acid levels, and also

Acknowledgments

The work described here was performed with the support of National Institutes of Health (NIH) Grants R01GM19030 (JAH, SJM, DP) awarded by the National Institute of General Medical Sciences (NIGMS) and U01AI067068 (JAH and SJM) awarded by the National Institute of Allergy and Infectious Diseases (NIAID). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH, NIGMS, NIAID or United States Government. We thank Hannah Ollila,

References (17)

  • J.A. Hollenbach et al.

    A community standard for immunogenomic data reporting and analysis: proposal for a STrengthening the REporting of immunogenomic studies statement

    Tissue Antigens

    (2011)
  • A.K. Lancaster et al.

    PyPop update–a software pipeline for large-scale multilocus population genomics

    Tissue Antigens

    (2007)
  • L. Excoffier et al.

    Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows

    Mol. Ecol. Resour.

    (2010)
  • S.J. Mack et al.

    Analytical methods for immunogenetic population data

    Methods Mol. Biol.

    (2012)
  • P.A. Gourraud et al.

    Standard methods for the management of immunogenetic data

    Methods Mol. Biol.

    (2012)
  • J.A. Hollenbach et al.

    Analytical methods for disease association studies with immunogenetic data

    Methods Mol. Biol.

    (2012)
  • V. Chongsuvivatwong, Epicalc: Epidemiological calculator,...
  • J. Sinnwell, J. Schaid, haplo.stats: Statistical Analysis of Haplotypes with Traits and Covariates when Linkage Phase...
There are more references available in the full text version of this article.

Cited by (64)

  • Study of HLA-A, -B, -C, -DRB1 and -DQB1 polymorphisms in COVID-19 patients

    2022, Journal of Microbiology, Immunology and Infection
  • HLA repertoire of 115 UAE nationals infected with SARS-CoV-2

    2022, Human Immunology
    Citation Excerpt :

    Hardy-Weinberg equilibrium (HWE) was further tested using Genepop when H1 = excess heterozygosity and when H1 = defect heterozygosity using default settings (Markov chain parameters: dememorization = 10,000, batches = 20, iterations per batch = 5000). Based on the proposed approach for analyzing immunogenetic data in case-control studies by Hollenbach, et al. [33], the Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG) R package was used for case-control association analyses of individual HLA loci and amino acid level analysis [34]. HLA allelic association analyses was estimated using chi-square testing.

View all citing articles on Scopus
1

Contributed equally to manuscript.

View full text