Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline

doi:10.1016/j.humimm.2015.12.006

Human Immunology

Volume 77, Issue 3, March 2016, Pages 283-287

https://doi.org/10.1016/j.humimm.2015.12.006 Get rights and content

Abstract

Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG) is an integrated data-analysis pipeline designed for the standardized analysis of highly-polymorphic genetic data, specifically for the HLA and KIR genetic systems. Most modern genetic analysis programs are designed for the analysis of single nucleotide polymorphisms, but the highly polymorphic nature of HLA and KIR data require specialized methods of data analysis. BIGDAWG performs case-control data analyses of highly polymorphic genotype data characteristic of the HLA and KIR loci. BIGDAWG performs tests for Hardy–Weinberg equilibrium, calculates allele frequencies and bins low-frequency alleles for k × 2 and 2 × 2 chi-squared tests, and calculates odds ratios, confidence intervals and p-values for each allele. When multi-locus genotype data are available, BIGDAWG estimates user-specified haplotypes and performs the same binning and statistical calculations for each haplotype. For the HLA loci, BIGDAWG performs the same analyses at the individual amino-acid level. Finally, BIGDAWG generates figures and tables for each of these comparisons. BIGDAWG obviates the error-prone reformatting needed to traffic data between multiple programs, and streamlines and standardizes the data-analysis process for case-control studies of highly polymorphic data. BIGDAWG has been implemented as the bigdawg R package and as a free web application at bigdawg.immunogenomics.org.

Introduction

The extensive polymorphism, linkage disequilibrium and genotyping ambiguity commonly associated with the HLA and KIR loci (described here collectively as immunogenomic loci) pose challenges for the consistent analyses of these data [1]. Modern genetic analysis programs are designed for use with bi-allelic single nucleotide polymorphisms (SNPs) or SNP haplotypes generated in genome-wide association studies (GWAS), but cannot be applied to highly polymorphic immunogenomic data. New tools are needed to leverage modern computational resources for the analysis of immunogenomic data, and to integrate the analysis of immunogenomic loci with genomic SNP/GWAS data. The few Ad-hoc tools designed to handle immunogenomic data, such as PyPop [2] and Arlequin [3] are limited by operating systems, outdated with spurious maintenance cycles, and often times require cumbersome data formatting.

A typical immunogenomic data analysis workflow involves the trafficking of data between several programs; this usually involves reformatting of these data for each program, a process that is time intensive, error prone and limits reproducibility. Quite often, this data-trafficking involves the use of Microsoft Excel, which is particularly poor choice for immunogenomic data-management [1]. In addition, the management of data in a typical workflow is often idiosyncratic to the analyst, which further limits reproducibility across studies. The automated manipulation of immunogenomic data in a single analysis workflow will reduce errors and allow true analytical reproducibility.

We have developed Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG), an automated software pipeline that performs a suite of common case-control analyses of multi-locus highly polymorphic genetic data [4], [5], [6]. Unlike SNP/GWAS case-control analysis tools, BIGDAWG is tailored for use with immunogenomic data. In addition, BIGDAWG can be applied to any highly polymorphic genetic data, including SNPs and SNP haplotypes. BIGDAWG is implemented as an R package (named, bigdawg) and as a web application running at bigdawg.immunogenomics.org.

Section snippets

Implementation

BIGDAWG has been developed in the framework of the R statistical environment (http://www.r-project.org). The bigdawg R package provides documentation of all BIGDAWG functions, and includes a vignette detailing package use along with a sample dataset. The bigdawg vignette is included here as Supplementary Material. BIGDAWG’s functionality depends on the epicalc [7] and haplo.stats [8] R packages, along with the R base package parallel. The R XML package [9] is required for updating the protein

Running bigdawg

In this section, we demonstrate running bigdawg on the built-in example data set (described in Section 2.6). The example set can be accessed by setting the ‘Data’ parameter to the value ‘HLA_data’ (case sensitive). The first two lines of the following code snippet specify all possible parameters that a user can change. The subsequent two lines will load bigdawg from the R library (step 1) and run the full analysis with all defaults using the built-in dataset (step 2).

># All possible user

Discussion

BIGDAWG is a standardized pipeline for the case-control analysis of immunogenomic data. Available as the bigdawg R package, and as BWA, the BIGDAWG web application, BIGDAWG has been designed for the analysis of highly-polymorphic HLA data, but can be applied to any genotype data, including genotype data derived from disparate genetic systems (e.g., HLA, KIR and SNPs) or from a variety of sources. BIGDAWG performs case-control analyses at the haplotype, locus and amino-acid levels, and also

Acknowledgments

The work described here was performed with the support of National Institutes of Health (NIH) Grants R01GM19030 (JAH, SJM, DP) awarded by the National Institute of General Medical Sciences (NIGMS) and U01AI067068 (JAH and SJM) awarded by the National Institute of Allergy and Infectious Diseases (NIAID). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH, NIGMS, NIAID or United States Government. We thank Hannah Ollila,

References (17)

J.A. Hollenbach et al.
A community standard for immunogenomic data reporting and analysis: proposal for a STrengthening the REporting of immunogenomic studies statement
Tissue Antigens
(2011)
A.K. Lancaster et al.
PyPop update–a software pipeline for large-scale multilocus population genomics
Tissue Antigens
(2007)
L. Excoffier et al.
Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows
Mol. Ecol. Resour.
(2010)
S.J. Mack et al.
Analytical methods for immunogenetic population data
Methods Mol. Biol.
(2012)
P.A. Gourraud et al.
Standard methods for the management of immunogenetic data
Methods Mol. Biol.
(2012)
J.A. Hollenbach et al.
Analytical methods for disease association studies with immunogenetic data
Methods Mol. Biol.
(2012)
V. Chongsuvivatwong, Epicalc: Epidemiological calculator,...
J. Sinnwell, J. Schaid, haplo.stats: Statistical Analysis of Haplotypes with Traits and Covariates when Linkage Phase...

There are more references available in the full text version of this article.

Cited by (64)

Study of HLA-A, -B, -C, -DRB1 and -DQB1 polymorphisms in COVID-19 patients
2022, Journal of Microbiology, Immunology and Infection
Human leukocyte antigen (HLA) plays an important role in immune responses to infections, especially in the development of acquired immunity. Given the high degree of polymorphisms that HLA molecules present, some will be more or less effective in controlling SARS-CoV-2 infection. We wanted to analyze whether certain polymorphisms may be involved in the protection or susceptibility to COVID-19.
We studied the polymorphisms in HLA class I (HLA-A, -B and -C) and II (HLA-DRB1 and HLA-DQB1) molecules in 450 patients who required hospitalization for COVID-19, creating one of the largest HLA-typed patient cohort to date.
Our results show that there is no relationship between HLA polymorphisms or haplotypes and susceptibility or protection to COVID-19.
Our results may contribute to resolve the contradictory data on the role of HLA polymorphisms in COVID-19 infection.
HLA repertoire of 115 UAE nationals infected with SARS-CoV-2
2022, Human Immunology
Citation Excerpt :
Hardy-Weinberg equilibrium (HWE) was further tested using Genepop when H1 = excess heterozygosity and when H1 = defect heterozygosity using default settings (Markov chain parameters: dememorization = 10,000, batches = 20, iterations per batch = 5000). Based on the proposed approach for analyzing immunogenetic data in case-control studies by Hollenbach, et al. [33], the Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG) R package was used for case-control association analyses of individual HLA loci and amino acid level analysis [34]. HLA allelic association analyses was estimated using chi-square testing.
The class I and class II Human Leucocyte Antigens (HLA) are an integral part of the host adaptive immune system against viral infections. The characterization of HLA allele frequency in the population can play an important role in determining whether HLA antigens contribute to viral susceptibility. In this regard, global efforts are currently underway to study possible correlations between HLA alleles with the occurrence and severity of SARS-CoV-2 infection. Specifically, this study examined the possible association between specific HLA alleles and susceptibility to SARS-CoV-2 in a population from the United Arab Emirates (UAE). The frequencies of HLA class I (HLA-A, -B, and -C) and HLA class II alleles (HLA-DRB1 and -DQB1); defined using Next Generation Sequencing (NGS); from 115 UAE nationals with mild, moderate, and severe SARS-CoV-2 infection are presented here. HLA alleles and supertypes were compared between hospitalized and non-hospitalized subjects. Statistical significance was observed between certain HLA alleles and supertypes and the severity of the infection. Specifically, alleles HLA-B*51:01 and HLA-A*26:01 showed a negative association (suggestive of protection), whilst genotypes HLA-A*03:01, HLA-DRB1*15:01, and supertype B44 showed a positive association (suggestive of predisposition) to COVID-19 severity. The results support the potential use of HLA testing to differentiate between patients who require specific clinical management strategies.
Genome-wide association study of asthma, total IgE, and lung function in a cohort of Peruvian children
2021, Journal of Allergy and Clinical Immunology
Genetic ancestry plays a role in asthma health disparities.
Our aim was to evaluate the impact of ancestry on and identify genetic variants associated with asthma, total serum IgE level, and lung function.
A total of 436 Peruvian children (aged 9-19 years) with asthma and 291 without asthma were genotyped by using the Illumina Multi-Ethnic Global Array. Genome-wide proportions of indigenous ancestry populations from continental America (NAT) and European ancestry from the Iberian populations in Spain (IBS) were estimated by using ADMIXTURE. We assessed the relationship between ancestry and the phenotypes and performed a genome-wide association study.
The mean ancestry proportions were 84.7% NAT (case patients, 84.2%; controls, 85.4%) and 15.3% IBS (15.8%; 14.6%). With adjustment for asthma, NAT was associated with higher total serum IgE levels (P < .001) and IBS was associated with lower total serum IgE levels (P < .001). NAT was associated with higher FEV₁ percent predicted values (P < .001), whereas IBS was associated with lower FEV₁ values in the controls but not in the case patients. The HLA-DR/DQ region on chromosome 6 (Chr6) was strongly associated with total serum IgE (rs3135348; P = 3.438 × 10^–10) and was independent of an association with the haplotype HLA-DQA1∼HLA-DQB1:04.01∼04.02 (P = 1.55 × 10^–05). For lung function, we identified a locus (rs4410198; P = 5.536 × 10^–11) mapping to Chr19, near a cluster of zinc finger interacting genes that colocalizes to the long noncoding RNA CTD-2537I9.5. This novel locus was replicated in an independent sample of pediatric case patients with asthma with similar admixture from Brazil (P = .005).
This study confirms the role of HLA in atopy, and identifies a novel locus mapping to a long noncoding RNA for lung function that may be specific to children with NAT.
Challenges for the standardized reporting of NGS HLA genotyping: Surveying gaps between clinical and research laboratories
2021, Human Immunology
Next generation sequencing (NGS) is being applied for HLA typing in research and clinical settings. NGS HLA typing has made it feasible to sequence exons, introns and untranslated regions simultaneously, with significantly reduced labor and reagent cost per sample, rapid turnaround time, and improved HLA genotype accuracy. NGS technologies bring challenges for cost-effective computation, data processing and exchange of NGS-based HLA data. To address these challenges, guidelines and specifications such as Genotype List (GL) String, Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING), and Histoimmunogenetics Markup Language (HML) were proposed to streamline and standardize reporting of HLA genotypes. As part of the 17th International HLA and Immunogenetics Workshop (IHIW), we implemented standards and systems for HLA genotype reporting that included GL String, MIRING and HML, and found that misunderstanding or misinterpretations of these standards led to inconsistencies in the reporting of NGS HLA genotyping results. This may be due in part to a historical lack of centralized data reporting standards in the histocompatibility and immunogenetics community. We have worked with software and database developers, clinicians and scientists to address these issues in a collaborative fashion as part of the Data Standard Hackathons (DaSH) for NGS. Here we report several categories of challenges to the consistent exchange of NGS HLA genotyping data we have observed. We hope to address these challenges in future DaSH for NGS efforts.
Fine mapping of the major histocompatibility complex (MHC) in myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) suggests involvement of both HLA class I and class II loci
2021, Brain, Behavior, and Immunity
The etiology of myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is unknown, but involvement of the immune system is one of the proposed underlying mechanisms. Human leukocyte antigen (HLA) associations are hallmarks of immune-mediated and autoimmune diseases. We have previously performed high resolution HLA genotyping and detected associations between ME/CFS and certain HLA class I and class II alleles. However, the HLA complex harbors numerous genes of immunological importance, and there is extensive and complex linkage disequilibrium across the region. In the current study, we aimed to fine map the association signals in the HLA complex by genotyping five additional classical HLA loci and 5,342 SNPs in 427 Norwegian ME/CFS patients, diagnosed according to the Canadian Consensus Criteria, and 480 healthy Norwegian controls. SNP association analysis revealed two distinct and independent association signals (p ≤ 0.001) tagged by rs4711249 in the HLA class I region and rs9275582 in the HLA class II region. Furthermore, the primary association signal in the HLA class II region was located within the HLA-DQ gene region, most likely due to HLA-DQB1, particularly the amino acid position 57 (aspartic acid/alanine) in the peptide binding groove, or an intergenic SNP upstream of HLA-DQB1. In the HLA class I region, the putative causal locus might map outside the classical HLA genes as the association signal spans several genes (DDR1, GTF2H4, VARS2, SFTA2 and DPCR1) with expression levels influenced by the ME/CFS associated SNP genotype. Taken together, our results implicate the involvement of the MHC, and in particular the HLA-DQB1 gene, in ME/CFS. These findings should be replicated in larger cohorts, particularly to verify the putative involvement of HLA-DQB1, a gene important for antigen-presentation to T cells and known to harbor alleles providing the largest risk for well–established autoimmune diseases.
HLA binding-groove motifs are associated with myocarditis induction after Pfizer-BioNTech BNT162b2 vaccination
2024, European Journal of Clinical Investigation

View all citing articles on Scopus

¹: Contributed equally to manuscript.

View full text

Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline

Abstract

Introduction

Section snippets

Implementation

Running bigdawg

Discussion

Acknowledgments

A community standard for immunogenomic data reporting and analysis: proposal for a STrengthening the REporting of immunogenomic studies statement

Tissue Antigens

PyPop update–a software pipeline for large-scale multilocus population genomics

Tissue Antigens

Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows

Mol. Ecol. Resour.

Analytical methods for immunogenetic population data

Methods Mol. Biol.

Standard methods for the management of immunogenetic data

Methods Mol. Biol.

Analytical methods for disease association studies with immunogenetic data

Methods Mol. Biol.