Elsevier

Human Immunology

Volume 79, Issue 9, September 2018, Pages 678-684
Human Immunology

Hla-mapper: An application to optimize the mapping of HLA sequences produced by massively parallel sequencing procedures

https://doi.org/10.1016/j.humimm.2018.06.010Get rights and content

Abstract

A challenging task when more than one HLA gene is evaluated together by second-generation sequencing is to achieve a reliable read mapping. The polymorphic and repetitive nature of HLA genes might bias the read mapping process, usually underestimating variability at very polymorphic segments, or overestimating variability at some segments. To overcome this issue we developed hla-mapper, which takes into account HLA sequences derived from the IPD-IMGT/HLA database and unpublished HLA sequences to apply a scoring system. This comprehends the evaluation of each read pair, addressing them to the most likely HLA gene they were derived from. Hla-mapper provides a reliable map of HLA sequences, allowing accurate downstream analysis such as variant calling, haplotype inference, and allele typing. Moreover, hla-mapper supports whole genome, exome, and targeted sequencing data. To assess the software performance in comparison with traditional mapping algorithms, we used three different simulated datasets to compare the results obtained with hla-mapper, BWA MEM, and Bowtie2. Overall, hla-mapper presented a superior performance, mainly for the classical HLA class I genes, minimizing wrong mapping and cross-mapping that are typically observed when using BWA MEM or Bowtie2 with a single reference genome.

Introduction

The human leukocyte antigen (HLA) complex comprehends the most variable segment of the human genome. This complex plays a central role in the immune response since HLA molecules are key features for antigen presentation and immune response modulation [1]. HLA genes compatibility between recipient and donor influences graft acceptance, therefore, HLA variability is frequently evaluated for clinical purposes. Moreover, different HLA variants might be related to differential immune responses against pathogens, susceptibility to autoimmune diseases, or even to specific tumors [2]. The classical HLA class I genes, HLA-A, HLA-B, and HLA-C, are highly polymorphic loci and encode a key molecule for antigen presentation to T CD8+ lymphocytes. The non-classical HLA class I genes, HLA-G, HLA-E, and HLA-F, are conserved at the DNA and protein level and encode immune-modulatory molecules [3], [4], [5].

Much effort has been made to characterize HLA polymorphisms and complete sequences in worldwide populations. Most of the studies are restricted to the coding region, or even to exon segments only. However, with the advent of next-generation sequencing (NGS), or massively parallel sequencing, information regarding introns and regulatory segments has been obtained, as well as the characterization of many new allele variants, providing deeper insights regarding HLA worldwide variability and extended haplotypes. This is particularly evident for non-classical HLA class I genes [6], [7], [8]. In addition, NGS strategies do allow the evaluation of several genes all at once, and also allow the phase definition (haplotypes) among part of the variants detected. Nevertheless, when two or more HLA genes are sequenced at the same time using NGS methods, a challenging task is to achieve a reliable read mapping. Because of the polymorphic and repetitive nature of most of the HLA genes due to their paralogous origins, the following issues may arise when raw data (the reads) are mapped to a single reference genome.

First, the classical HLA class I genes are among the most variable genes in the human genome. The International Immunogenetics Database (IPD-IMGT/HLA database, version 3.31.0) describes more than 12,893 class I alleles [9]. Because of the polymorphic nature of classical HLA genes, the sequences (or reads) obtained by NGS methods usually present too many nucleotide differences when compared with the reference human genome. Thus, well-established aligners such as BWA (the Burrows-Wheeler Aligner) and Bowtie2 [10], [11] frequently do not map classical HLA class I sequences correctly. Usually, when these algorithms are used with default parameters, because of the high polymorphism, many reads do not find a match in the reference genome, leading to a mapping bias that underestimates HLA variability and overestimates reference allele frequencies [12], [13], [14]. Moreover, when a higher tolerance for mismatches is defined, many reads are incorrectly mapped as explained below.

Second, mostly due to the high sequence similarity among HLA genes, polymorphisms may lead a sequence to be more compatible with the reference sequence of another HLA gene than with the reference of the original gene. Thus, when the tolerance for mismatches is increased to avoid the aforementioned mapping bias, this second issue leads to a large number of reads mapping to more than one HLA gene into the reference genome, or simply mapping to the wrong gene. In this scenario, it is expected an overestimation of genetic diversity in segments presenting a large number of incorrectly mapped reads. Third, depending on the NGS method used, a large number of very short reads is produced and they could exacerbate the issues already presented.

The BWA developers do acknowledge the issues presented above. They created the bwa.kit to improve HLA read mapping based on alternative contigs and known HLA sequences from the IPD-IMGT/HLA. However, its use is not straightforward and the data obtained might bias downstream genotyping procedures as discussed later. In addition, other attempts to evaluate the efficiency of many mappers and variant call methods in complex regions have been made, but mapping improvement depends on the mapper and variant caller combination [15], [16].

All the issues introduced above could be circumvented if the reads of each HLA gene were tagged with different indices, but this strategy is cost ineffective. Usually, the aim is to sequence all HLA genes from a given individual in a single sequencing run, tagging each individual with specific indices, thus allowing the evaluation of many samples at the same time. Additionally, research groups could be interested in other non-HLA genes that may be sequenced together with HLA, or even in evaluating HLA from whole genome sequencing.

Although many companies have introduced different NGS HLA-typing kits and specific applications to call HLA alleles, these products are mainly focused on reporting pre-defined alleles from the IPD-IMGT/HLA database, for clinical purposes. In this context, they are usually restricted to the segments tracked by the aforementioned database, which does not include the complete upstream regulatory regions and the complete 3′ untranslated segments. In addition, these applications usually produce good results when the HLA typing kits from the same manufacturer are used. However, they may not be suitable when alternative approaches to characterize HLA genes are applied (e.g., when other non-HLA genes are included in the sequencing). These applications are usually not freely available.

Many publicly available typing tools have recently been developed to call HLA alleles from NGS data [14], [17], including HLAminer [18], HLA-VBseq [19], HLAreporter [20], OptiType [21], and others. In general, these tools report only the HLA typing and do not properly handle new HLA alleles. Moreover, they usually do not output aligned (BAM/SAM) files that can be further processed (for instance, to infer SNP genotypes).

To achieve a better evaluation of HLA genes when several HLA genes are sequenced together or when other genes outside the HLA complex are also included, we developed hla-mapper to optimize read mapping for HLA class I genes. hla-mapper uses a scoring system in order to address each read pair to the most likely gene, providing a reliable map of those sequences, as described in the methods section. Here we present the hla-mapper application, comparing its mapping accuracy with other aligners such as BWA and Bowtie2.

Section snippets

The hla-mapper software

The input for hla-mapper is composed of paired-end FASTQ files from amplicon, exon, or whole-genome sequencing. The software has a trimming algorithm to remove short reads (the default value is 70 nucleotides) and low-quality segments, identifying the largest sequence fragment for each read in which all nucleotides present a quality value higher than 97% (Q ≥ 15). This process assures that only high-quality sequences pass forward to the scoring stage (Fig. 1).

After the trimming process,

Hla-mapper performance at datasets 1 and 2

hla-mapper performance was assessed in three datasets, as described earlier. The results for dataset 1 and 2 (that do not include HLA pseudogenes) are shown in Table 1, Table 2. Here we assessed the hla-mapper performance when dealing with simulated samples that resemble an actual population sample with known alleles (dataset 1, Table 1) and when dealing with only new HLA sequences (dataset 2, Table 2). In these datasets, there were 8.4 million read pairs (2 alleles × 700 fragments × 6

Discussion

Much effort has been made in the development of typing tools to call HLA alleles from NGS data. Many commercial or freely available typing tools have recently been developed. Among the public ones, we can find HLAminer, HLAreporter, HLA-VBseq, OptiType, ATHLATES, PHLAT, HLAforest, and others. The major goal of these tools is HLA typing, i.e., the definition of the two HLA alleles at each locus. Among the commercial ones, we may cite NGSEngine (from GenDX) and HLA Twin (from Omixon). Minor

Acknowledgements

This work was supported by FAPESP/Brazil (Grant# 2013/17084-2). E.C.C and C.T.M.J. are supported by CNPq/Brazil (Grants# 304471/2013-5 and 309572/2014-2).

Conflict of interests

There is no financial conflict of interest.

References (25)

  • J. Robinson et al.

    The IPD and IMGT/HLA database: allele variant databases

    Nucl. Acids Res.

    (2015)
  • H. Li et al.

    Fast and accurate short read alignment with Burrows-Wheeler transform

    Bioinformatics

    (2009)
  • Cited by (33)

    • HLA associations, somatic loss of HLA expression, and clinical outcomes in immune aplastic anemia

      2021, Blood
      Citation Excerpt :

      HLA class I genes were enriched by locus-specific long-range PCR, as previously described.31 To precisely detect low-frequency mutations, we used hla-mapper (version 2.3),32 and used thresholds defined by base-position error rates in T-cell samples, as previously published for non-HLA genes.33,34 We inferred 6p LOH using read counts of allele-specific single-nucleotide polymorphisms in HLA genes.

    • Human leukocyte antigen (HLA)-F and -G gene polymorphisms and haplotypes are associated with malaria susceptibility in the Beninese Toffin children

      2021, Infection, Genetics and Evolution
      Citation Excerpt :

      Libraries were sequenced using the MiSeq system (Illumina, San Diego, CA) and the Illumina Reagent (V2 500 cycles) in a paired-end mode (2x250bp). The bioinformatics analyses of HLA-E/-F/−G massively parallel sequencing data, including alignment, variant calling, and haplotype calling, were performed as previously described (Castelli et al., 2018; Castelli et al., 2015; Lima et al., 2016). Considering that only the coding region haplotypes have an officially recognized nomenclature defined by the IPD-IMGT/HLA database (Robinson et al., 2015), the nomenclature of the promoter and 3’UTR segments, as well as the nomenclature of the extended haplotypes encompassing these three major regions, were assigned as previously defined for HLA-E (Ramalho et al., 2017), HLA-F (Lima et al., 2016), and HLA-G (Castelli et al., 2017).

    View all citing articles on Scopus
    View full text