Hla-mapper: An application to optimize the mapping of HLA sequences produced by massively parallel sequencing procedures

doi:10.1016/j.humimm.2018.06.010

Human Immunology

Volume 79, Issue 9, September 2018, Pages 678-684

https://doi.org/10.1016/j.humimm.2018.06.010 Get rights and content

Abstract

A challenging task when more than one HLA gene is evaluated together by second-generation sequencing is to achieve a reliable read mapping. The polymorphic and repetitive nature of HLA genes might bias the read mapping process, usually underestimating variability at very polymorphic segments, or overestimating variability at some segments. To overcome this issue we developed hla-mapper, which takes into account HLA sequences derived from the IPD-IMGT/HLA database and unpublished HLA sequences to apply a scoring system. This comprehends the evaluation of each read pair, addressing them to the most likely HLA gene they were derived from. Hla-mapper provides a reliable map of HLA sequences, allowing accurate downstream analysis such as variant calling, haplotype inference, and allele typing. Moreover, hla-mapper supports whole genome, exome, and targeted sequencing data. To assess the software performance in comparison with traditional mapping algorithms, we used three different simulated datasets to compare the results obtained with hla-mapper, BWA MEM, and Bowtie2. Overall, hla-mapper presented a superior performance, mainly for the classical HLA class I genes, minimizing wrong mapping and cross-mapping that are typically observed when using BWA MEM or Bowtie2 with a single reference genome.

Introduction

The human leukocyte antigen (HLA) complex comprehends the most variable segment of the human genome. This complex plays a central role in the immune response since HLA molecules are key features for antigen presentation and immune response modulation [1]. HLA genes compatibility between recipient and donor influences graft acceptance, therefore, HLA variability is frequently evaluated for clinical purposes. Moreover, different HLA variants might be related to differential immune responses against pathogens, susceptibility to autoimmune diseases, or even to specific tumors [2]. The classical HLA class I genes, HLA-A, HLA-B, and HLA-C, are highly polymorphic loci and encode a key molecule for antigen presentation to T CD8+ lymphocytes. The non-classical HLA class I genes, HLA-G, HLA-E, and HLA-F, are conserved at the DNA and protein level and encode immune-modulatory molecules [3], [4], [5].

Much effort has been made to characterize HLA polymorphisms and complete sequences in worldwide populations. Most of the studies are restricted to the coding region, or even to exon segments only. However, with the advent of next-generation sequencing (NGS), or massively parallel sequencing, information regarding introns and regulatory segments has been obtained, as well as the characterization of many new allele variants, providing deeper insights regarding HLA worldwide variability and extended haplotypes. This is particularly evident for non-classical HLA class I genes [6], [7], [8]. In addition, NGS strategies do allow the evaluation of several genes all at once, and also allow the phase definition (haplotypes) among part of the variants detected. Nevertheless, when two or more HLA genes are sequenced at the same time using NGS methods, a challenging task is to achieve a reliable read mapping. Because of the polymorphic and repetitive nature of most of the HLA genes due to their paralogous origins, the following issues may arise when raw data (the reads) are mapped to a single reference genome.

First, the classical HLA class I genes are among the most variable genes in the human genome. The International Immunogenetics Database (IPD-IMGT/HLA database, version 3.31.0) describes more than 12,893 class I alleles [9]. Because of the polymorphic nature of classical HLA genes, the sequences (or reads) obtained by NGS methods usually present too many nucleotide differences when compared with the reference human genome. Thus, well-established aligners such as BWA (the Burrows-Wheeler Aligner) and Bowtie2 [10], [11] frequently do not map classical HLA class I sequences correctly. Usually, when these algorithms are used with default parameters, because of the high polymorphism, many reads do not find a match in the reference genome, leading to a mapping bias that underestimates HLA variability and overestimates reference allele frequencies [12], [13], [14]. Moreover, when a higher tolerance for mismatches is defined, many reads are incorrectly mapped as explained below.

Second, mostly due to the high sequence similarity among HLA genes, polymorphisms may lead a sequence to be more compatible with the reference sequence of another HLA gene than with the reference of the original gene. Thus, when the tolerance for mismatches is increased to avoid the aforementioned mapping bias, this second issue leads to a large number of reads mapping to more than one HLA gene into the reference genome, or simply mapping to the wrong gene. In this scenario, it is expected an overestimation of genetic diversity in segments presenting a large number of incorrectly mapped reads. Third, depending on the NGS method used, a large number of very short reads is produced and they could exacerbate the issues already presented.

The BWA developers do acknowledge the issues presented above. They created the bwa.kit to improve HLA read mapping based on alternative contigs and known HLA sequences from the IPD-IMGT/HLA. However, its use is not straightforward and the data obtained might bias downstream genotyping procedures as discussed later. In addition, other attempts to evaluate the efficiency of many mappers and variant call methods in complex regions have been made, but mapping improvement depends on the mapper and variant caller combination [15], [16].

All the issues introduced above could be circumvented if the reads of each HLA gene were tagged with different indices, but this strategy is cost ineffective. Usually, the aim is to sequence all HLA genes from a given individual in a single sequencing run, tagging each individual with specific indices, thus allowing the evaluation of many samples at the same time. Additionally, research groups could be interested in other non-HLA genes that may be sequenced together with HLA, or even in evaluating HLA from whole genome sequencing.

Although many companies have introduced different NGS HLA-typing kits and specific applications to call HLA alleles, these products are mainly focused on reporting pre-defined alleles from the IPD-IMGT/HLA database, for clinical purposes. In this context, they are usually restricted to the segments tracked by the aforementioned database, which does not include the complete upstream regulatory regions and the complete 3′ untranslated segments. In addition, these applications usually produce good results when the HLA typing kits from the same manufacturer are used. However, they may not be suitable when alternative approaches to characterize HLA genes are applied (e.g., when other non-HLA genes are included in the sequencing). These applications are usually not freely available.

Many publicly available typing tools have recently been developed to call HLA alleles from NGS data [14], [17], including HLAminer [18], HLA-VBseq [19], HLAreporter [20], OptiType [21], and others. In general, these tools report only the HLA typing and do not properly handle new HLA alleles. Moreover, they usually do not output aligned (BAM/SAM) files that can be further processed (for instance, to infer SNP genotypes).

To achieve a better evaluation of HLA genes when several HLA genes are sequenced together or when other genes outside the HLA complex are also included, we developed hla-mapper to optimize read mapping for HLA class I genes. hla-mapper uses a scoring system in order to address each read pair to the most likely gene, providing a reliable map of those sequences, as described in the methods section. Here we present the hla-mapper application, comparing its mapping accuracy with other aligners such as BWA and Bowtie2.

Section snippets

The hla-mapper software

The input for hla-mapper is composed of paired-end FASTQ files from amplicon, exon, or whole-genome sequencing. The software has a trimming algorithm to remove short reads (the default value is 70 nucleotides) and low-quality segments, identifying the largest sequence fragment for each read in which all nucleotides present a quality value higher than 97% (Q ≥ 15). This process assures that only high-quality sequences pass forward to the scoring stage (Fig. 1).

After the trimming process,

Hla-mapper performance at datasets 1 and 2

hla-mapper performance was assessed in three datasets, as described earlier. The results for dataset 1 and 2 (that do not include HLA pseudogenes) are shown in Table 1, Table 2. Here we assessed the hla-mapper performance when dealing with simulated samples that resemble an actual population sample with known alleles (dataset 1, Table 1) and when dealing with only new HLA sequences (dataset 2, Table 2). In these datasets, there were 8.4 million read pairs (2 alleles × 700 fragments × 6

Discussion

Much effort has been made in the development of typing tools to call HLA alleles from NGS data. Many commercial or freely available typing tools have recently been developed. Among the public ones, we can find HLAminer, HLAreporter, HLA-VBseq, OptiType, ATHLATES, PHLAT, HLAforest, and others. The major goal of these tools is HLA typing, i.e., the definition of the two HLA alleles at each locus. Among the commercial ones, we may cite NGSEngine (from GenDX) and HLA Twin (from Omixon). Minor

Acknowledgements

This work was supported by FAPESP/Brazil (Grant# 2013/17084-2). E.C.C and C.T.M.J. are supported by CNPq/Brazil (Grants# 304471/2013-5 and 309572/2014-2).

Conflict of interests

There is no financial conflict of interest.

References (25)

E.C. Castelli et al.
HLA-G variability and haplotypes detected by massively parallel sequencing procedures in the geographicaly distinct population samples of Brazil and Cyprus
Mol. Immunol.
(2017)
T.H. Lima et al.
HLA-F coding and regulatory segments variability determined by massively parallel sequencing procedures in a Brazilian population sample
Hum. Immunol.
(2016)
J. Ramalho et al.
HLA-E regulatory and coding region variability and haplotypes in a Brazilian population sample
Mol. Immunol.
(2017)
Y. Huang et al.
HLA reporter: a tool for HLA typing from next generation sequencing data
Genome Med.
(2015)
E.C. Castelli et al.
HLA-E coding and 3' untranslated region variability determined by next-generation sequencing in two West-African population samples
Hum. Immunol.
(2015)
J. Klein et al.
The HLA system. First of two parts
N. Engl. J. Med.
(2000)
V. Matzaraki et al.
The MHC locus and genetic susceptibility to autoimmune and infectious diseases
Genome Biol.
(2017)
T.V. Hviid
HLA-G in human reproduction: aspects of genetics, function and pregnancy complications
Hum. Reprod. Update
(2006)
E.A. Donadi et al.
Implications of the polymorphism of HLA-G on its function, regulation, evolution and disease association
Cell Mol. Life Sci.
(2011)
B.K. Kaiser et al.
Structural basis for NKG2A/CD94 recognition of HLA-E
Proc. Natl. Acad. Sci. USA
(2008)

J. Robinson et al.

The IPD and IMGT/HLA database: allele variant databases

Nucl. Acids Res.

(2015)

H. Li et al.

Fast and accurate short read alignment with Burrows-Wheeler transform

Bioinformatics

(2009)

Cited by (33)

HLA associations, somatic loss of HLA expression, and clinical outcomes in immune aplastic anemia
2021, Blood
Citation Excerpt :
HLA class I genes were enriched by locus-specific long-range PCR, as previously described.31 To precisely detect low-frequency mutations, we used hla-mapper (version 2.3),32 and used thresholds defined by base-position error rates in T-cell samples, as previously published for non-HLA genes.33,34 We inferred 6p LOH using read counts of allele-specific single-nucleotide polymorphisms in HLA genes.
Immune aplastic anemia (AA) features somatic loss of HLA class I allele expression on bone marrow cells, consistent with a mechanism of escape from T-cell–mediated destruction of hematopoietic stem and progenitor cells. The clinical significance of HLA abnormalities has not been well characterized. We examined the somatic loss of HLA class I alleles and correlated HLA loss and mutation-associated HLA genotypes with clinical presentation and outcomes after immunosuppressive therapy in 544 AA patients. HLA class I allele loss was detected in 92 (22%) of the 412 patients tested, in whom there were 393 somatic HLA gene mutations and 40 instances of loss of heterozygosity. Most frequently affected was HLA-B*14:02, followed by HLA-A*02:01, HLA-B*40:02, HLA-B*08:01, and HLA-B*07:02. HLA-B*14:02, HLA-B*40:02, and HLA-B*07:02 were also overrepresented in AA. High-risk clonal evolution was correlated with HLA loss, HLA-B*14:02 genotype, and older age, which yielded a valid prediction model. In 2 patients, we traced monosomy 7 clonal evolution from preexisting clones harboring somatic mutations in HLA-A*02:01 and HLA-B*40:02. Loss of HLA-B*40:02 correlated with higher blood counts. HLA-B*07:02 and HLA-B*40:01 genotypes and their loss correlated with late-onset of AA. Our results suggest the presence of specific immune mechanisms of molecular pathogenesis with clinical implications. HLA genotyping and screening for HLA loss may be of value in the management of immune AA. This study was registered at clinicaltrials.gov as NCT00001964, NCT00061360, NCT00195624, NCT00260689, NCT00944749, NCT01193283, and NCT01623167.
Human leukocyte antigen (HLA)-F and -G gene polymorphisms and haplotypes are associated with malaria susceptibility in the Beninese Toffin children
2021, Infection, Genetics and Evolution
Citation Excerpt :
Libraries were sequenced using the MiSeq system (Illumina, San Diego, CA) and the Illumina Reagent (V2 500 cycles) in a paired-end mode (2x250bp). The bioinformatics analyses of HLA-E/-F/−G massively parallel sequencing data, including alignment, variant calling, and haplotype calling, were performed as previously described (Castelli et al., 2018; Castelli et al., 2015; Lima et al., 2016). Considering that only the coding region haplotypes have an officially recognized nomenclature defined by the IPD-IMGT/HLA database (Robinson et al., 2015), the nomenclature of the promoter and 3’UTR segments, as well as the nomenclature of the extended haplotypes encompassing these three major regions, were assigned as previously defined for HLA-E (Ramalho et al., 2017), HLA-F (Lima et al., 2016), and HLA-G (Castelli et al., 2017).
Little attention has been devoted to the role of the immunoregulatory HLA-E/-F/−G genes in malaria. We evaluated the entire HLA-E/-F/−G variability in Beninese children highly exposed to Plasmodium falciparum (P.f.) malaria.
154 unrelated children were followed-up for six months and evaluated for the presence and number of malaria episodes. HLA-E/-F/−G genes were genotyped using massively parallel sequencing. Anti P.f. antibodies were evaluated using ELISA.
Children carrying the G allele at HLA-F (−1499,rs183540921) showed increased P.f. asymptomatic/symptomatic ratio, suggesting that these children experienced more asymptomatic P.f. episodes than symptomatic one. Children carrying HLA-G-UTR-03 haplotype exhibited increased risk for symptomatic P.f. episodes and showed lower IgG2 response against P.f. GLURP-R2 when compared to the non-carriers. No associations were observed for the HLA-E gene.
HLA-F associations may be related to the differential expression profiles of the encoded immunomodulatory molecules, and the regulatory sites at the HLA-G 3’UTR may be associated to posttranscriptional regulation of HLA-G and to host humoral response against P.f.
18th International HLA and Immunogenetics Workshop: Report on the SNP-HLA Reference Consortium (SHLARC) component
2024, HLA
Optimal population-specific HLA imputation with dimension reduction
2024, HLA
HLA-F transcriptional and protein differential expression according to its genetic polymorphisms
2023, HLA
Immunogenetics of HLA-B: SNP, allele, and haplotype diversity in populations from different continents and ancestry backgrounds
2023, HLA

View all citing articles on Scopus

View full text

Hla-mapper: An application to optimize the mapping of HLA sequences produced by massively parallel sequencing procedures

Abstract

Introduction

Section snippets

The hla-mapper software

Hla-mapper performance at datasets 1 and 2

Discussion

Acknowledgements

Conflict of interests

Mol. Immunol.

Hum. Immunol.

Mol. Immunol.

Genome Med.

Hum. Immunol.

The HLA system. First of two parts

N. Engl. J. Med.

The MHC locus and genetic susceptibility to autoimmune and infectious diseases

Genome Biol.

HLA-G in human reproduction: aspects of genetics, function and pregnancy complications

Hum. Reprod. Update

Implications of the polymorphism of HLA-G on its function, regulation, evolution and disease association

Cell Mol. Life Sci.

Structural basis for NKG2A/CD94 recognition of HLA-E

Proc. Natl. Acad. Sci. USA

The IPD and IMGT/HLA database: allele variant databases

Nucl. Acids Res.

Fast and accurate short read alignment with Burrows-Wheeler transform

Bioinformatics