Elsevier

Methods in Enzymology

Volume 531, 2013, Pages 465-485
Methods in Enzymology

Chapter Twenty-One - Microbial Community Analysis Using MEGAN

https://doi.org/10.1016/B978-0-12-407863-5.00021-6Get rights and content

Abstract

Metagenomics, the study of microbes in the environment using DNA sequencing, depends upon dedicated software tools for processing and analyzing very large sequencing datasets. One such tool is MEGAN (MEtaGenome ANalyzer), which can be used to interactively analyze and compare metagenomic and metatranscriptomic data, both taxonomically and functionally.

To perform a taxonomic analysis, the program places the reads onto the NCBI taxonomy, while functional analysis is performed by mapping reads to the SEED, COG, and KEGG classifications. Samples can be compared taxonomically and functionally, using a wide range of different charting and visualization techniques. PCoA analysis and clustering methods allow high-level comparison of large numbers of samples. Different attributes of the samples can be captured and used within analysis. The program supports various input formats for loading data and can export analysis results in different text-based and graphical formats. The program is designed to work with very large samples containing many millions of reads. It is written in Java and installers for the three major computer operating systems are available from http://www-ab.informatik.uni-tuebingen.de.

Introduction

Environmental metagenomics involves the study of microbial organisms in their native environment using DNA sequencing (Handelsman, Rondon, Brady, Clardy, & Goodman, 1998). Metagenomic samples contain large number of organisms, for example, 4 × 107 prokaryotic cells can be found in 1 g of forest soil (Richter & Markewitz, 1995). Research in this area has benefited from the rise of second-generation sequencing technologies and hopes to benefit further from the third generation of sequencing techniques. Sampling and sequencing different environmental niches can now be done very cheaply and efficiently.

Given a dataset of DNA sequencing reads obtained from an environmental sample, there are three initial computational challenges to address. The first task is to estimate the taxonomic content, that is the qualitative and, if possible, quantitative distribution of organisms in the sample. The second problem is to determine the functional content of the sample. The third challenge is to compare different samples of interest. In many cases, the aim is to detect changes in taxonomic and/or function composition that correlate to external parameters or properties of the samples.

To address these challenges, the first step is often to align the set of sequencing reads against a database of known reference protein sequences such as NCBI-NR or RefSeq (Wheeler et al., 2008) using a pairwise alignment tool such as BLASTX (Altschul, Gish, Miller, Myers, & Lipman, 1990), RAPSearch2 (Zhao, Tang, & Ye, 2012), or PAUDA (Huson & Xie, 2013). A read is said to hit a given reference sequence, if a significant alignment is found during this process. The comparison of the sequencing reads against a reference database is usually considered the computationally most expensive step of the analysis and subsequent steps are based on the obtained alignments, however, new tools such as PAUDA may change this.

As an alternative to reference-based methods, one can use alignment-free taxonomic predictors to estimate the taxonomic content or profile of a sample. Such tools often employ machine-learning techniques such as naïve Bayesian classifiers or support vector machines, based on k-mer counts (McHardy, Martin, Tsirigos, Hugenholtz, & Rigoutsos, 2007). The advantages of machine-learning techniques are their speed and their (limited) ability to classify reads even when alignments to sequences in the reference database do not exist. However, such methods are usually not able to assess functional content and they do not produce pairwise alignments, which have many uses beyond general profiling.

Unfortunately, current databases only represent a small percentage of the microbial diversity encountered on earth and are strongly biased toward model organisms. It will be some time before projects such as GEBA (Wu et al., 2009) will have a significant impact on this problem.

Given the result of the alignment step or similarity search of a set of metagenomic reads against a reference database, an analysis program such as MEGAN is then required to explore and analyze the data. MEGAN is a tool for analyzing metagenomic sequence data, allowing the user to interactively explore the taxonomic and functional content of a sample. It also supports the comparison of multiple samples on taxonomic and functional levels. The program was originally published in Huson, Auch, Qi, and Schuster (2007) and the version 4 is presented in Huson, Mitra, Weber, Ruscheweyh, and Schuster (2011). Here, we describe version 5 of the software. Written in Java, the program runs on all major operating systems. The program can be downloaded from http://www-ab.informatik.uni-tuebingen.de/software/megan.

We would like to emphasize that those technical issues such as sample-preparation protocol, DNA extraction method, and sequencing technology all have a marked effect on the resulting metagenome data. It is current practice to keep all these variables constant within a given project, especially when sample comparison is of interest.

The basic input to MEGAN is a set of sequencing reads and the result of a pairwise alignment of the reads to a database of appropriate reference sequences. MEGAN supports a number of different input formats, such as BLAST (text, tabular, and XML) (Altschul et al., 1990), SAM (Li et al., 2009), RapSearch2 (Zhao et al., 2012), RDP (Wang, Garrity, Tiedje, & Cole, 2007), NBC (Rosen, Reichenberger, & Rosenfeld, 2011), QIIME (Caporaso et al., 2010) as well as a number of different CSV (comma-separated value) formats. Processed data is stored in a so-called RMA (read-match archive) file that contains all reads and matches in a compressed and indexed format. The results of analyses produced by MEGAN can be exported in a number of CSV formats and all visualizations provided by the program can be exported in a wide range of graphics formats. The program provides search tools for locating taxa and genes of interest.

In this chapter, we discuss how to analyze and compare metagenomic samples and address some of the typical questions encountered along the way. Sampling, library preparation, sequencing, and quality control are beyond the scope of this work and are not discussed here. As sequence comparison software, we recommend use of RAPSearch2 or PAUDA.

We use a permafrost soil metagenome dataset that was published in Mackelprang et al. (2011) as a running example. This dataset was sequenced from multiple permafrost cores originating from Hess Creek, Alaska. The samples were acquired in two cores, each containing soil from two different layers representing the active and the permafrost layer. Samples were sequenced immediately after collection as well as 2 and 7 days after thawing of the cores at 5 °C. Hence, there are 12 sequenced samples in total, each associated with one of two cores, core 1 or core 2, one of two layers, active layer or permafrost, and one of three time points, frozen (time point 0), 2 days, or 7 days. MEGAN files for all 12 samples can be downloaded from http://ab.inf.uni-tuebingen.de/software/megan/.

The first step in a MEGAN analysis is to parse the reads and sequence alignment (BLAST or similar) files, using the Import from Blast menu item. MEGAN stores the result of this step in an RMA file. The initial parsing of a dataset may take a number of hours and is often performed on a server using MEGAN in command-line mode. However, it takes MEGAN only seconds to open or reopen an RMA file.

Section snippets

Taxonomic Analysis

One popular approach to assess the taxonomic content of a sample is to focus on specific phylogenetic markers such as 16S rRNA. This type of analysis is supported by MEGAN and one can easily import the result of an analysis of such data obtained, for example, using the RDP classifier (Wang et al., 2007) or by performing a BLASTN comparison of the reads against the Silva database (Pruesse et al., 2007). The taxonomy used by MEGAN can be modified to suit the purposes of 16S rRNA, see Lanzén et

Functional Analysis

MEGAN currently supports functional analysis using three different functional classification schemes, namely, SEED, KEGG, and COG (eggNOG).

The SEED Viewer in MEGAN is based on the SEED classification (Overbeek et al., 2005) and the concept of a subsystem, which consists of a set of functional roles that implement a specific biological process or structural complex. To perform a SEED-based analysis, for each read in the input, MEGAN identifies the highest scoring hit to a reference sequence for

Sequence Alignment

As mentioned above, the main computational step is to determine all pairwise alignments between the set of DNA reads and all sequences in an appropriate reference database. Based on the result of this computation, it is possible to construct a reference-guided multiple sequence alignment between all reads that hit the same reference sequence. MEGAN provides access to such multiple sequence alignments in its Alignment Viewer. Once the user has selected a node in the Taxonomy, SEED, KEGG, or COG

Comparison of Samples

Metagenome projects usually comprise multiple samples taken from different environments, experimental settings, time points, or locations. The comparison of tens or hundreds of large samples is a challenging task. Depending on the project, the goal of a comparison can vary between the detection of simple changes in taxonomic composition to complex functional shifts. To facilitate the comparison of samples, MEGAN allows the user to open multiple samples simultaneously, showing each sample in a

Conclusion

MEGAN is an interactive program for analyzing the taxonomic and functional content of metagenomic (and metatranscriptomic) samples. With MEGAN, we hope to provide a versatile tool for analyzing single or groups of metagenomes on a desktop computer, aimed at the biologist in the field (or lab) rather than the trained bioinformatician, and thus we try to keep usability as simple as possible. Input is a set of DNA reads and the result of comparing the reads against a reference database.

MEGAN

References (23)

  • S.F. Altschul et al.

    Basic local alignment search tool

    Journal of Molecular Biology

    (1990)
  • J.G. Caporaso et al.

    Qiime allows analysis of high-throughput community sequencing data

    Nature Methods

    (2010)
  • J. Handelsman et al.

    Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products

    Chemistry and Biology

    (1998)
  • D.H. Huson et al.

    MEGAN analysis of metagenomic data

    Genome Research

    (2007)
  • D.H. Huson et al.

    Integrative analysis of environmental sequences using megan4

    Genome Research

    (2011)
  • D.H. Huson et al.

    A poor man’s BLASTX—High-throughput metagenomic protein database search using PAUDA

    Bioinformatics

    (2013)
  • M. Kanehisa et al.

    KEGG: Kyoto encyclopedia of genes and genomes

    Nucleic Acids Research

    (2000)
  • A. Lanzén et al.

    CREST—Classification resources for environmental sequence tags

    PLoS One

    (2012)
  • H. Li et al.

    The sequence alignment/map (SAM) format and SAMtool

    Bioinformatics

    (2009)
  • Rachel Mackelprang et al.

    Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw

    Nature

    (2011)
  • A.C. McHardy et al.

    Accurate phylogenetic classification of variable-length DNA fragments

    Nature Methods

    (2007)
  • Cited by (165)

    • Role of hidden microbes in sustainable agriculture

      2023, Advanced Microbial Technology for Sustainable Agriculture and Environment
    • Role of microorganisms in climate-smart agriculture

      2022, Relationship Between Microbes and the Environment for Sustainable Ecosystem Services, Volume 1: Microbial Products for Sustainable Ecosystem Services
    View all citing articles on Scopus
    View full text