Chapter Twenty-One - Microbial Community Analysis Using MEGAN
Introduction
Environmental metagenomics involves the study of microbial organisms in their native environment using DNA sequencing (Handelsman, Rondon, Brady, Clardy, & Goodman, 1998). Metagenomic samples contain large number of organisms, for example, 4 × 107 prokaryotic cells can be found in 1 g of forest soil (Richter & Markewitz, 1995). Research in this area has benefited from the rise of second-generation sequencing technologies and hopes to benefit further from the third generation of sequencing techniques. Sampling and sequencing different environmental niches can now be done very cheaply and efficiently.
Given a dataset of DNA sequencing reads obtained from an environmental sample, there are three initial computational challenges to address. The first task is to estimate the taxonomic content, that is the qualitative and, if possible, quantitative distribution of organisms in the sample. The second problem is to determine the functional content of the sample. The third challenge is to compare different samples of interest. In many cases, the aim is to detect changes in taxonomic and/or function composition that correlate to external parameters or properties of the samples.
To address these challenges, the first step is often to align the set of sequencing reads against a database of known reference protein sequences such as NCBI-NR or RefSeq (Wheeler et al., 2008) using a pairwise alignment tool such as BLASTX (Altschul, Gish, Miller, Myers, & Lipman, 1990), RAPSearch2 (Zhao, Tang, & Ye, 2012), or PAUDA (Huson & Xie, 2013). A read is said to hit a given reference sequence, if a significant alignment is found during this process. The comparison of the sequencing reads against a reference database is usually considered the computationally most expensive step of the analysis and subsequent steps are based on the obtained alignments, however, new tools such as PAUDA may change this.
As an alternative to reference-based methods, one can use alignment-free taxonomic predictors to estimate the taxonomic content or profile of a sample. Such tools often employ machine-learning techniques such as naïve Bayesian classifiers or support vector machines, based on k-mer counts (McHardy, Martin, Tsirigos, Hugenholtz, & Rigoutsos, 2007). The advantages of machine-learning techniques are their speed and their (limited) ability to classify reads even when alignments to sequences in the reference database do not exist. However, such methods are usually not able to assess functional content and they do not produce pairwise alignments, which have many uses beyond general profiling.
Unfortunately, current databases only represent a small percentage of the microbial diversity encountered on earth and are strongly biased toward model organisms. It will be some time before projects such as GEBA (Wu et al., 2009) will have a significant impact on this problem.
Given the result of the alignment step or similarity search of a set of metagenomic reads against a reference database, an analysis program such as MEGAN is then required to explore and analyze the data. MEGAN is a tool for analyzing metagenomic sequence data, allowing the user to interactively explore the taxonomic and functional content of a sample. It also supports the comparison of multiple samples on taxonomic and functional levels. The program was originally published in Huson, Auch, Qi, and Schuster (2007) and the version 4 is presented in Huson, Mitra, Weber, Ruscheweyh, and Schuster (2011). Here, we describe version 5 of the software. Written in Java, the program runs on all major operating systems. The program can be downloaded from http://www-ab.informatik.uni-tuebingen.de/software/megan.
We would like to emphasize that those technical issues such as sample-preparation protocol, DNA extraction method, and sequencing technology all have a marked effect on the resulting metagenome data. It is current practice to keep all these variables constant within a given project, especially when sample comparison is of interest.
The basic input to MEGAN is a set of sequencing reads and the result of a pairwise alignment of the reads to a database of appropriate reference sequences. MEGAN supports a number of different input formats, such as BLAST (text, tabular, and XML) (Altschul et al., 1990), SAM (Li et al., 2009), RapSearch2 (Zhao et al., 2012), RDP (Wang, Garrity, Tiedje, & Cole, 2007), NBC (Rosen, Reichenberger, & Rosenfeld, 2011), QIIME (Caporaso et al., 2010) as well as a number of different CSV (comma-separated value) formats. Processed data is stored in a so-called RMA (read-match archive) file that contains all reads and matches in a compressed and indexed format. The results of analyses produced by MEGAN can be exported in a number of CSV formats and all visualizations provided by the program can be exported in a wide range of graphics formats. The program provides search tools for locating taxa and genes of interest.
In this chapter, we discuss how to analyze and compare metagenomic samples and address some of the typical questions encountered along the way. Sampling, library preparation, sequencing, and quality control are beyond the scope of this work and are not discussed here. As sequence comparison software, we recommend use of RAPSearch2 or PAUDA.
We use a permafrost soil metagenome dataset that was published in Mackelprang et al. (2011) as a running example. This dataset was sequenced from multiple permafrost cores originating from Hess Creek, Alaska. The samples were acquired in two cores, each containing soil from two different layers representing the active and the permafrost layer. Samples were sequenced immediately after collection as well as 2 and 7 days after thawing of the cores at 5 °C. Hence, there are 12 sequenced samples in total, each associated with one of two cores, core 1 or core 2, one of two layers, active layer or permafrost, and one of three time points, frozen (time point 0), 2 days, or 7 days. MEGAN files for all 12 samples can be downloaded from http://ab.inf.uni-tuebingen.de/software/megan/.
The first step in a MEGAN analysis is to parse the reads and sequence alignment (BLAST or similar) files, using the Import from Blast menu item. MEGAN stores the result of this step in an RMA file. The initial parsing of a dataset may take a number of hours and is often performed on a server using MEGAN in command-line mode. However, it takes MEGAN only seconds to open or reopen an RMA file.
Section snippets
Taxonomic Analysis
One popular approach to assess the taxonomic content of a sample is to focus on specific phylogenetic markers such as 16S rRNA. This type of analysis is supported by MEGAN and one can easily import the result of an analysis of such data obtained, for example, using the RDP classifier (Wang et al., 2007) or by performing a BLASTN comparison of the reads against the Silva database (Pruesse et al., 2007). The taxonomy used by MEGAN can be modified to suit the purposes of 16S rRNA, see Lanzén et
Functional Analysis
MEGAN currently supports functional analysis using three different functional classification schemes, namely, SEED, KEGG, and COG (eggNOG).
The SEED Viewer in MEGAN is based on the SEED classification (Overbeek et al., 2005) and the concept of a subsystem, which consists of a set of functional roles that implement a specific biological process or structural complex. To perform a SEED-based analysis, for each read in the input, MEGAN identifies the highest scoring hit to a reference sequence for
Sequence Alignment
As mentioned above, the main computational step is to determine all pairwise alignments between the set of DNA reads and all sequences in an appropriate reference database. Based on the result of this computation, it is possible to construct a reference-guided multiple sequence alignment between all reads that hit the same reference sequence. MEGAN provides access to such multiple sequence alignments in its Alignment Viewer. Once the user has selected a node in the Taxonomy, SEED, KEGG, or COG
Comparison of Samples
Metagenome projects usually comprise multiple samples taken from different environments, experimental settings, time points, or locations. The comparison of tens or hundreds of large samples is a challenging task. Depending on the project, the goal of a comparison can vary between the detection of simple changes in taxonomic composition to complex functional shifts. To facilitate the comparison of samples, MEGAN allows the user to open multiple samples simultaneously, showing each sample in a
Conclusion
MEGAN is an interactive program for analyzing the taxonomic and functional content of metagenomic (and metatranscriptomic) samples. With MEGAN, we hope to provide a versatile tool for analyzing single or groups of metagenomes on a desktop computer, aimed at the biologist in the field (or lab) rather than the trained bioinformatician, and thus we try to keep usability as simple as possible. Input is a set of DNA reads and the result of comparing the reads against a reference database.
MEGAN
References (23)
- et al.
Basic local alignment search tool
Journal of Molecular Biology
(1990) - et al.
Qiime allows analysis of high-throughput community sequencing data
Nature Methods
(2010) - et al.
Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products
Chemistry and Biology
(1998) - et al.
MEGAN analysis of metagenomic data
Genome Research
(2007) - et al.
Integrative analysis of environmental sequences using megan4
Genome Research
(2011) - et al.
A poor man’s BLASTX—High-throughput metagenomic protein database search using PAUDA
Bioinformatics
(2013) - et al.
KEGG: Kyoto encyclopedia of genes and genomes
Nucleic Acids Research
(2000) - et al.
CREST—Classification resources for environmental sequence tags
PLoS One
(2012) - et al.
The sequence alignment/map (SAM) format and SAMtool
Bioinformatics
(2009) - et al.
Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw
Nature
(2011)
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods
Cited by (165)
Communication mediated interaction between bacteria and microalgae advances photogranulation
2024, Science of the Total EnvironmentInsights into the effects of pulsed antimicrobials on the chicken resistome and microbiota from fecal metagenomes
2023, Journal of Integrative AgricultureRole of hidden microbes in sustainable agriculture
2023, Advanced Microbial Technology for Sustainable Agriculture and EnvironmentRole of microorganisms in climate-smart agriculture
2022, Relationship Between Microbes and the Environment for Sustainable Ecosystem Services, Volume 1: Microbial Products for Sustainable Ecosystem Services