Detecting natural selection in RNA virus populations using sequence summary statistics

https://doi.org/10.1016/j.meegid.2009.06.001Get rights and content

Abstract

At present, most analyses that aim to detect the action of natural selection upon viral gene sequences use phylogenetic estimates of the ratio of silent to replacement mutations. Such methods, however, are impractical to compute on large data sets comprising hundreds of complete viral genomes, which are becoming increasingly common due to advances in genome sequencing technology. Here we investigate the statistical performance of computationally efficient tests that are based on sequence summary statistics, and explore their applicability to RNA virus data sets in two ways. Firstly, we perform extensive simulations in order to measure the type I error of two well-known summary statistic methods – Tajima's D and the McDonald–Kreitman test – under a range of virus-like mutational and demographic scenarios. Secondly, we apply these methods to a compilation of ∼100 RNA virus alignments that represent natural RNA virus populations. In addition, we develop and introduce a new implementation of the McDonald–Kreitman test and show that it greatly improves the test's statistical reliability on typical viral data sets. Our results suggest that variants of the McDonald–Kreitman test could prove useful in the analysis of very large sets of highly diverse viral genetic data.

Introduction

One of the main goals of viral evolutionary genetics is to understand to what extent natural selection – as opposed to mutation and random genetic drift – determines the genetic variability and evolution of viruses. Various methods of gene sequence analysis have been developed to detect and measure natural selection, the most popular of which can be categorised as either dn/ds-based methods (e.g. Nei and Gojobori, 1986) or methods based on site-frequency summary statistics (e.g. Tajima, 1989, McDonald and Kreitman, 1991a). The former calculate the ratio of non-synonymous to synonymous genetic changes, which is typically denoted dn/ds or ω. A ratio greater than one indicates the action of positive selection, while a ratio of less than one can indicate purifying selection. In contrast, summary statistic methods depend on the frequency at which polymorphisms are found in a sample of sequences. These statistics may be computed from within-species polymorphisms (Tajima, 1989) or from both polymorphisms and among-species fixations (McDonald and Kreitman, 1991a).

Currently, most studies of viral genetic data use phylogenetic dn/ds methods as a means to detect selection (e.g. Yang, 2007, Pond and Frost, 2005), which are based on statistical models of codon evolution (Goldman and Yang, 1994, Yang et al., 2000). Examples of this approach are too numerous to list here, but one of the most influential was Nielsen and Yang's (1998) investigation of positive selection in the HIV-1 env gene. Phylogenetic dn/ds methods do not require users to make specific assumptions about the sampled population and can therefore provide robust evidence for the directionality of selection. In addition, simulations show dn/ds methods to have good statistical power under models of both positive and negative selection (Zhai et al., 2009), although in practice such methods are likely more powerful in detecting recurrent or reciprocal selection than single, historical selective sweeps (Pybus and Shapiro, 2009). However, the interpretation of dn/ds can be potentially misleading when recombination has been operating (Wilson and McVean, 2006) and the application of phylogenetic dn/ds methods to within-population data sets has recently been criticised (Kryazhimskiy and Plotkin, 2008). Crucially, phylogenetic dn/ds methods can be time consuming or impractical to compute on large data sets. Recent developments in sequencing technology (Margulies et al., 2005) will make commonplace the publication of data sets containing hundreds or thousands of complete viral genomes, and therefore it is sensible to investigate the potential utility of alternative methods.

Site-frequency summary statistics, such as Tajima's D (Tajima, 1989) have occasionally been used to analyse viral data sets. For example, Edwards et al. (2006), and Shriner et al., 2004a, Shriner et al., 2004b applied versions of Tajima's D to HIV-1 and Tsompana et al. (2005) employed the test on the Tomato spotted wilt virus. In addition, tests that consider patterns of both polymorphism and divergence, notably the McDonald Kreitman (MK) test, have been applied to the Bovine immunodeficiency virus (Cooper et al., 1999), beak and feather disease virus (Ritchie et al., 2003) and North American Powassan virus (Ebel et al., 2001). Most pertinent to virus evolution, Williamson (2003) demonstrated that the MK test can be applied to “serially-sampled” sequences that are obtained from the same population at different time points, thereby estimating the rate of viral adaptation through time. Summary statistic methods are computationally very efficient, can potentially be applied to very large whole genome data sets, and perhaps are more robust to the effects of recombination than phylogenetic dn/ds methods. However, summary statistic methods typically assume that multiple mutations do not occur at the same nucleotide site, which may explain why they are rarely employed on rapidly evolving viral data sets, but commonly applied to species with relatively low evolutionary rates, such as Drosophila (McDonald and Kreitman, 1991a, Smith and Eyre-Walker, 2002, Andolfatto, 2005).

In this paper we investigate the utility and performance of two common summary statistic methods, Tajima's D statistic (Tajima, 1989) and the MK test (McDonald and Kreitman, 1991a), when applied to RNA virus sequences. We perform extensive simulations of virus-like alignments in order to measure the type I error of these tests (i.e. the chance of falsely rejecting the hypothesis of neutral evolution). Second, we apply the two tests to a collection of almost 100 RNA virus alignments that represent natural viral populations. Third, we develop and implement a new algorithm for computing the MK test that improves the performance of the test on data sets containing much genetic variation.

Section snippets

Tajima's D statistic

The Tajima's D test is based on two different estimates of θ, the genetic diversity of a sequence alignment: (i) the mean number of pairwise differences (θˆk) and (ii) the scaled number of segregating sites (θˆs), otherwise known as the Watterson estimate (Watterson, 1975). The units of θ are substitutions per site. The premise of Tajima's D test is that under neutral evolution these two measures should be equal, hence the difference between them should be zero. For a neutrally evolving haploid

Investigating the performance of Tajima's D

To explore the reliability and type I error rate of Tajima's D statistic, we simulated alignments of neutrally evolving sequences under various scenarios. Simulation was a two-step process. First, for each scenario, 500 neutral coalescent trees with 50 taxa were simulated. Second, one alignment of sequences, 6000 nt in length, was simulated along each tree.

Neutral coalescent trees were simulated using standard approaches (e.g. Hudson, 1990) which were implemented in the Java Evolutionary Biology

Investigating the performance of Tajima's D

Fig. 3 shows the performance of the Tajima's D test on neutral sequences simulated under different θ values and sampled from a constant-sized population. Fig. 3a shows the type I error of the test, Fig. 3b shows the average D value and Fig. 3c shows the mean values of θˆk and θˆs for each simulated value of θ. The statistical performance of the test depends greatly on θ. For explanatory convenience, we divide the range of θ into three regions.

  • REGION ONE (θ < 10−4): Alignments generated under

Discussion

It is widely acknowledged that Tajima's D is sensitive to changes in population size or the existence of population structure (e.g. Simonsen et al., 1995, Nielsen, 2005). A further concern with using Tajima's D test on viral populations is that their high evolutionary rates would invalidate the test's key assumption that each mutation occurs at a different site (the ‘infinite sites’ assumption). In our study we simulated sequences under an exhaustive range of θ values to assess the type I error

Acknowledgements

We thank Eddie Holmes for generating the original collection of RNA virus data sets. SB is funded by NERC UK. OGP is supported by the Royal Society.

References (43)

  • R. Egea et al.

    Standard and generalized McDonald–Kreitman test: a website to detect selection by comparing different classes of DNA sites

    Nucleic Acids Research

    (2008)
  • N. Goldman et al.

    A codon-based model of nucleotide substitution for protein-coding DNA sequences

    Molecular Biology and Evolution

    (1994)
  • D. Graur

    Neutral mutation hypothesis test

    Nature

    (1991)
  • R.R. Hudson

    Gene genealogies and the coalescent process

    Oxford Surveys in Evolutionary Biology

    (1990)
  • S. Kryazhimskiy et al.

    The population genetics of dN/dS

    PLoS Genetics

    (2008)
  • M.A. Larkin et al.

    ClustalW2 and ClustalX version 2

    Bioinformatics

    (2007)
  • M. Margulies et al.

    Genome sequencing in microfabricated high-density picolitre reactors

    Nature

    (2005)
  • J.H. McDonald et al.

    Adaptive protein evolution at the adh locus in drosophila

    Nature

    (1991)
  • J.H. McDonald et al.

    Neutral mutation hypothesis test

    Nature

    (1991)
  • M. Nei et al.

    Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions

    Molecular Biology and Evolution

    (1986)
  • R. Nielsen

    Statistical tests of selective neutrality in the age of genomics

    Heredity

    (2001)
  • Cited by (25)

    • Origin, phylogeny, variability and epitope conservation of SARS-CoV-2 worldwide

      2021, Virus Research
      Citation Excerpt :

      The related bat genome RaTG13 isolate in China from Rhinolophus affinis (Table 1) cluster together with SARS-CoV-2 genomes constituting a tight cluster, except for only one genome (from an Iceland SARS-CoV-2 isolate – GISAID EPI_ISL_424374). The Tajima D statistics may be computed either from within-species or among-species polymorphisms to test for neutrality (Bhatt et al., 2010). The observed Tajima's D values < 0 for SARS-CoV-2 is consistent with population expansion after a bottleneck, which is in agreement with others (Fang et al., 2020; Laskar and Ali, 2020).

    • Analysis of Reptarenavirus genomes indicates different selective forces acting on the S and L segments and recent expansion of common genotypes

      2018, Infection, Genetics and Evolution
      Citation Excerpt :

      Short-scale evolutionary events may be better analyzed using population genetics approaches rather than by the use of methods based on dN/dS, which implicitly assume that substitutions have reached fixation (Kryazhimskiy and Plotkin, 2008). Population genetics approaches were previously applied to study viral evolution and to infer selective patterns (Bhatt et al., 2010; Edwards et al., 2006; Shriner et al., 2004; Tsompana et al., 2005; Renzette et al., 2016; Alabi et al., 2010; Hill and Unckless, 2017). However, the interpretation of Tajima's D is complicated by the sensitivity of this statistic to changes in population size.

    • Evolutionary history of the cobalamin-independent methionine synthase gene family across the land plants

      2018, Molecular Phylogenetics and Evolution
      Citation Excerpt :

      To evaluate positive selection over the full extension of metE in soybeans, we applied the McDonald and Kreitman test (MKT) (McDonald and Kreitman, 1997). This test calculates the ratio of the number of nonsynonymous polymorphic sites (Pn) to the number of synonymous polymorphic sites (Ps) within the species, compared with the ratio of the number of nonsynonymous nucleotide substitutions to the number of synonymous nucleotide substitutions (Ds) between species; thus, an outgroup is required to determine in which sites the differences are fixed (Bhatt et al., 2010). We prepared three versions of DatasetC.

    View all citing articles on Scopus
    View full text