Benchmark priors for Bayesian model averaging
Introduction
The issue of model uncertainty has permeated the econometrics and statistics literature for decades. An enormous volume of references can be cited (only a fraction of which is mentioned in this paper), and special issues of the Journal of Econometrics (1981, Vol. 16, No. 1) and Statistica Sinica (1997, Vol. 7, No. 2) are merely two examples of the amount of interest this topic has generated in the literature. From a Bayesian perspective, dealing with model uncertainty is conceptually straightforward: the model is treated as a further parameter which lies in the set of models entertained (the model space). A prior now needs to be specified for the parameters within each model as well as for the models themselves, and Bayesian inference can be conducted in the usual way, with one level (the prior on the model space) added to the hierarchy — see, e.g., Draper (1995) and the ensueing discussion. Unfortunately, the influence of the prior distribution, which is often straightforward to assess for inference given the model, is much harder to identify for posterior model probabilities. It is acknowledged — e.g., Kass and Raftery (1995), George (1999) — that posterior model probabilities can be quite sensitive to the specification of the prior distribution.
In this paper, we consider a particular instance of model uncertainty, namely uncertainty about which variables should be included in a linear regression problem with k available regressors. A model here will be identified by the set of regressors that it includes and, thus, the model space consists of 2k elements.1 Given the issue of sensitivity to the prior distribution alluded to above, the choice of prior is quite delicate, especially in the absence of substantial prior knowledge. Our aim here is to come up with a prior distribution that leads to sensible results, in the sense that data information dominates prior assumptions. Whereas we acknowledge the merits of using substantive prior information whenever available, we shall be concerned with providing the applied researcher with a ‘benchmark’ method for conducting inference in situations where incorporating such information into the analysis is deemed impossible, impractical or undesired. In addition, this provides a useful backdrop against which results arising from Bayesian analyses with informative priors could be contrasted.
We will focus on Bayesian model averaging (BMA), rather than on selecting a single model. BMA follows directly from the application of Bayes’ theorem in the hierarchical model described in the first paragraph, which implies mixing over models using the posterior model probabilities as weights. This is very reasonable as it allows for propagation of model uncertainty into the posterior distribution and leads to more sensible uncertainty bands. From a decision-theory point of view, Min and Zellner (1993) show that such mixing over models minimizes expected predictive squared error loss, provided the set of models under consideration is exhaustive. Raftery et al. (1997) state that BMA is optimal if predictive ability is measured by a logarithmic scoring rule. The latter result also follows from Bernardo (1979), who shows that the usual posterior distribution leads to maximal expected utility under a logarithmic proper utility function. Such a utility function was argued by Bernardo (1979) to be ‘often the more appropriate description for the preferences of a scientist facing an inference problem’. Thus, in the context of model uncertainty, the use of BMA follows from sensible utility considerations. This is the scenario that we will focus on. However, our results should also be useful under other utility structures that lead to decisions different from model averaging — e.g. model selection. This is because the posterior model probabilities will intervene in the evaluation of posterior expected utility. Thus, finding a prior distribution that leads to sensible results in the absence of substantive prior information is relevant in either setting.
Broadly speaking, we can distinguish three strands of related literature in the context of model uncertainty. Firstly, we mention the fundamentally oriented statistics and econometrics literature on prior elicitation and model selection or model averaging, such as exemplified in Box (1980), Zellner and Siow (1980), Draper (1995) and Phillips (1995) and the discussions of these papers. Secondly, there is the recent statistics literature on computational aspects. Markov chain Monte Carlo methods are proposed in George and McCulloch (1993), Madigan and York (1995), Geweke (1996) and Raftery et al. (1997), while Laplace approximations are found in Gelfand and Dey (1994) and Raftery (1996). Finally, there exists a large literature on information criteria, often in the context of time series, see, e.g., Hannan and Quinn (1979), Akaike (1981), Atkinson (1981), Chow (1981) and Foster and George (1994). This paper provides a unifying framework in which these three areas of research will be discussed.
In line with the bulk of the literature, the context of this paper will be normal linear regression with uncertainty in the choice of regressors. We abstract from any other issue of model specification. We present a prior structure that can reasonably be used in cases where we have (or wish to use) little prior information, partly based on improper priors for parameters that are common to all models, and partly on a g-prior structure as in Zellner (1986). The prior is not in the natural-conjugate class, but is such that marginal likelihoods can still be computed analytically. This allows for a simple treatment of potentially very large model spaces through Markov chain Monte Carlo model composition (MC3) as introduced in Madigan and York (1995). In contrast to some of the priors proposed in the literature, the prior we propose leads to valid conditioning in the posterior distribution (i.e., the latter can be interpreted as a conditional distribution given the observables) as it avoids dependence on the values of the response variable. The only hyperparameter left to elicit in our prior is a scalar g0j for each of the models considered. Theoretical properties, such as consistency of posterior model probabilities, are linked to functional dependencies of g0j on sample size and the number of regressors in the corresponding model. In addition (and perhaps more importantly), we conduct an empirical investigation through simulation. This will allow us to suggest specific choices for g0j to the applied user. As we have conducted a large simulation study, efficient coding was required. This code (in Fortran-77) has been made publicly available on the World Wide Web.2
Section 2 introduces the Bayesian model and the practice of Bayesian model averaging. The prior structure is explained in detail in Section 3, where expressions for Bayes factors are also given. The setup of the empirical simulation experiment is described in Section 4, while results are provided in Section 5. Section 6 presents an illustrative example using the economic model of crime from Ehrlich 1973, Ehrlich 1975, and Section 7 gives some concluding remarks and practical recommendations. The appendix presents results about asymptotic behaviour of Bayes factors.
Section snippets
The model and Bayesian model averaging
We consider n independent replications from a linear regression model with an intercept, say α, and k possible regression coefficients grouped in a k-dimensional vector β. We denote by Z the corresponding n×k design matrix and we assume that , where r(·) indicates the rank of a matrix and ιn is an n-dimensional vector of 1's.
This gives rise to 2k possible sampling models, depending on whether we include or exclude each of the regressors. In line with the bulk of the literature in
Priors for model parameters and the corresponding Bayes factors
In this section, we present several priors — i.e., several choices for the density in (1.2) — and derive the expressions of the resulting Bayes factors. In the sequel of the paper, we shall examine the properties (both finite-sample and asymptotic) of the Bayes factors.
Convergence and implementation
The implementation of the simulation study described in the previous section will be conducted through the MC3 methodology mentioned in Section 1. This Metropolis algorithm generates a new candidate model, say Mj, from a Uniform distribution over the subset of consisting of the current state of the chain, say Ms, and all models containing either one regressor more or one regressors less than Ms. The chain moves to Mj with probability min(1,Bjs), where Bjs is the Bayes factor in (2.16).
In
An empirical example: Crime data
The literature on the economics of crime has been critically influenced by the seminal work of Becker (1968) and the empirical analysis of Ehrlich 1973, Ehrlich 1975. The underlying idea is that criminal activities are the outcome of some rational economic decision process, and, as a result, the probability of punishment should act as a deterrent. Raftery et al. (1997) have used the Ehrlich data set corrected by Vandaele (1978). These are aggregate data for 47 U.S. states in 1960, which will be
Conclusions
We consider the normal linear regression model with uncertainty regarding the choice of regressors. The prior structure we have proposed in Section 3 leads to a valid interpretation of the posterior distribution as a conditional and only requires the choice of one scalar hyperparameter, called g0j. We make g0j a possible function of the sample size, n, the number of regressors in the model under consideration, kj, and the total number of available regressors, k. Theoretical results on
Acknowledgements
We thank Arnold Zellner, Dennis Lindley and two anonymous referees for their useful suggestions. Carmen Fernández gratefully acknowledges financial support from a Training and Mobility of Researchers grant awarded by the European Commission (ERBFMBICT # 961021). Carmen Fernández and Mark Steel were affiliated to CentER and the Department of Econometrics, Tilburg University, The Netherlands, and Eduardo Ley was at FEDEA, Madrid, Spain during the early stages of the work on this paper. Some of
References (64)
Likelihood of a model and information criteria
Journal of Econometrics
(1981)Likelihood ratios, posterior odds and information criteria
Journal of Econometrics
(1981)A comparison of the information and posterior probability criteria for model selection
Journal of Econometrics
(1981)- et al.
Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates
Journal of Econometrics
(1993) - et al.
Bayesian analysis of systems of seemingly unrelated regression equations under a recursive extended Natural Conjugate prior density
Journal of Econometrics
(1988) - et al.
Nonparametric regression using Bayesian variable selection
Journal of Econometrics
(1996) Advanced Econometrics
(1985)The pathology of the natural conjugate prior density in the regression model
Annales d'Economie et de Statistique
(1991)Crime and punishment: an economic approach
Journal of Political Economy
(1968)- et al.
The intrinsic Bayes factor for model selection and prediction
Journal of the American Statistical Association
(1996)