Elsevier

Acta Psychologica

Volume 104, Issue 3, June 2000, Pages 371-398
Acta Psychologica

Confidence in aggregation of expert opinions

https://doi.org/10.1016/S0001-6918(00)00037-8Get rights and content

Abstract

We investigate the case of a single decision maker (DM) who obtains probabilistic forecasts regarding the occurrence of a unique target event from J distinct, symmetric, and equally diagnostic expert advisors (judges). The paper begins with a mathematical model of DM's aggregation process of expert opinions, in which confidence in the final aggregate is shown to be inversely related to its perceived variance. As such, confidence is expected to vary as a function of factors such as the number of experts, the total number of cues, the fraction of cues available to each expert, the level of inter-expert overlap in information, and the range of experts' opinions. In the second part of the paper, we present results from two experiments that support the main (ordinal) predictions of the model.

Introduction

Decision makers (DMs) often turn to experts (also referred to as forecasters, judges or advisors) for advice. People seek advice in routine and mundane problems (e.g., tuning in to weather forecasters to determine what to wear on a given day) as well as important, once in a lifetime, decisions (e.g., asking several brokers for advice on where and how to invest one’s life savings). The common elements of all these cases are:

  • 1.

    the need to make a decision by a certain time,

  • 2.

    the uncertainty about the possible outcomes and their relative likelihoods,

  • 3.

    the existence of relevant information that can reduce the uncertainty and facilitate the decision, and

  • 4.

    the possibility to consult qualified experts who have access to relevant information before choosing a course of action.

Although there are situations where one relies on advice from a single expert (your favorite weather forecaster), in most cases DMs solicit advice from multiple experts. For example, when trying to decide which investment to select for one’s life savings, it is easy to imagine consulting several brokers, financial advisors, reading books, and so forth. Similarly, when facing the possibility of a major medical procedure it is common practice (not to mention common sense) to seek advice from several specialists at different medical centers. This paper is concerned with various aspects of this process. Specifically, we examine how DMs aggregate the opinions of multiple experts and how confident are they in the process and its final outcome.

Fig. 1provides a simplified schematic description of the situation. The multiple inputs flowing into the DM refer to a series of expert opinions. DM's task is to aggregate those opinions into a final decision and generate a response. The research on aggregation deals with a variety of distinct questions, but almost all of them are concerned with the “best” way to integrate advice from multiple sources and identifying the various factors that affect and drive the aggregation process. These factors can be roughly organized in four distinct categories associated with key characteristics of (a) the DM (a person, a group or a model, its level of experience, expertise, etc.), (b) the decision task (its context, its importance, the type and amount of information available, the presentation format, etc.), (c) the expert advisors (their accuracy, credibility, the inter-correlations among their forecasts, etc.), and (d) the information on which they base their advice (its reliability, validity, etc.). In fact, one could organize most of the empirical research on aggregation according to the type of factors being studied (Rantilla & Budescu, 1999).

The literature on aggregation of expert advice covers an extensive range of substantive areas, including weather forecasting (e.g., Clemen & Murphy, 1986, Clemen & Winkler, 1987), business and marketing (e.g., Ashton & Ashton, 1985, Ashton, 1986, Gonzalez, 1994, Maines, 1996, Fischer & Harvey, 1999, Harvey et al., 2000), and assorted prediction tasks Fischer, 1981, Hogarth, 1989, Rantilla & Budescu, 1999. There are several excellent reviews of this literature, notably Clemen’s (1989) annotated review on the topic of aggregating forecasting models, Flores and White's (1988) review on combination of forecasts, and Armstong’s (2000) handbook on forecast combination. The bulk of this work can be classified into (a) normative models, which generally start from a series of assumptions or axioms and prescriptively describe a model of optimum decision making (e.g., Hogarth, 1977, Hogarth, 1978, Clemen, 1987, Clemen & Winkler, 1987, Clemen & Winkler, 1999, Lock, 1987, Winkler, 1989, McNees, 1992, Wallsten et al., 1997, Munnich et al., 1999, Wallsten & Diederich, 2000), and (b) descriptive models, that generally start from observed human behavior and search for consistent processes that describe what DMs are doing (Sniezek & Henry, 1989, Sniezek & Henry, 1990, Sniezek & Buckley, 1995, Yaniv, 1997, Rantilla & Budescu, 1999, Soll, 1999).

We think that these two approaches complement each other and, taken together, offer insight both into what human DMs can, and should do, when aggregating expert advice. The present paper exemplifies this combined approach. We present a mathematical model that seeks to describe DM’s intuitive aggregation process in full. While previous models (normative and descriptive, alike) of aggregation have focused on the actual response (the final aggregate) generated by the DM, ours is the first model that is also concerned with DM's confidence. More specifically, we attempt to relate DM's response and his/her confidence to the same factors. A brief version of the model will be presented in conjunction with empirical work that was carried out to test some of its key prediction. Before describing the actual model, we discuss briefly the role of confidence in aggregation of expert opinion research.

Section snippets

Accuracy and confidence in decisions

A great deal of research has been devoted to identification and comparison of various aggregation rules (normative, statistical, heuristic or intuitive) used by DMs. These comparisons have focused primarily on the effectiveness and accuracy of the rules. A general conclusion from this work is that some form of averaging is almost always nearly optimal, and it also accounts quite well for the actual observed behavior (e.g., Clemen, 1989, Wallsten et al., 1997, Fischer & Harvey, 1999, Rantilla &

Modeling the determinants of confidence

Previous normative and descriptive work has shown that simple averaging of the various inputs generally yields accurate results. It is somewhat surprising (and frustrating to some decision theorists) that this holds in the normative context (see discussions by Clemen, 1987, and Clemen & Winkler, 1999, on the comparison of various mathematical models). Our primary interest is in the fact that when human DMs are asked to combine multiple opinions they tend to average them (e.g., Anderson, 1981,

The setup and notation

We are interested in the situation in which a single DM integrates probabilistic opinions (forecasts) from J distinct experts regarding a unique target probabilistic event. Once the opinions of the experts are communicated to the DM he/she has to combine them and to (a) generate his/her best estimate of the probability of the target event, and (b) express his/her confidence in this estimate.

In the process of deriving the model we make a series of assumptions. We will distinguish between

Model of confidence

The DM assumes that when the jth expert sees the ith cue, he/she forms an overall confidence in the occurrence of the target event, Xij, that can be expressed asXiji+eij,where πi is the expected probability of the target event, conditional on the observed cue, and eij is a random variable with a 0 mean and a finite S.D., σij. The variance of this estimate is given byσXij2πi2ij2+2σij,πi.

The first major source of variability, σπi2, is due exclusively to the imperfect probabilistic

Analysis of the model

The components of the model can be classified into structural and natural factors. The first class includes all those variables that define the structure and magnitude of the decision problem but are independent of its content and the nature and quality of the advice. They are the total number of cues, N, the number of experts, J, the fraction of cues seen by each expert, g, and the fraction of pairwise overlap in the cues presented to the experts, f. For a given decision problem, most DMs can

Relating the theoretical model and the experiments

All our derivations are based on properties of the variance of the aggregate (because it is mathematically tractable), but the hypotheses will be tested using direct measures of expressed confidence. We assume that the two variables are monotonically related, but we have no good theoretical or empirical basis for making more specific assumptions about the exact mathematical form of this relation. In fact, we do not necessarily assume, nor do we seriously believe, that subjects consider the

The experimental paradigm

Next we report results from two studies which were designed to empirically test some implications of this model. Although the studies differ in details, they share essential basic features. Participants, in the role of DMs, considered scenarios embedded in specific and familiar contexts (business, medicine, etc.) in which a set of J (hypothetical and anonymous) experts were said to have seen N distinct cues diagnostic of the occurrence of a certain, well-defined, target event. In each case

Method

Participants: 88 undergraduate students at the University of Illinois at Urbana–Champaign (UIUC), who participated in partial fulfillment of a class requirement. One participant was excluded from the analysis because of failure to follow the instructions.

Procedure: Participants took part in the study in groups of 8–20 at a time. They were handed packets that contained a description of a scenario, which provided the context for their decisions, and the problems (each on a separate sheet). The

Results

Actual decisions: As expected, participants' responses were very close to the mean of the J forecasts. In fact, the mean difference between the two across all participants and all 36 items was 0.002. The mean item-specific deviations were symmetrically distributed between −0.003 and +0.003, and the median difference was exactly 0 in 34 of 36 cases. This indicates almost universal use of a simple averaging heuristic. There were no significant differences between the four scenarios.

Confidence

Discussion of Study 1

We examined aggregation processes for probabilistic information supplied by various numbers of equally qualified experts who based their forecasts on equal amounts of equally diagnostic information. As expected, when experts are equally qualified and informed (have the same amount and quality of information available to them), DMs simply average the forecasts. Many of the normative approaches also advocate what are, essentially, simple averaging rules (e.g., Clemen, 1989). Furthermore, there is

Method

Participants: 73 volunteers were recruited, most of whom were undergraduate students at the UIUC. They were recruited using postings in the Psychology Building, and were paid a flat fee of $7 for their participation in the study.

Procedure: Participants took part in the study in small groups ranging in size from 2 to 12. The procedure was identical in most respects to that used in the previous study. Participants were each given a copy of a general medical scenario (Appendix A), which provided

Results

The aggregates: As in Study 1, participants' responses were very close to the (manipulated) mean forecast. The mean difference across all participants and all 72 items was 0.0015, and the distribution of differences was symmetric around 0. Clearly, participants aggregated experts' opinions by averaging them, as our model assumes.

Confidence ratings: We standardized the 7-point confidence ratings within each participant, to eliminate effects of individual differences in the use of the scale. The

Discussion of Study 2

This study was a refined replication and extension of our first experiment on aggregation of probabilistic information from multiple overlapping sources. Participants received forecasts from multiple (equally credible and qualified) experts who based their opinions on equal numbers of (equally diagnostic) cues. The two major changes were (a) adding a new condition of partial overlap (in addition to the extreme cases of redundancy and complementarity), and (b) controlling, and informing the

General discussion

This paper had two basic goals. The first was to derive a general mathematical model describing the process of aggregation of expert opinions. We modeled the aggregate and its confidence in terms of the same factors. We relied on the almost universal empirical finding that people aggregate by averaging, and focused on the confidence they express in the aggregate. We believe that our formulation of perceived variance as a proxy for DM confidence is a valid and productive way to conceptualize the

Acknowledgements

The authors would like to thank Jack Soll, Thomas Wallsten, George Wu, an anonymous reviewer and the members of Janet Sniezek's research group for their useful comments on earlier versions of the manuscript.

References (50)

  • J.A. Sniezek

    The role of variable labels in cue probability learning tasks

    Organizational Behavior and Human Decision Processes

    (1986)
  • J.A. Sniezek

    Groups under uncertainty: an examination of confidence in group decision making

    Organizational Behavior and Human Decision Processes

    (1992)
  • J.A. Sniezek et al.

    Cueing and cognitive conflict in judge–advisor decision making

    Organizational Behavior and Human Decision Processes

    (1995)
  • J.A. Sniezek et al.

    Accuracy and confidence in group judgment

    Organizational Behavior and Human Decision Processes

    (1989)
  • J.A. Sniezek et al.

    Revision, weighting, and commitment in consensus group judgment

    Organizational Behavior and Human Decision Processes

    (1990)
  • J.A. Sniezek et al.

    Cue measurement scale and functional hypothesis testing in cue probability learning

    Organizational Behavior and Human Decision Processes

    (1978)
  • J.B. Soll

    Intuitive theories of information: beliefs about the value of redundancy

    Cognitive Psychology

    (1999)
  • G. Stasser

    Information salience and the discovery of hidden profiles by decision-making groups: a thought experiment

    Organizational Behavior and Human Decision Processes

    (1992)
  • D. Trafimow et al.

    Perceived expertise and its effect on confidence

    Organizational Behavior and Human Decision Processes

    (1994)
  • R.L. Winkler

    Combining forecasts: a philosophical basis and some current issues

    International Journal of Forecasting

    (1989)
  • I. Yaniv

    Weighting and trimming: heuristics for aggregating judgments under uncertainty

    Organizational Behavior and Human Decision Processes

    (1997)
  • N.H. Anderson

    Information integration theory

    (1981)
  • Armstrong, J. S. (2000). Combining forecasts. In J. S. Armstrong, Principles of forecasting: a handbook for researchers...
  • A.H. Ashton et al.

    Aggregating subjective forecasts: some empirical results

    Management Science

    (1985)
  • D.V. Budescu et al.

    On the importance of random error in the study of probability judgment. Part I. New theoretical developments

    Journal of Behavioral Decision Making

    (1997)
  • Cited by (0)

    1

    The work was supported by National Science Foundation Grant No. SBR-9632448.

    2

    The work was supported by a National Science Foundation Graduate Research Fellowship.

    View full text