Introduction

The isolated population of Iceland has a well-characterized history starting 1100 years ago when the Icelandic population was founded by emigrants from Scandinavia, probably mostly from Norway but with some significant contributions from the British Isles.1 Its modest population size of 300 000 individuals, combined with very good historical records, makes Iceland a population of choice for the location of disease-causing genes using either linkage mapping or association mapping, the latter also termed linkage disequilibrium (LD) mapping.2 LD is defined as the statistical association between alleles at different loci in the genome. Such associations can be generated by some forms of epistatic selection or chance alone through the interplay of mutation and drift.3 Amounts of LD within an isolated population are modulated mainly by the intensity of genetic drift, through the effective size of the population, and the per generation recombination rate between loci.4, 5 Patterns of LD in a sample of linked loci can thus be used to estimate the absolute effective size of a population provided independent information on the recombination rates between loci is available.6, 7, 8 Effective population size, hereafter Ne, is a central parameter in population genetics as it governs the magnitude of drift and thus, among other things, LD patterns. In the context of association mapping, knowledge of Ne is needed to predict genome-wide patterns of LD, the density of single-nucleotide polymorphisms (SNPs) needed to cover a genomic region, and to model the power of LD-based association mapping methods to detect polymorphism-causing phenotypic variation for traits/disease of interest.5

Here we survey LD patterns in 12 genomic regions, spanning 15 cM each, using sets of linked microsatellites genotyped in a random sample of the contemporary Icelandic population. We characterize LD within each region and find that the loci exhibiting significant LD can be found as distant as 4 cM suggesting a reduced effective population size. We then use extensive simulations to compare observed LD patterns with LD patterns simulated under various demographic scenarios with varying population size and recombination rates. This allows us to obtain point estimates of (1) the scaled recombination rate and (2) the long-term Ne in the Icelandic population.

Materials and methods

Sampling of individuals and microsatellite genotyping

The study population consisted of 1753 individuals (962 males and 791 females) who had participated in the Icelandic Cancer Project as controls.2 Twelve genomic regions were selected based on the criteria that they were gene-rich, approximately 15 cM in size and of high sequence quality, that is, >90% of the sequence was in the finished state and contained no gaps. For convenience, the regions were named B, C, E, F, G, H, I, K, L, N, P and Q. Regions C, E and H were genotyped in a total of 1753 individuals, whereas the remaining regions were genotyped in a subset of 1400 individuals. For each of the genomic regions, 14–16 markers were selected, spaced at an interval as close to 1 cM as possible. The genetic markers were selected for high heterozygosity of dinucleotide repeats. When possible, available dinucleotide Marshfield markers of high heterozygosity were used. The heterozygosity detected in our samples ranged from 0.44 to 0.96, and was on average 0.77. A complete list of genetic regions, marker locations and marker heterozygosity is presented in Supplementary Table 1.

Multiplex PCR was performed on the samples using unlabelled, marker-specific primers with a common oligonucleotide tail sequence, followed by a second PCR with fluorescent-labelled primers that anneal to the tails. Each multiplex PCR reaction contained 2–4 markers. Three such reactions were pooled and run on a MegaBace 1000 automated DNA sequencer (Amersham Biosciences). MegaBace Genetic Profiler software was used for fragment sizing and genotype calling. Two researchers checked all genotyping scores individually and a third person scored discrepancies between the two. Median success per marker was 98% (84.7–99.6); 2.8% of genotypes were unreadable.

LD analysis

LD was analyzed using the composite measure of genotypic disequilibria devised by Weir9 (see Schaid10 for a recent account). This measure, hereafter denoted by Δ, is defined for a pair of loci with arbitrary number of alleles, is insensitive to deviations from Hardy–Weinberg equilibrium (hereafter HWE) and can be used with unphased data. Δ was calculated after discarding, at each locus, alleles with frequency lower than 5%. For each pair of loci, A and B, each with, respectively, nA and nB alleles with frequency greater than 5%, Δ was calculated using the counting scheme suggested by Schaid.10 Significance of the composite LD was calculated for each pair by computing the χ2 statistic from the counts of the nA*nB contingency table and assuming that, under the null hypothesis of absence of LD, χ2 follows a χ2 distribution with (nA−1)(nB−1) degrees of freedom. This approximation is very accurate provided that all cells in the contingency table have five observations or more, an assumption that was met in virtually all cases because of the large sample size (1753 individuals) and the 5% threshold imposed on allele frequencies. The counting scheme and significance testing were first implemented in Mathematica and then as a C routine (see below for the analysis of simulated data). The significance testing procedure was checked by analyzing the contingency tables obtained for 20 pairs of loci from all regions both in Mathematica and using the statistical software R. For these loci, we checked (1) that the P-values obtained in R under the χ2 approximation, using the chisq.test function, were identical to the ones computed in Mathematica and (2) that these P-values were accurate by comparing them with exact P-values computed under the null hypothesis of no-linkage disequilibrium using the function fisher.test implemented in R.

The individual P-values obtained for all 12 regions were pooled (n=1235 pairs) and the distribution of P-values was used to estimate the so-called q-value, the frequency of false positives among the pairs declared as ‘significant’ at a given individual threshold, using the method of Storey and Tibshirani,11 as implemented in the R routine ‘q-value’. Examination of the graph depicting the relation between individual P-values and corresponding q-values suggests that retaining pairs of loci with associated P-values of 1% will yields less than 10% of false positives over all pairs (data not shown).

A logistic regression was used to model the decrease in the proportion of pairs exhibiting significant LD with genetic distance within each region. We used the genetic distance (in cM) between microsatellites from the recombination map reported by Kong et al.12 We binned our data in seven classes of distance between loci: from 0 to 1, 1–2, 2–3, 3–4, 4–5 cM, from 5 to 10 cM and distances larger than 10 cM. For convenience, these seven categories are hereafter denoted in tables and figures as, respectively, ‘1 cM’, ‘2 cM’, ‘3 cM’, ‘4 cM’, ‘5 cM’, ‘>5 cM’ and ‘> 10cM’. The proportion of significant pairs (at the 5% level), pij, found within region i at distance Dj was modeled as Zij=μ+Ri+Dj, where Zij=log (pij/(1−pij)), μ is the intercept of the regression, Ri is the effect associated with region i and Dj the effect of distance between pairs of loci (in cM). The analysis was performed in R using the binomial family and a logit link function.

Simulated data sets

Data were simulated under scenarios with variable amounts of recombination and exponential population growth in the population history using a coalescence approach. Samples drawn from populations were simulated using the ancestral recombination graph approach implemented in the software CoaSim version 4.0,13 an extensible coalescence simulation tool (http://www.birc.dk/Software/CoaSim/).

The following parameters were varied in the coalescent simulations:

ρ=4NerL, the scaled recombination rate for a chromosomal region of length L nucleotides recombining at a rate r per nucleotide and per generation, varied in {500, 1000, 2000, 3000, 4000, 4500, 5000, 10000}.

θ=4Neμ, the scaled mutation rate of the microsatellite markers simulated on the chromosome region, was set to either 1.5 or 3. All microsatellites were assigned the same scaled mutation rate within a given simulation. To check the influence of the model of mutation on LD patterns, simulations were run under two alternative mutation models commonly used for modeling microsatellite: the K-alleles model and the stepwise mutation model. A K-alleles mutation model with K=50 different alleles was assumed throughout the study. The K-allele mutation model is directly implemented in CoaSim, but the stepwise mutation model was implemented using the scheme-based extension language of CoaSim.

Population growth was simulated through the parameter β=2Neb, where b is the (exponential) growth rate per generation and Ne is the effective population size. β was set to 0 (stable population) 10, and 100. For β>0, the value of the parameter θ was adjusted so that the expected number of alleles per locus would be the same as under β=0. This was carried out by first obtaining, through repeated simulation (104 independent simulations), the mean length of the whole genealogy of a sample drawn from a population with growth rate β relative to that of a population with growth rate β=0 (the latter being known analytically14).

For each combination of parameters (ρ, θ, β) and each mutation model, 100 independent data sets consisting of 10 regions of length ρ with 15 microsatellite loci per region were simulated. For all simulations, a sample size of 2600 haploid chromosomes (pooled at random to generate 1300 diploid individuals) was used. This sample size corresponds to the minimum of genotypic data available across all regions (Table 1). The rationale for choosing this sample size was to run stochastic simulations with the biggest sample size still compatible with the data observed in all regions genotyped.

Table 1 Number of pairs of loci exhibiting significant LD within each region

Inference of demography and evolutionary parameters from LD patterns

LD patterns in simulated data were summarized by computing the proportion of pairs of loci exhibiting significant LD at seven different classes of genetic distance (expressed as fractions of ρ) using the same binning scheme as the one adopted for the observed data. The distance between a single simulated data set and the observed data was calculated as

where piobs (and pisim) is the observed (and simulated) proportion of pairs of loci in significant LD within the class of genetic distance i. For each combination of parameters (ρ, θ, β), a mean distance was calculated by averaging over a hundred simulated data sets. The combination (ρ*, θ*, β*) yielding the lowest mean distance was retained as fitting the data best.

Results

Overall patterns of LD in the data

A total of 1753 individuals were genotyped at 179 loci. The mean amount of missing data per locus ranged from 1.7 (Region B) to 3.6% (Region C), yielding a minimum of 1285 individuals effectively genotyped for assessing LD within each region (Table 1). LD patterns between pairs of loci in each region are summarized in Table 1. Hereafter, we consider a pair of loci exhibits significant LD when the χ2 test statistic has an associated P-value that is lower than 5%. The proportion of pairs of loci exhibiting significant LD decreased with genetic distance, although there was considerable variation within a given distance (Figure 1). A logistic regression analysis confirmed that the genetic distance between loci (in cM) explained a significant amount of the variation in the proportion of pairs exhibiting significant LD at the 5% level (P<10−16). Regions E and G displayed significant heterogeneity in their patterns of LD relative to the remaining regions (Supplementary Figure 1). Region E (respectively G) had a higher (respectively lower) average proportion of pairs of loci exhibiting significant LD within the first 4 cM. When these two regions were discarded from the logistic regression, the decrease in the proportion of pairs exhibiting LD with genetic distance did not exhibit further heterogeneity between regions (P>0.05), whereas the effect of genetic distance was still highly significant (P<10−15).

Figure 1
figure 1

Decrease of LD with genetic distance (n=1235 pairs, five pairs with P-values lower than 10−20 are not shown in the graph but were used in the fitting procedure). X-axis: genetic distance between loci of a pair in centiMorgans (cM). Y-axis: LD measured for each pair as minus the natural logarithm of P-values. The horizontal dotted line denotes the 5% threshold for individual P-values (−Log(0.05)=3).

Region G exhibited the second highest ratio of genetic (cM) to physical (Mb) among the 12 regions studied (Table 2). However, the differences among regions revealed by the logistic regression cannot be explained by difference in sample size between regions E and G or by differences in the average cM per Mb among regions (not shown). Note that regions genotyped were precisely chosen to be roughly 1 cM per Mb (0.83–1.54 cM per Mb). Inspection of the fine scale variation in local scaled recombination rates estimated in these regions using the HapMap data recently released15 confirms that Region G has one the highest average cM per Mb among the regions we studied (Table 2) and may harbor a couple of hot spot in the middle of the region that may explain the faster decay of LD among pairs (Supplementary Figure S2). Inspection of the sequence properties of regions E and G in the genome browser available at Ensembl (http://www.ensembl.org/) did not reveal striking features for either GC content, gene density or SNP density (data not shown).

Table 2 Properties of the studied regions

Estimation of demographic and evolutionary parameters

We discarded data from Regions E and G and used all data from the 10 remaining genomic regions to estimate population parameters (ρ, β, θ). The rationale for doing so was to only pool data from regions where patterns of LD were as similar as possible to yield a meaningful average. Observed data from the remaining 10 genomic regions were pooled and patterns of LD were summarized by computing the proportion of pairs of loci exhibiting significant LD as a function of genetic distance (Figure 2). This summary of LD patterns was subsequently used, using equation (1), to compute the distance d between the observed data and a simulated data set obtained under a given (ρ, β, θ) combination. Given that we discarded a region of low LD and a region of high LD, doing so does not affect the overall patterns of decrease of LD as summarized by the proportion of pairs exhibiting significant LD within each class of distance. In fact, keeping these two regions in the analysis would result in virtually identical proportions of pairs with significant LD in the various classes of distance. For each parameter (ρ, β, θ) combination, the distance between the observed data and each of the 100 simulated data set was computed. Examples of the distribution of distances between observed and data simulated under four different scenarios are given Figure 3. In all cases examined, the distributions were roughly normal (Figure 3) and could be adequately characterized by their mean and variance (not shown). The mean distance between data simulated under a given scenario and the observed data was used as a measure of fit (Figure 4). Data sets simulated under scenarios involving sharp exponential growth (β=1000) yielded data sets with very few polymorphic pairs of loci (at the 5% threshold), making it difficult to compare the simulation to the observed data. Data sets simulated under the condition β=1000 had a large distance to the observed data (not shown). These are not examined further here. For a given (θ, β) value, the mean distance between simulation and data had a single minimum when varying ρ, the lowest mean distance being achieved for β=0 (Figure 4). The value of ρ* minimizing the distance between data and simulation was somewhat sensitive to θ and the mutation model chosen (compare panels of Figure 4). However, for a given θ, the value of ρ minimizing distance was nearly insensitive to β. Simulation of data under a constant effective size or moderate exponential growth (β varying from 0 to 10) suggests simulations obtained under the assumption that a region of 15 cM corresponds to a ρ of 2500–3500 best fit the data (Figure 4). A point estimate of ρ*=3000, together with the assumption of r=10−8 (1 cM equals 1 Mb), yields an effective population size of the Icelandic population of Ne=3000/(4*0.15)=5000 individuals.

Figure 2
figure 2

Proportion of pairs of loci exhibiting significant LD. Data from all regions (but excluding Regions E and G) were pooled and binned in seven classes of genetic distance (n=1039 pairs).

Figure 3
figure 3

Distribution of distance d between the observed data and simulated data sets obtained under four alternative scenarios (ρ, θ, β). All simulations were run under the K-alleles model (K=50). For each simulated data set, the distance d to the observed data was calculated using equation (1). Each empirical distribution is based on 100 simulated data sets and therefore comprises 100 counts.

Figure 4
figure 4

Mean distance between observed and simulated data as a function of ρ. Plots of mean distance are given for data simulated under scenarios with various degrees of exponential growth: β=0 (open circles), β=10 (gray circles) and β=100 (black circles). Each dot represent the distance averaged over 100 simulated data sets. Top panels: simulations assuming a stepwise mutation model with θ =1.5 (a) or θ=3 (b). Bottom panels: simulations run under the K-alleles model (K=50) with θ=1.5 (c) or θ=3 (d).

Expected extent of LD between SNP markers

Assuming that a region of 15 cM corresponds to ρ=3000, we simulated a set of 100 evenly spaced SNP markers with minor allele frequencies of 5% and investigated the LD between these. Clearly, SNP markers have lower power to detect significant LD. However, with the large sample sizes used here (1300 diploid individuals simulated), even SNPs may exhibit more LD than expected by chance alone at a range of 3–4 cM in the Icelandic population (Figure 5). The range of detectable LD (as measured by the range of distance where the proportion of pairs of SNPs with significant LD as higher than expected by chance alone) in our simulations extends further than previous simulation studies.16 This is easily explained by the lower effective size we simulated, the fact that other studies have used a different criteria such as the half-life of LD decay16 to characterize the range of LD, as well as the large sample size (1300 individuals) we used. Given the large sample size simulated here, many pairs of loci will display significant LD along a region albeit with values of r2 probably much lower than what is needed to ‘tag’ efficiently these regions.17

Figure 5
figure 5

Decrease of the proportion of pairs in significant LD with distance: comparing simulated microsatellites and SNP data. (a) Data comprises 10 genomic regions of length ρ and 100 SNPs per region (n=49 500 pairs in total). Simulations were based on the following parameters ρ=4000, β=0. (b) Data comprises 10 genomic regions of length ρ with 15 microsatellites per region (n=12,00 pairs). Simulations were based on the following parameters: K-alleles mutation model (K=50), ρ=4000, β=0, θ =1.5. (c) Microsatellite simulated data set. Simulations were based on the following parameters: K-alleles mutation model (K=50), ρ=4000, β=0 and θ=3.

Discussion

Patterns of observed LD and prospects for association mapping

We have studied a large data set consisting solely of unphased data. It is experimentally difficult and very time consuming to obtain phased data, especially for markers that are separated by more than 10 kb.18 Here, we show that, instead of inferring haplotypes and then analyzing these as if they were true haplotypes, the direct use of unphased measures of LD still allows us to extract substantial information about the patterns of LD and its decrease with distance. Our study validates recent predictions obtained through a Monte Carlo simulation study.19 Furthermore, the use of unphased measures of LD, such as Weir's Δ, allows making inferences that are robust to deviations from HWE within the population studied. The hypothesis of HWE is often needed for inferring haplotypes and can be violated even in seemingly homogeneous populations such as the Icelandic one that still may exhibit mild population substructure.20 Altogether, this a priori bodes well of the use of large quantity of unphased data to achieve genome-wide coverage.

Our study shows that the decay of LD with distance in Iceland can be approximated by a simple model of a panmictic population of constant size. Microsatellites reveal more significant LD than expected by chance alone over distances of 4 cM (on average 3 Mb), where 35% of pairs (164/471) exhibit significant LD. There is no apparent sign of a strong influence of population structure in the population sample used here, as we see virtually no cases (four among 1235 pairs examined, Table 1) of significant LD at ranges larger than 5 cM despite numerous comparisons at this range. A question that remains is whether the mere existence of significant LD is sufficient for successful LD mapping. Ultimately, the power to detect an unknown susceptibility variant through a marker is dependent on the phenotypic effect of the true variant times the amount of LD between the marker and the causative variant measured through the r2 measure of LD.14 As data on SNPs accumulate rapidly in the Icelandic population, a natural extension of this work will be to (1) contrast predicted patterns of LD between SNPs based on our very simple population model calibrated using our microsatellite data sets and (2) investigate which additional features need to be added to this model (demography, variation in recombination rates) to account for the features of the patterns of LD observed using SNPs. Calibrating a simulation model that remain parsimonious in the number of parameters, while capturing the crucial features of LD patterns, will be an important tool to conduct realistic simulations to predict the power of alternative association mapping strategies (see Schaffner et al5 for a recent example using the data from the Phase 1 of the HapMap project).

Estimation of scaled recombination rates and effective size in the Icelandic population

We have studied patterns of LD in a large sample of Icelandic individuals and used patterns of LD to estimate an effective size for the Icelandic population. We found that simulation of the patterns of LD under a simple model assuming a constant population size through time could fit the observed patterns of LD in our sample fairly well. Based on (1) the ρ*-values (2500–3500) that yielded the best fit to the data, and (2) assuming a rate of recombination of 1 cM per Mb – a reasonable assumption for the regions surveyed here when comparing physical and genetic maps (Table 2) – and therefore setting r to 10−8, we estimated an effective size of approximately 5000 individuals. Using our minimum distance scheme, we only obtain a point estimate of ρ. By inspecting the distribution of distances, we can bracket a set of parametric values of ρ that yields distances that are not significantly greater than the minimum observed. This leads to a rough confidence interval around ρ* of (2000–5000). Note that this confidence interval is somewhat sensitive to values assumed for θ and the underlying mutation model of microsatellites.

A number of studies using microsatellites for studying the patterns of LD in other isolated European populations, such as the Faroe Islands21 and various isolates within the Finnish22 and Swedish populations,23 are potentially available for comparison. Unfortunately, no estimates of the scaled recombination rate, ρ, were reported in those studies. Comparison across studies is then very difficult as the differences in the range of LD detected can be due to differences in the scaled recombination rate (which includes differences in true recombination rates as well as differences in the realized effective size of these isolates) but also to vast differences in the sample size used in these studies and ours.

Mean estimates of ρ averaged on 1 Mb windows, a scale comparable to our study, obtained using a composite likelihood estimator of ρ, ρCL (Hudson, 2001) and the SNP data available for three populations involved in the Hap Map project, were recently reported.24, 25 Our point estimate of ρ*=200 (130–330) per Mb yields an estimate of ρ*=0.0002 per bp (0.00013–0.00033). Our estimate falls directly in the range of mean estimates for ρ reported in the European–Americans (Utah residents with ancestry from northern and Western Europe, ρCL=0.000192 per bp), African–Americans (ρCL=0.000405) and Han Chinese (from the Los Angeles area, ρCL=0.000200). Interestingly, these estimates were based on studies using different markers, different estimation procedure and different strategies of genotyping (150 microsatellite markers on 1300 individuals in our case versus data from only 23 individuals but genotyped for 1 586 383 SNPs).

A number of cautionary remarks are needed before discussing the implications of this result. We have simulated data under simple demographic scenarios to investigate whether such simple models could match observed LD patterns. This by no means implies that the complex demographic history of Iceland should be viewed as constant through time or even adequately approximated by the simple exponential growth scenario we also used. The mere existence of an effective size adequately capturing the rate of coalescence of genes in a real population that has undergone fluctuating sizes in the past can be questioned.26 Austerlitz and Heyer,27 studying the isolated population of Saguenay-Lac Saint Jean (Québec), also emphasized that beyond the effects of variable population size and overlapping generations, the transmission of reproductive success from parents to offspring could dramatically lower effective size relative to the census size. In that regard, the Wright–Fisher model, upon which the whole concept of effective size is based, may not be adequately modeling genetic drift in the Icelandic population. However, we feel that fitting observed patterns of LD using a simple Wright–Fisher model allows us to obtain an effective size estimate that can in turn be used to make some testable predictions about patterns of LD in other genomic regions or different markers such as SNPs.

Previous estimates of effective size estimates in human population relied on different methods and were mostly based on levels of polymorphism. Levels of polymorphism at a single locus can be used to estimate the product of effective population size and mutation rate (at that locus). If one assumes a mutation rate for the sequence or the marker used, one can obtain an estimate of the (long-term) effective population size. Harpending et al28 reviewed these methods and, with the data available at that time, they suggested a small effective population size of the entire human population, on the order of 10 000 breeding individuals, due to reduced demographic population size in the Pleistocene period. Sherry et al29 used the observed distribution of sample frequencies of 13 dimorphic Alu elements. They used coalescence theory to compute the expected total genealogies branch lengths for monomorphic and dimorphic elements, leading to an estimate of human effective population size of 18 000 during the last one to two million years. In that respect, the estimate we obtain here is very large given that Iceland represents a rather small fraction of the contemporary human genetic diversity. One possibility is that estimates of long-term effective size based on LD are based on relatively recent coalescence time. Hayes et al6 obtained similar results using simulation with phased data and argued that one could actually obtain estimates of Ne at different times in the past by using pairs of loci with different amounts of recombination (the tighter the linkage between markers used, the older the time back in time). As a consequence, our estimation method yields estimates of Ne that are less affected by ancient demographic events and thus much higher than estimates based on levels of polymorphism alone.

In conclusion, we have shown that using LD measures requiring only unphased genotypic data from linked loci still convey substantial information about effective recombination rates and past Ne. We have illustrated this method by analyzing a large sample of individuals and estimating the scaled recombination rate in the Icelandic population. We found that the data were broadly consistent with an estimated scaled recombination rate of ρ200 per cM. This estimate will be of importance for predicting, through simulations, the power of future association mapping studies using the Icelandic population. Last, our approach for the estimation of ρ from LD patterns can be used in populations where deviations from HWE proportions occur due to either nonrandom mating or hidden population structure. As such, it should be widely applicable not only to analyze LD in a variety of human populations but also in other species, such as numerous plant species that reproduce through partial self fertilization, where the HWE assumption is untenable and phased data is still difficult and time consuming to obtain.