Main

Over the past 4 years, genome-wide association studies (GWA studies) have become a powerful tool for investigating the genetic basis of common diseases, and important new findings now emerge almost every week1,2,3. But so far there has been limited attention to health problems in Africa, such as the massive burden of infectious disease and the increasing prevalence of chronic diseases associated with changes in lifestyle4,5,6. By elucidating the molecular mechanisms that underlie resistance and susceptibility to disease, GWA studies in Africa might provide important insights into the development of more effective vaccines, therapeutics and public health interventions. As Africa is the ancestral home of all human populations, understanding the biology of disease in Africa could shed light on the genetic origins of common diseases worldwide7,8,9.

African populations are genetically more diverse than European and Asian populations10,11,12. According to the out-of-Africa hypothesis of human origins, this is because groups migrating out of Africa experienced severe population bottlenecks, resulting in a reduction of genetic diversity in descendant populations8,13. A reduction in nucleotide diversity outside Africa has been consistently observed in genotype and resequencing data8; similarly, levels of haplotype diversity tend to decrease and linkage disequilibrium (LD) tends to increase according to the geographic distance of a population from Africa11,12.

From a statistical genetic perspective, the high levels of haplotype diversity and low levels of LD in African populations have both advantages and disadvantages for genome-wide analysis. High levels of haplotype diversity are potentially a powerful tool for fine mapping the causal variants that underlie disease associations14,15,16. However, low levels of LD are disadvantageous when screening the genome for disease associations using current SNP-genotyping approaches, which essentially rely on the principle of LD mapping17,18. The fundamental question underlying this Review is how to develop an appropriate methodology for GWA analysis in Africa that overcomes the difficulties of genome-wide screening for association and exploits the potential for fine mapping causal variants.

There is a growing body of data to address this problem. The first GWA study from Africa has recently been published19, and others are close to completion. The International HapMap Project has generated a large catalogue of SNP allele frequencies and haplotypes in the Yoruba people of Nigeria15,20, and this effort has recently been expanded to include the Luhya and the Maasai people from Kenya. There has recently been remarkable progress in our understanding of genome sequence variation in Africa, starting with systematic resequencing of specific genomic regions21,22, followed by next-generation sequencing of the entire genome of an African individual23, leading on to the 1000 Genomes Project. This large international endeavour will include whole-genome sequence data for several hundred people in different parts of Africa, and data have already emerged for the Yoruba group.

Here, we examine the implications of these and other recent findings for the design and analysis of GWA studies in Africa. We begin by outlining the practical importance of conducting GWA studies in Africa, and then examine in some detail the analytical problems that can arise at each stage of a conventional GWA study as a result of low LD and population structure (Box 1). We go on to consider how new approaches in statistical imputation and large-scale genome sequencing will help to overcome these problems and to accelerate the identification of causal variants. Finally, we briefly discuss the practical requirements for effective GWA studies in Africa, the need for attention to the ethical issues that arise in a resource-poor setting, and what is being done to build local research capacity in this area.

Why GWA studies in Africa are needed

Most GWA studies are motivated by a desire to understand the underlying causes of disease at both the molecular and the environmental level. Although it is becoming apparent that the current generation of GWA studies will provide only a partial understanding of the genetic architecture of common diseases, they at least provide a foundation for systematic investigation of the problem, including the complex question of how disease risk is affected by gene–environment interactions24,25. In high-income countries, GWA studies have provided many new leads for medical research on common diseases: for example, the discovery that common variants of the FTO gene are associated with risk of obesity — one of the first major findings from a GWA study — has led to new insights into the determinants of eating behaviour26,27,28,29. It is important that Africa should not be excluded from this new research agenda. Here, we briefly outline two major public health problems that are of particular importance for GWA research in Africa: infectious diseases and the rising prevalence of chronic non-communicable diseases. In later sections we consider the potential importance of GWA studies in Africa for fine mapping the causal genetic variants that underlie common diseases found throughout the world.

Infectious disease. Over 10% of children in sub-Saharan Africa die before the age of 5 (compared with <1% in high-income countries), primarily due to infectious diseases, such as malaria, respiratory infections and diarrhoea4. AIDS is a major cause of death in young adults, and tuberculosis affects all age groups5. The difficulty of developing effective vaccines against malaria, AIDS and tuberculosis is a strong incentive for conducting GWA studies to discover natural mechanisms of resistance to infection, which has a significant genetic component30,31,32,33. A classic example of how human genetic discoveries may translate into leads for vaccine development is given by a series of discoveries concerning Plasmodium vivax, a species of malaria parasite that causes much morbidity in the tropics. P. vivax infection is remarkably rare in sub-Saharan Africa, and over 30 years ago it was discovered that this is because most Africans lack the Duffy blood group, which is essential for erythrocyte invasion by P. vivax34. This lack is now known to be caused by a regulatory SNP in the Duffy blood group, chemokine receptor (DARC) gene; this discovery led to the molecular characterization of a crucial parasite protein that binds to the erythrocyte Duffy receptor, which in turn led to the development of a candidate vaccine against P. vivax35. Similarly useful genetic discoveries for Plasmodium falciparum, which is responsible for most malaria deaths, could revolutionize malaria vaccine development. Two further observations support this approach: malaria has been a strong force for selection on the human genome, and only a small fraction of host genetic resistance to malaria is explained by known factors, such as sickle haemoglobin, which implies that many genetic factors remain to be discovered36. A global partnership of malaria researchers, the Malaria Genomic Epidemiology Network (MalariaGEN), has been established to conduct multi-centre-scale genetic-association studies of resistance to malaria, and initial GWA data from The Gambia have been reported19,37. An equally strong case can be made for genetic studies of tuberculosis, HIV/AIDS, invasive bacterial disease and other major infections31,32,33. GWA studies of tuberculosis have recently been completed in The Gambia and Ghana (A. V. Hill and R. D. Horstmann, personal communication), and a GWA study of bacteraemia in Kenya is currently being conducted by the Wellcome Trust Case Control Consortium.

Gene–environment interactions and chronic diseases. Changes in lifestyle in Africa are causing a rapidly rising prevalence of chronic non-communicable diseases, such as hypertension and diabetes38,39. For example, the number of people in Africa with diabetes is estimated to rise from 10 million in 2006 to more than 18 million in 2025, and chronic diseases in general are expected to account for more than a quarter of deaths by 2015 (Ref. 39). The importance of genetic factors is highlighted by the observation that, among residents of high-income countries, there is higher prevalence of hypertension, diabetes and obesity in people of African ancestry than in those of European ancestry40,41. Importantly, the prevalence of these diseases in people of African ancestry is much higher for those who reside in high-income countries42,43.

One approach being used to investigate the genetic basis of these differences is admixture mapping in groups of mixed ancestry, for example, African-Americans44,45,46,47,48. Admixture mapping — which has the advantage of requiring few genetic markers and the concomitant disadvantage that it localizes the genetic signal imprecisely — has led to the discovery of a number of important genetic loci49,50. However, it is important that genetic studies are conducted in Africa itself, both to ensure relevance to the health problems that are encountered there and to take account of the great diversity of environments, which range from forest to urban, and of habitation, sanitation, diet, physical activity and other aspects of lifestyle. One of the most important environmental variables is the level of exposure to infection — for example, the prevalences of malaria, HIV/AIDS and helminthic infection vary widely across the continent.

One epidemiologically relevant change that is now occurring in Africa is a tendency for migration from rural to urban areas, which typically have a lower prevalence of parasitic diseases, such as malaria, but a higher prevalence of chronic non-communicable diseases, such as hypertension51. An important recent advance is a GWA study of hypertension in African-Americans, which identified several loci associated with systolic blood pressure that were replicated in a sample from West Africa52. There is a clear need to establish well-defined cohorts — such as the Africa America Diabetes Mellitus study, which involves five centres in Nigeria and Ghana — to investigate how genetic factors, and their interaction with the environment, contribute to the rapidly rising prevalence of hypertension, diabetes and other chronic diseases in Africa itself53,54,55,56.

GWA by LD mapping

Current methods for GWA analysis are based on the principle of LD mapping17,18. This relies on a sufficiently high level of LD to screen for genetic associations across the whole genome by typing a subset of variants15,20,23,57. GWA by LD mapping has three main stages of analysis, starting with genome-wide screening for associations, followed by replicating associations and then fine mapping of causal variants. In this section we summarize the different methodological issues that arise at each of these stages of GWA analysis in Africa.

Stage 1: Genome-wide screening for associations. At the first stage of GWA analysis, the aim is to screen the genome for regions that are associated with the disease or phenotype of interest. For example, a landmark study conducted by the Wellcome Trust Case Control Consortium in the UK population involved 17,000 individuals (2,000 cases for each of 7 common diseases and 3,000 population controls), and 0.5 million SNPs were genotyped in each individual58. To reduce the number of false-positive associations arising from multiple testing, it is necessary to impose a rigorous threshold for statistical significance that takes account of the large number of SNPs that have been genotyped. There are different ways of arriving at this genome-wide significance threshold58,59,60,61,62, but typically it is set in the region of between p < 10−7 and p < 10−8. This threshold is more difficult to achieve in African populations than in European or Asian populations because of the lower levels of LD, and authentic loci with strong genetic effects may fail to reach genome-wide significance because of weak LD between causal variants and the SNPs that are genotyped19. In the following sections we consider how GWA signals might be boosted in African populations by improved SNP-genotyping platforms and by multipoint imputation from population-specific sequencing data.

Stage 2: Replicating associations. The second stage of GWA analysis aims to exclude false-positive associations due to systematic biases in genotyping and sampling by replicating associations in independent studies59. In European and Asian populations there has been considerable success in replicating GWA signals in large multi-centre studies across different locations26,63,64, but in Africa there is a greater likelihood that authentic signals of association will fail to replicate across different locations because of high levels of population structure. A particular problem is that multi-centre replication is less likely to succeed when there is variation among locations in the level of LD between the causal variant and the SNPs that are genotyped19,65,66.

Below, we discuss in more detail the problem that this LD variation and population structure creates for GWA studies in Africa, and how it might be addressed by using imputation to identify potential causal variants before attempting to replicate associations across different locations.

Stage 3: Fine mapping of causal variants. The final stage of GWA analysis is to make a high-resolution genetic map of those regions of the genome with replicable signals of association, with the aim of localizing the causal variants. How to achieve this remains open to debate, because so far there has been very limited success in identifying the causal variants responsible for GWA signals in European and Asian populations. Broadly speaking, fine mapping involves systematic resequencing of the genomic region of interest to identify all common variants, which are then tested for disease association using the largest possible sample size67. After the completion of the first tranche of large GWA studies in 2007 revealed novel genomic regions of association for several common diseases in European populations, there were initial hopes that this would shortly be followed by the identification of the causal variants. However, despite considerable efforts by several large research groups and consortia, progress has been slow, the fundamental problem being that high levels of LD make it difficult to distinguish causal variants from neighbouring non-functional variants. This has led to growing interest in trans-ethnic studies, which aim to increase the resolution of fine mapping by enlarging the haplotypic diversity of the sample. Studies in Africa could be of particular value in fine mapping because of the low levels of LD found in individual populations and because different populations in Africa have different patterns of LD14,15,16,68,69.

In the following sections, we discuss in more detail the challenges of dealing with high levels of genome variation and population structure, and go on to consider how these challenges might be overcome as new GWA methodologies are developed, particularly those that are based on large-scale genome sequencing.

Dealing with high levels of genome variation

How many SNPs should be genotyped? What is the optimum number of SNPs to genotype at the first stage of a GWA study in Africa, and which are the best SNPs to include in this genotyping set? More specifically, what is the optimum genotyping set for a given population, and how much does this need to be enlarged to cover other ethnic groups and geographical locations, given the great genetic diversity of African populations10,11,12? Although there is no clear answer to these questions at present, data soon to emerge from the 1000 Genomes Project on different populations in Africa will greatly increase our understanding of this problem. Our current understanding relies primarily on HapMap data on 90 Yoruban individuals from Ibadan in Nigeria, who were genotyped for 3.4 million SNPs15,20. A crude estimate from the initial HapMap publication, based on the concept of tagging SNPs, was that a GWA study of 1.5 million SNPs in an African population would have approximately the same statistical power as a study of 0.6 million SNPs in a European population15. In practical terms, the current commercial SNP-genotyping platforms, most of which have been designed based on HapMap data, provide considerably lower levels of genome coverage in Africa than in Europe or Asia, and this translates to a lower power to detect a GWA signal that achieves the genome-wide significance threshold11,70,71,72.

The problem of ascertainment bias. A fundamental limitation of using HapMap data for designing genotyping platforms for Africa is that the data focus on SNPs that were discovered in a relatively small number of individuals, predominantly of European descent73. This is particularly problematic because African populations have more private SNPs than European or Asian populations74. Resequencing of specific genome regions in the ENCODE Project revealed that data from Phase 2 of the HapMap Project provide 81% coverage of common SNPs in the Yoruba people compared with 94% in Europeans75. Because the HapMap Project prioritized SNPs that were common on all three continents, this leads to over-representation of high-frequency SNPs and under-representation of SNPs that are common in Africa but rare or absent elsewhere22,76 (Fig. 1). This was highlighted in a targeted resequencing study, which found that 91% of low-frequency SNPs discovered in the Yoruba people were missing from the HapMap data, compared with 86% in a comparable European sample22.

Figure 1: African populations are subject to high levels of ascertainment bias in current SNP databases.
figure 1

A study by Wall et al.76 sequenced 40 intergenic regions in 90 individuals from 6 different ethnic groups. Within these regions, they observed almost all of the SNPs in the HapMap Phase 2 database, as well as discovering many new SNPs. The figure shows the number of SNPs in the HapMap data (green) compared with the number of SNPs that were discovered by resequencing and that were not present in the HapMap data (orange), categorized by derived allele frequency. a | Data from all ethnic groups combined. b | SNPs discovered in an African group (Mandinka) compared with African data (Yoruba people in Ibadan, Nigeria (YRI)) from the HapMap Project. c | SNPs discovered in a European group (Basque) compared with European data (Utah residents with Northern and Western European ancestry from the CEPH collection (CEU)) from the HapMap Project. d | SNPs discovered in an East Asian group (Han Chinese) compared with SNPs from a similar group (Han Chinese in Beijing (CHB)) in the HapMap Project. It can be seen that the HapMap data have greater SNP ascertainment bias for African than for European or Asian populations. In particular, African populations have many low-frequency alleles that are not well represented in current SNP databases. The figure is modified, with permission, from Ref. 76 © (2008) CSHL Press.

It remains uncertain to what extent data from the Yoruba people can be extrapolated to other populations in Africa, given the high level of haplotypic diversity and population structure11. The practical implication is that a SNP-genotyping platform based on HapMap data might have decreased coverage in regions of the genome in which LD differs between the population under study and the HapMap populations65,66. Large-scale genotyping data sets are now being generated for other African populations: Phase 3 of the HapMap Project will include the Luhya and Maasai groups from Kenya in East Africa; the Human Genome Diversity Project includes individuals from eight African ethnic groups12,77; and GWA studies of malaria and tuberculosis involving thousands of individuals are ongoing in The Gambia, Ghana and Malawi19,37. Although these studies will provide valuable information, they all use commercial genotyping platforms for which SNP selection reflects the same biases as the HapMap data. Large-scale resequencing studies, such as the 1000 Genomes Project, are therefore required for the development of a comprehensive list of African variants and their LD structure across the continent.

Capturing structural variation. GWA studies must also take account of the importance of structural variation in the human genome. This includes copy-number variants (CNVs), such as insertions, deletions and duplications, as well as inversions and translocations. Structural variants seem to exhibit similar demographic patterns to SNPs (that is, a high proportion of common variants seem to be shared across continents), but there is evidence of greater structural variant diversity in Africa: among the HapMap populations, the Yoruba sample has more polymorphic CNVs than the European or Asian samples78,79,80. Structural variants can be genotyped with arrays that are specifically designed to interrogate known CNV regions, but there is potential ascertainment bias if these are based primarily on non-African reference data. Alternatively, they can be inferred from SNP-genotyping arrays, but the low levels of LD in Africa are an inherent limitation in LD-based strategies for CNV tagging. New approaches using next-generation sequencing technology will be particularly valuable in reducing ascertainment bias towards known, common variants.

Dealing with population structure

The potential confounding effect of population structure on genetic association studies in Africa is illustrated by the existence of more than 2,000 distinct language groups, most of which correspond to a specific ethnic group (see the Ethnologue website). There is growing evidence that these ethnic differences correlate with genetic differences and that levels of population structure are much greater within Africa than in other parts of the world10,19,77.

Here, we discuss the analytical implications in two parts. First, we consider the consequences of local population structure — that is, variation in allele frequencies and LD between different ethnic groups who reside at a single location. Second, we discuss the effects of population structure on multi-centre studies — that is, variation in allele frequencies and patterns of LD at different geographical locations.

Local population structure. Failure to account for population structure in a community with multiple ethnic groups can result in a high false-discovery rate and reduce the power of the study81,82. These confounding effects can be minimized by ethnic matching of cases and controls, but accurate matching can be difficult in communities in which there is substantial mixing between groups (Box 2). At the first stage of a GWA study, it is possible to correct for population structure using statistical approaches, such as genomic control and principal components analysis83,84. In The Gambia, which is a community of considerable ethnic diversity, quantile–quantile plots of GWA data indicate that methods based on principal components analysis are highly effective in minimizing false-positive associations caused by population artefacts19. However, such statistical methods are more difficult to apply at the second stage of a GWA study because they require a substantial fraction of the assayed genetic markers to be independent of the phenotype being studied, and are therefore of limited value in replication studies in which only candidate SNPs are genotyped. In this situation it may be necessary to rely on surrogate markers, such as language or location of residence, to correct for population structure, and it has been shown that, at least in some populations, this can be reasonably effective19. An alternative is to replicate candidate signals with the use of family-based association studies — for example, with family trios — as such designs are generally more robust to the confounding effects of population structure85.

Population structure in multi-centre studies. A fundamental problem for multi-centre replication studies in Africa is that allele frequencies and patterns of LD may vary among the different study sites8,66,86,87. Replication studies across European populations have been largely successful in reproducing the initial findings from GWA studies because the allele frequencies and patterns of LD are reasonably constant across Europe. This is not the case in Africa, and failure to replicate an association at different study sites may simply be due to varying patterns of LD between the causal variant and the SNPs that are genotyped. An analysis of the haemoglobin-β (HBB) gene region in different parts of West Africa provides a clear example of this: the SNP encoding 'sickle cell', haemoglobin (HbS), shows different patterns of LD in The Gambia compared with the HapMap Yoruba sample, and if data from both populations are combined using standard meta-analytic approaches, this tends to reduce, rather than improve, statistical power to detect signals of associations unless the causal SNP itself is genotyped19,66(Fig. 2).

Figure 2: Meta-analysis at a site with different associated haplotypes in two populations.
figure 2

The 'sickle cell' variant of the haemoglobin-β (HBB) gene — encoding haemoglobin S (HbS) — is known to confer resistance to severe malaria. It is also known to exist on different haplotypes in different African populations. Here, we consider the major HbS haplotypes (green and blue horizontal bars) found in Gambia and in the Yoruba people of Nigeria: the HbS-encoding variant (orange strip) is in linkage disequilibrium with different SNPs (cyan strips) in the two populations. The graphs represent fictitious case–control studies of severe malaria in the Gambian (a) and Yoruban (b) populations, showing the strength of association signal expected from the causal variant (orange star) and other SNPs (red circles). Part c shows the results expected if data from a and b were combined in a standard meta-analysis: the association signal of the causal variant is boosted, but that of other SNPs is reduced.

This presents a quandary for the design and analysis of GWA studies in Africa. The standard approach in Europe aims to confirm initial GWA findings in multi-centre studies before attempting to identify the causal variants by regional sequencing and fine mapping. However, in Africa there is a lower probability that association signals will replicate in multi-centre studies unless the causal variants are assayed directly. It has been proposed that the term 'transferability' is more appropriate than 'replication' when testing SNP associations across genetically different populations88.

In the next section we consider how new technologies for large-scale genome sequencing will help to overcome this problem in two ways: first, by starting to define the population genomic structure of African populations at the level of resolution needed to understand whether a particular multi-centre study is truly a test of replication as opposed to transferability; and second, by providing a method to refine the evidence of association in a GWA study, and to narrow this down to a shortlist of potential causal variants, before attempting to replicate putative causal variants across multiple populations.

Moving towards GWA by sequencing

GWA methodologies are about to be transformed by new sequencing technologies23,89,90,91,92,93,94. The cost of sequencing an individual human genome will probably fall to US$1000 in the next few years, and eventually it will be possible to conduct GWA analysis by genome sequencing of all of the cases and controls. This will be particularly beneficial for studies in Africa: it will increase the strength of GWA signals, because causal variants will be directly tested, and replication studies will be more likely to succeed because they will include the causal variant. In this situation, weak LD and variable LD between populations will become an advantage, as they will help to distinguish causal variants.

GWA by sequencing will greatly enhance our ability to detect associations with variants that are population-specific, and to dissect the problem of allelic heterogeneity. For example, there are two distinct variants of the HBB gene that confer resistance to malaria in West Africa: one encodes HbS (a valine substitution at codon 6) and the other encodes HbC (a lysine substitution, also at codon 6). HbS is relatively widespread, whereas HbC has a more localized distribution — for example, among the Dogon people of Mali, who have a low frequency of HbS95,96,97. This example is well understood because haemoglobin has been intensively studied by geneticists for many years, but allelic heterogeneity of this sort might be extremely difficult to dissect by GWA analysis, unless it is based on genome sequencing.

The 1000 Genomes Project will improve imputation accuracy. It will be some years before GWA analysis by sequencing becomes a practical proposition, and this raises the question of how to perform effective GWA studies in Africa using current genotyping resources. Within the next 2 years, the 1000 Genomes Project proposes to generate whole-genome sequence data on at least 60 individuals from each of 5 different African populations: data are currently being generated on two HapMap groups, the Yoruba of Nigeria and the Luhya of Kenya, and plans are under way to include groups from The Gambia, Ghana and Malawi. As well as enabling the optimization of new SNP-genotyping platforms, these data will increase the value of existing SNP-genotyping platforms by increasing the accuracy of multipoint imputation. Imputation is a method of statistically inferring an individual's genotype at a variable position in the genome, based on that individual's known genotypes at nearby variable positions combined with reference data on genome variation in the general population98,99,100,101. The HapMap Project has provided an important reference panel for imputation in European populations, and it is now common for GWA studies to report association data at 3 million SNP positions, of which 1 million have been directly genotyped and the remainder imputed.

Accurate imputation requires the correct reference data. Imputation strategies are predicated on the assumption that the reference data accurately represent the haplotypes that exist in the GWA study population; if this is not the case, imputation can give misleading results. This is particularly problematic for African populations, in which lower imputation accuracy has been reported65,102. The problem is well illustrated by a GWA study of severe malaria in Gambian children, in which a detailed analysis was undertaken of a 110-kb region of the genome containing the known malaria resistance variant HbS19. All common variants were imputed across this region using two different sets of reference data on genome variation. The first reference data set was obtained by resequencing 62 Gambian individuals across this region of the genome, and the second used HapMap data from the Yoruba people, who live in a different part of West Africa. Imputation based on the Gambian reference data accurately identified HbS as being strongly associated with resistance to malaria, whereas imputation based on the HapMap Yoruba reference data showed no association with HbS (Fig. 3). This finding is not entirely surprising given that the HbS-encoding allele occurs on different haplotypic backgrounds in different parts of Africa103,104,105,106.

Figure 3: Imputation and the choice of haplotype reference panel.
figure 3

Imputation is a process of statistical inference that estimates the most likely genotype of an individual at a given position in the genome, based on what is known about the genotype of that individual at nearby positions and on a reference data set of genome variation in the general population. The accuracy of imputation depends on the appropriateness of the reference data set. The figure shows signals of association with severe malaria from SNPs distributed across a 2.5-Mb region of chromosome 11 (Ref. 19). The vertical dashed line represents the position of rs334: this SNP is known to encode the haemoglobin S (HbS) variant of the haemoglobin-β (HBB) gene, which confers resistance to malaria. a | SNPs typed using the Affymetrix 500K genotyping platform (black circles). b | SNPs imputed using the HapMap Yoruba people in Ibadan, Nigeria (YRI) data as the reference (grey circles). The rs334 SNP is shown as a yellow diamond. c | SNPs imputed from regional sequencing data on 62 Gambian individuals (orange circles), including rs334 (yellow diamond). If we did not know that rs334 was the causal variant, imputation based on Gambian sequencing data would have been extremely useful, whereas imputation based on the HapMap YRI data would have been misleading. Parts a and c are modified, with permission, from Nature Genetics Ref. 19 © (2009) Macmillan Publishers Ltd. All rights reserved.

Imputation based on population-specific sequencing as an interim strategy. The above result represents a proof of principle that causal genetic variants can be fine mapped in Africa by imputation based on population-specific genome variation data. Fig. 3 illustrates three important points. First, imputation should be based on population-specific reference data. Second, multipoint imputation can be highly effective in boosting GWA signals when the genotyping array contains no individual SNP that is in strong LD with the causal variant; in the case of HbS, the imputed GWA signal was several orders of magnitude greater than that obtained by direct genotyping. Third, low levels of LD can be valuable in localizing a causal variant by fine mapping: in the case of HbS, after imputation had been performed with the appropriate reference panel, the causal SNP (rs334) could be clearly identified at the peak of the association signal. It remains to be determined whether other causal variants will be as amenable to fine mapping by imputation as the HbS variant. The success of this approach will be affected by different patterns of genome variation, and the HbS-encoding locus has several features that may be particularly favourable, namely a strong phenotypic effect, an extended ancestral haplotype owing to recent positive selection, and a genomic region with generally weak LD19. But until it becomes possible to conduct GWA by sequencing all cases and controls, imputation based on population-specific reference data provides an interim strategy to boost GWA signals of association in Africa and to enable multi-centre studies, the results of which can be usefully combined because all common (including causal) variants have been imputed with a reasonable level of accuracy at each centre.

Conclusions

GWA research in Africa presents a wide range of practical, scientific and ethical challenges, which we have not attempted to cover comprehensively here. For example, it can be a huge undertaking to establish the clinical research infrastructure needed to recruit large numbers of patients and to ensure accurate phenotypic data when working in a resource-poor setting. There is a need for robust epidemiological platforms to investigate how variations in environment and lifestyle affect the results of genetic-association studies in different populations, and for statistical methodologies that can account for such interactions when combining data from different locations in multi-centre GWA studies. GWA studies in Africa require the building of local resources and the development of data-sharing networks that meet the needs of the African research community, and it is of crucial importance to pay attention to the ethical issues that arise in medical and genetic research in resource-poor settings (Box 3).

Our main purpose in this Review has been to consider the specific methodological roadblocks to GWA analysis that arise in Africa owing to the high levels of genome diversity and population structure. Many of these roadblocks will be removed when new sequencing technologies make it possible to conduct GWA analysis by genome sequencing. Because all variants will be directly observed, including causal variants, this will increase the strength of GWA signals and make it easier to perform meta-analyses across multiple study sites. An interim strategy is to conduct imputation of all common variants, and for GWA studies in Africa it is particularly important that this should be based on population-specific genome variation data. By providing a framework for accurate imputation in a number of different African populations, the 1000 Genomes Project will be an important first step towards reliable multi-centre GWA studies in Africa and the fine mapping of causal variants.