Academia.eduAcademia.edu

Genome-wide significance for dense SNP and resequencing data

2008, Genetic Epidemiology

The problem of multiple testing is an important aspect of genome-wide association studies, and will become more important as marker densities increase. The problem has been tackled with permutation and false discovery rate procedures and with Bayes factors, but each approach faces difficulties that we briefly review. In the current context of multiple studies on different genotyping platforms, we argue for the use of truly genome-wide significance thresholds, based on all polymorphisms whether or not typed in the study. We approximate genome-wide significance thresholds in contemporary West African, East Asian and European populations by simulating sequence data, based on all polymorphisms as well as for a range of single nucleotide polymorphism (SNP) selection criteria. Overall we find that significance thresholds vary by a factor of 420 over the SNP selection criteria and statistical tests that we consider and can be highly dependent on sample size. We compare our results for sequence data to those derived by the HapMap Consortium and find notable differences which may be due to the small sample sizes used in the HapMap estimate. Genet. Epidemiol. 32:179-185, 2008.

Genetic Epidemiology 32 : 179–185 (2008) Genome-Wide Significance for Dense SNP and Resequencing Data Clive J. Hoggart,1 Taane G. Clark,1,2 Maria De Iorio,1 John C. Whittaker,3 and David J. Balding1 2 1 Department of Epidemiology and Public Health, Imperial College London, Norfolk Place, London Current affiliation - Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive Oxford and Wellcome Trust Sanger Institute, Hinxton, Cambridge 3 Non-communicable Disease Epidemiology Unit, London School of Hygiene and Tropical Medicine, Keppel Street, London The problem of multiple testing is an important aspect of genome-wide association studies, and will become more important as marker densities increase. The problem has been tackled with permutation and false discovery rate procedures and with Bayes factors, but each approach faces difficulties that we briefly review. In the current context of multiple studies on different genotyping platforms, we argue for the use of truly genome-wide significance thresholds, based on all polymorphisms whether or not typed in the study. We approximate genome-wide significance thresholds in contemporary West African, East Asian and European populations by simulating sequence data, based on all polymorphisms as well as for a range of single nucleotide polymorphism (SNP) selection criteria. Overall we find that significance thresholds vary by a factor of 420 over the SNP selection criteria and statistical tests that we consider and can be highly dependent on sample size. We compare our results for sequence data to those derived by the HapMap Consortium and find notable differences which may be due to the small sample sizes used in the HapMap estimate. Genet. Epidemiol. 32:179–185, 2008. r 2007 Wiley-Liss, Inc. Key words: genome-wide association studies; multiple testing; statistical significance Contract grant sponsor: The UK Medical Research Council. Correspondence to: Clive J. Hoggart, Department of Epidemiology and Public Health, Imperial College London, Norfolk Place, London. E-mail: [email protected] Received 7 August 2007; Accepted 16 October 2007 Published online 28 December 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20292 INTRODUCTION Given the large number of statistical tests (around 106) in genome-wide association (GWA) studies, and even larger numbers from future studies using whole-genome resequencing technology, it is important, but difficult, to accurately convey the significance of any reported associations. Frequently this problem is tackled by controlling the family-wise error rate (FWER), which is the probability of one or more significant results under the null hypothesis of no association. For a test applied at each of n single nucleotide polymorphisms (SNPs), the simplest way of approximating the per-test significance level a00 corresponding to a given FWER (a) is to apply a Bonferroni (a00 5 a/n) or a Šidák (a00 5 1(1a)1/n) correction [Šidák, 1968, 1971]. If the tests are mutually independent the Šidák correction is exact, but in practice the tests are dependent and both corrections can be very conservative. More accurate approximations to the genomewide significance level a00 are difficult to obtain, because the correlations between the tests depend on many factors, such as variations in linkage disequir 2007 Wiley-Liss, Inc. librium (LD) and SNP density, as well as the choice of test statistic(s) and sample size. The concept of an effective number of (independent) tests is appealing [Cheverud, 2001; Nyholt, 2004; The International HapMap Consortium, 2005], but this number is closely related to a00 and depends on the same factors [Salyakina et al., 2005; Dudbridge et al., 2006]. False discovery rate procedures [Benjamini and Hochberg, 1995; Storey and Tibshirani, 2003], which control the expected proportion of non-causal SNPs among those declared significant, have been proposed as an alternative to the FWER approach. These have been applied successfully to the analysis of genome-wide gene expression data, for which one typically expects many true positives. However, in the context of GWA studies, the smaller number of positives and the problem of LD between true positives and flanking SNPs, remain problematic [Dudbridge et al., 2006; Yang et al., 2005]. In principle, the use of Bayes factors to measure significance avoids the need for assessing genomewide error rates. A full Bayesian analysis involves the difficult challenge of identifying a realistic yet tractable alternative model. For example, the alternative model of the Wellcome Trust Case Control 180 Hoggart et al. Consortium [2007] assumes the same distribution of effect sizes for all minor allele frequency (MAF) values of the causal variant, which may be considered unrealistic. In practice, Bayes factors are typically applied to one SNP at a time, and are often treated as frequentist test statistics for which error rates must be assessed. Permutation procedures can yield the correct FWER even when the tests are dependent. They are computationally intensive in the setting of GWA studies, which has motivated approximations to reduce the computational effort [Dudbridge and Koeleman, 2004; Seaman and Muller-Myhsok, 2005; Kimmel and Shamir, 2006]. The results of permutation procedures apply only to the current genotyped dataset and must be recomputed when the dataset is altered, therefore they cannot provide truly genomewide significance thresholds unless the entire genome is sequenced, neither can they be used before genotyping to derive significance thresholds for power calculation to aid the design of association studies. For linkage analyses, universal thresholds based on the LOD score were established and have proven useful over many years [Lander and Kruglyak, 1995]. For association studies, a genome-wide 5% significance level was estimated [Risch and Merikangas, 1996] as 5  108. This result has been influential, but it was derived using a Bonferroni correction assuming 106 independent tests (105 genes and 10 independent tests in each gene), and its sensitivity to different genomic and study design features has not been explored. Using SNPs ascertained from the ENCODE Project Consortium [2004], which spanned 10 500-kb regions, the HapMap Consortium estimated the effective number of independent tests per 500 kb to be 350 in the West African population and 150 in European and East Asian populations when testing all SNPs with MAF 45%. Assuming the length of the human genome to be 3,300 Mb this is equivalent to an effective number of tests in the genome of 9.9  105 in Europeans and East Asians and 23.1  105 in West Africans, which, using a Šidák correction, gives a genome-wide 5% significance threshold of 2.2  108 for Africans and 5.2  108 for Europeans and East Asians. These estimates were based on SNPs ascertained by sequencing 48 individuals and genotyped in 209 unrelated individuals. Mock case-control samples were then generated by resampling-phased chromosomes of these individuals. The ENCODE data are not suitable to calculate thresholds for tests of SNPs with MAFo5% due to the small sample size used to ascertain SNPs. Furthermore, to generate sufficient sizes of mock case-control samples from the ENCODE data, one would have had to resample chromosomes many Genet. Epidemiol. DOI 10.1002/gepi times, resulting in less genetic diversity than in the actual population. METHODS To overcome the limitations of the approaches discussed above, we simulated sequence level data in large populations enabling all polymorphisms to be ascertained, and individuals to be sampled without replacement. The simulation was implemented using the FREGENE software [Hoggart et al., 2007], which simulates large genomic regions in large diploid populations, forward in time, under various demographic scenarios and evolutionary models, including variable recombination rates and gene conversion. In our simulation, we mimicked the modeling assumptions of [Schaffner et al., 2006] in which the demographic and evolutionary model of the simulation was chosen to approximate the history of populations from three continental regions: West Africa, East Asia and Europe. The genetic parameters of the simulation were tuned [Schaffner et al., 2006] to match (1) the allele frequency distribution; (2) the relationship between allele frequency and the probability that an allele is ancestral; (3) Fst, and two measures of the extent of LD; (4) the relationship of genetic distance with r2 and (5) the fraction of pairs of markers with D0 5 1 in each of the three populations [Schaffner et al., 2006]. Our simulation reproduced the reported results for these five criteria [Schaffner et al., 2006]. Further details of the simulation study are given in the Appendix. It is infeasible to simulate entire human chromosomes in such large populations, however with our computer memory constraints we were able to simulate regions of 5 Mb and hence we in effect approximated the human genome by 660 chromosomes each of 5 Mb. Genome-wide values were then inferred from the simulated chromosome using Šidák’s correction. Our checks indicate that the error in this approximation is negligible (see Appendix for justification). From the three simulated populations, we selected chromosome pairs corresponding to individuals for inclusion in the association studies. We then permuted the case-control labels 5  105 times, and recorded the minimum P-value in order to approximate the per-SNP significance level corresponding to a FWER of a 5 5, 10, 20, and 50%. We explored the effect of SNP ascertainment, test statistic and the case/control sample size. Specifically we considered four SNP-ascertainment strategies: 1. SNPs chosen randomly with an average spacing of 5 kb and an approximate uniform MAF distribution in Europeans approximating an Genome-Wide Significance Affymetrics 500 K GeneChip — we denote this as our ‘‘standard’’ study design. 2. All polymorphic sites. 3. All sites with MAF 40.5% in the relevant population. 4. All sites with MAF 45% in the relevant population. We considered three sample sizes: 100, 1,000 and 5,000 cases, with equal numbers of controls. In all analyses, a Cochran-Armitage trend test (sensitive to additive effects only) was applied. In addition Pearson’s 2 degree of freedom (df) test and separate tests for dominant and recessive effects were applied to the standard SNP ascertainment with 1,000 cases and controls. Furthermore, when these additional tests were applied the additive, dominant and recessive tests were combined at each SNP by taking the maximum of the three test statistics [MAX test; Freidlin et al., 2002]. The additive and the 2 df tests were combined similarly. See the Appendix for full details of the calculation of genome-wide significance thresholds. RESULTS Tables I, II, III show our estimates of the per-SNP significance level (a00 ) for the corresponding FWER (a) of 5, 20 and 50% for the simulated West African, East Asian and European populations. The results for a 5 10% are not shown; they can be approximated by a linear interpolation between the a 5 5 and 20% values. The effective number of tests was calculated as the number of independent tests that would give the calculated threshold a00 at the significance level a 5 20% using the Šidák correction. 181 For all of the simulated GWA studies, our estimate of a00 is larger than would be estimated from a genome-wide Šidák correction, because the effective number of tests is always less than the actual number of tests. For our standard GWA the lowest significance thresholds were observed in Europeans. This may appear surprising because lower LD in the West African population leads us to expect less dependence of the tests and so lower thresholds than among Europeans. However, the SNPs were ascertained on the basis of European allele frequencies which resulted in 145 and 258 of these SNPs being monomorphic in the West African and East Asian populations respectively, whereas all were polymorphic in Europeans. More generally, SNPs typically have a higher MAF in Europeans than in the other two populations and SNPs with low MAF are less likely to be significant. This is because for rare variants the null distribution of P-values is nonuniform, having less weight at small P-values. If the chromosomes are sequenced and all polymorphic sites tested, the ascertainment issue vanishes and, as expected, a00 is lowest in the West African population. With 5,000 cases and controls the ratios of a00 for sequence data to its value under the standard SNP map for a genome-wide significance level of 20% are 23.8, 24.2 and 12.6 in the West African, East Asian, and European populations respectively. The threshold a00 increases as the SNP density decreases from MAF 40.5 to 45%, but as before a00 is lowest in the African population. With 1,000 cases and controls, for all SNP ascertainments other than the standard, a00 is roughly twice as large in East Asian and European populations in comparison to West Africans, but the difference is less for 5,000 cases and controls when testing all polymorphisms and for MAF 40.5%. The TABLE I. West African population: significance thresholds and effective number of tests (each with 95% confidence interval) Per-test siginificance level a00  108 for family-wise error rate a 5 Standard 100 Cases/controls 1,000 Cases/controls 5,000 Cases/controls MAF45% 1,000 Cases/controls 5,000 Cases/controls MAF40.5% 1,000 Cases/controls 5,000 Cases/controls All polymorphisms 1,000 Cases/controls 5,000 Cases/controls No. of tests in genome  106 5% 20% 50% Effective Actual 34 (23–41) 14 (7.2–18) 16 (10–19) 130 (110–150) 61 (52–70) 62 (55–71) 390 (350–430) 210 (180–220) 180 (170–190) 0.17 (0.15–0.2) 0.37 (0.32–0.43) 0.36 (0.31–0.4) 0.66 0.26 0.56 0.55 1.6 (1–2) 1.5 (1–1.9) 6.7 (5.9–7.3) 6.0 (4.8–6.8) 21 (19–22) 18 (16–20) 3.3 (3–3.8) 3.7 (3.3–4.6) 9.8 0.34 0.38 0.98 (0.69–1.6) 0.66 (0.51–0.85) 4.9 (4.2–5.7) 2.6 (2.3–3) 15 (13–16) 8.0 (7.4–8.7) 4.5 (3.9–5.3) 8.7 (7.4–9.9) 21.8 0.21 0.40 0.92 (0.68–1.4) 0.65 (0.32–0.88) 4.0 (3.4–4.7) 2.6 (2.4–3) 13 (12–14) 7.4 (6.8–8) 5.6 (4.7–6.6) 8.5 (7.5–9.5) 120 Ratio 0.047 0.071 MAF, minor allele frequency. Genet. Epidemiol. DOI 10.1002/gepi 182 Hoggart et al. TABLE II. East Asian population: significance thresholds and effective number of tests (each with 95% confidence interval) No. of tests in genome  106 Per-test siginificance level a00  108 for family-wise error rate a 5 Standard 100 Cases/controls 1,000 Cases/controls 5,000 Cases/controls MAF45% 1,000 Cases/controls 5,000 Cases/controls MAF40.5% 1,000 Cases/controls 5,000 Cases/controls All polymorphisms 1,000 Cases/controls 5,000 Cases/controls 5% 20% 50% Effective 41 (26–58) 17 (14–23) 19 (14–24) 160 (140–180) 82 (69–95) 85 (71–97) 450 (420–480) 280 (250–310) 240 (220–260) 0.14 (0.12–0.16) 0.27 (0.23–0.33) 0.26 (0.23–0.32) 0.66 0.21 0.41 0.39 3.0 (2.1–4.3) 2.7 (2.1–3.6) 13 (10–14) 12 (9.5–13) 37 (33–41) 37 (33–41) 1.8 (1.6–2.1) 1.9 (1.7–2.3) 6.7 0.27 0.28 2.5 (1.8–3.1) 1.3 (0.79–1.9) 9.5 (8.2–10) 5.2 (4.4–5.9) 25 (22–26) 14 (13–15) 2.4 (2.1–2.7) 4.3 (3.8–5) 13.7 0.18 0.31 9.6 (8.5–11) 3.5 (3.1–4.1) 28 (26–31) 10 (9.2–12) 2.3 (2–2.6) 6.4 (5.4–7.2) 2.1 (1.4–3.2) 0.86 (0.6–1.3) Actual Ratio 116 0.02 0.055 MAF, minor allele frequency. TABLE III. European population: significance thresholds and effective number of tests (each with 95% confidence interval) Per-test siginificance level a00  108 for family-wise error rate a 5 Standard 100 Cases/controls 1,000 Cases/controls 5,000 Cases/controls MAF45% 1,000 Cases/controls 5,000 Cases/controls MAF40.5% 1,000 Cases/controls 5,000 Cases/controls All polymorphisms 1,000 Cases/controls 5,000 Cases/controls No. of tests in genome  106 5% 20% 50% Effective 29 (19–41) 15 (12–18) 11 (7.2–14) 120 (110–140) 62 (52–74) 44 (38–52) 330 (290–360) 180 (170–200) 150 (130–160) 0.19 (0.17–0.21) 0.36 (0.3–0.43) 0.51 (0.43–0.59) 0.66 0.29 0.55 0.77 2.1 (1.4–3.4) 3.1 (2.2–3.9) 11 (9.2–12) 10 (9–12) 34 (32–38) 31 (29–34) 2.1 (1.9–2.4) 2.2 (1.9–2.5) 7.4 0.28 0.3 1.4 (1.1–1.9) 1.3 (0.64–1.7) 7.4 (6.2–8.3) 5.2 (4.6–5.8) 23 (20–24) 14 (13–15) 3.0 (2.7–3.6) 4.3 (3.8–4.9) 14.2 0.21 0.3 7.1 (6–8.4) 3.5 (2.8–4) 25 (22–26) 9.8 (8.8–11) 3.1 (2.6–3.7) 6.5 (5.6–7.9) 1.8 (1.3–2.4) 0.69 (0.41–0.99) Actual 116 Ratio 0.027 0.056 MAF, minor allele frequency. smaller a00 in the West African population can be explained by two factors arising from its larger effective population size: lower average LD between SNPs, and hence tests at neighboring SNPs are less dependent, and a greater number of polymorphisms. We can see the effect of the greater LD in the East Asian and European populations in comparison with the West African population by the lower ratio of effective number of tests to actual number of tests. Typically a00 decreases as the sample size increases, for example an increase in sample size from 1,000 to 5,000 cases and controls for MAF 40.5% and all polymorphic sites ascertainment schemes, resulted in the required a00 decreasing by a factor of two to Genet. Epidemiol. DOI 10.1002/gepi three and the effective number of tests approximately doubling. This is because increased sample size changes the distribution of the test statistic at rare variants to allow smaller P-values. The exception is with SNP ascertainment MAF45% when increasing the sample size from 1,000 to 5,000; this has no effect on a00 because of the absence of rare alleles. Table IV compares the estimates of the per-SNP significance level (a00 ) for the corresponding FWER for a range of single SNP tests. We see that testing for a dominant effect gives approximately the same threshold as for an additive effect. However, when testing for recessive effects the threshold is higher, Genome-Wide Significance 183 TABLE IV. European population with 1,000 cases and controls: significance thresholds and effective number of tests for different test statistics (each with 95% confidence interval) Per-test siginificance level a00  108 for family-wise error rate a 5 5% Standard Additive test Dominant test Recessive test MAX test 2 df test Both additive and 2 df tests MAF420% Additive test Dominant test Recessive test MAX test 2 df test Both additive and 2 df tests 20% 50% 15 12 21 7.1 18 8.1 (12–18) (9.1–19) (14–29) (4.4–8.2) (12–25) (5.1–12) 62 61 93 27 75 40 (52–74) (51–70) (80–100) (23–32) (63–89) (34–47) 180 180 280 87 220 120 4.6 3.3 4.7 0.98 4.7 2.5 (2.5–6.5) (2.1–4.6) (3.4–5.9) (0.51–1.7) (3.4–6) (1.7–3.4) 21 18 21 6.6 18 12 (19–24) (15–21) (18–24) (5.6–7.5) (16–24) (10–14) 66 57 65 22 62 35 (170–200) (160–200) (260–300) (79–93) (210–240) (110–130) (61–72) (52–63) (60–68) (21–25) (56–67) (33–39) No. of tests in genome  106 Effective 0.36 0.37 0.24 0.83 0.30 0.55 (0.3–0.43) (0.32–0.44) (0.21–0.28) (0.69–0.99) (0.25–0.35) (0.48–0.66) 1 1.2 1 3.4 1.2 1.9 (0.92–1.1) (1–1.5) (0.93–1.3) (3–4) (0.93–1.4) (1.6–2.1) Actual Ratio 0.66 0.66 0.66 2 0.66 1.3 0.55 0.56 0.36 0.42 0.45 0.42 3.8 3.8 3.8 12 3.8 7.7 0.26 0.32 0.26 0.28 0.32 0.25 df, degree of freedom; MAF, minor allele frequency. because the counts of minor-allele homozygotes in cases and controls will be small for rare alleles. Thus, the resulting P-value cannot be as small as for the tests of trend or dominance. To check this argument, we evaluated the required thresholds for the three tests applied at all SNPs with MAF420%; they were approximately the same. These results indicate that the required threshold for a given FWER is dependent on the test employed only through the minimum genotype frequency. Thus, if the MAF and sample size are large enough such that genotype counts in cases and controls are sufficiently large (as a rule of thumb 45) different tests do not give rise to different thresholds. The effective number of tests for the MAX statistic is 8.3  105, which is only slightly less than the sum of the effective number of tests when the three statistics are tested individually (9.7  105). Although the three-test statistics are highly correlated, the genomic locations at which their minima occur are close to independent. Similarly, the additive and 2 df tests behave as if nearly independent when combined, even though the two test statistics are highly correlated. Thus, there is a substantial penalty in terms of type 1 error for applying multiple tests at a SNP. DISCUSSION We provide approximate significance thresholds for whole-genome association studies for West African, East Asian and European populations with various study designs and single-SNP analyses. These results provide experimenters with guidelines for declaring genome-wide significance in present studies and future studies with sequence data. With the standard SNP ascertainment and 1,000 cases and controls the effective number of tests is about half the actual number of tests, thus the Šidák correction would underestimate the required threshold by one half, with 5,000 cases and controls the Šidák correction is less conservative. These significance thresholds are also relevant to meta-analyses in which test statistics are derived by combining results from independent studies. SNP ascertainment has the largest effect among the factors considered. With 5,000 cases and controls results vary by a factor of up to 24 between testing all polymorphisms and our standard analysis which has an average SNP spacing of 5 kb. Increasing sample size increases the effective number of tests, and hence reduces the required significance threshold, when many rare variants are tested, because of increased ability to detect rare variants with larger sample sizes. The HapMap Consortium estimated the effective number of tests in the genome if all SNPs with MAF 45% were tested to be 9.9  105 in Europeans and East Asians and 23  105 in West Africans, but the sample size was not stated. Our equivalent estimates are 21  105 in Europeans, 18  105 in East Asians and 33  105 in West Africans, for a sample size of 1,000 cases and 1,000 controls. No standard deviation was reported for the HapMap estimates, and so we cannot assess whether our higher estimates are in fact significantly different, but if so the discrepancy could be due to the difference in numbers of cases and controls, or polymorphisms being missed by Genet. Epidemiol. DOI 10.1002/gepi 184 Hoggart et al. ENCODE due to the small sample sizes used to ascertain SNPs. If several experimenters perform analyses for the same phenotype on different genotyping platforms, each may choose to control their FWER separately, taking into account only the SNPs tested. However, for the wider public the relevant FWER is for the combined datasets of all relevant studies and not the number of tests performed by any one research group. Similarly, the Wellcome Trust Case Control Consortium (WTCCC) argue for controlling the type 1 error rate, but note that ‘‘one should not correct significance levels for the number of tests performed’’. We propose universal genome-wide significance thresholds based on all polymorphisms, whether or not typed, to achieve these goals without the challenging task of specifying a tractable yet realistic alternative model. The WTCCC used a significance threshold of 5 x 10-7, which from Table 3 corresponds to a genome-wide false positive rate of above 50%. If only common variants are tested the threshold calculated for MAF45% could be justified. This threshold can also be thought of as a lower bound when SNPs have been ascertained by tagging using HapMap data because there is little power with these SNPs to detect variants with MAF o5% [Zeggini et al., 2005]. However the threshold for all polymorphic sites would provide a more stringent threshold that will become increasingly appropriate as SNP densities increase and more low MAF SNPs are tested. ACKNOWLEDGMENTS C.H. and T.C. are funded by a grant from the UK Medical Research Council under the Link Applied Genomics scheme. The authors thank Toby Andrew, Frank Dudbridge and Paul O’Reilly for helpful comments. REFERENCES Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B 57:289–300. Cheverud JM. 2001. A simple correction for multiple comparisons in interval mapping genome scans. Heredity 87:52–58. Dudbridge F, Koeleman BPC. 2004. Efficient Computation of Significance Levels for Multiple Associations in Large Studies of Correlated Data, Including Genome-wide Association Studies. Am J Hum Genet 75:424–435. Dudbridge F, Gusnanto A, Koeleman BPC. 2006. Detecting Multiple Associations in Genome-wide Studies. Hum Genom 2:310–317. Freidlin B, Zheng G, Li ZH, Gastwirth JL. 2002. Trend tests for case-control studies of genetic markers: Power, sample size and robustness. Hum Hered 53:146–152. Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, De Iorio M, Balding DJ. 2007. Sequence-level Genet. Epidemiol. DOI 10.1002/gepi population simulations over large genomic regions. Genetics: in press; doi: 10.1534/genetics.106.069088. Kimmel G, Shamir R. 2006. A fast method for computing highsignificance disease association in large population-based studies. Am J Hum Genet 79:481–492. Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Schlein A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K. 2002. A high-resolution recombination map of the human genome. Nat Genet 31:241–247. Lander E, Kruglyak L. 1995. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11:241–247. Nyholt DR. 2004. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74:765–769. Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases. Science 273:1516–1517. Salyakina D, Seaman SR, Browning BL, Dudbridge F, MullerMyhsok B. 2005. Evaluation of Nyholt’s procedure for multiple testing correction. Hum Hered 60:19–25. Seaman SR, Muller-Myhsok B. 2005. Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. Am J Hum Genet 6:399–408. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. 2006. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15:1576–1583. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–9445. Šidák Z. 1968. On multivariate normal probabilities of rectangles: their dependence on correlations. Ann Math Statist 39: 1425–1434. Šidák Z. 1971. On probabilities of rectangle in multivariate normal Student distributions: their dependence on correlations. Ann Math Statist 41:169–175. The ENCODE Project Consortium. 2004. The ENCODE (Encyclopedia Of DNA Elements) Project. Science 306: 636–640. The International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437:1299–1320. The Wellcome Trust Case Control Consortium. 2007. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678. Yang Q, Cui J, Chazaro I, Cupples LA, Demissie S. 2005. Power and type I error rate of false discovery rate approaches in genome-wide association studies. BMC Genet 6:(Suppl 1):S134. Zeggini E, Rayner W, Morris AP, Hattersley AT, Walker M, Hitman GA, Deloukas P, Cardon LR, McCarthy MI. 2005. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nat Genet 37:1320–1322. APPENDIX SIMULATION STUDIES Here we describe in more detail the simulation model implemented which was formulated by Schaffner et al. [2006]. In the simulation, a homogeneous population evolved before splitting in two, one of which represents a small population that migrated out of Africa. The two populations then continued to evolve for 3,500 generations before the Genome-Wide Significance non-African population split into two mimicking ancestral Asian and European populations. The three populations continued to evolve for a further 2,000 generations before undergoing a final expansion. Population bottlenecks were implemented immediately after the population splits in all subpopulations. Migration occurred between the African and European populations and the African and Asian populations. A constant mutation rate of 1.5  108 per site and per generation and a constant gene conversion rate of 4.5  109 per site and per generation with a tract length of 500 bases were used. The crossover rate was variable and the model for the variation in rates was hierarchical. An average regional rate was set based on the deCODE genetic map [Kong et al., 2002], local rates were then set stochastically and finally hotspots were sampled conditional on the local rate. The intensities and spacing of the hotspots were both stochastic, for further details of the genetic and demographic model see Schaffner et al. [2006]. CALCULATING GENOME-WIDE SIGNIFICANCE THRESHOLDS Under the null hypothesis that no SNP is associated with case-control status, the per-SNP significance level a00 that corresponds to an FWER of a satisfies a ¼ Prðminfpi goa00 Þ, where pi denotes the P-value from the ith SNP. Computing a00 directly is difficult because of the large number of SNPs and their complex correlation structure. Instead, we approximate a00 via simulation of 5-Mb chromosomes and assume the 3.3-Gb genome is made up of 660 independent 5-Mb chromosomes. Thus, for a FWER of a 5 5% for a genome-wide study, a Šidák correction, with 660 independent tests, requires a significance level of a0 5 0.0078% within each 5-Mb chromosome. Thus, we estimate a00 by the 0.0078% point of the distribution of the minimum P-value in a 5-Mb interval, using n permutations of the case- 185 control labels (here n 5 5  105). In other words, we use qna0 as an estimate of a00 where qi denotes the ith smallest P-value arising from the n permutations of case-control labels. An approximate confidence interval for a00 can be obtained by treating the true position of the a0 -percentile of q as a Binomial random variable with parameters n and a0 . Then, using the normal approximation to the Binomial, p weffiffiffiffiffiobtain ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffithe ffiffi 95% 0 0 ð1  a0 Þg  1:96 na and confidence plimits qfna ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qfna0 þ 1:96 na0 ð1  a0 Þg. A genome of 660 5-Mb chromosomes will have overall less LD than the actual human genome, which may bias upwards the estimate of the number of independent tests. To investigate the effect of this assumption, we simulated a 20-Mb region using the same evolutionary parameters but a simpler demographic model in which there was no migration but a single homogeneous population of 10,000 individuals. We compared the estimated genome-wide significance level for this population using the entire 20-Mb region and also using 10-Mb and 5-Mb subintervals of the 20-Mb region. We found that while the point estimates derived from the 5-Mb regions were slightly lower than those derived from the 20-Mb regions, the 95% confidence intervals overlapped considerably. Furthermore, the 95% confidence intervals based on 10 and 20-Mb regions were practically indistinguishable suggesting that effect of region size has ‘‘flattened’’ off by this point. For computational efficiency, for each simulation model we permuted the case-control labels from just one sample from the population, rather than repeatedly re-sampling cases and controls from the population. The lack of sample replication is not a problem for regions as large as 5-Mb, because there is in effect internal replication within the region. To verify this, we took 10 independent case-control samples from the standard simulation. We found that estimates of a00 from each were not significantly different at the 5% level. Genet. Epidemiol. DOI 10.1002/gepi