Estimating the number and size of the main effects in genome-wide case-control association studies

Kuo, Po-Hsiu; Bukszár, József; van den Oord, Edwin JCG

doi:10.1186/1753-6561-1-S1-S143

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

Estimating the number and size of the main effects in genome-wide case-control association studies

Po-Hsiu Kuo^1,2,
József Bukszár³ &
Edwin JCG van den Oord^1,3

BMC Proceedings volume 1, Article number: S143 (2007) Cite this article

913 Accesses
4 Citations
Metrics details

Abstract

It has recently become possible to screen thousands of markers to detect genetic causes of common diseases. Along with this potential comes analytical challenges, and it is important to develop new statistical tools to identify markers with causal effects and accurately estimate their effect sizes. Knowledge of the proportion of markers without true effects (p₀) and the effect sizes of markers with effects provides information to control for false discoveries and to design follow-up studies. We apply newly developed methods to simulated Genetic Analysis Workshop 15 genome-wide case-control data sets, including a maximum likelihood (ML) and a quasi-ML (QML) approach that incorporate the test statistic distribution and estimates effect size simultaneously with p₀, and two conservative estimators of p₀ that do not rely on the test statistic distribution under the alternative. Compared with four existing commonly used estimators for p₀, our results illustrated that all of our estimators have favorable properties in terms of the standard deviation with which p₀ is estimated. On average, the ML method performed slightly better than the QML method; the conservative method performed well and was even slightly more precise than the ML estimators, and can be more robust in less optimal conditions (small sample sizes and small number of markers). Further improvements and extensions of the proposed methods are conceivable, such as estimating the distribution of effect sizes and taking population stratification into account when obtain estimates of p₀ and effect size.

Background

Due to the rapid advances in genotyping technology, genome-wide association studies with hundreds of thousands of markers are now possible. These large scale genetic studies offer great promise to expedite the discovery of the common genetic variants affecting common diseases [1]. A first step in the analyses is to understand the properties of the massive data sets. Among the most fundamental properties are the proportion of markers without true effects (p₀) and the effect sizes (Δ) of markers with effects. Knowledge of these parameters provides information about how relevant the genotyped markers are for the disease outcome. In addition, these parameters play a role in a variety of applications. For example, estimates of p₀ are commonly used in methods for controlling false discoveries, which is important to prevent spending time and resources on leads that will eventually prove irrelevant. Another example is that knowledge of effect size Δ is important to design follow-up studies that have adequate power to replicate previous findings.

Multiple methods have been proposed to estimate p₀ [2–4]. These estimators tend to make general assumptions about the distribution of the test statistic under the alternative hypothesis, partly because many of them have been developed in the context of microarray research where specific assumptions may be problematic. However, in genetics good approximations for the statistical test statistic distribution are often available. This information can be used to obtain more precise estimators in these applications. Once (a set of) markers have been identified as being associated with the disease, the next objective of interest is typically to estimate the effect sizes. Currently, the most commonly used approach simply estimates the effect size of the significant markers in the same sample that has been used for testing. Due to the effects of sampling error and the presence of false positives, this approach overestimates the effect sizes considerably [5, 6]. Several methods have been proposed to obtain more unbiased effect size estimates such as a simple split-sample method, cross-validation, and bootstrap resampling [7]. However, with all these methods that use the same sample to first declare significance and then estimate the effects sizes of the significant finding, it will remain difficult to obtain estimates that are both precise and unbiased. We therefore proposed a set of related methods (unpublished data) that estimate the average effect size Δ of the 1 - p₀ markers with effects. Because our estimates are not confined to only those markers that are declared significant, they do not suffer from the upward bias caused by sampling fluctuations producing large test statistics in this specific sample. Our methods do not assign effect size estimates to individual markers but estimate the average of all markers with effects. This does not hamper the design of subsequent replication studies. Thus, for any critical value chosen to declare significance, we can calculate the number of markers with effects plus their average effect size among the significant markers to design replication studies.

Our methods include a maximum likelihood (ML) and a quasi-ML (QML) approach that incorporates the test statistic distribution and estimates Δ simultaneously with p₀. In addition, we propose a conservative estimators of p₀ (CON) and a variation of this conservative estimator that adaptively estimates a fine tuning parameter (ADA CON). Neither CON nor ADA CON rely on the test statistic distribution under the alternative but take advantage of the specific knowledge that in large-scale genetic studies the p₀ must be very close to 1. Because these conservative estimators do not consider the distribution under the alternative hypothesis, they cannot estimate the average effect size directly. However, we can still use the point estimate of these conservative methods and include it in a second step in a ML method to estimate the average effect size Δ for our conservative estimators of p₀. We apply our methods to the simulated rheumatoid arthritis (RA) case-control data with 10 k single-nucleotide polymorphisms (SNPs) in Genetic Analysis Workshop 15 (GAW15). We chose a case-control design with SNPs because this is one of the most important designs for mapping the genetic determinants of complex human diseases through genome-wide association studies. We also compared our estimators with four existing estimators. We found two studies comparing multiple and non-overlapping sets of estimators [2, 3]. In these studies, the lowest slope (LOW S) and location based estimator (LBE) showed the most favorable properties and were therefore included here. In addition, estimators developed by Storey (STO) [4] and Storey-Tibshirani (STO-TIB) [8] were included because they may be among the more commonly used estimators.

Methods

The maximum likelihood methods (ML and QML) and the conservative methods (CON and ADA CON) are briefly described below. All 100 replicates in GAW15 Problem 3 were analyzed. To create a case-control data set we selected the first sib from family-based data sets as independent cases (N = 1500), and used all individuals in control data sets.

Analyses were done with knowledge of the "answers" of causal markers locations.

A single-value approximation for Pearson's statistic

SNPs are bi-allelic so that the initial statistical analysis will consist of calculating Pearson's statistic to test whether the frequency of the two alleles (A, a) or three SNP genotypes (AA, Aa, aa) differs between cases and controls. For Pearson's test we can define a single parameter Δ that be interpreted as an (average) effect size. For 2 × 2 tables, for example,

Δ = \frac{\sqrt{γ δ} \sqrt{q_{1} (1 - q_{1})} (o - 1)}{\sqrt{((o - 1) (γ + δ q_{1}) + 1) ((o - 1) δ q_{1} + 1)}} .

where o is the odds ratio, γ and δ = 1 - γ the proportions of controls and cases, q₁ and 1 - q₁ the allele frequency in the controls and cases.

We can derive the following approximation [9] for the distribution of Pearson's statistic to analyze for 2 × ν contingency tables that depends on only Δ.

χ_{ν - 2} + (1 - Δ^{2}) χ_{1} (\frac{n Δ^{2}}{1 - Δ^{2}}),

where χ_ν-2is a (central) chi-square random variable with ν - 2 degrees of freedom and $χ_{1} (\frac{n Δ^{2}}{1 - Δ^{2}})$ is a chi-square random variable with 1 degree of freedom and non-centrality parameter $\frac{n Δ^{2}}{1 - Δ^{2}}$ . The fact that an approximation exist that depends on only a single parameter (this does not have to be the case as the asymptotic equivalent depends on many parameters) is of great importance because it means that we only have to estimate a single parameter from the data the characterize the effect size. Note that if Δ = 0, the approximation reduces to a central chi-square random variable with ν - 1 degrees of freedom under the null hypothesis. In classic works on power analysis [10], categorical data analysis [11], and text books [12], the distribution of Pearson's statistic is often approximated with a non-central chi-square distribution with ν - 1 degrees of freedom and non-centrality parameter nΔ², which also depends on the single value Δ only. However, this approximation can be inaccurate [9].

The maximum likelihood estimators

The likelihood function on the m test statistics t₁,...,t_m is

L (m_{1}, Δ) = \frac{1}{(\begin{matrix} m \\ m_{1} \end{matrix})} (\prod_{i = 1}^{m} f_{0} (t_{i})) \sum_{{i_{1}, ..., i_{m_{1}}} \in {1, ..., m}} \frac{f_{Δ} (t_{i_{1}})}{f_{0} (t_{i_{1}})} \times ... \times \frac{f_{Δ} (t_{i_{m_{1}}})}{f_{0} (t_{i_{m_{1}}})},

where m₁ = m - m₀ the number of effects and m₀ the number of markers without effect, f₀ an approximating density function under the null, and fΔ an approximating density function under the alternative that depends on average effect size Δ. The ML estimator of m₁ and the average effect size are the ${\hat{m}}_{1}$ and $\overset{⌢}{Δ}$ that maximize function L.

Due to enormous number of terms in the sum, the likelihood cannot be evaluated directly. For example, with a total number of tests m = 100,000, of which m₁ = 5 markers have an effect, there are 8.33 × 10²² terms. Therefore, we developed an implementation that uses recursive series to calculate the likelihood. In addition, we developed a quasi-likelihood approach (QML) that is computationally much easier and faster. Here the logarithm on the m test statistics t₁,...,t_m is

ℓ_{q u a s i} (p_{0}, Δ) = \sum_{i = 1}^{m} \log {p_{0} f_{0} (t_{i}) + (1 - p_{0}) f_{Δ} (t_{i})},

which is essentially the log-likelihood function of the mixture model.

The conservative estimator

In addition to the ML estimator, we propose an estimate of p₀ that does not rely on the test statistic distribution under the alternative but capitalizes on the knowledge that in large-scale genetic studies p₀ is close to 1 (CON method). We calculate a cut-off value c in such a way that the probability that a non-causal marker has test statistic value higher than c is k/m. If we denote the total number of markers whose test statistic value is higher than c as d, then this estimate of p₀ is

{\overset{⌢}{p}}_{0} = 1 - \frac{d - k}{m} .

Note that the expected number of non-causal markers with test statistic value higher than the cut off c is km₀/m rather than k. This estimator can therefore be expected to be conservatively biased. However, because p₀ = m₀/m is close to 1, we would expect the bias to be small.

A natural idea is to choose a value for fine-tuning parameter k that minimizes the mean square error $M S E (k) = E {({\overset{⌢}{p}}_{0} - p_{0})}^{2}$ for which an analytical expression can be derived (not shown). A practical problem is that the value of k that minimizes the MSE depends on the unknown parameters p₀, the average effect size, and the covariances among the markers. Alternatively, we can estimate k from the data (ADA CON method). That is, we first estimate p₀ for a chosen value of k, e.g., k = 10. Second, using that point estimate, we obtain an estimate of the average effect size (e.g., by ML). Third, for the p₀ and the effect size estimate, we calculate the optimal k. We repeat Steps 1 to 3 until there is no noticeable change in k. However, extensive simulation showed that this resulted in somewhat less precise estimates than just calculating a value of k using reasonable assumptions. The reason was that the conservative method appeared fairly robust against mis-specifications of k, which outweighed the additional sampling error associated with estimating k.

Results

We identified four markers on chromosome 6 with extremely low p-values and effect sizes that were five times larger than the average effect sizes of the other markers with effects (see Table 1). Because a complex statistical method is not needed to detect such effects, we excluded these four markers and analyzed the remaining set of markers (N = 9183). Table 1 displays results across the 100 replicates. Whereas our estimators and LOW S never estimated p₀ to be 1, LBE consistently estimated p₀ to be 1, and STO and STO-TIB were somewhere in between. The mean p₀ estimates were very close to each other in our four new methods but deviated from four existing methods. The only exception was the LOW S method, in which the mean p₀ estimate was closer to what we obtained from the new methods. The precision of p₀ estimates was also high in the new methods as the standard deviations were small.

Table 1 Estimating p₀ and average effect size with different methods using all 100 replicates

Full size table

Based on the p₀ estimate in our new methods, the average number of total causal markers with main effects was 18. The average numbers of causal markers in the LOW S, STO-TIB, STO, and LBE were 21, 150, 256, and 0, respectively. Clearly, STO-TIB and STO overestimated the number of effects and LBE underestimated the number of effects. It is also important to note that standard errors of the estimates were about 100 times larger for STO-TIB and STO, implying that the number of markers was estimated very imprecisely.

The second part of Table 1 shows results for the estimated average effect size Δ. The ML methods estimate Δ and p₀ simultaneously. The other estimators do not consider the distribution under the alternative hypothesis, and can therefore not estimate the average effect size directly. However, in these cases we can still use the point estimate $p_{0}^{*}$ obtained with these estimators and include that in the a maximum likelihood method that finds $\overset{⌢}{Δ}$ by maximizing ℓ(m - m $p_{0}^{*}$ , Δ). In cases where the point estimate $p_{0}^{*}$ equals 1, the effect size cannot be estimated. In these scenarios Δ was treated as "missing". Results showed that the estimated average effect size was 0.083 in all four new methods in which the ML method was slightly more precise. The estimated average effect size was less precise and considerably lower with STO and STO-TIB, reflecting the downward bias and larger standard deviation in these p₀ estimates.

Discussion

Results illustrated that all of our four new estimators have favorable properties in terms of the standard deviation with which p₀ is estimated. The ML and QML estimators have the additional advantage that they provide a direct estimate of average effect size Δ. Because the point estimates of p₀ in both CON and ADA CON methods are very similar to that in the ML and QML methods, the average effect size is expected to be similar across methods. This is important because these two parameters are somewhat intertwined and the estimate of the average effect size helps the interpretation of the p₀ estimate. For example, without this effect size estimate, it is unclear whether the estimated numbers of causal markers have very small or large effects.

On average, the ML method performed slightly better than the QML method. Furthermore, we found in other simulations that the QML estimator can be unstable. In general, the ML method may therefore be the method of choice. Results also showed that the CON method performed well and was even slightly more precise than the ML estimators. One reason is that the CON method only estimates a single parameter, whereas the ML methods estimate two parameters. However, this observation is also consistent with previous simulations showing that in less optimal conditions (small sample sizes and small number of markers), the CON method can be more robust. Indeed, as another example of its relative robustness, the CON method performed equally well when the four markers with extremely large effects were included but the ML estimators became somewhat less precise.

Linkage disequilibrium causes test statistics between markers to be correlated. Extensive simulations were performed to examine the impact of such correlated tests on our estimates of the p₀ and Δ (data not shown). Results demonstrated that correlated tests mainly increase the variance of these estimates but did not introduce bias. This makes intuitive sense and essentially mimics other scenarios where certain statistics (e.g., mean) are estimated with correlated observations.

Further improvements and extensions of the proposed methods are conceivable. An example involves work we are currently doing to estimate the distribution of effect sizes. The extension essentially consists of conditioning on the number of markers with effects and then maximizing the likelihood L(Δ|m₁). Thus, we start with estimating the largest effect in the data set, then the second largest, continuing until the estimated effect sizes become (very) small. Another example is that in case-control studies, population stratification can cause spurious associations between marker alleles and disease status when both disease prevalence and allele frequencies differ among subgroups. Using the principle of genomic control [1, 13, 14], our estimators can be further adapted to obtain estimates of p₀ and Δ that take stratification into account.

References

Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108. 10.1038/nrg1521.
Article PubMed CAS Google Scholar
Dalmasso C, Broet P, Moreau T: A simple procedure for estimating the false discovery rate. Bioinformatics. 2005, 21: 660-668. 10.1093/bioinformatics/bti063.
Article PubMed CAS Google Scholar
Hsueh HM, Chen JJ, Kodell RL: Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. J Biopharm Stat. 2003, 13: 675-689. 10.1081/BIP-120024202.
Article PubMed Google Scholar
Storey JD: A direct approach to false discovery rates. J R Stat Soc Ser B Stat Method. 2002, 64: 479-498. 10.1111/1467-9868.00346.
Article Google Scholar
Goring HHH, Terwilliger JD, Blangero J: Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet. 2001, 69: 1357-1369. 10.1086/324471.
Article PubMed Central PubMed CAS Google Scholar
Ioannidis JPA, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG: Replication validity of genetic association studies. Nat Genet. 2001, 29: 306-309. 10.1038/ng749.
Article PubMed CAS Google Scholar
Sun L, Bull SB: Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005, 28: 352-367. 10.1002/gepi.20068.
Article PubMed Google Scholar
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.
Article PubMed Central PubMed CAS Google Scholar
Bukszar J, van den Oord E: Accurate and efficient power calculations for 2 × m tables in unmatched case-control designs. Stat Med. 2006, 25: 2632-2646. 10.1002/sim.2269.
Article PubMed Google Scholar
Cohen J: Statistical Power Analysis for the Behavioral Sciences. 1988, Hillsdale: Erlbaum
Google Scholar
Agresti A, (Ed): Categorical Data Analysis. 1990, New York: Wiley
Google Scholar
Weir BS, (Ed): Genetic Data Analysis II. 1996, Sunderland: Sinauer Associates
Google Scholar
Devlin B, Jones BL, Bacanu SA, Roeder K: Mixture models for linkage analysis of affected sibling pairs and covariates. Genet Epidemiol. 2002, 22: 52-65. 10.1002/gepi.1043.
Article PubMed Google Scholar
Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was partly supported by U.S. National Institutes of Health grant R01-AA-11408. Preparation of this manuscript was supported by a Young Investigator award from the National Alliance for Research on Schizophrenia and Depression to Po-Hsiu Kuo.

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, 800 East Leigh Street, Biotech 1, VIPBG, Suite 1-130, Richmond, Virginia, 23219, USA
Po-Hsiu Kuo & Edwin JCG van den Oord
Institute of Clinical Medicine, College of Medicine, National Cheng Kung University, 138, Sheng-Li Road, Tainan, 704, Taiwan
Po-Hsiu Kuo
Center for Biomarker Research and Personalized Medicine, Virginia Commonwealth University, 410 North 12th Street, R Blackwell Smith Building, Richmond, Virginia, 23219, USA
József Bukszár & Edwin JCG van den Oord

Authors

Po-Hsiu Kuo
View author publications
You can also search for this author in PubMed Google Scholar
József Bukszár
View author publications
You can also search for this author in PubMed Google Scholar
Edwin JCG van den Oord
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Po-Hsiu Kuo.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kuo, PH., Bukszár, J. & van den Oord, E.J. Estimating the number and size of the main effects in genome-wide case-control association studies. BMC Proc 1 (Suppl 1), S143 (2007). https://doi.org/10.1186/1753-6561-1-S1-S143

Download citation

Published: 18 December 2007
DOI: https://doi.org/10.1186/1753-6561-1-S1-S143

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Estimating the number and size of the main effects in genome-wide case-control association studies

Abstract

Background

Methods

A single-value approximation for Pearson's statistic

The maximum likelihood estimators

The conservative estimator

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Estimating the number and size of the main effects in genome-wide case-control association studies

Abstract

Background

Methods

A single-value approximation for Pearson's statistic

The maximum likelihood estimators

The conservative estimator

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us