The maximum likelihood methods (ML and QML) and the conservative methods (CON and ADA CON) are briefly described below. All 100 replicates in GAW15 Problem 3 were analyzed. To create a case-control data set we selected the first sib from family-based data sets as independent cases (N = 1500), and used all individuals in control data sets.
Analyses were done with knowledge of the "answers" of causal markers locations.
A single-value approximation for Pearson's statistic
SNPs are bi-allelic so that the initial statistical analysis will consist of calculating Pearson's statistic to test whether the frequency of the two alleles (A, a) or three SNP genotypes (AA, Aa, aa) differs between cases and controls. For Pearson's test we can define a single parameter Δ that be interpreted as an (average) effect size. For 2 × 2 tables, for example,
where o is the odds ratio, γ and δ = 1 - γ the proportions of controls and cases, q1 and 1 - q1 the allele frequency in the controls and cases.
We can derive the following approximation [9] for the distribution of Pearson's statistic to analyze for 2 × ν contingency tables that depends on only Δ.
where χν-2 is a (central) chi-square random variable with ν - 2 degrees of freedom and is a chi-square random variable with 1 degree of freedom and non-centrality parameter . The fact that an approximation exist that depends on only a single parameter (this does not have to be the case as the asymptotic equivalent depends on many parameters) is of great importance because it means that we only have to estimate a single parameter from the data the characterize the effect size. Note that if Δ = 0, the approximation reduces to a central chi-square random variable with ν - 1 degrees of freedom under the null hypothesis. In classic works on power analysis [10], categorical data analysis [11], and text books [12], the distribution of Pearson's statistic is often approximated with a non-central chi-square distribution with ν - 1 degrees of freedom and non-centrality parameter nΔ2, which also depends on the single value Δ only. However, this approximation can be inaccurate [9].
The maximum likelihood estimators
The likelihood function on the m test statistics t1,...,tm is
where m1 = m - m0 the number of effects and m0 the number of markers without effect, f0 an approximating density function under the null, and fΔ an approximating density function under the alternative that depends on average effect size Δ. The ML estimator of m1 and the average effect size are the and that maximize function L.
Due to enormous number of terms in the sum, the likelihood cannot be evaluated directly. For example, with a total number of tests m = 100,000, of which m1 = 5 markers have an effect, there are 8.33 × 1022 terms. Therefore, we developed an implementation that uses recursive series to calculate the likelihood. In addition, we developed a quasi-likelihood approach (QML) that is computationally much easier and faster. Here the logarithm on the m test statistics t1,...,tm is
which is essentially the log-likelihood function of the mixture model.
The conservative estimator
In addition to the ML estimator, we propose an estimate of p0 that does not rely on the test statistic distribution under the alternative but capitalizes on the knowledge that in large-scale genetic studies p0 is close to 1 (CON method). We calculate a cut-off value c in such a way that the probability that a non-causal marker has test statistic value higher than c is k/m. If we denote the total number of markers whose test statistic value is higher than c as d, then this estimate of p0 is
Note that the expected number of non-causal markers with test statistic value higher than the cut off c is km0/m rather than k. This estimator can therefore be expected to be conservatively biased. However, because p0 = m0/m is close to 1, we would expect the bias to be small.
A natural idea is to choose a value for fine-tuning parameter k that minimizes the mean square error for which an analytical expression can be derived (not shown). A practical problem is that the value of k that minimizes the MSE depends on the unknown parameters p0, the average effect size, and the covariances among the markers. Alternatively, we can estimate k from the data (ADA CON method). That is, we first estimate p0 for a chosen value of k, e.g., k = 10. Second, using that point estimate, we obtain an estimate of the average effect size (e.g., by ML). Third, for the p0 and the effect size estimate, we calculate the optimal k. We repeat Steps 1 to 3 until there is no noticeable change in k. However, extensive simulation showed that this resulted in somewhat less precise estimates than just calculating a value of k using reasonable assumptions. The reason was that the conservative method appeared fairly robust against mis-specifications of k, which outweighed the additional sampling error associated with estimating k.