The maximum likelihood methods (ML and QML) and the conservative methods (CON and ADA CON) are briefly described below. All 100 replicates in GAW15 Problem 3 were analyzed. To create a case-control data set we selected the first sib from family-based data sets as independent cases (*N* = 1500), and used all individuals in control data sets.

Analyses were done with knowledge of the "answers" of causal markers locations.

### A single-value approximation for Pearson's statistic

SNPs are bi-allelic so that the initial statistical analysis will consist of calculating Pearson's statistic to test whether the frequency of the two alleles (A, a) or three SNP genotypes (AA, Aa, aa) differs between cases and controls. For Pearson's test we can define a single parameter Δ that be interpreted as an (average) effect size. For 2 × 2 tables, for example,

\Delta =\frac{\sqrt{\gamma \delta}\sqrt{{q}_{1}\left(1-{q}_{1}\right)}\left(o-1\right)}{\sqrt{\left(\left(o-1\right)\left(\gamma +\delta {q}_{1}\right)+1\right)\left(\left(o-1\right)\delta {q}_{1}+1\right)}}.

where *o* is the odds ratio, *γ* and *δ* = 1 - *γ* the proportions of controls and cases, *q*_{1} and 1 - *q*_{1} the allele frequency in the controls and cases.

We can derive the following approximation [9] for the distribution of Pearson's statistic to analyze for 2 × *ν* contingency tables that depends on only Δ.

{\chi}_{\nu -2}+\left(1-{\Delta}^{2}\right){\chi}_{1}\left(\frac{n{\Delta}^{2}}{1-{\Delta}^{2}}\right),

where *χ*_{ν-2 }is a (central) chi-square random variable with *ν* - 2 degrees of freedom and {\chi}_{1}\left({\scriptscriptstyle \frac{n{\Delta}^{2}}{1-{\Delta}^{2}}}\right) is a chi-square random variable with 1 degree of freedom and non-centrality parameter {\scriptscriptstyle \frac{n{\Delta}^{2}}{1-{\Delta}^{2}}}. The fact that an approximation exist that depends on only a single parameter (this does not have to be the case as the asymptotic equivalent depends on many parameters) is of great importance because it means that we only have to estimate a single parameter from the data the characterize the effect size. Note that if Δ = 0, the approximation reduces to a central chi-square random variable with *ν* - 1 degrees of freedom under the null hypothesis. In classic works on power analysis [10], categorical data analysis [11], and text books [12], the distribution of Pearson's statistic is often approximated with a non-central chi-square distribution with *ν* - 1 degrees of freedom and non-centrality parameter *n*Δ^{2}, which also depends on the single value Δ only. However, this approximation can be inaccurate [9].

### The maximum likelihood estimators

The likelihood function on the *m* test statistics *t*_{1},...,*t*_{m} is

L({m}_{1},\Delta )=\frac{1}{\left(\begin{array}{c}m\\ {m}_{1}\end{array}\right)}\left({\displaystyle \prod _{i=1}^{m}{f}_{0}({t}_{i})}\right){\displaystyle \sum _{\{{i}_{1},\mathrm{...},{i}_{{m}_{1}}\}\in \{1,\mathrm{...},m\}}\frac{{f}_{\Delta}\left({t}_{{i}_{1}}\right)}{{f}_{0}\left({t}_{{i}_{1}}\right)}}\times \mathrm{...}\times \frac{{f}_{\Delta}\left({t}_{{i}_{{m}_{1}}}\right)}{{f}_{0}\left({t}_{{i}_{{m}_{1}}}\right)},

where *m*_{1} = *m* - *m*_{0} the number of effects and *m*_{0} the number of markers without effect, *f*_{0} an approximating density function under the null, and *f*Δ an approximating density function under the alternative that depends on average effect size Δ. The ML estimator of *m*_{1} and the average effect size are the {\widehat{m}}_{1} and \stackrel{\u2322}{\Delta} that maximize function *L*.

Due to enormous number of terms in the sum, the likelihood cannot be evaluated directly. For example, with a total number of tests *m* = 100,000, of which *m*_{1} = 5 markers have an effect, there are 8.33 × 10^{22} terms. Therefore, we developed an implementation that uses recursive series to calculate the likelihood. In addition, we developed a quasi-likelihood approach (QML) that is computationally much easier and faster. Here the logarithm on the *m* test statistics *t*_{1},...,*t*_{m} is

{\ell}_{quasi}({p}_{0},\Delta )={\displaystyle \sum _{i=1}^{m}\mathrm{log}\{{p}_{0}{f}_{0}({t}_{i})+(1-{p}_{0}){f}_{\Delta}({t}_{i})\},}

which is essentially the log-likelihood function of the mixture model.

### The conservative estimator

In addition to the ML estimator, we propose an estimate of *p*_{0} that does not rely on the test statistic distribution under the alternative but capitalizes on the knowledge that in large-scale genetic studies *p*_{0} is close to 1 (CON method). We calculate a cut-off value *c* in such a way that the probability that a non-causal marker has test statistic value higher than *c* is *k*/*m*. If we denote the total number of markers whose test statistic value is higher than *c* as *d*, then this estimate of *p*_{0} is

{\stackrel{\u2322}{p}}_{0}=1-\frac{d-k}{m}.

Note that the expected number of non-causal markers with test statistic value higher than the cut off c is *km*_{0}/*m* rather than *k*. This estimator can therefore be expected to be conservatively biased. However, because *p*_{0} = *m*_{0}/*m* is close to 1, we would expect the bias to be small.

A natural idea is to choose a value for fine-tuning parameter *k* that minimizes the mean square error MSE\left(k\right)=E{({\stackrel{\u2322}{p}}_{0}-{p}_{0})}^{2} for which an analytical expression can be derived (not shown). A practical problem is that the value of *k* that minimizes the MSE depends on the unknown parameters *p*_{0}, the average effect size, and the covariances among the markers. Alternatively, we can estimate *k* from the data (ADA CON method). That is, we first estimate *p*_{0} for a chosen value of *k*, e.g., *k* = 10. Second, using that point estimate, we obtain an estimate of the average effect size (e.g., by ML). Third, for the *p*_{0} and the effect size estimate, we calculate the optimal *k*. We repeat Steps 1 to 3 until there is no noticeable change in *k*. However, extensive simulation showed that this resulted in somewhat less precise estimates than just calculating a value of *k* using reasonable assumptions. The reason was that the conservative method appeared fairly robust against mis-specifications of *k*, which outweighed the additional sampling error associated with estimating *k*.