Let *n* denote the number of individuals in a study. Let *y*_{
i
} denote the measure of a quantitative phenotype *y* on individual *i*. The score of the genotype at a SNP is denoted by *g*. Assuming that the two alleles of the SNP are denoted by A and B, we set *g* = 0 for genotype AA, *g* = 1 for genotype AB, and *g* = 2 for genotype BB. The genotype score on individual *i* is denoted by *g*_{
i
}.

To identify SNPs associated with phenotype

*y*, we consider a regression in which the response variable is the genotype score

*g* and the independent variable is the phenotype

*y*. The (partial) regression sum of squares for

*g* equals:

where

and

are the sample means of

*g*_{
i
} and

*y*_{
i
}, respectively. Because the denominator in expression (1) remains constant in a genome scan, a natural measure of association would be:

Note that in a case-control study

is proportional to the difference in the frequency of allele B between case subjects and control subjects. If there are

*p* covariates

*x*_{
j
},

*j* = 1, …,

*p*, then phenotype

*y*_{
i
} is replaced by the residual

from the following linear regression:

where *x*_{
ij
} is the value of *x*_{
j
} for individual *i*. Similarly, genotype *g* is replaced by the residual *g** of the regression of *g* over *x*_{
j
}, *j* = 1, …, *p*.

to measure the strength of association, where *n** is the number of subjects whose genotype is nonmissing at the SNP being investigated. The sample mean of
is 0, so it is dropped from the definition of *S*. So is the sample mean of
. The purpose of using *n** is to make the statistic *S* on the same scale because *n** is expected to vary across the genome.

In comparison, the usual method for detecting association would consider the following regression:

The least-squares estimate of

*β*_{1}, denoted

, is:

The statistic for testing for association is:

where
is the standard error of
. This procedure is equivalent to testing whether the coefficient of *g* is 0 in a multiple linear regression that includes *x*_{
i
}, …, *x*_{
p
} as covariates and *y* as the response. Because the total sum of squares for
is fixed in a genome scan, there is a monotonic relationship between the regression sum of squares for Eq. (5) and the statistic *T*.

We propose the following resampling procedure to evaluate the genome-wide significance of

*S*: (1) Compute residuals

and residuals

from regression (3). (2) Randomly select a SNP from a set of SNPs that are under the null hypothesis. This null set can be determined by using a histogram of

*p*-value [

2]. (3) Permute residuals

(or, equivalently, permute residuals

). (4) Compute the statistic

*S* for the randomly selected marker. (5) Repeat steps 2–4 the desired number of times, say

*K*. (6) Let

*S*^{(}^{
k
}^{)} denote the value of statistic

*S* in the

*k*th iteration. Suppose that the value of statistic

*S* is

*s* for the observed data. Its

*p*-value is computed as:

where *I*(·) is the indicator function satisfying *I*(True) = 1 and *I*(False) = 0.

Because the SNP selected in each iteration is random, the significance obtained in this way is genome-wide.