Treating phenotype as given: a simple resampling method for genome-wide association studies

Significance of genetic association to a marker has been traditionally evaluated through statistics that are standardized such that their null distributions conform to some known ones. Distributional assumptions are often required in this standardization procedure. Based on the observation that the phenotype remains the same regardless of the marker being investigated, we propose a simple statistic that does not need such standardization. We propose a resampling procedure to assess this statistic’s genome-wide significance. This method has been applied to replicate 2 of the Genetic Analysis Workshop 17 simulated data on unrelated individuals in an attempt to map phenotype Q2. However, none of the selected SNPs are in genes that are disease-causing. This may be due to the weak effect that each genetic factor has on Q2.


Background
The traditional approach to hypothesis testing is to construct a test statistic whose exact or approximate distribution under the null hypothesis is known. Alternative methods for assessing significance of a test statistic include resampling techniques, such as the bootstrap and permutation. These methods dominate modern genetic association studies.
Statistics used in these methods typically make use of a standardization factor, for instance, the standard error of the observed genetic effect size. Different model assumptions result in different standardization factors. For instance, in a case-control genetic association study, we are interested in testing whether the difference in the frequency of a reference allele between case subjects and control subjects is 0. The popular allelic test and the Cochran-Armitage test for trend are based on two different estimates of the variance of this difference. The estimate for the allelic test requires Hardy-Weinberg equilibrium in the combined case-control population, whereas the estimate for the Cochran-Armitage test does not. The properties of these two tests are rather different [1].
It is a brilliant idea to construct a test statistic through standardization. It makes it possible to assess the significance of the observed effect size with respect to a reference distribution. However, in genome-wide association studies there are typically tens of thousands of markers. Presumably most of these markers are under the null hypothesis (i.e., they are not associated). The presence of these null markers provides a natural reference distribution. Standardizing the effect size in this setting appears to be unnecessary and awkward.
We propose a novel resampling approach that aims to directly assess the significance of genetic effect size. The null distribution generated from this method is for a randomly selected single-nucleotide polymorphism (SNP). Hence it naturally addresses the genome-wide significance of genetic effect size at a SNP and obviates the issue of multiple testing. To define the genetic effect size, we observe that the phenotype remains the same in a genome scan. Only the testing locations are different. This observation leads to a simple representation of the genetic effect size. This approach is demonstrated by means of a genome-wide association study on phenotype Q2 in replicate 2 of the Genetic Analysis Workshop 17 (GAW17) simulated data on unrelated subjects.

Methods
Let n denote the number of individuals in a study. Let y i denote the measure of a quantitative phenotype y on individual i. The score of the genotype at a SNP is denoted by g. Assuming that the two alleles of the SNP are denoted by A and B, we set g = 0 for genotype AA, g = 1 for genotype AB, and g = 2 for genotype BB. The genotype score on individual i is denoted by g i .
To identify SNPs associated with phenotype y, we consider a regression in which the response variable is the genotype score g and the independent variable is the phenotype y. The (partial) regression sum of squares for g equals: (1) where g and y are the sample means of g i and y i , respectively. Because the denominator in expression (1) remains constant in a genome scan, a natural measure of association would be: Note that in a case-control study ( ) ( ) y y g g i i i − − ∑ is proportional to the difference in the frequency of allele B between case subjects and control subjects. If there are p covariates x j , j = 1, …, p, then phenotype y i is replaced by the residual y i * from the following linear regression: where x ij is the value of x j for individual i. Similarly, genotype g is replaced by the residual g* of the regression of g over x j , j = 1, …, p.
We use the statistic: to measure the strength of association, where n* is the number of subjects whose genotype is nonmissing at the SNP being investigated. The sample mean of The purpose of using n* is to make the statistic S on the same scale because n* is expected to vary across the genome.
In comparison, the usual method for detecting association would consider the following regression: The least-squares estimate of b 1 , denotedb 1 , is: The statistic for testing for association is: where SE( ) b 1 is the standard error ofb 1 . This procedure is equivalent to testing whether the coefficient of g is 0 in a multiple linear regression that includes x i , …, x p as covariates and y as the response. Because the total sum of squares for { , , , } * y i n i = 1  is fixed in a genome scan, there is a monotonic relationship between the regression sum of squares for Eq. (5) and the statistic T.
We propose the following resampling procedure to evaluate the genome-wide significance of S: where I(·) is the indicator function satisfying I(True) = 1 and I(False) = 0.
Because the SNP selected in each iteration is random, the significance obtained in this way is genome-wide.

Results
We analyze phenotype Q2 in replicate 2. There are 24,487 SNPs and 697 individuals. Residuals { } * y i are computed from regression (3) with regressors Sex, Age, and Smoke. The residuals { } same way. Genome-wide p-values from the t-distribution with 1 degree of freedom for statistic T are presented in Figure 1. There are 19 SNPs with a p-value less than 5 × 10 −4 ( Table 1).
The reference distribution for statistic S is obtained in the way described in the previous section. The value of K is set at 10,000,000. The histogram of p-values for parameter b 1 in regression (5) shows no apparent deviation from uniform distribution (data not shown). Thus all SNPs are used to determine the reference distribution for statistic S. P-values obtained from the resampling procedure are presented in Figure 2. Eighteen SNPs have a p-value that is less than 5 × 10 −4 . None of these SNPs are close to the 19 SNPs that have p-values less than 5 × 10 −4 for statistic T, except for C20S2223 on chromosome 20 (base-pair position 61666107; p = 2.59 × 10 −5 for statistic T; resampling p = 6.27 × 10 −2 for statistic S). C20S2223 is located in gene PRIC285 identified by statistic S.

Discussion and conclusions
Significance of genetic association at a SNP has been traditionally evaluated through a standardized test statistic such that its significance can be evaluated through a known distribution. For instance, in Eq. (7) the standard error ofb 1 serves as a standardization factor. Its p-value is assessed through a t-distribution or a standard normal distribution. This strategy works beautifully in general but is awkward in the context of genome-wide association studies. The presence of abundant unassociated SNPs provides a natural reference distribution that is pertinent to the data. We have proposed a simple statistic S and a resampling method for generating its reference distribution. Although a distributional assumption is necessary to compute the standard error ofb 1 , it is not required in our method. More important, the resampling procedure has a built-in mechanism for handling genome-wide significance because the reference distribution is the distribution of an arbitrary SNP under the null hypothesis that there is no association. Application to the GAW17 simulated data on unrelated individuals (replicate 2) revealed several susceptible genes for phenotype Q2 that can be used as candidates for further investigation.
The essence of the proposed method is the existence of a large number of similar features-SNPs in the current context. The basic principle is applicable to other situations that involve a large amount of similar features, for instance, gene expression levels in gene expression studies.
We have focused on a continuous trait. As alluded to earlier, for a dichotomous trait, if there are no covariates, then the statistic S can be defined as the difference in frequency of the reference allele between the two phenotype groups. More generally, one can use a proportional odds model with genotype as the response and phenotype as the predictor. A proportional odds model is appealing for dealing with covariates and both continuous and dichotomous traits. We are actively investigating this possibility.
The proposed method is resampling based and requires intensive resampling. However, computation for each sample is fast. The overall computation time for K = 10,000,000 resampling iterations took less than two days on a computer running the Windows Vista operating system. The iteration coverage is about 408     per SNP (≈ K/24,487 SNPs). We have also used K = 100,000; the simulated p-value remains almost the same. The resampling procedure needs to be done only once, regardless of the number of SNPs. We are investigating the utility of our approach in other association study settings, such as gene-gene interaction and population stratification.
We had no knowledge of the GAW17 disease-generating model when this analysis was conducted. It turns out that none of the genes in Table 1 and Table 2 are disease causing. This may be due to the weak effect that each genetic factor has on Q2.