Suppose that the number of SNPs in a genetic unit (a signaling pathway or a gene) is

*P*. Let

*Y*_{
i
},

*i* = 1, 2, …,

*N*, be the phenotype for individual

*i*. We define

*I*_{
ij
} for individual

*i* to be the number of minor alleles at SNP

*j*. Let

*X*_{
i
} be a genetic summary score, which we define as:

where *w*_{
j
} is a weight that is applied to the *j*th SNP.

We evaluate two weighting schemes. In the first scheme, *w*_{
j
} is taken to be 1 for all *j*. Thus rare and common SNPs are treated in the same way, and *X*_{
i
} is the simple sum of the number of minor alleles in the gene or pathway. This approach is similar to that of Morris and Zeggini’s method [15]; they defined a genetic score by the proportion of sites within the gene or pathway that harbored mutations. Because this scheme does not differentially weight SNPs, we refer to it as unweighted.

In the second scheme, we calculate the frequency of nonreference mutations among nonextreme individuals at position

*j* as:

where

*δ*(

*Y*_{
i
}) = 1 when

*Y*_{
i
} is within one standard deviation of the mean and

*δ*(

*Y*_{
i
}) = 0 otherwise. Adding a 1 to the numerator and a 2 to the denominator ensures that the frequency

*p*_{
j
} is nonzero, so that the weight used in the second scheme,

remains finite [12]. Note that, with this weight, SNPs that are rare among those individuals whose phenotypes lie within the center of the phenotype distribution will be up-weighted and will have a larger role in the genetic summary score *X*_{
i
}. We refer to this scheme simply as weighted.

For both approaches, once we have defined the genetic score

*X*_{
i
}, we assume that it is related to

*Y*_{
i
} through the linear model:

where *ε* is an unknown error term. A Wald statistic,
, is computed, with the variance estimated using a sandwich estimator [17]. Because the weighted approach uses phenotypic information in defining the weight, we use permutation to assess statistical significance. We note that, in this case, the weight is recomputed for each permuted data set. We use 1 million permutations throughout.

We evaluate our approach using the simulated GAW17 data set. These data are described in detail elsewhere [18]. Although all 200 replicates are analyzed, for illustration purposes, we present results concerning replicate 1 in greater detail. Our analyses focus on one phenotype: quantitative trait Q1. Even though we had access to the answers for the underlying simulation model, our approach, including the characterization of signaling pathways, was developed without reference to these answers.

We characterize gene sets using information from two databases. The first, PharmGKB (https://www.pharmgkb.org/) [19], provides information on 1,400 signaling pathways. Unfortunately, the genes in the GAW17 data set are not well represented in PharmGKB, with only 713 out of 3,205 genes sequenced in the GAW17 data being included in 821 of these pathways. To compensate for this low coverage, we also classify genes by biological process from the Gene Ontology (GO) database (http://www.geneontology.org). Although not defining a signaling pathway, the GO biological process domain classifies genes by their involvement in biological processes and therefore presents an interesting unit over which to accumulate genetic variation. This approach allows us to classify 2,304 out of 3,205 genes into 3,009 biological processes. The GO data are contained in two files: a human genetic association file, dated September 15, 2010, revision 1.1433; and a genetic ontology file, dated September 6, 2010, revision 1.160. We note that in both of these classification schemes (PharmGKB and GO) one gene may be mapped to several pathways or biological processes. A pathway or biological process is taken to be significantly associated with the phenotype if its permutation *p*-value does not exceed the Bonferroni corrected significance threshold 0.05(821 + 3009) ~ 1.3055 × 10^{–5}. The entire analysis is repeated using both the weighted and unweighted schemes. Only nonsynonymous SNPs are considered throughout.