We studied the quantitative trait Q1 influenced by 39 variants in nine independent genes.

### Statistical association analysis of rare variants

We carry out the association test at the gene level. Assume that a gene *G* contains *J*_{
G
} variants denoted *SNP*^{
j
}, *j* = 1, …, *J*_{
G
}, and that *MAF*_{
j
} is the minor allele frequency of *SNP*^{
j
}. Let *Y* = (*y*_{1}, …, *y*_{
N
}) be the observations of the phenotype Q1 in *N* unrelated individuals, and let *X*_{
iG
} be the vector of genotypes of the SNPs in gene *G* for individual *i*. The genotypes are coded 0, 1, or 2, depending on the number of minor alleles.

Let *T*_{maf} be a selection criterion on minor allele frequency (MAF) values. The association methods we have investigated vary according to a predefined *T*_{maf} value (i.e., less than 1%, less than 5%, or less than 50%) and on the number of collapsing groups. They are all based on a linear regression modeling the relationship between the trait *Y* and the SNP data within a gene. We briefly review these methods in this Methods section. More details are given by Dering et al. [8].

### Association testing in the unrelated individuals data set: univariate collapsing approaches

The univariate collapsing approaches use only a subset of variants that satisfy the constraint MAF ≤ *T*_{maf}, where *T*_{maf} is a predefined selection value.

The first univariate collapsing approach is the collapsing and summation test (CAST). Let

*X*_{
iG
}(maf) be the vector of genotype scores of the SNPs with MAF <

*T*_{maf}, and let

*J*_{
G
}(maf) be the length of the vector

*X*_{
iG
}(maf). The variable

*C* =

*C*_{
iG
}(maf) (

*i* = 1, …,

*N*) denotes the two collapsing strategies that we used: collapsing absence/presence (CA) and collapsing proportion (CP). For the CA strategy:

Equation (1) is based on the presence or absence of the minor allele at any rare variant in gene *G* within an individual [3]. Equation (2) is based on the proportion of rare variants with MAF ≤ *T*_{maf} at which an individual *i* carries at least one copy of the minor allele [5]. The model is *Y* = *Cβ* + *ε*, where
and σ^{2} is the residual variance.

The effect of *β* can be tested with a likelihood ratio test that follows a chi-square distribution with 1 degree of freedom (df).

The second univariate collapsing method is the variable-threshold (VT) approach [2], which uses the CP approach to collapse rare SNPs with MAF <*T*_{maf} but maximizes the statistic according to *T*_{maf}. All *T*_{maf} values observed in the gene *G* are considered*.* For each *T*_{maf}, a regression *z*-score is computed. Let *z*_{max} be the maximum *z*-score across all *T*_{maf} values. The test of association is based on *z*_{max}, and its statistical significance is evaluated empirically by permutation.

The last univariate collapsing method is the weighted-sum (WS) approach [

2], which is a generalization of the binary trait weighted-sum approach proposed by Madsen and Browning [

4] for quantitative traits. Under this approach,

*T*_{maf} = 0.5 (i.e., all variants in a gene

*G* are used). The collapsing variable

*C* for subject

*i* in the WS approach is given by:

For each gene

*G*, a genetic score is calculated as:

The significance of Z_{G} is assessed empirically by permutation.

### Association testing in the unrelated individuals data set: combined multivariate and collapsing approach

The combined multivariate and collapsing (CMC) method originally proposed by Li and Leal [3] uses a multiple regression model that contains the CA method’s collapsing variable of SNPs with MAF <*T*_{maf} = 1% and includes all *k* remaining SNPs, *X*_{
j
}_{1,…,}_{
k
}, individually.

The multivariate model (denoted here as CMC3) is:

Evidence of association (∃*j*, *β*_{
j
} ≠ 0, *j* = 0, …, *k*) is assessed with the likelihood ratio test, which follows a chi-square distribution with (*k* + 1) df.

Using only the SNPs with MAF ≤ 5%, we extended this model in two ways. In both extensions the multivariate model contains the CA collapsing variable of SNPs with MAF < 1%. In the first variation of this model (denoted CMC1), the multivariate model also contains the CA collapsing variable of the other SNPs (i.e., 1% ≤ MAF ≤ 5%). In contrast, in the second extension (denoted CMC2), the other SNPs are included individually in the multivariate model.

The CMC1 model is then written as:

and the test of association is a likelihood ratio test with 2 df.

The CMC2 model is the same as Eq. (6), where *k* is the number of SNPs and 1% ≤ MAF ≤ 5%. Evidence of association is assessed with the likelihood ratio test with (*k* + 1) df.

### Association testing in the unrelated individuals data set: single-marker test

For comparison purposes, we also carried out a single-locus association test. For a gene *G*, association with each SNP was tested using the likelihood ratio test. For each gene *G*, we obtained *J*_{
G
} likelihood ratio test statistics, each with 1 df. The evidence of association at the gene level was based on the maximum of the *J*_{
G
} likelihood ratio test statistics.

Single-marker (SM) tests were conducted with PLINK, version 1.07 [9]. The R.2.10.1 software was used for all collapsing approaches except the VT and WS approaches. For these two approaches we used the R script (http://genetics.bwh.harvard.edu/vt/dokuwiki/) [2], and we set the number of permutations to 1,000.

### Association testing in the family data set

We used the measured genotype (MG) test [

10], which is a linear mixed model:

and
are the polygenic and the residual variances, respectively, and
is the kinship matrix in family *i*. The SNP data in relatives were collapsed as described under the CA, CP, and WS collapsing approaches. In these three approaches, the test of association is a likelihood ratio test with 1 df. In addition, we also carried out the bivariate CMC1 approach using the likelihood ratio test with 2 df. We could not evaluate the VT approach because it maximizes *T*_{maf}. We carried out the MG test using the QTDT software (http://www.sph.umich.edu/csg/abecasis/QTDT/).

### Type I error rate and power estimates

The empirical distribution of each association approach was evaluated in unrelated individuals and in family data. Type I error and power rates were estimated by testing association of Q1 to each of the seven false causal genes and each of the nine true causal genes, respectively, using the 200 replicates. Type I error and power rates were derived at a nominal level of *α* = 5%.

In the unrelated individuals data set, we evaluated association with Q1 using 10 approaches: CA1 and CA5 with *T*_{maf} = 1% and 5%, respectively; CP1 and CP5 with *T*_{maf} = 1% and 5%, respectively; and VT, WS, CMC1, CMC2, CMC3, and SM. For the WS and VT tests, we used empirical *P*-values. For all remaining association tests we used tabulated nominal *P*-values. In each replicate, we tested for association of Q1 with each of the 16 genes using each of the 10 approaches. For each gene and for each association procedure we computed the proportion of replicates having a *P*-value ≤ *α*. For the SM approach, we applied a Bonferroni correction to account for the multiple tests; we computed the proportion of replicates such that the lowest *P*-value out of the *J*_{
G
} SNPs was less than or equal to *α*/*J*_{
G
}.

In the family data set, we evaluated similarly the following five approaches: CA1 and CA5 with *T*_{maf} = 1% and 5%, respectively; CP1 and CP5 with *T*_{maf} = 1% and 5%, respectively; and SM. We also evaluated the WS approach but used the tabulated *P*-value derived from a chi-square distribution with 1 df.