Workshop 17

Solution This workshop is meant to help you review the main topics covered so far in the course, in preparation for the upcoming midterm exam. Work out each of the following problems on the board. Make sure you discuss your solution with your TAs before you erased the board and start a new problem. 1. Find the domain and range of F (x) = f (f (x)) where f (x) = 1 x. Sketch the graph of F (x) and determine whether F is invertible. Solution It is useful here to think what the composition f (f (x)) actually works. Generally, working from the inside out, we take the input x, then take its multiplicative inverse 1 x then take another inverse to go back to x. The domain is the set of all possible inputs x that yield an output f (f (x)). All numbers except 0 have an inverse, and once you take that inverse you can always take it again to find the original input. Therefore the domain is all the real numbers except for 0, or R \ {0}. The range is the same set, as outside of x = 0, f (f (x)) = x. The function is thus the line y = x with a hole at the origin. y x 1 1 To find whether F is invertible we take any element y in its image and we need to find a unique x in the domain such that F (x) = y. Since F (x) = x in its domain, we can always use y = x, therefore it is invertible. We could also have used the fact that the line y = x is its own inverse and F (x) is the same function except that it's missing a point at the origin, but all the other points were already invertible, so F (x) is invertible. A third way to show this is that doing a reflexion of the previous graph on the y = x line also yields a function (the shape of the graph does not change this time) therefore F (x) is invertible.


Background
With the rapid development of technologies, more and more single-nucleotide polymorphisms (SNPs) have become available and, in particular, most of the rare variants can be identified using the next-generation sequencing technique. However, detecting associated rare variants that contribute to phenotypic variation is still a huge challenge. Current approaches for testing rare variants include grouping the rare variants based on a threshold of the minor allele frequency (MAF) [1], summing the rare variants weighted by the allele frequencies in control subjects [2,3], and clustering rare haplotypes using family data [4]. Another approach is to use a penalized regression, which can avoid the singular design matrix that may result from rare variants by adding a penalty, such as the least absolute shrinkage and selection operator (LASSO) and ridge penalties [5,6]. In this analysis, we evaluated the LASSO regression, linear regression and the collapsing methods by comparing their power and false positive rates. Based on the results, we recommend the LASSO approach to detect rare SNPs.

Data checking
In the Genetic Analysis Workshop 17 (GAW17) simulated data set, there are no missing genotype data. Among all the 24,487 SNPs, 91% have a MAF less than 0.1, 87% have a MAF less than 0.05, and 75% have a MAF less than 0.01. Moreover, 39% of the SNPs have a MAF less than 0.001, which leads to 9,433 SNPs being singletons among 697 unrelated individuals. Owing to the rareness of the variants, we do not examine Hardy-Weinberg disequilibrium as a quality control procedure in this study. Thus we include all SNPs and all individuals for the association analysis.

LASSO regression
To deal with the singular matrix in linear regression caused by the rare variants, we adopt a statistical method that effectively shrinks the coefficients of unassociated SNPs and reduces the variance of the estimated regression coefficients. Here, we apply the LASSO penalty [7] to implement this regression analysis.
At the ith SNP site, we code the genotype as 0, 1, or 2 to represent 0, 1, or 2 copies of the minor allele, which is the ith column in the design matrix represented by X. For the quantitative trait y, the regression can be written: where b is the vector of regression coefficients. In a LASSO regression, the elements of b are the estimates that minimize the loss: where n is the number of individuals, L is the number of SNP sites, and l is the tuning parameter. The LASSO regression was implemented in the R package glmnet.

Gene-level association tests
The association is tested on the gene level. Within a gene, the dependent variable is Q2 of the GAW17 data set, and the independent variables are the genotypes of all the SNPs in the gene. We use a model, with a LASSO penalty, in which no interactions are involved. This model is indexed as M1. To test for the association between a gene and Q2, we use F statistics to test for the significance between models M1 and M0, where M0 is taken to be the model under the null hypothesis that b is a vector of zeros. Let RSS M1 and RSS M0 be the residual sums of squares of models M1 and M0, respectively. To correct for selection bias, we use the generalized degrees of freedom (GDF) [8], indicated by GDF(M), in the F tests for model M1; the GDF is larger than the number of nonzero coefficients. The F statistic is constructed as follows: In classical linear models, the number of covariates is fixed; therefore the number of degrees of freedom is equal to the number of covariates. However, the situation is different in a LASSO regression: The number of nonzero coefficients can no longer accurately measure the model complexity. For a LASSO regression, which involves variable selection, the GDF was introduced [8] to correct for selection bias and to accurately measure the degrees of freedom of the obtained model. The GDF of a model is defined as the average sensitivity of the fitted values to a small change in the observed values. The parametric bootstrapping method is used to estimate the GDF [8,9].
Suppose that the observed value y i , i = 1, …, n, is modeled as μ i + ε, where μ i is the expectation of y i and ε is Gaussian white noise with variance s 2 . An estimate s 2 for s 2 can be obtained by an ordinary regression. Given a modeling procedure M: y μ, GDF(M), the GDF of the modeling procedure M, can be estimated as follows: (1) on the basis of the modeling procedure. (2) Calculateĥ i M as the regression slope from: (3) Finally, calculate: Given GDF(M), the extended Akaike's A Information Criterion (AIC) is defined as: Thus the tuning parameter l is selected to be the one that minimizes the extended AIC value.
Alternative methods: F linear and combined multivariate and the collapsing method for quantitative traits As a comparison, we also carry out the F test based on general linear regression for each gene, which we call F linear . A second alternative method is the combined multivariate and collapsing (CMC) method [1], which is a unified approach that combines collapsing and multivariate tests for a binary trait. We modify the CMC method for the quantitative trait, in which markers are divided into rare and common subgroups, on the basis of a predefined allele frequency threshold (δ); within the rare subgroup an individual is coded 1 if a rare allele is present at any of the variant sites and 0 otherwise. After this collapsing, we calculate the F test to test for the association. We call this approach QCMC(δ) for convenience, and we consider δ = 0.01 and 0.05 in this paper.

Results
We evaluated the power and false-positive rates of the F LASSO , F linear , QCMC(0.01), and QCMC(0.05) tests based on the 200 replicates of the GAW17 data set. The significance level of the tests was first set to 1.6 × 10 -5 , which is the Bonferroni-corrected significance level of 0.05 adjusted by the number of genes, that is, 0.05/ 3,205. However, because of the small sample sizes in the GAW17 data set, the power of the association tests was poor and could not be compared in our four tests. Therefore we also used the weak significance level of 0.01 for method comparison.
We examined the answers to the GAW17 simulation after our association analyses were completed. In the answers, Q2 is influenced by 72 SNPs in 13 genes, where the MAFs and effect sizes (b i , the elements of b) could be found for each causal SNP. Thus the variance contributed by each SNP to the phenotype could be calculated as 2 1 2 q q i ( ) − b under the assumption of an additive model, where q is the MAF. Therefore we calculated the variance contribution for a gene using: As shown in Table 1, both genes VNN3 and VNN1 have a variance contribution of approximately 0.02; SREBF1, BCHE, VLDLR, SIRT1, PDGFD, LPL, and PLAT have variance contributions of approximately 0.01 individually; and RARB, GCKR, VWF, and INSIG1 have variance contributions between 0.0002 and 0.005. The power is dependent on the variance attributed to the gene.
We evaluated the power of the four methods based on the 13 causal genes using the 200 replicates ( Figure 1). In general, the LASSO regression outperformed linear regression for all causal genes and gained more than 10% power on the first four genes, as shown in Figure 1. The QCMC(0.01) method performed better than the QCMC(0.05) method because 91.7% of the MAFs of causal SNPs were less frequent than 0.01. Except for the VNN1 and SREBF1 genes, the LASSO method was more powerful than the two QCMC methods. This is quite easy to understand. The VNN1 gene has two causal SNPs, which have MAFs of 0.006 and 0.17, and all the causal SNP variants are less frequent than 0.005 in the SREBF1 gene. For this reason, both the QCMC(0.01) and the QCMC(0.05) tests are able to collapse the causal SNPs perfectly and thereby lead to a higher power than the LASSO approach for these two genes.
In general, all the tests increased the power when a gene's contribution to the phenotype variation increased. However, we observed some exceptions, possibly because the power depends on many other factors, such as allele frequency and linkage disequilibrium among the SNPs within a gene. First, although their contributions to the phenotype variation were similar, we had more power to detect VNN1, which consists of two causal SNPs, with one of them being common (MAF = 0.17), than VNN3, which consists of seven rare causal SNPs. Second, for the GCKR gene, which has only one causal SNP, we also had reasonable power, in contrast to its small contribution to the phenotype variation. The association for these two genes was concentrated in a small number of causal SNPs and hence was easier to detect. Third, the SIRT1 and VLDLR genes had a similar number of SNPs, number of causal SNPs, MAFs, and variance contribution; however, SIRT1 gained much more power than VLDLR did. To understand why, we examined the linkage disequilibrium among each of these genes (using Haploview, http://www.broad.mit. edu/mpg/haploview) (Figure 2). SIRT1 includes a common SNP, C10S3059 (MAF = 0.167), that is in linkage disequilibrium with the causal rare SNP, C10S3048 (MAF = 0.002). The four gametes formed by these two SNPs are CT (83.7%), CC (16.6%), GC (0.1%), and GT (0.1%); and the D′ value is 0.5 (R 2 = 0.003). Among the 55 significant tests of the SIRT1 gene in 200 replicates, 81.8%, 60%, and 50.9% of the LASSO models selected SNP C10S3059, C10S3048, or both, respectively, in their M1 models. However, for VLDLR, although SNP C9S341 (MAF = 0.095) was also in linkage disequilibrium with the causal SNP C9S444, which has MAF = 0.001 (D′ = 0.384 and R 2 = 0.002), it was not as common as C10S3059 and the linkage disequilibrium pattern was not the same as that for SIRT1.
We also investigated the false-positive rates by counting the frequency of the P-values that were not larger Table 1 True variance contributions of 13 causal genes given in the GAW17 answers  than a specific significance level for all of the 3,192 noncausal genes over the 200 replicates (Table 2). For some unknown reason, all four methods had inflated falsepositive rates, and the inflation of the F LASSO test was slightly bigger than that of the other three tests, but not significantly so.

Discussion and conclusions
In this study, we used the LASSO regression and calculated the GDF for the F tests to avoid selection bias. This method requires using a parametric bootstrap to obtain the GDF; therefore it is computationally not as fast as the linear regression and collapsing methods. In general, the F LASSO test is more powerful than the other methods. Linear regression is the least powerful approach because of the large number of rare SNPs and because no deduction is made in the large number of degrees of freedom. The collapsing test requires specifying the predefined allele frequency threshold for grouping rare SNPs. It is difficult to determine this criterion optimally when in reality the true disease model is never known. For an extreme example, the QCMC(0.001) test was identical to the linear regression approach and the QCMC(0.1) test had no power at all in these data. Therefore, from this point of view, we recommend the LASSO approach for detecting rare SNPs.
Based on the power comparison of the SIRT1 and VLDLR genes, we observed some evidence that linkage disequilibrium played a significant role in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, it will perform much better in terms of power. It would be of interest to further investigate the role of linkage disequilibrium between common noncausal markers and rare causal SNPs on the power to detect rare causal SNPs and hence determine a more powerful test.