Gene-based partial least-squares approaches for detecting rare variant associations with complex traits
© Turkmen and Lin; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Genome-wide association studies are largely based on single-nucleotide polymorphisms and rest on the common disease/common variants (single-nucleotide polymorphisms) hypothesis. However, it has been argued in the last few years and is well accepted now that rare variants are valuable for studying common diseases. Although current genome-wide association studies have successfully discovered many genetic variants that are associated with common diseases, detecting associated rare variants remains a great challenge. Here, we propose two partial least-squares approaches to aggregate the signals of many single-nucleotide polymorphisms (SNPs) within a gene to reveal possible genetic effects related to rare variants. The availability of the 1000 Genomes Project offers us the opportunity to evaluate the effectiveness of these two gene-based approaches. Compared to results from a SNP-based analysis, the proposed methods were able to identify some (rare) SNPs that were missed by the SNP-based analysis.
The past decade has seen a surge of interest in genome-wide association studies because of the availability of densely situated single-nucleotide polymorphisms (SNPs) throughout the genome and because of new exciting results that offer tremendous hope and optimism . However, such hope and optimism have been dampened by the limited success of reaping benefits from the discoveries, because the identified SNPs contribute little to the explanation of the underlying variability. Such realizations set off vigorous debates, including those in the popular media. However, second-generation sequencing technology has made it practically feasible to reliably genotype rare SNP variants.
In mapping genes that contribute to common diseases, a popular hypothesis is that causal variants are common in the population. However, it is now hypothesized that complex traits may be caused by the combined contribution of many rare variants (minor allele frequency [MAF] < 0.05); this contribution is one of the many potential explanations for the missing heritability . However, even though rare variants can now be genotyped efficiently and accurately using second-generation sequencing technology, detecting associations between rare variants and disease using SNP-based methods is frequently ineffective, especially when the effect size is small. Therefore there is a pressing need for statistical methods that can detect rare variants.
Here, we propose a gene-based partial least-squares (GBPLS) method for detecting associations with quantitative traits. By considering a gene as the fundamental unit in our modeling, we hope that this aggregation effect will help us uncover associations that are too weak to be detected for individual SNPs. To reduce the computational burden and noise caused by the inclusion of a large number of noncausal variants, we use a screening procedure that selects the top genes, to which we apply the proposed methods.
Our data analysis was performed with the knowledge of the answers, which contributed to our focus on gene-based strategies, although the proposed method does not make use of the specific simulation model. As such, we believe that the proposed methods would be suitable for the analysis of real data because SNPs within genes are often working together to regulate phenotypes.
The Genetic Analysis Workshop 17 (GAW17) data contain one family data set and one population data set. Two hundred replicates of the trait simulation were carried out in both data sets . In this study, we used the population data consisting of a collection of 697 unrelated individuals. Information for each individual in each replicate includes the genotypes of 24,487 SNPs in 3,205 genes for each individual, covariates Age, Sex, and Smoke, quantitative traits Q1, Q2, Q4, and qualitative trait Affected. Note that the SNP genotypes are held fixed for all 200 replicates. Of the 24,487 SNPs, almost 91% and 75% have a MAF less than 0.1 and 0.01, respectively. We carried out our proposed procedures for two quantitative traits: Q1 and Q2. The design matrix of the genotype X of size n x p is constructed by labeling genotypes AA, Aa, and aa as 0, 1, and 2, respectively, where, n = 697 and p = 24,487. All 200 replicates are used to learn about association. To control population structure, we also used in the analyses the non-SNP covariates Age, Sex, and Smoke and the first 10 principal components (PCs) P k (k = 1, 2, …, 10) of genotype data.
Data preprocessing: the screening stage
The screening stage is composed of two steps. In the first step, correlations between the 24,487 SNPs and the quantitative trait Q j (j = 1, 2), are calculated. Each gene is scored by the largest absolute value of the correlation between the trait and the SNPs within the corresponding gene. We then order the 3,205 genes based on these scores and retain only the top 1,000 to reduce the amount of computation in subsequent analyses. In the second step, we randomly assign all of the SNPs in the top 1,000 genes into the same set of 1,000 genes 500 times to control the different gene sizes. We recompute the gene scores for each randomization and calculate the p-values as the proportion of recalculated gene scores that are as extreme as or more extreme than the original gene scores. We rank the 1,000 genes based on the p-values (i.e., the smaller the p-value, the smaller the rank of the gene). The output of this stage produces the ordered list of the top 1,000 genes.
With the knowledge of the simulated model, we can see that this preliminary screening stage retains most of the genes involved in the trait and excludes many noncausal genes. We ran the analysis for values different from 1,000 as well and concluded that the top 1,000 yields the best results. Because of space limitations, we report here the results based on only the top 1,000 genes. We use to denote the lower dimensional design matrix X, where the columns of are constructed by the SNPs within the first t of the 1,000 ordered genes, where t = 1, 2, …, 1,000. Similarly, the submatrix that stores only the SNPs within the gene i is denoted for i = 1, 2, …, t. Although the initial screening stage helps to reduce the dimensionality of the design matrix greatly, there is still the curse of dimensionality problem because there are more SNPs than observations.
Gene-based partial least-squares approach
In the GBPLS1 algorithm, partial least-squares path modeling (PLSPM) is used to calculate the outer coefficients. PLSPM is a statistical method that was developed for the analysis of structural equation models with latent variables. A formal presentation of the partial least-squares approach to latent variable path models is given by Wold ; a more recent reference can be found in Tenenhaus et al. . In the PLSPM used for the GBPLS1 algorithm, genotype information from SNPs within the same gene are combined into a single LV (i.e., gene score) by constructing a linear function of the SNPs in , (i = 1, 2, …, 100). The resulting coefficients of the SNPs are the outer weights/coefficients.
for j = 1, 2, where Z i is the LV for gene i. We then order the absolute values of the inner model coefficients for the gene scores (i.e., β i , i = 1, 2, …, t) to identify the q most important ones in explaining the trait Q j , where q in the analysis is taken to be 25, 35, and 50. To determine the relative importance of the SNPs in the construction of the q most important gene scores, we ordered the absolute values of the outer weights and recorded the corresponding ranks (called SNP ranks) for each SNP.
Although the GBPLS1 approach can deal with a dimension corresponding to 100 genes, it has computational problems when we run it for more than 100 genes because the algorithm simply uses the least-squares estimator that would fail with large dimensionality. To remedy this problem, we propose a similar algorithm, GBPLS2, that incorporates a partial least-squares and penalized regression for calculating the outer and inner weights, respectively, to handle higher dimensions.
for t = 100, 250, 500, 750 and i =1, 2, …, t.
In the second stage, we apply the LASSO (least absolute shrinkage and selection operator) penalty  to implement regression analysis of traits on the gene scores T i and other non-SNP covariates and the first 10 PCs in which only gene scores were penalized. The penalty parameter was determined for each replicate by 10-fold cross-validation. The genes with nonzero inner coefficients and the rankings of the corresponding outer coefficients for the SNPs within these genes are the outcomes of the GBPLS2 algorithm.
We carried out our GBPLS analyses using R packages plspm (the factor scheme), glmnet, and plsgenomics, which were downloaded from [http://cran.r-project.org/].
False-positive and true-positive rates for the GBPLS1 and GBPLS2 methods for selected values of c, t, and q
We also include the results for a SNP-based approach to demonstrate the advantages of the gene-based approach. We applied the t-test by treating SNPs as the two group variables because there are no homozygous genotypes for the minor alleles. We calculated p-values for each replicate and calculated the median of the p-values for traits Q1 and Q2. This SNP-based analysis indicated that no variants (except C13S522 and C13S523 in FLT1) exhibited genome-wide significance (<2 × 10−6) with Bonferroni correction.
Summary of results for Q1
C1S6533, C1S6542, C1S6561
C1S6533, C1S6540, C1S6542
C13S431, C13S522, C13S523, C13S524
C13S431, C13S522, C13S523, C13S524
C4S1873, C4S1877, C4S1878, C4S1884, C4S1889
C4S1861, C4S1877, C4S1878, C4S1884, C4S1889
Discussion and conclusions
In this paper, we used the 1000 Genomes Project second-generation sequencing genotype data in conjunction with simulated phenotypic data made available through GAW17 to demonstrate our gene-based approach for detecting associations between common diseases and rare variants. Our results are encouraging because some of the causal SNPs in the simulation model were successfully detected, whereas they would have remained elusive with a SNP-based method. However, because the SNP genotypes were held fixed in the simulated replicates, further evaluation is warranted to fully assess the effectiveness of the methods under more general and more diverse settings.
It is also worth pointing out that, although population substructure is not part of the feature of the simulation model, our methods are capable of accommodating such a complex scenario. Our results indicate that statistical power was not greatly affected by the inclusion of factors to account for potential stratification caused by population substructure. However, this assessment is preliminary and, as such, further investigation is needed for more comprehensive evaluation. Moreover, the effectiveness of the methods for correctly accounting for population structure will need to be carefully evaluated.
Finally, we note that our analysis and summary of the results are with knowledge of the answers. Nevertheless, the proposed approach was not developed based on the specific simulation model; rather, it was designed to detect variants, including rare ones, that work together within the same genes for causing or regulating disease phenotypes. Thus we expect our method to be effective in real data analysis.
The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. We would like to thank the reviewers for their helpful comments, which have improved the paper.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Chanock SJ, Hunter DJ: Genomics: when the smoke clears …. Nature. 2008, 452: 537-538. 10.1038/452537a.View ArticlePubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Wold H: Soft modeling: the basic design and some extensions. Systems Under Indirect Observation: Causality, Structure, Prediction. Edited by: KG Joreskog, H Wold. 1982, Amsterdam, North Holland, 2: 1-54.Google Scholar
- Tenenhaus M, Vinzi VE, Chatelin Y-M, Lauro C: PLS path modeling. Comput Stat Data Anal. 2005, 48: 159-205. 10.1016/j.csda.2004.03.005.View ArticleGoogle Scholar
- Wold H: Partial least squares. Encyclopedia of Statistical Sciences. Edited by: S Kotz, NL Johnson. 1985, New York, Wiley, 6: 581-591.Google Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996, 58: 267-288.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.