Volume 5 Supplement 9
Genome-wide association analysis of GAW17 data using an empirical Bayes variable selection
© Pungpapong et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Next-generation sequencing technologies enable us to explore rare functional variants. However, most current statistical techniques are too underpowered to capture signals of rare variants in genome-wide association studies. We propose a supervised coalescing of single-nucleotide polymorphisms to obtain gene-based markers that can stably reveal possible genetic effects related to rare alleles. We use a newly developed empirical Bayes variable selection algorithm to identify associations between studied traits and genetic markers. Using our novel method, we analyzed the three continuous phenotypes in the GAW17 data set across 200 replicates, with intriguing results.
With the advent of next-generation sequencing, rare variants such as single-nucleotide polymorphisms (SNPs) with a minor allele frequency (MAF) less than 5% are getting more attention in genome-wide association studies (GWAS). Because of the small variance at a locus with a single rare allele, it is difficult to detect the allele’s association with the phenotype of interest. One approach to tackling this problem is to collapse multiple rare SNPs within a defined region and treat them as a single predictor in the model. Known genetic regions are used in the collapsing process to get gene-based markers. Penalized orthogonal-components regression (POCRE)  is used to perform this task.
Genome-wide association studies are challenged by the “curse of dimensionality”; that is, a large number of SNPs are genotyped (large p) from a small number of biological samples (small n). As a result, an increasing effort has been devoted to selecting variables in high-dimensional data. One strategy for dealing with variable selection is through the thresholding concept. Empirical Bayes thresholding [2, 3] was proposed to estimate sparse sequences observed in Gaussian white noise. Here, we use the empirical Bayes thresholding method to select variables in linear regressions with efficient implementation. Final models are obtained by entering gene-based markers and environmental factors possibly associated with the phenotype of interest. All analyses are based on three continuous phenotypes in the GAW17 data set across 200 replicates.
The genome-wide association of the three continuous phenotypes (Q1, Q2, and Q4) in the GAW17 data set  was investigated. All analyses presented here are based on the genotype of 697 unrelated individuals. The genotype data were recoded into counts of minor alleles using PLINK . The other three traits (Age, Sex, and Smoke) were used in the model to consider the environmental effects. The analyses were performed for all 200 replicates.
Supervised coalescing of SNPs in a genetic region
where g λ (γ) is a penalty function with a tuning parameter λ. Zhang and colleagues  used the empirical Bayes thresholding method proposed by Johnstone and Silverman [2, 3] to introduce a proper penalty function, which provides adaptive sparse loadings of orthogonal components.
POCRE is a supervised learning method that needs the information of both genotype and phenotype to build a model. In the GAW17 data set, the genotype is held fixed but the phenotype varies across 200 replicates. To overcome potential overfitting in the model-building process, we selected one replicate as a training set to obtain the sparse coefficients of SNPs in each genetic region, and we then applied the results from POCRE to data in another replicate. In practice, when only one data set is available, cross-validation can be performed to select a tuning parameter to alleviate overfitting.
Empirical Bayes variable selection
where if and otherwise; and , following Johnston and Silverman [2, 3]. Data-driven optimal values for ω and a were obtained to achieve adaptivity to sparseness and shape of prior distribution of β j , respectively. With current values of β and σ, the optimal values for ω and a are obtained as the values that maximize their full conditional distribution functions, P(ω|β,σ) and P(a|β,σ), respectively. β as the posterior median is then updated. The iterative procedure for updating β and hyperparameters is carried out until convergence. With this mixture prior distribution, EBVS gives a sparse solution for β.
In addition, another 98 noncausal genes were identified. All of these genes were identified in only one or two out of 200 replicates, which might be due to noise. Another causal gene, VEGFC, was also found and included in the final model in two replicates. However, after transforming gene-based results into SNP-based results, none of the true causal SNPs affiliating to VEGFC were identified.
With the next-generation sequencing technology, many rare variants or low-frequency SNPs can be detected. The customary criteria for MAF in data preprocessing (i.e., MAF ≥ 0.05) in GWAS is not appropriate in this situation. One possible solution is to reduce the cutoff point of MAF. Although this approach can be done easily, it is difficult to determine the optimal cutoff point. With too big a cutoff point, the majority of rare variants are discarded in analyses and little is gained from the next-generation sequencing data. With too small a cutoff point, most SNPs are included in a model, presenting challenges in statistical analyses for detecting signals of rare variants.
We grouped both common and rare variants in the same genetic region into a gene-based marker using POCRE. POCRE has a variable selection property that assumes that not all SNPs in a genetic region contribute to a gene-based marker. Although this assumption is realistic, the variable selection property of POCRE might rule out true causal SNPs in the coalescing process. On the other hand, the coalescing process might include noncausal SNPs, resulting in a false positive when the gene is identified to have nonzero effect by EBVS. Better techniques to combine SNPs into gene-based markers need to be further studied to overcome the challenges in the next-generation sequencing.
Another challenge in analyzing the GAW17 data is signal detection for a trait with low heritability. It is well known that it is difficult to identify nonzero effects in GWAS for a trait with low heritability. However, true causal rare variants worsen the situation and make the variants more difficult to detect. Better strategies need to be further explored in GWAS to tackle the problem of a low heritability trait with rare variants.
In this study, we proposed using POCRE to coalesce common and rare variants in the same gene into a gene-level marker and applied the newly developed empirical Bayes variable selection to detect the association between markers and three continuous phenotypes in the GAW17 data set: Q1, Q2, and Q4. With a large number of predictors, the newly developed empirical Bayes approach not only selects important variables into the model but also estimates the effect sizes of nonzero predictors simultaneously.
Our results show that combining both common and rare variants into gene-level markers can increase the power to detect their signals. In fact, many identified true causal SNPs have MAF = 0.000717 or have variants that are found in only one individual. Nevertheless, there are still a number of false negatives. Based on GAW17 data, we notice that false negatives occur when only a few causal SNPs are present in the genetic region. When the size of causal SNPs in the gene region is moderate, it is still challenging to detect true signals when most of the causal SNPs are rare variants. As shown in our analysis, causal SNPs with higher MAFs can be identified more frequently than causal SNPs with lower MAFs.
Support from the National Institutes of Health (NIH) grant R01 GM031575, National Science Foundation CAREER Grant IIS-0844945, NIH grant U01CA128535, and the Cancer Care Engineering project at the Oncological Sciences Center of Purdue University is gratefully acknowledged.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Zhang D, Lin Y, Zhang M: Penalized orthogonal-components regression for large p small n data. Electron J Stat. 2009, 3: 781-796. 10.1214/09-EJS354.View ArticleGoogle Scholar
- Johnstone IM, Silverman BW: Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequence. Ann Stat. 2004, 32: 1594-1649. 10.1214/009053604000000030.View ArticleGoogle Scholar
- Johnstone IM, Silverman BW: EbayesThresh: R programs for empirical Bayes thresholding. J Stat Software. 2005, 12: 1-38.View ArticleGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Baker PIW, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.