- Open Access
A Bayesian hierarchical gene model on latent genotypes for genome-wide association studies
© Johnston and Carvalho; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
The primary goal of genome-wide association studies is to determine which genetic markers are associated with genetic traits, most commonly human diseases. As a result of the "large p, small n" nature of genome-wide association study data sets, and especially because of the collinearity due to linkage disequilibrium, multivariate regression results in an ill-posed problem. To overcome these obstacles, we propose preprocessing single-nucleotide polymorphisms to adjust for linkage disequilibrium, and a novel Bayesian statistical model that exploits a hierarchical structure between single-nucleotide polymorphisms and genes. We obtain posterior samples using a hybrid Metropolis-within-Gibbs sampler, and further conduct inference on single-nucleotide polymorphism and gene associations using centroid estimation. Finally, we illustrate the proposed model and estimation procedure and discuss results obtained on the data provided for the Genetic Analysis Workshop 18.
In genome-wide association studies (GWAS), we infer which single-nucleotide polymorphisms (SNPs) are associated with a trait. We cast this problem as variable selection; however, because the number of observations in a GWAS data set, n, is typically much smaller than the number of SNPs, p, this is a "large p, small n" problem . This problem is aggravated by the computational cost of trying to fit a complex statistical model involving hundreds of thousands of SNPs. As a result, few publications have incorporated interaction testing of GWAS data . Models that have been proposed include, but are not limited to, simple logistic regression models that only look for marginal effects , more complicated logistic regression models that allow for interactions , and nonlinear models . Bayesian models have also been explored as an effective way to reduce the curse of dimensionality (eg, Ref.  and references therein). Our objective is to supplement these models with one that accounts for correlation in the model specification and that can exploit SNP groupings within genes.
It is usual to assume that the genotype data X is known as observed data and to define the likelihood of the trait response y conditional on X. This can be problematic for inference because X depends on minor allele frequencies (MAFs), and elements of X can be highly correlated as a result of linkage disequilibrium (LD). It is possible to simulate genotypes by sampling and dichotomizing a random vector from a multivariate normal distribution with a zero mean vector and a covariance matrix that can be computed from the correlation between SNPs . We propose modeling X as though it was generated in this way; that is, we observe, in X, a correlated and categorized--by allele frequencies--version of the latent genotypes, which we denote Z. We model y using Z in place of X.
Instead of obtaining latent genotypes for each marker and individual, we settle with an approximation that allows us to fit a model with many SNPs. Denoting the continuous but correlated genotypes by U, we compute , and then, , where C is the correlation matrix. For now, C is estimated using the sample correlation matrix.
Hierarchical gene model
where , , is the number of genes that cover , and is the prior probability of a gene being active. To sample and from their posterior distributions, we adopt a Gibbs sampling procedure with Metropolis-Hastings steps to sample from the posterior distributions of , , and . After checking for convergence, we use the centroid estimator to estimate the posterior probability of association (PPA) of the jth SNP, , based on N samples from this procedure as , and similarly for , for each gene g. By increasing the "boost" parameter , we can place more weight on the information from the gene level. This regularizes the SNPs such that by tuning we may adjust the PPA level of separation between causal and noncausal SNPs.
An ubiquitous estimator in Bayesian inference is the maximum a posteriori (MAP) estimator, , but may correspond to a sharp peak in a multimodal and structured posterior space that does not gather much posterior mass around it. An estimator that is arguably better suited for complex spaces is the centroid estimator, , where is Hamming distance. For unconstrained spaces such as ours, it can be shown that is a consensus estimator; that is, . The centroid estimator can be shown to be closer to the mean than to a mode of the posterior space of SNP associations, and so offers a better summary of the posterior distribution of .
Top 5 SNPs for original raw (normal text) and latent genotypes (bold)
We presented a Bayesian variable selection approach that performs joint inference for quantitative trait association on collections of genetic markers while formally modeling gene effects through a hierarchical influence. In addition, we prescribe centroid estimators that are based on posterior probabilities of association and thus enable a direct interpretation of their values uniformly across studies without having to correct for multiple testing. We also proposed the novel use of latent genotypes as a way to account for SNP correlations caused by LD. We believe that this method offers a reasonably accurate and flexible assumption because genotypes are corrected directly in the model instead of considered in the estimation procedure, as, for example, as kernel weights in the sequence kernel association test (SKAT) . However, unfortunately, we were not able to find meaningful effects in the GAW18 data set when using latent genotypes that would point to interesting genes. This outcome can be explained by many factors, including a low sample size, an inaccurate representation of the correlation across markers, and a poor choice of SNP blocks, and thus warrants further investigation. Moreover, a more thorough prior sensitivity analysis would recommend a less stringent distribution for some hyperparameters, mainly , that would favor more genes to be active.
The authors would like to acknowledge the Linux Clusters for Genetic Analysis (Lin GA) computing resources at Boston University Medical Campus for their support in managing and storing GAW18 data and annotations. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- West M: Bayesian factor regression models in the "large p, small n" paradigm. Bayesian Stat. 2003, 7: 723-732.Google Scholar
- Cordell H: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009, 392-404. 10Google Scholar
- Armitage P, Berry G, Matthews JNS: Statistical Methods in Medical Research. Chichester, Blackwell Science. 2002Google Scholar
- McCullagh P, Nelder JA: Generalized Linear Models. London, Chapman & Hall. 1989Google Scholar
- McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting gene-gene interactions: a review. Appl Bioinformatics. 2006, 77-88. 5Google Scholar
- Guan Y, Stephens M: Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011, 5: 1780-1815. 10.1214/11-AOAS455.View ArticleGoogle Scholar
- Montana G: HapSim:a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics. 2005, 21: 4309-4311. 10.1093/bioinformatics/bti689.View ArticlePubMedGoogle Scholar
- George E, McCulloch R: Variable selection via Gibbs sampling. JAMA. 1993, 88: 881-889.Google Scholar
- Carvalho L, Lawrence C: Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc Natl Acad Sci U S A. 2008, 105: 3209-3214. 10.1073/pnas.0712329105.PubMed CentralView ArticlePubMedGoogle Scholar
- Protein page: DRD2 (human). [http://www.phosphosite.org/proteinAction.do?id=16214]
- Vallvé JC, Serra N, Zalba G, Fortuño A, Beloqui O, Ferre R, Ribalta J, Masana L: Two variants in the fibulin2 gene are associated with lower systolic blood pressure and decreased risk of hypertension. PLoS One. 2012, 7: e43051-10.1371/journal.pone.0043051.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.