 Proceedings
 Open Access
Bayesian hierarchical modeling of means and covariances of gene expression data within families
 Roger PiqueRegi^{1},
 John Morrison^{1} and
 Duncan C Thomas^{1}
https://doi.org/10.1186/175365611S1S111
© PiqueRegi et al; licensee BioMed Central Ltd. 2007
 Published: 18 December 2007
Abstract
We describe a hierarchical Bayes model for the influence of constitutional genotypes from a linkage scan on the expression of a large number of genes. The model comprises linear regression models for the means in relation to genotypes and for the covariances between pairs of related individuals in relation to their identitybydescent estimates. The matrices of regression coefficients for all possible pairs of singlenucleotide polymorphisms (SNPs) by all possible expressed genes are in turn modeled as a mixture of null values and a normal distribution of nonnull values, with probabilities and means given by a thirdlevel model of SNP and trait random effects and a spatial regression on the distance between the SNP and the expressed gene. The latter provides a way of testing for cis and trans effects. The method was applied to data on 116 SNPs and 189 genes on chromosome 11, for which Morley et al. (Nature 2004, 430: 743–747) had previously reported linkage. We were able to confirm the association of the expression of HSD17B12 with a SNP in the same region reported by Morley et al., and also detected a SNP that appeared to affect the expression of many genes on this chromosome. The approach appears to be a promising way to address the huge multiple comparisons problem for relating genomewide genotype × expression data.
Keywords
 Full Conditional Distribution
 MCMC Sample
 Generalize Estimate Equation Model
 Gene Expression Phenotype
 Multiple Comparison Problem
Background
Recent advances in genomic technology now allow genotyping of hundreds of thousands of singlenucleotide polymorphisms (SNPs) and measurement of the expression of tens of thousands of genes on single microarrays or chips at a manageable cost. Extensive literature on the analysis of gene expression data has evolved over the last five years, and since the advent of ultrahighvolume genotyping platforms, genomewide association and linkage scans using SNPs have also become feasible. The multiple comparisons problem is central to the analysis of either type of highvolume data. In 2001, Jansen and Nap [1] proposed combining the analysis of the two technologies in what he called "genetical genomics" to provide insight into the genetic regulation of gene expression.
However, only quite recently have attempts been made to relate the two technologies, first by Morley et al. [2] in a linkage scan for 3554 expressed genes in relation to 2756 autosomal SNP markers, and subsequently by the same group [3] in a genomewide association scan of 27 of the expressed genes with the highest linkage in the first study, in relation to >770,000 SNPs. (See also Schadt et al. [4] and Stranger et al. [5] for similar analyses.) Independently, Tsalenko et al. [6] proposed a biclustering method to visualize SNPs and the transcripts they regulate, using an approach that is more visual than statistical. The multiple comparisons problem in such analyses (2.7 billion comparisons in the association analysis) dwarfs those from either genomewide linkage or association analyses of single traits or supervised cluster analyses of expression data in relation to single outcomes.
Therefore, there is a need to develop new statistical methods to analyze all transcripts and genotypes together. Here, we describe a novel hierarchical Bayesian approach to the analysis of all possible pairs of associations and linkages between expressed genes and SNP markers. We demonstrate the results for chromosome 11 and we argue that the method can be extended to cover the entire genome and transcriptome.
Methods
Statistical model
Let Y_{ ij }^{ n }denote the expression of gene n in member j of family i and let G_{ ij }^{ m }be the corresponding SNP genotype at marker m at location x_{ m }. For the means and covariances of the expression traits, we adopted a generalized estimating equations model of the form used by Thomas et al. [7]
E(Y_{ ij }^{ n }) ≡ μ_{ ij }^{ n }= α_{0}^{ n }+ Σ_{ m }A^{ nm }G_{ ij }^{ m } (1)
E(C_{ ijk }^{ n }) ≡ χ_{ ijk }^{ n }= β_{0}^{ n }+ B^{ n }Z_{ ijk }(X^{ n }), (2)
where C_{ ijk }^{ n }= (Y_{ ij }^{ n } μ_{ ij }^{ n })(Y_{ ik }^{ n } μ_{ ik }^{ n }) and Z_{ ijk }(x) are the estimated E(IBD_{ ijk }(x)G_{ i }) at chromosomal location x for pairs (j, k) from nuclear family i, based on the complete multilocus marker data. X^{ n }is a latent variable for location of the unobserved causal locus linked to expression trait n. For j = k, V(Y_{ ij }^{ n }) = χ^{ n }models the gene expression variance in Eq. (2).
In Eq. (1), the regression coefficients A^{ nm }are modeled as a mixtures of null values with probabilities 1π^{ nm }and a normal distribution of nonnull values with means α^{ nm }expressed in terms of row and column effects:
A^{ nm }~ (1  π^{ nm }) δ (0) + π^{ nm }N(α^{ nm }, σ^{2}), (3)
where
α^{ nm }= γ_{0}^{ A }+ γ_{1}^{ A }I(x_{ m }∈ R_{ n }) + e_{ m }^{ A }+ h_{ n }^{ A } (4a)
logit(π^{ nm }) = γ_{0}^{ P }+ γ_{1}^{ P }I(x_{ m }∈ R_{ n }) + e_{ m }^{ P }+ h_{ n }^{ P }. (4b)
The parameter γ_{1} distinguishes between cis and trans effects, a cis interaction occurs when the chromosomal location x_{ m }of SNP m is within the interval R_{ n }, the alignment region for the gene expression probe n. The random effects e and h are distributed as
(e_{ m }^{ A }, e_{ m }^{ P }) ~ N_{2}(0, T) (5a)
(h_{ n }^{ A }, h_{ n }^{ P }) ~ N_{2}(0, W) (5b)
and the γs, T, W have uninformative normal and Wishart priors.
The regression coefficients B^{ n }in the covariance model in Eq. (2) are handled similarly, except that we assume each trait has at most one region linked to it. (This is not essential to the method, because Eq. (2) could be extended to a summation over multiple independent linkage regions, but it would not make sense to offer all marker locations simultaneously, since the IBD variables are highly correlated from one location to the next.) Thus, we assume
B^{ n }~ N [γ_{0}^{ B }+ γ_{1}^{ B }I(X^{ n }∈ R_{ n }), τ^{2}] (6)
The rationale behind this convergence monitoring procedure is described and justified by Gelman et al. [8].
Subjects, genotypes, and phenotypes
In order to keep the computation to a manageable level, we restricted this analysis to the SNP genotypes and expressed genes on chromosome 11, since previous analyses by Morley et al. [2] had found evidence of both cis and trans linkages at this chromosome. The final data set thus had 116 SNPs and 189 expressed genes. IBD status was estimated from the complete twogeneration pedigrees (excluding grandparents) by a program written by JM based on the LanderGreen algorithm [9]. All 378 sib pairs (110 individuals) from the available 14 families were included in the phenotype analysis.
Results
Topranking associations
Top SNPs used in the prediction  

Phenotype^{a}  Probe  R ^{2}  P(R^{2} > 0)  SNP1  π^{ nm }  SNP2  π^{ nm }  SNP3  π^{ nm }  SNP4  π^{ nm } 
HSD17B12  217869_at  0.25  0.988  rs1453389^{b}  1.00  rs916482  1.00  rs1425151  0.40  rs509628  0.28 
C11orf10  218213_s_at  0.12  0.986  rs916482  1.00  
AMPD3  207992_s_at  0.19  0.985  rs2029463^{b}  0.81  rs948215  0.80  rs1157659  0.21  rs1491846  0.17 
FEZ1  203562_at  0.12  0.984  rs2029463  1.00  rs2155076  0.20  rs948215^{b}  0.11  
ADM  202912_at  0.11  0.982  rs916482  1.00  
STIP1  213330_s_at  0.11  0.981  rs916482  0.99  rs1319730  0.33  
DDB1  208619_at  0.15  0.978  rs1530966  0.91  rs597345  0.54  rs1499511  0.10  
FADS1  208964_s_at  0.14  0.974  rs1216592  0.85  rs1605026  0.38  rs591804  0.35  
TPP1  200743_s_at  0.13  0.970  rs916482  0.94  rs1157659  0.14  rs902215^{b}  0.14  
RBM14  204178_s_at  0.10  0.966  rs916482  0.98  rs674237  0.10  
HMBS  203040_s_at  0.13  0.963  rs86392  0.49  rs916482  0.47  rs1319730^{b}  0.44  rs1945906  0.20 
PPME1  49077_at  0.11  0.958  rs916482  0.82  rs2155001  0.16  
CD44  204490_s_at  0.12  0.957  rs702738  0.34  rs916482  0.28  rs1319730  0.28  rs1453390^{b}  0.17 
NRGN  204081_at  0.10  0.946  rs2029463  0.93  rs961746  0.16  rs509628  0.15  
NDUFS8  203190_at  0.11  0.944  rs86392  0.68  rs1319730  0.33  rs1945906  0.32  
PSMD13  201232_s_at  0.09  0.923  rs916482  0.91  rs1319730  0.12 
Linkage of residual gene expression variation after association
Phenotype^{a}  R ^{2}  P(R^{2} > 0)  Samples @mode  Mode [X^{ n }]  Pr [X^{ n }= m] 

208964_s_at  0.014  0.912  1749  rs2226844 

202223_at  0.009  0.908  1526  rs1453390  
220964_s_at  0.007  0.862  1412  rs647837  
201432_at  0.005  0.821  1567  rs931811  
201477_s_at  0.008  0.802  1492  rs1941817  
204178_s_at  0.004  0.800  1548  rs2155076  
202076_at  0.004  0.772  1668  rs681267  
206067_s_at  0.002  0.749  1535  rs2029463  
210364_at  0.003  0.718  1586  rs470719  
203675_at  0.006  0.706  1659  rs1216592  
205412_at  0.001  0.683  1585  rs470982  
202418_at  0.001  0.679  1444  rs470719 
Discussion
We have introduced a novel hierarchical Bayes model for genetic control of gene expression. Our approach to dealing with the multiple comparisons problem is to represent the matrices of all possible SNP × expressed gene association or linkage coefficients in terms of row and column random effects, along with a spatial regression on the distance between the two. Although this allows inference on specific pairs, we have greater interest in the variances of the row and column effects, which reflect systematic tendencies for SNPs to affect variable numbers of phenotypes and for phenotypes to be differentially expressed. Our mixture model also supports the possibility that the vast majority of such associations or linkages would be truly null, and allows separate estimation of both the probability and magnitude of nonnull tests. So far we have not imposed any relationship between the parameters of the association (means) and linkage (covariance) models, but one might contemplate using the broad regions where linkage is seen for a particular phenotype as a prior for testing singleSNP associations with that phenotype.
The strongest geneexpression × SNP association reported by Cheung et al. [3] on chromosome 11 also appeared in our results as the most significant association, but with a SNP close to theirs (their reported SNP was not included in the data set). We also found evidence of at least one SNP that appears to be linked to a large number of expressed genes, suggesting the existence of master regulatory genes in that region.
We chose to restrict these analyses to a subset of genes and SNPs on a single chromosome to test the feasibility of the method. In principle the approach could be applied on a genomewide scale, since the computation time increases linearly with m, n, sample size, and number of MCMC samples. Generating 4000 MCMC samples required 6 hours on a 2.2 GHz singleprocessor machine. However, one outstanding methodological challenge that would have to be addressed before the approach could be applied to dense SNP associations would be how to deal with the multicollinearity problem; for this reason, we chose to restrict this analysis to only a subset of SNPs that were not in strong LD with each other.
Declarations
Acknowledgements
Supported in part by NIH grant R01GM58897 and U01ES 015090. Implemented Matlab software is available upon request.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/1?issue=S1.
Authors’ Affiliations
References
 Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet. 2001, 17: 388391. 10.1016/S01689525(01)023101.View ArticlePubMedGoogle Scholar
 Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genomewide variation in human gene expression. Nature. 2004, 430: 743747. 10.1038/nature02797.View ArticlePubMed CentralPubMedGoogle Scholar
 Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genomewide association. Nature. 2005, 437: 13651369. 10.1038/nature04244.View ArticlePubMed CentralPubMedGoogle Scholar
 Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, Linsley PS, Mao M, Stoughton RB, Friend SH: Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003, 422: 297302. 10.1038/nature01434.View ArticlePubMedGoogle Scholar
 Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavaré S, Deloukas P, Dermitzakis ET: Genomewide associations of gene expression variation in humans. PLoS Genet. 2005, 1: e7810.1371/journal.pgen.0010078.View ArticlePubMed CentralPubMedGoogle Scholar
 Tsalenko A, Sharan R, Edvardsen H, Kristensen V, BørresenDale AL, BenDor A, Yakhini Z: Analysis of SNPexpression association matrices. Proc IEEE Comput Syst Bioinform Conf. 2005, 135143.Google Scholar
 Thomas DC, Qian D, Gauderman WJ, Siegmund K, Morrison JL: A generalized estimating equations approach to linkage analysis in sibships in relation to multiple markers and exposure factors. Genet Epidemiol. 1999, 17 (Suppl 1): S733S734.Google Scholar
 Gelman A, Carlin JB, Stern HS, Rubin DB, (Eds): Bayesian Data Analysis. 2004, New York: Chapman & Hall/CRC, 2Google Scholar
 Lander ES, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA. 1987, 84: 23632367. 10.1073/pnas.84.8.2363.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.