The multigroup ANOVA model
Let Yi, jbe the gene expression value for the jth gene on the ith individual, where j = {1, 2,..., M} and i = {1, 2,..., n}. Further, let G
i
denote the group membership of the ith individual, where G
i
={1, 2,..., g}, let I(·) denote an indicator function, and let n
k
= #{i: G
i
= k} denote the number of individuals belonging to group k, such that the total sample size is . Then the multigroup ANOVA model is given by
where θ
j
is the baseline effect for the jth gene, βk, jis the group-differential effect (comparing the jth gene in group k against a baseline group) and εi, jis the error term. The errors are assumed to be independently distributed with mean zero such that E[εi, j] = 0 and Var[εi, j] = σ
j
2. This means that every gene can have its own variance component. Aside from this, there are no distributional assumptions about the data.
This multigroup model can be restated as a linear model as described in Ishwaran and Rao [7] and the task of identifying differentially expressing genes amounts to finding non-zero group × gene interactions effects. A Bayesian variable selection technique described below can then be used to identify these parameters of interest. Despite relatively weak assumptions about the data generating mechanism, we will nevertheless require that a common variance model holds for the variable selection process, and it is therefore necessary to transform the data to stabilize the variances across genes. This is done using a weighted regression technique also described in Ishwaran and Rao [7], which has the advantage of avoiding problems associated with more typical global variance-stabilizing transformations (log-transformations, for example), but at the same time does not change the signal-to-noise ratio of the data for any given gene. Let Yi, j+ denote data produced by this transformation, and select a baseline group g. Then the data are further centered and scaled by
where N = (n - n
g
)M is the total sample size and is an unbiased estimator for the common variance, .
Spike and slab hierarchical model for Bayesian ANOVA of microarrays (BAM)
Following some additional pre-processing steps [7] to generate an orthogonal linear model design, Ishwaran and Rao's spike and slab hierarchical model [7] is given as:
where Γ
j
is the diagonal matrix with diagonal entries obtained from γ
j
= (γ1, j, γ2, j,..., γg-1, j)t, and γ= (γ1t, γ2t,..., γ
M
t)tis a M(g - 1) -dimensional hypervariance vector. Each design matrix X
j
is chosen for orthogonality such that X
j
tX
j
= N I. The prior distribution for the coefficient variances, π (dγ), is a continuous bimodal prior in which one spike component specifies small values, thus favoring small values of β
j
, and a right continuous tail specifies preferences for large non-zero values of β
j
.
Ishwaran and Rao [8] derive a blocked Gibbs sampling algorithm for posterior inference and show that the posterior mean for the β values is the optimal summary measure to use. Hard thresholding is done using a data-adaptive rule based on looking for coalescence of the posterior variance at a value of 1.0 with corresponding large posterior mean values (see Ishwaran and Rao [7] for details). This spike and slab model was also shown to possess a selective shrinkage property where shrinkage towards 0 (from classical least square estimates) occurs only for those parameters corresponding to truly non-differentially expressing genes with probability tending to 1 as the sample size increases. The model described above is implemented in a software package called BAMarray [9], and is available for download at http://www.bamarray.com.
It should be noted that the posterior mean estimator has been shown to be a weighted average of generalized ridge regression estimators, hence justifying the distribution-free assumptions made in the original ANOVA model. So in effect, the hierarchical model is a tool to generate a Bayesian test statistic (referred to as a Zcut value [7]), and corresponding thresholding rule, but does not reflect our belief about the data generating mechanism.
Robustness of spike and slab model
The model is robust to non-normality of the response variable, robust to clumpy dependence across genes [10] and allows unequal variances across genes. In addition, the spike and slab model was shown to be robust against correlation in gene expression measurements across individuals. Typically, when not accounted for, these correlations tend to produce underestimated variance estimates and can lead to detection of spurious effects. It was shown that the spike and slab model greatly mitigated this problem due to its selective shrinkage property, which shrinks only those coefficients corresponding to truly zero effects (see Theorem 1 and Remark 5 of Ishwaran and Rao [8]).
Determining gene expression profiles
For our purposes, a gene expression profile is a multi-gene pattern of expression values by which a given individual may be assigned membership to one of a set of mutually exclusive groups or categories. Our strategy for determining expression profiles is to ask whether gene expression measurements for each individual can be classified as up, down or equi-regulated with respect to a "median" family expression profile. Interestingly, this can be handled as a special case of the multi-group model in what is termed a no-baseline analysis, and by accruing this information across genes we obtain a novel profile from which to assign individuals to expression profile clusters across families. This process allows us to define expression profiles using all of the genes present rather than trying to do things on a gene-by-gene basis. This also means that a parameter vector of length equal to the number of genes on the chip must be estimated for each individual. In order to conduct such an analysis, we extended the data set with pseudo samples obtained as follows: for each possible pairing of individuals within each pedigree, we calculate the pair-wise difference vector and add it to the existing data set. The effect of taking all possible pair-wise differences between family members is to increase the sample size while at the same time centering the expression profiles on a per-family basis. Each vector difference is added as a pseudo sample to the existing set of data, and for a pedigree of K individuals, the total number of such additions is (K)(K-1). Note also that with the generation of the pseudo samples and by use of the no-baseline variation of the spike and slab BAM model, we are implicitly allowing each individual to act as a "group".
Remark
The generation of pseudo samples and the inherent familial structure of the pedigrees means that the expression values will be correlated. But as previously indicated, the spike and slab BAM model is robust to these correlations and gives correct inferences.
After estimating Zcut vectors from the BAM no baseline model for each individual, we clustered those vectors across individuals using agglomerative hierarchical clustering, wherein the optimal number of clusters is determined by Tibshirani's gap statistic [11]. In order to determine which genes were most influential in the clustering (i.e., generation of expression profiles) we examined each gene via a one-way ANOVA F statistic using the assigned latent cluster labels as group indicators. Corresponding p-values were determined and then adjusted for multiplicity by using the Benjamini-Hochberg version of false detection rate (FDR) control [12].
Relating the latent gene expression clusters to the SNP genotype information
Using the GAW15 data set, we sought to relate gene expression to SNP genotypes collected from pedigrees of healthy individuals. In essence, we were trying to determine the genetic determinants of the natural variation of gene expression in this data set. Given that we do not have explicit groupings of individuals into different phenotypic groups (e.g., case/control) our approach was to use the clustered expression profiles described above and relate these latent clusters to the SNP genotype information using the spike and slab multi-group methodology described above. Similar arguments about correlation between SNP measurements across individuals still hold but the correlations across individuals within the latent expression profile clusters are expected to be less strong than the expression correlations within same pedigree. We can apply the spike and slab BAM model to SNP data because the model does not make distributional assumptions about the data generating mechanism, and employs a limiting distribution result for its thresholding rule.