Box-Cox family of transformations
Given the expression intensities x1, j, x2, j,..., xn, j(n = 194) for the jth(j = 1, 2,..., 3554) gene, the Box-Cox family [3] transforming x
j
= (x1, j, x2, j,..., xn, j) to with power parameter p
j
is:
(1)
The expression intensities transformed here are the original observations rather than the log2 values reported in the data set. The 0.3-power transformation is the transformation that maximizes the probability plot correlation coefficient (PPC, see Filliben [10]) for the greatest number of genes.
Mixture analysis using Gaussian mixture model
The SKUMIX algorithm is extended in our mixture analysis. First, we applied the Box-Cox family of power transformations without the scale parameter (see Eq. (1)). Second, we considered a wider interval [0, 1.5] than the one recommended by Maclean et al. [2] for selecting the optimal power parameter. Third, as suggested by Ning et al. [11], we used 6.9 as the 0.05 critical value for LRT of "a single component distribution" vs. "a mixture distribution of two components."
Partial correlation analysis
We calculate the Pearson product moment correlation coefficients r
ij
= r(x
i
, x
j
), first-order partial correlations rij.k= r(x
i
, x
j
|x
k
) and second-order partial correlation coefficients rij.kl= r(x
i
, x
j
|x
k
, x
l
) [12] for expression phenotype variables whose values are the 0.3-power Box-Cox transformed expressions. The partial correlation criteria are:
The last two inequalities in criterion Q reduce redundancy by removing quartets built on trios. We identify trios of expression phenotype variables (x
i
, x
j
, x
k
) that meet criterion T and quartets (x
i
, x
j
, x
k
, x
l
) that meet criterion Q.
Measure of common mixing mechanism
When a gene expression variable appeared to be a mixture, we fit a mixture of two Gaussian components with equal variance using MCLUST [13] and classified each subject into the component with the largest Bayesian posterior probability [14]. We called the component with estimated probability less than 0.5 the "uncommon component" and the other one the "common component." The concordance rate (C) in a gene set is the ratio of subjects that simultaneously fall into the uncommon or the common components for all the genes in the set. A value of C close to 1 suggests a common mixture mechanism. We selected genes in a trio or quartet with C ≥ 80% for the factor analysis. Fleiss' statistic κ [15] was used to assess agreement. A value of κ > 0.75 indicated excellent agreement, while κ < 0.40 indicated poor agreement [16].
Factor analysis
Each gene expression variable that appeared to be a mixture and was present in one or more trios or quartets was included in a factor analysis using varimax rotation.
Bayesian factor screening
We used BFS [7, 17] to identify SNPs significantly associated with expressions of the genes from the factor analysis. We only considered the regression model with second-order interactions:
(2)
where the values of x1, x2,..., x
S
are recoded genotypes (1 for minor homozygotes, 2 for heterozygotes, 3 for major homozygotes, and -2 for missing data) of S (2682) consistent and informative SNPs that may have linear main effects and/or interaction effects on the gene expression variable γ. Let γ be the indicator vector such that γ
j
= 0 if β
j
= 0 and β
ij
= 0 for all i ≠ j, and γ
j
= 1 if otherwise. Then a model (or an element) in the model space can be represented by a binary vector γ = (γ1, γ2,..., γ
S
) that ranges from γ(1) = (0, 0,..., 0) to = (1, 1,..., 1), with the model size defined as . In our study, we set the model size m = 6, the chain length CL = 200,000, and the magnitude of the effect relative to the experimental noise λ = 1.5. We use the Java program developed by Yoon [17] to find the optimal model from the model subspace consisting of = 5.14 × 1017 elements. The output gives an estimate of each SNP's marginal posterior probability (MPP) of appearing in the 200,000 selected models. An MPP close to 1 suggests that the SNP is an important factor (either as a main effect or as one of two terms in an interaction) for the gene expression variable.