### Box-Cox family of transformations

Given the expression intensities *x*_{1, j}, *x*_{2, j},..., *x*_{n, j}(n = 194) for the *j*^{th}(*j* = 1, 2,..., 3554) gene, the Box-Cox family [3] transforming *x*_{
j
}= (*x*_{1, j}, *x*_{2, j},..., *x*_{n, j}) to {x}_{j}^{({p}_{j})}=({x}_{1,j}^{({p}_{j})},{x}_{2,j}^{({p}_{j})},\mathrm{...},{x}_{n,j}^{({p}_{j})}) with power parameter *p*_{
j
}is:

{x}_{j}^{({p}_{j})}=\{\begin{array}{ll}({x}_{j}^{{p}_{j}}-1)/{p}_{j},\hfill & {p}_{j}\ne 0\hfill \\ ln({x}_{j}),\hfill & {p}_{j}=0\hfill \end{array}.

(1)

The expression intensities transformed here are the original observations rather than the log_{2} values reported in the data set. The 0.3-power transformation is the transformation that maximizes the probability plot correlation coefficient (PPC, see Filliben [10]) for the greatest number of genes.

### Mixture analysis using Gaussian mixture model

The SKUMIX algorithm is extended in our mixture analysis. First, we applied the Box-Cox family of power transformations without the scale parameter (see Eq. (1)). Second, we considered a wider interval [0, 1.5] than the one recommended by Maclean et al. [2] for selecting the optimal power parameter. Third, as suggested by Ning et al. [11], we used 6.9 as the 0.05 critical value for LRT of "a single component distribution" vs. "a mixture distribution of two components."

### Partial correlation analysis

We calculate the Pearson product moment correlation coefficients *r*_{
ij
}= *r*(*x*_{
i
}, *x*_{
j
}), first-order partial correlations *r*_{ij.k}= *r*(*x*_{
i
}, *x*_{
j
}|*x*_{
k
}) and second-order partial correlation coefficients *r*_{ij.kl}= *r*(*x*_{
i
}, *x*_{
j
}|*x*_{
k
}, *x*_{
l
}) [12] for expression phenotype variables whose values are the 0.3-power Box-Cox transformed expressions. The partial correlation criteria are:

T:|r({x}_{i},{x}_{j})|>0.8\phantom{\rule{0.5em}{0ex}}and\phantom{\rule{0.5em}{0ex}}{r}_{ij}^{2}-{r}_{ij.k}^{2}>0.63\phantom{\rule{0.5em}{0ex}}for\phantom{\rule{0.5em}{0ex}}i\ne j\ne k

Q:\left|{r}_{ij}\right|>0.8,\phantom{\rule{0.5em}{0ex}}{r}_{ij}^{2}-{r}_{ij.kl}^{2}>0.63,\phantom{\rule{0.5em}{0ex}}{r}_{ij}^{2}-{r}_{ij.k}^{2}<0.63,\phantom{\rule{0.5em}{0ex}}and\phantom{\rule{0.5em}{0ex}}{r}_{ij}^{2}-{r}_{ij.l}^{2}<0.63\phantom{\rule{0.5em}{0ex}}for\phantom{\rule{0.5em}{0ex}}i\ne j\ne k\ne l.

The last two inequalities in criterion Q reduce redundancy by removing quartets built on trios. We identify trios of expression phenotype variables (*x*_{
i
}, *x*_{
j
}, *x*_{
k
}) that meet criterion T and quartets (*x*_{
i
}, *x*_{
j
}, *x*_{
k
}, *x*_{
l
}) that meet criterion Q.

### Measure of common mixing mechanism

When a gene expression variable appeared to be a mixture, we fit a mixture of two Gaussian components with equal variance using MCLUST [13] and classified each subject into the component with the largest Bayesian posterior probability [14]. We called the component with estimated probability less than 0.5 the "uncommon component" and the other one the "common component." The *concordance rate* (*C*) in a gene set is the ratio of subjects that simultaneously fall into the uncommon or the common components for all the genes in the set. A value of *C* close to 1 suggests a common mixture mechanism. We selected genes in a trio or quartet with *C* ≥ 80% for the factor analysis. Fleiss' statistic *κ* [15] was used to assess agreement. A value of *κ* > 0.75 indicated excellent agreement, while *κ* < 0.40 indicated poor agreement [16].

### Factor analysis

Each gene expression variable that appeared to be a mixture and was present in one or more trios or quartets was included in a factor analysis using varimax rotation.

### Bayesian factor screening

We used BFS [7, 17] to identify SNPs significantly associated with expressions of the genes from the factor analysis. We only considered the regression model with second-order interactions:

y=\alpha +{\displaystyle {\sum}_{j=1}^{S}{\beta}_{j}{x}_{j}}+{\displaystyle {\sum}_{i<j}{\beta}_{ij}{x}_{i}{x}_{j}}+\epsilon ,

(2)

where the values of *x*_{1}, *x*_{2},..., *x*_{
S
}are recoded genotypes (1 for minor homozygotes, 2 for heterozygotes, 3 for major homozygotes, and -2 for missing data) of *S* (2682) consistent and informative SNPs that may have linear main effects and/or interaction effects on the gene expression variable *γ*. Let *γ* be the indicator vector such that *γ*_{
j
}= 0 if *β*_{
j
}= 0 and *β*_{
ij
}= 0 for all *i* ≠ *j*, and *γ*_{
j
}= 1 if otherwise. Then a model (or an element) in the model space can be represented by a binary vector *γ* = (*γ*_{1}, *γ*_{2},..., *γ*_{
S
}) that ranges from *γ*^{(1)} = (0, 0,..., 0) to {\gamma}^{({2}^{\text{S}})} = (1, 1,..., 1), with the model size defined as m={\displaystyle {\sum}_{j=1}^{S}{\gamma}_{j}}\in [0,S]. In our study, we set the model size *m* = 6, the chain length *CL* = 200,000, and the magnitude of the effect relative to the experimental noise *λ* = 1.5. We use the Java program developed by Yoon [17] to find the optimal model from the model subspace consisting of {C}_{6}^{2,682} = 5.14 × 10^{17} elements. The output gives an estimate of each SNP's marginal posterior probability (MPP) of appearing in the 200,000 selected models. An MPP close to 1 suggests that the SNP is an important factor (either as a main effect or as one of two terms in an interaction) for the gene expression variable.