Data
Data on 14 three-generation CEPH families, each consisting of four grandparents, two parents and seven to eight offspring, were provided by the Genetic Analysis Workshop 15. Gene expression levels in lymphoblastoid cells of these 194 subjects were obtained using the Affymetrix Human Focus Arrays that contain probes for 8500 transcripts. The data set contains 3554 of the 8500 genes, for which Morley et al. [2] found that the variation in expression level among individuals was higher than between replicates on the same individual. Furthermore, the genotypes of 2882 autosomal and X-linked single-nucleotide polymorphisms (SNPs) were provided.
Missing SNPs were imputed based on the SNP data of relatives or if incomplete, based on related and unrelated subjects. SNPs with a percentage of missing data exceeding 20% (187 monomorphic and 24 polymorphic), were discarded from further analysis, as were the 95 SNPs which had a mutation rate of less than 5% (homo- and heterozygote). In the SNPs with less than 5% homozygote mutations, the hetero- and homozygote mutants were combined into one category (958 SNPs). For each SNP variable two dummies were created, using the LogicFS library in R. The first dummy variable was coded 0 for wild type and 1 for the mutants. The second dummy variable was coded 0 for wild type and heterozygote mutant and 1 for homozygote mutant.
Penalized canonical correlation analysis
Consider the standardized n × p matrix Y, containing p (gene expression) variables, and the standardized n × q matrix X, containing q (DNA marker) variables, obtained from n subjects. Canonical correlation analysis (CCA) looks for a linear combination of all the variables in one data set that correlates maximally with a linear combination of all the variables in the other data set. These linear combinations are the so-called CCA components ξ and ω, such that ξ = Xv and ω= Yu, with the weight vectors uT= (u1,..., u
p
) and vT= (v1,..., v
q
). The correlation between the CCA components is also known as the canonical correlation.
Via the two-block Mode B of Wold's original partial least-squares algorithm, the weight vectors can be estimated [3]. This algorithm is an iterative process that begins by calculating an initial pair of CCA components (ξ and ω) based on an initial guess of the weight vectors (v and u). The objective is to maximize the correlation between these two CCA components, therefore ξ is regressed on Y and ω is regressed on X in order to obtain a new set of estimated weights. This process is repeated until the weights converge.
However, problems arise in the regression step due to multicollinearity and overfitting. Furthermore, all the original variables are contained in the CCA components, so interpretation is difficult. Zou and Hastie [1] proposed the elastic net to overcome these problems. It is based on the following properties 1) variable selection is built into the procedure, 2) the procedure is not limited by the fact that the number of variables greatly exceeds the number of subjects, and 3) strongly correlated variables are in or out of the model together. The elastic net combines the advantages of the lasso [4] (built in variable selection) and ridge regression (grouping effect). We adapted the elastic net to Wold's algorithm, obtaining the following algorithm:
-
1.
Set k ← 0.
-
2.
Assign arbitrary starting values and . For instance, set and . And normalize and .
-
3.
Estimate ξ, ω, v and u iteratively, as followed
-
a.
k ← k + 1.
-
b.
and .
-
c.
Compute and by performing multiple regression using the elastic net.
-
d.
Normalize and . Repeat until and have converged (a difference of less than 10-3 with the preceding step).
Hereafter, the residual matrices of Y and X are determined and the algorithm can be repeated to obtain the next sets of CCA-components. This process can be repeated until the residual matrices contain no further information.
The penalty parameters λ1 and λ2 come from the lasso and the ridge penalty, respectively. The (1 + λ2) scaling factor is built in, to prevent a double shrinkage. By introducing this scaling factor, the shrinkage effect of the ridge is eliminated while maintaining the grouping effect of the ridge and the shrinkage effect of the lasso. The algorithm is highly depending on the choices of λ1 and λ2. Methods to determine their optimal values are discussed in the following sections.
Univariate soft-thresholding
The optimal values of λ1 and λ2 are determined by cross-validation (see next section), but in cases in which the computation time is tremendously increased due to the high number of variables, as in the case of microarray data, it might be desirable to simplify the computations. As Zou and Hastie [1] proposed, we can set λ2X→ ∞ and λ2Y→ ∞, in order to simplify the calculation. This is called univariate soft-thresholding. Although it ignores the dependence between variables within the same set, the de-correlation effect from the ridge penalty is maintained. It has shown to be quite promising [1].
Performing univariate soft-thresholding, Step 3(c) reduces to computing and by
From this formula, we see that now only the optimal lasso penalty has to be chosen. Lasso shrinks weights to zero; the larger the lasso penalty, the smaller the number of weights that receive a non-zero value. The magnitude of the lasso penalty determines how many weights become non-zero, and thus how many variables are maintained in the CCA-components. Following Zou and Hastie [1], we turned this process around and determined the number of variables we would like to give a non-zero weight.
Penalty parameter optimization
The optimal number of variables within each CCA component is determined by k-fold cross-validation. The data set is divided into k subsets (based on subjects), of which k-1 subsets form the training set and the remaining subset forms the test set. The weight vectors u and v are estimated by the training set and the mean squared prediction error is determined by the test set. This is repeated k times, such that each subset has functioned as a test set.
The mean squared prediction error is defined by
with ρ-k(j) the canonical correlation, and and the weight vectors estimated by the training set in which subject j was deleted. The MSPE is determined for CCA components with differing numbers of variables. The CCA component pair with the lowest MSPE contains the optimal number of variables.