Data
Data on 14 threegeneration CEPH families, each consisting of four grandparents, two parents and seven to eight offspring, were provided by the Genetic Analysis Workshop 15. Gene expression levels in lymphoblastoid cells of these 194 subjects were obtained using the Affymetrix Human Focus Arrays that contain probes for 8500 transcripts. The data set contains 3554 of the 8500 genes, for which Morley et al. [2] found that the variation in expression level among individuals was higher than between replicates on the same individual. Furthermore, the genotypes of 2882 autosomal and Xlinked singlenucleotide polymorphisms (SNPs) were provided.
Missing SNPs were imputed based on the SNP data of relatives or if incomplete, based on related and unrelated subjects. SNPs with a percentage of missing data exceeding 20% (187 monomorphic and 24 polymorphic), were discarded from further analysis, as were the 95 SNPs which had a mutation rate of less than 5% (homo and heterozygote). In the SNPs with less than 5% homozygote mutations, the hetero and homozygote mutants were combined into one category (958 SNPs). For each SNP variable two dummies were created, using the LogicFS library in R. The first dummy variable was coded 0 for wild type and 1 for the mutants. The second dummy variable was coded 0 for wild type and heterozygote mutant and 1 for homozygote mutant.
Penalized canonical correlation analysis
Consider the standardized n × p matrix Y, containing p (gene expression) variables, and the standardized n × q matrix X, containing q (DNA marker) variables, obtained from n subjects. Canonical correlation analysis (CCA) looks for a linear combination of all the variables in one data set that correlates maximally with a linear combination of all the variables in the other data set. These linear combinations are the socalled CCA components ξ and ω, such that ξ = Xv and ω= Yu, with the weight vectors u^{T}= (u_{1},..., u_{
p
}) and v^{T}= (v_{1},..., v_{
q
}). The correlation between the CCA components is also known as the canonical correlation.
Via the twoblock Mode B of Wold's original partial leastsquares algorithm, the weight vectors can be estimated [3]. This algorithm is an iterative process that begins by calculating an initial pair of CCA components (ξ and ω) based on an initial guess of the weight vectors (v and u). The objective is to maximize the correlation between these two CCA components, therefore ξ is regressed on Y and ω is regressed on X in order to obtain a new set of estimated weights. This process is repeated until the weights converge.
However, problems arise in the regression step due to multicollinearity and overfitting. Furthermore, all the original variables are contained in the CCA components, so interpretation is difficult. Zou and Hastie [1] proposed the elastic net to overcome these problems. It is based on the following properties 1) variable selection is built into the procedure, 2) the procedure is not limited by the fact that the number of variables greatly exceeds the number of subjects, and 3) strongly correlated variables are in or out of the model together. The elastic net combines the advantages of the lasso [4] (built in variable selection) and ridge regression (grouping effect). We adapted the elastic net to Wold's algorithm, obtaining the following algorithm:

1.
Set k ← 0.

2.
Assign arbitrary starting values {\widehat{v}}^{(0)} and {\widehat{u}}^{(0)}. For instance, set {\widehat{v}}^{(0)}\leftarrow \frac{1}{q}1 and {\widehat{u}}_{r}^{(0)}\leftarrow \frac{1}{p}1. And normalize {\widehat{v}}^{(0)} and {\widehat{u}}^{(0)}.

3.
Estimate ξ, ω, v and u iteratively, as followed

a.
k ← k + 1.

b.
{\widehat{\xi}}^{k}\leftarrow X{\widehat{v}}^{(k1)} and {\widehat{\omega}}^{k}\leftarrow Y{\widehat{u}}^{(k1)}.

c.
Compute {\widehat{v}}^{(k)} and {\widehat{u}}^{(k)} by performing multiple regression using the elastic net.
\begin{array}{c}{\widehat{v}}^{(k)}=\left(1+{\lambda}_{2X}\right)\mathrm{arg}\underset{{v}^{(k)}}{\mathrm{min}}{\widehat{\omega}}^{k}X{v}^{(k)}{}^{2}+{\lambda}_{2X}{\displaystyle \sum _{j=1}^{q}{v}_{j}^{(k)2}}+{\lambda}_{1X}{\displaystyle \sum _{j=1}^{q}\left{v}_{j}^{(k)}\right}\\ {\widehat{u}}^{(k)}=\left(1+{\lambda}_{2Y}\right)\mathrm{arg}\underset{{u}^{(k)}}{\mathrm{min}}{\widehat{\xi}}^{k}Y{u}^{(k)}{}^{2}+{\lambda}_{2Y}{\displaystyle \sum _{j=1}^{p}{u}_{j}^{(k)2}}+{\lambda}_{1Y}{\displaystyle \sum _{j=1}^{p}\left{u}_{j}^{(k)}\right}\end{array}

d.
Normalize {\widehat{v}}^{(k)} and {\widehat{u}}^{(k)}. Repeat until {\widehat{v}}^{(k)} and {\widehat{u}}^{(k)} have converged (a difference of less than 10^{3} with the preceding step).
Hereafter, the residual matrices of Y and X are determined and the algorithm can be repeated to obtain the next sets of CCAcomponents. This process can be repeated until the residual matrices contain no further information.
The penalty parameters λ_{1} and λ_{2} come from the lasso and the ridge penalty, respectively. The (1 + λ_{2}) scaling factor is built in, to prevent a double shrinkage. By introducing this scaling factor, the shrinkage effect of the ridge is eliminated while maintaining the grouping effect of the ridge and the shrinkage effect of the lasso. The algorithm is highly depending on the choices of λ_{1} and λ_{2}. Methods to determine their optimal values are discussed in the following sections.
Univariate softthresholding
The optimal values of λ_{1} and λ_{2} are determined by crossvalidation (see next section), but in cases in which the computation time is tremendously increased due to the high number of variables, as in the case of microarray data, it might be desirable to simplify the computations. As Zou and Hastie [1] proposed, we can set λ_{2X}→ ∞ and λ_{2Y}→ ∞, in order to simplify the calculation. This is called univariate softthresholding. Although it ignores the dependence between variables within the same set, the decorrelation effect from the ridge penalty is maintained. It has shown to be quite promising [1].
Performing univariate softthresholding, Step 3(c) reduces to computing {\widehat{v}}^{(k)} and {\widehat{u}}^{(k)} by
\begin{array}{cc}{\widehat{v}}_{i}^{(k)}={\left(\left{\widehat{\omega}}^{kT}{X}_{i}\right\frac{{\lambda}_{1X}}{2}\right)}_{+}sign\left({\widehat{\omega}}^{kT}{X}_{i}\right)& i=1,\dots ,q\\ {\widehat{u}}_{i}^{(k)}={\left(\left{\widehat{\xi}}^{kT}{Y}_{i}\right\frac{{\lambda}_{1Y}}{2}\right)}_{+}sign\left({\widehat{\xi}}^{kT}{Y}_{i}\right)& i=1,\dots ,p.\end{array}
From this formula, we see that now only the optimal lasso penalty has to be chosen. Lasso shrinks weights to zero; the larger the lasso penalty, the smaller the number of weights that receive a nonzero value. The magnitude of the lasso penalty determines how many weights become nonzero, and thus how many variables are maintained in the CCAcomponents. Following Zou and Hastie [1], we turned this process around and determined the number of variables we would like to give a nonzero weight.
Penalty parameter optimization
The optimal number of variables within each CCA component is determined by kfold crossvalidation. The data set is divided into k subsets (based on subjects), of which k1 subsets form the training set and the remaining subset forms the test set. The weight vectors u and v are estimated by the training set and the mean squared prediction error is determined by the test set. This is repeated k times, such that each subset has functioned as a test set.
The mean squared prediction error is defined by
MSPE={\frac{{\displaystyle \sum _{j=1}^{N}\left{x}_{j}{\widehat{v}}^{k(j)}{\rho}^{k(j)}{y}_{j}{\widehat{u}}^{k(j)}\right}}{N}}^{2},
with ρ^{k(j) }the canonical correlation, and {\widehat{v}}^{k(j)} and {\widehat{u}}^{k(j)} the weight vectors estimated by the training set in which subject j was deleted. The MSPE is determined for CCA components with differing numbers of variables. The CCA component pair with the lowest MSPE contains the optimal number of variables.