Data
The data consist of microarray gene expression measurements that are treated as quantitative traits and a large number of genotypes for 14 Centre d'Etude du Polymorphisme Humain (CEPH) families from Utah. Each pedigree contains three generations with approximately eight offspring per sibship. There are 194 individuals, 56 of which are founders. Phenotypes are measured by microarray gene expression profiles obtained from lymphoblastoid cells using the Affymetrix Human Genome Focus Arrays. Morley et al. [2] selected 3554 genes among the available 8793 probes on the basis of higher variation among unrelated individuals than between replicate arrays for the same individual. In this paper, we used a preprocessed and normalized data provided for these genes. Additional phenotypic data obtained for CEPH families includes age and gender.
Genotypes are measured by genetic markers provided by The SNP Consortium and are available for 2882 autosomal and Xlinked SNPs. The physical map for SNP locations is also available.
The statistical model
In this study we are interested in identifying linear combinations of measures based on gene expressions and SNPbased measures that have the largest correlation. Canonical correlation analysis (CCA) establishes such relationships between the two types of variables [5]. Suppose that x is a random vector of the first type of variables and y is a random vector of the second type of variables. We are looking for vectors a and b which maximize the following correlation:
cor(\begin{array}{cc}{a}^{\prime}x,& {b}^{\prime}y\end{array})=\rho (\begin{array}{cc}a,& b\end{array})=\frac{{a}^{\prime}{\Sigma}_{12}b}{{({a}^{\prime}{\Sigma}_{11}a{b}^{\prime}{\Sigma}_{22}b)}^{1/2}}
where Σ_{11}, Σ_{22}, and Σ_{12} are the variance and covariance matrices. The solution is obtained from the singular value decomposition (SVD) of a matrix K given by:
K={\Sigma}_{11}^{1/2}{\Sigma}_{12}{\Sigma}_{22}^{1/2}=UD{V}^{\prime}=({\alpha}_{1},\dots ,{\alpha}_{k})D({\beta}_{1},\dots ,{\beta}_{k}{)}^{\prime}.
Here k, is the rank of K. The highest canonical correlation is obtained by considering the best rank 1 approximation to the matrix K [6], K = d_{1}α_{1}β_{1}', where d_{1} is the positive square root of the first eigenvalue of K'K or KK', and α_{1} and β_{1} are first singular vectors. Then the canonical vectors or weights in the linear combinations of the two sets of variables that have the largest correlation are given by
\begin{array}{ccc}a={\Sigma}_{11}^{1/2}{\alpha}_{1}& \text{and}& b={\Sigma}_{22}^{1/2}{\beta}_{1}.\end{array}
In conventional CCA, all variables are included in the fitted linear combinations. However, in microarray and genomewide data, the number of genes under consideration often exceeds tens of thousands. In these cases linear combinations of all features may not be easily interpreted. Sparse canonical correlation analysis (SCCA) enhances biological interpretability and provides sets of variables with sparse loadings. This is consistent with the belief that only a small proportion of genes are expressed under a certain set of conditions, and that expressed genes are regulated at a subset of genetic locations. We propose obtaining sparse linear combinations of features by considering a sparse singular value decomposition of K where singular vectors α and β in Eq. (3) have sparse loadings. We developed an iterative algorithm that alternately approximates the left and right singular vectors of the SVD using softthresholding for feature selection. This approach is related to the sparse principal component analysis method of Zou et al. [7] and partial least squares methods described by Wegelin [8].
Sparse canonical correlation analysis algorithm
Assume that variables in both data sets have been standardized to have zero means and unit variances and K is the matrix in Eq. (2). Our algorithm works as follows:

1.
Select sparseness parameters λ_{
α
}and λ_{
β
}for left and right singular vectors, respectively. Select initial values for singular vectors α^{0} and β^{0}, set i = 0.

2.
Update left singular vector α as follows:

a.
Set α^{i+1 }= Kβ^{i}and normalize it to have unit length

b.
Softthresholding: {\alpha}^{i+1}={\left(\left{\alpha}^{i+1}\right\frac{1}{2}{\lambda}_{\alpha}\right)}_{+}Sign({\alpha}^{i+1})

c.
Normalize α^{i+1 }again

3.
Update right singular vector β as follows:

a.
Set β^{i+1 }= Kα^{i+1 }and normalize it to have unit length

b.
Softthresholding: {\beta}^{i+1}={\left(\left{\beta}^{i+1}\right\frac{1}{2}{\lambda}_{\beta}\right)}_{+}Sign({\beta}^{i+1})

c.
Normalize β^{i+1 }again
4. i = i + 1, repeat steps 2 and 3 until convergence.
The algorithm converges to the first singular vectors α_{1} and β_{1} in Eqs. (2) and (3). Other canonical vectors can be obtained by considering the residual of the correlation matrix. This is a very computationally efficient algorithm because it usually converges in less than ten iterations and requires only matrix multiplication. Optimal combination of sparseness parameters for softthresholding steps can be selected using crossvalidation. A subset of the data is repeatedly used to identify linear combinations of variables for the specific pair of sparseness parameters and then the correlation between the obtained canonical vectors is calculated in the remainder of the data set. The optimal combination of λ_{
α
}and λ_{
β
}corresponds to the highest average correlation over the repetitions.
Analysis approach
In this study one type of variables is based on gene expression levels and the other type of information relates to SNP genotypes and pedigree structure. An immediate challenge in this context is how to define correlation between these two types of data. We adopted a measure of covariance of genetic similarity with phenotypic similarity as in the linkage analysis methodologies of Tritchler et al. [3] and Commenges [4]. Consider the offspring generation in all available pedigrees and take all possible sib pairs. Let y_{
ij
}and y_{
ik
}be the phenotypes for the siblings j and k in family i for a particular gene expression and let w_{
ijk
}represent identitybydescent (IBD) value for these siblings for some specific SNP. Then, for the given gene expression and SNP, the test statistic for linkage in Tritchler et al. [3] is
\sigma ={\displaystyle \sum _{i}{\displaystyle \sum _{j}{\displaystyle \sum _{k>j}\left\{{y}_{ij}E({y}_{ij})\right\}}}}\left\{{y}_{ik}E({y}_{ik})\right\}\left\{{w}_{ijk}E({w}_{ijk})\right\},
which is used for computation of covariance matrix between the phenotypic similarity and genotypic similarity. Note the similarity of the above expression to HasemanElston regression. In fact, Tritchler et al. [3] show that the correlation statistic subsumes both the original HasemanElston regression analysis and the later HasemanElston (revisited).
Phenotypic similarity
The phenotypes in this study are the gene expression values for siblings in the last generation of the pedigrees (i.e., the offspring generation). Previous studies have shown that there is variation in human gene expression according to age and gender [2]. Therefore, we limit the analysis to the last generation in all pedigrees and also correct for the effects of gender and age by fitting a linear model
y_{
ij
}= α + β_{
gender
}gender_{
ij
}+ β_{
age
}age_{
ij
}+ e_{
ij
}= E(y_{
ij
}) + e_{
ij
}.
Gender and age information were not available for all individuals in pedigree 1454 and for three individuals in pedigree 1340. Therefore, these data were excluded from the analysis. In the 13 remaining pedigrees, there were 344 distinct sib pairs with sibship size varying between 15 and 28. Although sib pairs are correlated within pedigrees, this does not affect the results because this is an exploratory study and no assumption of independence is made.
The phenotypic similarity for siblings j and k in pedigree i is computed as sibpair crossproduct of mean corrected gene expression phenotypes:
{y_{
ij
} E(y_{
ij
})}{y_{
ik
} E(y_{
ik
})}.
Genotypic similarity
For each sib pair, the probabilities of sharing 0, 1, and 2 alleles identically by descent (IBD) were estimated using MERLIN. This is an exploratory analysis of all SNPs and gene expressions, so great precision in the estimation of the IBD values is not required and the provided physical distance map of the SNP locations is a suitable approximation to the genetic distances required by MERLIN. Given the incomplete genetic marker information for some individuals, exact IBD values could not be computed. We estimated the number of alleles shared IBD by two siblings as a posterior expected value based on the probabilities estimated using MERLIN. Expected IBD values E(w_{
ijk
}) were computed as the sample means over all sib pairs.
Standardization
We standardize the phenotypic and genotypic similarity variables by subtracting the means and dividing by the standard deviations. Simulations show that after data standardization, the analysis can be simplified by replacing Σ_{11} and Σ_{22} in Eq. (2) with an identity matrix while yielding satisfactory results. Then K in Eq. (2) is the covariance between the two standardized data sets and the first canonical vectors in Eq. (3) are just α_{1} and β_{1}.
Evaluation
We evaluated the performance of SCCA using the leaveoneout crossvalidation (LOOCV) analysis, treating a pedigree as one unit. In this study, assessment of performance is based on the estimated test sample correlation. It shows correlation between linear combinations of identified loci and gene expressions in the independent sample. We used pedigree as the unit in LOOCV because it represents a statistically independent unit. Leaving out one whole pedigree preserves dependence structure in the familybased study and ensures independence between training and testing samples. We carried out an analogue of 13fold CV where foldsize was dictated by the complex structure of the data.