Data
The data consist of microarray gene expression measurements that are treated as quantitative traits and a large number of genotypes for 14 Centre d'Etude du Polymorphisme Humain (CEPH) families from Utah. Each pedigree contains three generations with approximately eight offspring per sibship. There are 194 individuals, 56 of which are founders. Phenotypes are measured by microarray gene expression profiles obtained from lymphoblastoid cells using the Affymetrix Human Genome Focus Arrays. Morley et al. [2] selected 3554 genes among the available 8793 probes on the basis of higher variation among unrelated individuals than between replicate arrays for the same individual. In this paper, we used a pre-processed and normalized data provided for these genes. Additional phenotypic data obtained for CEPH families includes age and gender.
Genotypes are measured by genetic markers provided by The SNP Consortium and are available for 2882 autosomal and X-linked SNPs. The physical map for SNP locations is also available.
The statistical model
In this study we are interested in identifying linear combinations of measures based on gene expressions and SNP-based measures that have the largest correlation. Canonical correlation analysis (CCA) establishes such relationships between the two types of variables [5]. Suppose that x is a random vector of the first type of variables and y is a random vector of the second type of variables. We are looking for vectors a and b which maximize the following correlation:
where Σ11, Σ22, and Σ12 are the variance and covariance matrices. The solution is obtained from the singular value decomposition (SVD) of a matrix K given by:
Here k, is the rank of K. The highest canonical correlation is obtained by considering the best rank 1 approximation to the matrix K [6], K = d1α1β1', where d1 is the positive square root of the first eigenvalue of K'K or KK', and α1 and β1 are first singular vectors. Then the canonical vectors or weights in the linear combinations of the two sets of variables that have the largest correlation are given by
In conventional CCA, all variables are included in the fitted linear combinations. However, in microarray and genome-wide data, the number of genes under consideration often exceeds tens of thousands. In these cases linear combinations of all features may not be easily interpreted. Sparse canonical correlation analysis (SCCA) enhances biological interpretability and provides sets of variables with sparse loadings. This is consistent with the belief that only a small proportion of genes are expressed under a certain set of conditions, and that expressed genes are regulated at a subset of genetic locations. We propose obtaining sparse linear combinations of features by considering a sparse singular value decomposition of K where singular vectors α and β in Eq. (3) have sparse loadings. We developed an iterative algorithm that alternately approximates the left and right singular vectors of the SVD using soft-thresholding for feature selection. This approach is related to the sparse principal component analysis method of Zou et al. [7] and partial least squares methods described by Wegelin [8].
Sparse canonical correlation analysis algorithm
Assume that variables in both data sets have been standardized to have zero means and unit variances and K is the matrix in Eq. (2). Our algorithm works as follows:
-
1.
Select sparseness parameters λ
α
and λ
β
for left and right singular vectors, respectively. Select initial values for singular vectors α0 and β0, set i = 0.
-
2.
Update left singular vector α as follows:
-
a.
Set αi+1 = Kβiand normalize it to have unit length
-
b.
Soft-thresholding:
-
c.
Normalize αi+1 again
-
3.
Update right singular vector β as follows:
-
a.
Set βi+1 = Kαi+1 and normalize it to have unit length
-
b.
Soft-thresholding:
-
c.
Normalize βi+1 again
4. i = i + 1, repeat steps 2 and 3 until convergence.
The algorithm converges to the first singular vectors α1 and β1 in Eqs. (2) and (3). Other canonical vectors can be obtained by considering the residual of the correlation matrix. This is a very computationally efficient algorithm because it usually converges in less than ten iterations and requires only matrix multiplication. Optimal combination of sparseness parameters for soft-thresholding steps can be selected using cross-validation. A subset of the data is repeatedly used to identify linear combinations of variables for the specific pair of sparseness parameters and then the correlation between the obtained canonical vectors is calculated in the remainder of the data set. The optimal combination of λ
α
and λ
β
corresponds to the highest average correlation over the repetitions.
Analysis approach
In this study one type of variables is based on gene expression levels and the other type of information relates to SNP genotypes and pedigree structure. An immediate challenge in this context is how to define correlation between these two types of data. We adopted a measure of covariance of genetic similarity with phenotypic similarity as in the linkage analysis methodologies of Tritchler et al. [3] and Commenges [4]. Consider the offspring generation in all available pedigrees and take all possible sib pairs. Let y
ij
and y
ik
be the phenotypes for the siblings j and k in family i for a particular gene expression and let w
ijk
represent identity-by-descent (IBD) value for these siblings for some specific SNP. Then, for the given gene expression and SNP, the test statistic for linkage in Tritchler et al. [3] is
which is used for computation of covariance matrix between the phenotypic similarity and genotypic similarity. Note the similarity of the above expression to Haseman-Elston regression. In fact, Tritchler et al. [3] show that the correlation statistic subsumes both the original Haseman-Elston regression analysis and the later Haseman-Elston (revisited).
Phenotypic similarity
The phenotypes in this study are the gene expression values for siblings in the last generation of the pedigrees (i.e., the offspring generation). Previous studies have shown that there is variation in human gene expression according to age and gender [2]. Therefore, we limit the analysis to the last generation in all pedigrees and also correct for the effects of gender and age by fitting a linear model
y
ij
= α + β
gender
gender
ij
+ β
age
age
ij
+ e
ij
= E(y
ij
) + e
ij
.
Gender and age information were not available for all individuals in pedigree 1454 and for three individuals in pedigree 1340. Therefore, these data were excluded from the analysis. In the 13 remaining pedigrees, there were 344 distinct sib pairs with sibship size varying between 15 and 28. Although sib pairs are correlated within pedigrees, this does not affect the results because this is an exploratory study and no assumption of independence is made.
The phenotypic similarity for siblings j and k in pedigree i is computed as sib-pair cross-product of mean corrected gene expression phenotypes:
{y
ij
- E(y
ij
)}{y
ik
- E(y
ik
)}.
Genotypic similarity
For each sib pair, the probabilities of sharing 0, 1, and 2 alleles identically by descent (IBD) were estimated using MERLIN. This is an exploratory analysis of all SNPs and gene expressions, so great precision in the estimation of the IBD values is not required and the provided physical distance map of the SNP locations is a suitable approximation to the genetic distances required by MERLIN. Given the incomplete genetic marker information for some individuals, exact IBD values could not be computed. We estimated the number of alleles shared IBD by two siblings as a posterior expected value based on the probabilities estimated using MERLIN. Expected IBD values E(w
ijk
) were computed as the sample means over all sib pairs.
Standardization
We standardize the phenotypic and genotypic similarity variables by subtracting the means and dividing by the standard deviations. Simulations show that after data standardization, the analysis can be simplified by replacing Σ11 and Σ22 in Eq. (2) with an identity matrix while yielding satisfactory results. Then K in Eq. (2) is the covariance between the two standardized data sets and the first canonical vectors in Eq. (3) are just α1 and β1.
Evaluation
We evaluated the performance of SCCA using the leave-one-out cross-validation (LOOCV) analysis, treating a pedigree as one unit. In this study, assessment of performance is based on the estimated test sample correlation. It shows correlation between linear combinations of identified loci and gene expressions in the independent sample. We used pedigree as the unit in LOOCV because it represents a statistically independent unit. Leaving out one whole pedigree preserves dependence structure in the family-based study and ensures independence between training and testing samples. We carried out an analogue of 13-fold CV where fold-size was dictated by the complex structure of the data.