All analyses were performed on the GAW15 Problem 1 human gene expression data. Clustering analysis was applied to the 29 gene expression phenotypes found to have significant linkage results on chromosome 14 [2]. Principal-component analysis based on heritability [5] was performed to combine gene expression traits in each of the resulting cluster. A ridge-penalized principal-components approach based on heritability proposed by Wang et al. [6] was applied to the 150 gene expression levels with highest heritability. Multipoint linkage analysis was carried out on each of the 29 individual traits on chromosome 14 as well as on several combined traits.
Clustering analysis
Here we proposed a clustering method that uses all subjects in the data set and incorporates family structure information by defining a distance measure that reflects similarity of traits among family members. This distance measure is a sum of weighted family-specific mean trait differences. The weights are calculated from within-family trait sum-of-squares. When the trait values for subjects within a family are more similar, leading to a smaller within-family sum-of-squares, the differences in their trait means is more important and thus is weighted larger. To be specific, let i index families and j index subjects. Let n
i
be the number of members in the ith family. Then the distance between trait x and trait y is defined as
where and . This distance measure resembles the Fstatistic in the ANOVA test. The proposed clustering using all subjects was compared to the standard hierarchical clustering using founders.
Principal components of heritability
The principal-components approach based on heritability proposed by Ott and Rabinowitz [5] exploited family structure information by defining principal components of heritability (PCH) as scores with maximal heritability, subject to scores being orthogonal to each other. To be specific, a trait can be decomposed into a family-specific component and a subject-specific component. Instead of maximizing the total variation as in standard principal-components analysis, the PCH maximizes the relevant family-specific component variation relative to the subject-specific component variation. That is, the PCH is the solution to: , where B is the family-specific variation and W is the subject-specific variation. Note that this maximization criterion is equivalent to maximizing the heritability (the ratio of the family-specific variation to the total variation) of a score. Here we use between-family sum-of-squares to estimate B, and use within-family sum-of-squares to estimate W. The first three PCHs are computed in each of the clusters found in the previous section.
Penalized principal-components of heritability
Without knowing which expression levels are regulated by a common gene, it may be desirable to apply the principal components of heritability approach on a large number of traits and evaluate which traits have significantly large loadings at linkage peaks. However, the method of Ott and Rabinowitz [5] is not applicable for high-dimensional traits for two reasons: first, it does not account for the problem of overfitting, which is a common problem to high-dimensional data; second, the sample within-family sum-of-squares (estimate of W) could be singular and cannot be inverted. Although generalized inverse can be used, the results will be highly unstable. In order to screen large number of traits, we used a penalized principal components of heritability [6] defined as
to stabilize the PCH. Here, λ is the tuning parameter. When λ is zero, the penalized PCH is reduced to the PCH in Ott and Rabinowitz [5]; when λ approaches infinity, the penalized PCH approaches the score that maximizes the family-specific variation. In the latter case, the penalized PCH is close to the regular principal component applying to the founders. The λ is chosen by maximizing a cross-validated heritability [6]. We applied penalized PCH to 150 gene expression levels with the highest heritability.
Linkage analysis
Prior to linkage analysis, genotype consistency was checked by PEDCHECK. SNPs with Mendelian genotyping errors were set to missing. Multipoint linkage analyses were performed by SIBPAL in S.A.G.E. The weighting method used for different sibling pairs was 'W4' [7]. The Rutgers genetic map provided by Sung et al. [8] was used. Linkage results from S.A.G.E. were summarized by t statistics and p-values.