Several regions of linkage were identified in the analysis of gene expression data from the CEPH families, with the highest LOD score (5.36) observed on chromosome 1 for cluster number 67, composed of two genes: glutathione S-transferase M1 (GSTM1) and glutathione S-transferase M2 (GSTM2), both located on chromosome 1p13.3. Individual LOD scores for these regions were 5.33 and 4.53 for GSTM1 and GSTM2, respectively, which have both been identified in a previous study [2]. To follow up these results, we performed a bivariate linkage analysis of gene expression levels for GSTM1 and GSTM2 using SOLAR [10]; the LOD score was similar to those obtained for each individual gene expression level. Additionally, we estimated genetic correlation between these two traits, which was significantly different from zero (p < 0.0001). This suggests the presence of pleiotropy, i.e., a single gene affecting both traits. The fact that the present methods identify a clear pleiotropic cluster that was identified previously with a different method suggests that linkage results for the larger clusters may also be biologically meaningful. With large data sets, the reproducibility of the results could also be assessed, for example, with k-split methods.
Regions containing clusters 82, 76, and 93 showed LOD > 3.36. Analysis of pathways for these clusters is limited because the majority of genes were not linked to identifiable pathways. Furthermore, significant associations with pathways and clusters could be due to single genes in a pathway in the cluster (e.g., the inflammatory response pathway in cluster 76) or a number of genes sharing a pathway (e.g., six genes that are part of the calcium signaling pathway in cluster 93). At this point we have limited our data to the annotation provided in Problem 1 (i.e., using GenMAPP and KEGG) and building canonical pathways using IPA by providing a set of gene names as input data. While there is overlap between the pathways observed between the two approaches utilized (Table 3), the differences might be due to the source from which pathway information is extracted. Since GenMAPP is a collection of data from voluntary contributions, some bias may be present which might also result in incomplete pathway information. IPA, on the other hand, is an extensive collection of all published literature that is continuously being updated. However, difficulties in interpreting the pathway results may occur because, more than likely, contradicting results will be observed across multiple manuscripts. The researcher will have to proceed with caution when such instances occur.
The present analyses have identified several clusters of multiple transcripts for which the first principal component shows strong evidence of linkage to at least one genomic region. Furthermore, the clustering analysis prior to linkage analysis produces cluster definitions that are defined by the data at hand and do not depend on a priori biological knowledge that they constitute a pathway or are genetically co-regulated. Thus, the present approach may identify novel groups of co-regulated transcripts. Many of these clusters contained large numbers of genes (up to 113) and a fully parameterized multivariate analysis would require estimation of a large number of variance components that would be computationally difficult. By conducting analyses of the first principal component, we reduce the transcription information from each cluster into a single variable that maximizes the amount of variance explained. In this exploratory study, we have only used the first principal component for each cluster; a more thorough analysis would include higher principal components, perhaps in conjunction with factor analysis or other multivariable techniques. However, in the present study, PC2 only explained on average 8% of the variance and, therefore, in most cases, PC1 has captured a large part of the information.
A variety of other methods are available for identifying clusters, including hierarchical and machine-learning methods. For the present analyses, we chose non-hierarchical techniques because they are rapid and can assign transcripts to distinct clusters with no assumptions about the nature of the relationships among clusters. All clustering methods are influenced by the scale of the variables included and the present approach employs normalization procedures to equalize the scale among variables. While this approach minimizes the influence of outliers, it would mask the identity of some real clusters, or, perhaps, artifactually introduce some clustering in the data. More research is needed in this area. The present clustering method requires certain assumptions, including specification of the maximum number of clusters. In addition, as presently implemented, the method only identifies clusters in which genes are expressed in the same direction (both up-regulated or both down-regulated). Further work may be required to address these limitations. However, the ability to analyze a large number of traits relatively rapidly and easily shows the advantages of this method. Therefore, the multi-stage method we followed by first performing cluster analysis followed by linkage is another useful approach to identify chromosomal locations for genes affecting expression levels of multiple transcripts.