Functional group-based linkage analysis of gene expression trait loci
© Li et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
We explored approaches to using multiple related traits (gene expression levels) in linkage analysis. We first grouped mRNA transcripts according to their functions annotated in biological process of gene ontology (GO). We then compared using sample average, principal-components analysis (PCA), and linear discriminant analysis (LDA) to derive a univariate composite trait. Our results showed that PCA generally yielded stronger evidence for linkage, through the LDA component had the highest heritability. We also developed an algorithm to search for clusters of linkage peaks from multiple traits in the same group and a heuristic method for calculating p-value evaluating the linkage peak clustering. Future research is needed to develop rigorous methods in mapping of genes affecting the expression of a group of transcripts.
Our aim is to explore approaches to collecting information from multiple sub-clinical traits (e.g., gene expression, protein levels, metabolite measurements) in search for loci responsible for complex diseases. Given a particular regulatory/metabolic pathway that is important in the development of a disease, we use the expression levels of genes in the pathway to map the loci affecting the pathway, thus affecting the clinical outcome. To avoid confusion between the genes we are trying to map and the genes whose expression levels are used as traits, we will call the expression levels "traits" of "transcripts", as we will talk about functions of the transcripts. Because these traits reflect more immediate effects of the gene in question, we may reduce the confounding effects of environmental factors and the disease genetic heterogeneity.
Because there is no clinical trait or covariate information in Problem 1, we defined a group of transcripts as those sharing some common biological functions, and explored whether we can gain more linkage information by combining multiple expression traits in a group.
Selection of gene functional groups
The 3554 transcripts were grouped based on the biological processes in which they are involved according to the gene ontology (GO) via Onto-Express . On the one hand, we would like to borrow information across many transcripts in a function group; on the other hand there is an increasing functional heterogeneity with a larger or less-specific function group. To achieve a trade-off between a functional specificity and group size, and to ensure that the results from several groups are comparable, we restricted our analysis to those groups with approximately 10 to 20 transcripts. We calculated pairwise correlations between pairs of traits in the same group and the general heritability was estimated using SOLAR  for each of the 554 transcripts in 40 groups. The top 10 groups with the highest average heritability estimates were selected for subsequent linkage analysis because they were more likely to contain interesting linkage signals.
Linkage analysis with composite trait
All traits were standardized to have sample mean of 0 and sample variance of 1. We used three approaches to producing a univariate summary of the expression traits of individual genes in each group: a sample average, principal-components analysis (PCA), and linear discriminant analysis (LDA). All three approaches derive a single or multiple univariate "composite" traits by using a linear combination of the individual traits. The components in PCA are orthogonal linear combinations of original data and are ordered by decreasing sample variances . In particular, the first component explains the largest proportion of sample variation. PCA was carried out using function prcomp in R. The PCA did not take into account the fact that the subjects came from several distinct families. As an alternative, we sought to find a linear combination of the original data that maximized the ratio of inter-family variance to within-family variance, for which we used LDA with the family as the class label. The LDA was carried out using function lda in R.
Multipoint variance-component LOD scores for each transcript and composite traits were calculated using Merlin .
Combining linkage results from multiple traits
In linkage studies in which multiple related traits (such as obesity, diabetes, and hypertension) are analyzed, it is often of interest to see if several of the traits have linkage signals around a common region, often done by simply visualizing the LOD scores along a chromosome. We developed a heuristic algorithm for identifying the clustering of linkage peaks. 1) Linkage "peaks" were defined as LOD scores greater than a particular threshold C (e.g., 2). The threshold was set to be relatively high such that the chance of type I error was low. 2) The peak locations were defined to be where the local maximum LOD scores were. 3) Using a sliding window with width W (e.g., 10 cM), we defined a "cluster" as the window inside which more than one distinct gene had one or more peaks.
Selection of functional groups
Linkage analysis with composite traits
Combining linkage results from multiple traits
Clustering of LOD score peaks
Total no. of peaks
1 (3 × 10-5)
2 (7 × 10-4)
Li et al.  used the average expression profile of a transcription module as a quantitative trait in linkage mapping and found that it was more powerful than using individual expression traits. Lan et al.  also used PCA and hierarchical clustering seeded by relevant disease traits for dimension reduction in expression quantitative trait locus mapping, but they did not emphasize its use for a functional group. Here, we compared the performance of several dimension reduction techniques and found PCA generally outperformed the sample average when applied to GO-defined functional groups, at least in terms of capturing and enhancing any linkage scores as obtained from a single-trait analysis. The motivation behind using LDA was to maximize the heritability of the derived trait because the heritability is approximately the ratio of intra-versus inter-family variances. However, the linkage result for the LDA-derived trait was actually the worst. The LOD scores are generally smaller than those from single-trait analysis and for Group 5, almost all of the LOD score peaks from single-trait analysis were gone, suggesting that it is not always desirable to search for traits with high heritability.
We selected the ten groups for linkage analysis based on the heritability. There is substantial difference in the correlation structure within each group as evidenced in Figure 2. It is not clear for what correlation structure such combined analysis is likely to be the most successful. On the one hand, traits that are correlated are more likely to share a common regulatory gene. On the other hand, two highly correlated traits do not provide much additional information compared with a single trait .
Although using PCA does not necessary yield higher LOD scores than a single expression trait, different thresholds for "significant" LOD scores are necessary due to multiple testing when multiple traits are analyzed individually. A dimension reduction approach necessarily reduces the number of tests conducted. We could, for example, use PCA to screen for functional groups that are more likely to be co-regulated by a common gene among all GO groups.
The way we searched for clustering of linkage peaks among related traits was just a proof-of-concept exercise. In particular, information such as the width or height of a linkage peak was not used. Our p-value calculation was based on perhaps over-simplified assumptions, such as the independence of peak locations under the null hypothesis. Much further research is needed.
Finally, given an overwhelmingly large set of variables (e.g., gene expression levels), how to define functional groups will likely be the most critical part of the analysis. We realized that the GO functional groups do not necessarily imply the transcripts in a group belong to the same metabolic pathway or are regulated by common genes. Without sufficient biological knowledge and no clinical outcomes, we could not address the problem of how to select traits for joint analysis and opted to simply use GO information. Solutions to this difficult problem are highly context dependent and close collaboration between statisticians and subject-area experts is needed.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Research. 2003, 31: 3775-3781. 10.1093/nar/gkg624.View ArticlePubMed CentralPubMedGoogle Scholar
- Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.View ArticlePubMed CentralPubMedGoogle Scholar
- Rencheer AC: Multivariate Statistical Inference and Applications. 1998, New York: WileyGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Li H, Chen H, Bao L, Manly KF, Chesler EJ, Lu L, Wang J, MiZhou RWW, Cui Y: Integrative genetic analysis of transcription modules: towards filling the gap between genetic loci and inherited traits. Hum Mol Genet. 2005, 15: 481-492. 10.1093/hmg/ddi462.View ArticlePubMedGoogle Scholar
- Lan H, Stoehr JP, Nadler ST, Schueler KL, Yandell BS, Attie AD: Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics. 2003, 164: 1607-1614.PubMed CentralPubMedGoogle Scholar
- Almasy L, Dyer TD, Blangero J: Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages. Genet Epidemiol. 1997, 14: 953-958. 10.1002/(SICI)1098-2272(1997)14:6<953::AID-GEPI65>3.0.CO;2-K.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.