Selection of gene functional groups
The 3554 transcripts were grouped based on the biological processes in which they are involved according to the gene ontology (GO) via Onto-Express [1]. On the one hand, we would like to borrow information across many transcripts in a function group; on the other hand there is an increasing functional heterogeneity with a larger or less-specific function group. To achieve a trade-off between a functional specificity and group size, and to ensure that the results from several groups are comparable, we restricted our analysis to those groups with approximately 10 to 20 transcripts. We calculated pairwise correlations between pairs of traits in the same group and the general heritability was estimated using SOLAR [2] for each of the 554 transcripts in 40 groups. The top 10 groups with the highest average heritability estimates were selected for subsequent linkage analysis because they were more likely to contain interesting linkage signals.
Linkage analysis with composite trait
All traits were standardized to have sample mean of 0 and sample variance of 1. We used three approaches to producing a univariate summary of the expression traits of individual genes in each group: a sample average, principal-components analysis (PCA), and linear discriminant analysis (LDA). All three approaches derive a single or multiple univariate "composite" traits by using a linear combination of the individual traits. The components in PCA are orthogonal linear combinations of original data and are ordered by decreasing sample variances [3]. In particular, the first component explains the largest proportion of sample variation. PCA was carried out using function prcomp in R. The PCA did not take into account the fact that the subjects came from several distinct families. As an alternative, we sought to find a linear combination of the original data that maximized the ratio of inter-family variance to within-family variance, for which we used LDA with the family as the class label. The LDA was carried out using function lda in R.
Multipoint variance-component LOD scores for each transcript and composite traits were calculated using Merlin [4].
Combining linkage results from multiple traits
In linkage studies in which multiple related traits (such as obesity, diabetes, and hypertension) are analyzed, it is often of interest to see if several of the traits have linkage signals around a common region, often done by simply visualizing the LOD scores along a chromosome. We developed a heuristic algorithm for identifying the clustering of linkage peaks. 1) Linkage "peaks" were defined as LOD scores greater than a particular threshold C (e.g., 2). The threshold was set to be relatively high such that the chance of type I error was low. 2) The peak locations were defined to be where the local maximum LOD scores were. 3) Using a sliding window with width W (e.g., 10 cM), we defined a "cluster" as the window inside which more than one distinct gene had one or more peaks.
To assess whether a cluster was due to chance, we calculated a simple p-value, assuming that linkage peaks were independent and uniformly distributed along a chromosome, conditional on the observed number of peaks for each gene. Let L denote the total chromosome length. For gene k, conditional on observing n
k
peaks, the probability of observing at least one peak in a window is . We calculated the average probability as
For the total K transcripts in a group, the probability that there are at least K0 peaks in a window is