Tight clustering, first proposed by Tseng and Wong , is a method that produces tight and stable clusters without forcing all points into clusters. Tseng and Wong used a resample-based algorithm. Briefly, in each resample, the K-mean clustering algorithm is used to group the sampled genes into clusters. A similarity measure between two genes is defined as the number of resamples in which the two genes are in the same cluster. Then, this similarity measure is used to define a tight cluster. Here, to find a cluster of highly correlated genes, we propose a new tight clustering algorithm, which is computationally simpler than Tseng and Wong's algorithm.
Suppose there are n individuals and each individual has m traits (expressions of m genes). Let x
denote the trait value of the jth trait of the ith individual, and x
= (x1j,..., x
) denote the values of the jth trait of all individuals. A set of traits is said to be a tight cluster if the average correlation coefficient between traits in this set is larger than a given threshold α, and the number of traits in this set is between m1 and m2. Briefly, for given parameters α0, m1 and m2, our "tight clustering" algorithm involves the following steps: 1) for a given α (≥ α0), find the largest tight cluster. If a tight cluster for the given α cannot be found, reduce the value of α. If a tight cluster still cannot be found when α = α0, stop the algorithm. 2) Remove the traits in the tight cluster from the data set and repeat Step 1.
For a given data set and the values of the parameters α0, m1 and m2, it is a computational challenge to search the whole state space to find the largest tight cluster. We propose to use a backwards elimination approach to find an approximate solution. Let ρ
denote the average correlation coefficient between the jth trait and all the other traits. The backward elimination has the following three steps: 1) rank the traits by ρ
from the largest to the smallest and delete a proportion (5% in our implementation) of traits with the smallest values of ρ
. 2) Recalculate ρ
in the new data set (5% of traits were deleted) and repeat Step 1. 3) Repeat Steps 1 and 2 until the average correlation coefficient is larger than α and the number of traits is between m1 and m2.
In our implementation, we used m1 = 20, m2 = 200, and α0 = 0.5 (see Discussion section for the effect of different parameter values). We first chose α = 0.8, and then reduced α to 0.7, 0.6, and 0.5. We then found the first PC of each cluster, which, because traits within are highly correlated, should explain a large part of the total variation.
Linkage genome scan
We used the program Merlin-regress  to perform the linkage genome scan for each quantitative trait. Merlin-regress implements a method proposed by Sham et al. , which is based on a regression of estimated identity-by-descent (IBD) sharing between relative pairs on the squared sums and squared differences of trait values of the relative pairs. This program allows quick computation and has similar power of variance-component approaches.
The program Merlin-regress produces a p-value for each trait-marker pair. A linkage signal is declared if the p-value is less than a cut-off value. We propose to obtain a cut-off value of individual p-values by controlling the false-discovery rate (FDR). Suppose we have M trait-marker pairs and denote the ordered p-values by p(1),..., p(M). To control the FDR at 5%, the cut-off value of the individual p-value will be as described by Benjamini and Yekutieli :