A method dealing with a large number of correlated traits in a linkage genome scan

We propose a method to perform linkage genome scans for many correlated traits in the Genetic Analysis Workshop 15 (GAW15) data. The proposed method has two steps: first, we use a clustering method to find the tight clusters of the traits and use the first principal component (PC) of the traits in each cluster to represent the cluster; second, we perform a linkage scan for each cluster by using the representative trait of the cluster. The results of applying the method to the GAW15 Problem 1 data indicate that most of the traits in the same cluster have the same regulators, and the representative trait measure, the first PC, can explain a large part of the total variation of all the traits in each cluster. Furthermore, considering one cluster of traits at a time may yield more linkage signals than considering traits individually.


Background
From yeast to humans, the expression level of many genes shows familial aggregation and a simple segregation pattern in [1][2][3], suggesting an inherited contribution. Morley et al. [3] used microarray to measure the baseline expression level of genes in immortalized B cell from members of 14 Centre d'Etude du Polymorphism Humain (CEPH) pedigrees. They mapped the quantitative trait loci (QTL) of many quantitative traits (gene expression phenotypes) to chromosomal locations using genome scans. The data set of Morley et al. was made available to us by Genetic Analysis Workshop 15 (GAW15). This data set provides genome scan data for 3554 quantitative traits, or expression phenotypes of 3554 genes.
Morley et al. carried out a linkage genome scan for each of the 3554 traits individually and found that 142 and 984 of the 3554 traits have at least one significant linkage signal when the point-wise p-value thresholds 4.3 × 10 -7 and 3.7 × 10 -5 were used, respectively. Considering the correlated nature of the 3554 traits, we proposed a method to find clusters of genes whose expressions appear to be highly correlated and to localize the genetic determinants of each of the clusters. The proposed method has two steps: 1) use a clustering method to find tight clusters of the traits, and in each cluster, use the first principal component (PC) of the traits to represent this cluster; and 2) carry out a linkage scan using the first PC trait for each cluster and consider the linkage evidence of the first PC trait as the linkage evidence of all the traits in the cluster. Upon applying the method to the GAW15 Problem 1, we found that: 1) the traits in the same cluster have linkage signals at the same chromosomal regions; 2) the first PC can explain more than 53% of the total variation of all the traits in each cluster, which indicates that it is reasonable to represent a tight cluster of traits by using the first PC; and 3) this clustering of highly correlated traits may show more linkage signals than considering the traits individually.

Tight clustering
Tight clustering, first proposed by Tseng and Wong [4], is a method that produces tight and stable clusters without forcing all points into clusters. Tseng and Wong used a resample-based algorithm. Briefly, in each resample, the K-mean clustering algorithm is used to group the sampled genes into clusters. A similarity measure between two genes is defined as the number of resamples in which the two genes are in the same cluster. Then, this similarity measure is used to define a tight cluster. Here, to find a cluster of highly correlated genes, we propose a new tight clustering algorithm, which is computationally simpler than Tseng and Wong's algorithm.
Suppose there are n individuals and each individual has m traits (expressions of m genes). Let x ij denote the trait value of the j th trait of the i th individual, and x j = (x 1j ,..., x nj ) denote the values of the j th trait of all individuals. A set of traits is said to be a tight cluster if the average correlation coefficient between traits in this set is larger than a given threshold α, and the number of traits in this set is between m 1 and m 2 . Briefly, for given parameters α 0 , m 1 and m 2 , our "tight clustering" algorithm involves the following steps: 1) for a given α (≥ α 0 ), find the largest tight cluster. If a tight cluster for the given α cannot be found, reduce the value of α. If a tight cluster still cannot be found when α = α 0 , stop the algorithm. 2) Remove the traits in the tight cluster from the data set and repeat Step 1.
For a given data set and the values of the parameters α 0 , m 1 and m 2 , it is a computational challenge to search the whole state space to find the largest tight cluster. We propose to use a backwards elimination approach to find an approximate solution. Let ρ j denote the average correlation coefficient between the j th trait and all the other traits. The backward elimination has the following three steps: In our implementation, we used m 1 = 20, m 2 = 200, and α 0 = 0.5 (see Discussion section for the effect of different parameter values). We first chose α = 0.8, and then reduced α to 0.7, 0.6, and 0.5. We then found the first PC of each cluster, which, because traits within are highly correlated, should explain a large part of the total variation.

Linkage genome scan
We used the program Merlin-regress [5] to perform the linkage genome scan for each quantitative trait. Merlinregress implements a method proposed by Sham et al. [6], which is based on a regression of estimated identity-bydescent (IBD) sharing between relative pairs on the squared sums and squared differences of trait values of the relative pairs. This program allows quick computation and has similar power of variance-component approaches.
The program Merlin-regress produces a p-value for each trait-marker pair. A linkage signal is declared if the p-value is less than a cut-off value. We propose to obtain a cut-off value of individual p-values by controlling the false-discovery rate (FDR). Suppose we have M trait-marker pairs and denote the ordered p-values by p (1) ,..., p (M) . To control the FDR at 5%, the cut-off value of the individual p-value will be as described by Benjamini and Yekutieli [7]:

Results
We applied the proposed method to the data set given by the GAW15 Problem 1. The data set contained members of 14 CEPH Utah families. Each individual had 3554 traits and genotypes at 2882 SNPs across the genome. We deleted some SNPs with genotypes that showed Mendelian inconsistency. There were 2761 SNPs left for further analysis.
Upon applying the tight clustering algorithm to 3554 traits, we found 18 clusters with an average correlation coefficient larger than 0.5. There were 734 traits, ~20% of the total traits, in the 18 clusters. The details of the 18 clusters are given in Table 1. We obtained one cluster with average correlation of ~0.8, four clusters with average correlation of ~0.7, five clusters with average correlation of 0.6, and eight clusters with average correlation of ~0.5. Then, we applied PC analysis to each of the clusters. The eigenvalues of clusters 1 and 10 are shown in Figure 1. This figure shows that the first eigenvalue is much larger than the other eigenvalues. The eigenvalues of other clusters show a similar pattern. The ratio of the first eigenvalue to the sum of all eigenvalues is approximately equal to the average correlation coefficient of the cluster and is more than 50% in all clusters. That is, the first PC explains more than 50% of the total variation for all clusters. Thus, the Following Morley et al. [3], the regions with linkage signals (p-value less than the cut-off value) are considered to be regulators. The number of regulators of each of the 18 PC traits and the number of traits with at least one regulator in each cluster are summarized in Table 2. Table 2 shows that 6 out of 18 (33%) PC traits have at least one regulator, while 16 out of 743 (2.2%) individual traits have at least one regulator. We conclude that considering one cluster at a time may show more linkage signals than considering traits individually. Figure 2

Discussion
In order to analyze a large number of correlated traits given in the GAW15 Problem 1, we have developed a method that first clusters the traits into tight clusters and then finds a representative measure for each cluster. We then performed a linkage scan for the representative trait. The results of analyzing the GAW15 Problem 1 data set The figure in the top panel shows the p-value curve for the first PC trait of the cluster  indicate that the traits in the same cluster have the same regulators, and that the first PC in one cluster is a good representative measure of this cluster. The results also show that performing a genome scan for one cluster at a time may show more linkage signals than that for genome scans of individual traits.
One remaining problem is how to choose the values of parameters α 0 , m 1 , and m 2 in the tight clustering algorithm. We have applied the algorithm to the GAW15 data with different parameter values. Our results indicate that when m 2 varies from 100 to 1000, there is almost no effect on the cluster assignment and linkage analyses. When m 1 is small, say 2, there will be more small clusters and more linkage signals. Because two biologically unrelated genes may have highly correlated traits by chance, by using a small m 1 , the tight clustering algorithm may cluster biologically unrelated genes into one cluster. We suggest choosing m 1 between 10 and 20, and m 2 between 100 and 400. Our independent simulation studies indicate that if α 0 is too small, the FDR cannot be controlled, and the first PC trait will not be a good representative of the cluster, because the first PC can only explain a small part of the total variation. Thus we suggest using α 0 ≥ 0.5.

Competing interests
The author(s) declare that they have no competing interests.