Data processing
For association and linkage scans, we studied the 2819 autosomal SNPs and the 18 transcripts listed in Figure 1. 86% of the SNPs have less than 10% missing genotypes and only 1.3% SNPs have more than 20% missing genotypes. Missing genotypes were imputed using fastPHASE [9]. For SNPs with weak linkage disequilibrium (LD) between them, the program is more likely to impute the most common genotype, which may affect the efficiency of our approach.
Association scans
Extended from previously studied multilocus association scores on a dichotomous phenotype [6–8], a qGTD (quantitative genotype-trait distortion) statistic was proposed for quantitative traits of unrelated individuals [10]. qGTD is defined on the ranks of the trait values of n individuals, i.e., {R1,...,R
n
}. Given a set of k SNPs, there are 3kpossible multilocus genotypes. The association content of this set of SNPs with the quantitative phenotype is then measured by
where S
i
is the trait's rank sum on the n
i
individuals with genotype i, and n
i
(n + 1)/2 is the expected value of S
i
under the null hypothesis that these SNPs are not associated with the phenotype.
qGTD captures the differences between the observed rank sums and those under the null hypothesis. The magnitude of qGTD scores reflects the level of association with the phenotype: the greater the value, the stronger the association [10]. Unassociated SNPs add dimensions to the multilocus genotypes and lower the value of qGTD. Therefore, a greedy screening algorithm is used to screen out SNPs that do not contribute to increase the value of qGTD and retain a cluster of SNPs that contribute important information to the score. As discussed in [6–8], such a screening is not informative for a large number of SNPs simultaneously due to sparseness in high dimensions. A random subspace strategy is then employed, where the greedy algorithm is repeated on a large number of random SNP subsets. SNPs are then ranked by the numbers of times (return frequencies) that they are retained by the screening algorithm, which measure the overall importance of individual SNPs.
To evaluate the importance of the SNPs in gene × gene interactions, we further filtered the retained SNP clusters from the qGTD screening by their qGTD scores and only selected the top 1000 distinctive clusters with the highest qGTD values. Using these 1000 clusters, we computed the qGTD return frequencies for each SNP. As discussed previously, higher value of qGTD indicates stronger joint effects from the SNPs on the quantitative phenotype. SNPs that present more frequently in clusters with higher qGTD values play a more critical role in gene × gene interactions that decide the variation of the phenotype.
In this paper, we apply the above association scan (repeated on 5 million random subsets) using the 56 unrelated grandparents in the 14 CEPH families. For each selected expression trait, we selected the top 30 overall important SNPs with the highest return frequencies, and the top 30 important interaction SNPs with the highest qGTD return frequencies, which give us a comparable number of identified loci as that by the linkage scans.
Linkage scans
Linkage analysis was done on all 194 members of 14 CEPH families using the pedigree analysis package MERLIN [11]. The command pedwipe was first used to remove unlikely genotypes in the pedigree data. Regression-based linkage analysis for quantitative traits proposed by Sham et al. [12] was applied to all 18 expression traits with estimated mean, variance and heritability. The original data only contain physical map. In our analysis, we used linkage map provided by Sung et al. [13].
Clustering of transcripts based on identified regulatory loci
To summarize the inter-regulatory-relation between the transcripts shown in the association scans, hierarchical clustering with average link [14] was conducted based on overall return frequencies, qGTD return frequencies, and common pairs of interacting loci. The dissimilarity measure for return frequencies (overall or qGTD) was 1 - correlation coefficient between two transcripts. For each transcript, we recorded the jointly returned SNPs in the 1000 qGTD-filtered clusters (see the section on association scans) and counted the number of times that SNPs that belong to a pair of loci (with loci defined as 5 cM bins on the genome) were returned in one cluster. The dissimilarity based on these interacting loci pairs is
where m
ij
is the number of shared interacting loci pairs between transcripts i and j, m
i
and m
j
are the total numbers of interacting loci pairs for i and j, respectively, and m is the total number of loci pairs on the genome. We also clustered the transcripts based on their gene expression values with dissimilarity being one minus the correlation.