Data
We used the Problem 1 data of GAW15 as described by Mosley et al. [2]. These data consisted of 196 individuals from 14 CEPH (Centre d'Etude du Polymorphisme Humain) Utah families with seven to eight offspring per family. For each individual, we have the expression levels of 3554 transcripts (the phenotypes) and 2882 autosomal and X-linked single-nucleotide polymorphisms (SNPs) (the genotypes). See the GAW15 data description for more information [4].
Pre-processing
Mendelian errors and double-recombinants were blanked according to mistyping probabilities estimated by SIMWALK2 [5] using its default options. We blanked 119 genotypes as Mendelian errors and 570 genotypes as double-recombinants. Genetic map positions were obtained using the SNP Mapping web application developed at the University College Dublin Conway Institute of Biomolecular and Biomedical Research, located at http://actin.ucd.ie/software.html. We estimated multipoint identity-by-descent (IBD) matrices using all of the genotyped SNPs with Merlin [6]. Given that linkage disequilibrium (LD) between pairs of SNPs was low (r2 < 0.2), we ignored it when creating the IBD matrices.
Univariate linkage analysis by variance components
We performed variance components linkage analysis with SOLAR [7] for each of the 3554 phenotypes. We fitted models with additive genetic, QTL, and residual environmental variance components and we used sex as a covariate. We characterized a hot spot as a locus with more than four phenotypes having LOD scores greater than 3.4.
Linkage by feature selection regression
We present a regression method based on the Haseman-Elston approach [8]. Given a phenotype, the mean-corrected product of the sibs' trait values [(x1-μ) (x2-μ)] was used as a measure of phenotypic distance and the IBD estimates between pairs were used as measures of the genetic similarity at each genetic location. In the standard Haseman-Elston regression, the phenotypic distance is regressed on the genetic similarity, for a set of locations along the genome (usually for every centimorgan). The result is a linkage test for each location. Here we tried a different approach: given a phenotype we tried to select the set of genetic locations that best explained the phenotypic distances. Instead of models that test linkage at only one location, we allowed models that combined locations. The linkage analysis result for a phenotype is the combination of genetic locations that best explains the phenotypic distances between sib pairs.
The selection of genetic locations related to a specific phenotype based on exhaustive search is computationally impractical with current computer capabilities: the number of combinations rises up to 2nwhere n is the number of locations. In these cases, sub-optimal algorithms can be used to find a selection. This sub-optimality presents the risk of falling to local minima in the solution.
The sequential backward selection (SBS) algorithm [9] begins with a model that contains the whole set of n features as the initial subset, and applies the following steps: 1) evaluate a quality criterion for the current model; 2) try dropping each of the features in the subset and compute the corresponding criterion; 3) select the best candidate model; 4) iterate until the quality criterion is not improved.
The number of combinations needed for this solution will obviously be less than for an exhaustive search. In the symmetric algorithm, namely sequential forward selection (SFS), the initial subset contains zero elements and step two (above) includes a feature instead of dropping it.
Note that when one feature is dropped in SBS, there is no possibility of including it again in the subset, although it could provide further information and improve the criterion. To avoid this nesting effect, a variant family of the above algorithms called "floating search methods" has been proposed [10]. For sequential forward-floating selection (SFFS) the algorithm begins with an empty feature set. At each step, the best feature that satisfies the quality criterion is included in the current data set (SFS step). If the criterion is improved by removing some of the features in the new data set, the algorithm performs a backward step (SBS). Therefore, SFFS dynamically increases and decreases the number of features until no improvement is found with any step.
A SFFS algorithm has been built for determining genetic locations related to a phenotype. The floating search algorithm targeted the minimization of the root mean square error (RMSE) of a partial least squares (PLS) model when predicting a given phenotypic distance from the complete set of loci. For each model constructed, the number of PLS optimal latent variables was obtained from the training set (one-third of the available pairs) by means of four-fold cross-validation. RMSE was computed on the validation set, covering two-thirds of the available pairs.
We applied this algorithm to the 3554 phenotypes. Because the SFFS algorithm is computationally intensive, instead of using all the available genetic locations (one for every centimorgan), we used only 800 locations evenly spaced, one every 5 cM. Results are given as a vector of the locations (expressed in cM) selected for each phenotype.
Gene ontology analysis
We performed tests to evaluate statistical over-representation of gene ontology (GO) [3] categories in sets of genes controlled by the same hot spot, using the Biological Network Gene Ontology (BiNGO) tool [11]. This tool performs a hypergeometric test with a Benjamini & Hochberg false-discovery rate (FDR) multiple testing correction against each of the ontologies: biological process, molecular function, and cellular component. We used a significance threshold of 0.01.