Data
The Genetic Analysis Workshop 15 (GAW15) Problem 1 Centre d'Etude du Polymorphisme Humain data consisted of 196 participants from 14 three-generation pedigrees with 14 individuals per family, 4 grandparents, 2 parents, and 8 offspring. Two hundred and seventy-six arrays, including data on 3554 probe sets on Affymetrix Human Focus Arrays, were provided by GAW15. These probe sets had been selected as those with greatest inter-individual variability from a total of 8500 probe sets [2].
Feature-extraction and normalization methods
To assess the impact of data pre-processing on linkage analysis, we selected two feature-extraction and normalization methods, Affy and dChip. The data processed by the Affy method was provided by GAW15 and is the data set used by Morley et al. to identify significant linkage signals for genome-wide variation of gene expression [2]. These data were normalized by global scaling using Affymetrix Microarray Suite 5 with a target value of 500 [2]. The second feature extraction and normalization was conducted using dChip, a common approach found in microarray analytical packages that is described in detail by Li and Wong [5]. Briefly, we first normalized 276 arrays from each individual against an array with the median probe intensity using the invariant set algorithm. We then calculated the model-based expression indexes (MBEI) based on the perfect match probes (PM-only model) [5]. For both normalizations, the gene expression values for individuals with technical duplicates were averaged, and all expressions were log2-transformed.
Selection of phenotype subsets
To increase the number of informative phenotypes, we excluded genes that had little variation in expression (standard deviation ≤ 0.3) and low call rates (absent calls > 90%) across samples; 3306 phenotypes (probe sets) remained. We further reduced the number by identifying those that were most likely to be genetic by calculating the heritability (H2) estimate (using the Splus/R library multic [6]) assuming a polygenic model for both the dChip and Affy normalizations. To reduce the number of phenotypes, we used a cutpoint of H2 > 0.60 or when H2 was significantly different from zero at α = 0.0001 for either normalization. This resulted in the inclusion of 45 phenotypes for linkage analysis.
Genetic data
For a subset of subjects, including founders, we observed a large number of missing genotypes. Because of the increase of false positives due to tight linked markers [7], we reduced the extent of linkage disequilibrium between SNPs by removing SNPs with r2 > 0.30 using ldSelect [8]. We then removed 2205 with Mendelian inconsistencies (0.5% of matings/genotypes). Thus, 2272 SNPs from a total of 2882 were used in the linkage analysis. Multipoint identity-by-descent (MIBD) sharing among pairs of relatives was calculated using the SIMWALK2 program [9].
Quantitative trait linkage analysis
The 90 phenotypes (45 expression phenotype × 2 normalizations) were normally transformed using the van der Waerden rank transformation [10]. Variance-components linkage analyses were performed using the S-Plus library multic [6]. Sex was used as a covariate, consistent with Morley et al. [2]. We assessed evidence for linkage of the 90 phenotypes and considered "strong" linkage evidence for LOD score > 3.0, which assumes a genome-wide significance of 0.05.