Impact of normalization and filtering on linkage analysis of gene expression data.

Using the Problem 1 data set made available for Genetic Analysis Workshop 15, we assessed sensitivity of linkage results to a correlation-based feature extraction method as well as to different normalization procedures applied to the raw Affymetrix gene expression microarray data. The impact of these procedures on heritability estimates and on expression quantitative trait loci are investigated. The filtering algorithm we propose in this paper ranks genes based on the total absolute correlation of each gene with all other genes on the array and has the potential to extract features that may play role in functional pathways and gene networks. Our results showed that the normalization and filtering algorithms can have a profound influence on genetic analysis of gene expression data.


Background
Several recent studies have applied traditional quantitative trait linkage analysis to genome-wide gene expression data and have investigated the role of genetic variation in transcription [1]. These types of studies have the potential to uncover complicated transcriptional control. For example, Morley et al. and others identified several cis-and trans-acting genes and master regulators using linkage mapping on the gene expression phenotypes [1]. However, gene expression measurements are not generated in a uniform platform and, depending on the particular technology, most high-level statistical analyses using these data are preceded by a number of low-level pre-processing steps. For instance, Affymetrix GeneChip arrays have become a widely used microarray platform [2] and there are various algorithms for performing feature extraction and normalization on these high-density oligonucleotide gene expression arrays [3,4]. Ideally, all of the methods and algorithms should produce similar results. In practice, however, findings could be sensitive to variations in pre-processing approaches and may lead to different results. In particular, the impact that different algorithms have on expression quantitative trait loci (eQTL) analysis is not well understood. Understanding consequences of low-level pre-processing approaches is essential in interpreting findings from genome-wide linkage analyses of multivariate gene expression measurements. In this paper, we first propose a correlation-based filtering method. The motivation is to select genes that may participate in pathways. We then apply three normalization methods and the filtering algorithm to the Problem 1 data set made available for Genetic Analysis Workshop 15 (GAW15). In particular, we compare and contrast heritability estimates, concordance in top-ranked genes, and impact on the number of genes identified as cis-, trans-or multiple-regulators.

Materials and methods
Raw gene expression measurements obtained from lymphoblastoid cell lines from 14 Centre d'Etude du Polymorphisme Humain (CEPH) Utah families were made available for GAW15. Furthermore, a subset consisting of 3554 genes that Morley et al. analyzed for linkage was also made available [1]. The genotypes of 2882 autosomal and X-linked SNPs of members of these families were generated by The SNP Consortium http://snp.cshl.org/.
For the data in this problem, the Affymetrix Genome Focus Arrays were used. A unique feature of microarrays generated using the Affymetrix platform is the so-called MisMatch (MM) probes. Each of the probe pairs in a probe set has a Perfect Match (PM) and a MisMatch probe, with each probe having 25 bases. The PM probes are designed to bind perfectly to the gene of interest and the MM probes have a contrasting base at position 13 with the intention of measuring non-specific binding [2]. The Microarray Analysis Suite (MAS) algorithm from Affymetrix [2] incorporates information from the MM probes into the calculation of gene expression intensities. However, there is a continuing debate in the literature over the merit of MM probes as well as the impact of the various algorithms on downstream analysis. Several new algorithms have been proposed in recent years and most of them use only the PM signals to calculate gene expression. The algorithms include the robust multi-array average (RMA) method [3], Gene Chip RMA (GCRMA) [4], and the probe logarithmic error intensity estimate (PLIER) method from Affymetrix [2].
Morley et al. applied the MAS algorithm and selected 3554 genes using data from 94 grandparents such that the between-individual variance is larger than the withinindividual variance [1]. In this report, we used raw CEL files from 194 subjects and applied the RMA, GCRMA, and PLIER algorithms to normalize and calculate gene expression intensities. Like the MAS algorithm, the PLIER method also incorporates information from MM probes while both RMA and GCRMA use only PM signals. We then ranked genes by the total absolute correlation (TAC) that each gene i has with the other n -1 genes. The TAC for each gene is calculated as follows: where r ij is the correlation of gene i with gene j. For this filtering procedure, we used expression profiling data from 56 unrelated individuals. We extracted the top 3554 genes ranked by TAC from the largest ones to the smallest ones, so that we can make qualitative comparisons with what was reported by Morley et al. [1].
After the normalization and filtering steps, we carried out multipoint linkage analysis on each of the phenotypes using the MERLIN-REGRESS command in the statistical genetics software MERLIN [5]. The variance components (VC) option in MERLIN was used to estimate heritability. As in Morley et al., cis regulators were defined as those regulatory variants that mapped within 5 megabases (Mb) of the target gene [1]. All other significant linkages are categorized as trans regulators. Physical locations of probe sets were obtained from the Affymetrix annotation table http:/ /www.affymetrix.com. The Rutgers map was used to establish a correspondence between the physical map and the genetic map http://compgen.rutgers.edu/maps. Markers that could not be mapped using the Rutgers map, but that were located between physically anchored markers, were placed on the genetic map by interpolation.

Results
It appears that the normalization methods have a strong effect on the gene expression correlation structure. The selected genes from RMA may have lower correlation than those from GCRMA and PLIER. Depending on the method used, different genes were identified as the topranking genes. The overlap in probe sets extracted using   [1], in which they identified a total of 984 expression phenotypes with regulators. However, the GCRMA resulted in much fewer genes with trans-regulators while the PLIER method detected more genes with trans-regulators at the same p-value threshold.  A total of 67 expression phenotypes with strong linkage (p-value ≤ 3.7*10 -5 ) have been detected by all three methods. 19 of the 67 (28.4%) expression phenotypes with trans-regulators that have been identified in all the data sets normalized using the three different normalized methods have also been shown to exhibit copy number variation in healthy individuals [6]. This was determined by searching the copy number variation database availa-Box plot of LOD scores versus heritability

Boxplot of LOD vs Heritability
Heritability LOD (c) PLIER ble at http://projects.tcag.ca/variation/. These 19 expression phenotypes are listed in Table 3. The genes showing copy number variation and significant linkage signals may be of interest for further biological investigation.
We took the maximum LOD score for each trait and grouped them into different heritability categories. Figure  1 shows the overall trend of maximum values of LOD scores corresponding to different heritability intervals as shown in Table 1. No clear relationship between the heritability estimates and the maximum LOD score was seen. Some traits with high LOD scores have low heritability and other traits with high heritability have similar LOD scores as those with low heritability, as can be seen in Figure 1a and 1b.

Discussion
Recently, a number of groups have started to integrate data from gene expression studies with genetic linkage analysis, leading to a new synergy between the two approaches [1]. Understanding the genetic basis of gene expression might shed light on processes that connect genotypic information to cellular-and organism-level information and systems biology. We used classical quantitative trait loci methodology in which expression levels are treated as quantitative phenotypes and genetic variants that significantly influence gene expression are sought along the entire genome. Several studies have shown that mRNA levels for many genes are heritable and mapping efforts have led to the characterization of genetic regulation in 'cis', as well as in 'trans' [1].
We applied three different normalization methods to the Problem 1 data set provided for GAW15 and carried out expression quantitative trait linkage (eQTL) analysis for 3554 traits. We used a filtering algorithm that extracts features based on a large total absolute correlation criterion to come up with the set of genes for the linkage analyses. Our findings suggest that different normalization and filtering algorithms can have a profound influence on genetic analysis of gene expression data. This observation is in agreement with a recent brief report by Williams et al. [7]. Our linkage analysis treated each trait independently of the others, similar to most other published work in this area [1]. Unlike the variance-based filtering used by Morley et al. [1], our correlation-based filtering takes dependence among genes into account in the pre-processing phase. The results from our approach can be useful in interpreting linkage results and inferring which genes may participate in pathways. We have not investigated whether the gene expression measurements are normally distributed, and this may also influence the power of the linkage findings.
Our choice of Pearson correlation was arbitrary but not critical. This is supported by the following analysis: we evaluated the correlations between the TAC estimated by Pearson (TAC-Pearson) and that by Spearman (TAC-Spearman) for the data normalized by the three normalization methods. The correlations between these two measures were 0.92, 0.90, and 0.94, respectively, for expression values normalized by RMA, GCRMA, and PLIER, respectively. We also evaluated the similarity of TAC rank of the set of 3554 expression phenotypes estimated by Pearson and Spearman correlations. For the 2000 top-ranked genes, the proportion of overlapped set of genes based on Pearson and Spearman were 89.8%, 87.2%, and 92.2% for RMA, GCRMA and PLIER, respectively.

Competing interests
The author(s) declare that they have no competing interests.