Impact of normalization and filtering on linkage analysis of gene expression data
© Beyene et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Using the Problem 1 data set made available for Genetic Analysis Workshop 15, we assessed sensitivity of linkage results to a correlation-based feature extraction method as well as to different normalization procedures applied to the raw Affymetrix gene expression microarray data. The impact of these procedures on heritability estimates and on expression quantitative trait loci are investigated. The filtering algorithm we propose in this paper ranks genes based on the total absolute correlation of each gene with all other genes on the array and has the potential to extract features that may play role in functional pathways and gene networks. Our results showed that the normalization and filtering algorithms can have a profound influence on genetic analysis of gene expression data.
Several recent studies have applied traditional quantitative trait linkage analysis to genome-wide gene expression data and have investigated the role of genetic variation in transcription . These types of studies have the potential to uncover complicated transcriptional control. For example, Morley et al. and others identified several cis- and trans-acting genes and master regulators using linkage mapping on the gene expression phenotypes . However, gene expression measurements are not generated in a uniform platform and, depending on the particular technology, most high-level statistical analyses using these data are preceded by a number of low-level pre-processing steps. For instance, Affymetrix GeneChip arrays have become a widely used microarray platform  and there are various algorithms for performing feature extraction and normalization on these high-density oligonucleotide gene expression arrays [3, 4]. Ideally, all of the methods and algorithms should produce similar results. In practice, however, findings could be sensitive to variations in pre-processing approaches and may lead to different results. In particular, the impact that different algorithms have on expression quantitative trait loci (eQTL) analysis is not well understood. Understanding consequences of low-level pre-processing approaches is essential in interpreting findings from genome-wide linkage analyses of multivariate gene expression measurements. In this paper, we first propose a correlation-based filtering method. The motivation is to select genes that may participate in pathways. We then apply three normalization methods and the filtering algorithm to the Problem 1 data set made available for Genetic Analysis Workshop 15 (GAW15). In particular, we compare and contrast heritability estimates, concordance in top-ranked genes, and impact on the number of genes identified as cis-, trans- or multiple-regulators.
Materials and methods
Raw gene expression measurements obtained from lymphoblastoid cell lines from 14 Centre d'Etude du Polymorphisme Humain (CEPH) Utah families were made available for GAW15. Furthermore, a subset consisting of 3554 genes that Morley et al. analyzed for linkage was also made available . The genotypes of 2882 autosomal and X-linked SNPs of members of these families were generated by The SNP Consortium http://snp.cshl.org/.
For the data in this problem, the Affymetrix Genome Focus Arrays were used. A unique feature of microarrays generated using the Affymetrix platform is the so-called MisMatch (MM) probes. Each of the probe pairs in a probe set has a Perfect Match (PM) and a MisMatch probe, with each probe having 25 bases. The PM probes are designed to bind perfectly to the gene of interest and the MM probes have a contrasting base at position 13 with the intention of measuring non-specific binding . The Microarray Analysis Suite (MAS) algorithm from Affymetrix  incorporates information from the MM probes into the calculation of gene expression intensities. However, there is a continuing debate in the literature over the merit of MM probes as well as the impact of the various algorithms on downstream analysis. Several new algorithms have been proposed in recent years and most of them use only the PM signals to calculate gene expression. The algorithms include the robust multi-array average (RMA) method , Gene Chip RMA (GCRMA) , and the probe logarithmic error intensity estimate (PLIER) method from Affymetrix .
where r ij is the correlation of gene i with gene j. For this filtering procedure, we used expression profiling data from 56 unrelated individuals. We extracted the top 3554 genes ranked by TAC from the largest ones to the smallest ones, so that we can make qualitative comparisons with what was reported by Morley et al. .
After the normalization and filtering steps, we carried out multipoint linkage analysis on each of the phenotypes using the MERLIN-REGRESS command in the statistical genetics software MERLIN . The variance components (VC) option in MERLIN was used to estimate heritability. As in Morley et al., cis regulators were defined as those regulatory variants that mapped within 5 megabases (Mb) of the target gene . All other significant linkages are categorized as trans regulators. Physical locations of probe sets were obtained from the Affymetrix annotation table http://www.affymetrix.com. The Rutgers map was used to establish a correspondence between the physical map and the genetic map http://compgen.rutgers.edu/maps. Markers that could not be mapped using the Rutgers map, but that were located between physically anchored markers, were placed on the genetic map by interpolation.
It appears that the normalization methods have a strong effect on the gene expression correlation structure. The selected genes from RMA may have lower correlation than those from GCRMA and PLIER. Depending on the method used, different genes were identified as the top-ranking genes. The overlap in probe sets extracted using the three normalization approaches described earlier (RMA, GCRMA, and PLIER) along with our correlation-based feature extraction algorithm in comparison with the probe sets selected by Morley et al.  following their use of the Affymetrix MAS algorithm and a variance-based filtering varied from 10% to 43%. Specifically, the overlap between their results and the results from RMA, GCRMA, and PLIER were 1174/3554 (33%), 361/3554 (10%), and 1528/3554 (43%), respectively. Among these three methods, the rates of concordance were as follows: RMA versus GCRMA, 1670/3554 (47%); RMA versus PLIER, 1732/3554 (49%); and GCRMA versus PLIER, 1149/3554 (32%).
Estimated heritability values based on the raw gene expression data sets normalized by three different methods
0.00 – 0.10
0.11 – 0.20
0.21 – 0.30
0.31 – 0.40
0.41 – 0.50
0.51 – 0.60
0.61 – 0.70
0.71 – 0.80
0.81 – 0.90
0.91 – 1.00
Number of expression phenotypes having cis-/trans-/cis and trans-regulators at different p-value thresholds using data sets normalized by three methods (cis regulators are those that mapped within 5 Mb of the target gene)
Selected expression phenotypesa with significant evidence for linkage (LOD score ≥ 3.4) from genome scans identified in all data sets normalized by three different methods
Recently, a number of groups have started to integrate data from gene expression studies with genetic linkage analysis, leading to a new synergy between the two approaches . Understanding the genetic basis of gene expression might shed light on processes that connect genotypic information to cellular- and organism-level information and systems biology. We used classical quantitative trait loci methodology in which expression levels are treated as quantitative phenotypes and genetic variants that significantly influence gene expression are sought along the entire genome. Several studies have shown that mRNA levels for many genes are heritable and mapping efforts have led to the characterization of genetic regulation in 'cis', as well as in 'trans' .
We applied three different normalization methods to the Problem 1 data set provided for GAW15 and carried out expression quantitative trait linkage (eQTL) analysis for 3554 traits. We used a filtering algorithm that extracts features based on a large total absolute correlation criterion to come up with the set of genes for the linkage analyses. Our findings suggest that different normalization and filtering algorithms can have a profound influence on genetic analysis of gene expression data. This observation is in agreement with a recent brief report by Williams et al. . Our linkage analysis treated each trait independently of the others, similar to most other published work in this area . Unlike the variance-based filtering used by Morley et al. , our correlation-based filtering takes dependence among genes into account in the pre-processing phase. The results from our approach can be useful in interpreting linkage results and inferring which genes may participate in pathways. We have not investigated whether the gene expression measurements are normally distributed, and this may also influence the power of the linkage findings.
Our choice of Pearson correlation was arbitrary but not critical. This is supported by the following analysis: we evaluated the correlations between the TAC estimated by Pearson (TAC-Pearson) and that by Spearman (TAC-Spearman) for the data normalized by the three normalization methods. The correlations between these two measures were 0.92, 0.90, and 0.94, respectively, for expression values normalized by RMA, GCRMA, and PLIER, respectively. We also evaluated the similarity of TAC rank of the set of 3554 expression phenotypes estimated by Pearson and Spearman correlations. For the 2000 top-ranked genes, the proportion of overlapped set of genes based on Pearson and Spearman were 89.8%, 87.2%, and 92.2% for RMA, GCRMA and PLIER, respectively.
This research was supported in part by funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), Canadian Institutes of Health Research (CIHR), and Genome Canada through the Ontario Genomics Institute. We also gratefully acknowledge insightful comments of two anonymous reviewers and the editor.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung V: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.View ArticlePubMed CentralPubMedGoogle Scholar
- Affymetrix: Microarray Suite User Guide. [http://www.affymetrix.com/support/technical/manuals.affxAffymetrix]
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data. Nat Biotechnol. 2004, 22: 656-658. 10.1038/nbt0604-656b.View ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.View ArticlePubMed CentralPubMedGoogle Scholar
- Williams RB, Cotsapas CJ, Cowley MJ, Chan E, Nott DJ, Little PF: Normalization procedures and detection of linkage signal in genetical-genomics experiments. Nat Genet. 2006, 38: 855-856. 10.1038/ng0806-855.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.