Normalization of microarray expression data using within-pedigree pool and its effect on linkage analysis
© Kim et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
"Genetical genomics", the study of natural genetic variation combining data from genetic marker-based studies with gene expression analyses, has exploded with the recent development of advanced microarray technologies. To account for systematic variation known to exist in microarray data, it is critical to properly normalize gene expression traits before performing genetic linkage analyses. However, imposing equal means and variances across pedigrees can over-correct for the true biological variation by ignoring familial correlations in expression values. We applied the robust multiarray average (RMA) method to gene expression trait data from 14 Centre d'Etude du Polymorphisme Humain (CEPH) Utah pedigrees provided by GAW15 (Genetic Analysis Workshop 15). We compared the RMA normalization method using within-pedigree pools to RMA normalization using all individuals in a single pool, which ignores pedigree membership, and investigated the effects of these different methods on 18 gene expression traits previously found to be linked to regions containing the corresponding structural locus. Familial correlation coefficients of the expressed traits were stronger when traits were normalized within pedigrees. Surprisingly, the linkage plots for these traits were similar, suggesting that although heritability increases when traits are normalized within pedigrees, the strength of linkage evidence does not necessarily change substantially.
Genetical genomics  integrates genome-wide expression profile data of microarray experiments and marker-based measures of genetic variation. It has newly become a central methodology in quantitative trait studies in order to determine loci involved in regulatory expression of quantitative variation in RNA level. As a result of the reduced cost of microarray expression arrays and marker genotyping, these studies have been extended to study quantitative traits in linkage and association studies [2, 3]. In microarray experiments, research on determining which normalization method best adjusts for systematic experimental variation to produce unbiased data is necessary. Among the normalization methods used for the common Affymetrix GeneChip, the robust multiarray average (RMA) method and the statistical algorithm implemented in Affymetrix's Microarray Suite (MAS5) program are the gold standard to control for systematic variation in samples of unrelated individuals [4, 5]. Investigation of the effects of various normalization methods in family data are needed .
Our aim is to maintain the individual familial distributions with normalization. We address this problem of normalization of expression data within pedigrees by comparing two possible standard distributions for normalization: 1) applying RMA across all arrays as a pool, which assumes all individuals are independent and share the same distribution of trait values, and 2) applying RMA to arrays within pedigrees assuming family members within a pedigree share the same distribution.
Study subjects consisted of 194 individuals from 14 Centre d'Etude du Polymorphisme Humain (CEPH) Utah pedigrees, and 2882 autosomal and X-linked single-nucleotide polymorphism (SNP) genotypes were available from The SNP Consortium . Quantitative phenotype data were generated from immortalized B cells, and 8793 gene expression values were available from the microarray raw CEL (cell intensity file) data files from the Affymetrix Genechips and Hgfocus CDF (chip description file) files. In cases in which a subject had more than one expression file, the first replicate was chosen for this analysis so that we used one array per subject. To evaluate the performance of different scenarios, two data sets were used: 1) RMA normalized using all individuals (arrays) as the normalization pool, and 2) RMA normalization applied to a within-pedigree pool to be able to allow for as many distributions as the number of pedigrees. The 'affy' package v1.8 in R v2.3.1 was used to perform the normalizations of the expression data .
In order to examine the effect of using different normalization pools consisting of different types of individuals, we performed paired t-tests using 8793 gene expression values of four founders of one family in two ways: comparing the values of the four founders after normalizing using themselves as the pool to paired gene expression values after normalizing 1) using all family members including themselves as the pool, and 2) using independent individuals as the pool (other grandparents from other families) . The purpose of microarray normalization methods is to remove systematic variation while preserving biological variation. Therefore, comparison of the normalized gene expression data using the four founders as their own normalization pool to the normalized data using either of the other two pools (family members or independent individuals) should not show significant differences if the normalization pools are all only removing random error variation. Thus, we should observe non-significant p-values for these paired t-tests if only random variation is removed by each method. We assumed that these t-tests indicated a significant difference in the normalization methods if the p-value was less than a conservative p-value of 0.001 (since we were performing over 8,000 tests). However, we also evaluated this using a p-value of 0.05 as the significance threshold.
To examine the effect of normalization methods on linkage results, we selected 18 cis-acting transcriptional regulator phenotypes with previous evidence of cis-acting linkage to the known location of each corresponding structural gene . Nonparametric quantitative linkage analysis was performed using Merlin v1.0 with the qtl option. We plotted both the negative p-values of the nonparametric linkage score  and the allele-sharing LOD score of Kong and Cox . Based on the change of mean and variance of the trait values in each array, some individuals may have different trait values when using different normalization methods. FCOR in S.A.G.E v5.1.0 was used to calculate familial correlations (e.g., parent-offspring, sibling, and grandparent), which were compared for each trait across the different normalization methods.
Results and discussion
Familial correlation coefficients and maximum LOD score of non-parametric linkage analysis after RMA normalization
Familial correlation coefficienta
Target gene loci
All arrays pool
Within pedigree pool
All arrays pool
All arrays pool
All arrays pool
Because of the presence of systematic variation in generating gene expression data, proper normalization of raw values is required to separate true signals from background noise. Previous gene expression studies have typically focused on unrelated individuals. With the emergence of genetic linkage studies of gene expression traits, the use of pedigrees poses concerns for proper normalization. We investigated the need to account for pedigrees by normalizing within families. Our results for the 18 linked traits show strong familial correlations, which generally increased when traits were normalized within pedigrees. However, in general, the strength of the maximum LOD score decreased (13/18) when traits were normalized within pedigrees. This is not surprising because the increased correlation of a quantitative trait within a family may decrease evidence for linkage. It is also possible that linkage signals are inflated when normalizing using all individuals as the pool. Because we did not evaluate linkage of these 18 traits to all markers on other chromosomes in this data set, we cannot evaluate possible inflation of linkage signals. However, the pattern of the chromosomal linkage plot for these genes (15/18) did not differ across the two normalization strategies. This lack of difference suggests that normalization within pedigrees may not be necessary, at least in studies with small pedigree sizes. However future linkage studies on expression data with larger pedigree sizes and/or large sample sizes may benefit from normalization by pedigree.
This work was partly supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF 2005-213-C00007) and in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. Some of the results of this paper were obtained by using the program package S.A.G.E., which is supported by a U.S. Public Health Service Resource Grant (RR03655) from the National Center for Research Resources.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Hubner N, Wallace CA, Zimdahl H, Petretto E, Schulz H, Maciver F, Mueller M, Hummel O, Monti J, Zidek V, Musilova A, Kren V, Causton H, Game L, Born G, Schmidt S, Müller A, Cook SA, Kurtz TW, Whittaker J, Pravenec M, Aitman TJ: Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat Genet. 2005, 37: 243-253. 10.1038/ng1522.View ArticlePubMedGoogle Scholar
- Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.View ArticlePubMed CentralPubMedGoogle Scholar
- Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.View ArticlePubMed CentralPubMedGoogle Scholar
- Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu HC, Mountz JD, Baldwin NE, Langston MA, Threadgill DW, Manly KF, Williams RW: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37: 233-242. 10.1038/ng1518.View ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Chesler EJ, Bystrykh L, de Haan G, Cooke MP, Su A, Manly KF, Williams RW: Reply to "Normalization procedures and detection of linkage signal in genetical-genomics experiments". Nat Genet. 2006, 38: 856-858. 10.1038/ng0806-856.View ArticleGoogle Scholar
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy – analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20: 307-315. 10.1093/bioinformatics/btg405.View ArticlePubMedGoogle Scholar
- Whittemore AS, Halpern J: A class of tests for linkage using affected pedigree members. Biometrics. 1994, 50: 118-127. 10.2307/2533202.View ArticlePubMedGoogle Scholar
- Kong A, Cox NJ: Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet. 1997, 61: 1179-1188. 10.1086/301592.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.