Linkage analysis using principal components of gene expression data
BMC Proceedings volume 1, Article number: S79 (2007)
The goal of this paper is to investigate the effect of using principal components as a data reduction method for expression data in linkage analysis. We used 45 probes normalized using the Affymetrix Global Scaling that had evidence of high heritability to estimate the first 10 principal components (PC). A genome-wide linkage scan was performed on the 45 expression values and the 10 PCs using 2272 single-nucleotide polymorphisms. Our conclusions were: 1) PC analyses under-performed the single-probe analysis for known signals; 2) the PC that best reproduced the single-probe analysis was primarily composed of that probe; 3) no new signals were detected in the PC analysis; 4) no new pleiotropic effects were detected in the PC analysis.
There is great interest in understanding genetic factors related to variable expression of genes. Recently, several studies have shown the first evidence of heritability of mRNA between individuals [1–6]. By treating the expression phenotypes for each transcript (or probe) as a quantitative phenotype, a variance-components linkage analysis could be used [1–6]. The expectation is to detect linkage signals between the gene expression values and genomic regions. As pointed out by William et al. , several issues plague these studies, including the selection of informative expression values. Principal components (PC) is a dimension-reduction approach  and it has been shown to be a valuable tool in linkage analysis of correlated phenotypes . Multivariate linkage analysis has been shown to be useful in identifying genomic regions with pleiotropic effect . Given that the PC approach is another way to combine information from multiple phenotypes, it can be hypothesized that PC analysis might also be helpful in the identification of pleiotropic effects. Because of the large number of expression phenotypes in the Genetic Analysis Workshop 15 (GAW15) Problem 1 data set, we first selected the most genetically informative phenotypes based on those with high heritability estimates . In this paper, we examine whether combinations of correlated expression phenotypes improved the linkage signals using PC and whether the PC results suggest new pleiotropic effects.
The GAW15 Problem 1 CEPH (Centre d'Etude du Polymorphisme Humain) data consisted of 196 participants from 14 three-generation pedigrees with 14 individuals per family, 4 grandparents, 2 parents, and 8 offspring. Two hundred and seventy-six arrays including data on 3554 probe sets on the Affymetrix Human Focus Arrays were provided by GAW15. These probe sets had been selected as those with greatest inter-individual variability from a total of 8500 probe sets .
Selection of phenotype subsets
To increase the number of informative phenotypes, we excluded genes whose expression had little variation (standard deviation ≤0.3) and low call rates (absent calls >90%) across samples; 3306 phenotypes (probe sets) remained. We further reduced the number by identifying those that were most likely to be genetic based on heritability estimates from a polygenic model, resulting in 45 phenotypes. Additional details of the selection process can be found in de Andrade et al. .
Principal-components (PC) analysis is a data reduction technique in which each component is a linear combination of the phenotypes, each PC describing as much variability of the phenotypes as possible . Because the 45 phenotypes are on a common scale, the decompositions were made using the unscaled covariance matrix. The first ten PCs accounted for 84% of the variance in the 45 phenotypes; 14 components would have been required to explain 90% of the variance.
For a subset of subjects, including founders, we observed a large number of missing genotypes. Recognizing that missing data can impact identity-by-descent (IBD) estimation  when there is linkage disequilibrium, we reduced the extent of linkage disequilibrium between single-nucleotide polymorphisms (SNPs) by removing SNPs with r2 > 0.30 using ldSelect . Of the 2756 markers provided by GAW15, 2272 remained (mean spacing 1.2 cM). We then removed 2205 Mendelian inconsistencies (0.5% of matings/genotypes) primarily by removing the conflicting offspring genotypes. Multipoint IBD (MIBD) sharing among pairs of relatives was calculated using SIMWALK2 .
Quantitative trait linkage analysis
Prior to the linkage analysis, the 55 phenotypes (45 expression phenotype + 10 PCs) were normally transformed using the empirical normal quantile transformation , which has been shown to have increased power for variance-components analysis . Variance-components linkage analyses were performed using the S-Plus/R library multic . Sex was used as a covariate. We assessed the 55 phenotypes for evidence of linkage and considered "strong" linkage evidence as p < 10-9, which is comparable to Table 1 in Morley et al. , and "moderate" linkage evidence as p < 10-4 for comparison with the single probe analyses. Finally, for the 45 expression phenotype models, we used a screening tool proposed by de Andrade et al.  to estimate bivariate linkage results. For those phenotypes that suggested strong bivariate linkage using the screening tool, bivariate linkage analysis was also performed .
Figure 1 shows the relative weighting for the first five PCs. For each component, the bar height represents the relative influence of a particular probe. Also included is the gene and gene location associated with each expression phenotype. The first PC is dominated by probe 209480.at, which has a relative weighting of 0.979. This first probe was the only component to show any increased linkage signal, changing from a LOD of 9.0 to 10.7. One possible reason for this increase is that the first PC is composed of two phenotypes (209480.at, 204769.s.at) that are associated with the HLA region on chromosome 6. Figure 2 shows the linkage analysis for chromosome 6 using these two phenotypes. Separate lines are drawn for the two univariate analyses, the first PC, and the bivariate analysis using these probes.
Table 1 compares our results with those of Morley et al. . They used the Affymetrix normalization method and their multipoint genome-wide linkage analysis was done using SIBPAL in S.A.G.E. . Two of our six best linkage signals for the single phenotype analysis agreed with Morley's; four, including the HLA region identified using the PC approach, were not found by Morley. For the remaining nine top phenotypes identified by Morley et al., we were unable to compare the results because the specific phenotypes were not part of our final 45.
Additional review of all the bivariate estimates using the de Andrade screening approach showed only two new areas to investigate that would not have been previously flagged using a criteria of p < 10-9. The DDX17 gene signal, identified by Morley using probe 208151.x.at (22q13.1), was increased from p < 10-6 to p < 10-8 when used with probe 207598.x.at (XRCC2, 7q36) in a bivariate analysis. One area in chromosome 6 had a screening p-value of 10-8 using probes 220386.s.at and 320.at but actual bivariate analysis yielded a p-value similar to the stronger of the phenotypes (10-6). Neither of these potential associations were strongly grouped in the first 10 PCs. The Dead/H Box 17 (DDX17) is a member of the DEAD box (asp-glu-ala-asp/his) protein family of RNA helicases that are involved in diverse cellular functions including mRNA splicing, ribosome assembly, translation initiation, mRNA stability, and cell growth and division, and XRCC2 is essential for the efficient repair of DNA double-strand breaks by homologous recombination between sister chromatids.
We compared the total number of LOD scores greater than three across the genome for the first 10 PCs with the regions identified by each of the 45 phenotypes and found that in general, the PC analyses under-performed the single probe set analysis for known signals. The component that best reproduced the single probe set analysis (Component 1) was primarily composed of that probe set and also included another probe set that focused on that region. No new signals were detected using PCs despite the strong correlation between the probes. However, we observed a strong linkage signal on chromosome 6 in the HLA region when two probe sets from the HLA region were analyzed as bivariate traits. The two other increases in linkage signals from bivariate analysis that increased our list of interesting areas were not picked up using PCs.
We observed that although PC has been suggested as a potentially useful screening tool for identifying genes linked to a cluster of highly correlated variables/phenotypes, it was not helpful in identifying linkage signals in this data set. Based on this analysis, other data reduction techniques should be investigated.
Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, Linsley PS, Mao M, Stoughton RB, Friend SH: Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003, 422: 297-302. 10.1038/nature01434.
Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.
Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.
Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu HC, Mountz JD, Baldwin NE, Langston MA, Threadgill DW, Manly KF, Williams RW: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37: 233-242. 10.1038/ng1518.
Hubner N, Wallace CA, Zimdahl H, Petretto E, Schulz H, Maciver F, Mueller M, Hummel O, Monti J, Zidek V, Musilova A, Kren V, Causton H, Game L, Born G, Schmidt S, Muller A, Cook SA, Kurtz TW, Whittaker J, Pravenec M, Aitman TJ: Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat Genet. 2005, 37: 243-253. 10.1038/ng1522.
William RBH, Cotsapas CJ, Cowley MJ, Chan E, Nott DJ, Little PFR: Normalization procedures and detection of linkage signal in genetical-genomics experiments. Nat Genet. 2006, 38: 855-856. 10.1038/ng0806-855.
Anderson TW: An Introduction to Multivariate Statistical Analysis. 1984, New York: John Wiley & Sons, 2
Ott J, Rabinowitz D: A principal-components approach based on heritability for combining phenotype information. Hum Hered. 1999, 49: 106-111. 10.1159/000022854.
Almsay L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.
de Andrade M, Atkinson EJ, Fridley BL, Goode EL, McDonnell S, Liu-Mares W, Rabe KG, Sun Z, Slager SL: The genetics of gene expression: comparison of linkage scans using two phenotype normalization methods. BMC Proc. 2007, 1 (Suppl 1): S151-
Huang Q, Sanja S, Amos C: Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Am J Hum Genet. 2004, 75: 1106-1112. 10.1086/426000.
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74: 106-120. 10.1086/381000.
Sobel E, Sengul H, Weeks DE: Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered. 2001, 52: 121-131. 10.1159/000053366.
Lehman EL: Nonparametrics: Statistical Methods Based On Ranks. 1975, San Francisco: Holden-Day, Inc, 97-
Peng B, Yu RK, DeHoff KL, Amos CI: Normalizing a large number of quantitative traits using empirical normal quantile transformation. BMC Proc. 2007, 1 (Suppl 1): S156-
de Andrade M, Atkinson EJ, Lunde EM, Amos CI, Chen J: Estimating Genetic Components of Variance in Family Studies using the Multic Routines. Technical Report Series No. 78. 2006, Rochester, Minnesota: Department of Health Sciences Research, Mayo Clinic
de Andrade M, Olswold C, Slusser JP, Tordsen LA, Atkinson EJ, Rabe KG, Slager SL: Identification of genes involved in alcohol consumption and cigarette smoking. BMC Genetics. 2005, 6: S112-10.1186/1471-2156-6-S1-S112.
de Andrade M, Amos CI: Multivariate linkage analysis. Biostatistical Genetics and Genetic Epidemiology. Edited by: Elson R, Olson J, Palmer L. 2002, New York: John Wiley & Sons, 564-568.
This research was partially funded by NIH grants R01HL71917 and CA94919.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
The author(s) declare that they have no competing interests.
About this article
Cite this article
Atkinson, E.J., Fridley, B.L., Goode, E.L. et al. Linkage analysis using principal components of gene expression data. BMC Proc 1 (Suppl 1), S79 (2007). https://doi.org/10.1186/1753-6561-1-S1-S79