Association tests based on the principal-component analysis
© Oh and Park; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Haplotypes are composed of specific combinations of alleles at the several loci on the same chromosome. Because haplotypes incorporate linkage disequilibrium (LD) information from multiple loci, haplotype-based association analyses can provide greater powers than the single-marker analysis in the association studies. However, when we construct haplotypes using many markers simultaneously, we may be confronted with a sparseness problem due to a large number of haplotypes. In this paper, we propose the principal-component (PC) association test as an alternative to the haplotype-based association test. We define the PC scores from the LD blocks and perform the association test using logistic regression. The proposed PC test was applied to the analysis of the Genetic Analysis Workshop 15 simulated data set. By knowing the answers of Problem 3, we evaluated the performance of the PC test and the haplotype-based association test using Akaike Information Criterion (AIC), power, and type I error. The PC test performed better than the haplotype-based association test in the sense that the former tends to have smaller AIC values and slightly greater power than the latter.
Recently, several studies have shown that the use of haplotypes may offer more powerful information on genetic association with traits than the use of single-nucleotide polymorphisms (SNPs) [1, 2]. Haplotypes are specific combinations of allelic variants at a series of tightly linked markers on the same chromosome. Haplotypes incorporate linkage disequilibrium (LD) information from multiple loci. If markers have low LD relationship in the LD block, then a large number of haplotypes are constructed. Several methods have been proposed to test whether the haplotypes are associated with the disease trait. In these association studies, haplotypes are treated as covariates in logistic regression models [3–6]. However, the haplotype-based association test has a problem when the number of haplotypes is large. If there are m markers, then the maximum number of haplotypes is 2 m . When there are many haplotypes, parameter estimation is difficult due to the large number of parameters as well as the sparseness of data.
In order to solve this problem, we propose the principal-component (PC) association test as an alternative to the haplotype-based association test. PC scores are derived from the LD blocks. The PC scores have the same amount of information as the haplotypes. In general, the first few PC scores tend to have the most of information about LD blocks. Thus, the use of the first few PC scores may produce the similar results to the use of full haplotypes with fewer parameters.
The proposed PC scores test was applied to the analysis of the Genetic Analysis Workshop 15 simulated data set (Problem 3), which includes 100 replicates. Each replicate contains a random sample of 1500 families with an affected sibling pair (ASP), and a randomly selected member of the offspring generation from each of the 2000 unaffected control families. By knowing the answers of Problem 3, we evaluated the performance of the PC test and the haplotype-based association test using Akaike Information Criterion (AIC) , power, and type I error.
Genotype data and sample
We used all 100 replicates from chromosome 6 sparse SNP data set. We first performed the transmission/disequilibrium test (TDT)  and Hardy-Weinberg equilibrium test for family data sets. We did not include the markers with minor allele frequencies < 0.01. We selected unrelated individual samples including one sib from each ASP family (1500 individuals) and 500 controls.
We considered SNP markers with LD, D' > 0.7. We selected eight LD blocks, where LD Blocks 1 to 4 are known to be not associated with the RA and Blocks 5 to 8 are known to be associated with the RA. Each LD block contained two to six markers.
Haplotype-based association test
For the selected LD blocks, their haplotypes and frequencies were estimated by the expectation-maximization (EM) algorithm. We then performed the haplotype-association tests by fitting logistic regression. In this association study, we pooled the minor haplotypes that have frequencies less than 0.05. The effect of haplotype can be assumed to be additive, dominant, or recessive. In our analysis, we assumed the additive effect of haplotypes and performed the test using haplo.glm .
PC score association test
We first determined whether the effect of a SNP in LD blocks is additive, dominant, or recessive. If the effect of the SNP is additive, the SNP is coded as 0, 1, and 2 according to the number of minor alleles. On the other hand, for the dominant or recessive effect, it is coded as 0 or 1. Then, we performed the PC analysis with LD blocks and calculated the PC scores. For the given LD block, suppose there are k SNPs denoted by s1, s2,..., s k , where s k is coded as 0, 1, or 2. Then, the PC scores are defined as follows:
PC i = e i 'S,
where PC i is the ith PC score, e i is its eigenvector, and S = [s1, s2,..., s k ] is the score vector of SNPs. In our analysis, we only assumed the additive effect of SNPs. We determined the number of PC scores in each block to account for 70% of total variation, which ranged from one to three. For these PC scores, we fitted logistic regression with PC score as covariates.
Comparison of PC score and haplotype-based association tests
The association tests were performed using logistic regression with PC scores and haplotypes as covariates. Using Akaike Information Criterion (AIC), power, and type I error, we evaluated the performances of the PC test and the haplotype-based association test.
AIC of PC score and haplotype-based association tests
The results of the PC score based and haplotype-based tests using Replicate 48
Association test via logistic regressionc H0: Global β = 0
LD block (SNP)
No. PC scorea
Blocks not associated with RA
Blocks associated with RA
Using the LR statistic we tested the global hypothesis that all β = 0. The LR statistics of Table 1 show that Blocks 1 to 4 are not significantly associated with RA and Blocks 5 to 8 are significantly associated with RA in both PC score and haplotype-based methods except for block 6. In addition, the degrees of freedom of PC score tests are smaller than those of the haplotype-based association tests.
Table 1 also shows that the association tests based on the PC scores and haplotypes have almost the same AIC values. However, the AIC values of the PC score tests are slightly less than those of the haplotype-based association tests. In addition, the degrees of freedom of PC score tests are smaller than those of the haplotype-based association tests.
Type I error and power
Type I errors and powers of PC score based and haplotype-based tests from the 100 replicates
Type I error
Blocks not associated with RA
Blocks associated with RA
Discussion and conclusion
In this paper, we proposed using PC scores for the association test as an alternative to the haplotype-based test. The use of PC scores has the effect of reducing the number of parameters in logistic regression. The proposed method would be very useful when the number of haplotypes is large. In our analysis, the PC score test was shown to have smaller AIC values than the haplotype-based test, while the PC score test has a much smaller number of parameters.
PC analysis has been mainly applied to the analysis of quantitative variables. However, it has been successfully used to analyze the discrete SNP data mainly focusing on selection of SNPs. For example, Horne and Camp  proposed the PCA method for identification of LD groups and selection of optimal SNP-sets that capture sufficient intragenic genetic diversity. Lin and Altman  proposed using the PCA method to find haplotype tagging SNPs. Unlike these previous methods, our method focussed on association studies using PCA.
One drawback of the PC score test is that the interpretation of scores is not straightforward. In particular, the biological meaning of PC scores cannot be easily obtained. In our study, a significant result of PC scores implies that some SNPs in the LD block are associated with the disease. Among the SNPs in the LD block, the SNP which has the largest component of the eigenvector has the greatest impact on the disease.
The PC score test has many advantages. First, it has the effect of dimensional reduction. It reduces the number of parameters greatly. As a result, it can avoid the sparseness of data. Second, it can easily handle more complicated association studies such as gene × gene interactions. On the other hand, the haplotype-based test cannot easily handle gene × gene interactions across different chromosomes. In order to handle gene × gene interactions between different chromosomes, the haplotype-based approach need to consider the haplotype × haplotype interactions, which requires a much larger number of parameters and cannot be handled easily.
In summary, the proposed PC score method may be applied to the classification analysis and other interaction studies such as for the gene × environment interactions. Furthermore, PC scores are summary measures of LD blocks. Thus, we recommend these measures to be used for a possible construction of gene regulatory networks, which we will investigate in the future.
The authors thank for Soon Sun Kwon for many helpful comments. The work was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126).
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Akey J, Xiong M: Haplotypes vs single marker linkage disequilibrium tests: what do we gain?. Eur J Hum Genet. 2001, 9: 291-300. 10.1038/sj.ejhg.5200619.View ArticlePubMedGoogle Scholar
- Morris RW, Kaplan NL: On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol. 2002, 23: 221-233. 10.1002/gepi.10200.View ArticlePubMedGoogle Scholar
- Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002, 53: 79-91. 10.1159/000057986.View ArticlePubMedGoogle Scholar
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.View ArticlePubMed CentralPubMedGoogle Scholar
- Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ: Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered. 2003, 55: 56-65. 10.1159/000071811.View ArticlePubMedGoogle Scholar
- Epstein MP, Satten GA: Inference on haplotype effects in case-control studies using unphased genotype data. Am J Hum Genet. 2003, 73: 1316-1329. 10.1086/380204.View ArticlePubMed CentralPubMedGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.PubMed CentralPubMedGoogle Scholar
- Horne BD, Camp NJ: Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet Epidemiol. 2004, 26: 11-21. 10.1002/gepi.10292.View ArticlePubMedGoogle Scholar
- Lin Z, Altman RB: Finding haplotype tagging SNPs by use of principal component analysis. Am J Hum Genet. 2004, 75: 850-861. 10.1086/425587.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.