- Proceedings
- Open Access
Genome-wide association study for multiple phenotype analysis
- Xuan Deng^{1}Email author,
- Biqi Wang^{1},
- Virginia Fisher^{1},
- Gina Peloso^{1},
- Adrienne Cupples^{1} and
- Ching-Ti Liu^{1}
- Published: 17 September 2018
Abstract
Genome-wide association studies often collect multiple phenotypes for complex diseases. Multivariate joint analyses have higher power to detect genetic variants compared with the marginal analysis of each phenotype and are also able to identify loci with pleiotropic effects. We extend the unified score-based association test to incorporate family structure, apply different approaches to analyze multiple traits in GAW20 real samples, and compare the results. Through simulation studies, we confirm that the Type I error rate of the pedigree-based unified score association test is appropriately controlled. In marginalanalysis of triglyceride levels, we found 1 subgenome-wide significant variant on chromosome 6. Joint analyses identified several suggestive genome-wide significant signals, with the pedigree-based unified score association test yielding the greatest number of significant results.
Background
The increasing availability of high-density genomic data with thousands of samples enables the identification of single-nucleotide polymorphisms (SNPs) contributing to complex traits on a genome-wide scale. Research studies often collect data on multiple related phenotypes to better understand disease structure; however, genome-wide association studies (GWAS) commonly analyze each trait independently. For example, body mass index (BMI) and waist-to-hip ratio (WHR) are both proxy traits for obesity and commonly collected in an obesity-related study. The standard approach usually analyzes each phenotype separately and reports the corresponding findings of each analysis, ignoring the dependency among traits. Approaches considering joint analyses have been proposed to tackle multiple phenotypes. Yang and Wang [1] and Ott and Wang [2] described a number of approaches elaborately, including multivariate regression models, variable reduction methods such as principal component analysis, and canonical correlation analysis. However, there is no single approach that is uniformly the most powerful across all situations. The sum of squared score (SSU) test does not explicitly incorporate trait correlation, and multivariate analysis of variance (MANOVA) could fail to detect pleiotropy when a strong trait correlation exists and the traits have thesame direction of association [3]. Considered to be an optimally weighted combination of MANOVA and SSU, the unified score-based association test (USAT) by Ray et al. [3] may provide higher power, especially for detecting pleiotropy.
We aimed to study the performance of various approaches for jointly analyzing multiple phenotypes. We first reviewed existing methods. We then expanded USAT to related samples as a pedigree-based USAT (pUSAT). We found that the Type I error rate of pUSATwas well preserved through simulations. Finally, we analyzed GAW20real data using multiple phenotype methods and compared the results.
Methods
Assume K correlated phenotypes Y_{1},…, Y_{K} in N individuals. Let Y_{k} be the N × 1 vector of k^{th} phenotype and Y be the N × K matrix for all individuals. The test of interest is the association of a single variant with the K phenotypes. Suppose G_{i} is the genotype score (ie, count of the minor allele as 0, 1, or 2) for a SNP of interest i, and G is the N × 1 vector of genotypes for all individuals. Moreover, define C = (c_{1}, …, c_{q}) as the N × q matrix of a set of q-adjusted covariates for all samples.
Marginal linear mixed model
SSU test
Multivariate linear mixed model
USAT and pUSAT
Phenotypic and genotypic data
GAW20 provides the dense genome-wide SNPs from the 821 pedigree-based individuals with triglyceride (TG) and high-density lipoprotein cholesterol (HDL-C) levels measured. We used the log-transformed average of pretreatment values at visits 1 and 2 of TG and HDL-Clevels and investigated the pleiotropic variants involved in blood lipids. The GAW20 data has been genotyped using the Affymetrix Genome-wide Human SNP Array 6.0. SNPs were excluded with a call rate < 95%, minor allele frequency < 5%, and failure of the Hardy-Weinberg equilibrium test (p value<10e-6), which results in a total of 587,358 variants. Individuals with more than 5% missing genotypes were also excluded from analysis.
Results
Simulation study
Estimated Type I errors of pUSAT for K = 2 phenotypes (α = 0.0)
Type I error | Correlation ρ | |||
---|---|---|---|---|
0 | 0.25 | 0.5 | 0.75 | |
pUSAT | 0.027 | 0.032 | 0.036 | 0.038 |
Real data analysis
Descriptive statistics of variables in the analysis
Men(N = 407) | Women(N = 414) | Total(N = 821) | ||
---|---|---|---|---|
Log of TG^{a} | 4.86 (0.59) | 4.70 (0.56) | 4.78 (0.58) | |
Log of HDL-C^{a} | 3.69 (0.23) | 3.92 (0.26) | 3.80 (0.27) | |
Age^{a} | 48.29 (15.93) | 48.38 (15.84) | 48.34 (15.87) | |
Field center^{b} | Minnesota | 206 (50.6%) | 205 (49.5%) | 411 (50%) |
Utah | 201 (49.4%) | 209 (50.5%) | 410 (50%) | |
Smoking status^{b} | Never Smoker | 268 (65.8%) | 298 (72.0%) | 566 (68.9%) |
Past Smoker | 106 (26.0%) | 82 (19.8%) | 188 (22.9%) | |
Current Smoker | 33 (8.1%) | 34 (8.2%) | 67 (8.2%) |
SNPs that are suggestive as being of genome-wide significance (p < 5 × 10^{− 6}) in univariate and joint analysis*
SNP | Chr:Pos | Univariate analysis (LMM) | Joint analysis | ||||
---|---|---|---|---|---|---|---|
TG | HDL-C | SSU | mvLMM | USAT | pUSAT | ||
rs90513 | 1:3189344 | 3.33E-02 | 1.20E-06 | 1.30E-05 | 7.18E-06 | 2.36E-05 | 9.88E-06 |
rs11940232 | 4:138953336 | 6.32E-05 | 1.98E-05 | 1.47E-06 | 8.56E-06 | 2.65E-06 | 2.58E-06 |
rs17058802 | 4:173880215 | 5.66E-07 | 4.56E-03 | 2.23E-06 | 3.39E-06 | 4.13E-06 | 2.60E-06 |
rs708010 | 6:37071350 | 1.86E-04 | 4.69E-06 | 1.01E-06 | 5.48E-06 | 2.19E-06 | 2.12E-06 |
rs17619780 | 6:40472303 | 7.58E-08 | 2.22E-01 | 4.95E-06 | 1.59E-07 | 9.60E-06 | 3.01E-07 |
rs12533593 | 7:147451966 | 6.69E-03 | 2.28E-06 | 7.72E-06 | 1.22E-05 | 1.48E-05 | 1.01E-05 |
rs7300117 | 12:130266575 | 2.24E-05 | 8.66E-01 | 5.60E-04 | 3.92E-06 | 9.99E-04 | 8.58E-06 |
rs2880301 | 13:18998534 | 9.66E-01 | 7.29E-02 | 1.95E-01 | 1.19E-01 | 1.69E-13 | 2.22E-01 |
rs17464499 | 22:26221715 | 3.20E-02 | 4.48E-06 | 3.02E-05 | 2.67E-05 | 6.30E-05 | 3.29E-05 |
Discussion and conclusions
The explosion in datacollection and the increasing evidence that some loci affect multiple traits require more complex statistical models for analyses to better understand the properties of association. Here, we reviewed several different methods for multiple phenotypes in GWAS, and expanded the USAT approach to related samples as pUSAT. The proposed method can provide insight into the underlying associations, and help the researchers to identify pleiotropic loci especially when prior information is unavailable. The simulation studies demonstrate that the Type I error rate of pUSAT is conservative under different correlations. We also applied various methods to the GAW20 data with TG and HDL-C as the phenotypes. One suspicious locus was identified as GWA-significant by the regular USAT, which assumes independent individuals, whereas other multivariate analyses missed this locus. Several suggestiveGWA loci were detected by the joint multivariate analyses; however, pUSAT highlights the importance of joint analysis for multiple phenotypes and yields smaller p values for most SNPs.
Declarations
Funding
Publication of this article was supported by NIH R01 GM031575.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
Authors’ contributions
All authors contributed to the overall study. XD, BW and VF conducted all analyses and XD drafted the manuscript. GMP, LAC and CTL provided constructive advice and revised the manuscript critically. All authors approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. J Probab Stat. 2012;2012:652569.View ArticleGoogle Scholar
- Ott J, Wang J. Multiple phenotypes in genome-wide genetic mapping studies. Protein Cell. 2011;2(7):519–22.View ArticleGoogle Scholar
- Ray D, Pankow JS, Basu S. USAT: a unified score-based association test for multiple phenotype-genotype analysis. Genet Epidemiol. 2016;40:15.View ArticleGoogle Scholar
- Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44:5.Google Scholar
- Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009;33:11.View ArticleGoogle Scholar
- Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009;53:4.Google Scholar
- Muller KE, Peterson BL. Practical methods for computing power in testing the multivariate general linear hypothesis. Comput Stat Data Anal. 1984;2(2):143–58.View ArticleGoogle Scholar
- Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods. 2014;11(4):407–9.View ArticleGoogle Scholar
- Comuzzie AG, Cole SA, Laston SL, Voruganti VS, Haack K, Gibbs RA, Butte NF. Novel genetic loci identified for the pathophysiology of childhood obesity in the Hispanic population. PLoS One. 2012;7:e51954.View ArticleGoogle Scholar
- Thevenon J, Souchay C, Seabold GK, Dygai-Cochet I, Callier P, Gay S, Corbin L, Duplomb L, Thauvin-Robinet C, Masurel-Paulet A. Heterozygous deletion of the LRFN2 gene is associated with working memory deficits. Eur J Hum Genet. 2016;24(6):911–8.View ArticleGoogle Scholar