Gene-based multiple trait analysis for exome sequencing data

The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.


Background
The common disease/common variant hypothesis states that common variants contribute substantially to common diseases [1,2]. Following this hypothesis, genomewide association studies have successfully detected associations with common variants. However, such common variants explain only a small proportion of the phenotypic variation. Many of the as yet undetected common variants may have small effect sizes; therefore they are not expected to contribute significantly to the missing heritability. An alternative theory, the common disease/ rare variant hypothesis, argues that a large number of rare variations with moderate to high penetrances account for genetic susceptibility to common disease [1]. Recently, deep-resequencing studies of candidate genes have provided some evidence supporting the common disease/rare variant hypothesis [3]. Although various statistical methods have been developed to detect associations with common variants for common diseases, these methods are inefficient for rare variants because of the small number of observations for each single rare variant. One feasible method for rare variant analysis is to pool multiple rare variants within a gene or region and to test their joint effect. This category of methods has been reviewed by Dering et al. [4].
Some genetic association studies examine a qualitative trait, such as the case-control status and some additional correlated quantitative traits. For example, a genetic study of diabetes may examine the diabetic status and other related phenotypes, such as body mass index and other lipid profiles. Similarly, a glaucoma study may explore the related endophenotypes, such as central corneal thickness, intraocular pressure, and maximum vertical cup-to-disc ratio. One way to analyze these data is to perform single-trait analyses separately. An alternative way is to perform a multiple-trait analysis, which potentially has improved power to identify the pleiotropic variants for these traits [5,6].
Univariate test statistics or p-values of multiple traits may not be independent because of the environmental or genetic correlations among multiple traits. Hence some classical methods of combining independent p-values, such as Fisher's method, are not directly applicable to the analysis of multiple correlated traits. Our purpose in this paper is to develop a test statistic for combining these correlated univariate statistics by considering the correlation structure among multiple traits. Motivated by a recently proposed approach [7], we developed a gene association test to test the joint effect of multiple variants within a gene on multiple correlated traits. The proposed method considers genes as basic units and uses the weighted sum [8] to combine the effects of multiple variants. The test statistic of multiple traits is the linear or quadratic combination of the univariate test statistics. It is likely that some rare variants may contribute to only a subset of available traits. Therefore we also conduct an alternative test on a preselected subset of multiple traits.

Methods
Let Y = (Y 1 , …, Y m ) T denote the m available traits. Assume that the gene k has L genotyped single-nucleotide polymorphisms (SNPs), including both common and rare ones. In the first step, the genetic score S j of the gene k for an individual j is calculated using the weighted sum of all SNPs within the gene. Second, a univariate test is performed to establish the association of genetic scores with all the traits separately. Then, a gene-level association test using the linear or quadratic combination of single-trait univariate statistics is constructed for multiple traits. Finally, the optimal subset of traits is selected for multipletrait analysis. The details of the various steps are described in what follows.
Gene score using weighted sum The weighted-sum gene score assigns different weights to each variant based on the estimated allele frequencies [8]. The score for gene k for individual j is given by: where I ij is the number of minor alleles for SNP i in individual j, and m i is the total number of minor alleles for SNP i in all n individuals. In the original study [8], the allele frequencies were estimated only for the control subjects. Because multiple-trait analysis needs to analyze multiple quantitative traits as well as the disease status, in the present study we estimate the allele frequencies using all individuals.

Association test for multiple traits
Let S = S 1 , S 2 , …, S n ) T denote the scores for gene k for n individuals. The test statistic for testing the association of S with each trait Y l (l = 1, …, m) is denoted Z l , and it is assumed to asymptotically follow N(0, 1) under the null hypothesis. The choice for Z l is discussed in the Results section. The test statistic for the combined multiple traits (CMT) method is a quadratic or linear combination of m univariate test statistics Z 1 , …, Z m .
The CMT method is motivated by a recently proposed approach [7] to test the joint effect of multiple correlated SNPs. Because the test statistics Z 1 , …, Z m are assumed to asymptotically follow N(0, 1), the joint distribution of the random vector Z = (Z 1 , …, Z m ) T is asymptotically multivariate normal N m (0, Σ), where Σ is the correlation matrix of Z. The quadratic combination statistic T CMT Q is given by: and it is distributed asymptotically as a chi-square distribution with m degrees of freedom if Σ is full rank [9]. Because Σ is unknown, it is estimated by permutations. The phenotypes are permuted, and the m univariate test statistics are computed under each permutation. The estimation of Σ is given by: under N such permutations. The p-value of the quadratic statistic ( P Q CMT ) can be approximated by the chisquare quantile.
An alternative statistic for multiple correlated traits is the linear combination statistic T CMT L : which is distributed asymptotically as N(0, 1) [9]. Here also the covariance matrix Σ is replaced by its estimatê Association test for the optimal subset of multiple traits It is likely that some of the relevant rare variants are associated with only a subset of traits. In this case, combining all the test statistics using the CMT method may result in a loss of power because of the high degree of freedom in the chi-square distribution. Therefore we propose an association test for optimally combined multiple traits (OCMT) using a preselected subset of these traits. To select the optimal subset, we use the CMT method to calculate the p-values for all possible subsets with at least two traits. The subset with the minimum p-value is selected as the optimal one, denoted A*. For any subset with at least two traits A, the quadratic combination statistic T CMT-A Q is given by: where Z A and Σ A are the subvector and submatrix of Z and Σ, respectively. The p-value is any possible subset with at leas = min{ : t t two traits}.
To control type I error, the p-value of the OCMT method is obtained using a permutation procedure that is based on the permutation of phenotypes. For each permutation π, the subset with the minimum p-value is selected as the optimal subset, denoted A*(π). The p-value of OCMT ( P OCMT Q ) is defined as the proportion of permutations with P P

Results
We applied the CMT and OCMT methods to the GAW17 simulated mini-exome data sets [10]. The results are reported in two parts. In the first part, the power of the proposed method is compared to the power of the single-trait analysis. Initially, the CMT and OCMT methods were proposed and applied to the GAW17 data set without the knowledge of the simulation model. These original results were presented at the GAW17 meeting. Because the simulation model used to create the GAW17 data was discussed at the workshop, we reran the analysis with the knowledge of the simulation model. Here we present the results of the revised analysis based on the knowledge of the simulation model. In the second part, we present some insights into the false-positive rate of the CMT method.
For each gene, we tested each trait separately using the t-test statistic of the b coefficient corresponding to the gene score in the logistic regression (for the disease status D) or the linear regression (for quantitative traits Q1, Q2, and Q4) adjusted for three covariates (Age, Sex, and Smoking status). The test statistic Z l (l = 1, …, m) was obtained from the inverse normal distribution transformation of the t-test statistic [7] and was assumed to have a standard normal distribution.
The association tests for multiple traits were performed by using the CMT method with the quadratic combining statistic. The p-value was approximated using the theoretical quantiles of the chi-square distribution. The correlation matrices among the test statistics Z 1 , …, Z m were estimated by 1,000 permutations. In addition, the association test using the OCMT method was performed, and its p-value was obtained from another set of 1,000 permutations. Given a predefined significance level a, the power of any association test was defined as the proportion of 200 replicates that returned a p-value less than or equal to a.
The samples in the GAW17 data set were collected from six cohorts, but population stratification was not considered in the simulation model. We performed the association tests with and without corrections for stratification. To correct for stratification, we corrected genotypes and phenotypes using the first 10 principal component scores derived using Eigenstrat [11].
Power of single-and multiple-trait analyses Table 1 presents the power to detect the Q1 causal genes at a significance level of 0.05. Without adjusting for stratification, the univariate test for Q1 has a power greater than or equal to 0.3 for eight genes. With the adjustment, only four genes (FLT1, KDR, VEGFA, and VEGFC) still have a power greater than or equal to 0.3. For Q2 and Q4, most of the genes have a power less than 0.1. Some of the genes also are associated with the disease status (power ≥ 0.3).
We used the CMT method with the quadratic statistic to perform the multiple-trait analysis for Q1 and D (Q1 +D). For all the Q1 causal genes, the power of the Q1 +D analysis was less than or comparable to the power of the univariate test for Q1. The results show that all Q1 causal genes may have small or no pleiotropic effects that are insufficient to compensate for the increase of the critical value from c 0 05 1 Checking the GAW17 simulation model revealed that this result is reasonable and consistent with the simulation model. Among all the Q1 causal genes, under the simulation model, only ELAVL4 was assumed to have pleiotropic effects on Q1 and the latent liability. Moreover, only two rare variants in this gene with MAF = 0.000717 (C1S3181 and C1S3182) contributed to the latent liability. Therefore it could be difficult to detect the association of ELAVL4 with the latent liability. The other eight causal genes on Q1 were assumed to have no pleiotropic effects. Therefore multiple-trait analysis did not improve the power for Q1 causal genes. However, multiple-trait analysis did elucidate the genetic correlation between the disease status and related quantitative traits and aided in the identification of the pleiotropic genes.
The p-value of the OCMT method is the minimum of the p-values of all subsets with at least two traits. Multiple-trait analysis was performed for all subsets of Q1, Q2, Q4, and D using the CMT method. The optimally combined multiple traits were selected on the basis of their p-values. The power of the OCMT method was comparable with that of the analysis for Q1+Q2+Q4+D. The last column of Table 1 summarizes the subset with the highest power among all the possible subsets with at least two traits and their powers. Most of the genes had the highest power when Q1 and D were combined. This finding shows that the OCMT method provides a way to select the best combination of traits, because the disease status is derived on the basis of multiple traits and Q1 has the biggest effect size. Table 2 summarizes the power to detect the causal genes on Q2, given a significance level of 0.05. In general, the univariate test for Q2 has the largest power, and the combination of Q2 and D has relatively good performance compared with the other subsets of multiple traits. Table 3 summarizes the power to detect the causal genes on the latent liability. After adjusting for population stratification, all the genes had a power less than 0.3 for both single-and multiple-trait analyses. The power of the analysis for Q1+Q2+Q4+D was less than or comparable to that of the single-trait analysis for D. This result is not surprising, because only ELAVL4 has a small pleiotropic effect on Q1; the other genes have no effects on Q1, Q2, or Q4.

False-positive rates of single-and multiple-trait analyses
On the basis of the results adjusted for population stratification, we calculated the false-positive rates of single-and multiple-trait analyses. Genes with a power greater than or equal to 0.3 were considered the associated findings. Luedtke et al. [12] reported that 695 genes were spuriously associated with the disease status. Excluding these 695 genes, we identified 10 causal genes (4 on Q1, 6 on Q2, and 3 on D) and 77 falsepositive genes (62 on Q1, 14 on Q2, 0 on Q4, and 6 on D) in the single-trait analysis for Q1, Q2, Q4, and D. The false-positive rate of the single-trait analysis was equal to 0.031. The multiple-trait analysis Q1+Q2 +Q4+D detected six causal genes and 58 false-positive genes. The false-positive rate of the Q1+Q2+Q4+D analysis was equal to 0.023. This result shows that, compared with the single-trait analysis, the CMT method combining multiple traits does not increase the false-positive rate.

Discussion
In some genetic association studies, multiple correlated traits are available that can be used to identify genes We report the powers without (upper) and with (lower) the adjustment of stratification for the single-trait analyses (Q1, Q2, Q4, D), the CMT method for Q1 and D (Q1+D), the CMT method for all the four traits (Q1+Q2+Q4+D), and the OCMT method (OCMT). The last column presents the subset with the highest power among all the subsets with at least two traits and its power (in parentheses).
responsible for multiple traits. In single-trait analysis, some common associated genes may be found across different traits. These overlapping associations may be caused by pleiotropic genes and/or the correlation structure among traits. Multiple-trait analysis has the potential to improve the power to detect pleiotropic genes. The multiple-trait analysis proposed here did not suffer a significant loss of power, even though true models, such as the GAW17 simulated model, had small or no pleiotropic effects. The proposed method considers the correlation matrix, thereby ensuring that the false-positive rate is not inflated by the correlation among multiple traits. In next-generation sequencing data sets, huge numbers of rare variants are genotyped across the whole genome. Because of the small number of observations for each rare variant, statistical tests for common variants are inefficient at identifying the associations. Hence the proposed method performs a gene-level test using the weighted sum to test the joint effect of multiple variants within a gene. The weighted sum is a feasible choice for the gene-based score [8], but it is not the only choice; other methods are available for pooling multiple rare variants [4].
From Tables 1 and 3, it can be seen that the correction for population stratification has a large effect on the power to detect some associations with Q1, D, and Q1+Q2+Q4+D. This phenomenon also was observed from the quantile-quantile plots in replicate 1 (data not shown). Although we did not consider the population structure in the simulation model, the principal component scores may influence the phenotypes, because the population structures would be similar to those of the true causal genes in this data set [13]. We conclude that the rare variant associations unadjusted for population stratifications should be interpreted with caution.

Conclusions
With the advent of next-generation sequencing technologies, the identification of rare variants has become realistic. Statistical methods for common variants are not applicable in rare variant analysis because of the small number of observations and the huge number of rare variants across the whole genome. Thus efficient statistical methods are needed. When multiple related traits are available, it is expected that multiple-trait analysis has the improved power to detect pleiotropic genes. We proposed the CMT and OCMT methods to examine the We report the powers without (upper) and with (lower) the adjustment of stratification for the single-trait analyses (Q1, Q2, Q4, D), the CMT method for Q2 and D (Q2+D), the CMT method for all the four traits (Q1+Q2+Q4+D), and the OCMT method (OCMT). The last column presents the subset with the highest power among all the subsets with at least two traits and its power (in parentheses).
joint effect of multiple variants on multiple traits. The proposed method uses the quadratic or linear combination of univariate test statistics and thus considers the correlation structure among multiple correlated traits. The CMT and OCMT methods were applied to the GAW17 mini-exome data. The results show that the method is suitable for multiple trait analysis.  We report the powers without (upper) and with (lower) the adjustment of stratification for the single-trait analyses (Q1, Q2, Q4, D), the CMT method for all the four traits (Q1+Q2+Q4+D), and the OCMT method (OCMT). The last column presents the subset with the highest power among all the subsets with at least two traits and its power (in parentheses).