Powerful association test combining rare variant and gene expression using family data from Genetic Analysis Workshop 19

Background Genetic association studies aim to test for disease or trait association with genetic variants, either throughout the human genome or in regions of interest. However, for most diseases and traits, the combined effects of associated genetic variants explain only a small proportion of the genetic variation. This “missing heritability” may be a result of the small effects of common variants considered in the genetic association studies. Rare variants may also play an important role in understanding the missing heritability of complex traits. Method We propose a novel weight-adjustment approach to combine gene expression into rare variant analysis. Results from previous simulation studies suggested that incorporating gene expression information can lead to substantial gain in statistical power. Results Using the family data set provided through the Genetic Analysis Workshop 19, we identified susceptible genes associated with blood pressure regulation. Conclusions These findings provide valuable information for further functional studies for blood pressure control and mechanism.


Background
In the past decade, genome-wide association studies (GWAS) have been successful in identifying susceptible genetic loci for many complex traits [1]. However, the study by Eichler et al. reported that the amount of genetic variations explained by the findings from GWAS for a given disease or complex trait is often notably less than the estimated heritability of the traits [2]. One explanation is that the common variants examined by GWAS often have smaller effects, and the rare variants with larger genetic effects are often excluded in a GWAS analysis.
Rare variants may play an important role in explaining the "missing heritability" of complex traits. As a result of recent advances in high-throughput sequencing technology, it is becoming financially feasible to assay rare genetic variations in thousands of individuals. The rare variants here are defined as genetic variants with a minor allele frequency of less than 1 %. Hence the typical GWAS strategy of analyzing one variant at a time is oftentimes underpowered for rare variant detection unless the effect size of the variant or the sample size is very large. A number of methods have been developed to analyze multiple rare variants jointly [3][4][5]. In this paper, we consider the Seq-aSum-VS approach developed by Basu and Pan [3]. This approach uses dimension-reduction and dataadaptive variable selection strategies to identify the nonnull variants from a group of genetic variants and uses a score test to test for association between the group of variants and the disease of interest.
It often still requires a relatively large sample size for rare variant analyses. To boost the statistical power of genetic association analysis, one research direction is to integrate various genomic information, such as gene expression, rare variants, copy number variation, methylation, transcriptional regulation, and protein abundance. With the availability of both rare variant genotype and gene expression information in the family data set through Genetic Analysis Workshop 19 (GAW19), we proposed a novel approach to incorporate gene expression into rare variant association analysis in this paper.
Genovese et al. [6] introduced the idea of using prior knowledge to weight p values from a genome-wide association study and provided theoretical proof for controlling family-wise error rate (FWER). In this paper, our contribution is to provide a formal mechanism to construct weights using available information about rare variants and gene expression. Studies by Li et al. [7] and Ho et al. [8] demonstrate that by incorporating additional genomic information, the weight-adjustment procedure can increase statistical power drastically compared to the traditional genetic association analysis.

Methods
There are 259 participants with full genotype, expression, and blood pressure data available in the GAW19 family data set. Our analysis focuses on systolic blood pressure (SBP) and diastolic blood pressure (DBP) as outcomes of interest. The average SBP and DBP from all visits were used as the summary measurement respectively for every participant. To adjust for pedigree structure in this data, we estimated identity-by-descent (IBD) matrix (Σ) using genome-wide single nucleotide polymorphism (SNP) marker data obtained using the Illumina platform provided through GAW19. To incorporate the dependence information embedded in the family structure, we transformed the average blood pressure measurements: YeMVN μ; Σ ð Þ; , so that individuals from the same family are independent in the transformed phenotypic value (Y * ). In addition, to account for the population stratifications that exist in the Mexican American population, we performed multidimensional scaling and calculated the first 3 principal components. In the following analysis, the residual from the transformed phenotypic value adjusting for the first 3 principal components were used. We considered genetic variant with a minor allele frequency of less than 0.01 as rare variant, and performed rare variants analysis for the genes reported by hg19 build on the odd-number autosomes that have less than 50 rare variants and more than 1 rare variant.

Sequential sum test
Consider k rare variants in a gene and that SNP i indicates the number of rare variant alleles in variant i in a general regression equation: where s i is 1, −1, or 0, indicating whether the effect of rare variant i is positive or negative or excluded from the equation, and v i is a weight assigned to rare variant i. In our analysis, we assumed v i = 1. In addition, β c represents the common odds ratio between the trait and the rare variants in the gene. We performed the Seq-aSum-VS approach described in Basu and Pan [3] and obtained p value for each gene with 500 permutations.

Constructing weights using expression measurements
After obtaining the p value for each gene from the Seq-aSum-VS test, we used gene expression information to construct weight for each gene. In Roeder et al. [9,10], the authors suggested to use a weight (w i > 0) to adjust p value (p i ) and to reject the null hypothesis if it belongs to the set of all gene i for which p i /w i ≤ α. The weight adjustment procedure maintains the proper FWER control as long as w i > 0 and w i ¼ 1.
Building on the theoretical findings, we developed a novel weight-adjustment approach for rare variant association analysis. After weight adjustment, genes that have strong contributions to phenotype-associated gene expression will be assigned weights greater than 1, hence achieving smaller weight-adjusted p values. The weighting mechanism is as follows: we assign a weight w i to each gene and the weight is the product of 2 parts: w g i E j and w E j P . The first term; w g i E j ;indicates the effect of gene g i on the j th gene expression measurement, E j . The second term, w E j P ; describes whether gene expression measurement (E j ) is associated with the phenotypic outcome (P). Eq. (1) is applied to obtain w g i E j : and In equation 1, g i is the number of total rare variants in gene i calculated by collapsing the genotypes across rare variant loci. A second equation The benefit of taking the product of 2 weights is that if either w g i E j or w E j P is zero, then the resulting product will be zero. On the other hand, if both w g i E j and w E j P are substantially large, then taking the product of the two parts will result in an amplified overall weight. In other words, if rare variants in the gene under consideration provide a strong contribution to outcome P through E j , then w g i E j Â w E j P will be a large value. A crude weight for gene i is set to be the maximum of the products among all gene expression measurements: To ensure that w Ã ¼ 1 , we divide crude weights ( w MP i Þ by their average ( w MP ) as required by Roeder and Wasserman [10] w MP i is larger than the average, then w Ã MP i will be greater than 1 after dividing by the average. We calculate adjusted p value for the ith gene as: adjusted p value for gene i ¼ p value of gene i reported by Seq-aSum-VS test w Ã MP i . If after adjustment, p value becomes greater than 1, then it is set to 1.
For the genes with adjusted p value of less than 0.05, we performed gene set enrichment analysis using biological process categories defined in gene ontology (GO). To account for the hierarchical structure in GO terms, we implemented conditional hypergeometric test [11].

Results
Of the 13,711 genes on the odd-numbered autosomes based on the hg19 build, we considered 6118 genes with less than 50 rare variants and more than 1 rare variant in the analysis. We identified 153 genes with weightadjusted p values of less than 0.05 for SBP or DBP; the top 20 genes are listed in Table 1. The genes with strong contribution to phenotype-associated gene expression levels are assigned weights greater than 1. In Table 1, 17 genes have weights for SBP greater than 1 and 18 genes have weights for DBP greater than 1, indicating that these genes contribute to alterations of phenotype-associated gene expression levels.
We performed gene-set enrichment analysis for these 153 significant genes using GO biological process categories. The top 15 enriched gene sets with more than ten genes are reported in Table 2 with p value of less than 0.05. The results suggest that these reported 153 genes are involved in the regulation of blood pressure, and blood vessel size pathways (p value <0.03). Interestingly, these 153 blood pressure-associated genes are also significantly involved in sensory perception of sound. Hypertension has been clinically observed to be correlated with hearing loss [12]. The result could suggest genetic basis for the correlation between hypertension and hearing function. However, further study is needed to validate the findings in this study.

Discussion
Many of the genes reported in Table 1 achieve smaller p values after the weight adjustment procedure in this analysis. In Table 1, we listed genes with weight-adjusted p values of less than 0.05. Multiple comparisons, such as Bonferroni threshold, could be adapted and applied using the weight-adjusted p values. In this analysis, the threshold Chr chromosome, p s * weight adjusted p value for SBP, p value s p value for SBP, # RVs number of rare variants identified in the gene, W s weight for SBP Subscript D represents statistics for DBP is 0.05/6118 ≅ 8 × 10 − 6 and none of the genes reported in Table 1 exceeded this stringent threshold value. In addition, we chose the Seq-aSum-VS test to obtain a p value for each gene in this paper; other approaches, such as the sequence kernel association test (SKAT) [5], can also be used to obtain the p value for each gene and the weighting scheme proposed in this study can then be used to calculated weight-adjusted p values. To obtain w g i E j , we summed the total number of rare variants across all the loci in a gene in equation (1) by assuming all the rare variants have similar effects on E j . An alternative approach to calculate w g i E j is to replace equation (1) by the test statistic reported Seq-aSum-VS test while treating gene expression as the outcome. The alternative approach does not assume that all the rare variants have the same effects on E j ; however, the alternative approach might be more computationally intensive.
In this analysis, we did not consider genes with more than 50 rare variants. In a gene with a large number of rare variants, many of the variants might be nullvariants, which will cause the estimated effects for genes with large numbers of rare variants to be diluted. For gene with a large number of rare variants, we suggest to use a moving window approach and only consider a feasible number of rare variants in a window.
Furthermore, for data collected in a case-control design, our proposed approach is easily modified by logistic regression and applied. The weighting scheme proposed in this study is also usable when only a subset of the individuals have gene expression measurements available. It is also modifiable for when SNP, gene expression data and gene expression, phenotype data are from two different sets of cohorts, instead of paired gene expression and GWAS data from the same cohort. However, paired gene expression and GWAS data from the same cohort might be preferable, as the data will have increased power to detect the causative relationship (SNP→E→P) but not the reactive relationship (SNP→P→E) based on the simulation study described in [8].
In the data analysis, we observed genes with small p values without evidence of gene expression association. It is biologically possible that genetic variants could cause phenotypic changes without altering gene expression level. Thus in practice, we suggest to pursue genes with either (a) small p values from Seq-aSum-VS test or (b) small weight-adjusted p values.