To date, genome-wide association studies (GWAS) have been successful in unveiling many common single-nucleotide polymorphisms (SNPs) associated with common diseases, including type 1 and type 2 diabetes, rheumatoid arthritis, Crohn’s disease, and coronary heart disease [1–3]. However, the results from recent GWAS account for a relatively small proportion of the heritability of those diseases. One possible explanation of this limitation is that GWAS have focused mainly on variants that are common (minor allele frequency [MAF] > 5%), whereas many disease-causing variants may be rare and therefore difficult to tag using common variants.
The advent of next-generation sequencing technology has offered great opportunities for discovering novel rare variants in the human genome, associating these rare variants with diseases, and increasing our biological knowledge of disease etiology. In particular, as pointed out by Choi et al. , protein-coding regions harbor 85% of the mutations with large effects on disease-associated traits. As a result, whole-exome sequencing technology has emerged as a powerful paradigm for the identification of rare variants associated with diseases. This technology was used in the pilot3 study of the 1000 Genomes Project , from which the Genetic Analysis Workshop 17 (GAW17) mini-exome data were generated.
In the GAW17 mini-exome data set , most of the SNPs are rare (MAF < 5% for 21,355 out of 24,487 SNPs) so that multimarker association tests are more desirable than single-marker tests, such as the chi-square test, because of the potential to increase power from multiple signals in a region. However, because of higher degrees of freedom, multimarker association tests may have reduced power. To overcome this problem, investigators have recently proposed several multimarker association tests for which the test statistics have smaller degrees of freedom. In this paper, we consider two types of such association test procedures. The first approach is based on collapsing multimarkers within a chromosomal region to generate a reduced set of genetic predictors [7–9]; the second approach correlates genetic similarity among individuals across a set of markers by using a kernel function with their phenotypic similarity [10–13]. We describe these methods in the Methods section.
We apply these methods to each of the genes in the GAW17 unrelated individuals data set to identify genes associated with the given traits (Affected, Q1, Q2, and Q4), adjusting for the effects of environmental covariates (Smoke, Age, Sex, and Population). The results from these methods are compared. In addition, for each given trait, we use the Bayesian mixed-effects model to estimate the phenotypic variance that can be explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests.