Identifying rare variants associated with hypertension using the C-alpha test
© Faino et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Important rare variants may be near significantly associated common variants based on genetic distance. For this reason, we conducted an analysis of rare variants informed by tests of single-marker association at loci with common variants. We identified highly significant common variants within chromosome 3, as well as rare variants around these locations. Based on a predetermined window size, we then analyzed these rare variants with the C-alpha test to determine significant associations with hypertension. We found significant rare variants around common variants; however, the C-alpha test was sensitive to the specified window size. When comparing markers in genes to markers not in genes, we found that markers not in genes had more significant C-alpha test p values than markers in genes.
Whole genome sequencing provides geneticists and statisticians with the genome data necessary to attribute genetic variants with specific phenotypes such as high cholesterol, cancer, and diabetes. Many genome-wide association studies (GWAS) link phenotypes to genetic variants through logistic regression analyses. These single-marker association tests perform well for common variants (CVs). However, for rare variants (RVs), defined here as having minor allele frequencies (MAFs) of less than 0.05, single-marker association tests lack the power to detect significant associations .
Complex phenotypes have been found to be poorly explained by CVs. A hypothesis has emerged that RVs may contribute more significantly to disease heritability than CVs . However, how to study these RVs has not been clear. Researchers have created various methods to statistically analyze RVs based on the idea of pooling together many RVs to increase statistical power. For many of these methods, subjects are either coded as having at least 1 RV or no RVs, or are coded based on a count of the number of RVs they have. Statistical analyses such as the cumulative minor-allele test (CMAT) and kernel-based adaptive clustering (KBAC) are powerful at detecting significance; however, their power diminishes in the presence of protective and harmful RVs [3–5]. The C-alpha test provides a computationally simple method for testing the significance of a set of RVs that can be protective, harmful, or neutral . In particular, the C-alpha test assesses the following hypotheses:
Ho: p i = p 0
Ha: p i follows a mixture distribution, with some variants detrimental (p i >p 0 ), some neutral, and some protective (p i <p 0 )
where p i is the proportion of the rare allele at the ith RV occurring in cases versus controls. P 0 is equal to the proportion of cases among all subjects, where a similar proportion of the rare alleles at the RVs is expected to occur at random in the cases and the controls. A small p value indicates that the distribution of the rare alleles is not random.
A combination of protective, harmful, and neutral variants is likely associated with hypertension. Longitudinal hypertension data and whole genome sequence data were provided to the authors as part of the Genetic Analysis Workshop 18 (GAW18). The data set is from the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) Project 2, which was designed to identify RVs associated with hypertension and provided an opportunity to test a novel method for RV analysis.
The hypothesis of interest was to see whether significant RV associations occurred near CVs, and whether or not these associations were affected by the CVs occurring in genes versus not in genes. The RV analysis thus was related to and based off of single-marker association tests. Highly significant CVs were identified within chromosome 3, as were RVs around these locations. Based on a predetermined window size, these RVs were then analyzed with the C-alpha test to determine significant associations with hypertension. Markers in genes were compared to markers not in genes.
Based on the hypothesis of interest, the analysis consisted of the following steps: data clean-up, single-marker association tests, and the RV analysis. The following sections provide details for each of these methods.
From the original GAW18 sequenced chromosome 3 file, only columns with the single-nucleotide polymorphism (SNP) IDs and genotype information for each subject were retained. Base-pair locations that appeared to have 3 or more alleles were excluded from further analyses. Only fully sequenced data from unrelated individuals was included in this analysis (n = 103). The real data rather than the simulated GAW18 data was used in order to not bias the results toward more significance in genes versus not in genes.
If a subject had hypertension listed for one or more of the four doctor visits, the subject was coded as 2 for affected. Otherwise, a subject was coded as 1 for unaffected. The covariates of interest were gender (M/F), age at first visit, smoking (Yes/No), and blood pressure medication (Yes/No). Similarly to how hypertension was coded, smoking was coded as 2 if the subject was listed as smoking for any of the 4 doctor visits, and subjects were coded as 2 if the subject was listed as using blood pressure medication during any of the 4 visits. Markers that did not satisfy the Hardy-Weinberg equilibrium (p <0.01) were excluded from the analysis.
Single-marker association tests
Single-marker association tests were performed on CVs along chromosome 3. A total of 103 unrelated individuals were included in this analysis. The logistic models were adjusted for the following covariates: smoking, blood pressure medication, age at first visit, and gender. Because of large p values from the association tests, neither a significance threshold for the p value nor a correction for multiple testing was considered.
Rare variant analysis
The top 10 significant CVs were chosen along chromosome 3 for further analysis. Five of these top 10 markers were located in genes. A SNP was considered to be within a gene if it was located anywhere within the 5′ and 3′ untranslated region (UTR) of the gene. Gene locations were based on information from GeneCruiser .
Window sizes of 1 kilobase (kb), 5 kb, and 25 kb were examined around these CVs for RVs. RVs were defined as having a MAF <0.05. These RVs were then extracted and a C-alpha test was calculated for each window size to determine the sensitivity of arbitrary window sizes. Singletons were removed from the analysis. The biased urn method was used to obtain C-alpha test p values that accounted for population stratification in permutations of case and/or control status . From the GWAS simulated odd-numbered chromosomes, a reduced set of 39,883 SNPs with pairwise r2 ≤0.01 and no missing alleles was obtained. A total of 94 subjects were included in the reduced set of SNPs, of which 60 were cases and 34 were controls. The first 5 eigenvectors from a principal components analysis, along with the same covariates from above, were used to generate 1000 biased urn samples based on Fisher's noncentral hypergeometric distribution .
Data analyses were performed with PLINK version 1.07 and the R packages AssotesteR and Epstein et al's modified BiasedUrn package . All data cleaning was performed with JMP version 10 and SAS version 9.2.
C-Alpha test results for markers in genes versus markers not in genes
Markers in genes
Markers not in genes
# RVs tested
# RVs tested
(gene ID: 51185)
(gene ID: 2272)
(gene ID: 55689)
(gene ID: 2272)
(gene ID: 8626)
On average, markers not in genes were more significant than markers in genes. Fisher's exact tests comparing significance for the 1-kb, 5-kb and 25-kb windows were not significant (p = 0.4667, p=1 and p = 0.5238, respectively).
Our hypothesis examined whether or not important RVs are near significantly associated CVs based on genetic distance. Our RV analysis thus was related to and based off of single-marker association tests. Highly significant CVs were identified within chromosome 3, as were RVs around these locations. Based on a predetermined window size, these RVs were then analyzed with the C-alpha test to determine significant associations with hypertension.
Based on our results, we found significant RVs around highly significant CVs. However, we also found that the C-alpha test was sensitive to the specified window size. Future research can examine more closely how the specified window size affects the C-alpha test.
When comparing markers in genes to markers not in genes, we found that markers not in genes had more significant C-alpha test p values than markers in genes. These findings were not significant with a Fisher's exact test. Very few CVs that were analyzed occurred in genes (approximately 1%), and this may have biased the results. In addition, the p values from the single-marker association tests were underpowered (the smallest p value was 0.00064). However, the p values from the single-marker association tests were comparable in magnitude in genes versus not in genes. Also, among the top 10 most significant SNPs, 5 were located in genes and 5 were located not in genes, which minimized bias in the results. Future research can investigate in more depth whether RVs occur more in genes versus not in genes.
The authors would like to thank the Division of Biostatistics and Bioinformatics at National Jewish Health for assistance in the analysis of this data. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR: Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009, 106: 3871-3876. 10.1073/pnas.0812824106.PubMed CentralView ArticlePubMedGoogle Scholar
- Bodmer W, Bonilla C: Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008, 40: 695-701. 10.1038/ng.f.136.PubMed CentralView ArticlePubMedGoogle Scholar
- Basu S, Pan W: Comparison of statistical tests for disease association with rare variants. Genet Epidemiol. 2011, 35: 606-619. 10.1002/gepi.20609.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu DJ, Leal SM: A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010, 5: e1000384-10.1371/journal.pgen.1001156.Google Scholar
- Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zöllner S: Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet. 2010, 87: 604-617. 10.1016/j.ajhg.2010.10.012.PubMed CentralView ArticlePubMedGoogle Scholar
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.PubMed CentralView ArticlePubMedGoogle Scholar
- Liefeld T, Reich M, Gould J, Zhang P, Tamayo P, Mesirov JP: GeneCruiser: a web service for the annotation of microarray data. Bioinformatics. 2005, 3681-3682. 21Google Scholar
- Epstein MP, Duncan R, Jiang Y, Conneely KN, Allen AS, Satten GA: A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. Am J Hum Genet. 2012, 91: 215-223. 10.1016/j.ajhg.2012.06.004.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.