- Open Access
A comparative analysis of family-based and population-based association tests using whole genome sequence data
© Zhou et al.; licensee BioMed Central Ltd. 2014
- Published: 17 June 2014
The revolution in next-generation sequencing has made obtaining both common and rare high-quality sequence variants across the entire genome feasible. Because researchers are now faced with the analytical challenges of handling a massive amount of genetic variant information from sequencing studies, numerous methods have been developed to assess the impact of both common and rare variants on disease traits. In this report, whole genome sequencing data from Genetic Analysis Workshop 18 was used to compare the power of several methods, considering both family-based and population-based designs, to detect association with variants in the MAP4 gene region and on chromosome 3 with blood pressure. To prioritize variants across the genome for testing, variants were first functionally assessed using prediction algorithms and expression quantitative trait loci (eQTLs) data. Four set-based tests in the family-based association tests (FBAT) framework--FBAT-v, FBAT-lmm, FBAT-m, and FBAT-l--were used to analyze 20 pedigrees, and 2 variance component tests, sequence kernel association test (SKAT) and genome-wide complex trait analysis (GCTA), were used with 142 unrelated individuals in the sample. Both set-based and variance-component-based tests had high power and an adequate type I error rate. Of the various FBATs, FBAT-l demonstrated superior performance, indicating the potential for it to be used in rare-variant analysis. The updated FBAT package is available at: http://www.hsph.harvard.edu/fbat/.
- Genetic Analysis Workshop
- Semiparametric Regression
- Expression Quantitative Trait Locus
- Sequence Kernel Association Test
- GAW18 Data
Both existing and novel methods incorporating family-based and population-based designs were compared in this report. All the methods we compare use a single test for a set of multiple single-nucleotide polymorphisms (SNPs) in a region (gene in our setting). This approach avoids the problem of needing large samples for testing rare variants individually.
Heritability and coheritability
We chose to first test the methods on MAP4, a gene that was simulated to be associated with blood pressure in the GAW18 data. Then, the most powerful tests that maintained adequate type I error were used on a whole chromosome scan of chromosome 3. Because many of the tests we considered are unable to provide results when using all SNPs, our analysis strategy starts with reducing the number of SNPs based on functional assessment.
Variants were filtered based on their predicted function. For coding variants, SnpEff (http://snpEff.sourceforge.net) was used to predict nonsynonymous, splice, and stop variants. Nonsynonymous variants were further classified using polyphen2 . Lymphoblastoid cell line (LCL) expression quantitative trait loci (eQTLs) from Caucasian (CEU) International Haplotype Map Project (HapMap) samples were used to highlight SNPs affecting the transcription of MAP4 . Polyphen scores above 0.5 were included together with splice and stop variants in our analysis. An arbitrary cutoff of 3.4 (-log10 p value from eQTL analysis) was used for eQTL filtering.
FBAT-v  and FBAT-lmm (JJ Zhou, MN Laird, personal communications, 2013) are 2 newly developed gene-based rare-variant tests. FBAT-v is analogous to gene-based burden tests developed for case-control studies. FBAT-lmm is a variance component test. Although FBAT-lmm is also a transmission disequilibrium-based test, the trait is modeled through a linear mixed model (LMM), where a random genetic component is introduced and tested. It allows genetic effects within the region to be both protective and deleterious. P values are determined using 1000 permutations. FBAT-m  and FBAT-l  are part of the preexisting FBAT suite of tests that were designed for common variants, but can be used with multiple SNPs. FBAT-m is a multivariate test with degrees of freedom equal to the number of linearly independent SNPs. The linear combination test (FBAT-l) used the noninformative families to estimate the optimal weights for the linear combination of SNPs.
The sequence kernel association test (SKAT) has been proposed as a test for association between both common and rare genetic variants in a region using either continuous or dichotomous traits [6, 7] for population designs. Under the semiparametric regression model, a local relationship (similarity), or "kernel" matrix, is estimated using the genotypes from a testing region, for example, identical by state (IBS) kernel and gaussian kernel for nonlinear effects. As described by Yang et al, genome-wide complex trait analysis (GCTA) is a toolkit designed to estimate heritability using genome-wide association studies (GWAS) data from unrelated individuals based on an LMM under a polygenic assumption [8, 9]. We have adapted the GCTA approach to test only the SNPs in a gene or region, and, as such, it is comparable to the SKAT approach; indeed, LMM and semiparametric regression share many theoretical connections .
We used the complete set of 200 replicates for assessing type I error and power, using an alpha of 0.05 to determine statistical significance. In our analyses, we focused on 2 continuous phenotypes: systolic blood pressure (SBP) and diastolic blood pressure (DBP). Heritability estimates for SBP and DBP were both in the range of 20% to 30% (see Table 1). Coheritabilities for the 2 traits (i.e., the proportion of phenotypic covariance explained by common genetic covariance) ranged from 30% to 70% for 3 exams (see Table 1). The analyses were adjusted by age, sex, age*sex, and BPmeds (i.e., current use of antihypertensive medications) at each exam by generating standardized residuals. We also analyzed average residuals over 3 exams. For the Q1 phenotype, we adjusted for age and sex only.
Functional assessment for screening
Summary statistics of MAP4 gene
# of SNPs
Names and MAF of 28 SNPs that remain for all analyses
Type I error and power comparison based on family studies (n = 849).
FBAT -v -e
Type I error and power comparison based on population study (n = 142)
Chromosome 3 scan
Both family-based and population-based analyses of whole genome sequencing data were evaluated for their power to detect associations with a simulated phenotype with variants in the MAP4 gene and on chromosome 3. This approach incorporated the use of functional prediction information to filter variants as would traditionally be done in most applied studies. Both SKAT and GCTA had high power and an adequate type I error rate. Of the various FBAT tests, FBAT-l demonstrated superior performance, indicating the potential to be used in rare-variant analysis. The lack of population substructure and availability of potential phenotypes contribute to the high performance of FBAT-l. Absent these conditions, the performance degrades. The relatively poor performance of FBAT-lmm could be a result of small sample size and concordant direction of effect size across SNPs. However, FBAT-lmm shows promise for the case where effect sizes within a test region vary in signs of risk. It does not currently have the capability to analyze extended pedigrees.
We also note that when analyzing extended pedigree data and highly correlated traits between relatives, the empirical variance estimator (-e) should be used to achieve the correct type I error. However, its use decreases the effective sample size so that it is closer to the number of independent pedigrees. Finally, our analysis demonstrates that using the average phenotype over 3 time points gives higher power compared to single-time-point phenotype analysis. This suggests the combination of the phenotypes from different time points, or even the combination of SBP and DBP, may achieve higher power.
In this paper, we compared various FBAT region based tests and compared family based tests with population based tests. Our results show that FBAT -l outperformed FBAT -v0 when testing MAP4 and this could be due to some causal variants of MAP4 within the variants for analysis being common. Our population-based tests comparison suggests that in the absence of population substructure, the population-based association tests are more powerful.
The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods. 2010, 7: 248-249. 10.1038/nmeth0410-248.PubMed CentralView ArticlePubMedGoogle Scholar
- Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464: 773-777. 10.1038/nature08903.View ArticlePubMedGoogle Scholar
- De G, Wip W-K, Ionita-Laza I, Laird N: Rare variant analysis for family-based design. PLoS One. 2013, 8: e48495-10.1371/journal.pone.0048495.PubMed CentralView ArticlePubMedGoogle Scholar
- Rakovski CS, Xu X, Lazarus R, Blacker D, Laird NM: A new multimarker test for family-based association studies. Genet Epidemiol. 2007, 31: 9-17. 10.1002/gepi.20186.View ArticlePubMedGoogle Scholar
- Xu X, Rakovski C, Laird N: An efficient family-based association test using multiple markers. Genet Epidemiol. 2006, 30: 620-626. 10.1002/gepi.20174.View ArticlePubMedGoogle Scholar
- Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, Lin X: Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012, NHLBI GO Exome Sequencing Project--ESP Lung Project Team, 91: 224-237. 10.1016/j.ajhg.2012.06.007.Google Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42: 565-569. 10.1038/ng.608.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Lee SH, Goddard ME, Visscher PM: GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011, 88: 76-82. 10.1016/j.ajhg.2010.11.011.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu D, Lin X, Ghosh D: Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007, 63: 1079-1088. 10.1111/j.1541-0420.2007.00799.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.