- Proceedings
- Open Access
A novel statistical method for rare-variant association studies in general pedigrees
- Huanhuan Zhu^{1},
- Zhenchuan Wang^{1},
- Xuexia Wang^{2} and
- Qiuying Sha^{1}Email author
- Published: 18 October 2016
Abstract
Both population-based and family-based designs are commonly used in genetic association studies to identify rare variants that underlie complex diseases. For any type of study design, the statistical power will be improved if rare variants can be enriched in the samples. Family-based designs, with ascertainment based on phenotype, may enrich the sample for causal rare variants and thus can be more powerful than population-based designs. Therefore, it is important to develop family-based statistical methods that can account for ascertainment. In this paper, we develop a novel statistical method for rare-variant association studies in general pedigrees for quantitative traits. This method uses a retrospective view that treats the traits as fixed and the genotypes as random, which allows us to account for complex and undefined ascertainment of families. We then apply the newly developed method to the Genetic Analysis Workshop 19 data set and compare the power of the new method with two other methods for general pedigrees. The results show that the newly proposed method increases power in most of the cases we consider, more than the other two methods.
Keywords
- Rare Variant
- Genetic Analysis Workshop
- General Pedigree
- Sequence Kernel Association Test
- GAW19 Data
Background
There is increasing interest in detecting associations between rare variants and complex traits. Although statistical methods to detect common variant associations are well developed, these variant-by-variant methods may not be optimal for detecting associations with rare variants as a result of allelic heterogeneity as well as the extreme rarity of individual variants [1]. Recently, several statistical methods for detecting associations of rare variants were developed for population-based designs, including the cohort allelic sums test [2], the combined multivariate and collapsing method [1], the weighted sum statistic [3], the variable minor allele frequency threshold method [4], the adaptive sum test [5], the step-up method [6], the sequence kernel association test [7], and the test for optimally weighted combination of variants [8].
Meanwhile, quite a few statistical methods for rare-variant association studies have been developed for family-based designs. For any type of study design, the statistical power will be improved if rare variants can be enriched in the samples. If one parent has a copy of a rare allele, half of the offspring are expected to carry it, and, hence, variants that are rare in the general population could be very common in certain families [9]. Therefore, family-based designs may play an important role in rare-variant association studies. Because of the importance of family-based designs in rare-variant association studies, several family-based rare-variant association methods for quantitative traits [10–12] and for qualitative traits [13–15] have been developed. However, most of these methods were developed under the assumption of random ascertainment and family-based designs with random ascertainment may not yield enrichment of rare variants. To analyze the sequencing data in general pedigrees provided by Genetic Analysis Workshop 19 (GAW19), we proposed a novel method to test rare-variant association in general pedigrees for quantitative traits. Applying the proposed method to the GAW19 data set, we compared the power of the proposed method with that of two popular methods for family-based designs.
Methods
Consider a sample of n pedigrees with n _{ i } members in the i ^{ th } pedigree and a genomic region with M variants. Let y _{ ij } and g _{ ij } = (g _{ ij1}, …, g _{ ijM })^{ T } denote the trait value and genotypes of the M variants in the genomic region for the j ^{ th } individual in the i ^{ th } pedigree. Let x _{ ij } = ∑ _{ m = 1} ^{ M } w _{ m } g _{ ijm } denote the weighted combination of genotypes at the M variants, where w = (w _{1}, … ,w _{ M })^{ T } is a weight function.
where\( U={\displaystyle {\sum}_{i=1}^n{\displaystyle {\sum}_{j=1}^{n_i}}}\left({x}_{ij}-\overline{x}\right)\left({y}_{ij}-\overline{y}\right) \), V = w ^{ T } Σw∑ ^{ n } _{ i = 1} y _{ i } ^{ T } Φ _{ i } y _{ i }, \( {y}_i={\left({y}_{i1},\dots, {y}_{i{n}_i}\right)}^T \), \( \overline{y}=\frac{1}{{\displaystyle {\sum}_{i=1}^n{n}_i}}{\displaystyle {\sum}_{i=1}^n}{\displaystyle \kern.8em {\sum}_{j=1}^{n_i}{y}_{ij}} \), Φ _{ i } is twice the kinship coefficient of the i ^{ th } pedigree, and Σ = cov(g _{11}, g _{11}) is the covariance matrix of the multiple variant genotype of one individual. Σ can be estimated by \( \widehat{\varSigma}=\frac{1}{{\displaystyle {\sum}_{i=1}^n{n}_i}}{\displaystyle {\sum}_{i=1}^n{\displaystyle {\sum}_{j=1}^{n_i}}}\left({g}_{ij}-\overline{g}\right){\left({g}_{ij}-\overline{g}\right)}^T \), where \( \overline{g}=\frac{1}{{\displaystyle {\sum}_{i=1}^n}{n}_i}{\displaystyle {\sum}_{i=1}^n}{\displaystyle {\sum}_{j=1}^{n_i}{g}_{ij}} \). It is worth pointing out that T _{ score } is equivalent to the quantitative version of the retrospective likelihood score statistic proposed by Schaid et al [16].
where σ _{ mm } is the (m, m)^{ th } element of \( {\widehat{\varSigma}}_0 \) and u _{ m } is the m ^{ th } element of u. Under the null hypothesis, T _{ OW-score } is asymptotically distributed as a mixture of independent χ ^{ 2 } statistics [18, 19]. Alternatively, the distribution of T _{ OW-score } can be approximated by a Satterwaite approximation for the distribution of quadratic forms [7, 20, 21] or a scaled χ ^{ 2 } distribution [16]. We propose to approximate the distribution of T _{ OW-score } by a scaled χ ^{ 2 } distribution with the scale δ and degrees of freedom d estimated by the expectation and variance of T _{ OW-score }. Note that u ∼ N(0, Σ∑ _{ i = 1} ^{ n } y _{ i } ^{ T } Φ _{ i } y _{ i }). We have \( {\widehat{\mu}}_T=\widehat{E}\left({T}_{OW- score}\right)= trace\left(\widehat{\varSigma}{\widehat{\varSigma}}_0^{-1}\right) \) and \( {\widehat{\sigma}}_T^2=\operatorname{va}\widehat{r}\left({T}_{OW- score}\right)=2 trace\left(\widehat{\varSigma}{\widehat{\varSigma}}_0^{-1}\widehat{\varSigma}{\widehat{\varSigma}}_0^{-1}\right) \). Then, the scale δ is estimated as \( \widehat{\delta}={\widehat{\sigma}}_T^2/\left(2{\widehat{\mu}}_T\right) \) and the degree of freedom d is estimated as \( \widehat{d}=2{\widehat{\mu}}_T^2/{\widehat{\sigma}}_T^2 \)
We compare the performance of our OW-score with (a) WS-score, the score test given by equation (1) with weight given by Madsen and Browning [3] and (b) famSKAT, family-based sequence kernel association test given by Chen et al [11].
Results
We applied our proposed method as well as the WS-score test and famSKAT to the simulated data from GAW19. All tests were conducted on 849 individuals, from 20 pedigrees, that had no missing genotypes or phenotypes. Sex, age, blood pressure medication status, and smoking status were considered as covariates in this study. We were aware of the underlying simulation model.
Power comparisons of the 3 tests using the average of DBP at 3 time points as phenotypes (significance level is assessed at 5 %)
Genes | T _{ OW-score } | T _{ WS-score } | FamSKAT |
---|---|---|---|
CGN | 0.135 | 0 | 0.035 |
FLT3 | 0.005 | 0 | 0.08 |
LEPR | 0.05 | 0.015 | 0.065 |
MAP4 | 0.175 | 0.185 | 0.425 |
MTRR | 0.465 | 0.005 | 0.06 |
NRF1 | 0 | 0.005 | 0.035 |
PTTG1IP | 0.02 | 0.145 | 0.06 |
RAI1 | 0.845 | 0.005 | 0.155 |
REPIN1 | 0.915 | 0.05 | 0.085 |
SLC35E2 | 0.005 | 0 | 0.05 |
TNN | 0 | 0 | 0.035 |
ZFP37 | 0 | 0.005 | 0.005 |
ZNF443 | 0.01 | 0.015 | 0.195 |
ZNF544 | 0.005 | 0.015 | 0.06 |
We also evaluated the type I error rate of the proposed OW-score test. To evaluate the type I error, we used 1000 blocks (100 variants in each block) from chromosome 5 that are far from causal variants. In each block, we applied the OW-score test to each of the 200 replicates to test association between genotypes and the phenotype of interest. We obtained 1 p value for each replicate and each block. The type I errors of the proposed test were 0.04887, 0.00921, and 0.00131 at significance levels of 0.05, 0.01, and 0.001, respectively. We also considered the average of SBP at three time points as the phenotype of interest, which yielded similar results.
Discussion
Next-generation sequencing technologies make directly testing rare variant association possible. However, the development of powerful statistical methods for rare-variant association studies is still underway. In this article, we proposed a novel statistical method for rare-variant association studies based on general pedigrees for quantitative traits. The application to the GAW19 data set showed that the proposed method has correct type I error rate and is more powerful than the other two methods against which our method was compared.
We described our method for quantitative traits. For qualitative traits, we can derive a score test similar to that given by equation (1). However, the performance of the proposed method for qualitative traits requires further investigation. Like many statistical methods for rare variant association studies, the proposed method can consider phenotype measurement at only one time point. Statistical methods based on sequence data have been developed for unrelated individuals that have phenotype measurements at multiple time points [22]. From a statistical standpoint, modeling using longitudinal phenotypes is more informative than that using phenotypes at a single time point and thus can increase the power of an association test [22, 23]. Our future work includes extension of the proposed method to longitudinal phenotypes.
Conclusions
In this article, we developed a novel statistical method for rare variant association studies in general pedigrees (randomly ascertained pedigrees or ascertained pedigrees). Application to the GAW19 data set showed that the newly proposed method is more powerful than the other two methods in most of the cases. Our new method uses a retrospective view, which allows us to account for complex and undefined ascertainment of families. The GAW19 data is based on randomly ascertained pedigrees. Results of applying our method to GAW19 data showed that the proposed method has correct type I error based on random ascertainment. When random ascertainment is violated and ascertainment is based on trait values, the proposed method is expected to have correct type I error. If pedigrees are ascertained because of extreme trait values, the proposed method is expected to have higher power than methods based on randomly ascertained pedigrees.
Declarations
Acknowledgements
The GAW19 whole genome sequence data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples) Consortium, which is supported by National Institutes of Health (NIH) grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW19 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
Declarations
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
Authors’ contributions
QS designed the overall study, HZ and ZW conducted statistical analyses, and HZ, XW, and QS drafted the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Morgenthaler S, Thilly WG. A strategy to discover genes that carry multiallelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615(1-2):28–56.View ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384.View ArticlePubMedPubMed CentralGoogle Scholar
- Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70(1):42–54.View ArticlePubMedPubMed CentralGoogle Scholar
- Hoffmann TJ, Marini NJ, Witte JS. Comprehensive approach to analyzing rare genetic variants. PLoS One. 2010;5(11):e13584.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare variant association testing for sequencing data with the sequence kernel association test (SKAT). Am J Hum Genet. 2011;89(1):82–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol. 2012;36(6):561–71.View ArticlePubMedGoogle Scholar
- Shi G, Rao D. Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genet Epidemiol. 2011;35(6):572–9.PubMedPubMed CentralGoogle Scholar
- Liu D, Leal S. A unified framework for detecting rare variant quantitative trait associations in pedigree and unrelated individuals via sequence data. Hum Hered. 2012;73(2):105–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37(2):196–204.View ArticlePubMedGoogle Scholar
- Svishcheva GR, Belonogova NM, Axenovich TI. FFBSKAT: fast family-based sequence kernel association test. PLoS One. 2014;9(6):e99407.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu X, Feng T, Li Y, Lu Q, Elston RC. Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol. 2010;34(2):171–87.View ArticlePubMedPubMed CentralGoogle Scholar
- Feng T, Elston R, Zhu X. Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol. 2011;35(5):398–409.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012;90(6):1028–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SN. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genet Epidemiol. 2013;37(5):409–18.View ArticlePubMedGoogle Scholar
- Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009;33(6):497–507.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–88.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu H, Tang Y, Zhang H. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009;53:853–6.View ArticleGoogle Scholar
- Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multi locus association test for quantitative traits. Am J Hum Genet. 2008;82(2):386–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics. 2008;9:292.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang S, Fang S, Sha Q, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants with longitudinal data. BMC Proc. 2014;8 Suppl 1:S91.View ArticlePubMedPubMed CentralGoogle Scholar
- Furlotte N, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genet Epidemiol. 2012;36(5):463–71.View ArticlePubMedPubMed CentralGoogle Scholar