Extended T2 tests for longitudinal family data in whole genome sequencing studies
© Liu et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Family data and rare variants are two key features of whole genome sequencing analysis for hunting the missing heritability of common human diseases. Recently, Zhu and Xiong proposed the generalized T2 tests that combine rare variant analysis and family data analysis. In similar fashion, we developed the extended T2 tests for longitudinal whole genome sequencing data for family-based association studies. The new methods simultaneously incorporate three correlation sources: from linkage disequilibrium, from pedigree structure, and from the repeated measures of covariates. We assess and compare these methods using the simulated data from Genetic Analysis Workshop 18. We show that, in general, the extended T2 tests incorporating longitudinal repeated measures have higher power than the single-time-point T2 tests in detecting hypertension-associated genome segments.
Compared with traditional genome-wide association studies (GWAS), whole genome sequencing (WGS) is a more efficient way of finding the missing heritability of diseases . While GWAS are mostly based on microarray genotyping, which can discover only common genetic variants, WGS is able to reveal rare and structural variants, which are crucial factors behind disease phenotypes . As the cost of sequencing decreases significantly, we expect that WGS will become increasingly predominant in the hunt for novel disease genes.
Most of the recent discoveries from sequencing studies were based on the Mendelian trait model . Genetic association studies based on the complex trait model are challenging because of limited sample size as well as the new properties of sequencing data. WGS data are distinct from GWAS data in two major aspects. First, WGS provides a huge number of rare variants. Even with large allelic effects, caused by very small minor allele frequencies (MAFs), the association tests between single rare variants and the trait are less powerful and unreliable . Second, family designs play a critical role in WGS. Because of its relatively high cost, WGS tends to exploit families of patients, so that the rare causal variants are likely enriched through cotransmission of the disease . Furthermore, the pedigree structure allows statistical imputation of the genotypes at no experimental cost, which potentially increases the statistical power [6, 7].
In a recent celebrated work, Zhu and Xiong proposed a set of generalized T2 tests for family-based WGS data . These methods simultaneously address the correlations among genetic variants (i.e., linkage disequilibrium [LD]) and the correlations among family members (i.e., kinship). Rare-variant collapsing procedures [9, 10] are also integrated into the tests. However, these methods cannot incorporate covariates and do not address the correlation structure for longitudinal repeated measures. In this study, we further extended the methodology of the T2 tests to address these limits. By applying these methods to an analysis of the Genetic Analysis Workshop 18 (GAW18) simulation data, we showed that the asymptotic null distributions of Zhu and Xiong  are problematic in controlling the type I error rate, and that our extended methods are likely more powerful for longitudinal data.
Generalized T2 tests for family data
Zhu and Xiong  showed that the covariance as a result of both LD and kinship could be explicitly expressed as a Kronecker product of the corresponding covariance matrices. Following the idea of Hotelling's T2 test [11, 12], the authors proposed a generalized T2 test that incorporates these covariance matrices, which are estimated separately by using the same data. Depending on various strategies of collapsing of rare variants, here we consider three generalized T2 tests of Zhu and Xiong.
T 2 : The genotypes of rare variants between adjacent common variants are summed up, and one covariance matrix is estimated for both common and collapsed rare variants.
CMC.ZXpaper (CMC test): The rare variants are collapsed in the same way as above, but the covariance matrices are estimated separately for common and rare variants (assuming they are uncorrelated).
CMC.ZXcode: Rare variants are collapsed by the maximum of their genotypes, and one covariance matrix is estimated for both common and collapsed rare variants. This strategy follows the R function pedCMC of Zhu and Xiong (https://sph.uth.edu/hgc/faculty/xiong/software-D.html).
Extended T2 test for family data with longitudinal repeated measures
Following the notations in Figure 1, let be the number of the cases, be the number of the controls, and . The genotype column vector of the tth common variant is , the aligned column vector of all T common variants is represented by . Similarly, for the collapsed genotypes of rare variants, the genotype column vector of the sth rare variant is , and for totally S rare variants. Considering the covariates with longitudinal repeated measures, the column vector of the cth covariate at the jth repeated measurement point is , and the aligned column vector is for totally C covariates, each measured for J times. Similarly, the row vectors are denoted as follows. For , the vectors are the rows in the block of common variants, the row vectors are for rare variants, and are for longitudinal covariates. The row average in cases is , and that in controls is . The row averages for rare variants and covariates are obtained analogously.
Following the idea of the generalized T2 test, the extended T2 test is where ,
The key problem is to estimate . Following the assumption of Zhu and Xiong  that and are independent, we consider two cases. In the first case, assume is also independent with and . Then , whereby
and are the covariance matrix among the elements in and , respectively (e.g., the LD among the genetic variants), is the kinship matrix, and denotes the Kronecker product. For the covariate block, we consider , where is the covariance matrix among the elements , and is a matrix that captures the correlations among individuals in terms of environmental covariates.
We consider two simplification assumptions: (a) indicates that covariate variables among individuals are independent, considering the individual dependence has been captured by the genetics; and (b) indicates that covariate variables among individuals have the similar dependence pattern as that according to genetics (e.g., children may be more likely to smoke if parents do, or the age of children is correlated with the age of parents). According to the various rare-variant collapsing strategies in the above generalized T2 tests by Zhu and Xiong , the corresponding extended T2 tests are denoted T2.longi, CMC.ZXpaper.longi, and CMC.ZXcode.longi, respectively.
Asymptotic and permutation tests
Zhu and Xiong derived asymptotic chi-square distribution for the null. In their paper , the degrees of freedom (DF) equal the number of variants; in their R code, the DF equal the rank of data matrix. The latter is better but still gives inflated p values as shown below. Thus, we applied a permutation test for the type I error rate being well controlled. Specifically, let and , , , denote the test statistics of the gth genome window from the original data and from the lth permutation, respectively. The empirical p value for the gth window is , where L =1000. Because the target is to find the associations with genetic variants, not with the covariates, the permutations are applied only to the genotype data to retain the relationship between response and the covariates.
For evaluating the above methods, we used the "dose" genotype data of 1,215,399 single-nucleotide variants (SNVs) on chromosome 3 and the 200 simulation replicates of hypertension outcomes and covariates (age, hypertension medicine use, smoking status). As an arbitrary, yet simple, way to group variants, we split chr3 into 19,080 windows, each 10 kilobase pairs (kbp) long. In each window, rare variants (MAF <0.05) between adjacent common variants were collapsed, leaving 654,415 genetic factors (common or rare variants, or collapsed rare-variant groups) to be analyzed. The average number of genetic factors contained in the windows is 34.3, the median is 32, the minimum is 1, and the maximum is 330. For the simulated phenotypes, the number of individuals is 849 in 20 families, where the family sizes are from 21 to 74, with the mean 42.45 and the median 36.5. There are 188 simulated true SNVs contained in 129 true windows (1, 3, 7, 32, and 86 windows contain 5, 4, 3, 2, and 1 true SNVs, respectively) on chr3. The knowledge of these true SNVs was used only for evaluating the power of these association tests, not for designing data analysis strategy.
In the simulation data of GAW18, true SNVs are always allocated on genes. Using genes as windows to group SNVs may concentrate the true SNVs and has the potential to improve the detection power. However, the idea of WGS, instead of exome sequencing, is that the disease-related genetic factors might allocate outside of genes. So we did not use the knowledge that true SNVs are in genes; instead, we evaluated the methods based on fixed-genome segment windows.
There are several limitations and future research topics based on the current work. First, matrix estimation is a key issue in this methodology development. Good estimation of matrices and their inverses can better incorporate correlation structures' potential to improve the performance. Here we simply adopted the same variance matrix estimate in Hotelling's T2 test. This is a maximal likelihood estimate if observations are independent. Unfortunately, independency is not true for family data in the first place. Besides the correlation issue, for a high-dimensional matrix with a potentially sparse structure, there are better estimates of the covariance matrix and its inverse . Second, the permutation test is relatively slow, especially for handling large amounts of data in WGS. It would be desirable to derive more accurate asymptotic distributions for fast and precise p value calculation. Third, necessary modifications of these tests are needed to handle missing data and unequal numbers of repeated measures, which are common problems.
We have extended Zhu and Xiong's  generalized T2 tests to incorporate the covariates with longitudinal repeated measures. These methods account for 3 sources of correlation structures among genetic variants, family members, and time series observations. Compared with the T2 test methods for snapshot observations, the new methods have higher power to detect hypertension-related genome segments according to the GAW18 simulation data.
We are grateful to the National Institutes of Health funding support for GAW18 and for the student travel awards to YL and JX. We are grateful to WPI Computing and Communications Center for computational support. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Cirulli ET, Goldstein DB: Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010, 11: 415-425. 10.1038/nrg2779.View ArticlePubMedGoogle Scholar
- Stitziel NO, Kiezun A, Sunyaev S: Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol. 2011, 12: 227-10.1186/gb-2011-12-9-227.PubMed CentralView ArticlePubMedGoogle Scholar
- Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR: Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009, 106: 3871-3876. 10.1073/pnas.0812824106.PubMed CentralView ArticlePubMedGoogle Scholar
- Ott J, Kamatani Y, Lathrop M: Family-based designs for genome-wide association studies. Nat Rev Genet. 2011, 12: 465-474.View ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2001, 30: 97-101.View ArticlePubMedGoogle Scholar
- Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR: Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012, 44: 955-999. 10.1038/ng.2354.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu Y, Xiong M: Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012, 90: 1028-1045. 10.1016/j.ajhg.2012.04.022.PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004, 305: 869-872. 10.1126/science.1099870.View ArticlePubMedGoogle Scholar
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.PubMed CentralView ArticlePubMedGoogle Scholar
- Hotelling H: The generalization of Student's ratio. Ann Math Statist. 1931, 2: 360-378. 10.1214/aoms/1177732979.View ArticleGoogle Scholar
- Xiong M, Zhao J, Boerwinkle E: Generalized T2 test for genome association studies. Am J Hum Genet. 2002, 70: 1257-1268. 10.1086/340392.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Weedon MN, Purcell S, Lettre G, Estrada K, Willer CJ, Smith AV, Ingelsson E, O'Connell JR, Mangino M, et al: Genomic inflation factors under polygenic inheritance. Eur J Hum Genet. 2011, 19: 807-812. 10.1038/ejhg.2011.39.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai TT, Zhang CH, Zhou HH: Optimal rates of convergence for covariance matrix estimation. Ann Statist. 2010, 38: 2118-2144. 10.1214/09-AOS752.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.