- Open Access
A multi-level model for analyzing whole genome sequencing family data with longitudinal traits
© Chen et al.; licensee BioMed Central Ltd. 2014
- Published: 17 June 2014
Compared with microarray-based genotyping, next-generation whole genome sequencing (WGS) studies have the strength to provide greater information for the identification of rare variants, which likely account for a significant portion of missing heritability of common human diseases. In WGS, family-based studies are important because they are likely enriched for rare disease variants that segregate with the disease in relatives. We propose a multilevel model to detect disease variants using family-based WGS data with longitudinal measures. This model incorporates the correlation structure from family pedigrees and that from repeated measures. The iterative generalized least squares algorithm was applied to estimation of parameters and test of associations. The model was applied to the data of Genetic Analysis Workshop 18 and compared with existing linear mixed-effect models. The multilevel model shows higher power at practical p-value levels and a better type I error control than linear mixed-effect model. Both multilevel and linear mixed-effect models, which use the longitudinal repeated information, have higher power than the methods that only use data collected at one time point.
- Rare Variant
- Multilevel Model
- Whole Genome Sequencing
- Kinship Matrix
- Genetic Analysis Workshop
Whole genome sequencing (WGS) provides comprehensive collection of genetic variations and thus is promising in discovering novel inheritable factors for both Mendelian and complex traits. Two data properties distinguish WGS from microarray-based genome-wide association study (GWAS). First, WGS data contain rare causal mutations that could have large allelic effect. However, the statistical association for such rare variants is weak at population level because of small allele frequency ; therefore, population-based case-control study, which is commonly applied in GWAS, is less powerful for WGS. Second, family design is attractive and commonly applied in WGS studies. Causal rare variants are likely enriched through cotransmission in families. Moreover, pedigree structures allow statistical imputation of genotypes without experimental cost . Additionally, family-based data analyses automatically control for population stratification and are potentially able to incorporate helpful genetic information on phase, effects of parental origin, cotransmission of variants, and so on.
Detection of disease variants can also be facilitated by trajectory information on individual changes over time. Longitudinal genetic studies enable a close investigation of both genetic factors that lead to a disease and environmental determinants that modulate the subsequent progression of the disease. In WGS, it is important to develop powerful methods that accommodate both within-family correlation structure and correlation among repeated measures. Here we extend a multilevel model [4, 5] to WGS longitudinal family data, which simultaneously accounts for familial and time-series correlations. The implementation is based on the iterative generalized least squares (IGLS) algorithm [6, 7], which allows conclusions to be drawn about both genetic and environmental effects while controlling the complex correlation structure. We assessed the multilevel model by comparing with the linear mixed-effects (LME) models using "dose" genotypes on chromosome 3 and the 200 simulation replicates of longitudinal response and covariates provide by Genetic Analysis Workshop 18 (GAW18) .
Method 1: LME model
where is response of the ith repeated measure of the jth individual in the kth family, where , , and , with being the number of measures for individual j in family k and being the number of individuals in family k; is a covariate vector (including genotype) for fixed effects . is a covariate vector for random effects , where , the covariance matrix among individuals in family k (e.g., the kinship matrix). Also, where is the covariance matrix among the repeated measures for individual j in family k. We assume and are independent between each other and among themselves for all j and k. To implement the LME model, we applied the following R package:
GWAF: R package GWAF was design for genome-wide analysis for family data . It accounts for the pedigree correlation structure by kinship matrix. However, it does not handle longitudinal repeated measures. So this method was used to represent the cross-sectional analysis for family data and was compared with other family-data analysis incorporating longitudinal information.
Lmekin: R function lmekin in package coxme  was applied to account for both the family correlation structure and the correlation structure of the longitudinal repeated measures. Specifically, we set the model that includes a random intercept at individual level to account for the correlation of repeated measures assuming compound symmetry structure, a random intercept at family level to account for the clustering effect among family members. Furthermore, the kinship matrix was incorporated through its varlist option to account for the kinship correlation among family members.
Method 2: Multi-level model
So the term can be decomposed as , such that the matrixes are all known and the parameters can be estimated as described below. Certainly, more complicated correlation structure can be modeled by a further decomposition according to the number of covariance parameters to be estimated. Finally, is the independent and identically distributed error term, and is the identity matrix for all observations.
The estimation procedure starts at an arbitrary (e.g., obtained from a multiple regression fitting) and then iterates between steps 1 and 2 until convergence. Because the IGLS estimate is equivalent to the restricted maximal likelihood estimate , we can apply a Z-test to calculate p-values for the elements in , which contains the fixed genetic effects. In particular, because , the Z-test statistic for is , and the two-tailed p-value is . Certainly, this multilevel model has the potential to be further extended to incorporate a more complicated covariance structure for more sophisticated modeling.
For evaluating the methods, we used the "dose" genotype data of the 169 true single-nucleotide variants (SNVs) on chromosome 3 that were associated with diastolic blood pressure (DBP) in 200 simulation replicates. These data contain 849 individuals in 20 families, and the number of individuals in families is 21 to 74, with the mean 42.45 and the median 36.5. Kinship matrixes of these families were directly calculated based on the pedigree information. The above models were fitted with or without covariates: age, blood pressure medicine status, and sex. For GWAF, which does not analyze longitudinal data, we applied the DBP at the first time point as the response. For lmekin and multilevel model, we applied all three longitudinal repeated measures. The knowledge of the true SNVs was only used for evaluating the power of these association tests, not for the data analysis strategy.
We studied the power of detecting the 169 DBP-related SNVs on chromosome 3. Based on the phenotype data in the simulation replicate 1, Figure 1c shows the true positive rate of detecting these true SNVs over a variety of p-value cutoffs. In general, the power of detecting true SNVs is low at small or moderate p-values. This phenomenon indicates that the sample size is still relatively too small to detect a large proportion of the weak genetic effects simulated in the data. At the same time, longitudinal methods (lmekin and multilevel models) are better than the one-time-point model (GWAF); the latter does not have much power except for the strongest SNVs. The lmekin and the multilevel models have similar performance overall, but the multilevel model is better at the region of relatively small p-values (e.g., p-value <0.1) that are of practical interest. For both lmekin and multilevel models, there is no big difference between the models with and without covariates. We also studied the power of detecting specific SNVs by using the data of 200 simulation replicates. For example, by the multilevel model with covariates, the strongest SNV at location 48040283 always got significant p-values from 1.8 × 10−31 to 3.09 × 10−9.
In this work, our main focus was to see whether modeling longitudinal data could provide helpful information to increase the power of detecting true SNVs when compared with the methods for analyzing data at one time point. Here we directly applied the original genotype data into modeling and illustrated that the longitudinal repeated observations were indeed helpful to detect DBP-related genetic factors. However, many true SNVs are rare variants, some of which could have big allelic effect for specific individuals when the disease mutation presents. As a result of small minor allele frequency (MAF), the association between such rare variants and their corresponding phenotypes is still weak at the population level . This may be one of the main reasons why the overall power is low in detecting the majority of the causal or regulatory genetic factors. Various strategies of rare-variant collapsing procedures [14, 15] could be applied to grouping and combining genotypes of rare variants, which has potential to further increase the power.
The computational speed of the multilevel model is comparable with the linear mixed-effects model estimation by lmekin. Both models are computationally demanding (e.g., ~10 minutes for our implementation of multilevel model and 8 minutes for lmekin to process one SNV on a MacBook Pro with 2.9-GHz Intel Core i7). However, we observed that the convergence speed of the iterative generalized least squares algorithm for the multilevel model is relatively fast: the results usually do not change much after two iterations. So, restricting the number of iterations could potentially reduce computational time. Further study on improving computation efficiency will be carried out in the near future.
We developed a multilevel model for fitting family-based genotype data and repeated measures of covariates to quantitative longitudinal response, which accounts for correlations among individuals, nesting effects at the family and individual levels, and the time series correlations due to the repeated measures of covariates and responses. Using the simulated data of GAW18, this method showed more accurate type I error control than the LME model by lmekin, which is likely the result of better account for correlations among individuals. The multilevel model also provided higher power at small p-value cutoffs. At the same time, both lmekin and multilevel model, which use the longitudinal information, have higher power than GWAF, which only models data at one time point.
We are grateful to the National Institutes of Health (NIH) funding support (GM031575) for GAW18 and for a student travel award to TC We are grateful to WPI Computing and Communications Center for computational support. The GAW18 WGS data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR: Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci USA. 2009, 106: 3871-3876. 10.1073/pnas.0812824106.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2001, 30: 97-101.View ArticlePubMedGoogle Scholar
- Ott J, Kamatani Y, Lathrop M: Family-based designs for genome-wide association studies. Nat Rev Genet. 2011, 12: 465-474.View ArticlePubMedGoogle Scholar
- Goldstein H: Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika. 1986, 73: 43-56. 10.1093/biomet/73.1.43.View ArticleGoogle Scholar
- Goldstein H, Browne W, Rasbash J: Multilevel modeling of medical data. Stat Med. 2002, 3291-3315. 21Google Scholar
- Goldstein H: Restricted unbiased iterative generalized least-squares estimation. Biometrika. 1989, 622-623. 76Google Scholar
- Goldstein H, Rasbash J: Efficient computational procedures for the estimation of parameters in multilevel models based on iterative generalised least squares. Comput Stat Data Anal. 1992, 13: 63-71. 10.1016/0167-9473(92)90154-8.View ArticleGoogle Scholar
- Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Chen MH, Yang Q: GWAF: an R package for genome-wide association analyses with family data. Bioinformatics. 2010, 26: 580-581. 10.1093/bioinformatics/btp710.PubMed CentralView ArticlePubMedGoogle Scholar
- Therneau T: The lmekin function. [ftp://126.96.36.199/R/web/packages/coxme/vignettes/lmekin%20pdf%202012]
- Rasbash J, O'Connor T, Jenkin J: Multilevel models for family data. Bristol, England: Centre for Multilevel Modelling, University of Bristol. 2009Google Scholar
- Searle S: Large sample variances of maximum likelihood estimators of variance components using unbalanced data. Biometrics. 1970, 26: 505-524. 10.2307/2529105.View ArticleGoogle Scholar
- Efron B: Size,power and false discovery rates. Ann Statist. 2007, 35: 1351-1377. 10.1214/009053606000001460.View ArticleGoogle Scholar
- Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004, 305: 869-872. 10.1126/science.1099870.View ArticlePubMedGoogle Scholar
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.