Proceedings | Open | Published:
A multi-level model for analyzing whole genome sequencing family data with longitudinal traits
BMC Proceedingsvolume 8, Article number: S86 (2014)
Compared with microarray-based genotyping, next-generation whole genome sequencing (WGS) studies have the strength to provide greater information for the identification of rare variants, which likely account for a significant portion of missing heritability of common human diseases. In WGS, family-based studies are important because they are likely enriched for rare disease variants that segregate with the disease in relatives. We propose a multilevel model to detect disease variants using family-based WGS data with longitudinal measures. This model incorporates the correlation structure from family pedigrees and that from repeated measures. The iterative generalized least squares algorithm was applied to estimation of parameters and test of associations. The model was applied to the data of Genetic Analysis Workshop 18 and compared with existing linear mixed-effect models. The multilevel model shows higher power at practical p-value levels and a better type I error control than linear mixed-effect model. Both multilevel and linear mixed-effect models, which use the longitudinal repeated information, have higher power than the methods that only use data collected at one time point.
Whole genome sequencing (WGS) provides comprehensive collection of genetic variations and thus is promising in discovering novel inheritable factors for both Mendelian and complex traits. Two data properties distinguish WGS from microarray-based genome-wide association study (GWAS). First, WGS data contain rare causal mutations that could have large allelic effect. However, the statistical association for such rare variants is weak at population level because of small allele frequency ; therefore, population-based case-control study, which is commonly applied in GWAS, is less powerful for WGS. Second, family design is attractive and commonly applied in WGS studies. Causal rare variants are likely enriched through cotransmission in families. Moreover, pedigree structures allow statistical imputation of genotypes without experimental cost . Additionally, family-based data analyses automatically control for population stratification and are potentially able to incorporate helpful genetic information on phase, effects of parental origin, cotransmission of variants, and so on.
Detection of disease variants can also be facilitated by trajectory information on individual changes over time. Longitudinal genetic studies enable a close investigation of both genetic factors that lead to a disease and environmental determinants that modulate the subsequent progression of the disease. In WGS, it is important to develop powerful methods that accommodate both within-family correlation structure and correlation among repeated measures. Here we extend a multilevel model [4, 5] to WGS longitudinal family data, which simultaneously accounts for familial and time-series correlations. The implementation is based on the iterative generalized least squares (IGLS) algorithm [6, 7], which allows conclusions to be drawn about both genetic and environmental effects while controlling the complex correlation structure. We assessed the multilevel model by comparing with the linear mixed-effects (LME) models using "dose" genotypes on chromosome 3 and the 200 simulation replicates of longitudinal response and covariates provide by Genetic Analysis Workshop 18 (GAW18) .
Method 1: LME model
Linear mixed-effects models offer a natural approach to deal with correlation structures among observations. For longitudinal family data, we can define an LME model:
where is response of the ith repeated measure of the jth individual in the kth family, where , , and , with being the number of measures for individual j in family k and being the number of individuals in family k; is a covariate vector (including genotype) for fixed effects . is a covariate vector for random effects , where , the covariance matrix among individuals in family k (e.g., the kinship matrix). Also, where is the covariance matrix among the repeated measures for individual j in family k. We assume and are independent between each other and among themselves for all j and k. To implement the LME model, we applied the following R package:
GWAF: R package GWAF was design for genome-wide analysis for family data . It accounts for the pedigree correlation structure by kinship matrix. However, it does not handle longitudinal repeated measures. So this method was used to represent the cross-sectional analysis for family data and was compared with other family-data analysis incorporating longitudinal information.
Lmekin: R function lmekin in package coxme  was applied to account for both the family correlation structure and the correlation structure of the longitudinal repeated measures. Specifically, we set the model that includes a random intercept at individual level to account for the correlation of repeated measures assuming compound symmetry structure, a random intercept at family level to account for the clustering effect among family members. Furthermore, the kinship matrix was incorporated through its varlist option to account for the kinship correlation among family members.
Method 2: Multi-level model
We extend the classic multi-level model [4, 5, 11] to analyze WGS family data with longitudinal repeated measures. The response for the ith measure (level 1) of the jth individual (level 2) in the kth family (level 3) can be written as
where and are similarly defined in (1). The rest random-effect terms on the right side of the equation are normal distributed with mean zero and variance characterizing the correlation structure among observations. Denote the response vector . We have , where
The first random term characterizes the clustering effects at the family and individual levels. Specifically, , where is a matrix of 1's with dimension being the size of kth family, is a matrix of 1's with dimension being the number of repeated measures per individual. denotes the matrix direct sum, and denotes the Kronecker product. The second random term indicates the genetic correlation (kinship coefficients) among individuals in the kth family. Mathematically, , where is the kinship matrix. The third random term indicates the correlation among repeated measures in the jth individual: , where is an identity matrix with dimension being the size of the kth family and, is the correlation matrix among repeated individuals. For example, if we assume compound symmetry structure, for three repeated measures,
So the term can be decomposed as , such that the matrixes are all known and the parameters can be estimated as described below. Certainly, more complicated correlation structure can be modeled by a further decomposition according to the number of covariance parameters to be estimated. Finally, is the independent and identically distributed error term, and is the identity matrix for all observations.
Step 1: Given , estimate by the least squares estimation of variance . Specifically, this is a procedure of fitting regression model of response vector to the design matrix , where denotes the vectorization of the upper triangular part of matrix . So,
Step 2: Given , estimate by the weighed least squares estimate:
The estimation procedure starts at an arbitrary (e.g., obtained from a multiple regression fitting) and then iterates between steps 1 and 2 until convergence. Because the IGLS estimate is equivalent to the restricted maximal likelihood estimate , we can apply a Z-test to calculate p-values for the elements in , which contains the fixed genetic effects. In particular, because , the Z-test statistic for is , and the two-tailed p-value is . Certainly, this multilevel model has the potential to be further extended to incorporate a more complicated covariance structure for more sophisticated modeling.
For evaluating the methods, we used the "dose" genotype data of the 169 true single-nucleotide variants (SNVs) on chromosome 3 that were associated with diastolic blood pressure (DBP) in 200 simulation replicates. These data contain 849 individuals in 20 families, and the number of individuals in families is 21 to 74, with the mean 42.45 and the median 36.5. Kinship matrixes of these families were directly calculated based on the pedigree information. The above models were fitted with or without covariates: age, blood pressure medicine status, and sex. For GWAF, which does not analyze longitudinal data, we applied the DBP at the first time point as the response. For lmekin and multilevel model, we applied all three longitudinal repeated measures. The knowledge of the true SNVs was only used for evaluating the power of these association tests, not for the data analysis strategy.
First, we evaluated the type I error rate control for these methods. Fitting the 169 DBP-related SNVs on chromosome 3 to Q1, a null response provided by GAW18 "to facilitate assessment of type I error," we plotted in Figure 1a the false-positive rates over a variety of p-value cutoffs. It is clear that the type I error rate of lmekin is highly inflated, and the type I error rates of multilevel model and GWAF are closer to the expected level around the diagonal line. The inflation is worse when covariates are contained in the models (denoted "covar"). We also studied the type I error rate through permutation. Figure 1b shows the false-positive rates for fitting the permuted genotype data of these SNVs to DBP response, which retained the relationship between covariates and DBP but destroyed the association between SNVs and DBP. Now both lmekin and our multilevel models control the type I error rate perfectly well. To explain the puzzle, we checked the GAW18 "answers" and found that Q1 was simulated as a quantitative trait correlated among family members with heritability 0.68, but the total heritability for DBP is only 0.317. This means that Q1 values have stronger correlation than DBP values do. The inflation of the type I error of lmekin indicates that this LME model is less capable than our multilevel model in accounting for the correlation among individuals (cf. ).
We studied the power of detecting the 169 DBP-related SNVs on chromosome 3. Based on the phenotype data in the simulation replicate 1, Figure 1c shows the true positive rate of detecting these true SNVs over a variety of p-value cutoffs. In general, the power of detecting true SNVs is low at small or moderate p-values. This phenomenon indicates that the sample size is still relatively too small to detect a large proportion of the weak genetic effects simulated in the data. At the same time, longitudinal methods (lmekin and multilevel models) are better than the one-time-point model (GWAF); the latter does not have much power except for the strongest SNVs. The lmekin and the multilevel models have similar performance overall, but the multilevel model is better at the region of relatively small p-values (e.g., p-value <0.1) that are of practical interest. For both lmekin and multilevel models, there is no big difference between the models with and without covariates. We also studied the power of detecting specific SNVs by using the data of 200 simulation replicates. For example, by the multilevel model with covariates, the strongest SNV at location 48040283 always got significant p-values from 1.8 × 10−31 to 3.09 × 10−9.
In this work, our main focus was to see whether modeling longitudinal data could provide helpful information to increase the power of detecting true SNVs when compared with the methods for analyzing data at one time point. Here we directly applied the original genotype data into modeling and illustrated that the longitudinal repeated observations were indeed helpful to detect DBP-related genetic factors. However, many true SNVs are rare variants, some of which could have big allelic effect for specific individuals when the disease mutation presents. As a result of small minor allele frequency (MAF), the association between such rare variants and their corresponding phenotypes is still weak at the population level . This may be one of the main reasons why the overall power is low in detecting the majority of the causal or regulatory genetic factors. Various strategies of rare-variant collapsing procedures [14, 15] could be applied to grouping and combining genotypes of rare variants, which has potential to further increase the power.
The computational speed of the multilevel model is comparable with the linear mixed-effects model estimation by lmekin. Both models are computationally demanding (e.g., ~10 minutes for our implementation of multilevel model and 8 minutes for lmekin to process one SNV on a MacBook Pro with 2.9-GHz Intel Core i7). However, we observed that the convergence speed of the iterative generalized least squares algorithm for the multilevel model is relatively fast: the results usually do not change much after two iterations. So, restricting the number of iterations could potentially reduce computational time. Further study on improving computation efficiency will be carried out in the near future.
We developed a multilevel model for fitting family-based genotype data and repeated measures of covariates to quantitative longitudinal response, which accounts for correlations among individuals, nesting effects at the family and individual levels, and the time series correlations due to the repeated measures of covariates and responses. Using the simulated data of GAW18, this method showed more accurate type I error control than the LME model by lmekin, which is likely the result of better account for correlations among individuals. The multilevel model also provided higher power at small p-value cutoffs. At the same time, both lmekin and multilevel model, which use the longitudinal information, have higher power than GWAF, which only models data at one time point.
Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR: Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci USA. 2009, 106: 3871-3876. 10.1073/pnas.0812824106.
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2001, 30: 97-101.
Ott J, Kamatani Y, Lathrop M: Family-based designs for genome-wide association studies. Nat Rev Genet. 2011, 12: 465-474.
Goldstein H: Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika. 1986, 73: 43-56. 10.1093/biomet/73.1.43.
Goldstein H, Browne W, Rasbash J: Multilevel modeling of medical data. Stat Med. 2002, 3291-3315. 21
Goldstein H: Restricted unbiased iterative generalized least-squares estimation. Biometrika. 1989, 622-623. 76
Goldstein H, Rasbash J: Efficient computational procedures for the estimation of parameters in multilevel models based on iterative generalised least squares. Comput Stat Data Anal. 1992, 13: 63-71. 10.1016/0167-9473(92)90154-8.
Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-
Chen MH, Yang Q: GWAF: an R package for genome-wide association analyses with family data. Bioinformatics. 2010, 26: 580-581. 10.1093/bioinformatics/btp710.
Therneau T: The lmekin function. [ftp://184.108.40.206/R/web/packages/coxme/vignettes/lmekin%20pdf%202012]
Rasbash J, O'Connor T, Jenkin J: Multilevel models for family data. Bristol, England: Centre for Multilevel Modelling, University of Bristol. 2009
Searle S: Large sample variances of maximum likelihood estimators of variance components using unbalanced data. Biometrics. 1970, 26: 505-524. 10.2307/2529105.
Efron B: Size,power and false discovery rates. Ann Statist. 2007, 35: 1351-1377. 10.1214/009053606000001460.
Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004, 305: 869-872. 10.1126/science.1099870.
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.
We are grateful to the National Institutes of Health (NIH) funding support (GM031575) for GAW18 and for a student travel award to TC We are grateful to WPI Computing and Communications Center for computational support. The GAW18 WGS data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
The authors declare that they have no competing interests.
ZW designed the overall study. ZW, TC, and PS conducted statistical analyses and drafted the manuscript. All authors read and approved the final manuscript.