A generalized least-squares framework for rare-variant analysis in family data

Li, Dalin; Rotter,, Jerome I; Guo, Xiuqing

doi:10.1186/1753-6561-8-S1-S28

Volume 8 Supplement 1

Genetic Analysis Workshop 18

Proceedings
Open access
Published: 17 June 2014

A generalized least-squares framework for rare-variant analysis in family data

Dalin Li^1,2,
Jerome I Rotter,^2,3 &
Xiuqing Guo^2,3

BMC Proceedings volume 8, Article number: S28 (2014) Cite this article

1843 Accesses
1 Citations
Metrics details

Abstract

Rare variants may, in part, explain some of the hereditability missing in current genome-wide association studies. Many gene-based rare-variant analysis approaches proposed in recent years are aimed at population-based samples, although analysis strategies for family-based samples are clearly warranted since the family-based design has the potential to enhance our ability to enrich for rare causal variants. We have recently developed the generalized least squares, sequence kernel association test, or GLS-SKAT, approach for the rare-variant analyses in family samples, in which the kinship matrix that was computed from the high dimension genetic data was used to decorrelate the family structure. We then applied the SKAT-O approach for gene-/region-based inference in the decorrelated data. In this study, we applied this GLS-SKAT method to the systolic blood pressure data in the simulated family sample distributed by the Genetic Analysis Workshop 18. We compared the GLS-SKAT approach to the rare-variant analysis approach implemented in family-based association test-v1 and demonstrated that the GLS-SKAT approach provides superior power and good control of type I error rate.

Background

Rare variants may, in part, explain some of the missing heritability in current genome-wide association studies [1]. Many rare-variant analysis approaches have been proposed in recent years [2–9]; however most are aimed at population-based case-control samples. Because most of the rare variants arise from recent mutations in pedigrees [10], the family-based design has the potential to enhance our ability to enrich for rare risk or protective variants that occur in the pedigrees over several generations and can substantially increase power. We recently developed an analysis strategy based on generalized least squares (GLS) [11] for family-based rare-variant association analysis, in which we first use the kinship matrix to decorrelate the family-based data, then apply a SKAT-O [9] approach to the decorrelated data, which we term the GLS-SKAT approach (Li D, personal communications, 2013). In this study, we applied the GLS-SKAT method to analyze the simulated systolic blood pressure (SBP) data in the family sample provided by the Genetic Analysis Workshop 18 (GAW18) to examine its performance.

Methods

Residual calculation for SBP in GAW18 simulated data

Independent individuals in the GAW18 data were extracted based on the list of unrelated individuals provided by the GAW18 organizers. For each simulation, a linear regression model was built based on the independent individuals using covariates including gender, age, age square, antihypertension medicine usage, and smoking. The linear regression model built from the unrelated subjects was then projected to the correlated subjects and the residuals were calculated for all individuals in GAW18.

Because the GLS-SKAT approach (Li D, personal communications, 2013) can implicitly control for population substructure, neither ethnicity nor principal components were included as covariates in the residual calculation.

Kinship matrix calculation

The software EMMAX [12] was used for kinship matrix calculation. To ensure that the kinship matrix was nonsingular, one individual in each identical twin pair was excluded. Variants with minor allele frequencies (MAFs) less than 0.01 were also excluded in the kinship matrix calculation.

GLS transformation

GLS transformation was performed using R code developed in-house. The GLS transformation has been described in detail elsewhere (manuscript in preparation). Briefly, a transformation matrix was calculated as the inverse of the decomposition of the kinship matrix. Then, this transformation matrix was used to decorrelate the family data by multiplying it with both the phenotype and genotypes matrices. The residual of SBP after controlling for the covariates was used as the phenotype. Given the time limitation, only simulated chromosome 3 genotype data was used as recommended by the GAW organizer. All variants, both common and rare, were transformed in this step.

Analysis of the decorrelated data

For gene-based inference, the SKAT-O approach [9] was applied to the decorrelated data. Here, a gene region was defined as 20 kilobases (kb) up and downstream of the gene transcript start and stop sites, respectively. Both rare and common variants were included in this analysis, and the single-nucleotide polymorphisms (SNPs) were weighted inversely to their MAF based on the weighting framework previously proposed by Bowling and Bowling [4]. Again, no population stratification measures nor ethnicity information was included.

This GLS-SKAT procedure was applied to replicates 1 to 100 in the GAW18 simulated data. Power and type I error rate were calculated based on the 100 simulated data sets. To achieve a family-wise error rate of 0.05 after applying the Bonferroni correction, an alpha level of 4.0E-5 was utilized in the power calculation to account for the 1247 gene regions on chromosome 3.

Family-based association test for rare variants

As a comparison we analyzed the same 100 data sets using the family-based association test (FBAT) test for rare variants (described in the updated FBAT version 2.04). There are 2 versions of the FBAT test for rare variants: FBAT-v0, which weights all the variants equally, and FBAT-v1, which weights the variants inversely to their MAF. Here we used the FBAT-v1 in the analysis. For consistency, both rare and common variants also were analyzed via the FBAT-v1 test.

We were blind to the GAW18 answer sheet when we carried out the analyses. We compared the results to the answer sheet and carried out power calculations only after the GAW18 meeting.

Results

Detailed description of the GAW data can be found in the work of Almasy et al [13]. In the GLS-SKAT analysis, 2 individuals who were identical twins of others were excluded in the kinship matrix calculation, leaving 847 individuals with both phenotype and genotype data in the analysis. In the FBAT-v1 analysis, 369 trios from the GAW18 data were used because the FBAT test can only use trios.

Figure 1 shows the QQ plots of SBP analysis results in the first simulation (chromosome 3 only) for both GLS-SKAT and FBAT-v1 methods. The lambda of the QQ plot is 1.057 for GLS-SKAT and 0.967 for FBAT.

To calculate the type I error rate, we pooled the statistics across all of the 100 simulated data sets after excluding genes located within 1 megabit (Mb) to any of the causal SNPs on chromosome 3. The type I error rate was calculated for both the GLS-SKAT and FBAT-v1 analyses (Table 1). Both approaches have type I error rates close to the nominal alpha level.

Table 1 Type I error rate of the GLS-SKAT and FBAT-v1 analysis

Full size table

We also calculated power using the 100 simulation data sets for both the GLS-SKAT and FBAT-V1 methods for the MAP4 gene region, which was simulated to contribute to SBP variations (Table 2). We observed a clear advantage to the GLS-SKAT approach, which had a power of 0.69 in comparison to 0.43 for FBAT-v1 at an alpha significance level of 0.001, and 0.34 in comparison to 0.08 with a more stringent significance level of 4.0E-5, with the correction for the number of genes tested.

Table 2 Power of the GLS-SKAT and FBAT-v1 analysis for the MAP4 gene region

Full size table

Discussion

We applied the GLS-SKAT approach in the simulated GAW18 data set. This approach is based on a GLS framework developed for family-based samples, in which the kinship matrix is first computed using the high-dimensional genetics data, and then the kinship matrix is used to decorrelate the family data. Building upon the belief that the property of this framework would lead to a data projection that would yield a best unbiased linear estimate for each single variant in the genome (paper in preparation), we have followed the SKAT-O approach [9] for gene-/region-based inference to capture the potential causal rare variants.

Application of the GLS-SKAT approach to GAW18 simulated family data showed that the GLS-SKAT approach does not have inflated type I error rates with complex family structures, indicating that the GLS-based framework can properly decorrelate family data. Moreover, in all analyses when using GLS-SKAT, this was observed without incorporating any covariates related to ethnicity or population substructure, indicating that, by accounting for the pairwise relatedness, the kinship matrix can be used to control for both major scale population structure (ie, ethnicity) and the much finer scale substructures (ie, family structure). This is consistent with what has been observed in previous investigations [12, 14].

By incorporating all family members in the analysis, the GLS-SKAT approach makes full use of all phenotype and genotype data. This is an advantage compared to the FBAT-based test, in which only trios were included. Moreover, because the FBAT-based analysis is based on transmission disequilibrium, the phenotypes of parents do not contribute to the test statistics. This leads to additional loss of information and consequently lower power, as observed in this study.

Another advantage of the GLS-SKAT approach is that it does not rely on the known family structure because the kinship matrix is estimated using the high dimensional genetics data, making it robust to the potential errors of unidentified cryptic family structure. This is particularly important for samples with complex large pedigrees, in which the cleaning and correction of the self-reported family structure using genetic data are cumbersome.

The GLS approach is similar in spirit to the previously proposed EMMAX and GRAMMAR approach [12, 14], as well as the recently proposed GRAMMAR-GAMMA approach [15], because all those methods use the kinship matrix or covariance matrix to remove the influence of family structure. In particular, the GRAMMAR-GAMMA approach by Svishcheva et al [15] proposed to transform the phenotype vector using the covariance matrix to speed up the score-based association test. Our GLS approach is based on a very similar idea except that we propose to transform both phenotype and genotype, which allows us to completely ignore the covariance matrix when calculating the test statistics. This should lead to additional computational efficiency and potentially fewer restrictions in practice.

In the application of the GLS approach to GAW18 data, the transformation was performed using the kinship matrix after controlling for all covariates that contribute to SBP in the simulation model. However, in practice there could be residual correlation as a result of the structure of unknown covariates within the family, as one reviewer pointed out. This potential problem can be solved by using a covariance matrix instead of the projection matrix in the GLS transformation, in which the covariance matrix is a weighted average of the kinship matrix and an identical matrix with the weight being the calculated hereditability. This is similar to the approach used in GRAMMAR-GAMMA [15].

Conclusions

In summary, we applied the GLS-SKAT approach to GAW18 simulated SBP data. The results demonstrated that the GLS-SKAT approach can properly control for both population and family structure using the kinship matrix estimated from high dimensional genetics data. By using all the available phenotype and genotype information, we demonstrated that the GLS-SKAT approach has a clear advantage in terms of power compared to the FBAT-based test.

References

Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI: Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am J Hum Genet. 2008, 82: 100-112. 10.1016/j.ajhg.2007.09.006.
Article PubMed Central CAS PubMed Google Scholar
Morgenthaler S, Thilly WG: A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007, 615: 28-56. 10.1016/j.mrfmmm.2006.09.003.
Article CAS PubMed Google Scholar
Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.
Article PubMed Central CAS PubMed Google Scholar
Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.
Article PubMed Central PubMed Google Scholar
Price AL, Kryukov GB, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR: Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010, 86: 832-838. 10.1016/j.ajhg.2010.04.005.
Article PubMed Central PubMed Google Scholar
Han F, Pan W: A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010, 42-54. 70
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.
Article PubMed Central CAS PubMed Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.
Article PubMed Central CAS PubMed Google Scholar
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, NHLBI GO Exome Sequencing Project--ESP Lung Project Team, Christiani DC, Wurfel MM, Lin X: Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012, 91: 224-237. 10.1016/j.ajhg.2012.06.007.
Article PubMed Central CAS PubMed Google Scholar
Keinan A, Clark AG: Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012, 336: 740-743. 10.1126/science.1217283.
Article PubMed Central CAS PubMed Google Scholar
Greene WH: Econometric Analysis. 2000, New York, Macmillan, 4
Google Scholar
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E: Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010, 42: 348-354. 10.1038/ng.548.
Article PubMed Central CAS PubMed Google Scholar
Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-
Article PubMed Central PubMed Google Scholar
Zhou X, Stephens M: Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012, 44: 7821-7824.
Google Scholar
Svishcheva GR, Axenovich TI, Belonogova NM, van Duijn CM, Aulchenko YS: Rapid variance components-based method for whole-genome association analysis. Nat Genet. 2012, 44: 1166-1170. 10.1038/ng.2410.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Supported by NIH P01 DK046763, NIH/NIDDK DERC DK063491, CTSI grant UL1TR000124, and NIH-NHLBI N01 HC-95159. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

Author information

Authors and Affiliations

Medical Genetics Institute, Cedars-Sinai Medical Center, Los Angeles, CA, 90048, USA
Dalin Li
David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
Dalin Li, Jerome I Rotter, & Xiuqing Guo
Institute for Translational Genomics and Population Sciences, Los Angeles Biomedical Research Institute and Department of Pediatrics, Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
Jerome I Rotter, & Xiuqing Guo

Authors

Dalin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jerome I Rotter,
View author publications
You can also search for this author in PubMed Google Scholar
Xiuqing Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiuqing Guo.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DL conceived and performed the analysis and drafted the manuscript. JIR and XG supervised the whole study and helped to prepare the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Li, D., Rotter,, J.I. & Guo, X. A generalized least-squares framework for rare-variant analysis in family data. BMC Proc 8 (Suppl 1), S28 (2014). https://doi.org/10.1186/1753-6561-8-S1-S28

Download citation

Published: 17 June 2014
DOI: https://doi.org/10.1186/1753-6561-8-S1-S28

Genetic Analysis Workshop 18

A generalized least-squares framework for rare-variant analysis in family data

Abstract

Background

Methods

Residual calculation for SBP in GAW18 simulated data

Kinship matrix calculation

GLS transformation

Analysis of the decorrelated data

Family-based association test for rare variants

Results

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 18

A generalized least-squares framework for rare-variant analysis in family data

Abstract

Background

Methods

Residual calculation for SBP in GAW18 simulated data

Kinship matrix calculation

GLS transformation

Analysis of the decorrelated data

Family-based association test for rare variants

Results

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us