Association analysis of whole genome sequencing data accounting for longitudinal and family designs
- Yijuan Hu^{1},
- Qin Hui^{2} and
- Yan V Sun^{2, 3, 4}Email author
https://doi.org/10.1186/1753-6561-8-S1-S89
© Hu et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Abstract
Using the whole genome sequencing data and the simulated longitudinal phenotypes for 849 pedigree-based individuals from Genetic Analysis Workshop 18, we investigated various approaches to detecting the association of rare and common variants with blood pressure traits. We compared three strategies for longitudinal data: (a) using the baseline measurement only, (b) using the average from multiple visits, and (c) using all individual measurements. We also compared the power of using all of the pedigree-based data and the unrelated subset. The analyses were performed without knowledge of the underlying simulating model.
Background
Whole genome sequencing (WGS) makes it possible for investigators to extend association studies to rare variants. Rare variants, which have minor allele frequencies (MAFs) of less than 1% to 5%, might play an important role in the etiology of complex traits and account for missing heritability unexplained by common variants [1–3]. However, traditional single-variant tests for common variants have limited power for testing rare variants because of their low frequencies and large numbers. A number of methods [4–7] have been developed to address this challenge by jointly analyzing rare variants within a region. Among these methods, the burden test of Lin and Tang [7] easily fits into the regression framework that can accommodate complex study designs and phenotypes, thus remaining a competitive option.
The longitudinal study design, which collects repeated measurements on the same subject over time, has been routinely used in epidemiologic and clinical research. The repeated measurements can reduce error and thus increase statistical power compared with the single measurement. There have been increasing applications of such a design in genome-wide association studies (GWAS) with a focus on common variants. The analytical strategies include using the measurement at a single time point [8], using the summarized univariate measurement [9] and adopting the linear mixed-effects model to fully exploit information in the repeated measurements [10, 11]. However, the implementation of such designs is limited in the context of WGS studies with a focus on rare-variant associations.
In the era of next-generation sequencing studies, the family-based design has the unique advantages of protecting against population stratification, detecting genotyping errors, and facilitating accurate imputation, all of which are challenging issues to cope with in the studies of rare variants using unrelated subjects. However, it is well known that enrolling and sequencing additional family members will not increase statistical power as much as that can be achieved by the same number of unrelated subjects. The degree of power gain from the added family members depends on the extent of the within-family correlation and the size of each family. Thus, it is of interest to assess this power gain in each specific study.
Genetic Analysis Workshop 18 (GAW18) provided WGS data (sequencing plus imputation) in a pedigree-based sample with longitudinal measurements for systolic blood pressure (SBP) and diastolic blood pressure (DBP). In this study, we implement methods to exploit longitudinal and family structures and apply these methods to examine the associations of aggregated rare variants, as well as common single-nucleotide polymorphisms (SNPs), with SBP and DBP. We compare the power of the methods using the baseline measurement only, using the averaged value over repeated measurements, and using the full information of longitudinal data. We also contrast the power of the methods using all of the pedigree-based data with that using the unrelated subset.
Methods
In this study, we focus on the first replicate of simulated SBP and DBP and the genetic data from sequencing and imputation only on chromosome 3 because of limited computation resources. The analyses were performed without knowledge of the underlying simulating model. We obtained 849 individuals from 20 pedigrees, among which 142 are unrelated. All individuals have SBP and DBP measurements at three time points with no missing data, as well as age, gender, smoking status, and antihypertensive medication status. Note that we preadjust the SBP and DBP measurements by the antihypertensive medication status (i.e., increasing SBP by 10 mm Hg and DBP by 5 mm Hg if the subject is taking medication). We define common variants as those with MAFs 5% or greater and obtain 403,098 SNPs on chromosome 3. We jointly analyze rare variants by mRNA transcripts, which are the functional products of genes. We exclude transcripts whose total rare allele frequency (i.e., sum of MAFs over all inclusive variants) is less than 0.01 and end up with a total of 813 transcripts represented by accession numbers. Given a common single-nucleotide polymorphism (SNP) or a transcript for the phenotype, we consider using (a) the baseline, (b) the time-averaged, and (c) the repeated measurements; for study subjects, we consider using (a) the entire pedigree-based sample and (b) the unrelated subjects only. All statistical analyses were carried out in R (http://www.r-project.org) version 2.15.1.
Notation
Assume there are m rare variants in a transcript. Under the population-based design, let Y_{ it } denote the phenotype measured for subject i at time t, G_{ i }= (G_{ i1 }, G_{ i2 }, ..., G_{ im })^{T} the genotypes of the m variants, and X_{ it }= (X_{ it1 }, X_{ it2 }, ..., X_{ itq })^{T} the q covariates including the intercept and possibly time-varying ones. Under the family-based design and when subject i belongs to family p, we modify the aforementioned notation to be Y_{ pit }, G_{pi} and X_{ pit }.
Burden score of rare variants
We focus on variants with MAF less than 5% and that are putatively functional (i.e., nonsense, missense, or splice site mutations). Specifically, using the chromosomal location (NCBI built 37.1) provided by the GAW18 data, we search for functional annotation of 812,234 SNPs with MAFs 5% or less on chromosome 3 using the GVS server (http://snp.gs.washington.edu). We consider the following functional categories: missense, missense-near-splice, splice-3, splice-5, stop-gained, stop-lost, and stop-lost-near-splice. Note that the functional annotation is specific to each transcript. The burden score of the ith subject at a given transcript is defined as the sum of genotypes of the selected variants: ${S}_{i}={\sum {}_{j=1}}_{,\dots ,m}{G}_{ij}$.
In the framework of Lin and Tang [7], the burden score is used in the following regression models as a regular covariate.
Models and assumptions
Unrelated subjects with single measurements of blood pressure
where ${\u03f5}_{\text{i}}$ is an error term with mean zero and variance ${\sigma}^{\text{2}}$ and X_{ i }includes the intercept, gender, smoking status, and baseline (averaged) age in the baseline (averaged) phenotype model.
Unrelated subjects with repeated measurements of blood pressure
where b_{ i } is a random effect that follows $\text{N}\left(0,{\sigma}_{b}^{\text{2}}\right);$ ${\u03f5}_{it}$ is an error term that follows $\text{N}\left(0,{\sigma}^{\text{2}}\right);$ ${\u03f5}_{it}$ and b_{ i } are mutually independent; and X_{ it }consists of the intercept, gender, smoking status, and the time-varying age. By including age into X_{ it }, we assume that the traits change linearly with time. Both visual inspection of the individual-level trait trajectories and statistical testing of the age coefficient support the linear modeling (results not shown). Because $\text{Cov}\left({Y}_{it},{Y}_{it},\right)={\sigma}_{b}^{2}+{\sigma}^{2}\phantom{\rule{0.3em}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}t={t}^{\prime}\phantom{\rule{2.77695pt}{0ex}}\text{and}\phantom{\rule{2.77695pt}{0ex}}\text{Cov}\left({Y}_{it},{Y}_{it},\right)={\sigma}_{b}^{2}\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}t\ne {t}^{\prime},$ the random effect b_{ i } induces a squared correlation of ${\sigma}_{b}^{\text{2}}/\left({\sigma}_{b}^{\text{2}}+{\sigma}^{\text{2}}\right)$ between any pair of measurements from subject i. Note that b_{ i } is shared by different measurements of subject i so that the induced correlations are the same.
Families with single measurements of blood pressure
where g_{ pi } is a random effect representing genetic similarity among family members, ${\u03f5}_{\text{pi}}$ is an error term that follows $\text{N}\left(0,{\sigma}^{\text{2}}\right),$ g_{ pi } is independent of ${\u03f5}_{pi}$${\u03f5}_{\text{pi}},$ and X_{ pi }is the same as X_{ i }in (1). In addition, we assume that ${g}_{pi}~\text{N}\left(0,{\sigma}_{g}^{2}\right),$ $\text{Corr(}{g}_{pi},{g}_{pi},)=\text{2}{\psi}_{ii},$, where ${\psi}_{ii},$ is the kinship coefficient between family member i and i'. Unlike b_{ i } in (2), which is shared among correlated units (time), there is a random effect g_{ pi } for each unit (subject) here, and their covariance matrix is specified so that correlations among different pairs of family members are different.
Families with repeated measurements of blood pressure
For both (4) and (5), the covariance between different family members is $\text{Cov}\left({Y}_{pit},{Y}_{p{i}^{\prime}{t}^{\prime}}\right)=\text{2}{\psi}_{ii},{\sigma}_{g}^{2}$ for any t and t'. However, the covariance between measurements $\left(t\ne {t}^{\prime}\right)$ from the same subject is $\text{Cov}\left({Y}_{pit},{Y}_{pit},\right)={\sigma}_{g}^{2}+{\sigma}_{b}^{2}$ based on (4) and $\text{Cov}\left({Y}_{pit},{Y}_{pit},\right)={\sigma}_{g}^{2}$ based on (5).
Although model (4) is more flexible than (5), the chromosome-wide scan based on (4) is not feasible within the given timeframe and available computational resources. We thus adopt a two-stage strategy that first scans chromosome 3 using (5) and then refines the p-values of top SNPs using (4).
Population stratification
GAW18 data consist of Mexican Americans from San Antonio, a population that may have an admixed ancestry of whites and Native Americans. To account for possible population stratification, we include top principal components (PCs) of SNP genotypes as covariates in the above regression models. We first obtain independent SNPs (linkage disequilibrium R^{2} <0.2) restricted to those with MAFs 5% or greater using the unrelated subjects. Then we project the SNP loadings of unrelated subjects to their relatives to calculate the eigenvectors of the entire sample of families.
Results
p-Values for the top five common single-nucleotide polymorphisms based on the analysis of averaged systolic blood pressure
SBP p-values | ||||||||
---|---|---|---|---|---|---|---|---|
SNP ID | Gene | Chr | Position | MAF | Baseline | Averaged | Repeated-1 | Repeated-2 |
3_47903424 | MAP4 | 3 | 47903424 | 0.124 | 2.3 × 10^{-9} | 6.3 × 10^{-11} | 2.2 × 10^{-9} | 1.1 × 10^{-10} |
3_47903305 | MAP4 | 3 | 47903305 | 0.123 | 3.2 × 10^{-9} | 6.5 × 10^{-11} | 1.7 × 10^{-9} | 1.0 × 10^{-10} |
3_47905079 | MAP4 | 3 | 47905079 | 0.124 | 3.2 × 10^{-9} | 6.5 × 10^{-11} | 1.7 × 10^{-9} | 1.0 × 10^{-10} |
3_47588649 | MAP4 | 3 | 47588649 | 0.122 | 3.5 × 10^{-9} | 6.7 × 10^{-11} | 1.8 × 10^{-9} | 1.1 × 10^{-10} |
3_47990500 | MAP4 | 3 | 47990500 | 0.123 | 3.6 × 10^{-9} | 7.2 × 10^{-11} | 2.1 × 10^{-9} | 1.1 × 10^{-10} |
DBP p -values | ||||||||
SNP ID | Gene | Chr | Position | MAF | Baseline | Averaged | Repeated-1 | Repeated-2 |
3_48064367 | MAP4 | 3 | 48064367 | 0.128 | 1.4 × 10^{-11} | 3.6 × 10^{-13} | 3.5 × 10^{-13} | 3.6 × 10^{-13} |
3_47711490 | MAP4 | 3 | 47711490 | 0.120 | 1.8 × 10^{-11} | 3.9 × 10^{-13} | 5.4 × 10^{-12} | 3.9 × 10^{-13} |
3_48092335 | MAP4 | 3 | 48092335 | 0.127 | 2.8 × 10^{-11} | 4.8 × 10^{-13} | 2.2 × 10^{-12} | 5.2 × 10^{-13} |
3_48105528 | MAP4 | 3 | 48105528 | 0.127 | 2.8 × 10^{-11} | 4.8 × 10^{-13} | 2.2 × 10^{-12} | 5.2 × 10^{-13} |
3_47990500 | MAP4 | 3 | 47990500 | 0.123 | 4.9 × 10^{-11} | 5.0 × 10^{-13} | 5.8 × 10^{-12} | 5.4 × 10^{-13} |
p-Values for the top three transcripts based on the analysis of averaged systolic blood pressure (diastolic blood pressure
SBP p-values | |||||||
---|---|---|---|---|---|---|---|
Accession ID | Gene | Chr | SMAF | Baseline | Averaged | Repeated-1 | Repeated-2 |
NM_001134364.1 | MAP4 | 3 | 0.080 | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} |
NM_002375.4 | MAP4 | 3 | 0.074 | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} |
NM_030885.3 | MAP4 | 3 | 0.036 | <2.2 × 10^{-16} | <2.2 × 10^{-16} | 1.6 × 10^{-15} | <2.2 × 10^{-16} |
DBP p -values | |||||||
Accession ID | Gene | Chr | SMAF | Baseline | Averaged | Repeated-1 | Repeated-2 |
NM_001134364.1 | MAP4 | 3 | 0.080 | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} |
NM_002375.4 | MAP4 | 3 | 0.074 | 4.4 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} | <2.2 × 10^{-16} |
NM_030885.3 | MAP4 | 3 | 0.036 | 6.0 × 10^{-15} | 2.2 × 10^{-16} | 5.3 × 10^{-15} | <2.2 × 10^{-16} |
The results of testing the genetic association with DBP show similar patterns as with SBP (data not shown). In particular, using the averaged DBP yielded better power than using the repeated measurements. Tables 1 and 2 provide the top common SNPs and transcripts, respectively.
Discussion and conclusions
We investigated three approaches to exploiting longitudinal phenotype data and assessed the power gain of adding family members in the context of WGS studies. Most GWAS have focused on the population-based design, which maximizes the power per genotyped subject. Our results demonstrated that including family members can also significantly boost the power. Most GWAS have ignored the longitudinal nature of the phenotype data, which are available from many prospective cohorts. The use of the longitudinal data can provide a more accurate measurement of the phenotype and thus serves as a powerful tool in genetic association studies. With more clinic data being available through the electronic medical record (EMR) system and more clinic populations with genotypic data, the search for disease-associated common and rare variants can be more fruitful by improving the phenotyping via longitudinal information.
It appears somewhat counterintuitive that using the time-averaged measurement is more powerful than using the repeated measurement in the analysis of the GAW18 data. This is possible because, in the presence of linear time effect, the averaged measurement does not lose any information compared with the repeated measurements but simply reduces error. When more complex longitudinal structures exist, the repeated measurements retain full information and are expected to outperform the averaged measurement.
A family-based design can allow us to test association and linkage simultaneously. In this paper, we focused on the association analysis only. We modeled the association in the fixed-effect parameters and accounted for family relatedness using the random-effect parameters, whose covariances among family members are formulated through the kinship coefficient. Our models can be readily extended to linkage analysis by including another set of random-effect parameters whose covariances depend on the proportion of alleles shared identical by descent at the marker locus between a relative pair [12].
Declarations
Acknowledgements
YVS and QH were supported in part by National Institutes of Health (NIH) grant RC1 HL100245 from the National Heart, Lung, and Blood Institute. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
Authors’ Affiliations
References
- Schork NJ, Murray SS, Frazer KA, Topol EJ: Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev. 2009, 19: 212-219. 10.1016/j.gde.2009.04.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004, 305: 869-872. 10.1126/science.1099870.View ArticlePubMedGoogle Scholar
- Nejentsev S, Walker N, Riches D, Egholm M, Todd JA: Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009, 324: 387-389. 10.1126/science.1167728.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR: Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010, 86: 832-838. 10.1016/j.ajhg.2010.04.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin DY, Tang ZZ: A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011, 89: 354-367. 10.1016/j.ajhg.2011.07.015.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabatti C, Service SK, Hartikainen A, Pouta A, Ripatti S, Brodsky J, Jones CG, Zaitlen NA, Varilo T, Kaakinen M, et al: Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 2008, 41: 35-46.PubMed CentralView ArticlePubMedGoogle Scholar
- Ionita-Laza I, McQueen M, Laird N, Lange C: Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100K scan. Am J Hum Genet. 2007, 81: 607-614. 10.1086/519748.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith EN, Chen W, Kähönen M, Kettunen J, Lehtimäki T, et al: Longitudinal genome-wide association of cardiovascular disease risk factors in the Bogalusa Heart Study. PLoS Genet. 2010, 6: e1001094-10.1371/journal.pgen.1001094.PubMed CentralView ArticlePubMedGoogle Scholar
- Luan J, Kerner B, Zhao JH, Loos RJF, Sharp SJ, Muthén BO, Wareham NJ: A multilevel linear mixed model of the association between candidate genes and weight and body mass index using the Framingham longitudinal family data. BMC Proc. 2009, 3 (suppl 7): S115-10.1186/1753-6561-3-s7-s115.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cookson WOC, and Cardon LR: Pedigree tests of transmission disequilibrium. Eur J Hum Genet. 2000, 545-551. 8Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.