Association analysis of whole genome sequencing data accounting for longitudinal and family designs

Using the whole genome sequencing data and the simulated longitudinal phenotypes for 849 pedigree-based individuals from Genetic Analysis Workshop 18, we investigated various approaches to detecting the association of rare and common variants with blood pressure traits. We compared three strategies for longitudinal data: (a) using the baseline measurement only, (b) using the average from multiple visits, and (c) using all individual measurements. We also compared the power of using all of the pedigree-based data and the unrelated subset. The analyses were performed without knowledge of the underlying simulating model.


Background
Whole genome sequencing (WGS) makes it possible for investigators to extend association studies to rare variants. Rare variants, which have minor allele frequencies (MAFs) of less than 1% to 5%, might play an important role in the etiology of complex traits and account for missing heritability unexplained by common variants [1][2][3]. However, traditional single-variant tests for common variants have limited power for testing rare variants because of their low frequencies and large numbers. A number of methods [4][5][6][7] have been developed to address this challenge by jointly analyzing rare variants within a region. Among these methods, the burden test of Lin and Tang [7] easily fits into the regression framework that can accommodate complex study designs and phenotypes, thus remaining a competitive option.
The longitudinal study design, which collects repeated measurements on the same subject over time, has been routinely used in epidemiologic and clinical research. The repeated measurements can reduce error and thus increase statistical power compared with the single measurement. There have been increasing applications of such a design in genome-wide association studies (GWAS) with a focus on common variants. The analytical strategies include using the measurement at a single time point [8], using the summarized univariate measurement [9] and adopting the linear mixed-effects model to fully exploit information in the repeated measurements [10,11]. However, the implementation of such designs is limited in the context of WGS studies with a focus on rare-variant associations.
In the era of next-generation sequencing studies, the family-based design has the unique advantages of protecting against population stratification, detecting genotyping errors, and facilitating accurate imputation, all of which are challenging issues to cope with in the studies of rare variants using unrelated subjects. However, it is well known that enrolling and sequencing additional family members will not increase statistical power as much as that can be achieved by the same number of unrelated subjects. The degree of power gain from the added family members depends on the extent of the within-family correlation and the size of each family. Thus, it is of interest to assess this power gain in each specific study.
Genetic Analysis Workshop 18 (GAW18) provided WGS data (sequencing plus imputation) in a pedigreebased sample with longitudinal measurements for systolic blood pressure (SBP) and diastolic blood pressure (DBP). In this study, we implement methods to exploit longitudinal and family structures and apply these methods to examine the associations of aggregated rare variants, as well as common single-nucleotide polymorphisms (SNPs), with SBP and DBP. We compare the power of the methods using the baseline measurement only, using the averaged value over repeated measurements, and using the full information of longitudinal data. We also contrast the power of the methods using all of the pedigree-based data with that using the unrelated subset.

Methods
In this study, we focus on the first replicate of simulated SBP and DBP and the genetic data from sequencing and imputation only on chromosome 3 because of limited computation resources. The analyses were performed without knowledge of the underlying simulating model. We obtained 849 individuals from 20 pedigrees, among which 142 are unrelated. All individuals have SBP and DBP measurements at three time points with no missing data, as well as age, gender, smoking status, and antihypertensive medication status. Note that we preadjust the SBP and DBP measurements by the antihypertensive medication status (i.e., increasing SBP by 10 mm Hg and DBP by 5 mm Hg if the subject is taking medication). We define common variants as those with MAFs 5% or greater and obtain 403,098 SNPs on chromosome 3. We jointly analyze rare variants by mRNA transcripts, which are the functional products of genes. We exclude transcripts whose total rare allele frequency (i.e., sum of MAFs over all inclusive variants) is less than 0.01 and end up with a total of 813 transcripts represented by accession numbers. Given a common single-nucleotide polymorphism (SNP) or a transcript for the phenotype, we consider using (a) the baseline, (b) the time-averaged, and (c) the repeated measurements; for study subjects, we consider using (a) the entire pedigree-based sample and (b) the unrelated subjects only. All statistical analyses were carried out in R (http://www.r-project.org) version 2.15.1.

Notation
Assume there are m rare variants in a transcript. Under the population-based design, let Y it denote the phenotype measured for subject i at time t, G i = (G i1 , G i2 , ..., G im ) T the genotypes of the m variants, and X it = (X it1 , X it2 , ..., X itq ) T the q covariates including the intercept and possibly time-varying ones. Under the family-based design and when subject i belongs to family p, we modify the aforementioned notation to be Y pit , G pi and X pit .

Burden score of rare variants
We focus on variants with MAF less than 5% and that are putatively functional (i.e., nonsense, missense, or splice site mutations). Specifically, using the chromosomal location (NCBI built 37.1) provided by the GAW18 data, we search for functional annotation of 812,234 SNPs with MAFs 5% or less on chromosome 3 using the GVS server (http://snp.gs.washington.edu). We consider the following functional categories: missense, missense-near-splice, splice-3, splice-5, stop-gained, stoplost, and stop-lost-near-splice. Note that the functional annotation is specific to each transcript. The burden score of the ith subject at a given transcript is defined as the sum of genotypes of the selected variants: In the framework of Lin and Tang [7], the burden score is used in the following regression models as a regular covariate.

Models and assumptions
Unrelated subjects with single measurements of blood pressure The single measurements of blood pressure can be the baseline or time-averaged SBP or DBP. We denote the baseline and averaged measurements by Y i1 and Y i+ = t Y it , respectively. Because they are all quantitative, it is natural to relate each of them to S i and X i through the linear regression model: where i is an error term with mean zero and variance σ 2 and X i includes the intercept, gender, smoking status, and baseline (averaged) age in the baseline (averaged) phenotype model.

Unrelated subjects with repeated measurements of blood pressure
To account for the correlation of measurements from the same individual, we use the linear mixed-effects model. For subject i at time t, it is written as where b i is a random effect that follows N(0, σ 2 b ); it is an error term that follows N(0, σ 2 ); it and b i are mutually independent; and X it consists of the intercept, gender, smoking status, and the time-varying age. By including age into X it , we assume that the traits change linearly with time. Both visual inspection of the individual-level trait trajectories and statistical testing of the age coefficient support the linear modeling (results not shown). Because the random effect b i induces a squared correlation of σ 2 b /(σ 2 b + σ 2 ) between any pair of measurements from subject i. Note that b i is shared by different measurements of subject i so that the induced correlations are the same.

Families with single measurements of blood pressure
To account for the phenotype correlation among subjects from the same family, we also adopt a linear mixed-effects model. For the ith subject in the pth pedigree, we formulate that where g pi is a random effect representing genetic similarity among family members, pi is an error term that follows N(0, σ 2 ), g pi is independent of pi pi , and X pi is the same as X i in (1). In addition, we assume that Corr(g pi , g pi , ) = 2ψ ii , Corr(g pi , g pi , ) = 2ψ ii ,, where ψ ii , is the kinship coefficient between family member i and i'. Unlike b i in (2), which is shared among correlated units (time), there is a random effect g pi for each unit (subject) here, and their covariance matrix is specified so that correlations among different pairs of family members are different.

Families with repeated measurements of blood pressure
Model (3) can be readily extended to accommodate repeated measurements. We include the repeated measurements in (3) as follows: where X pit contains the time-varying age, g pi is introduced previously, b pi follows the same distribution as b i in model (2), and g pi and b pi are independent of each other. Unlike in (2), b pi here characterizes the additional correlation between repeated measurements after adjusting for the genetically induced portion. To see this, we consider the reduced model without b i : For both (4) and (5), the covariance between different family members is Cov Y pit , Y pi t = 2ψ ii , σ 2 g for any t and t'. However, the covariance between measurements (t = t ) from the same subject is Cov Y pit , Y pit , = σ 2 g + σ 2 b based on (4) and Cov Y pit , Y pit , = σ 2 g based on (5). Although model (4) is more flexible than (5), the chromosome-wide scan based on (4) is not feasible within the given timeframe and available computational resources. We thus adopt a two-stage strategy that first scans chromosome 3 using (5) and then refines the p-values of top SNPs using (4).

Population stratification
GAW18 data consist of Mexican Americans from San Antonio, a population that may have an admixed ancestry of whites and Native Americans. To account for possible population stratification, we include top principal components (PCs) of SNP genotypes as covariates in the above regression models. We first obtain independent SNPs (linkage disequilibrium R 2 <0.2) restricted to those with MAFs 5% or greater using the unrelated subjects. Then we project the SNP loadings of unrelated subjects to their relatives to calculate the eigenvectors of the entire sample of families. Figure 1 displays the quantile-quantile (QQ) plots of p-values for testing the association between common SNPs and SBP. All 6 tests produced proper type I error because their genomic control parameter λ s are close to 1. This suggests that the data are well described by our models and population stratification is appropriately adjusted by the PCs. Clearly, using all pedigree-based samples is substantially more powerful than using the unrelated subjects only. In addition, using the averaged SBP yielded smaller p-values for top SNPs than using the baseline or repeated measurements, and using the repeated measurements is slightly more powerful than using the baseline. This pattern can also be seen in Figure 2. The top five SNPs based on the method using the averaged SBP and all subjects are listed in Table 1, whose last column provides the refined p-values from model (4). Note that the use of model (4) does not alter the aforementioned order based on power, although it tends to slightly improve on the use of model (5). Using the Bonferroni correction, the genome-wide significance threshold is 1.3 × 10 -7 , at which the top five SNPs can be declared as genome-wide significant by any method. Note that we only focused on chromosome 3, so what we are assessing is in fact chromosome-wide significance. For testing the association between rare variants and SBP, the 6 tests also have controlled type I error (see Figure 3 for QQ plots of p-values). Again, compared with the unrelated subset, the relatives added considerable information on the associations of the top three transcripts. All three types of SBP generated comparable power with all individuals, and the three consensus top transcripts are described in Table 2. Using the Bonferroni correction, the genome-wide significance threshold is 6.2 × 10 -5 , at which the three top transcripts can be declared as genome-wide significant by any method. All of the identified common and rare variants map to the gene MAP4, which spans from 47,892,180 to 48,130,769 on chromosome 3.

Results
The results of testing the genetic association with DBP show similar patterns as with SBP (data not shown). In particular, using the averaged DBP yielded better power than using the repeated measurements. Tables 1 and 2 provide the top common SNPs and transcripts, respectively.

Discussion and conclusions
We investigated three approaches to exploiting longitudinal phenotype data and assessed the power gain of adding family members in the context of WGS studies. Most GWAS have focused on the population-based design, which maximizes the power per genotyped subject. Our results demonstrated that including family members can also significantly boost the power. Most  Note: Repeated-1 and repeated-2 correspond to model (5) and model (4), respectively. DBP, diastolic blood pressure; MAF, minor allele frequency; SBP, systolic blood pressure Figure 3 Quantile-quantile plots of p-values for tests between common single-nucleotide polymorphisms and systolic blood pressure (SBP) GWAS have ignored the longitudinal nature of the phenotype data, which are available from many prospective cohorts. The use of the longitudinal data can provide a more accurate measurement of the phenotype and thus serves as a powerful tool in genetic association studies. With more clinic data being available through the electronic medical record (EMR) system and more clinic populations with genotypic data, the search for disease-associated common and rare variants can be more fruitful by improving the phenotyping via longitudinal information.
It appears somewhat counterintuitive that using the time-averaged measurement is more powerful than using the repeated measurement in the analysis of the GAW18 data. This is possible because, in the presence of linear time effect, the averaged measurement does not lose any information compared with the repeated measurements but simply reduces error. When more complex longitudinal structures exist, the repeated measurements retain full information and are expected to outperform the averaged measurement.
A family-based design can allow us to test association and linkage simultaneously. In this paper, we focused on the association analysis only. We modeled the association in the fixed-effect parameters and accounted for family relatedness using the random-effect parameters, whose covariances among family members are formulated through the kinship coefficient. Our models can be readily extended to linkage analysis by including another set of random-effect parameters whose covariances depend on the proportion of alleles shared identical by descent at the marker locus between a relative pair [12]. Note: SMAF is the sum of minor allele frequencies (MAFs) of single-nucleotide polymorphisms included in the burden score of the gene. DBP, diastolic blood pressure; SBP, systolic blood pressure.