Adjustment of familial relatedness in association test for rare variants

High-throughput sequencing technology allows researchers to test associations between phenotypes and all the variants identified throughout the genome, and is especially useful for analyzing rare variants. However, the statistical power to identify phenotype-associated rare variants is very low with typical genome-wide association studies because of their low allele frequencies among unrelated individuals. In contrast, a family-based design may have more power because rare variants are more likely to be enriched in families than among unrelated individuals. Regardless, an analysis of family-based association studies needs to account appropriately for relatedness between family members. We analyzed the observed quantitative trait systolic blood pressure as well as the simulated Q1 data in the Genetic Analysis Workshop 18 data set using 4 tests: (a) a single-variant test, (b) a collapsing test, (c) a single-variant test where familial relatedness was accounted for, and (d) a collapsing test where familial relatedness was accounted for. We then compared the results of the 4 methods and observed that adjusting for familial relatedness could appropriately control the false-positive rate while maintaining reasonable power to detect several strongly associated variants/genes.


Background
Current platforms for genome-wide association studies are limited to scanning common variants. Although rare variants may contribute a significant proportion of heritability, the statistical power to detect rare variants is low because of their low allele frequencies. Recent advances in high-throughput sequencing technologies have provided us great opportunities to delve deeper into the genetic components of complex traits by identifying millions of rare variants in the human genome [1] and allowing them to be tested for associations with complex traits.
In an effort to increase the power to detect rare-variant associations, many methods have been proposed to aggregate the effects of multiple rare variants within a specific functional unit, for example, a gene [2]. Among those methods, the kernel score test has enjoyed great popularity thanks to its flexibility and computational efficiency [3]. Family-based designs may offer more power, however, because related individuals are more likely than unrelated individuals to be enriched for rare variants [4]. One challenge of analyzing family data is that familial relatedness needs to be carefully adjusted for in the association analysis. Linear mixed models have been used to account for population stratification in the context of genome-wide association studies (GWAS) [5], and can also be naturally adapted to a family-based design. In this paper, we adopted a linear mixed model to adjust for familial relatedness while aggregating the effects of rare variants within a gene using a kernel score test.

Data description and preprocessing
The Genetic Analysis Workshop 18 (GAW18) data consists of genotyping and sequencing data for 959 individuals from 20 extended families. Among them, 849 individuals had at least 1 blood pressure measurement. For the real data, we used the first nonmissing measurement of systolic blood pressure (SBP) as the target quantitative trait. For the 200 replicates of the simulated phenotype, we also used the first measurement of SBP as the target quantitative trait. Finally, to evaluate the false-positive rate, we used the simulated Q1 data as the null phenotype.
The common variants in the genotyping data were used to estimate the genetic similarity matrix. As for the sequencing data, we selected only the nonsynonymous mutations for the association test in recognition of the fact that variants causing amino acid alterations tend to have large effects on the phenotype.

The model
We used the following linear mixed model to adjust for familial relatedness: Y = Cγ + Xβ + Zα + e where Y represents the phenotype (quantitative trait), C represents the collection of the covariates (age and gender), X includes the genotypes of the variants to be tested, and Z is the design matrix for the whole-genome polygenic random effects α. Finally, γ and β are fixed effects and e is the random residual. When the random effects are integrated out, the model is marginalized as Y ∼ N(Cγ + Xβ, Kσ 2 α + Iσ 2 e ), where σ 2 α and σ 2 are the variances for the random effects α and e, respectively. As for K, one could use twice the theoretical kinship matrix to represent the familial relationships. However, to account for potential cryptic relatedness between individuals across different families, we used a genetic similarity matrix that was estimated from the genotyping data as follows: where z im is the number of minor alleles for ith individual at the mth marker, f m is the allele frequency for the mth marker, and p is the total number of markers in the genotyping data. We investigated 4 methods of association analysis: (a) a single-variant association test without adjusting for familial relatedness (UNI), that is, the component Zα is excluded in the model; (b) a single-variant association test adjusting for familial relatedness (UNI-ADJ); (c) a kernel score test without adjusting for familial relatedness (SKAT); and (d) a kernel score test adjusting for familial relatedness (SKAT-ADJ). The first method is a simple linear regression. The second method tests the null hypothesis of zero fixed effects in a linear mixed model. Specifically, we used the likelihood ratio test implemented in the software GEMMA [6]. The third method is the direct application of the SKAT software [3] that ignores familial relatedness. As for the fourth method, we adopted the approach described in Ref. [7]. Specifically, we estimated the covariance matrix R = cov (Zα + e) = Kσ 2 α + Iσ 2 e under the null model without genes or variants being tested. Then we applied a data transformation to calculateỸ ,X, andC , where: X = R −1/2 X,X = R −1/2 X,C = R −1/2 C, leading to the following transformed model,Ỹ =Cγ +Xβ +ẽ, wherẽ e = R −1/2 (Zα + e). Note that eachẽ i is independent and identically distributed; therefore, the kernel score test can be appropriately applied on the transformed data.

Simulated data
For the simulated data, we analyzed the genes that explain the highest percentages of SBP variance (the percentages of SBP variance explained by each gene in the simulation were provided by the GAW18 organizers) using the 4 methods described above. Genes without nonsynonymous mutations were excluded. For the first two single-variant methods, we performed a Bonferroni correction on the smallest p value of the variants in a gene to obtain the gene-level p value. To assess the falsepositive rate, we also calculated the p values using Q1 as the phenotype. The power and the false-positive rate were calculated based on the 200 replicates. Table 1 shows the results for the top 7 genes. As the table shows, ignoring familial relatedness results in inflation of the false-positive rate using the UNI and SKAT methods, in contrast with the UNI-ADJ and SKAT-ADJ methods, which still demonstrate good control of the false-positive rate. As for the power, the single-variant method does better for genes whose effects on the phenotype are dominated by a single variant (eg, LEPR) whereas the collapsing method is more powerful for genes containing multiple variants with comparable effect sizes.

Real data
Consistent with our observation in the simulated data, the Q-Q plot (Figure 1) shows that the UNI and SKAT methods that ignore familial relatedness result in substantial inflation of the false-positive rate. For UNI-ADJ, only 1 variant (chr11:119059358) has a Bonferroni-corrected p value smaller than 0.05 (P Bonferroni = 0.038). It is a coding variant for gene PDZD3. To our knowledge, the literature has suggested no role for this gene in blood pressure. For SKAT-ADJ, there is no gene with a Bonferroni-corrected p value smaller than 1.

Discussion and conclusions
In this study, we investigated the utility of linear mixed models in adjusting for familial structure. We also attempted to combine a linear mixed model method with a kernel score test-a state-of-the-art collapsing methodology that tests for association between a group of variants and a phenotype.

Conclusions
We found that linear mixed models are able to satisfactorily adjust for familial relatedness in the sense that false-positive rates are well controlled at the nominal significance level. Not surprisingly, in terms of the statistical power, the performance of the single-variant method and the collapsing method depends on the distribution of the variants' effect sizes within the gene. The single-variant method is more powerful for genes with one dominating causal variant, whereas the collapsing method is more powerful for genes with multiple causal variants of similar effect sizes.   Authors' details