Volume 8 Supplement 1

## Genetic Analysis Workshop 18: Human sequence data in extended pedigrees

# Family-based Bayesian collapsing method for rare-variant association study

- Liang He
^{1}Email author and - Janne M Pitkäniemi
^{1, 2}

**8(Suppl 1)**:S37

https://doi.org/10.1186/1753-6561-8-S1-S37

© He and Pitkäniemi; licensee BioMed Central Ltd. 2014

**Published: **17 June 2014

## Abstract

In this study, we analyze the Genetic Analysis Workshop 18 data to identify the genes and underlying single-nucleotide polymorphisms on 11 chromosomes that exhibit significant association with systolic blood pressure. We propose a novel family-based method for rare-variant association detection based on the hierarchical Bayesian framework. The method controls spurious associations caused by population stratification, and improves the statistical power to detect not only individual rare variants, but also genes with either continuous or binary outcomes. Our method utilizes nuclear family information, and takes into account the effects of all single-nucleotide polymorphisms in a gene, using a hierarchical model. When we apply this method to the genome-wide Genetic Analysis Workshop 18 data, several genes and single-nucleotide polymorphisms are identified as potentially related to systolic blood pressure.

## Keywords

## Background

Current studies suggest that a large number of common variants identified in genome-wide association studies (GWAS) as being associated with various complex diseases can account for only a small portion of phenotype variation [1]. With the advent of next-generation sequencing, attention has focused on rare variants (RVs), such as single-nucleotide polymorphisms (SNPs) with a minor allele frequency (MAF) of less than 1%. Traditional single-marker methods lose statistical power for detecting RV association because of their rare occurrence. In the last few years, however, a variety of methods have been developed, including the combined multivariate and collapsing (CMC) method [2] and the weighted sum (WS) method [3]. More sophisticated methods that are robust to different variant effects include the kernel-based adaptive cluster (KBAC) method [4], the C-alpha test [5], and the sequence kernel association test (SKAT) [6].

These methods, however, all assume that individuals are independently sampled and are, therefore, vulnerable to the influence of population stratification. Exploring marker transmission within a family avoids the issues of population stratification. More important, once anRV enters a family, it can segregate to other family members so that copies of the minor allele are enriched in the data. This could potentially increase the statistical power of family-based approaches.

Here we propose a novel family-based Bayesian collapsing model (FBCM) capable of identifying associations of RVs and genes with quantitative phenotypes. The method builds on the hierarchical quantitative transmission disequilibrium test (HQTDT) [7]. Compared to classical statistical methods, the Bayesian framework incorporates prior information, thereby providing an alternative approach to situations in which factors affecting the power of the test, such as the MAF of the SNP, play an important role [8]. We combine HQTDT with the idea of collapsing under a Bayesian framework. Then we expand the model in a data-driven manner by utilizing random effects to model the signals of individual rare variants within a gene.

## Methods

### Family-based Bayesian collapsing model

Several statistical models based on the Bayesian framework, such as model selection [9] and multiple regression [10], have been proposed for RV association detection. Because of the large scale of model space and matrix calculation, these approaches suffer from impractical computational time if a full joint posterior distribution is required. Although most Bayesian methods endeavor to employ various optimization or approximation algorithms to obtain a point estimate, the loss of uncertainty information on the estimate means the significance of the estimate cannot be evaluated. In this paper, we propose a Bayesian model that aims to efficiently generate a full posterior distribution without the loss of model space, and is viable for family-based genome-wide association analysis. The central idea comes from collapsing RVs and modeling their effects using variant-specific random effects. In some cases, it is probable that an RV is enriched in certain pedigrees while being very rare in others. Thus, among a group of RVs, some can be both neutral and associated with the phenotype through population stratification. To solve this problem, the RVs are collapsed in 2 orthogonal components to adjust for the possible population stratification.

*Κ*diallelic loci (in this paper, a locus always refers to the location of a SNP) with MAF less than 1%. Given a set of

*i*= 1,...,

*M*nuclear families, each of which contains

*n*

_{ i }siblings so that the total number of offspring is $\sum _{i=1}^{M}{n}_{i}=N,$ we define the coded genotypic score

*G*

_{ ijk }for the

*j*

^{th}child in the

*i*

^{th}family as the number of minor alleles at the

*k*

^{th}locus. It is assumed that both parents of each child are available, and, correspondingly, the genotypic scores of the parents at the

*k*

^{th}locus in the

*i*

^{th}family are denoted by

*GM*

_{ ik }and

*GF*

_{ ik }, respectively, for the mother and father. Conditional on the parental genotypes, the expected score for the offspring in the

*i*

^{th}family at the

*k*

^{th}locus under mendelian law is $G{E}_{ik}=\frac{G{M}_{ik}+G{F}_{ik}}{2}$. Furthermore, the deviation of the genotypic score for the

*j*

^{th}child in the

*i*

^{th}family at the

*k*

^{th}locus, which is denoted by

*D*

_{ ijk }, is

*G*

_{ ijk }−

*GE*

_{ ik }. For technical reasons, we add a pseudolocus

*k*= 0 and define

*GE*

_{ i0 }=

*D*

_{ ij0 }= 0. When at least 1of the parents carries the copies of minor alleles at a locus, it is then possible to observe deviation in offspring at this locus. However, given a moderate set of variants, it is very unlikely for an individual to harbor minor alleles at more than 2 causal variants. For instance, when MAF is 0.005 and there are 50 independent causal RVs, the probability of an individual having minor alleles at more than 2 loci is $\mathsf{\text{1}}-0.\mathsf{\text{9}}{\mathsf{\text{9}}}^{\mathsf{\text{5}}0}-\mathsf{\text{5}}0\phantom{\rule{2.77695pt}{0ex}}\cdot \phantom{\rule{2.77695pt}{0ex}}0.\mathsf{\text{9}}{\mathsf{\text{9}}}^{\mathsf{\text{49}}}\cdot \phantom{\rule{2.77695pt}{0ex}}0.0\mathsf{\text{1}}-\frac{50.49}{2}\cdot \phantom{\rule{2.77695pt}{0ex}}0.\mathsf{\text{9}}{\mathsf{\text{9}}}^{\mathsf{\text{48}}}\cdot \phantom{\rule{2.77695pt}{0ex}}0.0{\mathsf{\text{1}}}^{\mathsf{\text{2}}}\approx \mathsf{\text{1}}.\mathsf{\text{38}}\%$. Thus, by taking advantage of the rare occurrence of copies of minor alleles, for each individual we consider at most 2 loci that have nonzero deviation in a candidate gene. These are indexed by

*r*

_{ ij }and

*s*

_{ ij }, which are defined below.

This method dramatically shortens computational time by avoiding large-scale matrix computation in Gibbs sampling. If an individual has nonzero deviation at fewer than 2 loci, both or *s*_{
ij
} are 0. Those with the smallest MAFs are selected if an individual has more than 2 loci with nonzero deviation. Thus, more emphasis is placed on those with smaller MAFs because deleterious functional variants tend to have low frequencies [12]. Given that RVs often do not exhibit strong linkage disequilibrium (LD) with either rare or common SNPs [11], for a moderate number of RVs such approximation loses much less information than do naive collapsing methods. Moreover, including 2 loci enables the model to detect the additive effect combination of 2 RVs.

*y*

_{ ij }denote the quantitative phenotype for the

*j*

^{th}child in the

*i*

^{th}family. The relationship between the phenotype and the set of RVs in the candidate gene can be expressed by a hierarchical model

where *µ* is the global intercept and *ε*_{
ij
} is the random error. The genotypic score is decomposed into within-family and between-family components, and the construction of formula (1) guarantees the orthogonality of those 2 components. Inference based on *β*_{1} provides a stratification-resistant within-family test, while *β*_{2} estimates the genetic effect resulting from stratification. As a result of the limitation of the sample size for the inference and the fact that the variance components are not our major interest in this study, the family-level variable *φ*_{
i
} is modeled as a random effect. This enables us to capture the between-family variance that includes the influence of the family-specific environmental factor.

The vectors of variant-specific random effects $\stackrel{\u0303}{\alpha}=\left({\alpha}_{0},{\alpha}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{\alpha}_{\mathsf{\text{k}}}\right)$ and $\stackrel{\u0303}{\gamma}=\left({\gamma}_{0},{\gamma}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{\gamma}_{\mathsf{\text{k}}}\right)$ modulate the within-family and between-family global effects *β*_{1} and *β*_{2}, respectively. *r*_{
ij
} and *s*_{
ij
} are individual-specific indices of which elements in $\stackrel{\u0303}{\alpha}$ and $\stackrel{\u0303}{\gamma}$ contribute to the *j*^{th} child in the *i*^{th} family. It is possible that some of the RVs are neutral, but may be associated with the phenotype through population stratification. Ignoring this possibility will not only inflate the type I error rate, but also will introduce noise after collapsing. By modeling these 2 situations using $\stackrel{\u0303}{\alpha}$ and $\stackrel{\u0303}{\gamma}$ separately, formula (1) (below) manages to detect the association, accounting for neutral RVs as well as population stratification.

Although it is less common to observe LD between RVs compared to common variants, only independent representative SNPs are selected and included in the analysis. Because 2 loci at most are taken into account for each individual, such selection improves the accuracy and efficiency of the model.

### Prior distributions for random effects

*β*

_{1}and $\stackrel{\u0303}{\alpha}$,

*β*

_{2}and $\stackrel{\u0303}{\gamma}$) may result in a nonidentifiable model. To ensure identifiability, $\stackrel{\u0303}{\alpha}$ and $\stackrel{\u0303}{\gamma}$ --except for ${\alpha}_{0}$ and ${\gamma}_{0}$, which are random effects of the pseudolocus and sampled from

*Bern*(0)--are selected to be independently sampled from Bernoulli distributions with hyper-parameters

*p*

_{ k }and

*q*

_{ k }

*. β*

_{1}and

*β*

_{2}are given a noninformative normal prior distribution with some variance ${\sigma}_{\beta}^{2}$, that is,

The *k*^{th} variant is treated as associated when *α*_{
k
} = 1; otherwise it is neutral. The hyperparameter *p*_{
k
} is the predictor for *α*_{
k
} and can be regarded as the probability of the *k*^{th} variant being associated. With such a prior distribution for *α*_{
k
}, the model actually selects the optimal group of associated RVs in a data-driven way and then collapses them together.

### Bayesian inference

*β*

_{1}= 0. However, in a Bayesian framework, this hypothesis cannot be evaluated directly because the posterior distribution of

*β*

_{1}is continuous. Instead, we can conduct a composite hypothesis test:

Where $\in $ is a small positive number. Although in principle the choice of $\in $ is arbitrary, a too small $\in $ might inflate the estimate error resulting from the numerical approximation. So we set $\in $ as $0.\mathsf{\text{2}}*{\widehat{\sigma}}_{\u03f5},$ where ${\widehat{\sigma}}_{\u03f5}$ is the estimated standard deviation for random error. The Bayes factor (BF) is a good way to summarize the evidence provided by the data in favor of one statistical model over another while also taking into account the complexity of a model. Note that the BF can be expressed using the ratio of the odds of the posterior distribution, which can be obtained approximately by the Monte Carlo Markov chain (MCMC) method, to the prior odds. For the prior distribution described above, the prior odds are calculated as

$\frac{P\left(\left|{\beta}_{1}\right|>\u03f5\cap {\sum}_{k=1}^{K}{\alpha}_{k}>0\right)}{P\left(\left|{\beta}_{1}\right|\le \u03f5\cup {\sum}_{k=1}^{K}{\alpha}_{k}=0\right)}=\frac{\left(1-erf\left(\frac{\u03f5}{\sqrt{2}{\sigma}_{\beta}}\right)\right)\left(1-{0.5}^{K}\right)}{erf\left(\frac{\u03f5}{\sqrt{2}{\sigma}_{\beta}}\right)\left(1-{0.5}^{K}\right)+{0.5}^{K}},$ (4)

*erf*(•) is the error function, defined as: $erf\left(x\right)=\frac{2}{\sqrt{\pi}}{\int}_{0}^{x}{e}^{-{t}^{2}}dt$. Thus, the hybrid BF can be obtained by

_{ k, }

*k*ϵ 1,...,

*Κ*. Note that if we treat α

_{ k }as a model indicator, one way to quantify and summarize the posterior probabilities is to calculate the marginal BF, which is the ratio of the posterior odds to the prior odds of the same variable, defined as:

The model is implemented using WinBUGS with 50,000 iterations, and the convergence is checked by investigating the autocorrelations for all parameters. We also simulate several chains with different initial values simultaneously, and evaluate convergence with the Gelman-Rubin convergence diagnostic tool [13].

## Results

*r*. Half the RVs are randomly selected to be neutral but are associated with the phenotype through population stratification, represented by an indicator vector

*s*. The genotypic scores of the parents are independently sampled from $Bern\left(2\phantom{\rule{2.77695pt}{0ex}}\xb7\phantom{\rule{2.77695pt}{0ex}}MAF\left(k\right)\right)$ for the

*k*

^{th}RV, where $MAF\left(k\right)$ is fixed as 0.005 throughout all RVs. Then the genotypes of children are obtained from parental haplotypes by random transmission, denoted by a 2 × 50 matrix G. Gis divided into the expected genotypic score matrix

*E*and the deviation matrix

*D*for the offspring in a family. The phenotypes of the 2 sibs in each family are generated from

*N*(

*β*

_{1}• (D × r) +

*β*

_{2}•(E × s),

**Σ**), where

*β*

_{1}is the effect size,

*β*

_{2}= 0.5 reflects the effect of population stratification, and $\text{\Sigma}=\left(\begin{array}{cc}\hfill 2\hfill & \hfill 1\hfill \\ \hfill 1\hfill & \hfill 2\hfill \end{array}\right)$ is the covariance matrix to reflect the family structure. We set the hyperparameter ${\sigma}_{\beta}^{2}$ as 10

^{4}and tuned the BF cutoff equal to 2 so that the type I error rate is controlled below 0.005. Figure 1 shows the power curves with

*β*

_{1}= 1 and 1.5.

To evaluate FBCM using real data, the association analyses are performed by fitting our method to the data that use the full pedigree structure provided, with the entry for each variant being the estimated number of minor alleles carried. Our aim is to identify the genes and underlying RVs related to systolic blood pressure (SBP) throughout those 11 chromosomes among the GAW18 type 2 diabetes families. To better reflect the association between predisposition to hypertension and the variants, the highest SBP measured at the 4 examination points is selected as the phenotype for each individual. Log-transformation of the phenotype is performed to fix the skewness of the phenotype distribution. The age corresponding to the highest measured SBP is included in our model as a covariate to account for the significant correlation between age and SBP. Individuals with any missing data or without parental information are excluded, leaving 275 trios remaining.

Most significant genes associated with SBP

Chr | Ensembl ID (ENSG00000-) | HGNC symbol | BF | Effect size |
---|---|---|---|---|

5 | 213568 | HNRNPA1P13 | 8.52 | 0.43 |

19 | 127529 | OR7C2 | 4.82 | −0.24 |

15 | 259500 | RP11-138E16.2 | 4.43 | −0.32 |

| ||||

Chr | Position | BF | Index SNP | MAF |

5 | 135764696 | 36.84 | rs139658064 | 0.004 |

5 | 135765214 | 0.00368 | NA | NA |

Next, we investigated the underlying SNPs among the most related gene, HNRNPA1P13. The most significant SNPs based on their BFs in formula (3) are listed at the bottom of Table 1. The MAF information on these SNPs comes from the 1000 Genomes Project (http://www.1000genomes.org/). The larger BFs favor the evidence against the null hypothesis and indicate that positive deviation from the expected number of transmitted minor alleles drives the effect of the gene, while the BFs much lower than 1 suggest the effect of the deviation in the opposite direction. For example, given the effect size of gene HNRNPA1P13 is 0.43, SNP at position 135765214 with BF 0.00368 indicates that more transmitted copies of a minor allele from parents are likely to have a negative impact on the phenotype.

## Discussion

The FBCM proposed here is a novel statistical method for analyzing the association of RVs in pedigree data. The new methodology accounts for potential nonassociated variants by introducing random effects in a multiplicative way to approximate and capture the variant effects. The FBCM also takes into account situations in which some pooled variants can be associated with a phenotype through population stratification. Although the between-family component in our model can be integrated into the family-level random effect ${\varphi}_{i}$, when the RV under investigation shows a significant between-family effect, our model performs better by capturing this effect to reduce the residual error. The model is based on the HQTDT, but expands the HQTDT by incorporating the collapsing information of the deviation from the expected genotypic score for a group of SNPs and at the same time maintaining orthogonality. Unlike the model selection method [9], our model employs random effects to predict variant-specific effects based on data and succeeds in boosting the sensitivity for gene association detection by collapsing the random effects of RVs. By taking advantage of the rare occurrence of minor alleles in an individual, the algorithm considers at most 2 sites in order to reduce the number of predictors, circumventing the huge computational burden involved in obtaining the full posterior distribution.

In variant-level analysis, our method improves the power to detect RV effects. The improvement of statistical power can be achieved by accounting for the random effects of all variants in a candidate gene through Gibbs sampling. Moreover, in the GAW18 data, it has been shown that the occurrence of a RV tends to be more common if a family member carries a minor allele. Thus, the family-based analysis is expected to have more power than the independent population-based analysis. The results show that our family-based method is able to identify both genes and individual SNPs significantly related to the phenotype, even in RV situations.

Our model can be further expanded in many ways. The appropriate link functions can be employed to handle other forms of phenotype, such as binary data. In this study, we focus on families without any missing data. However, for the trios with missing parental genotype information, the genetic score can be decomposed into between-family and within-family components by using only sibs genotypes. For the random effect distribution, the Bernoulli distribution is assigned as the prior distribution for the random effects of individual variants. For better modeling of the effects of individual variants, more sophisticated distributions can be employed.

## Conclusions

We have demonstrated that a novel FBCM can be applied to identify associations between RVs and quantitative traits for pedigree data. This method cannot only detect the gene effect, but can also pinpoint the underlying SNPs. Compared to other methods for handling RVs, our method based on family data improves statistical power by collapsing and accounting for all possible RV effects in a gene with population stratification controlled. Because the method allows for computational efficiency in obtaining the full posterior distribution, it is applicable to large-scale association tests. The results of our genome-wide analyses provide insights into the potential role of RVs in SBP.

## Declarations

### Acknowledgements

The authors are grateful to the HjeltInstitute for providing facilities to complete this work. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Authors’ Affiliations

## References

- Hindorff L, MacArthur J, Wise A, Junkins H, Hall P, Klemm A, Manolio T: A Catalog of Published Genome-Wide Association Studies. [http://www.genome.gov/gwastudies]
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu DJ, Leal SM: A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010, 6: e1001156-10.1371/journal.pgen.1001156.PubMed CentralView ArticlePubMedGoogle Scholar
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ: Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7: e1001322-10.1371/journal.pgen.1001322.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J HumGenet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Gauderman WJ: Candidate gene association analysis for a quantitative trait, using parent-offspring trios. Genet Epidemiol. 2003, 25: 327-338. 10.1002/gepi.10262.View ArticlePubMedGoogle Scholar
- Stephens M, Balding DJ: Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009, 10: 681-690. 10.1038/nrg2615.View ArticlePubMedGoogle Scholar
- Quintana MA, Berstein JL, Thomas DC, Conti DV: Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genet Epidemiol. 2011, 35: 638-649. 10.1002/gepi.20613.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi N, Zhi D: Bayesian analysis of rare variants in genetic association studies. Genet Epidemiol. 2011, 35: 57-69. 10.1002/gepi.20554.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK, Cox NJ: The allelic architecture of human disease genes: common disease-common variant...or not?. Hum Mol Genet. 2002, 11: 2417-2423. 10.1093/hmg/11.20.2417.View ArticlePubMedGoogle Scholar
- Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG: Recent and ongoing selection in the human genome. Nat RevGenet. 2007, 8: 857-868.View ArticleGoogle Scholar
- Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences. Stat Sci. 1992, 7: 457-472. 10.1214/ss/1177011136.View ArticleGoogle Scholar
- Jeffreys H: Theory of Probability. 1998, New York, Oxford University Press, 3Google Scholar
- Ehret G: Genome-wide association studies: contribution of genomics to understanding blood pressure and essential hypertension. Curr Hypertens Rep. 2010, 12: 17-25. 10.1007/s11906-009-0086-6.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Comments

View archived comments (1)