Combining information from linkage and association mapping for next-generation sequencing longitudinal family data
© Balliu et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
In this analysis, we investigate the contributions that linkage-based methods, such as identical-by-descent mapping, can make to association mapping to identify rare variants in next-generation sequencing data. First, we identify regions in which cases share more segments identical-by-descent around a putative causal variant than do controls. Second, we use a two-stage mixed-effect model approach to summarize the single-nucleotide polymorphism data within each region and include them as covariates in the model for the phenotype. We assess the impact of linkage disequilibrium in determining identical-by-descent states between individuals by using markers with and without linkage disequilibrium for the first part and the impact of imputation in testing for association by using imputed genome-wide association studies or raw sequence markers for the second part. We apply the method to next-generation sequencing longitudinal family data from Genetic Association Workshop 18 and identify a significant region at chromosome 3: 40249244-41025167 (p-value = 2.3 × 10−3).
In genetic association studies, joint analysis of multiple single-nucleotide polymorphisms (SNPs) can be more powerful than separate SNP analysis because single markers typically either have small effect sizes (common variants) or minor allele frequencies that are too small to reliably fit models (rare variants) . If the rare variant effects were large, they would have been found through previous family-based linkage studies if the disease was not heterogeneous. There may be a middle ground in which multiple rare variants of moderate effect size play a key role in the etiology of some diseases. Such situations might be ideal for identity-by-descent (IBD) mapping . Moreover, with the availability of genome-wide SNP data, the density of SNP markers has increased dramatically, making it possible to detect segments of IBD as small as 2 centimorgans (cM) .
In this article, we investigate the contribution that linkage-based methods, such as IBD mapping, can make to association mapping to identify rare variants in next-generation sequencing data. In the first part of our analysis, we use the methods of Browning and Thompson  to identify regions in which cases share more segments of IBD around a putative causal variant than do controls. After selecting these regions, we use a two-stage mixed-effects model approach, which was recently proposed by Tsonaka et al , to summarize the SNP data within each region and include them as covariates in the model for the phenotype. To increase our power to identify rare variants, we also include the number of rare variants per region as a covariate in the model.
Description of genotypic data sets used in each part of the analysis
IBD mapping: regions with excess of IBD sharing
Association mapping: Two-stage approach
Type of data
(65,000, Illumina chips)
Imputed (dosage) WGS based on existing GWAS framework
Whole genome sequence
We consider data from 939 individuals from 20 families; 464 are directly sequenced individuals and imputed WGS data, based on existing genome-wide association studies (GWAS) data, are available for their family members. We restrict our work to real genotypic data from chromosome 3. For each individual, we have information on age at examination and current tobacco smoking for up to 4 time points. We use the binary trait hypertension diagnosis at the first time point for selection of regions with excess IBD sharing and the quantitative trait diastolic blood pressure (DBP) for the phenotype model.
Selection of regions with excess IBD sharing
We construct all possible case-case (CaCa) and case-control (CaCo) pairs, such that individuals within pairs are unrelated. This results in 9229 CaCa pairs and 10080 CaCo pairs. We estimate the IBD state using 2 data sets: one containing approximately 50,000 GWAS markers, which we refer to as the AllMark data set, and 1 containing only 784 LD-pruned GWAS markers, the NoLD data set. From both data sets we eliminate SNPs with minor allele frequencies (MAFs) <5% because shared alleles that are assumed to be rare represent strong evidence for IBD and can distort results if this assumption is violated . In brief, the NoLD markers are selected using a sliding window 1 cM in size, removing markers based on linkage information content and excluding markers with the lowest MAF. At each marker we calculate the rate of IBD for each of the 2 groups and subtract their genomic average over all markers and pairs. If the ratio between CaCa pairs is larger than the maximum CaCo ratio, exceeding a certain threshold, we consider this region for association analysis.
To compute the IBD states between pairs of individuals, we use Thompson's method  with ibd_haplo (MORGAN) [6, 7]. This method uses a continuous-time Markov rate matrix to model and estimate IBD states among pairs of individuals, using data at dense SNP loci, ignoring the LD structure. However, LD remains a major confounding factor because LD is itself a reflection of coancestry at the population level. To assess the impact of LD on IBD estimation, we present results for both AllMark and NoLD data sets.
In ibd_haplo, one needs to specify a value for parameters of the latent IBD process β, the pointwise pairwise probability of IBD, and α, the overall rate of change of IBD state along a chromosome. The choice of these parameters defines the time-depth of the IBD that is sought . For the results shown in this paper, α = 0.05 and β = 0.01. We use a calling threshold of 0.9 as the probability that each of the IBD states must reach for the state to be called.
After the regions have been selected, we use the two-stage approach of Tsonaka et al to test for their association with the longitudinal phenotype . In the first stage, a random-effects model is used to summarize the regions via their empirical Bayes (EB) estimates. Next, the EB estimates of a specific region r, obtained from the first stage, are added as covariates into the model for the phenotype to test for region effects. Below, we describe in brief the phenotypic model used in the second stage.
where xijt is the vector with covariate values for age and smoking status, eb ijr is the EB estimates of the region r, obtained from the first stage, and s ijr is the number of rare variants within region r; u ij is the random family effect and follows a multivariate normal distribution with mean 0 and variance-covariance matrix , where R is the coefficient of relationships matrix; eijt is a normally distributed residual with a 4 × 4 covariance matrix to model the correlation among 6 repeated measurements. We use a multivariate Wald statistic with 2 degrees of freedom to test the null hypothesis of no region effect; that is, H0: β2 = β3 = 0.
Description of IBD between case-case and case-control pairs
Mean length of segments
Descriptions of regions
After selecting the regions, we tested their association with the longitudinal phenotype by fitting a linear mixed model to DBP with the EB estimates per region, smoking status, and age as covariates. To further increase our power, we considered a second model, where we adjusted also for the sum of rare variants. We used 2 different genotype data, DOS with imputed genotypes on 939 individuals and WGS with complete genomics on 464 individuals. To account for multiple testing, we used a Bonferroni correction, using a significance level of alpha divided by the maximum number of independent regions tested for each data set; that is, 7 for the NoLD and 8 for the AllMark. We used 6 × 10−3 as the significance level for AllMark and 7 × 10−3 for NoLD.
P-value for testing region effects using the NoLD data set
β2, β3 a
β2, β3 a
1.3 × 10−3
2.3 × 10−3
3.6 × 10−3
9.3 × 10−3
2.1 × 10−3
We have presented a method that combines linkage and association-based mapping to identify rare variants in next-generation sequencing data. Initially, we identify regions with an excess of IBD between case-case as compared to case-control pairs. Subsequently, we use a two-stage approach to summarize the regions via an EB estimate of the genetic variation and test for region effects. The two-stage approach captures the correlation between SNPs within regions by using random effects. These types of approaches can be more powerful than methods that ignore the dependency structure between the SNPs . The approach can be directly applied to family and longitudinal data and can deal with missing genotypes.
One main advantage of this method, as compared to an association-only approach , is that by using the IBD mapping in the first step, we reduce the number of candidate regions to areas more enriched for putative causal loci. This considerably reduces the number of tests that need to be performed, and testing for interactions becomes feasible. This method can also be used for non-gene regions, although cautiously, because possibly important regions might already have been excluded in the first part, if the parameters for the IBD are misspecified. Moreover, if the resulting regions contain too many markers, the effect of rare variants might be diluted. The regions are selected using the binary hypertension diagnosis phenotype at the first measurement and not the quantitative DBP phenotype analyzed in the association study. This may be a problem if the 2 phenotypes are different. In our case, the binary phenotype was created using a threshold for the quantitative phenotype or information on medications. If the effect of a variant changes over time, we might lose power by determining the IBD states only on the first measurement. For individuals receiving treatment, the recorded DBP could be considered as a right-censored value, because we know that it is less than what the untreated value would be. Our approach ignores this information, which again may result in power loss. One way to address this issue could be to use a nonparametric algorithm to adjust blood pressure for treatment effect .
In this article, we do not present results for type I error or power. However, Tsonaka et al.  and Houwing-Duistermaat et al.  report results for both regarding the two-stage approach. Using extensive simulations, Tsonaka et al. showed that the test statistics preserve the type I error at nominal level for scenarios comparable to ours. Houwing- Duistermaat et al. analyzed the simulated phenotypes from this Genetic Analysis Workshop (GAW) and found that the power was as high as 96.5% and 72.5% using the imputed GWAS and WGS data, respectively.
We found significant results only when the candidate regions were selected using the NoLD and DOS data. One reason for the better performance of the NoLD data, as compared to the AllMark data, could be the presence of LD in the latter. LD leads to increased rates of false positive IBD results , which could erroneously indicate these regions as interesting. The absence of overlap between regions when using these 2 data sets also indicates the sensitivity of the method to the amount of LD in the data. Another reason for the better performance of the NoLD data set could be the region selection process. In the NoLD data, the markers are further apart from each other, as compared to the AllMark data set. Hence, when selecting a region (at least 2 markers), we automatically include more SNPs and rare variants.
When the NoLD and WGS data were used, the signal of the region found using DOS was no longer significant. This power loss could be a result of the smaller sample size in the WGS data, which leads to increased variances of the parameter estimates (results not shown). The same happens for the estimates of the genetic variance. On one hand, using the DOS data we estimate with a variance of 1.4366 (p-value 6.9 × 10−11). On the other hand, when using WGS, the estimate becomes much smaller, , and its variance increases to 27.23 (p-value = 0.99). Removing the number of rare variants from the model led to a significant p-value for this region.
Using the NCBI database , we found that the gene CADM2, which is 146 kilobase (kb) on the right of the region we identified, is associated, among other phenotypes, with blood pressure and body mass index . More specifically, 3 SNPs in this gene are associated with blood pressure: rs1370032 (p-value = 7.22 × 10−5), rs13074417 (p-value = 7.625 × 10−5), and rs4859048 (p- value = 7.872 × 10−5).
We identified a significant region when IBD states were determined using LD-pruned markers and association with phenotype was tested using the imputed genotypes of 939 individuals. When the raw sequence data of 464 individuals was used, the signal was significant only after the number of rare variants from the phenotype model were removed.
The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7-Health-F5-2012) under grant agreement n° 305280 (MIMOmics). The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Cantor RM, Lange K, Sinsheimer JM: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010, 86: 6-33. 10.1016/j.ajhg.2009.11.017.PubMed CentralView ArticlePubMedGoogle Scholar
- Browning SR, Thompson EA: Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics. 2012, 190: 1521-1531. 10.1534/genetics.111.136937.PubMed CentralView ArticlePubMedGoogle Scholar
- Browning BL, Browning SR: A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011, 88: 173-182. 10.1016/j.ajhg.2011.01.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsonaka R, van der Helm-van Mil AHM, Houwing-Duistermaat JJ: A two-stage mixed-effects model approach for gene-set analyses in candidate gene studies. Stat Med. 2011, 31: 1190-1202.View ArticlePubMedGoogle Scholar
- Brown MD, Glazner CG, Zheng C, Thompson EA: Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics. 2012, 190: 1447-1460. 10.1534/genetics.111.137570.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson EA: The IBD process along four chromosomes. Theor Popul Biol. 2008, 73: 369-373. 10.1016/j.tpb.2007.11.011.PubMed CentralView ArticlePubMedGoogle Scholar
- The IBD_Haplo software is part of MORGAN 3.1. April 2012 release, [http://www.stat.washington.edu/thompson/Genepi/pangaea.shtml]
- Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, Hsu L: Insights into colon cancer etiology via regularized approach to gene set analysis of GWAS data. Am J Hum Genet. 2010, 86: 860-871. 10.1016/j.ajhg.2010.04.014.PubMed CentralView ArticlePubMedGoogle Scholar
- Houwing-Duistermaat JJ, Helmer Q, Balliu B, van de Akker E, Tsonaka R, Uh HW: Gene analysis for longitudinal family data using nested random effects models. BMC Proc. 2014, 8 (suppl 2): S88-PubMed CentralView ArticlePubMedGoogle Scholar
- Soler JMP, Blangero J: Longitudinal familial analysis of blood pressure involving parametric (co)variance functions. BMC Genet. 2003, 4 (Suppl 1): S87-10.1186/1471-2156-4-S1-S87.PubMed CentralView ArticlePubMedGoogle Scholar
- The National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/gap/PheGenI?tab=1&gene=253559]
- Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, Lango Allen H, Lindgren CM, Luan J, Mägi R, et al: Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010, 42: 937-948. 10.1038/ng.686.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.