- Open Access
Combining information from linkage and association mapping for next-generation sequencing longitudinal family data
BMC Proceedingsvolume 8, Article number: S34 (2014)
In this analysis, we investigate the contributions that linkage-based methods, such as identical-by-descent mapping, can make to association mapping to identify rare variants in next-generation sequencing data. First, we identify regions in which cases share more segments identical-by-descent around a putative causal variant than do controls. Second, we use a two-stage mixed-effect model approach to summarize the single-nucleotide polymorphism data within each region and include them as covariates in the model for the phenotype. We assess the impact of linkage disequilibrium in determining identical-by-descent states between individuals by using markers with and without linkage disequilibrium for the first part and the impact of imputation in testing for association by using imputed genome-wide association studies or raw sequence markers for the second part. We apply the method to next-generation sequencing longitudinal family data from Genetic Association Workshop 18 and identify a significant region at chromosome 3: 40249244-41025167 (p-value = 2.3 × 10−3).
In genetic association studies, joint analysis of multiple single-nucleotide polymorphisms (SNPs) can be more powerful than separate SNP analysis because single markers typically either have small effect sizes (common variants) or minor allele frequencies that are too small to reliably fit models (rare variants) . If the rare variant effects were large, they would have been found through previous family-based linkage studies if the disease was not heterogeneous. There may be a middle ground in which multiple rare variants of moderate effect size play a key role in the etiology of some diseases. Such situations might be ideal for identity-by-descent (IBD) mapping . Moreover, with the availability of genome-wide SNP data, the density of SNP markers has increased dramatically, making it possible to detect segments of IBD as small as 2 centimorgans (cM) .
In this article, we investigate the contribution that linkage-based methods, such as IBD mapping, can make to association mapping to identify rare variants in next-generation sequencing data. In the first part of our analysis, we use the methods of Browning and Thompson  to identify regions in which cases share more segments of IBD around a putative causal variant than do controls. After selecting these regions, we use a two-stage mixed-effects model approach, which was recently proposed by Tsonaka et al , to summarize the SNP data within each region and include them as covariates in the model for the phenotype. To increase our power to identify rare variants, we also include the number of rare variants per region as a covariate in the model.
To assess the impact of linkage disequilibrium (LD) on our analysis, we present results from estimating IBD probabilities using markers with and without LD. We assess the impact of imputation by analyzing both imputed dosage genome-wide association studies (DOS) and whole genome sequence (WGS) data. Table 1 provides a description of the data sets used for IBD and association mapping.
We consider data from 939 individuals from 20 families; 464 are directly sequenced individuals and imputed WGS data, based on existing genome-wide association studies (GWAS) data, are available for their family members. We restrict our work to real genotypic data from chromosome 3. For each individual, we have information on age at examination and current tobacco smoking for up to 4 time points. We use the binary trait hypertension diagnosis at the first time point for selection of regions with excess IBD sharing and the quantitative trait diastolic blood pressure (DBP) for the phenotype model.
Selection of regions with excess IBD sharing
We construct all possible case-case (CaCa) and case-control (CaCo) pairs, such that individuals within pairs are unrelated. This results in 9229 CaCa pairs and 10080 CaCo pairs. We estimate the IBD state using 2 data sets: one containing approximately 50,000 GWAS markers, which we refer to as the AllMark data set, and 1 containing only 784 LD-pruned GWAS markers, the NoLD data set. From both data sets we eliminate SNPs with minor allele frequencies (MAFs) <5% because shared alleles that are assumed to be rare represent strong evidence for IBD and can distort results if this assumption is violated . In brief, the NoLD markers are selected using a sliding window 1 cM in size, removing markers based on linkage information content and excluding markers with the lowest MAF. At each marker we calculate the rate of IBD for each of the 2 groups and subtract their genomic average over all markers and pairs. If the ratio between CaCa pairs is larger than the maximum CaCo ratio, exceeding a certain threshold, we consider this region for association analysis.
To compute the IBD states between pairs of individuals, we use Thompson's method  with ibd_haplo (MORGAN) [6, 7]. This method uses a continuous-time Markov rate matrix to model and estimate IBD states among pairs of individuals, using data at dense SNP loci, ignoring the LD structure. However, LD remains a major confounding factor because LD is itself a reflection of coancestry at the population level. To assess the impact of LD on IBD estimation, we present results for both AllMark and NoLD data sets.
In ibd_haplo, one needs to specify a value for parameters of the latent IBD process β, the pointwise pairwise probability of IBD, and α, the overall rate of change of IBD state along a chromosome. The choice of these parameters defines the time-depth of the IBD that is sought . For the results shown in this paper, α = 0.05 and β = 0.01. We use a calling threshold of 0.9 as the probability that each of the IBD states must reach for the state to be called.
After the regions have been selected, we use the two-stage approach of Tsonaka et al to test for their association with the longitudinal phenotype . In the first stage, a random-effects model is used to summarize the regions via their empirical Bayes (EB) estimates. Next, the EB estimates of a specific region r, obtained from the first stage, are added as covariates into the model for the phenotype to test for region effects. Below, we describe in brief the phenotypic model used in the second stage.
Let DBPijt be the diastolic blood pressure for individual j from family i at time point t, where i = 1, . . . , N , j = 1, . . . , ni, t = 1, . . . , 4, and ni is the number of individuals in family i. We use the following linear mixed model for each region r:
where xijt is the vector with covariate values for age and smoking status, eb ijr is the EB estimates of the region r, obtained from the first stage, and s ijr is the number of rare variants within region r; u ij is the random family effect and follows a multivariate normal distribution with mean 0 and variance-covariance matrix , where R is the coefficient of relationships matrix; eijt is a normally distributed residual with a 4 × 4 covariance matrix to model the correlation among 6 repeated measurements. We use a multivariate Wald statistic with 2 degrees of freedom to test the null hypothesis of no region effect; that is, H0: β2 = β3 = 0.
Table 2 presents the mean proportions and lengths of IBD segments shared for both groups. Averages were taken over all markers and all pairs. For both AllMark and NoLD, we observed a small difference in both mean proportion and length. However, in AllMark, where LD is ignored, the mean proportion of IBD is increased, as compared to NoLD. We compared the rates between the 2 groups and found 8 and 7 regions with an excess of IBD between CaCa pairs for AllMark and NoLD, respectively. Table 3 lists the starting and ending physical positions of these regions, as well as the number of SNPs and rare variants they contain. Interestingly, we observed no overlap between regions when using markers with and without LD.
After selecting the regions, we tested their association with the longitudinal phenotype by fitting a linear mixed model to DBP with the EB estimates per region, smoking status, and age as covariates. To further increase our power, we considered a second model, where we adjusted also for the sum of rare variants. We used 2 different genotype data, DOS with imputed genotypes on 939 individuals and WGS with complete genomics on 464 individuals. To account for multiple testing, we used a Bonferroni correction, using a significance level of alpha divided by the maximum number of independent regions tested for each data set; that is, 7 for the NoLD and 8 for the AllMark. We used 6 × 10−3 as the significance level for AllMark and 7 × 10−3 for NoLD.
No significant results were found when the candidate regions were selected using the AllMark data (results not shown). Table 4 gives the results of the analysis based on NoLD. When NoLD and DOS were used, there was a significant result for the region 3:40249244-41025167 (p-value of the 2df Wald 2.3 × 10−3). When WGS was used instead of DOS, the variance of the estimates increased and the signal was no longer significant. When the number of rare variants was removed from the model, the region again reached significance (p-value = 2.1 × 10−3).
We have presented a method that combines linkage and association-based mapping to identify rare variants in next-generation sequencing data. Initially, we identify regions with an excess of IBD between case-case as compared to case-control pairs. Subsequently, we use a two-stage approach to summarize the regions via an EB estimate of the genetic variation and test for region effects. The two-stage approach captures the correlation between SNPs within regions by using random effects. These types of approaches can be more powerful than methods that ignore the dependency structure between the SNPs . The approach can be directly applied to family and longitudinal data and can deal with missing genotypes.
One main advantage of this method, as compared to an association-only approach , is that by using the IBD mapping in the first step, we reduce the number of candidate regions to areas more enriched for putative causal loci. This considerably reduces the number of tests that need to be performed, and testing for interactions becomes feasible. This method can also be used for non-gene regions, although cautiously, because possibly important regions might already have been excluded in the first part, if the parameters for the IBD are misspecified. Moreover, if the resulting regions contain too many markers, the effect of rare variants might be diluted. The regions are selected using the binary hypertension diagnosis phenotype at the first measurement and not the quantitative DBP phenotype analyzed in the association study. This may be a problem if the 2 phenotypes are different. In our case, the binary phenotype was created using a threshold for the quantitative phenotype or information on medications. If the effect of a variant changes over time, we might lose power by determining the IBD states only on the first measurement. For individuals receiving treatment, the recorded DBP could be considered as a right-censored value, because we know that it is less than what the untreated value would be. Our approach ignores this information, which again may result in power loss. One way to address this issue could be to use a nonparametric algorithm to adjust blood pressure for treatment effect .
In this article, we do not present results for type I error or power. However, Tsonaka et al.  and Houwing-Duistermaat et al.  report results for both regarding the two-stage approach. Using extensive simulations, Tsonaka et al. showed that the test statistics preserve the type I error at nominal level for scenarios comparable to ours. Houwing- Duistermaat et al. analyzed the simulated phenotypes from this Genetic Analysis Workshop (GAW) and found that the power was as high as 96.5% and 72.5% using the imputed GWAS and WGS data, respectively.
We found significant results only when the candidate regions were selected using the NoLD and DOS data. One reason for the better performance of the NoLD data, as compared to the AllMark data, could be the presence of LD in the latter. LD leads to increased rates of false positive IBD results , which could erroneously indicate these regions as interesting. The absence of overlap between regions when using these 2 data sets also indicates the sensitivity of the method to the amount of LD in the data. Another reason for the better performance of the NoLD data set could be the region selection process. In the NoLD data, the markers are further apart from each other, as compared to the AllMark data set. Hence, when selecting a region (at least 2 markers), we automatically include more SNPs and rare variants.
When the NoLD and WGS data were used, the signal of the region found using DOS was no longer significant. This power loss could be a result of the smaller sample size in the WGS data, which leads to increased variances of the parameter estimates (results not shown). The same happens for the estimates of the genetic variance. On one hand, using the DOS data we estimate with a variance of 1.4366 (p-value 6.9 × 10−11). On the other hand, when using WGS, the estimate becomes much smaller, , and its variance increases to 27.23 (p-value = 0.99). Removing the number of rare variants from the model led to a significant p-value for this region.
Using the NCBI database , we found that the gene CADM2, which is 146 kilobase (kb) on the right of the region we identified, is associated, among other phenotypes, with blood pressure and body mass index . More specifically, 3 SNPs in this gene are associated with blood pressure: rs1370032 (p-value = 7.22 × 10−5), rs13074417 (p-value = 7.625 × 10−5), and rs4859048 (p- value = 7.872 × 10−5).
We identified a significant region when IBD states were determined using LD-pruned markers and association with phenotype was tested using the imputed genotypes of 939 individuals. When the raw sequence data of 464 individuals was used, the signal was significant only after the number of rare variants from the phenotype model were removed.
Cantor RM, Lange K, Sinsheimer JM: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010, 86: 6-33. 10.1016/j.ajhg.2009.11.017.
Browning SR, Thompson EA: Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics. 2012, 190: 1521-1531. 10.1534/genetics.111.136937.
Browning BL, Browning SR: A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011, 88: 173-182. 10.1016/j.ajhg.2011.01.010.
Tsonaka R, van der Helm-van Mil AHM, Houwing-Duistermaat JJ: A two-stage mixed-effects model approach for gene-set analyses in candidate gene studies. Stat Med. 2011, 31: 1190-1202.
Brown MD, Glazner CG, Zheng C, Thompson EA: Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics. 2012, 190: 1447-1460. 10.1534/genetics.111.137570.
Thompson EA: The IBD process along four chromosomes. Theor Popul Biol. 2008, 73: 369-373. 10.1016/j.tpb.2007.11.011.
The IBD_Haplo software is part of MORGAN 3.1. April 2012 release, [http://www.stat.washington.edu/thompson/Genepi/pangaea.shtml]
Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, Hsu L: Insights into colon cancer etiology via regularized approach to gene set analysis of GWAS data. Am J Hum Genet. 2010, 86: 860-871. 10.1016/j.ajhg.2010.04.014.
Houwing-Duistermaat JJ, Helmer Q, Balliu B, van de Akker E, Tsonaka R, Uh HW: Gene analysis for longitudinal family data using nested random effects models. BMC Proc. 2014, 8 (suppl 2): S88-
Soler JMP, Blangero J: Longitudinal familial analysis of blood pressure involving parametric (co)variance functions. BMC Genet. 2003, 4 (Suppl 1): S87-10.1186/1471-2156-4-S1-S87.
The National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/gap/PheGenI?tab=1&gene=253559]
Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, Lango Allen H, Lindgren CM, Luan J, Mägi R, et al: Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010, 42: 937-948. 10.1038/ng.686.
The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7-Health-F5-2012) under grant agreement n° 305280 (MIMOmics). The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
The authors declare that they have no competing interests.
BB participated in the design of the study, performed the statistical analysis, interpreted the results, and wrote the manuscript. HWU participated in the design of the study and in the interpretation of the results. RT developed the statistical method. SB participated in the design of the study and helped to draft the manuscript. QH participated in the analysis of the data and the interpretation of the results. JJHD conceived of the study, participated in its design and coordination, and helped draft the manuscript. All authors read and approved the final manuscript.