- Open Access
Logistic Bayesian LASSO for detecting association combining family and case-control data
© The Author(s). 2018
- Published: 17 September 2018
Because of the limited information from the GAW20 samples when only case-control or trio data are considered, we propose eLBL, an extension of the Logistic Bayesian LASSO (least absolute shrinkage and selection operator) methodology so that both types of data can be analyzed jointly in the hope of obtaining an increased statistical power, especially for detecting association between rare haplotypes and complex diseases. The methodology is further extended to account for familial correlation among the case-control individuals and the trios. A 2-step analysis strategy was taken to first perform a genome-wise single single-nucleotide polymorphism (SNP) search using the Monte Carlo pedigree disequilibrium test (MCPDT) to determine interesting regions for the Adult Treatment Panel (ATP) binary trait. Then eLBL was applied to haplotype blocks covering the flagged SNPs in Step 1. Several significantly associated haplotypes were identified; most are in blocks contained in protein coding genes that appear to be relevant for metabolic syndrome. The results are further substantiated with a Type I error study and by an additional analysis using the triglyceride measurements directly as a quantitative trait.
As next-generation sequencing (NGS) technology becomes more accurate and affordable, many recent studies have focused on assessing associations between common complex diseases and single-nucleotide variants (SNVs), paying particular attentions to those that are rare. Various methods have been proposed, but most can only achieve the identification of candidate genes or regions. To narrow the list of potential causal variants, it would be helpful to investigate haplotype blocks formed by single-nucleotide polymorphisms (SNPs) in regions/genes where associations are suggested but may not necessarily be genome-wide significant. Apart from being able to identify biologically relevant variants, haplotype-based methods can be more powerful than SNV-based methods as multilocus genotypes contain more information than single-locus genotypes, especially when causal loci interact in cis, leading to disease etiology . If there are rare causal SNVs in a haplotype block, then rare haplotypes can tag such causal variants, a conclusion based on a simulation study . More importantly, rare haplotypes may be obtained from common SNPs, rendering NGS data unnecessary. The power for detecting rare haplotype associations is further enhanced in a family-based study, as rare associated variants are enriched in families afflicted with the diseases compared to population samples of independent cases and controls of the same size. Currently, numerous methods exist, including a class based on Logistic Bayesian LASSO (LBL) for detecting associations of haplotypes, common or rare, using either case control or family-based data [1, 3].
The GAW20 Real Data Package provides a good opportunity to apply LBL to identify haplotypes that are associated with metabolic syndrome. Specifically, we consider the ATP binary trait derived from the measurements taken at visit 2 (before drug intervention). Among the 188 pedigrees in the data set, only 17 contain complete case–parent trios (ie, genotype information for both parents and the child and phenotype status for the child are all available), leading to a total of only 25 such trios. In addition, we extracted 283 cases (ATP = 1) and 475 controls (ATP = 0) with available genotype information. Because the number of trios is extremely small, it is clear that there is insufficient power to detect haplotype association using these data alone. However, their inclusion may enhance detection power compared to when only case-control data are used. Because the current LBL methodology focuses on a single study design, we propose eLBL, an extension of the LBL methodology to combine case–control and case–parent trios data for a joint analysis. Furthermore, because cases, controls, and trios are all extracted from the same set of pedigrees, there are intrinsic correlations. To account for such familial dependency, we have adopted a composite likelihood adjustment approach.
Extension of the logistic Bayesian LASSO accounting for familial dependency
To specify the probabilities and elaborate on ∅, we assume that, for any given individual in the study with haplotype pair Z, we model the odds of the disease θZ = P(Y = 1|Z)/P(Y = 0|Z) with a logistic model logθz = α + XZβ, where XZ is the design vector corresponding to haplotype pair Z, coded according to the assumed mode of inheritance (eg, additive, recessive, dominant); β = (β1, ⋯, βK) (part of the collection of parameter vector ∅) is the regression coefficient vector with βj corresponding to the effect of the jth variant on the log odds; and α is the baseline effect (related to the phenocopy rate). Note that if we assume an additive model, then the jth variant is the jth haplotype, and the total number of distinct haplotypes is K + 1. We cast the problem into a Bayesian framework, where the adjusted likelihood in eq. (4) is used for correct posterior inference . The detailed Markov chain Monte Carlo (MCMC) inference procedure follows the original LBL methodology using shrinkage priors to increase power for detecting rare haplotypes [1, 3]; the adjustment factor k is updated in each MCMC iteration. Convergence of the Markov chain is assured based on commonly used diagnostic tools. The posterior odds over the prior odds, namely the Bayes factor (BF), is used to assess the significance of the βjs. We have also constructed empirical posterior credible intervals (CIs) for the odds ratios (ORs). For each haplotype, the OR is essentially the exponential of the corresponding β in the logistic model given above. It is estimated, together with the CIs, from the posterior sample of the β values. Decision on the significance of a haplotype is based on both BF (> 2) and CI (not including the null value 1).
A 2-step analysis strategy
Because the proposed eLBL (extension of the Logistic Bayesian LASSO [least absolute shrinkage and selection operator]) methodology is based on an MCMC procedure to sample from the posterior distribution, it is computationally intensive, and thus not suitable for whole-genome scan. Instead, we adopt the following 2-step strategy. In the first step, we use Monte Carlo pedigree disequilibrium test (MCPDT) , a family-based single-SNP association testing method, to scan 654,767 SNPs across the 22 autosomes. We excluded SNPs with low minor allele frequencies (< 1%). MCPDT imputes missing data and takes familial relationships into account; consequently, it is viewed as using all information to the maximum extent possible. In the second step, we formed haplotype blocks around the SNPs selected from Step 1 using haploview . We then applied eLBL to identify haplotype(s) within each block that have a significant influence on the ATP binary trait.
Top 10 SNPs with the smallest p values as identified by MCPDT
2.59 × 10− 8
3.48 × 10−8
1.80 × 10−7
1.22 × 10−4
5.92 × 10−5
9.48 × 10−5
1.23 × 10−4
6.24 × 10−6
1.15 × 10−4
8.95 × 10−5
Significant haplotypes identified by eLBL; CI does not include 1 and BF > 2
Motivated by making maximum use of information resulting from the limited sample sizes when only case–control or trio data are considered, we propose eLBL, an extension of the LBL methodology, so that both types of data can be analyzed jointly to increase statistical power. This new approach is further extended to adjust for familial correlations, leading to correct statistical inference using dependent data. Our 2-step analysis strategy was designed to increase statistical power. Indeed, by using all available information, MCPDT identified 3 genome-wide significant SNPs, which disappear when only observed data are used (results not shown). On the other hand, eLBL with shrinkage priors was able to recover haplotypes (many are rare) that are associated with the ATP binary trait. The associated genes harboring the haplotype blocks studied all appear to be related to metabolic syndrome. The increase in power is clearly seen as several of the associated haplotypes contain SNPs that do not pass genome-wide significance. To further substantiate the gain in power with the new eLBL approach, we performed an analysis with only independent cases and controls using the original LBL . The results, as expected, are sensitive to the selection of the independent samples, and miss many of the haplotypes identified in Table 2. Similarly, an analysis of 17 independent trios using famLBL  reveals that the sample size is too small to obtain interpretable results. With an increase in power, the natural question is whether there is also an increase in Type I error, as eLBL is a new method and has not been studied thoroughly. To answer this question, we performed a limited simulation study wherein data from a null model was simulated. To mimic the family dependent structure and the linkage disequilibrium structure of the real data, we simulated our data using the GAW20 families and the inferred haplotypes with the estimated frequencies from block 8 to preserve linkage disequilibrium. Our results indicate that there is no elevated Type I error. In the contrary, eLBL is seen to be conservative for rare variants. Nevertheless, for haplotypes with frequencies greater than 0.05, the Type I error is as expected. To further substantiate the results from eLBL, we also analyzed the triglyceride level from visit 2 directly as a quantitative trait using a variation of LBL , but also accounting for the familial structures in the data. Of the 4 protein coding genes, SRPT2, SLC37A2, STARD13, and ABBC1, identified by eLBL, the quantitative analysis also identified associated haplotypes in blocks contained within these genes. These results, together with the Type I error study and the annotations of the genes, affirm the results from eLBL, leading to our conjecture that the haplotypes identified are potentially either involved in the causal mechanism or playing a regulatory role in metabolic syndrome.
Publication of this article was supported by NIH R01 GM031575. This work was supported in part by NSF grant DMS-1208968.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
XZ, MW, and HZ implemented the algorithms and performed the data analyses. SL conceived the study and supervised the analyses and interpretation. XZ, MW, and SL wrote the manuscript. WS reviewed the manuscript and contributed to the discussion. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Biswas S, Lin S. Logistic Bayesian lasso for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics. 2012;68(2):587–97.View ArticleGoogle Scholar
- Wang M, Lin S. Detecting associations of rare variants with common diseases: collapsing or haplotyping? Brief Bioinform. 2015;16(5):1–10.View ArticleGoogle Scholar
- Wang M, Lin S. FamLBL: detecting rare haplotype disease association based on common SNPs using case-parent triads. Bioinformatics. 2014;30(18):2611–8.View ArticleGoogle Scholar
- Varin C, Reid N, Firth D. An overview of composite likelihood methods. Stat Sin. 2011;21:5–42.Google Scholar
- Ribatet M, Cooley D, Davison AC. Bayesian inference from composite likelihoods, with an application to spatial extremes. Stat Sin. 2012;22(2):813–45.Google Scholar
- Ding J, Lin S, Liu Y. Monte Carlo pedigree disequilibrium test for markers on the X chromosome. Am J Hum Genet. 2006;79(3):567–73.View ArticleGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of lD and haplotype maps. Bioinformatics. 2005;21(2):263–5.View ArticleGoogle Scholar