Two-stage family-based designs for sequencing studies
© Yang and Thomas; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
The cost of next-generation sequencing is now approaching that of the first generation of genome-wide single-nucleotide genotyping panels, but this is still out of reach for large-scale epidemiologic studies with tens of thousands of subjects. Furthermore, the anticipated yield of millions of rare variants poses serious challenges for distinguishing causal from noncausal variants for disease. We explore the merits of using family-based designs for sequencing substudies to identify novel variants and prioritize them for their likelihood of causality. While the sharing of variants within families means that family-based designs may be less efficient for discovery than sequencing of a comparable number of unrelated individuals, the ability to exploit cosegregation of variants with disease within families helps distinguish causal from noncausal ones. We introduce a score test criterion for prioritizing discovered variants in terms of their likelihood of being functional. We compare the relative statistical efficiency of 2-stage versus1-stage family-based designs by application to the Genetic Analysis Workshop 18 simulated sequence data.
Most genome-wide association studies for discovering common variants associated with disease traits have been conducted using a case-control design with unrelated controls. Not only are unrelated individuals easier to identify and enroll than are entire families (particularly multiple-case families), but the statistical efficiency per subject genotyped is typically higher using unrelated controls than using unaffected siblings or other relatives . However, with the growing interest in rare variants and the availability of next-generation sequencing technology, there has been a resurgence of interest in using family-based designs [2–5]. Althoughfamily-based designs are less efficient for discovering novel variants than designs using unrelated individuals with the same total number of subjects, they may have other advantages that may outweigh this limitation. Specifically, by exploiting information about cosegregation with disease within families, they may be more efficient at prioritizing potentially causal variants from noncausal ones for subsequent testing for association with disease in larger samples. The ability to exploit mendelian inheritance may also substantially improve the imputation of rare variants in untested samples . Finally, family-based designs can exploit both between-family and within-family comparisons in various 2-stage designs for better power while being robust to bias from population stratification [7–11]. The net result could be improved power for the ultimate goal of discovering novel associations with disease. We focus here on the first of these advantages.
We consider a 2-stage design using individuals from a family study for discovery of novel variants and screening, followed by association testing in an independent data set using the linear regression of the phenotype on the genotype. A subset of members is selected for sequencing, preferentially sampling those with the most extreme phenotypes. We rank all the variants identified in these individuals using a novel score test and select those most likely to be causal for genotyping and association testing in the replication data set. These designs are compared with a single-stage family-based design in which all available members are sequenced .
Methods of prioritization
Bayes factor criterion. Following the principles described in Petersen et al , the probability that any particular variant is causal under a given genetic model can be estimated by accumulating likelihood ratio contributions (the ratio of the likelihoods of the data under the alternative hypothesis that a particular variant is causal to the likelihood under the null hypothesis that it is not causal) across families. The likelihood ratio requires a specific alternative hypothesis to be tested against the null hypothesis, and estimation of these parameters is likely to be highly unstable for rare variants. To avoid this, Petersen et al  compute a Bayes factor (BF) by averaging over a prior distribution of allele frequencies and relative risks. BFs are computed as the ratio of the conditional probabilities of the joint genotypes of the sampled individuals under the true model to that under the null.
where Y f is the vector of phenotypes for family f; μ f is the family-specific mean phenotype; Φ f is the kinship matrix, that is, the matrix of kinship coefficients; G fv is the vector of genotypes for variant v; and q v is its minor allele frequency. The G fv -q v deviationsare set to zero for untyped individuals, but the inclusion of the kinship terms for typed-untyped pairs allows their phenotypes to contribute. This statistic has mean zero under the null hypothesis and asymptotic variance var(T v ) = ∑ f t fv 2. Very similar tests have recently been described by Schifano et al  and Chen et al . For the purpose of prioritizing variants, it is sufficient to calculate the score test T v 2/var(T v ) for each variant and select the top-ranked ones at some cutoff. In other simulations, we have found this statistic to be highly correlated with the BF, to show nearly as good discrimination between causal and noncausal variants, and to be computationally much faster.
Application to Genetic Analysis Workshop 18 data
We compared the various design and analysis alternatives on a subset of the simulated Genetic Analysis Workshop 18 (GAW18) data. Based on the "answers" provided, we chose to focus on the MAP4 region of chromosome 3, which contains 15 functional variants having the strongest associations with both diastolic blood pressure (DBP) (6.5% of the total phenotypic variants) and systolic blood pressure (SBP) (7.8%). These 15 variants spanned a broad range of MAFs and effect sizes, individually accounting for from 2.8% to <0.3% of the variance. We selected all the variants in the region from 200 kilobases (kb) upstream to 100 kb downstream of the transcription start site (1151 variants in total). For comparison, we selected six 300-kb regions at random from those on chromosome 3 that harbored no functional variants for either trait and included all variants in these regions (6195 variants in total). For simplicity, we used the most likely genotypes for the imputed individuals, although the expected allele dosages would have been better. We also limited our initial exploration to 1000 variants (all 15 functional and 985 of the null variants).
For each trait, we preprocessed the phenotype data using a general linear mixed model to extract the intercept and slope coefficients for age and their variances for each individual, after adjustment for gender and current hypertension treatment, with random effects for family. These estimates were then treated as the phenotypes in the genetic analysis of each variant individually, using linear regression.
For the single-stage design, we analyzed the associations using all 959 individuals using the quantitative transmission disequilibrium test with mating-type means (QTDTM)  and tabulated the proportion of functional and nonfunctional variants that were associated at 0.05 significance after Bonferroni correction. We also tried the FBAT-rare procedure [18, 19], which is similar except that the test is performed at the nuclear family level rather than at the individual offspring level. Because of the small number of informative nuclear families, the variance estimator from this test was less stable. Because no residual within-family dependency was simulated, the QTDTM test is valid, so we present the results only from this test.
For the 2-stage design, we first selected 2, 4, or 6 members of each pedigree for whom sequencing data were available, excluding those in the subset of maximally unrelated individuals and the closely related full sib and parent-offspring pairs. We used a logistic function of the squared rank deviation from the family's median phenotype to select these members at random. We used the sequence data on only these sampled individuals (along with the complete phenotype data) to compute the score test for each variant. In the second stage, we tested the association of the top-ranked associations in the data set of unrelated individuals using linear regression of the phenotype on the genotype, with Bonferroni adjustment for only the prioritized variants. We varied the thresholds for prioritization from 0.5% to 16% of the top-ranked variants.
Because of the computational burden, we restricted these analyses to phenotype replicates 1 to 5 and analyzed each replicate using 20 random subsets of members' sequence data.
Mean score tests for the complete pedigrees.
Mean score test
Proportion of variants (%)
With only 15 causal variants (most of them with very small effects) in the subset of variants we analyzed, we cannot reliably compare the power of the various designs, although for these data, the 2-stage procedure did not seem to perform better than the 1-stage procedure. This may be partly a result of the small size of the replication sample of unrelated individuals (N = 157). As a test, we expanded the data set by combining the unrelated individuals from 5 replicates for the second step and the power for the joint test rose substantially (results not shown). Despite the lower power of the 2-stage approach, the cost is much lower because only a small number of individuals need to be fully sequenced and the genotyping required for replication is much cheaper. Because the costs of sequencing and targeted genotyping are rapidly changing and could be quite different for individual- and family-based designs, we have not addressed cost-efficiency. See  and  for discussion of optimization of sampling fractions for studies of independent individuals using individual and pooled sequencing under cost constraints.
We arbitrarily fixed the total number of variants to be prioritized at 100, but the general principles for design of 2-stage designs [21–24] could be applied to optimize the allocation of sample sizes across the 2 stages and the threshold for prioritizing variants, subject to cost constraints. As this threshold becomes more restrictive, fewer variants will be selected, lowering power for the first stage, but because the penalty for multiple testing will be less, power for the second stage will be improved; a similar tradeoff applies to sample size allocation between the 2 stages. As a preliminary exploration, we varied this threshold between 0.5% and 16% of variants, and the overall power increased monotonically with the threshold. Of course, the number of false positives also increased, but at a much lower rate, so that the false discovery rate dropped with increasing threshold and number of members sequenced. Still, the false discovery rate is very large, so that a further replication would be needed to weed out the false positives.
It is also possible that more appropriate adjustment for time-dependent treatment data would improve power for all these analyses (although it's unlikely to affect the relative performance of the 1-stage and 2-stage designs). Because treatment is itself related to blood pressure, it is both a confounder and an intermediate variable on a causal pathway, so neither ignoring it nor covariate adjustment is appropriate. This problem was extensively discussed at Genetic Analysis Workshop 13 ; see references [26–29] for discussion of several better approaches. As a rough test of this hypothesis, we reran the analysis using the simulated effect sizes for gender and treatment, and the results (not shown) were essentially unchanged.
Supported in part by NIH grants U19 CA148107, U01 HG005927, R01 ES019876, R21 ES020794, and P30 CA014089. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The authors are grateful to John Morrison for programming help with extraction of the selected data from the massive GAW18data sets. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Witte JS, Gauderman WJ, Thomas DC: Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol. 1999, 149: 693-705. 10.1093/oxfordjournals.aje.a009877.View ArticlePubMedGoogle Scholar
- Ionita-Laza I, Ottman R: Study designs for identification of rare disease variants in complex diseases:the utility of family-based designs. Genetics. 2011, 189: 1061-1068. 10.1534/genetics.111.131813.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi G, Rao DC: Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genet Epidemiol. 2011, 35: 572-579.PubMed CentralView ArticlePubMedGoogle Scholar
- Feng T, Elston RC, Zhu X: Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol. 2011, 35: 398-409. 10.1002/gepi.20588.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Feng T, Li Y, Lu Q, Elston RC: Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol. 2010, 34: 171-187. 10.1002/gepi.20449.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum Genet. 2009, 10: 387-406. 10.1146/annurev.genom.9.081307.164242.PubMed CentralView ArticlePubMedGoogle Scholar
- Murphy A, Weiss ST, Lange C: Screening and replication using the same data set: testing strategies for family-based studies in which all probands are affected. PLoS Genet. 2008, 4: e1000197-10.1371/journal.pgen.1000197.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, et al: Genomic screening and replication using the same data set in family-based association testing. Nat Genet. 2005, 37: 683-691. 10.1038/ng1582.View ArticlePubMedGoogle Scholar
- Feng T, Zhang S, Sha Q: Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure. Eur J Hum Genet. 2007, 15: 1169-1175. 10.1038/sj.ejhg.5201902.View ArticlePubMedGoogle Scholar
- Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM: Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet. 2003, 73: 801-811. 10.1086/378591.PubMed CentralView ArticlePubMedGoogle Scholar
- Wason JM, Dudbridge F: A general framework for two-stage analysis of genome-wide association studies and its application to case-control studies. Am J Hum Genet. 2012, 90: 760-773. 10.1016/j.ajhg.2012.03.007.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang F, Thomas DC: Two-stage design of sequencing studies for testing association with rare variants. Hum Hered. 2011, 71: 209-220. 10.1159/000328193.PubMed CentralView ArticlePubMedGoogle Scholar
- Petersen GM, Parmigiani G, Thomas D: Missense mutations in disease genes: a Bayesian approach to evaluate causality. Am J Hum Genet. 1998, 62: 1516-1524. 10.1086/301871.PubMed CentralView ArticlePubMedGoogle Scholar
- Ionita-Laza I, Makarov V, Yoon S, Raby B, Buxbaum J, Nicolae DL, Lin X: Finding disease variants in mendelian disorders by using sequence data: methods and applications. Am J Hum Genet. 2011, 89: 701-712. 10.1016/j.ajhg.2011.11.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X: SNP set association analysis for familial data. Genet Epidemiol. 2012, 36: 797-810.PubMed CentralPubMedGoogle Scholar
- Chen H, Meigs JB, Dupuis J: sequence kernel association test for quantitativetraits in family samples. Genet Epidemiol. 2013, 37: 196-204. 10.1002/gepi.21703.PubMed CentralView ArticlePubMedGoogle Scholar
- Gauderman WJ: Candidate gene association analysis for a quantitative trait, using parent-offspring trios. Genet Epidemiol. 2003, 25: 327-338. 10.1002/gepi.10262.View ArticlePubMedGoogle Scholar
- Yip W, De G, Raby BA, Laird N: Identifying causal rare variants of disease through family-based analysis of Genetics Analysis Workshop 17 data set. BMC Proc. 2011, 5 (Suppl 9): S21-10.1186/1753-6561-5-S9-S21.PubMed CentralView ArticlePubMedGoogle Scholar
- De G, Yip W, Ionita-Laza I, Laird N: Rare variant analysis for family-based design. PloS One. 2013, 8: e48495-10.1371/journal.pone.0048495.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang W, Thomas DC, Conti DV: Analysis and optimal design for association studies using next-generation sequencing with case-control pools. Genet Epidemiol:doi: 10.1002.gepi.21681. epub ahead of print September 12, 2012.Google Scholar
- Skol AD, Scott LJ, Abecasis GR, Boehnke M: Optimal designs for two-stage genome-wide association studies. Genet Epidemiol. 2007, 31: 776-788. 10.1002/gepi.20240.View ArticlePubMedGoogle Scholar
- Wang H, Thomas DC, Pe'er I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol. 2006, 30: 356-368. 10.1002/gepi.20150.View ArticlePubMedGoogle Scholar
- Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO: Methodological issues in multistage genome-wide association studies. Stat Sci. 2009, 24: 414-429. 10.1214/09-STS288.PubMed CentralView ArticlePubMedGoogle Scholar
- Satagopan JM, Elston RC: Optimal two-stage genotyping in population-based association studies. Genet Epidemiol. 2003, 25: 149-157. 10.1002/gepi.10260.View ArticlePubMedGoogle Scholar
- Almasy L, Cupples LA, Daw EW, Levy D, Thomas D, Rice JP, Santangelo S, MacCluer JW: Genetic Analysis Workshop 13: introduction to workshop summaries. Genet Epidemiol. 2003, 25 (Suppl 1): S1-S4.View ArticlePubMedGoogle Scholar
- Levy D, DeStefano AL, Larson MG, O'Donnell CJ, Lifton RP, Gavras H, Cupples LA, Myers RH: Evidence for a gene influencing blood pressure on chromosome 17. Genome scan linkage results for longitudinal blood pressure phenotypes in subjects from the Framingham heart study. Hypertension. 2000, 36: 477-483. 10.1161/01.HYP.36.4.477.View ArticlePubMedGoogle Scholar
- Gauderman WJ, Macgregor S, Briollais L, Scurrah K, Tobin M, Park T, Wang D, Rao S, John S, Bull S: Longitudinal data analysis in pedigree studies. Genet Epidemiol. 2003, 25 (Suppl 1): S18-S28.View ArticlePubMedGoogle Scholar
- Bickeböller H, Barrett JH, Jacobs KB, Rosenberger A: Modeling and dissection of longitudinal blood pressure and hypertension phenotypes in genetic epidemiological studies. Genet Epidemiol. 2003, 25 (Suppl 1): S72-S77.View ArticlePubMedGoogle Scholar
- Cui JS, Hopper JL, Harrap SB: Antihypertensive treatments obscure familial contributions to blood pressure variation. Hypertension. 2003, 41: 207-210. 10.1161/01.HYP.0000044938.94050.E3.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.