Prioritization of family member sequencing for the detection of rare variants
© The Author(s). 2016
Published: 18 October 2016
The advent of affordable sequencing has enabled researchers to discover many variants contributing to disease, including rare variants. There are methods for determining the most informative individuals for sequencing, but the application of these methods is more complex when working with families. Sets of large families can be beneficial in finding rare variants, but it may be unfeasible to sequence all members of these family sets.
Using simulated data from the Genetic Analysis Workshop 19, we apply multiple regression to identify cases and controls. To find the best controls for each case, we used kinship coefficients to match within families. Selected cases and controls were analyzed for rare variants, collapsed by gene, associated with hypertension using the family-based rare variant association test (FARVAT).
The gene with the strongest simulated effect, MAP4, did not meet the Bonferroni corrected significance threshold. However, analysis of cases and controls using our selection method substantially improved the significance of MAP4, despite the reduction in sample size.
Taking the additional steps to select the optimal cases and controls from large family data sets can help ensure that only informative individuals are included in analysis and may improve the ability to detect rare variants.
Whole-genome sequencing (WGS) is an important tool in the discovery of rare variants that influence disease. Family-based association studies have likewise been crucial in the fine-mapping of genetic variants contributing to complex disease. Decreased sequencing costs have made it increasingly feasible to sequence large families or even large sets of families, but WGS remains too expensive for most studies. To address this, a subset of family members may be selected for WGS, but it can be difficult to determine which configuration of family members will have the greatest power to detect rare variants. Extreme phenotyping is an approach that compares individuals at opposite ends of the phenotypic spectrum with the thought that rare causal variants will be enriched for in the extremes of complex traits [1–3]. The appeal of this approach is its cost-effectiveness; however, the decrease in cost relies on the ability to cheaply phenotype many more patients than will be sequenced . A drawback is the decreased sample size, which can result in loss of power. We modified the process of extreme phenotyping and combined it with family-based selection to make the best use of the data. We defined our cases and controls as individuals with extremes of unexplained variation in systolic blood pressure (SBP) after adjusting for covariates in a regression analysis; these individuals are most likely to have a genetic component explaining their SBP [1, 4]. As a second step, we used kinship coefficients to eliminate those individuals who are least likely to contribute useful genetic information to the analysis because they are either too closely related (eg, parent–child) or unrelated.
Descriptive characteristics of base population, potential cases and controls, and selected cases and controls
Base cases (n = 261)
Base controls (n = 458)
Potential cases (n = 170)
Potential controls (n = 277)
Selected cases (n = 128)
Selected controls (n = 188)
Selected cases vs. controls
SBP (mm Hg)
DBP (mm Hg)
Extremes of unexplained variation
Prioritization of subjects
Quality control of sequencing data
In addition to the quality control (QC) performed by the organizers of Genetic Analysis Workshop prior to release , further QC steps were taken using VCFtools version 0.1.12a  for chromosome 3, which initially included 1,757,452 sites among 464 sequenced individuals. No individuals were missing more than 10 % of calls, and thus, none were removed. Sites with a call rate of less than 95 % were removed (210,954 sites), as were sites that were out of Hardy-Weinberg equilibrium within the founders (6903 sites removed using n = 91 founders) at a p value cutoff of less than 2.9 × 10−8 (Bonferroni corrected: 0.05/1,546,498 = 3.2 × 10−8) leaving a total of 1,539,595 sites. Sites that did not pass QC were then removed from the data set of imputed genotypes that included 959 subjects (both sequenced individuals and those with imputed genotypes using the 464 sequenced subjects as input for the imputation). This data set contained 1,215,399 imputed sites, of which 87,555 sites were removed as a result of the aforementioned QC process, leaving 1,127,844 sites for analysis.
Gene-based annotation was performed with the sites remaining after QC using ANNOVAR (Annotate Variation)  and the human genome RefSeq database based on hg19. Sites in intragenic regions or outside of a gene were mapped to the closest gene. Those that were further than 5 kbp from a gene were excluded, as the simulation model selected causal variants that were within this range, which left 566,962 sites (560,882 out of range).
Sequencing data from chromosome 3 for each set of cases and controls (base, potential, and selected) were analyzed using FARVAT . FARVAT allows for the use of a dichotomous outcome and takes little computational time. FARVAT provides burden-, variance component–, and SKAT-O–type tests, and additionally provides the Pedigree Combined Multivariate and Collapsing (PedCMC)  and collapsing-based tests . We utilized the variance component-type test as this test performs well for genes with functional rare variants having effects in the opposite direction, as is likely to be the case for most genes . Users have the option to specify an offset to improve statistical efficiency. We chose the disease prevalence-based offset, using the hypertension prevalence of 0.26 among Hispanic adults as reported by the National Health and Nutritional Examination Surveys (NHANES) . In addition, age and sex were included as covariates.
Analysis of genes associated with hypertension in simulated data
Base cases & controls (n = 719)
Potential cases & controls (n = 447)
Selected cases & controls (n = 316)
The potential power of family data is appealing for the discovery of rare variants that contribute to complex disease. Family data sets can contain hundreds or thousands of individuals, and WGS may not be feasible for every individual in every family. Frequently, researchers will select some family members for sequencing and then impute sequencing data for the remaining family members using existing genome-wide SNP data, however, this can still be costly and the accuracy of imputation varies depending on the approach used [14, 15]. As an alternative, it is possible to limit analyses to fewer family members yet bypass imputation. Careful selection of cases and controls is key to narrow the potential candidates for sequencing. Our multistep approach can be applied to any outcome and allows elimination of those individuals who are least likely to have a genetic component to their outcome, with further elimination of those individuals who will be genetically uninformative to a rare variant association analysis. Multiple factors contribute to complex disease, and it may be important to consider all of these factors in the effort to find genetic determinants. By using multiple regression, we were able to take several covariates into consideration; each of these covariate phenotypes is easily and inexpensively obtained. The inclusion of these covariates allowed us to focus our attention on those individuals with unexplained and, likely, genetic hypertension. Through this approach, cases and controls were not simply defined as those with the highest and lowest blood pressures, respectively, but rather those with blood pressure that is higher or lower than expected given their age, sex, smoking habits, and blood pressure medication usage. The use of theoretical kinship coefficients ensured only genetically informative individuals were included in the analysis. As with any selection process, the sample size decreased as the requirements for inclusion became more stringent. While this decrease reduces costs, loss of power from decreased sample size is a serious concern. In addition, the combination of multiple phenotypic components into a case definition forces the use of a dichotomous outcome during analysis, which generally results in a loss of power. However, we found that the signal for MAP4, the gene with the strongest simulated effect on SBP, improved with each step of the selection process, indicating that our selection process overcame the loss of power because of a decrease in sample size and dichotomization of a quantitative trait.
Family data can be useful for the detection of rare variants, but must be carefully analyzed. There are options to prioritize the selection of cases and controls for sequencing and analysis. Careful case definitions, combined with information on family structure, can help ensure that only the most informative individuals are chosen for sequencing. This can help keep costs low and, potentially, improve the ability to detect rare variants. However, loss of power is a real concern, meaning the selection process may only yield meaningful results if there is a large base population from which to select.
We would like to thank Jace Otting for his assistance with data programming. This research was supported in part by a core grant to the Center for Demography and Ecology at the University of Wisconsin-Madison (P2C HD047873).
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
RS completed analysis and writing. JMK assisted with writing and data summaries. BFD cleaned and prepared sequencing data. CDE supervised all efforts. All read and approved the final manuscript.
The authors declare they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lanktree MB, Hegele RA, Schork NJ, Spence JD. Extremes of unexplained variation as a phenotype: an efficient approach for genome-wide association studies of cardiovascular disease. Circ Cardiovasc Genet. 2010;3(2):215–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol. 2011;35(8):790–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Spence JD, Barnett PA, Bulman DE, Hegele RA. An approach to ascertain probands with a non-traditional risk factor for carotid atherosclerosis. Atherosclerosis. 1999;144(2):429–34.View ArticlePubMedGoogle Scholar
- Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, Johnson M, Peralta JM, Manning AK, Wood AR, et al. Omics squared: human genomic, transcriptomic, and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9(8):S2.Google Scholar
- Choi S, Lee S, Cichon S, Nothen MM, Lange C, Park T, Won S. FARVAT: a family-based rare variant association test. Bioinformatics. 2014;30(22):3197–205.View ArticlePubMedGoogle Scholar
- Yan T, Yang YN, Cheng X, DeAngelis MM, Hoh J, Zhang H. Genotypic association analysis using discordant-relative-pairs. Ann Hum Genet. 2009;73(1):84–94.View ArticlePubMedGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang D, McPeek MS. Robust rare variant association testing for quantitative traits in samples with related individuals. Genet Epidemiol. 2014;38(1):10–20.View ArticlePubMedGoogle Scholar
- Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012;90(6):1028–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–93.View ArticlePubMedGoogle Scholar
- Nwankwo T, Yoon SS, Burt V, Gu Q. Hypertension among adults in the United States: National Health and Nutrition Examination Survey, 2011–2012. NCHS Data Brief. 2013;133:1–8.Google Scholar
- Song S, Shields R, Li X, Li J. Joint analysis of sequence data and single-nucleotide polymorphism data using pedigree information for imputation and recombination inference. BMC Proc. 2014;8 Suppl 1:S20.View ArticlePubMedPubMed CentralGoogle Scholar
- Hinrichs AL, Culverhouse RC, Suarez BK. Genotypic discrepancies arising from imputation. BMC Proc. 2014;8 Suppl 1:S17.View ArticlePubMedPubMed CentralGoogle Scholar