Skip to main content

Prioritization of family member sequencing for the detection of rare variants

Abstract

Background

The advent of affordable sequencing has enabled researchers to discover many variants contributing to disease, including rare variants. There are methods for determining the most informative individuals for sequencing, but the application of these methods is more complex when working with families. Sets of large families can be beneficial in finding rare variants, but it may be unfeasible to sequence all members of these family sets.

Methods

Using simulated data from the Genetic Analysis Workshop 19, we apply multiple regression to identify cases and controls. To find the best controls for each case, we used kinship coefficients to match within families. Selected cases and controls were analyzed for rare variants, collapsed by gene, associated with hypertension using the family-based rare variant association test (FARVAT).

Results

The gene with the strongest simulated effect, MAP4, did not meet the Bonferroni corrected significance threshold. However, analysis of cases and controls using our selection method substantially improved the significance of MAP4, despite the reduction in sample size.

Conclusions

Taking the additional steps to select the optimal cases and controls from large family data sets can help ensure that only informative individuals are included in analysis and may improve the ability to detect rare variants.

Background

Whole-genome sequencing (WGS) is an important tool in the discovery of rare variants that influence disease. Family-based association studies have likewise been crucial in the fine-mapping of genetic variants contributing to complex disease. Decreased sequencing costs have made it increasingly feasible to sequence large families or even large sets of families, but WGS remains too expensive for most studies. To address this, a subset of family members may be selected for WGS, but it can be difficult to determine which configuration of family members will have the greatest power to detect rare variants. Extreme phenotyping is an approach that compares individuals at opposite ends of the phenotypic spectrum with the thought that rare causal variants will be enriched for in the extremes of complex traits [1–3]. The appeal of this approach is its cost-effectiveness; however, the decrease in cost relies on the ability to cheaply phenotype many more patients than will be sequenced [3]. A drawback is the decreased sample size, which can result in loss of power. We modified the process of extreme phenotyping and combined it with family-based selection to make the best use of the data. We defined our cases and controls as individuals with extremes of unexplained variation in systolic blood pressure (SBP) after adjusting for covariates in a regression analysis; these individuals are most likely to have a genetic component explaining their SBP [1, 4]. As a second step, we used kinship coefficients to eliminate those individuals who are least likely to contribute useful genetic information to the analysis because they are either too closely related (eg, parent–child) or unrelated.

Methods

Study population

We analyzed replicate 1 of the simulated data set from the Genetic Analysis Workshop 19 (GAW19) T2D-GENES Project 2, a family data set with WGS data [5], with knowledge of the simulation model. The provided data set consisted of family-based WGS data and simulated phenotypes for diastolic blood pressure (DBP) and SBP. Covariates included sex, age, hypertensive status, antihypertensive medication use, and smoking status. Prior to modeling, families without sequencing data available for any family member were omitted. The remaining sample consisted of 261 individuals with hypertension and 458 individuals without (base cases and controls; Table 1).

Table 1 Descriptive characteristics of base population, potential cases and controls, and selected cases and controls

Extremes of unexplained variation

To define cases and controls, we modified an approach that selects participants with variation in their phenotype that is unexplained by known nongenetic risk factors, and thus are most likely to have a genetic component [1, 4]. Using SBP as the outcome, we used multiple regression to adjust for the following nongenetic variables that affect SBP: age, sex, smoking status, and antihypertensive medication use. The original data were longitudinal; for subjects with hypertension, the first year with this diagnosis was used in the model. If the year used had missing data and the next year had more complete data, that next year was used. For those without hypertension, the year with the most complete data was used. Subjects with hypertension who were above the regression line were those with unexplained high SBP and were selected as potential cases (n = 170; see Table 1). These cases are identified in red in Fig. 1. Subjects without hypertension who were below the regression line were those with unexplained low SBP and were selected as potential controls (n = 277; see Table 1). These potential controls are identified in blue in Fig. 1.

Fig. 1
figure 1

Modeling for selection of cases and controls. The base population used in modeling (n = 719) were plotted with their observed systolic blood pressure (SBP) and their expected SBP as predicted by multiple regression. Subjects in red are hypertensive above the mean and were designated as cases (n = 170). Subjects in blue are nonhypertensive below the mean and were designated as controls (n = 277)

Prioritization of subjects

The process for control selection is outlined in Fig. 2. Modeling resulted in several controls being available for each case; however, the familial relationship between these potential controls and cases had not yet been taken into consideration. Family structure was determined by kinship coefficients calculated with the family-based rare variant association test (FARVAT) using pedigree data [6]. Controls who were unrelated to any case were excluded, as they were genetically uninformative. In addition, parent–child pairs may be less powerful in association analyses as a result of overmatching [7], so controls who were parents of cases were excluded. Only nonparent controls who were related to cases (ie, with a nonzero kinship coefficient) were included in the analysis, and any cases without a related control were excluded. This resulted in some cases with multiple controls, and in other cases with only a single control.

Fig. 2
figure 2

Selection of cases and controls. Multistep process using modeling to choose potential cases and controls, and kinship coefficients to select cases and controls

Quality control of sequencing data

In addition to the quality control (QC) performed by the organizers of Genetic Analysis Workshop prior to release [5], further QC steps were taken using VCFtools version 0.1.12a [8] for chromosome 3, which initially included 1,757,452 sites among 464 sequenced individuals. No individuals were missing more than 10 % of calls, and thus, none were removed. Sites with a call rate of less than 95 % were removed (210,954 sites), as were sites that were out of Hardy-Weinberg equilibrium within the founders (6903 sites removed using n = 91 founders) at a p value cutoff of less than 2.9 × 10−8 (Bonferroni corrected: 0.05/1,546,498 = 3.2 × 10−8) leaving a total of 1,539,595 sites. Sites that did not pass QC were then removed from the data set of imputed genotypes that included 959 subjects (both sequenced individuals and those with imputed genotypes using the 464 sequenced subjects as input for the imputation). This data set contained 1,215,399 imputed sites, of which 87,555 sites were removed as a result of the aforementioned QC process, leaving 1,127,844 sites for analysis.

Annotations

Gene-based annotation was performed with the sites remaining after QC using ANNOVAR (Annotate Variation) [9] and the human genome RefSeq database based on hg19. Sites in intragenic regions or outside of a gene were mapped to the closest gene. Those that were further than 5 kbp from a gene were excluded, as the simulation model selected causal variants that were within this range, which left 566,962 sites (560,882 out of range).

Genetic analysis

Sequencing data from chromosome 3 for each set of cases and controls (base, potential, and selected) were analyzed using FARVAT [10]. FARVAT allows for the use of a dichotomous outcome and takes little computational time. FARVAT provides burden-, variance component–, and SKAT-O–type tests, and additionally provides the Pedigree Combined Multivariate and Collapsing (PedCMC) [11] and collapsing-based tests [12]. We utilized the variance component-type test as this test performs well for genes with functional rare variants having effects in the opposite direction, as is likely to be the case for most genes [6]. Users have the option to specify an offset to improve statistical efficiency. We chose the disease prevalence-based offset, using the hypertension prevalence of 0.26 among Hispanic adults as reported by the National Health and Nutritional Examination Surveys (NHANES) [13]. In addition, age and sex were included as covariates.

Results

Table 1 provides descriptive results of potential and selected cases and controls. Gene sets on chromosome 3 were analyzed by FARVAT; some gene sets were excluded, as FARVAT will not analyze gene sets with only 1 single nucleotide polymorphism (SNP). FARVAT recalculates minor allele frequency among each set of individuals being analyzed, resulting in a different number of gene sets for each set of cases and controls, as shown in Table 1. After Bonferroni correction for multiple testing (p = 0.05/1389 = 0.000036), none of the genes reached significance for any of the 3 sets of cases and controls. Because MAP4 was simulated to be significantly associated with SBP, Table 2 includes the results for MAP4, along with the 10 most significant genes for each analysis, which tended to vary. Of the genes on chromosome 3, only MAP4, FLNB, and ABTB1 were simulated to have an effect on SBP, explaining 7.79 %, 0.29 %, and 0.13 % of the total variance in SBP. Although MAP4 did not meet the Bonferroni-corrected significance threshold, analysis of potential cases and controls showed improved significance for MAP4 over the analysis of all individuals in the base population, and analysis of selected cases and selected controls further improved the significance of MAP4. Figure 3 displays quantile-quantile plots of each analysis; these plots show no inflation of the observed p values, indicating that type I error was controlled.

Table 2 Analysis of genes associated with hypertension in simulated data
Fig. 3
figure 3

Quantile-quantile (Q-Q) plots of analyses. Q-Q plots of each analysis, including base population, potential cases and controls, and selected cases and controls

Discussion

The potential power of family data is appealing for the discovery of rare variants that contribute to complex disease. Family data sets can contain hundreds or thousands of individuals, and WGS may not be feasible for every individual in every family. Frequently, researchers will select some family members for sequencing and then impute sequencing data for the remaining family members using existing genome-wide SNP data, however, this can still be costly and the accuracy of imputation varies depending on the approach used [14, 15]. As an alternative, it is possible to limit analyses to fewer family members yet bypass imputation. Careful selection of cases and controls is key to narrow the potential candidates for sequencing. Our multistep approach can be applied to any outcome and allows elimination of those individuals who are least likely to have a genetic component to their outcome, with further elimination of those individuals who will be genetically uninformative to a rare variant association analysis. Multiple factors contribute to complex disease, and it may be important to consider all of these factors in the effort to find genetic determinants. By using multiple regression, we were able to take several covariates into consideration; each of these covariate phenotypes is easily and inexpensively obtained. The inclusion of these covariates allowed us to focus our attention on those individuals with unexplained and, likely, genetic hypertension. Through this approach, cases and controls were not simply defined as those with the highest and lowest blood pressures, respectively, but rather those with blood pressure that is higher or lower than expected given their age, sex, smoking habits, and blood pressure medication usage. The use of theoretical kinship coefficients ensured only genetically informative individuals were included in the analysis. As with any selection process, the sample size decreased as the requirements for inclusion became more stringent. While this decrease reduces costs, loss of power from decreased sample size is a serious concern. In addition, the combination of multiple phenotypic components into a case definition forces the use of a dichotomous outcome during analysis, which generally results in a loss of power. However, we found that the signal for MAP4, the gene with the strongest simulated effect on SBP, improved with each step of the selection process, indicating that our selection process overcame the loss of power because of a decrease in sample size and dichotomization of a quantitative trait.

Conclusions

Family data can be useful for the detection of rare variants, but must be carefully analyzed. There are options to prioritize the selection of cases and controls for sequencing and analysis. Careful case definitions, combined with information on family structure, can help ensure that only the most informative individuals are chosen for sequencing. This can help keep costs low and, potentially, improve the ability to detect rare variants. However, loss of power is a real concern, meaning the selection process may only yield meaningful results if there is a large base population from which to select.

References

  1. Lanktree MB, Hegele RA, Schork NJ, Spence JD. Extremes of unexplained variation as a phenotype: an efficient approach for genome-wide association studies of cardiovascular disease. Circ Cardiovasc Genet. 2010;3(2):215–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol. 2011;35(8):790–9.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Spence JD, Barnett PA, Bulman DE, Hegele RA. An approach to ascertain probands with a non-traditional risk factor for carotid atherosclerosis. Atherosclerosis. 1999;144(2):429–34.

    Article  CAS  PubMed  Google Scholar 

  5. Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, Johnson M, Peralta JM, Manning AK, Wood AR, et al. Omics squared: human genomic, transcriptomic, and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9(8):S2.

    Google Scholar 

  6. Choi S, Lee S, Cichon S, Nothen MM, Lange C, Park T, Won S. FARVAT: a family-based rare variant association test. Bioinformatics. 2014;30(22):3197–205.

    Article  CAS  PubMed  Google Scholar 

  7. Yan T, Yang YN, Cheng X, DeAngelis MM, Hoh J, Zhang H. Genotypic association analysis using discordant-relative-pairs. Ann Hum Genet. 2009;73(1):84–94.

    Article  CAS  PubMed  Google Scholar 

  8. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Jiang D, McPeek MS. Robust rare variant association testing for quantitative traits in samples with related individuals. Genet Epidemiol. 2014;38(1):10–20.

    Article  PubMed  Google Scholar 

  11. Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012;90(6):1028–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–93.

    Article  PubMed  Google Scholar 

  13. Nwankwo T, Yoon SS, Burt V, Gu Q. Hypertension among adults in the United States: National Health and Nutrition Examination Survey, 2011–2012. NCHS Data Brief. 2013;133:1–8.

    Google Scholar 

  14. Song S, Shields R, Li X, Li J. Joint analysis of sequence data and single-nucleotide polymorphism data using pedigree information for imputation and recombination inference. BMC Proc. 2014;8 Suppl 1:S20.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Hinrichs AL, Culverhouse RC, Suarez BK. Genotypic discrepancies arising from imputation. BMC Proc. 2014;8 Suppl 1:S17.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would like to thank Jace Otting for his assistance with data programming. This research was supported in part by a core grant to the Center for Demography and Ecology at the University of Wisconsin-Madison (P2C HD047873).

Declarations

This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.

Authors’ contributions

RS completed analysis and writing. JMK assisted with writing and data summaries. BFD cleaned and prepared sequencing data. CDE supervised all efforts. All read and approved the final manuscript.

Competing interests

The authors declare they have no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Corinne D Engelman.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sippy, R., Kolesar, J.M., Darst, B.F. et al. Prioritization of family member sequencing for the detection of rare variants. BMC Proc 10 (Suppl 7), 11 (2016). https://doi.org/10.1186/s12919-016-0035-8

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s12919-016-0035-8

Keywords