On family-based genome-wide association studies with large pedigrees: observations and recommendations

Fardo, David W; Zhang, Xue; Ding, Lili; He, Hua; Kurowski, Brad; Alexander, Eileen S; Mersha, Tesfaye B; Pilipenko, Valentina; Kottyan, Leah; Nandakumar, Kannabiran; Martin, Lisa

doi:10.1186/1753-6561-8-S1-S26

Volume 8 Supplement 1

Genetic Analysis Workshop 18

Proceedings
Open access
Published: 17 June 2014

On family-based genome-wide association studies with large pedigrees: observations and recommendations

David W Fardo¹,
Xue Zhang²,
Lili Ding^2,3,
Hua He²,
Brad Kurowski^2,3,
Eileen S Alexander⁴,
Tesfaye B Mersha^2,3,
Valentina Pilipenko²,
Leah Kottyan²,
Kannabiran Nandakumar¹ &
…
Lisa Martin^2,3

BMC Proceedings volume 8, Article number: S26 (2014) Cite this article

2925 Accesses
4 Citations
1 Altmetric
Metrics details

Abstract

Family based association studies are employed less often than case-control designs in the search for disease-predisposing genes. The optimal statistical genetic approach for complex pedigrees is unclear when evaluating both common and rare variants. We examined the empirical power and type I error rates of 2 common approaches, the measured genotype approach and family-based association testing, through simulations from a set of multigenerational pedigrees. Overall, these results suggest that much larger sample sizes will be required for family-based studies and that power was better using MGA compared to FBAT. Taking into account computational time and potential bias, a 2-step strategy is recommended with FBAT followed by MGA.

Background

Phenotypic variation in complex traits is conferred through both common and rare variants. It has been suggested that common variation plays a role at the level of the population, whereas rare variation has stronger effects at the levels of the clan (extended family) and the nuclear family [1]. To date, a large number of genome-wide association studies (GWAS) have focused on population-level variation. Since the first GWAS was published in 2005 [2], more than 1000 have been conducted. By using predominantly case-control designs with single-variant analyses, these studies have identified common variants associated with common diseases and related phenotypes. Alternatively, family-based approaches using trios and nuclear families have been increasingly utilized with GWAS and next-generation sequencing [3–9]. In the past 10 years, studies of extended families have been much more limited, even though individuals sharing recent ancestors share regions of the genome other than disease-causing variants and may provide a better proxy for the total mutation load [1]. Thus, there is a clear need to evaluate strategies for the analysis of genetic data from extended families.

The measured genotype approach (MGA) and family-based association testing (FBAT) are 2 broad strategies to examine family-based association in the context of large extended families. MGA from a variance components framework utilizes a mixed model in which familial relationships are accounted for using random effects and genetic variants are incorporated as fixed effects. In contrast, FBAT relies solely on within-family information by constructing a score test that essentially provides a correlation between phenotype and genotype. However, performance of these approaches in the context of variants of varying frequency with modest to moderate effect in extended family data is unclear.

Thus, this paper evaluates the performance of MGA and FBAT in the context of large extended families genotyped for both common and rare variants (minor allele frequency ≥5% and <5%, respectively). To accomplish this, we will use chromosome 3 variants from single-nucleotide polymorphism (SNP) genotyping chips, as well as the simulated phenotypes from the Genetic Analysis Workshop 18 (GAW18) data set based on the multigenerational structure of the San Antonio Family Studies (SAFS) [10].

Methods

We analyze 20 large pedigrees generated from SAFS that range from 21 to 76 members in size. We used the chromosome 3 data to test for association in the 200 simulation replicates by employing both MGA [11] and FBAT [6, 12] with diastolic blood pressure (DBP) at exam 1. To assess empirical false-positive rates, we analogously analyze Q1, a trait simulated with no genetic link.

Details regarding the San Antonio Family Heart Study (SAFHS) and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS), which comprise the SAFS, have been provided elsewhere [13, 14]. Pertinent to our analyses, GWAS data were generated from this study using a variety of genotyping platforms and extensively cleaned, resulting in a total of 472,049 SNPs. The 65,519 SNPs residing on chromosome 3 were used in our analyses.

Measured genotype approach

First, we used MGA [9, 15] as implemented in SOLAR (Texas Biomedical Research Institute, San Antonio, TX) [16]. This approach accounts for phenotypic correlation between family members by including a polygenic component as a random effect. Each SNP is coded additively (ie, as a count of minor alleles) and is incorporated as a fixed effect in the following model:

D B P = μ + β_{1} a g e + β_{2} a g e^{2} + β_{3} B P M E D + β \times (S N P) + g + e

(1)

where μ is a grand mean for DBP, $β_{1}, β_{2}, β_{3}$ are the respective covariate effects, $β$ is the SNP effect, and g and e are random genetic (additive polygenic) and residual effects. We assume that g and e are normally distributed with zero mean and variances $2 Φ σ_{g}^{2}$ and $I σ_{e}^{2}$ , respectively, where $Φ$ is the kinship matrix, $I$ is the identity matrix, and ${σ_{g}}^{2}$ , ${σ_{e}}^{2}$ are the variances from additive genetic (g) and residual (e) effects. To test a SNP effect, the log likelihood of the model estimating an unconstrained SNP effect is compared to the log likelihood of the model in which the SNP effect is constrained to zero. Assuming that trait values follow a multivariate normal distribution, twice the difference in the log likelihoods of these 2 models is asymptotically distributed as $χ_{1}^{2}$ .

Family based association test: marginal tests

Second, we used FBAT to test for association. Here we define the FBAT test statistic by

\sum_{i j} \frac{t_{i j} (x_{i j} - E (x_{i j} | S_{i j}))}{t_{i j}^{2} V a r (x_{i j} | S_{i j})} ~ χ_{1}^{2}

(2)

where $t_{i j}$ is residual phenotype (DBP at exam 1) from the jth nonfounder of the ith family after regression on age, age squared, sex, and blood pressure medication use, all at the first exam; $x_{i j}$ is the additively coded genotype (ie, minor allele count) for this subject; and $S_{i j}$ are the sufficient statistics [17] for the jth nonfounder of the ith family (eg, the sufficient statistics consist of parental genotypes when analyzing mother-father-offspring trios). FBAT analysis was performed with PBAT's [18] hybrid pedigree algorithm that clusters trios within extended pedigrees to improve computation time using SNP & Variation Suite v7.6.10 (Golden Helix, Bozeman, MT, http://www.goldenhelix.com).

Family based association test: screening approach

In addition to examining FBAT test statistics marginally, we also employed the Van Steen screening approach [19], which allows for a reduction in the multiple comparisons burden. Briefly, the screening method imputes nonfounder variants by conditioning on the corresponding sufficient statistics and then estimates the conditional power for each variant. This metric is then used to screen, or rank, variants for testing, thereby reducing the adjustment necessary to declare statistical significance. Extensions of this have been proposed [20]; here, for simplicity of exposition, we use the simple top 10 approach, as done in Herbert et al [21], of testing only the top 10 variants based on conditional power using a Bonferroni-corrected significance threshold of 0.05/10.

Power

Each of the 17 SNPs from the simulation model that are causal for DBP $(|{\hat{β}}_{D B P}| > 0)$ was tested with MGA and FBAT using a nominal 5% significance threshold. The Bonferroni correction was calculated slightly differently for MGA and FBAT. For MGA analyses, 62,715 SNPs were considered (monomorphic SNPs were removed), resulting in a 0.05/62715 significance threshold. These same SNPs were examined using FBAT, and only the 58,519 SNPs that included at least 10 informative families were tested, giving a Bonferroni-corrected significance of 0.05/58519.

Type I error

To assess false-positive rates, we examined the trait Q1 simulated with no genetic influence. Linkage disequilibrium (LD) was used to prune the chromosome 3 SNPs and create a subsample of 1228 uncorrelated SNPs. These SNPs were used to estimate type I error rates, using both MGA and FBAT to maintain consistency across approaches. The pruning approach has 2 advantages. First, it reduces the computational burden, which was especially problematic in MGA where computation time increases substantially with the degree of pedigree complexity as a result of estimation of the mixed model. Second, it results in an error rate more in line with the number of true comparisons, as Bonferroni correction assumes uncorrelated tests. To calculate a comparable assessment of type I error using the Van Steen screening approach, the proportion of noncausal SNPs declared significant in each replicate was averaged.

Of note, the multiple testing correction approach differed between the power and the type I error evaluation. Specifically, the LD pruning step was not performed when examining empirical power. Although it is optimal to use the same procedure to assess error rate and power, the varying pruning step should not bias our results.

Results

Power

Overall, there was low power to detect causal variants (Table 1). Only 3 SNPs achieved greater than 20% power using a nominal significance level. SNP rs11711953 in MAP4 had a considerably large effect on DBP (heritability 2.29%) and a minor allele frequency (MAF) of 2.6%. The other 2 SNPs with marginal power, rs4683602 and rs16851435, are common (MAFs of 0.272 and 0.243, respectively) but exhibited a much more modest effect (heritability 0.003% and <10⁻⁵). After accounting for multiple testing, only rs11711953 had the power to be detected, and then only by using MGA. When using the Van Steen top 10 screening approach (FBAT-VS) the MAP4 SNP was detectable, but not at the rate conferred by MGA.

Table 1 Empirical powers for DBP causal variants.

Full size table

Type I error

Using the Q1 phenotype, we found that both MGA and FBAT methods appropriately controlled for type I error rate using a nominal significance (type I error rate 0.05 for both). After controlling for multiple testing, no false positives were identified with any of the methods.

Discussion and conclusions

Using a cohort of extended families, we evaluated the performance of 2 family based methods (MGA and FBAT) to identify causal variants of varying allele frequency and effect size. Overall, the approaches exhibited low power with only 3 variants identified more than 20% of the time. Nevertheless, both approaches also exhibited very appropriate family-wise false-positive rates. Taken together, these results suggest that family-based studies require large sample sizes to detect the majority of effects.

The variant identified across all approaches (rs11711953), had a MAF of 0.026 and a true effect size of −6.2235 (with heritability of 2.29%). It appears that the ability to detect this variant was driven by the very strong effect size (more than 10× greater than any other variant). The other 2 variants identified were more common, but had relatively small effect sizes. As other common variants had larger effect sizes, there is clearly a complex interplay of factors influencing power to detect effects.

Both methods suffered from overall low power. This suggests that substantially larger data sets and methodological extensions incorporating multiple variants such as FBAT-RV [22] will be required when testing for effects of rare variants on complex phenotypes. However, care is required to prevent spurious association results when increasing sample size. Specifically, because the measured genotype approach is susceptible to confounding as a result of population stratification, combining data across multiple studies may be problematic. In the current study, there were no inflated false-positive rates using any of the methodologies, suggesting that there were no adverse effects of population stratification. However, given the extreme low power of this study, care must be taken to not overevaluate these findings. Future studies need to explore this possibility with more genetically diverse family samples to examine the relative merits of family-based approaches. Notably, methods that rely on between-family information must appropriately handle population stratification because their validity is contingent on either its absence [23] or sufficient adjustment, as opposed to FBAT approaches that are, by design, robust to population stratification.

One of the major challenges in these analyses was the computational time, especially for the MGA, where genome-wide analyses are infeasible. MGA analysis took approximately 30 seconds per SNP, while the FBAT took one-eighth second per SNP. Ideally, without any constraints on computation time and with sufficient evidence to rule out population stratification, it is best to perform both MGA and FBAT approaches across the genome and focus on regions of overlap, that is, those with most evidence for true association. However, because both time and population substructure are often constraints, when considering between MGA- or FBAT-type analyses, we recommend initially employing an FBAT screening approach with a less-stringent significance threshold because of its speed and robustness to population stratification, and then following up regions of interest with MGA for confirmation to identify variants most likely to be causal.

In summary, analysis of the GAW18 simulated phenotypes, DBP and Q1, allowed us to examine the performance of family-based association methods in the context of extended families and variants of varying frequency. Overall, we found that the GAW18 data was underpowered to detect all but one of the variants regardless of the approach used. Approaches to ease the burden of multiple testing are beneficial, and simulations with explicit population stratification are needed to further discern comparisons between these methods.

References

Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA: Clan genomics and the complex architecture of human disease. Cell. 2011, 147: 32-43. 10.1016/j.cell.2011.09.008.
Article PubMed Central CAS PubMed Google Scholar
Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al: Complement factor H polymorphism in age-related macular degeneration. Science. 2005, 308: 385-389. 10.1126/science.1109557.
Article PubMed Central CAS PubMed Google Scholar
Lasky-Su J, Won S, Mick E, Anney RJL, Franke B, Neale B, Biederman J, Smalley SL, Loo SK, Todorov A, et al: On genome-wide association studies for family-based designs: an integrative analysis approach combining ascertained family samples with unselected controls. Am J Hum Genet. 2010, 86: 573-580. 10.1016/j.ajhg.2010.02.019.
Article PubMed Central CAS PubMed Google Scholar
Murphy A, Weiss ST, Lange C: Two-stage testing strategies for genome-wide association studies in family-based designs. Methods Mol Biol. 2010, 620: 485-496. 10.1007/978-1-60761-580-4_17.
Article CAS PubMed Google Scholar
Luo L, Boerwinkle E, Xiong M: Association studies for next-generation sequencing. Genome Res. 2011, 21: 1099-1108. 10.1101/gr.115998.110.
Article PubMed Central CAS PubMed Google Scholar
Laird NM, Lange C: The role of family-based designs in genome-wide association studies. Stat Sci. 2009, 24: 388-397. 10.1214/08-STS280.
Article Google Scholar
Sha Q, Zhang Z, Zhang S: Joint analysis for genome-wide association studies in family-based designs. PLoS ONE. 2011, 6: 8-
Article Google Scholar
Qin H, Feng T, Zhang S, Sha Q: A data-driven weighting scheme for family-based genome-wide association studies. Eur J Hum Genet. 2010, 18: 596-603. 10.1038/ejhg.2009.201.
Article PubMed Central PubMed Google Scholar
Aulchenko YS, De Koning D-J, Haley C: Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007, 177: 577-585. 10.1534/genetics.107.075614.
Article PubMed Central CAS PubMed Google Scholar
Almasy L, Dyer T, Peralta J, Jun G, Fuchsberger C, Almeida M, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-
Article PubMed Central PubMed Google Scholar
Amin N, Van Duijn CM, Aulchenko YS: A genomic background based method for association analysis in related individuals. PloS One. 2007, 2: e1274-10.1371/journal.pone.0001274.
Article PubMed Central PubMed Google Scholar
Laird NM, Horvath S, Xu X: Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000, 19: S36-S42. 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M.
Article PubMed Google Scholar
Mitchell BD, Kammerer CM, Blangero J, Mahaney MC, Rainwater DL, Dyke B, Hixson JE, Henkel RD, Sharp RM, Comuzzie AG, et al: Genetic and environmental contributions to cardiovascular risk factors in Mexican Americans. The San Antonio Family Heart Study. Circulation. 1996, 94: 2159-2170. 10.1161/01.CIR.94.9.2159.
Article CAS PubMed Google Scholar
Hunt KJ, Lehman DM, Arya R, Fowler S, Leach RJ, Göring HHH, Almasy L, Blangero J, Dyer TD, Duggirala R, Stern MP: Genome-wide linkage analyses of type 2 diabetes in Mexican Americans: the San Antonio Family Diabetes/Gallbladder Study. Diabetes. 2005, 54: 2655-2662. 10.2337/diabetes.54.9.2655.
Article CAS PubMed Google Scholar
Boerwinkle E, Chakraborty R, Sing CF: The use of measured genotype information in the analysis of quantitative phenotypes in man. Ann Hum Genet. 1986, 50: 181-194. 10.1111/j.1469-1809.1986.tb01037.x.
Article CAS PubMed Google Scholar
Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.
Article PubMed Central CAS PubMed Google Scholar
Rabinowitz D, Laird N: A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000, 50: 211-223. 10.1159/000022918.
Article CAS PubMed Google Scholar
Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM: PBAT: tools for family-based association studies. Am J Hum Genet. 2004, 74: 367-369. 10.1086/381563.
Article PubMed Central PubMed Google Scholar
Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, Christman M, et al: Genomic screening and replication using the same data set in family-based association testing. Nat Genet. 2005, 37: 683-691. 10.1038/ng1582.
Article CAS PubMed Google Scholar
Ionita-Laza I, McQueen MB, Laird NM, Lange C: Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100K scan. Am J Hum Genet. 2007, 81: 607-614. 10.1086/519748.
Article PubMed Central CAS PubMed Google Scholar
Herbert A, Gerry NP, McQueen MB, Heid IM, Pfeufer A, Illig T, Wichmann H-E, Meitinger T, Hunter D, Hu FB, et al: A common genetic variant is associated with adult and childhood obesity. Science. 2006, 312: 279-283. 10.1126/science.1124779.
Article CAS PubMed Google Scholar
De G, Yip W-K, Ionita-Laza I, Laird N: Rare variant analysis for family-based design. PloS One. 2013, 8: e48495-10.1371/journal.pone.0048495.
Article PubMed Central CAS PubMed Google Scholar
Lange K, Sinsheimer JS, Sobel E: Association testing with Mendel. Genet Epidemiol. 2005, 29: 36-50. 10.1002/gepi.20073.
Article PubMed Google Scholar

Download references

Acknowledgements

We are grateful to Dr. Patrick Breheny for useful discussion and the anonymous reviewers whose suggestions improved the manuscript. This work was supported in part by NIH grants 8P20GM103436-12 (DWF, KN), K25AG043546 (DWF), NS36695 (LD, LJM), AI070235 (HH, LJM, TBM), AI066738 (LJM), HL111459 (LJM, VP), T32-ES10957 (ESA), K12 HD001097-16 (BGK), K01HL103165 (TBM).

The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

Author information

Authors and Affiliations

Department of Biostatistics, University of Kentucky College of Public Health, 111 Washington Ave, Lexington, KY, 40536, USA
David W Fardo & Kannabiran Nandakumar
Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH, 45229, USA
Xue Zhang, Lili Ding, Hua He, Brad Kurowski, Tesfaye B Mersha, Valentina Pilipenko, Leah Kottyan & Lisa Martin
Department of Pediatrics, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH, 45229, USA
Lili Ding, Brad Kurowski, Tesfaye B Mersha & Lisa Martin
Department of Environmental Health, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH, 45229, USA
Eileen S Alexander

Authors

David W Fardo
View author publications
You can also search for this author in PubMed Google Scholar
Xue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lili Ding
View author publications
You can also search for this author in PubMed Google Scholar
Hua He
View author publications
You can also search for this author in PubMed Google Scholar
Brad Kurowski
View author publications
You can also search for this author in PubMed Google Scholar
Eileen S Alexander
View author publications
You can also search for this author in PubMed Google Scholar
Tesfaye B Mersha
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Pilipenko
View author publications
You can also search for this author in PubMed Google Scholar
Leah Kottyan
View author publications
You can also search for this author in PubMed Google Scholar
Kannabiran Nandakumar
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Martin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David W Fardo.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DWF, XZ, and LJM designed the overall study. DWF, XZ, and KN conducted the statistical analyses and created tables. DWF, XZ, and LJM drafted the manuscript, which was revised by LD, HH, BGK, ESA, TBM, VP, LK, and KN. All authors discussed the project throughout, read, and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Fardo, D.W., Zhang, X., Ding, L. et al. On family-based genome-wide association studies with large pedigrees: observations and recommendations. BMC Proc 8 (Suppl 1), S26 (2014). https://doi.org/10.1186/1753-6561-8-S1-S26

Download citation

Published: 17 June 2014
DOI: https://doi.org/10.1186/1753-6561-8-S1-S26

Genetic Analysis Workshop 18

On family-based genome-wide association studies with large pedigrees: observations and recommendations

Abstract

Background

Methods

Measured genotype approach

Family based association test: marginal tests

Family based association test: screening approach

Power

Type I error

Results

Power

Type I error

Discussion and conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 18

On family-based genome-wide association studies with large pedigrees: observations and recommendations

Abstract

Background

Methods

Measured genotype approach

Family based association test: marginal tests

Family based association test: screening approach

Power

Type I error

Results

Power

Type I error

Discussion and conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us