- Open Access
Identity-by-descent graphs offer a flexible framework for imputation and both linkage and association analyses
© Blue et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
We demonstrate the flexibility of identity-by-descent (IBD) graphs for genotype imputation and testing relationships between genotype and phenotype. We analyzed chromosome 3 and the first replicate of simulated diastolic blood pressure. IBD graphs were obtained from complete pedigrees and full multipoint marker analysis, facilitating subsequent linkage and other analyses. For rare alleles, pedigree-based imputation using these IBD graphs had a higher call rate than did population-based imputation. Combining the two approaches improved call rates for common alleles. We found it advantageous to incorporate known, rather than estimated, pedigree relationships when testing for association. Replacing missing data with imputed alleles improved association signals as well. Analyses were performed with knowledge of the underlying model.
Patterns of identity-by-descent (IBD) sharing within and across pedigrees are fundamental for the understanding of genetic variation, including its distribution, origin, and relationship to phenotype. Recent analytical and computational advances have allowed us to estimate the distribution of patterns of IBD sharing in large and complex pedigrees using the program gl_auto in the MORGAN v3.1 package (http://www.stat.washington.edu/thompson/Genepi/pangaea.shtml). These estimates are computationally intense: for example, 727 cpu minutes for family 10 (83 members) on an Intel L5427 Xeon 2.50-Gz processor. However, the resulting sampled IBD graphs can be quickly reused for several types of analysis, including genotype imputation in pedigrees , or obtaining  and refining a linkage signal . At Genetic Analysis Workshop 18 (GAW18), we used these sampled IBD graphs for (a) imputation of genotypes in pedigrees compared with a "population-based" method that uses an external reference panel, (b) linkage analyses using both parametric and variance components models, and (c) association testing of both observed and imputed genotypes using two strategies to incorporate relationships between subjects.
Genetic map and markers
We analyzed GAW18 marker data for chromosome 3. We did not use the GAW18 sequence data because it included imputed variants, although our methods would work for sequence data as well. We obtained genetic map positions (cM) for the genome-wide association studies (GWAS) markers from the Rutgers sex-averaged interpolated positions of dbSNP Build 134 (http://compgen.rutgers.edu/maps), excluding the 116 loci missing values. Kosambi positions were converted to Haldane positions to suit assumptions made by the Lander-Green algorithm . We found no Mendelian inconsistencies using Loki v2.4.7  in the 65,403 markers. For linkage and association analyses, we removed markers with minor allele frequency (MAF) less than 0.05 (13,139 markers) and/or greater than 5% missing data (4,939 markers), leaving 48,892 markers for analysis.
Phenotype and families
We began with the simulated diastolic blood pressure at time point 1 from SIMPHEN.1.csv. Given the contents of the answer key to the simulated data, we included age, sex, age*sex, and treatment in a linear regression model. The residuals are our adjusted phenotype values.
We analyzed 7 families showing evidence of cryptic relatedness in the hope of reducing genetic and allelic heterogeneity in our trait. Using all available GWAS data, we estimated kinship coefficients between all pairs of individuals in the data set using both the KING-robust  and REAP  methods to accommodate admixture, explained in detail elsewhere . As is standard quality control in pedigree studies, pedigree relationships were validated by empirical estimates of kinship. Pairwise kinship coefficients exceeding those for second cousins were observed for subject pairs across families 5 to 8, 10, 21, and 25. These families were used in BEAGLE  imputation and SOLAR  association analyses described later. Family 10 was chosen for further analyses because it was the family with the strongest evidence of association to our trait .
Estimation of identity-by-descent sharing
A single set of IBD graphs was used for all pedigree-based analyses. We used a subset of 351 markers with an average spacing of one marker per 0.65 cM, choosing the marker at each targeted region with the highest value of heterozygosity multiplied by the number of observed genotypes to generate IBD graphs with the program gl_auto. Markov-chain Monte Carlo sampling with a state-of-the-art hybrid sampler [11, 12] allowed us to use both large pedigrees and many markers. We saved every 50th  of 50,000 sampled realizations of IBD graphs for chromosome 3, conditional on all observed genotypes, the genetic map, and pedigree structure .
We used the program GIGI to impute genotypes dependent on the sampled IBD graphs . Imputation markers were not in the framework set used to produce IBD graphs. For each imputation marker, a set of genotypes for all subjects was sampled from the genotype probability distribution, given observed data at the imputation marker in some subjects, the sampled IBD graphs, allele frequencies, and the meiotic map. Genotype and allele probabilities were then averaged across the sampled IBD graphs. We called both alleles of a missing genotype if Pr(genotype) greater than 0.8, and otherwise called one allele if Pr(allele) greater than 0.9. Genotypes failing to meet these criteria were not called.
For comparison, we also used BEAGLE , which uses an outside reference panel of genotypes and population-level linkage disequilibrium to impute marker information among unrelated individuals. We compared results using three reference panels: the genotyped subjects from family 10 and the other families (experiment F10 + FO), only samples from family 10 (experiment F10), and the other families without family 10 (experiment FO). BEAGLE's 3,621 scaffold markers were chosen to be common (MAF >0.3) and evenly spaced (at least 0.05 cM apart). As with GIGI, we called both alleles of a missing genotype if Pr(genotype) greater than 0.8 and otherwise called one allele if Pr(allele) greater than 0.9.
Design of imputation experiments
Subjects with observed marker data1
F10 + FO
Number of subjects
Number of variants observed per subject
Intrigued by the complementary data used by BEAGLE and GIGI, we evaluated a combination of their results. Using design 2 data, we first used GIGI to call both alleles if Pr(genotype) greater than 0.99 or to call one allele if Pr(allele) greater than 0.995, thus only calling alleles if essentially forced by the pedigree data. For loci with uncalled genotypes, we then used results from BEAGLE F10 + FO with call thresholds Pr(genotype) greater than 0.8 and Pr(allele) greater than 0.9. For loci with one uncalled allele by GIGI, we accepted the BEAGLE genotype if it included the single allele called by GIGI.
We computed lod scores for family 10 and all cryptically related families at a subset of 44 positions from the IBD graphs, yielding a spacing of approximately 1 lod score per 5 cM. To obtain multipoint lod scores, we (a) used the program IBDgraph [13, 14] to identify equivalence classes among the realized IBD graphs at each position , (b) computed likelihoods for one representative of each equivalence class at each position with the mlink program , and (c) computed a weighted average from the sampled IBD graphs to obtain an estimate of the multipoint lod score for the trait at each position .
We tested three parametric models. Model 1 is a quantitative trait locus (QTL) model with parameters defined by the single-nucleotide polymorphism (SNP) with the biggest contribution to the simulated trait variance. Because this SNP explains only 0.0229% of the simulated trait variance, model 1 tests whether we can detect a locus with a small effect size if it is modeled perfectly. Model 2 is a QTL that is the weighted average for all functional SNPs within the gene bearing the "biggest" SNP. The result is a common allele with small effect sizes and is an attempt to model the cumulative effects of several functional variants within a single gene. Model 3 is a perfectly penetrant additive locus, where affectation status indicates the subject carries the risk allele at the biggest SNP, and tests whether we could detect the SNP locus if it perfectly explained the trait variance. We compare results from the same IBD graphs with a typical variance components (VCs) lod score, as implemented by SOLAR .
We used VC analyses to investigate association with the trait of candidate covariate SNPs while accounting for correlations among related subjects. We analyzed family 10 alone, as well as all 7 families jointly. Each of the top 5 SNPs, ranked by p-value, identified from a half genome scan , was tested for association with a linear mixed model using dose of the minor allele as the fixed effect and the kinship matrix and a polygenic model as a random effect. Whereas SOLAR  uses the pedigree-based kinship matrix to account for relatedness, EMMAX  estimates the kinship matrix from the genome-wide genotype correlations. These two programs fit the same model, differing only in the source of the kinship matrix.
We also performed VC analyses with SOLAR with various combinations of imputed and observed genotype data within family 10 to evaluate the usefulness of imputed genotype data. For these analyses, we used the weighted average of genotype probabilities obtained from GIGI to provide an expected dose of the minor allele, given the observed data.
Although enough copies of the risk allele segregate within the family to generate a linkage signal if the risk allele was indeed causal (model 3, lodmax = 5.36), this locus does not explain enough phenotypic variation within this family to provide measurable evidence of linkage (lodmax <0.5 for models 1 and 2). VC lod score analyses  provided comparable results: no evidence of linkage in family 10 and an all-families' lod score near 0.2.
Family-based association testing
Association test p-values from use of all available genotype data
Association test p-values from use of imputed genotype data within family 10
Nobserved : Nimputed : Ntotal Subjects
IBD graphs provided ample opportunity to investigate relationships between individuals and between genotypes and phenotypes. Pedigree-based imputation that exploited these graphs outperformed population-based imputation for rare variants, even when the latter included family members of the subjects being imputed. We also showed that the two approaches may be combined to improve call rate and accuracy for some uses. Both parametric and VC linkage analysis failed to detect a linkage signal. Further examination revealed no cosegregation of phenotypes and genotypes at the functional variants on chromosome 3 in these families in SIMPHEN.1.csv (John Blangero, personal communication), although this was not true of the other simulated replicates. In contrast, family-based association testing with a mixed model was still able to detect association with the functional variants. We found that using the known pedigree structure in SOLAR provided similar but slightly stronger evidence for association than EMMAX, which treats subjects as unrelated but accounts for relatedness through an empirical covariance matrix. Finally, use of observed genotype data provides a stronger association signal than imputed data, although the difference between the two sets of p-values can be negligible. This suggests that when direct genotyping is not possible, pedigree-based imputation provides a practical and useful alternative.
This research was supported by the National Institutes of Health (NIH) grants AG040184, AG005136, AG039700, CA148958, GM046255, GM075091, HD054562, MH092367, and MH094293. The GAW18 whole genome sequencing data were provided by the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Cheung CY, Thompson EA, Wijsman EM: GIGI:an approach to effective imputation of dense genotypes on large pedigrees. Am J Hum Genet. 2013, 92: 504-516. 10.1016/j.ajhg.2013.02.011.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchani EE, Wijsman EM: Estimation and visualization of identity-by-descent within pedigrees simplifies interpretation of complex trait analysis. Hum Hered. 2011, 72: 289-297. 10.1159/000334083.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosenthal EA, Ronald J, Rothstein J, Rajagopalan R, Ranchalis J, Wolfbauer G, Albers JJ, Brunzell JD, Motulsky AG, Rieder MJ, et al: Linkage and association of phospholipid transfer protein activity to LASS4. J Lipid Res. 2011, 52: 1837-1846. 10.1194/jlr.P016576.PubMed CentralView ArticlePubMedGoogle Scholar
- Lander ES, Green PJ: Construction of multilocus genetic maps in humans. Proc Natl Acad Sci USA. 1987, 84: 2363-2367. 10.1073/pnas.84.8.2363.PubMed CentralView ArticlePubMedGoogle Scholar
- Heath SC: Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am J Hum Genet. 1997, 61: 748-760. 10.1086/515506.PubMed CentralView ArticlePubMedGoogle Scholar
- Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM: Robust relationship inference in genome-wide association studies. Bioinformatics. 2010, 26: 2867-2873. 10.1093/bioinformatics/btq559.PubMed CentralView ArticlePubMedGoogle Scholar
- Thornton T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, Risch N: Estimating kinship in admixed populations. Am J Hum Genet. 2012, 91: 122-138. 10.1016/j.ajhg.2012.05.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Thornton T, Conomos MP, Sverdlov S, Marchani EE, Cheung C, Glazner C, Lewis SM, Wijsman EM: Estimating and adjusting for ancestry admixture in statistical methods for relatedness inference, heritability estimation, and association testing. BMC Proc. 2014, 8 (suppl 2): S5-PubMed CentralView ArticlePubMedGoogle Scholar
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009, 84: 210-223. 10.1016/j.ajhg.2009.01.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMedGoogle Scholar
- Wijsman EM, Rothstein JH, Thompson EA: Multipoint linkage analysis with many multiallelic or dense diallelic markers: MCMC provides practical approaches for genome scans on general pedigrees. Am J Hum Genet. 2006, 79: 846-858. 10.1086/508472.PubMed CentralView ArticlePubMedGoogle Scholar
- Tong LP, Thompson E: Multilocus lod scores in large pedigrees: combination of exact and approximate calculations. Hum Hered. 2008, 65: 142-153. 10.1159/000109731.View ArticlePubMedGoogle Scholar
- Thompson EA: The structure of genetic linkage data: from LIPED to 1M SNPs. Hum Hered. 2011, 71: 86-96. 10.1159/000313555.PubMed CentralView ArticlePubMedGoogle Scholar
- Koepke H, Thompson EA: Efficient testing operations on dynamic graph structures using strong hash functions. Department of Statistics, technical reports. 2010, Seattle: University of WashingtonGoogle Scholar
- Lathrop GM, Lalouel JM, Julier C, Ott J: Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA. 1984, 81: 3443-3446. 10.1073/pnas.81.11.3443.PubMed CentralView ArticlePubMedGoogle Scholar
- Sobel E, Lange K: Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am J Hum Genet. 1996, 58: 1323-1337.PubMed CentralPubMedGoogle Scholar
- Cheung CYK, Wijsman E: Imputing genotypes in large pedigrees: a comparison between GIGI and BEAGLE. American Society of Human Genetics. 2012, San Francisco, vol. (abstract)Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.