Volume 8 Supplement 1
An exploration of heterogeneity in genetic analysis of complex pedigrees: linkage and association using whole genome sequencing data in the MAP4 region
© Bull et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
We conduct pedigree-based linkage and association analyses of simulated systolic blood pressure data in the nonascertained large Mexican American pedigrees provided by Genetic Analysis Workshop 18, focusing on observed sequence variants in MAP4 on chromosome 3. Because pedigrees are large and sequence data have been completed by imputation, it is feasible to conduct analysis for each pedigree separately as well as for all pedigrees combined. We are interested in quantifying and explaining between-pedigree heterogeneity in linkage and association signals. To this end, we first examine minor allele frequency differences between pedigrees. In some of the pedigrees, rare and low-frequency variants occur at a higher prevalence than in all pedigrees combined. In simulation replicate 1, we conduct variance-components linkage and association analysis of all 894 MAP4 variants to compare analytic approaches in single pedigree and combined analysis. In all 200 replicates, we similarly examine the 15 causal variants in MAP4 known under the generating model. We illustrate how random allele frequency variation among pedigrees leads to heterogeneity in pedigree-specific linkage and association signals.
Whole genome sequencing holds out the promise of being able to more effectively map the effects of genetic variants on complex traits, and thereby identify the causal variants involved in disease expression . Even when the effect of a causal variant is large, if the allele is rare, the effect size will be overwhelmed in a low-frequency population unless a very large sample can be analysed. As pointed out by Gagnon et al  and others, variants that occur rarely in a population are unlikely to segregate in more than a few pedigrees when pedigrees are not ascertained by disease or trait values. However, when a rare allele has entered a family, it can segregate to multiple family members, increasing the allele frequency in that pedigree and improving power to detect rare or low-frequency causal variants.
In analysis of the original Genetic Analysis Workshop pedigree data (reported in detail in Chen et al ), we observed substantial between-pedigree heterogeneity and differences between linkage and association tests, which we seek here to explore more thoroughly in the simulated data with the underlying model known. In the original data, pedigree heterogeneity could arise from variation in genetic effect size, in minor allele frequency, or from allelic or locus heterogeneity. In the simulated data, heterogeneity is largely caused by variation in minor allele frequencies between pedigrees, because the causal variants and their effect sizes were fixed under the generating model and applied to all individuals. We anticipated that analysis of the observed sequence data and the simulated systolic blood pressure (SBP) would allow us to describe natural genetic variation among the San Antonio Family Study (SAFS) pedigrees, as well as assess whether selection of pedigrees with linkage can enrich for rare variants and improve detection of variants in association analysis.
SAFS pedigree data
We analyzed the imputed "best guess" sequence genotype data for a total of 959 study participants in the 20 pedigrees, as provided, including 894 MAP4-designated sequence variants encompassing the chromosome 3 region from 47.892183 to 48.130741 megabases (Mb). Under the generating model for the phenotype simulations, the sequence data were the same for each replicate, and the same set of 15 variants was defined as causal, so only the randomly generated phenotypes varied from replicate to replicate. We estimated the minor allele frequency (MAF) at each segregating locus within a pedigree. As described in Chen et al , we analyzed the residuals of SBP from censored linear regression, averaged over 3 visits, as a quantitative phenotype accounting for use of antihypertensive medication, age, sex, and smoking.
Linkage and association analysis
We applied the genetic analysis software SOLAR for variance-components models to assess linkage and association in pedigrees . With single-marker identity-by-descent (IBD) estimates based on kinship and the sequence data for each pedigree, we performed 2-point linkage analysis across the MAP4 region. Association analyses, conducted for each single variant, included measured genotype (MG) analysis, which relates the quantitative trait directly to the genotype in all individuals, and the quantitative transmission disequilibrium test (QTDT), which relates the variation in the quantitative trait to the difference between the genotype observed in the offspring and that expected given the parental genotypes . Association analysis by QTDT is of interest because it is an explicitly pedigree-based association method that detects association in the presence of linkage by testing for transmission disequilibrium. In some sense it is "intermediate" between linkage analysis (purely within-pedigree analysis) and the MG association analysis (purely between individuals). As recommended to reduce type I error , we took linkage information into account in the association analysis.
In the QTDT method, the mean phenotype is modelled as a linear combination of fixed effects (ie, genotype score) and random effects (ie, polygenic and linkage components). The genotype scores (g) are decomposed into between-family (b) and within-family (w) components in a fixed-effect model E(phenotype) = μ + βbb + βww. The MG approach estimates regression coefficients with the constraint βb = βw. The QTDT approach estimates both βb and βw and tests whether the within-family parameter βw is significantly different from zero. The parameter βw reflects the within-pedigree correlation between phenotype and the allelic transmission score w = (g-b) which is the deviation between the observed and expected genotype. It is, therefore, robust to stratification effects .
In replicate 1, for each of 894 loci with sequence variants we computed the LOD score, and the asymptotic MG and QTDT p-values for all pedigrees combined and for each pedigree separately, and constructed LOD score and −log10(p value) regional profile plots. For processing of all 200 replicates, we considered only the 15 MAP4 causal variants specified in the simulation model. Given the small size of MAP4 relative to a typical linkage region, and the limitations of single-point IBD estimation for linkage analysis, we took the maximum of the 2-point LOD as a regional measure of linkage. We constructed box plots of the LOD score and −log10(p value) to examine variation across replicates by pedigree and differences in power among the causal loci.
Results and discussion
MAP4 variants in SAFS pedigrees
Of the 894 MAP4 variants, only a fraction was observed within a single pedigree (ranging from 179 in pedigree 47 to 389 in pedigree 5) and as expected, rare variants (MAF <1%) were most prevalent. Overall MAF values for the 15 "causal" loci in MAP4 ranged from 0.5% to 37.8%: 3 common variants (>5% MAF) were observed in all 20 pedigrees, 3 low-frequency variants (1% to 5% MAF) appeared in 7 or more pedigrees, with 9 rare variants (<1% MAF) in 1 to 5 pedigrees. In any one pedigree, only 3 to 8 of 15 variants were observed. When a rare variant was observed, the corresponding pedigree-specific MAF was substantially higher, with nearly all being >1% for causal variants.
Linkage and association in single pedigrees and all pedigrees combined
Replicate 1 linkage LOD scores, association p values, and regression coefficients for pedigree-specific and combined all pedigree analysis of locus 10 in MAP4
Locus 10 LOD
1.5 × 10 −5
1.3 × 10 −14
In large pedigrees consisting of multiple nuclear families, the pedigree-specific QTDT analysis yields separate regression coefficients that correspond to between-family and within-family genotype scores calculated in each nuclear family. The between-family score is an expected genotype (typically, an average of the parental genotypes) and the within-family score for an individual is the deviation of their observed genotype from the expected, which is taken as a measure of allelic transmission . Like other transmission-based methods, families in which one or both parents are homozygous can be uninformative for the QTDT test, with a within-family score equal to zero. When a variant occurs infrequently, there will be few heterozygous individuals and a pedigree consisting of multiple such families may be entirely uninformative for estimation of a within-family regression coefficient (as in Table 1: pedigrees 4, 11, 14, 16, 20). In contrast, MG is essentially a linear regression of phenotype on the observed genotype score that includes all individuals in a pedigree, and a combined analysis will also include individuals from pedigrees in which all individuals are homozygous. This is the main reason why QTDT is so much less powerful than MG.
Under the fixed effects part of the QTDT model, the between-family and within-family scores are approximately orthogonal so the between and within regression coefficients are approximately independent , but only the latter is robust to population stratification. Because the QTDT within-family regression coefficient in a single pedigree may be based on a small number of informative transmissions, it can be imprecise and the corresponding asymptotic p values may be inaccurate, so, as illustrated in Table 1, combined pedigree analysis will be an improvement. In the replicate 1 analysis, the between and within QTDT coefficients agree well for the combined analysis and for the 4 pedigrees with nonnegligible LOD scores, suggesting lack of population stratification bias in the simulated data.
Consideration of the combined pedigree maxLOD across the MAP4 region was sufficient to identify a gene-specific region for sequence analysis, even with single-point IBD estimation. For each of the 9 MAP4 rare variants (MAF <1%) included in the phenotype-generating model, segregation of the variant was observed in 5 or fewer of the 20 pedigrees, with mean MAF >1% in the subset. The low-frequency variant responsible for the largest SBP %variation explained (locus 10), which segregated in 12 pedigrees, exhibited the best power for association with good agreement between MG and QTDT signals. Pedigrees with some evidence for linkage at this locus were enriched for the low-frequency variant (or conversely, linkage evidence was contributed according to the frequency of the variant), and it followed that these pedigrees were also more informative in the within-family QTDT assessing transmission disequilibrium, and contributed to smaller standard errors in the MG regression analysis. As we understand it, the Genetic Analysis Workshop 18 (GAW18) pedigrees comprise a sample of large Mexican American families, not ascertained on the phenotype. Therefore differences in MAF between pedigrees arise from random differences in the genotypes of the founder individuals for each pedigree. Because the genetic model used to generate the phenotype data in the simulated data sets is the same in each pedigree, variation in MAF among the pedigrees is a major source of heterogeneity in the pedigree-specific linkage and association tests.
This research was supported by funding from the Canadian Institutes of Health Research: CIHR Operating Grant MOP-84287 (SBB, principal investigator), CIHR Training Grant GET-101831 (ZC). ZC is a Fellow with CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology). The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Bailey-Wilson JE, Wilson AF: Linkage analysis in the next-generation sequencing era. Hum Hered. 2011, 72: 228-236. 10.1159/000334381.PubMed CentralView ArticlePubMedGoogle Scholar
- Gagnon F, Roslin NM, Lemire M: Successful identification of rare variants using oligogenic segregation analysis as a prioritizing tool for whole-exome sequencing studies. BMC Proc. 2011, 5 (Suppl 9): S11-10.1186/1753-6561-5-S9-S11.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z, Tan KR, Bull SB: Multiphase analysis by linkage, quantitative transmission disequilibrium, and measured genotype: systolic blood pressure in complex Mexican American pedigrees. BMC Proc. 2014, 8 (Suppl 1): S108-PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cookson WO, Cardon LR: Pedigree tests of transmission disequilibrium. Eur J Hum Genet. 2000, 8: 545-551. 10.1038/sj.ejhg.5200494.View ArticlePubMedGoogle Scholar
- Kent JW, Dyer TD, Göring HH, Blangero J: Type I error rates in association versus joint linkage/association tests in related individuals. Genet Epidemiol. 2007, 31: 173-177. 10.1002/gepi.20200.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.