Volume 3 Supplement 7
Genetic Analysis Workshop 16
Contrasting identitybydescent estimators, association studies, and linkage analyses using the Framingham Heart Study data
 Elizabeth E Marchani^{1},
 Yanming Di^{2},
 Yoonha Choi^{3},
 Charles Cheung^{3},
 Ming Su^{2},
 Frederick Boehm^{3},
 Elizabeth A Thompson^{2} and
 Ellen M Wijsman^{1, 3}Email author
DOI: 10.1186/175365613S7S102
© Marchani et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
We explored the utility of population and pedigreebased analyses using the Framingham Heart Study genomewide 50 k singlenucleotide polymorphism marker data provided for Genetic Analysis Workshop 16. Our aims were: 1) to compare identitybydescent sharing estimates from variable amounts of data; 2) to apply each of these estimates to a casecontrol association study designed to control for relatedness among samples; and 3) to contrast these results to those obtained using modelbased and modelfree linkage analysis methods.
Background
The study of quantitative traits has led to the development of tools of varying complexity designed to identify chromosomal regions associated with disease. This has been coupled in recent years by the increasing availability of data sets that include hundreds of thousands of markers. We investigated the utility of using more data and more sophisticated analyses by applying analytical methods of variable complexity to a single data set and comparing their results. We also estimated the same statistics using variable amounts of marker data to investigate the influence of the amount of marker data on the estimates. Generally speaking, we asked whether we really benefit from these new tools and large data sets, or do they simply increase the complexity of our research? More specifically, we: 1) compared estimates of identitybydescent (IBD) sharing from increasing amounts of marker data; 2) evaluated how well different IBD estimates corrected a casecontrol association study for relatedness; and 3) compared the results of our casecontrol study and various linkage analyses to contrast any signals of association between genotype and phenotype.
Methods
Genetic map and marker data
We focused exclusively on the Genetic Analysis Workshop (GAW) 16 50 k marker data, and primarily analyzed only chromosome 7. We matched the position of each chromosome 7 marker to the sexaveraged Kosambi map sequence position on the Rutgers map [1], and then converted those positions to a Haldane map. Markers within <0.01 cM of each were given unique and sequential map positions to obtain nonoverlapping map positions.
We filtered markers with >3% missing data and with minor allele frequency <0.05. We used chisquare tests to test the null hypothesis of HardyWeinberg equilibrium, and removed markers yielding the largest 1% of the test statistics, leaving 2132 markers on chromosome 7. A "thinned" marker panel was obtained by selecting approximately every tenth marker from this "dense" filtered marker set, preferentially selecting markers with higher rare allele frequencies among founders because they are more suitable for linkage analysis. The final thinned data set included 214 markers on chromosome 7 (and 3465 genomewide markers) with a marker density of ~1 per cM.
Pedigree data cleaning
To ensure compatibility with the linkage analysis programs in the MORGAN package [2], we merged the two members of each of 25 monozygotic twin pairs. Parents missing pedigree information who were referenced by at least two family members were given records of their own. Mendelianinconsistent genotypes were identified by Loki 2.4.7 [3] and recoded as missing genotypes for all members of each affected pedigree. All individuals sharing a pedigree number could not necessarily be connected, so we split the larger pedigree into smaller pedigrees generated by available parentoffspring relationships.
Phenotype data refinement
For linkagebased analyses, we focused on highdensity lipoprotein level (HDL) and chromosome 7 due to previous evidence of linkage within the Framingham Heart study (FHS) [4–8]. We used observations from Exam 11 for the Original Cohort, and Exam 1 for the Offspring and Generation 3 Cohort, agematching the second and third generations to maximize the number of individuals in our study. Height was imputed from Exam 7 of the Original Cohort data when it was missing from Exam 11 in order to calculate body mass index (BMI). We fit linear regression models to adjust HDL for age, BMI, sex, cholesterol treatment status, and cohort.
Quantitative trait locus models
We performed Bayesian oligogenic segregation analysis using the software package Loki 2.4.7 [3] to identify and describe models for quantitative trait loci (QTLs) associated with the adjusted HDL phenotype. The QTL with the largest effect size (A allele frequency = 0.76, AA genotype effect = 1.39, Aa genotype effect = 1.75, aa genotype effect = 24.21, variance due to the QTL = 12.04, additive variance = 13.53, dominance variance = 23.54) was incorporated into the MORGAN [2] lm_multiple analysis described below. We used the posterior distribution of these models to generate a sample of simulated traits for use in empirical significance testing [9].
IBD sharing and kinship from population and pedigree data
Without reference to the pedigree structures, we estimated kcoefficients, where k_{ i }is the probability that i alleles are shared IBD, using the thinned chromosome 7 and genomewide panels of markers, as well as subsets of 214 and 1000 genomewide markers. We estimated kcoefficients for all possible pairs of independent people (n = 1827), using all founders in the pedigrees and other unrelated individuals, and for all pairs of individuals within each pedigree. Kinship coefficients, Φ, were subsequently computed as Φ = 0.25k_{1} + 0.5k_{2} and pairs of individuals with Φ > 0.2 were noted.
We selected four pairs of individuals showing high apparent relatedness as estimated from the thinned chromosome 7 markers while differing with respect to pedigree numbers. For each pair, the dense chromosome 7 markers were used to detect IBD segments using the model of Thompson [10]. We used a prior marginal pairwise IBD probability 0.1, and for an IBDchange rate parameter giving a prior expected length of chromosome in a particular IBD state of 1 cM, averaged over the nine possible IBD states in accordance with their marginal prior probabilities. The dense chromosome 7 data set was also used to flag tracts of homozygous markers (>9 SNPs in a row) shared between each of these four pairs of individuals.
We used a new "casecontrol" study design that corrects for relatedness (both known and estimated as cryptic kinship) within the sample [11], choosing 838 "cases" and 844 "controls" from the upper and lower 15^{th} percentiles of the trait distribution in the full data set. The correction for relatedness essentially eliminates inflated test statistics resulting from inclusion of related individuals. We corrected the naïve chisquare statistic pvalues using three types of kinship coefficients: pedigreebased prior, pedigreebased posterior, and populationestimated kinship coefficients. The pedigreebased prior was computed based on pedigree structure alone, while the pedigreebased posterior was based on the gl_auto results (described below) that used both pedigree structure and marker data. The dense chromosome 7 marker panel was used for the casecontrol study, while the thinned chromosome 7 marker panel was used for estimation of kinship coefficients. The populationestimated kinship coefficient was a maximumlikelihood estimate based on the thinned chromosome 7 marker data.
Linkage analyses
Two MORGAN [2] programs, lm_multiple and gl_auto, were used for lod score analyses and realization of inheritance indicators conditional on marker data, respectively. Options in both programs now allow the multiplemeiosis sampler to be used with the locus sampler, leading to more accurate Markovchain Monte Carlo (MCMC) sampling of inheritance indicators on large pedigrees [12]. Additionally, both programs have options to run sequentially over pedigrees, permitting easier processing of output on disjoint pedigrees and allowing for exact computation of lod scores on small (<= 14 meioses) pedigrees in lm_multiple, and independent realizations of inheritance indicators in gl_auto. This allows computationally intensive MCMC approximation to be used only where necessary.
Genetic linkage can be detected with pedigree data using inheritance vector realizations. We used the inheritance vectors obtained from gl_auto for two linkage analysis methods: 1) standard variancecomponents (VC) analysis using SOLAR, and 2) a novel conditional inheritance vector test using the wscore [13], which is the expectation over founder genotypes of a maximized likelihood given those founder genotypes, to test whether we could resolve the number of causal loci in a region of interest indicated by the VC results. The wscore analyses were performed only on the size 49 pedigrees, while VC analysis was performed on this subset as well as on all pedigrees for comparison. We summarized the results using randomized pvalues for the conditional test [14] and empirical pvalues for the VC analysis through trait simulation and the inheritance vectors described above [9]. We also performed three Bayesian oligogenic joint segregation and linkage analyses on all pedigrees using Loki 2.4.7 [3], where every 100^{th} out of 500,000 iterations were used to compute Bayes' factors for the presence of a QTL within each 2cM bin.
Results
IBD sharing and kinship from population and pedigree data
Estimated proportions of IBD for four putatively unrelated pairs of individuals
Pairs  

10895 and 9894  13728 and 11898  19185 and 11156  23487 and 25107  
I^{b}  S^{c}  g^{d}  G^{e}  I  S  g  G  I  S  g  G  I  S  g  G  
k_{0}^{a}  0.13  0.11  0.27  0.26  0.44  0.39  0.43  0.53  0.45  0.33  1.00  0.98  0.17  0.19  0.39  0.50 
k_{1}  0.48  0.44  0.41  0.52  0.45  0.52  0.45  0.36  0.51  0.62  0.00  0.02  0.70  0.70  0.61  0.49 
IBD  within individuals  ^{f}  0.16        0.37        0.26        0.28     
No. homozygous tracts  3        5        3        7       
Range of tract length (SNPs)  [10:14]        [6:12]        [5:11]        [3:21]       
Linkage analyses
Randomized pvalues summarize test significance as well as uncertainty of the test results. For example, although we find significant evidence for linkage near 38 cM using VC analysis and trait resimulation, the conditional test pvalues in the same region are uncertain, as indicated by the range of pvalues estimated at that position. To resolve these uncertainties, we would need to use markers at greater density or with greater polymorphism levels to infer the inheritance vectors in this region.
Discussion
As a rule, more marker data increases the stability and apparent accuracy of IBD estimates (Figure 1), as the number of pairs of independent people with Φ > 0.2 declined dramatically with increasing numbers of loci analyzed. The distribution of loci across the genome also influenced the estimated IBD sharing between pairs of independent people, as 214 markers from across the genome identified fewer pairs of independent people with k_{1} > 0.8 than 214 markers from chromosome 7.
However, a relatively modest number of markers were needed to achieve this stability. Although we see a dramatic difference between the kstatistics estimated from 214 vs. 1000 genomewide markers, little difference is observed between estimates from 1000 vs. 3465 genomewide markers. This suggests that we can thin our marker panels to avoid the effects linkage disequilibrium without strongly altering our inferred kstatistics. This also allows for the generation of multiple, equivalent, thinned data sets from a single dense data set, to be used for replication purposes.
Even with thousands of markers, relationships between some pairs of independent people were detected by several approaches. Some of these pairs shared pedigree numbers but could not be connected in the pedigree file. This suggests that our methods were able to detect real relatives even when they were not labelled as such. However, some pairs had unique pedigree numbers, suggesting some cryptic relatedness among the FHS participants and raising the possibility that adjustment for such relationships might be necessary in some analyses.
Relatedness, known or cryptic, clearly inflated uncorrected pvalues in the casecontrol study. Fortunately, our corrections using pedigree and markerestimated relatedness worked well. Because many casecontrol studies do not have access to pedigree data, this suggests that our method may be applied to genomewide association studies without the need for additional pedigree information. The pedigreeposterior estimate of relatedness overcorrected our test statistic, although as the analysis used only the thinned chromosome 7 markers, this is not surprising given the results in Figure 1.
The number and strength of linkage signals varied across methods and by the amount of data used. Linkage analyses, but not the casecontrol analyses, provided evidence for HDL loci on chromosome 7. The bimodality of the linkage signal between 20 and 40 cM was more clearly defined with the more computationally intensive and traitmodelbased wscore and MCMCbased oligogenic linkage analyses. The wscore also identified a possible linkage signal near 95 cM, although the confidence in actual pvalues varied across the chromosome. Analysis of the size 49 pedigrees emphasized the peak near 20 cM at the expense of the peak near 40 cM, while analyses of all pedigrees identified a modest signal near 180 cM. This signal was detected in a previous GAW [5], suggesting that with more data there may be additional evidence for this locus.
Our novel methods were useful in a variety of situations. Inheritance vectors generated by gl_auto were used in the VC analyses and in empirical significance testing. Analysis of IBD segments identified wide swathes of shared chromosomal regions between pairs of independent people with patterns not visible in a single summary statistic. The method to correct casecontrol studies for relatedness was practical and effective, and the wscore provided both localization and confidence of information in linkage analysis. Although encompassing a wide range of approaches, these methods show clear promise for future work.
Conclusion
The use of additional data and analytical methods of increasing complexity appears to have paid dividends. However, there are clearly limits. More markers provide better IBD sharing estimates, but a marker density greater than between 1 and 3 cM would likely give only a slight improvement. Correcting casecontrol studies for relatedness is effective, relatively simple, and can be done using marker data alone. Linkage analyses of greater complexity identified more, albeit weak, linkage signals than simpler analyses. It would appear that all association and linkage methods are capable of detecting strong and clear signals. Because not all studies are fortunate enough to have strong signals, sophisticated analytical tools and large, but not too large, data sets may deliver additional results along with their complexity.
List of abbreviations used
 BMI:

Body mass index
 FHS:

Framingham Heart Study
 GAW:

Genetic Analysis; Workshop
 HDL:

Highdensity lipoprotein
 IBD:

Identity by decent
 MCMC:

Markovchain Monte Carlo
 QTL:

Quantitative trait locus
 SNP:

Singlenucleotide polymorphism
 VC:

Variance components.
Declarations
Acknowledgements
Funding to the authors was provided by NIH grants GM46225 (EAT, EMW, MS, YD), HL30086 (EMW), AG05136 (EMW, YC, CC), HD055782 (EMW), AG00258 (EM), and GM075091 (FB). The Genetic Analysis Workshops are supported by NIH grant GM031575 and the Framingham Heart Study is supported by NHLBI grant N01HC25195.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Kong X, Murphy K, Raj T, He C, White PS, Matise TC: A combined linkage and physical map of the human genome. Am J Hum Genet. 2004, 75: 11431148. 10.1086/426405.PubMed CentralView ArticlePubMedGoogle Scholar
 MORGAN. [http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml]
 Heath SC: Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am J Hum Genet. 1997, 61: 748760. 10.1086/515506.PubMed CentralView ArticlePubMedGoogle Scholar
 Shearman AM, Ordovas JM, Cupples LA, Schaefer EJ, Harmon MD, Shao Y, Keene JD, DeStefano AL, Joost O, Wilson PW, Housman DE, Myers RH: Evidence for a gene influencing the TG/HDLC ratio on chromosome 7q32.3qter: a genome wide scan in the Framingham study. Hum Mol Genet. 2000, 9: 13151320. 10.1093/hmg/9.9.1315.View ArticlePubMedGoogle Scholar
 George AW, Sasu S, Li N, Rothstein JH, Sieberts SK, Stewart W, Wijsman EM, Thompson EA: Approaches to mapping genetically correlated complex traits. BMC Genet. 2003, 4 (suppl 1): S7110.1186/147121564S1S71.PubMed CentralView ArticlePubMedGoogle Scholar
 Horne BD, Malhotra A, Camp NJ: Comparison of linkage analysis methods for genomewide scanning of extended pedigrees, with application to the TG/HDLC ratio in the Framingham Heart Study. BMC Genet. 2003, 4 (suppl 1): S9310.1186/147121564S1S93.PubMed CentralView ArticlePubMedGoogle Scholar
 Kathiresan S, Manning AK, Demissie S, D'Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, OrhoMelander M, Arnett DK, Peloso GM, Ordovas JM, Cupples LA: A genomewide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Genet. 2007, 8 (suppl 1): S1710.1186/147123508S1S17.View ArticleGoogle Scholar
 Zhang X, Wang K: Bivariate linkage analysis of cholesterol and triglyceride levels in the Framingham Heart Study. BMC Genet. 2003, 4 (suppl 1): S6210.1186/147121564S1S62.PubMed CentralView ArticlePubMedGoogle Scholar
 Igo RP, Wijsman EM: Empirical significance values for linkage analysis: trait simulation using posterior model distributions from MCMC oligogenic segregation analysis. Genet Epidemiol. 2008, 32: 119131. 10.1002/gepi.20267.View ArticlePubMedGoogle Scholar
 Thompson EA: The IBD process along four chromosomes. Theor Popu Biol. 2008, 73: 369373. 10.1016/j.tpb.2007.11.011.View ArticleGoogle Scholar
 Choi Y, Wijsman EM, Weir BS: Casecontrol association testing in the presence of unknown relationships. Genet Epidemiol in press.
 Tong L, Thompson EA: Multilocus lod scores in large pedigrees: combination of exact and approximate calculations. Hum Hered. 2008, 65: 142153. 10.1159/000109731.View ArticlePubMedGoogle Scholar
 Di Y, Thompson EA: Conditional tests for localizing trait genes. Hum Hered. 2009, 68: 139150. 10.1159/000218112.PubMed CentralView ArticlePubMedGoogle Scholar
 Thompson EA, Geyer CJ: Fuzzy pvalues in latent variable problems. Biometrika. 2007, 94: 4960. 10.1093/biomet/asm001.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.