Identifying variants that contribute to linkage for dichotomous and quantitative traits in extended pedigrees
© Chen et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Compared to genome-wide association analysis, linkage analysis is less influenced by allelic heterogeneity. The use of linkage information in large families should provide a great opportunity to identify less frequent variants. We perform a linkage scan for both dichotomous and quantitative traits in eight extended families. For the dichotomous trait, we identified one linkage region on chromosome 4q. For quantitative traits, we identified two regions on chromosomes 4q and 6p for Q1 and one region on chromosome 6q for Q2. To identify variants that contribute to these linkage signals, we performed standard association analysis in genomic regions of interest. We also screened less frequent variants in the linkage region based on the risk ratio and phenotypic distribution among carriers. Two rare variants at VEGFC and one common variant on chromosome 4q conferred the greatest risk for the dichotomous trait. We identified two rare variants on chromosomes 4q (VEGFC) and 6p (VEGFA) that explain 12.4% of the total phenotypic variance of trait Q1. We also identified four variants (including one at VNN3) on chromosome 6q that are able to drop the linkage LOD from 3.7 to 1.0. These results suggest that the use of classical linkage and association methods in large families can provide a useful approach to identifying variants that are responsible for diseases and complex traits in families.
Common variants have been successfully identified for many diseases and complex traits through the use of genome-wide association studies (GWAS). Although many of the findings from GWAS have been replicated in different populations, current association results have yet to explain many existing linkage regions of interest. Furthermore, GWAS have limited power to identify rarer variants, in contrast to linkage analysis, which is not sensitive to allelic heterogeneity . Thus the use of linkage information in family data (especially in large pedigrees) provides great opportunities to identify rarer variants.
In this paper, we present our analysis of the Genetic Analysis Workshop 17 (GAW17) family data  (the first set of simulated family data, without knowledge of the underlying simulating model). Our data consist of one dichotomous trait (with 30% of individuals being affected) and three quantitative traits for 697 individuals from 8 extended families. Each individual has genotypes at 24,487 single-nucleotide polymorphisms (SNPs) across 22 autosomes, and variants at more than 85% of these SNPs are either rare (minor allele frequency [MAF] < 0.01) or less frequent (0.01 < MAF < 0.05). The GAW17 family data are rich in relative pairs (579 sib pairs, 2,430 second-degree relative pairs, and thousands of other types). Each family includes four generations, with family size varying from 73 to 128. For the dichotomous trait, there are 48 affected sib pairs, 251 affected second-degree relative pairs, 202 affected third-degree relative pairs, 43 affected fourth-degree relative pairs, and 5 affected fifth-degree relative pairs. The use of more distantly related relative pairs has the potential to increase the power to identify rare, infrequent, and common variants.
Rather than breaking large pedigrees into smaller ones and then applying a standard software package (such as Merlin ), in this analysis we develop new implementations to perform genome-wide linkage scans in large pedigrees for both dichotomous and quantitative traits. We investigate the association of variants in the identified linkage regions.
Linkage and association methods for dichotomous traits
for each affected relative pair (parent-offspring pairs are excluded for the lack of variation). Because the covariate Age is a strong risk factor for the dichotomous trait (a 10-year increase in age doubles the risk of being affected), it is crucial to take into account the effect of aging in the NPL analysis. We include only affected relative pairs with an age difference no larger than 16 years in the NPL analysis. Although this threshold is somewhat arbitrary, we subsequently applied other threshold values to ensure that our results were not too sensitive to this value. Note that our linkage results are not inflated by potential linkage disequilibrium (LD) between adjacent markers, because the IBD statistics are known in our simulated data. In practice, IBD statistics need to be estimated, and the LD needs to be properly modeled in the linkage analysis when genotype data for the parents of affected sib pairs are not complete .
To examine the association of SNPs with simulated phenotypes, we apply the standard transmission disequilibrium test (TDT)  and the more recent generalized disequilibrium test (GDT)  to the simulated data. The GDT is able to make use of all discordant relative pairs in extended pedigrees to compare the allele frequency differences between affected and unaffected individuals within families. Given the potential lack of power of existing association methods to detect rare variants, we developed a strategy to screen rare and infrequent variants in the linkage region. At each SNP with MAF < 0.05, we compute the odds of being affected among carriers of the variant. If the odds among carriers of the variant are much larger (or smaller) than the overall odds, we perform follow-up analyses for this SNP and consider the overall effect of the collapsed rare and infrequent variants.
Linkage and association methods for quantitative traits
For the quantitative traits, we implement a score-based robust linkage analysis . Although our test statistic is identical  with that in the regression-based method , our software implementation allows much larger pedigrees than the Merlin-regress software (the key element Cov(π ij , π kl ) in the test statistic can be conveniently calculated as a function of the pedigree structure without using the SNP data). Covariates in the linear regression of the quantitative traits include Age, Smoking status, and the first principal component from a multidimensional scaling (MDS) structure analysis  in which the family structure is incorporated. These covariates are adjusted in the robust score test.
We examine the identified linkage region using the variance component score test as implemented in the GDT software package (through parameter fastAssoc) . Although the algorithm implemented in the GDT package is identical to the one that is implemented in Merlin , the GDT implementation can handle much larger pedigrees because the (time-consuming) Lander-Green algorithm  is a required component of the Merlin package but not in GDT. To adjust for the most significant SNPs, we perform additional association scans. Finally, we fit a variance component model to estimate the unexplained heritability and the effect of each variant associated with the quantitative trait.
Results and discussion
For the dichotomous trait, Age is the most statistically significant risk factor (p = 2 × 10−16), with a 10-year increase in age doubling the risk of being affected. Smoking is the second largest risk factor. Sex is not significantly associated with the affection status. Our NPL scan on 22 autosomes with adjustment for Age revealed one significant linkage region on chromosome 4q with maximum LOD = 3.1 at position 170.34 Mb and a one-LOD support interval between 142.8 Mb and 177.9 Mb. The linkage evidence is derived primarily from the largest family (family 7, consisting of 128 individuals). Family 7 alone provides the maximum LOD (4.2) in a region between 153.57 Mb and 177.90 Mb.
The evidence supporting linkage remains strong even with a variable age threshold. When Age is not incorporated into the NPL analysis, the maximum LOD is 2.5 in the same region. We also identified a region on chromosome 9 with suggestive evidence in support of linkage (maximum LOD = 2.9 in a region between 4.5 Mb and 7.0 Mb). However, this result is sensitive to the age threshold. When Age is not incorporated into the analysis, the evidence supporting linkage dropped to LOD = 1.7; thus we restricted our subsequent analyses to the region on chromosome 4q.
To localize the variants that contribute to the evidence supporting linkage, we first performed TDT and GDT analyses. No significant association was found (no associations have a p < 0.001). Rather than comparing allele frequency differences between affected and unaffected individuals, we computed the odds of being affected among carriers of rare and infrequent variants. We estimated allele frequencies based on all 202 founders, representing the general population. In the linkage region, four SNPs had odds greater than or equal to 1: C4S4373 at 167.01 Mb (odds = 9/6, MAF = 0.002), C4S4915 at 176.14 Mb (odds = 10/8, MAF = 0.005), C4S4916 at 176.14 Mb (odds = 9/6, MAF = 0.002), and C4S4935 at 177.85 Mb (odds = 16/15, MAF = 0.002). C4S4373 and C4S4916 (9 Mb apart) are in complete LD (r2 = 1, D′ = 1) so only one of these two SNPs needed to be further studied. In addition, C4S4915 and C4S4916 (101 bp apart) are in strong LD. All C4S4916 variant carriers were also C4S4915 variant carriers, whereas three of the C4S4915 carriers were not C4S4916 carriers. All C4S4916 and C4S4935 carriers were private to family 7 and were not present in other families.
Variants in the chromosome 4q linkage region that are associated with the dichotomous trait in family 7
Number of smokers
Logistic regression model for the dichotomous trait
2 × 10−16
2.6 × 10−5
1.5 × 10−2
3.5 × 10−5
3.6 × 10−3
In addition to the SNP-by-SNP association analysis, the rare variants with larger effect identified through our screening procedure were collapsed within genes before the association analysis. We assumed that the collapsed genotype of an individual was a heterozygote if the individual was a carrier of any of the rare variants (a homozygote with two copies of the rare variant is rather rare). However, the GDT p-values were not substantially improved using the collapsed alleles.
The MDS structure analysis  identified population substructure among founders, and principal components for nonfounders were approximated according to their relationship relative to founders. We found that one principal component was sufficient to represent the substructure in the data. The first principal component was included in the analysis as a covariate with Age and Smoking status for the quantitative traits. A polygenic analysis  demonstrated that all quantitative traits were highly heritable, with h2 = 0.615 (Q1), h2 = 0.432 (Q2), and h2 = 0.697 (Q4). A bivariate analysis identified a modest genetic correlation between Q1 and Q2 (r G = 0.255), suggesting that these two traits may share genes in common. There was no evidence of a common genetic basis between Q4 and the other two quantitative traits.
The robust quantitative trait linkage analysis identified two linkage regions for Q1 and one linkage region for Q2. The first region for Q1 is on chromosome 4q, overlapping with the linkage region for the dichotomous trait. The maximum linkage support is LOD = 14.8 at 167.1 Mb, with a wide region of support (88.0–186.1 Mb with LOD > 3). The second region supporting linkage for Q1 is on chromosome 6p (LOD = 9.1, 25.6–26.4 Mb) with a large support region (LOD > 3 from 0 to 80.0 Mb). The region supporting linkage for Q2 is on chromosome 6q at position 143.6 Mb, with a maximum LOD of 3.7.
Variance component regression model for Q1
1.9 × 10−72
1.3 × 10−11
8.8 × 10−3
2.3 × 10−15
8.1 × 10−16
The estimated heritability was reduced with the two rare variants included, from h2 = 0.615 to h2 = 0.491, consistent with the two variants explaining 12.4% of the total phenotypic variance. In the largest family (family 7), the two variants explained 26.7% of the total phenotypic variance (the estimated heritability was reduced from 0.814 to 0.547). We screened the rare variants according to the phenotypic distribution of the carriers, but we did not identify any other rare variants that contributed independently to the variation beyond that from C4S4935 and C6S2981. On chromosome 6p, besides C6S2981 with a large effect on Q1, other potential rare variants include C6S752 at 25.83 Mb, C6S2245 at 31.71 Mb, and C6S2432 at 32.92 Mb; however, all C6S752 carriers (and all except one C6S2245 carrier) are also C6S2981 carriers, and all C6S2432 carriers are also C6S752 carriers. Thus a LD block exists in the region between 25.83 Mb and 43.85 Mb. This LD block could explain why the SNP with the strongest effect is at 43.85 Mb even though the linkage peak is at 26 Mb.
Q2 exhibits significant linkage on chromosome 6q (with a maximum LOD = 3.7 at 143.6 Mb). By screening the phenotypic distribution among the carriers, we identified two rare variants, one less frequent variant, and one common variant that partly explained the evidence for linkage in this region: C6S5449 at 133.1 Mb (MAF = 0.005, at gene VNN3), C6S6047 at 144.8 Mb (MAF = 0.012, at gene UTRN), C6S6659 at 155.5 Mb (MAF = 0.064, at gene TIAM2), and C6S6839 at 155.6 Mb (MAF = 0.002, at gene TIAM2 private to family 4). When the four SNPs were adjusted in the linkage analysis, the LOD was reduced from 3.7 to 1.0. Although the four variants were able to explain most of the linkage in the region, they had small effects on Q2: They explained only 4.3% of total phenotypic variance, and none of the variants were significantly associated with Q2.
The linkage region and the candidate variants established in this work could be crucial for future functional analysis. The LD confounds the choice of functional SNPs. Functional analysis that follows up on this work could further determine the functional variants in the region.
Our linkage and association analysis identified two rare variants (at VEGFC) and one common variant on chromosome 4q for the dichotomous trait, two rare variants on chromosomes 4q (at VEGFC) and 6p (at VEGFA) that explain 12.4% (or 26.7% in the largest family) of the phenotypic variance of trait Q1, and rare variants at VNN3 on chromosome 6q that explain the linkage to trait Q2. No linkage regions were identified for trait Q4. The variant at VEGFC on chromosome 4q underlies both the dichotomous trait and the quantitative trait Q1. Compared to the true model that was used to simulate the GAW17 family data , our linkage and association findings for all four traits are confirmed.
Although linkage could be due to a single variant in a gene, as in the simulated GAW17 family data, given no prior knowledge of the genetic model, we should not rule out the possibility that linkage could be due to multiple variants in a gene. Therefore screening for rarer variants with larger effect is crucial for the association analysis. Our current screening procedure, which is based on the odds of being affected among carriers, is somewhat preliminary. Some further improvement could consider the likelihood of affection status among carriers (which gives less weight to a small number of carriers). Bioinformatic annotation (e.g., nonsynonymous SNPs only) should be incorporated as well.
The variants identified in our analysis are rare in the general population (e.g., 1 out of 404) and would be difficult to identify in a population-based study. These variants have a much higher frequency in some families, and our work shows that the use of linkage and association in large families provides a powerful way to identify variants that are responsible for diseases and complex traits.
This work is partly supported by National Institutes of Health (NIH) grants 5UC2 HL103010-02 and 7U01 DK062418. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. We thank Josyf Mychaleckyj for several thoughtful suggestions and the two anonymous reviewers for valuable input that improved the manuscript.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Lander ES, Schork NJ: Genetic dissection of complex traits. Science. 1994, 265: 2037-2048. 10.1126/science.8091226.View ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin: rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Sham PC, Purcell S, Cherny SS, Abecasis GR: Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am J Hum Genet. 2002, 71: 238-253. 10.1086/341560.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen WM, Abecasis GR: Estimating the power of variance component linkage analysis in large pedigrees. Genet Epidemiol. 2006, 30: 471-484. 10.1002/gepi.20160.View ArticlePubMedGoogle Scholar
- Abecasis GR, Wigginton JE: Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am J Hum Genet. 2005, 77: 754-767. 10.1086/497345.PubMed CentralView ArticlePubMedGoogle Scholar
- Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.PubMed CentralPubMedGoogle Scholar
- Chen WM, Manichaikul A, Rich SS: A generalized family-based association test for dichotomous traits. Am J Hum Genet. 2009, 85: 364-376. 10.1016/j.ajhg.2009.08.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen WM, Broman KW, Liang KY: Power and robustness of linkage tests for quantitative traits in general pedigrees. Genet Epidemiol. 2005, 28: 11-23. 10.1002/gepi.20034.View ArticlePubMedGoogle Scholar
- Chen WM, Broman KW, Liang KY: Quantitative trait linkage analysis by generalized estimating equations: unification of variance components and Haseman-Elston regression. Genet Epidemiol. 2004, 26: 265-272. 10.1002/gepi.10315.View ArticlePubMedGoogle Scholar
- Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM: Robust relationship inference in genome-wide association studies. Bioinformatics. 2010, 26: 2867-2873. 10.1093/bioinformatics/btq559.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen WM, Abecasis GR: Family-based association tests for genomewide association scans. Am J Hum Genet. 2007, 81: 913-926. 10.1086/521580.PubMed CentralView ArticlePubMedGoogle Scholar
- Lander ES, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA. 1987, 84: 2363-2367. 10.1073/pnas.84.8.2363.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models. Biometrika. 1986, 73: 13-22. 10.1093/biomet/73.1.13.View ArticleGoogle Scholar
- Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, Lai S, Usala G, Lai M, Loi P, et al: Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2006, 2: e132-10.1371/journal.pgen.0020132.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.