Volume 5 Supplement 9
Genetic Analysis Workshop 17: Unraveling Human Exome Data
Finding genes that influence quantitative traits with tree-based clustering
- Ian J Wilson^{1}Email author,
- Richard AJ Howey^{1},
- Darren T Houniet^{1} and
- Mauro Santibanez-Koref^{1}
DOI: 10.1186/1753-6561-5-S9-S98
© Wilson et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
We present a new statistical method to identify genes in which one or more variants influence quantitative traits. We use the Genetic Analysis Workshop 17 (GAW17) data set of unrelated individuals as a test of the method on the raw GAW17 phenotypes and on residuals after fitting linear models to individual-based covariates. By performing appropriate randomization tests, we found many significant results for a proportion of the genes that contain variants that directly contribute to disease but that have an increased type I error for analyses of raw phenotypes. Power calculations show that our methods have the ability to reliably identify a subset of the loci contributing to disease. When we applied our method to derived phenotypes, we removed many false positives, giving appropriate type I error rates at little cost to power. The correlation between genome-wide heterozygosity and the value of the trait Q1 appears to drive much of the type I error in this data set.
Background
Multilocus approaches to associations between variants and traits are likely to be advantageous when rare single-nucleotide polymorphisms (SNPs), which have an undetectable effect on a trait when considered singly, can explain a large proportion of the genetic variance at a locus when they are taken together [1–3]. Investigators have taken a gene-centric approach to association testing using, for example, entropy [4], weighted sums [2, 3], and distance measures [5] to summarize information across different sites. Our approach uses data-driven tree-based clustering to combine genotypes across multiple loci. The tree structure makes our algorithm an efficient way to search through SNPs that best explain the difference in quantitative trait values. Our tree construction method ensures that genotypes that differ by a mutation at a single locus always cluster on the tree and gives an easily interpretable visualization of the SNPs at a locus that affects the trait.
Sevon et al. [6] developed a method, TreeDT, that uses lexical sorting of haplotypes to produce a tree-based test of association. We use the idea of lexical trees but extend it by using multilocus genotypes and by working with quantitative traits rather than case-control status. The method can be used on haplotypes, but using multilocus genotypes is a natural extension when we are interested in the effects of rare variants, because these variants are unlikely to be present in two copies and phasing of such variants is much more uncertain [7]. We develop this method to work with rare variants by using recoded multilocus genotypes rather than haplotypes and by extending the statistics used to look for associations between quantitative traits and the tree structure. Every node on the tree represents a multilocus genotype that appears one or more times in the population. Shorter multilocus genotypes are situated at internal nodes of the tree. This method provides a pictorial summary of the information contained in a region at different genes along the chromosome. The methods presented here are implemented both as a stand-alone program and as an R library [8].
Methods
Data preparation
Genotype frequencies for example data set
Multilocus genotype | |||
---|---|---|---|
Allele count | Code | Control counts | Case counts |
0-0-0-0 | 0-0-0-0 | 27 | 18 |
0-0-0-1 | 0-0-0-1 | 7 | 8 |
0-0-1-0 | 0-0-1-0 | 4 | 13 |
0-0-2-0 | |||
0-0-1-1 | 0-0-1-1 | 0 | 6 |
0-0-1-2 | |||
0-0-2-1 | |||
0-0-2-2 | |||
0-1-0-0 | 0-1-0-0 | 9 | 4 |
0-2-0-0 | |||
1-0-0-0 | 1-0-0-0 | 4 | 0 |
Principal components and individual variation
Derived phenotype data
where ε ~ N(0, σ^{2}), using backwards stepwise selection on the explanatory variables and using the Bayesian information criterion [11] to decide which variables are retained. After model fitting, we use the standardized residuals as phenotypes. We calculate a second set of derived phenotypes in the same way but with the initial model also containing the first six variable PC loadings, which we label PC1, … , PC6.. All of this model selection is done on phenotype data set 1. We then fit the selected models individually to each of the 200 replicate data sets and use their standardized residuals as phenotype data. All calculations are performed using standard R functions.
Building trees from genetic data
Test statistics
Obtaining a tree test statistic is a two-stage process. First, we require values of partial test statistics defined on the leaves and internal nodes of our tree. A variety of test statistics are available, but we are restricted to those that depend only on values of a trait at a node and its descendants. Using information at ancestral nodes or on nodes on other branches of the tree (such as other individuals from the same population) is not possible within this framework. We use the term disjoint here for a set of nodes in which none of the nodes is the ancestor of another. For further details see Sevon et al. [6].
that is, , but other approaches are possible and implemented in our program. As a side effect of the calculation, we obtain intermediate values for S_{ j }, j = 1, …, k − 1. Typically for our tests, we take k = 10.
Significance testing
The null distribution of S_{ k } is not available. Because the GAW17 data were sampled from different populations, a straightforward randomization was not appropriate and individuals were randomized within populations; that is, the traits were randomly relabeled within each population so that the mean and standard deviation of the traits stayed the same within populations across replicate permutations. The seven different populations used were Luhya, Yoruba, Japanese, Denver Chinese, Han Chinese, CEPH (European-descended residents of Utah), and Tuscan.
We calculate statistical significance with 10^{6} randomized replicates per gene for the phenotype data set. We perform power calculations over multiple data sets using 10^{5} replicate simulations over all 200 data sets. One hundred thousand replicate permutations for all 3,205 genes of a single data set typically takes about 60 minutes, and these calculations are easy to perform in parallel, making them feasible for whole-genome data. The R package QTLTree [8] is available from IJW’s website (http://www.staff.ncl.ac.uk/i.j.wilson).
Results
Model choice derived phenotypes
The backwards model selection results in three additional derived phenotypes: Q1A, the residuals after fitting Age and Smoke; Q1B, the residuals after fitting Age + Smoke + PC1 + PC4; and Q4A, the standardized residuals after fitting a linear model with predictors Age + Sex + Smoke to Q4. No nonconstant terms were kept in models for Q2, and no extra PC terms were kept for Q4.
Analysis of data set 1
Summary results for data set 1
Q1 | Q2 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Gene | Q1 rank | Q1A rank | Q1B rank | p-value for Q1 | p-value for Q1A | p-value for Q1B | Gene | Q2 rank | p-value for Q2 |
FLT1* | 1 | 1 | 1 | 0.0000 | 0.0000 | 0.0000 | LRRC18 | 1 | 0.000441 |
ZNF91 | 2 | 5 | 6 | 0.0000 | 0.0000 | 0.0002 | WDFY4 | 2 | 0.000441 |
FLJ22662 | 3 | 22 | 900 | 0.0000 | 0.0000 | 0.2604 | SPAG8 | 3 | 0.000453 |
ZNF454 | 4 | 3 | 27 | 0.0000 | 0.0000 | 0.0044 | RARB* | 24 | 0.004243 |
KRT3 | 5 | 12 | 134 | 0.0000 | 0.0000 | 0.0285 | GCKR* | 28 | 0.005489 |
MAP2K6 | 6 | 16 | 834 | 0.0000 | 0.0000 | 0.2360 | VNN1* | 42 | 0.00835 |
ZNF568 | 7 | 4 | 118 | 0.0000 | 0.0000 | 0.0244 | BCHE* | 358 | 0.063238 |
BRCA1 | 8 | 15 | 58 | 0.0000 | 0.0000 | 0.0115 | SIRT1* | 382 | 0.068709 |
RGPD8 | 9 | 7 | 140 | 0.0000 | 0.0000 | 0.0301 | INSIG1* | 391 | 0.07131 |
KDR* | 14 | 2 | 15 | 0.0001 | 0.0000 | 0.0018 | VNN3* | 639 | 0.128922 |
VEGFC* | 15 | 238 | 26 | 0.0001 | 0.0090 | 0.0044 | LPL* | 1,443 | 0.382003 |
VEGFA* | 157 | 197 | 282 | 0.0086 | 0.0062 | 0.0717 | PLAT* | 1,833 | 0.514235 |
ELAVL4 | 312 | 729 | 1,248 | 0.0281 | 0.0983 | 0.3655 | VWF* | 2,071 | 0.59704 |
ARNT* | 653 | 1,086 | 983 | 0.1025 | 0.2037 | 0.2872 | PDGFD* | 2,426 | 0.716842 |
FLT4* | 1,352 | 788 | 1,757 | 0.3187 | 0.1173 | 0.5233 | VLDLR* | 2,691 | 0.811708 |
HIF1A* | 1,563 | 998 | 1,660 | 0.3930 | 0.1814 | 0.4937 | SREBF1* | 2,891 | 0.883901 |
Power calculations
Summaries of repeatable significant results over 200 data sets
Q1 | Q1A | Q1B | Q2 | Q4 | |||||
---|---|---|---|---|---|---|---|---|---|
Gene | n ≤ 0.01 | Gene | n | Gene | n | Gene | n | Gene | n |
FLT1 | 200 | FLT1 | 200 | FLT1 | 191 | VNN1 | 67 | BUD13 | 170 |
FLJ22662 | 192 | KDR | 200 | KDR | 135 | OR5B2 | 48 | SLC22A1 | 154 |
ZNF713 | 188 | TERT | 197 | MAP2K7 | 135 | PTGIS | 39 | SEPT1 | 129 |
KDR | 187 | ZNF713 | 196 | FOXO3 | 133 | ZNF568 | 38 | CYP3A43 | 116 |
FOXO3 | 170 | C1ORF147 | 194 | HSZFP36 | 94 | METTL2B | 32 | TOB2 | 115 |
KRT3 | 170 | FLJ22662 | 194 | EPHB1 | 90 | GCKR | 32 | NF2 | 85 |
KRT75 | 164 | OR6C4 | 192 | LRP4 | 88 | MUC19 | 31 | NUP188 | 83 |
GRK1 | 162 | PRKCH | 192 | RBM6 | 86 | UNC45B | 31 | OR10T2 | 68 |
E2F2 | 157 | ALX4 | 190 | GRIA4 | 79 | ZNF518B | 31 | C16ORF55 | 67 |
CETP | 155 | E2F2 | 189 | FNDC3A | 78 | VNN3 | 29 | ICAM4 | 65 |
TERT | 155 | ETV6 | 189 | HIST2H2BE | 73 | SIRT1 | 25 | LOC1001316 | 65 |
MAP2K6 | 154 | KRT75 | 189 | HAS3 | 72 | BCHE | 22 | MYCBP | 65 |
VEGFC | 45 | VEGFC | 65 | VEGFC | 65 | SREBF1 | 16 | WBP1 | 65 |
ARNT | 38 | HIF1A | 51 | VEGFA | 19 | LPL | 14 | MUC3A | 61 |
VEGFA | 32 | FLT4 | 51 | ARNT | 18 | RARB | 11 | CNGA3 | 58 |
FLT4 | 51 | VEGFA | 32 | FLT4 | 7 | PDGFD | 8 | GTSE1 | 58 |
ELAVL4 | 15 | ARNT | 37 | HIF1A | 7 | VLDLR | 7 | MMP27 | 58 |
HIF1A | 12 | ELAVL4 | 12 | HIF3A | 2 | VWF | 3 | PIK3R2 | 58 |
HIF3A | 3 | HIF3A | 2 | ELAVL4 | 2 | INSIG1 | 1 | TAAR5 | 56 |
Discussion and conclusions
The methods in QTLTree described here are a quick way to collapse the information contained in genotypes within a gene into a form that allows quick calculation of an optimum set of SNPs or combination of SNPs. Within the R environment, the method can also be used to interactively explore the sets of SNPs that may be affecting a quantitative trait. The methods seem to be able to detect an appreciable proportion of genes underlying variation in phenotypes, although the large number of detected loci that do not contribute to variation using the raw trait data is worrying, because within-population randomization should account for any simple differences between populations. Because different levels of structure exist within the data and because gene-environment-region interactions are possible (e.g., the age structure differs between populations, and the values of the Q1 and Q4 phenotypes depend on age), we attempted a further level of correction. Using derived phenotypes after regressing on correlated phenotypes and PC loadings improved the type I error rates while not reducing the power for realistic false-positive rates.
Correlations between population statistics and phenotypic traits
Trait | Average heterozygosity | Rotation 1 | Rotation 2 | Rotation 3 | Rotation 4 |
---|---|---|---|---|---|
Age | 0.02 | 0.14 | 0.16 | 0.07 | 0.06 |
Smoke | −0.01 | −0.04 | −0.05 | 0.04 | 0.01 |
Q1 | 0.16 | −0.15 | −0.07 | 0.05 | −0.19 |
Q2 | 0.05 | −0.01 | 0.01 | −0.04 | −0.05 |
Q4 | −0.06 | −0.09 | −0.12 | −0.07 | −0.03 |
Q1A | 0.17 | −0.21 | −0.12 | 0.02 | −0.24 |
Q1B | 0.02 | 0.00 | −0.01 | −0.01 | 0.00 |
Q4A | −0.10 | 0.06 | 0.02 | −0.02 | 0.06 |
Affected | 0.14 | 0.02 | 0.09 | 0.07 | −0.05 |
The strong false-positive signals with the raw data lead us to ask, can the difference in the number of variable sites between individuals explain the inflated errors in Q1? To test this for phenotypes Q1 and Q2, we created data sets with just the SNPs that affected disease and all non-disease-causing SNPs. The correlation across individuals between average heterozygosity for SNPs affecting Q1 and average heterozygosity for SNPs not affecting Q1 is 0.22 (Pearson r^{2}, p = 5.6 × 10^{−9}), and the correlation for Q2 is r^{2} = 0.14 (p = 2 × 10^{−4}). This may explain some of the false positives. Although such problems are unlikely to arise for real data, they emphasize the difficulties that may crop up in future studies using next-generation sequencing technologies if case and control subjects are not treated in the same way and if genotyping and variant calling are not performed blind to disease status.
Declarations
Acknowledgments
We thank Rita Cantor, the participants in GAW17 Group 13, and two anonymous reviewers for helpful discussions on the paper. Financial support was provided to DTH by the British Heart Foundation and to IJW and RAJH through Wellcome Trust grant 087436. The Genetic Analysis Workshops are supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Neale BM, Sham PC: The future of association studies: gene-based analysis and replication. Am J Hum Genet. 2004, 75: 353-362. 10.1086/423901.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, X (suppl X): X-X.Google Scholar
- Cui Y, Kang G, Sun K, Qian M, Romero R, Fu W: Gene-centric genomewide association study via entropy. Genetics. 2008, 179: 637-650. 10.1534/genetics.107.082370.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, AMFS Investigators, Hayward NK, Montgomery GW, Visscher PM, et al: A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010, 87: 139-145. 10.1016/j.ajhg.2010.06.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Sevon P, Toivonen H, Ollikainen V: TreeDT: tree pattern mining for gene mapping. IEEE/ACM Trans Comput Biol Bioinform. 2006, 3: 174-185. 10.1109/TCBB.2006.28.View ArticlePubMedGoogle Scholar
- Christensen GB, Lambert CG: Search for compound heterozygous effects in exome sequence of unrelated subjects. BMC Proc. 2011, 5 (suppl 9): S95-10.1186/1753-6561-5-S9-S95.PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: a language and environment for statistical computing. 2010, Vienna, Austria, R Development Core TeamGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, et al: Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008, 319: 1100-1104. 10.1126/science.1153717.View ArticlePubMedGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.