Finding genes that influence quantitative traits with tree-based clustering

We present a new statistical method to identify genes in which one or more variants influence quantitative traits. We use the Genetic Analysis Workshop 17 (GAW17) data set of unrelated individuals as a test of the method on the raw GAW17 phenotypes and on residuals after fitting linear models to individual-based covariates. By performing appropriate randomization tests, we found many significant results for a proportion of the genes that contain variants that directly contribute to disease but that have an increased type I error for analyses of raw phenotypes. Power calculations show that our methods have the ability to reliably identify a subset of the loci contributing to disease. When we applied our method to derived phenotypes, we removed many false positives, giving appropriate type I error rates at little cost to power. The correlation between genome-wide heterozygosity and the value of the trait Q1 appears to drive much of the type I error in this data set.


Background
Multilocus approaches to associations between variants and traits are likely to be advantageous when rare single-nucleotide polymorphisms (SNPs), which have an undetectable effect on a trait when considered singly, can explain a large proportion of the genetic variance at a locus when they are taken together [1][2][3]. Investigators have taken a gene-centric approach to association testing using, for example, entropy [4], weighted sums [2,3], and distance measures [5] to summarize information across different sites. Our approach uses data-driven tree-based clustering to combine genotypes across multiple loci. The tree structure makes our algorithm an efficient way to search through SNPs that best explain the difference in quantitative trait values. Our tree construction method ensures that genotypes that differ by a mutation at a single locus always cluster on the tree and gives an easily interpretable visualization of the SNPs at a locus that affects the trait.
Sevon et al. [6] developed a method, TreeDT, that uses lexical sorting of haplotypes to produce a treebased test of association. We use the idea of lexical trees but extend it by using multilocus genotypes and by working with quantitative traits rather than case-control status. The method can be used on haplotypes, but using multilocus genotypes is a natural extension when we are interested in the effects of rare variants, because these variants are unlikely to be present in two copies and phasing of such variants is much more uncertain [7]. We develop this method to work with rare variants by using recoded multilocus genotypes rather than haplotypes and by extending the statistics used to look for associations between quantitative traits and the tree structure. Every node on the tree represents a multilocus genotype that appears one or more times in the population. Shorter multilocus genotypes are situated at internal nodes of the tree. This method provides a pictorial summary of the information contained in a region at different genes along the chromosome. The methods presented here are implemented both as a stand-alone program and as an R library [8].

Data preparation
For these analyses we use the unrelated individuals genotype data from Genetic Analysis Workshop 17 (GAW17); these data consist of 697 individuals genotyped at 22,487 sites. The data generation is detailed by Almasy et al. [9]. For tree analyses, we recode these genotype data as binary multilocus genotypes (BMGs) by coding homozygotes for the common allele as 0 and all other genotypes as 1, so that the BMG of an individual at a gene is a vector of 0's and 1's. This recoding is illustrated with example data in Table 1. We perform two sets of analyses: all preliminary work is done on the first phenotype data set, and subsequent power calculations are performed on all 200 data sets.

Principal components and individual variation
Loadings from principal components analysis (PCA) were calculated for allele counts for all genotype data using standard R functions. Plots of loadings are shown in Figure 1. The first principal component (PC), explaining 35% of the variance, does not resemble typical PCA results because it does not produce a separation into the three main population groupings that is seen in other studies with comparable samples [10], whereas rotations 2 and 3 effectively separate the individuals into three groups. Rotations 1 and 4 are more closely related to the overall variation seen in a sample (as measured by average heterozygosity) than to differences between populations, as seen for SNP array data [10].

Derived phenotype data
To incorporate suspected relations between Q1, Q2, Q4, and the other explanatory variables (Age, Smoke, Sex, and PC loadings), we calculate two additional derived phenotypes for each of Q1, Q2, and Q4. We construct the first residual phenotype by fitting the linear model: where ε~N(0, s 2 ), using backwards stepwise selection on the explanatory variables and using the Bayesian information criterion [11] to decide which variables are retained. After model fitting, we use the standardized residuals as phenotypes. We calculate a second set of derived phenotypes in the same way but with the initial model also containing the first six variable PC loadings, which we label PC1, … , PC6.. All of this model selection is done on phenotype data set 1. We then fit the selected models individually to each of the 200 replicate data sets and use their standardized residuals as phenotype data. All calculations are performed using standard R functions.

Building trees from genetic data
Consider a set of BMGs for a gene (here we could also use haplotypes if we had accurate phasing) as strings of 0's and 1's. Put all the individuals at the root of a tree. Now consider the variant positions in that gene one position at a time from left to right or using some other ordering. For the first position, all those samples with a 0 at the position are put on the left branch of the root, and all those with a 1 are put on the right branch. The two leaves of this tree now contain BMGs of length one 1. Now step through all the other variable positions for each leaf. If there is any variation at the current position among the individuals at a leaf, the leaf is split in two, with all individuals with a 0 on the left branch and all individuals with a 1 on the right branch. Repeat for all the variable sites at the gene. After k sites the leaves contain BMGs of length k. This procedure is illustrated in Figure 2. The multilocus genotypes that map to multilocus genotype codes for our example data are shown in Table 1.

Test statistics
Obtaining a tree test statistic is a two-stage process. First, we require values of partial test statistics defined on the leaves and internal nodes of our tree. A variety of test statistics are available, but we are restricted to those that depend only on values of a trait at a node and its descendants. Using information at ancestral nodes or on nodes on other branches of the tree (such as other individuals from the same population) is not possible within this framework. We use the term disjoint here for a set of nodes in which none of the nodes is the ancestor of another. For further details see Sevon et al. [6].
Although many partial test statistics are possible, we take a simple one, the z score: The example data set is shown in Figure 2.
where x ij , j = 1, …, n i , are the values of the trait at node i and x and s are the sample mean and standard deviation, respectively, of the trait over all individuals. Two trees with the respective z scores at the nodes are shown in Figure 3.
The QTLTree test statistic over the whole tree, S k , is defined as the maximum value of: where the summation is over m disjoint nodes where m ≤ k, and f is some function. For our analyses, we take: that is, f y y ( ) = , but other approaches are possible and implemented in our program. As a side effect of the calculation, we obtain intermediate values for S j , j = 1, …, k − 1. Typically for our tests, we take k = 10.

Significance testing
The null distribution of S k is not available. Because the GAW17 data were sampled from different populations, a straightforward randomization was not appropriate and individuals were randomized within populations; that is, the traits were randomly relabeled within each population so that the mean and standard deviation of the traits stayed the same within populations across replicate permutations. The seven different populations used were Luhya, Yoruba, Japanese, Denver Chinese, Han Chinese, CEPH (European-descended residents of Utah), and Tuscan. We calculate statistical significance with 10 6 randomized replicates per gene for the phenotype data set. We perform power calculations over multiple data sets using 10 5 replicate simulations over all 200 data sets. One hundred thousand replicate permutations for all 3,205 genes of a single data set typically takes about 60 minutes, and these calculations are easy to perform in parallel, making them feasible for whole-genome data. The R package QTLTree [8] is available from IJW's website (http://www.staff.ncl.ac.uk/i.j.wilson).

Model choice derived phenotypes
The backwards model selection results in three additional derived phenotypes: Q1A, the residuals after fitting Age and Smoke; Q1B, the residuals after fitting Age + Smoke + PC1 + PC4; and Q4A, the standardized residuals after fitting a linear model with predictors Age + Sex + Smoke to Q4. No nonconstant terms were kept in models for Q2, and no extra PC terms were kept for Q4.
Analysis of data set 1 Table 2 gives the genes with the highest significance levels for analyses of the three Q1-related traits and Q2. Although for the uncorrected trait the gene with rank 1 is true, the rest of the top 10 ranked genes are all unrelated to the trait and are significant after Bonferroni correction (p < 5 × 10 Example of multilocus genotype tree. This tree is constructed by considering sites left to right along the binary multilocus genotype (BMG). The root of the tree contains all 100 individuals. As we consider successive SNPs, all nodes containing both 0 and 1 individuals at the SNP are split, with individuals carrying a 1 put on the top branch and those carrying a 0 on the lower branch. The leaves of the tree (yellow background) carry full-length BMGs, and internal nodes (blue) carry partial BMGs, with sites that can carry 0 or 1 labeled with a hash (#). Case and control counts are given by numbers above and below the node label, respectively.
follow their expectations well, whereas those for Q1 approach acceptability only when using phenotype Q1B, corrected using PC loadings. Results from phenotype Q2 show some deviation from the expected at low p-values, but these do not seem to be due to true associations from results in Table 2.

Power calculations
Power calculation results are summarized in Figure 4. There is some power to detect true associations at genes influencing traits Q1 and Q2, but not for all genes. Using the derived trait Q1B, which incorporates PC loadings, increases the true-positive rate at low false-positive rates.  Figure 3 Genotype trees underlying two disease positions. Binary multilocus genotype trees underlying genes affecting phenotype Q1 from data set 1. ARNT is not significantly associated with Q1 using QTLTree, whereas KDR is associated. Numbers in boxes are values of the z statistic at the node; counts of individuals at the leaf are to the right. Counts for internal nodes can be calculated by looking at descendants. Colored boxes indicate evidence for association with phenotype within the tree: red, z > 2.0; pink, 1 ≤ z ≤ 2; and green, −2 ≤ z ≤ −1. Uncoloured boxes show no evidence of association with phenotype.
Using five disjoint test statistics increases the power over using a single statistic. Further test statistics did not further increase power (results not shown). Table 3 shows that there is a tendency for false positives to be found at the same genes over all replicate simulations, even for the derived traits. This table also informs us that although phenotype Q4 is well behaved for data set 1, across all data sets there is a tendency for false-positive results to be seen in the same genes. Using the residual derived phenotype Q4A corrected this problem (results not shown). All p-values based on 10 6 randomizations. Ranks are the rank of the genes when ordered by p-value. Genes with asterisks indicate true associations with trait. Genes were chosen on the basis of rank for Q1 analyses or of being true.   in sequencing samples, because samples with higher coverage tend to have more variants called. The strong false-positive signals with the raw data lead us to ask, can the difference in the number of variable sites between individuals explain the inflated errors in Q1? To test this for phenotypes Q1 and Q2, we created data sets with just the SNPs that affected disease and all non-disease-causing SNPs. The correlation across individuals between average heterozygosity for SNPs affecting Q1 and average heterozygosity for SNPs not affecting Q1 is 0.22 (Pearson r 2 , p = 5.6 × 10 −9 ), and the correlation for Q2 is r 2 = 0.14 (p = 2 × 10 −4 ). This may explain some of the false positives. Although such problems are unlikely to arise for real data, they emphasize the difficulties that may crop up in future studies using nextgeneration sequencing technologies if case and control subjects are not treated in the same way and if genotyping and variant calling are not performed blind to disease status.