Comparison of results from tests of association in unrelated individuals with uncollapsed and collapsed sequence variants using tiled regression
© Sung et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Tiled regression is an approach designed to determine the set of independent genetic variants that contribute to the variation of a quantitative trait in the presence of many highly correlated variants. In this study, we evaluate the statistical properties of the tiled regression method using the Genetic Analysis Workshop 17 data in unrelated individuals for traits Q1, Q2, and Q4. To increase the power to detect rare variants, we use two methods to collapse rare variants and compare the results with those from the uncollapsed data. In addition, we compare the tiled regression method to traditional tests of association with and without collapsed rare variants. The results show that collapsing rare variants generally improves the power to detect associations regardless of method, although only variants with the largest allelic effects could be detected. However, for traditional simple linear regression, the average estimated type I error is dependent on the trait and varies by about three orders of magnitude. The estimated type I error rate is stable for tiled regression across traits.
The assumptions of independence between observations and between independent variables are a major theoretical underpinning of much of traditional statistics. However, because of the linear nature of the genome and the intrinsically familial nature of genetics, these assumptions are often violated when traditional statistical methods are applied to genetic data. As the density of genetic markers increases, ultimately encompassing the entire genome, the correlations between markers increase, depending in large part on the linkage disequilibrium structure in the sample. For unrelated individuals a multiple linear regression approach, including all variants across the entire genome, would be ideal. However, the large number of variants relative to the number of samples and the presence of a large degree of multicollinearity among markers make this approach intractable. In addition, the inclusion of rare sequence variants, including several rare variants in the same gene, makes traditional tests of association problematic because of the low frequency of many of the variants. In this study, we use tiled regression  to analyze the unrelated individuals from the simulated Genetic Analysis Workshop 17 (GAW17) data in order to identify the set of variants that are responsible for the variation in quantitative traits and to compare the use of uncollapsed and collapsed sequence variants.
To examine the data for population substructure, we first performed a principal components analysis on 1,356 common SNPs (minor allele frequency [MAF] > 0.2 and r2 < 0.2) with Eigensoft, version 3.0 . The self-reported ethnicity of each individual was plotted against the first two principal components. The individuals could be grouped into three populations (Asian, African, and European), except for one self-reported European who was classified in the Asian group. This individual, NA12829, was removed from all subsequent analyses, leaving 696 individuals. However, because the sample sizes for the subpopulations were small, the subpopulations were considered a single sample. We used linear regression to adjust and center each quantitative trait (Q1, Q2, and Q4) for age, sex, and smoking status in each of the 200 replicates.
We used the genotypes for the 24,473 nonmonomorphic single-nucleotide polymorphisms (SNPs), including common and rare sequence variants (collectively referred to here as sequence variants), as provided (uncollapsed) and with rare sequence variants collapsed [3, 4]. To collapse the rare variants, we used two methods: (1) collapsing all variants with a MAF < 0.01 and (2) collapsing nonsynonymous variants with a MAF < 0.01. The rare variants were collapsed into a single variant for each genomic region defined by hot spots (see Tiled regression section). The derived region-wide collapsed variants were coded as the presence or absence of any rare allele within each region. Common variants were left uncollapsed and coded as the number of minor alleles.
In tiled regression, the genome is divided into independent segments based on predefined regions. Recombination hot spots (i.e., well-defined regions of increased recombination) are used to delineate regions. The term tile denotes both the sequence of DNA between two hot spot regions and a hot spot region itself. Each sequence variant is assigned to a tile based on its physical position. A tile is selected if the multiple linear regression on all variants in the tile shows a significant relationship to trait variation (testing the null hypothesis that all variant coefficients are 0) or if the simple linear regression on any single variant in the tile is significant. A stepwise regression is then used to select the important individual independent variants identified in each selected tile. Thereafter the significant variants are combined across tiles in higher-order stepwise regressions within chromosome and then genome levels. The end result is a multiple linear regression model that includes a set of variants that independently contribute to trait variation. An appropriate null distribution for determining the significance level of the overall results is being investigated, and permutation tests will likely be required to obtain an accurate significance level.
Test of association for uncollapsed and collapsed sequence variants
We used tiled regression, as implemented in TRAP (Tiled Regression Analysis Package, http://research.nhgri.nih.gov/software/TRAP) , to identify the set of independently significant sequence variants that affect each of the covariate-adjusted quantitative traits and to compare results from the uncollapsed and collapsed approaches. Tiles were determined on the basis of the location of recombination hot spots in Human Genome Sequence build 36 . We used critical values of 0.1 and 0.01 for the initial screening of the multiple and simple regressions, respectively. We used a critical value of 0.01 for entering and retaining variables in the stepwise regressions. Simple linear regressions (SLRs) were performed with TRAP and PLINK .
We requested the answers for the GAW17 simulated data and compared resulting sets of significant variants to the simulation model to examine power and type I error. We determined measures analogous to average power and type I error rate.
Results are presented here in detail for uncollapsed variants and for collapsed variants, including only nonsynonymous variants with MAF < 0.01, for the combined populations for the 200 replicates. Results from the analysis of collapsed variants including all variants with MAF < 0.01 were similar to those for the collapsed nonsynonymous variants and are not shown. Results from both SLR implementations (PLINK and the method included in TRAP) were nearly identical, as expected.
Proportion of 200 replicates identifying causal variants in traits Q1 and Q2
PoR for uncollapsed variants
PoR for collapsed variants
(MAF < 0.01 and nonsynonymous)
7.9 × 10−2
2.5 × 10−2
8.2 × 10−2
1.6 × 10−1
4.8 × 10−2
1.1 × 10−1
2.2 × 10−2
6.3 × 10−4
4.9 × 10−4
7.2 × 10−2
2.7 × 10−3
1.8 × 10−3
Average proportion of 200 replicates identifying noncausal variants in traits Q1, Q2, and Q4
PoR for uncollapsed variants
PoR for collapsed variants
(MAF < 0.01 and nonsynonymous)
9.7 × 10−4
5.5 × 10−6
6.8 × 10−4
1.3 × 10−3
7.3 × 10−6
1.0 × 10−3
1.3 × 10−3
6.2 × 10−6
3.1 × 10−6
1.7 × 10−3
3.8 × 10−6
2.1 × 10−6
1.3 × 10−3
3.5 × 10−6
2.0 × 10−7
1.7 × 10−3
3.2 × 10−6
Discussion and conclusions
Regardless of the method used, collapsing rare variants generally increased the proportion of replicates that identified a significant causal variant. However, the allele frequency and effect size of the allele for most of the causal variants were too small to be detected in these data using these methods. For Q1, only variants in FLT1 had a PoR greater than about 0.5. For Q2, only one variant in VNN1 had a PoR of any sizable magnitude. Nearly all the other causal variants for Q1 and Q2 were undetectable.
More troubling was the inconsistency in the average PoR that identified significant noncausal variants for the SLR method. A difference of about three orders of magnitude was seen for traditional SLR methods, most likely caused by differences in the underlying simulation models for the traits, identical genotypes across replicates, correlated phenotypes, and, perhaps most important, the high degree of multicollinearity in the genotyping data.
The lack of agreement between the empirically derived average PoR and the expected significance levels for the SLR methods may be due to the alternative null hypothesis problem. Under the null hypothesis of no genetic component, no causative variants influence the phenotype; that is, the phenotype is essentially a normally distributed random variable. Under the other null hypothesis, there are no causative variants among the set of variants considered; however, other unknown causal variants do, in fact, influence the phenotypic distribution. These unknown causative variants affect the correlation structure of the phenotypes and may be correlated with other known or unknown variants. Trait Q4 was not generated by specific causative variants but rather is a polygenic effect, which is essentially a normally distributed random variable in unrelated individuals. On the other hand, traits Q1 and Q2 were generated by known causative variants that were correlated with other noncausative variants . These correlations can increase both the type I error rate and the power of the test. This may explain the inflated type I error rate for the SLR methods. The increase in type I error rate for Q1 is quite large, perhaps reflecting the presence of causative variants with relatively larger locus-specific heritability relative to the causative variants in Q2. The tiled regression approach attempts to minimize these correlations by identifying the set of independent variants that most affect phenotypic variation while minimizing the degree of multicollinearity. It appears from these results that the type I error rates for tiled regression are stable with respect to the underlying null hypothesis, although additional work will be required to determine accurate significance levels for the entire tiled regression procedure, not just the significance levels in the final model.
This project was supported by the Division of Intramural Research at the National Human Genome Research Institute, National Institutes of Health (NIH). The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Wilson AF, Kim Y, Sung H, Cai J, McMahon FJ, Sorant AJM: Tiled regression: the use of regression methods in hotspot defined genomic segments to identify independent genetic variants responsible for variation in quantitative traits. International Genetic Epidemiology Society, 18th Annual Meeting. 2009, Abstract 142 http://www.geneticepi.org/meetings/2009 in Abstracts.docGoogle Scholar
- Patterson N, Prics AL, Reich D: Population structure and Eigenanalysis. PLoS Genet. 2006, 2: 190-10.1371/journal.pgen.0020190.View ArticleGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83 (3): 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, 35 (8): 12-17.View ArticleGoogle Scholar
- Sorant AJM, Cai J, Sung H, Kim Y, Wilson AF: Tiled Regression Analysis Package (TRAP): software implementation of tiled regression methodology. 2010, International Genetic Epidemiology Society, 19th Annual Meeting, Abstract 244 http://www.geneticepi.org/iges_files/2010%20Abstract%20Document.pdfGoogle Scholar
- Human Genome Sequence Build 36. [http://www.stats.ox.ac.uk/~mcvean/OXSTAT/GeneticMap_b36]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (Suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.