Volume 5 Supplement 9
Genetic Analysis Workshop 17: Unraveling Human Exome Data
New insights into old methods for identifying causal rare variants
- Haitian Wang^{1},
- Chien-Hsun Huang^{2},
- Shaw-Hwa Lo^{2},
- Tian Zheng^{2} and
- Inchi Hu^{1}Email author
DOI: 10.1186/1753-6561-5-S9-S50
© Wang et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.
Background
There is a growing interest in the role of rare variants in disease etiology—rare in the sense that the minor allele frequency (MAF) is less than 1%. Earlier genome-wide association studies identified risk loci that accounted for only 5–10% of disease heritability [1]. There is now an increasing body of evidence that suggests an association between the rare variants and complex diseases [2]. However, the small variance of rare variants makes their association with phenotypes difficult to detect. To increase the detection power of such associations, most existing methods collapse the rare variants using biological information. Some of these collapsing methods are based on genes or pathways, whereas others involve functionality, synonymous single-nucleotide polymorphisms (SNPs), or nonsynonymous SNPs [3]. Although the collapsing methods increase the allele frequency so that the risk effect is amplified, the noise in the collapsed variables may also increase. This could render the collapsing method less effective in some cases.
In this paper, we propose a three-step method that does not use collapsing. After removing the SNPs that are identical in value across all subjects in the data, we calculate the F-statistic for all the markers. We show that the F-statistic does not down-weight a rare variant despite its low allele frequency and thus is effective in capturing the effect of the rare variant. Next, we apply sliced inverse regression (SIR) to top-ranked markers selected by the F-statistic. When the number of selected top markers is not too large, we find that SIR performs well in identifying the SNPs used to simulate the Genetic Analysis Workshop 17 (GAW17) phenotypes from the top markers (hereafter these SNPs are referred to as the answers).
Methods
Data set
The data set [4] from GAW17 consists of 24,487 SNPs and 697 unrelated individuals. The genotypes are real sequencing data from the 1000 Genomes Project. In particular, rare variants (i.e., MAF < 1%) make up 74% of the total variants. Based on the same genotypes, 200 replications are simulated. Four phenotypes are available for analysis: the quantitative traits Q1, Q2, Q4, and the disease affected status. In this paper, we use Q1 to illustrate our method, which is influenced by 39 SNPs in 9 genes. Although the method is applicable to Q2, which is influenced by 72 SNPs in 13 genes, the result is not reported here. Q4 is too noisy to obtain meaningful conclusions. Our method does not require any information on the answers, although we use the information to assess the performance of the method.
Preliminary data reduction
Because most of the SNPs are rare variants, many of them take identical values across all individuals in the data set; that is, statistically they are indistinguishable. In the preliminary data reduction step, we remove the identical-valued SNPs and keep track of them for later reference. By doing so, we reduce the 24,487 SNPs to 15,124 distinguishable markers. The benefit of this procedure is twofold. First, the dimension is reduced for subsequent statistical analysis, and, second, removing identical-valued SNPs prevents the numerical problems caused by degenerate matrices in regression and principal component types of analyses.
Selection by F-statistics
where:
Y_{ i }=β_{ 0 }+ β_{ 1 }X_{ 1i }+ε_{ I }
We calculate the F-statistic for each of the 15,124 markers by fitting a simple linear regression model, one marker at a time. The response variable Y is the average value of Q1 over 10 replications. The markers are then ranked by F values. A number of top markers are selected for further analysis in the next step.
Selection by sliced inverse regression
SIR [5] is a nonparametric regression method that uses local smoothing of the response variable. It retrieves high-dimension data features from low-dimensional projection. After standardizing X, we estimate an inverse regression of X on sliced Y. The inverse regression, that is, the computation of E(X|Y), converts a high-dimensional regression problem of Y on X to many simple regressions of X on Y. In estimating E(X|Y), the range of Y is divided into small intervals (sliced) to increase computational efficiency. Next, we perform a principal components analysis on E(X|Y), and the principal components (PCs) are returned. These are the SIR effective dimension reduction directions discussed by Li [5]. We simply call them the SIR directions. In our analysis, SIR is performed using the R package dr.
Top markers selected from the previous step are subjected to SIR, and the response variable of SIR is the same as the one used in the previous step of determining the F-statistic. Because the first SIR direction (DIR1) contains the most important information of the data, it is the most significant, and we use only this SIR direction in our method. Next, we perform a two-mean clustering analysis on the absolute DIR1 coefficients. The markers in the cluster with higher mean are our final candidates for causal SNPs.
Results
Our method is first applied to data from replicates 1 to 10 (group 1). To assess the false discovery rate (FDR) of our method, we divided the 200 replications of the GAW17 data set into 20 groups with 10 replicates per group (i.e., replicates 11 to 20 are group 2, replicates 21 to 30 are group 3, etc.). We average the ratios of the number of identified answers to the number of final candidate SNPs over 20 groups to obtain the FDR.
Performance of the F-statistic
Answers in the top 100 markers identified by the F-statistic in the group 1 data
Collapsed sequence | Original sequence | Gene | Rank in selected markers | SNP | MAF | β |
---|---|---|---|---|---|---|
10648 | 16705 | FLT1 | 1 | C13S523 | 0.066714 | 0.64997 |
10647 | 16704 | FLT1 | 2 | C13S522 | 0.027977 | 0.6183 |
10640 | 16692 | FLT1 | 11 | C13S431 | 0.017217 | 0.74136 |
571 | 5386 | KDR | 17 | C4S1877 | 0.000717 | 1.07706 |
571 (identical) | 5392 | KDR | 17 | C4S1889 | 0.000717 | 0.94133 |
10649 | 16706 | FLT1 | 28 | C13S524 | 0.004304 | 0.62223 |
994 | 1153 | ARNT | 32 | C1S6533 | 0.011478 | 0.5619 |
3627 | 5390 | KDR | 66 | C4S1884 | 0.020803 | 0.29558 |
How many top markers should be used?
Trade-off between FDR and the number of identified answers
Number of input markers to SIR | FDR (20 groups) (%) | Average number of identified answers (20 groups) |
---|---|---|
10 | 12.7 | 2.75 |
30 | 20.8 | 4.3 |
50 | 37.7 | 4.8 |
70 | 38.5 | 3.9 |
80 | 47.7 | 5 |
Performance of SIR in 20 groups with top 30 markers FDR = 20.8%
Group | Number of identified answers (A) | Number of candidate markers (B) | Ratio of A to B (%) | (B − A)/B (%) |
---|---|---|---|---|
1 | 5 | 5 | 100.0 | 0.0 |
2 | 4 | 4 | 100.0 | 0.0 |
3 | 4 | 6 | 66.7 | 33.3 |
4 | 4 | 6 | 66.7 | 33.3 |
5 | 4 | 6 | 66.7 | 33.3 |
6 | 4 | 6 | 66.7 | 33.3 |
7 | 5 | 6 | 83.3 | 16.7 |
8 | 5 | 5 | 100.0 | 0.0 |
9 | 5 | 6 | 83.3 | 16.7 |
10 | 4 | 5 | 80.0 | 20.0 |
11 | 4 | 4 | 100.0 | 0.0 |
12 | 4 | 5 | 80.0 | 20.0 |
13 | 6 | 8 | 75.0 | 25.0 |
14 | 3 | 4 | 75.0 | 25.0 |
15 | 4 | 6 | 66.7 | 33.3 |
16 | 3 | 4 | 75.0 | 25.0 |
17 | 5 | 6 | 83.3 | 16.7 |
18 | 6 | 7 | 85.7 | 14.3 |
19 | 4 | 5 | 80.0 | 20.0 |
20 | 3 | 6 | 50.0 | 50.0 |
Performance of SIR
Discussion
Why does the F-statistic work for rare variants?
Thus the weight given to the regression sum of squares (SSR) for the minor allele is (n − 1) times that for the major allele. This helps to manifest the effects of rare variants on the response.
Simultaneous versus separate treatment of common and rare variants
In the preceding analysis, we calculated the F-statistic for both common and rare variants simultaneously. Applying the F-statistic on private variants only leads to further interesting findings. After removing identical SNPs, we reduce the 9,433 private variants to 685 distinguishable markers. In the group 1 data, among the top 20 markers ranked by F-statistic, five answers are found. Three of them are new (CAS4935, C4S1873, and C4S1887) to the previously identified SNPs when applying the F-statistic to all markers. This suggests that a separate treatment of common and rare variants is a promising strategy for further investigation.
Conclusions
We use traditional statistical methods for new applications in the context of rare variant research. A three-step method is introduced. First, we perform a simple data reduction by removing identical-valued SNPs. Second, we calculate F-statistics on all markers and select those with the top F values. Finally, we perform SIR on the top markers and perform two-mean clustering on the absolute coefficients of the first SIR direction. Markers in the cluster with a higher mean are the final candidates for causal SNPs. Using the top 30 markers from the F-statistic, we find that the FDR is 20.8% and that the average number of answers is 4.3. We show that the proposed method is an effective and easy-to-use feature selection approach in the context of rare variants.
Declarations
Acknowledgments
This research was supported by Hong Kong Research Grants Council grants 642207, and DAG S08/09.BM15, and by National Institutes of Health (NIH) grants R01 GM070789 and 3R01 GM070789-05S1 and by National Science Foundation (NSF) grant DMS 0714669. HW would like to thank GAW17 for awarding the student travel support. We thank both reviewers for their thoughtful comments. Genetic Analysis Workshop 17 is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Asimit J, Zeggini E: Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010, 44: 293-308. 10.1146/annurev-genet-102209-163421.View ArticlePubMedGoogle Scholar
- Bansal V, Libiger O, Torkamani A, Schork NJ: Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet. 2010, 11: 773-785. 10.1038/nrg2867.PubMed CentralView ArticlePubMedGoogle Scholar
- Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, X (suppl X): X-X.Google Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Li KC: Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc. 1991, 86: 316-342. 10.2307/2290563.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.