Volume 3 Supplement 7
Genetic Analysis Workshop 16
A quantilebased method for association mapping of quantitative phenotypes: an application to rheumatoid arthritis phenotypes
 Saurabh Ghosh^{1}Email author,
 Krishna Rao Sanapala^{1},
 Abhik Ghosh^{1} and
 Sujatro Chakladar^{1}
DOI: 10.1186/175365613S7S18
© Ghosh et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
Genetic association of populationbased quantitative trait data has traditionally been analyzed using analysis of variance (ANOVA). However, violations of certain statistical assumptions may lead to falsepositive association results. In this study, we have explored modelfree alternatives to ANOVA using correlations between allele frequencies in the different quantile intervals of the quantitative trait and the quantile values. We performed genomewide association scans on anticyclic citrullinated peptide and rheumatoid factorimmunoglobulin M, two quantitative traits correlated with rheumatoid arthritis, using the data provided in Genetic Analysis Workshop 16. Both the quantitative traits exhibited significant evidence of association on Chromosome 6, although not in the human leukocyte antigen region which is known to harbor a major gene predisposing to rheumatoid arthritis. We found that while a majority of the significant findings using the asymptotic thresholds of ANOVA was not validated using permutations, a relatively higher proportion of the significant findings using the asymptotic cutoffs of the correlation statistic were validated using permutations.
Background
Complex genetic traits are usually characterized by correlated quantitative precursors. Because quantitative traits carry more information on withingenotype variation compared to binary (affected/unaffected) traits, it has been argued that analyzing quantitative endophenotypes may be a more powerful strategy than analyzing clinical endpoints for identifying genes controlling a complex trait. While some familybased methods of association have been developed for quantitative traits [1–3], populationbased quantitative trait data have usually been analyzed using classical analysis of variance (ANOVA) methods. However, ANOVA is valid in a strict statistical sense only under the assumption of normality of the variable of interest and equality of variances in each underlying group. On the other hand, the assumption of normality and equality of variances of the quantitative traits for the different genotypes at a quantitative trait locus (QTL) is genetically unrealistic, particularly if the trait is correlated with a disease outcome [4]. Studies have shown that the ANOVA statistic may lead to an inflated rate of false positives when the underlying assumptions are violated [5]. Thus, it is of interest to explore for modelfree alternatives that would circumvent this problem. In this study, we explore some novel quantilebased statistics to test for allelic association and perform genomewide association scans of anticyclic citrullinated peptide (antiCCP) and rheumatoid factor immunoglobulin M (RFUW) levels, two quantitative phenotypes correlated with the rheumatoid arthritis (RA) affection status in the North American Rheumatoid Arthritis Consortium (NARAC) data provided in Genetic Analysis Workshop (GAW) 16. The basic paradigm of the methods is that a marker allele in linkage disequilibrium with an allele at the QTL would have either a strictly increasing or a strictly decreasing frequency distribution across the range of quantitative trait values. We compare the association results of our proposed methods with those obtained using ANOVA.
Data description
For our analyses, we used data on antiCCP and RFUW levels along with genotypes at 531,689 singlenucleotide polymorphisms (SNPs) distributed over the 22 autosomal chromosomes. Data on the two quantitative traits were available only on individuals affected with RA. After removing the individuals with missing phenotype data, our methods used data on 867 individuals (640 females and 227 males) for antiCCP and 746 individuals (548 females and 148 males) for RFUW. We found that 51,524 SNPs had a minor allele frequency less than 0.05. Because these markers are unlikely to be involved in disease pathogenesis under a commonvariant, commondisease (CVCD) model, they were also removed from the association analyses.
Statistical methodology
Suppose Q_{1}, Q_{2},..., Q_{ k }denote the k sample quantiles of the quantitative trait in study with Q_{0} defined as the minimum value of the trait in the data. Suppose p_{ i }denotes the proportion of the minor allele at a particular SNP among individuals with quantitative trait values in the interval (Q_{i1}, Q_{ i }), i = 1,2,..., k. It can be shown that the frequency of a marker allele conditioned on the value of the quantitative trait lying in an interval (a, b) is independent of a and b if and only if the coefficient of linkage disequilibrium between the QTL and the marker locus is zero. Moreover, in the presence of linkage disequilibrium between the two loci, the frequency of the marker allele in positive linkage disequilibrium with the QTL allele predisposing to high values of the quantitative trait will increase with the quantile intervals. Thus, one can test for allelic association based on the correlation between Q_{ i }and p_{ i }values. A test based on the direct correlation coefficient (r) between the two variables would be equivalent to testing whether the slope coefficient of a linear regression of p_{ i }values on Q_{ i }values is zero or not (that is, whether there is a linear trend (either increasing or decreasing) of p_{ i }values with increasing values of Q_{ i }). The test statistic is √(n2)r/√(1r^{2}) and is distributed as t with (n2) degrees of freedom under the assumption of bivariate normality of the underlying variables, although the assumption can be relaxed for large samples. We note that this statistic is identical to that used for testing the slope coefficient in the linear regression mentioned above. A more robust statistic is based on the rank correlation (R) between the variables given by √(n1)R and is asymptotically distributed as a standard normal variate [6]. However, it may be more appropriate not to rely on the asymptotic distributions and use permutation principles, instead, to determine the significance of the above tests. We randomly permute the quantitative trait values across the individuals keeping the SNP genotype data unchanged. This preserves the marginal distributions of both the quantitative traits and the genotypes, but generates the null distribution of no association. Since it is becoming increasingly evident that the Bonferroni correction for multiple testing is highly overconservative and hence leads to an elevated rate of false negatives, we use the false discovery rate (FDR) procedure [7] with an overall rate of 0.05 to identify SNPs significantly associated with the quantitative trait in study.
One of the statistical issues is the choice of an appropriate number of quantiles. The method involves the estimation of the minor allele frequency in each quantile interval as well as computing the correlation between these estimated frequencies and the quantile values. If the number of quantiles is too large, each quantile interval will comprise very few observations, leading to inefficient estimation of the allele frequencies. On the other hand, if the number of quantiles is too small, the total number of observations for the computation of the correlation coefficient will not be sufficiently large to make reliable inferences on association. We suggest that each quantile interval comprise approximately 25 observations so that the variance of the estimated allele frequency [p(1p)/n] cannot exceed 0.01. The minimum number of quantiles should be about 30 for appropriate use of large sample properties of the correlation coefficient.
Results
We first tested for possible deviation from HardyWeinberg equilibrium (HWE) for each of the 480,165 SNPs. To attain an overall significance level of 0.05, we used a Bonferroni corrected pvalue less than 10^{7} as evidence of departure from HWE. Those SNPs that failed the HWE test were removed from the association analyses. We performed three tests of association: the first using ANOVA, the second using the direct correlation coefficient, and the third using the rank correlation coefficient for each of the two quantitative traits: antiCCP and RFUW levels. To determine an effective number of independent tests for an appropriate FDR correction, we used the software HAPLOVIEW, which identified 367,392 tag SNPs, based on a threshold of pairwise r^{2} greater than 0.8.
Neither of the two quantitative traits follows a normal distribution (pvalue < 0.01 using the KolmogorovSmirnov statistic and validated by QQ plots). There was also no sex effect in either antiCCP levels (p = 0.109 using a two sample ttest, p = 0.118 using the MannWhitney test) or RFUW levels (p = 0.7 using a two sample ttest, p = 0.08 using the MannWhitney test) and hence, no sex adjustment was necessary for the association analyses. For the quantilebased methods, we divided the data into 31 quantile intervals for both the quantitative traits (for antiCCP, 28 observations in each of the first 30 intervals and 27 in the last interval; for RFUW, 24 observations in each of the first 30 intervals and 26 in the last interval).
Most significant association findings
Phenotype  Method  Chromosome  SNP  pvalue 

AntiCCP  ANOVA  1  rs1211759  6.58 × 10^{23} 
Correlation  1  rs17123469  5.31 × 10^{19}  
RFUW  ANOVA  6  rs6456834  7.07 × 10^{23} 
Correlation  1  rs2785665  1.03 × 10^{19} 
When we analyzed RFUW levels using ANOVA, we found 1399 significantly associated SNPs, while the direct correlation coefficient and the rank correlation statistics provided 178 and 15 significant SNPs, respectively. The most extreme pvalue based on the rank correlation statistic was 0.006 on chromosome 2. However, unlike antiCCP, the RFUW levels did not exhibit any genomic region with a cluster of significant SNPs.
As pointed out in the section entitled "Statistical Methodology," it was of interest to explore whether the above association findings, obtained using asymptotic tests, can be validated by permutation tests. We found that for antiCCP, only 182 of the 935 SNPs that showed significant evidence of association using ANOVA were validated by permutations, while for RFUW, only 297 of the 1399 SNPs showed significant association using permutations. However, the corresponding numbers using the direct correlation statistic were 47 out of 180 SNPs for antiCCP and 99 out of 178 for RFUW. Thus, among the SNPs exhibiting significant association using asymptotic thresholds, the permutation tests validated a higher proportion of SNPs for the correlation statistic compared to ANOVA. This can be intuitively explained by the fact that the quantilebased correlation statistic is modelfree in nature and hence relatively more robust to violations in distributional assumptions. The rank correlation statistic did not provide any significant evidence of association using the permutation test procedure.
None of the SNPs in the two regions that have been implicated in RA pathogenesis, the human leukocyte antigen (HLA) region (on 6p21) and the PTPN22 gene (on 1p13), showed any significant evidence of association with either of the two quantitative phenotypes at the overall FDR of 0.05 using any of the methods.
Conclusion
The proposed quantilebased statistics are modelfree alternatives to the ANOVA approach for association mapping of quantitative traits and circumvent the problem of departures from normality and homoskedasticity of the quantitative traits conditioned on the QTL genotypes. While these methods are more robust than ANOVA, they are expected to be less powerful when the underlying assumptions of ANOVA are valid. In particular, the rank correlation statistic is bounded by √(n1), where n is the number of observations, and hence, has limits to the pvalues when the asymptotic distribution is used. The direct correlation coefficient statistic, which is equivalent to the test of the slope coefficient in a linear regression, is a more suitable alternative. However, a permutation strategy is likely to provide more accurate pvalues compared to asymptotic distributions, especially if the number of quantiles is not very large. We also note that it is not possible to assess the relative performances of the different methods using real data efficiently. We are carrying out extensive simulations independently to compare the powers of the different tests. Our preliminary simulations based on a sample size of 500 suggest that the powers of the correlationbased statistic and ANOVA are comparable under a homoskedastic QTL model, but the correlationbased method is more powerful under a heteroskedastic model.
An intriguing result of the present study has been the lack of evidence of significant association in the HLA region on chromosome 6 and the PTPN22 gene on chromosome 1 with either of the two quantitative phenotypes. A similar phenomenon was also observed in the analyses of these two quantitative traits in the data provided in Genetic Analysis Workshop 15 [9]. This raises the possibility that the biological pathway involved in modulating the levels of antiCCP and RFUW may be different from that of the clinical endpoint of RA. Alternatively, the paradox can be explained by the fact that since a clinical endpoint is usually a function of multiple quantitative precursors, the effect size of any gene is small for individual quantitative traits and hence require much larger sample sizes to exhibit significant evidence of association.
Because the data on the two quantitative traits were available only on individuals affected with RA, the association findings should be interpreted as polymorphisms involved in differential elevations of antiCCP and RFUW levels in RAaffected individuals from the mean normal levels of these phenotypes. There is reduced variation in the quantitative phenotypes when restricted only to RA cases and hence, an appropriate design to identify polymorphisms associated with the two quantitative phenotypes would include data on controls in addition to the cases.
List of abbreviations used
 ANOVA:

Analysis of variance
 antiCCP:

Anticyclic citrullinated peptide
 CVCD:

Commonvariant, commondisease
 FDR:

False discovery rate
 HLA:

Human leukocyte antigen
 HWE:

HardyWeinberg equilibrium
 GAW16:

Genetic Analysis Workshop 16
 NARAC:

North American Rheumatoid Arthritis Consortium
 QTL:

Quantitative trait locus
 RA:

rheumatoid arthritis
 RFUW:

Rheumatoid factor immunoglobulin M
 SNP:

Singlenucleotide polymorphism.
Declarations
Acknowledgements
This work was supported by the Fogarty International Center, National Institutes of Health, USA grant R01 TW 00660405 and by a grant from the Indian Council of Medical Research.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Allison DB: Transmissiondisequilibrium tests for quantitative traits. Am J Hum Genet. 1997, 60: 676690.PubMed CentralPubMedGoogle Scholar
 George V, Tiwari HK, Shu Y, Zhu X, Elston RC: Linkage and association analysis of alcoholism using a regressionbased transmission/disequilibrium test. Genet Epidemiol. 1999, 17 (Suppl 1): S157161.View ArticlePubMedGoogle Scholar
 Abecasis GR, Cardon LR, Cookson WO: A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000, 66: 279292. 10.1086/302698.PubMed CentralView ArticlePubMedGoogle Scholar
 Vimaleswaran K, Radha V, Ghosh S, Majumder PP, Deepa R, Babu HNS, Rao MRS, Mohan V: Peroxisome proliferatoractivated receptorγ coactivator 1α(PGC1 α) and their relationship with type 2 diabetes in Asian Indians. Diabetic Med. 2005, 22: 15161521. 10.1111/j.14645491.2005.01709.x.View ArticlePubMedGoogle Scholar
 Ghosh S, De G: Association analysis of populationbased quantitative trait data: an assessment of ANOVA. Hum Hered. 2007, 64: 8288. 10.1159/000101426.View ArticlePubMedGoogle Scholar
 Randles RH, Wolfe DA: Introduction to the theory of nonparametric statistics. 1979, New York, John Wiley & SonsGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B. 1995, 57: 289300.Google Scholar
 Lin Y, Zhang M, Wang L, Pungpapong V, Fleet JC, Zhang D: Simultaneous genomewide association studies of anticyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonalcomponents regression. BMC Proc. 2009, 3 (suppl 7): S2010.1186/175365613s7s20.PubMed CentralView ArticlePubMedGoogle Scholar
 Ghosh S, Babron MC, Amos CI, Briollais L, Chen P, Chen WV, Chiu WF, Drigalenko E, Etzel CJ, Hamshere ML, Holmans PA, MargaritteJeannin P, Lebrec JJ, Lin S, Lin WY, Mandhyan DD, Nishchenko I, Schaid DJ, Seguardo R, Shete S, Taylor K, Tayo BO, Wan S, Wei LY, Wu CO, Yang XR: Linkage analyses of rheumatoid arthritis and related quantitative phenotypes: the GAW15 experience. Genet Epidemiol. 2007, 31 (Suppl 1): S8695. 10.1002/gepi.20284.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.