Volume 3 Supplement 7
Genetic Analysis Workshop 16
Evaluation of an optimal receiver operating characteristic procedure
 Neal Jeffries^{1}Email author and
 Gang Zheng^{1}
DOI: 10.1186/175365613S7S56
© Jeffries and Zheng; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
Lu and Elston have recently proposed a procedure for developing optimal receiver operating characteristic curves that maximize the area under a receiver operating characteristic curve in the setting of a predictive genetic test. The method requires only summary data, not individual level genetic data. In an era of increased data sharing, we investigate the performance of this algorithm when individual level genetic data are available and compare this approach to more standard receiver operating characteristic curvebuilding methods.
Conclusion
Though the LuElston method can produce an optimal area under the curve under some assumptions, the method typically has little advantage over standard multivariable logistic methods when data are available. Also, the standard approach easily allows comparison of nested models via likelihood ratio tests and incorporation of covariates  the LuElston approach is shown to have some difficulties with such analyses. These conclusions are based on evaluations using the Genetic Analysis Workshop 16 rheumatoid arthritis data set.
Background
Lu and Elston [1] present an approach to constructing optimal area under the curve (AUC) curves, applicable to casecontrol studies that does not require the availability of a data set. The method may be based solely upon summary information of markerspecific allele frequencies in cases and controls, penetrance, and disease prevalence. In this approach one can construct multivariable predictive models of disease without knowing the joint distribution of the markers, using an assumption of no interactions between markers. Further, the method is optimal in the sense that the area under the receiver operator characteristic (ROC) curve is maximized. The authors provide a more complex extended model involving linkage disequilibrium (LD) correlations and haplotype frequency estimation that allows for interactions among markers; however, evaluations using this approach are not pursued in this brief report because we would not expect our conclusions to change qualitatively.
Here we assess how this method performs when a casecontrol data set is available and therefore many methods are available for constructing ROC curves based on the joint distribution of markers. In such situations we examine the extent to which the approach remains optimal and how it may be extended with respect to marker selection and incorporation of covariates. These issues are examined using a small subset of markers drawn from the Genetic Analysis Workshop 16 (GAW16) rheumatoid arthritis (RA) data.
Methods
and the AUC is computed using the trapezoidal rule as in Lu and Elston [1].
Results
Application to GAW16
The LuElston approach uses a set of markers to categorize individuals as testing positive or testing negative. The approach is not designed for biomarker selection/discovery, but to construct a ROC curve with a given set of predetermined markers. For this data set the predetermined SNP markers studied were rs6457617, rs2476601, rs7574865, rs1061622, rs2073838, rs1248696, corresponding to MHC, PTPN22, STAT4, TNFRSF1B, SLC22A4, and DLG5 genes, respectively. The first three were chosen because they were linked to RA [2]. The last three SNPs were chosen because 1) they were among the SNPs selected for an RA candidate gene study [3] and 2) among the candidate SNPs only these three appear to be included in this Illumina data set. Thus, all six SNPs were chosen independently from the GAW16 data under consideration. Applying the LuElston approach by taking the genotype frequencies, from the RA data set yields an ROC curve with AUC = 0.7504. Without a data set, one could have used estimates of prevalence, allele frequencies, and published logodds ratios to derive genotype frequencies for cases and controls as described by LuElston. However, because the purpose of using these estimates is to obtain genotype frequencies for cases and controls, it is easier to estimate directly from the data at hand. Further, using the data at hand promotes comparability between this approach and the logistic regression approaches.
The AUC is a global summary measure of how the FPRs and TPRs change as the cutoff for declaring a positive test is varied. It will be compared to the AUC obtained using conventional logistic regression methods and the same set of predetermined SNPs. Starting with a ROC curve either the LuElston or logistic regression method can be used to develop a diagnostic test.
Logistic regression as an alternative to LuElston
Given that the LuElston ROC may be exactly reproduced from a series of univariate regression analyses, this raises the question of whether a single multivariate regression model may produce better discrimination. It might be expected that the fitted probabilities from six univariate models are a special case of the fitted probabilities available from a multivariate model so a higher AUC could be obtained in a less restricted multivariate model. Alternatively, Lu and Elston argue their model is optimal in that it should have the highest AUC value. It is in fact the case that no broad generalizations can be madein some cases the optimal LuElston method as described above will produce an AUC exceeding that constructed through a simple multiple logistic regression of the same factors, and in other situations the LuElston method will perform worse. This arises because different assumptions may be made regarding P(G_{ k }D) and the collection of genotypes under consideration may differ.
The two curves are quite close though the logistic curve has a slightly lower computed AUC0.7503 compared with 0.7504. This slight decrease associated with the logistic may not be generalized. For example, if only five markers are used (excluding the third marker, rs7574865) then the LuElston AUC is 0.7487; the multivariate model AUC is 0.7490. Of course, these AUCs are identical for practical purposes; the comparisons show neither approach has uniformly higher AUCs.
For the case of six markers, the LuElston curve is based on 3^{6}, or 729 possible genotypes while the logistic curve is based on the unique 178 genotypes that are observed as determined by the six SNPs. However, the approaches produce similar results and this might be expected if the relation in Eq. (1) is approximately true.
where P(DG_{ k }) is the empirically observed proportion of cases among those with genotype G_{ k }Using Eq. (6), one can derive an empirical ROC of 0.793. However, such an approach likely yields an overfitted model. As an example, a genotype with two cases and no controls would have an infinitely large LR_{(K) }with estimated sensitivity of 100%a figure that is not likely to be reproduced in a followup study with more individuals having that genotype.
Model fitting aspects
where Var(A_{2}  A_{1}) = Var(A_{2})+ Var(A_{1})  2Cov(A_{2}, A_{1}), A_{2} and A_{1} represent the different optimal AUCs corresponding to the two collections of SNPs, and the variance term in the denominator would be estimated by a bootstrap approach. They propose comparing the resulting Zstatistic to a standard normal distribution to assess whether one collection represents a significant improvement over the other. However, in the context of nested collections when one collection properly contains all the SNPs in the other collection, such a comparison likely produces pvalues that are incorrect. This follows because if the second collection properly contains the first, then the optimality theorem of LuElston dictates that A_{2} (the AUC associated with the second collection) must exceed A_{1} and Z in Eq. (7) is necessarily positive. Therefore, the evaluation of Z by a standard normal distribution is not appropriate.
As a conventional alternative, a likelihoodratio test may be used with the multivariate logistic regression approach to determine whether an additional marker would improve the model.
To evaluate which of the methods (bootstrap or multivariable logistic likelihoodratio test) had appropriate type I error behavior when adding an unrelated SNP, we sampled 1445 markers drawn from those chromosomes that hold none of the original six markers and were spaced at roughly equidistant intervals for a given chromosome. Originally, 2000 such markers were drawn but only 1445 met quality control and minor allele frequency conditions to ensure that bootstrap samples would generate all three genotypes. Our assumption is that few, if any, of these markers are strongly related to arthritis.
As expected, the bootstrap approach did not perform well because a standard normal distribution centered about 0 is illsuited for evaluating a test statistic that is necessarily nonnegative (i.e., A_{2} ≥ A_{1}). The KolmogorovSmirnov pvalue for testing if the 1445 test statistics followed a standard normal distribution was p < 10^{15}. On the other hand, the likelihood ratio test performed appropriately for the multivariate logistic regression approach. The pvalue for testing whether the 1445 test statistics followed a χ^{2} distribution with two degrees of freedom was p = 0.55. The likelihood ratio test incorporates a genomic control correction [4] for population stratification that is achieved by dividing all the 1445 loglikelihood ratio test statistics by the ratio of the median test statistic value and the median value of a χ^{2} distribution with two degrees of freedom. The inclusion of this genomic control procedure is not likely to account for the difference in the two approaches because the basic problem with the bootstrap approach concerns using a standard normal distribution centered about 0 to model a nonnegative random variable.
We explored the possibility of using a permutation rather than a bootstrap approach to determine whether the addition of another SNP leads to significant improvement in AUC within the LuElston approach. Here, the casecontrol labels for the additional SNP (one of the 1445) are permuted and an associated A_{2}  A_{1} difference is computed. One thousand permutations produce an A_{2}  A_{1} permutation distribution which is compared to the observed A_{2}  A_{1} in the original data set. If the original A_{2}  A_{1} exceeds, say, 95% of the empirical A_{2}  A_{1} distribution, this may be taken as evidence of significant AUC improvement. The approach appears promising but was complicated by the indication of population stratificationthe empirical pvalue distribution was similar to that of the likelihoodratio test before the stratification adjustment. While the DevlinRoeder approach to account for stratification may work for a likelihoodratio test, it is unclear how to proceed for a permutation test. Further, the permutation of labels for just the additional SNP will remove LD with nearby SNPs, which could affect performance.
Incorporating covariates
Logistic regression easily includes covariate information as additional regressorsthe covariates may be discrete or continuous. The LuElston approach toward incorporation of covariates is to first categorize the covariate as a factor (even though it may be continuous in nature). Next, the same multiplicative approach is used to determine the probabilities of observing each combination of covariates and genotypes for cases and controls. From these probabilities the likelihood ratios and ROC curves are constructed as before. In the event the covariates are continuous in nature, such a data transformation entails a loss of information and efficiency.
Conclusion
The LuElston approach is valuable for developing classification models in the absence of individuallevel data. We have applied Lu and Elston's approach for constructing ROC curves and compared it to conventional logistic regression methods. When the assumption of multiplicative effects without interactions among markers is in force there should be little difference between the LuElston and conventional logistic method. The advantages of this conventional approach are the ability to use standard approaches toward model selection based upon loglikelihood differences and a simple way to incorporate covariates via regression.
List of abbreviations used
 AUC:

Area under curve
 FDR:

Falsepositive rate
 GAW16:

Genetic Analysis Workshop 16
 LD:

Linkage disequilibrium
 RA:

Rheumatoid arthritis
 ROC:

Receiver operating characteristic
 SNP:

Singlenucleotide polymorphism
 TRP:

Truepositive rate.
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Lu Q, Elston RC: Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. Am J Hum Genet. 2008, 82: 641651. 10.1016/j.ajhg.2007.12.025.PubMed CentralView ArticlePubMedGoogle Scholar
 Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1C5 as a risk locus for rheumatoid arthritisa genomewide study. N Engl J Med. 2007, 357: 11991209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
 Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, Gregersen PK, Klareskog L, Rioux JD: Replication of putative candidategene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet. 2005, 77: 10441060. 10.1086/498651.PubMed CentralView ArticlePubMedGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 9971004. 10.1111/j.0006341X.1999.00997.x.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.