Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension

In this paper, we compare logistic regression and 2 other classification methods in predicting hypertension given the genotype information. We use logistic regression analysis in the first step to detect significant single-nucleotide polymorphisms (SNPs). In the second step, we use the significant SNPs with logistic regression, support vector machines (SVMs), and a newly developed permanental classification method for prediction purposes. We also detect rare variants and investigate their impact on prediction. Our results show that SVMs and permanental classification both outperform logistic regression, and they are comparable in predicting hypertension status.


Background
Genetic Analysis Workshop 18 (GAW18) data provide genotypes from a real human whole-genome sequencing study including systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as covariates such as age, medication use, cigarette smoking, parents, and pedigrees. The genome-wide association study (GWAS) data of 1043 individuals come from 20 Mexican American pedigrees enriched for type 2 diabetes from San Antonio, Texas. The data are longitudinal, with 3 measurements for most participants at 4 time intervals (1981 to 1996, 1997 to 2000, 1998 to 2006, and 2009 to 2011). Because there are missing observations in the original phenotype data, simulated hypertension status and blood pressure data sets were generated according to the real genotypes and other covariates.
In our analysis, we use the GWAS data for chromosome 3 and the simulated phenotype data from GAW18. The GWAS of chromosome 3 contains 65,519 singlenucleotide polymorphisms (SNPs). Simulated phenotypes were generated from the real data, which consist of 849 individuals and 3 examination times. The sample for the simulated data set is of the 849 individuals who have both phenotypes and imputed sequence data in the real data set. Two hundred replicates of simulated phenotype data are provided. All individuals have simulated phenotype information at 3 time points with no missing data. The goal of our analysis is to predict whether people will have hypertension. Hence, our data set includes a binary response for simulated hypertension status, GWAS genotypes, age, sex, smoking status, parents, and pedigree information. The covariate medication status is excluded because it contains diagnosis information and is thus highly correlated with hypertension. We choose different numbers of SNPs and compare the corresponding prediction error rates. We also compare the performance of 3 approaches including logistic regression analysis, support vector machines (SVMs) [1][2][3], and the newly developed permanental classification method [4].

Methods
In this paper, we treat SIMPHEN.1.csv as the training data set and SIMPHEN.2.csv-SIMPHEN.5.csv as the testing sets. The conclusions are similar if we use other replicates as training and testing data sets.

Single-nucleotide polymorphism selection using logistic regression
Our main goals are to predict hypertension and to compare the prediction performances of logistic regression and the other 2 classification methods. In the GAW18 data, the hypertension diagnosis variable HTN is binary (yes = 1; no = 0). The logistic regression model has been used extensively for handling categorical responses and shows competitive performance in a wide range of applications. We apply it in this paper as a baseline model. Two sources of data are used in the logistic regression analysis: simulated phenotype data and SNP data. We use simulated phenotype data rather than the real phenotypes because the simulated phenotypes do not contain missing values.
In this step, we treat the 3 repeated measurements for each participant as 3 observations to increase the power in identifying the effects of 2 time-variant covariates, Age and Smoke. The interaction term Age Sex is also included because of its significance. The factors Father and Mother represent the hypertension status of the parents. The factor Pedigree is included to identify critical effects associated with family history. Father, Mother, and Pedigree are all highly significant in the baseline model.
Next, each SNP in the recommended dataset chr3gwas.csv is added separately into the model to measure its significance in terms of p-values. logit (Pr(HTN = 1)) = Smoke + Age + Sex + Age × Sex + Mother + Father + Pedigree + SNP i We sort the corresponding p-values increasingly and regard SNPs at the beginning of the list as the most significant ones. The SNPs are listed in Table 1. We also perform SNP selections using linear regression analysis with SBP and DBP as response variables separately because HTN = 1 is defined as SBP greater than 140 mm Hg or DBP greater than 90 mm Hg. No transformation is needed for SBP or DBP according to Box-Cox power transformation. We expect that the most significant SNPs for HTN are also ranked high using SBP or DBP. Indeed, rs11711953 and rs11706549 are the 2 most significant SNPs for all 3 responses. Table 2 lists the 2 × 3 frequency table of hypertension  diagnosis and genotype for rs11711953, where XX represents missing values. The p-value of the associated chi-square test is 0.0011, which implies a significant difference in genotype frequencies between the hypertension group and the non-hypertension group.
Attempts to use a subset of only low linkage disequilibrium (LD) SNPs (67 SNPs with mutual correlation r <0.95 and 434 SNPs with r <0.99 among the 2,500 most significant SNPs) were less successful than using the entire list. Therefore, in the remainder of the paper, we focus on the entire list with the exception that SNPs in perfect LD (r = 1) are removed, which leaves 62,735 SNPs for logistic regression analysis.

Prediction based on logistic regression
In the second step, we add the most significant SNPs, that is, those SNPs with smallest p-values, into the baseline logistic regression model and use the extended model to predict hypertension status with at most four indicator variables for each SNP (the model could become too complex to handle even with a small number of SNPs). To be clear, each SNP provided in the data has up to four genotypes including XX. For each SNP, the insignificant genotypes (p-value ≥ 0.05) are grouped into a single category, -Not Significant (NS).

Classification
The model is fitted on the training set SIMPHEN.1.csv with the top 5, 10, 15, 20, 50, 100, and 200 SNPs as the predictors and SIMPHEN.2.csv-SIMPHEN.5.csv as the testing sets. We apply the supervised classification methods of support vector machines and the permanental classification to predicting hypertension.
Given a training data set, , where the y i indicates the class to which the covariate x i belongs, SVMs use a projection function of the input data into a high-dimensional feature space in which a hyperplane with the maximal margin is found to divide the observations having y i = 0 from those having y i = 1. The testing sets are then mapped into that same space and 2.1 × 10 −9 2.2 × 10 −9 2.2 × 10 −9 3.0 × 10 −9 3.0 × 10 −9 3.5 × 10 −9 3.7 × 10 −9 4.3 × 10 −9 4.7 × 10 −9 predicted to be in a category based on which side of the hyperplane they fall on. Here we use a radial basis kernel C) where C is the size of error penalty. To tune the (γ , C) parameters, we use 10-fold cross-classification on the training data.
The permanental classification method is a novel stochastic classification method. It regards all observations belonging to the same class as a realization of a stochastic point process, called a permanental process. For each class, the method provides a probability of membership by measuring the stochastic distance between the new observation and each class. For our data analysis, we use the covariance function K x, x = exp{−||x − x || 2 /τ 2 } and parameter for the permanental process and 10-fold cross-validation to tune (α, τ ) on the training data. One of the major advantages of permanental classification is that it is capable of handling high-dimensional data and multiple classes efficiently.

Effect of logistic regression
Given the fitted logistic regression model, the predicted hypertension status is "yes = 1" if Pr(HTN = 1) ≥ 0.5 and "no = 0" otherwise. We then perform the logistic regression with different numbers (5, 10, 15, 20) of non-identical SNPs included into the baseline model. The prediction errors of logistic regression are summarized in Table 3. It can be seen from Table 3 that the decrease in training error is small, as the number of SNPs increases from 5 to 20 while the testing errors increase. This indicates that overfitting becomes an issue when more than 10 SNPs are included. Moreover, among the 20 SNPs added into the model, there are 11 SNPs with mutual correlation less than 0.90 (12 for 0.95 and 15 for 0.99). Prediction error rates are reported in Table 3.

Rare variants
Rare variants could be critical in interpreting some individual cases in practice [5]. However, it is hard to detect these rare variants using regression models, so we conduct a separate analysis for the rare variants. We define rare variants as genotypes whose minor allele frequency is less than 5% over the whole study group in each SNP. The number of rare variants found is 31,794. Chi-square tests are performed to detect the most significant rare variants for hypertension in terms of p-values. Table 4 lists 2 × 2 frequency tables for the top 2 rare variants. The corresponding p-values for them are 4.06 × 10 -12 and 2.64 × 10 -10 . As with SNPs, many identical rare variants exist. Therefore, different numbers (5, 10, 15, 20) of significant non-identical rare variants are added into the baseline model. The program does not converge if more rare variants are selected. When the 20 most significant rare variants (p-value <10 -8 ) are included, only 6 rare variants among them have mutual correlation less than 0.99. As a result, in most cases, the prediction errors of models with selected rare variants do not improve much. It is not surprising that rare variants do not work as well as the original SNPs because rare variants help only with the prediction of a small portion of patients.

Effect of classification
We use the same genotype (SNPs) and covariates (Smoke, Age, Sex, Age Sex, Mother, Father, Pedigree) chosen by logistic regression. The numbers of SNPs used for SVM and permanental classification are 0, 5, 10, 15, 20, 50, 100, and 200. Tables 5, 6 and 7 list the average prediction error rates of all four testing sets, from the second to the fifth, by using common variants, rare variants, and their combinations. The analysis of the simulated data shows that the best prediction error rates of SVM and permanental classification are both close to 12%. Moreover, the rare variants do not provide significant improvement for prediction.

Conclusions
The logistic regression model is used as a baseline. A sophisticated regression model could be used, but here  we focus on the SVM and permanental classification methods. The pedigree and SNP information helps predict hypertension. The strength of SVM and permanental classification is that they are able to handle a lot of strong LD SNPs. When the most significant 20 or fewer SNPs from the single-SNP logistic regression are used as predictors for SVM or permanental classification classifiers, the error rates are comparable. Moreover, the error rates are reduced from about 22% for multi-SNP logistic regression to 12% for SVM and permanental classification, when the most significant 100 SNPs from the single-SNP logistic regression are used as predictors. The testing error rate increases somewhat for SVM, and the testing error rate decreases for permanental classification, when the most significant 200 SNPs are used as predictors. This implies that overfitting occurs for SVM in this situation. The nonparametric SVM and semiparametric permanental classification can include more SNPs and thus can result in lower prediction errors.
To identify significant SNPs, HTN as a binary response may be less powerful than the quantitative blood pressure measurements. We will explore the performance of SNP selection using blood pressure measurements to selection based on HTN in a subsequent paper. If some rare variants do make contributions to hypertension, they may not be able to be identified using regression because of the small group size of rare variants. Moreover, the rare variant provided only small improvements for predicting hypertension. Collapsing methods [6,7] that create dummy variables indicating the presence of every rare variant in a gene can be more powerful, and many different such approaches are in the literature. Based on current testing results, the classification methods outperformed logistic regression because they included a large number of SNPs; the pedigree information and the common variants of SNPs contribute greatly to prediction. In addition, SVM and permanental classification have comparable prediction errors when considering pedigree information.