 Proceedings
 Open Access
 Published:
Multitask feature selection in microarray data by binary integer programming
BMC Proceedings volume 7, Article number: S5 (2013)
Abstract
A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a lowrank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multitask microarray classification problems. We compared the singletask version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multitask version of the proposed algorithm on 8 multitask microarray datasets. The multitask feature selection algorithm resulted in significantly higher accuracy than when using the singletask feature selection methods.
Background
Microarray technology has the ability to simultaneously measure expression levels of thousands of genes for a given biological sample, which is classified into one of the several categories (e.g., cancer vs. control tissues). Each sample is represented by a feature vector of gene expressions obtained from a microarray. Using a set of microarray samples with known class labels, the goal is to learn a classifier able to classify a new tissue sample based on its microarray measurements. A typical microarray classification data set contains a limited number of labeled examples, ranging from only a few to several hundred. Building a predictive model from such smallsample highdimensional data is a challenging problem that has received a significant attention in machine learning and bioinformatics communities. To reduce the risk of overfitting, a typical strategy is to select a small number of features (i.e., genes) before learning a classification model. As such, feature selection [1, 2] becomes an essential technique in microarray classification.
There are several reasons for feature selection in microarray data, in addition to improving the classifier's generalization ability. First, the selected genes might be of interest to domain scientists interested in identifying disease biomarkers. Second, building a classifier from a small number of features could result in an easily interpretable model that could give important clues to biologists. Depending on how the feature selection process is combined with model learning process, feature selection techniques can be organized into three categories. (1) Filter methods [3] are independent of the learning algorithm. (2) Wrapper methods [4] are coupled with the learning algorithm using heuristics such as forward selection and backward elimination. (3) Embedded methods [5, 6] integrate feature selection as a part of the classifier training. Both the wrapper and the embedded methods effectively introduce hyperparameters that require computationally costly nested crossvalidation and increase likelihood of overfitting. Feature filter methods are very popular because they are typically conceptually simple, computationally efficient, and robust to overfitting. These properties also explain why the filter methods are more widely used than the other two approaches in microarray data classification.
Traditional filter methods rank the features based on their correlation with the class label and then select the top ranked features. The correlation can be measured by statistic tests (e.g., ttest) or by informationtheoretic criteria such as mutual information. The filter methods easily scale up to high dimensional data and can be used in conjunction with any supervised learning algorithm. However, because the traditional filter methods access each feature independently, highly correlated features tend to have similar rankings and tend to be selected jointly. Using redundant features could result in low classification accuracy. As a result, one common improvement for filter methods is to reduce redundancy between selected features. For example, minimalredundancymaximalrelevance (mRMR) proposed by [7] selects the feature set with both maximal relevance to the target class and minimal redundancy among the selected feature set. Because of the high computational cost of considering all possible feature sets, the mRMR algorithm selects features greedily, minimizing their redundancy with features chosen in previous steps and maximizing their relevance to the target class.
A common critique of popular feature selection filters is that they are typically based on relatively simple heuristics. To address this concern, recent research resulted in more principled formulation of feature filters. For example, algorithms proposed in [8] and [9] attempt to select the feature subset with maximal relevance and minimal redundancy by solving a constrained quadratic optimization problem (QP). The objective used by [8] is a combination of a quadratic term and a linear term. The redundancy between feature pairs is measured by the quadratic term and the relevance between features and class label is measured by the linear term. The features are ranked based on a weight vector obtained by solving a QP problem. The main limitation of this method is that the relevance between a feature and the class label is measured by either Pearson correlation or mutual information. However, Pearson correlation assumes normal distribution of the measurements, which might not be appropriate to measure correlation between numerical features and binary target. The mutual information requires using discrete variables and is sensitive to discretization. The objective used by [9] contains only one quadratic term. This quadratic term consists of two parts: one measures feature relevance using mutual information between features and the class label, and another measures feature redundancy using mutual information between each feature pair. However, the square matrix in the proposed quadratic term is not positive semidefinite. Thus, the resulting optimization problem is not convex and could result in poor local optima.
In this paper, we propose a novel feature filter method to find the feature subset which maximizes the interclass separability and intraclass tightness, and minimizes the pairwise correlations between selected features. We formulate the problem as a quadratic programming with binary integer constraints. For high dimensional microarray data, to solve the proposed quadratic programming problem with binary integer constraints requires high time and space cost. Therefore, we relax binary integer constraints and apply the low rank approximation to the quadratic term in the objective function. The resulting objective function can be efficiently solved to obtain a small subset of features with maximal relevance and minimal redundancy.
In many reallife microarray classification problems, the size of the given microarray dataset is particularly small (e.g., we might have less than 10 labeled highdimensional examples). In this case, even the most carefully designed feature selection algorithms are bound to underperform. Probably the only remedy is to borrow strength from external microarray datasets. Recent research [10, 11] illustrates that multitask feature selection algorithms can improve the classification accuracy. The multitask feature selection algorithms select the informative features jointly across many different microarray classification data sets. Following this observation, we extend our feature selection algorithm to the multitask microarray classification setup.
The contributions of this paper can be summarized as follows.(1) We propose a novel gene filter method which can obtain a feature subset with maximal discriminative power and minimal redundancy; (2) The globally optimal solution can be found efficiently by relaxing the integer constraints and using a lowrank approximation technique; (3) We extend our feature selection method to multitask classification setting; (4) The experimental results show our algorithms achieve higher accuracy than the existing filter feature selection methods, both in singletask learning and multitask settings.
Results and discussion
We compared our proposed feature algorithm with 9 representative feature selection filters. The first 6 are standard feature selection filters: Pearson Correlation (PC), ChiSquare [3], GINI, Infogain, KruskalWallis test and Relief [12]. They rank the features based on different criteria that measure correlation between each feature and class label. The remaining 3 are the stateoftheart feature selection methods which are able to remove redundant features: mRMR [7], QPFS [8] and SASMIF [9]. The feature similarity for both QPFS and our algorithm was measured by Pearson correlation. For fair comparison, for the SASMIF method we used top m ranked features. To balance the effect of feature relevance and feature redundancy, the parameter λ in (9) was set to $\frac{{m}^{2}M{\sum}_{i}{C}_{i}}{{\sum}_{i,j}{Q}_{ij}}$. The lowrank parameter k was set to 0.1 · M, as suggested in [13]. Our algorithm is denoted as STBIP for single task version and MTBIP for multitask version.
Given the selected features, we used LIBLINEAR [14] to train the linear SVM model. The linear SVM model was chosen because previous studies [5] showed SVM classifier could be very accurate on microarray data. The regularization parameter C of LIBLINEAR was chosen among {10^{3}, 10^{4}, …, 10^{3}}. For the experiments in the singletask scenario, we used the nested 5 cross validation to select the optimal regularization parameter. For experiments in multitask learning scenario, it was too time consuming to use the nested crossvalidation to select the regularization parameter. Thus, we simply fixed the regularization parameter to 1 in the multitask experiments.
Single task feature selection
In this section, we evaluate our proposed feature selection algorithm for singletask learning using four benchmark microarray gene expression cancer datasets: (1) Colon dataset [15] containing 62 samples, 40 tumor and 22 normal samples; (2) Lung dataset [16] containing 86 samples coming from 24 patients that died and 62 that survived; (3) Diffuse Bcell Lymphoma (DLBCL) dataset [17] containing 77 samples, 58 coming from DLBCL patients and 19 from Bcell lymphoma patients. (4) Myeloma dataset [18] containing 173 samples, 137 coming from patients with bone lytic lesions and 36 from control patients. We summarize the characteristics of these datasets in Table 1.
For each microarray dataset, we randomly selected 20 positive and 20 negative examples (except for choosing 15 positive and 15 negative in DLBCL dataset) as the training set and the rest as the test set. Due to the class imbalance in test sets, we used AUC, the area under the Receiver Operating Characteristic (ROC) curve, to evaluate the performance. The average AUC based on 10 repetitions of experiments on different random splits to training and test set are reported in Table 2. We Compared the AUC accuracy of different feature selection algorithms for m = 20, 50, 100, 200, 1000. For each dataset, the best AUC score among all methods was emphasized in bold. As shown in Table 2, our proposed method achieved the highest accuracy on Colon and DLBCL datasets. On the Myeloma dataset, it had the highest accuracy when m = 100 and 1000 and had the second highest accuracy when m = 20, 50 and 200. On the Lung dataset, our algorithm was ranked in the upper half of the competing algorithms. The last column in Table 2 shows the average AUC score across four different datasets. Our method achieved the highest average AUC scores. The next two successful feature selection algorithms are Relief and QSFS. The mRMR had somewhat lower accuracy, comparable to simple filters such as PC, ChiSquare, GINI and InfoGain. SASMIF was considerably less accurate, while KW was the least successful.
Multitask feature selection
In this section, we evaluate our proposed feature selection algorithm for multitask learning. We used 8 cancer related binary microarray classification datasets published in [19]. The data are summarized in Table 3. As shown in Table 3, the size of the 8 microarray datasets was very small. The singletask feature selection algorithms are not expected to perform well because there might be insufficient information even when simple feature selection filters are used. In contrast, our multitask feature selection algorithm is expected to improve the accuracy by borrowing strength across multiple microarray datasets.
For each microarray data set, we randomly selected N^{+} = 2, 3, 4, 5 positive and the same number of negative examples as the training data and used the rest as the test data. We show the results for m = 100 in this section. The average AUC across these 8 microarray datasets is shown in Figure 1. The results clearly show the multitask version of our proposed algorithm was the most successful algorithm overall.
To gain a deeper understanding about the reason why the multitask feature selection algorithm obtained better overall accuracy than singletask feature selection algorithms, we show the AUC score of each individual microarray dataset based on ${N}^{+}=3$ in Table 4. We can see that the single task version of our feature selection algorithm had the highest overall accuracy among other singletask benchmarks, a result consistent with Table 2. The multitask version of our algorithm has higher AUC than its single task version on 4 datasets and its average AUC is about 1.5% higher. In 4 cases, (e.g. Colon, Lung, Pancreas, Renal datasets) we can also observe the negative transfer, where the accuracy drops. How to prevent negative transfer in multitask feature selection would be another interest research topic for our future research.
Geneannotation enrichment analysis for multitask microarray datasets
The multitask experimental results show that accuracies obtained by MTBIP are better than other single task feature filters overall. So we would like to perform function annotation of the MP selected genes. In MTBIP filter, only one selected gene list is obtained for all 8 different types of cancers. Given this gene list, the top 10 enriched GO terms were obtained using DAVID Bioinformatics Resources [20]. The top 10 enriched GO terms based on MTBIP selected gene list is shown in Table 5. In this table, the hits means the number of genes that are found in the selected gene list associating with the specific GO term. The pvalue was obtained by Fisher Exact test which is used to measure the geneenrichment in annotation terms. After we got the enriched GO terms, we used the Comparative Toxicogenomics Database (CTD) [21] to check whether there is an association between the GO term and the cancer type. The last column in Table 5 shows the disease association for each GO term. The datasets are ordered as Bladder (B), Breast (B), Colon (C), Lung (L), Pancreas (P), Prostate (P), Renal (R) and Uterus (U). If a GO term is associated with the given type of cancer, we write down the cancer name. Otherwise, we put the symbol # in that position. We could see that the enriched GO terms based MTBIP tends to associate many different types of cancer. As shown in Table 5, GO:0005856 (cytoskeleton), GO:0005886 (plasma membrane) and GO:0032403 (protein complex binding) were associated with 7 different cancers. GO:0030054 (cell junction) and GO:0015629 (actin cytoskeleton) are associated with 6 different cancers.
Conclusion
We proposed a novel feature filter method to select a feature subset with discriminative power and minimal redundancy. The proposed feature selection method is based on quadratic optimization problem with binary integer constraints. It can be solved efficiently by relaxing the binary integer constrains and applying a lowrank approximation to the quadratic term in the objective. Furthermore, we extend our feature selection algorithm to multitask classification problems. The empirical results on a number of microarray datasets show that in the single task scenario the proposed algorithm results in higher accuracy than the existing feature selection methods. The results also suggest that our multitask feature selection algorithm can further improve the microarray classification performance.
Methodology
Feature selection by binary integer programming
Let us denote the training dataset as D = (x_{ i }, y_{ i }), i = 1, …, N , where x_{ i } is an M dimensional feature vector for the ith example and y_{ i } is its class label. N is the number of training examples. Our objective is to select a feature subset that is strongly predictive of class label and has low redundancy. We introduce a binary vector $w=\phantom{\rule{0.1em}{0ex}}{\left[{w}_{\mathsf{\text{1}}},{w}_{\mathsf{\text{2}}},\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}},{w}_{M}\right]}^{T}$ to indicate which features are selected:
So, the new feature vector for the ith example after feature selection can be represented as g_{ i } = x_{ i } ⊙ w, where the symbol ⊙ denotes the pairwise product. Therefore, g_{ ij } = x_{ ij }, for w_{ j } = 1 and g_{ ij } = 0 for w_{ j } = 0. Alternatively, g_{ i } can be represented as g_{ i } = W x_{ i }, where W is a diagonal matrix and its diagonal is the vector w.
Intuitively, we would like the examples with the same class to be close (intraclass tightness) and the examples from different classes to be far away (interclass separability) in the spaces defined by selected features. The Euclidean distance between two examples x_{ i } and x_{ j } in the new feature space can be calculated as
The interclass separability of the data can be measured by a sum of the pairwise distances between examples with different class labels
The intraclass tightness of the data can be measured by a sum of the pairwise distances between examples with the same class label
Therefore, the problem of selecting a feature subset to maximize the intraclass tightness and interclass separability can be formulated as
Objective (5) can be rewritten as
where matrix A is defined as:
In addition to the objective (5) or (6), in order to improve the diversity of selected features, we would like to select a feature subset with minimal redundancy. A feature is defined to be redundant if there is another feature highly correlated with it. Let us denote Q as a symmetric positive semidefinite matrix with size M × M, whose element Q_{ ij } represents the similarity between feature i and feature j. Since the measurements of each feature across different samples are normal distributed, it is reasonable to use Pearson Correlation to measure the similarity between two features here. Also, the similarity matrix Q is positive semidefinite when Pearson Correlation is used. Then, we define a redundancy among the selected set of features represented by vector w as their average pairwise similarity w^{T}Q w /m^{2}, where m is the number of selected features. Our objective is to minimize the redundancy defined in such way.
The first contribution of this paper is to formulate the feature selection task as a new quadratic programming problem subject to binary integer and linear constraints as follows,
The first term in (8), which is a linear term as shown in the following Proposition 1, tries to maximize the interclass separability and intraclass tightness of the data. It describes the discriminative power of the selected feature subset. The second quadratic term is the average pairwise similarity score between the selected features, which results in reduction of feature redundancy. Parameter λ is introduced to control the tradeoff between feature relevance and feature redundancy. Since Q is a positive semidefinite matrix, the proposed objective function is convex. The first constraint ensures that the resulting vector w is binary, while the second constraint ensures that exactly m features are selected. The following proposition establishes that the first term in the objective (8) is linear.
Proposition 1. The first term of the objective function (8) can be written as a linear term c^{T} w , where c is vector of size M with elements c_{ i } = (X^{T}LX)_{ ii }, L is the Laplacian matrix of A, defined as L = D − A. D is a diagonal degree matrix such that ${D}_{ii}={\sum}_{j}{A}_{ij}$. The X is the N × M feature matrix. Each row in X corresponds to one example. (X^{T}LX)_{ ii } denotes the ith element in the diagonal of the matrix X^{T}LX.
Proof. Let us denote W as a diagonal matrix where W_{ ii } = w_{ i }. Then,
because w_{ i } ∈ {0, 1}, WW^{T} = W. Therefore, $trace\left({X}^{T}LXW{W}^{T}\right)={\sum}_{i=1}^{M}{\left({X}^{T}LX\right)}_{ii}{W}_{ii}={c}^{T}w$, where c_{ i } = (X^{T} LX)_{ ii }
Based on Proposition 1, objective (8) can be rewritten as the following constrained quadratic optimization problem,
There are two practical obstacles in solving (9): (1) Binary constraint of variable w, and (2) feature similarity matrix Q is with size M × M, which implies high computational cost for high dimensional data. In the next two sections, we will first relax the binary constraint, and then we will apply a lowrank approximation to Q. The resulting constrained optimization problem can be solved very efficiently, with linear time with respect to the number of features M.
Problem Relaxation. Due to the binary constraint on the indictor vector w, it is difficult to solve (9) [9]. To resolve this, we first relax the binary constraint on w by allowing its elements w_{ i } to be within the range [0, m]. Then, (9) could be approximated by
Now, (10) becomes a standard Quadratic Programming (QP) problem. The optimal solution can be obtained by a general QP solver (e.g., MOSEK [22]).
Lowrank Approximation. The matrix Q in (10) is of size M × M . So, it results in high time and space cost if we work with high dimensional microarray data. Therefore, we would like to avoid the computational bottleneck by using lowrank approximation techniques.
The matrix Q in (10) is symmetric positive semidefinite. So, it can be decomposed as Q = UΛU^{T}, where U is a matrix of eigenvectors and Λ is a diagonal matrix with corresponding eigenvalues of Q. By setting $\alpha ={\Lambda}^{\frac{1}{2}}{U}^{T}w$, it follows that $w=U{{\Lambda}^{}}^{\frac{1}{2}}\alpha $. Therefore, problem (10) can be rewritten as
Typically, the rank of Q (let us denote it as k) is much smaller than M, $k\phantom{\rule{0.3em}{0ex}}\ll \phantom{\rule{0.3em}{0ex}}M.$ Therefore, we can replace the full eigenvector and eigenvalue matrices U and Λ by the top k eigenvectors and eigenvalues, resulting in an M × k matrix U_{ k } and a k × k diagonal matrix Λ_{ k }, without losing much information. Therefore, (11) is reformulated as
Since α is a vector with length k, $k\ll M.$ the QP (11) is reduced to a new QP in a kdimensional space with M + 1 constraints. Once the solution α of (12) is obtained, the variable w in original space can be approximated by $w={U}_{k}{\Lambda}_{k}^{\frac{1}{2}}\alpha $.
Decomposing matrix Q requires O(M^{3}) time, which is expensive in microarray data where M is large. Next we will show how to efficiently compute the top k eigenvectors and eigenvalues using Nystrom approximation technique [23]. Nystrom method approximates a M × M symmetric, positive semidefinite matrix Q by
where E_{ Mk } denotes the submatrix of Q created by selecting k of its columns, and W_{ kk } is a submatrix that corresponds to the intersection of the selected columns and rows. Sampling schemes in Nystrom method include random sampling [23], probabilistic sampling [24], and kmeans based sampling [13]. We chose the kmeans sampling in our experiments because [13] showed that it produces very good lowrank approximations at a relatively low cost. Given (13), we can easily obtain the low rank approximation of Q as
As shown in the following Proposition 2, the top k eigenvectors and eigenvalues can be computed in O(Mk^{2}) time using Nystrom method, which is much more efficient than doing eigendecomposition of Q, which requires O(M^{3}) time.
Proposition 2. The top k eigenvectors U_{ k } and the corresponding eigenvectors Λ_{ k } of $Q=G{G}^{T}$ can be approximated as Λ_{ k } = Λ_{ G } and ${U}_{k}=G{U}_{G}{\Lambda}_{G}^{\phantom{\rule{0.3em}{0ex}}\frac{1}{2}}$ , where U_{ G } and Λ_{ G } are obtained by the eigendecomposition of k × k matrix ${G}^{T}G={U}_{G}{\Lambda}_{G}{U}_{G}^{T}$
Proof. First, we observe that U_{ k } contains orthonormal columns.
Next, we observe that
Our proposed feature selection algorithm is summarized in Algorithm 1. In the Algorithm 1, steps 1 to 5 require O(Mk^{2} + k^{3}) time. QP in step 6 with k variables has a polynomial time complexity with respect to k. Step 7 requires O(Mk) time. Therefore, overall, the proposed feature selection algorithm is very efficient and it has linear time complexity with the number of features M.
Algorithm 1 SingleTask Binary Integer Program Feature Selection
Input: training data X, their labels y, regularized parameter λ, number of features m, lowrank parameter k.
Output: m selected features

1.
Apply Proposition 1 to compute the vector c

2.
Use kmeans to select k landmark features for lowrank approximation of Q

3.
Compute E_{ Mk } and W_{ kk } in (13)

4.
Obtain lowrank approximation of Q by (14)

5.
Apply Proposition 2 to compute the top k eigenvalue Λ_{ k } and eigenvector U_{ k } of Q

6.
Obtain α by solving the lower dimensional QP problem(12).

7.
Obtain w in original feature space as $w={U}_{k}{\Lambda}^{\phantom{\rule{0.3em}{0ex}}\frac{1}{2}}\alpha $

8.
Rank the features according to the weight vector w and select the top m features
Multitask feature selection by binary integer programming
Multitask learning algorithms have been shown to be able to achieve significantly higher accuracy than single task learning algorithms both empirically [11] and theoretically [25]. Motivated by these promising results, in this section, we extend our feature selection algorithm to the multitask setting. The objective is to select features which are discriminative and nonredundant over multiple microarray datasets.
Let us suppose there are K different but similar classification tasks, and denote the training data of the tth task as ${D}^{t}=\left\{\left({\mathsf{\text{x}}}_{i}^{t},\phantom{\rule{2.77695pt}{0ex}}{y}_{i}^{t}\right),\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}{N}^{t}\right\}$, where ${N}^{t}$ is the number of training examples of the tth task. [10, 11] proposed multitask feature selection algorithms that use ℓ_{1,2} norm to regularize the linear model coefficients β across K different classification tasks. The ℓ_{1,2} norm regularizer over all βs across K classification tasks could be expressed as ${\sum}_{j=1}^{M}\left({\sum}_{t=1}^{K}{{\beta}_{t}^{j}}_{2}\right)$, where ${\beta}_{t}^{j}$ is the coefficient of the jth feature in the tth task. Due to the ℓ_{1} norm on the ℓ_{2} norm of group of coefficients of each feature across K tasks, the ℓ_{1,2} norm regularizer selects the same feature subset across K tasks. However, the ℓ_{1,2} norm regularized problem is challenging to solve because the nonsmoothness of the ℓ_{1,2} norm. In this section, we would like to show our proposed feature selection can be easily extended to multitask learning version. The resulting objective optimization problem have the same form as objective (9), which can be solved efficiently as shown in previous section.
Let us denote w_{ t } as the binary indicator defined in (1) to represent the selected feature subset of the tth classification task. If we do not consider the relatedness between these K classification task, individual w_{ t } could be obtained by applying Algorithm 1 to different classification tasks. Based on the conclusion given by [10, 11], it would be beneficial to select the same feature subset across K related classification task. In our case, this is can be achieved by setting w_{ t } = w ∀ t. Therefore, the same feature across K tasks, defined by vector w, can be obtained by solving the following optimization problem,
where c_{ j } and Q_{ j } are the linear and quadratic terms of the QP corresponding to the jth task. The details about how to compute the c_{ j } and Q_{ j } are explained in the previous section. The technique of relaxing binary integer constraints and applying lowrank approximation to Q introduced in the previous section can be used to solve (15). The extended multitask feature selection algorithm is also a feature filter. It can be used in conjunction with any supervised learning algorithm.
References
 1.
Guyon I, Elisseeff A: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003, 11571182.
 2.
Saeys Y, Inza I, Larranãga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 25072517. 10.1093/bioinformatics/btm344.
 3.
Liu H, Setiono R: A probabilistic approach to feature selection  a filter solution. Proceedings of the Thirteeth th International Conference on Machine Learning. 1996, 319327.
 4.
Kohavi R, John G: Wrappers for Feature Subset Selection. Artificial Intelligence. 1997, 97: 273324. 10.1016/S00043702(97)00043X.
 5.
Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002, 46: 389422. 10.1023/A:1012487302797.
 6.
Tibshirani RJ: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996, 58: 267288.
 7.
Peng H, Long F, Ding CHQ: Feature Selection Based on Mutual Information: Criteria of MaxDependency, MaxRelevance, and MinRedundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005, 27 (8): 12261238.
 8.
Lujan IR, Huerta R, Elkan C, Cruz CS: Quadratic Programming Feature Selection. Journal of Machine Learning Research. 2010, 11: 14911516.
 9.
Liu S, Liu H, Latecki LJ, Yan S, Xu C, Lu H: Size Adaptive Selection of Most Informative Features. Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence. 2011
 10.
Argyriou A, Evgeniou T, Pontil M: Convex multitask feature learning. Machine Learning. 2008, 73 (3): 243272. 10.1007/s1099400750408.
 11.
Obozinski G, Taskar B, Jordan MI: Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing. 2010, 20 (2): 231252. 10.1007/s112220089111x.
 12.
Kira K, Rendell LA: A Practical Approach to Feature Selection. Proceedings of the Ninth International Conference on Machine Learning. 1992, 249256.
 13.
Zhang K, Kwok JT, Parvin B: Prototype vector machine for large scale semisupervised learning. Proceedings of the Twentysixth International Conference on Machine Learning. 2009, 12331240.
 14.
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research. 2008, 9: 18711874.
 15.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. The Proceedings of the National Academy of Sciences USA. 1999, 67456750.
 16.
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002, 816824.
 17.
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nature Medicine. 2002, 6874.
 18.
Tian E, Zhan F, Walker R, Rasmussena E, Ma Y, Barlogie B, Shaughnessy J: The role of the Wntsignaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine. 2003, 24832494.
 19.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences USA. 2001, 1514954.
 20.
Da Wei Huang BTS, Lempicki RA, et al: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols. 2008, 4: 4457. 10.1038/nprot.2008.211.
 21.
Davis AP, King BL, Mockus S, Murphy CG, SaraceniRichards C, Rosenstein M, Wiegers T, Mattingly CJ: The comparative toxicogenomics database: update 2011. Nucleic acids research. 2011, 39 (suppl 1): D1067D1072.
 22.
Andersen ED, Andersen KD: The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm. High Performance Optimization. 2000, 197232.
 23.
Williams CKI, Seeger M: The Effect of the Input Density Distribution on Kernelbased Classifiers. Proceedings of the Seventeenth International Conference on Machine Learning. 2000, 11591166.
 24.
Drineas P, Mahoney MW: On the Nyström Method for Approximating a Gram Matrix for Improved KernelBased Learning. Journal of Machine Learning Research. 2005, 6: 21532175.
 25.
BenDavid S, Schuller R: Exploiting Task Relatedness for Multiple Task Learning. COLT: Proceedings of the Workshop on Computational Learning Theory. 2003
Acknowledgements
This work was supported by the U.S. National Science Foundation Grant IIS0546155.
Declarations
Publication of this work was supported by the U.S. National Science Foundation Grant IIS0546155.
This article has been published as part of BMC Proceedings Volume 7 Supplement 7, 2013: Proceedings of the Great Lakes Bioinformatics Conference 2013. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/7/S7.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LL and SV conceived the study and developed the algorithm. LL wrote the first draft of the manuscript. Both authors participated in the preparation of the manuscript and approved the final version.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lan, L., Vucetic, S. Multitask feature selection in microarray data by binary integer programming. BMC Proc 7, S5 (2013). https://doi.org/10.1186/175365617S7S5
Published:
Keywords
 Feature Selection
 Feature Subset
 Feature Selection Method
 Feature Selection Algorithm
 Quadratic Programming