An efficient heuristic method for active feature acquisition and its application to protein-protein interaction prediction
© Thahir et al.; licensee BioMed Central Ltd. 2012
Published: 13 November 2012
Skip to main content
© Thahir et al.; licensee BioMed Central Ltd. 2012
Published: 13 November 2012
Machine learning approaches for classification learn the pattern of the feature space of different classes, or learn a boundary that separates the feature space into different classes. The features of the data instances are usually available, and it is only the class-labels of the instances that are unavailable. For example, to classify text documents into different topic categories, the words in the documents are features and they are readily available, whereas the topic is what is predicted. However, in some domains obtaining features may be resource-intensive because of which not all features may be available. An example is that of protein-protein interaction prediction, where not only are the labels ('interacting' or 'non-interacting') unavailable, but so are some of the features. It may be possible to obtain at least some of the missing features by carrying out a few experiments as permitted by the available resources. If only a few experiments can be carried out to acquire missing features, which proteins should be studied and which features of those proteins should be determined? From the perspective of machine learning for PPI prediction, it would be desirable that those features be acquired which when used in training the classifier, the accuracy of the classifier is improved the most. That is, the utility of the feature-acquisition is measured in terms of how much acquired features contribute to improving the accuracy of the classifier. Active feature acquisition (AFA) is a strategy to preselect such instance-feature combinations (i.e. protein and experiment combinations) for maximum utility. The goal of AFA is the creation of optimal training set that would result in the best classifier, and not in determining the best classification model itself.
We present a heuristic method for active feature acquisition to calculate the utility of acquiring a missing feature. This heuristic takes into account the change in belief of the classification model induced by the acquisition of the feature under consideration. As compared to random selection of proteins on which the experiments are performed and the type of experiment that is performed, the heuristic method reduces the number of experiments to as few as 40%. Most notable characteristic of this method is that it does not require re-training of the classification model on every possible combination of instance, feature and feature-value tuples. For this reason, our method is far less computationally expensive as compared with previous AFA strategies.
The results show that our heuristic method for AFA creates an optimal training set with far less features acquired as compared to random acquisition. This shows the value of active feature acquisition to aid in protein-protein interaction prediction where feature acquisition is costly. Compared to previous methods, the proposed method reduces computational cost while also achieving a better F-score. The proposed method is valuable as it presents a direction to AFA with a far lesser computational expense by removing the need for the first time, of training a classifier for every combination of instance, feature and feature-value tuples which would be impractical for several domains.
Constructing a complete human protein-protein interaction (PPI) network (the 'interactome') can accelerate discovery in biomedical sciences and is crucial to the study of disease mechanisms and drug discovery. For example, proteins (genes) which are associated with a disease interact with other disease-related genes more closely in the interactome ; for this reason, protein-disease associations can be determined based on the network topological features such as the degree of a node (i.e. protein), average distance of the node from disease-related proteins etc. . Several network-based approaches have been devised to determine gene-disease associations and functional modules using the interactome, including neighborhood based approaches, clustering/graph partitioning based methods and random-walks [3–6]. However, only a fraction of the whole human interactome is known today, calling for methods to discover hitherto-unknown PPIs [7, 8].
Determining PPIs by high-resolution experimental methods is very resource intensive. High throughput methods such as yeast 2-hybrid and mass spectrometry methods have low assay-sensitivity (i.e. the interactions that they can detect is only a subset of all PPIs that exist) and even among those that they can, each screen identifies a further smaller subset of PPIs . Computational methods are therefore necessary to complement the high-throughput methods to reconstruct the interactome expeditiously. Several computational systems have been developed for prediction of protein-protein interactions, particularly for yeast and human, using machine learning approaches [10–14]. These approaches employ statistical machine learning methods to classify whether two proteins interact with each other or not, based on the biological features of proteins such as their localization, molecular function and the tissues the proteins are expressed in. In all of these methods, it is assumed that a training data set is available, and that the pending goal is to develop an algorithm to learn to model the relation between feature space and labels given represented by the training data.
However, in the current training data many features are unknown (i.e. 'missing') for many proteins. Carrying out wet-lab experiments to determine all such missing features is infeasible as those experiments require human expertise, time, high-end equipment and other resources. It may however be possible to carry out a few experiments to determine some of the missing features, if not all. If only a few missing values can be determined, which features for which proteins should be determined by experiments? From the perspective of machine learning for PPI prediction, it would be desirable that those experiments be carried out which when used in training the classifier, the accuracy of the classifier is improved the most. That is, the utility of the feature-acquisition is measured in terms of how much acquired features contribute to improving the accuracy of the classifier. Active feature acquisition (AFA) is a strategy to preselect such instance-feature combinations (i.e. protein and experiment combinations) for maximum utility. It is to be noted that the goal of AFA is the creation of optimal training set that would result in the best classifier, and not the determination of the best classification model itself. Subsequent to creation of training data with active feature acquisition, any state-of-the-art method such as random forest based methods may be applied to learn the classification model. While PPI prediction itself is being actively studied recently [11, 12, 15], AFA strategy has not been applied in this domain.
A few algorithms have been developed for AFA in other application domains which calculate utility of feature-acquisition based on the accuracy of the current model and its confidence in the prediction. Melville et al. proposed a framework for performing active feature acquisition , which is described here briefly. Here, the training set T of m instances is represented by the matrix F, where Fi,j corresponds to the value of the j-th feature of the i-th instance. The feature matrix initially has missing values, the class label of each instance is already known. Missing features may be acquired with active feature acquisition procedure at a cost of Ci,j for feature Fi,j. qi,j refers to the query for value of Fi,j . The objective of AFA is to query for missing feature values such that the most accurate classifier is built for a given budget for feature acquisition. The framework proposed by Melville et.al , is an iterative model wherein in each iteration a set of missing features, which provide the highest expected improvement to classifier accuracy at minimal cost, are chosen and queried. Known feature values are added to training data and the classifier is retrained. The process is repeated until a desired level of classifier accuracy is achieved, or the budget available for feature acquisition is exhausted.
where A(F, f i = V k ) is the accuracy of the classifier when it is trained with the value of f i set to V k . A(F) is the accuracy of the original classifier. C(f i ) is the cost of acquiring the feature value. P(f i = V k ) is measured by building a classifier C i corresponding to each feature. In the training data all the features other than f i and the class label are taken as feature values and C i is built. The classifier C i predicts what the probability is that a missing feature will take a particular value when the other feature values and the class label for an instance are known. It finds the expected utility for various missing values across all the instances. The missing feature with maximum expected utility is selected and its value is obtained (by experimentation or manual labeling, as applicable).
This method is computationally intensive for several classifiers types and for several domains. This is because the classifier needs to be trained for each missing feature and its various possible values in order to measure A(F, f i = V k ). Therefore, in order to evaluate the utility of a single missing feature of a given instance, the classifier is to be retrained 'K' times. As this procedure is repeated for each of the missing feature elements, the classifier is to be retrained |M|*K times in a single iteration (where M is the set of all missing features over all instances). Although incremental learning can be done efficiently for classifiers like Naive Bayes, for several other classifiers it is inefficient. For instance in the case of Random Forests, retraining the classifier once has time-complexity of T*N*log(N) , where T is the number of trees in the random forest and N is the number of instances in the training data. So, the total time complexity for evaluating the utility of all the missing features is T*N*log N*|M|*k. When the dataset size is large and has several missing values, the time for evaluating the expected utility would be very high. To overcome this, the authors (Melville et al) proposed Sampled Expected Utility wherein a random subset of instances (S) with missing feature values are selected randomly and are evaluated by the above procedure. The results show that this expected utility approach performs better than the method which randomly picks missing feature values for labeling. Saar-Tsechansky et al. create the reduced consideration set 'S' by giving preference to missing features in instances which are misclassified or instances which have high uncertainty as to their label according to the induced classifier model . Though methods like sampled expected utility reduce the consideration set, for large data sets with several missing features this approach would be computationally very expensive, especially for models which are parametric. Gregory et al. proposed an active feature acquisition approach that they specifically evaluated on two sequence labeling tasks . Their approach also required re-training of classifiers. Attenberg, Melville and Provost present a unified approach to active dual supervision, where they determine which feature or instance should be acquired that benefits the classifier the most by extending the sampled expected utility measures proposed for active dual supervision, but their methods still require re-training the classifiers .
In expected utility based approaches for AFA, the usefulness of acquiring a missing feature is estimated by retraining the classifier for each of the possible values that the missing feature can take and then calculating the expected improvement in classifier accuracy. However, retraining the classifier for every possible value, for each missing feature of each instance, is computationally very intensive, or even infeasible for large multi-dimensional data sets.
In this work we propose a novel heuristic to measure the utility of acquiring a missing feature value without the need of retraining of the classifier multiple times.
P(y=L m | C,p) = predicted probability that 'p' has label L m according to previously learnt classifier C.
P(y= L m | C,(p ∩ f i = V j )) = predicted probability that 'p' has label L m according to previously learnt classifier C, when the feature f i of 'p' is set to V j
If Δρ is less than 0, it indicates that when f i is set to V j it concurs with the belief of C (i.e. the estimated probability of 'p' belonging to its correct class (L m ) according to C increases). Hence in 'p' if f i is set to V j and C is retrained, classifier is not expected to update its model. Therefore, Δρ is set to 0 for that case.
In the domain of PPI prediction, there is no "negative dataset" available; that is, there are no pairs that are known to be non-interacting. However, in 500 to 1500 randomly selected pairs only one pair is expected to be an interacting pair . Therefore, random pairs are usually treated as negative class instances in this domain. For our work, we created training and testing datasets of 10,000 protein pair instances each with 2,000 interacting pairs and 8,000 random pairs. AFA is carried out in batch mode, selecting 500 missing values in each batch.
where, Set1 are the set of GO terms for P1 and Set2 are the terms for P2. Three feature values, one each by using GO annotations for biological process, cellular component and molecular function are developed.
Two gene expression features are computed. They are the mean and standard deviation of the correlation values (PPCm) for the 70 categories.
We can further improve the efficiency of the process by finding proteins which have little variance in correlations. Say if protein P1 does not have much variance, then for any protein P2 there will be little correlation between P1 and P2.
D1 is the set of domains in protein P1
D2 is the set of domains in protein P2
score(d1,d2) is the interaction score between the domains d1 and d2.
For a given protein pair, this feature measures how close the genes (encoding the proteins) are to each other in the genome. The data for computing this feature is downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/. Based on the locus tag and the chromosome to which the genes are attached the distance score is computed between the genes.
T1 is the set of tissues P1 occurs in
T2 is the set of tissues P2 occurs in
The metrics that we employ here are those that are commonly used in the domain of information retrieval: F-score. F-score is the harmonic mean of the precision and recall. Precision is measured as the percentage of true positives among all predicted interactions; recall is the percentage of true positives among all real interactions.
The Gene Expression and Gene Neighborhood features in PPI prediction feature vectors have nearly 100% coverage, and therefore do not depend on active feature acquisition. The Gene Ontology features (biological process, cellular component and molecular function), domain and tissue features have a large number of missing values. So we considered these five features to study active feature acquisition for PPI prediction. We consider only protein pairs where individual proteins have gene ontology annotations and at least one of tissue or domain annotations. This is to ensure that the feature vector is reasonably filled. A training and test data set of 10,000 instances each was generated. The training set has 10,000 × 5 = 50,000 feature values of which nearly half of these feature values are missing in the original dataset. Additionally, we set another 10,000 feature values to be missing (which are otherwise available in the dataset), so as to simulate acquiring these features as-and-when asked by the algorithm. In other words, these are the feature values which are available for acquisition by the AFA system. To apply AFA, we need to discretize the real valued features. To do that we apply the commonly applied Maximum Description Length (MDL) based discretization method proposed by Fayyad and Irani . We use the Weka Machine Learning Toolkit's implementation of this discretization method .
We carried out active feature acquisition on other standard classification tasks with data available at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html). However, the AFA method proposed here did not perform better in these cases (we have not tested Melville method, but tested AFA against random selection). It remains to be seen whether the proposed heuristic method has particular advantage in PPI-prediction like domains, i.e. when (i) the data has several missing values, or (ii) the positive instances are an extremely rare category among the unlabeled instances. Although this is discouraging, the proposed method presents a novel direction for estimating the utility, which is not dependent on training a classifier numerous times in each iteration. The evaluations on these datasets are as yet preliminary. Rigorous testing and analysis is to be carried out in future, with our method as well as previous methods, to understand what the domain-characteristics may be that lead to the success or failure by different methods in these domains.
Active learning methods optimize the interaction between a computational method and a human expert by preselecting the data that an expert is to devote time or resources on, so that the outcome contributes most beneficially to the computational algorithm. Typically these methods are applied to domains that have massive amount of data such as astronomical images or world-wide-web documents, where, even though each data instance can be labelled with little manual effort, creation of a training data that is representative of the entire dataset can benefit with active learning approaches. In molecular biology domain however, the reasons for active learning are atypical. Here, even though the data may not be as massive, the resources, time and expertise required to characterize each instance is very large, making it impossible to characterize even moderately large datasets. For this reason, active learning methods can contribute to the domain of molecular biology, and guide the selection of molecule-experiment combinations that yield maximum benefit towards characterizing other molecules by computational methods. We have previously applied active learning for label acquisition for protein-protein interaction prediction .
Here, we presented a new heuristic approach for Active Feature Acquisition (AFA) that reduces computational cost by estimating the improvement a feature value would bring to the classifier. In contrast, other expected utility-based methods for feature acquisition train a new classifier for each 'instance-feature-value' triple. The results show that AFA achieves comparable F-score by acquiring only 40% as much missing features as the random method. Further, AFA has not been previously applied for PPI prediction (to the best of our knowledge) and the results show that AFA would be critical for the domain of PPI prediction where the biological features are missing for several protein pairs (especially for pairs with proteins which have not been studied extensively).
Active label/feature acquisition strategies generally work under budget constraints, and it is necessary to account for the cost of acquiring these missing values. The cost for experimentally determining the interaction of the protein pairs might vary for different pairs depending upon the localization of the proteins and the experimental conditions which need to be created to verify the interaction. Similarly cost of obtaining the missing features might differ for the various feature types. So it is necessary to develop computational methods which are able to model the cost of experimental annotation and incorporate them in to the active label/feature acquisition strategies .
The heuristic we proposed for active feature acquisition works in a batch mode selecting a group of missing features to be acquired in each iteration; further improvements can be achieved by incorporating marginal relevance of the features with respect to each other to ensure diversity in the selected missing features within a batch . It would be interesting to see how to address active learning in domains with sparse-label and sparse-feature space. The Active Information Approaches proposed in  may be a starting point in this direction. The active learning and active feature acquisition approaches we considered evaluate the utility only at a particular instance/missing-feature level. It is possible that acquiring a particular pair of missing labels or features can bring in much higher utility than the sum of the utility of acquiring each of them individually. Further we may be constrained by the amount of budget we can spend to learn the classifier. However performing a complete look-ahead has exponential time complexity. So highly simplified look-ahead procedures such as single feature look-ahead (SFL)  and randomized single feature look-ahead (RSFL)  have been proposed. Developing advanced look-ahead policies that incorporate more information about the state space and deeper look-ahead would enable obtaining higher error reduction for the given budget.
This work has been funded in part by the BRAINS grant R01MH094564 awarded to MG by the National Institute of Mental Health of National Institutes of Health (NIMH/NIH) of USA. Authors would like to thank Dr. Jaime Carbonell for discussions that lead to the development of this approach.
This article has been published as part of BMC Proceedings Volume 6 Supplement 7, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/6/S7.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.