Towards a semi-automatic functional annotation tool based on decision-tree techniques
- Jérôme Azé†1Email author,
- Lucie Gentils†1,
- Claire Toffano-Nioche1,
- Valentin Loux2,
- Jean-François Gibrat2,
- Philippe Bessières2,
- Céline Rouveirol3,
- Anne Poupon4 and
- Christine Froidevaux1
© Azé et al; licensee BioMed Central Ltd. 2008
Published: 17 December 2008
Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. Ultimately, the task of annotating these data relies on the expertise of biologists. The necessity for annotation to be supervised by human experts is the rate limiting step of the data analysis. To face the deluge of new genomic data, the need for automating, as much as possible, the annotation process becomes critical.
We consider annotation of a protein with terms of the functional hierarchy that has been used to annotate Bacillus subtilis and propose a set of rules that predict classes in terms of elements of the functional hierarchy, i.e., a class is a node or a leaf of the hierarchy tree. The rules are obtained through two decision-trees techniques: first-order decision-trees and multilabel attribute-value decision-trees, by using as training data the proteins from two lactic bacteria: Lactobacillus sakei and Lactobacillus bulgaricus. We tested the two methods, first independently, then in a combined approach, and evaluated the obtained results using hierarchical evaluation measures. Results obtained for the two approaches on both genomes are comparable and show a good precision together with a high prediction rate. Using combined approaches increases the recall and the prediction rate.
The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators. This first study is a crucial step towards designing a semi-automatic functional annotation tool.
Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. The first sequenced genome was published 10 years ago. Currently, about 800 (updated in May 2008) genomes have been completely sequenced and published, coding for more than 6 millions proteins (as stored in the protein sequence database UniProtKB). A further 3 700 new genomes are expected in the near future .
Biologist experts play a central role in the analysis of this massive amount of raw data. To annotate a new genome they need to integrate many pieces of information coming from various sources: results of bioinformatics analysis programs, data stored in specialized databases, results of high-throughput experiments such as transcriptomics, proteomics, etc., information stored in the literature, general knowledge about the domain of interest (biological properties of the studied organism, its ecology, etc.). Even for a small bacterial genome, containing about 2 000 genes, this annotation task is a heavy burden that takes between 12 and 18 months to complete for a small team of annotators. A number of annotation tools have been designed to help the biologists concentrating exclusively on this high-level task. The aims of theses tools are to hide technical details, to make the system implementation transparent, to centralize and facilitate the access to relevant data, and to report a synthesis of all the findings to the annotators in an efficient manner. In spite of these tools, the need for a human supervision of the annotation process still constitutes the bottleneck of genomic data analyses. Therefore, to face the deluge of new genomic data, there is a crying need to automate, as far as possible, the annotation process itself. Computational annotation methods should take into account as much relevant information as possible regarding the analyzed genome, as human experts do.
Let us emphasize here that there is a difference between the direct annotation of the gene product, e.g., "fatty acid-binding protein, adipocyte" and the annotation of the protein with terms of a functional hierarchy, for instance for GO , "GO:16564; Molecular function: transcription repressor activity" or "GO:42632; Biological process: cholesterol homeostasis". In the latter case different proteins are grouped according to their molecular function or to the functional path they belong to. In this article we are concerned with the second type of annotation.
Annotation is mostly based on evolutionary considerations, more precisely on the concept of homology. Homology is the fact, for two genes or proteins, to descend from a common ancestor. As such they share a number of properties, in particular their function. The principle of annotation is thus to infer an homology relationship between a gene (protein) of interest and a gene (protein) whose function is known and to transfer this function.
State of the art
Computational annotation methods range from symbolic to numerical techniques. Some of them are based on machine-learning techniques (e.g. SPEARMINT  or GOPET  that use C4.5  and SVM  respectively) while others are probabilistic approaches (e.g. MAGIC [7, 8] which is based on a Bayesian network or the Bayesian approach proposed in ).
In the context of the RAFALE project  our goal is to provide biologists with a semi-automatic tool for functional annotation. As a straightforward consequence, both productivity of the annotators and consistency of the annotations would be improved. It is a semi-automatic tool in the sense that the process is collaborative: annotations are suggested by rules that reflect known protein annotations but the annotations are ultimately validated by the biologists. We chose to learn rules obtained through decision-trees that exhibit several good features. They can be easily understood and used by human annotators. They represent modular pieces of information that can be considered as explanations of the annotations proposed to users. In our approach not only do we aim at obtaining good quality annotations but also we focus on how they have been obtained. This point is essential for a relevant evaluation of the quality of the annotations in order for them to be used by the biologists. Otherwise, biologists would not trust such rules and would not use them, thus missing a possibility of saving time. However, we do not restrict ourselves to high quality annotations. Unlike HAMAP , we can be led to propose several alternative annotations, together with their confidence degree, asking biologists to conclude themselves. In the following, we propose to apply two decision-trees techniques to the problem of predicting classes from a functional hierarchy, in the same spirit as in  which deals with the problem of predicting ORF functional classes. Two different frameworks have been chosen to represent rules that are more or less expressive and accordingly more or less expensive: first-order decision-trees  and multilabel attribute-value decision-trees . As we are more interested in providing biologists with reliable annotation – even though it concerns only a restricted subset of proteins – we aim at obtaining rules with high precision rather than good recall (see section Results).
Annotation framework and genomes under study
The available data
In this work our training set corresponds to data provided by the AGMIAL annotation platform . This platform has been used to annotate two lactic bacteria: Lactobacillus sakei  and Lactobacillus bulgaricus .
AGMIAL embodies an annotation strategy that considers the following pieces of information:
• modular aspect and intrinsic properties of protein sequences;
• search for homology relationship between proteins;
• genomic context;
• subcellular localization.
More than 30 bioinformatics methods belonging to the above categories are implemented in AGMIAL. As mentioned in the Background section, homology search techniques represent the cornerstone of the annotation process. However, with the availability of many sequenced genomes and thus the possibility of annotating a new genome in the light of other known genomes, techniques based on the genomic context are becoming increasingly important.
In this study, we choose to focus only on classes 1, 2 and 3. Class 5 and 6 correspond to proteins for which the annotators judged there was not enough information to conclude on a particular function. Class 4 is a medley that gathers together various heterogeneous functions without any relationship. It was not possible to learn regularities from data of this class. The exclusion of these 3 first level classes and their subclasses in the hierarchy removed 11 out of 62 classes (18%).
To generate annotation rules, we have to describe the proteins in terms of their properties. Some properties are intrinsic such as the number of transmembrane segments, the isoelectric point, the molecular mass, the number of domains and their type, etc. Other properties express a relationship between the protein of interest and proteins of other genomes (homology relationship) or between proteins of the analyzed genome (genomic context relationship). These properties are provided by the bioinformatics programs that analyze the genomic data.
For each protein of interest, we use homologous proteins that have been found with BLAST . For the current study we only consider close homologs, i.e., those having more than 50% identical residues and an e-value less than 10-4. In addition, the lengths of the protein and its homolog have to be similar to exclude the case of domains (l1 ≥ 0.8 × l2 or l2 ≥ 0.8 × l1, with l1 the length of the protein and l2 the length of its homolog). We then extract the GO-terms  associated with the homologous proteins in the Uniprot data bank . The GO-terms correspond to functional classes of the Gene Ontology . A protein has usually many homologs and each homolog can be described by several GO-terms. To build the blastmatchGo descriptor we group together all the homologs that have the same GO-term and we consider the fraction (f) of homologs that have a particular GO-term.
For instance, this will generate rules such as: 'if blastmatchGo(esa100, GO: 0006810, f) and f > 0.7 then class = 3.5'. In this expression, esa100 is the 100th protein of the L. sakei genome starting from the origin of replication, GO:0006810 is a term of the Gene Ontology that is associated to 70% of the homologs of esa100 found by BLAST.
The blastmatchSw descriptor is similar to the blastmatchGo descriptor, but it uses Swiss-Prot (SW) keywords  instead of GO-terms to describe homologous proteins.
This descriptor provides information about domains and motifs. We associate an INTERPRO  identifier to a protein if the corresponding domain or motif is found in the protein.
Number of proteins that have at least one descriptor of each type: blastmatchGo, blastmatchSw, interpro.
The descriptors corresponding to intrinsic properties of the proteins considered in this study are:
• TM the number of transmembrane segments;
• pI the isoelectric point;
• mm the molecular weight.
Each protein has many homologs described by GO-terms, SW keywords and INTERPRO identifiers. In order to avoid redundancy and to reduce the search space of the machine learning algorithms, we applied mappings of SW keywords and INTERPRO identifiers to GO-terms. We used the mappings provided on the GO web page . We kept Swiss-Prot keywords and INTERPRO identifiers if no mapping to a GO-term was found. This mapping allows a reduction of the search space that the machine learning algorithm needs to explore.
Impact of the mappings SW → GO and INTERPRO → GO.
Number of descriptors
In this section, we present the two machine learning techniques we used to learn decision-trees: ILP framework and Multilabel probabilistic decision-tree.
TILDE is a relational learning system from the ILP community that is based on first-order logical decision-trees. It uses top-down induction of decision-trees by adapting C4.5's heuristics. It allows discretization of numeric attributes and provides look-ahead facilities so that properties of descriptors and parameters can be easily set through a bias file.
We decided to predict protein function by using TILDE level by level, beginning from the upper level of the functional hierarchy. In order to discriminate the three classes of the first level, we build three decision-trees, where each class in turn is considered as the set of examples, while the two others give the counter-examples. Note that with this method a protein may be assigned up to 3 classes of the first level. In order to stay close to the AGMIAL system which allows only one annotation for a protein, we chose to assign a "no prediction" tag to a protein if the three trees disagree on the class predicted. This leads to a decrease in the recall value but, of course, to an increase in the precision.
As the second and third levels contain fewer proteins than the first one, we decided to learn multiclasses trees, that is, trees where each leaf refers to a single class, but where several classes can be found at different leaves. Thus we got ten trees, three at the first level, three at the second level and four at the third level, as only four classes at the second level had subclasses.
Multilabel probabilistic decision-tree
In a hierarchical multilabel classification tree, an example may belong to several classes. Moreover, an example belonging to some class with some membership degree also belongs to its superclasses with higher membership degrees.
Each leaf of a probabilistic decision-tree represents a vector of classes where the membership degree is equal to the proportion of the training examples observed in the leaf (and belonging to the class). For example, a leaf may be the vector: (3 – 90%, 2 – 10%, 3.2 – 85%, 3.1 – 15%, 3.2.3 – 36%, 3.2.5 – 64%). Different algorithms, derived from C4.5 , have been proposed [25, 14]. We chose to use the Clus-HMC algorithm  that has been designed to take into account class hierarchy. The algorithm uses minimization of the average variance and a weighted Euclidean distance to compare two partitions of the data. The distance takes into account the depth of the classes in the hierarchy.
In this study, we use the parameters empirically found to be the best by Blockeel et al. in . In order to evaluate the methods, we turn to Hierarchical Evaluation Measures, that are adapted to our data.
Hierarchical Evaluation Measure
Kiritchenko et al.  defined a Hierarchical Evaluation Measure which respects the three main properties that a hierarchical evaluation measure should satisfy:
1. The measure gives credit to partially correct classification;
2. The measure punishes distant errors more heavily;
3. The measure punishes errors at higher levels of a hierarchy more heavily.
Hierarchical precision (hP) and hierarchical recall (hR) have been reformulated with our parameters to respect the three above properties. A hierarchical Fscore (hF β ∈ [0..1]) has been defined in . The Fscore (hF β ) measure combines precision (hP) and recall (hR) to provide a single evaluation of a hierarchical classification tool. This measure is controlled by the β ∈ [0, + ∞] parameter which permits to give more or less importance to either precision or recall.
We also employ the prediction rate measure, pr, representing the percentage of predicted proteins and defined as pr = n p /n.
It may happen that some predictions are more detailed than the expert annotation. To respect the spirit of the measures defined in , in the evaluation of our method performances we consider the more detailed prediction as an incorrect prediction (see Fig. 2-d). However, a more detailed prediction might very well be correct. Indeed, the annotations considered as references here have been done a couple of years ago with less information than is available today. Consequently, the prediction will often correspond to the annotation that a human expert would do based on the current information available for prediction. For example, in L. sakei, protein "DNA directed RNA polymerase, a subunit" annotated in class 3.5 (RNA synthesis) is predicted in 3.5.3 (transcription elongation), as it should be.
Results and Discussion
For the two approaches, TILDE and multilabel probabilistic decision tree, we carry out two different tests. In the first test, proteins of both genomes are considered as a whole, and rules are learnt on a fraction of them and tested on the other fraction by a 3-fold cross validation procedure. In the second test, rules are learnt on proteins of a genome and tested on proteins of the second genome. The latter is a more "natural" way of proceeding since we seek to annotate new genomes in the light of previously annotated genomes. Results of these tests are evaluated with the four measures previously presented but we will only detail the second test, which is more natural.
L. bulgaricus + L. sakei
We have also combined the two tested methods as follows:
• Combined-Multilabel: first carry out the prediction with Multilabel. If no prediction is obtained, employ TILDE.
• Combined-TILDE: this is the converse of the previous approach, use TILDE first then Multilabel.
When TILDE is used as the first prediction, no real gain is observed. The prediction rate (pr) for the TILDE approach is close to 1 and thus the Multilabel approach is only used for the few proteins that are not predicted by TILDE.
On the other hand, when the Multilabel approach is used as the first prediction method, the gain is important both in terms of recall (almost 10%) and prediction rate (20 to 26%). This increase in the recall is concomitant to a slight decrease in the precision. However, overall, the precision remains close to 0.8 and this is good enough to be used in a semi-automatic application.
Trees and rules
3.1 – 3.5
3.5 – 3.6
For the first level, the homologs of esa800 do not have the GO-term "translation", but more than 69% of them are associated with the GO term "DNA binding" which is enough to classify the protein in class 3 (conf = 98%) ("information pathways" see Tab. 4).
For the second level, the homologs of esa800 do not have the GO-term "translation" but are associated with the GO-term "transcription" which corresponds to class 3.5 (conf= 97%) (RNA synthesis). For the third level, the homologs of esa800 are not associated with the GO-term "transferase activity". This is a very general term that does not carry any specific information in favor of a particular class. However in the context of class 3.5, it makes sense since the elongation process (3.5.3) corresponds to the attachment (transfer) of a new nucleotide to the growing RNA chain. Therefore the protein is predicted 3.5.2 (conf = 87%) since its homologs do not have this term.
The multilabel approach proposes a similar rule, that concludes to the same class (conf = 90%) for esa800 (Fig. 6).
Annotators can thus easily interpret these rules and trees and confirm or reject the rule conclusion.
Conclusion and perspectives
Results obtained for the two approaches on both genomes are comparable and are good enough to be useful for the annotators (good precision and high prediction rate). A first attempt at combining the two approaches is very encouraging (this increases the recall and the prediction rate). We will further refine these combinations.
We are now analysing thoroughly the rules obtained from the trees and comparing them in order to extract common pieces of knowledge which could be considered as strongly reliable for an automatic annotation. The biological meaning of these rules and their relevance for annotation purpose will be investigated by experts that use the AGMIAL platform. As we may obtain several possible annotations, we would like to extend the AGMIAL interface in order to make it support multiple annotations for the same protein, if required, and to provide the user with different predictions together with their confidence degree. Also we plan to learn new trees based on a richer set of descriptors for the training examples, for instance, by taking into account the genomic context or subcellular localisation. Finally, we are considering validating our approach by applying it to other genomes and to learn other expressive classifiers.
Note added in proofs: we were considering applying our methodology on the 5 MIPS genomes annotated with the MIPS Funcat functional hierarchy. MIPS scientists published recently a paper  describing a work quite similar, in spirit if not in methodological details, to the one we presented here, using Funcat and their 5 annotated genomes.
This work was supported in part by grants from the ACI IMPBIO (french national project RAFALE) and from the Agence Nationale de la Recherche (french national project Microbiogenomics, ANR-05-MMSA-0009-02).
This article has been published as part of BMC Proceedings Volume 2 Supplement 4, 2008: Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/2?issue=S4.
- Genomes On Line. [Http://www.genomesonline.org]
- Consortium TGO: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-9. 10.1038/75556.View ArticleGoogle Scholar
- Kreitschmann W, Fleischmann W, Apweiler R: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics. 2001, 17: 920-926. 10.1093/bioinformatics/17.10.920.View ArticleGoogle Scholar
- Vinayagam A, del Val C, Schubert F, Eils R, Glatting K, Suhai S, Konig R: GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006, 7: 161-10.1186/1471-2105-7-161.PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan R: C4.5: Programs for Machine Learning. 1993, Morgan KaufmannGoogle Scholar
- Cristianini N, Shawe-Taylor J: AN INTRODUCTION TO SUPPORT VECTOR MACHINES and other kernel-based learning methods. 2000, Cambridge University Press, [ISBN: 0 521 78019 5].View ArticleGoogle Scholar
- Troyanskaya O, Dolinski K, Owen A, Altman R, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci. 2003, 100 (14): 8348-53. 10.1073/pnas.0832373100.PubMed CentralView ArticlePubMedGoogle Scholar
- Barutcuoglu Z, Schapire R, Troyanskaya O: Hierarchical multi-label prediction of gene function. Bioinformatics. 2006, 22: 830-6. 10.1093/bioinformatics/btk048.View ArticlePubMedGoogle Scholar
- Levy E, Ouzounis C, Gilks W, Audit B: Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics. 2005, 6: 302-10.1186/1471-2105-6-302.PubMed CentralView ArticlePubMedGoogle Scholar
- RAFALE: french national project RAFALE. [Http://www.lri.fr/RAFALE]
- Gattiker A, Michoud K, Rivoire C, Auchincloss AH, Coudert E, Lima T, Kersey P, Pagni M, Sigrist CJA, Lachaize C, Veuthey AL, Gasteiger E, Bairoch A: Automated annotation of microbial proteomes in SWISS-PROT. Computational Biology and Chemistry. 2003, 27: 49-58. 10.1016/S1476-9271(02)00094-4.View ArticlePubMedGoogle Scholar
- Clare A, King R: Machine learning of functional class from phenotype data. Bioinformatics. 2002, 18: 160-166. 10.1093/bioinformatics/18.1.160.View ArticlePubMedGoogle Scholar
- Blockeel H, Raedt LD: Top-Down Induction of First-Order Logical Decision Trees. Artificial Intelligence. 1998, 101 (1–2): 285-297. 10.1016/S0004-3702(98)00034-4. [http://citeseer.ist.psu.edu/blockeel98topdown.html]View ArticleGoogle Scholar
- Blockeel H, Schietgat L, Struyf J, Dzeroski S, Clare A: Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics. Principles and Practice of Knowledge Discovery in Databases (PKDD'06). 2006, 18-29.Google Scholar
- Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, Guchte van de M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF: AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res. 2006, 34 (12): 3533-45. 10.1093/nar/gkl471.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaillou S, Champomier-Vergès MC, Cornet M, Coq AMCL, Dudez AM, Martin V, Beaufils S, Darbon-Rongère E, Bossy R, Loux V, Zagorec M: The complete genome sequence of the meat-borne lactic acid bacterium Lactobacillus sakei 23 k. Nature Biotechnology. 2005, 23: 1527-33. 10.1038/nbt1160.View ArticlePubMedGoogle Scholar
- Guchte van de M, Penaud S, Grimaldi C, Barbe V, Bryson K, Nicolas P, Robert C, Oztas S, Mangenot S, Couloux A, Loux V, Dervyn R, Bossy R, Bolotin A, Batto J, Walunas T, Gibrat J, Bessieres P, Weissenbach J, Ehrlich S, Maguin E: The complete genome sequence of Lactobacillus bulgaricus reveals extensive and ongoing reductive evolution. Proc Natl Acad Sci USA. 2006, 103: 9274-9279. 10.1073/pnas.0603024103.PubMed CentralView ArticlePubMedGoogle Scholar
- Moszer I, Jones L, Moreira S, Fabry C, Danchin A: Subtilist: the reference database for the Bacillus subtilis genome. Nucleic Acids Res. 2002, 30: 62-5. 10.1093/nar/30.1.62.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt). Nucleic Acids Research. 2005, D154-D159. 33 DatabaseGoogle Scholar
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005, 6 (5): R44-10.1186/gb-2005-6-5-r44.PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28: 45-8. 10.1093/nar/28.1.45.PubMed CentralView ArticlePubMedGoogle Scholar
- Zdobnov EM, Apweiler R: InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17 (9): 847-8. 10.1093/bioinformatics/17.9.847.View ArticlePubMedGoogle Scholar
- GeneOntology: The Gene Ontology. revision of February 2007, [Http://www.geneontology.org/external2go/]
- Clare A: Machine learning and data mining for yeast functional genomics. PhD thesis. 2003, University of Wales AberystwythGoogle Scholar
- Kiritchenko S, Matwin S, Nock R, Famili AF: Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization. Canadian Conference on Artificial Intelligence 2006. 2006, 395-406.Google Scholar
- Tetko I, Rodchenkov I, Walter M, Rattei T, Mewes H: Beyond the best match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics. 2008, 24: 621-8. 10.1093/bioinformatics/btm633.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.