Bio-event definition in text mining towards event interconnection
© Li and Liakata; 2015
Published: 6 August 2015
Event extraction is one of the main focuses in bio-text mining (TM). Interconnecting extracted events into reaction networks provides biologists with a wealth of fine-grained information on biochemical reactions . Intuitively, extracted events could be connected into networks based on common entities. However, this approach is limited due to: 1) its dependence on flawless entity normalisation; 2) inability to express the directionality of the various relations/reactions. To enrich the information in extracted events and facilitate their interconnection, we propose a modification to bio-event definition to make it more compatible with the structure of biological reactions and community-supported biological semantic resources. More specifically, we propose alignment of bio-events with the reactions in the Systems Biology Markup Language (SBML), which would make bio-events more biologically meaningful and directly re-usable by domain experts.
The last decade has seen increased interest and rapid advance in the semantic study of biology, resulting in a number of semantic knowledge resources proposed by the bio community. The Systems Biology Markup Language (SBML)  is a successful example of such efforts. SBML is a machine readable and transferable format depicting biological/biochemical reaction networks. It has been widely adopted and has been used for encoding a broad range of biological networks.
Ohta et al.  compared SBML with the event definitions in the series of GENIA tasks, pointing out that bio-event types in the current TM tasks are insufficient for covering all types of biochemical reactions in existing networks. The latter tasks in the BioNLP series try to expand the coverage of more types of bio-events .
We argue below that not only would we require more types of events but also a modification to current event structure.
BioNLP events and biological reactions
The main arguments of an event defined in the latest BioNLP'13 GE task consist of Theme and Cause . While Theme and Cause have a direct correspondence to the notions of Patient and Agent respectively, in linguistics thematic relations, assignment of these arguments delivers insufficient information about the roles of participants in reactions.
By contrast, when we look at an SBML file, it encodes a network as a set of biochemical reactions interconnected by the participants. The main elements of each reaction include reactants, modifiers and products which respectively denote the substances involved in, influencing and produced by the reaction.
The BioNLP 2013 Pathway Curation (PC) task  has augmented the Theme and Cause arguments by including Products and Participants. However, several issues remain unresolved. For example, a protein modification event in the PC task contains a single Theme. This is based on the knowledge that such events occur between proteins and certain molecules, which always result in the binding of the two. Without explicitly mentioning the products though, computers would not be able to automatically interconnect such events via, for example, coreference. Moreover, regulatory events cannot be incorporated into a gene expression event as modifiers if the event is not provided in a directional format. Modifiers are also missing for many events.
Proposed amendments to event definition
Examples for the event types from BioNLP'13 GE task.
MBP mRNA transcription
degradation of the p100 NF-kB protein
HOIP CD40 (inferred)
the association of HOIP with CD40
localization of ΔFKH
Post-translational modification of NF-kB p65
NF-kB pho (inferred)
NF-kB p65 phosphorylation
I-kBα ubi (inferred)
ubiquitination, and subsequent degradation of
p65 ace (inferred)
Acetylation of p65
p65 dea (inferred)
Deacetylation of p65 by histone deacetylase-3
Example 1: point mutation at Ser536
Example 2: HOIP functions downstream of TRAF2
M-CSF stimulated PKCα
inhibits MEK1 and MEK2
Gene expression (GE) is the process of synthesizing proteins from genetic codes. The same term is used to refer to a gene and gene product, e.g. protein. Therefore, the same gene name is used for both reactant and product. As a sub-process of GE, Transcription could take the same approach. So, the GE example in Table 1 could have IL-10 as the reactant.
Protein catabolism is the process of proteins breaking into amino acids. Therefore, if the mentions of generated amino acids appear along a broken protein, the amino acid names should be annotated as products. Meanwhile, if the names of related proteases occur, they should be annotated as modifiers.
Binding is the formation of macromolecules by the aggregation of two or more molecules. Generated molecules, physical clusters of the original macro-molecules will constitute the products. In Table 1, the inferred cluster could be named after the conjunction of the reactants. This can help event interconnections to produce more sensible reaction cascades. For example, in "post-translational modification state of CD40-associated HOIP", post-translational modification is taking place on the macromolecule consisting of CD40 and HOIP instead of either of them. If the product of the binding of CD40 and HOIP is named as CD40 HOIP, the downstream encoding of protein modification can use CD40 HOIP as the reactant.
The more specific Protein modification types include phosphorylation, ubiquitination, acetylation, and deacetylation. These processes attach specific chemicals onto other molecules. Therefore, the products of these processes can be inferred in a similar way as for binding.
Regulations including positive and negative regulations are the processes, which catalyze or inhibit other processes without producing anything via the actual regulation process per se. They are akin to the notion of modifiers defined in SBML. We propose that regulations should be incorporated into the processes they have influenced. This would merge regulation events with others. For example, "Addition of U0126 to the cultures abrogated the production of IL-10" could be extracted as a gene expression event of IL-10 with U0126 as the modifier rather than an extra regulation event of U0126, although the extraction may be technically achieved in two steps.
Event interconnection requires further research into entity coreference, event coreference and discourse analysis. Encoding extracted and inferred information from bio-events into SBML format can help by maintaining reaction directionality and enabling meaningful coreference.
This position paper argues that it is possible and indeed advantageous to enhance the output formats of extracted bio-events and make them compatible with the widely used SBML format for biological reactions. The format can be further refined to meet the complexity of bio-events. A possible first step would be to use the enhanced format to annotate existing corpora, e.g. those from BioNLP tasks or adapt them to the new format semi-automatically.
- Chen Li, Maria Liakata, Dietrich Rebholz-Schuhmann: Biological network extraction from scientific literature: state of the art and challenges. Briefings in bioinformatics. 2013, bbt006-Google Scholar
- Hucka Michael, et al: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models:. Bioinformatics. 2003, 19 (4): 524-531. 10.1093/bioinformatics/btg015.View ArticlePubMedGoogle Scholar
- Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii: From pathways to biomolecular events: opportunities and challenges. Proceedings of BioNLP 2011 Workshop. Association for Computational Linguistics. 2011, 105-113.Google Scholar
- Tomoko Ohta, et al: Pathway curation support as an information extraction task. Proceedings of the Fourth International Symposium on Languages in Biology and Medicine. 2011, 11-Google Scholar
- Jin-Dong Kim, Yue Wang, Yamamoto Yasunori: The Genia Event Extraction Shared Task, 2013 Edition-Overview. ACL. 2013, 2013: 8-Google Scholar
- Ohta Tomoko, et al: Overview of the pathway curation (PC) task of BioNLP shared task 2013. 2013Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.