Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Goldberg, Tatyana; Vinchurkar, Shrikant; Cejuela, Juan Miguel; Jensen, Lars Juhl; Rost, Burkhard

doi:10.1186/1753-6561-9-S5-A4

Volume 9 Supplement 5

Selected abstracts from the Biomedical Linked Annotation Hackathon 2015

Meeting abstract
Open access
Published: 06 August 2015

Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Tatyana Goldberg^1,2,
Shrikant Vinchurkar¹,
Juan Miguel Cejuela¹,
Lars Juhl Jensen³ &
…
Burkhard Rost¹

BMC Proceedings volume 9, Article number: A4 (2015) Cite this article

1812 Accesses
3 Citations
1 Altmetric
Metrics details

Summary

Annotators of text corpora and biomedical databases carry out the same labor-intensive task to manually extract structured data from unstructured text. Tasks are needlessly repeated because text corpora are widely scattered. We envision that a linked annotation resource unifying many corpora could be a game changer. Such an open forum will help focus on novel annotations and on optimally benefiting from the energy of many experts. As proof-of-concept, we annotated protein subcellular localization in 100 abstracts cited by UniProtKB. The detailed comparison between our new corpus and the original UniProtKB annotations revealed sustained novel annotations for 42% of the entries (proteins). In a unified linked annotation resource these could immediately extend the utility of text corpora beyond the text-mining community. Our example motivates the central idea that linked annotations from text corpora can complement database annotations.

Background

The natural language processing (NLP) and biomedical research communities have in common that they invest great effort into making high-quality manual annotation of biomedical literature. The focus and the annotation strategies of the two communities have, however, differed so much that collaborations remained stunningly limited. Most text corpora contain detailed markup of only a few types of entities and relationships in a limited number of abstracts or articles [1] (with exceptions such as the CRAFT corpus [2]). In contrast, manually curated databases such as Swiss-Prot/UniProtKB [3] aim at annotating each entity with a wide range of information extracted from literature, but with less focus on the text structure.

We envision linked annotations as a possible middle ground for the two important strategies to curate literature that could synergistically link the efforts of two distinct communities. By connecting the annotations of different types of entities and relationships annotated in existing and future corpora, a linked annotation resource could be constructed, which would have much greater coverage and diversity of annotations than any existing text corpus. Such a corpus would be valuable to NLP researchers and database curators alike.

Here, we present a case study on protein subcellular localization to demonstrate that the corpus annotation strategy can improve database annotation. The localization of a protein is one aspect of protein function and therefore constitutes one of the three hierarchies to capture protein function employed by the Gene Ontology (GO) [4].

The LocText corpus

We assembled a corpus of 100 PubMed abstracts referenced by UniProtKB. We focused on three model organisms: Homo sapiens (50 entries), Saccharomyces cerevisiae (baker's yeast with 25 entries), and Arabidopsis thaliana as a plant (25 entries). We used 46 of the 100 abstracts to develop our annotation guidelines that are available at https://www.tagtog.net/-corpora/loctext.

Two of us (TG & SV) then annotated the remaining 54 abstracts. The two annotations agreed at F1 = 94% for entities and at F1 = 80% for relationships. We normalized protein names to UniProtKB and localizations to GO identifiers. The resulting corpus contains 306 annotated relationships in 201 different UniProtKB proteins with 48 GO distinct localization terms. All annotations were made within the framework of the tagtog system (Figure 1; http://tagtog.net) [5] and Reflect was used to aid protein name normalization (http://reflect.ws) [6]. The corpus is available for download at https://www.tagtog.net/-corpora/loctext under the Creative Commons Attribution 4.0 (CC-BY 4.0) license.

Corpus provides novel annotations

Linked annotations from text corpora can complement database annotations only if manual corpus annotations identify relationships not captured by existing databases. Therefore, all our annotations were done from scratch without using database annotations. Comparing our "from scratch" annotations with those from UniProtKB revealed important novelty added by our text corpus.

We found novel or more detailed localization annotations with respect to UniProtKB for 84 of 201 (42%) proteins in 34 abstracts (Table 1); for example, Arabidopsis RabF2a (UniProtKB entry RAF2A_ARATH) is localized to endosomes (Figure 1). We found that for over half of these proteins with additional annotations (47/84 = 56%) UniProtKB did not cite the abstracts. This is likely explained by the way proteins are annotated, one protein at a time: if a curator works on one protein and an abstract mentions also the localization of another, which is not the focus of curator, the localization of the latter might not be annotated.

Table 1 Localization annotations in our corpus and in UniProtKB. The table categorizes the corpus relationships by organism relative to whether they represent existing annotations in UniProtKB, more detailed annotations, or truly novel annotations. It further subdivides the counts based on whether or not the relationships involve UniProtKB proteins that cite the abstract.

Full size table

Perspectives

Our case study clearly showed that corpora containing manual annotations of the sub-cellular localization of proteins are able to contribute novel information to curated databases such as UniProtKB. Notably, this is even true in the worst-case example when limiting annotations only to abstracts of articles that have already been utilized by the database curators. We expect our findings to generalize to most types of protein annotation, including disease associations and tissue expression.

Today databases avoid the trouble of integrating these annotations, because most text corpora are too limited in size and scope. Having the corpus developers combine their annotations into a single, unified linked annotation resource could thus be an important step towards integration of corpus annotations into databases, thus making them to richer data collection systems. Even before integration with databases happens, it will be possible for researchers to use semantic web technologies to combine the information in the linked annotation resource with that in existing databases, since UniProtKB and many other databases are already Resource Description Framework (RDF) compliant.

We envision a linked annotation resource to continuously grow, supported by annotation tools making it easy for corpus developers to link future annotations; for example, through a standard JSON format. Not all linked annotations need to be made manually, though. Including also results from automatic text mining pipelines would help address the challenge of the prohibitively high costs of large-scale manual annotation [2]. Associations extracted from both open and non-open access journals can be linked, as redistribution of extracted facts is not prohibited by most publishers' licenses.

References

Neves M: An analysis on the entity annotations in biological corpora. F1000Res. 2014, 3: 96-
PubMed Central PubMed Google Scholar
Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, Choi JD, Funk C, Malenkiy Y, Eckert M, Xue N, Baumgartner WA, Bada M, Palmer M, Hunter LE: A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012, 13: 207-10.1186/1471-2105-13-207.
Article PubMed Central PubMed Google Scholar
UniProt Consortium: Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014, 42: D191-D198.
Article Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Article PubMed Central CAS PubMed Google Scholar
Cejuela JM, McQuilton P, Ponting L, Marygold SJ, Stefancsik R, Millburn GH, Rost B, FlyBase Consortium: tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database. 2014, 2014: bau033-10.1093/database/bau033.
Article PubMed Central PubMed Google Scholar
Pafilis E, O'Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R: Reflect: augmented browsing for the life scientist. Nat Biotechnol. 2009, 27: 508-510. 10.1038/nbt0609-508.
Article CAS PubMed Google Scholar
Molendijk AJ, Ruperti B, Singh MK, Dovzhenko A, Ditengou FA, Milia M, Westphal L, Rosahl S, Soellick TR, Uhrig J, Weingarten L, Huber M, Palme K: A cysteine-rich receptor-like kinase NCRK and a pathogen-induced protein kinase RBK1 are Rop GTPase interactors. Plant J. 2008, 53: 909-923.
Article CAS PubMed Google Scholar
Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007, 23 (13): i41-i48. 10.1093/bioinformatics/btm229.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

Funding: Alexander von Humboldt Foundation through German Federal Ministry for Education and Research, Ernst Ludwig Ehrlich Studienwerk, and the Novo Nordisk Foundation Center for Protein Research (NNF14CC0001).

Author information

Authors and Affiliations

Bioinformatics & Computational Biology, Department of Informatics, Technical University of Munich (TUM), 85748, Garching, Germany
Tatyana Goldberg, Shrikant Vinchurkar, Juan Miguel Cejuela & Burkhard Rost
TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), 85748, Garching, Germany
Tatyana Goldberg
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200, Copenhagen N, Denmark
Lars Juhl Jensen

Authors

Tatyana Goldberg
View author publications
You can also search for this author in PubMed Google Scholar
Shrikant Vinchurkar
View author publications
You can also search for this author in PubMed Google Scholar
Juan Miguel Cejuela
View author publications
You can also search for this author in PubMed Google Scholar
Lars Juhl Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Burkhard Rost
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lars Juhl Jensen or Burkhard Rost.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Goldberg, T., Vinchurkar, S., Cejuela, J.M. et al. Linked annotations: a middle ground for manual curation of biomedical databases and text corpora. BMC Proc 9 (Suppl 5), A4 (2015). https://doi.org/10.1186/1753-6561-9-S5-A4

Download citation

Published: 06 August 2015
DOI: https://doi.org/10.1186/1753-6561-9-S5-A4

Selected abstracts from the Biomedical Linked Annotation Hackathon 2015

Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Summary

Background

The LocText corpus

Corpus provides novel annotations

Perspectives

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Selected abstracts from the Biomedical Linked Annotation Hackathon 2015

Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Summary

Background

The LocText corpus

Corpus provides novel annotations

Perspectives

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us