Interoperability of text corpus annotations with the semantic web
© Verspoor et al.; 2015
Published: 6 August 2015
This paper explores the adaptation of the PubAnnotation model with recent more general proposals for the representation of annotations in the Semantic Web, referred to here as the Open Annotation model and the focus of the W3C Web Annotation Working Group. We argue that interoperability with standards under development for text annotation on the web, and with recent proposals related to nanopublications, will have benefits for the use and consistency of linguistically annotated text corpora.
Formal annotation of language data is an activity that dates back at least to the classic work of Kucera and Francis on the Brown Corpus . It further is a general scholarly activity by which scholars organize existing knowledge and facilitate the creation and sharing of new knowledge. Annotation is also becoming increasingly pervasive in the context of social media. Recognition of the widespread importance of annotation has resulted in recent efforts to develop standard data models for annotation [2–4], specifically targeting Web formalisms in order to take advantage of increasing efforts to expose information on the Web, such as through Linked Data initiatives (http://linkeddata.org). The WWW Consortium (W3C) has formed the Web Annotation Working Group (http://www.w3.org/annotation/) to develop specifications for a Web annotation architecture.
In this paper, we propose the adoption of general semantic web-oriented annotation proposals for text annotation in the context of text corpora intended for use in developing Biomedical Natural Language Processing (BioNLP) solutions. We specifically look at adapting the current PubAnnotation format  for compatibility in relation to those proposals. We propose a representation of an annotated corpus in terms of the data models under development in the broader scholarly annotation community, and develop a translator from the existing PubAnnotation JSON format to the Open Annotation model.
This generalization of the model is particularly pertinent to collaborative annotation scenarios; exposing linguistic annotations in the de facto language of the Semantic Web, the W3C's Resource Description Framework (RDF), provides several advantages that we have previously described . We further demonstrate that the model can be integrated with the nanopublications model [7, 8], facilitating their use in a growing set of data publication tools .
PubAnnotation is an annotation repository, that also provides a web services interface exposing the underlying texts and associated annotations . This interface makes use of a simple JSON format that directly associates a span of text to a particular concept string.
Open Annotation model
The W3C Web Annotation working group will base its proposals in the prior Annotation Ontology  and Open Annotation Collaboration  models. Each of these models in turn incorporates elements from the earlier Annotea model . We refer to this model as the Open Annotation model (OpenAnn) , and adopt it for our target representation.
High-level model for scholarly annotation
The basic high-level data model of the two primary Open Annotation models defines an Annotation as an association created between two elements, a Body or content resource and (one or more) Target resources. The annotation provides some information about the target through the connection to the body. For instance, an annotation may relate the token "apple" in a text (the target of the annotation) to the concept of an apple, perhaps represented as WordNet  synset "apple#1" (the body of the annotation).
Annotations can be augmented with meta-data, e.g. the author or creation time of the annotation. The model allows for each element of the annotation - the annotation itself, the target, and the body - to have different associated meta-data, such as different authors.
The initial use cases for Open Annotation focused on single target-concept relationships, formalized as an expectation that the body of an Annotation be a single web resource, represented as a URI. However, to accommodate more complex bodies, a set of RDF statements can be captured in a construct known as a named graph . The named graph as a whole has a URI. We propose to bundle all Body content into a named graph, so that both simple (e.g., entity) annotations and more complex (e.g., event) annotations can be captured in a consistent representation.
This extension enables complex semantics to be associated with a resource, as well as supporting fine-grained tracking of the provenance of compositional annotations. These developments make possible the integration of linguistic annotation with the scholarly annotation models .
Representing PubAnn in OpenAnn
As an example of the use of OpenAnn for PubAnn, we transform a PubAnn JSON statement for an entity into OpenAnn. The PubAnn statement
is represented in OpenAnn as the following set of RDF statements:
# The basic annotation structure
<PubMed-‐1134658-‐Ann1> a oa:Annotation.
<PubMed-‐1134658-‐Ann1> oa:serializedBy. <http://pubannotation.org>
<PubMed-‐1134658-‐Ann1> oa:hasTarget <PubMed-‐ 1134658-‐0-‐SR1>.
<PubMed-‐1134658-‐Ann1> oa:hasBody <PubMed-‐ 1134658-‐0-‐T13>.
<PubMed-‐1134658-‐Ann1> oa:motivatedBy oa:tagging.
<PubMed-‐1134658-‐Ann1> prov:generatedOn "20141111".
<PubMed-‐1134658-‐0-‐T13> prov:derivedFrom<PubMed-‐1134658-‐0-‐SR1> . }
# The body (content) of the annotation.
# A named graph.
<PubMed-‐1134658-‐0-‐SR1> sio:refers-‐to genia:Protein . }
# The target of the annotation.
<PubMed-‐1134658-‐0-‐SR1> a oa:SpecificResource;
# A selector for a location within the text resource.
<PubMed-‐1134658-‐0-‐S1304-‐13099> a oa:TextPositionSelector ;
oa:start 1304 ;
We also extend our representation to be compatible with nanopublications (http://www.nanopub.org/guidelines) [7, 8], a community standard for encapsulating assertions with their provenance into a portable digital object, by defining the annotation body to be the assertion of a nanopublication.
np:has-‐assertion <PubMed-‐1134658-‐0-‐ T13>;
a np:Nanopublication . }
The above approach can be similarly applied for capturing relational or event semantics as the body of an annotation, by encapsulating a set of triples representing the event within a named graph. We leave such examples for a more in-depth paper.
The adoption of the Open Annotation formalism for representing annotations over textual corpora brings those annotations into the realm of the semantic web, enabling consistent specification of annotation content, provenance, and meta-data in terms of resolvable and reusable ontology concepts. It will allow annotations generated by different systems or individuals over the same documents to be more easily integrated, compared and contrasted. It further ensures interoperability of corpus annotations with components for authoring, sharing, and displaying annotations in browsers and other technical systems that will be developed through the broader efforts of the W3C, including digital publishing tools (cf. the Domeo annotation toolkit for the precursor of the Open Annotation model ).
Nanopublications seem to be a particularly apt choice for structuring OpenAnn text annotations in the biomedical domain. Using nanopublications, the assertion, provenance, and metadata for a PubAnnotation are clearly demarcated into named graphs, which can retrieved, validated, and viewed by a growing set of data publication tools .
Furthermore, nanopublications are being used in an increasing number of biomedical resources to represent factual assertions and their provenance, and a number of tools are being developed specifically to work with nanopublications (e.g., the NanoBrowser http://nanobrowser.inn.ac/). They have been used for incentivizing the publication of human variation data , capturing claims  and scientific discourse , and publishing text-mined associations . Bringing together Open Annotation with nanopublications offers substantial opportunities for access to and reuse of text annotations in combination with information derived from structured databases.
We have introduced a proposal for the representation of text annotations in terms of the Open Annotation model, and demonstrated how it could be applied to the current PubAnnotation JSON format. We structured our model to also be compatible with nanopublications, in order to enable integration of text annotations with information derived from curated databases. The result is a representation for text annotation on the web that is interoperable with the framework of two increasingly relevant semantic web models.
- Kucera H, Francis WN: 1967, Computational analysis of present-day American English: Brown University PressGoogle Scholar
- Ciccarese , Paolo , Ocana Marco, Castro Leyla Garcia, Das Sudeshna, Clark Tim: An open annotation ontology for science on web 3.0. Journal of Biomedical Semantics. 2011, 2: S4-PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter JT, Cole R, Sanderson , Van de Sompel H: The open annotation collaboration: A data model to support sharing and interoperability of scholarly annotations. 2011Google Scholar
- Sanderson , Robert , Ciccarese Paolo, Van de Sompel Herbert: Designing the W3C open annotation data model. Paper presented at the 5th Annual ACM Web Science Conference. 2013, Paris, FranceGoogle Scholar
- Kim , Jin-Dong , Wang Yue: PubAnnotation: a persistent and sharable corpus and annotation repository. Paper presented to the Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. 2012Google Scholar
- Verspoor KM, Livingston KM: Towards Adaptation of Linguistic Annotation to Scholarly Annotation Formalisms on the Semantic Web. 2012, >Paper presented at the Sixth Linguistic Annotation Worshop (LAW VI), Jeju, Republic of KoreaGoogle Scholar
- Groth , Paul , Gibson Andrew, Velterop Johannes: The Anatomy of a Nano-publication. Information Services and Use. 2010, 30: 51-56.Google Scholar
- Mons B, Velterop J: Nano-Publication in the e-science era. 2009Google Scholar
- Sernadela , Pedro , van der Horst Eelke, Thompson Mark, Lopes Pedro, Roos Marco, Oliveira JoséLuís: A Nanopublishing Architecture for Biomedical Data. 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014). Edited by: J. Saez-Rodriguez, M.P. Rocha, F. Fdez-Riverola & J.F. De Paz Santana. 2014, Springer International Publishing, 277-84.View ArticleGoogle Scholar
- Kahan J, Koivunen MR, Prud'Hommeaux E, Swick RR: Annotea: An Open RDF Infrastructure for Shared Web Annotations. Computer Networks. 2002, 39: 589-608. 10.1016/S1389-1286(02)00220-7.View ArticleGoogle Scholar
- Fellbaum C: WordNet: An Electronic Lexical Database (Language, Speech, and Communication) Cambridge. 1998a, Massachusetts: The MIT PressGoogle Scholar
- Carroll JJ, Bizer C, Hayes P, Stickler P: Named graphs, provenance and trust. 2005View ArticleGoogle Scholar
- Livingston , Kevin , Bada Michael, Hunter Lawrence, Verspoor Karin: Representing annotation compositionality and provenance for the Semantic Web. Journal of Biomedical Semantics. 2013, 4: 38-10.1186/2041-1480-4-38.PubMed CentralView ArticlePubMedGoogle Scholar
- Ciccarese , Paolo , Ocana Marco, Clark Tim: Open semantic annotation of scientific publications using DOMEO. Journal of Biomedical Semantics. 2012, 3: S1-PubMed CentralView ArticlePubMedGoogle Scholar
- Patrinos , George P, Cooper David N, van Mulligen Erik, Gkantouna Vassiliki, Tzimas Giannis, Tatum Zuotian, Schultes Erik, Roos Marco, Mons Barend: Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Human Mutation. 2012, 33: 1503-12. 10.1002/humu.22144.View ArticlePubMedGoogle Scholar
- Kuhn , Tobias , Barbano Paolo Emilio, Nagy Mate Levente, Krauthammer Michael: Broadening the Scope of Nanopublications. The Semantic Web: Semantics and Big Data. Edited by: P. Cimiano, O. Corcho, V. Presutti, L. Hollink & S. Rudolph. 2013, Springer Berlin Heidelberg, 487-501.View ArticleGoogle Scholar
- Clark , Tim , Ciccarese Paolo, Goble Carole: Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. Journal of Biomedical Semantics. 2014, 5: 28-10.1186/2041-1480-5-28.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosinach Queralt, Núria , Kuhn Tobias, Chichester Christine, Dumontier Michel, Sanz Ferran, Furlong Laura Inés: 2014, Publishing DisGeNET as Nanopublications. bioRxivGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.