=Paper=
{{Paper
|id=Vol-1650/smbm16Santos
|storemode=property
|title=A Curation Pipeline and Web-Services for PDF Documents
|pdfUrl=https://ceur-ws.org/Vol-1650/smbm16Santos.pdf
|volume=Vol-1650
|authors=André Santos,Sérgio Matos,David Campos,José Luís Oliveira
|dblpUrl=https://dblp.org/rec/conf/smbm/SantosMCO16
}}
==A Curation Pipeline and Web-Services for PDF Documents==
A curation pipeline and web-services for PDF documents
André Santos1 , Sérgio Matos1 , David Campos2 and José Luı́s Oliveira1
1
DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal
{aleixomatos,andre.jeronimo,jlo}@ua.pt
2
BMD Software, 3810-074 Aveiro, Portugal
david.campos@bmd-software.com
Abstract most relevant publications and information on a
given subject is a very challenging task for re-
The continuous growth of the biomedi- searchers.
cal literature and the need to efficiently
To facilitate the access to knowledge, several
find and extract information from its con-
resources started by manually curating scientific
tent led to the development of various text
articles, extracting and structuring relevant and
mining tools. More recently, these tools
validated information. However, with the rapid
started being integrated in user-friendly
growth of data this task became unfeasible (Yeh
applications facilitating their use by expert
et al., 2003; Rebholz-Schuhmann et al., 2005), and
database curators. However, these tools
automatic information extraction tools were devel-
were mainly designed to extract informa-
oped and integrated in the curation pipeline in or-
tion from text based documents, in XML
der to accelerate the curation process (Neves and
and other formats, while today a consid-
Leser, 2012). This also led to the need of cre-
erable part of the biomedical literature is
ating end-user interfaces to these tools, allowing
published and distributed in PDF format.
their use by curators in a efficient manner. The
To address this limitation, we extended the success of the BioCreative Interactive Annotation
web-based literature curation tool Egas, Task series demonstrates the importance of these
adding support for direct document cura- efforts (Arighi et al., 2013).
tion and annotation over PDF files, with While existing information extraction tools
side-by-side visualization of the original have been shown to achieve robust performance in
PDF document and of the extracted textual various tasks, and various literature curation tools
content. Egas’ PDF document processing have been proposed that make use of such auto-
and text-mining features are supported by mated methods, they were generally designed to
a newly developed web-services platform work with plain text or with structured formats
built over Neji, a highly efficient informa- such as XML. There is however a lack of tools
tion extraction framework. These web ser- for supporting curation workflows that make use
vices allow integrating PDF text extraction of the Portable Document Format (PDF), which
and annotation capabilities to other tools has become one of the most popular file formats
and text mining pipelines. for publishing and sharing documents.
We have previously presented Neji (Campos et
1 Introduction
al., 2013), an open source framework for biomed-
The large amount of information and knowledge ical concept recognition, and Egas (Campos et
continuously produced in the biomedical domain al., 2014), a web-based tool for literature cura-
is reflected on the number of published journal ar- tion built with modern web technologies and pro-
ticles. In 2015, the bibliographic database MED- viding simple inline representation of annotations
LINE contained over 23 million references to jour- and user-friendly interaction. In this paper we
nal articles in life sciences, of which 1 million present new features added to Egas and Neji to
were added in that year (U.S. National Library of support text-mining and curation workflows over
Medicine, 2016). At this rate, staying updated PDF documents. In Section 2 we describe Neji’s
with the current knowledge and identifying the new PDF processing functionalities and present its
Figure 1: Neji processing pipeline and modular architecture (Campos et al., 2013)
new web-services platform. These web-services objectives and goals, for example by simply com-
are used by the curation tool for extracting the text bining existing or new modules for reading, pro-
from PDF documents and for obtaining automatic cessing and writing data, or by selecting the ap-
concept annotations, and also facilitate the inte- propriate dictionaries or machine learning models
gration of Neji’s functionalities in external text- according to the concept types of interest.
mining pipelines and tools. Egas is described in Neji has been evaluated on several corpora, cov-
Section 3, highlighting the new PDF annotation ering different concept types (Campos et al., 2013;
features including side-by-side synchronous visu- Campos et al., 2015; Matos et al., 2016). Table
alization of the extracted text and of the original 1 shows a summary of the concept identification
PDF, and also the display of concept annotations performance.
over the PDF document.
2.1 Pipeline and modules
2 Neji The main component of Neji is the processing
pipeline (Figure 1), a series of independent mod-
Neji is an open source framework for biomed-
ules, each of them responsible for a specific pro-
ical concept recognition built around four cru-
cessing task, that are executed sequentially. We
cial characteristics: modularity, scalability, speed
used Monq.jfa1 , a library for fast and flexible text
and usability. It follows several state-of-the-art
filtering with regular expressions, to implement
methods for biomedical natural language process-
each pipeline module as a custom deterministic fi-
ing (NLP), namely methods for sentence split-
nite automaton (DFA) with specific rules and ac-
ting, tokenization, lemmatization, POS, chunking
tions.
and dependency parsing. The concept recognition
tasks are performed using dictionary matching and 2.1.1 Handling PDF files
machine learning techniques with normalization. Thanks to Neji’s modular architecture, adding
This framework implements a very flexible and ef- PDF processing capabilities only required the im-
ficient concept tree to store the document annota- plementation of a new reader module. For this,
tions, supporting nested and intersected concepts we integrated LA-PDFText (Ramakrishnan et al.,
with one or more identifiers. It supports several in- 2012), a state-of-the-art open-source tool for han-
put and output formats including the most popular dling PDF documents. LA-PDFText makes use of
ones in biomedical text mining, such as IeXML, a carefully crafted set of rules defined on the busi-
Pubmed XML, A1, CONLL and BioC. The archi- ness rules management system DROOLS, allow-
tecture of Neji allows users to configure the pro-
1
cessing of documents according to their specific http://www.pifpafpuf.de/Monq.jfa/
Table 1: Neji concept recognition results on a variety of corpora and concept types. D: Dictionary; ML:
Machine-Learning
Corpus Concept type F-score Method
CRAFT Species 95% D
Cell 92% D
Gene and Protein 76% ML
Chemicals 65% D
Cellular Component 83% D
Biological Process and Molecular Function 63% D
NCBI Disease Disorders 85% D
Anem Anatomy 82% D
BC II Gene Mention Gene and Protein 87% ML
tmVar Genetic Variants 86% ML
BC IV ChemdNER Chemicals 87% ML
ing to correctly handle different PDF layouts such to send their input documents and receive the plain
as one column, two columns and mixed layouts. text extracted from the submitted PDF file and also
This feature also allows defining different sets of annotation results in various well-known formats,
rules for specific PDF layouts if necessary, and we including standoff (A1) (Kim et al., 2009; Stene-
therefore included in the new Neji reader an op- torp et al., 2012) and BioC (Comeau et al., 2013).
tional parameter for this. Different annotation services can be configured
In order to evaluate the text extraction quality, in the platform, in which a service is an annotation
we obtained the original PDF documents corre- pipeline with a custom set of resources (dictio-
sponding to the 67 full-text articles that compose naries and ML models) and processing properties.
the CRAFT corpus (Bada et al., 2012), and com- This provides a way to easily manage concurrent
pared the text extracted by LA-PDFText, through annotation services, allowing the configuration of
our processing pipeline, to the distributed text con- the properties and resources of each of them inde-
tents, which were extracted from XML files. For pendently. Additionally, resources are loaded into
these articles, published in 21 different journals memory as soon as a new service is created. Since
and having distinct layouts, we obtained an exact this usually is an expensive step, especially for
match in 90% of the extracted sentences. large ML models, having the resources in memory
Apart from extracting the text, which is suf- greatly reduces the total annotation time.
ficient for running the processing pipeline, we
added additional capabilities to the reader, in or- 3 Egas
der to make use of PDF processing in the curation
tool Egas. Namely, we apply sentence splitting to Egas is a web-based platform for biomedical text
the extracted chunks of text, and extract the posi- mining and collaborative curation that supports in-
tion of each sentence in each page to allow align- line annotation of concept occurrences and of rela-
ing and navigating between the plain text and PDF tions between these concepts. Annotations can be
views in the user interface. This information is as- performed automatically, using the available ser-
sociated to each sentence and carried over to the vices for automatic concept and relation identifi-
remaining modules in the pipeline. A new writer cation, or manually, wherein a user can add new
module was also implemented that exports this ex- annotations and also edit or remove automatically
tended information in JSON format, for simple re- generated annotations. The results can be then ex-
use in external tools. ported to various standard annotation formats.
To adapt Egas to support literature curation over
2.2 Web-services PDF documents, we integrated PDF text extrac-
Neji web-services are intended to facilitate the tion using the Neji web services RESTful API and
use and access to Neji functionalities by provid- adapted the interface for side-by-side visualization
ing a simple RESTful API that allows developers of the extracted text alongside the original PDF
Figure 2: Egas PDF annotation interface
document, allowing the navigation between both and the corresponding sentence is briefly high-
zones, synchronizing the text annotation area and lighted to facilitate its identification. Conversely,
the PDF visualization area. double-clicking a sentence on the PDF scrolls the
Egas’ file import web services were also ex- text on the annotation panel and highlights the cor-
tended to support PDF files. As with the re- responding sentence.
maining file formats, this web service is responsi-
ble for receiving the file, extracting the text con- 4 Conclusions
tent using Neji’s PDF processing feature as de-
Assisted literature curation tools, based on text
scribed above, and creating the whole data struc-
mining and information extraction methods, are
ture to support document annotations. This struc-
increasingly being used by curation teams, help-
ture includes also sentence information retrieved
ing to expedite their tasks. However, there is a
from Neji, such as the start and end indexes, with
lack of tools that support direct annotation of PDF
respect to the extracted plain text, and its posi-
documents, which is a very common format for
tion within the PDF page, allowing synchronous
the scientific literature and other document types,
scrolling and navigation between the plain text and
such as patents. We present a new feature of Egas
PDF views.
that allows direct document curation and annota-
Figure 2 shows Egas’ user interface for PDF an- tion over PDF files, with side-by-side visualization
notation. The original PDF document is displayed of the original PDF document and of the extracted
on the right-side panel, while the left panel shows textual content. By aligning the user-friendliness
the annotation panel with the extracted text, al- of Egas with the possibility of reading the docu-
lowing annotation using the same simple interac- ment in a very familiar format such as PDF, we
tions as for other document formats, as described provide a more convenient and agreeable literature
in (Campos et al., 2014). As can be seen in the fig- curation environment, which could contribute to
ure, concept annotations added by the automatic improved efficiency.
annotation services or by the curator are displayed
on the plain text as well as on the PDF docu-
ment. Additionally, a tooltip with information as- References
sociated to each annotation is shown when hover-
Cecilia N Arighi, Ben Carterette, K Bretonnel Cohen,
ing the mouse over the annotation on either panel. Martin Krallinger, W John Wilbur, Petra Fey, Robert
By clicking a sentence number on the annotation Dodson, Laurel Cooper, Ceri E Van Slyke, Wasila
panel, the PDF document is scrolled accordingly, Dahdul, et al. 2013. An overview of the biocre-
ative 2012 workshop track iii: interactive text min- Alexander S Yeh, Lynette Hirschman, and Alexander A
ing task. Database, 2013:bas056. Morgan. 2003. Evaluation of text data mining
for database curation: lessons learned from the kdd
Michael Bada, Miriam Eckert, Donald Evans, Kristin challenge cup. Bioinformatics, 19(suppl 1):i331–
Garcia, Krista Shipley, Dmitry Sitnikov, William A i339.
Baumgartner, K Bretonnel Cohen, Karin Verspoor,
Judith A Blake, et al. 2012. Concept annotation in
the craft corpus. BMC bioinformatics, 13(1):1.
David Campos, Sérgio Matos, and José Luı́s Oliveira.
2013. A modular framework for biomedical concept
recognition. BMC bioinformatics, 14(1):281, jan.
David Campos, Jóni Lourenço, Sérgio Matos, and
José Luı́s Oliveira. 2014. Egas: a collaborative and
interactive document curation platform. Database
: the journal of biological databases and curation,
2014, jan.
David Campos, Sérgio Matos, and José L Oliveira.
2015. A document processing pipeline for annotat-
ing chemical entities in scientific documents. Jour-
nal of cheminformatics, 7(1):1.
Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ci-
ccarese, Kevin Bretonnel Cohen, Martin Krallinger,
Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Ri-
naldi, Manabu Torii, et al. 2013. Bioc: a minimalist
approach to interoperability for biomedical text pro-
cessing. Database, 2013:bat064.
Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi-
nobu Kano, and Jun’ichi Tsujii. 2009. Overview
of bionlp’09 shared task on event extraction. In
Proceedings of the Workshop on Current Trends in
Biomedical Natural Language Processing: Shared
Task, pages 1–9. Association for Computational Lin-
guistics.
Sérgio Matos, David Campos, Renato Pinho, Raquel M
Silva, Matthew Mort, David N Cooper, and
José Luı́s Oliveira. 2016. Mining clinical attributes
of genomic variants through assisted literature cura-
tion in egas. Database, 2016:baw096.
Mariana Neves and Ulf Leser. 2012. A survey on an-
notation tools for the biomedical literature. Brief-
ings in bioinformatics, page bbs084.
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy,
and Gully Apc Burns. 2012. Layout-aware text
extraction from full-text PDF of scientific articles.
Source code for biology and medicine, 7(1):7, jan.
Dietrich Rebholz-Schuhmann, Harald Kirsch, and
Francisco Couto. 2005. Facts from textis text min-
ing ready to deliver? PLoS Biol, 3(2):e65.
Pontus Stenetorp, Sampo Pyysalo, Goran Topić,
Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-
jii. 2012. BRAT: a web-based tool for NLP-assisted
text annotation. pages 102–107, apr.
U.S. National Library of Medicine. 2016. Detailed
Indexing Statistics: 1965-2015.