A curation pipeline and web-services for PDF documents

              André Santos1 , Sérgio Matos1 , David Campos2 and José Luı́s Oliveira1
                  1
                      DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal
                        {aleixomatos,andre.jeronimo,jlo}@ua.pt
                             2
                               BMD Software, 3810-074 Aveiro, Portugal
                              david.campos@bmd-software.com


                       Abstract                       most relevant publications and information on a
                                                      given subject is a very challenging task for re-
    The continuous growth of the biomedi-             searchers.
    cal literature and the need to efficiently
                                                         To facilitate the access to knowledge, several
    find and extract information from its con-
                                                      resources started by manually curating scientific
    tent led to the development of various text
                                                      articles, extracting and structuring relevant and
    mining tools. More recently, these tools
                                                      validated information. However, with the rapid
    started being integrated in user-friendly
                                                      growth of data this task became unfeasible (Yeh
    applications facilitating their use by expert
                                                      et al., 2003; Rebholz-Schuhmann et al., 2005), and
    database curators. However, these tools
                                                      automatic information extraction tools were devel-
    were mainly designed to extract informa-
                                                      oped and integrated in the curation pipeline in or-
    tion from text based documents, in XML
                                                      der to accelerate the curation process (Neves and
    and other formats, while today a consid-
                                                      Leser, 2012). This also led to the need of cre-
    erable part of the biomedical literature is
                                                      ating end-user interfaces to these tools, allowing
    published and distributed in PDF format.
                                                      their use by curators in a efficient manner. The
    To address this limitation, we extended the       success of the BioCreative Interactive Annotation
    web-based literature curation tool Egas,          Task series demonstrates the importance of these
    adding support for direct document cura-          efforts (Arighi et al., 2013).
    tion and annotation over PDF files, with             While existing information extraction tools
    side-by-side visualization of the original        have been shown to achieve robust performance in
    PDF document and of the extracted textual         various tasks, and various literature curation tools
    content. Egas’ PDF document processing            have been proposed that make use of such auto-
    and text-mining features are supported by         mated methods, they were generally designed to
    a newly developed web-services platform           work with plain text or with structured formats
    built over Neji, a highly efficient informa-      such as XML. There is however a lack of tools
    tion extraction framework. These web ser-         for supporting curation workflows that make use
    vices allow integrating PDF text extraction       of the Portable Document Format (PDF), which
    and annotation capabilities to other tools        has become one of the most popular file formats
    and text mining pipelines.                        for publishing and sharing documents.
                                                         We have previously presented Neji (Campos et
1   Introduction
                                                      al., 2013), an open source framework for biomed-
The large amount of information and knowledge         ical concept recognition, and Egas (Campos et
continuously produced in the biomedical domain        al., 2014), a web-based tool for literature cura-
is reflected on the number of published journal ar-   tion built with modern web technologies and pro-
ticles. In 2015, the bibliographic database MED-      viding simple inline representation of annotations
LINE contained over 23 million references to jour-    and user-friendly interaction. In this paper we
nal articles in life sciences, of which 1 million     present new features added to Egas and Neji to
were added in that year (U.S. National Library of     support text-mining and curation workflows over
Medicine, 2016). At this rate, staying updated        PDF documents. In Section 2 we describe Neji’s
with the current knowledge and identifying the        new PDF processing functionalities and present its
           Figure 1: Neji processing pipeline and modular architecture (Campos et al., 2013)


new web-services platform. These web-services           objectives and goals, for example by simply com-
are used by the curation tool for extracting the text   bining existing or new modules for reading, pro-
from PDF documents and for obtaining automatic          cessing and writing data, or by selecting the ap-
concept annotations, and also facilitate the inte-      propriate dictionaries or machine learning models
gration of Neji’s functionalities in external text-     according to the concept types of interest.
mining pipelines and tools. Egas is described in           Neji has been evaluated on several corpora, cov-
Section 3, highlighting the new PDF annotation          ering different concept types (Campos et al., 2013;
features including side-by-side synchronous visu-       Campos et al., 2015; Matos et al., 2016). Table
alization of the extracted text and of the original     1 shows a summary of the concept identification
PDF, and also the display of concept annotations        performance.
over the PDF document.
                                                        2.1      Pipeline and modules
2   Neji                                                The main component of Neji is the processing
                                                        pipeline (Figure 1), a series of independent mod-
Neji is an open source framework for biomed-
                                                        ules, each of them responsible for a specific pro-
ical concept recognition built around four cru-
                                                        cessing task, that are executed sequentially. We
cial characteristics: modularity, scalability, speed
                                                        used Monq.jfa1 , a library for fast and flexible text
and usability. It follows several state-of-the-art
                                                        filtering with regular expressions, to implement
methods for biomedical natural language process-
                                                        each pipeline module as a custom deterministic fi-
ing (NLP), namely methods for sentence split-
                                                        nite automaton (DFA) with specific rules and ac-
ting, tokenization, lemmatization, POS, chunking
                                                        tions.
and dependency parsing. The concept recognition
tasks are performed using dictionary matching and       2.1.1      Handling PDF files
machine learning techniques with normalization.         Thanks to Neji’s modular architecture, adding
This framework implements a very flexible and ef-       PDF processing capabilities only required the im-
ficient concept tree to store the document annota-      plementation of a new reader module. For this,
tions, supporting nested and intersected concepts       we integrated LA-PDFText (Ramakrishnan et al.,
with one or more identifiers. It supports several in-   2012), a state-of-the-art open-source tool for han-
put and output formats including the most popular       dling PDF documents. LA-PDFText makes use of
ones in biomedical text mining, such as IeXML,          a carefully crafted set of rules defined on the busi-
Pubmed XML, A1, CONLL and BioC. The archi-              ness rules management system DROOLS, allow-
tecture of Neji allows users to configure the pro-
                                                           1
cessing of documents according to their specific               http://www.pifpafpuf.de/Monq.jfa/
Table 1: Neji concept recognition results on a variety of corpora and concept types. D: Dictionary; ML:
Machine-Learning
            Corpus                              Concept type                     F-score      Method
           CRAFT                                 Species                           95%           D
                                                   Cell                            92%           D
                                             Gene and Protein                      76%           ML
                                                Chemicals                          65%           D
                                           Cellular Component                      83%           D
                                Biological Process and Molecular Function          63%           D
        NCBI Disease                            Disorders                          85%           D
             Anem                               Anatomy                            82%           D
      BC II Gene Mention                     Gene and Protein                      87%           ML
            tmVar                            Genetic Variants                      86%           ML
      BC IV ChemdNER                            Chemicals                          87%           ML


ing to correctly handle different PDF layouts such      to send their input documents and receive the plain
as one column, two columns and mixed layouts.           text extracted from the submitted PDF file and also
This feature also allows defining different sets of     annotation results in various well-known formats,
rules for specific PDF layouts if necessary, and we     including standoff (A1) (Kim et al., 2009; Stene-
therefore included in the new Neji reader an op-        torp et al., 2012) and BioC (Comeau et al., 2013).
tional parameter for this.                                 Different annotation services can be configured
   In order to evaluate the text extraction quality,    in the platform, in which a service is an annotation
we obtained the original PDF documents corre-           pipeline with a custom set of resources (dictio-
sponding to the 67 full-text articles that compose      naries and ML models) and processing properties.
the CRAFT corpus (Bada et al., 2012), and com-          This provides a way to easily manage concurrent
pared the text extracted by LA-PDFText, through         annotation services, allowing the configuration of
our processing pipeline, to the distributed text con-   the properties and resources of each of them inde-
tents, which were extracted from XML files. For         pendently. Additionally, resources are loaded into
these articles, published in 21 different journals      memory as soon as a new service is created. Since
and having distinct layouts, we obtained an exact       this usually is an expensive step, especially for
match in 90% of the extracted sentences.                large ML models, having the resources in memory
   Apart from extracting the text, which is suf-        greatly reduces the total annotation time.
ficient for running the processing pipeline, we
added additional capabilities to the reader, in or-     3   Egas
der to make use of PDF processing in the curation
tool Egas. Namely, we apply sentence splitting to       Egas is a web-based platform for biomedical text
the extracted chunks of text, and extract the posi-     mining and collaborative curation that supports in-
tion of each sentence in each page to allow align-      line annotation of concept occurrences and of rela-
ing and navigating between the plain text and PDF       tions between these concepts. Annotations can be
views in the user interface. This information is as-    performed automatically, using the available ser-
sociated to each sentence and carried over to the       vices for automatic concept and relation identifi-
remaining modules in the pipeline. A new writer         cation, or manually, wherein a user can add new
module was also implemented that exports this ex-       annotations and also edit or remove automatically
tended information in JSON format, for simple re-       generated annotations. The results can be then ex-
use in external tools.                                  ported to various standard annotation formats.
                                                           To adapt Egas to support literature curation over
2.2   Web-services                                      PDF documents, we integrated PDF text extrac-
Neji web-services are intended to facilitate the        tion using the Neji web services RESTful API and
use and access to Neji functionalities by provid-       adapted the interface for side-by-side visualization
ing a simple RESTful API that allows developers         of the extracted text alongside the original PDF
                                Figure 2: Egas PDF annotation interface


document, allowing the navigation between both         and the corresponding sentence is briefly high-
zones, synchronizing the text annotation area and      lighted to facilitate its identification. Conversely,
the PDF visualization area.                            double-clicking a sentence on the PDF scrolls the
   Egas’ file import web services were also ex-        text on the annotation panel and highlights the cor-
tended to support PDF files. As with the re-           responding sentence.
maining file formats, this web service is responsi-
ble for receiving the file, extracting the text con-   4   Conclusions
tent using Neji’s PDF processing feature as de-
                                                       Assisted literature curation tools, based on text
scribed above, and creating the whole data struc-
                                                       mining and information extraction methods, are
ture to support document annotations. This struc-
                                                       increasingly being used by curation teams, help-
ture includes also sentence information retrieved
                                                       ing to expedite their tasks. However, there is a
from Neji, such as the start and end indexes, with
                                                       lack of tools that support direct annotation of PDF
respect to the extracted plain text, and its posi-
                                                       documents, which is a very common format for
tion within the PDF page, allowing synchronous
                                                       the scientific literature and other document types,
scrolling and navigation between the plain text and
                                                       such as patents. We present a new feature of Egas
PDF views.
                                                       that allows direct document curation and annota-
   Figure 2 shows Egas’ user interface for PDF an-     tion over PDF files, with side-by-side visualization
notation. The original PDF document is displayed       of the original PDF document and of the extracted
on the right-side panel, while the left panel shows    textual content. By aligning the user-friendliness
the annotation panel with the extracted text, al-      of Egas with the possibility of reading the docu-
lowing annotation using the same simple interac-       ment in a very familiar format such as PDF, we
tions as for other document formats, as described      provide a more convenient and agreeable literature
in (Campos et al., 2014). As can be seen in the fig-   curation environment, which could contribute to
ure, concept annotations added by the automatic        improved efficiency.
annotation services or by the curator are displayed
on the plain text as well as on the PDF docu-
ment. Additionally, a tooltip with information as-     References
sociated to each annotation is shown when hover-
                                                       Cecilia N Arighi, Ben Carterette, K Bretonnel Cohen,
ing the mouse over the annotation on either panel.       Martin Krallinger, W John Wilbur, Petra Fey, Robert
By clicking a sentence number on the annotation          Dodson, Laurel Cooper, Ceri E Van Slyke, Wasila
panel, the PDF document is scrolled accordingly,         Dahdul, et al. 2013. An overview of the biocre-
  ative 2012 workshop track iii: interactive text min-      Alexander S Yeh, Lynette Hirschman, and Alexander A
  ing task. Database, 2013:bas056.                            Morgan. 2003. Evaluation of text data mining
                                                              for database curation: lessons learned from the kdd
Michael Bada, Miriam Eckert, Donald Evans, Kristin            challenge cup. Bioinformatics, 19(suppl 1):i331–
  Garcia, Krista Shipley, Dmitry Sitnikov, William A          i339.
  Baumgartner, K Bretonnel Cohen, Karin Verspoor,
  Judith A Blake, et al. 2012. Concept annotation in
  the craft corpus. BMC bioinformatics, 13(1):1.

David Campos, Sérgio Matos, and José Luı́s Oliveira.
  2013. A modular framework for biomedical concept
  recognition. BMC bioinformatics, 14(1):281, jan.

David Campos, Jóni Lourenço, Sérgio Matos, and
  José Luı́s Oliveira. 2014. Egas: a collaborative and
  interactive document curation platform. Database
  : the journal of biological databases and curation,
  2014, jan.

David Campos, Sérgio Matos, and José L Oliveira.
  2015. A document processing pipeline for annotat-
  ing chemical entities in scientific documents. Jour-
  nal of cheminformatics, 7(1):1.

Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ci-
  ccarese, Kevin Bretonnel Cohen, Martin Krallinger,
  Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Ri-
  naldi, Manabu Torii, et al. 2013. Bioc: a minimalist
  approach to interoperability for biomedical text pro-
  cessing. Database, 2013:bat064.

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi-
   nobu Kano, and Jun’ichi Tsujii. 2009. Overview
   of bionlp’09 shared task on event extraction. In
   Proceedings of the Workshop on Current Trends in
   Biomedical Natural Language Processing: Shared
   Task, pages 1–9. Association for Computational Lin-
   guistics.

Sérgio Matos, David Campos, Renato Pinho, Raquel M
   Silva, Matthew Mort, David N Cooper, and
   José Luı́s Oliveira. 2016. Mining clinical attributes
   of genomic variants through assisted literature cura-
   tion in egas. Database, 2016:baw096.

Mariana Neves and Ulf Leser. 2012. A survey on an-
 notation tools for the biomedical literature. Brief-
 ings in bioinformatics, page bbs084.

Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy,
  and Gully Apc Burns. 2012. Layout-aware text
  extraction from full-text PDF of scientific articles.
  Source code for biology and medicine, 7(1):7, jan.

Dietrich Rebholz-Schuhmann, Harald Kirsch, and
  Francisco Couto. 2005. Facts from textis text min-
  ing ready to deliver? PLoS Biol, 3(2):e65.

Pontus Stenetorp, Sampo Pyysalo, Goran Topić,
  Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-
  jii. 2012. BRAT: a web-based tool for NLP-assisted
  text annotation. pages 102–107, apr.

U.S. National Library of Medicine. 2016. Detailed
  Indexing Statistics: 1965-2015.