<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A curation pipeline and web-services for PDF documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andre´ Santos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Se´rgio Matos</string-name>
          <email>aleixomatos@ua.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Campos</string-name>
          <email>david.campos@bmd-software.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose´ Lu´ıs Oliveira</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BMD Software</institution>
          ,
          <addr-line>3810-074 Aveiro</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DETI/IEETA, University of Aveiro</institution>
          ,
          <addr-line>3810-193 Aveiro</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The continuous growth of the biomedical literature and the need to efficiently find and extract information from its content led to the development of various text mining tools. More recently, these tools started being integrated in user-friendly applications facilitating their use by expert database curators. However, these tools were mainly designed to extract information from text based documents, in XML and other formats, while today a considerable part of the biomedical literature is published and distributed in PDF format. To address this limitation, we extended the web-based literature curation tool Egas, adding support for direct document curation and annotation over PDF files, with side-by-side visualization of the original PDF document and of the extracted textual content. Egas' PDF document processing and text-mining features are supported by a newly developed web-services platform built over Neji, a highly efficient information extraction framework. These web services allow integrating PDF text extraction and annotation capabilities to other tools and text mining pipelines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The large amount of information and knowledge
continuously produced in the biomedical domain
is reflected on the number of published journal
articles. In 2015, the bibliographic database
MEDLINE contained over 23 million references to
journal articles in life sciences, of which 1 million
were added in that year
        <xref ref-type="bibr" rid="ref13">(U.S. National Library of
Medicine, 2016)</xref>
        . At this rate, staying updated
with the current knowledge and identifying the
most relevant publications and information on a
given subject is a very challenging task for
researchers.
      </p>
      <p>
        To facilitate the access to knowledge, several
resources started by manually curating scientific
articles, extracting and structuring relevant and
validated information. However, with the rapid
growth of data this task became unfeasible
        <xref ref-type="bibr" rid="ref11 ref14">(Yeh
et al., 2003; Rebholz-Schuhmann et al., 2005)</xref>
        , and
automatic information extraction tools were
developed and integrated in the curation pipeline in
order to accelerate the curation process
        <xref ref-type="bibr" rid="ref10 ref9">(Neves and
Leser, 2012)</xref>
        . This also led to the need of
creating end-user interfaces to these tools, allowing
their use by curators in a efficient manner. The
success of the BioCreative Interactive Annotation
Task series demonstrates the importance of these
efforts
        <xref ref-type="bibr" rid="ref1">(Arighi et al., 2013)</xref>
        .
      </p>
      <p>While existing information extraction tools
have been shown to achieve robust performance in
various tasks, and various literature curation tools
have been proposed that make use of such
automated methods, they were generally designed to
work with plain text or with structured formats
such as XML. There is however a lack of tools
for supporting curation workflows that make use
of the Portable Document Format (PDF), which
has become one of the most popular file formats
for publishing and sharing documents.</p>
      <p>
        We have previously presented Neji
        <xref ref-type="bibr" rid="ref3">(Campos et
al., 2013)</xref>
        , an open source framework for
biomedical concept recognition, and Egas
        <xref ref-type="bibr" rid="ref4">(Campos et
al., 2014)</xref>
        , a web-based tool for literature
curation built with modern web technologies and
providing simple inline representation of annotations
and user-friendly interaction. In this paper we
present new features added to Egas and Neji to
support text-mining and curation workflows over
PDF documents. In Section 2 we describe Neji’s
new PDF processing functionalities and present its
new web-services platform. These web-services
are used by the curation tool for extracting the text
from PDF documents and for obtaining automatic
concept annotations, and also facilitate the
integration of Neji’s functionalities in external
textmining pipelines and tools. Egas is described in
Section 3, highlighting the new PDF annotation
features including side-by-side synchronous
visualization of the extracted text and of the original
PDF, and also the display of concept annotations
over the PDF document.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Neji</title>
      <p>Neji is an open source framework for
biomedical concept recognition built around four
crucial characteristics: modularity, scalability, speed
and usability. It follows several state-of-the-art
methods for biomedical natural language
processing (NLP), namely methods for sentence
splitting, tokenization, lemmatization, POS, chunking
and dependency parsing. The concept recognition
tasks are performed using dictionary matching and
machine learning techniques with normalization.
This framework implements a very flexible and
efficient concept tree to store the document
annotations, supporting nested and intersected concepts
with one or more identifiers. It supports several
input and output formats including the most popular
ones in biomedical text mining, such as IeXML,
Pubmed XML, A1, CONLL and BioC. The
architecture of Neji allows users to configure the
processing of documents according to their specific
objectives and goals, for example by simply
combining existing or new modules for reading,
processing and writing data, or by selecting the
appropriate dictionaries or machine learning models
according to the concept types of interest.</p>
      <p>
        Neji has been evaluated on several corpora,
covering different concept types
        <xref ref-type="bibr" rid="ref3 ref5 ref8">(Campos et al., 2013;
Campos et al., 2015; Matos et al., 2016)</xref>
        . Table
1 shows a summary of the concept identification
performance.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Pipeline and modules</title>
        <p>The main component of Neji is the processing
pipeline (Figure 1), a series of independent
modules, each of them responsible for a specific
processing task, that are executed sequentially. We
used Monq.jfa1, a library for fast and flexible text
filtering with regular expressions, to implement
each pipeline module as a custom deterministic
finite automaton (DFA) with specific rules and
actions.
2.1.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Handling PDF files</title>
        <p>
          Thanks to Neji’s modular architecture, adding
PDF processing capabilities only required the
implementation of a new reader module. For this,
we integrated LA-PDFText
          <xref ref-type="bibr" rid="ref10">(Ramakrishnan et al.,
2012)</xref>
          , a state-of-the-art open-source tool for
handling PDF documents. LA-PDFText makes use of
a carefully crafted set of rules defined on the
business rules management system DROOLS,
allow1http://www.pifpafpuf.de/Monq.jfa/
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Corpus</title>
      </sec>
      <sec id="sec-2-4">
        <title>CRAFT</title>
      </sec>
      <sec id="sec-2-5">
        <title>NCBI Disease</title>
      </sec>
      <sec id="sec-2-6">
        <title>Anem</title>
      </sec>
      <sec id="sec-2-7">
        <title>BC II Gene Mention tmVar</title>
      </sec>
      <sec id="sec-2-8">
        <title>BC IV ChemdNER</title>
        <p>Species</p>
        <p>Cell
Gene and Protein</p>
        <p>Chemicals</p>
        <p>Cellular Component
Biological Process and Molecular Function
Disorders</p>
        <p>Anatomy
Gene and Protein
Genetic Variants</p>
        <p>Chemicals
ing to correctly handle different PDF layouts such
as one column, two columns and mixed layouts.
This feature also allows defining different sets of
rules for specific PDF layouts if necessary, and we
therefore included in the new Neji reader an
optional parameter for this.</p>
        <p>
          In order to evaluate the text extraction quality,
we obtained the original PDF documents
corresponding to the 67 full-text articles that compose
the CRAFT corpus
          <xref ref-type="bibr" rid="ref2">(Bada et al., 2012)</xref>
          , and
compared the text extracted by LA-PDFText, through
our processing pipeline, to the distributed text
contents, which were extracted from XML files. For
these articles, published in 21 different journals
and having distinct layouts, we obtained an exact
match in 90% of the extracted sentences.
        </p>
        <p>Apart from extracting the text, which is
sufficient for running the processing pipeline, we
added additional capabilities to the reader, in
order to make use of PDF processing in the curation
tool Egas. Namely, we apply sentence splitting to
the extracted chunks of text, and extract the
position of each sentence in each page to allow
aligning and navigating between the plain text and PDF
views in the user interface. This information is
associated to each sentence and carried over to the
remaining modules in the pipeline. A new writer
module was also implemented that exports this
extended information in JSON format, for simple
reuse in external tools.
2.2</p>
      </sec>
      <sec id="sec-2-9">
        <title>Web-services</title>
        <p>
          Neji web-services are intended to facilitate the
use and access to Neji functionalities by
providing a simple RESTful API that allows developers
to send their input documents and receive the plain
text extracted from the submitted PDF file and also
annotation results in various well-known formats,
including standoff (A1)
          <xref ref-type="bibr" rid="ref12 ref7">(Kim et al., 2009;
Stenetorp et al., 2012)</xref>
          and BioC
          <xref ref-type="bibr" rid="ref6">(Comeau et al., 2013)</xref>
          .
        </p>
        <p>Different annotation services can be configured
in the platform, in which a service is an annotation
pipeline with a custom set of resources
(dictionaries and ML models) and processing properties.
This provides a way to easily manage concurrent
annotation services, allowing the configuration of
the properties and resources of each of them
independently. Additionally, resources are loaded into
memory as soon as a new service is created. Since
this usually is an expensive step, especially for
large ML models, having the resources in memory
greatly reduces the total annotation time.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Egas</title>
      <p>Egas is a web-based platform for biomedical text
mining and collaborative curation that supports
inline annotation of concept occurrences and of
relations between these concepts. Annotations can be
performed automatically, using the available
services for automatic concept and relation
identification, or manually, wherein a user can add new
annotations and also edit or remove automatically
generated annotations. The results can be then
exported to various standard annotation formats.</p>
      <p>To adapt Egas to support literature curation over
PDF documents, we integrated PDF text
extraction using the Neji web services RESTful API and
adapted the interface for side-by-side visualization
of the extracted text alongside the original PDF
document, allowing the navigation between both
zones, synchronizing the text annotation area and
the PDF visualization area.</p>
      <p>Egas’ file import web services were also
extended to support PDF files. As with the
remaining file formats, this web service is
responsible for receiving the file, extracting the text
content using Neji’s PDF processing feature as
described above, and creating the whole data
structure to support document annotations. This
structure includes also sentence information retrieved
from Neji, such as the start and end indexes, with
respect to the extracted plain text, and its
position within the PDF page, allowing synchronous
scrolling and navigation between the plain text and
PDF views.</p>
      <p>
        Figure 2 shows Egas’ user interface for PDF
annotation. The original PDF document is displayed
on the right-side panel, while the left panel shows
the annotation panel with the extracted text,
allowing annotation using the same simple
interactions as for other document formats, as described
in
        <xref ref-type="bibr" rid="ref4">(Campos et al., 2014)</xref>
        . As can be seen in the
figure, concept annotations added by the automatic
annotation services or by the curator are displayed
on the plain text as well as on the PDF
document. Additionally, a tooltip with information
associated to each annotation is shown when
hovering the mouse over the annotation on either panel.
By clicking a sentence number on the annotation
panel, the PDF document is scrolled accordingly,
and the corresponding sentence is briefly
highlighted to facilitate its identification. Conversely,
double-clicking a sentence on the PDF scrolls the
text on the annotation panel and highlights the
corresponding sentence.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Assisted literature curation tools, based on text
mining and information extraction methods, are
increasingly being used by curation teams,
helping to expedite their tasks. However, there is a
lack of tools that support direct annotation of PDF
documents, which is a very common format for
the scientific literature and other document types,
such as patents. We present a new feature of Egas
that allows direct document curation and
annotation over PDF files, with side-by-side visualization
of the original PDF document and of the extracted
textual content. By aligning the user-friendliness
of Egas with the possibility of reading the
document in a very familiar format such as PDF, we
provide a more convenient and agreeable literature
curation environment, which could contribute to
improved efficiency.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Cecilia N Arighi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ben Carterette</surname>
            ,
            <given-names>K Bretonnel</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>W John Wilbur</given-names>
          </string-name>
          , Petra Fey, Robert Dodson, Laurel Cooper,
          <string-name>
            <surname>Ceri E Van Slyke</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wasila Dahdul</surname>
          </string-name>
          , et al.
          <year>2013</year>
          .
          <article-title>An overview of the biocreative 2012 workshop track iii: interactive text mining task</article-title>
          .
          <source>Database</source>
          ,
          <year>2013</year>
          :
          <fpage>bas056</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bada</surname>
          </string-name>
          , Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner,
          <string-name>
            <given-names>K Bretonnel</given-names>
            <surname>Cohen</surname>
          </string-name>
          , Karin Verspoor,
          <source>Judith A Blake</source>
          , et al.
          <year>2012</year>
          .
          <article-title>Concept annotation in the craft corpus</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Campos</surname>
          </string-name>
          , Se´rgio Matos, and Jose´ Lu´ıs Oliveira.
          <year>2013</year>
          .
          <article-title>A modular framework for biomedical concept recognition</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):281, jan.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Campos</surname>
          </string-name>
          , Jo´ ni Lourenc¸o, Se´rgio Matos, and Jose´ Lu´ıs Oliveira.
          <year>2014</year>
          .
          <article-title>Egas: a collaborative and interactive document curation platform</article-title>
          .
          <source>Database : the journal of biological databases and curation</source>
          ,
          <year>2014</year>
          , jan.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Campos</surname>
          </string-name>
          , Se´rgio Matos, and Jose´ L Oliveira.
          <year>2015</year>
          .
          <article-title>A document processing pipeline for annotating chemical entities in scientific documents</article-title>
          .
          <source>Journal of cheminformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Donald C Comeau</surname>
          </string-name>
          ,
          <article-title>Rezarta Islamaj Dog˘ an, Paolo Ciccarese</article-title>
          , Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi,
          <string-name>
            <given-names>Manabu</given-names>
            <surname>Torii</surname>
          </string-name>
          , et al.
          <year>2013</year>
          .
          <article-title>Bioc: a minimalist approach to interoperability for biomedical text processing</article-title>
          .
          <source>Database</source>
          ,
          <year>2013</year>
          :
          <fpage>bat064</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Overview of bionlp'09 shared task on event extraction</article-title>
          .
          <source>In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Se´rgio Matos</surname>
            , David Campos,
            <given-names>Renato</given-names>
          </string-name>
          <string-name>
            <surname>Pinho</surname>
            , Raquel M Silva,
            <given-names>Matthew</given-names>
          </string-name>
          <string-name>
            <surname>Mort</surname>
          </string-name>
          , David N Cooper, and Jose´ Lu´ıs Oliveira.
          <year>2016</year>
          .
          <article-title>Mining clinical attributes of genomic variants through assisted literature curation in egas</article-title>
          .
          <source>Database</source>
          ,
          <year>2016</year>
          :
          <fpage>baw096</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Mariana</given-names>
            <surname>Neves</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ulf</given-names>
            <surname>Leser</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A survey on annotation tools for the biomedical literature</article-title>
          . Briefings in bioinformatics, page bbs084.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Cartic</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , Abhishek Patnia, Eduard Hovy, and Gully Apc Burns.
          <year>2012</year>
          .
          <article-title>Layout-aware text extraction from full-text PDF of scientific articles. Source code for biology and medicine, 7(1):7, jan</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Dietrich</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Harald</given-names>
            <surname>Kirsch</surname>
          </string-name>
          , and Francisco Couto.
          <year>2005</year>
          .
          <article-title>Facts from textis text mining ready to deliver?</article-title>
          <source>PLoS Biol</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>e65</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Pontus</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          , Sampo Pyysalo, Goran Topic´,
          <string-name>
            <surname>Tomoko</surname>
            <given-names>Ohta</given-names>
          </string-name>
          , Sophia Ananiadou, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>BRAT: a web-based tool for NLP-assisted text annotation</article-title>
          . pages
          <fpage>102</fpage>
          -
          <lpage>107</lpage>
          , apr.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>U.S. National Library</surname>
          </string-name>
          of Medicine.
          <year>2016</year>
          . Detailed Indexing Statistics:
          <fpage>1965</fpage>
          -
          <lpage>2015</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Alexander S Yeh</surname>
          </string-name>
          , Lynette Hirschman, and Alexander A Morgan.
          <year>2003</year>
          .
          <article-title>Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>19</volume>
          (
          <issue>suppl 1</issue>
          ):
          <fpage>i331</fpage>
          -
          <lpage>i339</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>