A curation pipeline and web-services for PDF documents

A curation pipeline and web-services for PDF documents AndréSantos DETI/IEETA University of Aveiro

3810-193 Aveiro Portugal

SérgioMatos DETI/IEETA University of Aveiro

3810-193 Aveiro Portugal

DavidCampos david.campos@bmd-software.com BMD Software

3810-074 Aveiro Portugal

JoséLuísOliveira DETI/IEETA University of Aveiro

3810-193 Aveiro Portugal

A curation pipeline and web-services for PDF documents 0B29FD6A0C0753607B321E55F3AA64BB GROBID - A machine learning software for extracting information from scholarly documents

The continuous growth of the biomedical literature and the need to efficiently find and extract information from its content led to the development of various text mining tools. More recently, these tools started being integrated in user-friendly applications facilitating their use by expert database curators. However, these tools were mainly designed to extract information from text based documents, in XML and other formats, while today a considerable part of the biomedical literature is published and distributed in PDF format.

To address this limitation, we extended the web-based literature curation tool Egas, adding support for direct document curation and annotation over PDF files, with side-by-side visualization of the original PDF document and of the extracted textual content. Egas' PDF document processing and text-mining features are supported by a newly developed web-services platform built over Neji, a highly efficient information extraction framework. These web services allow integrating PDF text extraction and annotation capabilities to other tools and text mining pipelines.

Introduction

The large amount of information and knowledge continuously produced in the biomedical domain is reflected on the number of published journal articles. In 2015, the bibliographic database MED-LINE contained over 23 million references to journal articles in life sciences, of which 1 million were added in that year (U.S. National Library of Medicine, 2016). At this rate, staying updated with the current knowledge and identifying the most relevant publications and information on a given subject is a very challenging task for researchers.

To facilitate the access to knowledge, several resources started by manually curating scientific articles, extracting and structuring relevant and validated information. However, with the rapid growth of data this task became unfeasible (Yeh et al., 2003;Rebholz-Schuhmann et al., 2005), and automatic information extraction tools were developed and integrated in the curation pipeline in order to accelerate the curation process (Neves and Leser, 2012). This also led to the need of creating end-user interfaces to these tools, allowing their use by curators in a efficient manner. The success of the BioCreative Interactive Annotation Task series demonstrates the importance of these efforts (Arighi et al., 2013).

While existing information extraction tools have been shown to achieve robust performance in various tasks, and various literature curation tools have been proposed that make use of such automated methods, they were generally designed to work with plain text or with structured formats such as XML. There is however a lack of tools for supporting curation workflows that make use of the Portable Document Format (PDF), which has become one of the most popular file formats for publishing and sharing documents.

We have previously presented Neji (Campos et al., 2013), an open source framework for biomedical concept recognition, and Egas (Campos et al., 2014), a web-based tool for literature curation built with modern web technologies and providing simple inline representation of annotations and user-friendly interaction. In this paper we present new features added to Egas and Neji to support text-mining and curation workflows over PDF documents. In Section 2 we describe Neji's new PDF processing functionalities and present its Figure 1: Neji processing pipeline and modular architecture (Campos et al., 2013) new web-services platform. These web-services are used by the curation tool for extracting the text from PDF documents and for obtaining automatic concept annotations, and also facilitate the integration of Neji's functionalities in external textmining pipelines and tools. Egas is described in Section 3, highlighting the new PDF annotation features including side-by-side synchronous visualization of the extracted text and of the original PDF, and also the display of concept annotations over the PDF document.

Neji

Neji is an open source framework for biomedical concept recognition built around four crucial characteristics: modularity, scalability, speed and usability. It follows several state-of-the-art methods for biomedical natural language processing (NLP), namely methods for sentence splitting, tokenization, lemmatization, POS, chunking and dependency parsing. The concept recognition tasks are performed using dictionary matching and machine learning techniques with normalization. This framework implements a very flexible and efficient concept tree to store the document annotations, supporting nested and intersected concepts with one or more identifiers. It supports several input and output formats including the most popular ones in biomedical text mining, such as IeXML, Pubmed XML, A1, CONLL and BioC. The architecture of Neji allows users to configure the processing of documents according to their specific objectives and goals, for example by simply combining existing or new modules for reading, processing and writing data, or by selecting the appropriate dictionaries or machine learning models according to the concept types of interest.

Neji has been evaluated on several corpora, covering different concept types (Campos et al., 2013;Campos et al., 2015;Matos et al., 2016). Table 1 shows a summary of the concept identification performance.

Pipeline and modules

The main component of Neji is the processing pipeline (Figure 1), a series of independent modules, each of them responsible for a specific processing task, that are executed sequentially. We used Monq.jfa1 , a library for fast and flexible text filtering with regular expressions, to implement each pipeline module as a custom deterministic finite automaton (DFA) with specific rules and actions.

Handling PDF files

Thanks to Neji's modular architecture, adding PDF processing capabilities only required the implementation of a new reader module. For this, we integrated LA-PDFText (Ramakrishnan et al., 2012), a state-of-the-art open-source tool for handling PDF documents. LA-PDFText makes use of a carefully crafted set of rules defined on the business rules management system DROOLS, allow- In order to evaluate the text extraction quality, we obtained the original PDF documents corresponding to the 67 full-text articles that compose the CRAFT corpus (Bada et al., 2012), and compared the text extracted by LA-PDFText, through our processing pipeline, to the distributed text contents, which were extracted from XML files. For these articles, published in 21 different journals and having distinct layouts, we obtained an exact match in 90% of the extracted sentences.

Apart from extracting the text, which is sufficient for running the processing pipeline, we added additional capabilities to the reader, in order to make use of PDF processing in the curation tool Egas. Namely, we apply sentence splitting to the extracted chunks of text, and extract the position of each sentence in each page to allow aligning and navigating between the plain text and PDF views in the user interface. This information is associated to each sentence and carried over to the remaining modules in the pipeline. A new writer module was also implemented that exports this extended information in JSON format, for simple reuse in external tools.

Web-services

Neji web-services are intended to facilitate the use and access to Neji functionalities by providing a simple RESTful API that allows developers to send their input documents and receive the plain text extracted from the submitted PDF file and also annotation results in various well-known formats, including standoff (A1) (Kim et al., 2009;Stenetorp et al., 2012) andBioC (Comeau et al., 2013).

Different annotation services can be configured in the platform, in which a service is an annotation pipeline with a custom set of resources (dictionaries and ML models) and processing properties. This provides a way to easily manage concurrent annotation services, allowing the configuration of the properties and resources of each of them independently. Additionally, resources are loaded into memory as soon as a new service is created. Since this usually is an expensive step, especially for large ML models, having the resources in memory greatly reduces the total annotation time.

Egas

Egas is a web-based platform for biomedical text mining and collaborative curation that supports inline annotation of concept occurrences and of relations between these concepts. Annotations can be performed automatically, using the available services for automatic concept and relation identification, or manually, wherein a user can add new annotations and also edit or remove automatically generated annotations. The results can be then exported to various standard annotation formats.

To adapt Egas to support literature curation over PDF documents, we integrated PDF text extraction using the Neji web services RESTful API and adapted the interface for side-by-side visualization of the extracted text alongside the original PDF Egas' file import web services were also extended to support PDF files. As with the remaining file formats, this web service is responsible for receiving the file, extracting the text content using Neji's PDF processing feature as described above, and creating the whole data structure to support document annotations. This structure includes also sentence information retrieved from Neji, such as the start and end indexes, with respect to the extracted plain text, and its position within the PDF page, allowing synchronous scrolling and navigation between the plain text and PDF views.

Figure 2 shows Egas' user interface for PDF annotation. The original PDF document is displayed on the right-side panel, while the left panel shows the annotation panel with the extracted text, allowing annotation using the same simple interactions as for other document formats, as described in (Campos et al., 2014). As can be seen in the figure, concept annotations added by the automatic annotation services or by the curator are displayed on the plain text as well as on the PDF document. Additionally, a tooltip with information associated to each annotation is shown when hovering the mouse over the annotation on either panel. By clicking a sentence number on the annotation panel, the PDF document is scrolled accordingly, and the corresponding sentence is briefly highlighted to facilitate its identification. Conversely, double-clicking a sentence on the PDF scrolls the text on the annotation panel and highlights the corresponding sentence.

Conclusions

Assisted literature curation tools, based on text mining and information extraction methods, are increasingly being used by curation teams, helping to expedite their tasks. However, there is a lack of tools that support direct annotation of PDF documents, which is a very common format for the scientific literature and other document types, such as patents. We present a new feature of Egas that allows direct document curation and annotation over PDF files, with side-by-side visualization of the original PDF document and of the extracted textual content. By aligning the user-friendliness of Egas with the possibility of reading the document in a very familiar format such as PDF, we provide a more convenient and agreeable literature curation environment, which could contribute to improved efficiency.

Figure 2 :2Figure 2: Egas PDF annotation interface

Table 1 :1Neji concept recognition results on a variety of corpora and concept types. D: Dictionary; ML:Machine-LearningCorpusConcept typeF-scoreMethodCRAFTSpecies95%DCell92%DGene and Protein76%MLChemicals65%DCellular Component83%DBiological Process and Molecular Function63%DNCBI DiseaseDisorders85%DAnemAnatomy82%DBC II Gene MentionGene and Protein87%MLtmVarGenetic Variants86%MLBC IV ChemdNERChemicals87%MLing to correctly handle different PDF layouts suchas one column, two columns and mixed layouts.This feature also allows defining different sets ofrules for specific PDF layouts if necessary, and wetherefore included in the new Neji reader an op-tional parameter for this.

http://www.pifpafpuf.de/Monq.jfa/

An overview of the biocreative 2012 workshop track iii: interactive text mining task CeciliaNArighi BenCarterette MartinBretonnel Cohen JohnKrallinger PetraWilbur RobertFey LaurelDodson CeriECooper WasilaVan Slyke Dahdul Database 56 2013. 2013 Concept annotation in the craft corpus MichaelBada MiriamEckert DonaldEvans KristinGarcia KristaShipley DmitrySitnikov WilliamABaumgartner KBretonnel Cohen KarinVerspoor JudithABlake BMC bioinformatics 13 1 1 2012 A modular framework for biomedical concept recognition DavidCampos SérgioMatos José LuísOliveira BMC bioinformatics 14 1 281 2013. jan Egas: a collaborative and interactive document curation platform DavidCampos JóniLourenc ¸o SérgioMatos José LuísOliveira Database : the journal of biological databases and curation 2014. 2014. jan A document processing pipeline for annotating chemical entities in scientific documents DavidCampos SérgioMatos José LOliveira Journal of cheminformatics 7 1 1 2015 Bioc: a minimalist approach to interoperability for biomedical text processing RezartaDonald C Comeau PaoloIslamaj Dogan KevinCiccarese MartinBretonnel Cohen FlorianKrallinger ZhiyongLeitner YifanLu FabioPeng ManabuRinaldi Torii Database 64 2013. 2013 Overview of bionlp'09 shared task on event extraction Jin-DongKim TomokoOhta SampoPyysalo YoshinobuKano Jun'ichiTsujii Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task Association for Computational Linguistics 2009 Mining clinical attributes of genomic variants through assisted literature curation in egas SérgioMatos DavidCampos RenatoPinho RaquelMSilva MatthewMort DavidNCooper José LuísOliveira Database 96 2016. 2016 A survey on annotation tools for the biomedical literature MarianaNeves UlfLeser Briefings in bioinformatics 84 2012 Layout-aware text extraction from full-text PDF of scientific articles CarticRamakrishnan AbhishekPatnia EduardHovy Gully ApcBurns Source code for biology and medicine 7 1 7 2012. jan Facts from textis text mining ready to deliver? DietrichRebholz-Schuhmann HaraldKirsch FranciscoCouto PLoS Biol 3 2 e65 2005 BRAT: a web-based tool for NLP-assisted text annotation PontusStenetorp SampoPyysalo GoranTopić TomokoOhta SophiaAnaniadou Jun'ichiTsujii 2012. apr US Detailed Indexing Statistics 2016 National Library of Medicine Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup LynetteAlexander S Yeh AlexanderAHirschman Morgan Bioinformatics 19 1 2003 suppl