A curation pipeline and web-services for PDF documents André Santos1 , Sérgio Matos1 , David Campos2 and José Luı́s Oliveira1 1 DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal {aleixomatos,andre.jeronimo,jlo}@ua.pt 2 BMD Software, 3810-074 Aveiro, Portugal david.campos@bmd-software.com Abstract most relevant publications and information on a given subject is a very challenging task for re- The continuous growth of the biomedi- searchers. cal literature and the need to efficiently To facilitate the access to knowledge, several find and extract information from its con- resources started by manually curating scientific tent led to the development of various text articles, extracting and structuring relevant and mining tools. More recently, these tools validated information. However, with the rapid started being integrated in user-friendly growth of data this task became unfeasible (Yeh applications facilitating their use by expert et al., 2003; Rebholz-Schuhmann et al., 2005), and database curators. However, these tools automatic information extraction tools were devel- were mainly designed to extract informa- oped and integrated in the curation pipeline in or- tion from text based documents, in XML der to accelerate the curation process (Neves and and other formats, while today a consid- Leser, 2012). This also led to the need of cre- erable part of the biomedical literature is ating end-user interfaces to these tools, allowing published and distributed in PDF format. their use by curators in a efficient manner. The To address this limitation, we extended the success of the BioCreative Interactive Annotation web-based literature curation tool Egas, Task series demonstrates the importance of these adding support for direct document cura- efforts (Arighi et al., 2013). tion and annotation over PDF files, with While existing information extraction tools side-by-side visualization of the original have been shown to achieve robust performance in PDF document and of the extracted textual various tasks, and various literature curation tools content. Egas’ PDF document processing have been proposed that make use of such auto- and text-mining features are supported by mated methods, they were generally designed to a newly developed web-services platform work with plain text or with structured formats built over Neji, a highly efficient informa- such as XML. There is however a lack of tools tion extraction framework. These web ser- for supporting curation workflows that make use vices allow integrating PDF text extraction of the Portable Document Format (PDF), which and annotation capabilities to other tools has become one of the most popular file formats and text mining pipelines. for publishing and sharing documents. We have previously presented Neji (Campos et 1 Introduction al., 2013), an open source framework for biomed- The large amount of information and knowledge ical concept recognition, and Egas (Campos et continuously produced in the biomedical domain al., 2014), a web-based tool for literature cura- is reflected on the number of published journal ar- tion built with modern web technologies and pro- ticles. In 2015, the bibliographic database MED- viding simple inline representation of annotations LINE contained over 23 million references to jour- and user-friendly interaction. In this paper we nal articles in life sciences, of which 1 million present new features added to Egas and Neji to were added in that year (U.S. National Library of support text-mining and curation workflows over Medicine, 2016). At this rate, staying updated PDF documents. In Section 2 we describe Neji’s with the current knowledge and identifying the new PDF processing functionalities and present its Figure 1: Neji processing pipeline and modular architecture (Campos et al., 2013) new web-services platform. These web-services objectives and goals, for example by simply com- are used by the curation tool for extracting the text bining existing or new modules for reading, pro- from PDF documents and for obtaining automatic cessing and writing data, or by selecting the ap- concept annotations, and also facilitate the inte- propriate dictionaries or machine learning models gration of Neji’s functionalities in external text- according to the concept types of interest. mining pipelines and tools. Egas is described in Neji has been evaluated on several corpora, cov- Section 3, highlighting the new PDF annotation ering different concept types (Campos et al., 2013; features including side-by-side synchronous visu- Campos et al., 2015; Matos et al., 2016). Table alization of the extracted text and of the original 1 shows a summary of the concept identification PDF, and also the display of concept annotations performance. over the PDF document. 2.1 Pipeline and modules 2 Neji The main component of Neji is the processing pipeline (Figure 1), a series of independent mod- Neji is an open source framework for biomed- ules, each of them responsible for a specific pro- ical concept recognition built around four cru- cessing task, that are executed sequentially. We cial characteristics: modularity, scalability, speed used Monq.jfa1 , a library for fast and flexible text and usability. It follows several state-of-the-art filtering with regular expressions, to implement methods for biomedical natural language process- each pipeline module as a custom deterministic fi- ing (NLP), namely methods for sentence split- nite automaton (DFA) with specific rules and ac- ting, tokenization, lemmatization, POS, chunking tions. and dependency parsing. The concept recognition tasks are performed using dictionary matching and 2.1.1 Handling PDF files machine learning techniques with normalization. Thanks to Neji’s modular architecture, adding This framework implements a very flexible and ef- PDF processing capabilities only required the im- ficient concept tree to store the document annota- plementation of a new reader module. For this, tions, supporting nested and intersected concepts we integrated LA-PDFText (Ramakrishnan et al., with one or more identifiers. It supports several in- 2012), a state-of-the-art open-source tool for han- put and output formats including the most popular dling PDF documents. LA-PDFText makes use of ones in biomedical text mining, such as IeXML, a carefully crafted set of rules defined on the busi- Pubmed XML, A1, CONLL and BioC. The archi- ness rules management system DROOLS, allow- tecture of Neji allows users to configure the pro- 1 cessing of documents according to their specific http://www.pifpafpuf.de/Monq.jfa/ Table 1: Neji concept recognition results on a variety of corpora and concept types. D: Dictionary; ML: Machine-Learning Corpus Concept type F-score Method CRAFT Species 95% D Cell 92% D Gene and Protein 76% ML Chemicals 65% D Cellular Component 83% D Biological Process and Molecular Function 63% D NCBI Disease Disorders 85% D Anem Anatomy 82% D BC II Gene Mention Gene and Protein 87% ML tmVar Genetic Variants 86% ML BC IV ChemdNER Chemicals 87% ML ing to correctly handle different PDF layouts such to send their input documents and receive the plain as one column, two columns and mixed layouts. text extracted from the submitted PDF file and also This feature also allows defining different sets of annotation results in various well-known formats, rules for specific PDF layouts if necessary, and we including standoff (A1) (Kim et al., 2009; Stene- therefore included in the new Neji reader an op- torp et al., 2012) and BioC (Comeau et al., 2013). tional parameter for this. Different annotation services can be configured In order to evaluate the text extraction quality, in the platform, in which a service is an annotation we obtained the original PDF documents corre- pipeline with a custom set of resources (dictio- sponding to the 67 full-text articles that compose naries and ML models) and processing properties. the CRAFT corpus (Bada et al., 2012), and com- This provides a way to easily manage concurrent pared the text extracted by LA-PDFText, through annotation services, allowing the configuration of our processing pipeline, to the distributed text con- the properties and resources of each of them inde- tents, which were extracted from XML files. For pendently. Additionally, resources are loaded into these articles, published in 21 different journals memory as soon as a new service is created. Since and having distinct layouts, we obtained an exact this usually is an expensive step, especially for match in 90% of the extracted sentences. large ML models, having the resources in memory Apart from extracting the text, which is suf- greatly reduces the total annotation time. ficient for running the processing pipeline, we added additional capabilities to the reader, in or- 3 Egas der to make use of PDF processing in the curation tool Egas. Namely, we apply sentence splitting to Egas is a web-based platform for biomedical text the extracted chunks of text, and extract the posi- mining and collaborative curation that supports in- tion of each sentence in each page to allow align- line annotation of concept occurrences and of rela- ing and navigating between the plain text and PDF tions between these concepts. Annotations can be views in the user interface. This information is as- performed automatically, using the available ser- sociated to each sentence and carried over to the vices for automatic concept and relation identifi- remaining modules in the pipeline. A new writer cation, or manually, wherein a user can add new module was also implemented that exports this ex- annotations and also edit or remove automatically tended information in JSON format, for simple re- generated annotations. The results can be then ex- use in external tools. ported to various standard annotation formats. To adapt Egas to support literature curation over 2.2 Web-services PDF documents, we integrated PDF text extrac- Neji web-services are intended to facilitate the tion using the Neji web services RESTful API and use and access to Neji functionalities by provid- adapted the interface for side-by-side visualization ing a simple RESTful API that allows developers of the extracted text alongside the original PDF Figure 2: Egas PDF annotation interface document, allowing the navigation between both and the corresponding sentence is briefly high- zones, synchronizing the text annotation area and lighted to facilitate its identification. Conversely, the PDF visualization area. double-clicking a sentence on the PDF scrolls the Egas’ file import web services were also ex- text on the annotation panel and highlights the cor- tended to support PDF files. As with the re- responding sentence. maining file formats, this web service is responsi- ble for receiving the file, extracting the text con- 4 Conclusions tent using Neji’s PDF processing feature as de- Assisted literature curation tools, based on text scribed above, and creating the whole data struc- mining and information extraction methods, are ture to support document annotations. This struc- increasingly being used by curation teams, help- ture includes also sentence information retrieved ing to expedite their tasks. However, there is a from Neji, such as the start and end indexes, with lack of tools that support direct annotation of PDF respect to the extracted plain text, and its posi- documents, which is a very common format for tion within the PDF page, allowing synchronous the scientific literature and other document types, scrolling and navigation between the plain text and such as patents. We present a new feature of Egas PDF views. that allows direct document curation and annota- Figure 2 shows Egas’ user interface for PDF an- tion over PDF files, with side-by-side visualization notation. The original PDF document is displayed of the original PDF document and of the extracted on the right-side panel, while the left panel shows textual content. By aligning the user-friendliness the annotation panel with the extracted text, al- of Egas with the possibility of reading the docu- lowing annotation using the same simple interac- ment in a very familiar format such as PDF, we tions as for other document formats, as described provide a more convenient and agreeable literature in (Campos et al., 2014). As can be seen in the fig- curation environment, which could contribute to ure, concept annotations added by the automatic improved efficiency. annotation services or by the curator are displayed on the plain text as well as on the PDF docu- ment. Additionally, a tooltip with information as- References sociated to each annotation is shown when hover- Cecilia N Arighi, Ben Carterette, K Bretonnel Cohen, ing the mouse over the annotation on either panel. Martin Krallinger, W John Wilbur, Petra Fey, Robert By clicking a sentence number on the annotation Dodson, Laurel Cooper, Ceri E Van Slyke, Wasila panel, the PDF document is scrolled accordingly, Dahdul, et al. 2013. An overview of the biocre- ative 2012 workshop track iii: interactive text min- Alexander S Yeh, Lynette Hirschman, and Alexander A ing task. Database, 2013:bas056. Morgan. 2003. Evaluation of text data mining for database curation: lessons learned from the kdd Michael Bada, Miriam Eckert, Donald Evans, Kristin challenge cup. Bioinformatics, 19(suppl 1):i331– Garcia, Krista Shipley, Dmitry Sitnikov, William A i339. Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, et al. 2012. Concept annotation in the craft corpus. BMC bioinformatics, 13(1):1. David Campos, Sérgio Matos, and José Luı́s Oliveira. 2013. A modular framework for biomedical concept recognition. BMC bioinformatics, 14(1):281, jan. David Campos, Jóni Lourenço, Sérgio Matos, and José Luı́s Oliveira. 2014. Egas: a collaborative and interactive document curation platform. Database : the journal of biological databases and curation, 2014, jan. David Campos, Sérgio Matos, and José L Oliveira. 2015. A document processing pipeline for annotat- ing chemical entities in scientific documents. Jour- nal of cheminformatics, 7(1):1. Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ci- ccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Ri- naldi, Manabu Torii, et al. 2013. Bioc: a minimalist approach to interoperability for biomedical text pro- cessing. Database, 2013:bat064. Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- nobu Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 1–9. Association for Computational Lin- guistics. Sérgio Matos, David Campos, Renato Pinho, Raquel M Silva, Matthew Mort, David N Cooper, and José Luı́s Oliveira. 2016. Mining clinical attributes of genomic variants through assisted literature cura- tion in egas. Database, 2016:baw096. Mariana Neves and Ulf Leser. 2012. A survey on an- notation tools for the biomedical literature. Brief- ings in bioinformatics, page bbs084. Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully Apc Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source code for biology and medicine, 7(1):7, jan. Dietrich Rebholz-Schuhmann, Harald Kirsch, and Francisco Couto. 2005. Facts from textis text min- ing ready to deliver? PLoS Biol, 3(2):e65. Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- jii. 2012. BRAT: a web-based tool for NLP-assisted text annotation. pages 102–107, apr. U.S. National Library of Medicine. 2016. Detailed Indexing Statistics: 1965-2015.