OCMiner: Text Processing, Annotation and Relation Extraction for the Life Sciences Timo Böhme, Matthias Irmer, Anett Püschel, Claudia Bobach, Ulf Laube, and Lutz Weber OntoChem GmbH, Halle (Saale), Germany {timo.boehme,matthias.irmer,anett.pueschel, claudia.bobach,ulf.laube,lutz.weber}@ontochem.com http://www.ontochem.com Abstract. We present OCMiner, a high-performance text processing system for large document collections of scientific publications. Several linguistic options allow adjusting the quality of annotation results which can be specialized and fine-tuned for the recognition of Life Science terms. Recognized terms are mapped to semantic concepts which are ontologically located within their respective domain taxonomies. Rely- ing on a correct identification and semantic interpretation of mentions of domain concepts, relations between entities are extracted. The an- notated text, as well as extracted knowledge triples, can be visualized on a web-based front-end at http://www.ocminer.com/, permitting an explorative information retrieval. Keywords: text mining, chemical named entity recognition, relation extraction, explorative information retrieval 1 Background Life Science knowledge mining methods rely on a correct annotation of terms and phrases with concepts from different knowledge domains – in particular chemistry, proteins and diseases – followed by the application of suitable semantic relation extraction algorithms. We present the implementation of a high quality context sensitive annotation of named entities in text documents that makes use of an exchangeable set of chemistry, protein and disease ontologies. A particular challenge in recognizing Life Science terms in free text is chem- ical named entity recognition. The difficulty of correctly annotating chemical terms lies in the large number of chemical terms and chemicals as well as in the great variability of chemical expressions: There are trivial and systematic names for chemical compounds and classes, as well as formulas and trade names for drugs. Chemical names can be extremely long and may contain variations of meaningful punctuation symbols and parentheses. Moreover, different chem- istry name types can even be mixed within one chemical expression. Similarly, the recognition of protein terms in texts and the correct mapping to protein concepts is a non-trivial issue. Protein terms are often abbreviated and appear 2 Böhme et al. in various spelling variants (with or without hyphens, spaces etc., e.g. FLT1, FLT-1, FLT 1) and may be confused with other terms (e.g. ASK protein). Like- wise, frequent disease terms are often homonymous to other concepts, e.g. a “flash” might be a physiological circumstance only in certain contexts. In sum, the precise identification of named entities is an important prerequisite for the extraction of correct and relevant relations between annotated concepts, e.g. metabolic pathways or relations between chemicals and diseases. 2 System Description OCMiner is a modular processing pipeline for unstructured information based on the Apache UIMA framework. The system architecture is depicted in Fig. 1. Annotation pipeline Pre-dictionary modules input via file system or normalizer XML language abbrev. document web service & detagger detector tokenizer annotator structure Post-dictionary modules OCR Dictionary Chemistry-specific modules picture annotation name-2- class/group formula PDF annotated structure recognition annotator XML PDF chemistry documents full text reader dictionary coordinated PDF proteins cleanup NE rule entity dictionary annotations combiner extracted resolution diseases data XML dictionary formats full text reader other XML domain Relation extraction modules dictionaries custom phrase relation relation web reader tagger matcher flitering service other formats index Ontological domain knowledge Search and look-up chemistry search back-end proteins diseases anatomy celllines web ... front-end web server client browser Fig. 1. OCMiner document processing pipeline Documents are read from a variety of sources (text and picture PDF, XML, etc.) and standardized for further analysis. Then, preparatory processes such as language detection, sentence splitting, tokenization, document structuring, etc. take place. As the core of the annotation process, we have a dictionary-based named entity recognition module which uses a high performance dictionary look- up technology with support for very large dictionaries (our chemical dictionary has about 34 million entries). It implements specific language and dictionary dependent treatment options, e.g. spelling variations, spaces/hyphens, diacrit- ics, Greek letters, plural forms. This context-sensitive fine-tuning is especially important in the annotation of protein and chemistry terms. OCMiner 3 Importantly, recognized terms are semantically interpreted as mentions of concepts that are ontologically located within domain-specific taxonomies. Our dictionaries are generated from fine-grained domain ontologies in form of con- ceptual taxonomies. This semantic mapping provides the basis for ontological search methods and knowledge extraction technologies. Particular importance is given to the chemical dictionary. It is generated from a compound database built from various publicly available sources such as PubChem, MeSH, Drug- Bank, ChEMBL, among others. Our system is able to automatically arrange compounds into a single chemical ontology according to their structure or their functional properties [1]. As a consequence, a given textual expression is not only recognized as a chemical term but also semantically interpreted as a men- tion of a chemical entity which is precisely classified in the taxonomy. Similarly, the knowledge of other domains is hierarchically organized into taxonomies of concepts of varying specifity, eg. species, diseases or anatomy. Additional components handle specific scenarios. For instance, the abbrevi- ation annotator finds expansions of acronyms and abbreviated terms. Another module recognizes expressions like “vitamin A and B” as a coordinated entity and annotates “vitamin A” as such and “B” as “vitamin B”. A chemistry-specific module tries to recognize whether a given chemical expression refers to a specific compound, a compound class, or a substituent group/fragment. A processing step which serves as a prerequisite for relation extraction is the combination of annotated concepts to complex entities. Thus, post-dictionary modules combine sequences of named entities. For instance, the text phrase “human raf kinase inhibitor”, initially annotated as a sequence of named en- tities [organism human] [protein raf] [protein kinase] [chemistry inhibitor], is com- bined to a single – though internally complex – entity referring to a chemical compound class: [chemistry [protein [organism human] raf kinase] inhibitor]. This is especially useful for the recognition of higher-level combined entities made up of constituents of the domains chemistry, proteins, species, anatomy and celllines. Our system applies a shallow pattern-based approach to the extraction of re- lations between annotated concepts from different domains, e.g. between chem- icals and diseases, or metabolic pathways, physico-chemical properties of com- pounds, etc. First, the annotated input text is tokenized into phrase tokens, where named entities, including higher-level combined entities, constitute single tokens. Note that the system does not rely on part-of-speech tagging, parsing or other sophisticated but time-consuming natural language processing techniques. Instead, extraction rules work on phrase tokens and take specific attributes of involved named entities into account, such as the type of a chemical entity. A dedicated relation ontology defines a taxonomy of relations to be extracted. Ex- amples for relation concepts are “[compound] treats [disease]” or “[compound] metabolizes to [compound]”. For each relation concept, specific mappings from natural language syntax patterns to semantic normalizations are defined. Pat- tern definitions have a syntax and a complexity similar to regular expressions, allowing for nested grouping and variable order of tokens. The relation matcher module matches the tokenized input text against these rules and generates nor- 4 Böhme et al. malized relation representations in form of triples of concept identifiers: hentity1 , relation concept, entity2 i. Processed documents and extracted information can be stored in various ways. First, XML documents with inline annotations of recognized concept men- tions can be generated out of heterogeneous input documents. Second, extracted knowledge such as keyword lists and relations between mentioned concepts (i.e. knowledge triples) can be stored in various formats (custom XML formats, RDF triples, SBML, CML, etc.) and accessed as a web service. Third, annotated enti- ties and relation triples can be stored in an index (triple store or Lucene index), which is used for a web-based retrieval and visualization of the data. In par- ticular, the index is accessed by a search back-end, which provides the indexed data to the web front-end. The OCMiner front-end displays annotated docu- ments and provides a user interface for the navigation along relation chains, e.g. from chemicals over proteins to diseases, permitting an explorative information retrieval on multitudes of scientific publications (Fig. 2). Fig. 2. OCMiner web interface for relation navigation 3 Evaluation The system was evaluated as part of the BioCreative IV challenge. In the CHEMD- NER task for evaluating chemical NER, we obtained a precision of 85% at a recall of 71% (F-score 78%) [2]. In the CTD task, which consisted in providing annota- tions as a web service, our system reached an outstanding response time of 0.14 s/document, while ranking among the first two teams in annotation quality [3]. References 1. Bobach, C., T. Böhme, U. Laube, A. Püschel, L. Weber (2012): Automated com- pound classification using a chemical ontology, J. of Cheminformatics 4(1), 40. 2. Irmer, M., C. Bobach, T. Böhme, U. Laube, A. Püschel, L. Weber (2013): Chem- ical Named Entity Recognition with OCMiner, Proceedings of the 4th BioCreative challenge evaluation workshop, vol. 2, 92-96. 3. Wiegers, T. C., A. P. Davis and C. J. Mattingly (2014): Web services-based text- mining demonstrates broad impacts for interoperability and process simplification. Database. doi:10.1093/database/bau050.