=Paper=
{{Paper
|id=Vol-2535/paper_4
|storemode=property
|title=Ontology-based Entity Recognition and Annotation
|pdfUrl=https://ceur-ws.org/Vol-2535/paper_4.pdf
|volume=Vol-2535
|authors=Thomas Hoppe,Jamal Al Qundus,Silvio Peikert
|dblpUrl=https://dblp.org/rec/conf/qurator/HoppeQP20
}}
==Ontology-based Entity Recognition and Annotation==
Ontology-based Entity Recognition and Annotation* Thomas Hoppe12, Jamal Al Qundus1, Silvio Peikert1 1 Fraunhofer-Institut FOKUS, Berlin, Germany 2 Hochschule für Technik und Wirtschaft, Fachbereich 4, Angewandte Informatik, Berlin, Ger- many {jamal.al.qundus, thomas.hoppe, silvio.peikert}@fokus.fraunhofer.de Abstract. The majority of transmitted information consists of written text, either printed or electronically. Extraction of this information from digital resources requires the identification of important entities. While Named Entity Recognition (NER) is an important task for the extraction of factual information and the con- struction of knowledge graphs, other information such as terminological concepts and relations between entities are of similar importance in the context of knowledge engineering, knowledge base enhancement and semantic search. While the majority of approaches focusses on NER recognition in the context of the World-Wide-Web and thus needs to cover the broad range of common knowledge, we focus in the present work on the recognition of entities in highly specialized domains and describe our approach to ontology-based entity recog- nition and annotation (OER). Our approach, implemented as a first prototype, outperforms existing approaches in precision of extracted entities, especially in the recognition of compound terms such as German Federal Ministry of Educa- tion and Research and inflected terms. Keywords: Ontology, Entity Recognition, Text Annotation, DBpedia Spotlight, BioPortal Annotator. 1 Introduction Two realms define the range in which entity recognition has to take place. One realm needs to cover a large and broad range of common entities, related to common knowledge and contained in the broad range of web resources and documents, largely consisting of factual information about named entities. The other one covers highly specialized information in monothematic application domains and has a strong focus on the terminology used in the domain. Although it often covers also a large, but still limited set of entities, these entities are usually identified by complex names, such as * This work has been partially supported by the "Wachstumskern Qurator – Corporate Smart Insights" project (03WKDA1F) funded by the German Federal Ministry of Education and Research (BMBF). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 chemical compounds Acetylsalicylic Acid, job titles Servicetechniker für Windkraf- tanlagen, etc. Recognition of entities in the first realm is known under the term named entity recog- nition (NER). Since these approaches are often based on large common corpora, they are usually generic and domain-independent, but applicable to a broad range of appli- cation areas. Although they can cope with text in arbitrary domains, these approaches have problems recognizing all important entities in an application domain. Because of this incompleteness, they can achieve only limited recall. Further, their precision is limited by missing information. We summarize approaches of the second realm, which are not limited to named en- tities, under the more general term entity recognition (ER). These approaches rely on given background knowledge about the entities in a particular domain. Thus, they are domain-dependent and applicable to a smaller range of domains, but configurable by the background knowledge. Their goal is to detect as much relevant domain entities as possible, enabling thus higher recall and precision. If such an approach uses knowledge formalized as ontology1, we term it ontology-based entity recognition (OER). An example of NER in the first realm is – besides others – DBpedia Spotlight, an open source annotation tool for recognizing named entities in text and linking them to DBpedia resources [2]. DBpedia Spotlight can be trained with Wikipedia content for different languages. The quality of its entity spotting approach thus depends on the documents available in a particular language-dependent Wikipedia. Connected with this approach are additional limitations. Inflected and compound terms are often not recognized as [3] point out. The coverage of entities from specialized domains is uneven: while e.g. VIPs and Genomics will be covered in depth, products of a particular company will be just covered on the surface. Approaches of the first realm often require disambiguation mechanisms to decide which interpretation of a named entity is in a certain context intended. DBpedia Spot- light either tries to identify the right interpretation automatically or offers the user to select one of the interpretations [2]. Entity recognition in the second realm is based on the availability of given controlled vocabularies, which may originate from lists of terms, taxonomies, thesauri up to on- tologies. Although several papers describe and evaluate their systems, they are usually not publically available. One exception here is the BioPortal Annotator2, which per- forms ER and annotation of documents based on a larger number of biomedical ontol- ogies. But the BioPortal Annotator has problems too: if more than one ontology is used for annotation its results are highly redundant. Its ability to recognize compound terms is limited and, even if just one ontology is used for ER, it is rather slow. Disambiguation plays a minor role in this second realm, since the number of poly- semic terms in controlled vocabularies is usually rather small. Although a term, like 1 We use the term ontology in the sense of [1] as “... an explicit, shared specification of a con- ceptualization” and interchangeable with knowledge model. 2 https://bioportal.bioontology.org/annotator (last access Dec., 13th 2019) 3 construction, can be used in a narrow domain with different meanings, as process, de- partment, or task, these meanings are often strongly related. Hence, a clean differenti- ation of these meanings is not always necessary. For complementing DBpedia Spotlight, we decided to follow a similar approach as [3,4] and develop a fast Ontology-based Recognition and Annotation system (OER), which accounts for common spelling errors, inflected and compound terms. In contrast to [3] however, we base the system on controlled vocabularies obtained from knowledge models. Although our approach is based like [3] on a two-layered transducer architecture for the recognition process, we simplified the recognition process. [3] uses a parallel multi-process approach looking ahead for compound terms in order to avoid backtracking. This approach may deliver several alternative compounds; therefore, a voting process chooses the best compound, i.e. the longest matching compound. We use a single process instead, which scans through the text looking directly recursively ahead for the longest compounds. This approach avoids backtracking too, by deciding which longest sequence to keep when ascending from the recursive lookahead. 2 Architecture The architecture of OER consists of two parts working in two subsequent phases: during the first compilation phase, a language-dependent lem- maCache is initialized and the termi- nology of a knowledge model is pre- compiled into a lookupDictionary. The second annotation phase uses the precompiled lemmaCache and the lookupDictionary for the annotation of texts as shown in Fig. 1. Fig. 1. OER Architecture 2.1 Lemmatization For the initialization of the lemmaCache different sources are consulted depending on the language. For German the lemmaCache is initialized on the base of a dump of Morphys morphology dictionary3, which allows lemmatizing more than 400.000 Ger- man word forms directly from the start. This cache is augmented during runtime via the API of Wortschatz Leipzig with additional lemmas derived from 1.000.000 sentences from Wikipedia or a news corpus. For English the lemmas are derived currently from Wordnet only. 3 http://www.danielnaber.de/morphologie/ (last access Dec., 13th 2019) 4 2.2 Terminology Extraction A knowledge model of a domain is used as source for the derivation of a controlled vocabulary consisting of (term,URI)-tuples. The terms, which may consist of single tokens or token sequences, form the controlled vocabulary for the entity recognition. The URIs build the values used for annotating the recognized entities. By default the terms are derived for knowledge models in OWL, RDFS and SKOS from rdfs:label, skos:predLabel, skos:altLabel or skos:hiddenLabel. However, since a knowledge model may consist of more complex structures built from concepts, pre- ferred terms and their preferred labels, we also allow users of OER to define their own derivation pattern for (term,URI)-tuples via a user-defined SPARQL call. For simpler knowledge models, we also allow the specification of (term,URI)-tuple via CSV. Especially, for German it is important that the entity recognition can recognize dif- ferent spelling variants of the same entity. German is famous for its compound nouns, creating new nouns by connecting adverbs (Soforthilfe), adjectives (Dreirad), verbs (Fahrlehrer) and nouns (Mädchenhandelsschule). However, under certain circum- stances parts of a compound can be separated by a hyphen in order to improve legibility and to avoid ambiguity (Mädchen-Handelsschule). As experience has shown during the analysis of search queries, authors and users often even separate these parts incorrectly by blanks (Robert Koch Institut)4. These deficiencies may also occur combined in dif- ferent variations, e.g. (Johannes Gutenberg-Universität Mainz). In order to recognize all these variations easily as referring to the same entity or concept, we compute them in advance and store them in a two-layered prefix tree structure. 2.3 Two-Layered Tree-based Recognizer The first layer is based on a radix tree that builds a termStore. Each lemmatized term contained in a term sequence is used as key of the termStore to store and access a unique id for each lemma. The list of unique ids of each term sequence are used subsequently as key in a prefix tree called sequenceStore forming the second layer. These lists of unique ids are used to store and access URIs of entities and concepts corresponding to term sequences. Thus, a term sequence like gewählter Abgeordneter des deutschen Bundestags, will be lemmatized and normalized as gewählt abgeordnete des deutsch bundestag which in turn is translated into a numerical list, e.g. [2643, 92, 83634, 12344. This encoding of the term sequence is used as list-based key to access the URI of the corresponding knowledge model concept in the sequenceStore. This translation safes space through the numerical encoding of strings and allows mapping different flexions of labels, such as gewählten Abgeordneten des deutschen Bundestag, to the same URI. 4 https://deppenleerzeichen.de/ (last access Dec., 13th 2019) 5 2.4 Recognition and Annotation by Compound Term Lookahead This two-layered data structure is initially set up during the compilation phase and used during run-time to scan a given text in order to recognize and annotate compound terms. Suppose that the knowledge model contains two additional concepts with the labels gewählte Abgeordnete and deutscher Bundestag. Assume further that we like to anno- tate the following text: Als gewählte Abgeordnete des deutschen Fischzüchterverban- des reisen sie nach Berlin und treffen die gewählten Abgeordnete des deutschen Bun- destags. The recognition and annotation process simply scans the tokenized text from the beginning until a term contained in the termStore is reached (see Fig. 2). In the example, this is gewählte. Starting from this term a lookahead is performed searching for the longest sequence of terms, which are contained in the termStore and which form a term sequence contained in the sequenceStore. As soon as a subsequent term is not included in the termStore, the lookahead process terminates and delivers the longest term- Sequence still contained in the termStore together with its length (see Fig. 3). annotate (text, lc, ld): /* lc (lemmaCache), ld (lookupDictionary) */ tt := tokenize(text); s := p := 0; at := ’’; a := [] for token in tt: p := p + 1 if s > 1: s := s – 1 continue elseif not lc.lemmatize(token) in ld.termStore: at := at + ’ ’ + token else: (phrase,l) := lookahead(tt[p+1:],[term],1) if not phrase == []: URI = ld.get(phrase) if not URI == ’’: a := a.append((p,phrase,URI)) at := at + ’ ’ + wrapHTML(phrase) continue at := at + ’ ’ + token return (a,at) Fig. 2. Pseudo Code of Annotation Process The lookahead identifies gewählte Abgeordnete des deutschen as a sequence of terms each included in the termStore. As soon as it determines that Fischzüchter- verbandes is not included in the termStore, it will resort to the longest-term sequence found so far: gewählte Abgeordnete, since longer term sequences are not contained in the sequence store. 6 The longest-term sequences found are used to derive from the sequenceStore the corresponding annotation values, the length of the identified term sequences are used to skip the next n tokens in the scan pipeline. Eventually, this process recognizes for the example text the annotations of the term sequences gewählte Abgeordnete and gewählte Abgeordnete des deutschen Bundes- tags. lookahead(tt, fp ,n): if tt == []: return (n,fp) termfound := lc.lemmatize(tt[0]) in ld.termStore phraseFound := fp in ld.phraseStore if termFound or phraseFound: (ph,l) := lookahead(tt[1:],fp.append(tt[0]),n+1) if ph in ld.phraseStore: return (ph,l) elseif phraseFound: return (fp,n) return ([],n) Fig. 3. Pseudo Code of LookAhead Procedure 3 Evaluation In a first evaluation during the development, we compared this solution with DBpedia Spotlight on recruitment related German texts and with the BioPortal Annotator on medical texts in English using the MeSH ontology. In both cases, our system is able to identify compound terms in German as well as in English. 3.1 Recruitment Domain For a first evaluation of OER’s annotations against DBpedia Spotlight, we used the Recruitment Thesaurus of Ontonym5 currently consisting of more than 16.000 concepts and more than 20.000 labels – partially multilingual. As illustration, the following text excerpt from Wikipedia leads to the annotations shown in Fig. 4: „Medizinisch-technischer Assistent (MTA) ist die Sammelbezeichnung für die vier Be- rufsbilder der technischen Assistenten in der Medizin und Tiermedizin im deutschen Gesundheitswesen. Sie umfasst im Einzelnen die Ausbildungsberufe: Medizinisch-technischer Assistent – Funktionsdiagnostik (MTAF) Medizinisch-technischer Laboratoriumsassistent (MTLA oder MTA-L) Medizinisch-technischer Radiologieassistent (MTRA, MTA-R oder RTA) 5 A former spin-off (2008 - 2015) from the Freie Universität Berlin and the first author. 7 Veterinärmedizinisch-technischer Assistent (VMTA) Der Namensbestandteil „-assistent“ kann zur Verwechslung mit dem Beruf des medi- zinischen Fachangestellten (Arzthelfer) führen, der sich in Ausbildung und Tätigkeit aber deutlich unterscheidet.“ Fig. 4. Comparison of Annotations from DBpedia Spotlight (top) and OER (bottom) Fig. 4 and Table 2 (in the appendix) indicate that with the use of specialized domain knowledge the recall and precision of annotation can be improved in comparison to an annotator using a broader and more general knowledge model. 8 3.2 Medical Domain In a second evaluation, we compared OER with the BioPortal Annotator on medical texts annotated with MeSH6 [4]. Table 1 shows the annotations of the following text excerpt from Wikipedia: “Aspirin, also known as (Acetylsalicylic Acid), (ASA), is a medication used to treat pain, fever, or inflammation. Specific inflammatory conditions which aspirin is used to treat include Kawasaki disease, pericarditis, and rheumatic fever. Aspirin given shortly after a heart attack decreases the risk of death. Aspirin is also used long-term to help prevent further heart attacks, ischaemic strokes, and blood clots in people at high risk. It may also decrease the risk of certain BioPortal Annota- OER tor types of cancer, particularly colorectal aspirin : 5 aspirin : 5 cancer. For pain or fever, effects typically risk : 3 risk : 3 begin within 30 minutes. Aspirin is a non- pain : 2 pain : 2 steroidal anti-inflammatory drug fever : 2 fever : 2 (NSAID) and works similarly to other acetylsalicylic acid : 1 NSAIDs but also suppresses the normal inflammation : 1 inflammation : 1 functioning of platelets.” disease : 1 kawasaki disease : 1 pericarditis : 1 pericarditis : 1 In contrast to the mgrep based approach of rheumatic fever : 1 rheumatic fever : 1 heart : 2 heart attack : 1 the BioPortal Annotator, as identified in heart attacks : 17 [5], OER is not only able to find com- death : 1 death : 1 pound terms of MeSH concepts, it even strokes : 1 finds annotations of terms the BioPortal blood : 2 blood clots : 1 Annotator is not able to recognize. cancer : 1 colorectal cancer : 1 nsaids : 1 Table 1. Annotations of the sample text8 platelets : 1 4 Summary Entity Recognition is an important task for the identification of information in written text. To address this challenge, we have implemented a first prototype of an Ontology- based Recognition and Annotation system (OER) which is fast and can handle common spelling mistakes, flections, and compound terms. The architecture of OER supports two phases. In a first compilation phase, a language-dependent lemmaCache is initial- ized and a knowledge model is precompiled into a lookupDictionary, allowing to iden- tify terms of the controlled vocabulary quickly and to retrieve their corresponding con- cept URIs. The second annotation phase uses these data structures to annotate texts by 6 https://www.nlm.nih.gov/mesh/meshhome.html (last access Dec., 13th 2019) 7 This difference is caused by WordNets inability to lemmatize “attacks”. 8 Numbers indicate the number of occurrences of each term. 9 a single-threaded recursive scanning process of the text, delivering always the longest matching term sequence. We could show that OER gives, through the usage of domain knowledge, better annotations than DBpedia Spotlight. In contrast to the BioPortal An- notator, its annotations are more complete and it identifies compound terms better. Currently OER is still in a prototype phase and has some limitations. One of these limitations is the lemmatization of German compound terms. Since such compounds usually do not appear in morphologic dictionaries, we intend to augment the lem- maCache by a simple approach for splitting compounds, lemmatizing their head term and joining the lemmatized fragments together. Another limitation is the treatment of the different notations of gender-neutral terms, which can be solved rather easily. Be- cause of the nature of the texts and domains we like to process with the system, we deliberately ignored the question of disambiguation for the initial development. Of course, one limitation slips in by the used knowledge models: only the terms contained in the knowledge model, their lemma and word forms related to these lemmas can be recognized by this approach. Therefore, the annotations will only be as good as the knowledge models themselves. However, we do not regard this as a limitation; in- stead, we consider it a feature, since it allows focusing on entities contained in the knowledge model of a target domain [6]. Besides the lemmatization of compounds and the treatment of gender-neutral terms, an interesting, more experimental augmentation of the system would be the recognition of the semantic equivalence of certain noun phrases and compound nouns. Additionally the annotation process could be extended by annotating terms with categorical infor- mation and limiting the number of annotated terms based on a numerical measure of their specificity. Of course, further code optimizations and investigations of the quality of the annotations as well as of the speed of the annotation process need still to follow. References [1] R. Studer, V.R. Benjamins, D. Fensel, ‘Knowledge Engineering: Principles and Methods’. Data and Knowledge Engineering 25(1-2):161-197, Elsevier, 1998. [2] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer, ‘DBpedia spotlight: shed- ding light on the web of documents’, in Proceedings of the 7th international con- ference on semantic systems, 2011, pp. 1–8. [3] C. Jilek, M. Schröder, R. Novik, S. Schwarz, H. Maus, and A. Dengel, ‘Inflection- tolerant ontology-based named entity recognition for real-time applications’, ArXiv Prepr. ArXiv18120.2119, 2018. [4] C. Jonquet, N. Shah, M. Musen, ‘A System for Ontology-Based Annotation of Biomedical Data’, In: A. Bairoch, S. Cohen-Boulakia, C. Froidevaux (eds) Data Integration in the Life Sciences. DILS 2008. Lecture Notes in Computer Science, Vol 5109. Springer, Berlin, Heidelberg. [5] D. Sanchez-Cisneros, F. Aparicio Gali, ‘An Ontology-based namedentity recogni- tion system for biomedical texts’, Second Joint Conference on Lexical and Com- putational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). 10 [6] Miha Štravs, Jernej Zupančič, ‘Named Entity Recognition Using Gazetteer of Hi- erarchical Entities’, 10.1007/978-3-030-22999-3_65, in: Advances and Trends in Artificial Intelligence. From Theory to Practice, F. Wotawa, et al. (Eds.), LNAI 11606, pp. 768-776, Springer Nature, 20195 11 5 Appendix Term Annotation Term Annotation OER OER DBpedia Spotlight DBpedia Spotlight Medizinisch-technischer ont:Medizinisch-Technischer- Medizinisch-technischer dbp:Medizinisch-tech- Assistent Assistent Assistent nischer_Assistent MTA ont:Medizinisch-Technischer- MTA dbp:Medizinisch-tech- Assistent nischer_Assistent Berufsbilder ont:Beruf technischen Assistenten ont:technischer_Assistent technischen Assistenten dbp:Technischer_Assistent Medizin ont:Medizin Medizin dbp:Medizin Tiermedizin ont:Tiermedizin Tiermedizin dbp:Veterinärmedizin deutschen ont:Deutsch Gesundheitswesen ont:Gesundheitswesen Gesundheitswesen dbp:Gesundheitssystem Ausbildungsberufe ont:Ausbildungsberuf Medizinisch-technischer ont:Medizinisch-Technischer- Medizinisch-technischer dbp:Medizinisch-tech- Assistent Assistent Assistent nischer_Assistent Funktionsdiagnostik ont:Funktionsdiagnostik Funktionsdiagnostik dbp:Medizinische_Untersuchung MTAF ont:Medizinisch-Technischer- Assistent_fuer_Funktionsdia- gonstik Medizinisch-technischer ont:Medizinisch-Techni- Medizinisch-technischer dbp:Medizinisch-tech- Laboratoriumsassistent scher_Laboratoriumsassistent Assistent nischer_Assistent MTLA ont:Medizinisch-Techni- scher_Laboratoriumsassistent MTA-L ont:Medizinisch-Techni- scher_Laboratoriumsassistent Radiologieassistent dbp:Radiologie Medizinisch-technischer ont:Medizinisch-Techni- Radiologieassistent scher_Radiologieassistent MTRA ont:Medizinisch-Techni- MTRA dbp:Medizinisch-tech- scher_Radiologieassistent nischer_Assistent RTA dbp:Radio_Television_Afghani- stan MTA-R ont:Medizinisch-Techni- scher_Radiologieassistent Veterinärmedizinisch- ont:Veterinaermedizinisch- technischer Assistent technischer_Assistent VMTA ont:Veterinaermedizinisch- technischer_Assistent Beruf ont:Beruf medizinischen Fachange- ont:medizinische_Fachange- medizinischen Fachange- dbp:Medizinischer_Fa- stellten stellter stellten changestellter Arzthelfer ont:Arztfachhelfer Arzthelfer dbp:Medizinischer_Fa- changestellter führen ont:Leitung Ausbildung ont:Ausbildung Tätigkeit ont:Aufgabe Table 2. Comparison of OER and DBpedia Spotlight Annotation9 9 Name spaces of URIs are abbreviated. Terms and annotations found by either OER or DBpedia Spotlight alone are marked in green. Red marks wrong annotations and orange marks annotations, which are correct but not precise.