MeDetect: Domain Entity Annotation in Biomedical References Using Linked Open Data 1 1 1 1 Li Tian , Weinan Zhang , Haofen Wang , Chenyang Wu 2 2 1 Yuan Ni , Feng Cao , Yong Yu 1 2 Shanghai Jiao Tong University, Shanghai; IBM China Research Laboratory, Beijing 1 2 {tianli,wnzhang,whfcarter, wucy, yyu}@apex.sjtu.edu.cn; {niyuan,caofeng}@cn.ibm.com Abstract. Recently, with the ever-growing use of textual medicine records, an- notating domain entities has been regarded as an important task in the biomedi- cal field. On the other hand, the process of interlinking open data sources is be- ing actively pursued within the Linking Open Data (LOD) project. The number of entities and the number of properties describing semantic relationships be- tween entities within the linked data cloud are very large. In this paper, we pro- pose a knowledge-incentive approach based on LOD for entity annotation in the biomedical field. With this approach, we implement MeDetect, a prototype sys- tem to solve the problems mentioned above. The experimental results verify the effectiveness and efficiency of our approach. Keywords: Domain Entity Annotation, Linked Open Data 1 Introduction Entity annotation aims at discovering entities in references automatically. It is quite useful for many tasks including information extraction, classification, text summariza- tion, question answering, and literature-based knowledge discovery. On the other hand, the Web as a global information space is developing from a Web of documents to a Web of data. Currently, there are billions of triples publicly available in Web data sources of different domains. These data sources are becoming more tightly interre- lated as the number of links in the form of mappings grows. Based on the two points, what have we done in this paper can be summarized as follows. 1. We have proposed a novel knowledge-incentive approach based on LOD for entity annotation in the biomedical field. This approach has data flexibility, language in- dependence, and semantic relationship enrichment, which makes it more conven- ient and informative for further applications. 2. We have proposed to make use of collective annotation leveraged by LOD infor- mation to conduct entity filtering and disambiguation. 3. We have developed MeDetect to implement our proposal. The experimental results verify the effectiveness and efficiency of our approach. 2 Methods The overall design of MeDetect is shown in Figure 1. Fig.1.The architecture of MeDetect Data Preparation. This step is conducted off-line. It generates two data structures for the on-line process of MeDetect. The first is an entity glossary, and the second is entity relation data. For Biomedical related ontology like DrugBank, we just use its entities and relations to build the entity glossary and relation data. For other general ontology like DBpedia which contains other non-biomedical entities, we propose a two-stage approach to handle it. In the first stage we use the link owl:sameAs from LODD to DBpedia to find the seed entities and expand the entity set from the original seed set using the sharing category information between entities. Secondly, we use a Support Vector Machine (SVM) [1] as the binary classification algorithm to pick biomedical entities from the candidate set. Entity Name Recognition. With the biomedical entity glossary, the on-line entity name recognition step provides the syntactic match between the entity name in the glossary and the content of input biomedical references. This is based on syntactic matching, and the recognized entities with their URIs are passed to next step as candidate entities to be further filtered. Entity Filtering and Disambiguation. The last but most important on-line step of MeDetect is entity filtering and disambiguation. We implement a system based on recent work on Web page annotation, Collective Annotation [2]. This approach not only detects the importance of each entity for the input text, but also, more important- ly, filters out irrelevant entities based on the inter-entity relationship. There should be two functions in collective annotation: the single entity importance function and entity pair coherence function. The single entity importance function estimates the relevance between an entity and the input text, based on their syntactic and semantic similarity, using logistic regression or category-based matching. Here, the entity description information in its URI can be utilized to match the input text. The entity pair coherence function judges the topic similarity or consistency of pairs of entity URIs so as to filter out noise and cope with ambiguities. For example, if a candidate entity has no relationship or common topic with others, it is quite possible that this candidate is noise. Also if a candidate entity name has more than one URI, the entity pair coherence function will calculate the coherence of each of these URI with the ones of other entities and choose the most coherent one as the final URI of this entity name. Thus the problem of ambiguity is handled. In MeDetect, we use a LOD neighborhood overlap calculation [3] to implement the entity pair coherence function, as is shown in Figure 2. Fig.2. LOD neighborhood overlap calculation in collective annotation of MeDetect Finally, we show a case of MeDetect entity annotation for a piece of biomedical ref- erence in Figure 3. With the URI of each extracted entity, further information (such as the description of each entity and its links to related entities) can be directly provided in the annotation service. Fig.3.An Example of MeDetect Entity Annotation 3 Results The effectiveness and efficiency of MeDetect is evaluated by an experimental study. In the experiment, 120 paper abstracts with different biomedical topics are randomly selected from PubMed and three human judgers with biomedical or computer back- ground score the output entities for each paper abstract. To have a comparative evalu- ation of the quality of MeDetect, we import MetaMap and LingPipe. Here MeDetect\FD means MeDetect without filtering and disambiguation. Compared #Test #Output #Corrected Average Average systems references entities entities accuracy running time MeDetect 120 598 455 76.1% 20.2ms MeDetect\FD 120 Stu 683 468 68.5% 11.9ms MetaMap 120 1,062 412 35.4% 601.4ms LingPipe 120 782 510 65.2% 69.3ms Table 1. Performance comparison among MeDetect, MeDetect\FD, MetaMap, and LingPipe In Table 1 the average accuracy of MeDetect is much higher than MetaMap and LingPipe. Without entity filtering and disambiguation, MeDetect\FD provides a lower accuracy, despite its higher efficiency. In sum, MeDetect provides the most satis- factory performance. 4 Conclusion This paper describes a novel knowledge-incentive approach based on LOD for entity annotation in the biomedical field. This approach has data flexibility, language inde- pendence, and semantic relationship enrichment, which makes it more adaptive and informative for further applications. We implement a prototype system MeDetect to demonstrate our approach for domain entity annotation for biomedical references. Its three key components are data preparation, entity name recognition, and entity filter- ing and disambiguation. Our system demonstrates its high annotation accuracy and data flexibility for adding more LOD sources. In future work, we will enrich the enti- ty glossary of MeDetect by adding more LOD sources. More importantly, MeDetect will be further utilized for triple extraction from biomedical references. References 1. Suykens J.A.K. and Vandewalle J. Least Squares Support Vector Machine Classifiers. Neural Processing Letters 1999. 2. Kulkarni S., Singh A., Ramakrishnan G. and Chakrabarti S. Collective Annotation of Wikipedia Entities in Web Text. SigKDD Proc. 2010. 3. Zhou W., Wang H., Chao J., Zhang W. and Yu Y. LODDO: Using Linked Open Data Description Overlap to Measure Semantic Relatedness Between Named Entities. JIST Proc. 2011.