A Hybrid Approach to Domain-Specific Entity Linking Alex Olieman Jaap Kamps Maarten Marx Arjan Nusselder University of Amsterdam, Amsterdam, The Netherlands {olieman|kamps|maartenmarx}@uva.nl arjan@nusselder.eu ABSTRACT We apply our approach to conversational text, in particular par- The current state-of-the-art Entity Linking (EL) systems are liamentary proceedings, i.e. the minutes of parliamentary debates. geared towards corpora that are as heterogeneous as the Web, and When generating semantic annotations from conversational rec- therefore perform sub-optimally on domain-specific corpora. A ords, e.g. minutes or online conversations, the structure of the key open problem is how to construct effective EL systems for records already provides much useful information. It tells us, for specific domains, as knowledge of the local context should in instance, who was the speaker of each unit of speech, who spoke principle increase, rather than decrease, effectiveness. In this in response to whom, and who participated in the conversation. paper we propose the hybrid use of simple specialist linkers in Additional information may be provided by metadata for each combination with an existing generalist system to address this conversation, such as when and where it took place, between problem. Our main findings are the following. First, we construct which group of people, or what the occasion or agenda was. a new reusable benchmark for EL on a corpus of domain-specific Moreover, when the structure of the records is congruent, parsing conversations. Second, we test the performance of a range of this information is straightforward. approaches under the same conditions, and show that specialist This offers a springboard for generating valuable annotations by linkers obtain high precision in isolation, and high recall when applying subsequent NLP to the full records. This paper focuses combined with generalist linkers. Hence, we can effectively ex- on utilizing Information Extraction (IE) techniques–EL in particu- ploit local context and get the best of both worlds. lar–to enrich existing structure-based annotations. The techniques under investigation in this paper are designed to be applicable to Categories and Subject Descriptors written records of any kind of conversation. H.3.1 [Information Systems]: Content Analysis and Indexing – Our contribution lies in answering the following questions: abstracting methods, indexing methods, linguistic processing. 1. How can mentions of the most salient entity types within a Keywords corpus be linked at a low cost in terms of system develop- Information Extraction, Entity Linking, Semantic Annotation, ment and domain expertise? Conversational Text, Minutes, Parliamentary Proceedings. 2. How to construct a reusable benchmark for EL on conversa- tions, that allows comparison between systems and combi- 1. INTRODUCTION nations of systems? In the Entity Linking (EL) task, textual mentions are linked to 3. How effective are the specialist linkers, and how effective is corresponding Knowledge Base (KB) entries. The majority of their hybrid combination with generalist EL systems? state-of-the-art EL systems utilize one or more open-domain KBs, such as Wikipedia, DBpedia, Freebase, or YAGO, as basis for 2. RELATED WORK learning their entity recognition and disambiguation models [10]. Until the beginning of the 21st century, it was common to collect This approach shows definite merit when the target corpus con- the domain knowledge that was needed for an IE task in a KB [9]. sists of texts with heterogeneous topical contents [9], e.g. in a Progress in supervised machine learning, and the availability of random sample of news articles or blog posts. high-coverage encyclopedic resources, however, has led to the use Anyone with the desire to annotate a domain-specific (i.e. homo- of open-domain KBs in recent years. The domain-specific nature geneous) corpus, however, will at some point face sub-optimal of IE is no longer expressed in the KB, but instead in the training results when using a domain-agnostic EL system. This problem data [9]. This has moved the adaptation cost of applying EL on a has been identified as one of three promising research directions specific corpus from the system developer to the domain expert. in this area [10]. The main aim of this paper is to investigate Efforts to reduce the need for domain experts have been made by domain-specific Entity Linking. semi-supervised adaptation of generalist models to a target corpus The straightforward solution of training the state-of-the-art recog- [10]. One promising direction is Transfer Learning, which is nition and disambiguation models on the corpus (instead of on known to work for classification tasks [5], whereas this has not Wikipedia) can be extremely costly if accurate training data needs been demonstrated sufficiently for EL. Alternatively, a domain- to be handcrafted from scratch. Alternatively, a generalist EL relevant part of the KB can be selected by excluding KB-entries system can be used on the corpus without modification. This is that are more likely to be generated by a parsimonious unigram clearly a minimum-cost option, but its performance depends model of the KB (with the corpus as background), than by the highly on the similarity of the corpus with the text that the models unigram corpus model [2]. In Berlanga et al. [2], KB-entries are are based on (e.g. Wikipedia articles). Currently, the most practi- also tailored by basing entity-specific language models on both cal approach for domain-specific EL likely lies somewhere in the the corpus and the KB. middle: some adaptation needs to occur, and preferably with The recently presented GERBIL [11] is a KB-agnostic EL minimum effort. In this paper, we propose to use specialized benchmarking framework, which addresses issues with the com- linkers for salient entity types within the corpus’ domain, which parability and reproducibility of EL systems and experiments. Our can work in concert with a generally trained model. benchmark is complementary to GERBIL, in that it additionally allows combinations of EL systems to be evaluated. 55 3. DOMAIN-SPECIFIC LINKING If a corpus focuses on a specific domain of discourse, there may Our approach is to develop specialist linkers for entity types that also be characteristic ways in which names are used, as in eti- are mentioned frequently in the target corpus. These linkers capi- quette and/or jargon. Government and parliament members (in talize on a small amount of background knowledge, and achieve short: members) adhere to guidelines on how to address each other entity recognition and disambiguation by means of pattern detec- during a parliamentary debate. Members are addressed as, e.g., tion, string matching, and structured queries against the corpus. Mr., Mrs., colleague, minister, or secretary of state. This charac- teristic of the corpus can be utilized by detecting where a member We have selected the Dutch parliamentary proceedings as the is mentioned, and thereby avoid the ambiguity between their name target corpus for an experiment, available in an XML format with and its homonyms. Government members are often mentioned rich (structural) annotations, and which covers 1814 until today. only by their role (e.g. minister), which may be followed by a The automated analysis of parliamentary proceedings is part of a portfolio (e.g. the minister of finance). We have developed a larger international effort, and has been facilitated by previous regular expression that matches such patterns, and which avoids work in the PoliticalMashup project [7]. including words that are not part of a name or portfolio. Two off-the-shelf EL systems are used as baseline systems, and The phrases that are found by the regular expression need to be also as components for our combination approach. The first is linked to the URIs of the mentioned members (n=3,664). In any DBpedia Spotlight v0.7, which takes raw text as input and pro- parliamentary debate, most of the people that are mentioned are duces links with generative models based on DBpedia and Wik- present in that session. The PoliticalMashup proceedings include a ipedia [4]. It distinguishes itself from other state-of-the-art EL structured speakers list which we use to resolve such mentions. systems by creating entity-specific language models from the We use an index of members to disambiguate mentions of non- context of Wikipedia page links, rather than from the pages them- speakers, and query it with their name, and the date and the house selves. The second system is comprised of separate entity recogni- in which the debate took place. A link is generated only if this tion and disambiguation modules. Frog is an NLP workbench for query has a single result, i.e., when it can be created with high Dutch [3], from which the phrase chunking module is used to confidence. identify noun phrases. The identified phrases are subsequently passed on to the UvA Semanticizer, which takes a learning to re- We have also built an index of government members by time rank approach to disambiguation [8]. period, role, and portfolio, with which to resolve mentions by role. If the portfolio is not mentioned, the linker assumes that the men- 3.1 Domain-specific candidate entities tioned person is a speaker. The speakers list is searched for any The simplest way that we have considered to annotate entities of a members with the mentioned role (i.e. minister or secretary). If specific type starts by collecting names for the entities in question, there are multiple candidates, we assume that the last-mentioned including acronyms. These names are stored in a dictionary, member with this role is mentioned here. which maps them to canonical URIs. Subsequently, a state ma- chine that encodes all names is constructed by the Aho-Corasick 4. EVALUATION algorithm [1]. This allows the set of names to be matched in an In this section we describe the development of a reusable bench- arbitrary input string, and the URI of mentioned entities to be mark for EL on a corpus of domain-specific conversations. Our found in the dictionary. approach–using specialist linkers for salient entity types, and their combination with general-purpose EL systems–is tested with this This minimal-effort approach is fundamentally limited to entity benchmark, and we report on its results. types in which no ambiguity exists. The many-to-one mapping from names to URIs deals with synonymy, but does not allow a 4.1 Benchmark single name to be associated with multiple entities. Such a linker We have selected a sample of Dutch parliamentary proceedings therefore needs to target a type with few instances, or in which from the period 1999-2012. In consideration of the uneven spread ambiguous names are already avoided because they would con- in topical content over the various debates, we have stratified the fuse communication. It is difficult, for instance, to find brands in sample into governmental departments, with which we assume the the same sector that share a name or acronym. topical content is strongly associated. There is no formal one-to- In our corpus we target Dutch political parties (n=155), because one relation between debates and departments, and therefore we they are highly relevant as well as unambiguously named. The have used speakers with a government position as an indicator. linker uses case-insensitive, leftmost-longest string matching. Any The size of the sample is restricted to the approximate length of a matches that are part of a longer token are rejected. The second 3-hour debate, to limit the amount of time that our volunteer prerequisite for such a simple approach is that none of the entity annotators needed to spend on manual annotation. From this names should also occur as common words. If such names do overall limit, we allocated per-department quota in proportion to occur, this may be addressed with case-sensitive string matching. the number of debates associated with the department during the 3.2 Genre-specific characteristics full period. For each department, a random debate is selected and The genre of conversational text exhibits several characteristics taken out of the pool. From this debate, a random scene–a single that can act as useful clues for EL. We focus on features that member's speaking time with optional interruptions and replies–is relate to mentioned persons; a salient entity type in many kinds of picked, and included in the sample. These steps are repeated for conversation. Conversations have a temporal aspect, i.e. all words all departments in round-robin fashion until the overall limit is have been spoken or written at some point in time, and it is likely reached. Departments for which the quotum is full skip their turn. for a longitudinal corpus that the frequency with which a particu- lar person is mentioned varies over time. Conversations are also situated: they occur in a (virtual) space in which a person may be present or not. These features can be used for disambiguation. 56 Table 1. Composition of the stratified sample Department |scenes| |a|1 |aper| |aorg| Economic Affairs 4 97 29 10 Security and Justice 4 90 31 7 Infrastructure and the Environ. 4 79 41 14 Without department 4 72 33 16 Social Affairs and Employment 4 61 32 10 Interior and Kingdom Relations 4 57 17 11 Finance 4 53 30 1 Foreign Affairs 3 51 7 5 Education, Culture and Science 2 43 16 7 Health, Welfare and Sport 4 32 19 1 General Affairs 3 32 11 5 Defense 3 15 5 4 Total 43 682 271 91 This sample, see Table 1, was subsequently annotated by the two baselines and the specialist linkers. The resulting annotations were pooled into the sample's XML format. In order to assess the quali- Figure 1. Performance of the EL systems and combinations ty of these annotations against a consistent gold standard, we is asked, and so on. By adding a generalist EL system at the end employ two human annotators for an independent and a consen- of the chain, the phrases that mention non-domain-specific entities sus-building annotation round. We have established guidelines for also have their chance at being linked. them, e.g.: adjectively used names should be linked, but meta- phorical speech and pronouns should not. 4.3 Results We have additionally developed a web interface to facilitate the We have used the developed benchmark to assess the correctness creation of the gold standard by human annotators. The interface of the annotations that were generated by the specialist linkers and displays a single debate at a time, and clearly marks the scene of the baseline systems. To this end, we calculate precision and interest. The phrases that have been annotated by at least one of recall between the system and gold annotations (n=639). Figure 1 the systems are highlighted in this scene, and the annotator is able shows the performance of the specialist linkers (+, PM), DBpedia to select the mentioned entity from a list, or by entering a Wikipe- Spotlight (○, DBpS), Frog+Semanticizer (×, F+S), and prefer- dia or PoliticalMashup URL manually. The annotator may also ence-ordered combinations thereof (◁,△,◁,∗ ). For the single indicate that the mentioned entity is not present in either KB, or systems, the performance on annotations that link to persons and that the phrase should not be annotated at all. This benchmark organizations is also shown separately. does not evaluate entity mention boundaries in the interest of These results show that the specialist linkers were able to generate simplifying the manual annotation task. Overlapping annotations a larger number of accurate annotations for the corpus than either are displayed by the interface as their longest span, and annotators of the baseline systems, whilst limited to two specific entity types. are able to enter multiple valid URLs. The pre-selection of candi- F+S is the more precise of the baselines, but DBpS produces a date entities is achieved by deduplicating the system annotations, greater number of potentially useful links. Both baselines are not and adding to this the top results from queries to Wikipedia and much good at identifying the people that are mentioned in this PM with the annotated phrase. corpus, as we had expected, but F+S is surprisingly good at anno- tating organizations. 4.2 Combination of system annotations We have taken a simple approach to combining the output of Specialist linkers, generally speaking, gain a head start over gen- multiple systems to address the aim of linking mentioned entities eralist systems by working with a smaller set of candidate entities. that are specific to the domain, as well as other entities. This They are able to spot phrases that they should link with higher approach is intended not to make use of any training data. confidence, and in some cases lack the need to disambiguate, because they only know about a one-to-one mapping from the Earlier work on how to combine the output of multiple generalist spotted phrase to an entity. Where generalist EL systems are EL systems has used a voting method [6], and shows it to be somewhat biased towards entities with a high Web-presence, a somewhat effective. Taking a vote on how to link, however, specialist system should be biased towards entity types that are of seems less promising when systems are specialized towards cer- interest to the users of a particular corpus. The linker with which tain entity types. If we take the analogy of asking a question in a we targeted parliament members is additionally empowered by room full of specialists, who answers the question matters a great some temporal awareness, and a mapping from government posi- deal. We therefore employ a preference ordering instead: the most tions to office-holders. It is therefore the only system that can specialized (i.e. estimated high-precision) system is asked to link accurately link persons that are only mentioned by their office. a phrase first, and only if it doesn't the second system in the order Our approach of combining a relatively simple custom-made EL system with an off-the-shelf EL system has also proven to be 1 Number of phrases that have been annotated by at least one system. successful. Letting the specialist PM linkers annotate any phrases 57 they could, and to let the remaining phrases be annotated by either shelf EL system, which is responsible for linking mentioned DBpS or F+S, produced a significantly better result (+27% entities of non-salient types that are also of interest to the corpus' ≤ ∆𝐹1 ≤ +99%) than any of the systems could by themselves. If users. The specialist system, two baseline generalist systems, and high recall is of importance, it can be achieved by combining all hybrid combinations thereof have been evaluated against a gold three systems in an order of descending precision. The number of standard that has been carefully constructed by two human anno- phrases for which only one of the combined systems produces an tators who have experience in using the selected corpus. This gold annotation gives an upper bound for the gain in recall. There are standard, along with system annotations, annotation guidelines 548 of such phrases for DBpS and PM, 451 for F+S and PM, and and accompanying code, is available as an open-data benchmark 333 for DBpS and F+S in our corpus. for the EL community at http://datahub.io/dataset/el-bm-nl-9912. 5. APPLICATIONS Our results show that the specialist system offers competitive The potential for semantic annotations to improve information performance to the two baseline systems, even though it is limited access is clear when we focus on users with a deep interest in the to two highly specific entity types. Moreover, by combining the corpus' domain. An obvious application is in semantic search [2], specialist linkers with one or both generalist EL systems, recall where entity linking can help address issues with homonymy and can be significantly increased at a modest precision cost. synonymy in document retrieval. More notably, entity links can Acknowledgements This research was supported by the Nether- simplify the kind of queries that are used in corpus analysis, to lands Organization for Scientific Research (ExPoSe project, NWO which the desired answer is not a list of documents. CI # 314.99.108; DiLiPaD project, NWO Digging into Data # Consider this example for the genre of conversational text: give an 600.006.014). We extend our gratitude to Evelijn Martinius and overview of all the questions that have been addressed to person Rosa Merino Claros for helping to prepare the gold standard. X. This information need could be answered at a high level, e.g., by displaying a timeline which shows the frequency of asked 7. REFERENCES questions, and, for any selected time period, who where the top [1] Aho, A. V. and Corasick, M.J. 1975. Efficient string question-askers and which other entities are mentioned frequently matching: an aid to bibliographic search. Communications of in the context of these questions. A user may also drill-down into the ACM. 18, (1975), 333–340. a (filtered) concordance view of the questions addressed to person [2] Berlanga, R., Nebot, V. and Pérez, M. 2014. Tailored X. The advantage over keyword search is that EL can resolve semantic annotation for semantic search. Journal of Web partial and ambiguous name matches, and mentions of role- Semantics. (2014), 1–13. holders, to specific individuals. The way in which an entity is [3] Van den Bosch, A., Busser, B., Canisius, S. and Daelemans, mentioned thus becomes part of the answer, instead of the query. W. 2007. An efficient memory-based morphosyntactic tagger Another example is the application of EL for Social Network and parser for Dutch. Selected Papers of CLIN 2007 (Leuven, Analysis. When the conversational corpus is viewed as a social Belgium, 2007), 99–114. network, the structure of the conversations can already shed light [4] Daiber, J., Jakob, M., Hokamp, C. and Mendes, P.N. 2013. on some of the relations in this network. In the parliamentary Improving Efficiency and Accuracy in Multilingual Entity domain, e.g., it is possible to derive a graph of who is interrupted Extraction. Proc. of I-Semantics 2013 (Austria, Graz, 2013), by whom from the structure of the proceedings [7]. Entity links 3–6. allow us to see the much broader graph of who mentioned whom during a conversation. By showing this mention graph against the [5] Daumé III, H., Kumar, A. and Saha, A. 2010. Frustratingly background of the interruption graph, it becomes easy to explore Easy Semi-Supervised Domain Adaptation. Proceedings of the cases in which people mention each other for other reasons DANLP ’10 (2010), 53–59. than a direct reply. [6] Gagnon, M., Zouaq, A. and Jean-Louis, L. 2013. Can we use Finally, the low-cost annotation approach that we have described linked data semantic annotators for the extraction of domain- can be used to bootstrap other EL approaches, and other Infor- relevant expressions? Proc. of WWW 2013 companion mation Extraction tasks. In cases where it is desirable to have an (2013), 1239–1246. EL system learn to improve its annotation performance over time, [7] De Goede, B., Marx, M., Nusselder, A. and van Wees, J. our approach can be used to generate training data with an ac- 2011. Succinct summaries of narrative events using social ceptable quality for weakly supervised methods. Moreover, accu- networks. Proc. of HT ’11. (2011), 299–304. rate entity links form the basis for more elaborate IE tasks. E.g. for relation extraction they answer the question between which [8] Odijk, D., Meij, E. and de Rijke, M. 2013. Feeding the entities does this relation hold? and for sentiment analysis the Second Screen: Semantic Linking based on Subtitles. OAIR question who expresses this sentiment about what? 2013 (2013). [9] Piskorski, J. and Yangarber, R. 2013. Information Extraction: 6. CONCLUSIONS Past, Present, and Future. Multi-Source, Multilingual The current state-of-the-art entity linking systems aim to be open- Information Extraction and Summarization. Springer-Verlag. domain solutions for corpora that are as heterogeneous as the 23–50. Web. An unfortunate effect of this aim is that such generalist EL [10] Shen, W., Wang, J. and Han, J. 2014. Entity Linking with a systems often disappoint when they are used on domain-specific Knowledge Base: Issues, Techniques, and Solutions. IEEE corpora. We have proposed and evaluated a solution that is highly Transactions on Knowledge and Data Engineering. 4347, 2 cost-effective in comparison with existing alternative approaches. (2014), 443–460. We have outlined the prerequisites for, and development of, a [11] Usbeck, R. et al. 2015. GERBIL – General Entity Annotator lightweight linking system that targets salient entity types in a Benchmarking Framework. Proc. of WWW 2015 (Florence, specific corpus. In our approach, the output of such specialist Italy, 2015). linkers is combined in a simple manner with that of an off-the- 58