A Hybrid Approach to Domain-Specific Entity Linking
             Alex Olieman                   Jaap Kamps                      Maarten Marx                  Arjan Nusselder
                                        University of Amsterdam, Amsterdam, The Netherlands
                           {olieman|kamps|maartenmarx}@uva.nl                            arjan@nusselder.eu

ABSTRACT                                                                    We apply our approach to conversational text, in particular par-
The current state-of-the-art Entity Linking (EL) systems are                liamentary proceedings, i.e. the minutes of parliamentary debates.
geared towards corpora that are as heterogeneous as the Web, and            When generating semantic annotations from conversational rec-
therefore perform sub-optimally on domain-specific corpora. A               ords, e.g. minutes or online conversations, the structure of the
key open problem is how to construct effective EL systems for               records already provides much useful information. It tells us, for
specific domains, as knowledge of the local context should in               instance, who was the speaker of each unit of speech, who spoke
principle increase, rather than decrease, effectiveness. In this            in response to whom, and who participated in the conversation.
paper we propose the hybrid use of simple specialist linkers in             Additional information may be provided by metadata for each
combination with an existing generalist system to address this              conversation, such as when and where it took place, between
problem. Our main findings are the following. First, we construct           which group of people, or what the occasion or agenda was.
a new reusable benchmark for EL on a corpus of domain-specific              Moreover, when the structure of the records is congruent, parsing
conversations. Second, we test the performance of a range of                this information is straightforward.
approaches under the same conditions, and show that specialist              This offers a springboard for generating valuable annotations by
linkers obtain high precision in isolation, and high recall when            applying subsequent NLP to the full records. This paper focuses
combined with generalist linkers. Hence, we can effectively ex-             on utilizing Information Extraction (IE) techniques–EL in particu-
ploit local context and get the best of both worlds.                        lar–to enrich existing structure-based annotations. The techniques
                                                                            under investigation in this paper are designed to be applicable to
Categories and Subject Descriptors                                          written records of any kind of conversation.
H.3.1 [Information Systems]: Content Analysis and Indexing –
                                                                            Our contribution lies in answering the following questions:
abstracting methods, indexing methods, linguistic processing.
                                                                             1. How can mentions of the most salient entity types within a
Keywords                                                                        corpus be linked at a low cost in terms of system develop-
Information Extraction, Entity Linking, Semantic Annotation,                    ment and domain expertise?
Conversational Text, Minutes, Parliamentary Proceedings.                     2. How to construct a reusable benchmark for EL on conversa-
                                                                                tions, that allows comparison between systems and combi-
1. INTRODUCTION                                                                 nations of systems?
In the Entity Linking (EL) task, textual mentions are linked to              3. How effective are the specialist linkers, and how effective is
corresponding Knowledge Base (KB) entries. The majority of                      their hybrid combination with generalist EL systems?
state-of-the-art EL systems utilize one or more open-domain KBs,
such as Wikipedia, DBpedia, Freebase, or YAGO, as basis for                 2. RELATED WORK
learning their entity recognition and disambiguation models [10].           Until the beginning of the 21st century, it was common to collect
This approach shows definite merit when the target corpus con-              the domain knowledge that was needed for an IE task in a KB [9].
sists of texts with heterogeneous topical contents [9], e.g. in a           Progress in supervised machine learning, and the availability of
random sample of news articles or blog posts.                               high-coverage encyclopedic resources, however, has led to the use
Anyone with the desire to annotate a domain-specific (i.e. homo-            of open-domain KBs in recent years. The domain-specific nature
geneous) corpus, however, will at some point face sub-optimal               of IE is no longer expressed in the KB, but instead in the training
results when using a domain-agnostic EL system. This problem                data [9]. This has moved the adaptation cost of applying EL on a
has been identified as one of three promising research directions           specific corpus from the system developer to the domain expert.
in this area [10]. The main aim of this paper is to investigate             Efforts to reduce the need for domain experts have been made by
domain-specific Entity Linking.                                             semi-supervised adaptation of generalist models to a target corpus
The straightforward solution of training the state-of-the-art recog-        [10]. One promising direction is Transfer Learning, which is
nition and disambiguation models on the corpus (instead of on               known to work for classification tasks [5], whereas this has not
Wikipedia) can be extremely costly if accurate training data needs          been demonstrated sufficiently for EL. Alternatively, a domain-
to be handcrafted from scratch. Alternatively, a generalist EL              relevant part of the KB can be selected by excluding KB-entries
system can be used on the corpus without modification. This is              that are more likely to be generated by a parsimonious unigram
clearly a minimum-cost option, but its performance depends                  model of the KB (with the corpus as background), than by the
highly on the similarity of the corpus with the text that the models        unigram corpus model [2]. In Berlanga et al. [2], KB-entries are
are based on (e.g. Wikipedia articles). Currently, the most practi-         also tailored by basing entity-specific language models on both
cal approach for domain-specific EL likely lies somewhere in the            the corpus and the KB.
middle: some adaptation needs to occur, and preferably with                 The recently presented GERBIL [11] is a KB-agnostic EL
minimum effort. In this paper, we propose to use specialized                benchmarking framework, which addresses issues with the com-
linkers for salient entity types within the corpus’ domain, which           parability and reproducibility of EL systems and experiments. Our
can work in concert with a generally trained model.                         benchmark is complementary to GERBIL, in that it additionally
                                                                            allows combinations of EL systems to be evaluated.

                                                                       55
3. DOMAIN-SPECIFIC LINKING                                                    If a corpus focuses on a specific domain of discourse, there may
Our approach is to develop specialist linkers for entity types that           also be characteristic ways in which names are used, as in eti-
are mentioned frequently in the target corpus. These linkers capi-            quette and/or jargon. Government and parliament members (in
talize on a small amount of background knowledge, and achieve                 short: members) adhere to guidelines on how to address each other
entity recognition and disambiguation by means of pattern detec-              during a parliamentary debate. Members are addressed as, e.g.,
tion, string matching, and structured queries against the corpus.             Mr., Mrs., colleague, minister, or secretary of state. This charac-
                                                                              teristic of the corpus can be utilized by detecting where a member
We have selected the Dutch parliamentary proceedings as the                   is mentioned, and thereby avoid the ambiguity between their name
target corpus for an experiment, available in an XML format with              and its homonyms. Government members are often mentioned
rich (structural) annotations, and which covers 1814 until today.             only by their role (e.g. minister), which may be followed by a
The automated analysis of parliamentary proceedings is part of a              portfolio (e.g. the minister of finance). We have developed a
larger international effort, and has been facilitated by previous             regular expression that matches such patterns, and which avoids
work in the PoliticalMashup project [7].                                      including words that are not part of a name or portfolio.
Two off-the-shelf EL systems are used as baseline systems, and                The phrases that are found by the regular expression need to be
also as components for our combination approach. The first is                 linked to the URIs of the mentioned members (n=3,664). In any
DBpedia Spotlight v0.7, which takes raw text as input and pro-                parliamentary debate, most of the people that are mentioned are
duces links with generative models based on DBpedia and Wik-                  present in that session. The PoliticalMashup proceedings include a
ipedia [4]. It distinguishes itself from other state-of-the-art EL            structured speakers list which we use to resolve such mentions.
systems by creating entity-specific language models from the                  We use an index of members to disambiguate mentions of non-
context of Wikipedia page links, rather than from the pages them-             speakers, and query it with their name, and the date and the house
selves. The second system is comprised of separate entity recogni-            in which the debate took place. A link is generated only if this
tion and disambiguation modules. Frog is an NLP workbench for                 query has a single result, i.e., when it can be created with high
Dutch [3], from which the phrase chunking module is used to                   confidence.
identify noun phrases. The identified phrases are subsequently
passed on to the UvA Semanticizer, which takes a learning to re-              We have also built an index of government members by time
rank approach to disambiguation [8].                                          period, role, and portfolio, with which to resolve mentions by role.
                                                                              If the portfolio is not mentioned, the linker assumes that the men-
3.1 Domain-specific candidate entities                                        tioned person is a speaker. The speakers list is searched for any
The simplest way that we have considered to annotate entities of a            members with the mentioned role (i.e. minister or secretary). If
specific type starts by collecting names for the entities in question,        there are multiple candidates, we assume that the last-mentioned
including acronyms. These names are stored in a dictionary,                   member with this role is mentioned here.
which maps them to canonical URIs. Subsequently, a state ma-
chine that encodes all names is constructed by the Aho-Corasick               4. EVALUATION
algorithm [1]. This allows the set of names to be matched in an               In this section we describe the development of a reusable bench-
arbitrary input string, and the URI of mentioned entities to be               mark for EL on a corpus of domain-specific conversations. Our
found in the dictionary.                                                      approach–using specialist linkers for salient entity types, and their
                                                                              combination with general-purpose EL systems–is tested with this
This minimal-effort approach is fundamentally limited to entity
                                                                              benchmark, and we report on its results.
types in which no ambiguity exists. The many-to-one mapping
from names to URIs deals with synonymy, but does not allow a                  4.1 Benchmark
single name to be associated with multiple entities. Such a linker            We have selected a sample of Dutch parliamentary proceedings
therefore needs to target a type with few instances, or in which              from the period 1999-2012. In consideration of the uneven spread
ambiguous names are already avoided because they would con-                   in topical content over the various debates, we have stratified the
fuse communication. It is difficult, for instance, to find brands in          sample into governmental departments, with which we assume the
the same sector that share a name or acronym.                                 topical content is strongly associated. There is no formal one-to-
In our corpus we target Dutch political parties (n=155), because              one relation between debates and departments, and therefore we
they are highly relevant as well as unambiguously named. The                  have used speakers with a government position as an indicator.
linker uses case-insensitive, leftmost-longest string matching. Any           The size of the sample is restricted to the approximate length of a
matches that are part of a longer token are rejected. The second              3-hour debate, to limit the amount of time that our volunteer
prerequisite for such a simple approach is that none of the entity            annotators needed to spend on manual annotation. From this
names should also occur as common words. If such names do                     overall limit, we allocated per-department quota in proportion to
occur, this may be addressed with case-sensitive string matching.             the number of debates associated with the department during the
3.2 Genre-specific characteristics                                            full period. For each department, a random debate is selected and
The genre of conversational text exhibits several characteristics             taken out of the pool. From this debate, a random scene–a single
that can act as useful clues for EL. We focus on features that                member's speaking time with optional interruptions and replies–is
relate to mentioned persons; a salient entity type in many kinds of           picked, and included in the sample. These steps are repeated for
conversation. Conversations have a temporal aspect, i.e. all words            all departments in round-robin fashion until the overall limit is
have been spoken or written at some point in time, and it is likely           reached. Departments for which the quotum is full skip their turn.
for a longitudinal corpus that the frequency with which a particu-
lar person is mentioned varies over time. Conversations are also
situated: they occur in a (virtual) space in which a person may be
present or not. These features can be used for disambiguation.


                                                                         56
            Table 1. Composition of the stratified sample

Department                               |scenes|     |a|1   |aper|   |aorg|
Economic Affairs                               4       97      29       10
Security and Justice                            4      90      31         7
Infrastructure and the Environ.                 4      79      41        14
Without department                              4      72      33        16
Social Affairs and Employment                   4      61      32        10
Interior and Kingdom Relations                  4      57      17        11
Finance                                         4      53      30         1
Foreign Affairs                                 3      51        7        5
Education, Culture and Science                  2      43      16         7
Health, Welfare and Sport                       4      32      19         1
General Affairs                                 3      32      11         5
Defense                                         3     15        5         4
                                Total          43    682      271        91

This sample, see Table 1, was subsequently annotated by the two
baselines and the specialist linkers. The resulting annotations were
pooled into the sample's XML format. In order to assess the quali-                   Figure 1. Performance of the EL systems and combinations
ty of these annotations against a consistent gold standard, we                      is asked, and so on. By adding a generalist EL system at the end
employ two human annotators for an independent and a consen-                        of the chain, the phrases that mention non-domain-specific entities
sus-building annotation round. We have established guidelines for                   also have their chance at being linked.
them, e.g.: adjectively used names should be linked, but meta-
phorical speech and pronouns should not.                                            4.3 Results
We have additionally developed a web interface to facilitate the                    We have used the developed benchmark to assess the correctness
creation of the gold standard by human annotators. The interface                    of the annotations that were generated by the specialist linkers and
displays a single debate at a time, and clearly marks the scene of                  the baseline systems. To this end, we calculate precision and
interest. The phrases that have been annotated by at least one of                   recall between the system and gold annotations (n=639). Figure 1
the systems are highlighted in this scene, and the annotator is able                shows the performance of the specialist linkers (+, PM), DBpedia
to select the mentioned entity from a list, or by entering a Wikipe-                Spotlight (○, DBpS), Frog+Semanticizer (×, F+S), and prefer-
dia or PoliticalMashup URL manually. The annotator may also                         ence-ordered combinations thereof (◁,△,◁,∗ ). For the single
indicate that the mentioned entity is not present in either KB, or                  systems, the performance on annotations that link to persons and
that the phrase should not be annotated at all. This benchmark                      organizations is also shown separately.
does not evaluate entity mention boundaries in the interest of                      These results show that the specialist linkers were able to generate
simplifying the manual annotation task. Overlapping annotations                     a larger number of accurate annotations for the corpus than either
are displayed by the interface as their longest span, and annotators                of the baseline systems, whilst limited to two specific entity types.
are able to enter multiple valid URLs. The pre-selection of candi-                  F+S is the more precise of the baselines, but DBpS produces a
date entities is achieved by deduplicating the system annotations,                  greater number of potentially useful links. Both baselines are not
and adding to this the top results from queries to Wikipedia and                    much good at identifying the people that are mentioned in this
PM with the annotated phrase.                                                       corpus, as we had expected, but F+S is surprisingly good at anno-
                                                                                    tating organizations.
4.2 Combination of system annotations
We have taken a simple approach to combining the output of                          Specialist linkers, generally speaking, gain a head start over gen-
multiple systems to address the aim of linking mentioned entities                   eralist systems by working with a smaller set of candidate entities.
that are specific to the domain, as well as other entities. This                    They are able to spot phrases that they should link with higher
approach is intended not to make use of any training data.                          confidence, and in some cases lack the need to disambiguate,
                                                                                    because they only know about a one-to-one mapping from the
Earlier work on how to combine the output of multiple generalist                    spotted phrase to an entity. Where generalist EL systems are
EL systems has used a voting method [6], and shows it to be                         somewhat biased towards entities with a high Web-presence, a
somewhat effective. Taking a vote on how to link, however,                          specialist system should be biased towards entity types that are of
seems less promising when systems are specialized towards cer-                      interest to the users of a particular corpus. The linker with which
tain entity types. If we take the analogy of asking a question in a                 we targeted parliament members is additionally empowered by
room full of specialists, who answers the question matters a great                  some temporal awareness, and a mapping from government posi-
deal. We therefore employ a preference ordering instead: the most                   tions to office-holders. It is therefore the only system that can
specialized (i.e. estimated high-precision) system is asked to link                 accurately link persons that are only mentioned by their office.
a phrase first, and only if it doesn't the second system in the order
                                                                                    Our approach of combining a relatively simple custom-made EL
                                                                                    system with an off-the-shelf EL system has also proven to be
1
    Number of phrases that have been annotated by at least one system.              successful. Letting the specialist PM linkers annotate any phrases

                                                                               57
they could, and to let the remaining phrases be annotated by either         shelf EL system, which is responsible for linking mentioned
DBpS or F+S, produced a significantly better result (+27%                   entities of non-salient types that are also of interest to the corpus'
≤ ∆𝐹1 ≤ +99%) than any of the systems could by themselves. If               users. The specialist system, two baseline generalist systems, and
high recall is of importance, it can be achieved by combining all           hybrid combinations thereof have been evaluated against a gold
three systems in an order of descending precision. The number of            standard that has been carefully constructed by two human anno-
phrases for which only one of the combined systems produces an              tators who have experience in using the selected corpus. This gold
annotation gives an upper bound for the gain in recall. There are           standard, along with system annotations, annotation guidelines
548 of such phrases for DBpS and PM, 451 for F+S and PM, and                and accompanying code, is available as an open-data benchmark
333 for DBpS and F+S in our corpus.                                         for the EL community at http://datahub.io/dataset/el-bm-nl-9912.
5. APPLICATIONS                                                             Our results show that the specialist system offers competitive
The potential for semantic annotations to improve information               performance to the two baseline systems, even though it is limited
access is clear when we focus on users with a deep interest in the          to two highly specific entity types. Moreover, by combining the
corpus' domain. An obvious application is in semantic search [2],           specialist linkers with one or both generalist EL systems, recall
where entity linking can help address issues with homonymy and              can be significantly increased at a modest precision cost.
synonymy in document retrieval. More notably, entity links can              Acknowledgements This research was supported by the Nether-
simplify the kind of queries that are used in corpus analysis, to           lands Organization for Scientific Research (ExPoSe project, NWO
which the desired answer is not a list of documents.                        CI # 314.99.108; DiLiPaD project, NWO Digging into Data #
Consider this example for the genre of conversational text: give an         600.006.014). We extend our gratitude to Evelijn Martinius and
overview of all the questions that have been addressed to person            Rosa Merino Claros for helping to prepare the gold standard.
X. This information need could be answered at a high level, e.g.,
by displaying a timeline which shows the frequency of asked
                                                                            7. REFERENCES
questions, and, for any selected time period, who where the top             [1] Aho, A. V. and Corasick, M.J. 1975. Efficient string
question-askers and which other entities are mentioned frequently               matching: an aid to bibliographic search. Communications of
in the context of these questions. A user may also drill-down into              the ACM. 18, (1975), 333–340.
a (filtered) concordance view of the questions addressed to person          [2] Berlanga, R., Nebot, V. and Pérez, M. 2014. Tailored
X. The advantage over keyword search is that EL can resolve                     semantic annotation for semantic search. Journal of Web
partial and ambiguous name matches, and mentions of role-                       Semantics. (2014), 1–13.
holders, to specific individuals. The way in which an entity is
                                                                            [3] Van den Bosch, A., Busser, B., Canisius, S. and Daelemans,
mentioned thus becomes part of the answer, instead of the query.
                                                                                W. 2007. An efficient memory-based morphosyntactic tagger
Another example is the application of EL for Social Network                     and parser for Dutch. Selected Papers of CLIN 2007 (Leuven,
Analysis. When the conversational corpus is viewed as a social                  Belgium, 2007), 99–114.
network, the structure of the conversations can already shed light
                                                                            [4] Daiber, J., Jakob, M., Hokamp, C. and Mendes, P.N. 2013.
on some of the relations in this network. In the parliamentary
                                                                                Improving Efficiency and Accuracy in Multilingual Entity
domain, e.g., it is possible to derive a graph of who is interrupted
                                                                                Extraction. Proc. of I-Semantics 2013 (Austria, Graz, 2013),
by whom from the structure of the proceedings [7]. Entity links
                                                                                3–6.
allow us to see the much broader graph of who mentioned whom
during a conversation. By showing this mention graph against the            [5] Daumé III, H., Kumar, A. and Saha, A. 2010. Frustratingly
background of the interruption graph, it becomes easy to explore                Easy Semi-Supervised Domain Adaptation. Proceedings of
the cases in which people mention each other for other reasons                  DANLP ’10 (2010), 53–59.
than a direct reply.                                                        [6] Gagnon, M., Zouaq, A. and Jean-Louis, L. 2013. Can we use
Finally, the low-cost annotation approach that we have described                linked data semantic annotators for the extraction of domain-
can be used to bootstrap other EL approaches, and other Infor-                  relevant expressions? Proc. of WWW 2013 companion
mation Extraction tasks. In cases where it is desirable to have an              (2013), 1239–1246.
EL system learn to improve its annotation performance over time,            [7] De Goede, B., Marx, M., Nusselder, A. and van Wees, J.
our approach can be used to generate training data with an ac-                  2011. Succinct summaries of narrative events using social
ceptable quality for weakly supervised methods. Moreover, accu-                 networks. Proc. of HT ’11. (2011), 299–304.
rate entity links form the basis for more elaborate IE tasks. E.g.
for relation extraction they answer the question between which              [8] Odijk, D., Meij, E. and de Rijke, M. 2013. Feeding the
entities does this relation hold? and for sentiment analysis the                Second Screen: Semantic Linking based on Subtitles. OAIR
question who expresses this sentiment about what?                               2013 (2013).
                                                                            [9] Piskorski, J. and Yangarber, R. 2013. Information Extraction:
6. CONCLUSIONS                                                                  Past, Present, and Future. Multi-Source, Multilingual
The current state-of-the-art entity linking systems aim to be open-             Information Extraction and Summarization. Springer-Verlag.
domain solutions for corpora that are as heterogeneous as the                   23–50.
Web. An unfortunate effect of this aim is that such generalist EL
                                                                            [10] Shen, W., Wang, J. and Han, J. 2014. Entity Linking with a
systems often disappoint when they are used on domain-specific
                                                                                 Knowledge Base: Issues, Techniques, and Solutions. IEEE
corpora. We have proposed and evaluated a solution that is highly
                                                                                 Transactions on Knowledge and Data Engineering. 4347, 2
cost-effective in comparison with existing alternative approaches.
                                                                                 (2014), 443–460.
We have outlined the prerequisites for, and development of, a
                                                                            [11] Usbeck, R. et al. 2015. GERBIL – General Entity Annotator
lightweight linking system that targets salient entity types in a
                                                                                 Benchmarking Framework. Proc. of WWW 2015 (Florence,
specific corpus. In our approach, the output of such specialist
                                                                                 Italy, 2015).
linkers is combined in a simple manner with that of an off-the-

                                                                       58