<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Preeti Bhargava</string-name>
          <email>preeti.bhargava@lithium.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nemanja Spasojevic</string-name>
          <email>nemanja.spasojevic@lithium.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoning Hu</string-name>
          <email>guoning.hu@lithium.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Entity Disambiguation, Entity Linking, Entity Resolution,</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lithium Technologies | Klout</institution>
          ,
          <addr-line>San Francisco, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Text Mining</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>The Entity Disambiguation and Linking (EDL) task matches entity mentions in text to a unique Knowledge Base (KB) identifier such as a Wikipedia or Freebase id. It plays a critical role in the construction of a high quality information network, and can be further leveraged for a variety of information retrieval and NLP tasks such as text categorization and document tagging. EDL is a complex and challenging problem due to ambiguity of the mentions and real world text being multi-lingual. Moreover, EDL systems need to have high throughput and should be lightweight in order to scale to large datasets and run on off-the-shelf machines. More importantly, these systems need to be able to extract and disambiguate dense annotations from the data in order to enable an Information Retrieval or Extraction task running on the data to be more efficient and accurate. In order to address all these challenges, we present the Lithium EDL system and algorithm - a high-throughput, lightweight, language-agnostic EDL system that extracts and correctly disambiguates 75% more entities than state-of-the-art EDL systems and is significantly faster than them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        In Natural Language Processing (NLP), Entity
Disambiguation and Linking (EDL) is the task of matching entity
mentions in text to a unique Knowledge Base (KB) identifier
such as a Wikipedia or a Freebase id. It differs from the
conventional task of Named Entity Recognition, which is
focused on identifying the occurrence of an entity and its
type but not the specific unique entity that the mention
refers to. EDL plays a critical role in the construction of a
high quality information network such as the Web of Linked
Data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Moreover, when any new piece of information is
extracted from text, it is necessary to know which real world
entity this piece refers to. If the system makes an error here,
it loses this piece of information and introduces noise.
      </p>
      <p>EDL can be leveraged for a variety of information retrieval
and NLP tasks such as text categorization and document
tagging. For instance, any document which contains entities
such as Michael Jordan and NBA can be tagged with
categories Sports and Basketball. It can also play a significant
role in recommender systems which can personalize content
for users based on the entities they are interested in.</p>
      <p>EDL is complex and challenging due to several reasons:
• Ambiguity - The same entity mention can refer to
different real world entities in different contexts. A
clear example of ambiguity is the mention Michael
Jordan which can refer to the basketball player in
certain context or the machine learning professor
from Berkeley. To the discerning human eye, it
may be easy to identify the correct entity, but any
EDL system attempting to do so needs to rely on
contextual information when faced with ambiguity.
• Multi-lingual content - The emergence of the web and
social media poses an additional challenge to NLP
practitioners because the user generated content on
them is often multi-lingual. Hence, any EDL system
processing real world data on the web, such as user
generated content from social media and networks,
should be able to support multiple languages in order
to be practical and applicable. Unfortunately, this is
a challenge that has not been given enough attention.
• High throughput and lightweight - State-of-the-art
EDL systems should be able to work on large scale
datasets, often involving millions of documents with
several thousand of entities. Moreover, these systems
need to have low resource consumption in order to
scale to larger datasets in a finite amount of time.
In addition, in order to be applicable and practical,
they should be able to run on off-the-shelf commodity
machines.
• Rich annotated information - All information
retrieval and extraction tasks are more efficient and
accurate if the underlying data is rich and dense. Hence,
EDL systems need to ensure that they extract and
annotate many more entities and of different types
(such as professional titles, sports, activities etc.)
in addition to just named entities (such as persons,
organizations, locations etc.) However, most existing
systems focus on extracting named entities only.
In this paper, we present our EDL system and algorithm,
hereby referred to as the Lithium EDL system, which is
a high-throughput, lightweight and language-agnostic EDL
system that extracts and correctly disambiguates 75% more
entities than state-of-the-art EDL systems and is significantly
faster than them.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        EDL has been a well studied problem in literature and has
gained a lot of attention in recent years. Approaches that
disambiguate entity mentions with respect to Wikipedia date
back to Bunescu and Pasca’s work in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Cucerzan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
attempted to solve the same problem by using heuristic rules
and Wikipedia disambiguation markups to derive mappings
from display names of entities to their Wikipedia entries.
However, this approach doesn’t work when the entity is
not well defined in their KB. Milne and Witten [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] refined
Cucerzan’s work by defining topical coherence using
normalized Google Distance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and only using ‘unambiguous
entities’ to calculate topical coherence.
      </p>
      <p>
        Recent approaches have focused on exploiting statistical
text features such as mention and entity counts, entity
popularity and context similarity to disambiguate entities.
Spotlight [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used a maximum likelihood estimation approach
using mention and entity counts. To combine different types
of disambiguation knowledge together, Han and Sun [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
proposed a generative model to include evidences from entity
popularity, mention-entity association and context similarity
in a holistic way. More recently, systems like AIDA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and
AIDA-light [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have proposed graphical approaches that
employ these statistical measures and attempt the
disambiguation of multiple entries in a document simultaneously.
Bradesco et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] followed an approach similar to AIDA-light
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] but limited the entities of interest to people and
companies. However, a major disadvantage of such approaches
is that their combinatorial nature results in intractability,
which makes them harder to scale to very large datasets
in a finite amount of time. In addition, all these systems
do not support multi-lingual content which is very common
nowadays due to the prolificity of user generated content on
the web.
      </p>
      <p>Our work differs from the existing work in several ways.
We discuss these in the contributions outlined below.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Contributions</title>
      <p>
        Our contributions in this paper are:
• Our EDL algorithm uses several context-dependent
and context-independent features, such as
mentionentity cooccurrence, entity-entity cooccurrence,
entity importance etc., to disambiguate mentions to
their respective entities.
• In contrast to several existing systems such as Google
Cloud NL API 1, OpenCalais 2 and AIDA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], our
EDL system recognizes several types of entities
(professional titles, sports, activities etc.) in addition to
named entities (people, places, organizations etc.).
Our experiments (Section 7.2) demonstrate that it
recognizes and correctly disambiguates about 75%
more entities than state-of-the-art systems. Such
1https://cloud.google.com/natural-language/
2http://www.opencalais.com/
richer and denser annotations are particularly useful
in understanding the user generated content on social
media to model user conversations and interests.
• Our EDL algorithm is language-agnostic and
currently supports 6 different languages including
English, Arabic, Spanish, French, German, and
Japanese3. As a result, it is highly applicable to process
real world text such as multi-lingual user generated
content from social media. Moreover, it does not
need any added customizations to support additional
languages. In contrast, systems such as AIDA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
and AIDA-light [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] need to be extended by
additional components in order to support other
languages such as Arabic [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
• Our EDL system has high throughput and is very
lightweight. It can be run on an off-the-shelf
commodity machine and scales easily to large datasets.
Experiments with a dataset of 910 million documents
showed that our EDL system took about 2.2ms per
document (with an average size of 169 bytes) on a
2.5 GHz Xeon processor (Section 6.3). Moreover, our
experiments demonstrate that our system’s runtime
per unique entity extracted is about 3.5 times faster
than state-of-the-art systems such as AIDA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-4">
      <title>KNOWLEDGE BASE</title>
      <p>Our KB consists of about 1 million Freebase4 machine ids
for entities. These were chosen from a subset of all Freebase
entities that map to Wikipedia entities. We prefer to use
Freebase rather than Wikipedia as our KB since in
Freebase, the same id represents a unique entity across multiple
languages. Due to limited resources and usefulness of the
entities, our KB contains approximately 1 million most
important entities from among all the Freebase entities. This
gives us a good balance between coverage and relevance of
entities for processing common social media text. Section
3.3.1 explains how entity importance is calculated, which
enables us to rank the top 1 million Freebase entities.</p>
      <p>In addition to the KB entities, we also employ two special
entities: NIL and MISC. NIL entity indicates that there is no
entity associated with the mention, eg. mention ‘the’ within
the sentence may link to entity NIL. This entity is useful
especially when it comes to dealing with stop words and false
positives. MISC indicates that the mention links to an entity
which is outside the selected entity set in our KB.</p>
    </sec>
    <sec id="sec-5">
      <title>3 SYSTEM ARCHITECTURE</title>
      <p>This paper is focused on describing the Lithium EDL system.
However, the EDL system is a component of a larger Natural
Language Processing (NLP) pipeline, hereby referred to as
the Lithium NLP pipeline, which we describe briefly here.
Figure 1 shows the high level overview of the Lithium NLP
pipeline. It consists of several Text Preprocessing stages
before EDL.
3Our EDL system can easily support more languages with the ready
availability of ground truth data in them
4Freebase was a standard community generated KB until June 2015
when Google deprecated it in favor of the commercially available
Knowledge Graph API.
Text
Normalization</p>
      <p>Sentence
Breaking</p>
      <p>Tokenization</p>
      <p>Entity
Extraction</p>
    </sec>
    <sec id="sec-6">
      <title>Text Preprocessing</title>
      <p>The Lithium NLP pipeline processes an input text document
in the following stages before EDL:
• Language Detection - This stage detects the language
of the input document using a naive Bayesian filter.</p>
      <p>
        It has a precision of 99% and is available on GitHub5.
• Text Normalization - This stage normalizes the text
by escaping unescaped characters and replacing some
special characters based on the detected language.
For example, it replaces non-ASCII punctuations
with spaces and converts accents to regular
characters for English.
• Sentence Breaking - This stage breaks the
normalized text into sentences using the Java Text API6.
This tool can distinguish sentence breakers from
other marks, such as periods within numbers and
abbreviations, according to the detected language.
• Tokenization - This stage converts each sentence into
a sequence of tokens via the Lucene Standard
Tokenizer7.
• Entity Extraction - This stage captures mentions in
each sentence that belong to precomputed offline
dictionaries. Please see Section 3.3.1 for more details
about dictionary generation. A mention may contain
a single token or several consecutive tokens, but a
token can belong to at most one mention. Often
there are multiple ways to break a sentence into a
set of mentions. To make this task computationally
efficient, we apply a simple greedy strategy that
analyzes windows of n-grams (n ∈ [
        <xref ref-type="bibr" rid="ref1 ref6">1,6</xref>
        ]) and extracts
the longest mention found in each window.
      </p>
      <p>An extracted mention maps to multiple candidate entities.
Our pipeline determines the best entity for each mention in
the EDL phrase, which is described in Section 3.3.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Data Set Generation</title>
      <p>
        Since our goal here is to build a language-agnostic EDL
system, we needed a dataset that scales across several languages
and also has good entity density and coverage. Unfortunately,
such a dataset is not readily available. Hence, we generated a
ground truth data set for our EDL system, the Densely
Annotated Wikipedia Text (DAWT)8 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], using densely Wikified
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or annotated Wikipedia articles. Wikification is entity
linking with Wikipedia as the KB. We started with Wikipedia
      </p>
      <sec id="sec-7-1">
        <title>5https://github.com/shuyo/language-detection</title>
        <p>6https://docs.oracle.com/javase/7/docs/api/java/text/
BreakIterator.html
7http://lucene.apache.org/core/4_5_0/analyzers-common/org/
apache/lucene/analysis/standard/StandardTokenizer.html
8DAWT and other derived datasets are available for download at:
https://github.com/klout/opendata/tree/master/wiki_annotation.</p>
        <p>Dictionaries
Context Independent and</p>
        <p>Dependent Feature</p>
        <p>Calculators
Supervised
Classifiers</p>
        <p>Entity
Context</p>
        <p>First pass Disambiguation Easy Entities
and Scoring</p>
        <p>Second pass
Disambiguation and Scoring</p>
        <p>Document</p>
        <p>Context
Document</p>
        <p>Text
Final Disambiguation
Annotated Text (with
disambiguated entities)
data dumps9, which were further enriched by introducing
more hyperlinks in the existing document structure. Our
main goals when building this data set were to maintain high
precision and increase linking coverage. As a last step, the
hyperlinks to Wikipedia articles in a specific language were
replaced with links to their Freebase ids to adapt to our KB.
The densely annotated Wikipedia articles had on an average
4.8 times more links than the original articles.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Entity Disambiguation and Linking</title>
      <p>
        The system architecture of the EDL stage is shown in Figure
2. Similar to the approach employed by AIDA-light [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], it
employs a two-pass algorithm (explained in detail in Section
4) which first identifies a set of easy mentions, which have
low ambiguity and can be disambiguated and linked to their
respective entities with high confidence. It then leverages
these easy entities and several context dependent and
independent features to disambiguate and link the remaining hard
mentions. However, unlike AIDA-light [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], our approach
does not use a graph based model to jointly disambiguate
entities because such approaches can become intractable with
increase in the size of the document and number of entities.
In addition, our EDL problem is posed as a classification
rather than a regression problem as in AIDA-light [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The EDL stage consists of the following components:
3.3.1 Offline Dictionaries Generation. Our EDL system
uses several dictionaries capturing language models,
probabilities and relations across entities and topics. These are
generated by offline processes leveraging various multi-lingual
data sources to generate resource files. These are:</p>
      <sec id="sec-8-1">
        <title>9https://dumps.wikimedia.org/</title>
        <p>
          • Mention-Entity Cooccurrence - This dictionary is
derived using the DAWT data set [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Here, we
estimate the prior probability that a mention Mi refers
to an entity Ej (including NIL and MISC) with
respect to our KB and corpora. It is equivalent to
the cooccurrence probability of the mention and the
entity:
count(Mi → Ej )
        </p>
        <p>
          count(Mi)
We generate a separate dictionary for each language.
Moreover, since DAWT is 4.8 times denser than
Wikipedia, these dictionaries capture several more
mentions and are designed to be exhaustive across
several domains.
• Entity-Entity Cooccurrence - This dictionary is also
derived using DAWT. In this case, we capture
cooccurrence frequencies among entities by counting
all the entities that simultaneously appear within
a sliding window of 50 tokens. Moreover, this data
is accumulated across all languages and is language
independent in order to capture better relations and
create a smaller memory footprint when
supporting additional languages. Also, for each entity, we
consider only the top 30 co-occurring entities which
have at least 10 or more co-occurrences across all
supported languages.
• Entity Importance - The entity importance score [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
is derived as a global score identifying how important
an extracted entity is for a casual observer. This
score is calculated using linear regression with
features capturing popularity within Wikipedia links,
and importance of the entity within Freebase. We
used signals such as Wiki page rank, Wiki and
Freebase incoming and outgoing links, and type
descriptors within knowledge base etc.
• Topic Parent - The Klout Topic Ontology10 is a
manually curated ontology built to capture social media
users’ interests [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and expertise scores [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] across
multiple social networks. As of December 2016, it
consists of roughly 7,500 topic nodes and 13,500
edges encoding hierarchical relationships among them.
The Topic Parents dictionary contains the parent
topics for each topic within this ontology.
• Entity To Topic Mapping - This dictionary essentially
contains topics from the Klout Topic Ontology that
are associated with the different entities in our KB.
E.g. Michael Jordan, the basketball player, will be
associated with the topics ‘Football’ and ‘Sports’. We
generate this dictionary via a weighted ensemble of
several algorithms that employ entity co-occurrence
and propagate the topic labels. A complete
description of these algorithms is beyond the scope of this
paper.
3.3.2 Context.
        </p>
        <p>• Document context - As mentioned earlier, the Lithium
EDL system relies on disambiguating a set of easy
mentions in the document which are then leveraged
10https://github.com/klout/opendata
to disambiguate the hard mentions. Thus, for each
document, we maintain a document context C (Ti)
which includes all the easy entities in the document
text that have been disambiguated. This context
also includes cached pairwise feature scores for the
context dependent features between the easy and
hard entities (see Section 4.2.1 for a description of
the context dependent features).
• Entity context - For each candidate entity Ek of a
hard mention, we define an entity context C0 (Ek )
which includes the position of the corresponding
mention in the document, the index number of the
candidate entity as well as an easy entity window
Ek surrounding the hard mention. The appropriate
window size W is determined by parameter tuning
on a validation set.</p>
        <p>3.3.3 Supervised Classifiers. We pose our EDL problem
as a binary classification problem for the following reason:
For each mention, only one of the candidate entities is the
correct label entity. Our ground truth data set provides the
labeled correct entity but does not have any scores or ranked
order for the candidate entities. Hence, we pose this problem
as predicting one of the two labels {True, False} for each
candidate entity (where True indicates it is the correctly
disambiguated entity for a mention and False indicates that
it is not).</p>
        <p>Using the process described in Section 3.2, we generated a
ground truth training set of 70 English Wikipedia pages which
had a total of 43,662 mentions and 147,236 candidate entities.
We experimented with several classifiers such as Decision
Trees, Random Forest, k-Nearest Neighbors and Logistic
Regression on this training set. Decision Trees and Logistic
Regression outperformed most of the classifiers. While
Random Forest was as accurate as the Decision Tree classifier, it
was computationally more expensive. Hence, we use Decision
Tree and Logistic Regression in the Lithium EDL system.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4 ENTITY DISAMBIGUATION AND</title>
    </sec>
    <sec id="sec-10">
      <title>LINKING ALGORITHM</title>
      <p>Algorithm 1 describes the Lithium EDL two-pass algorithm.
We discuss it in detail now (the design choices for various
parameters are explained in Section 5).
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>First pass</title>
      <p>The first pass of the algorithm iterates over all mentions in
the document text and disambiguates mentions that have:
• Only one candidate entity: In this case, the algorithm
disambiguates the mention to the lone candidate
entity.
• Two candidate entities with one being NIL/MISC: In
this case, the algorithm disambiguates the mention
to the candidate entity with high
Mention-EntityCooccurr prior probability (above λ1 - Easy Mention</p>
      <p>Disambiguation threshold with NIL).
• Three or more candidate entities with one entity
mapping with very high prior: In this case, the
algorithm disambiguates the mention to the candidate
Algorithm 1: Lithium EDL algorithm</p>
      <p>Input: Text Ti with extracted mentions Mall and a set of candidate entities for each mention
Output: Text Ti with extracted mentions Mall and a unique disambiguated entity for each mention
// First pass - Disambiguate the easy mentions
1 Measy ← Easy mentions obtained from the first pass on Ti;
2 Eeasy ← Disambiguated easy entities obtained from the first pass on Ti;
3 Document Context C (Ti) ← C (Ti) + Eeasy ;
4 Mhard ← Mall - Measy;</p>
      <p>// Second pass - Iterate over the hard mentions
5 foreach Mention Mj ∈ Mhard do
6 Hj ← Candidate entities of Mj ;
// Iterate over the candidate entities of a hard mention
foreach Entity Ek ∈ Hj do</p>
      <p>Entity Context C’(Ek) ← C’(Ek) + E0k (set of easy entities in a window around Ek) ;
FEk ← Generate feature vector of context independent and dependent features values for Ek using C’(Ek);
Classify FEk as one of {True, False} using Decision Tree classifier;
SEk ← Final score for Ek generated using Logistic Regression model weights;</p>
      <p>Add SEk to set Sj (Set of candidate entity scores for Mj );
end
// Final disambiguation - select one of the candidate entities as disambiguated entity Dj for Mj
if Only one Ek ∈ Hj labeled as True then</p>
      <p>Dj ← Ek labeled as True;
else
if Multiple Ek labeled as True then</p>
      <p>Dj ← Highest scoring Ek labeled as True;
else if None of Ek labeled as True then</p>
      <p>Dj ← arg max (SEk );
if Dj is NIL and NIL_MARGIN_GAIN &lt; threshold then</p>
      <p>Dj ← arg max (Sj - SDj );
The second pass of the algorithm uses several context-independent
and context-dependent features as well as supervised
classifiers to label and score the candidate entities for each hard
mention and finally disambiguate it.</p>
      <p>4.2.1 Features. We use several language agnostic features
to classify each candidate entity for each hard mention as
‘True’ or ‘False’. These include both context-independent
(useful for disambiguating and linking entities in short and
sparse texts such as tweets) as well as context-dependent
features (useful for disambiguating and linking entities in
long and rich text). Each feature produces a real value in
[0.0,1.0].</p>
      <p>The context independent features are:
• Mention-Entity Cooccurrence (Mention-Entity-Cooccurr )
- This feature value is equal to the
Mention-Entity</p>
      <p>Cooccurr prior probability.
• Mention-Entity Jaccard Similarity
(Mention-Entity</p>
      <p>Jaccard) - This reflects the similarity between the
mention Mi and the representative name of a
candidate entity Ej . The mention and the entity display
names are first tokenized and the Jaccard similarity
is then computed between the token sets as</p>
      <sec id="sec-11-1">
        <title>T okens(Mi) ∩ T okens(Ej )</title>
      </sec>
      <sec id="sec-11-2">
        <title>T okens(Mi) ∪ T okens(Ej )</title>
        <p>
          For instance, the mention Marvel could refer to the
entities Marvel Comics or Marvel Entertainment,
both of which have a Jaccard Similarity of 0.5 with
the mention.
• Entity Importance (Entity-Importance) - This
reflects the importance or the relevance of the
candidate entity as determined by an entity scoring and
ranking algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] which ranks the top 1 million
entities occurring in our KB. For instance, the entity
Apple Inc. has an importance of 0.66 while Apple
(fruit) has an importance of 0.64 as ranked by the
Entity Scoring algorithm.
        </p>
        <p>Computers
Apple</p>
        <p>W
• Entity Entity Topic Semantic Similarity
(Entity-EntityTopic-Sim) - As mentioned in Section 3.3.1, each
entity in our KB is associated with a finite number
of topics in our topic ontology. For instance, entity
Apple Inc. maps to the topic ‘Apple’ and Google Inc.
maps to the topic ‘Google’ while ‘Apple (fruit)’ will
map to the topic ‘Food’. Figure 3 shows a partial
view of the ontology for the above mentioned topics.</p>
        <p>For each candidate entity Ei of a hard mention
Mi, we compute the minimum semantic distance
of its topics with topics of each entity in E0i over
all possible paths in our topic ontology space. The
similarity is the inverse of the distance. For instance,
consider the hard mention Apple, having two
candidate entities - Apple Inc. and Apple (fruit) for it, and
E0i containing the entity Google Inc. which has been
disambiguated. As shown in Figure 3, the
semantic distance between the topics for Apple Inc. and
Google Inc. is 4 while the semantic distance between
the topics for Apple (fruit) and Google Inc. is 5. As
a result, it is more likely that Apple disambiguates
to Apple Inc.</p>
        <p>Thus, we first determine the set of topics Ti that
the candidate entity Ei is associated with. For each
entity Ej in E0 , we generate the set of topics Tj . The
i
feature value is computed as</p>
        <p>1
max distance(ti, tj ) ∀ti ∈ Ti, tj ∈ Tj
4.2.2 Classification and Scoring. As a penultimate step
in the second pass, the computed features are combined into
a feature vector for a candidate entity and the Decision Tree
classifier labels the feature vector as ‘True’ or ‘False’. In
addition, for each candidate entity, we also generate final
scores using weights generated by the Logistic Regression
classifier that we trained in Section 3.3.3. We use an ensemble
of the two classifiers in the final disambiguation step as it
helps overcome the individual bias of each classifier.</p>
        <p>4.2.3 Final Disambiguation. The final disambiguation step
needs to select one of the labeled candidate entities as the
disambiguated entity for the mention. However, multiple
cases arise at the time of disambiguation:
• Only one candidate entity is labeled as ‘True’- Here,
the algorithm selects that entity as the disambiguated
entity for the given mention.
• Multiple candidate entities labeled as ‘True’ - Here,
the algorithm selects the highest scoring entity (from
among those labeled ‘True’) as the disambiguated
entity except when this entity is NIL/MISC. In that
case, the algorithm checks the margin of gain or
the score difference between the NIL/MISC entity
and the next highest scoring entity that is labeled
‘True’. If the margin of gain is less than a threshold
(less than NIL margin of gain threshold, λ3) then
the next highest scoring entity (from among those
labeled ‘True’) is selected.
• All candidate entities labeled as ‘False’ - Here, the
algorithm selects the highest scoring entity as the
disambiguated entity except when this entity is NIL/MISC.
In that case, the algorithm checks the margin of gain
for this entity over the next highest scoring entity.
If the margin of gain is less than a threshold (less
than NIL margin of gain threshold, λ3) then the
next highest scoring entity is selected.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Demonstrative Example</title>
      <p>To demonstrate the efficacy of our algorithm, let’s
disambiguate the sample text: “Google CEO Eric Schmidt said
that the competition between Apple and Google and iOS vs.</p>
      <p>Android is ‘the defining fight of the tech industry.’ ".</p>
      <p>Figure 4 walks through the disambiguation of the sample
text. The Text Preprocessing stages extract the mentions
(highlighted in bold) and generate the candidate entities and
the prior cooccurrence scores for each mention11. As shown,
the extracted mentions and their candidate entities are:
• Google - NIL and Google Inc.
• CEO - NIL and Chief Executive
• Eric Schmidt - NIL and Eric Schmidt
• Apple - NIL, Apple (fruit), Apple Inc. and Apple</p>
      <p>Records
• iOS - NIL and iOS
• Android - NIL, Android (OS) and Android(robot)
• tech industry - Technology
In the first pass, the algorithm disambiguates the easy
mentions. Based on their high prior scores and number of
candidate entities, it disambiguates Eric Schmidt, iOS and tech
industry (highlighted in color) to their correct entities. In
the second pass, it uses the easy mention window and
computes several context dependent and independent features to
score and classify the candidate entities of the hard mentions.</p>
      <p>Note that for the purpose of clarity and simplicity, we are
not walking through the feature and final score computation.
11Though our algorithm utilizes the Freebase machine id for each
candidate entity, we only show the entity name for clarity.
Samwpitlhe mseenntteinocnes Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is `the defining fight of the tech industry.'
iOS
iOS
NIL Google</p>
      <p>Inc.</p>
      <p>NIL Google</p>
      <p>Inc.</p>
      <p>NIL
NIL</p>
      <p>NIL</p>
      <p>NIL
As shown, for the remaining hard entities, it has classified
the candidate entities as ‘True’ or ‘False’. In the final
disambiguation step, it selects one of the labeled entities as the
correct disambiguated entity. In the sample sentence, for all
the mentions, only one of the candidate entities is labeled
as ‘True’, and hence the algorithm selects that entity as the
disambiguated entity for each mention.</p>
    </sec>
    <sec id="sec-13">
      <title>5 PARAMETER TUNING</title>
      <p>Our algorithm uses four different hyperparameters - 2 in the
first pass and 2 in the second pass. These are:
• Easy mention disambiguation threshold with NIL
(λ1) - This threshold is used to disambiguate easy
mentions which have 2 candidate entities and one of
them is the NIL entity.
• Easy mention disambiguation threshold (λ2) - This
threshold is used to disambiguate easy mentions
which have 3 or more candidate entities but the
mention maps to one of them with a very high prior
probability.
• NIL margin of gain threshold (λ3) - This threshold
is used in the second pass to disambiguate entities
when multiple or none of the candidates are labeled
‘True’.
• Window size (W ) - This parameter represents the
size of the easy entity window around each hard
entity.</p>
      <p>Using the process described in Section 3.2, we generated a
ground truth validation set of 10 English Wikipedia pages
which had a total of 7242 mentions and 23,961 candidate
entities. We used parameter sweeping experiments to
determine the optimal value of these parameters. We measured
the performance (in terms of precision, recall and f-score) of
the algorithm on the validation set with different parameter
settings and picked the parameter values that had the best
performance. Based on our experiments, we set the optimal
value of λ1 as 0.75, λ2 as 0.9, W as 400 and λ3 as 0.5.</p>
    </sec>
    <sec id="sec-14">
      <title>6 EVALUATION</title>
    </sec>
    <sec id="sec-15">
      <title>6.1 Test data</title>
      <p>Using the process described in Section 3.2, we generated a
ground truth test set of 20 English Wikipedia pages which
had a total of 18,773 mentions.</p>
    </sec>
    <sec id="sec-16">
      <title>6.2 Metrics</title>
      <p>We use standard performance metrics like precision, recall,
f-score and accuracy to evaluate our EDL system on the
test set. However, due to our problem setup, we calculate
true positives, false positives, and true negatives and false
negatives in an unconventional way as shown in Table 1.
Precision, recall, f-score and accuracy are calculated in the
tp tp
standard format as: P = tp+ fp , R = tp + fn , F1 = 2P×P+×RR
tp + tn
and Accuracy = tp + tn + fp + fn</p>
    </sec>
    <sec id="sec-17">
      <title>6.3 Results</title>
      <p>We compute the performance metrics for individual features
as well as for various feature sets on our English language
test set to assess their impact. Table 2 shows the feature
effectiveness results for our algorithm. As evident from the
results, Mention-Entity-Cooccurr has the biggest impact on the
performance of the algorithm among all individual features
as it has the highest individual precision and f-score.
When combined, the context independent features
combined have higher precision and f-score than the context
dependent features. This could be due to the fact that
in shorter text documents, there may not be enough easy
mentions disambiguated in the first pass. Since the context
dependent features rely on the easy entity window for
computation, their performance will be impacted. However, when
all these features are taken together, the overall performance
improves even further. This demonstrates that context is an
important factor in entity disambiguation and linking. Our
final algorithm, which utilizes all the context dependent and
independent feature sets, has a precision of 63%, recall of
87% and f-score of 73%.</p>
      <p>Table 3 shows the performance of the Lithium EDL system
across various languages. We note that the test datasets
for these languages are smaller. However, the algorithm’s
performance is comparable to that for the English dataset.
6.4</p>
    </sec>
    <sec id="sec-18">
      <title>Runtime Performance</title>
      <p>The Lithium EDL system has been built to run in a bulk
manner as well as a REST API service. The two major
challenges that we faced while developing the system were the
volume of new data that we process in bulk daily and limited
computational capacity. These challenges had a significant
influence on our system design and algorithmic approach.</p>
      <p>As a demonstrative example, the most consuming task in
our MapReduce cluster processes around 910 million
documents, with an average document size of 169 bytes, taking
about 2.2ms per document. Our MapReduce cluster has
around 150 Nodes each having a 2.5 GHz Xeon processor.
The processing is distributed across 400 reducers. The
Reduce step takes about 2.5 hrs. Each reducer task runs as a
single thread with an upper bound of 7GB on memory where
the processing pipeline and models utilize 3.7GB.</p>
      <p>
        A more detailed breakdown of the computational
performance of our system as a function of document length is
shown in Figure 5. The overall performance of the system is
a linear function of text length. We also analyze this
performance for different languages as well as for different stages of
the Lithium NLP pipeline. We can see that the computation
is slowest for English since it has the maximum number of
entities [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-19">
      <title>7 COMPARISON WITH OTHER</title>
    </sec>
    <sec id="sec-20">
      <title>COMMERCIAL SYSTEMS</title>
      <p>Currently, due to limited resources at our end and due to
inherent differences in KB, data and text preprocessing stages,
a direct comparison of the Lithium EDL system’s
performance (in terms of precision, recall and f-score) with other
commercial systems, such as Google Cloud NL API,
OpenCalais and AIDA, is not possible. Hence, we compare our
system with them on a different set of metrics.
7.1</p>
    </sec>
    <sec id="sec-21">
      <title>Comparison on languages</title>
      <p>While the Lithium EDL system supports about 6 different
languages (English, Arabic, Spanish, French, German,
Japanese), Google Cloud NL API supports mainly 3 languages:
English, Spanish, and Japanese. Similarly, OpenCalais
supports only English, Spanish, and French while AIDA only
supports English and Arabic.
7.2</p>
    </sec>
    <sec id="sec-22">
      <title>Comparison on linked entity density</title>
      <p>A major advantage of our system is the ability to discover
and disambiguate a much larger number of entities compared
to other state-of-the-art systems. As a demonstration, we
compared our result with Google Cloud NL API and
OpenCalais12. In particular, we ran both APIs on documents in
our test data set with the common subset of languages that
they supported.</p>
      <p>Table 4 compares the total number of unique entities
disambiguated by Lithium EDL system and those by Google NL.
An entity from Google NL is considered to be disambiguated
if it was associated with a Wikipedia link. Column Both
shows the numbers of entities that were disambiguated by
both systems. Most entities disambiguated by Google NL
were also disambiguated by our system. In addition, our
system disambiguated several more entities. Based on the
the precision of our system, we can estimate that at least
6080 disambiguated entities from our system are correct.
This implies that Google NL missed more than 2600 entities
that were correctly disambiguated by our system. Thus, our
system correctly disambiguated at least 75% more entities
than Google NL.</p>
      <p>Table 5 shows a similar comparison between our system and
OpenCalais. Every entity from OpenCalais API is considered
to be disambiguated. However, since OpenCalais entity does
not link the disambiguated entities to Wikipedia or Freebase
but to their own proprietary KB, we cannot determine which
entities were discovered by both the systems. Nevertheless,
based on the precision of our system, at least 3500 entities
that were correctly disambiguated by our system, were missed
by OpenCalais, which is significantly more than the number
of entities they detected.
7.3</p>
    </sec>
    <sec id="sec-23">
      <title>Comparison on runtime</title>
      <p>
        We compared the runtime performance of the Lithium NLP
pipeline against AIDA13 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] on several English language
documents. Comparison results are shown in Figure 6 on
the log-log scale. In Figure 6a we can see that the text
preprocessing stage of the Lithium pipeline is about
30,00050,000 times faster compared to AIDA which is based on
Stanford NLP NER [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The results for the disambiguation
stage are shown in Figure 6b. The disambiguation stage for
both the systems take a similar amount of time. However,
AIDA fails to extract as many entities as evident in Figure 6c
which shows that AIDA extracts 2.8 times fewer entities per
50kb of text. Finally, the disambiguation runtime per unique
entity extracted of Lithium pipeline is about 3.5 times faster
than AIDA as shown in Figure 6d. In conclusion, although
AIDA entity disambiguation is fairly fast and robust, our
system’s runs significantly faster and is capable of extracting
many more entities.
7.4
      </p>
    </sec>
    <sec id="sec-24">
      <title>Comparison on demonstrative example</title>
      <p>In order to explicitly demonstrate the benefits and
expressiveness of our system, we also compare the results of our EDL
system with Google Cloud NL API, OpenCalais and AIDA
on the example that we discussed in Section 4.3. Figure 7
shows the disambiguation and linking results generated by
our EDL system and the three other systems (Google NL
12We also analyzed AlchemyAPI (http://www.alchemyapi.com/
resources) but it only processed a limited amount of text in a document
and was not very stable on languages other than English.
13https://github.com/yago-naga/aida</p>
      <p>Lithium</p>
      <p>AIDA</p>
      <p>OpenCalais
Google Cloud</p>
      <p>NLP API</p>
      <p>Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is `the defining fight of the tech industry.'
Google Inc.</p>
      <p>Apple Inc.</p>
      <p>Google Inc. iOS</p>
      <p>Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is `the defining fight of the tech industry.'</p>
      <p>App Store (iOS)</p>
      <p>Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is `the defining fight of the tech industry.'</p>
      <p>Eric Schmidt</p>
      <p>Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is `the defining fight of the tech industry.'</p>
      <p>Apple Inc.
Cloud API, OpenCalais and AIDA) that we compare with.</p>
      <p>As evident, our EDL system disambiguates and links more
entities correctly than the other 3 systems. All the other
systems fail to disambiguate and link iOS and tech industry.</p>
      <p>In addition, AIDA incorrectly disambiguates Apple.
In this paper, we presented the Lithium EDL system that
disambiguates and links entity mentions in text to their
unique Freebase ids. Our EDL algorithm uses several context
dependent and context independent features to disambiguate
mentions to their respective entities. Moreover, it recognizes
several types of entities in addition to named entities like
people, places, organizations. In addition, our EDL system is
language-agnostic and currently supports several languages
including English, Arabic, Spanish, French, German, and
Japanese. As a result, it is highly applicable to process real
world text such as multi-lingual user generated content from
social media in order to model user interests and expertise.</p>
      <p>We compared our EDL system with several
state-of-theart systems and demonstrate that it has high throughput
and is very lightweight. It can be run on an off-the-shelf
commodity machine and scales easily to large datasets. Also,
our experiments show that our EDL system extracts and
correctly disambiguates about 75% more entities than
existing state-of-the-art commercial systems such as Google
NLP Cloud API and Open Calais and is significantly faster
than some of them. In future, we plan to add support for
several other languages to our EDL system once we have
collected enough ground truth data for them. We also plan
to migrate to Wikipedia as our KB. We will also compare our
system’s performance against several state-of-the-art systems
on metrics such as precision, recall and f-score with respect
to existing benchmarked datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Prantik</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nemanja</given-names>
            <surname>Spasojevic</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Global Entity Ranking Across Multiple Languages</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web</source>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Luka</given-names>
            <surname>Bradesko</surname>
          </string-name>
          , Janez Starc, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Pacifico</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Isaac Bloomberg Meets Michael Bloomberg: Better EntityDisambiguation for the News</article-title>
          .
          <source>In 24th International Conference on World Wide Web.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Razvan</surname>
            <given-names>C Bunescu</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Marius</given-names>
            <surname>Pasca</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Using Encyclopedic Knowledge for Named entity Disambiguation.</article-title>
          .
          <source>In EACL. 9-16.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Rudi</surname>
            <given-names>L</given-names>
          </string-name>
          <string-name>
            <surname>Cilibrasi</surname>
          </string-name>
          and Paul MB Vitanyi.
          <year>2007</year>
          .
          <article-title>The google similarity distance</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering 19</source>
          ,
          <issue>3</issue>
          (
          <year>2007</year>
          ),
          <fpage>370</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Silviu</given-names>
            <surname>Cucerzan</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Large-Scale Named Entity Disambiguation Based on Wikipedia Data.</article-title>
          .
          <source>In EMNLP-CoNLL</source>
          , Vol.
          <volume>7</volume>
          .
          <fpage>708</fpage>
          -
          <lpage>716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Daiber</surname>
          </string-name>
          , Max Jakob, Chris Hokamp, and
          <string-name>
            <surname>Pablo N Mendes</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Improving efficiency and accuracy in multilingual entity extraction</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems. ACM</source>
          ,
          <volume>121</volume>
          -
          <fpage>124</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling</article-title>
          .
          <source>In 43rd Annual Meeting on Association for Computational Linguistics</source>
          .
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Xianpei</given-names>
            <surname>Han</surname>
          </string-name>
          and
          <string-name>
            <given-names>Le</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A generative entity-mention model for linking entities with knowledge base</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics</source>
          ,
          <fpage>945</fpage>
          -
          <lpage>954</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Linked data: Evolving the web into a global data space</article-title>
          .
          <source>Synthesis lectures on the semantic web: theory and technology 1</source>
          ,
          <issue>1</issue>
          (
          <year>2011</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andras</given-names>
            <surname>Csomai</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Wikify!: Linking Documents to Encyclopedic Knowledge</article-title>
          .
          <source>In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (CIKM '07)</source>
          .
          <fpage>233</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>David</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Learning to Link with Wikipedia</article-title>
          .
          <source>In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08)</source>
          .
          <fpage>509</fpage>
          -
          <lpage>518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Dat</given-names>
            <surname>Ba</surname>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , Johannes Hoffart,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Theobald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>AIDA-light: High-Throughput Named-Entity Disambiguation.</article-title>
          .
          <source>In LDOW.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Nemanja</surname>
            <given-names>Spasojevic</given-names>
          </string-name>
          , Preeti Bhargava, and
          <string-name>
            <given-names>Guoning</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DAWT: Densely Annotated Wikipedia Texts across multiple languages</article-title>
          .
          <source>In Proceedings of the 26th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee</source>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Nemanja</surname>
            <given-names>Spasojevic</given-names>
          </string-name>
          , Prantik Bhattacharyya, and
          <string-name>
            <given-names>Adithya</given-names>
            <surname>Rao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Mining half a billion topical experts across multiple social networks</article-title>
          .
          <source>Social Network Analysis and Mining 6</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Nemanja</surname>
            <given-names>Spasojevic</given-names>
          </string-name>
          , Jinyun Yan,
          <string-name>
            <given-names>Adithya</given-names>
            <surname>Rao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Prantik</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>LASTA: Large Scale Topic Assignment on Multiple Social Networks</article-title>
          .
          <source>In Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD) (KDD '14).</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Amir</surname>
          </string-name>
          <string-name>
            <surname>Yosef</surname>
          </string-name>
          , Johannes Hoffart, Ilaria Bordino,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Spaniol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Aida: An online tool for accurate disambiguation of named entities in text and tables</article-title>
          .
          <source>Proceedings of the VLDB Endowment 4</source>
          ,
          <issue>12</issue>
          (
          <year>2011</year>
          ),
          <fpage>1450</fpage>
          -
          <lpage>1453</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Amir</surname>
          </string-name>
          <string-name>
            <surname>Yosef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Spaniol</surname>
          </string-name>
          , and Gerhard Weikum.
          <article-title>AIDArabic: A named-entity disambiguation framework for Arabic text</article-title>
          .
          <source>In The EMNLP 2014 Workshop on Arabic Natural Language Processing</source>
          .
          <fpage>187</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>