<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weasel: a machine learning based approach to entity linking combining di erent features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Tristram</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Walter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Cimiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christina Unger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Semantic Computing Group, CITEC, Bielefeld University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The task of entity linking consists in disambiguating named entities occurring in textual data by linking them to an identi er in a knowledge base that represents the real-world entity they denote. We present Weasel, a novel approach that is based on a combination of di erent features that is trained using a Support Vector Machine. We compare our approach to state-of-the-art tools such as FOX and DBpedia spotlight, showing that it outperforms both on the AIDA/CoNLL dataset and provides comparable results on the KORE50 dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Wikipedia</kwd>
        <kwd>entity linking</kwd>
        <kwd>vector similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The task of entity linking consists in disambiguating named entities occurring in
textual data by linking them to an identi er in a knowledge base that represents
the real-world entity they denote. We present Weasel, a novel approach that
is based on a combination of di erent features that is trained using a Support
Vector Machine. Given a named entity and a resource candidate, Weasel
combines four features that quantify i) the likelihood that the actual name refers to
the resource candidate (extracted from Wikipedia anchors), ii) the similarity of
the textual context in which the name occurs to the textual context of the
resource candidate (measured as the bag-of-words similarity using cosine), iii) the
overlap between the semantic neighborhood signature of the resource candidate
and the named entities mentioned in the larger context of the named entity, and
iv) the prominence of the resource candidate as measured by its page rank in
the DBpedia graph. An optimal model is trained using an SVM which combines
these four features into an overall decision whether the named entity refers to a
candidate resource or not.</p>
      <p>As underlying knowledge base, we use DBpedia and rely on Wikipedia pages
as textual context of the resources in DBpedia to extract our textual features.</p>
      <p>
        We compare our approach to state-of-the-art tools such as FOX [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
DBpedia spotlight [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], showing that it outperforms both on the AIDA/CoNLL dataset
and provides comparable results on the KORE50 dataset.
      </p>
      <p>In the following section we describe the approach and then report evaluation
results in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        The input data for Weasel is a regular piece of text, for example a newspaper
article in which entity-representing sequences of tokens (fragments) are already
identi ed (we use the Stanford NER [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). For a given fragment, the algorithm's
task is to determine the matching real-world entity. It does so by picking the
most likely entity from a set of entity candidates, e.g. speci ed in a knowledge
base, in our case DBpedia.
      </p>
      <p>A fragment which represents a real-world entity is called an anchor. To gather
a database of anchors and their corresponding entities, we extracted all anchor
texts from a Wikipedia dump1. Within a dump, hyperlinks to other pages appear
in the form [[Barack Obama|President Obama]], where Barack Obama refers
to an entity2 and President Obama is the anchor string which appears in the
page's text. A map is generated by parsing all Wikipedia articles (in a chosen
language) and counting the occurrences of anchor-entity pairs. The candidates
for a given fragment are found by looking up all entities that were referred to
by the fragment within Wikipedia.</p>
      <p>Candidates are scored by a support vector machine.3 Four features are
computed for each candidate, which are used as input for the SVM: tf-idf vector
similarity, semantic neighborhood signature vector similarity, candidate
reference probability, and page rank. In the following we explain each of them in
more detail.</p>
      <sec id="sec-2-1">
        <title>2.1 tf-idf vector similarity</title>
        <p>The tf-idf vector similarity measures how similar the input is to the abstract
of the candidate's Wikipedia entry with regards to the words which appear in
both texts. The tf-idf value for a word re ects how important it is in a document,
therefore if there is a large overlap between the important words in the input
and the abstract of a candidate's Wikipedia page, the candidate is considered
to be more likely correct. It is calculated based on the term frequency tf(t; d),
de ned as the raw frequency of term t in document d, and the inverse document
frequency idf(t; D), which is de ned as follows:
idf(t; D) = log</p>
        <p>N
jd 2 D : t 2 dj</p>
        <p>Where N is the number of documents in the corpus and jd 2 D : t 2 dj is the
number of documents t appears in. The tf-idf value is the product of the two:
t df(t; d; D) = tf(t; d) idf(t; D)
1 http://en.wikipedia.org/wiki/Wikipedia:Database_download
2 http://en.wikipedia.org/wiki/Barack_Obama
3 For the training set the correct assignment and three randomly chosen false
candidates for each entity are used. Weasel is using Weka's SMO: http://weka.
sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html</p>
        <p>Weasel computes a tf-idf vector for each entity extracted from the Wikipedia
dump by picking those 100 terms t from the abstract a of its Wikipedia page
that have the highest tf-idf score. Here the tf-value can be obtained by simply
counting all occurrences of t in a, and the idf-value can be obtained by treating
the set of all entity abstracts as the corpus and computing the idf-score for every
occurring word.</p>
        <p>The tf-idf vector for the input is then computed in the same fashion, where
the term frequency for every term in the input is determined by counting its
occurrences in the input text. Terms that are unique in the input are ignored,
as they do not a ect the entity's vector score (they are multiplied by zero in the
vector similarity calculation).</p>
        <p>The tf-idf vector similarity is then calculated as the cosine similarity between
two vectors A (the input terms) and B (the candidate terms):
t dfVectorSimilarity(A; B) = cos( ) =</p>
        <p>A B
kAkkBk
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic Neighborhood Signature Vector Similarity</title>
        <p>The semantic neighborhood signature vector similarity measures the overlap
between the candidate's semantic neighborhood signature and the set of all
candidates for the input. The rationale is that a candidate is more likely to be the
correct assignment for a given fragment if entities from its semantic
neighborhood signature are considered for other fragments (as related entities probably
appear together).</p>
        <p>
          The semantic neighborhood signature for an entity is computed in a similar
way as Babelfy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] does, except that instead of the BabelNet4 knowledge base
we use a graph of interlinked Wikipedia pages, where each page is a vertex and
each link between pages is an edge. The idea is to identify for each resource
all other resources it is closely related to. This is a two step process. First, the
edges are assigned weights so that edges in denser parts of the graph become
more important. The weight of an edge weight(v; v0) is the number of directed
triangles (directed cycles of length 3) it appears in:
        </p>
        <p>weight(v; v0) := jf(v; v0; v00) : (v; v0); (v0; v00); (v00; v) 2 Egj + 1
Where v, v0 and v00 are vertices and E is the set of all the graph's edges.</p>
        <p>Second, random walks with restart are performed from each vertex. For a
number of n steps (in our case 100 000, which are still computable within a
reasonable timespan) the graph is explored by `walking' to a connected vertex
with probability 1 or to move back to the vertex of origin with probability
. The probability of visiting the vertex v0 is based on the normalized weight of
the connecting edge:</p>
        <p>P (v0jv) =</p>
        <p>P
v002V
weight(v; v0)
weight(v; v00)
4 http://babelnet.org/
Where V is the set of all vertices in the graph.</p>
        <p>During the random walk for a given root vertex vr, the visits to each vertex
vi(i 6= r) are counted. The result is a frequency distribution of related vertices
in which the vertices with higher counts are interpreted as to be more relevant
to the root vertex than vertices with lower counts. Vertices with a count lower
than threshold are ignored5.</p>
        <p>
          We compute a measure similar to the personalized page rank [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] using a
random walk approach similar to the one implemented in Babelfy. This step is
performed o ine.
        </p>
        <p>To calculate the similarity score for each candidate, their semantic
neighborhood signature vector needs to compared to the semantic neighborhood signature
vector of the input. Unfortunately, the input does not have a semantic
neighborhood signature of its own. As a surrogate, a candidate vector is generated
as follows: Let m(e) ! c be a mapping from an entity e to a count c. For each
candidate entity of every fragment, check whether it already appears in the
mapping. If not, add it to the map and set its count to one. If it exists, increment
its count by one.</p>
        <p>Therefore a candidate's semantic neighborhood signature similarity score
increases as entities from their semantic neighborhood signature appear more often
as candidates in a given input.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Candidate Reference Probability</title>
        <p>To account for how often an anchor refers to a given entity, the candidate
reference probability is incorporated into the score calculation. For example, for the
anchor Michelangelo we might receive as candidates both the artist
Michelangelo6 and the Ninja Turtle Michelangelo7. While both are valid candidates
worthy of consideration, we can extract from Wikipedia that Michelangelo referred
only 21 times to the Ninja Turtle, but 1173 times to the artist. Turning these
candidate reference frequencies into probabilities (dividing by the total number
of references towards a given entity) we can use them as a factor when calculating
the score for a candidate:
crp(a; e) :=</p>
        <p>ref(a; e)
P ref(a0; e)
a02A</p>
        <p>Where ref(a; e) is the number of references from anchor/candidate a to entity
e and A is the set of all anchors referring to e.
5 = 10 in our implementation, so that n and
6 http://en.wikipedia.org/wiki/Michelangelo
7 http://en.wikipedia.org/wiki/Michelangelo_(Teenage_Mutant_Ninja_
Turtles)
have the same ratio as in Babelfy.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Page Rank</title>
        <p>The idea behind the PageRank algorithm is to estimate the relevance of a website
to a search query not only by whether the search terms appear on that page,
but also by how many other pages were providing a hyperlink to the website
in question. As DBpedia entries form a highly interconnected graph, the same
principle can be applied in order to determine the page rank of a resource in
DBpedia.</p>
        <p>Let u be a web page. Bu is the set of pages that have a hyperlink pointing
to u. Let Nu be the number of outgoing links for page u. A slightly simpli ed
PageRank is then de ned as follows:</p>
        <p>R(u) = c X
v2Bu</p>
        <p>R(v)
Nv
Where c is a normalizing factor. Notably the PageRank of a page is derived
from the PageRank of its incoming links, meaning that the algorithm has to
be iterated until the system reaches a steady state. The overall PageRank for a
graph is supposed to be constant, but there are corner cases (like cycles) that
have to be accounted for; see for more information. Our implementation uses
the JUNG framework8.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We evaluate our entity linking approach on two datasets9, AIDA/CoNLL and
KORE50, and compare its results with two other state-of-the-art entity linking
approaches, AGDISTIS [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as part of FOX10, and DBpedia Spotlight11, in terms
of precision and recall. Precision, recall, and the derived F1 score were de ned
as follows:
      </p>
      <p>P recision =</p>
      <p>Recall =
# of correctly assigned entities</p>
      <p># of all assigned entities
# of correctly assigned entities
# of gold standard entities</p>
      <p>precision recall
F1 = 2</p>
      <p>precision + recall</p>
      <p>Precision, Recall and F-Measure are micro-averaged over all named entities.
Documents in which more named entities occur have thus a larger impact on the
scores.
8 Java Universal Network/Graph Framework: http://jung.sourceforge.net/
9 Weasel's results for the 10-fold cross validations can be downloaded
from: http://sebastianwalter.org/weasel/nFoldResultsAIDA.zip and
http://sebastianwalter.org/weasel/nFoldResultsKore50.zip
10 http://aksw.org/Projects/FOX.html
11 http://spotlight.dbpedia.org</p>
      <p>AIDA/CoNLL</p>
      <p>FOX
DBpedia Spotlight</p>
      <p>Weasel</p>
      <p>Precision</p>
      <p>The quality of the NER used to nd entities poses an upper bound on
the achievable entity linking scores. For the AIDA/CoNLL dataset the
highest achievable linking precision is 0.79, while the highest achievable recall is 0.98
(max. F1: 0.88). For the KORE50 dataset the values are 0.93 and 0.88
respectively (max. F1: 0.90).</p>
      <p>Table 1 shows the evaluation results. Weasel outperforms both FOX and
DBpedia Spotlight on the AIDA/CoNLL dataset and achieves comparable results
on the KORE50 dataset. The latter is due to the fact that KORE50 was designed
to have a high degree of ambiguity and short sentences, which renders Weasel's
approach less e ective: Short sentences provide fewer tokens to evaluate a
tfidf overlap, while increased numbers of candidates allow for more detrimental
semantic neighborhood signature vector overlaps.</p>
      <p>Table 2 shows results leaving out single features and using only one feature.
We clearly see that the page rank is the dominating factor. This deserves closer
investigation.</p>
      <p>Weasel works best when a fragment refers to the most common meaning of a
word. For example, assignments like BSE ! Bovine spongiform encephalopathy
are usually correct. More ambiguous fragments get assigned correctly as well
when that assignment is the most common one in the given context, for example
commission ! European Commission in a debate of the European Parliament.
Problems arise when the target entity is not the most common assignment for
a given fragment. For example, the fragment Spanish in Spanish Farm Minister
Loyola de Palacio leads to the assignment Spanish ! Spanish Language, as the
fragment usually refers to the language, whereas the gold assignment is Spanish
! Spain.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        We have presented Weasel, a new entity linking approach that exploits a
combination of features that are combined using a Support Vector Machine. We
have shown that the approach works very well when the input sentences are
longer, clearly outperforming FOX and DBpedia Spotlight on the AIDA/CoNLL
dataset. On the KORE50 dataset where sentences are shorter, it achieves
comparable results to these approaches. In future work, we will release a service
that can be publicly accessed using a similar interface to DBpedia Spotlight and
without
SNS
tf-idf
page rank
CRP
only with
SNS
tf-idf
page rank
CRP
F1
FOX. We will also extend the approach to other languages and perform a more
systematic evaluation using the GERBIL framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work was supported by the Cluster of Excellence Cognitive Interaction
Technology CITEC (EXC 277) at Bielefeld University, which is funded by the
German Research Foundation (DFG).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aitor</given-names>
            <surname>Soroa</surname>
          </string-name>
          .
          <article-title>Personalizing pagerank for word sense disambiguation</article-title>
          .
          <source>In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL '09</source>
          , pages
          <fpage>33</fpage>
          {
          <fpage>41</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2009</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating nonlocal information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05</source>
          , pages
          <fpage>363</fpage>
          {
          <fpage>370</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2005</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pablo</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
            , Max Jakob,
            <given-names>Andres</given-names>
          </string-name>
          <string-name>
            <surname>Garc</surname>
            a-Silva, and
            <given-names>Christian</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Dbpedia spotlight: Shedding light on the web of documents</article-title>
          .
          <source>In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics '11</source>
          , pages
          <issue>1</issue>
          {
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Moro</surname>
          </string-name>
          , Alessandro Raganato, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Entity Linking meets Word Sense Disambiguation: a Uni ed Approach. Transactions of the Association for Computational Linguistics (TACL</article-title>
          ),
          <volume>2</volume>
          :
          <fpage>231</fpage>
          {
          <fpage>244</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Rene</given-names>
            <surname>Speck</surname>
          </string-name>
          and
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>Named entity recognition using fox</article-title>
          .
          <source>In International Semantic Web Conference</source>
          <year>2014</year>
          (
          <issue>ISWC2014</issue>
          ),
          <source>Demos &amp; Posters</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Michael Roder, Daniel Gerber, SandroAthaide Coelho, Soren Auer, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Both</surname>
          </string-name>
          .
          <article-title>Agdistis - graph-based disambiguation of named entities using linked data</article-title>
          . In Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig Knoblock, Denny Vrandecic, Paul Groth, Natasha Noy, Krzysztof Janowicz, and Carole Goble, editors,
          <source>The Semantic Web | ISWC</source>
          <year>2014</year>
          , volume
          <volume>8796</volume>
          of Lecture Notes in Computer Science, pages
          <volume>457</volume>
          {
          <fpage>471</fpage>
          . Springer International Publishing,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Usbeck</surname>
          </string-name>
          , Michael Roder,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Ciro Baron, Andreas Both, Martin Brummer, Diego Ceccarelli, Marco Cornolti, Didier Cherix,
          <string-name>
            <given-names>Bernd</given-names>
            <surname>Eickmann</surname>
          </string-name>
          , et al. Gerbil:
          <article-title>General entity annotator benchmarking framework</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web</source>
          , pages
          <volume>1133</volume>
          {
          <fpage>1143</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>