<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph-based Entity Linking using Shortest Path</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yongsun Shim</string-name>
          <email>yongsun0926@snu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sungkwon Yang</string-name>
          <email>sungkwon.yang@snu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyunwhan Joe</string-name>
          <email>hyunwhanjoe@snu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hong-Gee Kim</string-name>
          <email>hgkim@snu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biomedical Knowledge Engineering Laboratory, Seoul National University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity Linking (EL) is a technique to link named entities in a given context to relevant entities in a given knowledge base. Generally, EL consists of two important task. But, there are limitations of these tasks. To overcome these limitations, we tried to solve the problem of EL through the interdependencies between not only named entities but also common nouns and verbs appearing in the context. The second approach is that the words appeared in context are more closely related and the distance between those is closer. In this paper, we proposed the approaches to overcome these limitations.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Linking</kwd>
        <kwd>Graph-based knowledge base</kwd>
        <kwd>Shortest Path</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Entity Linking (EL), which is one of the various natural language processing methods,
is also known as Named Entity Disambiguation (NED). EL is a technique to link
named entities in a given context to relevant entities in a given knowledge base. Here,
the context means natural language sentences we want to analyze. EL is used in
various fields such as information extraction, semantic search, and question
answering.</p>
      <p>Generally, EL consists of two important task. First one is candidate entity look up
task which lists some relevant entities in the given knowledge base that corresponds
to the named entity of interest. Second task is called candidate ranking which ranks
the entities by relevance. There are two main approaches for candidate ranking. The
first one is local compatibility based approach. Local compatibility approaches
compares similarity between the context and descriptions of each candidate entity.
The limitation of this approach is that there is no consideration of relations between
every name mentions appeared in the context. The second one is pairwise based
approaches. Pairwise based approaches considers interdependencies among the
candidate entities of named entities appearing in the context. However, in these
approaches only computes similarity between two candidate entities. Sometimes it
lies wrong result, because of the lack of direct relation between two candidate entities,
even though they are indirectly connected. In other words, there is a shortage of
global interdependence.</p>
      <p>To overcome these limitations, we tried to solve the problem of EL through the
interdependencies between not only named entities but also common nouns and verbs
appearing in the context.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>In our approach, a context means a sentence. Our approach tries to use every words in
a context. Every words means that not only the named entity but also including
general nouns and verbs which are appeared in context. This approach differs from
the traditional approaches which use only the named entity. In this approach, our
assumption is that every words appeared in a context is related to each other.</p>
      <p>The second approach tries to use ordering of every words. This approach means
that every words appeared in context has association between words. In other words,
when a word appears in context, the words which related it will be closely appeared in
context.</p>
      <p>The third approach expand these approaches to graph-based knowledge base. It
means that vertices which related each other are closely located in graph-based
knowledge base. That is, relevant vertices will be located close each other in a
knowledge base such as related words are appeared in a context.</p>
      <p>By using our approaches, we can apply to find shortest path for solving EL in
graph-based knowledge base. That is, if the words appear at the same time in a
context, and the distance between words is close, the distance between the words and
the vertices mapped in the graph-based knowledge base will be close.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Workflow</title>
      <p>In this paper, we tried to use all the information of words used in a context to prevent
information loss. And we considered the ordering of words in a context, assuming that
there is a close relationship between each other words. Thus, in a context, every
words appeared by pre-processing are grouped into a sequence. We then applied this
sequence to a pre-constructed weighted-graph to apply our approaches.
This chapter describes the workflow of this paper. The workflow of this paper is
shown in Fig1. The workflow is divided into 5 steps.
The first step is to assign a weight to the graph-based knowledge base. This step is
performed only once at first. When weights are given to the entire graph, we will
target the edges of triangles and corresponding vertices. The reason for using the
triangle shape is that we start from the assumption that we can know the meaning of a
specific vertex by using directly linked vertices and vertex information beyond a one
depth.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Context Pre-processing &amp; NER</title>
      <p>The second step is a context pre-processing that extracts the whole word set from the
context. This step consists of two parts.</p>
      <p>The first part is context pre-processing which is the tokenization, stopwords
remove, and stemming. Among the Stanford NLP tools, we use tokenization tools to
break up words, and use the stopwords removal tool to remove the stopwords such as
I, am, are, etc. After eliminating the stopwords, the remained words will be converted
into the original form through the stemming process.</p>
      <p>The second part is Named Entity Recognition (NER) which extracts the named
entity using NER technique. Among the Stanford NLP tools, NER tools will be used
to perform entity recognition.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>Extract Candidate Entities</title>
      <p>The third step is extracting candidates of named entity using the whole word set. The
third stage consists of two parts.</p>
      <p>
        The first part is the Word Sense Disambiguation which connects the proper
meaning of the word when there is a common noun with ambiguity meaning in the
whole word set extracted in the first step. In this paper, we apply Lesk algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
which is well known as a knowledge-based approach. When applying to the Lesk
algorithm, we will compute the similarity between the context and descriptions of the
word. And then the highest similarity is selected to proper meaning.
      </p>
      <p>The second part is that extracts candidates of the named entity when there is a
named entity with ambiguity meaning in the whole word set. When extracting
candidates, named entities in DBpedia are extracted through keyword matching.</p>
      <p>That is, the whole word set extracted through steps 2 and 3 is extended to candidate
word sets of a combination of a disambiguated words and candidate entities.
3.4</p>
    </sec>
    <sec id="sec-6">
      <title>Apply Candidate Word Sets to Pre-weighted Graph</title>
      <p>The fourth step is creating a graph of each candidate word set by applying a
weightedgraph constructed in step 1. When the candidate word set is applied to the
weightedgraph, it gradually expands from the vertex of the first word to the other vertices
connected to the vertex. If we find the second vertex of the candidate word set while
expanding, we extend the graph again based on the second vertex. In this way, when
the last vertex appears while the graph is gradually expanded, the process is stopped.
3.5</p>
    </sec>
    <sec id="sec-7">
      <title>Find the Shortest Distance Graph in the Candidate Graph</title>
      <p>In step 5, we will compute the shortest distance to the vertex of each candidate word
set based on the graph created for each candidate word set. When moving from one
vertex to another vertex, select the vertex with the highest edge weight between the
two vertices. In this way, we go to the path until the last vertex is reached. For each
candidate graph, we choose the shortest path graph among the candidate graphs. The
candidate named entity of the selected graph will be considered the correct answer for
the named entity used in the context.
4</p>
    </sec>
    <sec id="sec-8">
      <title>Discussion &amp; Conclusion</title>
      <p>In this paper, we performed for solving EL using the graph-based knowledge base.
We try to solve the problem of EL with three approaches. Our first approach used
every words appeared in a context. This is because words appeared in a context are
related to each other. That is, it can be said that information of simultaneous
appearance of words is utilized. The second approach is that words are more closely
related and the distance between words is closer. That is, because the order of the
words is important, we will represent the set of words in a sequence. The third
approach seeks that strongly related words will be closer to the graph-based
knowledge base.</p>
      <p>In the future, we will simplify the process of extending the graph for the candidate
word set, and apply various algorithms such as dynamic programming to apply the
shortest distance.</p>
      <p>Acknowledgments. This work was supported by Institute for Information &amp;
communications Technology Promotion(IITP) grant funded by the Korea
government(MSIP) (No. 2013-0-00109, WiseKB: Big data based self-evolving
knowledge base and reasoning platform).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andrea</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alessandro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
          </string-name>
          , N.:
          <article-title>Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics</article-title>
          .
          <volume>2</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Xianpei</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jun</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Collective Entity Linking in Web Text: A Graph-Based Method</article-title>
          .
          <source>Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval</source>
          . pp.
          <fpage>765</fpage>
          -
          <lpage>774</lpage>
          . SIGIR '
          <fpage>11</fpage>
          . Beijing, China (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone</article-title>
          .
          <source>Proceedings of the 5th annual international conference on Systems documentation</source>
          . pp.
          <fpage>24</fpage>
          -
          <lpage>26</lpage>
          . SIGDOC '
          <fpage>86</fpage>
          . Toronto, Ontario, Canada (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>