<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Linked Open Data sources for Entity Disambiguation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Esther Villar Rodríguez Ana I. Torre Bastida</string-name>
          <email>esther.villar@tecnalia.com</email>
          <email>esther.villar@tecnalia.com isabel.torre@tecnalia.com</email>
          <email>isabel.torre@tecnalia.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana García Marta González Rodríguez</string-name>
          <email>agarcia@lsi.uned.es marta.gonzalez@tecnalia.com</email>
          <email>marta.gonzalez@tecnalia.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>OPTIMA Unit OPTIMA Unit, TECNALIA TECNALIA</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Serrano OPTIMA Unit</institution>
          ,
          <addr-line>ETSIUInNfoErDmática TECNALIA</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Within the framework of RepLab 2013, the filtering task try to discover if one tweet is related to one certain entity or not. Our work tries to take advantages of the Web of Data in order to create a context for every entity, extracted from the available Linked Data Sources. The context in Natural Language Processing (NLP) is the outstanding issue able to distinguish the contained semantics in a message by analyzing the frame in which the words are embedded.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Reputation management is used by companies (or individuals) to monitor the public
opinions aiming at maintaining a good brand image. The first step is to establish the
correct relation between the opinion (text) and the entity with some grade of
confidence. This is the objective of the filtering task in RepLab, an initiative promoted by
the EU project Limosine focused on the ability to process and understand which the
strengths and weaknesses of one entity are, based on users opinions
(http://www.limosine-project.eu/events/replab2013).</p>
      <sec id="sec-1-1">
        <title>Nowadays there is a large amount of available information on the web, such as</title>
        <p>web pages, social media data (tweets, facebook and others) or blogs. All of them
mention different entities, such as locations, characters, organizations ... The problem
appears when a name refers to an entity that have several meanings, for example the
song “Munich” of the music group “Editors” and the German city of the same name.</p>
      </sec>
      <sec id="sec-1-2">
        <title>For this filtering task, our system uses an approach based on the semantic context</title>
        <p>of an entity. The goal of this work is to create a description of an entity that will help
to achieve a, enough complex, semantic context to execute a successful
disambiguation. The data sources from where entity descriptions are extracted make up the Web
of Data, specifically the Linked Open Data Cloud.</p>
      </sec>
      <sec id="sec-1-3">
        <title>In this respect, our research has been developed in the frame of Linked Open Data</title>
        <p>paradigm that is a recommended best practice for exposing, sharing, and connecting
pieces of data, information, and knowledge on the Semantic Web using URIs and</p>
      </sec>
      <sec id="sec-1-4">
        <title>RDF. Due to activities like this, the volume of data in RDF format is continuously</title>
        <p>growing, building the known “Web of Data”, which today is the largest free available
knowledge base. Its size, open access, semantic character and continued growth led us
to choose it as our information provider for context generation during the filtering
task. This process is carried out using different semantic technologies: such as the</p>
      </sec>
      <sec id="sec-1-5">
        <title>SPARQL query language, the ontology definition languages, like RDF, RDFS or</title>
      </sec>
      <sec id="sec-1-6">
        <title>OWL and RDF repositories (SPARQL endpoints).</title>
      </sec>
      <sec id="sec-1-7">
        <title>The main contribution of this paper is the definition of a system that achieves high precision/sensitivity in the tasks of filtering by entities, using semantic technologies to extract context information from the Linked Open Data Sources, as are going to be presented in the following.</title>
        <p>2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <sec id="sec-2-1">
        <title>In this section, we introduce our approach for filtering entities on tweets. Our procedure uses the semantic context of the analyzed entities, and compares it versus the terms contained in the tweet.</title>
      </sec>
      <sec id="sec-2-2">
        <title>First of all the tweets are preprocessed, extracting the terms involved on them.</title>
      </sec>
      <sec id="sec-2-3">
        <title>These terms are the input for a second phase, where equivalent available forms for the</title>
        <p>concepts are obtained by the Stylus1 tool. When all the possible forms of a term are
calculated, the last step consists on generating a semantic context by querying
different data sources (modeled by a set of ontologies) that the Linked Open Data Cloud
provides to us.</p>
      </sec>
      <sec id="sec-2-4">
        <title>The section is divided into four parts. First we introduce the motivation to use a semantic context for entities filtering, later we explain the preprocessing phase of the system. In the third subsection we include a description of the generation of the context and finally we resume our filtering algorithm.</title>
        <p>2.1</p>
        <sec id="sec-2-4-1">
          <title>Motivation</title>
          <p>The main reasons to utilize a semantic context for discovering the relatedness of
tweets with the different entities processed are the next two ones:
 Powerful modeling and rendering capabilities offered by ontologies. The
ontologies allow us to capture the concepts and properties of a particular domain. It is
possible to draw a conceptual structure as detailed and complete as necessary.
Furthermore, the process of describing ontologies is simple and straightforward,
generating an independent and autonomous model.
 The amount of free available semantically represented data (RDF, RDFS, and
OWL) into the linked Open Data Cloud. Nowadays the amount of information
available in RDF format is huge. The Linked Data paradigm has promoted the Web
of Data, formed by the Linked Datasets. This makes possible that any user can
obtain information about heterogeneous domains. Consulting these datasets through
technologies such as SPARQL or RDF Dumps, the user can get semantic
information about concepts or entities using modeling ontologies of different
repositories.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>1 http://www.daedalus.es/productos/</title>
      </sec>
      <sec id="sec-2-6">
        <title>To illustrate the benefits which can give us a semantic context, there is an example here: In the field of reputation analysis of music groups, we consider the following tweet and the studied entity is the group U2:</title>
        <p>"Enjoy a lot in the last concert of the singer Bono"</p>
      </sec>
      <sec id="sec-2-7">
        <title>At first there is nothing in the tweet that can help us to relate it in a syntactically manner with "U2". But using the semantic context generated for this group of music, we know that among the members of the group, their vocalist is Paul David Hewson, better known by its artistic name "Bono".</title>
      </sec>
      <sec id="sec-2-8">
        <title>The semantic context allows us to build relationships that lead to U2 from Bono</title>
        <p>and in this way we deduce that a tweet talking about Bono entity, it also does
indirectly about the U2 entity and therefore the tweet and the second entity are related.</p>
      </sec>
      <sec id="sec-2-9">
        <title>For the extraction of the necessary information for the generation of context, we have considered the data sources and ontologies shown in the table 1:</title>
        <sec id="sec-2-9-1">
          <title>Dataset Name</title>
        </sec>
        <sec id="sec-2-9-2">
          <title>DBPEDIA</title>
        </sec>
        <sec id="sec-2-9-3">
          <title>MusicBranz</title>
        </sec>
        <sec id="sec-2-9-4">
          <title>EventMedia</title>
        </sec>
        <sec id="sec-2-9-5">
          <title>ZBW Economics</title>
        </sec>
        <sec id="sec-2-9-6">
          <title>Swetodblp</title>
        </sec>
        <sec id="sec-2-9-7">
          <title>DBLP</title>
        </sec>
        <sec id="sec-2-9-8">
          <title>Domain</title>
        </sec>
      </sec>
      <sec id="sec-2-10">
        <title>General domain.</title>
      </sec>
      <sec id="sec-2-11">
        <title>Music domain</title>
      </sec>
      <sec id="sec-2-12">
        <title>Media domain</title>
      </sec>
      <sec id="sec-2-13">
        <title>Economic domain</title>
      </sec>
      <sec id="sec-2-14">
        <title>University bibliography domain</title>
      </sec>
      <sec id="sec-2-15">
        <title>University bibliography domain</title>
        <p>Sparql Endpoint
http://dbpedia.org/sparql
http://dbtune.org/musicbrainz/sparql
http://eventmedia.eurecom.fr/sparql
http://zbw.eu/beta/sparql/
http://datahub.io/dataset/sweto-dblp
http://dblp.rkbexplorer.com/sparql/</p>
      </sec>
      <sec id="sec-2-16">
        <title>The table shows datasets of the four domains used in the task of filtering: music, university, banks and automobile.</title>
        <p>2.2</p>
        <sec id="sec-2-16-1">
          <title>Preprocessing of tweets</title>
        </sec>
      </sec>
      <sec id="sec-2-17">
        <title>This task is in charge of extracting the terms contained in the tweet. Before that the</title>
        <p>terms are compared with the entities, they need to be pre-processed to remove the
typical characteristics of the tweets (#) which can affect the precision.</p>
        <p>This preprocessing has three main tasks (fig 1):
1. Removing URL. URLs in this approach are eliminated, because they do not
provide value for the comparison in a the semantic context. In future work, we
will try to replace the URLs by entities that represent them, and we could even
consider the various relationships / links inside the web page that are identified
by the URL under study.</p>
      </sec>
      <sec id="sec-2-18">
        <title>2. Removing mentions. For our task, the mentions are not interesting at this moment, because the relationship between them and the content of the tweet is irrelevant.</title>
      </sec>
      <sec id="sec-2-19">
        <title>3. Transforming hashtags. Hashtags are topics with relevance that somehow summarize the content of the tweet; therefore we parse their terms so they can be treated by subsequent processes.</title>
      </sec>
      <sec id="sec-2-20">
        <title>The context represents the related concepts/entities and the kind of relationships</title>
        <p>between them. In our approach, we generate a context for each needed entity.</p>
      </sec>
      <sec id="sec-2-21">
        <title>The information to build the context is obtained from the datasets shown in Table</title>
      </sec>
      <sec id="sec-2-22">
        <title>1. Depending on the type of entity, we perform different types of questions to a specific domain (music, banks, automobiles, universities). These queries are constructed from the different forms or variants that represent the entity.</title>
      </sec>
      <sec id="sec-2-23">
        <title>Thus, the context generation process consists in two sub-processes, (figure 2):</title>
        <p> Extraction of the forms of an entity. Using the API of Stylus2 (Daedalus) the
entity forms have been extracted to try to avoid misspelled or ambiguous names.</p>
      </sec>
      <sec id="sec-2-24">
        <title>IBM Software IBM Software Group IBM System</title>
      </sec>
      <sec id="sec-2-25">
        <title>2 http://www.daedalus.es/productos/stilus/stilus-sem/</title>
        <p> Extraction of the concepts/entities and relationships from the previous forms.</p>
      </sec>
      <sec id="sec-2-26">
        <title>This task is performed by consulting different datasets through their corresponding</title>
      </sec>
      <sec id="sec-2-27">
        <title>SPARQL endpoint, using SPARQL query language and following the ontologies of each dataset, that depends on the type of entity. An example of a SPARQL query type is described in figure 2.</title>
        <p>2.4</p>
        <sec id="sec-2-27-1">
          <title>Filtering Algorithm</title>
        </sec>
      </sec>
      <sec id="sec-2-28">
        <title>In this section, the final version of the complete algorithm is provided.</title>
        <p>//Preprocesed the tweets
PREPROCESS_TWEET(tweet_list) return preocessedTweet_list
BEGIN
FOR EACH tweet IN tweet_list
{
processedTweet =RemoveURL(tweet);
processedTweet =RemoveMentions(processedTweet );
processedTweet =TransforHashTag(processedTweet );
processedTweet_list.add(processedTweet);
}
return processedTweet_list;
END
//
CONTEXT_GENERATION (entitites_form) return context_list
BEGIN
FOR EACH entity IN entity_form
{
queries=Select TypeQuery(entity);
results=ExecuteSparql(queries);
context_list.put(entity, results);
}
return context_list;
END;
// Main Program
MAIN (t_files, e_files)
BEGIN
tweet_list=readTweets(t_files);
entity_list=readEntities(e_files);
tweet_terms=PREPROCESS_TWEETS(tweet_list);
//Obtain all the forms for each entity with Stylus API
ent_forms=OBTAIN_FORMS(entity _list);
//Obtain Context for each entity
context_list=CONTEXT_GENERATION (ent_forms);
//Compare tweets versus entities
relatedness_list=COMPARE (contex_lists, tweets_terms);
// Print on a file the filtering task results
writeFilteringOutput(relatedness_list);</p>
        <p>END;
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <sec id="sec-3-1">
        <title>The corpus of RepLab 2013 uses Twitter data in English and Spanish. The corpus</title>
        <p>consists of a collection of tweets (at least 2,200 for each entity) potentially associated
to 61 entities from four domains: automotive, banking, universities and music/artists.</p>
      </sec>
      <sec id="sec-3-2">
        <title>As result measures, reliability and sensitivity are used [8]. For a better and deep understanding, the outcomes for this typical binary classification problem (true positives, false positives, true negatives and false negatives) are showed:</title>
        <p>Predicat
ed Class
Related
Unrelated
Run
BEST_APPROACH
UNEDTECNALIA_filtering_1
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related work</title>
      <p>
         Sensitivity (also called the true positive rate) measures the proportion of actual
positives which are correctly classified.
The disambiguation task has become essential when trying to mining the web in the
search of opinions. Brands or individuals such as Apple or Bush usually lead to
confusions due to their ambiguity and need to be disambiguated as a related or
unrelated reference. In many approaches Wikipedia has been used to tackle this
challenge by co-reference resolution methods (measuring context similarity through
vectors or another kind of metric) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Research has been used to focus on the
appearance of a pair of named entities in both texts to come to a conclusion about
their interrelation. The problem with Twitter is the shortness of its messages which
makes more difficult the comparison (overall considering the usual lack of two
coappearing entities).
      </p>
      <sec id="sec-4-1">
        <title>Some works are carried out by mapping name entities to Wikipedia articles and overlapping the context surrounding the entity in the text (the string which is wanted to be disambiguated) [3]. The systems return the entity which best matches the context.</title>
      </sec>
      <sec id="sec-4-2">
        <title>This approach, instead, tries to take advantage of Linked Open Data Cloud. A huge</title>
        <p>
          open data base where to ask and recover data in a straight way. This avoids scanning
unstructured pages and obtaining wrong or disconnected information. Other paper that
has been used as corpus, the data extracted from Linked Open Data sources is
presented by Hogan et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>In Natural Language Processing, the Recognizing named entities (NER) is a exten</title>
        <p>
          sively research field. Typically, the approaches used Wikipedia for explicit
disambiguation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], but there are also some examples of how semantics can be used for this
task [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. Both works are based on defining a similarity measure based on the
semantic relatedness.
        </p>
        <p>
          Hoffart et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] ] is the most closed approach to our work, because the knowledge
base used on their works are Linked Data sources, like DBpedia and YAGO and in
our research we also use DBpedia (among others). The main difference is that on our
approach we generate a context on which we place the entities in study. Afterwords
we check if the text has any relationship with the generated context, instead of using a
measure of semantic relatedness.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <sec id="sec-5-1">
        <title>The results reveal a high value for sensitivity which comes along with a low value for false negatives. This indicates that the system does not usually get wrong classifications and if it concludes that one example is related to one entity it is almost sure that it is correct.</title>
      </sec>
      <sec id="sec-5-2">
        <title>The reliability, however, is quite poor due to the fact that the context is very</title>
        <p>enclosed and thus there are a lot of not found examples (false negative rate). This
leads us to think that the context should be widely enriched and this could enlarge the
well classified group. So to filter with better precision, the context should contain not
only semantic information from Linked Data sources, but also domain concepts such
as verbs, idioms or any kind of expressions prone to be good indicators.
Entity: Led Zeppeling
Tweet: Listening to Led Zeppelin
Context: [music, band, concert, instrument,… listen,…]
Result: TRUE</p>
      </sec>
      <sec id="sec-5-3">
        <title>These clues could be extracted and treated by means of PLN and IR (Information</title>
      </sec>
      <sec id="sec-5-4">
        <title>Retrieval) algorithms. The first ones to preprocess the words (including stemming and</title>
        <p>disambiguation treatment) and the former in order to find a similarity-based structure
for the data so the filtering can be carried out by measuring the distance between the
query (actually the relationship related/unrelated) and the tweet according to the
clues. Commonly, a data mining process would need to learn from training examples
or on the other hand to use some statistical method as the tf-idf scheme or LSI (Latent</p>
      </sec>
      <sec id="sec-5-5">
        <title>Semantic Indexing) able to categorize and clustering the concepts most associated to a certain subject For context generation, we will also analyze more refined techniques in the same research line:</title>
        <p>o Improving the semantic context, using a larger number of Linked Datasets,
and refining the questions to be sent. In order to improve the questions we
plan to delve deeper into the ontologies and thereby expand the scope of the
context.
o Using other disambiguation techniques that can be combined with our
approach, as the information extraction from web pages, cited in the text, the
study of hash tags and mentions or using other non-semantic corpuses.</p>
      </sec>
      <sec id="sec-5-6">
        <title>The combination of all these techniques would allow creating a huge semanticpragmatic context with the valuable distinct feature of not being static, but an increasing and open context fed by Linked Data.</title>
      </sec>
      <sec id="sec-5-7">
        <title>Acknowledgments. This work has been partially supported by the Regional Govern</title>
        <p>ment of Madrid under Research Network MA2VIRMR (S2009/TIC-1542), and by</p>
      </sec>
      <sec id="sec-5-8">
        <title>HOLOPEDIA (TIN 2010-21128-C02). Special thanks to Daedalus for the free licenc</title>
        <p>ing to the utilization of Stilus Core. Hereby the authors would like to thank Fundación</p>
      </sec>
      <sec id="sec-5-9">
        <title>Centros Tecnológicos Iñaki Goenaga (País Vasco) for the awarded doctoral grant to the first author.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ravin</surname>
            , Y. and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kazi</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Is Hillary Rodham Clinton the President?</article-title>
          <source>In ACL Workshop on Coreference and its Applications</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Wacholder</surname>
            , N.,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ravin</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Disambiguation of proper names in text</article-title>
          .
          <source>In Proceedings of ANLP</source>
          ,
          <fpage>202</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Bunescu and Pasca.
          <year>2006</year>
          . Razvan C.
          <article-title>Bunescu</article-title>
          and
          <string-name>
            <given-names>Marius</given-names>
            <surname>Pasca</surname>
          </string-name>
          .
          <article-title>Using encyclopedic knowledge for named entity disambiguation</article-title>
          .
          <source>In EACL. The Association for Computer Linguistics</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. HOGAN,
          <string-name>
            <surname>Aidan</surname>
          </string-name>
          , et al.
          <article-title>Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <year>2012</year>
          , vol.
          <volume>10</volume>
          , p.
          <fpage>76</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. HAN,
          <article-title>Xianpei; ZHAO, Jun. Named entity disambiguation by leveraging wikipedia semantic knowledge</article-title>
          .
          <source>En Proceedings of the 18th ACM conference on Information and knowledge management. ACM</source>
          ,
          <year>2009</year>
          . p.
          <fpage>215</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. HOFFART,
          <string-name>
            <surname>Johannes</surname>
          </string-name>
          , et al.
          <article-title>Robust disambiguation of named entities in text</article-title>
          .
          <source>En Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics</source>
          ,
          <year>2011</year>
          . p.
          <fpage>782</fpage>
          -
          <lpage>792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. HAN,
          <article-title>Xianpei; ZHAO, Jun. Structural semantic relatedness: a knowledge-based method to named entity disambiguation</article-title>
          .
          <source>En Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics</source>
          ,
          <year>2010</year>
          . p.
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Amigo</surname>
          </string-name>
          , Enrique and Gonzalo, Julio and Verdejo,
          <string-name>
            <surname>Felisa</surname>
          </string-name>
          . A
          <article-title>General Evaluation Measure for Document Organization Tasks</article-title>
          .
          <source>Proceedings SIGIR</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>