=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-RepLab-VillarRodriguezEt2013
|storemode=property
|title=Using Linked Open Data Sources for Entity Disambiguation
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-RepLab-VillarRodriguezEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/RodriguezTSR13
}}
==Using Linked Open Data Sources for Entity Disambiguation==
Using Linked Open Data sources for Entity Disambiguation Esther Villar Rodríguez Ana I. Torre Bastida Ana García Marta González Rodríguez OPTIMA Unit OPTIMA Unit Serrano OPTIMA Unit ETSI Informática TECNALIA TECNALIA UNED TECNALIA esther.villar@tecnalia.com isabel.torre@tecnalia.com agarcia@lsi.uned.es marta.gonzalez@tecnalia.com Abstract. Within the framework of RepLab 2013, the filtering task try to discover if one tweet is related to one certain entity or not. Our work tries to take advantages of the Web of Data in order to create a context for every entity, extracted from the available Linked Data Sources. The context in Natural Lan- guage Processing (NLP) is the outstanding issue able to distinguish the con- tained semantics in a message by analyzing the frame in which the words are embedded. 1 Introduction Reputation management is used by companies (or individuals) to monitor the public opinions aiming at maintaining a good brand image. The first step is to establish the correct relation between the opinion (text) and the entity with some grade of confi- dence. This is the objective of the filtering task in RepLab, an initiative promoted by the EU project Limosine focused on the ability to process and understand which the strengths and weaknesses of one entity are, based on users opinions (http://www.limosine-project.eu/events/replab2013). Nowadays there is a large amount of available information on the web, such as web pages, social media data (tweets, facebook and others) or blogs. All of them men- tion different entities, such as locations, characters, organizations ... The problem appears when a name refers to an entity that have several meanings, for example the song “Munich” of the music group “Editors” and the German city of the same name. For this filtering task, our system uses an approach based on the semantic context of an entity. The goal of this work is to create a description of an entity that will help to achieve a, enough complex, semantic context to execute a successful disambigua- tion. The data sources from where entity descriptions are extracted make up the Web of Data, specifically the Linked Open Data Cloud. In this respect, our research has been developed in the frame of Linked Open Data paradigm that is a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF. Due to activities like this, the volume of data in RDF format is continuously growing, building the known “Web of Data”, which today is the largest free available knowledge base. Its size, open access, semantic character and continued growth led us to choose it as our information provider for context generation during the filtering task. This process is carried out using different semantic technologies: such as the SPARQL query language, the ontology definition languages, like RDF, RDFS or OWL and RDF repositories (SPARQL endpoints). The main contribution of this paper is the definition of a system that achieves high precision/sensitivity in the tasks of filtering by entities, using semantic technologies to extract context information from the Linked Open Data Sources, as are going to be presented in the following. 2 Proposed Approach In this section, we introduce our approach for filtering entities on tweets. Our proce- dure uses the semantic context of the analyzed entities, and compares it versus the terms contained in the tweet. First of all the tweets are preprocessed, extracting the terms involved on them. These terms are the input for a second phase, where equivalent available forms for the concepts are obtained by the Stylus1 tool. When all the possible forms of a term are calculated, the last step consists on generating a semantic context by querying differ- ent data sources (modeled by a set of ontologies) that the Linked Open Data Cloud provides to us. The section is divided into four parts. First we introduce the motivation to use a semantic context for entities filtering, later we explain the preprocessing phase of the system. In the third subsection we include a description of the generation of the con- text and finally we resume our filtering algorithm. 2.1 Motivation The main reasons to utilize a semantic context for discovering the relatedness of tweets with the different entities processed are the next two ones: Powerful modeling and rendering capabilities offered by ontologies. The on- tologies allow us to capture the concepts and properties of a particular domain. It is possible to draw a conceptual structure as detailed and complete as necessary. Fur- thermore, the process of describing ontologies is simple and straightforward, gen- erating an independent and autonomous model. The amount of free available semantically represented data (RDF, RDFS, and OWL) into the linked Open Data Cloud. Nowadays the amount of information available in RDF format is huge. The Linked Data paradigm has promoted the Web of Data, formed by the Linked Datasets. This makes possible that any user can ob- tain information about heterogeneous domains. Consulting these datasets through technologies such as SPARQL or RDF Dumps, the user can get semantic infor- mation about concepts or entities using modeling ontologies of different reposito- ries. 1 http://www.daedalus.es/productos/ To illustrate the benefits which can give us a semantic context, there is an example here: In the field of reputation analysis of music groups, we consider the following tweet and the studied entity is the group U2: "Enjoy a lot in the last concert of the singer Bono" At first there is nothing in the tweet that can help us to relate it in a syntactically manner with "U2". But using the semantic context generated for this group of music, we know that among the members of the group, their vocalist is Paul David Hewson, better known by its artistic name "Bono". The semantic context allows us to build relationships that lead to U2 from Bono and in this way we deduce that a tweet talking about Bono entity, it also does indirect- ly about the U2 entity and therefore the tweet and the second entity are related. For the extraction of the necessary information for the generation of context, we have considered the data sources and ontologies shown in the table 1: Dataset Name Domain Sparql Endpoint DBPEDIA General domain. http://dbpedia.org/sparql MusicBranz Music domain http://dbtune.org/musicbrainz/sparql EventMedia Media domain http://eventmedia.eurecom.fr/sparql ZBW Economics Economic domain http://zbw.eu/beta/sparql/ Swetodblp University bibliog- http://datahub.io/dataset/sweto-dblp raphy domain DBLP University biblio- http://dblp.rkbexplorer.com/sparql/ graphy domain Table 1. Available datasets and ontologies into Linked Open Data cloud for the studied domains The table shows datasets of the four domains used in the task of filtering: music, university, banks and automobile. 2.2 Preprocessing of tweets This task is in charge of extracting the terms contained in the tweet. Before that the terms are compared with the entities, they need to be pre-processed to remove the typical characteristics of the tweets (#) which can affect the precision. This preprocessing has three main tasks (fig 1): 1. Removing URL. URLs in this approach are eliminated, because they do not provide value for the comparison in a the semantic context. In future work, we will try to replace the URLs by entities that represent them, and we could even consider the various relationships / links inside the web page that are identified by the URL under study. 2. Removing mentions. For our task, the mentions are not interesting at this moment, because the relationship between them and the content of the tweet is irrelevant. 3. Trransforming hashtags. h Hasshtags are toopics with relevance that somehow suummarize the content of thee tweet; thereffore we parse their terms soo they can bee treated by su ubsequent proccesses. Fig. 1.. Preprocessing g tweets 2.3 Coontext genera ation The conttext representts the relatedd concepts/enttities and thee kind of rela lationships between them. t In our approach, a we ggenerate a conntext for each needed entityy. The innformation to build the conntext is obtained from the datasets d shownn in Table 1. Depennding on the type of entity ty, we perform m different tyypes of questtions to a specific domain (mu usic, banks, automobiles,, universities). These quueries are constructted from the different formss or variants th hat represent the t entity. Fig. 2. Proocess of contextt generation Thus, the t context geeneration proccess consists in n two sub-pro ocesses, (figurre 2): Extracction of the forms of an entity. Using g the API of Stylus2 (Daeddalus) the entity forms have beeen extracted to try to avoid d misspelled or o ambiguous names. I IBM Softwaree IBM Softw ware Group IBM System m 2 http://ww ww.daedalus.es//productos/stiluus/stilus-sem/ Extraction of the concepts/entities and relationships from the previous forms. This task is performed by consulting different datasets through their corresponding SPARQL endpoint, using SPARQL query language and following the ontologies of each dataset, that depends on the type of entity. An example of a SPARQL query type is described in figure 2. 2.4 Filtering Algorithm In this section, the final version of the complete algorithm is provided. //Preprocesed the tweets PREPROCESS_TWEET(tweet_list) return preocessedTweet_list BEGIN FOR EACH tweet IN tweet_list { processedTweet =RemoveURL(tweet); processedTweet =RemoveMentions(processedTweet ); processedTweet =TransforHashTag(processedTweet ); processedTweet_list.add(processedTweet); } return processedTweet_list; END // CONTEXT_GENERATION (entitites_form) return context_list BEGIN FOR EACH entity IN entity_form { queries=Select TypeQuery(entity); results=ExecuteSparql(queries); context_list.put(entity, results); } return context_list; END; // Main Program MAIN (t_files, e_files) BEGIN tweet_list=readTweets(t_files); entity_list=readEntities(e_files); tweet_terms=PREPROCESS_TWEETS(tweet_list); //Obtain all the forms for each entity with Stylus API ent_forms=OBTAIN_FORMS(entity _list); //Obtain Context for each entity context_list=CONTEXT_GENERATION (ent_forms); //Compare tweets versus entities relatedness_list=COMPARE (contex_lists, tweets_terms); // Print on a file the filtering task results writeFilteringOutput(relatedness_list); END; 3 Experiments and Results The corpus of RepLab 2013 uses Twitter data in English and Spanish. The corpus consists of a collection of tweets (at least 2,200 for each entity) potentially associated to 61 entities from four domains: automotive, banking, universities and music/artists. As result measures, reliability and sensitivity are used [8]. For a better and deep understanding, the outcomes for this typical binary classification problem (true positives, false positives, true negatives and false negatives) are showed: Table 1. Predicat Actual Class ed Class Related Unrelated Related TP = 22926 FP = 1358 Unrelated FN = 47135 TN = 18495 Sensitivity (also called the true positive rate) measures the proportion of actual positives which are correctly classified. Sensitivity = TP / Actual P = 0,944078 (1) Reliability (also called precision) measures the fraction of retrieved instances which belong to positive class. Reliability = TP / Predicted P = 0,327229 (2) Table 2. Filtering Results Run Reliability Sensitivity F(R,S) BEST_APPROACH 0,7288 0,4507 0,4885 UNEDTECNALIA_filtering_1 0,2760 0.2862 0.1799 4 Related work The disambiguation task has become essential when trying to mining the web in the search of opinions. Brands or individuals such as Apple or Bush usually lead to confusions due to their ambiguity and need to be disambiguated as a related or unrelated reference. In many approaches Wikipedia has been used to tackle this challenge by co-reference resolution methods (measuring context similarity through vectors or another kind of metric) [1, 2]. Research has been used to focus on the appearance of a pair of named entities in both texts to come to a conclusion about their interrelation. The problem with Twitter is the shortness of its messages which makes more difficult the comparison (overall considering the usual lack of two co- appearing entities). Some works are carried out by mapping name entities to Wikipedia articles and overlapping the context surrounding the entity in the text (the string which is wanted to be disambiguated) [3]. The systems return the entity which best matches the context. This approach, instead, tries to take advantage of Linked Open Data Cloud. A huge open data base where to ask and recover data in a straight way. This avoids scanning unstructured pages and obtaining wrong or disconnected information. Other paper that has been used as corpus, the data extracted from Linked Open Data sources is presented by Hogan et al. [4]. In Natural Language Processing, the Recognizing named entities (NER) is a exten- sively research field. Typically, the approaches used Wikipedia for explicit disambig- uation [5], but there are also some examples of how semantics can be used for this task [6, 7]. Both works are based on defining a similarity measure based on the se- mantic relatedness. Hoffart et al. [6] ] is the most closed approach to our work, because the knowledge base used on their works are Linked Data sources, like DBpedia and YAGO and in our research we also use DBpedia (among others). The main difference is that on our approach we generate a context on which we place the entities in study. Afterwords we check if the text has any relationship with the generated context, instead of using a measure of semantic relatedness. 5 Conclusions The results reveal a high value for sensitivity which comes along with a low value for false negatives. This indicates that the system does not usually get wrong classifications and if it concludes that one example is related to one entity it is almost sure that it is correct. The reliability, however, is quite poor due to the fact that the context is very enclosed and thus there are a lot of not found examples (false negative rate). This leads us to think that the context should be widely enriched and this could enlarge the well classified group. So to filter with better precision, the context should contain not only semantic information from Linked Data sources, but also domain concepts such as verbs, idioms or any kind of expressions prone to be good indicators. Entity: Led Zeppeling Tweet: Listening to Led Zeppelin Context: [music, band, concert, instrument,… listen,…] Result: TRUE These clues could be extracted and treated by means of PLN and IR (Information Retrieval) algorithms. The first ones to preprocess the words (including stemming and disambiguation treatment) and the former in order to find a similarity-based structure for the data so the filtering can be carried out by measuring the distance between the query (actually the relationship related/unrelated) and the tweet according to the clues. Commonly, a data mining process would need to learn from training examples or on the other hand to use some statistical method as the tf-idf scheme or LSI (Latent Semantic Indexing) able to categorize and clustering the concepts most associated to a certain subject For context generation, we will also analyze more refined techniques in the same research line: o Improving the semantic context, using a larger number of Linked Datasets, and refining the questions to be sent. In order to improve the questions we plan to delve deeper into the ontologies and thereby expand the scope of the context. o Using other disambiguation techniques that can be combined with our ap- proach, as the information extraction from web pages, cited in the text, the study of hash tags and mentions or using other non-semantic corpuses. The combination of all these techniques would allow creating a huge semantic- pragmatic context with the valuable distinct feature of not being static, but an increas- ing and open context fed by Linked Data. Acknowledgments. This work has been partially supported by the Regional Govern- ment of Madrid under Research Network MA2VIRMR (S2009/TIC-1542), and by HOLOPEDIA (TIN 2010-21128-C02). Special thanks to Daedalus for the free licenc- ing to the utilization of Stilus Core. Hereby the authors would like to thank Fundación Centros Tecnológicos Iñaki Goenaga (País Vasco) for the awarded doctoral grant to the first author. References 1. Ravin, Y. and Z. Kazi. 1999. Is Hillary Rodham Clinton the President? In ACL Workshop on Coreference and its Applications. 2. Wacholder, N., Y. Ravin, and M. Choi. 1997. Disambiguation of proper names in text. In Proceedings of ANLP, 202-208. 3. Bunescu and Pasca. 2006. Razvan C. Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics, 2006. 4. HOGAN, Aidan, et al. Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web, 2012, vol. 10, p. 76-110. 5. HAN, Xianpei; ZHAO, Jun. Named entity disambiguation by leveraging wikipedia seman- tic knowledge. En Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009. p. 215-224. 6. HOFFART, Johannes, et al. Robust disambiguation of named entities in text. En Proceed- ings of the Conference on Empirical Methods in Natural Language Processing. Associa- tion for Computational Linguistics, 2011. p. 782-792. 7. HAN, Xianpei; ZHAO, Jun. Structural semantic relatedness: a knowledge-based method to named entity disambiguation. En Proceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics. Association for Computational Linguistics, 2010. p. 50-59. 8. Amigo, Enrique and Gonzalo, Julio and Verdejo, Felisa. A General Evaluation Measure for Document Organization Tasks. Proceedings SIGIR 2013.