=Paper=
{{Paper
|id=Vol-1486/paper_46
|storemode=property
|title=What is Special about Bethlehem, Pennsylvania? Identifying Unusual Facts about DBpedia Entities
|pdfUrl=https://ceur-ws.org/Vol-1486/paper_46.pdf
|volume=Vol-1486
|dblpUrl=https://dblp.org/rec/conf/semweb/SchaferRP15
}}
==What is Special about Bethlehem, Pennsylvania? Identifying Unusual Facts about DBpedia Entities==
What is Special about Bethlehem, Pennsylvania? Identifying Unexpected Facts about DBpedia Entities Benjamin Schäfer, Petar Ristoski, and Heiko Paulheim University of Mannheim, Germany Research Group Data and Web Science {benni,petar,heiko}@dwslab.de Abstract. Most Linked Data browsers list all facts about an entity in an equal manner. In this paper, we present a prototype for identifying unexpected facts about entities, i.e., those facts that deviate from the expectations. To that end, we use an attribute-wise method for anomaly detection, which is also capable of providing qualitative explanations for the anomalies found. By comparing an entity at hand to a reference set of similar entities, we can provide information on how the entity at hand differs from the typical patterns found for similar entities, and display those unexpected facts together with a short explanation. Keywords: DBpedia, Data Exploration, Anomaly Detection 1 Introduction Many linked data browsers display lists of facts about an entity at hand without a particular notion of order or importance [2]. While some approaches exist for ranking the existing information [1, 3], the top ranked facts for an entity are often the trivial ones (e.g., Bethlehem, Pennsylvania is a City). A slightly different problem is the search for unexpected or surprising facts. Rather than ranking facts by their importance, ranking by unexpectedness re- quires a notion of the usual state of an entity. To that end, an entity needs to be compared to a reference set of similar entities, and the typical patterns underly- ing the entities in that set have to be identified. Then, unexpected facts can be identified as those facts of an entity which strongly deviate from the patterns. In this demonstration, we introduce a prototype for finding unexpected facts in DBpedia.1 Starting from selecting a DBpedia entity, the user can define the reference set and is then presented a number of unexpected facts. The first results with selected entities look promising. 2 Prototype The basic workflow of the tool comprises four steps, as depicted in Fig. 1. In the first step, the user selects a DBpedia entity to analyze. This step is supported 1 Available online at http://topfacts.informatik.uni-mannheim.de/ 2 Benjamin Schäfer, Petar Ristoski, and Heiko Paulheim 1 2 3 4 Entity Reference Set Anomaly Presentation Selection Selection Detection of Facts ABC attribute-wise Lookup categories model learning verbalized rules Fig. 1: Schematic depiction of the tool workflow. by DBpedia Lookup and its autocomplete function.2 For example, the entity dbpedia:Bethlehem, Pennsylvania is selected. Once the entity is selected, the user has to select a reference set of entities to compare to. To that end, all YAGO types, which form a much richer hierarchy than the DBpedia ontology types [9], are retrieved.3 All types that have between 20 and 1,000 entities can be used as a reference set.4 In our example, the user may, e.g., select the class yago:CitiesInPennsylvania. In the next step, the reference set is retrieved. For each entity, an attribute vector is created, using attributes such as datatype properties and direct types. Based on those feature vectors, an individual anomaly score for each attribute is computed using the ALSO approach [5]. This approach learns a predictive model for each attribute from all other attributes. Then, it computes the anomaly score for each attribute value based on the deviation between the actual value and the value predicted by the model, and the predictive strength of the model. For building the models, we use the rule variant of M5’ [6]. Finally, for all attributes that have a high anomaly score, the finding is out- put as the models’ justification for expecting a different value, ordered by the respective anomaly score. An example output is shown in Fig. 2. Following the details on demand paradigm [8], the single statements which are involved in the justification are shown upon request. Since the output would also show quite a few statements that are not unex- pected facts, but mere errors in DBpedia, we filter out those rules referring to statements which are inconsistent with the DBpedia ontology. For implementing the prototype, we use RapidMiner server5 with the Linked Open Data extension [7]. 3 Example Findings In this section, we show some interesting example findings for different resources and reference sets. 2 http://lookup.dbpedia.org 3 DBpedia delivers YAGO types as well, so no separate linkage to YAGO is required. 4 The numbers have been chosen for having a reference set that is big enough for discovering some meaningful patterns, and at the same time small enough to be processed in real time. 5 http://www.rapidminer.com What is Special about Bethlehem, Pennsylvania? 3 Bethlehem, PA compared to Cities in Pennsylvania: Bethlehem is one of the oldest places in Pennsylvania, being founded in 1741. Furthermore, most cities in Pennsylvania are not a founding place of any organization, but Bethlehem is the founding place of Lehigh University Press. Pennsylvania compared to States of the US: For Pennsylvania, we find that it has some uncommon characteristics for the US founding states6 : it is unusually large (119,283 square kilometers, with only New York being larger), has an unusually large maximum elevation (Mount Davis with 979m), and an unusually low area covered with water (2.7%). Black Swan compared to Ballet Films: It is unusual, e.g., that Black Swan is an Academy Award winning ballet film. Futhermore, ballet films are usually not thrillers. Trent Reznor compared to American Heavy Metal Singers: Unlike other heavy metal singers, Reznor is also a piano player and has written various film scores. Joanne K. Rowling compared to British Billionaires: Rowling is the only female among the British billionaires, and one of the rare supporters of the Labour party. 4 Conclusion and Outlook In this paper, we have introduced a prototype which identifies unexpected facts about DBpedia entities. We compare an entity to a reference set of similar en- tities, and identify those facts which deviate from the patterns that are typical for the reference set. While first anecdotal findings are promising, a full user evaluation, also con- trasting different presentation variants, still has to be conducted. Such an eval- uation should, ideally, try to define and capture the human notion of unexpect- edness, which, however, is not trivial. In our prototype, we have so far used direct types, numeric datatype at- tributes, and relations as features. Other features, such as relations to individu- als or qualified relations [4], might even lead to more findings, but come at the cost of a dimensionality explosion, and, hence, problems with realtime process- ing. Thus, some mechanism for on-the-fly feature selection would be required. Furthermore, the impact of the choice of different rule learning algorithms and heuristics would be interesting to explore. For defining the reference set, we have only used YAGO categories so far. It would be interesting to also allow more sophisticated restrictions, e.g., compare a city to other cities in the same range of inhabitants. In summary, the demo shows a novel way of interacting with Linked Data and identifying facts which are interesting to the user. 6 Although this was not the contrast set we chose, many of the rules found refer to the founding states. 4 Benjamin Schäfer, Petar Ristoski, and Heiko Paulheim Fig. 2: Example explanations provided by the tool. Acknowledgements The work presented in this paper has been partly funded by the German Re- search Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD). References 1. Gong Cheng, Thanh Tran, and Yuzhong Qu. Relin: relatedness and informativeness- based centrality for entity summarization. In Proceedings of the 10th International Semantic Web Conference (ISWC2011), pages 114–129, 2011. 2. Aba-Sah Dadzie and Matthew Rowe. Approaches to visualising linked data: A survey. Semantic Web, 2(2):89–124, 2011. 3. Li Ding, Rong Pan, Tim Finin, Anupam Joshi, Yun Peng, and Pranam Kolari. Finding and ranking knowledge on the semantic web. In Proceedings of the 4th International Semantic Web Conference (ISWC2005), pages 156–170, 2005. 4. Heiko Paulheim and Johannes Fürnkranz. Unsupervised Generation of Data Mining Features from Linked Open Data. In International Conference on Web Intelligence, Mining, and Semantics (WIMS’12), 2012. 5. Heiko Paulheim and Robert Meusel. A Decomposition of the Outlier Detection Problem into a Set of Supervised Learning Problems. Machine Learning, (2-3):509– 531, 2015. 6. John Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, volume 92, pages 343–348. Singapore, 1992. 7. Petar Ristoski, Christian Bizer, and Heiko Paulheim. Mining the Web of Linked Data with RapidMiner. Journal of Web Semantics, 2015. 8. Ben Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In IEEE Symposium on Visual Languages, pages 336–343, 1996. 9. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of se- mantic knowledge. In 16th international conference on World Wide Web, pages 697–706, 2007.