Type Prediction for Entities in DBpedia by Aggregating Multilingual Resources Thi-Nhu Nguyen1,4, Hideaki Takeda2, Khai Nguyen2,3, Ryutaro Ichise2, Tuan-Dung Cao4 1 Haiphong University, Vietnam nhunt@dhhp.edu.vn 2 National Institute of Informatics, Japan {takeda,nhkhai,ichise}@nii.ac.jp 3 University of Science, VNU-HCMC, Vietnam nhkhai@fit.hcmus.edu.vn 4 Hanoi University of Science and Technology, Vietnam dungct@soict.hust.edu.vn Abstract. The entity type is considered as very important in DBpedia. Since this information is inconsistently described in different languages, it is difficult to recognize the most suitable type of an entity. We propose a method to predict the entity type based on a novel conformity measure. We combine the consideration of the specific-level and the majority voting. The experiment result shows that our method can suggest informative types and outperforms the baselines. Keywords: DBpedia, Ontology, Mappings, Conformity, Consistency 1 Introduction DBpedia is built upon the community effort to extract the knowledge from Wikipedia [1]. Currently, contributors from many countries have joined the DBpedia mapping project, whose target is to map the Wikipedia templates into the types (e.g., Species, Person, and Place) in DBpedia ontology [2]. Despite the maturity of the DBpedia community, the lack of consensus between the contributors from different languages is still remaining as an issue. In DBpedia, a real-world entity is represented by multiple instances. Each instance is described in a specific language and its type is based on the mappings constructed for that language. Because the mappings are manually created for different languages, the types of particular instances are different even when those instances describe the same entity. Concretely, considering an entity, some types may be different at the specific-level, correct, or incorrect. For example, the entity of Barack Obama is recognized as Person, Politician, President, Artist and Book in 29 languages. Here, there is an agreement between Person, Politician, and President but still different at the specific levels. Meanwhile, Artist and Book Table 1. The statistics of the agreement of types between two languages # instances have type in % instances have Languages both languages the same type nl sv 308462 21.86% en nl 201248 53.28% en sv 158842 8.47% en es 149234 30.92% it en 144815 10.59% en sv 158842 8.47% nl es 143634 77.53% pt en 132773 68.55% nl it 130938 8.83% pl it 118935 8.75% are incorrect. In this situation, choosing the most suitable type of an entity is necessary to guarantee the consolidation of DBpedia but it becomes a difficult task. According to a preliminary analysis, the agreement of type assignment among different languages is low, even if only comparing two particular languages. Table 1 illustrates the percentage of instances sharing the same type in 10 language pairs, in which the number of instances had a type in both languages are the most among all 476 language pairs. In general, only 37% pairs have more than 50% of instances assigned with the same type. Recently, entity type prediction is considered as an important problem. It is helpful for the utility of DBpedia versions whose mapping community is immature. In addition, it is also the core of automatic mapping creation [3]. The simple ideas of type prediction are majority voting and most specific ancestor. The disadvantage of majority voting is that the suggested type is not specific enough for an entity. The most specific ancestor even returns more general types. In this paper, we propose a new method to predict the most suitable type of an entity. Our method is the improvement of majority voting. In detail, we focus on how to retrieve more specific types. 2 Type suggestion In this section, we describe how to predict the most suitable type for an entity. Our idea is based on the combination of the specific-level and the majority voting. The input is a set of the most specific-level types assigned by different languages. We define the conformity ( ) of the most specific-level type . The conformity is a recursive value taking the sum of the frequency of and the conformity of its parent. ( ) ( ) ( ( )) (1) Fig. 1. All types of Barack Obama entity and their frequency. Where the of is the number of languages treating the entity as . For an entity, we select the most suitable type by picking the one with the highest conformity. Obviously, this chosen type will meet the condition that it is used in the most languages and also enough specific. If there are many types that have the same highest conformity, we rank the type based on the conformity of their parent type. Let’s consider the example in Fig.1. In this figure, the entity of Barack Obama is assigned to 6 types in 29 languages. The conformity of the type President is the highest ( (President) =18). Therefore, it is selected as the prediction result. 3 Experiment and evaluation We compare our method with a manually crafted dataset and two other baselines: (1) majority voting and (2) most specific ancestor. We build an entity database from all available language versions of DBpedia. An entity is the compilation of the instances interconnected via owl:sameAs links. The difficulty of type suggestion is the diversity of types. Therefore, we select the entities with high inconsistency. Concretely, we first randomly select 500 entities whose type is available in at least 5 languages. Then, we pick up the 100 most inconsistent ones. Here, the inconsistency is estimated by the entropy of types’ frequency. Different from the conformity in Eq. 1, in order to guarantee the hierarchical relations, transitive types are counted. In which, transitive types are a set of ancestor types. After the selection, an expert is asked to assign the most suitable type among available of the entity (3). Finally, we compare the results of our method, (1), (2) against (3). Table 2 implies that our method gives the best result. As entities of high inconsistency are selected, the most specific ancestor method always chooses the owl:Thing, which is the root of the DBpedia ontology. Although majority voting is better than the most specific ancestor, in general, its result is not specific enough. This experiment demonstrates our prediction method is good but still 45% of the predicted types are different from human’s opinion. Most of them belong to types of place entity because among countries, the definitions of administrative region and residential area different. DBpedia ontology currently lacks types to represent all Table 2. The accuracy of type suggestion methods Our method Majority voting Most specific ancestor 55% 44% 0% these dissimilarities. For example, Voultegon is a commune in France but there is no type for commune. Therefore, this entity should be mapped to Settlement type. However, our method returns the inaccurate type City because this type is more specific than Settlement. 4 The demo We build a tool to visualize the entity types. A user can input the keywords in any language or a URI to query the entity. The database contains 86,290,758 entities, which are constructed from 128,866,644 instances of all languages. We use Lucene1 to have the entities indexed by all labels (i.e., rdfs:label) provided in all languages. We build a tool named MLDQ 2 to visualize hierarchically types in different languages, the suggested types of our method and other baselines, and some general information of the entity (e.g., the entropy of inconsistency). 5 Conclusion and future work In this work, we proposed a new method that combines the consideration of the specific-level and the majority voting to suggest the most suitable type of an entity. Three methods were evaluated and the results show that our method is the most promising one although it remains some weaknesses. For future work, we will evaluate our method with deeper analyses, including comparisons to more baselines. We also aim to improve our method by considering the conformity of transitive types in order to give more accurate predictions. References 1. Lehmann, J.; Isele, R.; Jakob, M; et al.: DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. The Journal Semantic Web – Interoperability, Usability, Applicability. vol. 6, no. 2, pp. 167-195. (2015 ) 2. Mendes, PN.; Jakob, M.; Bizer, C.;. DBpedia: A Multilingual Cross-Domain Knowledge Base. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pp. 1813-1817. (2012). 3. Palmero, A.; Giuliano, C.; Lavelli, A.: Automatic Mapping of Wikipedia Templates for Fast Deployment of Localised DBpedia datasets. In Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies, pp. {1:1-1:8} (2013). 1 https://lucenenet.apache.org/ 2 https://sites.google.com/site/iswc2016demo/