1 Haiphong University, Vietnam

Automatic Alignment between Wikipedia Attributes and DBpedia Properties

0 Hanoi University of Science and Technology , Vietnam

0000 0002

DBpedia plays a central role in Linked Open Data (LOD), due to the large and growing number of resources linked to it. Currently, this project extracts information from Wikipedia to represent in RDF triples. The extraction procedure required to manually map Wikipedia infobox attributes into the DBpedia properties. However, the number attributes are so large for all Wikipedia editions in different languages. This task therefore is time-consuming and labor intensive. We propose a novel method to mapping automatically basing on instance-based approach enhanced by using label translation. Experiments on Vietnamese Wikipedia confirm the significant improvement when applying our method.

DBpedia Ontology Mappings Infobox Attributes Properties

1 Haiphong University, Vietnam

nhunt@dhhp.edu.vn DBpedia is built upon the community effort to extract the knowledge from Wikipedia [1]. Currently, this project is maintaining an extraction framework and shared ontology to retrieve knowledge based on the most frequent infoboxes within Wikipedia editions [5]. Each infobox is a set of attribute-value pairs that represents a summary of a Wikipedia article. In detail, contributors from many countries have joined the DBpedia mapping project, whose target is to map the Wikipedia infoboxes into the classes and their attributes into corresponding properties in DBpedia ontology [2]. The DBpedia Ontology 2016-04 version encompasses 754 classes which has a form in a subsumption hierarchy and are described by more than 3000 different properties1. Thanks to crowdsourcing, a large number of infoboxes has been mapped. However, the number of accomplished mappings is still small and limited. Thus, the alignment among multilingual DBpedia is currently incomplete. Although DBpedia extracts information from 128 languages, there are only 32 languages that have mappings and 19 own chapters in different languages2. It is clear that mapping community is immature. Meanwhile, increasing the number of DBpedia versions helps to improve the association and richness of LOD. Therefore, mapping automatically attributes into corresponding properties is useful solution to deployment DBpedia chapters fast as well as is highly prone to changes in Wikipedia, a noticeable drawback considering how fast edits are made [3].

The main idea is built on an instance-based approach. In detail, it is assumed that attribute and property are the same if theirs values are equivalent. In this paper, we propose a new method that improved the mentioned approach with using label translation; specifically Vietnamese Wikipedia is our case study. 2

Mapping extraction

In this section, we describe how to determine whether an attribute contained in the Wikipedia infobox I can be mapped to a given property r in DBpedia. Given two language RDF datasets, the alignment is to harvest similar pairs in term of value between them. Given a target language and a set of source languages, after the data processing step, we will classify them into individual sets such as date, number, string and object. After that, we compute the value-based similarity between attributes and properties. For each candidate pair, the similarity of an alignment [ , ] is measured as follow: ( , ) , ), the value of property

of DBpedia in where is inner function that has value in [0,1], it is used to calculate the similarity between the values of and for each pivot language. And, , ) is used to address the value of Wikipedia attribute in target language , ) is used to extracts , . We denote their definite values by , respectively. As mentioned above, we distinguish two kinds of property to compute the similarity. Thus, we apply these functions for each property type.

However, determining the similarity based on their value puts together a raise of noise in the returned results. Because, their values may be the same, but they are not equal in fact. In order to overcome this drawback, we have improved by generating dictionary to translate label. Here, we considered a Wikipedia article A and DBpedia that they have the same entity. Only the attributes, which their values are equivalent, are used.

We denote them by a set

{ , , , } ( , ) ( , ) ( , ) ( , ) and { , , , } ( , ) ( , ) ( , ) for DBpedia properties in language . If we will have the number of mappings most, accounted at m*n matched pairs , , where , . It is desirable this number is decreased to k (k < m*n). In detail, label of attributes and properties are translated in the same language. Obvi

2 http://wiki.dbpedia.org/about/language-chapters

ously, it is convenient to translate to English. Then, we can use Wordnet to get synset. Finally, we use majority voting method to retrieve the best pairs. 3

Experiment and evaluation

In order to evaluate our approach for automatically mapping, we have carried the experiments on Vietnamese Wikipedia with existing DBpedia editions in three pivot languages (English, German and Dutch) as training data. In currently, editors of Vietnamese Wikipedia employ infoboxes in both Vietnamese and English. For English attributes, we take advantage from existing mapping to extract mappings easily. Thus, the Vietnamese attributes are concerned mainly because they contain more information with high accuracy. Choosing an infobox for mapping is based on two following principles: the attribute number of test data and infobox are equal and all articles chosen have to link to the articles in pivot languages. However, the value of infobox attributes are often incomplete even null in Wikipedia articles. Thus, we have to create a dataset for each infobox in Vietnamese Wikipedia so that the number of attributes had value as much as possible. Most frequent infoboxes are mapped first. This guarantees a good coverage, as infoboxes are distributed according to the Zipf's law. Then, we pick up the 100 infoboxes with the most occurrences in the statistics3 . Bảng phân loại (Categories) / 167 Thông tinh khu dân cư (Infobox settlement) /419 Thông tin hành tinh (Infobox planet) /87 Thông tin đơn vị hành chính Việt Nam / 71 (Infobox administrative divisions of Vietnam) Thông tin nhân vật hoàng gia/158 (Infobox royalty) Thông tin tiểu sử bóng đá /250 (Infobox football biography) Thông tin nhạc sĩ (Infobox musical artist) / 112 Thông tin phim (Infobox film) /53 Thông tin viên chức (Infobox officeholder) /197

3 http://mappings.dbpedia.org/server/statistics/vi/

a) Comparsion of value D

b) Comparison of value E In fact, some attributes are occurred many times and vice versa some ones are appeared rarely. Therefore, the occurrence is one of the most criteria to evaluate the mapping results. To evaluate, we use Eq. ( 2 ) and Eq. ( 3 ) to compute the proportions of mapped attributes and the percentages based on occurrences.

( 2 ) and ∑ ∑ ( 3 ) where is a set of attribute in infobox , is a set of correct mapped attribute and is an occurrence of an attribute . We compare our results with two cases: ( 1 ) instance-based approach before and ( 2 ) after improving with translation.

Table 1 implies the mapping results on the top 10 infoboxes and our method gives the better result; especially for infoboxes with almost Vietnamese attributes. Figure 1 illustrates more clearly the results before and after the improvement . Conversely, the remained attributes have not mapped yet. Most of them belong attributes with low occurrences. Moreover, the relation between attribute and property does not exist or that attribute is too specific for only Vietnamese Wikipedia so that it is difficult to find out a corresponding property in DBpedia. For an instance, let’s consider infobox “Thông tin đơn vị hành chính Việt Nam” (Infobox administrative divisions of Vietnam). The attributes “cỡ bản đồ” (map size), “nhãn bản đồ” (map label) and etc. have occurrences that are less than or equal 1. Besides, attribute “xã” (commune) or “phường” (ward) could not match with any exist property in DBpedia. 4

The demo

We build a tool to convert a Vietnamese Wikipedia articles into DBpedia resources basing on generated mappings. A user can input the keywords for some subjects in Vietnamese language, our system will show some suggestions about them. When the user clicks any subject, the article will be showed and the button aims to convert it to DBpedia resource. Besides, our system also allows users see and extract data in RDF triples. This demo uses the algorithm to automatically mapping attributes of Vietnamese Wikipedia infoboxes into corresponding properties in DBpedia ontology with database in any language or a URI to query the entity. We build a tool named AMA4 as the first simulator of building Vietnamese DBpedia chapter automatically. 5

Conclusion and future work

In this work, we propose a new method that recovered basing on instance approach with translation. The experiment shows that our method has improved with better result although it remains several weaknesses. This shows that our methology is promising to evolve into the development of linked data and fast deployment localized DBpedia chapter in the context of theirs mapping communities are still weak. For future work, we will investigate this algorithm for some different languages.

1. Lehmann , J.; Isele, R. ; Jakob, M; et al.: DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia . The Journal Semantic Web - Interoperability , Usability, Applicability. vol. 6 , no. 2 , pp. 167 - 195 . ( 2015 )

2. Mendes , PN.; Jakob , M. ; Bizer , C. ;. DBpedia: A Multilingual Cross-Domain Knowledge Base . In Proceedings of the Eighth International Conference on Language Resources and Evaluation , pp. 1813 - 1817 . ( 2012 ).

3. Palmero , A. ; Giuliano , C. ; Lavelli , A. : Automatic Mapping of Wikipedia Templates for Fast Deployment of Localised DBpedia datasets . In Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies , pp. { 1 : 1 - 1 : 8 } ( 2013 ).