<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Alignment between Wikipedia Attributes and DBpedia Properties</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Hanoi University of Science and Technology</institution>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>DBpedia plays a central role in Linked Open Data (LOD), due to the large and growing number of resources linked to it. Currently, this project extracts information from Wikipedia to represent in RDF triples. The extraction procedure required to manually map Wikipedia infobox attributes into the DBpedia properties. However, the number attributes are so large for all Wikipedia editions in different languages. This task therefore is time-consuming and labor intensive. We propose a novel method to mapping automatically basing on instance-based approach enhanced by using label translation. Experiments on Vietnamese Wikipedia confirm the significant improvement when applying our method.</p>
      </abstract>
      <kwd-group>
        <kwd>DBpedia</kwd>
        <kwd>Ontology</kwd>
        <kwd>Mappings</kwd>
        <kwd>Infobox Attributes</kwd>
        <kwd>Properties</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Haiphong University, Vietnam</title>
      <p>nhunt@dhhp.edu.vn
DBpedia is built upon the community effort to extract the knowledge from Wikipedia
[1]. Currently, this project is maintaining an extraction framework and shared
ontology to retrieve knowledge based on the most frequent infoboxes within Wikipedia
editions [5]. Each infobox is a set of attribute-value pairs that represents a summary of
a Wikipedia article. In detail, contributors from many countries have joined the
DBpedia mapping project, whose target is to map the Wikipedia infoboxes into the
classes and their attributes into corresponding properties in DBpedia ontology [2].
The DBpedia Ontology 2016-04 version encompasses 754 classes which has a form in
a subsumption hierarchy and are described by more than 3000 different properties1.
Thanks to crowdsourcing, a large number of infoboxes has been mapped. However,
the number of accomplished mappings is still small and limited. Thus, the alignment
among multilingual DBpedia is currently incomplete. Although DBpedia extracts
information from 128 languages, there are only 32 languages that have mappings and
19 own chapters in different languages2. It is clear that mapping community is
immature. Meanwhile, increasing the number of DBpedia versions helps to improve the
association and richness of LOD. Therefore, mapping automatically attributes into
corresponding properties is useful solution to deployment DBpedia chapters fast as
well as is highly prone to changes in Wikipedia, a noticeable drawback considering
how fast edits are made [3].</p>
      <p>The main idea is built on an instance-based approach. In detail, it is assumed that
attribute and property are the same if theirs values are equivalent. In this paper, we
propose a new method that improved the mentioned approach with using label
translation; specifically Vietnamese Wikipedia is our case study.
2</p>
      <sec id="sec-1-1">
        <title>Mapping extraction</title>
        <p>In this section, we describe how to determine whether an attribute contained in the
Wikipedia infobox I can be mapped to a given property r in DBpedia. Given two
language RDF datasets, the alignment is to harvest similar pairs in term of value between
them. Given a target language and a set of source languages, after the data processing
step, we will classify them into individual sets such as date, number, string and object.
After that, we compute the value-based similarity between attributes and properties.
For each candidate pair, the similarity of an alignment [ , ] is measured as
follow:
( , )
, ),
the value of property</p>
        <p>of DBpedia in
where is inner function that has value in [0,1], it is used to calculate the similarity
between the values of and for each pivot language. And, , ) is used to address
the value of Wikipedia attribute in target language , ) is used to extracts
,
. We denote their definite values by
, respectively. As mentioned above, we distinguish two kinds of property
to compute the similarity. Thus, we apply these functions for each property type.</p>
        <p>However, determining the similarity based on their value puts together a raise of
noise in the returned results. Because, their values may be the same, but they are not
equal in fact. In order to overcome this drawback, we have improved by generating
dictionary to translate label. Here, we considered a Wikipedia article A and DBpedia
that they have the same entity. Only the attributes, which their values are equivalent,
are used.</p>
        <p>We denote them by a set</p>
        <p>{ , , , } ( , ) ( , )
( , ) ( , ) and { , , , } ( , )
( , ) ( , ) for DBpedia properties in language . If
we will have the number of mappings most, accounted at m*n matched pairs ,
, where , . It is desirable this number is decreased to k (k &lt; m*n).
In detail, label of attributes and properties are translated in the same language.
Obvi</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2 http://wiki.dbpedia.org/about/language-chapters</title>
      <p>ously, it is convenient to translate to English. Then, we can use Wordnet to get synset.
Finally, we use majority voting method to retrieve the best pairs.
3</p>
      <sec id="sec-2-1">
        <title>Experiment and evaluation</title>
        <p>In order to evaluate our approach for automatically mapping, we have carried the
experiments on Vietnamese Wikipedia with existing DBpedia editions in three pivot
languages (English, German and Dutch) as training data. In currently, editors of
Vietnamese Wikipedia employ infoboxes in both Vietnamese and English. For English
attributes, we take advantage from existing mapping to extract mappings easily. Thus,
the Vietnamese attributes are concerned mainly because they contain more
information with high accuracy. Choosing an infobox for mapping is based on two
following principles: the attribute number of test data and infobox are equal and all articles
chosen have to link to the articles in pivot languages. However, the value of infobox
attributes are often incomplete even null in Wikipedia articles. Thus, we have to
create a dataset for each infobox in Vietnamese Wikipedia so that the number of
attributes had value as much as possible. Most frequent infoboxes are mapped first. This
guarantees a good coverage, as infoboxes are distributed according to the Zipf's law.
Then, we pick up the 100 infoboxes with the most occurrences in the statistics3 .
Bảng phân loại (Categories) / 167
Thông tinh khu dân cư (Infobox settlement) /419
Thông tin hành tinh (Infobox planet) /87
Thông tin đơn vị hành chính Việt Nam / 71
(Infobox administrative divisions of Vietnam)
Thông tin nhân vật hoàng gia/158
(Infobox royalty)
Thông tin tiểu sử bóng đá /250
(Infobox football biography)
Thông tin nhạc sĩ (Infobox musical artist) / 112
Thông tin phim (Infobox film) /53
Thông tin viên chức (Infobox officeholder) /197</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 http://mappings.dbpedia.org/server/statistics/vi/</title>
      <p>a) Comparsion of value D</p>
      <p>
        b) Comparison of value E
In fact, some attributes are occurred many times and vice versa some ones are
appeared rarely. Therefore, the occurrence is one of the most criteria to evaluate the
mapping results. To evaluate, we use Eq. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) and Eq. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) to compute the proportions
of mapped attributes and the percentages based on occurrences.
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) and
∑
∑
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
where is a set of attribute in infobox , is a set of correct mapped attribute and
is an occurrence of an attribute . We compare our results with two cases: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
instance-based approach before and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) after improving with translation.
      </p>
      <p>Table 1 implies the mapping results on the top 10 infoboxes and our method gives
the better result; especially for infoboxes with almost Vietnamese attributes. Figure 1
illustrates more clearly the results before and after the improvement . Conversely, the
remained attributes have not mapped yet. Most of them belong attributes with low
occurrences. Moreover, the relation between attribute and property does not exist or
that attribute is too specific for only Vietnamese Wikipedia so that it is difficult to
find out a corresponding property in DBpedia. For an instance, let’s consider infobox
“Thông tin đơn vị hành chính Việt Nam” (Infobox administrative divisions of
Vietnam). The attributes “cỡ bản đồ” (map size), “nhãn bản đồ” (map label) and etc.
have occurrences that are less than or equal 1. Besides, attribute “xã” (commune) or
“phường” (ward) could not match with any exist property in DBpedia.
4</p>
      <sec id="sec-3-1">
        <title>The demo</title>
        <p>We build a tool to convert a Vietnamese Wikipedia articles into DBpedia resources
basing on generated mappings. A user can input the keywords for some subjects in
Vietnamese language, our system will show some suggestions about them. When the
user clicks any subject, the article will be showed and the button aims to convert it to
DBpedia resource. Besides, our system also allows users see and extract data in RDF
triples. This demo uses the algorithm to automatically mapping attributes of
Vietnamese Wikipedia infoboxes into corresponding properties in DBpedia ontology with
database in any language or a URI to query the entity. We build a tool named AMA4
as the first simulator of building Vietnamese DBpedia chapter automatically.
5</p>
      </sec>
      <sec id="sec-3-2">
        <title>Conclusion and future work</title>
        <p>In this work, we propose a new method that recovered basing on instance approach
with translation. The experiment shows that our method has improved with better
result although it remains several weaknesses. This shows that our methology is
promising to evolve into the development of linked data and fast deployment
localized DBpedia chapter in the context of theirs mapping communities are still weak. For
future work, we will investigate this algorithm for some different languages.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lehmann</surname>
            , J.; Isele,
            <given-names>R.</given-names>
          </string-name>
          ; Jakob,
          <string-name>
            <surname>M;</surname>
          </string-name>
          et al.:
          <article-title>DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>The Journal Semantic Web - Interoperability</source>
          , Usability, Applicability. vol.
          <volume>6</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mendes</surname>
          </string-name>
          , PN.;
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;.
          <article-title>DBpedia: A Multilingual Cross-Domain Knowledge Base</article-title>
          .
          <source>In Proceedings of the Eighth International Conference on Language Resources and Evaluation</source>
          , pp.
          <fpage>1813</fpage>
          -
          <lpage>1817</lpage>
          . (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Palmero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Giuliano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic Mapping of Wikipedia Templates for Fast Deployment of Localised DBpedia datasets</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies</source>
          , pp. {
          <volume>1</volume>
          :
          <fpage>1</fpage>
          -
          <issue>1</issue>
          :
          <fpage>8</fpage>
          } (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>