<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>i18n-CKG: Considerations in Building Internationalization Contextualized Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Toronto</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper proposes a methodology for creating an Internationalization Contextualized Knowledge Graph (i18n-CKG) that is designed to capture the wide range of international concepts and concerns. This methodology rst surveys how well known entities are represented through di erent properties in various language locales. Next, the i18nCKG is constructed as a combination of the i18n properties with interlanguage translations and a scoring system to measure the universality of the properties. As a brief exercise, an i18n-CKG that models the i18n di erences for a popular entertainer in English, Korean, Russian, and Chinese is presented.</p>
      </abstract>
      <kwd-group>
        <kwd>Internationalization</kwd>
        <kwd>i18n</kwd>
        <kwd>Knowledge graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Internationalization, often abbreviated as \i18n" (18 represents the number of
letters between the rst and last), refers to the adaptation of a product or service
based in a certain language or locale for another language or locale. As di erent
countries, cultures, and communities have di erent priorities and ways of viewing
the world, it is important for knowledge graphs to be able to exibly represent
these i18n di erences. In this paper, a methodology for creating a contextualized
knowledge graph that is sensitive to i18n issues, an i18n-CKG, is presented.</p>
      <p>
        First, a brief literature review will show how i18n considerations have been
previously addressed in existing knowledge graphs, such as DBpedia. Next, the
methodology for creating an i18n-CKG will be described with a case study
using representative knowledge graph applications from di erent language locales,
English (en), Korean (ko), Russian (ru), and Chinese (zh). The paper ends with
a discussion of future work.
A number of prior works have expanded the i18n capabilities of Linked Data.
There have been case studies of i18n for the Greek DBpedia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Bengali DBpedia
Copyright c 2018 for this paper by its authors. Copying permitted for private
and academic purposes.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], multilingual ontology matching [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and guidelines for multilingual linked
data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], in addition to the ongoing i18n language DBpedia versions and others
such as YAGO [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Wikidata [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        It is also important to recognize that people may underestimate the di
culties of representing international knowledge in Linked Data. Often, it is not
as simple as mapping ontologies from di erent languages together. In software
development, a number of notable resources have pointed out these false
assumptions with names [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], languages [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and code [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
3
i18n-CKG
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>The intuition for this methodology can be expressed as follows. Although the
English Wikipedia1 is a widely used website, people from di erent languages,
communities, and cultures maintain their own Wikipedia or equivalent
repository of open knowledge. Similarly, although the Google search engine holds a
signi cant market share globally, other search engines are prominent in certain
locales, such as Naver in South Korea and Yandex in Russia. By analyzing these
i18n data sources and search services as a \ground truth", it is possible to
compare how di erent locales represent relationships between entities.</p>
      <p>
        First, the i18n data sources and popular local services for delivering
structured information should be identi ed. Wikipedia2 consists of many language
versions. The infoboxes in certain articles also display the structures (or
ontologies) in the respective language Wikipedia. These multilingual infoboxes act as
up-to-date sources of i18n data that may not be captured in older releases of
linked data such as Wikidata and DBpedia. In the case of search services, Google
is widely popular and holds a signi cant market share worldwide [
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ]. In South
Korea, Naver is a competitor to Google, while Baidu and Yandex dominate in
China and Russia respectively [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To the extent that these local search services
provide results catered to their language's audience, it is useful to compare how
popular search queries and entities are represented di erently.
      </p>
      <p>For an implementation of this methodology, a notable entity that would
be displayed in the search services of all four language locales (English (en),
Korean (ko), Russian (ru), Chinese (zh)) was selected. As a popular Korean pop
music entertainer with appearances in international music, television, and lms,
South Korean singer \Rain" was compared across the aforementioned services
to determine which relationships (properties) were most prominently displayed
in the structured infoboxes and search engine results. With the structured data
from the various sources collected computationally, machine translation of
nonEnglish texts and human-supervised reconciliation of complex properties were
used to create the nal set of triples from each i18n source.</p>
      <p>As part of the i18n-CKG, an additional measure is proposed to score how
frequently the i18n properties are present across language locales. This expresses
how \local" a speci c property is for a language. Such scores would also be
1 https://en.wikipedia.org/wiki/Main Page
2 https://www.wikipedia.org/
represented as triples: (p, i18nPropScore en, value). A simple
implementation is to use sl = rt where r is the presence of the property in a language
(0 or 1) and t is the total times the property is present across l locales. Higher
scores indicate more \local" or locale-speci c properties, while lower scores mean
more \universal" properties, unless the value is zero. The code to generate the
proposed i18n-CKG is maintained on GitHub3.</p>
      <p>
        For a complete translation of the ontology across languages (ontology
localization), this can be accomplished through methods outlined in existing guides
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The properties, p 2 P , that were discovered in the languages, l 2 L, each
have a language speci c label represented by a RDF triple as (p, label,
"string"@lang). For each p across l languages, additional triples on the
order of a maximum of (jpj(jlj 1))(jlj) should be newly created, assuming no
property labels are already available outside the local language. For Table 1,
this would be up to 348 = ((29)(4 1))(4) additional triples.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results of i18n-CKG Construction</title>
      <p>Table 1 shows how the four language locales represent the properties of the
entity \Rain". For instance, the \date of birth" and \place of birth" properties
are the most consistently displayed across all data sources, suggesting that these
properties are universally interesting across locales. A number of properties
represent less well known attributes, such as weight, blood type, and zodiac sign.
These properties are likely to be of interest in the language locales where they
are represented, but not as relevant or interesting for locales that omit them.</p>
      <p>A selection of the preliminary calculations of the i18n property scores are
noted below. The higher scores indicate that a property is more relevant in a
particular language locale, such as \Blood type" for zh, while lower scores are
attributed to properties that are widely found in many language locales. Once
all additional triples have been created, the resulting i18n-CKG data may be
loaded into an existing triplestore or knowledge graph to provide i18n context.
{ (Date of birth, i18nPropScore en, 0.25 (= 1/4))
{ (Date of birth, i18nPropScore zh, 0.25 (= 1/4))
{ (Blood type, i18nPropScore en, 0.0 (= 0/1))
{ (Blood type, i18nPropScore zh, 1.0 (= 1/1))
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>A number of extensions to the methodology proposed in this paper are possible.
Qualitatively, the di erences between the properties in the di erent language
ontologies can be compared more deeply. For example, for the property label
\Born", there may be a combination of \place of birth", \date of birth", and/or
\birth name" that are actually found in the data. In addition, there are also</p>
      <sec id="sec-4-1">
        <title>3 https://github.com/nchah/i18n-ckg</title>
        <p>opportunities to expand the computational methods in future iterations of this
research to scale with the amount of entities (on the order of millions) and
languages (on the order of hundreds).</p>
        <p>At an implementation level, the creation of i18n labeled properties requires
international language translation capabilities that can also scale. A
combination of reliable machine translation tools and linguistic veri cation by bilingual
experts may need to be implemented to maintain accuracy. While this is
straightforward for some properties (e.g. \date of birth"), di culties may become
apparent as more obscure or i18n-speci c properties are examined. Improvements
in these processes should be undertaken while also building on similar ontology
localization and mapping e orts in such cases as DBpedia4.</p>
        <p>Limitations to the i18n-CKG approach are also important to consider. For
instance, biases in the data may a ect the coverage of i18n-CKG. Certain
languages and communities that are not well represented online would not be
in</p>
      </sec>
      <sec id="sec-4-2">
        <title>4 http://mappings.dbpedia.org/index.php/Main Page</title>
        <p>cluded to the same degree as more widely used languages such as English and
French. There may also be divergent representations of the same entity or
property within the same language or locale, which require additional work to resolve.</p>
        <p>As more and more locales are covered, it will also be necessary to develop
processes for dealing with data that con ict between i18n locales. Con icting data
between i18n locales may arise due to political issues (e.g. territorial disputes),
cultural di erences, or unreliable i18n sources, among other reasons. Further
work is needed to handle such potentially sensitive matters.</p>
        <p>In this paper, a preliminary methodology for creating a contextualized
knowledge graph that is sensitive to i18n issues, an i18n-CKG, was described. The two
stage approach included a survey of i18n di erences using various i18n sources
of \ground truth" and the construction of a knowledge graph that accounted
for the i18n di erences. Future research in building contextualized knowledge
graphs with i18n considerations in mind should prove fruitful as new methods
and improved processes are developed.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Computerphile:
          <article-title>Internationalis(z)ing code - computerphile</article-title>
          . https://www.youtube.com/watch?v=0j74jcxSunY (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gomez-Perez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vila-Suero</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montiel-Ponsoda</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aguado-de Cea</surname>
          </string-name>
          , G.:
          <article-title>Guidelines for multilingual linked data</article-title>
          .
          <source>In: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics</source>
          . p.
          <fpage>3</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hamill</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Falsehoods programmers believe about language</article-title>
          . http://garbled.benhamill.com/
          <year>2017</year>
          /04/18/falsehoods-programmers
          <string-name>
            <surname>-</surname>
          </string-name>
          believeabout-language/ (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bratsas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antoniou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metakides</surname>
          </string-name>
          , G.:
          <article-title>Internationalization of linked data: The case of the greek dbpedia edition</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>15</volume>
          ,
          <issue>51</issue>
          {
          <fpage>61</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>McKenzie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Falsehoods programmers believe about names</article-title>
          . https://www.kalzumeus.com/
          <year>2010</year>
          /06/17/falsehoods-programmers
          <string-name>
            <surname>-</surname>
          </string-name>
          believe-aboutnames/ (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Meilicke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            <given-names>A</given-names>
          </string-name>
          -Castro,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Freitas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Van Hage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.R.</given-names>
            ,
            <surname>Montiel-Ponsoda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>De</surname>
          </string-name>
          <string-name>
            <surname>Azevedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.R.</given-names>
            ,
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>SVaB-Zamazal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Svatek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Tamilin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , et al.:
          <article-title>Multifarm: A benchmark for multilingual ontology matching</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>15</volume>
          ,
          <issue>62</issue>
          {
          <fpage>68</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. NetMarketShare:
          <article-title>Search engine market share</article-title>
          . https://www.netmarketshare.
          <article-title>com/search-engine-market-share</article-title>
          .
          <source>aspx</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marjit</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Towards bengali dbpedia</article-title>
          .
          <source>Procedia Technology</source>
          <volume>10</volume>
          ,
          <issue>890</issue>
          {
          <fpage>899</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Statcounter:
          <article-title>Search engine market share</article-title>
          . http://gs.statcounter.com/search-enginemarket-share (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago: a core of semantic knowledge</article-title>
          .
          <source>In: Proceedings of the 16th international conference on World Wide Web</source>
          . pp.
          <volume>697</volume>
          {
          <fpage>706</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>