=Paper=
{{Paper
|id=None
|storemode=property
|title=NLP for Interlinking Multilingual LOD
|pdfUrl=https://ceur-ws.org/Vol-1045/paper-05.pdf
|volume=Vol-1045
|dblpUrl=https://dblp.org/rec/conf/semweb/Lesnikova13
}}
==NLP for Interlinking Multilingual LOD==
<pdf width="1500px">https://ceur-ws.org/Vol-1045/paper-05.pdf</pdf>
<pre>
        NLP for Interlinking Multilingual LOD

                                Tatiana Lesnikova

                          INRIA & LIG, Grenoble, France
                           tatiana.lesnikova@inria.fr
                            http://exmo.inrialpes.fr/


      Abstract. Nowadays, there are many natural languages on the Web,
      and we can expect that they will stay there even with the development
      of the Semantic Web. Though the RDF model enables structuring infor-
      mation in a unified way, the resources can be described using different
      natural languages. To find information about the same resource across
      different languages, we need to link identical resources together. In this
      paper we present an instance-based approach for resource interlinking.
      We also show how a problem of graph matching can be converted into a
      document matching for discovering cross-lingual mappings across RDF
      data sets.

      Keywords: Multilingual Mappings, Cross-Lingual Link Discovery, Cross-
      Lingual RDF Data Set Linkage


1   Problem Statement

Due to the Resource Description Framework (RDF), the information on the Web
can be turned from the unstructured mass into the structured data represented
in the form of triples. The Linked Open Data (LOD) cloud containing billions
of triples is constantly growing. Since data sets are created independently, there
can be several Uniform Resource Identifiers (URIs) denoting the same entity
across different RDF data sets. As a result, one needs to address the problem
of entity resolution: identify and interlink the same entity across multiple data
sources.
    The RDF syntax is relatively simple and unambiguous: RDF = graph +
identifiers (labels). This is what the identification of resources can be based on.
However, this problem can become particularly difficult when there are multi-
lingual elements in a graph as a simple string matching technique is doomed
to fail. Hence, specific Natural Language Processing (NLP) techniques must be
considered.
    Our research problem is to find out methods for linking the same resource
located in several RDF data sets and described in various natural languages and
study the impact of available NLP techniques on the interlinking procedure.
2      Tatiana Lesnikova

2   Relevancy

Internet is a multilingual system, and we believe that it will continue to accom-
modate a diversity of natural languages despite the development of the Semantic
Web. Even though there are many resources in English, some other languages
occupy a decent portion of the Web space as well (see Internet world users by
language statistics 1 ). And we expect the necessity to tackle the multilingual-
ity problem to persist. There are many resources that could be interlinked. At
present, the number of languages 2 of RDF data sets amounts to 503.
    The importance of cross-lingual mappings has been discussed in several works
[1–3].
    Recently a Best Practices for Multilingual Linked Open Data Community
Group 3 has been created to elaborate a large spectrum of practices with regard
to multilingual LOD.
    Availability of the cross-lingual links is imperative for several neighboring
research areas. For example, to overcome the problem of ontology heterogene-
ity, some research has been done on monolingual ontology integration based on
instances interlinked by owl:sameAs [4]. If owl:sameAs links could be provided
between instances expressed in different languages, other experiments on inte-
grating underlying ontologies could be conducted.
    The owl:sameAs links between instances can be also valuable in other appli-
cations such as Question Answering over multilingual structured knowledge-base
[5] since a system can take advantage of the information presented in a language
different from a language that is being queried.
    Thus, the growing number of data sources in RDF format with multilin-
gual labels and the importance of cross-lingual links for other Semantic Web
applications motivate our interest in cross-lingual link discovery.


3   Related Work

The problem of searching for the same entity across multiple sources has been
investigated in several research fields. In database community, it is known as
instance identification, record linkage or record matching problem. In [6], the
authors use the term ”duplicate record detection” and provide a thorough survey
on the matching techniques. Though the work done in record linkage is similar
to our research, it does not contain cross-lingual aspect and RDF semantics.
    In the field of Information Retrieval (IR), within the framework of the Cross-
Language Evaluation Forum (CLEF)4 , the Web People Search Evaluation Cam-
paigns (2007-2010)5 focused on the Web People Search and person name ambi-
guity on Web pages and aimed at building a system which could estimate the
1
  http://www.internetworldstats.com/stats7.htm
2
  http://stats.lod2.eu/languages
3
  http://www.w3.org/community/bpmlod/
4
  http://www.clef-initiative.eu/
5
  http://nlp.uned.es/weps/weps-3
                                    NLP for Interlinking Multilingual LOD       3

number of referents and cluster Web pages that refer to the same individual into
one group. The research was performed on monolingual data.
    Cross-lingual entity linking has been addressed in Knowledge Base Popula-
tion track (KBP2011)[7] within the Text Analysis conference. The task is to link
entity mentions in a text to a knowledge base (Wikipedia). If entity mentions
are not in KB, they should be clustered into a separate group. Experiments were
done both on monolingual (English) and cross-lingual data (Chinese to English).
Authors in [8] used both language-independent and translation-based methods.
    In contrast to the research outlined above, we aim at providing insights into
the problem of cross-lingual interlinking from the point where data are already
in RDF format, and we can vary different parameters in order to determine their
impact on the interlinking operation.
    In the Semantic Web, interlinking resources that represent the same real-
world object and that are scattered across multiple Linked Data sets is a widely
researched topic. Within the Data Interlinking track (IM@OAEI 2011), several
interlinking systems have been proposed [9–13]. All of the systems were eval-
uated on monolingual data sets. Recent developments have been made also in
multilingual ontology matching [14, 15].
    To the best of our knowledge, there is no interlinking system specifically
designed to link RDF data sets with multilingual labels.


4   Research Questions

The goal of our work is to provide methods to link interrelated resources across
multilingual RDF data sets. For now, we restrict ourselves to owl:sameAs link
[16] as it is a classical type of link that is usually established, and it is also
important for tracking information about the same resource across different data
sources. Given two RDF data sets with URIs and literals in different natural
languages, the output will be a set of triples of type URI1 owl:sameAs URI2.
    Our general research question is: To what extent is it possible to interlink
data sets in different languages? To answer this question, within the framework
that we describe in the Proposed Approach section, we need to explore which
parameters influence this task. More specifically:

1. How to represent entities from RDF graphs?

     – What is the optimal distance for collecting language elements in traver-
       sal?
     – Is it necessary to preserve the structure of the graph in a virtual docu-
       ment by weighting the path length?

2. How to make entities described in different natural languages comparable?

     – What are the most appropriate Machine Translation techniques (rule-
       based, statistical, hybrid)?
4       Tatiana Lesnikova

     – What is the impact of translating one language into another or pivot
       language?
     – How does the output of similarity measures vary according to the con-
       text?

All these parameters will be studied with respect to specific contexts (language
pairs, data set types, amount of textual data available). We also plan to exper-
iment with graph matching techniques to see the difference with a translation-
based approach. Apart from Machine Translation, we will explore techniques
used for word alignment, thesaurus-based word sense disambiguation, multilin-
gual document ranking, and mapping to multilingual lexicons.


5    Hypotheses

We introduce several hypotheses that we would like to test in our research.

1. If two URIs denote the same real-world object, the descriptions of the prop-
   erties of this object should overlap with each other.
2. If descriptions are in different natural languages, then NLP techniques could
   help to decrease uncertainty across a set of resources.
3. If the descriptions of an entity overlap significantly, the similarity between
   them will be higher than between other entities.
4. If the degree of similarity depends on the available language context for each
   entity, then the more language data there are, the better will be the matching
   results.
5. If language data can be taken from two sources in RDF graph: property
   names and literals; then literals are more important since they are more
   informative.


6    Proposed Approach

Due to the presence of natural language terms in RDF graphs, we adopt a
language-oriented approach.

    The proposed approach includes several steps (see Figure 1).

1. Given two data sets with a resource representation in different natural lan-
   guage, extract language data for each URI. Thus, we create a ”virtual”
   document for each URI.
2. Compare virtual documents in pairs from both sets.
3. Find the maximum similarity between two representations of the resource.
4. Establish an owl:sameAs link between the two most similar representations.

    One should mind the following aspects of this approach:
                                    NLP for Interlinking Multilingual LOD        5

                                                      3
      DOCUMENT(en)                 SIMILARITY               DOCUMENT(en)


   translation                                                        2


      DOCUMENT(zh)                                          DOCUMENT(ru)
                                          4


                                                                      1


                                   owl:sameAs ?
        RESOURCE                                               RESOURCE


Fig. 1. Linking Process. Resources are described in Chinese and Russian languages
and then translated into English.


 – The idea of creating a ”virtual” document has been employed in ontology
   matching [17]. The intuition of converting a graph into a document represen-
   tation is that even though the taxonomy (structure) of graphs can be similar,
   the possibility to distinguish between two different things and identify the
   identical ones relies on their comparison. Thus, it is important to take into
   account lexical elements in a graph.

 – Once we have documents representing resources, we need to decide how to de-
   fine similarity between these resources. Similarity between documents can be
   taken for similarity between resources. Since we have documents in different
   languages, we can experiment with different types of Machine Translation
   (statistics-based, rule-based, hybrid). To estimate which strategy yields a
   better result, we will run our system by changing the translation component
   iteratively. Significant difference in results may signal which translation type
   is more beneficial. To enhance scalability, it would be interesting to trans-
   late the whole source corpus once and not to translate each label again and
   again. This would also allow for more contextual translation. The choice of
   translation techniques can also depend on the language combinations, for
   example, for rare languages, for which there does not exist enough parallel
   corpora, dictionary-based approaches might help.

 – At the resource comparison step, it is important to reduce a number of
   possible comparisons for the sake of time-efficiency. For example, only com-
   parisons between certain entity types are allowed. In case of using Supervised
   Machine Learning, the problem of training data is the most prominent one
   since there has been no official benchmark. And creating a generic training
   set for a heterogeneous amount of Linked Data seems very unrealistic. Then,
6         Tatiana Lesnikova

      instead of training, it would be interesting to test clustering algorithms and
      find appropriate parameters for identity resolution.

    – There are many techniques to compute similarity. A broad overview of them
      is given in [18]. We will use a vector space model [19] to represent terms in a
      ”virtual” document as vectors of features. The choice of particular similarity
      measures is yet to investigate. When terms are in different languages, docu-
      ment similarity fails. Some similarity measures perform better on long texts.
      After transformation of ”virtual” document into vectors, similarity metrics
      (e.g. Cosine, Euclidean) can be computed.

    – A virtual document per URI shall contain language data in proximity to a
      given URI. The hypothesis is that the more textual data we have to charac-
      terize a resource, the easier it would be to identify the identical ones.


      There are some complications as to textual data. Two scenarios are possible:
    – URI can be looked up and the textual data extracted (as in case of DBpedia)
    – No extra textual data are available per URI except the data in a graph itself.
    To overcome this lack of context for a particular resource, we propose to
browse a graph up to n+1 hops from the URI under investigation and collect
data along the way. The data carriers are property names and literals. Thus, a
virtual document for a particular URI will be the accumulation of data gathered
during graph traversal.
    This way of collecting a ”profile” for a resource entails a question: does the
difference of two graph structures affect the results of interlinking? On the one
hand, taking into account the success of statistical machine translation based
on statistical modelling and probabilities, the order of words is not always that
important. It would be interesting to see whether it holds for RDF interlinking as
well. On the other hand, we can try to preserve the order of collected properties
and literals in a virtual document by putting weight for each language element.
The further it is from the URI at question, the lower the weight. Term weight
can be assigned by computing termf requency in a document or distribution of
terms across a collection of documents known as inverse documentf requency
(IDF). Terms that appear in few documents can be discriminative with regard
to the rest of the documents. Combination of both TF x IDF is widely used in
vector space models.
    Once virtual documents are collected from both graphs, the documents will
be compared and results evaluated.

7      Reflections
We believe that we can succeed in finding the solution for our research topic
because we plan to put our research on a solid foundation and combine dif-
ferent methods to achieve the task. In traditional Web, there has been much
                                      NLP for Interlinking Multilingual LOD          7

work done on multilingual NLP, i.e. language identification, machine transla-
tion, cross-language information retrieval. We are going to conduct series of
experiments and see what works and how we can improve what does not work.
This would allow us to preserve only the best practices and finally crystallize
a solution to the problem. The author of this research proposal is also guided
by the specialists in the domain that will contribute to the right choice of the
research direction.


8     Evaluation

Evaluation means comparing the retrieved links against some reference. Standard
measures usually serve for evaluation of an interlinking system (Precision, Recall,
F-measure). The biggest challenge for evaluating our system is the absence of
standard benchmark tests. As described in [20], there are several ways to go
about this challenge. One of them would be to rely on the existing links between
resources in DBpedia. This could be considered as a good alternative if not yet
another hurdle: the existing interlanguage links can be inaccurate [21]. So, in
our research we plan to experiment with different evaluation settings: we may
experiment only with bi-directional links and/or study transitivity in order to
ensure the correctness of test cases. The English, French, Russian versions of
DBpedia 6 and Baidu Baike in Chinese [22] will be used for our experiments.
We will also try to identify types of entities to focus on.


References

 1. Gracia, J., Montiel-Ponsoda, E., Gômez-Pérez, A.: Cross-lingual Linking on the
    Multilingual Web of Data. In: Proc. of the 3rd Workshop on the Multilingual
    Semantic Web (MSW 2012) at ISWC 2012, Boston (USA), CEUR-WS ISSN 1613-
    0073, vol. 936 (2012)
 2. Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gômez-Pérez, A., Buitelaar, P., Mc-
    Crae, J.: Challenges for the multilingual Web of Data. Journal of Web Semantics,
    11, 63–71 (2012)
 3. Buitelaar, P., Choi, K.-S., Cimiano, P., Hovy, H. E.: The Multilingual Semantic
    Web (Dagstuhl Seminar 12362). Dagstuhl Reports 2(9), pp.15–94 (2012)
 4. Zhao, L., Ichise, R.: Instance-Based Ontological Knowledge Acquisition. In:
    Proc.10th International Conference, ESWC 2013, Vol. 7882, pp.155–169. LNCS,
    Springer Berlin Heidelberg (2013)
 5. Cabrio, E., Cojan, J., Gandon, F., and Hallili, A.: Querying multilingual DBpedia
    with QAKiS. In: Proc. 10th International Conference, ESWC 2013. Demo paper.
    Montpellier, France (2013)
 6. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey.
    IEEE Transactions on Knowledge and Data Engineering. 19(1), 1–16 (2007)
 7. Ji, H.,Grishman, R., Dang, H.T.: An Overview of the TAC2011 Knowledge BAse
    Population Track. In: Proc. Text Analytics Conference (TAC2011) (2011)
6
    http://dbpedia.org/About
8       Tatiana Lesnikova

 8. Monahan, S., Lehmann, J., Nyberg, T., Plymale, J., Jung, A.: Cross-Lingual Cross-
    Document Coreference with Entity Linking. In: Proc. TAC2011 (2011)
 9. Ngonga Ngomo, A.-C., Auer, S.: LIMES - A Time-Efficient Approach for Large-
    Scale Link Discovery on the Web of Data. IJCAI, pp. 2312–2317 (2011)
10. Nguyen, K., Ichise, R., Le, B.: SLINT: A Schema-Independent Linked Data Inter-
    linking System. In: Proc. of the 7th International Workshop on Ontology Matching,
    pp.1–12 (2012)
11. Volz, J., Bizez, C., Gaedke, M., and Kobilarov, G.: Discovering and maintaining
    links on the web of data. In: Proc. of ISWC’ 09, Springer-Verlag Berlin, Heidelberg,
    pp. 650–665, 2009.
12. Araújo, S., Hidders, J., Schwabe, D., Arjen, P. de Vries: SERIMI - Re-
    source Description Similarity, RDF Instance Matching and Interlinking. CoRR,
    abs/1107.1104 (2011)
13. Niu, X., Rong, S., Zhang, Y., and Wang, H.: Zhishi.links results for OAEI 2011.
    In: Proc. of ISWC’ 11 6th Workshop on Ontology Matching, pp. 220–227 (2011)
14. Meilicke, C., Trojahn, C., Sváb-Zamazal, O., Ritze, D.: Multilingual Ontology
    Matching Evaluation - a First Report on Using MultiFarm. In: Proc. of the 2d
    International Workshop on Evaluation of Semantic Technologies, pp.1–12, Herak-
    lion, Greece (2012)
15. Meilicke, C., Castro, R.G., Freitas, F., van Hage, W.R., Montiel-Ponsoda, E., de
    Azevedo, R.R., Stuckenschmidt, H., Sváb-Zamazal O., Svátek, V., Tamilin, A., Tro-
    jahn, C., Wang, S.: MultiFarm: A Benchmark for Multilingual Ontology Matching.
    Journal of Web Semantics. 15, 62–68 (2012)
16. Halpin, H., Hayes, J. P.: When owl: sameAs isn’t the Same: An Analysis of
    Identity Links on the Semantic Web. In: Proc. of the Linked Data on the
    Web Workshop (LDOW2010), Raleigh, North Carolina, USA, April 27, 2010,
    CEUR Workshop Proceedings, ISSN 1613-0073, online http://ceur-ws.org/Vol-
    628/ldow2010 paper09.pdf
17. Qu, Y., Hu, W., Cheng, G.: Constructing virtual documents for ontology matching.
    In: Proc. of the 15th International Conference of World Wide Web, pp.23–31 (2006)
18. Euzenat, J. and Shvaiko, P.: Ontology Matching. Springer-Verlag, Heidelberg
    (2007)
19. Salton, G.: The SMART Retrieval System: Experiments in Automatic Document
    Processing. Prentice Hall, Englewood Cliffs, NJ (1971)
20. Lesnikova, T.: Interlinking Cross-Lingual RDF Data Sets. In: Proc. 10th Inter-
    national Conference, ESWC 2013, Vol. 7882, pp. 671-675. LNCS, Springer Berlin
    Heidelberg (2013)
21. Rinser, D., Lange, D., Naumann, F.: Cross-lingual entity matching and infobox
    alignment in Wikipedia. Information Systems, 38 (6), pp. 887-907 (2013)
22. Wang, Z., Wang, Z., Li J., Pan, J. Z.: Knowledge extraction from Chinese wiki
    encyclopedias. Journal of Zhejiang University - Science C 13(4): 268-280 (2012)

</pre>