=Paper= {{Paper |id=Vol-1766/oaei16_paper13 |storemode=property |title=RiMOM results for OAEI 2016 |pdfUrl=https://ceur-ws.org/Vol-1766/oaei16_paper13.pdf |volume=Vol-1766 |authors=Yan Zhang,Hailong Jin,Liangming Pan,Juanzi Li |dblpUrl=https://dblp.org/rec/conf/semweb/ZhangJPL16 }} ==RiMOM results for OAEI 2016== https://ceur-ws.org/Vol-1766/oaei16_paper13.pdf
                    RiMOM Results for OAEI 2016

                   Yan Zhang, Hailong Jin, Liangming Pan, Juanzi Li

                       Tsinghua University, Beijing, China.
             {z-y14, jinhl, panlm14}@mails.tsinghua.edu.cn
                        ljz@keg.tsinghua.edu.cn



       Abstract. This paper presents the results of RiMOM in the Ontology Alignment
       Evaluation Initiative (OAEI) 2016. RiMOM participated in all three tracks of
       Instance Matching this year. In this paper, we first describe the overall framework
       of our system (RiMOM). Then we detail the techniques used in the framework for
       instance matching. Last, we give a thorough analysis on our results and discuss
       some future work on RiMOM.


1   Presentation of the system
With the rapid development of the Semantic Web, knowledge base has become a domi-
nant mechanism to represent the data semantics on the Web. In practice, data is always
distributed on heterogeneous data sources. For example, there are a large number of
ontological knowledge bases nowadays, such as DBpedia[1]. , YAGO [2, 3], Xlore [4],
etc. It is inevitable that the knowledge about the same real-world entity may be stored
in different knowledge bases. Therefore, data integration process requires the detection
of such heterogeneous instances to ensure the integrity and consistency.
    Most recently, it should be noticed that there are many knowledge bases described in
different languages. For example, Wikipedia, a well-known public encyclopedia, con-
tains 281 language versions. It is going to be norm that the same real-world entities are
described by different language. Thus, there is a growing need to align instances in a
cross-lingual environment so that we can share knowledge from all over the world. In
consideration of this circumstance, based on previous version of RiMOM[5], we pro-
pose an extended version, which provides support for cross-lingual instance matching
in a supervised or an unsupervised way.
    There are three major techniques in our system, blocking, multi-strategy, machine
learning:
 1. Blocking: We index the instances based on their objects in two knowledge bases
    respectively, and then select the instances which contain the same keys as candidate
    instance pairs. We limit the number of pairs to be compared by this step, which
    significantly improve the efficiency of the system.
 2. Multi-strategy: We implement several matchers in our instance matching system,
    we can execute these matchers in parallel and then aggregate the result according
    to the characteristics of the source ontologies.
 3. Machine learning:In general, there are some existing alignments. For exmaple,
    there are a number of cross-lingual links between two different language versions
      of Wikipedia. To make full use of these data, we formalize the instance matching as
      a binary classification problem, and use the reference mappings to train a classifier,
      which will determine whether an instance pair is equivalent or not.

    Faced with challenges in large-scale instance matching, we propose an novel data
integration framework RiMOM-2016 (the latest version of RiMOM), which is based
on our former ontology and instance matching system RiMOM [5, 6]. The RiMOM-
2016 framework is designed for large-scale and cross-lingual instance matching task
specially. It presents a novel multi-strategy method to be fit for different kinds of ontol-
ogy and employs a learning-based approach to get instance alignments in multilingual
environments.


1.1    State, purpose, general statement

This section describes the overall framework of RiMOM2016. The overview of the
instance matching system is shown in Fig. 1. The system includes seven modules,
i.e., Preprocess, Predicate Alignment, Mathcher Choosing, Candidate Pair Generation,
Matching Score Calculation, Instance Alignment and Validation. The sequences of the
process are shown in the Fig. 1. We illustrate the process as follows.




                             Fig. 1. Framework of RiMOM 2016
 1. Preprocess: The system begins with Preprocess, which loads the ontologies and
    parameters into system. In the meantime, preprocessor can get some meta data
    about the two ontologies, which will be used in the later processes, Predicate align-
    ment and Matcher choosing
 2. Predicate Alignment: In this process, we will get the alignments of the predicates
    between the two ontologies.
 3. Matcher choosing: The system will choose the most suitable one or more match-
    ers according to the meta data of the ontologies.
 4. Candidate Pairs Generation: In this step, we get candidate pairs when the in-
    stances have the same literal objects on some discriminatory predicates.
 5. Matching Score Calculation & Instance Alignment: This procedure is the most
    striking difference with the last version of RiMOM. In RiMOM-2016, we get align-
    ments in a supervised or an unsupervised way which depends on whether there
    exist reference alignments or not. In case of unsupervised method, we calculate
    similarities between two instances on each property, and then we aggregate these
    similarities according to the degree of identifying obtained in step 1. On the con-
    trary, we conduct a supervised method when there exist reference alignments. For
    each instance pairs, we also calculate the similarities as unsupervised way. Then
    we construct a similarity vector for each pairs and train a logistic regression model
    [7]. For each candidate instance pair, we use this model to determine whether it is
    equivalent or not.
 6. Validation: We will evaluate the alignment result on Precision, Recall and F1-
    Measure if there is validation data set.

1.2   Specific techniques used


    This year we participate in all of three subtasks in the Instance Matching track. We
will describe specific techniques in this section.
    Data Preprocessing: First, we remove some stop words like ”a, of, the”, etc. Af-
terwards, we calculate the TF-IDF values of words in each knowledge base. We also
calculate some information of each predicate, in order to obtain the degree of identify-
ing of predicates which will be used in similarity aggregation.
    Predicate Alignment: The predicates can express rich semantics, and there exist
one-to-one, one-to-many, or many-to-many relationships among these predicates. It is
apparent that we should get the alignments of the predicates before we calculate the
similarity of instances. In RiMOM-2016, we use an object-based method to align pred-
icates, which is similar with RiMOM-2015 [5].
    Blocking: This step aims to pick a relatively small set of candidate pairs from all
pairs. Due to the large scale of knowledge bases, it is impossible to calculate match-
ing scores of all instance pairs. In our method, we firstly generate the inverted index
on the objects. instance pairs are selected into the candidate set when they have com-
mon objects. This method may reduce the recall slightly, but it also reduce the scale of
computation significantly.
    Multi-Strategy: We implement several matchers in our system, e.g. label-based
approach and structure-based approach. In the preprocess step, we will compare the
schema of the two ontologies. If the range of predicates is similar, the label-based ap-
proach will play a key role in the matching process. Otherwise, the literal properties are
not similar (e.g. the two ontologies are defined in different languages or the intersection
of values is really small), label-based approach will not be effective. In this case, we
will get some supplementary information (e.g. machine translation, WordNet), or use
structure-based appraoch (or use the structure similarity as a feature). In addition, we
will use a learning-based method if we have data for training.
    Similarity Calculation & Instance Alignment: In OAEI 2016 instance matching
track, some of subtasks are defined in the same language, while others use multilingual
data sets (e.g. SABINE Task).
    Unsupervised method: we use a object-based method to get alignments, it is de-
fined as follows:
                                                                    0
                                                                p
                                fpn (i1 , i2 ) = Sim(Oip1n , Oi2n )                           (1)
    where i1 and i2 are instances from two data sets respectively. Oip1n represent the
                                                            0
                                                          p
object value of instance i1 on property pn . Sim(Oip1n , Oi2n ) represent the similarity of
object values between these two instances on property pn and its corresponding proper-
    0
ty pn . The computing method of this similarity depends on the data type. For example,
we use Levenshtein distance for type:text and indicator function for type:int.


      Sim(i1 , i2 ) = ω1 × fp1 (i1 , i2 ) + ω2 × fp2 (i1 , i2 ) + ... + ωn × fpn (i1 , i2 )   (2)

    For each property pj , we calculate the similarity according to equation 1 and aggre-
gate them by weights ωj which indicate the importance of properties.
    Supervised method: In equation 2, the weight wi is determined by meta-data of
ontology or manual. Intuitively, it could be improved by a learning-based method if
we have some existing alignments. So, basically, we formulate this instance matching
problem as a binary classification problem. For a pair of instance i1 and i2 , the feature
vector f = {fpi }ni=1 . Thus, we can use a sigmoid function to compute the probability
that instances i1 is equivalent with i2 .

                                                        1
                                 P (i1 ≡ i2 ) =                                               (3)
                                                  1 + ew·f (i1 ,i2 )
    If i1 ≡ i2 , P (i1 ≡ i2 ) > 0.5; otherwise P (i1 ≡ i2 ) < 0.5 In this case, the weights
w can be determined by the maximum likelihood estimation technique for logistic re-
gression. The assumption in this model is that we can use the machine learning method
to determine which property is more important for instance matching problem.


1.3    Link to the system and parameters file

The RiMOM system and configuration files (2016 version) can be found at https://
drive.google.com/file/d/0BzqVVt4Q8YUuaHpseWJOZkI4MnM/view?
usp=sharing.
2     Results

The Instance Matching track contains three tracks and seven subtasks. RiMOM-2016
participate in all of these tracks, and we will present the results and related analysis in
this section.


2.1    SABINE Track

There are two subtasks in this track: Inter-linguistic mapping and Data linking. Table
1 is the result for Inter-linguistic mapping task and Table 2 is for Data linking task.
Inter-linguistic mapping is a cross-lingual task between English and Italian. As shown
in the result, RiMOM preform well in this task. Data linking task requires participants
to link the entity to DBpedia, and RiMOM get high Recall but low Precision in this
task.


       Tool     Precision Recall F-measure            Tool     Precision Recall F-measure
   LogMapIm 0.012 0.016 0.014                      LogMapIm NaN 0.000                NaN
      AML         0.919 0.916 0.917                   AML        0.926 0.855 0.889
  LogMapLite 0.358 0.153 0.214                     LogMapLite NaN 0.000              NaN
    RiMOM         0.955 0.932 0.943                 RiMOM        0.424 0.917 0.580
Table 1. The result for Inter-linguistic mapping     Table 2. The result for Data linking




2.2    SYNTHETIC Track

There are two subtasks in this track: UOBM and SPIMBENCH. Each subtask contains
two data set in different size: sandbox is small data set while mainbox is a large one.
Table 3 4 5 6 show the final results in this track. We think RiMOM produce satisfactory
results in all of the subtasks.


         Tool    Precision Recall F-measure           Tool    Precision Recall F-measure
      LogMapIm 0.701 0.207 0.320                   LogMapIm 0.625 0.023 0.044
        AML        0.785 0.577 0.665                 AML         0.509 0.515 0.512
       RiMOM       0.771 0.877 0.821                RiMOM        0.443 0.516 0.477
      Table 3. The result for UOBM sandbox         Table 4. The result for UOBM mainbox




2.3    DOREMUS Track

This track contains three subtasks: 9-heterogeneities, 4-heterogeneities, and False Pos-
itive Trap. Table. 7 shows the final result in this track.
      Tool     Precision Recall F-measure       Tool     Precision Recall F-measure
  LogMapIm 0.958 0.766 0.851                 LogMapIm 0.981 0.695 0.814
     AML         0.907 0.749 0.820              AML        0.900 0.747 0.816
    RiMOM        0.984 1.000 0.992            RiMOM        0.991 1.000 0.995
Table 5. The result for SPIMBENCH sandbox Table 6. The result for SPIMBENCH mainbox


                           Sub-task      Precision Recall F-measure
                       9-heterogeneities 0.813 0.813 0.813
                       4-heterogeneities 0.746 0.746 0.746
                      False Positive Trap 0.707 0.707 0.707
                         Table 7. The result for DOREMUS Track



2.4   Discussions on the way to improve the proposed system

Our system can only align two ontologies at a time, and we think it will be a significant
improvement if we can develop a system which is able to align several ontologies simul-
taneously. In addition, in cross-lingual environment, our system still rely on the machine
translation. In this case, we hope to develop a method which is language-independent.


3     Conclusion and future work
In this paper, we present the system of RiMOM in OAEI 2016 Campaign. We partici-
pate all of the three tracks in instance matching track this year. We described specific
techniques we used in the task. In our project, we design a new framework to align
instances in different languages. The results turn out that our method is effective.
    In the future, we will make great efforts to improve our system continuously.


4     Acknowledgement
The work is supported by 973 Program (No.2014CB340504), NSFC-ANR (No.61261130588),and
NSFC key project(No.61533018), Tsinghua University Initiative Scientific Research
Program (No.20131089256) and THU-NUS NExT Co-Lab.
References
1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: Db-
   pedia - A crystallization point for the web of data. J. Web Sem. 7(3) (2009) 154–165
2. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally
   enhanced knowledge base from wikipedia. Artif. Intell. 194 (2013) 28–61
3. Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual
   wikipedias. In: 7th Biennial Conference on Innovative Data Systems Research, CIDR Con-
   ference (2014)
4. Wang, Z., Li, J., Wang, Z., Li, S., Li, M., Zhang, D., Shi, Y., Liu, Y., Zhang, P., Tang, J.: Xlore:
   A large-scale english-chinese bilingual knowledge graph. In: Proceedings of the ISWC 2013
   Posters & Demonstrations Track, Sydney, Australia, October 23, 2013. (2013) 121–124
5. Zhang, Y., Li, J.: Rimom results for oaei 2015. Ontology Matching (2015) 185
6. Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic multistrategy ontology alignment frame-
   work. IEEE Trans. Knowl. Data Eng. 21(8) (2009) 1218–1232
7. Hosmer, D.W., Lemeshow, S.: Introduction to the logistic regression model. Applied Logistic
   Regression, Second Edition (2000) 1–30