GMap Results for OAEI 2021?

                        Weizhuo Li1,4 , Shiqi Zhou2 , Qiu Ji1(B) , and Bingjie Lu3
      1
              School of Modern Posts, Nanjing University of Posts and Telecommunications, China.
          2
               School of Computer Science and Engineering, Southeast University, Nanjing, China
                                    3
                                       Zhejiang Lab, HangZhou, China.
              4
                State Key Laboratory for Novel Software Technology, Nanjing University, China
                          liweizhuo@amss.ac.cn qiuji@njupt.edu.cn


               Abstract. GMap is an alternative probabilistic scheme for ontology matching,
               which combines the sum-product network and the noisy-or model. More pre-
               cisely, we employ the sum-product network to encode the similarities based on
               the set of individuals and disjointness axioms. The noisy-or model is utilized to
               encode the probabilistic matching rules, which describe the influences among en-
               tity pairs across ontologies. This paper briefly introduces GMap and its results of
               two tracks (i.e., Conference, Anatomy) on OAEI 2021.


1         Presentation of the system
1.1           State, purpose, general statement
The state of the art approaches have utilized probabilistic graphical models [1] for on-
tology matching such as OMEN [2], iMatch [3], CODI [4] and MORW [5]. However,
few of them could not guarantee inference tractable and ensure no loss in inference ac-
curacy. Therefore, these matching systems had to adopt approximate inference, so the
quality of alignments was influenced to some extend. In this paper, we propose an alter-
native probabilistic scheme, called GMap, combining the sum-product network (SPN)
and the noisy-or model [6]. Except for the tractable inference, these two graphical mod-
els have some inherent advantages for ontology matching. For SPN, even if the knowl-
edge (e.g., individuals or disjointness axioms) is missing, SPN can also calculate their
contributions by the maximum a posterior (MAP) inference. For the noisy-or model,
it is reasonable to incorporate probabilistic matching rules to describe the influences
among entity pairs.
     Figure 1 shows the matching framework of GMap. Given two ontologies O1 and O2 ,
we first calculate the lexical similarity based on edit-distance, external lexicons and TF-
IDF [7] with the max strategy. Then, we employ SPN to encode the similarities based on
individuals and disjointness axioms and calculate the contribution through Maximum
A Posteriori (MAP) inference [1]. After that, we utilize the noisy-or model to encode
the probabilistic matching rules and the value calculated by SPN. With one-to-one con-
straint and crisscross strategy in the refine module, GMap obtains initial alignments.
?
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). This work was partially sponsored by the
    Natural Science Foundation of China grants (No. 62006125, 61602259), the NUPTSF grant
    (No. NY220171, NY220176), the Foundation of Jiangsu Provincial Double-Innovation Doctor
    Program (JSSCBS20210532), and the State Key Laboratory for Novel Software Technology
    at Nanjing University.
The whole matching procedure of GMap is iterative. If there is no external mapping
identified, the process of matching will be terminated.


                               Using SPN                                                            O1
                                               Using Noisy-Or
          O1   Computing        to encode                                   Additional
                                               model to encode   Refining
                 lexical     individuals and                                 matches
                similarity     disjointness
                                                 probabilistic   matches
                                                                            Identified?   No
                                                matching rules
                                  axioms
                                                                                               O2
          O2                                                                Yes


                                   Fig. 1: Matching process in GMap


1.2     Specific techniques used

The similarities based on individuals and disjointness axioms IIn open-world as-
sumption, individuals or disjointness axioms are missing at times. Therefore, we define
a special assignment—”Unknown” of the similarities based on the set of individuals
and disjointness axioms.
     For individuals, we employ the string equivalent to judge the equality of them. When
we calculate the similarity of concepts based on individuals across ontologies, we regard
individuals of each concept as a set and use Ochiai coefficient5 to measure the value.
We use a boundary t to divide the value into three assignments (i.e., 1, 0 and Unknown).
Assignment 1 (or 0) means that the pair matches (or mismatches). If the value ranges
between 0 and t or the individuals of one concept are missing, then the assignment is
set to Unknown.
     For disjointness axioms, we utilize these axioms and subsumption relations within
ontologies and define some rules to determine assignments of similarity. For example,
x1 , y1 and x2 are concepts that come from O1 and O2 . If x1 matches x2 and x1 is dis-
joint with y1 , then y1 is disjoint with x2 as well as their descendants. The similarity also
has three assignments. Assignment 1 (or 0) means the pair mismatches (or overlaps). If
all the rules are not satisfied, the assignment is Unknown.


Using SPN to encode the similarities based on individuals and disjointness axioms
Sum-Product Network is a directed acyclic graph with weighted edges, where variables
are leaves and internal nodes are sums and products [8]. As shown in Figure 2, we de-
signed a sum-product network denoted by S to encode above similarities and calculate
the contributions. All the leaves in S, called indicators, are binary-value. M represents
the contribution of individuals and disjointness axioms and indicators M1 , M2 , M3
comprise all the assignments of it. M1 = 1 (or M2 = 1) means that the contribution
is positive (or negative). If M3 = 1, the contribution is Unknown. Similarly, Indicators
D0 , D1 , I1 , I2 , I3 correspond to assignments of the similarities based on the set of in-
dividuals and disjointness axioms. The concrete assignment metrics are listed in Table
1–2 and the assignment metric of M is similar to the metric of similarity D.
 5
     https://en.wikipedia.org/wiki/Cosine similarity
        Table 1: Metric for Similarity D                                           Table 2: Metric for Similarity I

       Assignments   Indicators                                         Assignments       Indicators
          D=1      D0 = 0, D1 = 1                                          I=1      I1 = 1, I2 = 0, I3 = 0
          D=0      D0 = 1, D1 = 0                                          I=0      I1 = 0, I2 = 1, I3 = 0
       D =Unknown D0 = 1, D1 = 1                                        I =Unknown I1 = 0, I2 = 0, I3 = 1


                  SPN |= (I ! M | D0 )
                                                                  +                            SPN |= (I ? M | D1 )
                                                                                

                                     ×                                                              ×
                          ·                         +                                                              ·
                      D0                                                                                           D1
                                                    


                                     ×          ×                 ×                            +               +

                                         +           +            +                               

                                                                                   
                                                                                          
                                                                                           

                      ·              ·          ·                           ·                   ·                  ·
                     I1          I2             I3                      M1                     M2                  M3

Fig. 2: The designed SPN : When D0 = 1, D1 = 0, it means that the distribution of M depends
on the distribution of I; When D0 = 0, D1 = 1, the distributions of M and I are independent.


    With the MAP inference in SPN [8], we can obtain the indicators’ value of contri-
bution M . Precisely, the MAP inference in SPN contains three steps. Firstly, replace
sum nodes with max nodes. Secondly, with the bottom-up method, each max node can
get a maximum weighted value. Finally, the downward pass starts from the root node
and recursively selects the highest-value child of each max node, then the indicators’
value of M are obtained. Moreover, even if the set of individuals or disjointness axioms
are missing at times, we can also calculate the contribution M by MAP inference. As-
sumed I = 1, D =Unknown for one pair, we can obtain I1 = 1, I2 = 0, I3 = 0, D0 =
1, D1 = 1 with defined similarities and assignment metrics of SPN. As contribution M
is not given, so we need to set M1 = 1, M2 = 1, M3 = 1. After MAP inference, we
observe M1 = 1. It means that the contribution is positive. Besides, it can also infer
D0 = 1, which means the individuals of matching pair overlap.
    As the network S is complete and decomposable, the inference in S can be com-
puted in time linear in the number of edges [9]. Therefore, MAP inference is tractable.


Combining the lexical similarity and the contribution calculated by SPN Con-
sidering the range of lexical similarity, we introduce a scaling factor α to limit the
contribution of lexical similarity. It can help us to analyze the sources from different
contributions. The SPN-based similarity denoted by S0 is defined in Eqs 1, which is
calculated according to the indicators’ value of M and D.


                             
                             0                              M2 = 1, D1 = 1
                             
                             
                              α ∗ lexSim(x1 , x2 ) + λ       M1 = 1, D0 = 1
             S0 (x1 , x2 ) =                                                                 (1)
                             
                             α ∗ lexSim(x1 , x2 ) − λ       M2 = 1, D0 = 1
                             
                              α ∗ lexSim(x1 , x2 )           M3 = 1, D0 = 1


where λ is a contribution factor that represents the contribution based on disjointness
axioms and the set of individuals. If contribution is positive (or negative) and pair over-
laps, the SPN-based similarity is equal to the scaled lexical similarity adding (subtract-
ing) λ. If the contribution is Unknown and pair overlaps, the SPN-based similarity is
equal to the scaled lexical similarity. If the pair mismatches, then the inferred contribu-
tion is negative and the SPN-based similarity is set to 0.0.


Using Noisy-Or model to encode probabilistic matching rules As listed in Table
3, we utilize probabilistic matching rules to describe the influences among the related
pairs across ontologies.


                Table 3: The probabilistic matching rules between entity pairs
        ID     Category                   Probabilistic matching rules
        R1       Class    two classes probably match if their fathers match
        R2       Class    two classes probably match if their children match
        R3       Class    two classes probably match if their siblings match
                          two classes asserted in domain probably match if related object-
        R4       Class
                          properties match and range of these property match
                          two classes asserted in range probably match if related object-
        R5       Class
                          properties match and domain of these properties match
                          two classes asserted in domain probably match if related dat-
        R6       Class
                          aproperties match and the type of these properties are the same


    Considering the matching probability of one pair, we observe that the condition
of each rule has two values (i.e., T or F) and all the matching rules are conditional
independent when the value of this pair is given. Moreover, all the matching rules are
conducive to improving the matching probability of this pair. Therefore, we utilize the
noisy-or model [1] to encode them.
                                 R1                 R2                                  R6

                                                                      ...
                  S0             S1                 S2                                  S6


                                                     OR
                                                                                                       6
                                 (                                                                     Y
                                                                                                                       f (ci )
                                     0,    Ri = F        P (S = 0|S0 , R1 , . . . , R6 ) = (1     0)         (1   i)
              P (Si = 1|Ri ) =
                                      i,   Ri = T   S                                                  i=1
                                                         P (S = 1|S0 , R1 , . . . , R6 ) = 1    P (S = 0|S0 , R1 , . . . , R6 )


                    Fig. 3: The network structure of noisy-or model designed in GMap
     Figure 3 shows the designed noisy-or model applied in concept pairs and the exten-
sion to property pairs is straight-forward, where Ri corresponds to the ith rule and Si
is the conditional probability depended on the condition of Ri . S0 represents the SPN-
based similarity which is a leak probability [1]. We can easily calculate the matching
probability of each pair, P (S = 1|S0 , R1 , ..., R6 ), according to the formulas listed in
Figure 3, where ci is the count of satisfied Ri and the sigmoid function f (ci ) is used to
limit the upper bound of contribution of each Ri .
     As the inference in the noisy-or model can be computed in time linear in the size of
nodes [1], so GMap can keep inference tractable in the whole matching process.


1.3     Adaptations made for the evaluation

There are two kinds of parameters that need to be set. One mainly comes from net-
works, and it is set manually based on some considerations [10]. The others are adapted
by I3CON data set6 such as scaling factor (α), contribution factor (λ) in Eqs 1 and
threshold (θ). Nevertheless, we do not make any specific adaptation for OAEI 2021
evaluation campaign, and all parameters are the same for all the tracks.


1.4     Link to the system and parameters file

The latest version of GMap in maven project can be downloaded by google drive 7 ,
which is implemented by MELT platform [11]. In addition, GMap is also an open source
project in https://github.com/liweizhuo001/GMap1.1.


2     Results

In this section, we present the results of GMap achieved on OAEI 2021. Our system
mainly focuses on Conference, Anatomy.
 6
     http://www.atl.external.lmco.com/projects/ontology/i3con.html
 7
     https://drive.google.com/file/d/1kq8ntRVQclFF-TZOb6AesqJZ7e0P-
     h7F/view?usp=sharingand
2.1    Conference
Conference track contains sixteen ontologies from the conference organization domain.
According to the crisp evaluation based on the main (blind) reference alignment in
Conference Track, the results of GMap and other top-ranked matching systems are
listed in Table 4.


      Table 4: Results of GMap and other top-ranked matching systems in Conference track
                Matcher Precision Recall F.5-Measure F1-Measure F2-Measure
                 AML        0.78    0.62       0.74        0.69         0.65
                LogMap      0.76    0.56       0.71        0.64         0.59
                 GMap       0.61    0.61       0.61        0.61         0.61
               ATMatcher 0.69       0.51       0.64        0.59         0.54
               Wiktionary 0.66      0.53       0.63        0.59         0.55


     Overall, GMap ranked 3rd of the 14 participants in terms of F1-Measure, which
outperforms others in recall except for AML, but its precision is lower than theirs.
There are mainly two reasons. One is the lexical similarity which combines the sim-
ilarities based on edit-distance, external lexicons and TF-IDF with the max strategy.
The other is the noisy-or model, which is hard to describe the negative effect on pairs
matching [1]. Both of them would retain some false positive matches after the matching
finish. Especially in property pairs, even though their domains and ranges mismatch,
GMap can not describe this negative impact. Therefore, employing alignment debug-
ging techniques [12–18] are comparatively ideal solutions to deal with this issue.

2.2    Anatomy
The goal of the Anatomy track is to find an alignment between the Adult Mouse Anatomy
(2744 classes) and a part of the NCI Thesaurus (3304 classes). The results of GMap and
other top-ranked matching systems are listed in Table 5.


      Table 5: Results of GMap and other top-ranked matching systems in Anatomy Track
          Matcher Runtime (s) Size Precision F1-Measure Recall Recall+ Coherent
           AML            32    1471 0.956         0.941    0.927 0.81        +
        LogMapBio        1043   1586 0.874         0.894    0.914 0.773       +
         LogMap            7    1402 0.917         0.881    0.848 0.602       +
           TOM          15068 1313 0.933           0.866    0.808 0.525       −
          GMap           2362   1344 0.916         0.861    0.812 0.534       −
        Fine-TOM         2647   1315 0.916         0.851    0.794 0.490       −
        Wiktionary        493   1194 0.956         0.843    0.753 0.347       −


   As a result, GMap ranked 5th of the 13 participants in terms of F1-Measure. We
analyze that lexical-based module and simplified combination strategy may become the
main bottlenecks of GMap. Benefited from the thesauruses (e.g., UMLS) and optimized
combination strategy, most top-ranked systems can obtain better performances in ontol-
ogy matching tasks. As mentioned above, most systems (e.g., AML, LogMap) employ
the techniques of mapping validation, which is helpful to improve the quality of align-
ment further. Relatively, these techniques are not employed in the current version, and
we leave this issue for future work.


3     General comments
3.1   Comments on the results
GMap [6] achieved qualified results in its second participation in OAEI, which is com-
petitive with several top-ranked systems in main OAEI tracks. Benefited from SPN and
the noisy-or model, the quality of alignment can be improved further compared with
the original lexical similarity. However, some weaknesses still remain. For example, the
alignment incoherence of GMap is still unsolved, which influences the performances of
GMap. In addition, it is important for us to consider the efficiency of GMap, such as
running time and memory usage tailored for large-scale ontologies.

3.2   Discussions on the way to improve the proposed system
GMap still has a lot of room for improvement. Employing mapping validation tech-
niques is helpful to solve the alignment incoherent and reduce some false positive
matches in final alignments. In addition, seeking available data sets to learn parame-
ters of the sum-product network and the noisy-or model is also one direction of our
future works.


4     Conclusion
In this paper, we have presented GMap and its results of two tracks (i.e., Confer-
ence, Anatomy) on OAEI 2021. The results indicate that GMap is competitive with the
top-ranked systems by means of combining some special graphical models (i.e.,SPN,
Noisy-or model). On the other hand, for those disadvantages exposed, we discuss the
possible solutions. In the future, we would like to participate in more tracks (e.g., Multi-
farm, Complex, Interactive matching) and hope to solve the efficiency of large ontology
matching tasks.


References
 1. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.
    MIT press, 2009.
 2. Prasenjit Mitra, Natasha F Noy, and Anuj Rattan Jaiswal. OMEN: A Probabilistic Ontology
    Mapping Tool. In Proceedings of the 4th International Semantic Web Conference, pages
    537–547. Springer, 2005.
 3. Sivan Albagli, Rachel Ben-Eliyahu-Zohary, and Solomon E Shimony. Markov network
    based ontology matching. Journal of Computer and System Sciences, 78(1):105–118, 2012.
 4. Mathias Niepert, Christian Meilicke, and Heiner Stuckenschmidt. A Probabilistic-Logical
    Framework for Ontology Matching. In Proceedings of the 24th AAAI Conference on Artifi-
    cial Intelligence. AAAI Press, 2010.
 5. Chuncheng Xiang, Baobao Chang, and Zhifang Sui. An Ontology Matching Approach Based
    on Affinity-Preserving Random Walks. In Qiang Yang and Michael J. Wooldridge, editors,
    Proceedings of the 24th International Join Conference on Artificial Intelligence, pages 1471–
    1478. AAAI Press, 2015.
 6. Weizhuo Li. Combining sum-product network and noisy-or model for ontology matching.
    In Proceedings of the 10th International Workshop on Ontology Matching, pages 35–39.
    CEUR-WS.org, 2015.
 7. Jérôme Euzenat and Pavel Shvaiko. Ontology Matching. Springer Science & Business
    Media, 2013.
 8. Hoifung Poon and Pedro M. Domingos. Sum-Product Networks: A New Deep Architecture.
    In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pages 337–
    346. AUAI Press, 2011.
 9. Robert Gens and Domingos Pedro. Learning the Structure of Sum-Product Networks. In
    Proceedings of The 30th International Conference on Machine Learning, pages 873–880.
    JMLR.org, 2013.
10. Li Ding and Tim Finin. Characterizing the semantic web on the web. In Proceedings of the
    5th International Semantic Web Conference, pages 242–257. Springer, 2006.
11. Sven Hertling, Jan Portisch, and Heiko Paulheim. MELT - matching evaluation toolkit. In
    Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference,
    SEMANTiCS 2019, pages 231–245, 2019.
12. Christian Meilicke. Alignment incoherence in ontology matching. PhD thesis, Univer-
    sitätsbibliothek Mannheim, 2011.
13. Ernesto Jiménez-Ruiz and Bernardo Cuenca Grau. LogMap: Logic-Based and Scalable On-
    tology Matching. In Proceedings of the 10th International Semantic Web Conference, pages
    273–288. Springer, 2011.
14. Emanuel Santos, Daniel Faria, Catia Pesquita, and Francisco M Couto. Ontology alignment
    repair through modularization and confidence-based heuristics. PloS one, 10(12):1–19, 2015.
15. Weizhuo Li, Songmao Zhang, and Guilin Qi. A graph-based approach for resolving incoher-
    ent ontology mappings. Journal of Web Intelligence, 16(1):15–35, 2018.
16. Silvana Castano, Alfio Ferrara, Davide Lorusso, Tobias Henrik Näth, and Ralf Möller. Map-
    ping validation by probabilistic reasoning. In Proceedings of the 5th European Semantic Web
    Conference, pages 170–184. Springer, 2008.
17. Jan Noessner and Mathias Niepert. ELOG: A probabilistic reasoner for OWL EL. In Pro-
    ceedings of the 5th International Conference on Web Reasoning and Rule Systems, pages
    281–286. Springer, 2011.
18. Weizhuo Li and Songmao Zhang. Repairing mappings across biomedical ontologies by
    probabilistic reasoning and belief revision. Knowledge-Based Systems, 209:106436, 2020.