=Paper= {{Paper |id=Vol-1766/om2016_poster3 |storemode=property |title=Ontology matching evaluation: a statistical perspective |pdfUrl=https://ceur-ws.org/Vol-1766/om2016_poster3.pdf |volume=Vol-1766 |authors=Majid Mohammadi,Wout Hofman,Yao-hua Tan |dblpUrl=https://dblp.org/rec/conf/semweb/MohammadiHT16 }} ==Ontology matching evaluation: a statistical perspective== https://ceur-ws.org/Vol-1766/om2016_poster3.pdf

Ontology Matching Evaluation: A Statistical Perspective

Majid Mohammadi1 , Wout Hofman2 , Yao-hua Tan1
1
Faculty of Technology, Policy and Management, Delft University of Technology, Netherlands
2
Department of Technical Science, The Netherlands Institute of Applied Technology (TNO),
Soesterberg, the Netherlands

Abstract. This paper proposes statistical approaches to test if the diﬀerence between two
ontology matchers is real. Specifically, the performances of the matchers over multiple data
sets are obtained and based on their performances, the conclusion can be drawn whether
one method is better than one another or not. To do so, the paired t-test and Wilcoxon
signed rank test are proposed and the comparisons over six recently proposed methods are
reported.

Keywords: Ontology alignment, evaluation, statistical inference, paired t-test, Wilcoxon signed rank
test

1 Introduction

There has been an increasing interest in ontology matching (or alignment) over the last
years. As data come from various sources these days, the heterogeneity among data is
inevitable. The solution to such an issue is ontology matching, which has a wide range
of application from data integration and agent interoperability in computer science to
matching ontologies in biomedical and geoscience. As a result, a plethora of methods
have been proposed claiming that their method is better than, or competitive with, other
state-ofthe-art algorithms. However, no evidence has been brought to support such a claim

2 Binary comparison of matchers

The hypothesis testing is one of the major topic in the realm of statistical inference. Here,
we aim at utilizing this technique to indicate if the average diﬀerence in the performance
scores of two matchers over multiple benchmarks is meaningful or not. To leverage the
hypothesis testing, a null hypothesis is required. The null hypothesis (shown by H0 ) states
that there is no significant diﬀerence between two populations according to the available
samples of the populations. The alternative hypothesis (shown by Ha ), on the other
hand, is the rival hypothesis and states that there is meaningful diﬀerence between two
populations based on available samples. Thus, it is desirable to reject null hypothesis and
accept the alternative hypothesis. In ontology matching case, the performance of various
matchers over a range of data sets are available and we would like to test if the average
of their performances is random. In other words, the null hypothesis and the alternative
hypothesis in this case is

H0 : P̂ 1 = P̂ 2

H1 : P̂ 1 ̸= P̂ 2 (1)
where P̂ i is the average performances of the matcher i.
Before running any statistical test, the significant level must be determined. the α is the
probability of rejecting null hypothesis when the null hypothesis is true. To the best of
our knowledge, no statistical techniques have been employed to test the above-mentioned
hypothesis. Firstly, the widely-used paired t-test is presented with more detail. Having
hard preconditions to be satisfied, it must be warned that t-test might be inappropriate
and statistically unsafe. Thus, the Wilcoxon signed-rank test is presented which is able
to detect more diﬀerence even though the number of samples are not large enough.

2.1 Paired t-test

A common way to check if the diﬀerence between two matchers on diﬀerent data sets is
not random is to compute the paired t-test. Let di = Pi1 −Pi2 be the diﬀerence between the
performances of two matchers over i − th data set. The t statistics is computed as t = x−x̂
σ̂d
where x̂ and σ̂d are sample average and standard deviation of samples, respectively. This
statistics is distributed according to the Student distribution with N 1 degree of freedom.
After obtaining the probability of observing the data given that H0 being true (p-value)
according to the Student distribution, the H0 can be rejected if p − value < α and then
Ha is accepted.

2.2 Wilcoxon Signed Rank test

The non-parametric alternative to the paired t-test is Wilcoxon singed rank test. This
method ranks the absolute values of performance diﬀerences of two matchers. Then, it
compares the rank of positive and negative diﬀerences. After computing the diﬀerence
between two matchers over the the i − th data set, di , the diﬀerences are ranked based
on the values of di , disregarding its sign. if di = 0 it is ignored and the
∑ average ranks are
+
assigned if the performances over one data set ties. Assume W = di >0 rank(di ) and
∑ T − 41 N (N +1)
W− = + −
di <0 rank(di ) and T = min(W , W ). Then z =
√
1
is distributed
24
N (N +1)(2N +1)

according to the normal distribution.

3 Experimental Results

Table 1 tabulates the p-values obtained by paired t-test and Wilcoxon Signed Rank test
over six recently proposed methods.

Table 1. The p-values obtained by paired t-test (above diagonal) and Wilcoxon Signed Rank test (below
diagonal) over six recently proposed methods: XMAP, AML, AML2014, CroMatcher, edna and refalign.

XMAP AML AML2014 CroMatcher edna refalign
XMAP 0.526403 0.23326767 0.00094182 0.000972 0.000939
AML 0.640625 0.05359674 0.00079181 0.113909 0.000697
AML2014 0.00647436 0.01596065 0.00026227 0.243871 0.000243
CroMatcher 0.00097656 6.10E-05 8.56E-05 2.83E-06 0.01664
edna 0.000822 0.011231 0.058088 0.000287 4.75E-06
refalign 0.000977 6.10E-05 8.50E-05 0.003906 0.000285