FTRLIM Results for OAEI 2020 ? ??

                   Xiaowen Wang1 , Yizhi Jiang1 , Hongfei Fan1 ,
                       Hongming Zhu1? ? ? , and Qin Liu1

         School of Software Engineering, Tongji University, Shanghai, China
       {1931533,1931566,fanhongfei,zhu hongming,qin.liu}@tongji.edu.cn


        Abstract. FTRLIM is a distributed framework that is designed for
        large-scale instance matching. The FTRLIM framework leverages the
        blocking algorithm to generate candidate instance pairs, and applies the
        follow-the-regularized-leader model to determine whether candidate in-
        stance pairs are matched. FTRLIM participated in the SPIMBENCH
        Track of OAEI 2020, and achieved the fastest matching efficiency both
        in SANDBOX and MAINBOX, as well as the competitive matching qual-
        ity.


 1     Presentation of the system
 1.1   State, purpose, general statement
 The instance-based matching has gradually become a promising topic recently[1].
 Many methods have been proposed to complete the instance matching task. Sev-
 eral state-of-the-art instance matching methods evolve from ontology matching
 methods such as LogMap[2], AML[3], RiMOM-IM[4], and Lily[5]. As the scale
 of the data increases, the efficiency and cost requirements of instance matching
 methods become more stringent.
     FTRLIM is a distributed instance matching framework that focus more on
 the matching efficiency. When matching instances, it first generates indexes for
 instances based on their attributes. Instances with the same index are divided
 into the same instance block, and instances from different sources under the same
 block form the candidate instance pairs. Then FTRLIM figures out the matched
 instance pairs leveraging the online-learning model, follow-the-regularized-leader
 (FTRL). This is the second time that FTRLIM has participated in the OAEI
 evaluation. To participate in the SPIMBENCH Track, FTRLIM is rebuilt using
 JAVA with core functionalities as the submitted version. The complete version
 of FTRLIM has been developed and deployed on a Spark cluster, which provides
 the FTRLIM framework with ability to deal with large-scale data. Compared
  ?
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
 ??
    This research has been supported by the National Key R&D Program of China
    (No. 2018YFB0505000), the National Natural Science Foundation of China (No.
    61702374), and the Fundamental Research Funds for the Central Universities.
???
    Corresponding author, email: zhu hongming@tongji.edu.cn
with last year’s version, this year’s FTRLIM has been slightly changed, which
will be introduced later.

1.2   Specific techniques used
This section introduces the refined working flow of FTRLIM. FTRLIM consists
of four major components: Blocker, Comparator, Trainer, and Matcher. The
framework accepts input instances in the OWL format, which are stored in
source dataset and target dataset, respectively. FTRLIM finds matched instances
between the two datasets. The overview of the FTRLIM’s work flow is presented
in Fig. 1.


                    Fig. 1. Work Flow of FTRLIM OAEI 2020


Blocker Since the scale of instances that need to be matched is usually very
large, it is very time-consuming and space-consuming to compare all the in-
stances with each other to find matched instance pairs. Blocker extracts fea-
tures of textual attributes related to instances to generate indexes for them.
The interactions among different textual information are taken into considera-
tion, which allows instances to be fine-grained divided. It also has the ability to
infer indexes for instances whose textural attributes are in-completed or miss-
ing. FTRLIM supports users to generate indexes for instances via more than
one attribute. Instances with the same index are divided into the same instance
block, and instances from different sources under the same block will form can-
didate instance pairs. Only when a pair of instances is a candidate pair can it be
matched in the following procedures. When there are only two instances from
different data sources in the same block, these two instances will form a unique
instance pair[4], which will be regarded as an matched instance pair directly.

Comparator All candidate pairs will be sent to the comparator to calculate
similarity. The comparator compares two instances from user-specified aspects.
The edit distance similarity is calculated for textual instance attributes, while
the Jaccard similarity is calculated for instance relationships. The calculation
results will be arranged in order to form the similarity vector. Formally, let the
list of predicates adopted by Comparator be hp1 , p2 , . . . , pn i, then the similarity
vector of the two instance is

                     hs1 , s2 , . . . , sn i , si ∈ [0, 1], (i = 1, 2, . . . , n)

where si is the similarity of the two instances under the i-th predicate.


Trainer FTRLIM treats the instance matching as a regression problem, where
the similarity score between two instances can be regarded as the probability that
the two instances are matched. We innovatively introduce the FTRL model[6] to
solve the problem. FTRL is a widely-used online logistic regression model with
high precision, excellent sparsity, fast training speed and satisfactory streaming
data processing ability. Trainer is designed to train the FTRL model for instance
matching. It first generates train set for the FTRL model. After the preparation
of train set is completed, the FTRL model will be trained with hyperparameters
in configuration files. Benefiting from the FTRL model’s feature, the training
process won’t cost a long time. The Trainer component plays a greater role in
the complete version. It can be used to accept the feedback of users and adjust
the parameters of the FTRL model. Users are allowed to choose a batch of
candidate instance pairs and correct the similarity score, or pick up a certain
pair to correct.


Matcher All candidate pairs will obtain their final similarity scores in this
component. Since FTRLIM produces the similarity scores in the interval [0,1],
candidate pairs whose scores are greater than 0.5 will be regarded as matched
pairs. The matching score s is calculated as follows:

                                                   1
                                       s=                                           (1)
                                              1 + e−xT w

where x is the similarity vector, w is the weight of the FTRL model. In this
year’s submission, all elements of similarity vectors accepted by the FTRL model
are unified from [0, 1] to [-1, 1] to satisfy the symmetry of the equation.


Configurations FTRLIM is easily to be tailored according to user’s require-
ments. We expect that all matching procedures are under user’s control, thus
we allow users to customize their own FTRLIM system using configuration files.
Users are able to set the attributes for index generation, the attributes and re-
lationships for comparison, the hyperparameters for the FTRL model and many
other detailed parameters to get a better result.
1.3   Adaptions made for the evaluation

To participate in the evaluation, we rebuilt FTRLIM and replaced some manual
operations with automatic strategies.
    The train set for training the FTRL model is automatically generated in the
submitted version, while it needs manual scoring in the completed version. The
train set is composed of instance pairs’ similarity vectors as well as their simi-
larity scores. The Trainer regards all unique pairs as matched pairs. Therefore,
it selects all similarity vectors of unique pairs as positive samples, and assigns
them with similarity score 1.0. The mismatched pairs are built by replacing one
instance of each unique pair randomly. These pairs are assigned with similar-
ity score 0.0 and treated as negative samples in the train set. In the completed
version, however, FTRLIM does not regard all unique pairs as matched pairs
directly. It will compute the mean value of similarity vectors’ elements as the
raw score for each instance pairs. Then it will select a batch of instance pairs
that have raw scores higher than a threshold as positive samples, as well as the
same amount of instance pairs whose raw scores are lower than the threshold
as negative samples. Users will determine the similarity score by themselves to
generate the train set. Besides, we excluded the non-core functionalities of FTR-
LIM such as the user-feedback and the load balance mechanism. The ways of
input and output is adapted for the evaluation as well.


1.4   Link to the system and parameters file

The implementation of FTRLIM and relevant System Adapter for HOBBIT
platform can be found at this FTRLIM-HOBBIT’s gitlab page.1


2     Result

In this section, we present the results obtained by FTRLIM in the OAEI 2020
competition. FTRLIM participated in the SPIMBENCH Track, which aims at
determining whether two OWL instances describe the same Creative Work. The
datasets are generated and transformed using SPIMBENCH[7]. Our competitors
includes LogMap[2], AML[3], Lily[5] and REMinder. The first three systems have
participated in this track for many years, while REMinder is a new contestants
in this year. The results are published in this OAEI 2020 result page2 .


2.1   SPIMBENCH

The SPIMBENCH task is executed in two datasets, the SANDBOX and the
MAINBOX, of different size. The SANDBOX has about 380 instances and 10000
triplets, while the MAINBOX has about 1800 instances and 50000 triplets. We
summarized the results of the SPIMBENCH Track in Table 1 and Table 2, where
the best results are indicated in bold.
                         Table 1. The Results of SANDBOX

                                LogMap AML       Lily FTRLIM REMiner
                Fmeasure      0.8413 0.8645 0.9917 0.9214         0.9983
                Precision     0.9383 0.8349 0.9836 0.8542            1
                  Recall      0.7625 0.8963 1         1           0.9967
             Time performance 7483    6446 2050    1525            7284

                         Table 2. The Results of MAINBOX

                               LogMap AML        Lily   FTRLIM REMiner
               Fmeasure         0.7856   0.8605 0.0.9954 0.9215   0.9977
               Precision        0.8801   0.8385 0.9908 0.8558     0.9987
                 Recall         0.7095   0.8835     1    0.9980   0.9967
            Time performance    26782    38772 3899      2247      33966


    Compared with all competitors, FTRLIM achieves the best time performance
on both two datasets. The time cost of our framework is reduced by 25.6% than
the second fastest one, Lily, on SANDBOX, while it is reduced by 42.4% than
Lily on MAINBOX. The results on time performance indicate the efficiency of
FTRLIM, which is more essential for large-scale instance matching. The FTR-
LIM also achieves the highest recall on SANDBOX and almost the highest recall
on MAINBOX. The precision of FTRLIM is relatively low on both datasets.
There are two reasons that account for this situation. One reason is that the au-
tomatic strategy we adopted for generating train set is flawed. In the generated
train set, there is almost no similarity between the the sample instance pairs with
low score. Although this kind of samples helps the FTRL model learn to distin-
guish similar instance pairs from dissimilar instance pairs, it does not help the
model distinguish matched instance pairs from similar instance pairs. Then the
model prefers to predict high similarity scores for similar instance pairs, which
improves the recall but reduces the precision. Another reason is that there may
be problems with the way unique pairs are treated. Regarding the unique pairs
as matched pairs directly will also affect the precision of the prediction. But the
overall matching quality of FTRLIM is still competitive.


3     General comments

3.1    Comments on the result

FTRLIM has achieved time performance in both datasets of SPIMBENCH. The
Blocker component makes a significant contribution to achieving the results.
It helps the framework filter out instance pairs with a high possibility to be
1
    https://git.project-hobbit.eu/937522035/ftrlimhobbit
2
    http://oaei.ontologymatching.org/2020/results
matched effectively and efficiently. The Comparator component only needs to
compare instances with the same indexes rather than every instance pairs. The
datasets of SPIMBENCH contain a wealth of textual information, and there are
many attributes that can be used to build indexes or to compare the similarity
among instances. The FTRL model trained by Trainer is able to learn a weight
for attributes or relationships and distinguish instance pairs that points to the
same entity in the real world. Compared with other systems, the precision of
FTRLIM is unsatisfactory, which should be improved in feature works.


3.2   Improvements

There are still many aspects to be improved in FTRLIM. The submitted version
of FTRLIM generates flawed train set for training the FTRL model, and consid-
ers unique pairs as matched instances unconditionally. The automatic strategy
adopted by Trainer and Matcher should be optimized to address the problems.
More comparison methods for various data types should be attached to our
frameworks as well. Although FTRLIM is specially designed to solve the in-
stance matching problem, it is also expected to produce meaningful results in
other similar tracks in the future.


4     Conclusion

In this paper, we briefly presented our instance matching framework FTRLIM.
The core functionalities and components of FTRLIM were introduced, and the
evaluation results of FTRLIM were presented and analyzed. FTRLIM achieved
significantly better time performance than other systems on both two datasets of
SPIMBENCH, as well as the competitive matching quality. The results indicated
the effectiveness and high efficiency of our matching strategy, which is important
for matching instances on large-scale datasets.


References

1. Otero-Cerdeira, L., Rodrı́guez-Martı́nez, F.J., Gómez-Rodrı́guez, A.: Ontology
   matching: A literature review. Expert Systems with Applications 42(2), 949–971
   (2015)
2. Jiménez-Ruiz, E., Grau, B.C.: Logmap: Logic-based and scalable ontology matching.
   In: International Semantic Web Conference. pp. 273–288. Springer (2011)
3. Faria, D., Pesquita, C., Santos, E., Cruz, I.F., Couto, F.M.: Agreementmakerlight
   results for oaei 2013. In: OM. pp. 101–108 (2013)
4. Shao, C., Hu, L., Li, J.Z., Wang, Z., Chung, T.L., Xia, J.B.: Rimom-im: A novel
   iterative framework for instance matching. Journal of Computer Science and Tech-
   nology 31, 185–197 (2016)
5. Wu, J., Pan, Z., Zhang, C., Wang, P.: Lily results for oaei 2019. In: OM@ ISWC.
   pp. 153–159 (2019)
6. McMahan, H.B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L.,
   Phillips, T., Davydov, E., Golovin, D., et al.: Ad click prediction: a view from the
   trenches. In: Proceedings of the 19th ACM SIGKDD international conference on
   Knowledge discovery and data mining. pp. 1222–1230 (2013)
7. Saveta, T., Daskalaki, E., Flouris, G., Fundulaki, I., Herschel, M., Ngomo, A.C.N.:
   Spimbench : A scalable , schema-aware instance matching benchmark for the se-
   mantic publishing domain (2014)