<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaowen Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yizhi Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Luo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongfei Fan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hua Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongming Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Liu</string-name>
          <email>qin.liug@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Software Engineering, Tongji University</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tsingtao Advanced Research Institute, Tongji University</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To achieve better e ciency and feasibility in instance matching between two datasets, we proposed a system named FTRLIM, which is based on the FTRL (Follow the Regularized Leader) model. The FTRLIM system supports the generation of indexes for instances, which enables the system to gure out possible matching instance pairs e ciently. FTRLIM participated in the SPIMBENCH track of OAEI 2019, and obtained the highest F-measure in SANDBOX and almost the highest F-measure in MAINBOX, with the least time cost. The results also provided potential directions for further improvement of FTRLIM.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          Researchers have worked a lot on ontology alignment, and early methods mainly
focused on matching ontologies based on the schema. Recently, the
instancebased matching has gradually become a promising topic.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] There exists many
ontology matching systems that support the solution of the instance matching
problem, such as LogMap[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], AML[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Lily[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], RiMOM-IM[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and so on. With
the rapid growth of data scale, it has become a practical requirement to complete
the task of instance matching among large-scale knowledge graphs.
        </p>
        <p>
          FTRLIM is designed to provide an e ective and e cient solution for matching
instances among large-scale datasets, whose core functionalities are listed as
follows:
1. Build indexes for instances based on textual attributes. Only instances with
the same index have the possibility to be aligned.
2. Calculate the similarity between two instances on certain attributes and
relationships. Di erent methods have been used to calculate the similarity
according to the data types of attributes or relationships.
3. Generate the train dataset for the FTRL model [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] from the given data
automatically. Speci c instance pairs are selected as train set during the
matching process without manual operations.
4. Aggregate similarities of di erent attributes and relationships into a
similarity score with the FTRL model, which is trained after the generation of the
train set.
5. Select aligned instances according to similarity scores between each instance
pairs.
6. Customize all procedures based on con guration les.
        </p>
        <p>FTRLIM is a newly developed system and it is the rst time that we have
participated in the OAEI evaluation. We expect to check the feasibility and
e ciency of our system, and thus we rebuilt our system using Java with core
functionalities. The complete version of FTRLIM has been developed and
deployed on a Spark cluster, which provides the system with ability to deal with
large-scale data. The user feedback mechanism has been integrated into the
system as well. The system will correct matching results on the basis of feedback.
Last but not least, the system also supports merging aligned instances' attributes
and relationships.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>
          FTRLIM consists of ve major components: Index Generator, Comparator,
Train set Generator, Model Trainer and Matcher. The system accepts input
instances in OWL format, which are stored in source dataset and target dataset
respectively. FTRLIM will nd aligned instances between the two datasets. The
architecture of FTRLIM is presented in Fig.1.
Index Generator Since the scale of instances that need to be aligned is usually
very large, it is very time-consuming and space-consuming to compare all the
instances with each other to nd aligned instance pairs. FTRLIM uses textual
information related to instances to lter out instance pairs that could be aligned
e ciently. This work is done by Index Generator. Index Generator plays an
important role in FTRLIM. It builds indexes for all input instances based on their
attributes. The system rst extracts values of a speci ed instance attribute, then
regards each of the values as a document, all of which will constitute a
document set. The measurement TF-IDF is used to nd keywords for each document.
Finally the indexes of an instance are generated from the combination of its
keywords. FTRLIM supports users to generate indexes for instances via more than
one attribute. In this scenario, di erent indexes of an instance created referring
to di erent attributes will be concatenated together as the nal index. Instances
with the same index are divided into the same instance block, and instances from
di erent sources under the same block will form candidate instance pairs. Only
when a pair of instances is a candidate pair can it be aligned in the following
procedures. When there are only two instances from di erent data sources in
the same block, these two instances will form a unique instance pair[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which
will be regarded as an aligned instance pair directly. Missing value of attributes
is taken into consideration to avoid losing candidate instances as far as possible.
Comparator All candidate pairs will be sent to the comparator to calculate
similarity. The comparator compares two instances from di erent aspects. The
edit distance similarity is calculated for textual instance attributes, while the
Jaccard similarity is calculated for instance relationships. The calculation
results will be arranged in order to form the similarity vector. For example, if we
compare a candidate pair (x1; x2) under two attributes (a1; a2) and relationship
r1, the similarities of (x1; x2) from each aspect are 0.3, 1 and 0.8, respectively, the
similarity vector should be h0:3; 1; 0:8i. All the pairs are compared from identical
aspects to ensure that the same dimension of di erent similarity vectors has the
same meaning.
        </p>
        <p>Train set Generator Judging whether a pair of instances is aligned is actually
a binary classi cation problem. We innovatively introduced the FTRL model to
solve this problem. The FTRL model has ability to complete the task of
classi cation in large-scale datasets. The model needs to be trained at rst. The
component, Train set Generator, will generate train set for the FTRL model.
The train set is composed of instance pairs' similarity vectors as well as their
similarity scores. The Train set Generator regards all unique pairs as aligned
pairs. Therefore, it selects all similarity vectors of unique pairs as positive
samples, and assigns them with similarity score 1.0. The unaligned pairs are built
by replacing one instance of each unique pair randomly. These pairs are assigned
with similarity score 0.0 and treated as negative samples in the train set. The
input of the FTRL model is the similarity vector, and the output is the similarity
score. This component is di erent from the complete version of FTRLIM, which
will be introduced in Section 1.3.</p>
        <p>Model Trainer The FTRL model is trained in this component with
hyperparameters in con guration les. Bene ting from the FTRL model's feature, the
training process won't cost a long time. The trainer plays a greater role in the
complete version as well: it can be used to accept the feedback of users and
adjust the parameters of the FTRL model. Users are allowed to choose a batch
of candidate instance pairs and correct the similarity score, or pick up a certain
pair to correct.</p>
        <p>Matcher All candidate pairs will obtain their nal similarity scores in this
component. The trained FTRL model accepts all the similarity vectors and predicts
the matching scores of them. Instance pairs with score larger than 0.5 will be
regarded as aligned pairs. They will form the nal output of aligned instances
together with unique pairs.</p>
        <p>Con gurations FTRLIM is easily to be tailored according to user's
requirements. We expect that all matching procedures are under user's control, thus
we allow users to customize their own FTRLIM system using con guration les.
Users are able to set the attributes for index generation, the attributes and
relationships for comparison, the hyperparameters for the FTRL model and many
other detailed parameters to get a better result.
1.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Adaptions made for the evaluation</title>
        <p>To participate in the evaluation, we rebuilt the FTRLIM system and replaced
some manual operations with automatic strategies. In the complete version,
FTRLIM does not regard all unique pairs as aligned pairs directly. It will
compute the mean value of similarity vectors' elements as the raw score for each
instance pairs. Then it will select a batch of instance pairs that have raw scores
higher than a threshold as positive samples, as well as the same amount of
instance pairs whose raw scores are lower than the threshold as negative samples.
Users will determine the similarity score by themselves to generate the train set.
In the version developed for OAEI, this procedure is changed as we mentioned
in 1.2. We excluded the non-core functionalities of the system, and made the
ways of input and output suitable for the evaluation.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Link to the system and parameters le</title>
        <p>The implementation of FTRLIM and relevant System Adapter for HOBBIT
platform can be found at this FTRLIM-HOBBIT's gitlab page.3</p>
        <sec id="sec-2-4-1">
          <title>3 https://git.project-hobbit.eu/937522035/ftrlimhobbit</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Result</title>
      <p>
        In this section, we present the results obtained by FTRLIM in the OAEI 2019
competition. FTRLIM participated in the SPIMBENCH track, which aims at
determining when two OWL instances describe the same Creative Work. The
datasets are generated and transformed using SPIMBENCH[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We are the latest
team to join this track. Our competitors are LogMap[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], AML[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Lily[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], who
have participated in this track for many years. The results are published in this
OAEI 2019 result page4.
2.1
      </p>
      <sec id="sec-3-1">
        <title>SPIMBENCH</title>
        <p>The SPIMBENCH task is executed in two datasets, the SANDBOX and the
MAINBOX, of di erent size. The SANDBOX has about 380 instances and 10000
triplets, while the MAINBOX has about 1800 Create Works and 50000 triplets.</p>
        <p>
          Evaluation results of SANDBOX are summarized in Table 1, where the best
results are indicated in bold. Compared with AML[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Lily[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and LogMap [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
FTRLIM obtained the highest F-measure, highest recall and best time
performance, while the precision is 0.08 lower than LogMap that has the best precision.
        </p>
        <p>Evaluation results of MAINBOX are presented in Table 2 with the best
results in bold. Our system is approximately 41% faster than Lily and 17 times
faster than the slowest one, while the F-measure is only 0.00014 lower than the
best one. We obtained the nearly full mark on recall and the second highest
precision as well.</p>
        <sec id="sec-3-1-1">
          <title>4 http://oaei.ontologymatching.org/2019/results</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <p>3.1</p>
      <sec id="sec-4-1">
        <title>Comments on the result</title>
        <p>FTRLIM has achieved satisfactory performance in both datasets of SPIMBENCH,
especially in the SANDBOX. The Index Generator makes a signi cant
contribution to achieving the results. It helps the system lter out instance pairs with
a high possibility to be aligned e ectively and e ciently. The comparator only
needs to compare instances with the same indexes rather than every instance
pairs. The datasets of SPIMBENCH contain a wealth of textual information,
and there are many attributes that can be used to build indexes or to compare
the similarity among instances. The FTRL model trained by the Model Trainer
component is as smart as we expect to learn a weight for attributes or
relationships and distinguish pairs of instances pointing to the same entity in real
world.</p>
        <p>Compared with LogMap, the F-measure of FTRLIM is 8-13% higher while
the precision is 4-8% lower. This result shows that FTRLIM could still be
improved to obtain higher precision. The OAEI version of FTRLIM considers
unique pairs as aligned instances unconditionally, which is not always true. One
possible way to solve the problem is validating the matching results. This is one
of the centers of our future work.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Improvements</title>
        <p>There are still many aspects to be improved in the FTRLIM system. Besides
adding validation stage that described in 3.1, we will continue to optimize the
algorithm of generating indexes for instances and the matching strategy in
following work. More comparison methods and supporting data types should be
attached to our system as well. And we are committed to building the GUI for
our system. Although FTRLIM is specially designed to solve the instance
matching problem, it is also expected to produce meaningful results in other similar
tracks in the future.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we brie y presented our instance matching system FTRLIM. The
core functionalities and components of the system were introduced, and the
evaluation results of FTRLIM were presented and analyzed. FTRLIM achieved
signi cantly better time performance than other systems in both datasets of
SPIMBENCH, and got the highest F-measure in SANDBOX and almost the
same F-measure as the best one in MAINBOX. The results proved the e
ectiveness and high e ciency of our matching strategy, which is important for
matching instances among large-scale datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Otero-Cerdeira</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            guez-Mart nez,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Rodr guez</surname>
          </string-name>
          , A.:
          <article-title>Ontology matching: A literature review</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <volume>949</volume>
          {
          <fpage>971</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cross</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          :
          <article-title>Logmap family participation in the oaei 2018</article-title>
          . In: OM@ISWC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Faria</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pesquita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balasubramani</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tervo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrico</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrilha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          :
          <article-title>Results of aml participation in oaei 2018</article-title>
          . In: OM@ISWC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , H.:
          <article-title>Lily results for oaei 2018</article-title>
          . In: OM@ISWC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Rimom-im: A novel iterative framework for instance matching</article-title>
          .
          <source>Journal of Computer Science and Technology</source>
          <volume>31</volume>
          ,
          <issue>185</issue>
          {
          <fpage>197</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>McMahan</surname>
            ,
            <given-names>H.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holt</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grady</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davydov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golovin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chikkerur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hrafnkelsson</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubica</surname>
          </string-name>
          , J.:
          <article-title>Ad click prediction: a view from the trenches</article-title>
          .
          <source>In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Saveta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daskalaki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flouris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundulaki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herschel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          :
          <article-title>Spimbench : A scalable , schema-aware instance matching benchmark for the semantic publishing domain (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>