<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DLinker Results for OAEI 2022⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bill Gates Happi Happi</string-name>
          <email>bill.happi@ird.fr</email>
          <email>billhappi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Géraud Fokou Pelap</string-name>
          <email>geraud.fokou@univ-dschang.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danai Symeonidou</string-name>
          <email>danai.symeonidou@inrae.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Larmande</string-name>
          <email>pierre.larmande@ird.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Instance Matching, Syntactic Similarity, Data Linking Algorithm, Data Processing, Synonyms</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIADE, University of Montpellier</institution>
          ,
          <addr-line>IRD, CIRAD, 911 Av. Agropolis, 34394 Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INRAE, SupAgro, UMR MISTEA, University of Montpellier</institution>
          ,
          <addr-line>2 place pierre viala 34060 Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Dschang</institution>
          ,
          <addr-line>Dschang ville</addr-line>
          ,
          <country country="CM">Cameroon</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>DLinker is a system for matching instances of two RDF data sources. Its performance is mainly based on the deep comparison of literals. The main comparison algorithm is based on the search for the longest common subsequence (LCS) present in the literals. The validation of the similarity between two literals is performed by a mathematical formula. This formula computes the confidence percentage of the similarity between the literals and compares it with a threshold provided as input among the expected hyperparameters. To validate the similar instances, our system validates only the instances that have reached the value of the acceptation threshold provided in the list of required hyperparameters. The current version focuses on the processing of strings on the spot without taking into account synonyms to make its decisions. This is DLinker's first participation in the OAEI campaign on two principal tracks (SPIMBENCH and SPATIAL) with 9 challenges. DLinker demonstrated its ability to process diferent data with good accuracy and in a very short time. Additionally, in the context of the SPATIAL challenge DLinker has outperformed the state of the art finishing first with the shortest time. Overall, DLinker exposes diferent strengths and weaknesses that are discussed in this work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Presentation of DLinker</title>
      <sec id="sec-1-1">
        <title>1.1. General Statement</title>
        <p>
          DLinker is a generic instance matching system. Its performance is mainly based on a recursive
algorithm which compares literals using the longest common subsequence (LCS[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]). DLinker
performs the matching in two steps. First, during the training step, the algorithm learns from
the hyperparameters to optimize the prediction of pairs of similar instances. Finally the tool
exploits the model tuning for the prediction step. DLinker participated in 2 tracks composed of 9
... (B. G. H. Happi); ... (G. F. Pelap); ... (D. Symeonidou); ... (P. Larmande)
challenges during the OAEI campaign. The tool achieved good results during the SPATIAL track,
as it was the best on temporal and accuracy performances on small and large scale EQUALS
and OVERLAPS topologicals tasks for TOMTOM and SPATEN. During the SPIMBENCH track,
we participated on small datasets while providing good accuracy in a considerable time. The
main similarity measure of the tool is based on the Longest Common Sub-Sequence (LCS[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ])
algorithm, which is seen as a deep variant of this one, imposing a very large capacity for various
applications. The temporal factor of DLinker is justified by the parallelization of the pairs of
literal objects to be compared. DLinker was wrapped by Hobbit Framework [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. System overview</title>
        <p>Initially, the theoretical formalization of DLinker consisted in setting up a multilingual data
binding tool during data processing. However, the current version performs this task properly
when the sources to be linked belong to the same language. As shown in figure 1, DLinker
matches only instances. We give as inputs two data sources to be linked in diferent formats,
linked or not, to produce as output a set of alignments in a suitable data format.
1. Loading: loads the input files and returns an RDF data graph in the form of triples. Also
loads the hyperparameters (HP) from the data training. Let us give a brief description of
these hyperparameters:
• predicate threshold: represents the numerical value of similarity recognition
between two predicates;
• literal thresold: represents the numerical value of similarity recognition between
two literals;
• acceptation threshold: represents the minimal number of pairs of similar objects
belonging to the instances that can be linked;
• measurement depth: represents the number of times we should search for the longest
common subsequences between the pairs of objects of the instances. It can also be
seen as the depth of recursion during the search for similar literals;
2. Compute Similar predicates: Retrieves the unique predicates of each graph and returns
pairs of similar or complementary predicates existing between them using the threshold
of the predicate;
3. Select and Filter the specific (s, o) pairs to compare: First, a selection of each pair (s, o)
(respectively (s’, o’)) in each of the two data sources from the predicate pairs (pi, p’j)
(similar) is performed. Then, a construction of sub-lists of cartesian products ((s, o),(s’, o’))
for the (s, o) (respectively (s’,o’)) of these predicates, we proceed to the step of reducing
the number of unnecessary comparisons between the literal objects of the pairs of the
sub-sets by calculating the completeness or proximity ratio between them;
4. Compare pairs and validations: Here we realize the comparison of the literal objects
associated to the couples ((s, o),(s’, o’)). These comparisons are validated using the
threshold of the literal before or after the similarity search depth (measurement depth
hyperparameter). The validation is performed once the Acceptation Threshold (AT)
is reached after several successful object comparisons. To ensure proper validation of
comparison peer topics, we perform a hash of the concatenation of the topics which are
then stored as &lt;key, value&gt;. The value of the key increments to reach AT if the associated
object comparisons are positive;
5. Generate Alignment: The process of comparisons being parallel, a list of pairs (s, s’,
relation, 1) is filled after validations by each sub-processing of the subset of pairs of ((s,
o), (s’, o’)) based on the acceptation threshold hyperparameter.
6. Output file: the alignments are generated according to the format expected in output, for
the moment we provide “.rdf” and “.nt”.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Results</title>
      <p>This section describes the results of the DLinker system on two tracks namely: SPATIAL and
SPIMBENCH. The evaluation was executed on a Linux virtual machine with 256 GB of RAM
and 32 vCPUs (2.4 GHz) processors. The table below presents the diferent hyperparameters
that were used to obtain the following results:</p>
      <sec id="sec-2-1">
        <title>2.1. SPATIAL Data</title>
        <p>Predicate Threshold</p>
        <p>Literal threshold</p>
        <p>Acceptation threshold</p>
        <p>Measurement depth
This track1 concerns data that are part of SPATIAL data management systems and that store
topological relationships in the form of SPATIAL resources that can be linked together. These
SPATIAL resources are described from a large information set such as LinkedGeoData. These
data are sent from a SPATIAL benchmark generator2. This Benchmark supports several
topological relations (Equals, Disjoint, Touches, Contains/Within, Covers/CoveredBy, Intersects,
Crosses, Overlaps). This SPATIAL generator contains three data generators (TomTom, Spaten
and DEBS).</p>
        <p>1. Spaten is an open-source configurable spatio-temporal and textual dataset generator,
that can produce large volumes of data based on realistic user behavior. Spaten extracts
GPS traces from realistic routes utilizing the Google Maps API, and combines them with
real POIs and relevant user comments crawled from TripAdvisor. Spaten publicly ofers
GB-size datasets with millions of check-ins and GPS traces;
2. TomTom provides a Synthetic Trace Generator developed in the context of the HOBBIT
Project, that facilitates the creation of an arbitrary volume of data from statistical
descriptions of vehicle trafic. More specifically, it generates traces, with a trace being a list of
(longitude, latitude) pairs recorded by one device (phone, car, etc.) throughout one day.</p>
        <p>TomTom was the only data generator in the first version of SPgen;
3. DEBS provides a selection of AIS data collected from the MarineTrafic coastal network.</p>
        <p>It has been used for the EU H2020 Research Project BigDataOcean and the ACM DEBS
Grand Challenge 2018.</p>
        <p>The results below are available at this address: https://hobbit-project.github.io/OAEI_2022.html.
2.1.1. Evaluation on SandBox LINESTRINGS - LINSTRINGS
• Under the EQUALS topological relationship, DLinker terminated in 1667ms for Spaten
and 10487ms for TomTom providing the highest accuracy and smallest overall time;
• Under the OVERLAPS topological relationship, DLinker finished in 3236ms for Spaten
and 3087ms for TomTom providing the highest accuracy and lowest overall time. The
summary can be found in the figure 2.
1https://project-hobbit.eu/hobbit-spatial-benchmark-v2-0/
2https://github.com/hobbit-project/SpatialBenchmark
2.1.2. Evaluation on MainBox LINESTRINGS - LINSTRINGS
• Under the OVERLAPS topological relationship, DLinker terminated in 2026ms for Spaten
and 37072ms for TomTom providing the highest accuracy and smallest overall time;
• Under the OVERLAPS topological relationship, DLinker terminated in 2547ms for Spaten
and 5458ms for TomTom providing the highest accuracy and smallest global time. The
summary can be found in the figure 3.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. SPIMBENCH Data</title>
        <p>
          The datasets [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] in this track are produced using SPIMBENCH benchmark generator [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] with
the aim to generate descriptions of the same entity where valuebased, structure-based and
semantics-aware transformations are employed on a source dataset in order to create the target
dataset(s). The goal of the SPIMBENCH task is to determine when two instances describe
the same Creative Work. A dataset is composed of a Tbox (contains the ontology and the
instances) and a corresponding Abox (contains only the instances). The datasets share almost
the same ontology (with some diference in the properties’ level, due to the structure-based
transformations). What we expect from participants. Participants are requested to match
instances in the source dataset (Tbox1) against the instances of the target dataset (Tbox2). The
task goal is to produce a set of mappings between the pairs of matching instances that refer to
the same real-world entity. An instance in the source dataset (Tbox1) can have none or one
matching counterparts in the target dataset (Tbox2). Note that only instances of Creative Work
are to be mapped in this task3. The instances of the other classes appearing in the sources are
used to examine if the matching systems take into account RDFS [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and OWL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] constructs in
order to discover correspondences between instances that can be found only by considering
schema information. After processing 380 instances (10000 triples) for each file (source and
target), we obtained the scores presented in the following section.
2.2.1. Evaluation on small data set
DLinker participated in the evaluation of small data set and finished in 15555ms with an accuracy
of 0.791 as shown in the table 2 below.
3https://hobbit-project.github.io/OAEI_2022.html
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. General comments and conclusions</title>
      <p>Comments on the results. DLinker was ranked first on the SPACIAL track and second on the
SPIMBENCH track. Its fast processing time is due to the parallelization of the processing of
literal comparisons in the analysis of instances based on the LCS and its high precision. The
high accuracy can be guaranteed on so-called in-place comparisons and not on those requiring
synonym consideration. The non-participation of synonyms in the instance matching process
is the main weakness of DLinker and prevents it from correctly projecting itself in the ontology
alignment.</p>
      <p>Space for improvement in our system. Although DLinker seems stable on several datasets
already evaluated so far, we plan to make it even more robust and eficient for other challenges.
We have seen many exciting possibilities for future work. For example, we intend to
implement multilingual functionality to link two data sources of diferent languages and integrate
the consideration of synonyms during comparisons. We also plan to realize an automatic
hyperparameters generator from the training datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments References</title>
      <p>This work was supported by the IRD DIADE UNIT and the FOOSIN project. We also thank the
organisers of the OAEI evaluation campaign for providing test data and infrastructure.</p>
    </sec>
    <sec id="sec-5">
      <title>A. Online Resources</title>
      <p>The sources for DLinker python and Java version are available via:
• Python Version: https://github.com/BillGates98/DLinker,
• Java version: https://github.com/BillGates98/dlinker-adapter,</p>
    </sec>
    <sec id="sec-6">
      <title>B. Diagrams</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bepery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdullah-Al-Mamun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>Computing a longest common subsequence for multiple sequences</article-title>
          ,
          <source>2015. doi:1 0 . 1 1</source>
          0 9 / E I C T .
          <volume>2 0 1 5 . 7 3 9 1 9 3 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Röder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuchelev</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          ,
          <article-title>Hobbit: A platform for benchmarking big linked data</article-title>
          ,
          <source>Data Science</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
          <source>doi:1 0 . 3 2 3 3 / D S - 1</source>
          <volume>9 0 0 2 1 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Saveta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Šváb</given-names>
            <surname>Zamazal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hertling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Röder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Fundulaki</given-names>
            , A.
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sherif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Annane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bellahsene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Ben</given-names>
            <surname>Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Faria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kachroudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khiat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lambrix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mackeprang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Trojahn, Introducing the hobbit platform into the ontology alignment evaluation campaign</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Saveta</surname>
          </string-name>
          , E. Daskalaki, G. Flouris, I. Fundulaki,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herschel</surname>
          </string-name>
          , A.-C.
          <article-title>Ngonga Ngomo, Pushing the limits of instance matching systems: A semantics-aware benchmark for linked data</article-title>
          ,
          <year>2015</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>106</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 7 4 0 9 0 8 . 2 7 4 2 7 2 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          ,
          <article-title>Rdfs (fa) and rdf mt: Two semantics for rdfs</article-title>
          ,
          <source>2003. doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 5 4 0 - 3 9 7 1 8 - 2</volume>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Souhaib</surname>
          </string-name>
          , K. Mohamed, k. e. el kadiri,
          <source>Ontology Alignment OWL-Lite</source>
          ,
          <year>2012</year>
          .
          <source>doi:1 0 . 5 7</source>
          <volume>7 2 / 2 8 6 1 9 .</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>