<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Link Discovery Frameworks for Biomedical Linked Data: A comprehensive study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1st Houssein Dhayne</string-name>
          <email>houssein.dhayne@net.usj.edu.lb</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>2nd Hanan Farhat</string-name>
          <email>hanan.farhat@net.usj.edu.lb</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>3rd Rima Kilany</string-name>
          <email>rima.kilany@usj.edu.lb</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Engineering, ESIB, Saint Joseph University</institution>
          ,
          <addr-line>Beirut</addr-line>
          ,
          <country country="LB">Lebanon</country>
        </aff>
      </contrib-group>
      <fpage>5</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>-Data discovery, linking and integration techniques Linked Open Data (LOD) is a set of best practices to publish are of great importance for big data variety challenge. Linked RDF linked data on the Web in a machine-readable way, Open Data (LOD) and Semantic Web technologies have worked with an explicitly defined semantic meaning, linked to other laisnkaagderiovfetrritpoleasdodfrLesOsDthhiasscihnaclrleeansgeed. tHo o4w0%ev,ero,f uwnhtiiclh2o0n15ly, 3th%e datasets and allowed to be searched for. LOD principles can of overall triples are links between different datasets. Today, with be summarized in publishing of open, linked and structured the increasing amount of available LOD datasets, 9671 datasets data, in non-proprietary formats using URIs. An indexed compose the LOD, the need to link them together is becoming ready-to-consume crawl of a large portion of LOD (see vital. Links are usually generated, or discovered, by specific Fig.1), called LOD-a-lot1, contains 28,362,198,927 triples, fmraomsteewffoercktisves utocohlsaisn SthILisKdoamnadinL.ITMheEySa,pwphlyicihnstaarnecetwmoatocfhitnhge made up of 3,214,347,198 subjects, 1,168,932 predicates, and rather than ontology matching, and support active learning. They 3,178,409,386 objects [9]. With this increasing volume of both have their drawbacks and their advantages, which makes datasets, the name of the Big Linked Data has been appearing it hard to disregard one of them. This paper aims to evaluate in the research terms. Big Linked Data is an instance of Big whether SILK and LIMES are potential options for interlinking Data that is the union of big and linked data, where authors in laatrgmea-sncyalleevbeilosm,setdairctainlgdafrtaosmetst,hecogmenpearrainlgfetahteurtews,o rferaacmhienwgorthkes [10] presented their list of characteristics, which was created comparison measures, the resulting files, the performance and by unifying the characteristics of Big and Linked Data. the effectiveness of the links produced. The conclusions drawn Moreover, until 2015, the linkage of triples of LOD has from this work are to be used as a reference for the evaluation of increased to 40%, of which only 3% of overall triples are links the core differences between SILK and LIMES and therefore for between different datasets (Fig. 1 showing the growth from ccohnosoisdinergetdheasmaonstospueintainbgleftoorolfuintuareBrioesmeaedrcichalancodnetenxhta. nIctecmanenbtes 2007 to 2016), therefore new problems are arising that require of such frameworks. new solutions from the data science community. Wherefore, Index Terms-Semantic Web, Link Discovery, Biomedical the importance of Link-Discovery Frameworks, which are Linked Data, Data Links responsible for creating the links, has increased, taking into consideration the efficiency and effectiveness. Efficiency is I. INTRODUCTION the optimized process run-time, in addition to the execution time of the preceding and the following steps, excluding With the rise of Big Data awareness among biomedical complex criteria and computations that consume both time providers, there is a need to harness the techniques of data and resources. On the other hand, effectiveness is having the integration and analytics to create significant value towards resulting evaluated links accurate and complete. aiding the process of care delivery and disease exploration [1], Link-Discovery problem can be defined as a task that takes [2]. Among the different dimensions that characterize big data, two datasets as input and produces a set of links between the variety dimension seems to be the most intriguing one for entities of the two datasets as output. In a formal definition, let the Semantic Web and the one where the research community S(source) and T (target) two sets of RDF instances as well as can contribute [3] [4]. Resource Description Framework (RDF) s and t two instances of S and T respectively, and a similarity paradigm, published on the Web in accordance with the threshold ⇥ 2 [0, 1]. Link- Discovery is the process that leads Linked Data principles and best practices [5] and containing to the discovery of all set of pairs (s, t) 2 SxT that are linked information about genes, proteins, pathways, diseases, and by a relation ' relying on their properties by using a similarity drugs [6], has evolved as a powerful enabler for the transition metric ⇢ . if the value of ⇢ (s, t) ⇥ , then the two entities s of the current unstructured data into interlinked Data [7]. and t are considered to be linked by ' . For instance, linked data solve the integration of unstructured This paper aims to compare two Link-Discovery Framedata by replacing or annotating the data elements, of medical works, SILK and LIMES, at many levels starting from the texts or images, with unique identifiers, providing a structured querying of multiple heterogeneous sources [8]. 1http://lod-a-lot.lod.labs.vu.nl/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>general features, the comparison measures, the resulting files
and reaching the performance. In addition, it aims to evaluate
the quality of links produced.</p>
      <p>The rest of the paper is structured as follows: Section 2 spots
the light on comparison studies referring to both frameworks
SILK and LIMES. Section 3 shows the general process of
link discovery frameworks. In Section 4 we give an overview
of SILK and LIMES respectively and compare their general
features, while in section 5, the criteria used to perform the
comparison experiments is detailed. Section 6 presents the
results of our experiments, and finally, section 7 concludes
and provides recommendations for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORK</title>
      <p>
        Link Discovery frameworks are divided into two types:
Domain Specific (ex: GNAT, specific for music [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) and
Universal frameworks. SILK and LIMES are both Universal
Frameworks that aim to generate links between entities of
data resources. They have many common features, as well
as dissimilar ones that we will detail later.
      </p>
      <p>
        Many studies compared SILK and LIMES. Nevertheless,
the studies covered only the efficiency challenge (run-time)
assuming the effectiveness (link quality) is guaranteed. The
main two studies comparing the frameworks were published in
2011, where the first compared SILK of version 2 to LIMES of
version 0.3.21 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and the second compared SILK of version
2.3 to LIMES of version 0.5 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Both studies ended favoring
LIMES over SILK speed wise. However, we cannot rely on
this result since both frameworks have newer and refactored
versions.
      </p>
      <p>An important element that makes the comparison
unrighteous in these two studies is the difference in defining
comparison thresholds. While -even in older versions- LIMES had
the option for specifying the threshold for each operator by
itself in addition to the threshold for the aggregated output, in
SILK the threshold could only be specified for the output of
the aggregation. This per-comparison threshold was recently
introduced into the newer versions of SILK.</p>
      <p>Even though effectiveness or link quality was out of concern
in the latest studies on SILK and LIMES, the implemented
tests, whether comparing those two frameworks or evaluating
the performance of each alone, have helped develop the
criterion to be followed in order to evaluate generated links
quality, and deduce which framework, SILK or LIMES, has
better impact on effectiveness. Some of those constructive
studies are summarized in Table I, where each implemented
test is shown in details; the source datasets and target datasets,
the source and target classes, the properties compared and
the comparison operators used for each test. As a result,
it is notable that the choice of the comparison operator is
pertinently related to the property to compare.</p>
      <p>
        A study [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] published in 2015 and updated in 2017 has
handled the issue of link-discovery frameworks, and compared
ten of them using a specified benchmark. The frameworks
compared were RiMOM, KnoFuss, AgreementMaker, CODI,
SERIMI, LogMap, SLINT+, Zhishi.links, SILK and LIMES.
However, these ten tools can be classified as follows: Machine
Learning(Involved or Not Involved), and Matching (Ontology
Matching or Instance Matching). Seven frameworks appeared
to exclude machine learning from the matching process, and
support ontology matching. Yet, semantic web has deviated
the research from ontology to instance matching gradually
considering the heterogeneity of documents and real-world
entities, which may appear at many web locations with
different descriptions. As the objective of semantic web was
to make data more understandable by machines, machine
learning set the best example for making use of linked data.
Therefore, three frameworks: SILK, LIMES, and KnoFuss,
classified as instance matching frameworks including machine
learning accessed the interest zone. In contrary to SILK and
LIMES, KnoFuss supports only one type of data links which
is owl:sameAs, while they support, in addition to owl:sameAs
link, other RDF link types such as alternate, start, and next.
      </p>
      <p>Consequently, it would be essential for any comparison
study to start by the classification of comparison measures
offered by both tools, and this is what will be detailed in
Section IV-C. An overview of both SILK and LIMES and a
description of their internal architecture is presented in the
preceding sections.</p>
    </sec>
    <sec id="sec-3">
      <title>III. LINK DISCOVERY PROCESS</title>
      <p>The matching process is the core part of link-discovery. A
single comparison can be summarized in a few steps. First,
the datasets from which the instances will be picked should
be determined (a source and a target). Second, the required
classes are to be picked, and then comes the choice of
instances that will be compared. Each instance has its description
defined by Subjects and Values of the Subjects (Ex. subject:
Name, value: Amoxicillin; subject: Chemical Formula, value:
C16H19N3O5S). Next, the file containing Link specifications
is imported, subjects are specified, and operators of both
aggregation and comparison are chosen, in order to execute
the comparison operation. Finally, the generated links should
be filtered into link-candidates to be evaluated and exported
to users in user-specified output files.</p>
      <p>A link discovery process can be divided into 3 stages
(Fig. 2), and all Link Discovery Frameworks apply this process
in order to generate relatively accurate links.</p>
      <p>The first stage is the Pre-Matching stage. It is concerned
about configuring the framework in an optimized manner, and
includes bringing to-be-linked data from their corresponding
resources, which are a source dataset and a target dataset
that might be in the form of RDF dumps or SPARQL
endpoints. After source and target datasets are drawn out, the
specifications of the to-be-generated links are written into a
file and imported by the framework to compare data based
on them. Some other framework-specific parameters are to be
stated in the pre-processing stage too, as for example, the link
acceptance threshold value. In addition, if machine-learning is
involved in the link discovery process, training datasets are to
be imported at this stage. On the other hand, some frameworks
provide the option of benefiting from external resources like
dictionaries of RDF vocabulary or previous mappings that are
a form of crowds participation (crowd-sourcing).</p>
      <p>As the Pre-Matching stage ends, the matching stage starts,
and it can be of two types: Instance Matching or Ontology
Matching. End-users can intervene in the automated process
in cases of learning-based matching, and this intervention is
a candidate role for crowd-sourcing. SILK and LIMES are
instance matching frameworks and thus they are not involved
in ontology matching. When the matching process is complete,
the result would be the discovered link candidates with a
percentage or a relative value clarifying its accurateness for
each.</p>
      <p>Post-processing the outcome is evaluating the link
candidates that are below the acceptance threshold and above the
verification threshold. This stage can be done automatically
(using Machine Learning) based on the framework
architecture, or manually which would be a form of crowd-sourcing.
Finally, the links are to be exported in a user-specified format,
get published or saved.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. SILK &amp; LIMES IN A NUTSHELL</title>
      <sec id="sec-4-1">
        <title>A. SILK</title>
        <p>
          SILK [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is an open-source link discovery framework for
the web, based on RDF. SILK offers a Workbench that enables
the user to manage sets of data sources, easily edit and observe
linking and transformation tasks graphically, create and edit
reference links, and easily evaluate the generated links.
        </p>
        <p>SILK queries the data lists drawn out of the
corresponding resources using resources listers. The source and target
datasets could be local data dumps, or resources accessed
via a SPARQL endpoint. Only the target lists pass through
an indexer that is responsible for indexing them as a
preprocessing step, in order to facilitate the matching process.
The indexing process separates data into blocks and indexes
them by one or more of their properties (mostly labels). The
source lists do not pass through the indexer but are directly
cached on disk in order to be retrieved later at the matching
stage. The reason why only target lists get indexed is that
the matching process will be executed on each instance of
the source list, in order to compare it to the best potential
matches from the target list. The indexing of the target list,
will yield to a run-time optimization at the comparison level of
the matching process. This time-optimizing step might cause
missing some links when excluding blocks of lower matching
potential that contain correct links. Only the retained links will
be written to an output file which format can be specified by
the user (i.e. CSV).</p>
        <p>In the matching process, a similarity value for every pair
of instances is computed and the corresponding aggregation
metric (specified by the user) is evaluated. Then, comparison
measures (that are metrics or semi-metrics) are acted upon
by RDF path translators, which transform them into SPARQL
queries and send them to SPARQL endpoints to get evaluated.
Results of the query are cached temporarily in memory, until
links of values that are above the acceptance threshold are
picked and saved into it. The number of links to be picked for
each resource is specified initially by the user. This is called
link limit. The resulting links are picked from a list of link
candidates, in which each has a corresponding similarity value
(similarity between the source instance and target instance),
such that they have the highest similarity values within the
highest potential ”blocks of matching” (according to the
calculated indices of instances).</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. LIMES</title>
        <p>
          The word ”LIMES” [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] stands for Link Discovery
Framework for Metrics Spaces. Like SILK, it is a tool to generate
links between similar entities belonging to different data
resources. However, LIMES tool estimates the similarity values
using the mathematical characteristics of metric spaces. This
helps reduce the number of comparisons and thus decreases
the complexity of the run-time process.
        </p>
        <p>
          The mathematical principles underlying the LIMES
framework are summarized by defining the Metric Space and
the Matching Task. The metric space is described by four
conditions: non-negativity, identity of indiscernible, symmetry,
and triangle inequality (TI) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The difference with
semimetrics is that they do not satisfy the fourth condition of
metrics (TI). On the other side, a matching task is computing
the list of instances from source and target sets such that they
match the metric conditions.
        </p>
        <p>
          The General work-flow of LIMES starts by reading three
inputs; source and target datasets, and the link specification
file. After being imported, the data from the datasets is
separated into ”strings”, ”numeric values and values mappable into
vector space”, and ”leftover values”. They are then mapped
using String Mapper, Numeric Mapper, and Miscellaneous
Mapper respectively. The Mappers guarantee that the data is
converted into values belonging to the metric space, according
to the boundary condition realized from the TI [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], as follows:
m(x, y)
m(y, z) 
m(x, z) 
m(x, y) + m(y, z)
        </p>
        <p>
          The distance from x to z can be approximated when knowing
the distance from x to reference point y as well as the distance
from reference point y to z. The reference point y is called an
exemplar [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and is used to define the center of a portion
of the total metric space. Exemplars help calculate the upper
and lower bounds of the distance from point x to point z,
so when compared to theta, the threshold, the decision can be
taken regarding link generation. A set of exemplars is selected
in a way they be distributed in a uniform way in the metric
space, and to be as dissimilar as possible. The approximations
of distances from points to exemplars allow reducing the time
needed for comparisons.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>C. SILK vs. LIMES</title>
        <p>In this section, we will compare SILK and LIMES
according to the following features: General Information and
Accessibility, Framework Configuration, Run-time Optimization and
Link Discovery and Evaluation. Current studies emphasize on
run-time optimization issues while disregarding other
important features of the frameworks. Add to that, the fact that tests
are performed on older versions of both frameworks (Table I).
From here rises the need for an updated complete comparison
that covers all the features of SILK and LIMES in order to
be able to do an informed evaluation of their effectiveness
regarding link quality.</p>
        <p>1) General Information and Accessibility: Features under
this category are as follows:
• While LIMES is based on Java, SILK initial release is
developed with Python(2009) and the second version was
reimplemented using Scala(2010).
• A web interface is available for SILK2, while a practical
desktop application is available for LIMES3.
• Tools and sources are available to download for both
frameworks; SILK4 and LIMES5.
• Both frameworks have a link specification language;</p>
        <p>SILK-LSL and LIMES-LSL.
• SILKs latest version -until the writing of this report- was
published on 12/2/2016 while LIMES’ was on 4/4/2017.</p>
      </sec>
      <sec id="sec-4-4">
        <title>2) Framework Configuration: The difference between</title>
        <p>
          SILK and LIMES regarding the framework configuration are:
• Both frameworks support manual and learning based
link discovery process, but SILK supports two additional
methods; the generation of links using genetic
programming and batch learning [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
• The configuration information of Link Specification file
of SILK and LIMES, which specify the elements of the
comparison, are very similar (datasets, classes,
aggregation operators, comparison operators, link limit, xml
version, etc) but are organized differently.
• The Input/Output file formats supported by both
frameworks are various, and differ between SILK and LIMES.
2http://SILKframework.org/
3http://aksw.org/Projects/LIMES.html
4https://github.com/silk-framework/silk
5https://github.com/dice-group/LIMES
SILK supports eight source-file formats and LIMES
supports six, of which XML, RDF Dump, SPARQL
endpoint, and N-TURTLE are in common. As for the
format of the output, both support N-triples.
• Acceptance threshold: By comparing SILK and LIMES,
there is a threshold value for each specification to
accept the similarity values. In addition, the user can set
threshold values for which links could be automatically
accepted or links should be reviewed manually. Regarding
SILK, it allows, specifying the number of links of a single
data item to be picked. Only the highest-rated links per
source data item will remain after the filtering.
        </p>
        <p>
          3) Run-Time Optimization: Run-time is the execution time
of the matching process excluding the execution time of
preprocessing and post-processing operations. A comparison of
how each framework achieves run-time optimization are:
• Parallel clustering: parallel processing is supported by
both frameworks using customized versions of
MapReduce,
• Pre-processing methods: The pre-processing method
adopted by SILK relies on dividing data into blocks
in order to enable indexing (oftenly indexed by Labels)
and thus reduce comparisons. As for LIMES, the
preprocessing method applied is filtering. Sources of
preprocessing functions are: Xapian search engine library for
SILK, while the novel version of the LIMES framework
integrates an extended version of PPJoin+ algorithm [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Efficiency tests performed to compare the run-time of SILK
and LIMES concluded that LIMES is faster by 60 times, which
we justify by the difference in the pre-matching methods. Yet,
the versions tested upon are old, and should be updated.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4) Link Discovery and Evaluation: Link Discovery and</title>
        <p>
          Evaluation parameters depend on multiple features, as follows:
• Supported measures: SILK, in its latest version, has 16
similarity measures, versus 60 similarity measures for
LIMES in its 1.1.2 version. However, it should be noted
that both frameworks do support the addition of new
measures, with the difference that in LIMES the user
should add mappers to such measures to fit in the filtering
pre-matching process.
• Generated links: Both frameworks support the generation
of owl:sameAs link types in addition to other RDF
link [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] types. When links are generated, they are to be
evaluated by the framework, then judged it to be accepted
or not, before releasing them to output files. However,
for SILK, the user can interfere in filtering the results
and accepting the links by approving links with similarity
measures under the threshold, or declining links with high
similarity values (crowd-sourcing).
• Links evaluation: SILK makes evaluation and comparison
weight optional to state whether the measure is mandatory
(required) or not, and optional to give each measure a
certain weight of the total weight too. As for LIMES,
those parameters are not optional. A measure is supposed
to be specified when it is required, with no ability
to disregard its similarity values. Add to this that all
measures are weighed the same in LIMES (no weight
parameter).
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. PRE-EVALUATION PHASE</title>
      <p>Before evaluating the two frameworks performance and
impact on link quality, the data to-be-matched should be
carefully selected, and the link specification parameters set,
taking into account the difference at the level of threshold
concept in each framework.</p>
      <sec id="sec-5-1">
        <title>A. Comparison Measures</title>
        <p>There exists five String-specific measures in both tools.
While SILK splits up Character-based from Token-based
measures for Strings, LIMES does not. For Numeric measures,
while SILK, provides a date-specific measure, LIMES does
not. In fact, the only Numeric measure supported by LIMES
is the Euclidean distance measure. SILK classifies wgs84
measure as numeric. On the other hand, wgs84 is specific
for geo properties (such as georss:point). This led us to
include it under the geo type too, in parallel with the 19
geospecific LIMES measures. However, SILK has, in addition,
two extensions of Spatial Relations and Temporal Relations,
that are Temporal Distances and Spatial Distances specific
to centroid, minimum distances, days, hours, milliseconds,
minutes, months, seconds, and years. Therefore, the only
common measures between the two tools are Jaro, JaroWinkler,
Levenshtein, Cosine and Jaccard. So, it is most relevant to
compare the two frameworks based on these measures in order
to detect the differences in their behavior, precisely and fairly.</p>
      </sec>
      <sec id="sec-5-2">
        <title>B. Threshold-based Similarity and Distance</title>
        <p>
          Based on the Link discovery behavior, we can formally use
two type of threshold [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]:
• Link Discovery on the similarity threshold. Given two
sets S and T of instances, a similarity measure ⇢ over the
properties of s 2 S and t 2 T and a similarity threshold
⌧ 2 [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], the goal of LD is to compute the set of pairs
of instances (s, t) 2 S ⇥ T such that ⇢ (s, t) &gt; ⌧ .
• Link Discovery on the distance threshold. Given two sets
S and T of instances, a distance measure ◆ over the
properties of s 2 S and t 2 T and a distance threshold
✓ 2 [0, +1 [ the goal of LD is to compute the set of pairs
of instances (s, t) 2 S ⇥ T such that ◆ (s, t) 6 ✓ .
        </p>
        <p>While LIMES uses similarities, SILK works with distances.
Therefore, we use the setting ⌧ = (1 + ✓ ) 1 to transform the
distance threshold ✓ to the similarity threshold ⌧</p>
        <p>Moreover, it is worth noting that not all measures available
in SILK are normalized: Levenshtein, wgs84, date, dateTime,
and num20 are not normalized. The use of non-normalized
measures may lead to similarity values that are higher than the
threshold set. For instance, Normalized Levenshtein Distance
should be used instead of Levenshtein in SILK. Conversely,
all measures are normalized in LIMES.</p>
        <p>In our tests, we used 0.6, 0.8 and 0.95 thresholds (according
to LIMES threshold concept) in order to calculate the precision
and compare it between both frameworks.
Fig. 3. The measurement similarity between the two datasets is based on
intervention name of linkedct and title of the drug from drugbank.</p>
      </sec>
      <sec id="sec-5-3">
        <title>C. Datasets &amp; Link Specifications</title>
        <p>Linking biomedical datasets will lead to novel facilitation
for global health systems and thus humanity. Therefore, in this
experiment, we will test and study the behavior of both SILK
and LIMES in link discovery between two biomedical datasets
which are the following:</p>
        <p>
          LinkedCT6 is a dataset derived from a service named
ClinicalTrials.gov, which is initially provided by the U.S. National
Institute of Health. The mentioned service is mainly a registry
of more than 60 thousands entries of clinical trials conducted
in 158 countries. Each clinical trial is associated with relevant
information such as a brief description of the trial, disorders
and interventions related to it, eligibility criteria, sponsors,
locations (investigators),etc. The RDF version of the dataset
contains 48,909,090 triples and 2,023,055 links [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>
          DrugBank7 is a large repository of around 5000 small
molecule and biotech drugs that are FDA-approved. It contains
detailed information about drugs (pharmacological, chemical
and pharmaceutical data) in addition to comprehensive drug
target data (like structure, sequence, and pathway information).
Triples contained by the Linked data version of DrugBank are
3,649,531 triples, while links are 1,828,410 links [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>Regarding comparison metrics, the discerned common
measures were tested (Levenshtein, Jaro, JaroWinkler, Jaccard and
cosine), in order to guarantee the most relevant results. The
properties compared were title property of Drugbnak and
intervention holds the drug name (intervention intervention name)
property for LinkedCT. Then run-times were calculated.</p>
      </sec>
      <sec id="sec-5-4">
        <title>D. Gold Standard</title>
        <p>To evaluate links created when testing SILK and LIMES
discovery frameworks, we have chosen to leverage existing
”seeAlso” links between Linkedct and Drugbank. Therefore
52084 links were extracted from Linkedct dataset and prepared
to be used as a gold standard. Moreover, we developed a
Java application8 to compare links discovered by SILK and
LIMES with the gold standard as well as measure the quality
metrics of links. The application takes the gold standard and
6http://linkedct.org/
7https://old.datahub.io/dataset/bio2rdf-drugbank
8https://github.com/housseindh/LinkDiscoveryEvaluationMetrics
the generated (.nt) file in order to compare and calculate
various metrics (such as precision and recall) detailed in the
next section.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VI. EVALUATION</title>
      <sec id="sec-6-1">
        <title>A. Experimental objectives and Set-Up</title>
        <p>The twofold objective of the study of SILK and LIMES
is to: 1)evaluate the quality of discovered links in case of
biomedical datasets, according to two dimensions: thresholds
and similarity measures; and 2)evaluate the run-time of each
experiment.</p>
        <p>We built our scenario around the Intervention Name
property of type Drug as source and the Title property of Drug
instances as a target using ”Linkedct” and ”Drugband” datasets
respectively. We chose these datasets in a way to emphasize
the difference at the level of link quality. For instance, the
compared data could have different length of the string and
possible token permutations. Fig. 3 describes examples of
triples from the two datasets as well as their similar properties.</p>
        <p>We held this test on 346576 different entities of source
dataset (LinkedCT) against 7678 entities of target dataset
(Drugbank). We performed it using each of the four
Stringspecific comparison measures (Levenshtein, Jaccard, Jaro and
JaroWinkler). Cosine comparison measure was excluded as
the data was not compatible with it in SILK. Testing using
different thresholds is important because sometimes, correct
links have low similarity values, which needs lower thresholds
to allow their detection.</p>
        <p>All experiments were performed on a laptop equipped with
Intel Core i7 quadcore processor (2.90 GHz), 20 GB RAM,
the maximum heap size is set to 10 GB, running Windows 10,
Java version JDK/JRE 1.8.</p>
        <p>To evaluate the correctness of the links generated by the
matching process, three measures should be calculated for
different experiment sets: Precision, Recall, and F-Score.</p>
        <p>T P T P
P recision = Recall =
(T P + F P ) (T P + F N ) (1)</p>
        <p>P recision ⇥ Recall
F Score = 2 ⇥</p>
        <p>P recision + Recall
Where TP = True Positive, FP = False Positive and FN = False
Negative</p>
      </sec>
      <sec id="sec-6-2">
        <title>B. Experimental Results</title>
        <p>The columns in table II indicate the average result of 3 runs:
False Positive(FP), True Positive(TP) and False Negative(FN)
for three different thresholds(0.95, 0.8, 0.6). We used four
different similarity measures for evaluation. As an overall
observation, TP retained an approximately similar value for
each test, which corresponds to the number of entities in
the gold standard dataset. Accordingly, LIMES performed
particularly well by retaining the same values with different
thresholds using Levenshtein and Jaccard. However, SILK
accomplished that only with Jaccard. All other values varied
according to the 3-dimensions of computation; frameworks,
thresholds and similarity measures.</p>
        <p>Although giving 10 GB of memory, the two similarity
measures Jaro and JaroWinkler failed to produce a result with a
threshold of 0.6, because of a Java GC overhead limit exceeded
and Java heap space.</p>
        <p>Regarding the run-time evaluation, Fig 4a summarizes our
experiment results of SILK and LIMES. As an overall
observation, we find that Jaccard performed the optimal time for all
thresholds. And as we compare the time of LIMES and SILK,
we observe in most experiments that LIMES is faster than
or approximately equal to SILK. The only case where SILK
run-time noticeably exceeded LIMES’s was with JaroWinkler
comparison operator at 0.8 threshold.</p>
        <p>Fig.4b and 4cshow the quality metrics of linked discovery.
Both frameworks achieved very good results in terms of
recall. In terms of precision, while LIMES maintained very
good results for all thresholds when dealing with Levenshtein
and Jaccard, SILK had poor results with 0.6 threshold using
Levenshtein for the same experiment specifications. Moreover,
for the preceding experiments of Jaro and JaroWinkler, the
results were poor for both frameworks, and worse when
speaking about LIMES.</p>
        <p>In order to evaluate the effectiveness and efficiency of
these two frameworks, we propose to use an equation that
calculates the proportional value of F-Score to the run-time
duration. However, because of the considerable differences
in the value of the execution time between each experiment
compared to the value of F-Score, we apply the logarithm
function to smooth out the high impact of run-time for big
log(Runtime)
(2)</p>
        <p>Looking at the results of the EE equation in Fig. 4d, it seems
that LIMES has maintained consistent effectiveness regardless
of the Threshold for both similarity measures Levenshtein and
Jaccard, while it was remarkably unsteady with SILK.</p>
      </sec>
      <sec id="sec-6-3">
        <title>C. Technical Evaluation</title>
        <p>
          LIMES framework admits its drawback [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] considering its
optimization only for metrics, which is not the case for
nonmetric measures such as JaroWinkler. It favors performance
and ease-of-use over recall/precision factors, and considers
that the contribution of the user in modifying the thresholds
as a human-feedback that can compensate this drawback.
        </p>
        <p>The results evaluation using Precision, Recall and F-Score
confirms the fact that LIMES theoretically has better chances
than SILK in the case of large datasets . However, the close
results between SILK and LIMES in the current tests do
not exclude SILK from the efficient universal link discovery
frameworks.</p>
        <p>
          The two frameworks tend to perform a pre-matching process
to improve the performance of comparison. In addition, to
speed the process up, the target objective is to reduce the
number of comparisons needed to be held. SILK uses
indexing, while LIMES uses the Triangle Inequality . Obviously,
the algorithm used in LIMES, which depends on computing
exemplars from the resources and filtering them before
computing the similarity and serializing the result [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is the reason
why LIMES is faster and makes it less probable to miss links.
The distribution of exemplars, where each represents a portion
of the metric space and which are selected as dissimilar as
possible in the set of data, allows the parallelism of the filtering
process before matching. Filtering takes place by matching
each point to an exemplar to compute pessimistic estimates of
instance similarities, which leads to missing links. In SILK,
the indexing process allows dividing the data into blocks and
indexing them by some of their properties (mostly labels),
then, for each comparison, matching is performed only on
potential blocks. This would lower the number of comparisons
and thus will speed up the process but does not guarantee not
missing links.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>VII. CONCLUSION &amp; FUTURE WORKS</title>
      <p>In this paper, we summarized the core differences between
SILK and LIMES and presented an experiment that
evaluates the performance of entity comparison and measures the
quality of discovered links. In particular, we applied the
process to large-scale biomedical data (LinkedCT and Drugbank
datasets). We performed many experiments to evaluate the
impact of threshold values and that of similarity measures on
efficiency and effectiveness, in order to verify the points of
strength and weakness of each framework. This comprehensive
study clarified and validated the fact that each of SILK and
specific similarity measures usage. More specifically, LIMES
flourishes with Levenshtein at all thresholds while SILK
emerges with Jaccard at low thresholds.</p>
      <p>As a future plan, we aim to perform more tests on the rest
of the comparison measures, and upon different aggregation
scenarios to get deep into the best use-case domain of each
framework. On the other hand, a great deal of work shall be
focused on considering active learning that is already
integrated into both frameworks, and on testing the performance
in a distributed environment.
values. Therefore we propose to evaluate the effectiveness and
LIMES has its own advantages, and is more appropriate to
efficiency by using the following EE equation:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Merelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Pe´rez-Sa´nchez</article-title>
          , S. Gesing, and D. DAgostino, “
          <article-title>Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives,” BioMed research international</article-title>
          , vol.
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dhayne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kilany</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taher</surname>
          </string-name>
          , “
          <article-title>In search of big medical data integration solutions-a comprehensive survey</article-title>
          ,
          <source>” IEEE Access</source>
          , vol.
          <volume>7</volume>
          , pp.
          <volume>91</volume>
          <fpage>265</fpage>
          -
          <lpage>91</lpage>
          290,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          , “
          <article-title>Linked data, big data, and the 4th paradigm</article-title>
          .”
          <source>Semantic Web</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>235</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dhayne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Chamoun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sokhn</surname>
          </string-name>
          , “Survey:
          <article-title>When semantics meet crowdsourcing to enhance big data variety,” in Communications Conference (MENACOMM), IEEE Middle East and North Africa</article-title>
          . IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          , “
          <article-title>Linked data: The story so far,” in Semantic services, interoperability and web applications: emerging concepts</article-title>
          .
          <source>IGI Global</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Samwald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bouton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Kallesøe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Willighagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajagos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Marshall</surname>
          </string-name>
          , E. Prud'hommeaux,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          , E. Pichler et al., “
          <article-title>Linked open drug data for pharmaceutical research</article-title>
          and development,
          <source>” Journal of cheminformatics</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>19</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dhayne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kilany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haque</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taher</surname>
          </string-name>
          , “
          <article-title>Sedie: A semanticdriven engine for integration of healthcare data,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</article-title>
          . IEEE,
          <year>2018</year>
          , pp.
          <fpage>617</fpage>
          -
          <lpage>622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          , “
          <article-title>Introduction to linked data and its lifecycle on the web</article-title>
          ,
          <source>” in Reasoning Web International Summer School</source>
          . Springer,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ferna</surname>
          </string-name>
          ´ndez, W. Beek,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mart</surname>
          </string-name>
          <article-title>´ınez-</article-title>
          <string-name>
            <surname>Prieto</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Arias</surname>
          </string-name>
          , “Loda-lot,” in International Semantic Web Conference. Springer,
          <year>2017</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Haque</surname>
          </string-name>
          and M.-S. Hacid, “
          <article-title>Blinked data: Concepts, characteristics</article-title>
          , and challenge,” in
          <source>Services (SERVICES)</source>
          ,
          <source>2014 IEEE World Congress on. IEEE</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>426</fpage>
          -
          <lpage>433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Raimond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. B. Sandler</surname>
          </string-name>
          , “
          <article-title>Automatic interlinking of music datasets on the semantic web</article-title>
          .
          <source>” LDOW</source>
          , vol.
          <volume>369</volume>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , “
          <article-title>Limes-a time-efficient approach for largescale link discovery on the web of data</article-title>
          .”
          <string-name>
            <surname>in</surname>
            <given-names>IJCAI</given-names>
          </string-name>
          ,
          <year>2011</year>
          , pp.
          <fpage>2312</fpage>
          -
          <lpage>2317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>A.-C. N. Ngomo</surname>
          </string-name>
          , “
          <article-title>A time-efficient hybrid approach to link discovery,” Ontology Matching</article-title>
          , vol.
          <volume>1</volume>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartung</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          , and E. Rahm, “
          <article-title>A survey of current link discovery frameworks,” Semantic Web</article-title>
          , vol.
          <volume>8</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>436</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedke</surname>
          </string-name>
          , and G. Kobilarov, “
          <article-title>Silk-a link discovery framework for the web of data</article-title>
          .
          <source>” LDOW</source>
          , vol.
          <volume>538</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , “
          <article-title>Active learning of expressive linkage rules using genetic programming</article-title>
          ,
          <source>” Web Semantics: Science, Services and Agents on the World Wide Web</source>
          , vol.
          <volume>23</volume>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Beckett</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>McBride</surname>
          </string-name>
          , “
          <article-title>Rdf/xml syntax specification (revised),” W3C recommendation</article-title>
          , vol.
          <volume>10</volume>
          , no.
          <issue>2</issue>
          .
          <issue>3</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>Linkedct: A linked data space for clinical trials</article-title>
          ,
          <source>” arXiv preprint arXiv:0908.0567</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Wishart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassanali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stothard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Woolsey</surname>
          </string-name>
          , “
          <article-title>Drugbank: a comprehensive resource for in silico drug discovery and exploration</article-title>
          ,”
          <source>Nucleic acids research</source>
          , vol.
          <volume>34</volume>
          , no.
          <source>suppl 1</source>
          , pp.
          <fpage>D668</fpage>
          -
          <lpage>D672</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>