=Paper= {{Paper |id=Vol-2622/paper2 |storemode=property |title=Data Link Discovery Tools for Big Linked Data: A comprehensive study |pdfUrl=https://ceur-ws.org/Vol-2622/paper2.pdf |volume=Vol-2622 |authors=Houssein Dhayne,Hanan Farhat,Rima Kilany |dblpUrl=https://dblp.org/rec/conf/bdcsintell/DhayneFK19 }} ==Data Link Discovery Tools for Big Linked Data: A comprehensive study== https://ceur-ws.org/Vol-2622/paper2.pdf
    Data Link Discovery Frameworks for Biomedical
         Linked Data: A comprehensive study
                 1st Houssein Dhayne                                     2nd Hanan Farhat                               3rd Rima Kilany
            Faculty of Engineering, ESIB                         Faculty of Engineering, ESIB                   Faculty of Engineering, ESIB
               Saint Joseph University                              Saint Joseph University                        Saint Joseph University
                  Beirut, Lebanon                                      Beirut, Lebanon                                Beirut, Lebanon
           houssein.dhayne@net.usj.edu.lb                         hanan.farhat@net.usj.edu.lb                      rima.kilany@usj.edu.lb



   Abstract—Data discovery, linking and integration techniques                      Linked Open Data (LOD) is a set of best practices to publish
are of great importance for big data variety challenge. Linked                   RDF linked data on the Web in a machine-readable way,
Open Data (LOD) and Semantic Web technologies have worked                        with an explicitly defined semantic meaning, linked to other
as a driver to address this challenge. However, until 2015, the
linkage of triples of LOD has increased to 40%, of which only 3%                 datasets and allowed to be searched for. LOD principles can
of overall triples are links between different datasets. Today, with             be summarized in publishing of open, linked and structured
the increasing amount of available LOD datasets, 9671 datasets                   data, in non-proprietary formats using URIs. An indexed
compose the LOD, the need to link them together is becoming                      ready-to-consume crawl of a large portion of LOD (see
vital. Links are usually generated, or discovered, by specific                   Fig.1), called LOD-a-lot1 , contains 28,362,198,927 triples,
frameworks such as SILK and LIMES, which are two of the
most effective tools in this domain. They apply instance matching                made up of 3,214,347,198 subjects, 1,168,932 predicates, and
rather than ontology matching, and support active learning. They                 3,178,409,386 objects [9]. With this increasing volume of
both have their drawbacks and their advantages, which makes                      datasets, the name of the Big Linked Data has been appearing
it hard to disregard one of them. This paper aims to evaluate                    in the research terms. Big Linked Data is an instance of Big
whether SILK and LIMES are potential options for interlinking                    Data that is the union of big and linked data, where authors in
large-scale biomedical datasets, comparing the two frameworks
at many levels, starting from the general features, reaching the                 [10] presented their list of characteristics, which was created
comparison measures, the resulting files, the performance and                    by unifying the characteristics of Big and Linked Data.
the effectiveness of the links produced. The conclusions drawn                      Moreover, until 2015, the linkage of triples of LOD has
from this work are to be used as a reference for the evaluation of               increased to 40%, of which only 3% of overall triples are links
the core differences between SILK and LIMES and therefore for                    between different datasets (Fig. 1 showing the growth from
choosing the most suitable tool in a Biomedical context. It can be
considered as an opening for future research and enhancements                    2007 to 2016), therefore new problems are arising that require
of such frameworks.                                                              new solutions from the data science community. Wherefore,
   Index Terms—Semantic Web, Link Discovery, Biomedical                          the importance of Link-Discovery Frameworks, which are
Linked Data, Data Links                                                          responsible for creating the links, has increased, taking into
                                                                                 consideration the efficiency and effectiveness. Efficiency is
                            I. I NTRODUCTION                                     the optimized process run-time, in addition to the execution
                                                                                 time of the preceding and the following steps, excluding
   With the rise of Big Data awareness among biomedical                          complex criteria and computations that consume both time
providers, there is a need to harness the techniques of data                     and resources. On the other hand, effectiveness is having the
integration and analytics to create significant value towards                    resulting evaluated links accurate and complete.
aiding the process of care delivery and disease exploration [1],                    Link-Discovery problem can be defined as a task that takes
[2]. Among the different dimensions that characterize big data,                  two datasets as input and produces a set of links between
the variety dimension seems to be the most intriguing one for                    entities of the two datasets as output. In a formal definition, let
the Semantic Web and the one where the research community                        S(source) and T (target) two sets of RDF instances as well as
can contribute [3] [4]. Resource Description Framework (RDF)                     s and t two instances of S and T respectively, and a similarity
paradigm, published on the Web in accordance with the                            threshold ⇥ 2 [0, 1]. Link- Discovery is the process that leads
Linked Data principles and best practices [5] and containing                     to the discovery of all set of pairs (s, t) 2 SxT that are linked
information about genes, proteins, pathways, diseases, and                       by a relation ' relying on their properties by using a similarity
drugs [6], has evolved as a powerful enabler for the transition                  metric ⇢. if the value of ⇢(s, t) ⇥, then the two entities s
of the current unstructured data into interlinked Data [7].                      and t are considered to be linked by '.
For instance, linked data solve the integration of unstructured                     This paper aims to compare two Link-Discovery Frame-
data by replacing or annotating the data elements, of medical                    works, SILK and LIMES, at many levels starting from the
texts or images, with unique identifiers, providing a structured
querying of multiple heterogeneous sources [8].                                    1 http://lod-a-lot.lod.labs.vu.nl/



 Copyright © 2019 for this paper by its authors. Use permitted under Creative
 Commons License Attribution 4.0 International (CC BY 4.0).



                                                                                                                                                       5
                                                                     test is shown in details; the source datasets and target datasets,
                                                                     the source and target classes, the properties compared and
                                                                     the comparison operators used for each test. As a result,
                                                                     it is notable that the choice of the comparison operator is
                                                                     pertinently related to the property to compare.
                                                                        A study [14] published in 2015 and updated in 2017 has
                                                                     handled the issue of link-discovery frameworks, and compared
   Fig. 1. Growth of the Linking Open Data Cloud from 2007 to 2016   ten of them using a specified benchmark. The frameworks
                                                                     compared were RiMOM, KnoFuss, AgreementMaker, CODI,
                                                                     SERIMI, LogMap, SLINT+, Zhishi.links, SILK and LIMES.
general features, the comparison measures, the resulting files       However, these ten tools can be classified as follows: Machine
and reaching the performance. In addition, it aims to evaluate       Learning(Involved or Not Involved), and Matching (Ontology
the quality of links produced.                                       Matching or Instance Matching). Seven frameworks appeared
   The rest of the paper is structured as follows: Section 2 spots   to exclude machine learning from the matching process, and
the light on comparison studies referring to both frameworks         support ontology matching. Yet, semantic web has deviated
SILK and LIMES. Section 3 shows the general process of               the research from ontology to instance matching gradually
link discovery frameworks. In Section 4 we give an overview          considering the heterogeneity of documents and real-world
of SILK and LIMES respectively and compare their general             entities, which may appear at many web locations with dif-
features, while in section 5, the criteria used to perform the       ferent descriptions. As the objective of semantic web was
comparison experiments is detailed. Section 6 presents the           to make data more understandable by machines, machine
results of our experiments, and finally, section 7 concludes         learning set the best example for making use of linked data.
and provides recommendations for future work.                        Therefore, three frameworks: SILK, LIMES, and KnoFuss,
                                                                     classified as instance matching frameworks including machine
                      II. R ELATED WORK
                                                                     learning accessed the interest zone. In contrary to SILK and
   Link Discovery frameworks are divided into two types:             LIMES, KnoFuss supports only one type of data links which
Domain Specific (ex: GNAT, specific for music [11]) and              is owl:sameAs, while they support, in addition to owl:sameAs
Universal frameworks. SILK and LIMES are both Universal              link, other RDF link types such as alternate, start, and next.
Frameworks that aim to generate links between entities of               Consequently, it would be essential for any comparison
data resources. They have many common features, as well              study to start by the classification of comparison measures
as dissimilar ones that we will detail later.                        offered by both tools, and this is what will be detailed in
   Many studies compared SILK and LIMES. Nevertheless,               Section IV-C. An overview of both SILK and LIMES and a
the studies covered only the efficiency challenge (run-time)         description of their internal architecture is presented in the
assuming the effectiveness (link quality) is guaranteed. The         preceding sections.
main two studies comparing the frameworks were published in
2011, where the first compared SILK of version 2 to LIMES of                        III. L INK D ISCOVERY PROCESS
version 0.3.21 [12], and the second compared SILK of version            The matching process is the core part of link-discovery. A
2.3 to LIMES of version 0.5 [13]. Both studies ended favoring        single comparison can be summarized in a few steps. First,
LIMES over SILK speed wise. However, we cannot rely on               the datasets from which the instances will be picked should
this result since both frameworks have newer and refactored          be determined (a source and a target). Second, the required
versions.                                                            classes are to be picked, and then comes the choice of in-
   An important element that makes the comparison unrigh-            stances that will be compared. Each instance has its description
teous in these two studies is the difference in defining com-
parison thresholds. While -even in older versions- LIMES had
the option for specifying the threshold for each operator by
itself in addition to the threshold for the aggregated output, in
SILK the threshold could only be specified for the output of
the aggregation. This per-comparison threshold was recently
introduced into the newer versions of SILK.
   Even though effectiveness or link quality was out of concern
in the latest studies on SILK and LIMES, the implemented
tests, whether comparing those two frameworks or evaluating
the performance of each alone, have helped develop the
criterion to be followed in order to evaluate generated links
quality, and deduce which framework, SILK or LIMES, has
better impact on effectiveness. Some of those constructive
studies are summarized in Table I, where each implemented                   Fig. 2. General Workflow of Link Discovery Frameworks




                                                                                                                                          6
                                                               TABLE I
                                                C OMPARISONS APPLIED IN R ECENT S TUDIES .

                     Study                 Source Dataset/Class      Target Dataset/Class          Properties            Comparison Operator
                                                                                            rdf:type vs. linkedct:
                                           Dbpedia/Villages          linkedct/Villages                                   Levenshtein
       LIMES - A Time-Efficient Approach                                                    condition name
   1   for Large-Scale Link Discovery      DBpedia                   DrugBank               rdfs:label                   Levenshtein
       on the Web of Data [12]             DBpedia                   DBpedia                rdfs:label                   Levenshtein
                                           MESH                      LinkedCT               rdfs:label                   Levenshtein
                                           DBpedia/cities     &      Geonames/              gn:name                vs.
                                           towns                     populatedPlaces        gn:alternateName
       A Time-Efficient Hybrid Approach                                                     wgs84:long             vs.
   2                                                                                                                     trigrams for strings
       to Link Discovery [13]              Linkedgeodata             Geonames               wgs84:long
                                                                                            wgs84:lat vs. wgs84:lat      Euclidean for Numerics
                                                                                            rdfs:label
                                           DBpedia                   Linkedgeodata
                                                                                            population
                                                                                            rdfs:label vs. gn:name       JaroSimilarity
                                                                                            rdfs:label             vs.
                                                                                                                         JaroSimilarity
       SILK: A Link Discovery Framework                              Geonames/              gn:alternateName
   3                                       Dbpedia/cities
       for the Web of Data [15]                                      PopulatedPlaces        foaf:page              vs.
                                                                                                                         maxSimilarityInSets
                                                                                            gn:wikipediaArticle
                                                                                            dbpedia:populationTotal
                                                                                                                         numSimilarity
                                                                                            vs. gn:population
                                                                                            wgs84 pos:lat          vs.
                                                                                                                         numSimilarity
                                                                                            wgs84 pos:lat
                                                                                            wgs84 pos:long         vs.
                                                                                                                         numSimilarity
                                                                                            wgs84 pos:long
                                                             SiderDrugBank                                               Levenshtein
       Active Learning of Expressive                                                        Out of concern,
                                                             NewYorkTimes                                                Jaccard
   4   Linkage Rules using Genetic                                                          active learning
                                                              LinkedMDB                                                  Numeric
       Programming [16]                                                                     is the target
                                                            DBpediaDrugBank                                              Geographic



defined by Subjects and Values of the Subjects (Ex. subject:              in cases of learning-based matching, and this intervention is
Name, value: Amoxicillin; subject: Chemical Formula, value:               a candidate role for crowd-sourcing. SILK and LIMES are
C16 H19 N3 O5 S). Next, the file containing Link specifications           instance matching frameworks and thus they are not involved
is imported, subjects are specified, and operators of both                in ontology matching. When the matching process is complete,
aggregation and comparison are chosen, in order to execute                the result would be the discovered link candidates with a
the comparison operation. Finally, the generated links should             percentage or a relative value clarifying its accurateness for
be filtered into link-candidates to be evaluated and exported             each.
to users in user-specified output files.                                     Post-processing the outcome is evaluating the link candi-
   A link discovery process can be divided into 3 stages                  dates that are below the acceptance threshold and above the
(Fig. 2), and all Link Discovery Frameworks apply this process            verification threshold. This stage can be done automatically
in order to generate relatively accurate links.                           (using Machine Learning) based on the framework architec-
   The first stage is the Pre-Matching stage. It is concerned             ture, or manually which would be a form of crowd-sourcing.
about configuring the framework in an optimized manner, and               Finally, the links are to be exported in a user-specified format,
includes bringing to-be-linked data from their corresponding              get published or saved.
resources, which are a source dataset and a target dataset
                                                                                         IV. SILK & LIMES IN A NUTSHELL
that might be in the form of RDF dumps or SPARQL end-
points. After source and target datasets are drawn out, the               A. SILK
specifications of the to-be-generated links are written into a               SILK [15] is an open-source link discovery framework for
file and imported by the framework to compare data based                  the web, based on RDF. SILK offers a Workbench that enables
on them. Some other framework-specific parameters are to be               the user to manage sets of data sources, easily edit and observe
stated in the pre-processing stage too, as for example, the link          linking and transformation tasks graphically, create and edit
acceptance threshold value. In addition, if machine-learning is           reference links, and easily evaluate the generated links.
involved in the link discovery process, training datasets are to             SILK queries the data lists drawn out of the correspond-
be imported at this stage. On the other hand, some frameworks             ing resources using resources listers. The source and target
provide the option of benefiting from external resources like             datasets could be local data dumps, or resources accessed
dictionaries of RDF vocabulary or previous mappings that are              via a SPARQL endpoint. Only the target lists pass through
a form of crowds participation (crowd-sourcing).                          an indexer that is responsible for indexing them as a pre-
   As the Pre-Matching stage ends, the matching stage starts,             processing step, in order to facilitate the matching process.
and it can be of two types: Instance Matching or Ontology                 The indexing process separates data into blocks and indexes
Matching. End-users can intervene in the automated process                them by one or more of their properties (mostly labels). The




                                                                                                                                                  7
source lists do not pass through the indexer but are directly        The distance from x to z can be approximated when knowing
cached on disk in order to be retrieved later at the matching        the distance from x to reference point y as well as the distance
stage. The reason why only target lists get indexed is that          from reference point y to z. The reference point y is called an
the matching process will be executed on each instance of            exemplar [13], and is used to define the center of a portion
the source list, in order to compare it to the best potential        of the total metric space. Exemplars help calculate the upper
matches from the target list. The indexing of the target list,       and lower bounds of the distance from point x to point z,
will yield to a run-time optimization at the comparison level of     so when compared to theta, the threshold, the decision can be
the matching process. This time-optimizing step might cause          taken regarding link generation. A set of exemplars is selected
missing some links when excluding blocks of lower matching           in a way they be distributed in a uniform way in the metric
potential that contain correct links. Only the retained links will   space, and to be as dissimilar as possible. The approximations
be written to an output file which format can be specified by        of distances from points to exemplars allow reducing the time
the user (i.e. CSV).                                                 needed for comparisons.
   In the matching process, a similarity value for every pair
of instances is computed and the corresponding aggregation           C. SILK vs. LIMES
metric (specified by the user) is evaluated. Then, comparison           In this section, we will compare SILK and LIMES accord-
measures (that are metrics or semi-metrics) are acted upon           ing to the following features: General Information and Acces-
by RDF path translators, which transform them into SPARQL            sibility, Framework Configuration, Run-time Optimization and
queries and send them to SPARQL endpoints to get evaluated.          Link Discovery and Evaluation. Current studies emphasize on
Results of the query are cached temporarily in memory, until         run-time optimization issues while disregarding other impor-
links of values that are above the acceptance threshold are          tant features of the frameworks. Add to that, the fact that tests
picked and saved into it. The number of links to be picked for       are performed on older versions of both frameworks (Table I).
each resource is specified initially by the user. This is called     From here rises the need for an updated complete comparison
link limit. The resulting links are picked from a list of link       that covers all the features of SILK and LIMES in order to
candidates, in which each has a corresponding similarity value       be able to do an informed evaluation of their effectiveness
(similarity between the source instance and target instance),        regarding link quality.
such that they have the highest similarity values within the            1) General Information and Accessibility: Features under
highest potential ”blocks of matching” (according to the             this category are as follows:
calculated indices of instances).                                       • While LIMES is based on Java, SILK initial release is
B. LIMES                                                                   developed with Python(2009) and the second version was
   The word ”LIMES” [12] stands for Link Discovery Frame-                  reimplemented using Scala(2010).
                                                                                                                  2
work for Metrics Spaces. Like SILK, it is a tool to generate            • A web interface is available for SILK , while a practical

links between similar entities belonging to different data re-             desktop application is available for LIMES3 .
sources. However, LIMES tool estimates the similarity values            • Tools and sources are available to download for both

using the mathematical characteristics of metric spaces. This              frameworks; SILK4 and LIMES5 .
helps reduce the number of comparisons and thus decreases               • Both frameworks have a link specification language;

the complexity of the run-time process.                                    SILK-LSL and LIMES-LSL.
   The mathematical principles underlying the LIMES frame-              • SILKs latest version -until the writing of this report- was

work are summarized by defining the Metric Space and                       published on 12/2/2016 while LIMES’ was on 4/4/2017.
the Matching Task. The metric space is described by four                2) Framework Configuration: The difference between
conditions: non-negativity, identity of indiscernible, symmetry,     SILK and LIMES regarding the framework configuration are:
and triangle inequality (TI) [13]. The difference with semi-            • Both frameworks support manual and learning based
metrics is that they do not satisfy the fourth condition of                link discovery process, but SILK supports two additional
metrics (TI). On the other side, a matching task is computing              methods; the generation of links using genetic program-
the list of instances from source and target sets such that they           ming and batch learning [16].
match the metric conditions.                                            • The configuration information of Link Specification file
   The General work-flow of LIMES starts by reading three                  of SILK and LIMES, which specify the elements of the
inputs; source and target datasets, and the link specification             comparison, are very similar (datasets, classes, aggre-
file. After being imported, the data from the datasets is sepa-            gation operators, comparison operators, link limit, xml
rated into ”strings”, ”numeric values and values mappable into             version, etc) but are organized differently.
vector space”, and ”leftover values”. They are then mapped              • The Input/Output file formats supported by both frame-
using String Mapper, Numeric Mapper, and Miscellaneous                     works are various, and differ between SILK and LIMES.
Mapper respectively. The Mappers guarantee that the data is
converted into values belonging to the metric space, according         2 http://SILKframework.org/
                                                                       3 http://aksw.org/Projects/LIMES.html
to the boundary condition realized from the TI [13], as follows:       4 https://github.com/silk-framework/silk

     m(x, y)     m(y, z)  m(x, z)  m(x, y) + m(y, z)                 5 https://github.com/dice-group/LIMES




                                                                                                                                         8
     SILK supports eight source-file formats and LIMES                  to disregard its similarity values. Add to this that all
     supports six, of which XML, RDF Dump, SPARQL                       measures are weighed the same in LIMES (no weight
     endpoint, and N-TURTLE are in common. As for the                   parameter).
     format of the output, both support N-triples.                               V. PRE-EVALUATION PHASE
   • Acceptance threshold: By comparing SILK and LIMES,
     there is a threshold value for each specification to ac-         Before evaluating the two frameworks performance and
     cept the similarity values. In addition, the user can set     impact on link quality, the data to-be-matched should be
     threshold values for which links could be automatically       carefully selected, and the link specification parameters set,
     accepted or links should be reviewed manually. Regarding      taking into account the difference at the level of threshold
     SILK, it allows, specifying the number of links of a single   concept in each framework.
     data item to be picked. Only the highest-rated links per      A. Comparison Measures
     source data item will remain after the filtering.                There exists five String-specific measures in both tools.
   3) Run-Time Optimization: Run-time is the execution time        While SILK splits up Character-based from Token-based mea-
of the matching process excluding the execution time of pre-       sures for Strings, LIMES does not. For Numeric measures,
processing and post-processing operations. A comparison of         while SILK, provides a date-specific measure, LIMES does
how each framework achieves run-time optimization are:             not. In fact, the only Numeric measure supported by LIMES
   • Parallel clustering: parallel processing is supported by      is the Euclidean distance measure. SILK classifies wgs84
     both frameworks using customized versions of MapRe-           measure as numeric. On the other hand, wgs84 is specific
     duce,                                                         for geo properties (such as georss:point). This led us to
   • Pre-processing methods: The pre-processing method             include it under the geo type too, in parallel with the 19 geo-
     adopted by SILK relies on dividing data into blocks           specific LIMES measures. However, SILK has, in addition,
     in order to enable indexing (oftenly indexed by Labels)       two extensions of Spatial Relations and Temporal Relations,
     and thus reduce comparisons. As for LIMES, the pre-           that are Temporal Distances and Spatial Distances specific
     processing method applied is filtering. Sources of pre-       to centroid, minimum distances, days, hours, milliseconds,
     processing functions are: Xapian search engine library for    minutes, months, seconds, and years. Therefore, the only com-
     SILK, while the novel version of the LIMES framework          mon measures between the two tools are Jaro, JaroWinkler,
     integrates an extended version of PPJoin+ algorithm [13].     Levenshtein, Cosine and Jaccard. So, it is most relevant to
                                                                   compare the two frameworks based on these measures in order
Efficiency tests performed to compare the run-time of SILK
                                                                   to detect the differences in their behavior, precisely and fairly.
and LIMES concluded that LIMES is faster by 60 times, which
we justify by the difference in the pre-matching methods. Yet,     B. Threshold-based Similarity and Distance
the versions tested upon are old, and should be updated.              Based on the Link discovery behavior, we can formally use
   4) Link Discovery and Evaluation: Link Discovery and            two type of threshold [13]:
Evaluation parameters depend on multiple features, as follows:        • Link Discovery on the similarity threshold. Given two
   • Supported measures: SILK, in its latest version, has 16             sets S and T of instances, a similarity measure ⇢ over the
     similarity measures, versus 60 similarity measures for              properties of s 2 S and t 2 T and a similarity threshold
     LIMES in its 1.1.2 version. However, it should be noted             ⌧ 2 [0, 1], the goal of LD is to compute the set of pairs
     that both frameworks do support the addition of new                 of instances (s, t) 2 S ⇥ T such that ⇢(s, t) > ⌧ .
     measures, with the difference that in LIMES the user             • Link Discovery on the distance threshold. Given two sets
     should add mappers to such measures to fit in the filtering         S and T of instances, a distance measure ◆ over the
     pre-matching process.                                               properties of s 2 S and t 2 T and a distance threshold
   • Generated links: Both frameworks support the generation             ✓ 2 [0, +1[ the goal of LD is to compute the set of pairs
     of owl:sameAs link types in addition to other RDF                   of instances (s, t) 2 S ⇥ T such that ◆(s, t) 6 ✓.
     link [17] types. When links are generated, they are to be        While LIMES uses similarities, SILK works with distances.
     evaluated by the framework, then judged it to be accepted     Therefore, we use the setting ⌧ = (1 + ✓) 1 to transform the
     or not, before releasing them to output files. However,       distance threshold ✓ to the similarity threshold ⌧
     for SILK, the user can interfere in filtering the results        Moreover, it is worth noting that not all measures available
     and accepting the links by approving links with similarity    in SILK are normalized: Levenshtein, wgs84, date, dateTime,
     measures under the threshold, or declining links with high    and num20 are not normalized. The use of non-normalized
     similarity values (crowd-sourcing).                           measures may lead to similarity values that are higher than the
   • Links evaluation: SILK makes evaluation and comparison        threshold set. For instance, Normalized Levenshtein Distance
     weight optional to state whether the measure is mandatory     should be used instead of Levenshtein in SILK. Conversely,
     (required) or not, and optional to give each measure a        all measures are normalized in LIMES.
     certain weight of the total weight too. As for LIMES,            In our tests, we used 0.6, 0.8 and 0.95 thresholds (according
     those parameters are not optional. A measure is supposed      to LIMES threshold concept) in order to calculate the precision
     to be specified when it is required, with no ability          and compare it between both frameworks.




                                                                                                                                        9
                                                                          the generated (.nt) file in order to compare and calculate
                                                                          various metrics (such as precision and recall) detailed in the
                                                                          next section.
                                                                                               VI. EVALUATION
                                                                          A. Experimental objectives and Set-Up
                                                                             The twofold objective of the study of SILK and LIMES
                                                                          is to: 1)evaluate the quality of discovered links in case of
                                                                          biomedical datasets, according to two dimensions: thresholds
                                                                          and similarity measures; and 2)evaluate the run-time of each
Fig. 3. The measurement similarity between the two datasets is based on   experiment.
intervention name of linkedct and title of the drug from drugbank.           We built our scenario around the Intervention Name prop-
                                                                          erty of type Drug as source and the Title property of Drug
                                                                          instances as a target using ”Linkedct” and ”Drugband” datasets
C. Datasets & Link Specifications                                         respectively. We chose these datasets in a way to emphasize
   Linking biomedical datasets will lead to novel facilitation            the difference at the level of link quality. For instance, the
for global health systems and thus humanity. Therefore, in this           compared data could have different length of the string and
experiment, we will test and study the behavior of both SILK              possible token permutations. Fig. 3 describes examples of
and LIMES in link discovery between two biomedical datasets               triples from the two datasets as well as their similar properties.
which are the following:                                                     We held this test on 346576 different entities of source
   LinkedCT 6 is a dataset derived from a service named Clini-            dataset (LinkedCT) against 7678 entities of target dataset
calTrials.gov, which is initially provided by the U.S. National           (Drugbank). We performed it using each of the four String-
Institute of Health. The mentioned service is mainly a registry           specific comparison measures (Levenshtein, Jaccard, Jaro and
of more than 60 thousands entries of clinical trials conducted            JaroWinkler). Cosine comparison measure was excluded as
in 158 countries. Each clinical trial is associated with relevant         the data was not compatible with it in SILK. Testing using
information such as a brief description of the trial, disorders           different thresholds is important because sometimes, correct
and interventions related to it, eligibility criteria, sponsors,          links have low similarity values, which needs lower thresholds
locations (investigators),etc. The RDF version of the dataset             to allow their detection.
contains 48,909,090 triples and 2,023,055 links [18].                        All experiments were performed on a laptop equipped with
   DrugBank7 is a large repository of around 5000 small                   Intel Core i7 quadcore processor (2.90 GHz), 20 GB RAM,
molecule and biotech drugs that are FDA-approved. It contains             the maximum heap size is set to 10 GB, running Windows 10,
detailed information about drugs (pharmacological, chemical               Java version JDK/JRE 1.8.
and pharmaceutical data) in addition to comprehensive drug                   To evaluate the correctness of the links generated by the
target data (like structure, sequence, and pathway information).          matching process, three measures should be calculated for
Triples contained by the Linked data version of DrugBank are              different experiment sets: Precision, Recall, and F-Score.
3,649,531 triples, while links are 1,828,410 links [19].                                      TP                         TP
   Regarding comparison metrics, the discerned common mea-                   P recision =                Recall =
                                                                                         (T P + F P )               (T P + F N )
sures were tested (Levenshtein, Jaro, JaroWinkler, Jaccard and                                                                     (1)
                                                                                           P recision ⇥ Recall
cosine), in order to guarantee the most relevant results. The               F Score = 2 ⇥
                                                                                           P recision + Recall
properties compared were title property of Drugbnak and inter-
vention holds the drug name (intervention intervention name)              Where TP = True Positive, FP = False Positive and FN = False
property for LinkedCT. Then run-times were calculated.                    Negative
                                                                          B. Experimental Results
D. Gold Standard
                                                                             The columns in table II indicate the average result of 3 runs:
   To evaluate links created when testing SILK and LIMES
                                                                          False Positive(FP), True Positive(TP) and False Negative(FN)
discovery frameworks, we have chosen to leverage existing
                                                                          for three different thresholds(0.95, 0.8, 0.6). We used four
”seeAlso” links between Linkedct and Drugbank. Therefore
                                                                          different similarity measures for evaluation. As an overall
52084 links were extracted from Linkedct dataset and prepared
                                                                          observation, TP retained an approximately similar value for
to be used as a gold standard. Moreover, we developed a
                                                                          each test, which corresponds to the number of entities in
Java application8 to compare links discovered by SILK and
                                                                          the gold standard dataset. Accordingly, LIMES performed
LIMES with the gold standard as well as measure the quality
                                                                          particularly well by retaining the same values with different
metrics of links. The application takes the gold standard and
                                                                          thresholds using Levenshtein and Jaccard. However, SILK
  6 http://linkedct.org/                                                  accomplished that only with Jaccard. All other values varied
  7 https://old.datahub.io/dataset/bio2rdf-drugbank                       according to the 3-dimensions of computation; frameworks,
  8 https://github.com/housseindh/LinkDiscoveryEvaluationMetrics          thresholds and similarity measures.




                                                                                                                                               10
                                                             TABLE II
    FP, TP AND FN FOR THE TASK OF INTERLINKING L INKED CT AND D RUG BANK USING DIFFERENT THRESHOLDS AND SIMILARITY MEASURES .

 Threshold                         0.95                                           0.8                                         0.6
                  LIMES                          SILK                 LIMES                SILK                  LIMES                SILK
 Similarity  FP    TP           FN        FP     TP      FN     FP      TP      FN   FP     TP         FN   FP    TP       FN   FP      TP    FN
 Levenshtein 6901 50965         1118      7088   50930   1153   6901    50965   1118 13814 50967       1116 6901 50965 1118 101677 51056 1027
 Jaccard     6973 50989         1094      6928   51075   1008   6973    50989   1094 6913   50986      1097 6901 50965 1118 6908        50957 1126
 Jaro        19801 51113        970       8316   50993   1090   886401 51189    894 109944 51189       894 Java heap space      GC overhead
 JaroWinkler 40301 51138        945       9734   51027   1056   1720793 51194   889 176815 51324       759 GC overhead          GC overhead




                             (a) Run-time in second                                                             (b) Precision




                                   (c) Recall                                                   (d) EE: deviding FScore by log(Runtime).
             Fig. 4. Experimental Results of metrics of links generated by multiple similarity measures of interlinking LinkedCT and DrugBank.



   Although giving 10 GB of memory, the two similarity                          recall. In terms of precision, while LIMES maintained very
measures Jaro and JaroWinkler failed to produce a result with a                 good results for all thresholds when dealing with Levenshtein
threshold of 0.6, because of a Java GC overhead limit exceeded                  and Jaccard, SILK had poor results with 0.6 threshold using
and Java heap space.                                                            Levenshtein for the same experiment specifications. Moreover,
   Regarding the run-time evaluation, Fig 4a summarizes our                     for the preceding experiments of Jaro and JaroWinkler, the
experiment results of SILK and LIMES. As an overall obser-                      results were poor for both frameworks, and worse when
vation, we find that Jaccard performed the optimal time for all                 speaking about LIMES.
thresholds. And as we compare the time of LIMES and SILK,                          In order to evaluate the effectiveness and efficiency of
we observe in most experiments that LIMES is faster than                        these two frameworks, we propose to use an equation that
or approximately equal to SILK. The only case where SILK                        calculates the proportional value of F-Score to the run-time
run-time noticeably exceeded LIMES’s was with JaroWinkler                       duration. However, because of the considerable differences
comparison operator at 0.8 threshold.                                           in the value of the execution time between each experiment
   Fig.4b and 4cshow the quality metrics of linked discovery.                   compared to the value of F-Score, we apply the logarithm
Both frameworks achieved very good results in terms of                          function to smooth out the high impact of run-time for big




                                                                                                                                                     11
values. Therefore we propose to evaluate the effectiveness and         LIMES has its own advantages, and is more appropriate to
efficiency by using the following EE equation:                         specific similarity measures usage. More specifically, LIMES
                                                                       flourishes with Levenshtein at all thresholds while SILK
                                F Score                                emerges with Jaccard at low thresholds.
                     EE =                                       (2)
                             log(Runtime)                                 As a future plan, we aim to perform more tests on the rest
   Looking at the results of the EE equation in Fig. 4d, it seems      of the comparison measures, and upon different aggregation
that LIMES has maintained consistent effectiveness regardless          scenarios to get deep into the best use-case domain of each
of the Threshold for both similarity measures Levenshtein and          framework. On the other hand, a great deal of work shall be
Jaccard, while it was remarkably unsteady with SILK.                   focused on considering active learning that is already inte-
                                                                       grated into both frameworks, and on testing the performance
C. Technical Evaluation                                                in a distributed environment.
   LIMES framework admits its drawback [12] considering its
                                                                                                     R EFERENCES
optimization only for metrics, which is not the case for non-
metric measures such as JaroWinkler. It favors performance              [1] I. Merelli, H. Pérez-Sánchez, S. Gesing, and D. DAgostino, “Managing,
                                                                            analysing, and integrating big data in medical bioinformatics: open
and ease-of-use over recall/precision factors, and considers                problems and future perspectives,” BioMed research international, vol.
that the contribution of the user in modifying the thresholds               2014, 2014.
as a human-feedback that can compensate this drawback.                  [2] H. Dhayne, R. Haque, R. Kilany, and Y. Taher, “In search of big medical
                                                                            data integration solutions-a comprehensive survey,” IEEE Access, vol. 7,
   The results evaluation using Precision, Recall and F-Score               pp. 91 265–91 290, 2019.
confirms the fact that LIMES theoretically has better chances           [3] P. Hitzler and K. Janowicz, “Linked data, big data, and the 4th
than SILK in the case of large datasets . However, the close                paradigm.” Semantic Web, vol. 4, no. 3, pp. 233–235, 2013.
                                                                        [4] H. Dhayne, R. K. Chamoun, and M. Sokhn, “Survey: When semantics
results between SILK and LIMES in the current tests do                      meet crowdsourcing to enhance big data variety,” in Communications
not exclude SILK from the efficient universal link discovery                Conference (MENACOMM), IEEE Middle East and North Africa. IEEE,
frameworks.                                                                 2018, pp. 1–6.
                                                                        [5] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data: The story so far,”
   The two frameworks tend to perform a pre-matching process                in Semantic services, interoperability and web applications: emerging
to improve the performance of comparison. In addition, to                   concepts. IGI Global, 2011, pp. 205–227.
speed the process up, the target objective is to reduce the             [6] M. Samwald, A. Jentzsch, C. Bouton, C. S. Kallesøe, E. Willighagen,
                                                                            J. Hajagos, M. S. Marshall, E. Prud’hommeaux, O. Hassanzadeh,
number of comparisons needed to be held. SILK uses index-                   E. Pichler et al., “Linked open drug data for pharmaceutical research and
ing, while LIMES uses the Triangle Inequality . Obviously,                  development,” Journal of cheminformatics, vol. 3, no. 1, p. 19, 2011.
the algorithm used in LIMES, which depends on computing                 [7] H. Dhayne, R. Kilany, R. Haque, and Y. Taher, “Sedie: A semantic-
                                                                            driven engine for integration of healthcare data,” in 2018 IEEE Interna-
exemplars from the resources and filtering them before com-                 tional Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
puting the similarity and serializing the result [12] is the reason         2018, pp. 617–622.
why LIMES is faster and makes it less probable to miss links.           [8] A.-C. N. Ngomo, S. Auer, J. Lehmann, and A. Zaveri, “Introduction to
                                                                            linked data and its lifecycle on the web,” in Reasoning Web International
The distribution of exemplars, where each represents a portion              Summer School. Springer, 2014, pp. 1–99.
of the metric space and which are selected as dissimilar as             [9] J. D. Fernández, W. Beek, M. A. Martı́nez-Prieto, and M. Arias, “Lod-
possible in the set of data, allows the parallelism of the filtering        a-lot,” in International Semantic Web Conference. Springer, 2017, pp.
                                                                            75–83.
process before matching. Filtering takes place by matching             [10] R. Haque and M.-S. Hacid, “Blinked data: Concepts, characteristics,
each point to an exemplar to compute pessimistic estimates of               and challenge,” in Services (SERVICES), 2014 IEEE World Congress
instance similarities, which leads to missing links. In SILK,               on. IEEE, 2014, pp. 426–433.
                                                                       [11] Y. Raimond, C. Sutton, and M. B. Sandler, “Automatic interlinking of
the indexing process allows dividing the data into blocks and               music datasets on the semantic web.” LDOW, vol. 369, 2008.
indexing them by some of their properties (mostly labels),             [12] A.-C. N. Ngomo and S. Auer, “Limes-a time-efficient approach for large-
then, for each comparison, matching is performed only on                    scale link discovery on the web of data.” in IJCAI, 2011, pp. 2312–2317.
                                                                       [13] A.-C. N. Ngomo, “A time-efficient hybrid approach to link discovery,”
potential blocks. This would lower the number of comparisons                Ontology Matching, vol. 1, 2011.
and thus will speed up the process but does not guarantee not          [14] M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, and E. Rahm, “A
missing links.                                                              survey of current link discovery frameworks,” Semantic Web, vol. 8,
                                                                            no. 3, pp. 419–436, 2017.
                                                                       [15] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov, “Silk-a link discovery
            VII. C ONCLUSION & F UTURE WORKS                                framework for the web of data.” LDOW, vol. 538, 2009.
   In this paper, we summarized the core differences between           [16] R. Isele and C. Bizer, “Active learning of expressive linkage rules using
                                                                            genetic programming,” Web Semantics: Science, Services and Agents on
SILK and LIMES and presented an experiment that evalu-                      the World Wide Web, vol. 23, pp. 2–15, 2013.
ates the performance of entity comparison and measures the             [17] D. Beckett and B. McBride, “Rdf/xml syntax specification (revised),”
quality of discovered links. In particular, we applied the pro-             W3C recommendation, vol. 10, no. 2.3, 2004.
                                                                       [18] O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang,
cess to large-scale biomedical data (LinkedCT and Drugbank                  “Linkedct: A linked data space for clinical trials,” arXiv preprint
datasets). We performed many experiments to evaluate the                    arXiv:0908.0567, 2009.
impact of threshold values and that of similarity measures on          [19] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali,
                                                                            P. Stothard, Z. Chang, and J. Woolsey, “Drugbank: a comprehensive
efficiency and effectiveness, in order to verify the points of              resource for in silico drug discovery and exploration,” Nucleic acids
strength and weakness of each framework. This comprehensive                 research, vol. 34, no. suppl 1, pp. D668–D672, 2006.
study clarified and validated the fact that each of SILK and




                                                                                                                                                        12