<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DisMatch results for OAEI 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maciej Rybin´ski ⋆</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar´ıa del Mar Rold´an-Garc´ıa</string-name>
          <email>mmar@lcc.uma.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jos´e Garc´ıa-Nieto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jos´e F. Aldana-Montes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. de Lenguajes y Ciencias de la Computacio ́n, University of Malaga, ETSI Inform ́atica, Campus de Teatinos</institution>
          ,
          <addr-line>Malaga - 29071</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>DisMatch is an experimental ontology matching system based on the use of corpus based distributional measure for approximating semantic relatedness. Through the use of a domain-related corpus, the measure can be applied to a problem focused on the domain of the corpus, here being the Disease and Phenotype track. In this paper, we aim to briefly present the proposed approach and the results obtained in the evaluation, as well as some early conclusions regarding the performance of DisMatch.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology Matching</kwd>
        <kwd>Bench-marking</kwd>
        <kwd>Lexical Semantic Relatedness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          It has been demonstrated that corpus based measures can be used to
successfully approximate human judgment, w.r.t. semantic relatedness between pairs
of concepts [
          <xref ref-type="bibr" rid="ref1 ref3 ref4">1,3,4</xref>
          ]. DisMatch is an experimental system built for the purpose
of evaluating the applicability of a state-of-the-art domain-focused corpus based
measure of semantic relatedness, to a task of ontology alignment.
        </p>
        <p>
          For a pair of ontologies, DisMatch calculates the matrix of semantic
relatedness between labels representing their concepts. It then uses this matrix as the
input for the classic algorithm of Similarity Flooding [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], in order to incorporate
the taxonomic information into our final results.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Specific techniques used</title>
        <p>The workflow of DisMatch can be broken down into the following steps:
1. Preprocessing: extraction of the taxonomies and labels of the concepts.
2. Assigning distributional representations to the concepts of the ontologies
3. Calculating the semantic relatedness for the pairs of concepts of the
respective ontologies
4. Calculating the similarity propagation given the taxonomies and initial
relatedness scores (SimFlood)
5. Calculating the final similarity scores
6. Filtering</p>
        <p>
          In step (2), we use vector based representations of an ESA (Explicit
Semantic Analysis [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]) style approach adapted to the biomedical domain related use.
The representations are created for inputs that are the labels of individual
concepts. The distributional representations are obtained through a combined use
of Wikipedia and a domain-focused corpus of scientific documents, i.e. Medline.
        </p>
        <p>In step (3), we use the vectors from step (2) to calculate the relatedness
approximation as the cosine similarity of these vectors. To calculate the similarity
propagation in step (4), we use the very basic version of the algorithm applied to
the taxonomic structures. We do however restrict the propagation graph size by
not including the nodes that do not surpass a certain minimal initial relatedness
threshold.</p>
        <p>We calculate the final similarity scores (step 5) as an average between the
initial scores (semantic relatedness) and the similarity propagation output. This
gives more importance to the relatedness score (which is the point of our
experiment), and also caters for cases in which Similarity Flooding is poorly applicable.</p>
        <p>The filtering is done by: i) accepting only a maximal number of candidate
matches per node of an ontology; ii) eliminating candidate matches below a
certain similarity threshold; iii) accepting a globally maximal number of candidate
matches.
1.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Adaptations made for the evaluation</title>
        <p>No specific adaptations were made for the experiments, apart from minor changes
of the filtering parameters (i.e. the global number of candidate matches accepted
in the final alignment).
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Link to the set of provided alignments</title>
        <p>The set of provided alignments is available in URL http://bit.ly/2dPA9H5
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results of the Disease and Phenotype track</title>
      <p>DisMatch has been evaluated in both tasks of the Disease and Phenotype track:
HP-MP (alignment of Human Phenotype Ontology with Mammalian Phenotype
Ontology) and DOID-ORDO (alignment of Human Disease Ontology with
Orphanet Rare Disease Ontology). A summary of results is reported in the Official
site of OAEI 2016::Disease and Phenotype Track1.
1 In URL http://oaei.ontologymatching.org/2016/results/phenotype/.</p>
      <p>It can be observed that the results of DisMatch are relatively far off the silver
standard created in the evaluation process. We believe that this is largely due
to setting up the system with parameters that resulted in overly strict filtering
that created a relatively low number of mappings. In turn, the low number of
mappings led to poor recall, both in the silver standard evaluation and w.r.t.
the set of manually created mappings.</p>
      <p>The precision of DisMatch in the HP-MP alignment looks quite promising,
especially if we consider the number of unique alignments produced by the
system. Out of the total of 644 mappings, 353 mappings are confirmed by at least
one another system (thus falling into ’correct’ category in the silver standard
2). Out of these 353, 293 are confirmed by at least 2 other systems (’correct’ in
silver standard 3). The remaining 291 mapping are unique to DisMatch. Table 1
presents an overview of unique mappings produced by the respective systems.
The precision of the unique mappings produced by Dismatch is estimated at
0.8333, which accounts for a large portion of unique and correct mappings
discovered by our system. In this regard, the proposed approach obtained the
highest percentage of positive contribution (19.80%), with a relatively low negative
contribution (3.96%).</p>
      <p>In the case of DOID-ORDO alignment, the performance of our system is
limited, as it is affected not only by the low recall related to the poor parameter
selection, but also by the inability of our structural mapping component to cope
with the structure of the Orphanet ontology. This shortcoming will be addressed
in the future versions of DisMatch. Nonetheless, as shown in Table 2, even in
this setting, the system managed to produce a considerable number (estimated
40% of 259 is &gt; 100) of correct unique mappings.
Relatedness measure seems to capture non-trivial matches better than, for
example, string edit distance. At the same time, it still works for the trivial cases,
as common words will generate similar distributional representations. The main
strength of DisMatch (and its distributional semantic relatedness component) is
its ability of finding non-trivial mappings, which seems to be confirmed by the
number of unique correct matches generated by the system (and the
unique-tototal mappings ratio).</p>
      <p>Nonetheless, the structural matching strategy still seems to be an important
component of the system, as the relatedness matcher itself will, for example,
generate high confidence matches for inputs, such as ’X syndrome’ and ’Y
syndrome’, if X and Y are very rare in the background corpus. The importance
of the structural matching step seems to be consistent with the performance
gap between HP-MP (where the structural matcher worked) and DOID-ORDO
(where it did not work properly) cases.</p>
      <p>We believe that DisMatch could be improved substantially through improving
the relatedness-structure matching combination, i.e. by employing a better suited
structural matcher. Furthermore, our current structural matching strategy relied
solely on strictly taxonomic relationships, which is not always enough (i.e. in the
case of the OrphaNet ontology).</p>
      <p>Furthermore, semantic relatedness module generates candidate mappings
that are not necessarily ’equivalent’, as the measure does not distinguish between
the possible relationship types. It is worth considering adding an additional
’prediction’ module to provide a classification output of the relationship type of the
mappings.</p>
      <p>Moreover, when it comes to improving the performance of the relatedness
module itself, it seems that the measure provides more accurate results for
shorter input texts. This points to two possible improvements: (a) in finding
a better suited compositional approach for the lexical relatedness measure, or
(b) in using shorter inputs (possibly through synonym properties of the
ontologies to be aligned).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The results obtained with the DisMatch system show enough promise to continue
the experiments with corpus-based distributional relatedness measures applied
to the problem of ontology alignment. We believe, that our focus should now
be on providing an optimal set of additional components around the relatedness
measure. In addition, we expect that tuning of the filtering parameters will lead
the proposed system to reach higher precision with respect to silver standards.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by Grants TIN2014-58304-R (Spanish
Ministry of Education and Science) and P11-TIC-7529 (Innovation, Science and
Enterprise Ministry of the regional government of the Junta de Andaluc´ıa) and
P12-TIC-1519 (Plan Andaluz de Investigaci´on, Desarrollo e Innovaci´on). Jos´e
Gar´ıa-Nieto is recipient of a Post-Doctoral fellowship of “Captaci´on de Talento
para la Investigaci´on” at Universidad de M´alaga.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Evgeniy</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shaul</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>7</volume>
          , pages
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Melnik</surname>
          </string-name>
          , Hector Garcia-Molina, and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Similarity flooding: A versatile graph matching algorithm and its application to schema matching</article-title>
          .
          <source>In Data Engineering</source>
          ,
          <year>2002</year>
          . Proceedings. 18th International Conference on, pages
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          . IEEE,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ted</given-names>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Serguei V S Pakhomov</surname>
          </string-name>
          , Siddharth Patwardhan, and Christopher G Chute.
          <article-title>Measures of semantic similarity and relatedness in the biomedical domain</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ):
          <fpage>288</fpage>
          -
          <lpage>299</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Maciej</given-names>
            <surname>Rybinski</surname>
          </string-name>
          and Jos´e
          <string-name>
            <given-names>F</given-names>
            <surname>Aldana-Montes</surname>
          </string-name>
          .
          <article-title>Calculating semantic relatedness for biomedical use in a knowledge-poor environment</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>15</volume>
          (
          <issue>Suppl 14</issue>
          ):
          <fpage>S2</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>