<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hertuda Results for OEAI 2012</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universita ̈t Darmstadt</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hertuda is a very simple element based matcher. It shows that tokenization and a string measure can also yield in good results. It is an improved version of the first version submitted to the OAEI 2011.5.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>Hertuda is a first idea of an element based matcher with a string comparison. It generates
only homogeneous matchings, that are compatible with OWL Lite/DL. This means that
classes, data properties and object properties are handled separately. As a result there
are three thresholds that can be set independently. One for class to class, object to object
property and data to data property. A simple overall threshold sets all sub-thresholds to
the same value.</p>
        <p>Over all concepts a cross product is computed. If the confidence of a comparison
is higher than the threshold for this type of matching, then it is added to the resulting
alignment. For each concept all labels, comments and URI fragments are extracted.
Then these terms form a set. To compare two concepts, respectively sets of terms, each
element of the first set is compared with each element of the second set. The best value
is the similarity measure for these concepts.</p>
        <p>A preprocessing step for term comparison is to tokenize it. All camel case terms
or terms with underscores or hyphens in it, are split into single tokens and converted
to lower case. Therefore writePaper, write-paper and write paper will all result in two
tokens, namely fwriteg and fpaperg.</p>
        <p>
          Afterwards a similarity matrix is computed with the Damerau–Levenshtein distance
[
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. The average of the best mappings are then returned as the similarity between two
token sets. Figure 1 depicts schematically the algorithm of Hertuda.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Specific techniques used</title>
        <p>The final matching system contains of the string matching approach and a filter for
removing alignments that are not considered in the reference alignment. The system is
depicted in figure 2.</p>
        <p>The filter removes all alignments that are true, but are not in the reference alignment.
The removed mappings are mostly from upper level ontologies like dublin core or friend
of a friend.
void function hertuda() {
for each type in {class, data property, object property}
for each concept cOne in ontology one
for each concept cTwo in ontology two
if(compareConcepts(cOne, cTwo) &gt; threshould(type)){</p>
        <p>add alignment between cOne and cTwo
}</p>
        <p>}
float compareConcepts(Concept cOne, Concept cTwo) {
for each termOne in {label(cOne), comment(cOne), fragment(cOne)}
for each termTwo in {label(cTwo), comment(cTwo), fragment(cTwo)}
conceptsMatrix[termOne, termTwo] = compareTerms(termOne, termTwo)
return maximumOf(conceptsMatrix)
}
}
float compareTerms(String tOne, String tTwo) {
tokensOne = tokenize(tOne)
tokensTwo = tokenize(tTwo)
tokensOne = removeStopwords(tokensOne)
tokensTwo = removeStopwords(tokensTwo)
for each x in tokensOne
for each y in tokensTwo</p>
        <p>similarityMatrix[x, y] = damerauLevenshtein(x, y)
return bestAverageScore(similarityMatrix)</p>
      </sec>
      <sec id="sec-2-3">
        <title>1.3 Adaptations made for the evaluation</title>
        <p>There are no specific adaptions made. The overall threshold for a normalised
Damerau–Levenshtein distance is set to 0:88.</p>
      </sec>
      <sec id="sec-2-4">
        <title>1.4 Link to the system and parameters file</title>
        <p>The tool version submitted to OAEI 2012 can be downloaded from http://www.
ke.tu-darmstadt.de/resources/ontology-matching/hertuda</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Benchmark 2.2</title>
      </sec>
      <sec id="sec-3-2">
        <title>Anatomy</title>
        <p>The implemented approach is only string based and works on the element level, whereby
missing labels or comments or replaced terms by random strings has a high effect on
the matching algorithm.</p>
        <p>Hertuda has a higher recall than the StringEquiv from OEAI 2011.5 (0.673 to 0.622).
Through the tokenization and also the string distance the precision is much lower (0.69
to 0.997). This yield in worse F-Measure for Hertuda (68.1 to 0.766).
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conference</title>
        <p>The first version of Hertuda only compares the tokens for equality, whereas the new
version computes a string similarity. Though the recall is a little bit higher than the first
version, but the precision is lower. All in all, the F-Measure has increased by 0.01. This
approach can find a mapping between has the first name and hasFirstName with an
similarity of 1:0.
2.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Multifarm 2.5</title>
      </sec>
      <sec id="sec-3-5">
        <title>Library</title>
        <p>Hertuda is not designed for multiligual matching. Nevertheless, some simple alignments
are returned like person(en) person(de).</p>
        <p>In the Library Track a relatively high recall has been achieved (0.925). Through splitting
the words a very low precision value (0.465) was the result.
Hertuda was only capable to match the small task for FMA-NCI and FMA-SNOMED.
The large ones are not finished in time. The reason can be, that the complexity is to high
trough the cross product of all concepts.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>The approach shows, that also simple string based algorithms can yield in good results.
The improvement of version 1 is not much, but the recall was higher in many tracks.
The precision was therefore lower, but it ends often in better F-Measure values.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>To improve Hertuda it is possible to add more stop words in different languages. This
helps by comparing two ontologies that have the same language, but this differs from
English.</p>
        <p>Another point is to set the threshold more precise and not one for all. It is also
imaginable to set the threshold based on the matching ontologies. This will help to
reduce the low precision in some tracks.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion References</title>
      <p>The results show that an string based algorithm can also produce good alignments. The
recall of this version is in many cases much higher that the first version. Thus it is
possible to use this matcher as a previous step of structural matchers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Damerau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A technique for computer detection and correction of spelling errors</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ) (
          <year>1964</year>
          )
          <fpage>171</fpage>
          -
          <lpage>176</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Levenshtein</surname>
          </string-name>
          , V.:
          <article-title>Binary codes capable of correcting deletions, insertions and reversals</article-title>
          .
          <source>In: Soviet Physics Doklady</source>
          . Volume
          <volume>10</volume>
          . (
          <year>1966</year>
          )
          <fpage>707</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>