Hertuda Results for OEAI 2012

                                        Sven Hertling

                             Technische Universität Darmstadt
                           hertling@ke.tu-darmstadt.de


        Abstract. Hertuda is a very simple element based matcher. It shows that tok-
        enization and a string measure can also yield in good results. It is an improved
        version of the first version submitted to the OAEI 2011.5.


1     Presentation of the system

1.1   State, purpose, general statement

Hertuda is a first idea of an element based matcher with a string comparison. It generates
only homogeneous matchings, that are compatible with OWL Lite/DL. This means that
classes, data properties and object properties are handled separately. As a result there
are three thresholds that can be set independently. One for class to class, object to object
property and data to data property. A simple overall threshold sets all sub-thresholds to
the same value.
     Over all concepts a cross product is computed. If the confidence of a comparison
is higher than the threshold for this type of matching, then it is added to the resulting
alignment. For each concept all labels, comments and URI fragments are extracted.
Then these terms form a set. To compare two concepts, respectively sets of terms, each
element of the first set is compared with each element of the second set. The best value
is the similarity measure for these concepts.
     A preprocessing step for term comparison is to tokenize it. All camel case terms
or terms with underscores or hyphens in it, are split into single tokens and converted
to lower case. Therefore writePaper, write-paper and write paper will all result in two
tokens, namely {write} and {paper}.
     Afterwards a similarity matrix is computed with the Damerau–Levenshtein distance
[1, 2]. The average of the best mappings are then returned as the similarity between two
token sets. Figure 1 depicts schematically the algorithm of Hertuda.


1.2   Specific techniques used

The final matching system contains of the string matching approach and a filter for
removing alignments that are not considered in the reference alignment. The system is
depicted in figure 2.
    The filter removes all alignments that are true, but are not in the reference alignment.
The removed mappings are mostly from upper level ontologies like dublin core or friend
of a friend.
void function hertuda() {
  for each type in {class, data property, object property}
    for each concept cOne in ontology one
      for each concept cTwo in ontology two
        if(compareConcepts(cOne, cTwo) > threshould(type)){
          add alignment between cOne and cTwo
        }
}

float compareConcepts(Concept cOne, Concept cTwo) {
  for each termOne in {label(cOne), comment(cOne), fragment(cOne)}
   for each termTwo in {label(cTwo), comment(cTwo), fragment(cTwo)}
     conceptsMatrix[termOne, termTwo] = compareTerms(termOne, termTwo)

    return maximumOf(conceptsMatrix)
}

float compareTerms(String tOne, String tTwo) {
  tokensOne = tokenize(tOne)
  tokensTwo = tokenize(tTwo)

    tokensOne = removeStopwords(tokensOne)
    tokensTwo = removeStopwords(tokensTwo)

    for each x in tokensOne
     for each y in tokensTwo
      similarityMatrix[x, y] = damerauLevenshtein(x, y)

    return bestAverageScore(similarityMatrix)
}


                            Fig. 1. Algorithm for Hertuda


1.3   Adaptations made for the evaluation


There are no specific adaptions made. The overall threshold for a normalised Dam-
erau–Levenshtein distance is set to 0.88.


1.4   Link to the system and parameters file


The tool version submitted to OAEI 2012 can be downloaded from http://www.
ke.tu-darmstadt.de/resources/ontology-matching/hertuda
Fig. 2. Composition of matching algorithms of Hertuda. The string based approach and the filter
are sequential composed.


2     Results

2.1   Benchmark

The implemented approach is only string based and works on the element level, whereby
missing labels or comments or replaced terms by random strings has a high effect on
the matching algorithm.


2.2   Anatomy

Hertuda has a higher recall than the StringEquiv from OEAI 2011.5 (0.673 to 0.622).
Through the tokenization and also the string distance the precision is much lower (0.69
to 0.997). This yield in worse F-Measure for Hertuda (68.1 to 0.766).


2.3   Conference

The first version of Hertuda only compares the tokens for equality, whereas the new
version computes a string similarity. Though the recall is a little bit higher than the first
version, but the precision is lower. All in all, the F-Measure has increased by 0.01. This
approach can find a mapping between has the first name and hasFirstName with an
similarity of 1.0.


2.4   Multifarm

Hertuda is not designed for multiligual matching. Nevertheless, some simple alignments
are returned like person(en) ≡ person(de).


2.5   Library

In the Library Track a relatively high recall has been achieved (0.925). Through splitting
the words a very low precision value (0.465) was the result.
2.6   Large Biomedical Ontologies

Hertuda was only capable to match the small task for FMA-NCI and FMA-SNOMED.
The large ones are not finished in time. The reason can be, that the complexity is to high
trough the cross product of all concepts.


3     General comments

3.1   Comments on the results
The approach shows, that also simple string based algorithms can yield in good results.
The improvement of version 1 is not much, but the recall was higher in many tracks.
The precision was therefore lower, but it ends often in better F-Measure values.


3.2   Discussions on the way to improve the proposed system

To improve Hertuda it is possible to add more stop words in different languages. This
helps by comparing two ontologies that have the same language, but this differs from
English.
    Another point is to set the threshold more precise and not one for all. It is also
imaginable to set the threshold based on the matching ontologies. This will help to
reduce the low precision in some tracks.


4     Conclusion
The results show that an string based algorithm can also produce good alignments. The
recall of this version is in many cases much higher that the first version. Thus it is
possible to use this matcher as a previous step of structural matchers.


References
1. Damerau, F.: A technique for computer detection and correction of spelling errors. Commu-
   nications of the ACM 7(3) (1964) 171–176
2. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. In:
   Soviet Physics Doklady. Volume 10. (1966) 707