=Paper=
{{Paper
|id=Vol-2288/oaei18_paper5
|storemode=property
|title=DOME results for OAEI 2018
|pdfUrl=https://ceur-ws.org/Vol-2288/oaei18_paper5.pdf
|volume=Vol-2288
|authors=Sven Hertling,Heiko Paulheim
|dblpUrl=https://dblp.org/rec/conf/semweb/HertlingP18
}}
==DOME results for OAEI 2018==
DOME results for OAEI 2018 Sven Hertling and Heiko Paulheim Data and Web Science Group, University of Mannheim, Germany {sven,heiko}@informatik.uni-mannheim.de Abstract. DOME (Deep Ontology MatchEr) is a scalable matcher which relies on large texts describing the ontological concepts. Using the doc2vec approach, these texts are used to train a fixed-length vector representa- tion of the concepts. Mappings are generated if two concepts are close to each other in the resulting vector space. If no large texts are available, DOME falls back to a string based matching technique. Due to its high scalability, it can also produce results in the largebio track of OAEI and can be applied to very large ontologies. The results look promising if huge texts are available, but there is still a lot of room for improvement. 1 Presentation of the system 1.1 State, purpose, general statement Ontology matching is often based on string comparisons because each resource is described by URI fragments (last part of an URI after the # sign), rdfs:labels, and rdfs:comments. The DOME matcher specifically relies on large texts which describes the resources, and thereby allows to make a better distinction in case of a similar labels. Especially in knowledge graphs like DBpedia or YAGO, such texts are easily extracted from the corresponding Wikipedia abstract. The usual problem with such large texts is the matching with other similar and long texts. One possible way is to use topic modeling like latent semantic analysis(LSA [2]) or latent dirichlet allocation (LDA [1]). The extracted topics can then be used to find overlaps and in the end similar concepts. DOME uses another approach called doc2vec (also paragraph vector [5]) which is based on word2vec [6]. The idea is to represent a variable-length texts, like sentences, paragraphs, and documents, as a fixed-length feature vector. This vector is trained to predict the words appearing in the document. Thus this vector represents the semantics of the concept when training on texts which defines the meaning of the concept. Two approaches for training this vector are established: Distributed Memory (DM) and Distributed Bag of Words (DBOW). Applied to an example concept like Harry Potter1 the framework of DM is shown in figure 1. During training, the algorithm iterates over the given text in a sliding window of a specified and fixed length. The goal is to predict the last word given the first n words. One 1 http://harrypotter.wikia.com/wiki/Harry_Potter wizard http:// example.com/ a half blood Harry_Potter Resource URL Word Word Word Fig. 1. Training of Distributed Memory given the concept Harry Potter and a small excerpt of the corresponding wiki abstract. Word Word Word Word a half blood wizard http:// example.com/ Harry_Potter Resource URL Fig. 2. Training of Distributed Bag of Words. The example is the same as in figure 1 but now the concept URI together with a small subset of text is usbed to predict the following word. special vector is the first one which represents the paragraph vector. In our case this is the URI of the concept. All large texts which define this resource can be used to train this vector. Another approach for generating the concept vector is Distributed Bag of Words (DBOW), shown in figure 2. Instead of using concept vectors for each word, it tries to predict words from the text as an output. DOME uses the DM sequence learning algorithm with a vector size of 300 and window size of 5. The training is repeated in 10 epochs. The minimal word frequency is set to the minimum to allow all words contribute to the concept vector. We compute a predefined set of properties which contains definitional texts by two simple rules: 1) directly choose rdfs:comment 2 ) use every property where the URI ends in “abstract”. This can be further improved in the next version of DOME. KG 1 KG 2 Fragment Fragment String similarity Label Label Resource1 Resource2 Comment Comment Doc2Vec similarity Abstract Abstract Fig. 3. Matching strategy of DOME The doc2vec model is trained on all texts available in both ontologies. For each concept in the second ontology, the corresponding concept vector is com- puted, and the concepts which have the most similar vectors to those from the first ontology are retrieved. A mapping between two resource is established when the cosine similarity is above 0.85 (the threshold has been chosen based on a manual inspection of the results). The whole matching approach is shown in figure 3. The labels and fragments of each resource are compared using string based similarity. Specifically, the texts are tokenized and all punctuation (especially underscores and the like) are removed. After lowercasing, these values are stored in a hash structure. A mapping is created when each fragment or label have an exact match. The confidence value of these alignments are set to 1.0. After this step, the doc2vec approach is applied to find further matching concepts. We ensured that the mapping is OWL compliant because we only match instances to instances, classes to classes, and properties to properties. In the latter case we further distinguish datatype properties and object properties but also match properties declared as rdf:properties. With such a setup, the matcher is very scalable and can match all types of resources. 1.2 Specific techniques used The main technique used in DOME is the doc2vec approach [5] for comments and abstracts of concepts. It is only activated when there is enough text to process. All other matching techniques rely on fast string similarity. Further filtering of the alignment is not executed but during the matching only one to one mappings are allowed. 1.3 Adaptations made for the evaluation DOME is implemented in java and uses the DL4J2 (Deep Learning for Java) as an implementation of the doc2vec approach. DL4J heavily relies on platform specific implementations which are stored in multiple JAR files. This allows it to make use of GPUs to further speed up the computation. DOME relies on the 2 https://deeplearning4j.org CPU implementation of DL4J because upfront it is not clear if all evaluation machines used for OAEI contain a DL4J compatible GPU. Although the DL4J framework allows for searching for related concepts, it does not provide the similarity values out of the box. Thus the framework is modified to also retrieve these value which can be used in the alignment file to represent the confidence of a mapping. Since the values are already normalized no further post processing step of the similarity values is needed. Unfortunately, the packaged SEALS matcher was not able to run under the SEALS evaluation routine. The SEALS client loads all JAR files in its own classpath. This is a very secure way of running third-party code, but at the same time one of the most frequent cauess of matchers not working at OAEI, as in the case of DOME. The root cause is the custom classloader of SEALS which uses the JCL library3 . The SEALS classloader is a subclass of the AbstractClassLoader in the JCL library. Both classloaders do not implement all methods (especially the getPackage method) of the standard classloader. Many other libraries use such functions to further load operating specific code. This applies to the DL4J library as well as the sqlite-jdbc library. We fixed the error by creating an intermediate matcher which calls another java process. Within that process the classloader is the standard one and the DL4J library could be loaded without any errors. We released a matching frame- work which does the SEALS and Hobbit packaging, uploading and creating the intermediate matcher.4 1.4 Link to the system and parameters file DOME can be downloaded from https://www.dropbox.com/s/1bpektuvcsbk5ph/DOME.zip?dl=0. 2 Results The following section discusses the results for each track of the OAEI 2018 where DOME is able to produce meaningful results. This includes the anatomy, conference, largebio, phenotype, and knowledge graph track. DOME was not able to complete the multifarm track because currently no translation component is included. This would be possible with cross lingual em- bedding approaches shown in [8]. For complex and interactive track the matching system has to produce different type of output mapping or matching strategy which is not implemented. The tracks biodiv and iimb don’t contain enough free texts in the selected properties. 3 https://github.com/kamranzafar/JCL 4 https://github.com/sven-h/ontMatchingHobbit 2.1 Anatomy In the anatomy track, there are only labels given, thus the doc2vec approach is not used here. There are some properties like oboInOwl:hasRelatedSynonym or oboInOwl:hasDefinition which point to resources with more describing text, but these resources is not recognized by DOME, since we do not implement a larger list of properties used to point to texts. Therefore, DOME only utilizes string based matching for this track. The text is lowercased, tokenized and then matched based on a hashing algorithm. This results in a high precision of 0.997 (similar to the string equivalence baseline) and a very low runtime of 22 seconds. Only LogMapLt was 4 seconds faster. Due to a slightly lower recall of 0.615 (0.07 lower than the baseline) DOME has a lower F-Measure than the baseline. In improvement in this track would be to use the additional texts from oboInOwl:hasRelatedSynonym and oboInOwl:hasDefinition to further increase the recall. In order not to have to manually maintain such a list, it would also be possible to incorporate all literals that consist of text of at least a certain number of words. 2.2 Conference Within the conference track, DOME is as bit better than the baseline and often similar to edna (which is a string editing distance matcher adopted from the benchmark track). Evaluating DOME against the original reference alignment it performs exactly like edna in the class mappings and a bit better in the property mappings - both in terms of recall and precision. This results in 0.07 better F- Measure. But there is room for a lot improvement, because in this year, the best matcher reached 0.58 F-Measure in this track. When comparing to the entailed reference alignment DOME has same eval- uation measures like edna and a bit better when comparing properties. If both classes and properties are taken into account DOME is only 0.01 better than edna and 0.15 behind the current best matcher. In most of the conference ontologies, there are no long natural language texts. Only in rare cases, some classes are described by a comment. Those were processed by the doc2vec model but does not yield any new mappings. 2.3 Largebio In the largebio track, the number of classes is very high. In the case of FMA- SNOMED this results in matching 78,989 classes to 122,464 classes. Matchers which compare a string from one ontology to all concepts of the other ontology have a quadratic runtime and usually can not finish in time. DOME is one of five matchers (DOME, FCAMapX, LogMap, LogMapBio, XMap) which were able to return results within the given time limit. It is the fastest one and terminates within 30 seconds on the largest track. The second fastest is XMap with 7 min- utes and the slowest one is LogMapBio with 49 minutes. The reason here is the same as in the anatomy track. Most resources are only described by a label and fragment without further textual content. Thus, DOME relies on string compar- ison with a high precision but low recall. In case of “SNOMED-NCI whole”, this results in a precision of 0.907 and a recall of 0.485 (F-Measure of 0.632). The best matcher on this subtrack in terms of F-Measure is FCAMapX with a value of 0.733. 2.4 Phenotype The phenotype track is based on a real use case, and the matcher should find alignments between disease and phenotype ontologies. DOME is also able to complete this track but with a low F-Measure of 0.483 (HP-MP) and 0.633 (DOID-ORDO). The precision is again the highest among all matchers, but the recall is below 0.5. However, some ontologies in this track, like the DOID ontology, have fur- ther properties containing describing texts like obo:IAO 0000115 (label of the property is definition). DOME in its current version does not make use of this property, but, as discussed for the anatomy track above, those could be utilized by extending the system. 2.5 Knowledge Graph The knowledge graph track is a new track where classes, properties and instances should be matched. As already pointed out in [3,4], matching the classes and properties is easier than the instances. This is also the case for the DOME matcher. It returns all three types of mappings and complete on all nine sub tasks. In average it returns 16 class, 207 property, and 15,912 instance mappings. DOME achieved an F-Measure of 0.73 in the class correspondences. It is balanced between recall and precision, but even the baseline has a higher recall. So there should some room for improvement. When analyzing the property alignments, only DOME and the baseline can produce any results. Most likely, the reason is that all properties are typed as rdf:Property and not subdivided into owl:DatatypeProperty and owl:Object- Property. As discussed above, DOME is configured to match also rdf:Property. This results in a F-Measure of 0.84. Instance matches are generated by AML, DOME, LogMap, LogMapLt and the baseline. Especially in the instance mapping the doc2vec approach can help because long comments and abstracts of the resources are available. DOME was the second best matcher with an F-Measure of 0.61 (the baseline is the best “matcher” with an F-Measure of 0.69). Overall, looking at the results for classes, properties, and instances together, DOME has an F-Measure of 0.68, which is better than all matchers except the baseline. 3 General comments 3.1 Comments on the results The overall results shows that DOME is in a development phase. Sometimes it can beat at least the baselines in terms of F-measure and sometimes not. Currently there are not many tracks which provide a large amount of describing text for each resource, but many ontologies and knowledges graphs exists out there where this is the case. 3.2 Discussions on the way to improve the proposed system Based on the evaluation on all kinds of different tracks, we noticed a lot of further improvements. First of all, some ontologies use properties which connect a resource to its describing text which are not recognized by DOME. One possible approach to fix this would be the use all properties which contain long texts by some heuristic, e.g., strings exceeding a certain number of characters on average. This would include more text to help the doc2vec model to better differentiate the concepts. Another possible improvement is to use pretrained word vectors. Those might contain more semantics for each word than training it directly on describing texts for the two ontologies. However, for some very domain-specific ontologies with large amounts of texts, the generic pre-trained embeddings might even perform worse, thus, it is an open research question which of the two yields better results. A third possible approach is to combine the approach of RDF2Vec [7] (i.e., computing the word2vec embedding of random walks within knowledge graphs) and various cross lingual embeddings shown in [8]. One simple approach would be to learn a linear transformation between the two generated embeddings of the ontologies. 4 Conclusions In this paper, we have introduced the DOME matcher, which relies on document embeddings for texts describing the concepts defined in an ontology. The results for DOME are analyzed on the different tracks of OAEI. DOME is a highly scale matching system capable of generating class, property and instance alignments. On some tracks where a lot of text describing each resource exists, it shows promising results. However, the matcher is currently in an early state and offers a lot of room for improvement. References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003) 2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: In- dexing by latent semantic analysis. Journal of the American society for information science 41(6), 391–407 (1990) 3. Hertling, S., Paulheim, H.: Dbkwik: A consolidated knowledge graph from thou- sands of wikis. In: IEEE International Conference on Big Knowledge, ICBK 2018, Singapore (2018) 4. Hofmann, A., Perchani, S., Portisch, J., Hertling, S., Paulheim, H.: Dbkwik: To- wards knowledge graph creation from thousands of wikis. In: Proceedings of the International Semantic Web Conference (Posters and Demos), Vienna, Austria. pp. 21–25 (2017) 5. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning. pp. 1188–1196 (2014) 6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 7. Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: International Semantic Web Conference. pp. 498–514. Springer (2016) 8. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)