Using Word Semantics on Entity Names for Correspondence Set Generation

Introduction

On ontology Matching, many works make use of word semantics to align the ontologies. One commonly used resource is WordNet [4] [5], which groups words that share the same meaning together. Thesaurus and lexicons like WordNet indeed provide rich semantic information but require large amounts of human effort to be created and maintained.

Vector space representations of word semantics are a family of language models that associate words with vectors in a semantic space, where each dimension represents a component of the meaning of words [2][1] [3]. The semantic similarity of words is exploited by these methods, providing vectors close in space when their related words are close in meaning. These vectors are usually calculated by a learning algorithm on large corpora like Wikipedia and then used to evaluate the similarity between two words.

In this work, we exploit the word-word similarities in the GloVe model as external resources for Ontology Matching. The hypothesis is that two entities can be matched based on the words in their names using the word-word similarity provided by the model. We built a prototype and evaluated its performance against the baselines from OAEI.

Prototype

To build the simplest prototype, we used pre-trained vectors1 from GloVe and two ontologies O 1 and O 2 . Then, each entity e defined in O 1 or O 2 is associated with one vector #» v e = (a 1 , . . . , a n ), based on its name, where each component a i represents the semantic dimension of words that have related meaning. In case entity e has a compound name, we average the vectors of each word in its name, and set the resulting vector as #» v e . To generate a correspondence between two entities e 1 and e 2 , from O 1 and O 2 respectively, we calculate the cosine similarity on vectors #» v 1 and #» v 2 , associated with e 1 and e 2 , respectively. If the value of cosine similarity is above a lower bound, we continue with this correspondence, otherwise, it is discarded. This lower bound was empirically set to 0.7 as this value showed the better results.

After doing this procedure for all entity pairs, we have the complete alignment. Finally, we compare this alignment with the baseline alignments edna(edit distance based) and StringEquiv(string equivalence based) from OAEI 2016 on the conference and benchmark data sets. The results are presented in table 1 The prototype obtained low recall on both data sets. The majority of errors on the benchmark data set were on tests with random entity names, resulting in the low recall. This is expected since our method uses only this source of information to gather the entity semantics and then generate correspondences.

On the conference data set, the prototype performed between the two baselines. Many words from entity names were not in the vocabulary of the vectors, and were assigned the vector #» 0 , which contributes to the average recall.

Conclusion

These results are not ground-breaking, but also promising. Furthermore, given the simplicity of the prototype, there are many places where it can be improved. For example, in a future experiment, we should train our own vectors and fine tune the hyperparameters of the model. We believe that these improvements may provide increased performance and lead to further research in the area.

Table 1 .1. Comparison between the prototype and baselines of each data setDataset (method)Precision Recall F1-measureConference (edna)0.740.450.56Conference (StringEquiv)0.760.410.53Conference (Prototype)0.710.450.54Benchmark (edna)0.350.510.41Benchmark (Prototype)0.720.260.34

Obtained at http://nlp.stanford.edu/data/glove.6B.zip

GloVe: Global Vectors for Word Representation JPennington RSocher CDManning Empirical Methods in Natural Language Processing (EMNLP) 2014 Efficient Estimation of Word Representations in Vector Space Computing Research Repository TMikolov KChen GCorrado JDean abs- CoRR) 2013 Wikipedia-based Semantic Interpretation for Natural Language Processing EGabrilovich SMarkovitch J. Artif. Intell. Res 34 2009 A Hybrid Approach for Measuring Semantic Similarity between Ontologies Based on WordNet Knowledge Science WHe XYang DHuang Engineering and Management -5th International Conference 2011 A Survey of Exploiting WordNet in Ontology Matching FLin KSandkuhl Artificial Intelligence in Theory and Practice II 43 2008