Using Word Semantics on Entity Names for Correspondence Set Generation Rafael Vieira1 and Kate Revoredo2 1,2 Federal University of the State of Rio de Janeiro (UNIRIO), Brazil, 1 katerevoredo@uniriotec.br, 2 rvieira.research@gmail.com.br 1 Introduction On ontology Matching, many works make use of word semantics to align the ontologies. One commonly used resource is WordNet[4][5], which groups words that share the same meaning together. Thesaurus and lexicons like WordNet indeed provide rich semantic information but require large amounts of human effort to be created and maintained. Vector space representations of word semantics are a family of language mod- els that associate words with vectors in a semantic space, where each dimension represents a component of the meaning of words[2][1][3]. The semantic similarity of words is exploited by these methods, providing vectors close in space when their related words are close in meaning. These vectors are usually calculated by a learning algorithm on large corpora like Wikipedia and then used to evaluate the similarity between two words. In this work, we exploit the word-word similarities in the GloVe model as external resources for Ontology Matching. The hypothesis is that two entities can be matched based on the words in their names using the word-word similarity provided by the model. We built a prototype and evaluated its performance against the baselines from OAEI. 2 Prototype To build the simplest prototype, we used pre-trained vectors1 from GloVe and two ontologies O1 and O2 . Then, each entity e defined in O1 or O2 is associated with one vector v#»e = (a1 , . . . , an ), based on its name, where each component ai represents the semantic dimension of words that have related meaning. In case entity e has a compound name, we average the vectors of each word in its name, and set the resulting vector as v#»e . To generate a correspondence between two entities e1 and e2 , from O1 and O2 respectively, we calculate the cosine similarity on vectors v#»1 and v#»2 , associated with e1 and e2 , respectively. If the value of cosine similarity is above a lower bound, we continue with this correspondence, otherwise, it is discarded. This lower bound was empirically set to 0.7 as this value showed the better results. 1 Obtained at http://nlp.stanford.edu/data/glove.6B.zip 2 After doing this procedure for all entity pairs, we have the complete align- ment. Finally, we compare this alignment with the baseline alignments edna(edit distance based) and StringEquiv(string equivalence based) from OAEI 2016 on the conference and benchmark data sets. The results are presented in table 1. Dataset (method) Precision Recall F1 -measure Conference (edna) 0.74 0.45 0.56 Conference (StringEquiv) 0.76 0.41 0.53 Conference (Prototype) 0.71 0.45 0.54 Benchmark (edna) 0.35 0.51 0.41 Benchmark (Prototype) 0.72 0.26 0.34 Table 1. Comparison between the prototype and baselines of each data set The prototype obtained low recall on both data sets. The majority of errors on the benchmark data set were on tests with random entity names, resulting in the low recall. This is expected since our method uses only this source of information to gather the entity semantics and then generate correspondences. On the conference data set, the prototype performed between the two base- lines. Many words from entity names were not in the vocabulary of the vectors, #» and were assigned the vector 0 , which contributes to the average recall. 3 Conclusion These results are not ground-breaking, but also promising. Furthermore, given the simplicity of the prototype, there are many places where it can be improved. For example, in a future experiment, we should train our own vectors and fine tune the hyperparameters of the model. We believe that these improvements may provide increased performance and lead to further research in the area. References 1. Pennington, J., Socher, R. Manning, C. D.: GloVe: Global Vectors for Word Rep- resentation. Empirical Methods in Natural Language Processing (EMNLP), 1532– 1543 (2014) 2. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Repre- sentations in Vector Space Computing Research Repository (CoRR), abs-1301-3781 (2013) 3. Gabrilovich, E., Markovitch, S.: Wikipedia-based Semantic Interpretation for Nat- ural Language Processing J. Artif. Intell. Res., 34, 443–498 (2009) 4. He, W., Yang, X., Huang, D.: A Hybrid Approach for Measuring Semantic Simi- larity between Ontologies Based on WordNet Knowledge Science, Engineering and Management - 5th International Conference, 68–78 (2011) 5. Lin, F., Sandkuhl, K.: A Survey of Exploiting WordNet in Ontology Matching. Artificial Intelligence in Theory and Practice II, 43, 341–350 (2008)