Lexical Sense Alignment using Weighted Bipartite b-Matching Sina Ahmadi Insight Centre for Data Analytics, Data Science Institute National University of Ireland Galway sina.ahmadi@insight-centre.org Mihael Arcan Insight Centre for Data Analytics, Data Science Institute National University of Ireland Galway mihael.arcan@insight-centre.org John P. McCrae Insight Centre for Data Analytics, Data Science Institute National University of Ireland Galway john.mccrae@insight-centre.org 1 Introduction Lexical resources are important components of natural language processing (NLP) applications providing linguistic information about the vocabulary of a language and the semantic relationships between the words. While there is an increasing number of lexical resources, particularly expert-made ones such as WordNet [8] or FrameNet [2], as well as collaboratively- curated ones such as Wikipedia1 or Wiktionary2 , manual construction and maintenance of such resources is a cumbersome task. This can be efficiently addressed by NLP techniques. Aligned resources have shown to improve word, knowledge and domain coverage and increase multilingualism by creating new lexical resources such as Yago [13], BabelNet [9] and ConceptNet [12]. In addition, they can improve the performance of NLP tasks such as word sense disambiguation [10], semantic role tagging [15] and semantic relations extraction [14]. 2 Objective One of the current challenges in aligning lexical data across different resources is word sense alignment (WSA). Different monolingual resources may use different wordings and structures for the same concepts and entries. There are various approaches in aligning definitional texts based on semantic similarity and linking techniques. For instance, Meyer and Gurevych [7] use semantic similarity and Personalized PageRank (PPR) to estimate the semantic relatedness in linking Wiktionary and WordNet. Pilehvar and Navigli [11] go beyond the surface form semantic similarity by transforming resources into semantic networks. Differently, Matuschek and Gurevych [5] present Dijkstra-WSA algorithm which aligns word senses using Dijkstra’s shortest path algorithm. In this study, we present a similarity-based approach for WSA in English WordNet and Wiktionary with a focus on the polysemous items. Our approach relies on semantic textual similarity using features such as string distance metrics and word embeddings, and a graph matching algorithm. Transforming the alignment problem into a bipartite graph matching enables us to apply graph matching algorithms, in particular, weighted bipartite b-matching (WBbM). 1 https://www.wikipedia.org/ 2 https://www.wiktionary.org/ © Sina Ahmadi and Mihael Arcan and John McCrae; licensed under Creative Commons License CC-BY LDK 2019 - Posters Track. Editors: Thierry Declerck and John P. McCrae XX:2 Lexical Sense Alignment using Bipartite b-Matching 3 Method WBbM is one of the widely studied classical problems in combinatorial optimization for modeling data management applications, e-commerce and resource allocation systems [1, 3, 4]. WBbM is a variation of the weighted bipartite matching, also known as assignment problem. In the assignment problem, the optimal matching only contains one-to-one matching with the highest weight sum. This bijective mapping restriction is not realistic in the case of lexical resources where an entry may be linked to more than one entries. Therefore, WBbM aims at providing a more diversified matching where a node may be connected to a certain number of nodes. Formally, given G = ((U, V ), E) with weights W and vertex-labelling functions L : U ∪ V → N and B : U ∪ V → N, WBbM finds a subgraph H = ((U, V ), E 0 ) P which maximizes e∈E 0 W (e) having u ∈ [L(u), B(u)] and v ∈ [L(v), B(v)]. In other words, the number of the edges that can be connected to a node is determined by the lower and upper bound functions L and B, respectively. Algorithm 1: Greedy WBbM Input: G = ((U, V ), E, W ), bounds L and B Output: H = ((U, V )), E 0 , W ) satisfying bound constraints with a greedily-maximized P score e∈E 0 W (e) 0 1 E =∅ 2 Sort E by descending W(e) 3 for e to E do 4 if H = ((U, V )), E 0 ∪ {e}, W ) does not violate L and B then 5 E 0 = E 0 ∪ {e} 6 return H = ((U, V )), E 0 , W ) Algorithm 1 presents the WBbM algorithm with a greedy approach where an edge is selected under the condition that adding such an edge does not violate the lower and the upper bounds, i.e. L and B. R1 S1 S2 WBbM MATCHING Ei .. GRAPH GENERATOR Sim(R1 : Ei , R2 : Ei ) = [l0 , b0 ]   Sn σS S 0 σS S 0 ... [l, b] .. R1 : Ei : S1 R2 : Ei : S10  1 1 1 2  .  σS2 S 0 σS2 S 0 ... [l, b] R1 : Ei : S2 R2 : Ei : S20 [l0 , b0 ]   1 2  .. ..  .. ..  .. . .  ..  . .. R2  . . .  . 0   S10 R1 : Ei : Sn R2 : Ei : Sm σSn S10 ... σSn Sm 0 [l, b] [l0 , b0 ] Ei S20 [l0 , b0 ] .. 0 Sm Figure 1 Sense alignment system Sina Ahmadi et al. XX:3 We evaluate the performance of our approach on aligning sense definitions in WordNet and Wiktionary using an aligned resource presented by Meyer and Gurevych [7]. Given an identical entry in English WordNet and Wiktionary, we first convert the senses to a bipartite graph where each side of the graph represents the senses belonging to one resource. Then, we extract the similarity scores between those senses using a similarity function. The similarity function is a trained model based on similarity features such as word length ratio, longest common subsequence, Jaccard measure, word embeddings and forward precision, which is performed by NASIC [6]. And finally, the senses in the the weighted bipartite graph are matched by the WBbM algorithm. This process is illustrated in Figure 1 where senses of entry Ei in resource R1 , {S1 , S2 , ..., Sn }, are aligned with the senses of the same entry in 0 0 0 R2 , {S1 , S2 , ..., Sn }. The lower and upper bounds of the right side and left side of the graph, respectively [l, b] and [l0 , b0 ], are the parameters to be tuned. 4 Evaluation In order to evaluate the performance of our alignment approach, we calculated macro precision Pmacro , macro recall Rmacro , average F-measure Favg and average accuracy Aavg as follows: |E| TP 1 X T Pi P = Pmacro = (1) TP + FP |E| i=1 T Pi + F Pi |E| TP 1 X T Pi R= Rmacro = (2) TP + FN |E| i=1 T Pi + F Ni |E| P ×R 1 X F =2× Favg = Fi (3) P +R |E| i=1 |E| 1 X T Pi + T Ni Aavg = (4) |E| i=1 T Pi + T Ni + F Pi + F Ni where E refers to the set of entries, TP, TN, FN and FP respectively refer to true positive, true negative, false negative and false positive. Table 1 provides the evaluation results using the WBbM algorithm with different com- binations of the matching bounds over the left side (WordNet senses) and the right side (Wiktionary senses) of the alignment graph. We observe that a higher upper bound increases the recall. On the other hand, setting the lower bound to 1 provides a higher precision, while parameters with a lower bound of 0, e.g. [0, 3], lack precision. Note that [0, 1] parameter performs similarly as a bijective mapping algorithms such as the assignment problem where a node can be only matched to one node. Our approach delivers superior results in comparison to the baseline results provided by McCrae and Buitelaar [6]. L D K Po s t e r s XX:4 Lexical Sense Alignment using Bipartite b-Matching Left bound, right bound Pmacro Rmacro Favg Aavg [0, 1], [0, 1] 81.86 61.83 68.51 69.48 [0, 2], [0, 1] 78.13 70.74 73.28 76.57 [0, 3], [0, 1] 77.88 71.38 73.59 77.13 [1, 2], [1, 2] 81.21 74.17 76.59 79.49 [1, 3], [1, 3] 81.26 75.02 77.12 80.14 [1, 5], [0, 1] 81.25 75.25 77.28 80.33 [1, 5], [1, 2] 81.25 75.23 77.26 80.32 Table 1 WBbM algorithm performance on alignment of WordNet and Wiktionary 5 Conclusion We revisited WordNet-Wiktionary alignment task and proposed an approach based on textual and semantic similarity and WBbM algorithm. We demonstrated that this approach is efficient for aligning resources in comparison to the baseline results thanks to the flexibility of the matching algorithm. However, tuning the parameters of the matching algorithm needs further investigations of the resource and is not following a rule. In future work, we plan to extend our experiments to more resources. Acknowledgements This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015. References 1 Faez Ahmed, John P Dickerson, and Mark Fuge. Diverse weighted bipartite b-matching. arXiv preprint arXiv:1702.07134, 2017. 2 Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 86–90. Association for Computational Linguistics, 1998. 3 Cheng Chen, Sean Chester, Venkatesh Srinivasan, Kui Wu, and Alex Thomo. Group-aware weighted bipartite b-matching. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 459–468. ACM, 2016. 4 Cheng Chen, Lan Zheng, Venkatesh Srinivasan, Alex Thomo, Kui Wu, and Anthony Sukow. Conflict-aware weighted bipartite b-matching and its application to e-commerce. IEEE Transactions on Knowledge and Data Engineering, 28(6):1475–1488, 2016. 5 Michael Matuschek and Iryna Gurevych. Dijkstra-wsa: A graph-based approach to word sense alignment. Transactions of the Association for Computational Linguistics, 1:151–164, 2013. 6 John P McCrae and Paul Buitelaar. Linking datasets using semantic textual similarity. Cybernetics and Information Technologies, 18(1):109–123, 2018. 7 Christian M Meyer and Iryna Gurevych. What psycholinguists know about chemistry: Aligning Wiktionary and Wordnet for increased domain coverage. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 883–892, 2011. 8 George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. Sina Ahmadi et al. XX:5 9 Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250, 2012. 10 Roberto Navigli and Simone Paolo Ponzetto. Joining forces pays off: Multilingual joint word sense disambiguation. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 1399–1410. Association for Computational Linguistics, 2012. 11 Mohammad Taher Pilehvar and Roberto Navigli. A robust approach to aligning heterogen- eous lexical resources. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 468–478, 2014. 12 Robert Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451, 2017. 13 Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM, 2007. 14 Robert S Swier and Suzanne Stevenson. Exploiting a verb lexicon in automatic semantic role labelling. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 883–890. Association for Computational Linguistics, 2005. 15 Nianwen Xue and Martha Palmer. Calibrating features for semantic role labeling. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004. L D K Po s t e r s