=Paper=
{{Paper
|id=Vol-3033/paper18
|storemode=property
|title=A Methodology for Large-Scale, Disambiguated and Unbiased Lexical Knowledge Acquisition Based on Multilingual Word Alignment
|pdfUrl=https://ceur-ws.org/Vol-3033/paper18.pdf
|volume=Vol-3033
|authors=Francesca Grasso,Luigi Di Caro
|dblpUrl=https://dblp.org/rec/conf/clic-it/GrassoC21
}}
==A Methodology for Large-Scale, Disambiguated and Unbiased Lexical Knowledge Acquisition Based on Multilingual Word Alignment==
A Methodology for Large-Scale, Disambiguated and Unbiased Lexical
Knowledge Acquisition Based on Multilingual Word Alignment
Francesca Grasso, Luigi Di Caro
University of Turin, Department of Computer Science
{fr.grasso,luigi.dicaro}@unito.it
Abstract therefore not represented due to the absence of
syntagmatic links. Additionally, word senses suf-
In order to be concretely effective, many
fer from a lack of explicit common-sense knowl-
NLP applications require the availabil-
edge and context-dependent information. Finally,
ity of lexical resources providing varied,
the well-known fine granularity of word senses in
broadly shared, and language-unbounded
WordNet (Palmer et al., 2007) is due to the lack
lexical information. However, state-of-
of a meaning encoding system capable of repre-
the-art knowledge models rarely adopt
senting concepts in a flexible way. Other kinds of
such a comprehensive and cross-lingual
resources such as FrameNet (Baker et al., 1998)
approach to semantics. In this paper,
and ConceptNet (Speer et al., 2017) present the
we propose a novel automatable method-
same issue, while returning different types and de-
ology for knowledge modeling based on
grees of structural semantic information and dis-
a multilingual word alignment mecha-
ambiguation capabilities.
nism that enhances the encoding of unbi-
In this contribution, we provide a novel method-
ased and naturally disambiguated lexical
ology for the retrieval and representation of un-
knowledge. Results from a simple imple-
biased and naturally disambiguated lexical infor-
mentation of the proposal show relevant
mation that relies on a multilingual word align-
outcomes that are not found in other re-
ment mechanism. In particular, we exploit tex-
sources.
tual resources in different languages1 in order to
1 Introduction acquire and align varied lexical-semantic material
of the form
Lexical resources constitute a key instrument for that are common and shared by all the k languages
many NLP tasks such as Word Sense Disambigua- involved. As we demonstrate through a simple
tion and Machine Translation. However, their po- implementation, our method allows to create new
tential may vary widely depending on the nature lexical-semantic relations between words that are
of the lexical-semantic knowledge they encode, as not always available in other resources, as well as
well as on how the linguistic data are stored and to perform an automatic word sense disambigua-
linked within the network (Zock and Biemann, tion process. This system therefore enhances the
2020). The resources that are presently avail- encoding of prototypical semantic information of
able, such as WordNet (Miller, 1995), typically en- concepts that is also likely to be free from strong
code lexical-semantic knowledge mainly in terms cultural-linguistic and lexicographic biases.
of word senses, defined by textual (i.e. dictionary)
The benefits provided by our novel multilingual
definitions, and lexical entries are linked and put in
word alignment mechanism are thus fourfold: (i)
context through lexical-semantic relations. These
a linguistic and lexicographic de-biasing of lexical
relations, being only of a paradigmatic nature, are
knowledge; (ii) naturally-disambiguated aligned
characterized by a sharing of the same defining
lexical entries; (iii) the discovery of novel lexical-
properties between the words and a requirement
semantic relations; and (iv) the representation of
that the words be of the same syntactic class (Mor-
prototypical semantic information of concepts in
ris and Hirst, 2004). Typically related words are
different languages.
Copyright © 2021 for this paper by its authors. Use per-
1
mitted under Creative Commons License Attribution 4.0 In- In this work, we start with the combination of three lan-
ternational (CC BY 4.0). guages: English, German and Italian.
2 Background and Related Work to some (to a certain extent) widely-accepted and
shared information. CSK describes the kind of
2.1 Bias Types general knowledge material that humans use to
Due to its complex and fluid nature, lexical seman- define, differentiate and reason about the concep-
tics needs to undergo a process of abstraction and tualizations they have in mind (Ruggeri et al.,
simplification in order to be encoded into a formal 2019). ConceptNet (Speer et al., 2017) is one
model. As a result, lexical knowledge provided by of the largest CSK resources, collecting and auto-
lexical resources - especially when monolingual - matically integrating data starting from the orig-
will inherently carry different types of biases. In inal MIT Open Mind Common Sense project3 .
particular, i) linguistic and ii) lexicographic biases However, terms in ConceptNet are not disam-
affect the encoding, consumption, and exploitation biguated. Property norms (McRae et al., 2005;
of lexical knowledge in downstream tasks. Devereux et al., 2014) represent a similar kind of
resource, which is more focused on the cognitive
Linguistic bias Lexical information encoded in
and perception-based aspects of word meaning.
a language’s lexicon, as well as the potential con-
Norms, in contrast with ConceptNet, are based
texts in which a given lexeme can occur, inevitably
on semantic features empirically-constructed via
reflect the socio-cultural background of the speak-
questionnaires producing lexical (often ambigu-
ers of that language. Lexical resources used for the
ous) labels associated with target concepts, with-
compilation of lexical knowledge are often con-
out any systematic methodology of knowledge
ceived as monolingual, therefore they mostly re-
collection and encoding.
turn culture-bounded semantic information which
does not account for more shared knowledge. Another widespread modeling approach is
based on vector space models of lexical knowl-
Lexicographic bias The nuclear components edge. Vectors are automatically learnt from large
extracted from textual definitions can be different corpora utilizing a wide range of statistical tech-
depending on the resource used, even within a sin- niques, all centered on Harris’ distributional as-
gle language (Kiefer, 1988). For example, the def- sumption (Harris, 1954), i.e. words that occur
inition of “cow” reported by the Oxford Dictio- in the same contexts tend to have similar mean-
nary is “a large animal kept on farms to produce ings. Well-known models include word embed-
milk or beef ” while the Merriam-Webster Dictio- dings (Mikolov et al., 2013; Pennington et al.,
nary reports “the mature female of cattle”. Both 2014; Bojanowski et al., 2016), sense embed-
endogenous and exogenous properties can be sub- dings (Huang et al., 2012; Iacobacci et al., 2015;
jectively reported (Woods, 1975), such as the term Kumar et al., 2019), and contextualized embed-
“large” and the milk production respectively. dings (Scarlini et al., 2020). However, the rela-
tions holding between vector representations are
2.2 Related Work not typed, nor are they organized systematically.
On one side, lexicons are built on top of synsets2 Among the several other modeling strategies
and contextualize meanings (or senses) mainly in proposed, lexicographic-centered resources have
terms of paradigmatic relations. WordNet (Miller, been focused on the contextualization of lexical
1995) and BabelNet (Navigli and Ponzetto, 2010) items within syntactic structures, e.g. Corpus
can be seen as the cornerstone and the summit in Pattern Analysis (CPA) (Hanks, 2004), situation
that respect. However, if on the one hand Word- frames such as FrameNet (Fillmore, 1977; Baker
Net’s dense network of taxonomic relationships et al., 1998) and conceptual frames (Moerdijk et
allows a high degree of systematization, on the al., 2008; Leone et al., 2020). Words are not taken
other hand, a key unsolved issue with “wordnets” in isolation and the meaning they are attributed is
is the fine granularity of their inventories. Note connected to prototypical patterns or typed slots.
that multilingualism in BabelNet is provided as an However, these theories and methods for building
indexing service rather than as an alignment and semantic resources remain linked to the lexical ba-
unbiasing systematization method. sis and do not manage the mentioned biases.
Extensions of these resources also include
Common-Sense Knowledge (CSK), which refers
3
https://www.media.mit.edu/projects/o
2
Words considered as synonyms in specific contexts. pen-mind-common-sense/overview/
3 The Multilingual Word Alignment wool Wolle lana
sheep Schal cotone
As is known, a single word form can be associ-
cotton spinnen Biella
ated with more than one related sense, causing
synthetic Baumwolle sintetica
what is referred to as semantic ambiguity, or poly-
spin Rudolf sciarpa
semy. This phenomenon, however, manifests itself
scarf synthetisch pecora
differently across languages, since each language
mitten Schafe filare
encodes meaning into words in its own particular
way. We can therefore assume that, while a given Table 1: Unordered lists of single-language related
polysemous word may be ambiguous in a certain words for .
context, a semantically corresponding word in an-
other language will possibly not. Based on this
to the relevant data. We would not expect a lan-
assumption, it is possible to exploit this cross-
guage spoken in a place without carps to have a
language property to disambiguate a given word
word corresponding to “carp”. The purpose of this
using its semantic equivalent in another language
project is not to forcibly identify universally valid
when they both occur in the same context. Such
semantic relationships, rather to not report biased
disambiguation process can take place because
information deriving from the use of data coming
the two words feature different semantic - specif-
from a single linguistic context. For this reason, in
ically, polysemous - behaviours. Accordingly, we
our case the choice fell on European languages 4
developed a knowledge acquisition methodology
(two Germanic languages and a Romance one).
that features the power of word sense disambigua-
tion, relying on a multilingual alignment mechanism.
We now describe in detail the alignment mecha-
After providing a brief illustration of the lan-
nism through a basic example. Consider the fol-
guages we have selected for this first trial, we de-
lowing word forms: wool (EN); Wolle (DE); lana
scribe more in detail the methodology by using a
(IT), expressing a single target concept5 .
basic example. Afterwards, a simple implementa-
For each of the three lexical forms we collect a
tion of the proposed mechanism is presented.
set of related words in terms of paradigmatic (e.g.
3.1 Languages Involved synonyms) and syntagmatic (e.g. co-occurrences)
relations. The target-related words can possibly be
Among the benefits provided by the multilingual modifiers, verbs, or substantives. We thus obtain
word alignment methodology we propose, one is three different lists of words, one for each of the
that it prevents the represented lexical informa- languages involved. The retrieved terms in the lists
tion from containing strong cultural-linguistic bi- are still potentially ambiguous, since they refer to
ases. This objective is pursued through the use of a lexical form rather than to a contextually defined
three different languages, reflecting in turn three concept. Table 1 provides a small excerpt of such
diverse backgrounds. For this first trial we in- unordered lists of related words.
volved English, German and Italian. These lan- The lexical data in the lists are subsequently
guages were chosen primarily because we are pro- compared and filtered in order to select only the
ficient in them, therefore we are able to exert con- semantic items that occur in all the lists, i.e., those
trol over the data of our trial, as well as to interpret shared by the three languages6 , in the reported ex-
the results properly. Concurrently, given the na- ample. The resulting words are thus aligned with
ture of the methodology, it was necessary to select their semantic counterparts, generating a set of
a set of languages with a certain degree of simi- aligned triplets, as shown in Table 2.
larity in terms of shared lexical-semantic material. This multilingual word alignment provides, as
Indeed, the alignment mechanism can work and be a consequence, an automatic Word Sense Disam-
effective as long as the lexical-semantic systems of biguation system. Once the triplets are formed,
the languages involved reflect a somewhat similar their members will be indeed associated with a
cultural-linguistic background. For example, we 4
By “European” we refer to the European linguistic area.
might expect languages to agree on the meanings 5
An absolute monosemy is, of course, realistically un-
of “carp”, “cottage” and “sled” as long as speak- reachable.
6
ers of these languages have comparable exposure This implies the presence of a translation step.
wool Wolle lana being language-specific items within those con-
sheep ↔ Schafe ↔ pecora texts. Therefore, the lexical information provided
cotton ↔ Baumwolle ↔ cotone by the alignment mechanism will be free from
synthetic ↔ syntetisch ↔ sintetica strong cultural-linguistic biases. Finally, as illus-
spin ↔ spinnen ↔ filare trated in the next section, by exploiting multiple
scarf ↔ Schal ↔ sciarpa and differently built resources, we are able to re-
duce arbitrariness and lexicographic biases within
Table 2: Examples of aligned concept-related the lexical knowledge represented.
words for .
4 Implementation
In this section we describe details and results of a
likely unique sense, i.e. the one coming from
simple implementation of the proposed alignment
the intersection of all possible language-specific
mechanism for the acquisition of disambiguated
senses related to the three words. In other terms,
and unbiased lexical information. In particular, the
the target-related words, once aligned, naturally
system is composed of two main modules: a con-
identify (and provide) a common semantic con-
text generation and an alignment procedure. We
text. As a consequence, potentially polysemous
finally report the results of an evaluation to high-
words are disambiguated through such context,
light mainly (i) the autonomous disambiguation
without any support from sense repositories. For
power of the approach, (ii) the quality of the align-
example, the context-consistent sense of the verb
ments and their unbiased and syntagmatic nature,
to spin (EN), which is a highly polysemous word
and (iii) the amount of unveiled lexical-semantic
in English, can be identified by selecting the only
relations not covered by existing state-of-the-art
sense that is also shared by the other two aligned
resources such as BabelNet.
words, i.e. “turn fibres into thread”. In fact,
neither spinnen (DE) nor filare (IT) can possibly POS scale bilancia Waage
mean e.g. “rotate”. noun accuracy precisione Genauigkeit
This mechanism generates a twofold effect: be- noun balance equilibrio Balance
sides performing word sense disambiguation, it noun bulk massa Masse
also provides lexical knowledge in the form of noun control controllo Kontrolle
(paradigmatic and syntagmatic) lexical-semantic noun device dispositivo Gerät
relations between words that is also language- noun figure cifra Zahl
unbounded. In the first place, the uncontrolled adj accurate preciso genau
character of the data retrieval and alignment adj smart intelligente intelligent
process offers the generation of novel lexical- verb indicate indicare zeigen
semantic relations that are likely not available in verb set regolare einstellen
other structured resources. Additionally, since the
resulting set of words related to the target can be Table 3: 10 automatic alignments (out of 74)
only the one shared by multiple languages, the lex- for the target concept (BabelNet synset:00069470n).
cultural/linguistic background, rather a common
and shared one. For example, in Table 1 the pres- 4.1 Context for Multilingual Alignment
ence of the word “Biella” among the list of words To retrieve the concept-related words for the mul-
related to “lana”, probably refers to the fact that tilingual alignment we made use of two textual
the Italian city Biella is (locally) famous for its resources: Sketch Engine (Kilgarriff et al., 2014)
wool, therefore the two words may co-occur fre- and the Leipzig Corpora Collection (Quasthoff et
quently. Similarly, if we consider the alignment al., 2014). Through the former, we searched for
, a lexeme re- related words with its tool named “Word Sketch”
lated to the English word form would be “rain”, on the TenTen Corpus Family7 . In particular, we
due to the well-known idiom “it’s raining cats and were able to automatically collect words appear-
dogs”. However, neither “Biella” nor correspond- ing in the following grammatical relations: “mod-
ing words for “rain” can possibly result in the lists 7
https://www.sketchengine.eu/document
of related words of the respective other languages, ation/tenten-corpora
00008050n 00069470n 00069470n 00062766n 00008364n 00008363n
(en) libra scale plane plane bank bank
(it) bilancia bilancia aereo piano banca riva
(de) Waage Waage Flugzeug Ebene Bank Ufer
triplets 26 74 272 151 349 80
novel(en) 88,46% 87,84% 88,97% 89,40% 87,68% 91,25% 88,9%
novel(it) 76,92% 66,22% 75,74% 73,51% 75,64% 68,75% 72,8%
novel(de) 88,46% 74,32% 87,87% 84,11% 81,66% 76,25% 82,1%
Table 4: Alignments for six ambiguous concepts and percentage of unveiled novel relations in each lan-
guage with respect to the BabelNet database. Some examples of triplets for the concept scale-bilancia-
Waage (bn:00069470n) are shown in Table 3.
ifiers of w”, “adj. predicates of w”, “verbs with w lexicalizations of the synsets connected to it, to-
as subject” and “verbs with w as object”. The re- gether with the words included in their glosses9 .
trieved concept-related words are then lemmatized As test cases, we randomly picked 500 concepts
and marked with the suitable POS tags. Finally, constituting polysemous words in at least one of
we utilized the Leipzig Corpora Collection portal the three languages, obtaining non-empty align-
for searching additional context words in terms of ments for 456 of them. In Table 4 we report the
left and right (POS-tagged) co-occurrences. results of the alignment on six concepts.
Despite its limitations, our first implementa-
4.2 Multilingual Alignment tion of the proposed methodology was able to dis-
The Google Translate API was used for find- cover a total of 76,152 multilingual alignments
ing translations of related words in the three lan- over the 456 concepts, with (on average) more
guages8 . In particular, given a certain term tL1 in a than 80% novel semantic relations with respect
language L1, we opted for retrieving all its possi- to what is currently encoded in BabelNet across
ble translations into the other two languages (L2, the three languages. Still, the extracted data rep-
L3). We then tried to match each translated item resent mostly unbiased and disambiguated knowl-
with the previously-retrieved sets of related words edge, leading towards the construction of a new
in L2, L3. Whenever the [tL1 ↔ tL2 ]; [tL1 ↔ tL3 ] large-scale and multilingual prototypical lexical
match succeeded, we finally checked any possible database.
[tL2 ↔ tL3 ] match. If a [tL1 ↔ tL2 ↔ tL3 ] se-
mantic equivalence occurs, then the alignment can 5 Conclusions and Future Work
take place. Table 3 shows an excerpt of automatic In this paper we proposed an original methodol-
alignments for the concept scale (bn:00069470n). ogy for acquiring and encoding lexical knowledge
through a novel yet simple mechanism of multi-
4.3 Evaluation
lingual alignment. The aim was to represent var-
Our aim is not to overcome state-of-the-art re- ied, disambiguated, and language-unbounded lexi-
sources but rather to incorporate new and unbi- cal knowledge by minimizing strong linguistic and
ased semantic relations from a novel multilingual lexicographic biases. A simple implementation
alignment mechanism. In particular, we wanted and experimentation on 456 concepts carried to
to verify to what extent our knowledge acquisition unveil around 76K aligned lexical-semantic fea-
method is able to unveil lexical relations yet un- tures, of which more than 80% resulted new when
covered by a state-of-the-art resource (BabelNet). compared with a current state-of-the-art resource
Thus, we first generated sets of related words such as BabelNet. Future directions include the
from BabelNet in order to compare them with use of more languages and large-scale runs over
those produced and aligned by our (automatized) thousands of main concepts (Bentivogli et al.,
methodology. In particular, through the BabelNet 2004; Di Caro and Ruggeri, 2019; Camacho-
API, we obtained the English, Italian, and German Collados and Navigli, 2017).
8 9
No surrounding syntactic context for the words to align We used the SpaCy library to analyze, extract and lem-
was available for more advanced Machine Translation. matize the text - https://spacy.io.
References Sawan Kumar, Sharmistha Jat, Karan Saxena, and
Partha Talukdar. 2019. Zero-shot word sense dis-
Collin F Baker, Charles J Fillmore, and John B Lowe. ambiguation using sense definition embeddings. In
1998. The berkeley framenet project. In 36th An- Proceedings of the 57th Annual Meeting of the Asso-
nual Meeting of the Association for Computational ciation for Computational Linguistics, pages 5670–
Linguistics and 17th International Conference on 5681.
Computational Linguistics, Volume 1, pages 86–90.
Luisa Bentivogli, Pamela Forner, Bernardo Magnini, Valentina Leone, Giovanni Siragusa, Luigi Di Caro,
and Emanuele Pianta. 2004. Revising the wordnet and Roberto Navigli. 2020. Building semantic
domains hierarchy: semantics, coverage and balanc- grams of human knowledge. In Proceedings of the
ing. In Proceedings of the workshop on multilingual 12th Language Resources and Evaluation Confer-
linguistic resources, pages 94–101. ence, pages 2991–3000.
Piotr Bojanowski, Edouard Grave, Armand Joulin, Ken McRae, George S Cree, Mark S Seidenberg, and
and Tomas Mikolov. 2016. Enriching word vec- Chris McNorgan. 2005. Semantic feature produc-
tors with subword information. arXiv preprint tion norms for a large set of living and nonliving
arXiv:1607.04606. things. Behav. r. m., 37(4):547–559.
Jose Camacho-Collados and Roberto Navigli. 2017. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Babeldomains: Large-scale domain labeling of lex- rado, and Jeff Dean. 2013. Distributed representa-
ical resources. In Proceedings of the 15th Confer- tions of words and phrases and their compositional-
ence of the European Chapter of the Association for ity. In Advances in neural information processing
Computational Linguistics: Volume 2, Short Papers, systems, pages 3111–3119.
pages 223–228.
George A Miller. 1995. Wordnet: a lexical
Barry J Devereux, Lorraine K Tyler, Jeroen Geertzen, database for english. Communications of the ACM,
and Billi Randall. 2014. The cslb concept property 38(11):39–41.
norms. Behavior research methods, 46(4):1119–
1127. Fons Moerdijk, Carole Tiberius, and Jan Niestadt.
2008. Accessing the anw dictionary. In Proc. of
Luigi Di Caro and Alice Ruggeri. 2019. Unveiling
the workshop on Cognitive Aspects of the Lexicon,
middle-level concepts through frequency trajecto-
pages 18–24.
ries and peaks analysis. In Proceedings of the 34th
ACM/SIGAPP Symposium on Applied Computing,
Jane Morris and Graeme Hirst. 2004. Non-classical
pages 1035–1042.
lexical semantic relations. In Proceedings of
Charles J Fillmore. 1977. Scenes-and-frames seman- the Computational Lexical Semantics Workshop at
tics. Linguistic structures processing, 59:55–88. HLT-NAACL 2004, pages 46–51, Boston, Mas-
sachusetts, USA, May 2 - May 7. Association for
Patrick Hanks. 2004. Corpus pattern analysis. In Eu- Computational Linguistics.
ralex Proceedings, volume 1, pages 87–98.
Roberto Navigli and Simone Paolo Ponzetto. 2010.
Zellig S Harris. 1954. Distributional structure. Word, BabelNet: Building a very large multilingual seman-
10(2-3):146–162. tic network. In Proc. of ACL, pages 216–225. Asso-
ciation for Computational Linguistics.
Eric H Huang, Richard Socher, Christopher D Man-
ning, and Andrew Y Ng. 2012. Improving word Martha Palmer, Hoa Trang Dang, and Christiane Fell-
representations via global context and multiple word baum. 2007. Making fine-grained and coarse-
prototypes. In Proc. of ACL, pages 873–882. grained sense distinctions, both manually and auto-
Ignacio Iacobacci, Mohammad Taher Pilehvar, and matically. Nat.Lan.Eng., 13(02):137–163.
Roberto Navigli. 2015. SensEmbed: learning sense
embeddings for word and relational similarity. In Jeffrey Pennington, Richard Socher, and Christopher D
Proceedings of ACL, pages 95–105. Manning. 2014. Glove: Global vectors for word
representation. In EMNLP, volume 14, pages 1532–
Ferenc Kiefer. 1988. Linguistic, conceptual and ency- 43.
clopedic knowledge: Some implications for lexicog-
raphy. In T. Magay and J. Zigány, editors, Proceed- Uwe Quasthoff, Dirk Goldhahn, and Thomas Eckart.
ings of the 3rd EURALEX International Congress, 2014. Building large resources for text mining: The
pages 1–10, Budapest, Hungary, sep. Akadémiai leipzig corpora collection. In Text Mining, pages 3–
Kiadó. 24. Springer.
Adam Kilgarriff, Vı́t Baisa, Jan Bušta, Miloš Alice Ruggeri, Luigi Di Caro, and Guido Boella. 2019.
Jakubı́ček, Vojtěch Kovář, Jan Michelfeit, Pavel The role of common-sense knowledge in assessing
Rychlý, and Vı́t Suchomel. 2014. The sketch en- semantic association. Journal on Data Semantics,
gine: Ten years on. The Lexicography, 1(1):7–36. 8(1):39–56.
Bianca Scarlini, Tommaso Pasini, and Roberto Navigli.
2020. SensEmBERT: Context-Enhanced Sense Em-
beddings for Multilingual Word Sense Disambigua-
tion. In Proceedings of the 34th Conference on Arti-
ficial Intelligence. Association for the Advancement
of Artificial Intelligence.
Robert Speer, Joshua Chin, and Catherine Havasi.
2017. Conceptnet 5.5: An open multilingual graph
of general knowledge. In Thirty-First AAAI Confer-
ence on Artificial Intelligence.
William A Woods. 1975. What’s in a link: Founda-
tions for semantic networks. In Representation and
understanding, pages 35–82. Elsevier.
Michael Zock and Chris Biemann. 2020. Comparison
of different lexical resources with respect to the tip-
of-the-tongue problem. Journal of Cognitive Sci-
ence, 21(2):193–252.