Spanish Word Embeddings Learned on Word Association Norms

Spanish Word Embeddings Learned on Word Association Norms HelenaGómez-Adorno helena.gomez@iimas.unam.mx Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas Universidad Nacional Autónoma de México

Ciudad de México México

JorgeReyes-Magaña Instituto de Ingeniería Universidad Nacional Autónoma de México

Ciudad de México México

Facultad de Matemáticas Universidad Autónoma de Yucatán

Mérida Yucatán

GemmaBel-Enguix Instituto de Ingeniería Universidad Nacional Autónoma de México

Ciudad de México México

GerardoSierra gsierram@iingen.unam.mx Instituto de Ingeniería Universidad Nacional Autónoma de México

Ciudad de México México

Spanish Word Embeddings Learned on Word Association Norms B0BAAD462DF2C7D157E44D943ADF5014 GROBID - A machine learning software for extracting information from scholarly documents word vectors node2vec word association norms Spanish

Word embeddings are vector representations of words in an n-dimensional space used for many natural language processing tasks. A large training corpus is needed for learning good quality word embeddings. In this work, we present a method based on the node2vec algorithm for learning embeddings based on paths in a graph. We used a collection of Word Association Norms in Spanish to build a graph of word connections. The nodes of the network correspond to the words in the corpus, whereas the edges correspond to a pair of words given in a free association test. We evaluated our word vectors in human annotated benchmarks, achieving better results than those trained on a billion-word corpus such as, word2vec, fasttext, and glove.

Introduction

The representation of words in a vector space is a very active research area in the latest decades. Computational models like the singular value decomposition (SVD) and the latent semantic analysis (LSA) are capable of modeling word vector representations (word embeddings) from the term-document matrix. Both methods can reduce a dataset of N dimensions using only the most important features. Recently, Mikolov et al. [19] introduced word2vec inspired by the distributional hypothesis establishing that words in similar contexts tends to have similar meanings [22]. This method uses a neural network in order to learn vector representations of words by predicting other words in their context. The vector representation of a word obtained by word2vec has the awesome capability of conserving linear regularities between words.

In order to build a model of adequate and reliable vector space, capable of capturing semantic similarity and linear regularities of words, large volumes of text are needed. Although word2vec is fast and efficient to train, and pre-trained word vectors are usually available online, it is still computationally expensive to process large volumes of data in non-commercial environments, that is, on personal computers.

Free association is an experimental technique commonly used to discover the way in which the human mind structures knowledge [8]. In free association tests, a person is asked to say the first word that comes to mind in response to a given stimulus word. The set of lexical relations obtained with these experiments is called Word Association Norms (WAN). These kinds of resources reflect both semantic and episodic contents [6].

In previous work [4] we learn word vectors in English from a graph obtained from a WAN corpus. The vectors learned from this graph were able to map the contents of semantic and episodic memory in vector space. For this purpose, we used the node2vec algorithm [14] which is able to learn node mappings to a continuous vector space from the complete network taking into account the neighborhood of the nodes. The algorithm performs biased random paths to explore different neighborhoods in order to capture not only the structural roles of the nodes in the network but also the communities to which they belong to.

In this paper, we extend previous work of learning word vectors in English[4] by learning vector representations of words from a resource that collects words association norms in Spanish. We build two embedding resources of different dimensions, the first one based on Normas de Asociación Libre en Castellano [10] (NALC), and the other using the corpus of Normas de Asociación de Palabras para el Español de México [2] (NAP). The obtained embeddings from both resources are available on GitHub, the NALC based embeddings 4 and the NAP based embeddings 5 .

The rest of the paper is organized as follows. In section 2, we discuss the related work. In Section 3, we present the corpora of Word Association Norms. In section 4, we describe the methodological framework for learning word vectors from WAN's. Section 5, shows the evaluation of the generated vectors, using a word similarity dataset in Spanish. Finally, in section 6 we draw some conclusions and point out to possible directions of future work.

Related Work

Semantic networks [25] are graphs relating words [1] used in linguistics and psycholinguistics not only to study the organization of the vocabulary but also to approach the structure of knowledge. Many languages have corpora of WAN. In the past decades, different association lists were elaborated with the collaboration of a large number of volunteers. However, in recent years, the web has become a natural way to get data to build such resources. Jeux de Mots6 provides an example in French [18], whereas the Small World of Words7 contained datasets in 14 languages at the time of writing.

Sinopalnikova and Smrz [24] showed that WATs are comparable to balanced text corpora and can replace them in case of absence of a corpus. The authors presented a methodological framework for building and extending semantic networks with word association thesaurus (WAT), including a comparison of quality and information provided by WAT vs. other language resources.

Borge-Holthoefer & Arenas [6] used free association information for extracting semantic similarity relations with a Random Inheritance Model (RIM). The obtained vectors were compared with LSA-based vector representations and the WAS (word association space) model. Their results indicate that RIM can successfully extract word feature vectors from a free association network.

In a recent work by De Deyne et al. [9] the authors introduced a method for learning word vectors from WANs using a spreading activation approach in order to encode a semantic structure from the WAN. The authors used part of the Small World of Words network. The word association-based model was compared with a word embeddings model (word2vec) using relatedness and similarity judgments from humans, obtaining an average of 13% of improvement over the word2vec model.

Word Association Norms in Spanish

Many languages have compilations of word association norms. In the past decades, some interesting works have been developed with a large number of volunteers. Among the most well-known English resources accessible on the web are the Edinburgh Associative Thesaurus8 (EAT) [17] and the resource of Nelson et al.9 [21].

For Spanish, there are some corpora of free words association, in this work we used two WAN resources in Spanish: a) Corpus de Normas de Asociación de Palabras para el Español de México (NAP) [2] and b) Corpus de Normas de Asociación Libre en Castellano [10] (NALC).

The NAP corpus was elaborated with a group of 578 native Mexican speakers young adults, 239 men and 339 women, with ages ranging from 18 to 28 years, and with a range of education of at least 11 years. The total number of tokens in the corpus is 65731, with 4704 different words. The authors used 234 stimulus words, all of them common nouns taken from the MacArthur word compression and production [16]. It is important to mention that although the stimuli are always nouns, the associated words are free-choice, that is, the informants can relate to the word stimulus with any word regardless of its grammatical category.

For each stimuli and its associates, the authors computed different measures: time, frequency and association strength.

The NALC corpus includes 5819 stimuli words and their corresponding associates obtained from the free association responses of a sample of 525 subjects for 247 words, of 200 subjects for 664 words and of 100 for the remaining words. In the compilation of association norms, approximately 1500 university students have participated so far. All the subjects had Spanish as their native language and participated voluntarily in the empirical study. The total number of different words in the corpus is 31207.

Learning Word Embeddings on Spanish WANs

The graph that represents a given WAN corpus is formally defined as G = {V, E, φ} where:

-V = {v i |i = 1, .

.., n} is the finite set of nodes with size n, V = ∅, which corresponds to stimuli words along with its associates.

-E = {(v i , v j )|v i , v j ∈ V, 1 ≤ i, j ≤ n},

is the set of edges, which corresponds to the connections between stimuli and associates words. φ : E → R, is a weighting function over the edges.

We performed experiments with directed and non-directed graphs. In the directed graphs, each pair of nodes (v i , v j ) follows an established order where the initial node v i corresponds to the stimulus word and the final node v j to an associated word. For the non-directed graph, all the stimuli are connected with their correspondent associates without any order of precedence. We evaluated three edges weighting functions: Time It measures the seconds the participant takes to give an answer for each stimulus. Frequency It establishes the number of occurrences of each of the associated words with a stimulus. In this work we use the inverse frequency (IF ):

IF = ΣF − F

where F the frequency of a given associated word, and ΣF is the sum of the frequencies of the words connected to the same stimulus Association Strength Establishes a relation between the frequency and the number of responses for each stimulus. It can be calculated as follows:

AS W = AW * 100 ΣF where AW is the frequency of a given word associated with a stimulus, and ΣF the sum of the frequencies of the words connected the same stimulus (the total number of answers). We also used the inverse of the association strength (IAS):

IAS = 1 − F ΣF

The NAP corpus provides the three weighting functions, however for the NALC corpus only the association strength is available. Thus, in our evaluation we only report results using the association strength for the NALC corpus.

Node2vec

Node2vec [14] finds a mapping f : V → R d that transforms the nodes of a graph into vectors of d-dimensions. It defines a neighborhood in a network N s (u) ⊂ V for each node u ∈ V through a S sampling strategy. The goal of the algorithm is to maximize the probability of observing subsequent nodes on a random path of a fixed length.

The sampling strategy designed in node2vec allows it to explore neighborhoods with skewed random paths. The parameters p and q control the change between the breadth-first search (BFS) and depth-first search (DFS) in the graph. Thus, choosing an adequate balance allows preserving both the structure of the community and the equivalence between structural nodes in the new vector space.

In this work, we used the implementation of the project node2vec, which is available on the web10 with default values for all parameters. We also examined the quality of vectors with a different number of dimensions.

Spanish Word Embeddings Evaluation

There are several evaluation methods for unsupervised word embeddings methodologies [23], which are categorized as extrinsic and intrinsic. In the extrinsic evaluation, the quality of the word vectors is evaluated by the improvement of performance in a given natural language processing tasks (PLN) [12,13]. Intrinsic evaluation measures the ability of word vectors to capture syntactic or semantic relationships [3].

The hypothesis of the intrinsic evaluation is that similar words should have similar representations. So, we first performed a visualization of a sample of words using the T-SNE projection of the word vectors in a two-dimensional vector space. Figure 1 shows how the words that are related to each other are grouped. We show the word vectors obtained from graphs with the three weighting functions using the NAP corpus only. It is observed that in all cases the vectors illustrate some interesting phenomena. For example, when frequency is taken as weight (the graph below), the word pájaro (bird) is drawn very close to avión (plane). From this, it is inferred that the feature "fly" is more representative than "animal" for the model. For its part, the word caballo (horse), is represented closer to camioneta (truck) than to other animals, focusing more on its status as "transportation".

In addition, we evaluated the ability of word vectors to capture semantic relationships through a word similarity task. Specifically, we used two widely known corpora: a) the corpus WordSim-353 [11] composed of pairs of terms semantically related with similarity scores given by humans and b) the MC-30 [20] benchmark containing 30 word pairs. Both datasets in its Spanish version11 [15].

We calculated the cosine similarity between the vectors of word pairs contained in the above mentioned datasets and compare it with the similarity given by humans using the Spearman correlation. To deal with the non-inclusion of every word of the testing data sets in our NALC word association norms, we introduced the concept of overlap in the experiments and calculated the total number of common words between the lists that are being compared. The others are excluded from the evaluation. In principle, having large overlaps is a positive feature this approach. Tables 1 and 2 present the Spearman corre-lation, of the similarity given by human taggers, with the similarity obtained with word vectors (learned from NAP and NALC separately). We also report different dimensions of word vectors learned on the non-directed graphs with different weighting functions. We also report the overlap, which is the number of words that can be found in in both, the given WAN corpus (NAP or NALC) and the evaluation dataset (ES-WS-53 or MC-30). It can be observed that the word embeddings obtained from the NALC corpus achieved better correlation with the human similarities than the embeddings obtained from the NAP corpus in both datasets, ES-WS-53 and MC-30. The difference in the results can be explained by the size of the vocabulary in both WANs, the NALC corpus has higher overlap with both evaluation datasets than the NAP corpus.

In order to test and compare the quality of the Spanish word vectors, we also performed the experiments with pre-trained Spanish vectors12 . We selected three word embeddings models: word2vec13 , gloVe 14 , and fasttext 15 .

Table 3 shows the Spearman rank order correlation between the cosine similarity obtained with word vectors pre-trained in large corpora and the similarity of humans (obtained from WordSim-353 ) and MC-30 datasets) in comparison with the correlation between NAP embeddings and the humans rated similarities. In the same way, Table 4 shows the same comparison with pre-trained word vectors and the NALC based embeddings.

The highest correlation value was obtained with the vectors trained with the fasttext [5] model. The vectors trained on the Wikipedia in Spanish obtained the best results among the pre-trained models. Our method outperformed the results obtained by the pre-trained vectors when the vectors were learned on the NALC corpus in both evaluation datasets, ES-WS-353 and MC-30. We introduced a method for learning Spanish word embeddings from a Corpus of Word Association Norms. For learning the word vectors, we applied the node2vec algorithm on the graph of two WAN corpora, NAP and NALC. We employ weighting functions on the edges of the graph taking into account three different criteria: time, inverse frequency and inverse associative strength. The best results have been obtained with the association strength, however, the time weighting function also achieved high results. Words with a higher association strength usually have a shorter formulation time, which leads to the algorithm to connect more related words in a neighborhood because the node2vec algorithm looks for shorter paths in the graphs.

The results we obtained using the NALC corpus are higher than those obtained with pre-trained word embeddings trained on large corpora. The performance even improves the results achieved with the vectors trained on the Spanish billion words corpus [7]. However, some simple strategies would help improve our results. Some of them would be to adjust the parameters of the algorithm and adapt the system to different types of neighborhoods for the nodes, which could produce different configurations of the vectors. In future work we will perform an extrinsic evaluation these Spanish word vectors, i.e. in some Natural Language Processing task [4].

The evaluations carried out with the vectors learned on the NAP corpus also showed promising results with respect to the similarity and relational indexes. However, due to the low vocabulary length, the results were lower than those obtained on pre-trained embeddings. As future work, we plan to solve this problem by automatically generate word association norms between pairs of words retrieved from a medium-sized corpus. With this process, we will build a new resource that can account for syntactic, semantic and cognitive connections between words.

Fig. 1 .1Fig. 1. Projection of the word vectors in 5 semantic groups (of ten words each).

Table 1 .1Spearman rank order correlations between Spanish WAN embeddings (based on cosine similarity) and the ES-WS-353 dataset.NAPNALCOverlap 140Overlap 322DimensionInv. FrequencyInv. AssociationTime Inv. Association3000.4890.4630.4610.6502000.4540.4560.4910.6411280.5030.4630.4500.6591000.4710.4780.4950.664500.5230.5030.5030.626250.4840.4780.5720.611

Table 2 .2Spearman rank order correlations between Spanish WAN embeddings (based on cosine similarity) and the MC-30 datasetNAPNALCOverlap 11Overlap 27DimensionInv. FrequencyInv. AssociationTime Inv. Asociation3000.3050.5630.5450.8372000.4680.3810.2630.8441280.5450.2720.3000.7671000.3360.4180.3720.806500.5270.5090.2720.814250.4540.4000.5630.788

Table 3 .3Spearman rank order correlation comparison of NAP embeddings and pretrained word vectors with the evaluation datasets.SourceVector sizeMC-30 (Overlap 11) ES-WS-353 (Overlap 140)Fasttext-sbwc3000.8810.639Fasttext-wiki3000.9360.701Glove-sbwc3000.8270.532Word2vec-sbwc3000.8900.634n2v-Inverse Association3000.5630.463n2v-Inverse Frequency3000.3050.489n2v-Time250.5630.5726 Conclusions and Future Work

Table 4 .4Spearman rank order correlation comparison of NALC embeddings and pretrained word vectors with the evaluation datasets.SourceVector sizeMC-30 (Overlap 27) ES-WS-353 (Overlap 322)Fasttext-sbwc3000.7620.613Fasttext-wiki3000.7930.624Glove-sbwc3000.7070.482Word2vec-sbwc3000.7950.624n2v-Inverse Association3000.8370.650n2v-Inverse Association2000.8440.664

https://github.com/jocarema/nalc_vectors https://github.com/jocarema/nap_vectors http://www.jeuxdemots.org/ https://smallworldofwords.org/ http://www.eat.rl.ac.uk/ http://web.usf.edu/FreeAssociation http://snap.stanford.edu/node2vec/ http://web.eecs.umich.edu/ ~mihalcea/downloads.html https://github.com/uchile-nlp/spanish-word-embeddings https://code.google.com/archive/p/word2vec/ https://nlp.stanford.edu/projects/glove/ https://github.com/facebookresearch/fastText/blob/master/ pretrained-vectors.md

Acknowledgments

This work was partially supported by the following projects: Conacyt FC-2016-01-2225 and PAPIIT IA401219, IN403016, AG400119.

Words in the mind: An introduction to the mental lexicon JAitchison 2012 John Wiley & Sons Corpus de normas de asociación de palabras para el espaol de Mxico NArias-Trejo JBBarrón-Martínez RH LAlderete FA RAguirre NAP Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors MBaroni GDinu GKruszewski Proceedings of the 52 nd Annual Meeting of the Association for Computational Linguistics Long Papers the 52 nd Annual Meeting of the Association for Computational Linguistics 2014 1 Wan2vec: Embeddings learned on word association norms GBel-Enguix HGómez-Adorno JReyes-Magaña GSierra Semantic Web 2019 Enriching word vectors with subword information PBojanowski EGrave AJoulin TMikolov 10.1162/tacla00051 arXiv:1607.04606 2016 Computing Research Repository Navigating word association norms to extract semantic information JBorge-Holthoefer AArenas Proceedings of the 31 st Annual Conference of the Cognitive Science Society the 31 st Annual Conference of the Cognitive Science Society 2009 CCardellino Spanish Billion Words Corpus and Embeddings March 2016 Associative strength and semantic activation in the mental lexicon: Evidence from continued word associations SDe Deyne DJNavarro GStorms Proceedings of the 35 th Annual Conference of the Cognitive Science Society the 35 th Annual Conference of the Cognitive Science Society Cognitive Science Society 2013 Predicting human similarity judgments with distributional models: The value of word associations SDe Deyne APerfors DJNavarro 10.24963/ijcai.2017/671 Proceedings of COLING 2016, the 26 th International Conference on Computational Linguistics: Technical Papers COLING 2016, the 26 th International Conference on Computational Linguistics: Technical Papers 2016 AFernandez EDíez MAlonso Normas de asociación libre en castellano de la universidad de salamanca 2010 Placing search in context: The concept revisited LFinkelstein EGabrilovich YMatias ERivlin ZSolan GWolfman ERuppin Proceedings of the 10 th International Conference on World Wide Web the 10 th International Conference on World Wide Web ACM 2001 Improving feature representation based on a neural network for author profiling in social media texts HGómez-Adorno IMarkov GSidorov JPosadas-Durán MASanchez-Perez LChanona-Hernandez Computational Intelligence and Neuroscience 13 2016. 2016 Document embeddings learned on various types of n-grams for cross-topic authorship attribution HGómez-Adorno JPPosadas-Durán GSidorov DPinto Computing 2018 node2vec: Scalable feature learning for networks AGrover JLeskovec Proceedings of the 22 nd ACM International Conference on Knowledge Discovery and Data Mining the 22 nd ACM International Conference on Knowledge Discovery and Data Mining ACM 2016 Cross-lingual semantic relatedness using encyclopedic knowledge SHassan RMihalcea Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing the 2009 Conference on Empirical Methods in Natural Language Processing 2009 3 Association for Computational Linguistics DJackson-Maldonado DThal LFenson VMarchman TNewton BConboy Macarthur inventarios del desarrollo de habilidades comunicativas (inventarios): Users guide and technical manual

Baltimore, MD

Brookes 2003 An associative thesaurus of English and its computer analysis GKiss CArmstrong RMilroy JPiper 1973 Edinburgh University Press Edinburgh Making people play for lexical acquisition MLafourcade Proceedings of the th SNLP 2007, Pattaya the th SNLP 2007, Pattaya December 2007 7 Efficient estimation of word representations in vector space TMikolov KChen GCorrado JDean arXiv:1301.3781 2013 Computing Research Repository Contextual correlates of semantic similarity GMiller WCharlees 10.1080/01690969108406936 Language and cognitive processes 6 1 1991 Word association rhyme and word fragment norms DLNelson CLMcevoy TASchreiber 1998 The University of South Florida The distributional hypothesis MSahlgren Italian Journal of Disability Studies 20 2008 Evaluation methods for unsupervised word embeddings TSchnabel ILabutov DMimno TJoachims Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing the 2015 Conference on Empirical Methods in Natural Language Processing 2015 Word association thesaurus as a resource for extending semantic networks ASinopalnikova PSmrz 2004 Conceptual graphs as a universal knowledge representation JFSowa 10.1016/0898-1221 (92)90137-7 Computers & Mathematics with Applications 23 2 1992