A model for high-coverage lexical semantic annotation generation Attila Novák, Borbála Siklósi Pázmány Péter Catholic University Faculty of Information Technology and Bionics, MTA-PPKE Hungarian Language Technology Research Group 1083 Práter u. 50/a Budapest, Hungary Abstract onomy, like ‘fissiped mammal.n’, but are not present in ev- eryday language use. Similar problems concern most other AI applications often receive their input in the form of structured knowledge bases. Moreover, since they are ex- natural language text, or as the transcription of spoken text. A commonsense inference system should trans- tremely costly to produce or extend to achieve a good lex- form such input to a formal representation with lim- ical coverage, these resources are static in nature, they are ited vocabulary in order to be able to process them. In not able to keep up with changes in language use and daily this paper, we present a method based on neural word life, and they contain only standard word forms. embeddings that automatically assigns semic features Whatever its source, a knowledge base is an essen- to words of natural language. These features either de- tial component of a commonsense inference system. Even scribe the ontological category of a given word or pro- though recent results achieved by applying deep neural sys- vide some characterization or additional information. tems on raw textual input have been significant, traditional We show that our method has high coverage and per- forms well for English and Hungarian, and can easily inference systems first transform their input written in natu- be extended to other languages as well. ral language into a formal representation using features ex- tracted from one or more knowledge bases, then they try to solve the given task based on this formal representation. Introduction In order to be able to process arbitrary input, the coverage One of the most natural representations of commonsense of the knowledge bases used should be as high as possible knowledge is natural language. What people think or know (Davis 1990). about the world is expressed in either spoken or written lan- In this paper, we present an automatic method that is able guage. Due to the popularity and accessibility of on-line me- to assign semantic features or atomic predicates to prac- dia, crowds of people put their knowledge into written texts, tically any (even non-standard/slang or misspelled) word either in the form of very short comments on social media form in a text in a language-independent manner. As we ap- sites or in the form of longer posts in addition to the writings ply morphological analysis and lemmatization to the corpus of professional journalists. These texts, which are produced both at the time of generating the embedding models and at in a daily manner, adapt to changes in language use, and not query time, all forms of a single lemma are covered instead only general knowledge, but facts and beliefs about the ac- of only those explicitly present in the original corpus. This tual state of the world is also represented in them. Moreover, is essential to achieve a good coverage for an agglutinating not only standard language, but slang and words used in in- language like Hungarian where a single lexeme may have formal contexts and special domains are also present in texts hundreds of possible word forms, only few of which are ac- collected from the Web. In addition, more and more books tually present even in a huge corpus. Instead of constructing representing a wide range of domains and styles are digi- another static knowledge base of fixed vocabulary, we pro- tized. Large written corpora consisting of these resources are pose a dynamic tool that can be retrained or fine-tuned at available as raw material for research, and can be exploited any time using an up-to-date, possibly domain-specific cor- as a source of knowledge. pus appropriate to the task at hand. The target formalism or A more structured form of knowledge representation is set of semantic features to be used is also an interchange- hand-crafted ontologies, such as WordNet (Fellbaum 1998; able parameter of the proposed method. The set of features Miller 1995) or DBpedia (Lehmann et al. 2015). In Word- and predicates presented in this paper is derived from for- Net, concepts are collected into synonym sets and are orga- malized definitions of a subset of the headwords (including nized into a strictly hierarchical structure of hyponymy rela- the defining vocabulary) of the Longman Dictionary of Con- tions, along with some horizontal relations, like meronymy. temporary English (LDOCE) (Summers 2005). Both the vo- However, WordNet has been criticized for its too high gran- cabulary of the model and the features used are embedded ularity at the bottom level and its generality at the top level in a neural-network-created word embedding vector space (Brown 2008). Moreover, its middle layers also contain model (Mikolov et al. 2013). many concepts that may be appropriate in a scientific tax- Before we present the structure of the paper, let the fol- lowing example illustrate the kind of semantic annotation position of two (or more) words is generally well repre- automatically assigned by the model to words in the sen- sented by the sum of the corresponding embedding vectors tence The cow gives milk to her calf.: (Mikolov, Yih, and Zweig 2013). cow: mammal, at_farm, produce_milk, HAS{four(legs)}, animal As the words are represented as dense real-valued vec- gives: =AGT.CAUSE{=DAT.HAS.=PAT}, give, offer, communicate tors, the similarity of two words can easily be defined as the milk: food, sweet, drink, liquid angle between the vectors of the words, i.e. the most sim- calf: young, mammal, animal, has_wool. HAS{four(legs)} ilar words for a query word can be retrieved by finding its The paper is structured as follows: first, a brief introduc- nearest neighbours in the vector space according to cosine tion to neural word embeddings is presented. This is fol- distance. lowed by the description of the lexical resource that we One of the main drawbacks of building such a model from used when creating our models. In the following section, raw corpora, however, is that by itself it is not able to handle the method of building the model is described. In this pa- polysemy and homonymy, because one representational vec- per, the method is demonstrated for English. However, ex- tor is built for one lexical element regardless of the number isting semantic resources can also be mapped to word em- of its different senses. We applied a simple method to allevi- bedding spaces over the vocabulary of other languages. We ate this problem, at least in cases where the homonyms have have performed experiments with Hungarian, an agglutina- different PoS. In order to assign different vectors to the same tive language with scarce semantic resources, but the method word with different parts-of-speech, we applied PoS-tagging can easily be applied to other languages as well. Finally, we and lemmatization to the training corpora before building present both qualitative and quantitative evaluation of the the model. The main PoS tag of each word was attached to models. the word as a suffix in the form lemma#PoS, thus a differ- ent embedding vector was created for homonymous lemmas Word Embedding Models with different parts-of-speech. Traditional models of distributional semantics build word We trained an English word embedding model on the representations by counting words occurring in a fixed-size English Wikipedia dump2 of 2.25 billion tokens (8.24 M context of the target word (Baroni, Dinu, and Kruszewski token types) that was annotated using the Stanford tagger 2014). In contrast, more recent methods for building dis- (Toutanova et al. 2003). Since the CBOW model has proved tributional representations of words use neural networks to be more efficient for large training corpora, we used this to generate word embedding models (Mikolov et al. 2013; model architecture for training with the radius of the context Pennington, Socher, and Manning 2014) the most influen- window set to 5 and the number of dimensions to 300 and tial implementation of which is word2vec1 . using a token frequency limit of 5. When training embedding models, a fixed-size context of Figure 1 illustrates how the words pianist, teacher, turner, each word in the vocabulary is used as the input of a neu- maid and their three nearest neighbors are arranged in the ral network. This network is used to predict the target word English word embedding space3 . The original vectors con- from the context by using back-propagation and adjusting sist of 300 dimensions, but these were mapped to a 2D rep- the weights assigned to the connection between the input resentation using the t-sne algorithm (van der Maaten and neurons (each corresponding to an item in the whole vocab- Hinton 2008). ulary) and the projection layer of the network. This weight vector can finally be extracted and used as the embedding Lexical Resources vector of the target word. Since similar words are used in Our goal was to create a model that can assign semantic similar contexts, these vectors optimized for prediction from features and elementary predicates to words in an arbitrary the context will also be similar for similar words. There are text. Thus, first, the set of features to be used had to be de- two types of neural networks used for this task. One of them fined. The Longman Dictionary of Contemporary English is the so called CBOW (continuous bag-of-words) model in (LDOCE) (Summers 2005) is a traditional dictionary con- which the network is used to predict the target word from taining words and their definitions. All definitions in the dic- the context, while the other model, called skip-gram, is used tionary are written using a constrained defining vocabulary to predict the context from the target word. For both mod- (Longman Defining Vocabulary (LDV)). The definitions of a els, the embedding vectors can be extracted from the middle subset of headwords in LDOCE, including all items in LDV layer of the network and can be used alike as a dense vector and most frequent words listed in the BNC and the Google representation of the meaning of the words in both cases. unigram count, were transformed into a formal description The vectors thus obtained point to certain locations in the containing only unary and binary predicates in a resource semantic space consistently so that semantically and/or syn- called 4lang (Kornai et al. 2015). illustrated by the following tactically related words are close to each other, while un- examples (for the explanation of the notation used in these related ones are more distant. Moreover, it has been shown definitions see (Kornai et al. 2015)): that vector operations can also be applied to these represen- bread: food, FROM/2742 flour, bake MAKE tations, thus the semantic relatedness of two words can be (a type of food made from flour and water that is quantified as the algebraic difference of the two vectors rep- 2 resenting these words. Similarly, the meaning of the com- downloaded from https://dumps.wikimedia.org/ in May, 2016. 3 The PoS tag is NN for all example words, and it is omitted 1 https://code.google.com/archive/p/word2vec/ from the figure. Category Example words in 4lang PART OF.body body#NN, tongue#NN, back#NN, neck#NN, shoulder#NN, bone#NN, skin#NN, wrist#NN, buttock#NN etc. =AGT.HAS.mouth swallow#VB, suck#VB, eat#VB, drink#VB HAS{four(legs)} horse#NN, tiger#NN mammal mammal#NN, lion#NN, deer#NN, man#NN, horse#NN, sheep#NN, cattle#NN, rabbit#NN, cat#NN, pig#NN, goat#NN, cow#NN =AGT.HAS.mind read#VB, remember#VB, feel#VB, understand#VB =AGT.CAUSE{=DAT.KNOW.=PAT} express#VB, teach#VB Table 1: Example words for some semantic features (predicates) after transforming the definitions to the format consisting of labels and example words ment of (Pereira, Tishby, and Lee 1993), who states that due to the sophisticated variability of written texts, the number of clusters of the concepts used in a certain text cannot be pre- dicted. A hierarchical organization, however, is appropriate for producing compact groups of words and phrases, based on the actual text, rather than on some predefined general- ization. The linkage method for the hierarchical clustering was chosen based on the cophenet correlation between the original data points and the resulting linkage matrix (Sokal and Rohlf 1962). The best correlation was achieved when using Wards distance criteria (Ward 1963), resulting in small and dense groups of terms at the lower level of the resulting dendrogram. However, we did not need the whole hierarchy, represented as a binary tree, but separate, compact groups of terms, i.e. well-separated subtrees of the dendrogram. The Figure 1: The arrangement of the 3 nearest neighbors of the most intuitive way of defining these cutting points of the tree words pianist, teacher, turner, maid in the English word em- is to find large jumps in the clustering levels. To put it more bedding space formally, the height of each link in the cluster tree is to be compared with the heights of neighbouring links below it in a certain depth. If this difference is larger than a predefined mixed together and then baked) threshold value (i.e. the link is inconsistent), then the link is show: =AGT CAUSE[=DAT LOOK =PAT], communicate (to let someone see something) a cutting point. For more details of the clustering algorithm, see (Siklósi 2016). Each cluster was then labeled with the We further transformed this format so that we have some original category label with a numeric index added. category labels (here: unary and binary predicates) and listed Even though we present our method using only the 4lang examples. This was achieved by segmenting the formal de- dictionary as a lexical resource, the system can be built from scriptions into elementary predicates (by splitting at com- any dictionary that can be transformed to a similar format. mas), but we did not segment predicates into further parts, so e.g. HAS[four.(legs)] remained an atomic feature. Each such token was treated as a category label. Then, all words Method that had the particular token in their definition were listed Our objective was to create a model with high lexical cover- as an example for that label. This resulted in 1489 category age that can also return the most relevant semantic features labels and 12,507 words listed as examples for them. Then, for words not present in 4lang. In order to achieve this goal, in order to make this resource compatible with the word em- the semantic features from this controlled set were projected bedding model built from the Wikipedia corpus, its vocab- into the embedding space containing the representation of ulary was intersected with that model. Even though the vo- the words. Nearest feature neighbors for each word can be cabulary of this resource consists mostly of frequent words retrieved from the model using the cosine distance metric. used in LDOCE definitions, it also includes some affixes, For each indexed semantic predicate label output by the inflected forms, and a few multiword items, which are not clustering algorithm, we iterated the list of example words present in the lemmatized Wikipedia model, so the intersec- annotated with their part-of-speech (the crude PoS tags used tion resulted in 11,039 words. Table 1 shows some examples in the 4lang resource had to be mapped to the more fine- words for some features derived from the 4lang resource. grained PTB tags returned by the Stanford tagger) and re- However, some categories were too broad and the set of trieved their embedding vectors from the word embedding words listed for them was too heterogeneous. To handle this model built from the PoS-tagged Wikipedia corpus. As a problem, a hierarchical agglomerative clustering algorithm simple but effective method for rendering a representation was applied to the set of words in those categories that con- vector for a set of words with their corresponding word em- tained at least five words. The reason for applying a hierar- beddings we took the mean of these vectors, and used that chical clustering rather than k-means is based on the argu- as the embedding vector of that particular semantic feature. Original word Analyzed word Features Laika Laika#NNP carnivorous mammal faithful HAS.short(hair/3359) HAS{four(legs)} hAT/2744.farmi companion young EAT.flesh HAS.long(tail) likes like#VB want =PAT{person} wish emotion ask =AGT.HAS.mind annoy =PAT.IN/2758.mind communicate desire =AGT.HAS.body eating eat#VB swallow =AGT.HAS.mouth eat love INSTRUMENT.tongue =AGT.CAUSE{=PAT{move}} sleep suck sing touch rest fried fried#JJ food ’.COOK/825 ’.SERVE thick/2134 FROM/2742.flour bake.MAKE FROM/2742.milk food.IN/2758 vegetable sweet bread onion onion#NN ’.COOK/825 vegetable fruit food FROM/2742.milk sweet round soft thick/2134 PART OF.plant with with#IN cucumber cucumber#NN vegetable fruit food ’.COOK/825 sweet ’.EAT round CAUSE{food.HAS.taste} PART OF.plant soft Table 2: An example sentence, Laika likes eating fried onion with cucumber with features assigned to each word using our method Original word Analyzed word Hypernyms Laika Laika#NNP likes like#VB desire want eating eat#VB consume digest take in take have fried fried#JJ onion onion#NN vegetable produce food solid matter physical entity entity with with#IN cucumber cucumber#NN vegetable produce food solid matter physical entity entity Table 3: An example sentence, Laika likes eating fried onion with cucumber with hypernyms from WordNet assigned to each word Thus a representation of each predicate used in the defini- bases). Furthermore, this method also adapts to differences tions was obtained in the semantic space created from the in word usage in different languages, since words are repre- English PoS-tagged corpus. These semantic feature vectors sented with their embedding vector in the target language. were kept separated from the word vectors in the original embedding model in order to be able to restrict lookup to Experiments and Results either words or features derived from each lexical resource. To find the relevant features for a query word tagged with The aim of this research was to investigate the possibility its appropriate part-of-speech, its representational vector is of providing a high coverage tool for assigning a semantic retrieved from the word embedding model and its nearest representation to words of a natural language input dynami- neighbors are taken from the model containing the semantic cally instead of using a static knowledge base with a limited predicates. Since instead of exact matching, nearest neigh- vocabulary. Thus, first we investigated the performance of bors are searched for, out-of-vocabulary words (with respect the tool for some example input, then we also performed a to the original lexical resources) can also be assigned se- quantitative analysis. mantic labels. The only requirement is that the word must be present in the word embedding model. Qualitative analysis Table 2 shows an example: Laika likes eating fried onion Other languages with cucumber. First, using the Stanford parser, the input is We also carried out some experiments to apply our method annotated with part-of-speech tags and each word is lemma- to another language, Hungarian. Hungarian is an aggluti- tized. Then, for each lemmatized content word (i.e. omitting native language with very few lexical semantic resources. the function word with) with corresponding part-of-speech, As the original 4lang dictionary contained the Hungarian the top 10 nearest features are retrieved from the model and translation of the vocabulary included (3477 words), it was ordered by their distance from the vector representing the straightforward to create a similar model for Hungarian as target word in the embedding space. Note that the number of well. For this, we had to create a Hungarian word embedding top n features generated for each word is a free parameter, model, which was built from a web-crawled corpus of 3.18 but moving further in the semantic space results in less and billion tokens (27.49 M token types) that was annotated us- less appropriate features for the target word. Table 3 shows ing the PurePos (Orosz and Novák 2013) tagger, augmented the WordNet hypernyms assigned to each content word in with the Humor Hungarian morphological analyzer (Novák the same sentence (the representation of the adjective fried 2014; Novák, Siklósi, and Oravecz 2016). We applied the and the proper name Laika is missing from WordNet). method described above to define the position of the features As it can be seen in the example, our model is able to as- in the Hungarian word embedding space by calculating the sign two types of features to words. Ontological/taxonomic mean of the vector representations of the Hungarian example categories, such as carnivorous, mammal for the word Laika words for each semantic predicate. Our approach can easily vegetable, food for the words onion and cucumber appear be extended to any other language by translating this dic- together with characteristic features of the given concept, tionary of moderate size (relative to complicated knowledge such as faithful, HAS{four(legs)}, hAT/2744.farmi or round and CAUSE{food.HAS.taste}. While the first type of fea- N Rw p Rw Rf Rfp |f | ≤ N P (f ) MAP tures can be extracted from traditional ontologies, the lat- 1 44.11 88.18 50.79 92.66 50.02 92.66 92.66 5 86.88 87.75 91.38 92.26 99.00 56.70 89.66 ter type of characteristics can not. However, we believe that 10 93.39 93.39 95.97 95.97 100.00 32.70 90.56 the latter type of features form an important part to common 15 95.61 95.61 97.36 97.36 100.00 22.89 90.77 sense knowledge, because if people are asked to describe a 20 96.48 96.48 97.93 97.93 100.00 17.54 90.82 concept, they will rather use such characteristics. Moreover, an inference system can also benefit from such descriptions. It can also be seen from the example, that the model “knows” Table 4: Performance of the model for English tested on def- that Laika is a dog by returning semantic features charac- initions in the 4lang vocabulary as a function of the num- terizing dogs. In addition, the feature EAT.flesh emphasizes ber N of top-ranked features retrieved for each word. Rw : the contrast of Laika being a dog and eating cucumber and Word recall (words for which all features were retrieved), onion. Rw (poss): recall for words having no more than N fea- Another benefit of our model, as mentioned above, is that tures, Rf : feature recall, Rf (poss): feature recall ignoring it is able to generate features for all the words that are present features over the top N , |f | ≤ N : percentage of words hav- in the original corpus the word embedding was built from, ing no more than N features, P (f ): feature precision, MAP: not only for the extremely limited set of words included in mean average precision of features. Numbers are percent- the 4lang dictionary. WordNet or other hand-made resources ages. are limited only to the words and the classification that the designer of the resource had in mind. Our model, in con- Language acc d-acc #F #B trast, is able to assign features to proper names, slang words English 75.13% 90.07% 559 277 Hungarian 73.86% 88.34% 584 295 or mistyped word forms as well as long as these are repre- sented in the corpus the word embedding model was created from. In addition to the above example containing the dog Table 5: Performance of the model on 280 different test name Laika, the following examples show some of the near- words for English and Hungarian. acc: feature accuracy, d- est features for two more proper names and two slang words: acc: domain accuracy of features, #F: different features, #B: IBM: information.IN, computer, equipment, electric, group features marked wrong at least once. Facebook: information.ON, ABOUT.recent(events), computer hype: fame, fun, idea, popular, surprise numpty: bad, lazy, stupid, lack(work), dull over the N limit for words having more than N features A weakness of our method is that in some cases it also (Rfp ). As no definition contained more than 10 terms, Rw p adds noise in the generated features. For example, features p is identical to Rw and Rf is identical to Rf for N ≥ 10. such as sleep or sing generated for the verb eat are not ones The definitions are terse and contain a minimal description we would expect to be part of the definition of eat (even if for each word: for half of the words containing only a sin- in a broader sense they might be related). Inappropriate fea- gle term, and for almost all words not more than 5, see tures like this may be eliminated manually from the repre- column |f | ≤ N ). Feature precision (P (f )) apparently de- sentations generated by the model. The model can thus also creases quickly as the number of features retrieved increases be used as an aid in a semi-automatic semantic resource cre- if we blindly accept only terms present in the original defini- ation/extension process proposing an initial representation tions as correct. See, however, further discussion below. The that can be cleaned manually for applications that require last column of the table shows the mean average precision a high-precision lexical semantic representation. Otherwise, (MAP) of features (terms) present in the original definitions. the generated semantic features can be used in models per- In the other experiment, we selected 280 words not forming some downstream task even without filtering out the present in the original dictionary randomly from a prede- noise. In that case, the added semantic features may improve fined list of Hungarian words in which each word was as- the performance of the downstream tool providing mostly signed to one of 28 semantic domains (e.g. food, vehicles, useful features for words that otherwise would completely locations, occupations, etc.). From each domain 10 words lack semantic representation. were chosen randomly and were translated to English. Then, for these words, the 10 nearest features were generated and Quantitative analysis two human annotators checked whether each feature was ad- We also carried out two kinds of quantitative analysis of the equate for each given word. The same evaluation was per- performance of our model. First, we checked the robustness formed for Hungarian. The agreement ratio between the an- of the model by performing a sanity check. For each word notators was 0.798 for English and 0.734 for Hungarian ac- present in the original 4lang dictionary, we calculated how cording to Cohen’s kappa, which is substantial in both cases. many of the semantic features present in the original defi- The results are shown in Table 5. nition were retrieved among the top N features returned by The table shows feature accuracy (acc: the ratio of cor- the model (feature recall, Rf ) and the percentage of words rectly assigned features) in each domain. We also automat- for which all features were retrieved (word recall, Rw ). The ically computed feature “domain accuracy” (d-acc): here results are shown in Table 4 as a function of N (numbers we ignored feature assignment errors where the same fea- are percentages). Recall was also calculated ignoring words ture was marked adequate for another test word in the same p having more than N features (Rw ) and discounting features domain. The number of different features that appeared in this evaluation and the number of features marked wrong Fellbaum, C., ed. 1998. WordNet: an electronic lexical at least once are shown in the last two columns. Note that database. MIT Press. the feature accuracy (precision) for 10 features retrieved Kornai, A.; Ács, J.; Makrai, M.; Nemeskey, D. M.; Pajkossy, turned out to be much higher (75.13%) than in the sanity K.; and Recski, G. 2015. Competence in lexical semantics. check experiment (only 32.70%) even though this list con- In Proceedings of the Fourth Joint Conference on Lexical tained words not in the original resource. The reason for and Computational Semantics, 165–175. Denver, Colorado: this is that the model returns many features which, while Association for Computational Linguistics. not explicitly present in the original terse definitions, cor- rectly follow from the knowledge embodied in the feature Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, model. E.g. While the definition of dog in 4lang contains D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.; only 3 terms: animal, faithful and carnivorous, the top 10 Auer, S.; and Bizer, C. 2015. DBpedia - a large-scale, multi- features retrieved from the model also include mammal, lingual knowledge base extracted from wikipedia. Semantic HAS{four(legs)}, hairy and companion. The sanity check Web Journal 6(2):167–195. experiment thus grossly underestimated the precision of the Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and model. Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Conclusion Information Processing Systems 26: 27th Annual Confer- ence on Neural Information Processing Systems 2013. Pro- We have presented an automatic method that is able to as- ceedings of a meeting held December 5-8, 2013, Lake Tahoe, sign semantic features to words of natural language. This Nevada, United States., 3111–3119. approach exploits the representative power of neural word Mikolov, T.; Yih, W.; and Zweig, G. 2013. Linguistic regu- embeddings by mapping features derived from formal defi- larities in continuous space word representations. In Human nitions of words to the vector space of the given language. Language Technologies: Conference of the North American In addition to some illustrative examples, we have presented Chapter of the Association of Computational Linguistics, the evaluation of the models demonstrating that the method Proceedings, June 9-14, 2013, Westin Peachtree Plaza Ho- works with relatively high accuracy. Although there is a tel, Atlanta, Georgia, USA, 746–751. moderate amount of noise in the set of generated features, the method has a very high coverage, being able to process Miller, G. A. 1995. Wordnet: A lexical database for english. proper names or non-standard words as well, which cannot COMMUNICATIONS OF THE ACM 38:39–41. all be included in hand-made static knowledge bases. As Novák, A.; Siklósi, B.; and Oravecz, C. 2016. A New In- such, our automatic method can be used as the base of a tegrated Open-source Morphological Analyzer for Hungar- manually constructed resource, or can provide valuable in- ian. In Chair), N. C. C.; Choukri, K.; Declerck, T.; Goggi, put for downstream applications, such as commonsense in- S.; Grobelnik, M.; Maegaard, B.; Mariani, J.; Mazo, H.; ference systems. Moreno, A.; Odijk, J.; and Piperidis, S., eds., Proceedings of the Tenth International Conference on Language Resources Acknowledgments and Evaluation (LREC 2016). Paris, France: European Lan- guage Resources Association (ELRA). This research has been implemented with support provided Novák, A. 2014. A new form of humor – mapping by grant FK125217 of the National Research, Development constraint-based computational morphologies to a finite- and Innovation Office of Hungary financed under the FK17 state representation. In Chair), N. C. C.; Choukri, K.; De- funding scheme. clerck, T.; Loftsson, H.; Maegaard, B.; Mariani, J.; Moreno, A.; Odijk, J.; and Piperidis, S., eds., Proceedings of the Ninth References International Conference on Language Resources and Eval- Baroni, M.; Dinu, G.; and Kruszewski, G. 2014. Don’t uation (LREC’14). Reykjavik, Iceland: European Language count, predict! A systematic comparison of context-counting Resources Association (ELRA). vs. context-predicting semantic vectors. In Proceedings of Orosz, G., and Novák, A. 2013. PurePos 2.0: a hybrid the 52nd Annual Meeting of the Association for Computa- tool for morphological disambiguation. In Proceedings of tional Linguistics (Volume 1: Long Papers), 238–247. Bal- the International Conference on Recent Advances in Natu- timore, Maryland: Association for Computational Linguis- ral Language Processing (RANLP 2013), 539–545. Hissar, tics. Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. Brown, S. W. 2008. Choosing sense distinctions for wsd: Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Psycholinguistic evidence. In Proceedings of the 46th An- Global vectors for word representation. In Empirical Meth- nual Meeting of the Association for Computational Linguis- ods in Natural Language Processing (EMNLP), 1532–1543. tics on Human Language Technologies: Short Papers, HLT- Pereira, F.; Tishby, N.; and Lee, L. 1993. Distributional Short ’08, 249–252. Stroudsburg, PA, USA: Association for clustering of english words. In Proceedings of the 31st An- Computational Linguistics. nual Meeting on Association for Computational Linguistics, Davis, E. 1990. Representations of commonsense knowl- ACL ’93, 183–190. Stroudsburg, PA, USA: Association for edge. Morgan Kaufmann. Computational Linguistics. Siklósi, B. 2016. Using embedding models for lexical categorization in morphologically rich languages. In Gel- bukh, A., ed., Computational Linguistics and Intelligent Text Processing: 17th International Conference, CICLing 2016. Konya, Turkey: Springer International Publishing, Cham. Sokal, R. R., and Rohlf, F. J. 1962. The comparison of dendrograms by objective methods. Taxon 11(2):33–40. Summers, D. 2005. Longman Dictionary of Contemporary English. Longman Dictionary of Contemporary English Se- ries. Longman. Toutanova, K.; Klein, D.; Manning, C. D.; and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic de- pendency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Com- putational Linguistics on Human Language Technology - Volume 1, NAACL ’03, 173–180. Stroudsburg, PA, USA: Association for Computational Linguistics. van der Maaten, L., and Hinton, G. E. 2008. Visualiz- ing high-dimensional data using t-sne. Journal of Machine Learning Research 9:2579–2605. Ward, J. H. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical As- sociation 58(301):236–244.