A model for high-coverage lexical semantic annotation generation

                                                 Attila Novák, Borbála Siklósi
                       Pázmány Péter Catholic University Faculty of Information Technology and Bionics,
                                 MTA-PPKE Hungarian Language Technology Research Group
                                                        1083 Práter u. 50/a
                                                        Budapest, Hungary


                           Abstract                                onomy, like ‘fissiped mammal.n’, but are not present in ev-
                                                                   eryday language use. Similar problems concern most other
     AI applications often receive their input in the form of      structured knowledge bases. Moreover, since they are ex-
     natural language text, or as the transcription of spoken
     text. A commonsense inference system should trans-
                                                                   tremely costly to produce or extend to achieve a good lex-
     form such input to a formal representation with lim-          ical coverage, these resources are static in nature, they are
     ited vocabulary in order to be able to process them. In       not able to keep up with changes in language use and daily
     this paper, we present a method based on neural word          life, and they contain only standard word forms.
     embeddings that automatically assigns semic features             Whatever its source, a knowledge base is an essen-
     to words of natural language. These features either de-       tial component of a commonsense inference system. Even
     scribe the ontological category of a given word or pro-       though recent results achieved by applying deep neural sys-
     vide some characterization or additional information.         tems on raw textual input have been significant, traditional
     We show that our method has high coverage and per-
     forms well for English and Hungarian, and can easily
                                                                   inference systems first transform their input written in natu-
     be extended to other languages as well.                       ral language into a formal representation using features ex-
                                                                   tracted from one or more knowledge bases, then they try
                                                                   to solve the given task based on this formal representation.
                       Introduction                                In order to be able to process arbitrary input, the coverage
One of the most natural representations of commonsense             of the knowledge bases used should be as high as possible
knowledge is natural language. What people think or know           (Davis 1990).
about the world is expressed in either spoken or written lan-         In this paper, we present an automatic method that is able
guage. Due to the popularity and accessibility of on-line me-      to assign semantic features or atomic predicates to prac-
dia, crowds of people put their knowledge into written texts,      tically any (even non-standard/slang or misspelled) word
either in the form of very short comments on social media          form in a text in a language-independent manner. As we ap-
sites or in the form of longer posts in addition to the writings   ply morphological analysis and lemmatization to the corpus
of professional journalists. These texts, which are produced       both at the time of generating the embedding models and at
in a daily manner, adapt to changes in language use, and not       query time, all forms of a single lemma are covered instead
only general knowledge, but facts and beliefs about the ac-        of only those explicitly present in the original corpus. This
tual state of the world is also represented in them. Moreover,     is essential to achieve a good coverage for an agglutinating
not only standard language, but slang and words used in in-        language like Hungarian where a single lexeme may have
formal contexts and special domains are also present in texts      hundreds of possible word forms, only few of which are ac-
collected from the Web. In addition, more and more books           tually present even in a huge corpus. Instead of constructing
representing a wide range of domains and styles are digi-          another static knowledge base of fixed vocabulary, we pro-
tized. Large written corpora consisting of these resources are     pose a dynamic tool that can be retrained or fine-tuned at
available as raw material for research, and can be exploited       any time using an up-to-date, possibly domain-specific cor-
as a source of knowledge.                                          pus appropriate to the task at hand. The target formalism or
   A more structured form of knowledge representation is           set of semantic features to be used is also an interchange-
hand-crafted ontologies, such as WordNet (Fellbaum 1998;           able parameter of the proposed method. The set of features
Miller 1995) or DBpedia (Lehmann et al. 2015). In Word-            and predicates presented in this paper is derived from for-
Net, concepts are collected into synonym sets and are orga-        malized definitions of a subset of the headwords (including
nized into a strictly hierarchical structure of hyponymy rela-     the defining vocabulary) of the Longman Dictionary of Con-
tions, along with some horizontal relations, like meronymy.        temporary English (LDOCE) (Summers 2005). Both the vo-
However, WordNet has been criticized for its too high gran-        cabulary of the model and the features used are embedded
ularity at the bottom level and its generality at the top level    in a neural-network-created word embedding vector space
(Brown 2008). Moreover, its middle layers also contain             model (Mikolov et al. 2013).
many concepts that may be appropriate in a scientific tax-            Before we present the structure of the paper, let the fol-
lowing example illustrate the kind of semantic annotation         position of two (or more) words is generally well repre-
automatically assigned by the model to words in the sen-          sented by the sum of the corresponding embedding vectors
tence The cow gives milk to her calf.:                            (Mikolov, Yih, and Zweig 2013).
 cow: mammal, at_farm, produce_milk, HAS{four(legs)}, animal         As the words are represented as dense real-valued vec-
 gives: =AGT.CAUSE{=DAT.HAS.=PAT}, give, offer, communicate       tors, the similarity of two words can easily be defined as the
 milk: food, sweet, drink, liquid                                 angle between the vectors of the words, i.e. the most sim-
 calf: young, mammal, animal, has_wool. HAS{four(legs)}
                                                                  ilar words for a query word can be retrieved by finding its
   The paper is structured as follows: first, a brief introduc-   nearest neighbours in the vector space according to cosine
tion to neural word embeddings is presented. This is fol-         distance.
lowed by the description of the lexical resource that we             One of the main drawbacks of building such a model from
used when creating our models. In the following section,          raw corpora, however, is that by itself it is not able to handle
the method of building the model is described. In this pa-        polysemy and homonymy, because one representational vec-
per, the method is demonstrated for English. However, ex-         tor is built for one lexical element regardless of the number
isting semantic resources can also be mapped to word em-          of its different senses. We applied a simple method to allevi-
bedding spaces over the vocabulary of other languages. We         ate this problem, at least in cases where the homonyms have
have performed experiments with Hungarian, an agglutina-          different PoS. In order to assign different vectors to the same
tive language with scarce semantic resources, but the method      word with different parts-of-speech, we applied PoS-tagging
can easily be applied to other languages as well. Finally, we     and lemmatization to the training corpora before building
present both qualitative and quantitative evaluation of the       the model. The main PoS tag of each word was attached to
models.                                                           the word as a suffix in the form lemma#PoS, thus a differ-
                                                                  ent embedding vector was created for homonymous lemmas
                Word Embedding Models                             with different parts-of-speech.
Traditional models of distributional semantics build word            We trained an English word embedding model on the
representations by counting words occurring in a fixed-size       English Wikipedia dump2 of 2.25 billion tokens (8.24 M
context of the target word (Baroni, Dinu, and Kruszewski          token types) that was annotated using the Stanford tagger
2014). In contrast, more recent methods for building dis-         (Toutanova et al. 2003). Since the CBOW model has proved
tributional representations of words use neural networks          to be more efficient for large training corpora, we used this
to generate word embedding models (Mikolov et al. 2013;           model architecture for training with the radius of the context
Pennington, Socher, and Manning 2014) the most influen-           window set to 5 and the number of dimensions to 300 and
tial implementation of which is word2vec1 .                       using a token frequency limit of 5.
   When training embedding models, a fixed-size context of           Figure 1 illustrates how the words pianist, teacher, turner,
each word in the vocabulary is used as the input of a neu-        maid and their three nearest neighbors are arranged in the
ral network. This network is used to predict the target word      English word embedding space3 . The original vectors con-
from the context by using back-propagation and adjusting          sist of 300 dimensions, but these were mapped to a 2D rep-
the weights assigned to the connection between the input          resentation using the t-sne algorithm (van der Maaten and
neurons (each corresponding to an item in the whole vocab-        Hinton 2008).
ulary) and the projection layer of the network. This weight
vector can finally be extracted and used as the embedding                            Lexical Resources
vector of the target word. Since similar words are used in        Our goal was to create a model that can assign semantic
similar contexts, these vectors optimized for prediction from     features and elementary predicates to words in an arbitrary
the context will also be similar for similar words. There are     text. Thus, first, the set of features to be used had to be de-
two types of neural networks used for this task. One of them      fined. The Longman Dictionary of Contemporary English
is the so called CBOW (continuous bag-of-words) model in          (LDOCE) (Summers 2005) is a traditional dictionary con-
which the network is used to predict the target word from         taining words and their definitions. All definitions in the dic-
the context, while the other model, called skip-gram, is used     tionary are written using a constrained defining vocabulary
to predict the context from the target word. For both mod-        (Longman Defining Vocabulary (LDV)). The definitions of a
els, the embedding vectors can be extracted from the middle       subset of headwords in LDOCE, including all items in LDV
layer of the network and can be used alike as a dense vector      and most frequent words listed in the BNC and the Google
representation of the meaning of the words in both cases.         unigram count, were transformed into a formal description
   The vectors thus obtained point to certain locations in the    containing only unary and binary predicates in a resource
semantic space consistently so that semantically and/or syn-      called 4lang (Kornai et al. 2015). illustrated by the following
tactically related words are close to each other, while un-       examples (for the explanation of the notation used in these
related ones are more distant. Moreover, it has been shown        definitions see (Kornai et al. 2015)):
that vector operations can also be applied to these represen-      bread: food, FROM/2742 flour, bake MAKE
tations, thus the semantic relatedness of two words can be           (a type of food made from flour and water that is
quantified as the algebraic difference of the two vectors rep-
                                                                     2
resenting these words. Similarly, the meaning of the com-             downloaded from https://dumps.wikimedia.org/ in May, 2016.
                                                                     3
                                                                      The PoS tag is NN for all example words, and it is omitted
   1
       https://code.google.com/archive/p/word2vec/                from the figure.
  Category                       Example words in 4lang
  PART OF.body                   body#NN, tongue#NN, back#NN, neck#NN, shoulder#NN, bone#NN, skin#NN, wrist#NN, buttock#NN etc.
  =AGT.HAS.mouth                 swallow#VB, suck#VB, eat#VB, drink#VB
  HAS{four(legs)}                horse#NN, tiger#NN
  mammal                         mammal#NN, lion#NN, deer#NN, man#NN, horse#NN, sheep#NN, cattle#NN, rabbit#NN, cat#NN, pig#NN, goat#NN, cow#NN
  =AGT.HAS.mind                  read#VB, remember#VB, feel#VB, understand#VB
  =AGT.CAUSE{=DAT.KNOW.=PAT}     express#VB, teach#VB


Table 1: Example words for some semantic features (predicates) after transforming the definitions to the format consisting of
labels and example words


                                                                          ment of (Pereira, Tishby, and Lee 1993), who states that due
                                                                          to the sophisticated variability of written texts, the number of
                                                                          clusters of the concepts used in a certain text cannot be pre-
                                                                          dicted. A hierarchical organization, however, is appropriate
                                                                          for producing compact groups of words and phrases, based
                                                                          on the actual text, rather than on some predefined general-
                                                                          ization. The linkage method for the hierarchical clustering
                                                                          was chosen based on the cophenet correlation between the
                                                                          original data points and the resulting linkage matrix (Sokal
                                                                          and Rohlf 1962). The best correlation was achieved when
                                                                          using Wards distance criteria (Ward 1963), resulting in small
                                                                          and dense groups of terms at the lower level of the resulting
                                                                          dendrogram. However, we did not need the whole hierarchy,
                                                                          represented as a binary tree, but separate, compact groups of
                                                                          terms, i.e. well-separated subtrees of the dendrogram. The
Figure 1: The arrangement of the 3 nearest neighbors of the               most intuitive way of defining these cutting points of the tree
words pianist, teacher, turner, maid in the English word em-              is to find large jumps in the clustering levels. To put it more
bedding space                                                             formally, the height of each link in the cluster tree is to be
                                                                          compared with the heights of neighbouring links below it in
                                                                          a certain depth. If this difference is larger than a predefined
    mixed together and then baked)                                        threshold value (i.e. the link is inconsistent), then the link is
 show: =AGT CAUSE[=DAT LOOK =PAT], communicate
   (to let someone see something)                                         a cutting point. For more details of the clustering algorithm,
                                                                          see (Siklósi 2016). Each cluster was then labeled with the
   We further transformed this format so that we have some                original category label with a numeric index added.
category labels (here: unary and binary predicates) and listed               Even though we present our method using only the 4lang
examples. This was achieved by segmenting the formal de-                  dictionary as a lexical resource, the system can be built from
scriptions into elementary predicates (by splitting at com-               any dictionary that can be transformed to a similar format.
mas), but we did not segment predicates into further parts,
so e.g. HAS[four.(legs)] remained an atomic feature. Each
such token was treated as a category label. Then, all words                                             Method
that had the particular token in their definition were listed             Our objective was to create a model with high lexical cover-
as an example for that label. This resulted in 1489 category              age that can also return the most relevant semantic features
labels and 12,507 words listed as examples for them. Then,                for words not present in 4lang. In order to achieve this goal,
in order to make this resource compatible with the word em-               the semantic features from this controlled set were projected
bedding model built from the Wikipedia corpus, its vocab-                 into the embedding space containing the representation of
ulary was intersected with that model. Even though the vo-                the words. Nearest feature neighbors for each word can be
cabulary of this resource consists mostly of frequent words               retrieved from the model using the cosine distance metric.
used in LDOCE definitions, it also includes some affixes,                    For each indexed semantic predicate label output by the
inflected forms, and a few multiword items, which are not                 clustering algorithm, we iterated the list of example words
present in the lemmatized Wikipedia model, so the intersec-               annotated with their part-of-speech (the crude PoS tags used
tion resulted in 11,039 words. Table 1 shows some examples                in the 4lang resource had to be mapped to the more fine-
words for some features derived from the 4lang resource.                  grained PTB tags returned by the Stanford tagger) and re-
   However, some categories were too broad and the set of                 trieved their embedding vectors from the word embedding
words listed for them was too heterogeneous. To handle this               model built from the PoS-tagged Wikipedia corpus. As a
problem, a hierarchical agglomerative clustering algorithm                simple but effective method for rendering a representation
was applied to the set of words in those categories that con-             vector for a set of words with their corresponding word em-
tained at least five words. The reason for applying a hierar-             beddings we took the mean of these vectors, and used that
chical clustering rather than k-means is based on the argu-               as the embedding vector of that particular semantic feature.
   Original word   Analyzed word     Features
   Laika           Laika#NNP         carnivorous mammal faithful HAS.short(hair/3359) HAS{four(legs)} hAT/2744.farmi companion young EAT.flesh HAS.long(tail)
   likes           like#VB           want =PAT{person} wish emotion ask =AGT.HAS.mind annoy =PAT.IN/2758.mind communicate desire =AGT.HAS.body
   eating          eat#VB            swallow =AGT.HAS.mouth eat love INSTRUMENT.tongue =AGT.CAUSE{=PAT{move}} sleep suck sing touch rest
   fried           fried#JJ          food ’.COOK/825 ’.SERVE thick/2134 FROM/2742.flour bake.MAKE FROM/2742.milk food.IN/2758 vegetable sweet bread
   onion           onion#NN          ’.COOK/825 vegetable fruit food FROM/2742.milk sweet round soft thick/2134 PART OF.plant
   with            with#IN
   cucumber        cucumber#NN       vegetable fruit food ’.COOK/825 sweet ’.EAT round CAUSE{food.HAS.taste} PART OF.plant soft


Table 2: An example sentence, Laika likes eating fried onion with cucumber with features assigned to each word using our
method
                                   Original word   Analyzed word      Hypernyms
                                   Laika           Laika#NNP
                                   likes           like#VB            desire want
                                   eating          eat#VB             consume digest take in take have
                                   fried           fried#JJ
                                   onion           onion#NN           vegetable produce food solid matter physical entity entity
                                   with            with#IN
                                   cucumber        cucumber#NN        vegetable produce food solid matter physical entity entity


Table 3: An example sentence, Laika likes eating fried onion with cucumber with hypernyms from WordNet assigned to each
word


Thus a representation of each predicate used in the defini-                        bases). Furthermore, this method also adapts to differences
tions was obtained in the semantic space created from the                          in word usage in different languages, since words are repre-
English PoS-tagged corpus. These semantic feature vectors                          sented with their embedding vector in the target language.
were kept separated from the word vectors in the original
embedding model in order to be able to restrict lookup to                                                Experiments and Results
either words or features derived from each lexical resource.
To find the relevant features for a query word tagged with                         The aim of this research was to investigate the possibility
its appropriate part-of-speech, its representational vector is                     of providing a high coverage tool for assigning a semantic
retrieved from the word embedding model and its nearest                            representation to words of a natural language input dynami-
neighbors are taken from the model containing the semantic                         cally instead of using a static knowledge base with a limited
predicates. Since instead of exact matching, nearest neigh-                        vocabulary. Thus, first we investigated the performance of
bors are searched for, out-of-vocabulary words (with respect                       the tool for some example input, then we also performed a
to the original lexical resources) can also be assigned se-                        quantitative analysis.
mantic labels. The only requirement is that the word must
be present in the word embedding model.                                            Qualitative analysis
                                                                                   Table 2 shows an example: Laika likes eating fried onion
                     Other languages                                               with cucumber. First, using the Stanford parser, the input is
We also carried out some experiments to apply our method                           annotated with part-of-speech tags and each word is lemma-
to another language, Hungarian. Hungarian is an aggluti-                           tized. Then, for each lemmatized content word (i.e. omitting
native language with very few lexical semantic resources.                          the function word with) with corresponding part-of-speech,
As the original 4lang dictionary contained the Hungarian                           the top 10 nearest features are retrieved from the model and
translation of the vocabulary included (3477 words), it was                        ordered by their distance from the vector representing the
straightforward to create a similar model for Hungarian as                         target word in the embedding space. Note that the number of
well. For this, we had to create a Hungarian word embedding                        top n features generated for each word is a free parameter,
model, which was built from a web-crawled corpus of 3.18                           but moving further in the semantic space results in less and
billion tokens (27.49 M token types) that was annotated us-                        less appropriate features for the target word. Table 3 shows
ing the PurePos (Orosz and Novák 2013) tagger, augmented                          the WordNet hypernyms assigned to each content word in
with the Humor Hungarian morphological analyzer (Novák                            the same sentence (the representation of the adjective fried
2014; Novák, Siklósi, and Oravecz 2016). We applied the                          and the proper name Laika is missing from WordNet).
method described above to define the position of the features                         As it can be seen in the example, our model is able to as-
in the Hungarian word embedding space by calculating the                           sign two types of features to words. Ontological/taxonomic
mean of the vector representations of the Hungarian example                        categories, such as carnivorous, mammal for the word Laika
words for each semantic predicate. Our approach can easily                         vegetable, food for the words onion and cucumber appear
be extended to any other language by translating this dic-                         together with characteristic features of the given concept,
tionary of moderate size (relative to complicated knowledge                        such as faithful, HAS{four(legs)}, hAT/2744.farmi or round
and CAUSE{food.HAS.taste}. While the first type of fea-                 N     Rw          p
                                                                                         Rw       Rf         Rfp    |f | ≤ N     P (f )   MAP
tures can be extracted from traditional ontologies, the lat-             1   44.11      88.18    50.79      92.66        50.02   92.66    92.66
                                                                         5   86.88      87.75    91.38      92.26        99.00   56.70    89.66
ter type of characteristics can not. However, we believe that
                                                                        10   93.39      93.39    95.97      95.97       100.00   32.70    90.56
the latter type of features form an important part to common
                                                                        15   95.61      95.61    97.36      97.36       100.00   22.89    90.77
sense knowledge, because if people are asked to describe a              20   96.48      96.48    97.93      97.93       100.00   17.54    90.82
concept, they will rather use such characteristics. Moreover,
an inference system can also benefit from such descriptions.
It can also be seen from the example, that the model “knows”         Table 4: Performance of the model for English tested on def-
that Laika is a dog by returning semantic features charac-           initions in the 4lang vocabulary as a function of the num-
terizing dogs. In addition, the feature EAT.flesh emphasizes         ber N of top-ranked features retrieved for each word. Rw :
the contrast of Laika being a dog and eating cucumber and            Word recall (words for which all features were retrieved),
onion.                                                               Rw (poss): recall for words having no more than N fea-
    Another benefit of our model, as mentioned above, is that        tures, Rf : feature recall, Rf (poss): feature recall ignoring
it is able to generate features for all the words that are present   features over the top N , |f | ≤ N : percentage of words hav-
in the original corpus the word embedding was built from,            ing no more than N features, P (f ): feature precision, MAP:
not only for the extremely limited set of words included in          mean average precision of features. Numbers are percent-
the 4lang dictionary. WordNet or other hand-made resources           ages.
are limited only to the words and the classification that the
designer of the resource had in mind. Our model, in con-                             Language         acc       d-acc      #F     #B
trast, is able to assign features to proper names, slang words                       English      75.13%      90.07%      559    277
                                                                                     Hungarian    73.86%      88.34%      584    295
or mistyped word forms as well as long as these are repre-
sented in the corpus the word embedding model was created
from. In addition to the above example containing the dog            Table 5: Performance of the model on 280 different test
name Laika, the following examples show some of the near-            words for English and Hungarian. acc: feature accuracy, d-
est features for two more proper names and two slang words:          acc: domain accuracy of features, #F: different features, #B:
 IBM: information.IN, computer, equipment, electric, group           features marked wrong at least once.
 Facebook: information.ON, ABOUT.recent(events), computer
 hype: fame, fun, idea, popular, surprise
 numpty: bad, lazy, stupid, lack(work), dull
                                                                     over the N limit for words having more than N features
   A weakness of our method is that in some cases it also            (Rfp ). As no definition contained more than 10 terms, Rw    p

adds noise in the generated features. For example, features                                     p
                                                                     is identical to Rw and Rf is identical to Rf for N ≥ 10.
such as sleep or sing generated for the verb eat are not ones        The definitions are terse and contain a minimal description
we would expect to be part of the definition of eat (even if         for each word: for half of the words containing only a sin-
in a broader sense they might be related). Inappropriate fea-        gle term, and for almost all words not more than 5, see
tures like this may be eliminated manually from the repre-           column |f | ≤ N ). Feature precision (P (f )) apparently de-
sentations generated by the model. The model can thus also           creases quickly as the number of features retrieved increases
be used as an aid in a semi-automatic semantic resource cre-         if we blindly accept only terms present in the original defini-
ation/extension process proposing an initial representation          tions as correct. See, however, further discussion below. The
that can be cleaned manually for applications that require           last column of the table shows the mean average precision
a high-precision lexical semantic representation. Otherwise,         (MAP) of features (terms) present in the original definitions.
the generated semantic features can be used in models per-              In the other experiment, we selected 280 words not
forming some downstream task even without filtering out the          present in the original dictionary randomly from a prede-
noise. In that case, the added semantic features may improve         fined list of Hungarian words in which each word was as-
the performance of the downstream tool providing mostly              signed to one of 28 semantic domains (e.g. food, vehicles,
useful features for words that otherwise would completely            locations, occupations, etc.). From each domain 10 words
lack semantic representation.                                        were chosen randomly and were translated to English. Then,
                                                                     for these words, the 10 nearest features were generated and
Quantitative analysis                                                two human annotators checked whether each feature was ad-
We also carried out two kinds of quantitative analysis of the        equate for each given word. The same evaluation was per-
performance of our model. First, we checked the robustness           formed for Hungarian. The agreement ratio between the an-
of the model by performing a sanity check. For each word             notators was 0.798 for English and 0.734 for Hungarian ac-
present in the original 4lang dictionary, we calculated how          cording to Cohen’s kappa, which is substantial in both cases.
many of the semantic features present in the original defi-          The results are shown in Table 5.
nition were retrieved among the top N features returned by              The table shows feature accuracy (acc: the ratio of cor-
the model (feature recall, Rf ) and the percentage of words          rectly assigned features) in each domain. We also automat-
for which all features were retrieved (word recall, Rw ). The        ically computed feature “domain accuracy” (d-acc): here
results are shown in Table 4 as a function of N (numbers             we ignored feature assignment errors where the same fea-
are percentages). Recall was also calculated ignoring words          ture was marked adequate for another test word in the same
                                  p
having more than N features (Rw     ) and discounting features       domain. The number of different features that appeared in
this evaluation and the number of features marked wrong          Fellbaum, C., ed. 1998. WordNet: an electronic lexical
at least once are shown in the last two columns. Note that       database. MIT Press.
the feature accuracy (precision) for 10 features retrieved       Kornai, A.; Ács, J.; Makrai, M.; Nemeskey, D. M.; Pajkossy,
turned out to be much higher (75.13%) than in the sanity         K.; and Recski, G. 2015. Competence in lexical semantics.
check experiment (only 32.70%) even though this list con-        In Proceedings of the Fourth Joint Conference on Lexical
tained words not in the original resource. The reason for        and Computational Semantics, 165–175. Denver, Colorado:
this is that the model returns many features which, while        Association for Computational Linguistics.
not explicitly present in the original terse definitions, cor-
rectly follow from the knowledge embodied in the feature         Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas,
model. E.g. While the definition of dog in 4lang contains        D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.;
only 3 terms: animal, faithful and carnivorous, the top 10       Auer, S.; and Bizer, C. 2015. DBpedia - a large-scale, multi-
features retrieved from the model also include mammal,           lingual knowledge base extracted from wikipedia. Semantic
HAS{four(legs)}, hairy and companion. The sanity check           Web Journal 6(2):167–195.
experiment thus grossly underestimated the precision of the      Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
model.                                                           Dean, J. 2013. Distributed representations of words and
                                                                 phrases and their compositionality. In Advances in Neural
                       Conclusion                                Information Processing Systems 26: 27th Annual Confer-
                                                                 ence on Neural Information Processing Systems 2013. Pro-
We have presented an automatic method that is able to as-        ceedings of a meeting held December 5-8, 2013, Lake Tahoe,
sign semantic features to words of natural language. This        Nevada, United States., 3111–3119.
approach exploits the representative power of neural word        Mikolov, T.; Yih, W.; and Zweig, G. 2013. Linguistic regu-
embeddings by mapping features derived from formal defi-         larities in continuous space word representations. In Human
nitions of words to the vector space of the given language.      Language Technologies: Conference of the North American
In addition to some illustrative examples, we have presented     Chapter of the Association of Computational Linguistics,
the evaluation of the models demonstrating that the method       Proceedings, June 9-14, 2013, Westin Peachtree Plaza Ho-
works with relatively high accuracy. Although there is a         tel, Atlanta, Georgia, USA, 746–751.
moderate amount of noise in the set of generated features,
the method has a very high coverage, being able to process       Miller, G. A. 1995. Wordnet: A lexical database for english.
proper names or non-standard words as well, which cannot         COMMUNICATIONS OF THE ACM 38:39–41.
all be included in hand-made static knowledge bases. As          Novák, A.; Siklósi, B.; and Oravecz, C. 2016. A New In-
such, our automatic method can be used as the base of a          tegrated Open-source Morphological Analyzer for Hungar-
manually constructed resource, or can provide valuable in-       ian. In Chair), N. C. C.; Choukri, K.; Declerck, T.; Goggi,
put for downstream applications, such as commonsense in-         S.; Grobelnik, M.; Maegaard, B.; Mariani, J.; Mazo, H.;
ference systems.                                                 Moreno, A.; Odijk, J.; and Piperidis, S., eds., Proceedings of
                                                                 the Tenth International Conference on Language Resources
                  Acknowledgments                                and Evaluation (LREC 2016). Paris, France: European Lan-
                                                                 guage Resources Association (ELRA).
This research has been implemented with support provided
                                                                 Novák, A. 2014. A new form of humor – mapping
by grant FK125217 of the National Research, Development
                                                                 constraint-based computational morphologies to a finite-
and Innovation Office of Hungary financed under the FK17
                                                                 state representation. In Chair), N. C. C.; Choukri, K.; De-
funding scheme.
                                                                 clerck, T.; Loftsson, H.; Maegaard, B.; Mariani, J.; Moreno,
                                                                 A.; Odijk, J.; and Piperidis, S., eds., Proceedings of the Ninth
                       References                                International Conference on Language Resources and Eval-
Baroni, M.; Dinu, G.; and Kruszewski, G. 2014. Don’t             uation (LREC’14). Reykjavik, Iceland: European Language
count, predict! A systematic comparison of context-counting      Resources Association (ELRA).
vs. context-predicting semantic vectors. In Proceedings of       Orosz, G., and Novák, A. 2013. PurePos 2.0: a hybrid
the 52nd Annual Meeting of the Association for Computa-          tool for morphological disambiguation. In Proceedings of
tional Linguistics (Volume 1: Long Papers), 238–247. Bal-        the International Conference on Recent Advances in Natu-
timore, Maryland: Association for Computational Linguis-         ral Language Processing (RANLP 2013), 539–545. Hissar,
tics.                                                            Bulgaria: INCOMA Ltd. Shoumen, BULGARIA.
Brown, S. W. 2008. Choosing sense distinctions for wsd:          Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
Psycholinguistic evidence. In Proceedings of the 46th An-        Global vectors for word representation. In Empirical Meth-
nual Meeting of the Association for Computational Linguis-       ods in Natural Language Processing (EMNLP), 1532–1543.
tics on Human Language Technologies: Short Papers, HLT-          Pereira, F.; Tishby, N.; and Lee, L. 1993. Distributional
Short ’08, 249–252. Stroudsburg, PA, USA: Association for        clustering of english words. In Proceedings of the 31st An-
Computational Linguistics.                                       nual Meeting on Association for Computational Linguistics,
Davis, E. 1990. Representations of commonsense knowl-            ACL ’93, 183–190. Stroudsburg, PA, USA: Association for
edge. Morgan Kaufmann.                                           Computational Linguistics.
Siklósi, B. 2016. Using embedding models for lexical
categorization in morphologically rich languages. In Gel-
bukh, A., ed., Computational Linguistics and Intelligent Text
Processing: 17th International Conference, CICLing 2016.
Konya, Turkey: Springer International Publishing, Cham.
Sokal, R. R., and Rohlf, F. J. 1962. The comparison of
dendrograms by objective methods. Taxon 11(2):33–40.
Summers, D. 2005. Longman Dictionary of Contemporary
English. Longman Dictionary of Contemporary English Se-
ries. Longman.
Toutanova, K.; Klein, D.; Manning, C. D.; and Singer, Y.
2003. Feature-rich part-of-speech tagging with a cyclic de-
pendency network. In Proceedings of the 2003 Conference
of the North American Chapter of the Association for Com-
putational Linguistics on Human Language Technology -
Volume 1, NAACL ’03, 173–180. Stroudsburg, PA, USA:
Association for Computational Linguistics.
van der Maaten, L., and Hinton, G. E. 2008. Visualiz-
ing high-dimensional data using t-sne. Journal of Machine
Learning Research 9:2579–2605.
Ward, J. H. 1963. Hierarchical grouping to optimize an
objective function. Journal of the American Statistical As-
sociation 58(301):236–244.