=Paper= {{Paper |id=Vol-2006/paper045 |storemode=property |title=Analysis of Italian Word Embeddings |pdfUrl=https://ceur-ws.org/Vol-2006/paper045.pdf |volume=Vol-2006 |authors=Rocco Tripodi,Stefano Li Pira |dblpUrl=https://dblp.org/rec/conf/clic-it/TripodiP17 }} ==Analysis of Italian Word Embeddings== https://ceur-ws.org/Vol-2006/paper045.pdf
                           Analysis of Italian Word Embeddings

                   Rocco Tripodi                             Stefano Li Pira
           Ca’ Foscari University of Venice                University of Warwick
           rocco.tripodi@unive.it                     stefano.li-pira@wbs.ac.uk




                     Abstract                         al., 2010; Mikolov et al., 2013). It is based on neu-
                                                      ral network techniques and has demonstrated to
    English. In this work we analyze the              capture semantic and syntactic properties of words
    performances of two of the most used              taking as input raw texts without other sources of
    word embeddings algorithms, skip-gram             information. It represents each word as a vec-
    and continuous bag of words on Italian            tor such that words that appear in similar contexts
    language. These algorithms have many              are represented with similar vectors (Collobert and
    hyper-parameter that have to be carefully         Weston, 2008; Mikolov et al., 2013). The dimen-
    tuned in order to obtain accurate word rep-       sions of the word are not easily interpretable and,
    resentation in vectorial space. We provide        with respect to explicit representation, they do not
    an extensive analysis and an evaluation,          correspond to specific concepts.
    showing what are the best configuration of
                                                         In Mikolov et al. (2013), the authors propose
    parameters for specific analogy tasks.
                                                      two different models that seek to maximize, re-
    Italiano. In questo lavoro analizziamo            spectively, the probability of a word given its
    le performances di due tra i più usati al-        context (Continuous bag-of-word model), and the
    goritmi di word embedding: skip-gram e            probability of the surrounding words (before and
    continuous bag of words. Questi algo-             after the current word) given the current word
    ritmi hanno diversi iperparametri che de-         (Skip-gram model). In this work we seek to fur-
    vono essere impostati accuratamente per           ther explore the relationships by generating word
    ottenere delle rappresentazioni accurate          embedding for over 40 different parameteriza-
    delle parole all’interno di spazi vettoriali.     tions of the continuous bag-of-words (CBOW) and
    Presentiamo un’analisi accurata e una             the skip-gram (SG) architectures, since as shown
    valutazione dei due algoritmi mostrando           in Levy et al. (2015) the choice of the hyper-
    quali sono le configurazioni migliori di          parameters heavily affect the construction of the
    parametri su specifiche applicazioni.             embedding spaces.
                                                         Specifically our contributions include:
1   Introduction
                                                      • Word embedding.         The analysis of how
The distributional hypothesis of language, set          different hyper-parameters can achieve differ-
forth by Firth (1935) and Harris (1954), states         ent accuracy levels in relation recovery tasks
that the meaning of a word can be inferred from         (Mikolov et al., 2013).
the contexts in which it is used. Using the co-
occurrence of words in a large corpus, we can ob-     • Morpho-syntactic and semantic analysis.
serve for example that the contexts in which client     Word embeddings have demonstrated to capture
is used are very similar to those in which customer     semantic and syntactic properties, we compare
occur, while less similar to those in which wait-       two different objectives to recover relational
ress or retailer occur. A wide range of algorithms      similarities for semantic and morph-syntactical
have been developed to exploit these properties.        tasks.
Recently, one of the most widely used method in
many natural language processing (NLP) tasks is       • Qualitative analysis. We investigate problem-
word embeddings (Bengio et al., 2003; Mikolov et        atic cases.
2   Related works                                             HP      SG                    CBOW
                                                              dim     200, 300, 400, 500    200, 300, 400, 500
The interest that word embedding models have                  w       3, 5                  2, 5
achieved in the NLP international community has               m       1, 5                  1, 5
                                                              n       1, 5, 10              1, 5, 15
recently been confirmed by the increasing num-
ber of studies that are adopting these algorithms                     Table 1: Hyper-parameters
in languages different from English. One of the
first example is the Polyglot project that produced
word embedding for 117 languages (Al-Rfou et            tations of the word beginnings and endings.
al., 2013). They demonstrated the utility of word          We have observed that in these studies the au-
embedding, achieving, in a part of speech tagging       thors used either the most common set-up of pa-
task, performances competitive with the state-of-       rameters gathered from the literature (Tamburini,
the art methods in English. Attardi et al. (2014)       2016; Stemle, 2016; Berardi et al., 2015) or an
have done the first attempt to introduce word em-       arbitrary number (Attardi and Simi, 2014; Attardi
bedding in Italian obtaining similar results. They      et al., 2016). Despite the relevance given to these
have shown that, using word embedding, they ob-         parameters in the literature (Goldberg, 2017) we
tained one of the best accuracy levels in a named       have not seen studies that analyze the different
entity recognition task.                                strategies behind the possible parametrization. In
   However, these optimistic results are not con-       the next section, we propose a model to deepen
firmed by more recent studies. Indeed the perfor-       these aspects.
mance of word embedding are not directly com-
parable in the accuracy test to those obtained in       3    Italian word embeddings
the English language. For example, Attardi and          Previous results on the word analogy tasks have
Simi (2014) combining the word embeddings in            been reported using vectors obtained with propri-
a dependency parser have not observed improve-          etary corpora (Berardi et al., 2015). To make the
ments over a baseline system not using such fea-        experiments reproducible, we trained our mod-
tures. Berardi et al. (2015) found a 47% accuracy       els on a dump of the Italian Wikipedia (dated
on the Italian versus 60% accuracy on the English.      2017.05.01), from which we used only the body
The results may be a sign of a higher complexity        text of each articles. The obtained texts have
of Italian with respect to English as we will see       been lowercased and filtered according to the cor-
section 4.1.                                            responding parameter of each model. The cor-
   Similarly, recent work that trained word embed-      pus consists of 994.949 sentences that result in
dings on tweets have highlighted some criticali-        470.400.914 tokens.
ties. One of these aspects is how the morphology           The hyper-parameters used to construct the dif-
of a word is opaque to word embeddings. Indeed,         ferent embeddings for the SG and the CBOW
the relatedness of the meaning of a lemma’s differ-     models are: the size of the vectors (dim), the win-
ent word forms, its different string representations,   dow size of the words contexts (w), the minimum
is not systematically encoded. This means that in       number word occurrences (m) and the number of
morphologically rich languages with long-tailed         negative samples (n). The values that these hyper-
frequency distributions, even some word embed-          parameters can take are shown in Table 1.
ding representations for word forms of common
lemmata may become very poor (Kim et al.,               4    Evaluation
2016).
   For this reason, some recent contribution on         The obtained embedding1 spaces are evaluated
Italian tweets have tried to capture these aspects.     on an word analogy task, using a enriched ver-
Tamburini (2016) trained SG on a set of 200 mil-        sion of the Google word analogy test (Mikolov
lion tweets. He proposed a PoS-tagging system in-       et al., 2013), translated in Italian by (Berardi et
tegrating neural representation models and a mor-       al., 2015). It contains 19.791 questions and covers
phological analyzer, exhibiting a very good accu-       19 relations types. 6 of them are semantic and 13
racy. Similarly, Stemle (2016) proposes a sys-          morphosyntactic (see Table 2). The proportions of
tem that uses word embeddings and augment the              1
                                                             The trained vectors with the best performances are avail-
WE representations with character-level represen-       able at http://roccotripodi.com/ita-we
 Morphosyntactic                    Semantic                                 m=1         m=5       Berardi
 adjective-to-adverb                capital-common-countries                 3.227.282   847.355   733.392
 opposite                           capital-world
 comparative                        currency                               Table 3: Vocabulary length
 superlative (assoluto)             city-in-state
 present-participle (gerundio)      regione capoluogo
 nationality-adjective
 past-tense                                                    4.1   Experimental results
 plural
 plural-verbs (3rd person)
                                                               The results of our evaluation are presented in Fig-
 plural-verbs (1st person)                                     ure 1. The main trend that it is possible to notice
 remote-past-verbs (1st person)                                is that accuracy increases as the number of dimen-
 noun-masculine-feminine-singular
                                                               sions of the embedded vectors increases. This in-
 noun-masculine-feminine-plural
 #10.876                            #8.915                     dicates that Italian language benefits of a rich rep-
                                                               resentation that can account for its rich morphol-
                Table 2: Relation types                        ogy. Another important trend that emerges is the
                                                               fact that the parameters have the same effect on
these two types of question is balanced as shown               both algorithms and that they perform very differ-
in Table 2.                                                    ently on all the tasks. CBOW has very low accu-
                                                               racy compared to SG. We can also see that the dim
   To recover these relations two different meth-
                                                               hyper-parameter is not correlated with the dimen-
ods are used: 3C OS A DD (Eq. 1) (Mikolov et al.,
                                                               sion of the vocabulary (model complexity) as one
2013) and 3C OS M UL (Eq. 2) (Levy et al., 2014)
                                                               should expect. In fact, with increasing values of
to compute vectors analogies:
                                                               dim the accuracy increases whatever is the value
                                                               of m. This hyper-parameter heavily affects the vo-
   3C OS A DD argmax cos(b∗ , b − a + a∗ )               (1)   cabulary length (see Table 3). However the dim
                 ∗    b ∈V                                     hyper-parameter seems to be correlated only with
                                                               the accuracy in the semantic tasks while the per-
                                                               formances on the morpho-syntactic tasks seems
                          cos(b∗ , b)cos(b∗ , a∗ )             not to have a big bust increasing the dimension-
  3C OS M UL argmax                                      (2)
                ∗    b ∈V    cos(b∗ , a) +                    ality.
                                                                  With respect to the size of the context (w) used
These two measures try to capture different re-
                                                               to create the words representations we do not ob-
lations between word vectors. The idea behind
                                                               serve a clear difference between the 18 pairs both
these measures is to use the cosine similarity to
                                                               in the SG and in the CBOW. On the contrary a
recover the vector of the hidden word (b∗ ) that has
                                                               clear trend can be observed varying the n hyper-
to be the most similar vector given two positive
                                                               parameter, with n = 1 the accuracy was signifi-
and one negative word. In this way, it is possible
                                                               cantly lower than the one we obtained with n = 5
to model relations such as queen is to king what
                                                               or n = 10. Increasing the number of negative sam-
woman is to man. In this case, the word queen
                                                               ples constantly increases the accuracy.
(b∗ ) is represented by a vector that has to be simi-
lar to king (b) and woman (a∗ ) and different to man              These results support also the claim put forward
(a). The two analogy measures slightly differ in               by (Levy et al., 2014) that the 3C OS M UL method
how they weight each aspect of the similarity rela-            is more suited to recover analogy relations. In fact,
tion. 3C OS A DD allows one sufficiently large term            we can see that on average the right bars of the
to dominate the expression (Levy et al., 2014),                plots are higher than the left.
3C OS M UL achieves a better balance amplifying
                                                               4.2   Error analysis
the small differences between terms and reducing
the larger ones (Levy et al., 2014). As explained in           If we restrict the error analysis to the most macro-
Levy et al. (2014), we expect 3C OS M UL to over-              scopic differences in figure 1 we can compare
perform 3C OS A DD in evaluating both the syntac-              three different parametrizations: SG-200 w5-m5-
tic and the semantic tasks as it tries to normalize            n1, SG-500 w5-m5-n1, SG-500 w5-m5-n10. In
the strength of the relationships that the hidden              this way we can analyze the results obtained
term has both with the attractor terms and with the            changing the number of dimensions of the vectors
repellers term.                                                and the role played by n. In Table 4 the total num-
Figure 1: Results as accuracy with different hyper-parameters (y axis) using the 3C OS A DD (left bar)
and the 3C OS M UL (right bar) formula. The green part of the bars indicates the accuracy on the morpho-
syntactic task whereas the red one the accuracy on the semantic task. The + sign on each bar indicates
the accuracy on the entire dataset. The upper row of the figure shows the results of the SG algorithm
and the bottom row the results of CBOW. The last two bars of the SG plots indicates the results obtained
using the vectors made available by (Berardi et al., 2015)

             Parametrization         #errors     #words                   are not recovered for any of the parametrisation,
             SG-200-w5-m5-n10        10.113      543
             SG-500 w5-m5-n1         10.506      535                      we can observe that approximately 21% of the er-
             SG-500 w5-m5-n10        9.337       525                      rors are recovered under certain parametrizations
                                                                          (Table 6). To further investigate these improve-
Table 4: Total number of errors and number of dif-
                                                                          ments related to the aforementioned parametrisa-
ferent words that have not been recovered
                                                                          tion we focused on one of the most frequent er-
                                                                          rors in the analogy test, the word California. As
ber of errors and the number of different words                           we can see from the list of the analogy test solved
that have not been recovered by each parametriza-                         (Table 7) different parametrizations are helpful to
tion are presented. From this table we can see that                       solve different types of analogies. For example
most of the errors are done one a relatively small                        an increase in the dimensionality increases the
set of words. This phenomenon can be studied                              accuracy, but mainly in analogy test with words
analyzing the most problematic cases. In Table                            that have a representation in the training data re-
5 we can see the list of the most common errors                           lated to a wider set of contexts (Houston:Texas;
ranked by frequency for each method. As we can                            Chicago:Illinois). The best parametrisation is ob-
                                                                          tained increasing the negative sampling. As we
SG-200-w5-m5-n10    #    SG-500 w5-m5-n1    #    SG-500 w5-m5-n10    #    can see from the examples provided, the analo-
california         328   california        349   california         287
texas              223   texas             224   texas              165   gies are resolved thanks to a contextual similarity
arizona            164   arizona           164   arizona            145
florida            144   ohio              142   florida            124   between the two pairs (Huntsville:Alabama; Oak-
ohio               135   florida           140   ohio               112
                                                                          land:California). In these cases the negative sam-
            Table 5: Most common errors                                   pling could help to filter out from each representa-
                                                                          tion those words that are not expected to be rele-
see from these lists the errors are done on the same                      vant for the words embeddings.
words and this because they are the most common                              Similar types of improvement are noticed on
in the dataset (e.g.: in the dataset there are 217                        analogy tests that contain a challenging word
queries that require Florida as answer compared to                        predire (predict). The results of this analysis are
the 55 of Italia). However if we compare the fre-                         presented in Table 9 where it is possible to see that
quency of these errors in the analogy test within                         an higher dimensionality improves the accuracy
the three parametrisation we can observe an im-                           of analogical tests containing open domain verbs
provement of approximately 15% in accuracy with                           (e.g.: descrivere, vedere). Similarly to the previ-
SG-500 w5-m5-n10. Indeed, despite many errors                             ous case, an higher dimensionality allows for fine
                                      dim = 500 & n = 10                        solo n = 10                                  solo dim = 500
Parametrization      #errors solved   Milwaukee Wisconsin Oakland California    Huntsville Alabama Oakland California        Houston Texas Oakland California
                                      Shreveport Louisiana Oakland California   Baltimore Maryland Oakland California        Chicago Illinois Oakland California
dim = 500 & n = 10        873
                                      Irvine California Shreveport Louisiana    Irvine California Phoenix Arizona            Denver Colorado Oakland California
solo dim = 500            645
                                      Irvine California Baltimore Maryland      Arlington Texas Irvine California            Philadelphia Pennsylvania Oakland Calif
solo n = 10               927         Sacramento California Henderson Nevada    Phoenix Arizona Sacramento California        Portland Oregon Oakland California
                                      Sacramento California Orlando Florida     Huntsville Alabama Sacramento California     Tulsa Oklahoma Irvine California
 Table 6: Solved errors
                                                               Table 7: Examples of analogy tests solved.


grained partitions improving the correct associa-                               many words that are not in the training corpus or
tions between terms. However, also in in this case,                             that have been removed from the vocabulary be-
the best parametrizations are obtained increasing                               cause of their (low) frequency. For this reason we
the negative sampling or both the parameters. As                                kept the m hyper-parameter very low (e.g., 1 and
we can see here both the present participle and the                             5), in counter-tendency with recent works that use
past tense pairs are correctly solved. These exam-                              larger corpora and then remove infrequent words
ple provide a preliminary evidence of how nega-                                 setting m with high values (e.g., 50 or 100). In
tive sampling, filtering out non informative words                              fact, with increasing value of m the number of not
from the relevant context of each word, is able to                              given answers increases rapidly. It passes from
build representation by opposition that are benefi-                             300 (m = 1) to 893 (m = 5).
cial both for semantic and syntactic associations.                                 Some of the words that are not present in the
   Examples of words that almost always are                                     vocabulary with m = 1 include plural verbs (1st
not recovered correctly are presented in Table                                  person), that probably are not used by a typical
10. A selected list of words problematic for all                                Wikipedia editor and remote past verbs (1st per-
parametrizations is shown in Table 8. It contains                               son), a tense that in recent years is disappearing
plurals, feminine, currencies, superlatives and am-                             from written and spoken Italian. Some of these
biguous words. The low performances on these                                    verbs are:
cases can be explained by the poor coverage of
                                                                                    giochiamo                    zappiamo                        mescolai
these categories in the training data. In particular,
                                                                                    affiliamo                    implementai
it would be interesting to study the case of fem-                                   rallentiamo                  rallentai                       nuotai
inine and to analyze if it is due to a gender bias
in the Italian Wikipedia, as a preliminary analysis                                In Berardi et al. (2015) the number of not given
of the most frequent errors that persist in all the                             answer is 1.220. The accuracy of their embed-
parametrization seems to suggest. The words that                                dings, obtained using a larger corpus and using
have been benefited by the increase of n are:                                   the hyper-parameters that perform well on English
  ghana               slovenia        ucraino                portoghese
                                                                                language, is always lower than those obtained with
  pakistan                            zimbabwe
                                                                                our setting, in both the morphosyntactic and the
                      giocando                               contessa
  irlandese                           namibia
                                                                                semantic tasks. This confirms that the regular-
                      serbia
  migliorano                          suonano
                                                             messicano          ization of the parameters is crucial for good rep-
                                                                                resentation of the embeddings, since the Berardi
  scrivendo           implementano    maltese                giordania
                                                                                et al. (2015)’s model has been trained on a much
the errors that have been introduced increasing this                            larger corpus and for this should outperform ours.
parameter are related to the words in Table 11. It is                           Furthermore, this model seems to have some tok-
interesting to notice that given an error in an anal-                           enization problem.
ogy test, it is possible to find the correct answer in
the top five most similar words to the query. Pre-                              5      Conclusions
cisely we observed this phenomenon in 26% of the
cases for SG-200-w5-m5-n10, in 27% of the cases                                 We have tested two word representation methods:
for SG-500-w5-m5-n1 and in 25% for SG-500-                                      SG and CBOW training them only on a dump of
w5-m5-n1. Furthermore, approximately in 50% of                                  the Italian Wikipedia. We compared the results of
these cases the correct answer is the second most                               the two models using 12 combinations of hyper-
similar. Most of the recovery errors are due to vo-                             parameters.
cabulary issues. In fact, many words of the test set                               We have adopted a simple word analogy test
have no correspondence in the developed embed-                                  to evaluate the generated word embeddings. The
ding spaces. This is due to the low frequency of                                results have shown that increasing the number of
                                                                 dim = 500 & n = 10                          solo n = 10                                   solo dim = 500
pilotesse   migliore            colori        meloni             dire detto predire predetto                 cantare cantato predire predetto              descrivere descritto predire predetto
pere        matrigna            figliastra    sua
                                                                 mescolare mescolando predire predicendo     correre correndo predire predicendo           vedere visto predire predetto
real        lev                 yen           mamma
kwanza      vantaggiosissimo    urlano        stimano            predire predicendo generare generando       generare generando predire predicendo
aquila      eroina              programmato   impossibilmente    rallentare rallentando predire predicendo   predire predicendo programmare programmando
                                                                 scoprire scoprendo predire predicendo       scrivere scrivendo predire predicendo

        Table 8: Always wrong
                                                                                        Table 9: Examples of analogy tests solved.

 SG-200-w5-m5-n10               #   SG-500 w5-m5-n1          #    SG-500 w5-m5-n10          #
                                                                                                      Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
 capre                         26   groenlandia             27    ratti                     26
 rapidamente                   26   silenziosamente         27    ovviamente                25          Christian Jauvin. 2003. A neural probabilistic lan-
 dolcissimo                    26   caldissimo              27    incredibilmente           25          guage model. Journal of machine learning research,
 apparentemente                26   occhi                   27    grandissimo               25          3(Feb):1137–1155.
 andato                        26   greco                   27    malvolentieri             25
                                                                                                      Giacomo Berardi, Andrea Esuli, and Diego Marcheggiani.
                Table 10: Almost always wrong                                                           2015. Word embeddings go to italy: A comparison of
                                                                                                        models and training datasets. In IIR.
     irlanda               afghanistan                albania              egiziano                   Ronan Collobert and Jason Weston. 2008. A unified archi-
     olandese              provvedono                 francese             svizzero                     tecture for natural language processing: Deep neural net-
                                                                                                        works with multitask learning. In Proceedings of the 25th
                           Table 11: New errors                                                         international conference on Machine learning, pages 160–
                                                                                                        167. ACM.

                                                                                                      John Rupert Firth. 1935. The technique of semantics. Trans-
dimensions and the number of negative examples                                                           actions of the philological society, 34(1):36–73.
improve the performance of both the models.
   These types of improvement seems to be ben-                                                        Yoav Goldberg. 2017. Neural network methods for natural
                                                                                                        language processing. Synthesis Lectures on Human Lan-
eficial only for the semantic relationships. On                                                         guage Technologies, 10(1):1–309.
the contrary the syntactical relationship are neg-
                                                                                                      Zellig S Harris. 1954. Distributional structure. word, 10 (2-
atively affected by the low frequency of many of                                                         3): 146–162. reprinted in fodor, j. a and katz, jj (eds.),
its terms. This should be related to the morpholog-                                                      readings in the philosophy of language.
ical complexity of Italian. In the future it would be
                                                                                                      Yoon Kim, Yacine Jernite, David Sontag, and Alexander M
helpful to represent the spatial relationship regard-                                                   Rush. 2016. Character-aware neural language models. In
ing specific syntactical domain in order to eval-                                                       AAAI, pages 2741–2749.
uate the contribution of hyper-parametrization to                                                     Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. 2014.
syntactical relationship accuracy. Moreover future                                                      Linguistic regularities in sparse and explicit word repre-
work will include the testing of these word em-                                                         sentations. In CoNLL, pages 171–180.
bedding parametrizations in practical applications                                                    Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
(e.g. analysis of patents’descriptions and books’                                                       proving distributional similarity with lessons learned from
                                                                                                        word embeddings. Transactions of the Association for
corpora).                                                                                               Computational Linguistics, 3:211–225.
Acknowledgments                                                                                       Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer-
Part of this work has been conducted during a collaboration                                             nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural
of the first author with DocFlow Italia. All the experiments in                                         network based language model. In Interspeech, volume 2,
this paper have been conducted on the SCSCF multiprocessor                                              page 3.
cluster system at University Ca’ Foscari of Venice.
                                                                                                      Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
                                                                                                        2013. Efficient estimation of word representations in vec-
References                                                                                              tor space. arXiv preprint arXiv:1301.3781.
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013.                                                 Egon W Stemle. 2016. bot. zen@ evalita 2016-a minimally-
  Polyglot: Distributed word representations for multilin-                                              deep learning pos-tagger (trained for italian tweets). In
  gual nlp. CoNLL-2013, page 183.                                                                       CLiC-it/EVALITA.

Giuseppe Attardi and Maria Simi. 2014. Dependency pars-                                               Fabio Tamburini. 2016. A bilstm-crf pos-tagger for ital-
  ing techniques for information extraction.                                                            ian tweets using morphological information. In CLiC-
                                                                                                        it/EVALITA.
Giuseppe Attardi, Vittoria Cozza, and Daniele Sartiano.
  2014. Adapting linguistic tools for the analysis of italian
  medical records.

Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta, Federica
  Semplici, and Largo B Pontecorvo. 2016. Convolutional
  neural networks for sentiment analysis on italian tweets.
  In CLiC-it/EVALITA.