bot.zen @ EVALITA 2016 - A minimally-deep learning PoS-tagger
                 (trained for Italian Tweets)

                                       Egon W. Stemle
               Institute for Specialised Communication and Multilingualism
                                      EURAC Research
                                     Bolzano/Bozen, Italy
                               egon.stemle@eurac.edu


                 Abstract                           correnti che implementano metodi com-
                                                    provati dell’NLP e del Machine Learn-
English. This article describes the sys-            ing, per raggiungere risultati competi-
tem that participated in the POS tag-               tivi nel PoS tagging dei testi italiani di
ging for Italian Social Media Texts (PoST-          Twitter. In particolare il sistema utilizza
WITA) task of the 5th periodic evaluation           strategie di word embedding e di rap-
campaign of Natural Language Processing             presentazione character-level di inizio e
(NLP) and speech tools for the Italian lan-         fine parola, in un’architettura LSTM RNN.
guage EVALITA 2016.                                 Dati etichettati (Italian UD corpus, DiDi
The work is a continuation of Stemle                e PoSTWITA) e dati non etichettati (Italian
(2016) with minor modifications to the              C4Corpus e PAISÀ) sono stati utilizzati in
system and different data sets. It com-             fase di training.
bines a small assertion of trending tech-           Il sistema è disponibile sotto licenza open
niques, which implement matured meth-               source APLv2.
ods, from NLP and ML to achieve com-
petitive results on PoS tagging of Italian
Twitter texts; in particular, the system uses   1   Introduction
word embeddings and character-level rep-        Part-of-speech (PoS) tagging is an essential pro-
resentations of word beginnings and end-        cessing stage for virtually all NLP applica-
ings in a LSTM RNN architecture. La-            tions. Subsequent tasks, like parsing, named-
belled data (Italian UD corpus, DiDi and        entity recognition, event detection, and machine
PoSTWITA) and unlabbelled data (Italian         translation, often utilise PoS tags, and benefit (di-
C4Corpus and PAISÀ) were used for train-       rectly or indirectly) from accurate tag sequences.
ing.                                               Actual work on PoS tagging, meanwhile,
The system is available under the APLv2         mainly concentrated on standardized texts for
open-source license.                            many years, and frequent phenomena in computer-
                                                mediated communication (CMC) and Web cor-
Italiano. Questo articolo descrive il sis-      pora such as emoticons, acronyms, interaction
tema che ha partecipato al task POS tag-        words, iteration of letters, graphostylistics, short-
ging for Italian Social Media Texts (PoST-      enings, addressing terms, spelling variations, and
Wita) nell’ambito di EVALITA 2016, la           boilerplate (Androutsopoulos, 2007; Bernardini et
5° campagna di valutazione periodica del        al., 2008; Beißwenger, 2013) still deteriorate the
Natural Language Processing (NLP) e             performance of PoS-taggers (Giesbrecht and Ev-
delle tecnologie del linguaggio.                ert, 2009; Baldwin et al., 2013).
Il lavoro è un proseguimento di quanto            On the other hand, the interest in automatic
descritto in Stemle (2016), con modifiche       evaluation of social media texts, in particular
minime al sistema e insiemi di dati differ-     for microblogging texts such as tweets, has been
enti. Il lavoro combina alcune tecniche         growing considerably, and specialised tools for
Twitter data have become available for different      results across various (semantic) tasks. These re-
languages. But Italian completely lacks such re-      sults suggest that the word embeddings substan-
sources, both regarding annotated corpora and         tially outperform the other architectures on seman-
specific PoS-tagging tools.1 To this end, the         tic similarity and analogy detection tasks. Subse-
POS tagging for Italian Social Media Texts (PoST-     quently, Levy et al. (2015) conducted a compre-
WITA) task was proposed for EVALITA 2016              hensive set of experiments and comparisons that
concerning the domain adaptation of PoS-taggers       suggest that much of the improved results are due
to Twitter texts.                                     to the system design and parameter optimizations,
   Our system combined word2vec (w2v) word            rather than the selected method. They conclude
embeddings (WEs) with a single-layer Long Short       that ”there does not seem to be a consistent signif-
Term Memory (LSTM) recurrent neural network           icant advantage to one approach over the other”.
(RNN) architecture. The sequence of unlabelled           Word embeddings provide high-quality low di-
w2v representations of words is accompanied by        mensional vector representations of words from
the sequence of n-grams of the word beginnings        large corpora of unlabelled data, and the repre-
and endings, and is fed into the RNN which in turn    sentations, typically computed using NNs, encode
predicts PoS labels.                                  many linguistic regularities and patterns (Mikolov
   The paper is organised as follows: We present      et al., 2013b).
our system design in Section 2, the implementa-
tion in Section 3, and its evaluation in Section 4.   2.2      Character-Level Sub-Word Information
Section 5 concludes with an outlook on possible       The morphology of a word is opaque to WEs, and
implementation improvements.                          the relatedness of the meaning of a lemma’s differ-
                                                      ent word forms, i.e. its different string representa-
2       Design                                        tions, is not systematically encoded. This means
                                                      that in morphologically rich languages with long-
Overall, our design takes inspiration from as far     tailed frequency distributions, even some WE rep-
back as Benello et al. (1989) who used four pre-      resentations for word forms of common lemmata
ceding words and one following word in a feed-        may become very poor (Kim et al., 2015).
forward neural network with backpropagation for          We agree with Nogueira dos Santos and
PoS tagging, builds upon the strong foundation        Zadrozny (2014) and Kim et al. (2015) that sub-
laid down by Collobert et al. (2011) for a neu-       word information is very important for PoS tag-
ral network (NN) architecture and learning algo-      ging, and therefore we augment the WE repre-
rithm that can be applied to various natural lan-     sentations with character-level representations of
guage processing tasks, and ultimately is a varia-    the word beginnings and endings; thereby, we
tion of Nogueira dos Santos and Zadrozny (2014)       also stay language agnostic—at least, as much
who trained a NN for PoS tagging, with character-     as possible—by avoiding the need for, often lan-
level and WE representations of words.                guage specific, morphological pre-processing.
   Also note that an earlier version of the system
was used in Stemle (2016) to participate in the       2.3      Recurrent Neural Network Layer
EmpiriST 2015 shared task on automatic linguistic     Language Models are a central part of NLP. They
annotation of computer-mediated communication         are used to place distributions over word se-
/ social media (Beißwenger et al., 2016).             quences that encode systematic structural proper-
                                                      ties of the sample of linguistic content they are
2.1       Word Embeddings
                                                      built from, and can then be used on novel content,
Recently, state-of-the-art results on various lin-    e.g. to rank it or predict some feature on it. For a
guistic tasks were accomplished by architectures      detailed overview on language modelling research
using neural-network based WEs. Baroni et al.         see Mikolov (2012).
(2014) conducted a set of experiments comparing          A straight-forward approach to incorporate
the popular w2v (Mikolov et al., 2013a; Mikolov       WEs into feature-based language models is to
et al., 2013b) implementation for creating WEs to     use the embeddings’ vector representations as fea-
other distributional methods with state-of-the-art    tures.2 Having said that, WEs are also used in NN
    1                                                    2
        http://www.evalita.it/2016/tasks/postwita            For an overview see, e.g. Turian et al. (2010).
architectures, where they constitute (part of) the      in any scenario. Moreover, SGNS is the fastest
input to the network.                                   method to train, and cheapest (by far) in terms
   Neural networks consist of a large number of         of disk space and memory consumption”. Coin-
simple, highly interconnected processing nodes in       cidentally, Mikolov et al. (2013b) also suggest to
an architecture loosely inspired by the structure of    use SGNS. We incorporate w2v’s original C im-
the cerebral cortex of the brain (O’Reilly and Mu-      plementation for learning WEs3 in an independent
nakata, 2000). The nodes receive weighted inputs        pre-processing step, i.e. we pre-compute the WEs.
through these connections and fire according to         Then, we use gensim4 , a Python tool for unsuper-
their individual thresholds of their shared activa-     vised semantic modelling from plain text, to load
tion function. A firing node passes on an activation    the pre-computed data, and to compute the vector
to all successive connected nodes. During learning      representations of input words for our NN.
the input is propagated through the network and
the output is compared to the desired output. Then,     3.2    Character-Level Sub-Word Information
the weights of the connections (and the thresholds)     Our implementation uses a one-hot encoding with
are adjusted step-wise so as to more closely resem-     a few additional features for representing sub-
ble a configuration that would produce the desired      word information. The one-hot encoding trans-
output. After all input cases have been presented,      forms a categorical feature into a vector where the
the process typically starts over again, and the out-   categories are represented by equally many dimen-
put values will usually be closer to the correct val-   sions with binary values. We convert a letter to
ues.                                                    lower-case and use the sets of ASCII characters,
   RNNs are NNs where the connections between           digits, and punctuation marks as categories for the
the elements are directed cycles, i.e. the networks     encoding. Then, we add dimensions to represent
have loops, and this enables them to model se-          more binary features like ’uppercase’ (was upper-
quential dependencies of the input. However, reg-       case prior to conversion), ’digit’ (is digit), ’punctu-
ular RNNs have fundamental difficulties learn-          ation’ (is punctuation mark), whitespace (is white
ing long-term dependencies, and special kinds of        space, except the new line character; note that this
RNNs need to be used (Hochreiter, 1991); a very         category is usually empty, because we expect our
popular kind is the so called long short-term mem-      tokens to not include white space characters), and
ory (LSTM) network proposed by Hochreiter and           unknown (other characters, e.g. diacritics). This
Schmidhuber (1997).                                     results in vectors with more than a single one-hot
   Overall, with this design we not only benefit        dimension.
from available labelled data but also from avail-
                                                        3.3    Recurrent Neural Network Layer
able general or domain-specific unlabelled data.
                                                        Our implementation uses Keras, a high-level NNs
3     Implementation                                    library, written in Python and capable of running
                                                        on top of either TensorFlow or Theano (Chollet,
We maintain the implementation in a source              2015). In our case it runs on top of Theano,
code repository at https://github.com/                  a Python library that allows to define, optimize,
bot-zen/. The version tagged as 1.1 com-                and evaluate mathematical expressions involving
prises the version that was used to generate the        multi-dimensional arrays efficiently (The Theano
results submitted to the shared task (ST).              Development Team et al., 2016).
   Our system feeds WEs and character-level sub-           The input to our network are sequences of the
word information into a single-layer RNN with a         same length as the sentences we process. During
LSTM architecture.                                      training, we group sentences of the same length
                                                        into batches and process the batches according to
3.1    Word Embeddings
                                                        sentence length in increasing order. Each single
When computing WEs we take into consideration           word in the sequence is represented by its sub-
Levy et al. (2015): they observed that one spe-         word information and two WEs that come from
cific configuration of w2v, namely the skip-gram        two sources (see Section 4). For unknown words,
model with negative sampling (SGNS) ”is a robust           3
                                                            https://code.google.com/archive/p/
baseline. While it might not be the best method for     word2vec/
                                                          4
every task, it does not significantly underperform          https://radimrehurek.com/gensim/
i.e. words without a pre-computed WE, we first try        bank annotation for many languages.5
to find the most similar WE considering 10 sur-              italian-UD6 (version from January 2015) cor-
rounding words. If this fails, the unknown word is        pus was originally obtained by conversion from
mapped to a randomly generated vector represen-           ISDT (Italian Stanford Dependency Treebank)
tation. In Total, each word is represented by 2, 280      and released for the dependency parsing ST of
features: two times 500 (WEs), and sixteen times          EVALITA 2014 (Bosco et al., 2014). The cor-
80 for two 8-grams (word beginning and ending).           pus has semi-automatically converted PoS tags
If words are shorter than 8 characters their 8-grams      from the original two Italian treebanks, differing
are zero-padded.                                          both in corpus composition and adopted annota-
    This sequential input is fed into a LSTM layer        tion schemes.
that, in turn, projects to a fully connected output          The corpus contains around 317,000 tokens in
layer with softmax activation function. During            around 13,000 sentences from different sources
training we use dropout for the projection into the       and genres. It is available under the CC BY-NC-
output layer, i.e. we set a fraction (0.5) of the input   SA 3.07 license.
units to 0 at each update, which helps prevent over-
                                                          4.1.3   PoSTWITA (PoS and w2v)
fitting (Srivastava et al., 2014). We use categorical
cross-entropy as loss function and backpropaga-           postwita is the Twitter data made available by the
tion in conjunction with the RMSprop optimiza-            organizers of the ST. It contains Twitter tweets
tion for learning. At the time of writing, this was       from the EVALITA2014 SENTIPLOC corpus: the
the Keras default—or the explicitly documented            development and test set and additional tweets
option to be used—for our type of architecture.           from the same period of time were manually anno-
                                                          tated for a global amount of 6438 tweets (114,967
4     Results                                             tokens) and were distributed as the development
                                                          set. The data is PoS tagged according to UD
We used our slightly modified implementation to           but with the additional insertion of seven Twitter-
participate in the POS tagging for Italian Social         specific tags. All the annotations were carried out
Media Texts (PoSTWITA) shared task (ST) of the            by three different annotators. The data was only
5th periodic evaluation campaign of Natural Lan-          distributed to the task participants.
guage Processing (NLP) and speech tools for the
Italian language EVALITA 2016. First, we de-              4.1.4   C4Corpus (w2v)
scribe the corpora used for training, and then the        c4corpus8 is a full documents Italian Web cor-
specific system configuration(s) for the ST.              pus that has been extracted from CommonCrawl,
                                                          the largest publicly available general Web crawl
4.1     Training Data for w2v and PoS Tagging             to date. See Habernal (2016) for details about the
                                                          corpus construction pipeline, and other informa-
4.1.1    DiDi-IT (PoS, w2v)
                                                          tion about the corpus.
didi-it (Frey et al., 2016) (version September               The corpus contains about 670m tokens in 22m
2016) is the Italian sub-part of the DiDi corpus,         sentences. The data is available under the Cre-
a corpus of South Tyrolean German and Italian             ativeCommons license family.
from Facebook (FB) users’ wall posts, comments
on wall posts and private messages.                       4.1.5   PAISÀ (w2v)
   The Italian part consists of around 100,000 to-        paisa (Lyding et al., 2014) is a corpus of authen-
kens collected from 20 profiles of Facebook users         tic contemporary Italian texts from the web (har-
residing in South Tyrol. This version has about           vested in September/October 2010). It was created
20,000 PoS tags semi-automatically corrected by              5
                                                                http://universaldependencies.org/
a single annotator.                                          6
                                                                http://universaldependencies.org/it/
   The anonymised corpus is freely available for          overview/introduction.html
                                                              7
research purposes.                                              Creative   Commons       Attribution-NonCommercial-
                                                          ShareAlike 3.0 Unported, i.e. the data can be copied and
                                                          redistributed, and adapted for purposes other than commer-
4.1.2    Italian UD (PoS, w2v)                            cial ones. See https://creativecommons.org/
                                                          licenses/by-nc-sa/3.0/ for more details.
Universal Dependencies (UD) is a project that                 8
                                                                https://github.com/dkpro/
is developing cross-linguistically consistent tree-       dkpro-c4corpus
in the context of the project PAISÀ (Pı́attaforma      importantly 7, 778 of 12, 677 sentences had imper-
per l’Apprendimento dell’Italiano Su corpora An-        fect information during training.
notati) with the aim to provide a large resource of
freely available Italian texts for language learning    5   Conclusion & Outlook
by studying authentic text materials.                   We presented our submission to the PoSTWITA
   The corpus contains about 270m tokens in             task of EVALITA 2016, where we participated
about 8m sentences. The data is available under         with moderate results. In the future, we will try to
the CC BY-NC-SA 3.09 license.                           rerun the experiment with training data that takes
                                                        into consideration the Twitter-specific tags of the
4.2     PoSTWITA shared task
                                                        task.
For the ST we used one overall configuration for
the system but three different corpus configura-        Acknowledgments
tions for training. However, only one corpus con-
                                                        The computational results presented have been
figuration was entered into the ST: we used PoS
                                                        achieved in part using the Vienna Scientific Clus-
tags from didi-it + postwita (run 1), from italian-
                                                        ter (VSC).
UD (run 2), and from both (run 3). For w2v
we trained a 500-dimensional skip-gram model
on didi-it + italian-UD + postwita that ignored         References
all words with less than 2 occurrences within
                                                        Jannis K. Androutsopoulos. 2007. Neue Medien –
a window size of 10; it was trained with neg-              neue Schriftlichkeit? Mitteilungen des Deutschen
ative sampling (value 15). We also trained a               Germanistenverbandes, 1:72–97.
500-dimensional skip-gram model on c4corpus +
                                                        Timothy Baldwin, Paul Cook, Marco Lui, Andrew
paisa that ignored all words with less than 33            MacKinlay, and Li Wang. 2013. How noisy social
occurrences within a window size of 10; it was            media text, how diffrnt social media sources? In
trained with negative sampling (value 15).                Proceedings of the Sixth International Joint Confer-
   The other w2v parameters were left at their de-        ence on Natural Language Processing, pages 356–
                                                          364, Nagoya, Japan, October. Asian Federation of
fault settings10 .
                                                          Natural Language Processing. http://aclweb.
   The evaluation of the systems was done by the          org/anthology/I13-1041.
organisers on unlabelled but pre-tokenised data
(4759 tokens in 301 tweets), and was based on a         Marco Baroni, Georgiana Dinu, and German
                                                         Kruszewski. 2014. Don’t count, predict! A
token-by-token comparison. The considered met-           systematic comparison of context-counting vs.
ric was accuracy, i.e. the number of correctly as-       context-predicting semantic vectors. In Proceedings
signed PoS tags divided by the total number of to-       of the 52nd Annual Meeting of the Association
kens.                                                    for Computational Linguistics (Volume 1: Long
                                                         Papers), pages 238–247. Association for Computa-
      (1) didi-it + postwita                76.00        tional Linguistics. http://www.aclweb.org/
                                                         anthology/P14-1023.
      (2) italian-UD                        80.54
      (3) didi-it + postwita + italian-UD   81.61       Michael Beißwenger, Sabine Bartsch, Stefan Evert,
                                                          and Kay-Michael Würzner. 2016. EmpiriST 2015:
      Winning Team                          93.19         A Shared Task on the Automatic Linguistic Anno-
                                                          tation of Computer-Mediated Communication, So-
Table 1: Official result(s) of our PoS tagger for the     cial Media and Web Corpora. In Proceedings of the
three runs on the PoSTWITA ST data.                       10th Web as Corpus Workshop (WAC-X) and the Em-
                                                          piriST Shared Task, pages 78–90, Berlin, Germany.
                                                          Association for Computational Linguistics.
   We believe, the unexpectedly little performance
gain from utilizing the much larger italian-UD          Michael Beißwenger.       2013.    Das Dortmunder
data over the rather small didi-it + postwita data        Chat-Korpus: ein annotiertes Korpus zur Sprachver-
                                                          wendung und sprachlichen Variation in der
may be rooted in the insertion of Twitter-specific        deutschsprachigen Chat-Kommunikation. LINSE -
tags into the data (see 4.1.3), something we did          Linguistik Server Essen, pages 1–13.
not account for, i.e. 18, 213 of 289, 416 and more
                                                        Julian Benello, Andrew W. Mackie, and James A.
   9                                                       Anderson.    1989.   Syntactic category disam-
     https://creativecommons.org/licenses/
by-nc-sa/3.0/                                              biguation with neural networks.       Computer
  10
     -sample 1e-3 -iter 5 -alpha 0.025                     Speech & Language, 3(3):203–217, July. http:
  //www.sciencedirect.com/science/                       Verena Lyding, Egon Stemle, Claudia Borghetti, Marco
  article/pii/0885230889900181.                            Brunello, Sara Castagnoli, Felice Dell Orletta, Hen-
                                                           rik Dittmann, Alessandro Lenci, and Vito Pirrelli.
Silvia Bernardini, Marco Baroni, and Stefan Evert.         2014. The PAISÀ Corpus of Italian Web Texts. In
   2008. A WaCky Introduction. In Wacky! Working           Proceedings of the 9th Web as Corpus Workshop
   papers on the Web as Corpus, pages 9–40. GEDIT,         (WaC-9), pages 36–43, Gothenburg, Sweden. Asso-
   Bologna, Italy. http://wackybook.sslmit.                ciation for Computational Linguistics.
   unibo.it/pdfs/bernardini.pdf.
                                                         Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Cristina Bosco, Felice Dell’Orletta, Simonetta Monte-      Dean. 2013a. Efficient Estimation of Word Repre-
  magni, Manuela Sanguinetti, and Maria Simi. 2014.        sentations in Vector Space. CoRR, abs/1301.3781.
  The EVALITA 2014 Dependency Parsing Task. In             http://arxiv.org/abs/1301.3781.
  Proceedings of CLiC-it 2014 and EVALITA 2014,
  pages 1–8. Pisa University Press.                      Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
                                                           rado, and Jeffrey Dean. 2013b. Distributed Repre-
Franois Chollet. 2015. Keras: Deep Learning library        sentations of Words and Phrases and their Compo-
  for Theano and TensorFlow. https://github.               sitionality. CoRR, abs/1310.4546, October. http:
  com/fchollet/keras.                                      //arxiv.org/abs/1310.4546.
Ronan Collobert, Jason Weston, Léon Bottou, Michael     Tomáš Mikolov.    2012.       Statistical Lan-
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.              guage Models Based on Neural Networks.
  2011. Natural Language Processing (almost) from          Ph.D. thesis, Brno University of Technology.
  Scratch. Journal of Machine Learning Research,           http://www.fit.vutbr.cz/˜imikolov/
  12:2493–2537.      https://arxiv.org/abs/                rnnlm/thesis.pdf.
  1103.0398.
                                                         Cı́cero Nogueira dos Santos and Bianca Zadrozny.
Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W.         2014.     Learning Character-level Representa-
   Stemle. 2016. The DiDi Corpus of South Ty-               tions for Part-of-Speech Tagging.    In Pro-
   rolean CMC Data: A multilingual corpus of Face-          ceedings of the 31st International Conference
   book texts. Upcoming.                                    on Machine Learning (ICML-14), pages 1818–
                                                            1826.    http://jmlr.org/proceedings/
Eugenie Giesbrecht and Stefan Evert. 2009. Is Part-of-      papers/v32/santos14.pdf.
  Speech Tagging a Solved Task? An Evaluation of
  POS Taggers for the German Web as Corpus. Web          Randall C. O’Reilly and Yuko Munakata. 2000. Com-
  as Corpus Workshop (WAC5). http://sigwac.                putational Explorations in Cognitive Neuroscience
  org.uk/raw-attachment/wiki/WAC5/                         Understanding the Mind by Simulating the Brain.
  WAC5_proceedings.pdf#page=27.                            MIT Press.      http://books.google.com/
                                                           books?id=BLf34BFTaIUC{\&}pgis=1.
Ivan Habernal, Omnia Zayed, and Iryna Gurevych.
   2016. C4Corpus: Multilingual Web-size corpus          Nitish Srivastava, Geoffrey E. Hinton, Alex
   with free license. In Proceedings of the 10th           Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
   International Conference on Language Resources          dinov. 2014. Dropout : A Simple Way to Prevent
   and Evaluation (LREC 2016), page (to appear),           Neural Networks from Overfitting. Journal of Ma-
   Portorož, Slovenia, May. European Language Re-         chine Learning Research (JMLR), 15:1929–1958.
   sources Association (ELRA).
                                                         Egon W. Stemle. 2016. bot.zen @ EmpiriST 2015 -
Sepp Hochreiter and Jürgen Schmidhuber. 1997.             A minimally-deep learning PoS-tagger (trained for
  Long short-term memory. Neural Computation,              German CMC and Web data). In Proceedings of the
  9(8):1735–1780, November. http://dx.doi.                 10th Web as Corpus Workshop (WAC-X) and the Em-
  org/10.1162/neco.1997.9.8.1735.                          piriST Shared Task, pages 115–119. Association for
                                                           Computational Linguistics.
Sepp Hochreiter. 1991. Untersuchungen zu dy-
  namischen neuronalen Netzen. diploma thesis, TU        The Theano Development Team, Rami Al-Rfou,
  München.                                                Guillaume Alain, Amjad Almahairi, and et al.
                                                           2016. Theano: A Python framework for fast
Yoon Kim, Yacine Jernite, David Sontag, and Alexan-        computation of mathematical expressions. CoRR,
  der M. Rush. 2015. Character-Aware Neural Lan-           abs/1605.02688.   http://arxiv.org/abs/
  guage Models. CoRR, abs/1508.0. http://                  1605.02688.
  arxiv.org/abs/1508.06615.
                                                         Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Omer Levy, Yoav Goldberg, and Ido Dagan.                    Word representations: A simple and general method
 2015.       Improving distributional similarity            for semi-supervised learning. In Proceedings of
 with lessons learned from word embeddings.                 the 48th Annual Meeting of the Association for
 Transactions of the Association for Compu-                 Computational Linguistics, ACL ’10, pages 384–
 tational Linguistics, 3:211–225.      https:               394, Stroudsburg, PA, USA. Association for Com-
 //tacl2013.cs.columbia.edu/ojs/                            putational Linguistics. http://dl.acm.org/
 index.php/tacl/article/view/570.                           citation.cfm?id=1858681.1858721.