=Paper=
{{Paper
|id=Vol-2263/paper019
|storemode=property
|title=ITAmoji 2018: Emoji Prediction via Tree Echo State Networks
|pdfUrl=https://ceur-ws.org/Vol-2263/paper019.pdf
|volume=Vol-2263
|authors=Daniele Di Sarli,Claudio Gallicchio,Alessio Micheli
|dblpUrl=https://dblp.org/rec/conf/evalita/SarliGM18
}}
==ITAmoji 2018: Emoji Prediction via Tree Echo State Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-2263/paper019.pdf</pdf>
<pre>
         ITAmoji 2018: Emoji Prediction via Tree Echo State Networks

               Daniele Di Sarli, Claudio Gallicchio, Alessio Micheli
                         Department of Computer Science
                           University of Pisa, Pisa, Italy
    d.disarli@studenti.unipi.it, {gallicch,micheli}@di.unipi.it


                       Abstract
                                                              20.27%     19.86%     9.45%     5.35%     5.13%
    English.          For the “ITAmoji”
    EVALITA 2018 competition we mainly
    exploit a Reservoir Computing approach                       4.11%    3.54%     3.33%     2.80%     2.57%
    to learning, with an ensemble of models
    for trees and sequences. The sentences
                                                                 2.18%    2.16%     2.03%     1.94%     1.78%
    for the models of the former kind are
    processed by a language parser and the
    words are encoded by using pretrained                        1.67%    1.55%     1.52%     1.49%     1.39%
    FastText word embeddings for the Italian
    language. With our method, we ranked
                                                                 1.37%    1.28%     1.12%     1.07%     1.06%
    3rd out of 5 teams.
                                                             Figure 1: Emojis under consideration and their
    Italiano.        Per la competizione
                                                             frequency within the dataset.
    EVALITA 2018 sfruttiamo principal-
    mente un approccio Reservoir Computing,
    con un ensemble di modelli per sequenze                  trees in particular, with Tree Echo State Networks
    e per alberi. Le frasi per questi ultimi                 (Gallicchio and Micheli, 2013), also referred to as
    sono elaborate da un parser di linguaggi                 TreeESNs.
    e le parole codificate attraverso degli                     We follow this approach for solving the ITA-
    embedding FastText preaddestrati per la                  moji task in the EVALITA 2018 competition (Ron-
    lingua italiana. Con il nostro metodo ci                 zano et al., 2018). In particular, we parse the input
    siamo classificati terzi su un totale di 5               texts into trees resembling the grammatical struc-
    team.                                                    ture of the sentences, and then we use multiple
                                                             TreeESN models to process the parse trees and
                                                             make predictions. We then merge these models by
1   Introduction
                                                             using an ensemble to make our final predictions.
Echo State Networks (Jaeger and Haas, 2004) are
an efficient class of recurrent models under the             2    Task and Dataset
framework of Reservoir Computing (Lukoševičius
and Jaeger, 2009), where the recurrent part of               Given a set of Italian tweets, the goal of the ITA-
the model (“reservoir”) is carefully initialized and         moji task is to predict the most likely emoji as-
then left untrained (Gallicchio and Micheli, 2011).          sociated with each tweet. The dataset contains
The only weights that are trained are part of a              250,000 tweets in Italian, each of them originally
usually simple readout layer1 . Echo State Net-              containing only one (possibly repeated) of the 25
works were originally designed to work on se-                emojis considered in the task (see Figure 1). The
quences, however it has been shown how to extend             emojis are removed from the sentences and used
them to deal with recursively structured data, and           as targets.
   1
     Trained in closed form, e.g. by Moore-Penrose pseudo-      The test dataset contains 25,000 tweets simi-
inversion, or Ridge Regression.                              larly processed.
3       Preprocessing                                         Note that Equation 1 determines a recursive ap-
                                                           plication (bottom-up visit) over each node of the
The provided dataset has been shuffled and split
                                                           tree t until the state for all nodes is computed,
into a training set (80%) and a validation set
                                                           which we can express in structured form as x(t).
(20%).
                                                           The resulting tree x(t) is then mapped into a fixed-
   We preprocessed the data by first remov-
                                                           size feature representation via the χ state mapping
ing any URL from the sentences, as most of
                                                           function. We make use of mean and sum state
them did not contain any informative content
                                                           mapping functions, respectively yielding the mean
(e.g. “https://t.co/M3StiVOzKC”). We
                                                           and the sum of all the states. The result, χ(x(t)),
then parsed the sentences by using two different
                                                           is then projected into a different space by a matrix
parsers for the Italian language: Tint2 (Palmero
                                                           Wφ :
Aprosio and Moretti, 2016) and spaCy (Honni-
bal and Johnson, 2015). This produced two sets
                                                                        ŷ = fφ (Wφ χ(x(t))) ,             (2)
of trees, both including information about the de-
pendency relations between the nodes of each tree.            where fφ is an activation function.
We finally replace each word with its correspond-             For the readout we use both a linear regression
ing pretrained FastText embedding (Joulin et al.,          approach with L2 regularization known as Ridge
2016).                                                     regression (Hoerl and Kennard, 1970) and a mul-
                                                           tilayer perceptron (MLP):
4       Description of the system
                                                                            y = readout(ŷ),               (3)
Our ensemble is composed by 13 different mod-
els, 12 of which are TreeESNs and the other one               where y ∈ R25 is the output vector, which rep-
is a Long Short-Term Memory (LSTM) over char-              resents a score for each of the classes: the in-
acters. Different random initializations (“trials”)        dex with the highest value corresponds to the most
of the model parameters are all included in the en-        likely class.
semble in order to enrich the diversity of the hy-
potheses. We summarize the entire configuration            4.2   CharLSTM model
in Table 1.                                                The CharLSTM model uses a bidirectional LSTM
                                                           (Hochreiter and Schmidhuber, 1997; Graves and
4.1      TreeESN models                                    Schmidhuber, 2005) with 2 layers, which takes as
The TreeESN that we are using is a specialization          input the characters of the sentences expressed as
of the description given by Gallicchio and Micheli         pretrained character embeddings of size 300. The
(2013), and the reader can refer to that work for          LSTM output is then fed into a linear layer with
additional details. Here, the state corresponding          25 output units.
to node n of an input tree t is computed as:                  Similar models have been used in recent works
                                                           related to emoji prediction, see for example the
                               k
                                                 !         model used by Barbieri et al. (2017), or the one
                           1X n                            by Baziotis et al. (2018), which is however a more
 x(n) = f Win u(n) +              Ŵi x(chi (n)) ,
                           k                               complex word-based model.
                              i=1
                                                  (1)
   where u(n) is the label of node n in the input          4.3   Ensemble
tree, k is the number of children of node n, chi (n)       We take into consideration two different ensem-
is the i-th child of node n, Win is the input-to-          bles, both containing the models in Table 1, but
reservoir weight matrix, Ŵin is the recurrent reser-      with different strategies for weighting the NP pre-
voir weight matrix associated to the grammatical           dictions. In the following, let Y ∈ RNP ×25 be the
relation between node n and its i-th child, and f          matrix containing one prediction per row.
is the element-wise applied activation function of           The weights for the first ensemble (correspond-
the reservoir units (in our case, it is a tanh). All       ing to the run file run1.txt) have been produced
matrices in Equation 1 are left untrained.                 by a random search: at each iteration we com-
    2
                                                           pute a random vector w ∈ RNP with entries sam-
    Emitting data in the CoNLL-U format (Nivre et al.,
2016), a revised version of the CoNLL-X format (Buchholz   pled from a random variable W 2 , W ∼ U[0, 1].
and Marsi, 2006).                                          The square increases the probability of sampling
            #   Class           Reservoir units     fφ         Readout          Parser    Trials
            1   TreeESN              1000          ReLU          MLP             Tint      10
            2   TreeESN              1000          Tanh          MLP             Tint      10
            3   TreeESN              5000          Tanh          MLP             Tint       1
            4   TreeESN              5000          Tanh          MLP            spaCy       2
            5   TreeESN              5000          ReLU          MLP             Tint       1
            6   TreeESN              5000          ReLU          MLP            spaCy       1
            7   TreeESN              5000          Tanh     Ridge regression     Tint       1
            8   TreeESN              5000          Tanh     Ridge regression    spaCy       3
            9   TreeESN              5000          ReLU     Ridge regression     Tint       1
           10   TreeESN              5000          ReLU     Ridge regression    spaCy       3
           11   TreeESN              5000          Tanh     Ridge regression     Tint       1
           12   TreeESN              5000          Tanh     Ridge regression    spaCy       2
           13   CharLSTM               –             –             –              –         1

         Table 1: Composition of the ensemble, highlighting the differences between the models.


near-zero weights. After selecting the best con-        has been skewed so that the data extracted dur-
figuration on the validation set, the predictions       ing training follows a uniform distribution with re-
from each of the models are merged together in          spect to the target class. For the readout part we
a weighted mean:                                        use the Adam algorithm (Kingma and Ba, 2015)
                                                        for the stochastic optimization of the multi-class
                      ȳ = wY                     (4)   cross entropy loss function.
   For the second type of ensemble (correspond-         Models 7-10 Models from 7 to 10 are again
ing to the run file run2.txt) we adopt a multi-         TreeESNs, but with a Ridge Regression read-
layer perceptron. We feed as input the NP predic-       out. In this case, 25 classifiers are trained with
tions concatenated into a single vector y(1...NP ) ∈    a 1-vs-all method, one for each class, using binary
R25NP , so that the model is:                           targets.

                                                        Models 11-12 Models 11 and 12 are again
                               
    ȳ = tanh y(1...NP ) W1 + b1 W2 + b2 , (5)
                                                        TreeESNs with a Ridge Regression readout, but
   where the hidden layer has size 259 and the out-     they are trained to distinguish only between the
put layer is composed by 25 units.                      most frequent class, the second most frequent
   In both types of ensemble, as before, the out-       class and all the other classes aggregated together.
put vector contains a score for each of the classes,    This is done to try to improve the ensemble preci-
providing a way to rank them from the most to the       sion and recall for the top two classes.
least likely. The most likely class c̃ is thus com-
                                                        Model 13 The last model is a sequential LSTM
puted as c̃ = arg max ȳi .
                  i                                     over character embeddings. Like in the first 6
                                                        models, the Adam algorithm is used to optimize
5    Training
                                                        the cross entropy loss function.
The training algorithm differs based on the kind of
model taken under consideration. We address each        6   Results
of them in the following paragraphs.
                                                        The ensemble seems to bring a substantial im-
Models 1-6 The first six models are TreeESNs            provement to the performance on the validation
using a multilayer perceptron as readout. Given         set, as highlighted in Table 2. This is possible
the fact that the main evaluation metric for the        thanks to the number and diversity of the differ-
competition is the Macro F-score, each of the           ent models, as we can see in Figure 2 where we
models has been trained by rebalancing the fre-         show the Pearson correlation coefficients between
quencies of the different target classes. In partic-    the predictions of the models in the ensemble.
ular, the sampling probability for each input tree         On the test set we scored substantially lower,
                                                                                                               Run     Macro-F1        Coverage Error
                              1                     2           234 5678 89 10 11 13
 1
                                                                                                               run1     19.24              5.4317
                                                                                                               run2     18.80              5.1144
                                                                                                  0.8
                                                                                                        Table 3: Performance on the test set. These values
 2


                                                                                                        have been obtained by retraining the models over
                                                                                                  0.6
                                                                                                        the whole dataset (training set and validation set)
                                                                                                        after the final model selection phase.
 13 11 10 9 8 8 7 6 5 4 3 2


                                                                                                  0.4


                                                                                                        with the Macro-F1 and Coverage Errors reported
                                                                                                  0.2   in Table 3. These numbers are close to those ob-
                                                                                                        tained by the top two models applied to the Span-
                                                                                                        ish language in the “Multilingual Emoji Predic-
                                                                                                        tion” task of the SemEval-2018 competition (Bar-
Figure 2: Plot of the correlation between the pre-                                                      bieri et al., 2018), with F1 scores of 22.36 and
dictions of the models in the ensemble. For rea-                                                        18.73 (Çöltekin and Rama, 2018; Coster et al.,
sons of space, not all labels are shown on the axes.                                                    2018). In Figure 3 we report the confusion matrix
                                                                                                        (with values normalized over the columns to ad-
                                                                                                        dress label imbalance) and the accuracy over the
                                                                                                        top-N classes.
                                                                                          0.6
                                                                                                           An interesting characteristic of this approach,
                                             5                                            0.5           though, is computation time: we were able to train
                                                                                          0.4           a TreeESN with 5000 reservoir units over 200,000
                               True label


                                            10

                                                                                          0.3
                                                                                                        trees in just about 25 minutes, and this is without
                                            15                                                          exploiting parallelism between the trees.
                                                                                          0.2
                                            20
                                                                                                           In ITAmoji 2018, our team ranked 3rd out of
                                                                                          0.1           5. Detailed results and rankings are available at
                                            25                                            0             http://bit.ly/ITAmoji18.
                                                   5      10     15        20      25
                                                        Predicted label
                                                                                                        7   Discussion and conclusions
                                 1

                                                                                                        Different authors have highlighted the difference
                              0.5                                                                       in performance between SVM models and (deep)
                                                                                                        neural models for emoji prediction, and more in
                                 0
                                                                                                        general for text classification tasks, suggesting that
                                                   5       10         15          20      25            simple models like SVMs are more able to cap-
                                                                                                        ture the features which are most important for
Figure 3: Confusion matrix (top) and accuracy at                                                        generalization: see for example the reports of
top-N (bottom) on the test set. Labels are ordered                                                      the SemEval-2018 participants Çöltekin and Rama
by frequency.                                                                                           (2018) and Coster et al. (2018).
                                                                                                           In this work, instead, we approached the prob-
                     Run                         Avg F1     Max F1              Ens. F1       CovE      lem from the novel perspective of reservoir com-
                     run1                         14.4       18.5                24.9         4.014     puting applied to the grammatical tree structure of
                     run2                         14.4       18.5                26.7         3.428     the sentences. Despite a significant performance
                                                                                                        drop on the test set3 we showed that, paired with
                                                                                                        a rich ensemble, the method is comparable to the
Table 2: Performance obtained on the validation
                                                                                                        results obtained in the past by other participants in
set for the two submitted runs. The columns are,
                                                                                                        similar competitions using very different models.
in order, the average and maximum Macro-F1 over
the models in the ensemble, and the Macro-F1 and
Coverage Error of the ensemble.                                                                            3
                                                                                                             Probably due to overtraining: we observed that Macro-
                                                                                                        F1 overcame 0.40 in training.
References                                                  Conference on Empirical Methods in Natural Lan-
                                                            guage Processing, pages 1373–1378, Lisbon, Portu-
Francesco Barbieri, Miguel Ballesteros, and Horacio         gal, September. Association for Computational Lin-
  Saggion. 2017. Are Emojis Predictable? arXiv              guistics.
  preprint arXiv:1702.07285.
                                                          Herbert Jaeger and Harald Haas. 2004. Harnessing
Francesco Barbieri,      Jose Camacho-Collados,             nonlinearity: Predicting chaotic systems and sav-
  Francesco Ronzano, Luis Espinosa Anke, Miguel             ing energy in wireless communication. Science,
  Ballesteros, Valerio Basile, Viviana Patti, and           304(5667):78–80.
  Horacio Saggion. 2018. SemEval 2018 Task 2:
  Multilingual Emoji Prediction. In Proceedings           Armand Joulin, Edouard Grave, Piotr Bojanowski,
  of The 12th International Workshop on Semantic            and Tomas Mikolov.        2016.    Bag of Tricks
  Evaluation, pages 24–33.                                  for Efficient Text Classification. arXiv preprint
                                                            arXiv:1607.01759.
Christos Baziotis, Nikos Athanasiou, Georgios
  Paraskevopoulos, Nikolaos Ellinas, Athanasia            Diederik P Kingma and Jimmy Lei Ba. 2015. Adam:
  Kolovou, and Alexandros Potamianos.      2018.            Amethod for stochastic optimization. In Proceed-
  NTUA-SLP at SemEval-2018 Task 2: Predicting               ings of the 3rd International Conference on Learn-
  Emojis using RNNs with Context-aware Attention.           ing Representations (ICLR).
  arXiv preprint arXiv:1804.06657.
                                                          Mantas Lukoševičius and Herbert Jaeger. 2009. Reser-
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X             voir computing approaches to recurrent neural net-
  shared task on Multilingual Dependency Parsing. In       work training. Computer Science Review, 3(3):127–
  Proceedings of the Tenth Conference on Computa-          149.
  tional Natural Language Learning, pages 149–164.
  Association for Computational Linguistics.              Joakim Nivre, Marie-Catherine De Marneffe, Filip
                                                            Ginter, Yoav Goldberg, Jan Hajic, Christopher D
Çağrı Çöltekin and Taraka Rama. 2018. Tübingen-        Manning, Ryan T McDonald, Slav Petrov, Sampo
   Oslo at SemEval-2018 Task 2: SVMs perform better         Pyysalo, Natalia Silveira, et al. 2016. Universal De-
   than RNNs in Emoji Prediction. In Proceedings of         pendencies v1: A Multilingual Treebank Collection.
   The 12th International Workshop on Semantic Eval-        In LREC.
   uation, pages 34–38.
                                                          A. Palmero Aprosio and G. Moretti. 2016. Italy goes
Joël Coster, Reinder Gerard Dalen, and Nathalie            to Stanford: a collection of CoreNLP modules for
   Adriënne Jacqueline Stierman. 2018. Hatching            Italian. ArXiv e-prints, September.
   Chick at SemEval-2018 Task 2: Multilingual Emoji       Francesco Ronzano, Francesco Barbieri, Endang
   Prediction. In Proceedings of The 12th International     Wahyu Pamungkas, Viviana Patti, and Francesca
   Workshop on Semantic Evaluation, pages 445–448.          Chiusaroli. 2018. Overview of the EVALITA 2018
                                                            Italian Emoji Prediction (ITAMoji) Task. In Tom-
Claudio Gallicchio and Alessio Micheli. 2011. Ar-
                                                            maso Caselli, Nicole Novielli, Viviana Patti, and
  chitectural and Markovian factors of echo state net-
                                                            Paolo Rosso, editors, Proceedings of the 6th evalua-
  works. Neural Networks, 24(5):440–456.
                                                            tion campaign of Natural Language Processing and
Claudio Gallicchio and Alessio Micheli. 2013. Tree          Speech tools for Italian (EVALITA’18), Turin, Italy.
  Echo State Networks. Neurocomputing, 101:319–             CEUR.org.
  337.

Alex Graves and Jürgen Schmidhuber.        2005.
  Framewise phoneme classification with bidirec-
  tional LSTM networks. In Neural Networks, 2005.
  IJCNN’05. Proceedings. 2005 IEEE International
  Joint conference on, volume 4, pages 2047–2052.
  IEEE.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Long short-term memory. Neural computation,
  9(8):1735–1780.

Arthur E Hoerl and Robert W Kennard. 1970. Ridge
  regression: Biased estimation for nonorthogonal
  problems. Technometrics, 12(1):55–67.

Matthew Honnibal and Mark Johnson. 2015. An Im-
 proved Non-monotonic Transition System for De-
 pendency Parsing. In Proceedings of the 2015

</pre>