ITAmoji 2018: Emoji Prediction via Tree Echo State Networks

ITAmoji 2018: Emoji Prediction via Tree Echo State Networks DanieleDiSarli d.disarli@studenti.unipi.it Department of Computer Science University of Pisa

Pisa Italy

ClaudioGallicchio gallicch@di.unipi.it Department of Computer Science University of Pisa

Pisa Italy

AlessioMicheli micheli@di.unipi.it Department of Computer Science University of Pisa

Pisa Italy

ITAmoji 2018: Emoji Prediction via Tree Echo State Networks 48D8EFEE6FA90EE79F5218C120CEB992 GROBID - A machine learning software for extracting information from scholarly documents

English.

For the "ITAmoji" EVALITA 2018 competition we mainly exploit a Reservoir Computing approach to learning, with an ensemble of models for trees and sequences. The sentences for the models of the former kind are processed by a language parser and the words are encoded by using pretrained FastText word embeddings for the Italian language. With our method, we ranked 3 rd out of 5 teams.

Italiano.

Per la competizione EVALITA 2018 sfruttiamo principalmente un approccio Reservoir Computing, con un ensemble di modelli per sequenze e per alberi. Le frasi per questi ultimi sono elaborate da un parser di linguaggi e le parole codificate attraverso degli embedding FastText preaddestrati per la lingua italiana. Con il nostro metodo ci siamo classificati terzi su un totale di 5 team.

Introduction

Echo State Networks (Jaeger and Haas, 2004) are an efficient class of recurrent models under the framework of Reservoir Computing (Lukoševičius and Jaeger, 2009), where the recurrent part of the model ("reservoir") is carefully initialized and then left untrained (Gallicchio and Micheli, 2011). The only weights that are trained are part of a usually simple readout layer 1 . Echo State Networks were originally designed to work on sequences, however it has been shown how to extend them to deal with recursively structured data, and 1 Trained in closed form, e.g. by Moore-Penrose pseudoinversion, or Ridge Regression. 20.27% 19.86% 9.45% 5.35% 5.13% 4.11%

3.54% 3.33% 2.80% 2.57%

2.18% 2.16% 2.03% 1.94% 1.78%

1.67% 1.55% 1.52% 1.49% 1.39%

1.37% 1.28% 1.12% 1.07% 1.06%

Figure 1: Emojis under consideration and their frequency within the dataset.

trees in particular, with Tree Echo State Networks (Gallicchio and Micheli, 2013), also referred to as TreeESNs.

We follow this approach for solving the ITAmoji task in the EVALITA 2018 competition (Ronzano et al., 2018). In particular, we parse the input texts into trees resembling the grammatical structure of the sentences, and then we use multiple TreeESN models to process the parse trees and make predictions. We then merge these models by using an ensemble to make our final predictions.

Task and Dataset

Given a set of Italian tweets, the goal of the ITAmoji task is to predict the most likely emoji associated with each tweet. The dataset contains 250,000 tweets in Italian, each of them originally containing only one (possibly repeated) of the 25 emojis considered in the task (see Figure 1). The emojis are removed from the sentences and used as targets.

The test dataset contains 25,000 tweets similarly processed.

The provided dataset has been shuffled and split into a training set (80%) and a validation set (20%).

We preprocessed the data by first removing any URL from the sentences, as most of them did not contain any informative content (e.g. "https://t.co/M3StiVOzKC"). We then parsed the sentences by using two different parsers for the Italian language: Tint2 (Palmero Aprosio and Moretti, 2016) and spaCy (Honnibal and Johnson, 2015). This produced two sets of trees, both including information about the dependency relations between the nodes of each tree. We finally replace each word with its corresponding pretrained FastText embedding (Joulin et al., 2016).

Description of the system

Our ensemble is composed by 13 different models, 12 of which are TreeESNs and the other one is a Long Short-Term Memory (LSTM) over characters. Different random initializations ("trials") of the model parameters are all included in the ensemble in order to enrich the diversity of the hypotheses. We summarize the entire configuration in Table 1.

TreeESN models

The TreeESN that we are using is a specialization of the description given by Gallicchio and Micheli (2013), and the reader can refer to that work for additional details. Here, the state corresponding to node n of an input tree t is computed as:

x(n) = f W in u(n) + 1 k k i=1 Ŵn i x(ch i (n)) ,

(1) where u(n) is the label of node n in the input tree, k is the number of children of node n, ch i (n) is the i-th child of node n, W in is the input-toreservoir weight matrix, Ŵn i is the recurrent reservoir weight matrix associated to the grammatical relation between node n and its i-th child, and f is the element-wise applied activation function of the reservoir units (in our case, it is a tanh). All matrices in Equation 1 are left untrained.

Note that Equation 1 determines a recursive application (bottom-up visit) over each node of the tree t until the state for all nodes is computed, which we can express in structured form as x(t). The resulting tree x(t) is then mapped into a fixedsize feature representation via the χ state mapping function. We make use of mean and sum state mapping functions, respectively yielding the mean and the sum of all the states. The result, χ(x(t)), is then projected into a different space by a matrix W φ :

ŷ = f φ (W φ χ(x(t))) ,(2)

where f φ is an activation function.

For the readout we use both a linear regression approach with L2 regularization known as Ridge regression (Hoerl and Kennard, 1970) and a multilayer perceptron (MLP):

y = readout(ŷ),(3)

where y ∈ R 25 is the output vector, which represents a score for each of the classes: the index with the highest value corresponds to the most likely class.

CharLSTM model

The CharLSTM model uses a bidirectional LSTM (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) with 2 layers, which takes as input the characters of the sentences expressed as pretrained character embeddings of size 300. The LSTM output is then fed into a linear layer with 25 output units.

Similar models have been used in recent works related to emoji prediction, see for example the model used by Barbieri et al. (2017), or the one by Baziotis et al. (2018), which is however a more complex word-based model.

Ensemble

We take into consideration two different ensembles, both containing the models in Table 1, but with different strategies for weighting the N P predictions. In the following, let Y ∈ R N P ×25 be the matrix containing one prediction per row.

The weights for the first ensemble (corresponding to the run file run1.txt) have been produced by a random search: at each iteration we compute a random vector w ∈ R N P with entries sampled from a random variable W 2 , W ∼ U[0, 1]. near-zero weights. After selecting the best configuration on the validation set, the predictions from each of the models are merged together in weighted mean:

ȳ = wY(4)

For the second type of ensemble (corresponding to the run file run2.txt) we adopt a multilayer perceptron. We feed as input the N P predictions concatenated into a single vector y (1...N P ) ∈ R 25N P , so that the model is:

ȳ = tanh y (1...N P ) W 1 + b 1 W 2 + b 2 , (5)

where the hidden layer has size 259 and the output layer is composed by 25 units.

In both types of ensemble, as before, the output vector contains a score for each of the classes, providing a way to rank them from the most to the least likely. The most likely class c is thus computed as c = arg max i ȳi .

Training

The training algorithm differs based on the kind of model taken under consideration. We address each of them in the following paragraphs.

Models 1-6 The first six models are TreeESNs using a multilayer perceptron as readout. Given the fact that the main evaluation metric for the competition is the Macro F-score, each of the models has been trained by rebalancing the frequencies of the different target classes. In particular, the sampling probability for each input tree has been skewed so that the data extracted during training follows a uniform distribution with respect to the target class. For the readout part we use the Adam algorithm (Kingma and Ba, 2015) for the stochastic optimization of the multi-class cross entropy loss function.

Models 7-10 Models from 7 to 10 are again TreeESNs, but with a Ridge Regression readout. In this case, 25 classifiers are trained with a 1-vs-all method, one for each class, using binary targets.

Models 11-12 Models 11 and 12 are again TreeESNs with a Ridge Regression readout, but they are trained to distinguish only between the most frequent class, the second most frequent class and all the other classes aggregated together. This is done to try to improve the ensemble precision and recall for the top two classes.

Model 13 The last model is a sequential LSTM over character embeddings. Like in the first 6 models, the Adam algorithm is used to optimize the cross entropy loss function.

Results

The ensemble seems to bring a substantial improvement to the performance on the validation set, as highlighted in Table 2. This is possible thanks to the number and diversity of the different models, as we can see in Figure 2 where we show the Pearson correlation coefficients between the predictions of the models in the ensemble.

On the test set we scored substantially lower, 2018). In Figure 3 we report the confusion matrix (with values normalized over the columns to address label imbalance) and the accuracy over the top-N classes.

An interesting characteristic of this approach, though, is computation time: we were able to train a TreeESN with 5000 reservoir units over 200,000 trees in just about 25 minutes, and this is without exploiting parallelism between the trees.

In ITAmoji 2018, our team ranked 3 rd out of 5. Detailed results and rankings are available at http://bit.ly/ITAmoji18.

Discussion and conclusions

Different authors have highlighted the difference in performance between SVM models and (deep) neural models for emoji prediction, and more in general for text classification tasks, suggesting that simple models like SVMs are more able to capture the features which are most important for generalization: see for example the reports of the SemEval-2018 participants C ¸öltekin and Rama (2018) and Coster et al. (2018).

In this work, instead, we approached the problem from the novel perspective of reservoir computing applied to the grammatical tree structure of the sentences. Despite a significant performance drop on the test set3 we showed that, paired with a rich ensemble, the method is comparable to the results obtained in the past by other participants in similar competitions using very different models.

Figure 2 :Figure 3 :23Figure 2: Plot of the correlation between the predictions of the models in the ensemble. For reasons of space, not all labels are shown on the axes.

Table 1 :1The square increases the probability of sampling Composition of the ensemble, highlighting the differences between the models.# ClassReservoir unitsf φReadoutParser Trials1 TreeESN1000ReLUMLPTint102 TreeESN1000TanhMLPTint103 TreeESN5000TanhMLPTint14 TreeESN5000TanhMLPspaCy25 TreeESN5000ReLUMLPTint16 TreeESN5000ReLUMLPspaCy17 TreeESN5000Tanh Ridge regressionTint18 TreeESN5000Tanh Ridge regression spaCy39 TreeESN5000ReLU Ridge regressionTint110 TreeESN5000ReLU Ridge regression spaCy311 TreeESN5000Tanh Ridge regressionTint112 TreeESN5000Tanh Ridge regression spaCy213 CharLSTM----1

Table 2 :2Performance obtained on the validation set for the two submitted runs. The columns are, in order, the average and maximum Macro-F1 over the models in the ensemble, and the Macro-F1 and Coverage Error of the ensemble.Run Avg F1 Max F1 Ens. F1 CovErun114.418.524.94.014run214.418.526.73.428

Table 3 :3Performance on the test set. These values have been obtained by retraining the models over the whole dataset (training set and validation set) after the final model selection phase.with the Macro-F1 and Coverage Errors reported in Table3. These numbers are close to those obtained by the top two models applied to the Spanish language in the "Multilingual Emoji Prediction" task of the SemEval-2018 competition(Barbieri et al., 2018), with F1 scores of22.36 and 18.73 (C ¸öltekin and Rama, 2018;Coster et al., Emitting data in the CoNLL-U format(Nivre et al., 2016), a revised version of the CoNLL-X format(Buchholz and Marsi, 2006).Probably due to overtraining: we observed that Macro-F1 overcame 0.40 in training.

FrancescoBarbieri MiguelBallesteros HoracioSaggion arXiv:1702.07285 Are Emojis Predictable? 2017 arXiv preprint <author> <persName><forename type="first">Francesco</forename><surname>Barbieri</surname></persName> </author> <author> <persName><forename type="first">Jose</forename><surname>Camacho-Collados</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b2"> <analytic> <title level="a" type="main">SemEval 2018 Task 2: Multilingual Emoji Prediction FrancescoRonzano LuisEspinosa Anke MiguelBallesteros ValerioBasile VivianaPatti HoracioSaggion Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Predicting Emojis using RNNs with Context-aware Attention ChristosBaziotis NikosAthanasiou GeorgiosParaskevopoulos NikolaosEllinas AthanasiaKolovou AlexandrosPotamianos arXiv:1804.06657 Proceedings of the Tenth Conference on Computational Natural Language Learning SabineBuchholz ErwinMarsi the Tenth Conference on Computational Natural Language Learning 2018. 2006 arXiv preprint Association for Computational Linguistics Tübingen-Oslo at SemEval-2018 Task 2: SVMs perform better than RNNs in Emoji Prediction C¸agrı C¸öltekin TarakaRama Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Hatching Chick at SemEval-2018 Task 2: Multilingual Emoji Prediction JoëlCoster ReinderGerard Dalen NathalieAdriënne JacquelineStierman Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Architectural and Markovian factors of echo state networks ClaudioGallicchio AlessioMicheli Neural Networks 24 5 2011 Tree Echo State Networks ClaudioGallicchio AlessioMicheli Neurocomputing 101 2013 Framewise phoneme classification with bidirectional LSTM networks AlexGraves JürgenSchmidhuber ;Hochreiter JürgenSchmidhuber IJCNN'05. Proceedings. 2005 IEEE International Joint conference on 2005. 2005. 1997 4 Long short-term memory Ridge regression: Biased estimation for nonorthogonal problems EArthur RobertWHoerl Kennard Technometrics 12 1 1970 An Improved Non-monotonic Transition System for Dependency Parsing MatthewHonnibal MarkJohnson Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing the 2015 Conference on Empirical Methods in Natural Language Processing

Lisbon, Portugal

Association for Computational Linguistics 2015. September Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication HerbertJaeger HaraldHaas Science 304 5667 2004 Bag of Tricks for Efficient Text Classification ArmandJoulin EdouardGrave PiotrBojanowski TomasMikolov arXiv:1607.01759 2016 arXiv preprint Adam: Amethod for stochastic optimization PDiederik JimmyLeiKingma Ba Proceedings of the 3rd International Conference on Learning Representations (ICLR) the 3rd International Conference on Learning Representations (ICLR) 2015 Reservoir computing approaches to recurrent neural network training MantasLukoševičius HerbertJaeger Computer Science Review 3 3 2009 Universal Dependencies v1: A Multilingual Treebank Collection JoakimNivre Marie-CatherineDe Marneffe FilipGinter YoavGoldberg JanHajic ChristopherDManning RyanTMcdonald SlavPetrov SampoPyysalo NataliaSilveira LREC 2016 Italy goes to Stanford: a collection of CoreNLP modules for Italian APalmero Aprosio GMoretti 2016. September ArXiv e-prints Overview of the EVALITA FrancescoRonzano FrancescoBarbieri EndangWahyuPamungkas VivianaPatti FrancescaChiusaroli 2018. 2018 Italian Emoji Prediction (ITAMoji) Task Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) TommasoCaselli NicoleNovielli VivianaPatti PaoloRosso the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)

Turin, Italy