ghostwriter19 @ ATE_ABSITA: Zero-Shot and ONNX to Speed up BERT on Sentiment Analysis Tasks at EVALITA 2020 Mauro Bennici You Are My Guide Torino mauro@youaremyguide.com Abstract1 1 Introduction English. With the arrival of BERT 2 in 2018, NLP research has taken a signifi- In a process with data that change very quickly cant step forward. However, the necessary and the need to resort to complete training in the computing power has grown accordingly. shortest possible time, transfer learning tech- Various distillation and optimization sys- niques have made possible a fast fine-tuning of tems have been adopted but are costly in BERT models. The distillation of a model made it terms of cost-benefit ratio. The most im- possible to decrease the load and the times without portant improvements are obtained by cre- significantly losing accuracy. These models, ating increasingly complex models with therefore, require, at least, constant fine-tuning more layers and parameters. training. In addition, a BERT model specially de- signed for the Italian language and with a vocab- In this research, we will see how, by mix- ulary containing technical terms increases its ef- ing transfer learning, zero-shot learning, fectiveness. and ONNX runtime 3 , we can access the power of BERT right now, optimizing Constant and multi-disciplinary training requires time and resources, achieving noticeable specific skills and tailor-made services. In this re- results on day one. search, we will see an effective way to make both things possible. The idea is to use a way to ex- Italiano. Con l'arrivo di BERT nel 2018, change AI models between library and frame- la ricerca nel campo dell'NLP ha fatto un works, the ONNX project, and a runtime, the notevole passo in avanti. La potenza di ONNX runtime project, to optimize inference for calcolo necessaria però è cresciuta di many platforms, languages and hardware. The conseguenza. Diversi sistemi di ONNX runtime is still working to optimize the distillazione e di ottimizzazione sono stati training directly in the ONNX format. adottati ma risultano onerosi in termini di The second goal is to find a viable alternative with rapporto costo benefici. I vantaggi di acceptable performance at the start of a new pro- maggior rilievo si ottengono creando ject while waiting for a trained BERT model. modelli sempre più complessi con un maggior numero di layers e di parametri. The research was carried out for the ATE AB- SITA (de Mattei et al., 2020) task in the In questa ricerca vedremo come mixando EVALITA 2020 (Basile et al., 2020), using all 3 transfer learning, zero-shot learning e available sub-tasks. ONNX runtime si può accedere alla potenza di BERT da subito, ottimizzando tempo e risorse, raggiungendo risultati 2 Description of the system apprezzabili al day one. To start using a sentiment analysis system, we need several elements. Certainly, a starting dataset 1 Copyright ©️2020 for this paper by its authors. Use 2 permitted under Creative Commons License Attribu- https://github.com/google-research/bert 3 tion 4.0 International (CC BY 4.0). https://microsoft.github.io/onnxruntime/ with the related labels. In the tasks of the chal-  6 vCPU on Intel Xeon E5-2690 v4 - 112GB lenge, we have the reviews of 23 different prod- with P100 (GPU) ucts. Each review has a corresponding rating as- signed by the end-user. For each review, it was re-  14 cores on Intel Xeon E5-2690 v4 - 32GB quired to extract the aspects contained in it. By as- (CPU) pect, we mean every opinion word that expresses a sentiment polarity. Finally, each aspect was 2.1 Task 1 – ATE: Aspect Term Extraction classified as a pair of values: positive or negative, for 4 possible states. To identify an aspect, the dataset contains a label Imagine a system that receives an unspecified for every single word with three possible values: number of reviews in real-time with new products and different categories. We find ourselves in the - B for Begin of an aspect. situation of always having to fine-tune our mod- - I for Inside an aspect. els. - Or for Outside, not in an aspect. The complexity of BERT makes training time dif- ficult for constant alignment. Being able to reduce For example, the review “La borraccia termica the training time, or being able to put in place an svolge egregiamente il proprio compito di alternative in the meantime, new perspectives mantenere la temperatura, calda o fretta che sia. open up, such as: La costruzione è ottimale e ben rifinita. Acquisto  Made inference calls before a full trained straconsigliato!” is labeled as: model is completed.  Training of the new model.  Running the BERT model.  (optional) reclassify recent product reviews after the model update. In this perspective, in order to validate my hypoth- eses, I used the AlBERTo (Polignano at al., 2019) The model will be evaluated with the F1-score. model, used in the baseline, and Ktrain4, a wrap- The score results from the full matched aspects, per for TensorFlow5, with the autofit option. the partial matched ones, and the missed ones. The first submission, called ghostwriter19_a, was The preliminary results with the Ktrain model obtained training all the models with the Ktrain were encouraging (table 1). framework. The results for the three tasks for the second sub- mission, called ghostwriter19_b, were obtained in Model F1-Score two different way: ghostwriter19_a 0.6152  for the first two tasks, I used the model of Baseline 0.2556 the first submission but exported on ONNX and ran with the ONNX runtime. Table 1: Task 1 DEV results  for the third task, I trained the model with At this point, the model has been exported with TensorFlow using a Zero-Shot learner ONNX in maximum compatibility mode. The [ZSL] (Brown et al, 2020). model ran with the ONNX runtime optimized for CPU. To test the models, I used two different machines The performances have remained unchanged, but with Ubuntu 20.04 LTS: the speed of inference has significantly improved (table 2). 4 5 https://github.com/amaiya/ktrain https://www.tensorflow.org/ Model Query per second Model Query per second ghostwriter19_a CPU 4 ghostwriter19_a CPU 3 ghostwriter19_b CPU 68 with ONNX runtime ghostwriter19_b CPU 56 with ONNX runtime ghostwriter19_a GPU 124 ghostwriter19_a GPU 97 ghostwriter19_b GPU 217 with ONNX runtime ghostwriter19_b GPU 154 Table 2: Performance comparison on Task 1 with ONNX runtime Table 4: Performance comparison on Task 2 The improvement is 17x for the CPU version and 1.75x for the GPU version. The improvement is 9.5x for the CPU version and 1.59x for the GPU version. 2.2 Task 2 – ABSA: Aspect-based Sentiment Analysis 2.3 Task 3 – SA: Sentiment Analysis For this task, the aspects identified in Task 1 Task 3 is a classification problem. However, fully have been used. This implies that an error in understanding the score is not easy. The evalua- Task 1 will have a decisive impact on Task tion operation is carried out by different people number 2. and with different styles. A product with a similar review is rated according to the expectations and The aspect can be classified as: judgment of other users differently.  positive (POS:true,NEG:false) Furthermore, in order to obviate the long training time that a constant updating requires, compared  negative (POS:false,NEG:true) with systems used by the previous version of  mixed polarity(POS:true, NEG:true) EVALITA, such as an ensemble system with Tree Random Forest and Bi-LSTM (Bennici and Por-  neutral polarity (POS:false, NEG:false) tocarrero, 2018) or with an SVM system (Barbieri et al., 2016), I used a Zero-Shot Learner [ZSL] As showed to the image from the challenge (Pushp & Srivastava, 2017). A ZSL is a way to website6: make predictions without prior training (Petroni, 2019). ZSL will refer to the embedding of a pre- vious matrix, AlBERTo in this case, and of the proposed labels as a possible result (Schick and Schütze, 2020). The proposed labels were the possible numbers The results on the DEV test outperform the for evaluation, then the numbers from 1 to 5. baseline (table 3). The proposed prediction value is a weighted aver- Model F1-Score age of the two values with the highest probability, if and only if the gap between the two values is ghostwriter19_a 0.6019 less than 10−3. Otherwise, only the value with the highest probability will be considered valid. Baseline 0.2 For this task, I omitted the ONNX runtime test Table 3: Task 2 DEV results because a stable converter for the ZSL version is Also, for this task, the performance is improved not available. with the use of ONNX runtime (table 4). 6 http://www.di.uniba.it/~swap/ate_absita/task.html The score for this task is the Root Mean Squared Model F1 score Error between the polarity predicted and the po- larity assigned by the user. ghostwriter19_a_D 0.6019 Model RMSE score ghostwriter19_b_T 0.4994 ghostwriter19_a 0.6997 Baseline AlBERTo 0.2 Table 7: TEST dataset results for Task 2 ghostwriter19_b 0.8526 The loss from DEV to TEST is about 17% (table Baseline AlBERTo 1.0806 7). However, the percentage of the difference be- Table 5: Task 3 DEV results tween the results of Tasks 1 and 2 have been main- tained with the DEV and TEST datasets. The loss in performance is 18%, but the entire pre- vious training phase is skipped (table 5). This is in line with expectations, worse model per- formance in Task 1 impacted Task 2 proportion- ally. In return, working on a better model will im- 3 Results prove both tasks. The results obtained with the DEV dataset are very positive both in terms of accuracy and per- 3.3 Results for Task 3 formance. ZSL has proven to be an incredible technology to invest in. The Ktrain seems to suffer For Task 3 we have: a heavy overfit. Model RMSE score The research aims not to have a relevant model but to prove that a model could be production- ghostwriter19_a_D 0.6997 ready with fewer resources and time. However, in all three tasks, the models ghostwriter19_b_D 0.8526 outperformed the baseline with a significant gap in terms of accuracy/RMSE. ghostwriter19_a_T 0.81394 ghostwriter19_b_T 0.83479 3.1 Results for Task 1 Baseline AlBERTo 1.0806 The final results with the TEST dataset are: Table 8: TEST dataset results for Task 3 Model F1 score The difference between the DEV and TEST da- ghostwriter19_a_D 0.6152 tasets is marked here only for trained model, 14% (table 8). The untrained one performed slightly ghostwriter19_a_T 0.5399 better, 2%, with the TEST dataset. Baseline AlBERTo 0.2556 This result confirms that an underperforming Table 6: TEST dataset results for Task 1 model has the same performance of a model that use ZSL, as assumed. The results are about 12% lower than those ob- tained in the research phase (table 6). The price to pay, however, is that the average in- It will be interesting to continue experimenting ference time for the ZSL is 157x higher than the with different ONNX options to find a better com- pure TensorFlow model obtained with Ktrain. bination of compatibility and performance. 4 Conclusion 3.2 Results for Task 2 The results demonstrated that it is possible to cre- The final results with the TEST dataset are: ate hybrid systems for training and inference to make the power of BERT more accessible. In the time it takes to train a new and optimized de Mattei, L., de Martino, G., Iovine, A., Miaschi, A., model, an untrained ZSL model can make up for Polignano, M., & Rambelli, G. (2020). Overview of it in the meantime. the EVALITA 2020 Aspect Term Extraction and Aspect-based Sentiment Analysis (ATE_ABSITA) Task. In Proceedings of the 7th evaluation cam- Optimizing, and in future training, our models to paign of Natural Language Processing and Speech be intrinsically optimized for the platform and tools for Italian (EVALITA 2020), CEUR-WS.org. framework we have chosen to use does not affect performance and future use. Ning, E., Yan, N., Zhu, J., & Li, J. (2020, January 31). The improvements obtained in the use of ONNX Microsoft open sources breakthrough optimizations runtime for these Italian tasks are in line with what for transformer inference on GPU and CPU. Microsoft demonstrated, for the English language, https://cloudblogs.microsoft.com/open- at the beginning of 2020 (Ning at al., 2020). source/2020/01/21/microsoft-onnx-open-source- optimizations-transformer-inference-gpu-cpu/ The next step is to make the ONNX export work with a Zero-Shot learner [ZSL] in order to com- Polignano, M., Basile, P., de Gemmis, M., Semeraro, pensate, at least in part, for the more significant G., & Basile, V. (2019). Alberto: Italian bert lan- resources that this inevitably introduces. guage understanding model for nlp challenging tasks based on tweets. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019). CEUR-WS.org. References Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Barbieri, F., Basile, V., Croce, D., Nissim, M., Wu, Y., Miller, A. H., & Riedel, S. (2019, Septem- Novielli, N., & Patti, V. (2016). Overview of the ber 04). Language Models as Knowledge Bases? Evalita 2016 SENTIment POLarity Classification https://arxiv.org/abs/1909.01066 Task. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Pro- Pushp, P. K., & Srivastava, M. M. (2017, December cessing and Speech Tools for Italian. Final 23). Train Once, Test Anywhere: Zero-Shot Learn- Workshop (EVALITA 2016). CEUR-WS.org. ing for Text Classification. https://arxiv.org/abs/1712.05972 Basile, V., Croce, D., Di Maro, M., & Passaro, L. (2020). EVALITA 2020: Overview of the 7th Eval- Schick, T., & Schütze, H. (2020, April 27). Exploiting uation Campaign of Natural Language Processing Cloze Questions for Few Shot Text Classification and Speech Tools for Italian. In Proceedings of Sev- and Natural Language Inference. enth Evaluation Campaign of Natural Language https://arxiv.org/abs/2001.07676 Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), CEUR-WS.org. Bennici, M., & Portocarrero, X. S. (2018). Ensemble for aspect-based sentiment analysis. In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18). CEUR- WS.org. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., . . . Amodei, D. (2020, July 22). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165