ghostwriter19 @ ATE_ABSITA: Zero-Shot and ONNX to Speed up
          BERT on Sentiment Analysis Tasks at EVALITA 2020

                                          Mauro Bennici
                                         You Are My Guide
                                              Torino
                                    mauro@youaremyguide.com


                     Abstract1
                                                        1      Introduction
    English. With the arrival of BERT 2 in
    2018, NLP research has taken a signifi-             In a process with data that change very quickly
    cant step forward. However, the necessary           and the need to resort to complete training in the
    computing power has grown accordingly.              shortest possible time, transfer learning tech-
    Various distillation and optimization sys-          niques have made possible a fast fine-tuning of
    tems have been adopted but are costly in            BERT models. The distillation of a model made it
    terms of cost-benefit ratio. The most im-           possible to decrease the load and the times without
    portant improvements are obtained by cre-           significantly losing accuracy. These models,
    ating increasingly complex models with              therefore, require, at least, constant fine-tuning
    more layers and parameters.                         training. In addition, a BERT model specially de-
                                                        signed for the Italian language and with a vocab-
    In this research, we will see how, by mix-          ulary containing technical terms increases its ef-
    ing transfer learning, zero-shot learning,          fectiveness.
    and ONNX runtime 3 , we can access the
    power of BERT right now, optimizing                 Constant and multi-disciplinary training requires
    time and resources, achieving noticeable            specific skills and tailor-made services. In this re-
    results on day one.                                 search, we will see an effective way to make both
                                                        things possible. The idea is to use a way to ex-
    Italiano. Con l'arrivo di BERT nel 2018,            change AI models between library and frame-
    la ricerca nel campo dell'NLP ha fatto un           works, the ONNX project, and a runtime, the
    notevole passo in avanti. La potenza di             ONNX runtime project, to optimize inference for
    calcolo necessaria però è cresciuta di              many platforms, languages and hardware. The
    conseguenza.       Diversi     sistemi    di        ONNX runtime is still working to optimize the
    distillazione e di ottimizzazione sono stati        training directly in the ONNX format.
    adottati ma risultano onerosi in termini di         The second goal is to find a viable alternative with
    rapporto costo benefici. I vantaggi di              acceptable performance at the start of a new pro-
    maggior rilievo si ottengono creando                ject while waiting for a trained BERT model.
    modelli sempre più complessi con un
    maggior numero di layers e di parametri.            The research was carried out for the ATE AB-
                                                        SITA (de Mattei et al., 2020) task in the
    In questa ricerca vedremo come mixando              EVALITA 2020 (Basile et al., 2020), using all 3
    transfer learning, zero-shot learning e             available sub-tasks.
    ONNX runtime si può accedere alla
    potenza di BERT da subito, ottimizzando
    tempo e risorse, raggiungendo risultati             2      Description of the system
    apprezzabili al day one.                            To start using a sentiment analysis system, we
                                                        need several elements. Certainly, a starting dataset

1
  Copyright ©️2020 for this paper by its authors. Use
                                                        2
permitted under Creative Commons License Attribu-           https://github.com/google-research/bert
                                                        3
tion 4.0 International (CC BY 4.0).                         https://microsoft.github.io/onnxruntime/
with the related labels. In the tasks of the chal-             6 vCPU on Intel Xeon E5-2690 v4 - 112GB
lenge, we have the reviews of 23 different prod-                with P100 (GPU)
ucts. Each review has a corresponding rating as-
signed by the end-user. For each review, it was re-            14 cores on Intel Xeon E5-2690 v4 - 32GB
quired to extract the aspects contained in it. By as-           (CPU)
pect, we mean every opinion word that expresses
a sentiment polarity. Finally, each aspect was          2.1      Task 1 – ATE: Aspect Term Extraction
classified as a pair of values: positive or negative,
for 4 possible states.                                  To identify an aspect, the dataset contains a label
Imagine a system that receives an unspecified           for every single word with three possible values:
number of reviews in real-time with new products
and different categories. We find ourselves in the      - B for Begin of an aspect.
situation of always having to fine-tune our mod-        - I for Inside an aspect.
els.                                                    - Or for Outside, not in an aspect.
The complexity of BERT makes training time dif-
ficult for constant alignment. Being able to reduce     For example, the review “La borraccia termica
the training time, or being able to put in place an     svolge egregiamente il proprio compito di
alternative in the meantime, new perspectives           mantenere la temperatura, calda o fretta che sia.
open up, such as:                                       La costruzione è ottimale e ben rifinita. Acquisto
     Made inference calls before a full trained        straconsigliato!” is labeled as:
        model is completed.
       Training of the new model.
       Running the BERT model.
       (optional) reclassify recent product reviews
        after the model update.

In this perspective, in order to validate my hypoth-
eses, I used the AlBERTo (Polignano at al., 2019)       The model will be evaluated with the F1-score.
model, used in the baseline, and Ktrain4, a wrap-       The score results from the full matched aspects,
per for TensorFlow5, with the autofit option.           the partial matched ones, and the missed ones.

The first submission, called ghostwriter19_a, was       The preliminary results with the Ktrain model
obtained training all the models with the Ktrain        were encouraging (table 1).
framework.
The results for the three tasks for the second sub-
mission, called ghostwriter19_b, were obtained in           Model                     F1-Score
two different way:
                                                            ghostwriter19_a           0.6152
       for the first two tasks, I used the model of        Baseline                  0.2556
        the first submission but exported on ONNX
        and ran with the ONNX runtime.                  Table 1: Task 1 DEV results

       for the third task, I trained the model with    At this point, the model has been exported with
        TensorFlow using a Zero-Shot learner            ONNX in maximum compatibility mode. The
        [ZSL] (Brown et al, 2020).                      model ran with the ONNX runtime optimized for
                                                        CPU.
To test the models, I used two different machines
                                                        The performances have remained unchanged, but
with Ubuntu 20.04 LTS:
                                                        the speed of inference has significantly improved
                                                        (table 2).


4                                                       5
    https://github.com/amaiya/ktrain                        https://www.tensorflow.org/
    Model                      Query per second
                                                         Model                           Query per second
    ghostwriter19_a CPU        4
                                                         ghostwriter19_a CPU         3
    ghostwriter19_b CPU        68
    with ONNX runtime                                    ghostwriter19_b CPU         56
                                                         with ONNX runtime
    ghostwriter19_a GPU        124
                                                         ghostwriter19_a GPU         97
    ghostwriter19_b GPU        217
    with ONNX runtime                                    ghostwriter19_b GPU         154
Table 2: Performance comparison on Task 1                with ONNX runtime
                                                        Table 4: Performance comparison on Task 2
The improvement is 17x for the CPU version and
1.75x for the GPU version.                              The improvement is 9.5x for the CPU version and
                                                        1.59x for the GPU version.
2.2      Task 2 – ABSA: Aspect-based Sentiment
         Analysis                                       2.3    Task 3 – SA: Sentiment Analysis
For this task, the aspects identified in Task 1         Task 3 is a classification problem. However, fully
have been used. This implies that an error in           understanding the score is not easy. The evalua-
Task 1 will have a decisive impact on Task              tion operation is carried out by different people
number 2.                                               and with different styles. A product with a similar
                                                        review is rated according to the expectations and
The aspect can be classified as:                        judgment of other users differently.

       positive (POS:true,NEG:false)                   Furthermore, in order to obviate the long training
                                                        time that a constant updating requires, compared
       negative (POS:false,NEG:true)                   with systems used by the previous version of
       mixed polarity(POS:true, NEG:true)              EVALITA, such as an ensemble system with Tree
                                                        Random Forest and Bi-LSTM (Bennici and Por-
       neutral polarity (POS:false, NEG:false)         tocarrero, 2018) or with an SVM system (Barbieri
                                                        et al., 2016), I used a Zero-Shot Learner [ZSL]
As showed to the image from the challenge               (Pushp & Srivastava, 2017). A ZSL is a way to
website6:                                               make predictions without prior training (Petroni,
                                                        2019). ZSL will refer to the embedding of a pre-
                                                        vious matrix, AlBERTo in this case, and of the
                                                        proposed labels as a possible result (Schick and
                                                        Schütze, 2020).
                                                        The proposed labels were the possible numbers
The results on the DEV test outperform the              for evaluation, then the numbers from 1 to 5.
baseline (table 3).
                                                        The proposed prediction value is a weighted aver-
    Model                      F1-Score                 age of the two values with the highest probability,
                                                        if and only if the gap between the two values is
    ghostwriter19_a            0.6019                   less than 10−3. Otherwise, only the value with the
                                                        highest probability will be considered valid.
    Baseline                   0.2
                                                        For this task, I omitted the ONNX runtime test
Table 3: Task 2 DEV results
                                                        because a stable converter for the ZSL version is
Also, for this task, the performance is improved        not available.
with the use of ONNX runtime (table 4).

6
    http://www.di.uniba.it/~swap/ate_absita/task.html
The score for this task is the Root Mean Squared
                                                          Model                       F1 score
Error between the polarity predicted and the po-
larity assigned by the user.                              ghostwriter19_a_D           0.6019

    Model                       RMSE score                ghostwriter19_b_T           0.4994

    ghostwriter19_a             0.6997                    Baseline AlBERTo            0.2
                                                      Table 7: TEST dataset results for Task 2
    ghostwriter19_b             0.8526
                                                      The loss from DEV to TEST is about 17% (table
    Baseline AlBERTo            1.0806                7). However, the percentage of the difference be-
Table 5: Task 3 DEV results                           tween the results of Tasks 1 and 2 have been main-
                                                      tained with the DEV and TEST datasets.
The loss in performance is 18%, but the entire pre-
vious training phase is skipped (table 5).            This is in line with expectations, worse model per-
                                                      formance in Task 1 impacted Task 2 proportion-
                                                      ally. In return, working on a better model will im-
3      Results                                        prove both tasks.
The results obtained with the DEV dataset are
very positive both in terms of accuracy and per-      3.3     Results for Task 3
formance. ZSL has proven to be an incredible
technology to invest in. The Ktrain seems to suffer   For Task 3 we have:
a heavy overfit.
                                                          Model                       RMSE score
The research aims not to have a relevant model
but to prove that a model could be production-            ghostwriter19_a_D           0.6997
ready with fewer resources and time.
However, in all three tasks, the models                   ghostwriter19_b_D           0.8526
outperformed the baseline with a significant gap
in terms of accuracy/RMSE.                                ghostwriter19_a_T           0.81394

                                                          ghostwriter19_b_T           0.83479
3.1     Results for Task 1
                                                          Baseline AlBERTo            1.0806
The final results with the TEST dataset are:
                                                      Table 8: TEST dataset results for Task 3
    Model                       F1 score
                                                      The difference between the DEV and TEST da-
    ghostwriter19_a_D           0.6152                tasets is marked here only for trained model, 14%
                                                      (table 8). The untrained one performed slightly
    ghostwriter19_a_T           0.5399                better, 2%, with the TEST dataset.
    Baseline AlBERTo            0.2556                This result confirms that an underperforming
Table 6: TEST dataset results for Task 1              model has the same performance of a model that
                                                      use ZSL, as assumed.
The results are about 12% lower than those ob-
tained in the research phase (table 6).               The price to pay, however, is that the average in-
It will be interesting to continue experimenting      ference time for the ZSL is 157x higher than the
with different ONNX options to find a better com-     pure TensorFlow model obtained with Ktrain.
bination of compatibility and performance.
                                                      4     Conclusion
3.2     Results for Task 2                            The results demonstrated that it is possible to cre-
The final results with the TEST dataset are:          ate hybrid systems for training and inference to
                                                      make the power of BERT more accessible.
In the time it takes to train a new and optimized          de Mattei, L., de Martino, G., Iovine, A., Miaschi, A.,
model, an untrained ZSL model can make up for                 Polignano, M., & Rambelli, G. (2020). Overview of
it in the meantime.                                           the EVALITA 2020 Aspect Term Extraction and
                                                              Aspect-based Sentiment Analysis (ATE_ABSITA)
                                                              Task. In Proceedings of the 7th evaluation cam-
Optimizing, and in future training, our models to
                                                              paign of Natural Language Processing and Speech
be intrinsically optimized for the platform and               tools for Italian (EVALITA 2020), CEUR-WS.org.
framework we have chosen to use does not affect
performance and future use.
                                                           Ning, E., Yan, N., Zhu, J., & Li, J. (2020, January 31).
The improvements obtained in the use of ONNX                 Microsoft open sources breakthrough optimizations
runtime for these Italian tasks are in line with what        for transformer inference on GPU and CPU.
Microsoft demonstrated, for the English language,            https://cloudblogs.microsoft.com/open-
at the beginning of 2020 (Ning at al., 2020).                source/2020/01/21/microsoft-onnx-open-source-
                                                             optimizations-transformer-inference-gpu-cpu/
The next step is to make the ONNX export work
with a Zero-Shot learner [ZSL] in order to com-
                                                           Polignano, M., Basile, P., de Gemmis, M., Semeraro,
pensate, at least in part, for the more significant          G., & Basile, V. (2019). Alberto: Italian bert lan-
resources that this inevitably introduces.                   guage understanding model for nlp challenging
                                                             tasks based on tweets. In Proceedings of the Sixth
                                                             Italian Conference on Computational Linguistics
                                                             (CLiC-it 2019). CEUR-WS.org.
References
                                                           Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A.,
Barbieri, F., Basile, V., Croce, D., Nissim, M.,             Wu, Y., Miller, A. H., & Riedel, S. (2019, Septem-
  Novielli, N., & Patti, V. (2016). Overview of the          ber 04). Language Models as Knowledge Bases?
  Evalita 2016 SENTIment POLarity Classification             https://arxiv.org/abs/1909.01066
  Task. In Proceedings of Third Italian Conference on
  Computational Linguistics (CLiC-it 2016) & Fifth
  Evaluation Campaign of Natural Language Pro-             Pushp, P. K., & Srivastava, M. M. (2017, December
  cessing and Speech Tools for Italian. Final                23). Train Once, Test Anywhere: Zero-Shot Learn-
  Workshop (EVALITA 2016). CEUR-WS.org.                      ing          for         Text      Classification.
                                                             https://arxiv.org/abs/1712.05972

Basile, V., Croce, D., Di Maro, M., & Passaro, L.
  (2020). EVALITA 2020: Overview of the 7th Eval-          Schick, T., & Schütze, H. (2020, April 27). Exploiting
  uation Campaign of Natural Language Processing             Cloze Questions for Few Shot Text Classification
  and Speech Tools for Italian. In Proceedings of Sev-       and         Natural      Language         Inference.
  enth Evaluation Campaign of Natural Language               https://arxiv.org/abs/2001.07676
  Processing and Speech Tools for Italian. Final
  Workshop (EVALITA 2020), CEUR-WS.org.


Bennici, M., & Portocarrero, X. S. (2018). Ensemble
  for aspect-based sentiment analysis. In Tommaso
  Caselli, Nicole Novielli, Viviana Patti, and Paolo
  Rosso, editors, Proceedings of the 6th evaluation
  campaign of Natural Language Processing and
  Speech tools for Italian (EVALITA’18). CEUR-
  WS.org.


Brown, T. B., Mann, B., Ryder, N., Subbiah, M.,
  Kaplan, J., Dhariwal, P., . . . Amodei, D. (2020, July
  22). Language Models are Few-Shot Learners.
  https://arxiv.org/abs/2005.14165