ghostwriter19 @ SardiStance: Generating new Tweets to
            Classify SardiStance EVALITA 2020 Political Tweets

                                          Mauro Bennici
                                         You Are My Guide
                                              Torino
                                    mauro@youaremyguide.com


                                                            stessi utenti per mezzo di hashtag e
                     Abstract1                              menzioni.

    English. Understanding the events and the               In questa ricerca mostrerò come un
    dominant thought is of great help to con-               semplice traduttore può essere usato per
    vey the desired message to our potential                portare a fattor comune stili, lessico,
    audience, be it marketing or political                  grammatica e altre caratteristiche che
    propaganda.                                             portano ognuno di noi ad essere unico nel
                                                            modo di esprimersi.
    Succeeding while the event is still ongo-
    ing is of vital importance to prepare alerts        1    Introduction
    that require immediate action.
                                                        Each of us has a unique way of writing. However,
    A micro message platform like Twitter is            the fewer options we have to experience ourselves
    the ideal place to be able to read a large          to express our concept, the more the necessary
    amount of data linked to a theme and self-          synthesis leads to the loss of precious information
    categorized by its users using hashtags             to accurately assess our real intentions.
    and mentions.
                                                        Furthermore, the more the subject is debated, the
    In this research, I will show how a simple          more changes in style and tone occur. The conver-
    translator can be used to bring styles, vo-         sation becomes full of irony or aggressive. Extrap-
    cabulary, grammar, and other characteris-           olating a single line is dangerous without context.
    tics to a common factor that leads each of          The same sentence can have different interpreta-
    us to be unique in the way we express our-          tions depending on the moment in which it is pro-
    selves.                                             nounced, the audience it is intended for, the place
                                                        where you are, in the historical period in which it
    Italiano. Comprendere gli eventi e il               was composed.
    pensiero dominante è di grande aiuto per
    veicolare alla nostra potenziale audience il        My hypothesis is that we can translate all these
    messaggio desiderato sia esso di                    different styles into a single "language style" that
    marketing o di propaganda politica.                 fully expresses the real intentions of the writer.
                                                        The challenge is to understand when a user has
    Riuscirci mentre l'evento è ancora in corso         expressed a comment in favor, against, or neutral
    è di vitale importanza per predisporre alert        towards the Sardines' Italian political movement.
    che richiedono un intervento immediato.
                                                        The research was carried out for the SardiStance
    Una piattaforma di micro messaggi come              (Cignarella et al., 2020) task in the EVALITA
    Twitter è il luogo ideale per poter leggere         2020 (Basile et al., 2020). Two models were cre-
    una grande quantità di dati legata ad un            ated for the Task 1, but they also performed well
    tema, e spesso auto categorizzati dai suoi          on the Task 2.

1
  Copyright ©️2020 for this paper by its authors. Use
permitted under Creative Commons License Attribu-
tion 4.0 International (CC BY 4.0).
2      Description of the system                       In order to validate my hypotheses, I used the Al-
                                                       BERTo model, created from tweets, (Polignano at
The two tasks are similar. In Task A, it is neces-     al., 2019) and an auto training system such as
sary to classify the stance of a tweet based only on   Ktrain2, a framework that wrap TensorFlow3, to
the text of the tweet. Task A is divided into two      classify the tweets. To avoid manual error and in-
subtasks:                                              voluntary optimization, I used the autofit option.

       Constrained. It is allowed to use additional   First, I wrote a series of algorithms to make the
        resources such as a Lexicon but no other re-   texts to be compared homogeneous.
        sources (such as labeled tweets) to help the   The first one was to break up the composed
        training process.                              hashtags into sentences and words.
       Unconstrained. Where each resource used        For example, using capital letters as a separator:
        must be reported in the final report.                 #IoStoConLeSardine has become "io sto
                                                               con le sardine" ["I'm with sardines"].
In Task B, you can use the context information                #NessunoTocchiLeSardine has become
provided by the post author. Additional infor-                 "nessuno tocchi le sardine"["nobody
mation refers:                                                 touches the sardines"].

       to post statistics (favors, retweets, reply,   As a second step, I made sure to remove repeated
        source)                                        vowels in a sentence, such as:

       to the author's information (number of                "Svegliaaaa" to get the word "Sveglia"
        posts, number of followers, emoji in the               [Wake up!].
        bio)
                                                       I also replaced the word sardines with "PartitoPo-
       to the author's circle of relationships        liticoS" ["PoliticalPartyS"] to prevent the entity
        (friends, replies, retweets, and quotes)       from being mistaken for the fish that is its symbol.
                                                       I did not remove any stop words because it is use-
The research focuses on Task A Constrained.            ful to create the translation system.

Considering the constraints of Task A, it is not       At this point, I made a copy of the dataset to trans-
possible to access any additional information          late it. I used the spaCy 4 language functions of
other than the text of the tweet, I concentrated on    POS tagging, Dependency Parse, and Entity
understanding how to clean it up.                      Recognition to have all the essential components
                                                       of my translator.
The Training dataset contains:
   the tweet ID                                       The translator is a simple text representation. It is
                                                       a matter of rewriting the sentence following the
       the user ID                                    scheme:
       the text                                           subject adjectives

       the label                                             subjects
                                                              verb in the infinitive form
The labels options are:                                       adjectives objects
   Against
                                                              objects
       Favor
                                                              exclamations / other words
       Neutral / None
To be sure to do not use any data except the text,     At this stage, the words are not modified to make
the user id, useful for Task B, was discarded.         the sentence grammatically correct. Words are ex-
                                                       changed places, only the verb are modified to the

2                                                      4
    https://github.com/amaiya/ktrain                       https://spacy.io/api/annotation
3
    https://www.tensorflow.org/
infinitive form. The entities of type person [PER]
                                                         ghostwriter19_Task_A_2_c                F1-Score
take precedence over others.
                                                         Against                                 0.70
The translator concentrates its attention on the as-
pect inside the sentences to be sure to do not re-       Favor                                   0.50
move valid sentiment polarity words (Barbieri et
al, 2016). And to avoid to lose them in a round-         Neutral                                 0.32
trip translation activity on translation services
                                                        Table 3: F1-score details of model 2
(Marivate & Sefara, 2020). The attempt to repre-
sent the text in a more recognizable and identifia-     The problem is evident. Model 1 has a more chal-
ble form for an algorithm passes from the fact that     lenging time distinguishing the favor tweets from
it can still recognize the entities described and the   neutral ones. The good news is that both the mod-
polarity expressed for each of them. For this pur-      els overcame the estimated baseline.
pose, the translator makes several attempts to fit
words into their suggested position.
                                                        2.2    Hashtags and Mentions
Finally, I trained two models with the Ktrain           Thinking that on Twitter the hashtags are also
framework. The model 1, which use the translated        used for classification purposes, the operation that
tweets,     was      submitted    as      ghost-        replaces them was modified. Now the hashtags are
writer19_Task_A_1_c. The model 2, trained with          added at the end of the new tweets. Also, the men-
the only cleaned tweets, was submitted as ghost-        tions are considered and processed as hashtags
writer19_Task_A_2_c.                                    (table 4).

2.1     First results                                    Model                                  F1-Score
The model will be evaluated with the F1-score.
The main score is the average of the F1-score of         ghostwriter19_Task_A_1_c               0.5822
the Favor tweets and the F1-score of the Against
                                                         ghostwriter19_Task_A_2_c               0.6004
tweets.
                                                         Estimated Baseline                     0.5386
When comparing the two models, the first result
is that the translated tweets performed worse, al-      Table 4: Model 1 with hashtags and mentions in the trans-
                                                        lated tweets
beit by a few percentage points (table 1).
                                                        Analyzing the results in detail (table 5), we can
 Model                                 F1-Score         see that:

 ghostwriter19_Task_A_1_c              0.5613
                                                         ghostwriter19_Task_A_1_c                F1-Score
 ghostwriter19_Task_A_2_c              0.6004
                                                         Against                                 0.71
 Estimated Baseline                    0.5386
                                                         Favor                                   0.45
Table 1: First results
                                                         Neutral                                 0.41
Analyzing the results of both the models in detail
                                                        Table 5: F1-score details of model 1 with hashtags and
(table 2 and 3), we have that:                          mentions in the translated tweets


 ghostwriter19_Task_A_1_c               F1-Score        The model gained two percentage points for both
                                                        Against and Favor, compared with a one-point
 Against                                0.69            loss in Neutral. Unfortunately, it still remains two
                                                        points below the model 2, with the only cleaned
 Favor                                  0.43            tweets.

 Neutral                                0.42
Table 2: F1-score details of model 1
2.3     Passive verbs                                            3      Results
Analyzing the new texts generated, I noticed that                Model 1 was ultimately 3 percentage points better
essential information was lost by putting all the                than Model 2 with the Training dataset. The best
verbs in the infinitive. If the verb was in the pas-             performance of the model was also confirmed
sive form, the subject and object of the sentence                with Test datasets, with 2.5 percentage points of
were reversed. At the same time, I noticed that                  advantage.
very long tweets contained more than one sen-
tence.                                                           3.1     Results for Task A
                                                                 The final results with the Test dataset are:
I modified the translator to consider passive and
active verbs, swapping the sentence's subject and                    Model                                  F1-score
object if necessary. The hashtags inserted at the
end of the tweet only left at the end of the new                     ghostwriter19_Task_A_1_c               0.6257
tweet generated (table 6).
                                                                     ghostwriter19_Task_A_2_c               0.6004

 Model                                    F1-Score                   Baseline                               0.5784
                                                                 Table 8: Test dataset results for Task A
 ghostwriter19_Task_A_1_c                 0.6306
                                                                 The model 1 is about 7.5% better than the baseline
 ghostwriter19_Task_A_2_c                 0.6004                 (table 8).
                                                                 I remember that both models were trained with the
 Estimated Baseline                       0.5386
                                                                 autofit option, so without any particular study, to
Table 6: Model 1 with hashtags and mentions in the trans-        validate whether a "translation" of the original
lated tweets, plus active / passive verbs                        text could bring apparent advantages.
Analyzing the results in detail (table 7), we can                3.2     Results for Task B
see that:
                                                                 Although no context information was used, I still
 ghostwriter19_Task_A_1_c                     F1-Score           proposed the predictions for Task A to Task B.
                                                                 The final results with the Test dataset are:
 Against                                      0.76
                                                                     Model                                  F1-score
 Favor                                        0.50
                                                                     ghostwriter19_Task_A_1_c               0.6257
 Neutral                                      0.40
                                                                     ghostwriter19_Task_A_2_c               0.6004
Table 7: F1-score details of model 1 with hashtags and
mentions in the translated tweets, plus active / passive verbs       Baseline                               0.6284
The model gained five percentage points for                      Table 9: Test dataset results for Task B
Against and Favor tweets, compared with a one-
point more loss for Neutral ones. Now the transla-               Even if model 1 was not able to reach the pro-
tion model is the best model.                                    posed baseline, the difference between the two
                                                                 systems is 0.4% (table 9). The detailed results of
                                                                 the models are showed in the tables 10 and 11.


3.3     Detailed results for Task A

 model        f-avg      prec_a      prec_f      prec_n     recall_a    recall_f     recall_n     f_a         f_f      f_n

 1_c          0.6257     0.8106      0.4709      0.3226     0.6981      0.5357       0.4651       0.7502      0.5012   0.3810

 2_c          0.6004     0.8094      0.4772      0.2921     0.6523      0.4796       0.5349       0.7224      0.4784   0.3778

 baseline     0.5784     0.7549      0.3975      0.2589     0.6806      0.4949       0.2965       0.7158      0.4409   0.2764
Table 10: TASK A detailed results of the proposed models compared to the baseline model.
3.4        Detailed results for Task B

    model      f-avg    prec_a    prec_f     prec_n      recall_a    recall_f     recall_n   f_a      f_f      f_n

    1_c        0.6257   0.8106    0.4709     0.3226      0.6981      0.5357       0.4651     0.7502   0.5012   0.3810

    2_c        0.6004   0.8094    0.4772     0.2921      0.6523      0.4796       0.5349     0.7224   0.4784   0.3778

    baseline   0.6284   0.7845    0.4506     0.3054      0.7507      0.5357       0.2965     0.7672   0.4895   0.3009
Table 11: TASK B detailed results of the proposed models compared to the baseline model.


4         Conclusion
In a preliminary way, the final results demonstrate           Bennici, M., & Portocarrero, X. S. (2018). Ensemble
that it is possible to obtain an improvement of the             for aspect-based sentiment analysis. In Tommaso
                                                                Caselli, Nicole Novielli, Viviana Patti, and Paolo
predictions by reducing the differences of expres-
                                                                Rosso, editors, Proceedings of the 6th evaluation
sion to a predetermined structure.                              campaign of Natural Language Processing and
                                                                Speech tools for Italian (EVALITA’18), Turin, Italy.
The system is, however, right now, more efficient               CEUR-WS.org.
in terms of training times and final scores than en-
semble systems of Bi-LSTM, which were used
successfully up to 2 years ago (Bennici & Porto-              Cignarella, A. T., Lai, M., Bosco, C., Patti, V., &
carrero, 2018).                                                 Rosso, P. (2020). SardiStance@EVALITA2020:
                                                                Overview of the Task on Stance Detection in Italian
The next step is also to optimize the model's train-            Tweets. In Proceedings of the 7th Evaluation Cam-
ing to ascertain that the performance gain is main-             paign of Natural Language Processing and Speech
                                                                Tools for Italian (EVALITA 2020). CEUR-WS.org.
tained and in what percentage. At the same time,
the translator can be improved by switching to a
sequence-to-sequence system for a meaningful                  Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-
and efficient text representation that will include,          hamed, A., Levy, O., . . . Zettlemoyer, L. (2019, Octo-
among other things, the change of every words                 ber 29). BART: Denoising Sequence-to-Sequence Pre-
forms accordingly with the grammar and the orig-              training for Natural Language Generation, Translation,
inal intention of the writers (Lewis et al., 2019).           and Comprehension. https://arxiv.org/abs/1910.13461


References                                                    Marivate, V., & Sefara, T. (2020). Improving Short
                                                                Text Classification Through Global Augmentation
                                                                Methods. Lecture Notes in Computer Science Ma-
Barbieri, F., Basile, V., Croce, D., Nissim, M.,                chine Learning and Knowledge Extraction, 385-
  Novielli, N., & Patti, V. (2016). Overview of the             399. doi:10.1007/978-3-030-57321-8_21
  Evalita 2016 SENTIment POLarity Classification
  Task. In Proceedings of Third Italian Conference on
  Computational Linguistics (CLiC-it 2016) & Fifth            Polignano, M., Basile, P., de Gemmis, M., Semeraro,
  Evaluation Campaign of Natural Language Pro-                  G., & Basile, V. (2019). Alberto: Italian bert lan-
  cessing and Speech Tools for Italian. Final Work-             guage understanding model for nlp challenging
  shop (EVALITA 2016), Naples, Italy. CEUR-                     tasks based on tweets. In Proceedings of the Sixth
  WS.org.                                                       Italian Conference on Computational Linguistics
                                                                (CLiC-it 2019). CEUR-WS.org.


Basile, V., Croce, D., Di Maro, M., & Passaro, L.
  (2020). EVALITA 2020: Overview of the 7th Eval-
  uation Campaign of Natural Language Processing
  and Speech Tools for Italian. In Proceedings of Sev-
  enth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final
  Workshop (EVALITA 2020), CEUR-WS.org.