<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TextWiller @ SardiStance, HaSpeede2: Text or Con-text? A Smart Use of Social Network Data in Predicting Polarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Ferraccioli</string-name>
          <email>ferraccioli@stat.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Sciandra</string-name>
          <email>andrea.sciandra@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia Da Pont</string-name>
          <email>mattia.dapont@wmr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Girardi</string-name>
          <email>paolo.girardi@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario Solari</string-name>
          <email>dario.solari@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Madonna</string-name>
          <email>domenico.madonna@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Livio Finos</string-name>
          <email>livio.finos@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>. Universita` degli Studi di Modena e Reggio Emilia</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>. Universita` degli Studi di Padova</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this contribution we describe the system (i.e. a statistical model) used to participate in Evalita conference 2020, SardiStance (Tasks A and B) and Haspeede2 (Tasks A and B). We first developed a classifier by extracting features from the texts and the social network of users. Then, we fit the data through an extreme gradient boosting, with cross-validation tuning of the hyper-parameters. A key factor for a good performance in SardiStance Task B was the features extraction by using Multidimensional Scaling of the distance matrix (minimum path, undirected graph) applied on each network. The second system exploits the same features above, but it trains and performs predictions in twosteps. The performances proved to be lower than those of the single-step model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In this paper we describe and show the results of
the approach we developed to participate in the
SardiStance task
        <xref ref-type="bibr" rid="ref3">(Cignarella et al., 2020)</xref>
        for the
polarity detection (i.e. Task A and B, both with
constrained data) within the EVALITA campaign
        <xref ref-type="bibr" rid="ref1">(Basile et al., 2020)</xref>
        . The goal of this task was a
Stance Detection in Italian tweets about the
Sardines movement. The Task A is a three-class
classification task where the system has to
predict whether a tweet is in Favour, Against or
Neu
      </p>
      <p>
        Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
tral/none towards the given target, exploiting only
textual information, i.e. the text of the tweet. The
Task B is the same as the first one, except a wider
range of contextual information are available, that
is: the number of retweets, the number of favours,
the type of posting source (e.g. iOS or Android),
and date of posting. Furthermore, the networks
of the users based on Friends, Quote, Reply and
Retweet were provided. We developed two
systems (i.e. models) extracting features from the
text (both for Task A and B) and from the social
network of the users (only for Task B) and then
exploited extreme gradient boosting
        <xref ref-type="bibr" rid="ref2">(Chen et al.,
2020)</xref>
        to train the model on the data. A
crossvalidation hyper-parameter tuning was used to
define the optimal set of parameters.
      </p>
      <p>
        We use a very similar strategy for HaSpeede2
        <xref ref-type="bibr" rid="ref10">(Sanguinetti et al., 2020)</xref>
        where the goal is the
prediction of Hate Speech (i.e. Task A) and
Stereotype (i.e. Task B). In this case, however, the
sample contains documents from three different
topics. We believe that these may be characterized
by different vocabularies and kind of speech. We
take this in account in the prediction model as
explained in 3.3.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Features extraction and E.D.A.</title>
      <sec id="sec-2-1">
        <title>Text-based Features extraction</title>
        <p>
          The text preprocessing was done in R
          <xref ref-type="bibr" rid="ref9">(R Core
Team, 2019)</xref>
          software with the package TextWiller
          <xref ref-type="bibr" rid="ref11">(Solari et al., 2019)</xref>
          (function normalizzaTesti with
default parameters). We describe the preocess
used to define the features for both for SardiStance
and HaSpeede2.
        </p>
        <p>The first set of features is defined by the
columns of the DocumentTermMatrix which is a
matrix having documents on the rows and a
column for each term. The cells contain the
number of given words in the document. We defined
the matrix on the basis of the normalized texts
and removing terms (i.e. columns) with a sparsity
larger than .9. These procedures generated a 317
terms vocabulary for SardiStance and 170 terms
for HaSpeede2.</p>
        <p>In Figure 1 we plot the term frequencies of the
”In favour” and ”Against” stances. The terms
close to the bisector are the ones with a
similar frequency in the two classes (such as ”caro”,
”alto”, ”acqua”), so probably these terms don’t
carry much useful information to our cause. More
often we found interesting terms far from the
bisector, like ”bolognanonsilega”, ”antifascismo”,
”abuso” or ”branco” and we expected these terms
to carry more weight in the classification model.
AGAINST
emote
non
sardine
10.000%
1.000%
R
O
V
A
F
0.100%
0.010%
piazza
bologna pi
alengcaorbaeollroacosa pd
parte
bolognanonsilegagrande altro
aacgfncoooarsavtnitntoaocticiftuacctansaodcirratiidiisncomqiauaomaltcioqcauiraaoisrbheroiixsvnaeoaalmhgvneabetarnrnveadoavatodateencnstasoupolglodaoovpeorno
abbandonati abuso brancoprodi
0.010%
0.100%
1.000%
10.000%</p>
        <p>
          Further text features considered were: the
number of characters and the number of words, the
counts of ”?” and ”!” for each document.
Moreover, a sentiment value was computed for each
document by sentiment function of the R package
TextWiller
          <xref ref-type="bibr" rid="ref11">(Solari et al., 2019)</xref>
          .
        </p>
        <p>Figure 2 shows the association between True
Stances and Sentiment. This variable will be used
as a feature in Task A and B models.</p>
        <p>Previous analyses, such as sentiment
attribution through a lexicon, refer to a bag-of-words
(BoW) approach. One of the most notable
disadvantages of BoW is that it generally fails to
capture words semantics by ignoring words order.
A common solution to this problem involves the
use of Word Embedding (WE). WE techniques are
Positive
t
n
e
m
it
n
e
S
Neutral
Negative</p>
        <p>Sentiment vs True Label
sentiment</p>
        <p>Negative
Neutral
Positive
AGAINST</p>
        <p>True LabeNlONE</p>
        <p>
          FAVOR
based on neural networks and generate dense
vectors for word representation, by defining a
context window, i.e. a string of words before and
after a focal word, that will be used to train a
word embedding model. In WE, words are
represented as coordinates on a latent multidimensional
space derived from an underlying deep learning
model that considers the contiguous words. So,
for both tasks we also used a WE technique to
produce context-based features. In particular, we
used the word2vec model
          <xref ref-type="bibr" rid="ref8">(Mikolov et al., 2013)</xref>
          ,
a widely used natural language processing
technique to extract word associations from a large
corpus of text. word2vec is a neural network
prediction model containing continuous
bag-ofwords (CBoW) model and Skip-gram (SG) model.
The CBoW model predicts a target word from its
context words, while the SG model predicts the
context words given a target word. Since WE
needs a huge corpus of textual data for training
and given the limited amount of tweets, we
augmented the data with the corpus PAISA`
          <xref ref-type="bibr" rid="ref7">(Lyding et
al., 2013)</xref>
          , a large collection of Italian web texts.
We trained the model with embedded dimension
set to 50 and a 5 words context window. The
results for each word are then combined via
averaging to obtain the final features.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Network-based Features extraction</title>
        <p>A key point to explain the good performance in
the SardiStance Task B (i.e. second best score,
F-avg = 0.7309) is the efficient extraction of
features from the four Networks available, that is:
Friends, Retweet, Reply, and Quote. For each
network, a distance matrix among subjects was
computed. The distance used is the shortest path,
forcing the graph to be undirected. The Distance
Matrix was then projected into a euclidean space
trough a Multidimensional Scaling (MDS). Since
we expected the users to be strongly polarized
in clusters within the network, we also expected
the largest dimension to discriminate among the
stances. Therefore, we retained the first and
second dimension for each of the four networks. This
expectation was confirmed by Exploratory Data
Analysis. As an example, in Figure 3 we show
the scatter plot of the first two dimensions for the
Friend Network. The First Dimension clearly
discriminates the three stances (in particular Favour
vs Against).
●
2</p>
        <p>
          ●
Due to the – relatively – small sample size of the
train set (composed from 2,132 tweets in Italian,
the BenderRule), we decided not to use any neural
network. Instead, we preferred a Gradient Boost
approach
          <xref ref-type="bibr" rid="ref6">(Friedman, 1999)</xref>
          . Since this method has
been developed within the statistical learning
community, we used the word “model” as a
synonymous of “system”. We adopted the R
implementation of the XGBoost (eXtreme Gradient Boosting)
          <xref ref-type="bibr" rid="ref2">(Chen et al., 2020)</xref>
          . A cross-validation parameter
tuning was used to define the optimal set of
parameters.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>3.1 System One</title>
        <p>As features for Task A, we used information taken
from the text, that is, words/emoticons, special
characters, scores of word embedding (50
dimensions), sentiment, length of the message and
number of words.</p>
        <p>For Task B we used the same features used for
Task A together with the first and the second
dimension extracted from the MDS computed for
each network (as explained in 2.2).</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.2 System Two</title>
        <p>Since System Two uses the same features of
System One for Task A and B, the focus here is on the
employed metric: the average between F 1Against
and F 1F avour. With the aim to cast the model into
the metric, we fitted two separated models (i.e.
one for Favour and one for Against) in the first
step and then we combine the two predictions in
a second step. To be more precise, the two
models used in the first step predict if a document is
in Favour or not (first model) and if is Against
or not (second model). The two prediction are
combined in a final score by a simple subtraction:
(Predicted1==Favour) - (Predicted2==Against)
which makes a -1,0,1 final score.</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.3 System for HaSpeeDe2</title>
        <p>The corpus of documents for HaSpeeDe2 is a
sample of tweets from three different topics, namely
Immigrants, Muslims and Roma communities.
Since the vocabulary may change among topic, we
want our models to account for this specificity. We
leverage on this with models that use the estimated
topic. The topic is estimated by a xgboost model
(trained by cross-validation). Table 1 and Table 2
report the confusion matrix and performances
indices of the trained model (cross-validated).</p>
        <sec id="sec-2-5-1">
          <title>Reference</title>
          <p>Prediction Immigrants
Immigrants 408</p>
          <p>Rom 24
Terrorism 41</p>
          <p>System One is based on an xgboost with
binomial response (for both tasks). The fitting is done
separately, after splitting of the sample based on
the topic classification provided by the model
described above in this subsection. The model is
trained with the same cross-validated strategy used
to train System One for the SardiStance Task.</p>
          <p>System Two is based on an xgboost with
binomial response (for both tasks). The estimate is
computed on the whole sample (i.e. without
splitting of System One), but the topic classification is
used as feature.</p>
          <p>For both systems the basic set of features are the
same used in the SardiStance - Task A.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and discussion</title>
      <sec id="sec-3-1">
        <title>Results for HaSpeeDe2</title>
        <p>The results of the two systems are disappointing.
The final ranks are always at the very bottom of
the rankings. This may be partially due to a
suboptimal parameters optimization (we discovered a
mistake in the parameter setting), but this is
certainly not the only reason. We will take this result
as an opportunity to revise the approach.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results for SardiStance</title>
        <p>System Two performed poorly in the final score
for both Tasks. Our intuition was that the benefit
of a separate optimization of FAgainst and FF avour
was overcome by the gain in doing a joint training
(i.e. System One). We will address further efforts
to better understand this result.</p>
        <p>The results for System One are given in Table 3
for Task A and Table 4 for Task B, respectively.</p>
        <p>The rank of System One in Task A is 13, that
is just below the benchmark. The System was
weak in the correct estimation of Against stance
(F 1Against = 0:776), while it estimated fairly
well Favour stance (F 1F avour = 0:3791).</p>
        <p>The best performance of System One is on Task
B (F 1Against = 0:8505, F 1F avour = 0:6114)
where it scored 2nd position.</p>
        <sec id="sec-3-2-1">
          <title>Prediction AGAINST NONE FAVOUR</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Prediction AGAINST NONE FAVOUR</title>
          <p>For SardiStance, the System One proposed here
performed well in the Task B, while it has a
much poorer result in Task A. It exploits a simple
method to handle the network-based information,
while further refinement should be made on the
exploitation of text-based information. In this way
we want to stress the importance of data mashup,
as the system we deployed showed better results
for Task B which contains, in addition to texts,
information of a different nature derived from
network structures.</p>
          <p>
            It is to be expected that more networks should
carry similar information. A future direction of
research should be the joint analysis of the
Networks. There is a sparkling community
working on multilayer Networks
            <xref ref-type="bibr" rid="ref4">(De Domenico et al.,
2013)</xref>
            <xref ref-type="bibr" rid="ref5">(Durante et al., 2017)</xref>
            that may inspire more
effective use of this joint information.
          </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ).
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          , Tong He,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Benesty</surname>
          </string-name>
          , Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou,
          <string-name>
            <given-names>Mu</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Junyuan</given-names>
            <surname>Xie</surname>
          </string-name>
          , Min Lin, Yifeng
          <string-name>
            <surname>Geng</surname>
            , and
            <given-names>Yutian</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <year>2020</year>
          . xgboost: Extreme Gradient Boosting.
          <source>R package version 1.0.0.2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Alessandra</given-names>
            <surname>Teresa</surname>
          </string-name>
          <string-name>
            <surname>Cignarella</surname>
          </string-name>
          , Mirko Lai, Cristina Bosco, Viviana Patti, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>SardiStance@EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA</source>
          <year>2020</year>
          ). CEURWS.org.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Manlio De Domenico</surname>
            ,
            <given-names>Albert</given-names>
          </string-name>
          <string-name>
            <surname>Sole</surname>
          </string-name>
          ´-Ribalta, Emanuele Cozzo, Mikko Kivela¨,
          <string-name>
            <surname>Yamir</surname>
            <given-names>Moreno</given-names>
          </string-name>
          , Mason A.
          <string-name>
            <surname>Porter</surname>
            , Sergio Go´mez, and
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Arenas</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Mathematical Formulation of Multilayer Networks</article-title>
          . Physical
          <string-name>
            <surname>Review</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <fpage>041022</fpage>
          ,
          <string-name>
            <surname>October</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Daniele</given-names>
            <surname>Durante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David B.</given-names>
            <surname>Dunson</surname>
          </string-name>
          , and
          <string-name>
            <surname>Joshua</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Vogelstein</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Nonparametric bayes modeling of populations of networks</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>112</volume>
          (
          <issue>520</issue>
          ):
          <fpage>1516</fpage>
          -
          <lpage>1530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jerome H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Stochastic gradient boosting</article-title>
          .
          <source>Computational Statistics and Data Analysis</source>
          ,
          <volume>38</volume>
          :
          <fpage>367</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Verena</given-names>
            <surname>Lyding</surname>
          </string-name>
          , Egon Stemle, Claudia Borghetti, Marco Brunello, Sara Castagnoli, Felice Dell'Orletta, Henrik Dittmann, Alessandro Lenci, and
          <string-name>
            <given-names>Vito</given-names>
            <surname>Pirrelli</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>PAISA` corpus of italian web text</article-title>
          .
          <source>Eurac Research CLARIN Centre.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>R Core</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <year>2019</year>
          . R:
          <article-title>A Language and Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing, Vienna, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and
          <string-name>
            <given-names>Irene</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Overview of the evalita 2020 second hate speech detection task (haspeede 2)</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of the 7th evaluation campaign of Natural Language Processing</source>
          and
          <article-title>Speech tools for Italian (EVALITA 2020), Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Dario</given-names>
            <surname>Solari</surname>
          </string-name>
          , Andrea Sciandra, and
          <string-name>
            <given-names>Livio</given-names>
            <surname>Finos</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Textwiller: Collection of functions for text mining, specially devoted to the italian language</article-title>
          .
          <source>Journal of Open Source Software</source>
          ,
          <volume>4</volume>
          (
          <issue>41</issue>
          ):
          <fpage>1256</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>