<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Venses @ HaSpeeDe2 &amp; SardiStance: Multilevel Deep Linguistically Based Supervised Approach to Classification</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ca' Bembo - Dorsoduro 1075 - Università Ca' Foscari - 30131 Venezia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>In this paper1 we present the results obtained with ItVENSES a system for syntactic and semantic processing that is based on the parser for Italian called ItGetaruns to analyse each sentence. In previous EVALITA tasks we only used semantics to produce the results. In this year EVALITA, we used both a fully and mixed statistically based approach and the semantic one used previously. The statistic approaches are all characterized by the use of n-grams and the usual tf-idf indices. We added another parameter called the Kullback-Leibler Divergence to compute similarities. In addition we used emoticons and hashtags. Results for the two runs allowed have been fairly low - around 40% F1-score. We continued producing other runs on the basis of the statistical approach and after receiving the goldtest version and the evaluation script we discovered that in one of these additional runs - the fourth - we improved up to 54% macro F1 for HaSpeeDe2 task and up to 48% macro F1 for Sardines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper we will present work carried out by
the Venses Team in Evalita 2020
        <xref ref-type="bibr" rid="ref1">(Basile et.
2020)</xref>
        . We will comment in the following both
on the Sardines Task
        <xref ref-type="bibr" rid="ref2">(Cignarella et al., 2020)</xref>
        and
on the HaSpeeDe2 Task (Sanguinetti et al.
2020). The reason for this is discussed in the
sections below, but it has been basically determined
by the overlapping in the choice of the features
1 Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
to adopt for the classification tasks. To show
how the two tasks share part of the features we
created a table where we compare the output of
the first step in the process, i.e. the creation of a
frequency list dictionary. The frequency list that
we show in Table 1. below is made of nominal
entities that were extracted automatically from
the total frequency list. We call this frequency
list InstanceList and the position occupied by
each entry as InstanceListPosition and the Rank
as InstanceRank. In the first column we indicate
rank; in the following two columns we report the
word/s preceded by its frequency value. In the
second couple of columns, column no. 4 and 5
we make a comparison between the two corpora
based on frequency lists and the rank each entry
has received.
      </p>
      <p>We use three types of values: the frequency
value from the general frequency list derived from
the corpus; the rank position in the InstanceList
in case the word appears in both InstanceLists;
and the word “nil” in case the entry is not present
in the general frequency list of the comparing
corpus. In column 4 the comparison is made
between the first list (HaSpeeDe2) and its instances
and the second list (SardiStance). Every word is
associated to the rank in the InstanceList and a
second element which can be one of three: the
position in the second list if available; the
position in the general FrequencyList of the
compared corpus; nil in case the word is not present.
For instance, we can see that the words rom,
migranti, profughi, terroristi, nomadi, islamici/rom,
migrants, refugees, terrorists, nomads ,islamists
are not present in the second list and so they
characterize the first corpus (HaSpeeDe2) as
being different from the Sardines one, specializing
it in a particular list of topics or keywords. When
we look at column 5, where the comparison is
made in reverse order, we discover that sardine,
bibbiano, bonaccini/sardines, bibbiano,
bonaccini are not present in the first list. Most
importantly we discovered that the most frequent
words of the two lists are not shared, “rom”, in
list 1, and “sardine” in list two. In the sections
below we present the module for supervised
automatic classification and the experiments that
we devised using basically two approaches: a
semantic approach vs a statistic approach.
In Table 2 and 3 we report the subdivision into
classes of the two training and test corpora for
the two tasks, SardiStance and HaSpeeDe2 with
percent values to allow for comparisons. As can
be noticed in the SardiStance corpus the majority
class is constituted by AGAINST, followed by
FAVOR and then NONE. In the Test set, the
distribution into the three classes favors AGAINST
and for the other two classes is almost identical.
The same happens in the other corpus, the
HaSpeeDe2, where we notice a majority of
occurrencies for the NULL class in the Training
corpus. In the Test set, this is still valid but we
see an important increase of the
BothHateandStereo class and a strong reduction of the
Stereo class. Of course, these differences in class
distribution may have influenced the final
outcome, in case as it is ours - there is a default
choice at the end of the computation for each
tweet class. Here below some general
quantitative information for the two corpora:
Corpus/ HaSpeeDe2 HaSpeeDe2
Class Abs.Val. Percent
NULL 3,049 44.5825%
OnlyHATE 748 10.9372%
OnlySTEREO 1,024 14.9729%
BothHATEAnd 2,018 29.5072%
STEREO
Totals 6,839 100%
Table 2. Distribution of Classes for HaSpeeDe2
Tweets training and test corpora</p>
    </sec>
    <sec id="sec-2">
      <title>The Module for Supervised Automatic Classification</title>
      <p>We present the modules for automatic
classification that uses three different approaches: a fully
BOW and statistic one, a fully semantically
based one, and a mixed both bag-of-words and
(partially) semantically-based one. With the
exception of the fully semantic approach, the
remaining approaches are however characterized
by the use of n-grams and a fully supervised
method to create the model. In all approaches the
model is created on the basis of an automatically
built dictionary of unique wordforms sorted by
frequency where the first most frequent 25
nominal expressions are chosen as supplied instances
for n-grams construction.</p>
      <p>Eventually, we created six different classifiers
that we will present in the sections below. They
are a fully semantic classifier, a lexically-based
semantic classifier, a mixed statistic and lexical
semantic classifier using supervised n-grams, a
fully statistic tf-idf classifier based on
differences, a fully statistic Kullback-Leibler
Divergence (hence KLD) classifier based on
differences, a classifier based on emoticons and on
hashtags.</p>
      <p>First approach.</p>
      <p>We will start by describing the lexically-based
semantic classifier. This is used for both tasks
but in a different manner. Whereas in the
semantic classifier it is treated as an important
component of the evaluation module, it becomes just a
default classifier in the statistic classifiers, in
case of failure of the previous ones. It is
organized into a grid with seven slots:
[Polarity, Appraisal, NegativeW, PositiveW,</p>
      <sec id="sec-2-1">
        <title>SwearW, HateW, StereoW]</title>
        <p>
          Polarity is computed at a propositional level by
the deep parser and is described below. The
remaining slots are all lexically processed. In
particular Appraisal Classes are derived from
previous work on political newspapers
          <xref ref-type="bibr" rid="ref6">(Stingo and
Delmonte, 2016)</xref>
          ; Swear Words, Negative and
Positive Words are derived from previous work
on opinion and sentiment analysis and were used
in SenticPol
          <xref ref-type="bibr" rid="ref3">(Delmonte, 2014)</xref>
          ; finally
HateWords and StereoWords were collected from the
HurtLex made available by the organizers,
proceeding by a manual selection of Italian words
and discarding all English words.
        </p>
        <p>The second approach that we call
semanticallybased, uses a three levels of classification.
Besides using an n-gram model, it uses a majority
vote approach based on presence of emoticons
previously classified on the basis of the training
set. The most important module is fired in case
of failure (no n-gram available to match) in the
two previous steps and is totally based on
semantics. It builds an interpretation from deep
semantic analysis evaluating presence of appraisal
theory labeled items, presence of hate/stereotype
items from lexical lookup and their propositional
level semantics. In the sections below we
describe in details the three level classification
module. This approach covers 93% of the whole
training set – but see below. However its
predicting power is not so great.</p>
        <p>Third Approach.</p>
        <p>The bag-of-words approach associates a
numerical parameter to each word and the resulting sum
for the each tweet. At first we uses TF-IDF as the
mathematical formula for characterizing each
word occurrence and each tweet. We applied
TFIDF to each word in each tweet and used the
output to map the indices to n-grams and produce a
model. Then we used this model to predict the
similarity with n-grams obtained from the held
out development set of tweets. The results were
however very poor, 20% accuracy, which added
to 12% obtained from the emoticons model made
a 32% final accuracy.</p>
        <p>
          We assumed the reason was that tweets are too
short to be useful for term-frequency
computation. In the majority of the cases wordforms
appeared only once in each document/tweet – apart
from stop words. So we searched a formula
which could be better suited for this task and
could represent both frequency and dispersion at
corpus level. We found it in a number of papers
published by
          <xref ref-type="bibr" rid="ref4">Gries (2008</xref>
          , 2020), but also in a
paper online by Koos Wilt. The important part of
the formula regards the role of frequency of
occurrence in the total corpus which is used to
produce TF so that it would resemble a probability
of occurrence and the concept of entropy2. Gries
defines this formula as a way of characterizing
“keyness” by including dispersion information.
To do that he augmented frequency information
by using the Kullback-Leibler Divergence.
Wordforms can become key not only for their
frequency of occurrence, their dispersion or both.
The formula is able to “tease apart distributional
differences”.
p = frequency of w in document A of the corpus /
divided by total frequency of w in the corpus
q = total number of tokens in document A of the
corpus / divided by total number of tokens in the
corpus
KLD = p X log(p/q) è ∑ p X log(p/q)
In the same paper Gries suggests to compute
keyness also to n-grams besides multiword
expressions and this is what we did. The
summation applies to the document/tweet and is used to
differentiate each tweet from one another and
produce a similarity or distance evaluation. We
proceeded as before to verify the predictive
ability of this new formula and came out with
44/45% accuracy, a 12% gain.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Semantically-Based Module And</title>
    </sec>
    <sec id="sec-4">
      <title>The N-gram Models</title>
      <p>The general procedure we organized for the three
approaches is as follows.</p>
      <p>At first we massaged the text in order to obtain a
normalized version – wrong word accents like
“nè” instead of “né” etc. The text is then turned
into an xml file to suit the Prolog input
requirements imposed by the system. It is then
precom2 According to Wilt Koos, ibid. pag.2: “Classification
according to the KLD takes place on the assumption the
training set reflects order and the test set, a document to be
categorized, reflects a deviation from this order and is
therefore chaotic or entropic. The lower the entropy regarding
the training set, the more likely it is a given test set belongs
to that training set. “
piled by a set of regular expressions: we separate
the hash symbol # from its tag; we separate the
@ symbol from the following username; we
cancel the word URL; we separate all punctuation
marks from a preceding or following word; then
we lowercase all words and produce a sorted list
which is then used to count frequencies
associated to each wordform and produce the dictionary
of unique wordforms or types.</p>
      <p>Then we choose the first 25 nominal entities
from the list erasing generic or general nouns
like “person”, “people” etc. The final list of
features is treated as supplied instances to search for
the construction of n-grams from 4-gram up to
8grams: we take all sequences of four/eight tokens
where the ending or beginning word must be
taken from the list of instances. If eight is not
available we accept down to 4-grams. Instances are
collapsed under three unique general topic which
are the following ones: racism, politics,
sardines/Salvini.</p>
      <p>Since we process each tweet using lemmata in
every approach, we do sentence splitting and
tagging. Every tagged token is then lemmatized
and in the semantically-based approach it is
subsequently associated to a lexically validated
three-valued sentiment label.</p>
      <p>In the semantically-based approach, we then
compute syntactic constituency and
dependencies for every sentence. This information is
passed to the semantic processor which produces
predicate argument structures for every sentence
present in each tweet. In case no punctuation is
available and the sentence is longer than 40
tokens we activate an empirical set of rules to
insert punctuation and divide the tweet into
sentences by checking the presence of words
starting with uppercase letter and not being a Named
Entity. If the sentence splitter fails we activate a
search for sentence level coordinating or
subordinating conjunctions. Many tweets are just
fragments and contain a list of nouns and
adjectives: we add a dummy verb ESSERE/to_be in
order to allow the semantics to work.</p>
      <p>Propositional level semantics is made by the
computation of factivity, negation, subjectivity,
modality, speech_act, diathesis, which then
produce a fixed set of semantic labels to allow for a
correct interpretation.</p>
      <p>In the mixed approach and in the statistics-only
approach we procede as follows. Before
producing n-grams, we erase punctuation with the
exception of the hash symbol that informs the
system of the presence of an hastag or a slogan.
Similarity is computed by matching every lemma
from two n-grams labeled with the same main
topic. We established a ratio of 0.3 as the
threshold for acceptance, but then we check the
semantics be identical or very similar. We assume with
Emily Bender that “a system trained on form
alone cannot in principle learn meaning”3 . So we
use an approach with is based partially on
bagof-words n-grams – using frequency lists and
ngrams - but we associate semantic interpretation
to every n-gram of the model. Semantics is used
to verify and confirm the first approximation of a
similarity measure based on wordforms 4 and
lemmata. We assume that n-grams belonging to a
statement cannot possibly be regarded to have
the same meaning in case the comparison is
made with an n-gram extracted from a
proposition which has negation at propositional level.
4</p>
    </sec>
    <sec id="sec-5">
      <title>The Experiment and the Evaluation</title>
    </sec>
    <sec id="sec-6">
      <title>Module of ItVenses</title>
      <p>We organized our classifiers to produce two runs
as required by the two tasks, SardiStance and
HaSpeeDe. However, we then realized that we
needed to produce more runs in order to take into
account all variables involved in the
statisticallybased module. Eventually we had to choose one
modality for the single run with the statistical
module trusting the results obtained from the
Development set as described here below.
To produce a development set we held out 20%
of all training corpus - 427 tweets for
SardiStance and 1000 tweets for HaspeeDe2
that we called devtset and remodulated the
ngram model accordingly by subtracting the
ngrams related to the same sequence of tweets.
For HaSpeeDe2 the system produced 23,000
ngrams for the training corpus and 19,738 for the
development. The development set is made of
1,000 tweets held out from the total 6839 which
adds up to 136,536 tokens.</p>
      <p>For SardiStance, we have 4,993 n-grams from
the training corpus and 4,003 for the
development: the development set is made of 427 tweets
held out from the total 2,132 tweets, adding up to
57,774 tokens.</p>
      <p>The system takes as input the analysis of one
tweet at a time. In the mixed semantic-statistic
3 Emily Bender at a meeting in Uppsala University
organized by Joakim Nivre.  
4 Rather than using actual wordforms we could use the rank
number associated to each type in the dictionary as would
be done in current machine learning approaches. But given
the size of the training corpus we did not think it would be
necessary: the model for the SardiStance task takes just
5Mb of memory and the one for Absita 10Mb.
module, the multilevel evaluation process
consists of four steps which take advantage of the
following previously compiled analyses. We
have a full-fledged semantic analysis at
propositional level; a trivalued labeling of each
wordlemma by lexically-driven sentiment
dictionaries; a six slot analysis of ironic/sarcastic contents
at tweet level; a model for emoticons; a list of
special hashtags inducing a direct evaluation.
This is what we use in the semantic-only
approach. The evaluation process is performed
recursively for each tweet, and starts by searching
for presence of Emoticons extracted in the
previous analysis and organized in a model: in this
case, the decision is taken by majority vote based
on the type of emoticons present in the tweet. As
for the semantic-only module, the problem was
how to select best candidate from the pool of
model n-grams with different value labels. We
solved this problem by a scoring procedure. We
produced two levels of scoring: a first one based
on the number of sentiment labels with
positive/negative value producing as a score a ratio
of the total number divided by total number of
words in the n-gram. Negative words are valued
the double. The second scoring analysis is based
on the contents of the propositional level
semantics: here we associate 0.25 for each proposition
marked differently from statement; another 0.25
is added for presence of predicates different from
“dummy” verb ESSERE; eventually another 0.25
is added in case one of the arguments or
attributes is shared with the input n-grams.</p>
      <p>Eventually, we imposed coincidence at the level
of Discourse Class associated to the utterance.
We use seven different labels: statement,
question, exclamation, negated, unreal,
opinionsubjective, conditional.</p>
      <sec id="sec-6-1">
        <title>Creating and</title>
      </sec>
      <sec id="sec-6-2">
        <title>Accessing</title>
      </sec>
      <sec id="sec-6-3">
        <title>N-grams 4.1 models</title>
        <p>If the semantics-only method needs just words
from the two tweets to be evaluated by means of
linguistic parameters, the two other methods or
approaches we used are based on n-gram models
which introduce a great number of variables.
First of all, our n-gram model are organized in a
different manner from the way in which they are
usually conceived, so that their usage is also
peculiar and needs detailed explanation. N-grams
are not collected randomly by recursively
creating bigrams and trigrams.</p>
        <p>We can define three phases in the processing of
our n-gram models: phase 1, building; phase 2,
choosing; phase 3, evaluating. We will clarify
each phase in details below.</p>
        <p>Phase 1. Building fully supervised n-gram
models
As explained above, we collect topic words from
unique dictionary derived from the training set.
Topic words are the key entry in the n-gram, in
that n-grams are built from each tweet around
topic words. There two constraints at the basis of
each n-gram: one is content related and the other
is quantity related. The quantity constraint
requires each n-gram to be longer than 3 words in
sequence, in addition to the topic word. The
content constraint requires that each n-gram must
have at least a topic word at the beginning or end
of the sequence of words. That is, each n-gram
has a topic word as head or as tail. N-grams are
strictly conditioned by the length of the tweet
from which they are extracted. Short tweets may
have only one n-gram at most or none. Long
tweets may have two or more n-grams depending
on their content: they would be all contained in
the same list headed by the sum KLD index for
that tweet. N-grams can be expressed in actual
words or in lemmata. In the latter case, words are
no longer available to subsequent analysis. We
organized models with both words and lemmata.
Every n-gram comes with the class attributed to
the tweet in which they were contained.</p>
        <sec id="sec-6-3-1">
          <title>Phase 2. Choice constraints on n-grams</title>
          <p>Thus n-grams are each associated to two KLD
indices, one for each word, and another one from
the lump sum - which is unique - of all the words
indices contained in the tweet. In this way,
ngrams coming from the same tweet can be easily
identified and this information can be used to
select sequences of n-grams. Sequences of
ngrams when matched with the input tweet are
used to reinforce the similarity hypothesis.
Choosing n-grams from the model is basically
done on the basis of the ratio of intersecting
words/lemmata. We established different ratios:
one fifth or 20% of intersection, one fourth or
25%, one third or 30% and finally half or 50%
intersecting words/lemmata. The ratio may vary
according to another important parameter which
is tied to the way in which the n-gram is used.
We can decide to use words, lemmata, but also to
erase grammatical or function words. In case we
erase function words in the intersection only
content words will be computed, which is a much
smaller number and requires a smaller ratio to
compare. We tried all three choosing manners.</p>
        </sec>
        <sec id="sec-6-3-2">
          <title>Phase 3. Evaluating n-gram candidates</title>
          <p>Once the methods have been selected and
candidate n-grams are extracted from the model
according to choice constraints, the outcome may
be just one candidate and the evaluation stops or
more than one candidate which is the rule. Now
we have a list of candidate n-grams with the best
ones at the top. The list may be created in a
number of different manners. It has the KLD
index inherited from the tweet and three other
indices: one is the ratio of intersection
words/lemmata, the higher this ratio the more
relevant is the n-gram. Another index is the sum
of the KLD indices associated to each of its
word/lemma, the lower this sum the more
relevant is the ngram (rare content words have a
lower KLD index). Finally the third index is the
one associated to the tweet in which the n-grams
are contained. Choosing the best candidate in
fact usually means selecting the best candidates
from the list, because it almost never happens
that there is only one candidate at the top with
the best ratio or best index. The choice requires
collecting candidates at the top with the same
ratio/index. However this may require another
step since the best candidates may be associated
to different classes. So that after the first sieve
has reduced the number of best candidates,
another sieve requires selecting the most frequent
class and this is done by reordering the best
candidates on the basis of their class. In fact, this
might also be one possible general method:
rather that selecting only best candidates, one
might reorder all candidates chosen on the basis
of the intersection ratio, and count and choose
the most frequent class. Eventually, another
evaluation modality can be derived from the
KLD indices. We compute differences on the
basis of the KLD sum index for each model
ngram compared to the input n-gram and use this
difference as the relevant index. When
candidates are sorted in a list, the top will be
populated by the lowest indices which can be used to
characterize similarity. We chose the class of the
top n-gram, but also tried a best way by selecting
the first n-gram carrying a non negative index.
Negative sums may still indicate higher
differences between two n-grams.</p>
          <p>Thus overall we come up with 6 different
methods multiplied by two (function words erased/all
words/lemmata), which amounts to 12 different
methods. We experimented them all but at the
end we concentrated only on a few.Since it is
reasonable to assume that not all tweets of the
training set will be classified in the model due to
the lack of an instance defined by the list of
automatically derived keywords in the training
corpus, we ascertained at first what was the
coverage of the training text for the development set,
using in this case the model for the training set.
We report here below both training set coverage
of the development set and development set
results for both tasks. As can be easily noticed,
coverage for the semantic-statistic module is
poor, and the same applies for the so-called
lexical-semantic module, which is even worse and as
said above we only used as default.</p>
          <p>Cover Devel
Sardines Sardines
57.98% 35.31%</p>
          <p>Cover
HaSpDe
58.34%</p>
          <p>Devel
HaSpDe
38.67%
We then performed additional runs with the
statistical module always with the test set. However
we were unable to know the results until the</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>Task HaspeeDe2 - News</title>
        <p>Task A
RUN-1
Macro-F1: 0.5024333
Task B
RUN-1
Macro-F1: 0.5386702</p>
        <p>RUN-2
Macro-F1: 0.3805618
RUN-2</p>
        <p>Macro-F1: 0.3671441
As for the SardiStance tasks, results for the
HaSpeeDe2 task, obtained and delivered in due
time are not particularly satisfactory, even
though they are in line to results obtained for the
development set. As for the SardiStance tasks, in
evaluation script and the Test Gold set were
distributed to all participants. We then realized that
we had one run with the worst result and another
with the best result. The former run was by
choosing the first candidate from the list
proposed by the KDL indices with a positive value
in the list of candidated produced by a difference
computed between the index of testset n-grams
and the index of trainmodel n-grams. The latter
run was instead obtained by choosing the best
candidate – the one with higher value in terms of
number of shared words from the intersection at
word level between testset n-gram and
trainmodel n-gram, and got the following results:</p>
      </sec>
      <sec id="sec-6-5">
        <title>Task SardiStance</title>
      </sec>
      <sec id="sec-6-6">
        <title>Run3 (Statistic module – first candidate with positive value)</title>
        <p>Macro-F1 0.299607934</p>
      </sec>
      <sec id="sec-6-7">
        <title>Run4 (Statistic module – higher word intersection)</title>
        <p>Macro-F1 0.427668958
Even considering the last fourth run our ranking
would not change. We assume that basing the
evaluation on one n-gram alone is not the best
solution. So we modified our evaluation
procedure by requiring a sequence of at least two
ngrams for each tweet/news to be chosen at the
same time, using the same tweet-related KLD to
select them. In this case we were able to cover
more text and get a better similarity measure that
we report in the subsection below.</p>
        <p>We present here below the official results
obtained at first for the HaSpeeDe2 task A-B both
for News and for Tweets, and then the results
obtained for SardiStance. Consider that we could
report results for two runs only, and we choose
the Semantic-Statistic and the Statiscs-Only.</p>
      </sec>
      <sec id="sec-6-8">
        <title>Task HaspeeDe2 - Tweets</title>
        <p>Task A
RUN-1
Macro-F1: 0.5054034
Task B
RUN-1
Macro-F1: 0.5078902</p>
        <p>RUN-2
Macro-F1: 0.4726022
RUN-2
Macro-F1: 0.4671661
fact there is a remarkable difference from the
result obtained for the Development set in the
Semantic-statistic.</p>
      </sec>
      <sec id="sec-6-9">
        <title>The Improvements in the Statistical</title>
        <p>After receiving the Test Gold version and the
evaluation script, we continued producing other
runs on the basis of the statistical approach and
the choice of the algorithm we had available, for
instance restricting choice of candidates only to
those in which two or more n-grams had been
selected. We discovered that in one these
additional runs - the fifth for SardiStance and the
sixth for HaSpeeDe2- we improved up to 54%
macro-F1 for HaSpeeDe2 task and up to 48%
macro-F1 for SardiStance. Here below the results
for SardiStance and further the ones for
HaSpeeDe2.</p>
      </sec>
      <sec id="sec-6-10">
        <title>Run-5 SardiStance Task</title>
      </sec>
      <sec id="sec-6-11">
        <title>Macro F1 0.484871151 Run-6 HaSpeeDe2 Task News</title>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper we presented the system we used
for the two tasks HaSpeeDe2 and SardiStance.
We used different approaches one of which was
based on previous participation in similar Evalita
tasks. Two methods are however innovative in
their use of fully supervised n-grams,
automatically derived. We use statistical measure to
classify n-grams and a variety of different possible
solutions which we explain in detail. The high
number of possible results are however only
evaluated against the development set. We are
convinced that participants to these tasks which
are mainly directed to the use of commonly
available machine-learning software - should be
allowed to propose a higher number of runs due
to the variability of behaviour of the algorithm
when relevant parameters in statistical tools are
modified.</p>
      <p>Koos van der Wilt, Linguistics improves
statistical classification: the positive effects of
reducing feature dimensionality or grammatical
feature selection. Downloadable at
https://www.academia.edu/
27207951/Linguistics_improves_statistical_cl
assification_with_KLD_NB_TF_IDF_K_NN_
the_positive_effects_of_reducing_feature_dim
ensionality_or_grammatical_feature_
selection_ Koos_van_der_Wilt.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Basile</surname>
            , Valerio and Croce, Danilo and
            <given-names>Di</given-names>
          </string-name>
          <string-name>
            <surname>Maro</surname>
          </string-name>
          , Maria, and
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <year>2020</year>
          . EVALITA 2020:
          <article-title>Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          ,
          <source>in Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ), CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cignarella</surname>
            ,
            <given-names>Alessandra</given-names>
          </string-name>
          <string-name>
            <surname>Teresa</surname>
          </string-name>
          and Lai, Mirko and Bosco, Cristina and Patti, Viviana and Rosso, Paolo,
          <year>2020</year>
          .
          <article-title>Overview of the EVALITA 2020 Task on Stance Detection in Italian Tweets (SardiStance)</article-title>
          ,
          <source>in Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA</source>
          <year>2020</year>
          ), CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Delmonte R.</surname>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>ITGETARUNS A Linguistic Rule-Based System for Pragmatic Text Processing</article-title>
          ,
          <source>Proceedings of Fourth International Workshop EVALITA</source>
          <year>2014</year>
          , Pisa,
          <string-name>
            <surname>Edizioni</surname>
            <given-names>PLUS</given-names>
          </string-name>
          , Pisa University Press, vol.
          <volume>2</volume>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Gries</surname>
            ,
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Th</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Dispersions and adjusted frequencies in corpora</article-title>
          .
          <source>International Journal of Corpus Linguistics</source>
          <volume>13</volume>
          /4:
          <fpage>403</fpage>
          -
          <lpage>437</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Gries</surname>
            ,
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Th</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Dispersions and adjusted frequencies in corpora: further explorations</article-title>
          .
          <source>In Stefan Th. Gries</source>
          , Stefanie Wulff, and Mark Davies eds.
          <article-title>Corpus linguistic applications: current studies, new directions</article-title>
          . Amsterdam: Rodopi,
          <fpage>197</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Stingo</surname>
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>R. Delmonte</surname>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Annotating Satire in Italian Political Commentaries with Appraisal Theory</article-title>
          , IN Larry Birnbaum, Octavian Popescu and Carlo Strapparava (eds.),
          <source>Natural Language Processing meets Journalism - Proceedings of the Workshop</source>
          , NLPMJ2016, PP.
          <fpage>74</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>