ERTIM@MC2: Diversified Argumentative Tweets
                    Retrieval

    Kévin Deturck1, Parantapa Goswami2, Damien Nouvel3 and Frédérique Segond4
      1 INaLCO ERTIM, 75007 Paris, France/Viseo Innovation, 38000 Grenoble, France
                           2Viseo Innovation, 38000 Grenoble, France
                            3 4 INaLCO ERTIM, 75007 Paris, France


                              kevin.deturck@viseo.com
                            parantapa.goswami@viseo.com
                              damien.nouvel@inalco.fr
                            frederique.segond@inalco.fr


       Abstract. In this paper, we present our participation to CLEF MC2 2018 edition
       for the task 2 Mining opinion argumentation. It consists in detecting the most
       argumentative and diverse Tweets about some festivals in English and French
       from a massive multilingual collection. We measure argumentativity of a Tweet
       computing the amount of argumentation compounds it contains. We consider ar-
       gumentation compounds as a combination between opinion expression and its
       support with facts and a particular structuration. Regarding diversity, we consider
       the amount of festival aspects covered by Tweets. An initial step filters the orig-
       inal dataset to fit the language and topic requirements of the task. Then, we com-
       pute and integrate linguistic descriptors to detect claims and their respective jus-
       tifications in Tweets. The final step extracts the most diverse arguments by clus-
       tering Tweets according to their textual content and selecting the most argumen-
       tative ones from each cluster. We conclude the paper describing the different
       ways we combined the descriptors among the different runs we submitted and
       discussing their results.

       Keywords: Argumentation, Opinion, Twitter.


1      Introduction

CLEF MC2 Lab 2018 [1] proposes an information retrieval task for festival organizers
who would like to know what people think about their event on Twitter1. A user’s query
can be either in French or in English and also specifies a topic from a list of festival
names. We design a system, based on linguistic information, which selects the 100


1 http://www.twitter.com
most argumentative and diverse Tweets associated to a user’s query. An initial step
filters Tweets according to languages and topics in order to reduce the amount of data
to be processed. We first extract French and English Tweets performing language de-
tection thanks to an external tool. Then, using regular expressions and key words, a
topic filtering step extracts, for each language, sets of Tweets related to the different
Festivals.
    We perform a linguistic enrichment on the previously extracted sets of Tweets. We
then use these Tweets enriched with linguistic information to compute the argumenta-
tivity score of each Tweet and measure diversity among Tweets.
    Argumentation is a process of construction with arguments that are sets of premises,
in other words facts chosen to support claims [2]. Claims are personal statements made
by an individual about a topic. Thus, a claim is the expression of an individual’s opinion
as a polarity (negative, neutral, positive) considering a topic. We link argumentation
and opinion in that the former supports the latter. As we said an argumentation is related
to an opinion, we measure the argumentativity of a Tweet according the amount of
opinion and argumentation it contains. Opinion mining is driven by subjectivity detec-
tion because subjectivity is the property of a personal expression and we said opinion
is personal. We think the characterization of argumentation by factuality is a crucial
marker [3]. Factuality measures how much facts are present in a discourse. A fact is the
opposite of subjective content as it stands for a proposition which is true independently
of its enunciator. As we mentioned argumentation is a process of construction, we also
use discourse structuration markers to detect argumentation.
    Diversity is measured on a set of Tweets according to the variety of festival aspects
mentioned in the featured view-points. Therefore, the resulting Tweets from our system
must be distant considering the aspects they contain. That is why we measure diversity
as a distance among Tweets using clustering on their textual content.
    In what follows, we present the general architecture of the system together with the
different linguistic modules and resources we used. We also explain the different con-
figurations of the runs we have submitted.


2      Our approach to the detection of the most
       argumentative and diverse Tweets within MC2

The overall approach (see Fig.1) consists in applying different filtering steps in order
to reduce the original set of “Festival” Tweets to those relevant for the particular task
context and to map the most relevant Tweets to user’s queries according to their level
of argumentativity and diversity.
    We reduce the original dataset by two pre-filtering steps to fit the particular task
context. The original dataset contains other languages than English and French thus the
initial challenge is to identify and to separate English and French Tweets by a language
filtering step. A list of festival names is provided as topics for each language. We detect
and extract Tweets which contain mentions of these festivals.
   We perform data enrichment on the pre-filtered set using Natural Language Pro-
cessing tools. It consists of morpho-syntactic and semantic information on which the
calculation of the argumentativity score is based.
   We compute argumentativity score of a Tweet as the amount of both opinion and
argumentation it contains. For example, a Tweet with only one claim as “I love Hell-
fest.” will get a lower argumentativity score than a Tweet which combines an opinion
and an associated argumentation as in “I love Hellfest because it is ethic.”.
   We define the diversity of Tweets as the amount of different aspects they mention
about the festivals. For example, a set of Tweets about Cannes festival that only con-
tains Tweets like “I love Cannes festival because the introduction was great!” and
“Beautiful introduction at Cannes!” is argumentative but not relevant for diversity as it
only mentions one aspect. The more diverse Tweets are, the more individuals’ critical
criteria are provided so that festival organizers get a larger perspective on what people
think and why.


                                     Fig. 1. System general architecture


2.1     Language filtering
The language filtering step (see Fig. 2) is performed using the Python module “lan-
gid.py”2 ; we choose this module because it combines state-of-the-art results and speed
which is essential for processing such a massive dataset [4].


                       Fig. 2. Principle of the language filtering module


2   https://github.com/saffsd/langid.py
2.2     Topic filtering
The original dataset contains Tweets that are not only about the festivals from the par-
ticular task context. The next step consists, for each language, in detecting and grouping
the Tweets into categories corresponding to the lists of festivals provided (see Fig. 3).


                            Fig. 3. Principle of topic filtering module


Topic detection is performed using regular expressions based on key words representa-
tive of each festival. We select a set of “representative” key words associated with each
festival based on the mentions in the topically categorized sample of Tweets provided
by the organizers. For two festivals, Cannes and Avignon, we notice that the city name
is often used alone (without “festival” like “Cannes” instead of “Festival de Cannes”)
so we decide to only look for “cannes” and “avignon”. Regular expressions are built so
that the tokens may appear in any case and any order.


2.3     Data enrichment
The goal of this intermediary step is to enrich the pre-filtered data with linguistic infor-
mation. The output of this step is then stored in order to run the process just once pre-
venting lost in performance.
   We first normalize each Tweet using the Python module “tweet-preprocessor”3. It is
fully customizable, allowing us to specify parts of the Tweets we want to remove:
URL, Mention, Emoji, Smiley. We decide to keep hashtags as they might contain im-
portant information. For example in “The sound is too loud! #FestivalCannes”, it allows
to identify the topic of the Tweet. We normalize hashtags removing the “#” character.
   After text normalization, for each Tweet, we extract the following information using
NLP tools we selected according to our needs; they are mostly bilingual and fast to
handle the data size (see Table 3).

   ─ List of tokens

3 https://pypi.org/project/tweet-preprocessor/
    ─ List of lemmas
    ─ List of POS labels
    ─ Subjectivity score
    ─ Opinion polarity score

   Lists of tokens, lemmas and POS labels are obtained using the TreeTagger tool4 on
the normalized Tweets. We use normalized Tweets because TreeTagger is meant to
analyze regular texts while the original Tweets are noisy texts as they can contain
Tweet-specific elements like smileys. Specifying as a parameter the language of the
text to analyze (English or French), TreeTagger returns a list of lists containing each
form, its POS label and lemma.
   Subjectivity and opinion polarity scores are obtained using “TextBlob” library5 and
an adaptation for French named “textblob-fr”6. TextBlob computes the scores using
lexical resources and pattern matching. We run it on the normalized Tweets.


2.4     Opinion and argumentation filtering

This step computes an argumentativity score for each Tweet according to the opinion
and argumentation it contains. We have selected linguistic features that may represent
both aspects.
   For opinion detection, we use the subjectivity score as we consider the expression of
an opinion to be a subjective content [5]. We consider that the higher its subjectivity
score, the more “opinioned” a Tweet is. We also use the opinion polarity score, not for
the polarity itself, but for its magnitude that may also indicate how much opinioned a
Tweet is. These two scores are combined with their respective weights (specified in
section 2.6) as a magnitude score described in equation (2).

             𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑e(𝑡𝑤𝑒𝑒𝑡) = α ∗ 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 + β ∗ |𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦|                    (1)

where 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) is the opinion magnitude score (comprised between [0-1])
for 𝑡𝑤𝑒𝑒𝑡, 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 the subjectivity score (comprised between [0-1]), 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦
the polarity score in [0-1], 𝛼 and 𝛽 are their respective weights

We also use two lexical resources: one for English and one for French. For English, we
use [6] which encodes the “arousal” property of 13,915 English lemmas. It associates
a score to each lemma according to the affectivity it denotes; our hypothesis is that the
more a Tweet contains high affectivity lemmas (high scores), the more opinioned it
would be (see equation 3). For French, we use [7], a French lexicon which associates
to 14,129 non-neutral lemmas a binary polarity value (“positive” or “negative”) and six
binary values depending on whether each lemma evokes (1) or not (0) a sentiment
among six different: joy, anger, surprise, sadness, disgust and fear. We consider senti-
ment as an internal psychologic state whose expression can serve the formulation of an
opinion. Our hypothesis is that the more a French Tweet contains lemmas present in

4   http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
5   http://textblob.readthedocs.io/en/dev/
6   https://github.com/sloria/textblob-fr
this lexicon (encoding only non-neutral lemmas) and with high number of sentiment
denotations, the more opiniated it would be (see equation 4).
                                           ∑𝑛
                                            𝑖=1 𝑎𝑟𝑜𝑢𝑠𝑎𝑙(𝑙𝑒𝑚𝑚𝑎𝑖 )
                     𝑎𝑟𝑜𝑢𝑠𝑎𝑙(𝑡𝑤𝑒𝑒𝑡) =                                                (2)
                                                    𝑛

where 𝑎𝑟𝑜𝑢𝑠𝑎𝑙(𝑡𝑤𝑒𝑒𝑡) is the arousal score comprised between [0-1] for 𝑡𝑤𝑒𝑒𝑡,
𝑎𝑟𝑜𝑢𝑠𝑎𝑙(𝑙𝑒𝑚𝑚𝑎𝑖 ) the lexicon-based arousal score normalized comprised between [0-
1] for the lemma 𝑙𝑒𝑚𝑚𝑎𝑖 and 𝑛 the number of lemmas in 𝑡𝑤𝑒𝑒𝑡


                                           ∑𝑛
                                            𝑖=1 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑙𝑒𝑚𝑚𝑎𝑖 )
                𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡) =                                                (3)
                                                         𝑛

where 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡) is the expressivity score comprised between [0-1] for
𝑡𝑤𝑒𝑒𝑡, 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑙𝑒𝑚𝑚𝑎𝑖 ) is the expressivity of 𝑙𝑒𝑚𝑚𝑎𝑖 computed following
equation (5) and 𝑛 is the number of lemmas in 𝑡𝑤𝑒𝑒𝑡
                                                        |𝑡𝑟𝑢𝑒|
                        𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑙𝑒𝑚𝑚𝑎) =                                        (4)
                                                          7

where 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑙𝑒𝑚𝑚𝑎) is the lexicon-based expressivity score for 𝑙𝑒𝑚𝑚𝑎 ac-
cording to |𝑡𝑟𝑢𝑒|, the number of valid properties among the presence in the lexicon and
the six lexicon-annotated sentiments


Besides these lexicon-based measures, we also detect the opinion in a Tweet by taking
into account the proportion of adjectives regarding POS tags; our hypothesis is that the
more a Tweet contains adjectives, the more opiniated it would be (see equation 6).

                                                   |𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑠|
                      𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡) =                                         (5)
                                                         𝑛

  where 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡) is the descriptivity score for 𝑡𝑤𝑒𝑒𝑡 according to
|𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑠|, the number of tokens tagged as adjectives and 𝑛 the number of tokens in
𝑡𝑤𝑒𝑒𝑡

   Regarding argumentation, we say that an argumentative text is particularly struc-
tured to effectively combine arguments and opinions. Conjunctions are discourse con-
nectors thus we suppose they are particularly used to structure a text. We use POS tags
to value the proportion of conjunctions in a Tweet (see equation 7).

                                                  |𝑐𝑜𝑛𝑗𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠|
                     𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑎𝑡𝑖𝑜𝑛(𝑡𝑤𝑒𝑒𝑡) =                                          (6)
                                                          𝑛

   where 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑎𝑡𝑖𝑜𝑛(𝑡𝑤𝑒𝑒𝑡) is the structuration score comprised between [0-1]
for 𝑡𝑤𝑒𝑒𝑡, |𝑐𝑜𝑛𝑗𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠| the number of conjunctions in 𝑡𝑤𝑒𝑒𝑡 and 𝑛 the number of
tokens in 𝑡𝑤𝑒𝑒𝑡
   For English Tweets, we compute a concreteness score (see equation 8) relying on
the lexical resource [8]. It associates to nearly 40,000 English lemmas a score which
indicates how much perceptible (by the five senses) their mean is. As we start from the
hypothesis that an argumentative text is factual, thus independent from an individual’s
state of mind, we formulate a new hypothesis saying that it may contain more concrete
lemmas.
                                               ∑𝑛
                                                𝑖 = 1 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒𝑛𝑒𝑠𝑠(𝑙𝑒𝑚𝑚𝑎𝑖 )
                 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒𝑛𝑒𝑠𝑠(𝑡𝑤𝑒𝑒𝑡) =                                               (7)
                                                            𝑛

   where 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒𝑛𝑒𝑠𝑠(𝑡𝑤𝑒𝑒𝑡) is the concreteness score comprised between [0-1] of
𝑡𝑤𝑒𝑒𝑡, 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒𝑛𝑒𝑠𝑠(𝑙𝑒𝑚𝑚𝑎𝑖 ) the lexicon-based concreteness score of 𝑙𝑒𝑚𝑚𝑎𝑖
(normalized between [0-1], 0 if lemma is missing) and 𝑛 the number of tokens in 𝑡𝑤𝑒𝑒𝑡


2.5     Diversity filtering

In this final step, we build a set of Tweets that maximizes the diversity criterion among
the most argumentative Tweets. Diversity measures how much festival aspects the
Tweets mention. Thus, we suppose diverse Tweets may contain words semantically
distant. For example, we detect that “This festival is too expensive.” and “Ticket price
for this festival is too high.” mention a similar aspect by the semantic proximity be-
tween the words “expensive” and “price”. Inversely, the texts “This festival program is
so good!” and “This festival proposes a good choice of beers!” are more distant due to
the semantic distance between the words “program” and ”beer”.
   As diversity is computed according to the lexical semantics distance between
Tweets, we use word embeddings models from Sketch Engine7 to get a spatial repre-
sentation of words, one for English and one for French. As we want to keep as much
form-wise information as possible, we select for both languages word form models
(without lowercasing). For English, we select the model based on British National Cor-
pus because it is the lighter therefore it avoids memory problem at loading time. For
French, the only one proposed is a model based on a Web corpus. We vectorize a Tweet
by matching its tokens against the model using FastText Python module8. We use K-
means clustering via the ScikitLearn toolkit9 to compute the distance between vector-
ized Tweets.


2.6     Impacts of dataset pre-filtering
Table 1 shows the efficiency of the pre-filtering steps (regarding languages and topics)
in reducing data size by evaluating compression ratios among selective properties for
argumentativity. It includes linguistic properties (lemmas, subjectivity and opinion po-
larity) obtained as described in section 2.3. We consider properties which are relative


7   https://embeddings.sketchengine.co.uk/static/index.html
8   https://github.com/facebookresearch/fastText/tree/master/python
9   http://scikit-learn.org/stable/index.html
indicators of point of view diversity in our data regarding sources (authors) and vocab-
ulary (lemmas). We also select subjectivity and opinion polarity scores as a mean to
measure inclination of authors to express and explain their point of view.
   The language filtering step removes more than 40% of the original data compressing
the unique authors ratio, computed with equation (8), by more than 50% for both lan-
guages. The dataset is much more practicable but with an impoverishment of sources.
Only around 1% of the lemmas used in Tweets are different, which may be difficult for
lexical approaches. Polarity and subjectivity average magnitudes ([0-1] interval scale)
are low among the two languages; it may be positive to distinguish argumentative
Tweets.
                                           𝑛𝑈𝑛𝑖𝑞𝐴𝑢𝑡ℎ𝑜𝑟𝑠
                                 𝑟𝑎𝑡𝑖𝑜 =                                             (8)
                                             𝑛𝑇𝑤𝑒𝑒𝑡𝑠

   where 𝑟𝑎𝑡𝑖𝑜 is the ratio of the number of unique authors, 𝑛𝑈𝑛𝑖𝑞𝐴𝑢𝑡ℎ𝑜𝑟𝑠 to number
of Tweets 𝑛𝑇𝑤𝑒𝑒𝑡𝑠

   We can observe in Table 1 the evolution of author and vocabulary usage between
the two first filtering steps. Unique authors ratio increases by around 80% for both lan-
guages; it is a considerable increase compared to the previous step and a positive result
for the representativeness of the data. Vocabulary usage is poor in English (0.2% of
different lemmas) and in French (0.5%); this may be a relevant concern for the diversity
criterion. Polarity and subjectivity average magnitude stay low even if it increases for
French Tweets; the selective power of this information may be preserved.

             Table 1. Statistics on dataset through the pre-filtering steps

 Language                Initial multilin-    English               French
                         gualism
 Dataset                 Initial              Language    Topic     Lan-         Topic
                         “Festival”           filtered    fil-      guage fil-   fil-
                         Tweets                           tered     tered        tered
 #Tweets                 63M                  34M         2M        3M           200k
 #Unique authors         45M                  9M          1M        1M           100k
 #Tokens10               960M                 532M        25M       41M          3M
 #Unique      lemma-     N/A12                7M          61k       252k         15k
 tized tokens11
 Subjectivity magni-     N/A14                0.28        0.28      0.26         0.15
 tude average13


10  Obtained using Unix ‘wc’ command
11  Lemmatized using http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
12 14 Not designed to support all languages of the original dataset
13 Obtained using https://github.com/sloria/textblob
 Polarity magnitude          N/A16                 0.18              0.14    0.13   0.07
 average15


2.7      Runs description

A run returns the 100 most argumentative and diverse Tweets for all languages and
festival names from the particular task context. To get the 100 most argumentative and
diverse Tweets, we run K-Means with k = 100 and select the Tweet with higher argu-
mentativity score from each cluster. Each run results in a ranked set of Tweets with the
most argumentative first.
   We have submitted three runs which differ by features and associated weights used
for computing the argumentativity score of each Tweet. We combine scores, described
in section 2.4, that use the same types of linguistic information (POS-based and lexical
types). The purpose is to evaluate the impact of Mathematically, the combination is an
arithmetic mean (see equations 9, 10 and 11).

                                     𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑎𝑡𝑖𝑜𝑛(𝑡𝑤𝑒𝑒𝑡) + 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡)
           𝑝𝑜𝑠𝑆𝑐𝑜𝑟𝑒(𝑡𝑤𝑒𝑒𝑡) =                                                           (9)
                                                          2


                                            𝑎𝑟𝑜𝑢𝑠𝑎𝑙(𝑡𝑤𝑒𝑒𝑡) + 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒𝑛𝑒𝑠𝑠(𝑡𝑤𝑒𝑒𝑡)
           𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝐸𝑛(𝑡𝑤𝑒𝑒𝑡) =                                                    (10)
                                                              2


           𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝐹𝑟(𝑡𝑤𝑒𝑒𝑡) = 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡)                                (11)


In all runs, we set weights in magnitude equation (2) with 𝛼 = 0.75 and 𝛽 = 0.25 as
we are more confident in the subjectivity analysis than in the polarity one, relying on
our observations of some analyzed data. Even if we do not have a lot of confidence in
the magnitude score according to our observations of some analyzed data, it is the fea-
ture that is the more directly related to argumentativity. Therefore we use it in all runs
mostly with a minor weight (0.25). Run 1 uses all types of features while the two others
interchange the uses of lexicon-based and POS-based scores to evaluate their respective
impact on result quality.
   Run 1 uses all types of feature: POS, opinion score and the lexicon-based score (see
equations 12 and 13). English lexical resources cover opinion and argumentation as-
pects whereas the French one only covers opinion. We decide to give a minor weight
in the French run (0.25) than in the English one (0.50). We try to balance the lack of
lexicon resource for argumentation in French giving a more important weight to POS-
based score (0.50) in comparison with English (0.25). The magnitude score gets the


15   Obtained using https://github.com/sloria/textblob
16   Not designed to support all languages of the original dataset
same weight for the two languages (0.25) as we do not have a lot of confidence in the
tool that computes the score.


𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑖𝑡𝑦𝐸𝑛(𝑡𝑤𝑒𝑒𝑡) = 0.25 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) + 0.50 ∗
𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝐸𝑛(𝑡𝑤𝑒𝑒𝑡) + 0.25 ∗ 𝑝𝑜𝑠𝑆𝑐𝑜𝑟𝑒(𝑡𝑤𝑒𝑒𝑡)                                      (12)

𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑖𝑡𝑦𝐹𝑟(𝑡𝑤𝑒𝑒𝑡) = 0.25 ∗ 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) + 0.25 ∗
𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝐹𝑟(𝑡𝑤𝑒𝑒𝑡) + 0.50 ∗ 𝑝𝑜𝑠𝑆𝑐𝑜𝑟𝑒(𝑡𝑤𝑒𝑒𝑡)                                      (13)


   Run 2 uses the magnitude and lexicon-based scores (see equations 14 and 15). As
we previously said, the French lexicon coverage of task aspects is not complete contrary
to the English one so we set it with a smaller weight in French (0.50) than in English
(0.75). In English, we give a major weight to the lexicon-based score because we attach
more importance to the manually built lexical resource in comparison with the magni-
tude score automatically computed.


𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑖𝑡𝑦𝐸𝑛(𝑡𝑤𝑒𝑒𝑡) = 0.25 ∗ 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) + 0.75 ∗
𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑆𝑐𝑜𝑟𝑒𝐸𝑛(𝑇𝑤𝑒𝑒𝑡)                                                               (14)

𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑖𝑡𝑦𝐹𝑟(𝑡𝑤𝑒𝑒𝑡) = 0.50 ∗ 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) + 0.50 ∗
lexicalScoreFr(tweet)                                                               (15)


   Run 3 uses the POS-based score in association with the magnitude score (see equa-
tion 16). As the tool we used to extract the POS labels is the same for English and
French, we give the same weight (0.75) to POS-based score for the two languages. It is
a major score because of the lack of reliability we have for the opinion score.

    𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑖𝑡𝑦(𝑡𝑤𝑒𝑒𝑡)
                  = 0.25 ∗ 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡𝑤𝑒𝑒𝑡) + 0.75 ∗ 𝑝𝑜𝑠𝑆𝑐𝑜𝑟𝑒(𝑡𝑤𝑒𝑒𝑡)

                                                                                    (16)


3      Results, conclusions and perspectives

An overview of the results obtained by the different systems in the lab can be found in
[9]. In this section, we logically focus on our system as it is the subject matter of the
present paper and because we are in a relevant position to review its results. At this
time, no diversity results have been provided by the organizers.
3.1     Results
Regarding argumentativity, the organizers used two measures to evaluate the quantity
of argumentative content in runs. NDCG measures the relevance of Tweets with a dis-
count function over the rank: in each run, the most relevant Tweets must appear first.
NDCG measures the relevance of each run according to regular expressions which
match argumentative content. Two references for argumentative content have been
used: one manual prepared by annotators and another one obtained by a pooling of runs
from the different participants’ systems. A measure named “%arg” gives the percentage
of argumentative content comparing to both pooling and manual references. Table 2
presents the results for all runs.

                           Table 2. Argumentativity results


 Language             English                            French
 Measure              NDCG-      NDCG-       %arg        NDCG-      NDCG-      %arg
                      manual     pooling                 manual     pooling
 Run1                 0.002      0.36        21.81       2.597      2.06       22.00
 Run2                 0.007      0.60        36.72       2.594      1.39       20.43
 Run3                 0.003      0.39        20.36       2.594      1.99       21.89


We observe in Table 2 the best results are not obtained with the same types of features
across the languages. All runs use the magnitude of subjectivity and polarity scores (see
section 2.7) but in English the best results are obtained by addition with lexical features
(run 2) while in French the best run combines lexical and POS-based features (run 1).
We explain this difference by the different natures of lexical resources across the lan-
guages; as we suspected preparing the runs (see section 2.7), the lexical resource for
English may be more related to argumentativity especially with the concreteness prop-
erty while the French one is only about sentiment expression which may be useful for
opinion mining (see section 2.4) but not sufficient to detect argumentative content.
   Comparing the results in one language, run 2 in French is particularly low (NDCG-
pooling and %arg). This run in French may not include enough features related to ar-
gumentativity; the presence of opinion polarity, subjectivity or sentiments in a Tweet
should indicate that it contains a personal expression but it does not imply that it is
justified by an argumentation. However, the addition of the lexical sentiment feature
allows run 1 to be the best in French (in comparison with run 3). We think that a per-
sonal content may be the base for argumentation as a supporting tool. In other words,
particularly on Twitter, we suppose that there might not be argumentation without a
personal content. We note that POS-based information considering the structuration is
effective even on Tweets, probably due to relativity of POS-based scores among
Tweets. In English, it is surprising to observe that the run with lexical and POS-based
features (run 1) gets lower results than the run without POS-based information (run 2).
Regarding the weights among the two runs (see section 2.7), the lexical feature is ¾ of
the score in run 2 whereas it is ½ in run 1; we think that the lower results in run 1
compared to run 2 might be explained by the lower importance of the lexical feature in
run 1 rather than the addition of the POS-based feature. It is supported by the result of
run 3 which uses the POS-based feature and gets a better NDCG score than run 1.
   Considering our position across the different participants’ systems is interesting be-
cause we are at the first place by pooling for both languages (lowest scores are 0.00 in
French and 0.05 in English) and at the last place for English (best score is 0.06) and
penultimate for French (the lowest score is 2.28 from the baseline and the best score is
2.89). It means that our system does not correctly match the manual reference but ex-
tracts arguments not considered by the annotators or other participants. Maybe it re-
flects a divergence in what is considered as argumentative.


3.2    Conclusions and perspectives

The hypothesis of argumentation using words which denote concrete things seems to
be validated by the importance of the corresponding lexical feature in English, getting
a better score when it is used with a greater weight. In French, the discourse connectors
feature gives the best results and validates the assumption of a more structured text
when it is argumentative, even on short messages like Tweets.
   As the lexicon encoding concreteness and arousal properties allows to get the best
results, it may be relevant to build a corresponding resource in French; it could be
achieved by a translation process. It would be interesting to view if we also get better
results in French.
   We would like to analyze similar features on other text media to compare their re-
spective contribution. In particular, it would be interesting to evaluate the POS-based
feature with structuration words on texts which are less bound by their size, maybe a
longer text needs to be more structured.


References
 1. CLEF MC2 Lab Homepage, http://www.mc2.talne.eu, last accessed 2018/05/24.
 2. Palau, R. M., Moens, M. F.: Argumentation mining: the detection, classification and struc-
    ture of arguments in text. In: Proceedings of the 12th international conference on artificial
    intelligence and law, pp. 98-107, ACM (2009).
 3. Dusmanu, M., Cabrio, E., & Villata, S.: Argument mining on twitter: arguments, facts and
    sources. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
    Processing, pp. 2317-2322 (2017).
 4. Lui, M., Baldwin, T.: langid. py: An off-the-shelf language identification tool. In: Proceed-
    ings of the ACL 2012 system demonstrations, pp. 25-30 (2012).
 5. Tsytsarau, M., & Palpanas, T.: Survey on mining subjective data on the web. In: Data Min-
    ing and Knowledge discovery 24(3), 478-514 (2012).
 6. Warriner, A. B., Kuperman, V., Brysbaert, M.: Norms of valence, arousal, and dominance
    for 13,915 English lemmas. In: Behavior research methods 45(4), 1191-1207 (2013).
7. Abdaoui, A., Azé, J., Bringay, S., Poncelet, P.: Feel: a french expanded emotion lexicon. In:
   Language Resources and Evaluation 51(3), 833-855 (2017).
8. Brysbaert, M., Warriner, A. B., Kuperman, V.: Concreteness ratings for 40 thousand gener-
   ally known English word lemmas. In: Behavior research methods 46(3), 904-911 (2014).
9. Hajjem, M., Cossu, J. V., Latiri C., SanJuan E.: CLEF 2018. In: International Conference of
   the Cross-Language Evaluation Forum for European Languages Proceedings, LNCS vol-
   ume, Springer, Avignon (2018).