<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ERTIM@MC2: Diversified Argumentative Tweets Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kévin Deturck</string-name>
          <email>kevin.deturck@viseo.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parantapa Goswami</string-name>
          <email>parantapa.goswami@viseo.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Nouvel</string-name>
          <email>damien.nouvel@inalco.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frédérique Segond</string-name>
          <email>frederique.segond@inalco.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INaLCO ERTIM</institution>
          ,
          <addr-line>75007 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INaLCO ERTIM, 75007 Paris, France/Viseo Innovation</institution>
          ,
          <addr-line>38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Viseo Innovation</institution>
          ,
          <addr-line>38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our participation to CLEF MC2 2018 edition for the task 2 Mining opinion argumentation. It consists in detecting the most argumentative and diverse Tweets about some festivals in English and French from a massive multilingual collection. We measure argumentativity of a Tweet computing the amount of argumentation compounds it contains. We consider argumentation compounds as a combination between opinion expression and its support with facts and a particular structuration. Regarding diversity, we consider the amount of festival aspects covered by Tweets. An initial step filters the original dataset to fit the language and topic requirements of the task. Then, we compute and integrate linguistic descriptors to detect claims and their respective justifications in Tweets. The final step extracts the most diverse arguments by clustering Tweets according to their textual content and selecting the most argumentative ones from each cluster. We conclude the paper describing the different ways we combined the descriptors among the different runs we submitted and discussing their results.</p>
      </abstract>
      <kwd-group>
        <kwd>Argumentation</kwd>
        <kwd>Opinion</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        CLEF MC2 Lab 2018 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposes an information retrieval task for festival organizers
who would like to know what people think about their event on Twitter1. A user’s query
can be either in French or in English and also specifies a topic from a list of festival
names. We design a system, based on linguistic information, which selects the 100
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.twitter.com</title>
      <p>most argumentative and diverse Tweets associated to a user’s query. An initial step
filters Tweets according to languages and topics in order to reduce the amount of data
to be processed. We first extract French and English Tweets performing language
detection thanks to an external tool. Then, using regular expressions and key words, a
topic filtering step extracts, for each language, sets of Tweets related to the different
Festivals.</p>
      <p>We perform a linguistic enrichment on the previously extracted sets of Tweets. We
then use these Tweets enriched with linguistic information to compute the
argumentativity score of each Tweet and measure diversity among Tweets.</p>
      <p>
        Argumentation is a process of construction with arguments that are sets of premises,
in other words facts chosen to support claims [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Claims are personal statements made
by an individual about a topic. Thus, a claim is the expression of an individual’s opinion
as a polarity (negative, neutral, positive) considering a topic. We link argumentation
and opinion in that the former supports the latter. As we said an argumentation is related
to an opinion, we measure the argumentativity of a Tweet according the amount of
opinion and argumentation it contains. Opinion mining is driven by subjectivity
detection because subjectivity is the property of a personal expression and we said opinion
is personal. We think the characterization of argumentation by factuality is a crucial
marker [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Factuality measures how much facts are present in a discourse. A fact is the
opposite of subjective content as it stands for a proposition which is true independently
of its enunciator. As we mentioned argumentation is a process of construction, we also
use discourse structuration markers to detect argumentation.
      </p>
      <p>Diversity is measured on a set of Tweets according to the variety of festival aspects
mentioned in the featured view-points. Therefore, the resulting Tweets from our system
must be distant considering the aspects they contain. That is why we measure diversity
as a distance among Tweets using clustering on their textual content.</p>
      <p>In what follows, we present the general architecture of the system together with the
different linguistic modules and resources we used. We also explain the different
configurations of the runs we have submitted.
2</p>
      <p>Our approach to the detection of the most
argumentative and diverse Tweets within MC2
The overall approach (see Fig.1) consists in applying different filtering steps in order
to reduce the original set of “Festival” Tweets to those relevant for the particular task
context and to map the most relevant Tweets to user’s queries according to their level
of argumentativity and diversity.</p>
      <p>We reduce the original dataset by two pre-filtering steps to fit the particular task
context. The original dataset contains other languages than English and French thus the
initial challenge is to identify and to separate English and French Tweets by a language
filtering step. A list of festival names is provided as topics for each language. We detect
and extract Tweets which contain mentions of these festivals.</p>
      <p>We perform data enrichment on the pre-filtered set using Natural Language
Processing tools. It consists of morpho-syntactic and semantic information on which the
calculation of the argumentativity score is based.</p>
      <p>We compute argumentativity score of a Tweet as the amount of both opinion and
argumentation it contains. For example, a Tweet with only one claim as “I love
Hellfest.” will get a lower argumentativity score than a Tweet which combines an opinion
and an associated argumentation as in “I love Hellfest because it is ethic.”.</p>
      <p>
        We define the diversity of Tweets as the amount of different aspects they mention
about the festivals. For example, a set of Tweets about Cannes festival that only
contains Tweets like “I love Cannes festival because the introduction was great!” and
“Beautiful introduction at Cannes!” is argumentative but not relevant for diversity as it
only mentions one aspect. The more diverse Tweets are, the more individuals’ critical
criteria are provided so that festival organizers get a larger perspective on what people
think and why.
The language filtering step (see Fig. 2) is performed using the Python module
“langid.py”2 ; we choose this module because it combines state-of-the-art results and speed
which is essential for processing such a massive dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2 https://github.com/saffsd/langid.py</title>
      <p>2.2</p>
      <sec id="sec-3-1">
        <title>Topic filtering</title>
        <p>The original dataset contains Tweets that are not only about the festivals from the
particular task context. The next step consists, for each language, in detecting and grouping
the Tweets into categories corresponding to the lists of festivals provided (see Fig. 3).
Topic detection is performed using regular expressions based on key words
representative of each festival. We select a set of “representative” key words associated with each
festival based on the mentions in the topically categorized sample of Tweets provided
by the organizers. For two festivals, Cannes and Avignon, we notice that the city name
is often used alone (without “festival” like “Cannes” instead of “Festival de Cannes”)
so we decide to only look for “cannes” and “avignon”. Regular expressions are built so
that the tokens may appear in any case and any order.
2.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data enrichment</title>
        <p>The goal of this intermediary step is to enrich the pre-filtered data with linguistic
information. The output of this step is then stored in order to run the process just once
preventing lost in performance.</p>
        <p>We first normalize each Tweet using the Python module “tweet-preprocessor”3. It is
fully customizable, allowing us to specify parts of the Tweets we want to remove:
URL, Mention, Emoji, Smiley. We decide to keep hashtags as they might contain
important information. For example in “The sound is too loud! #FestivalCannes”, it allows
to identify the topic of the Tweet. We normalize hashtags removing the “#” character.</p>
        <p>After text normalization, for each Tweet, we extract the following information using
NLP tools we selected according to our needs; they are mostly bilingual and fast to
handle the data size (see Table 3).</p>
        <p>─ List of tokens</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 https://pypi.org/project/tweet-preprocessor/</title>
      <p>─ List of lemmas
─ List of POS labels
─ Subjectivity score
─ Opinion polarity score</p>
      <p>Lists of tokens, lemmas and POS labels are obtained using the TreeTagger tool4 on
the normalized Tweets. We use normalized Tweets because TreeTagger is meant to
analyze regular texts while the original Tweets are noisy texts as they can contain
Tweet-specific elements like smileys. Specifying as a parameter the language of the
text to analyze (English or French), TreeTagger returns a list of lists containing each
form, its POS label and lemma.</p>
      <p>Subjectivity and opinion polarity scores are obtained using “TextBlob” library5 and
an adaptation for French named “textblob-fr”6. TextBlob computes the scores using
lexical resources and pattern matching. We run it on the normalized Tweets.
2.4</p>
      <sec id="sec-4-1">
        <title>Opinion and argumentation filtering</title>
        <p>This step computes an argumentativity score for each Tweet according to the opinion
and argumentation it contains. We have selected linguistic features that may represent
both aspects.</p>
        <p>
          For opinion detection, we use the subjectivity score as we consider the expression of
an opinion to be a subjective content [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We consider that the higher its subjectivity
score, the more “opinioned” a Tweet is. We also use the opinion polarity score, not for
the polarity itself, but for its magnitude that may also indicate how much opinioned a
Tweet is. These two scores are combined with their respective weights (specified in
section 2.6) as a magnitude score described in equation (2).
where  ( ) is the opinion magnitude score (comprised between [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ])
for  ,  the subjectivity score (comprised between [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ]), 
the polarity score in [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ],  and  are their respective weights
We also use two lexical resources: one for English and one for French. For English, we
use [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] which encodes the “arousal” property of 13,915 English lemmas. It associates
a score to each lemma according to the affectivity it denotes; our hypothesis is that the
more a Tweet contains high affectivity lemmas (high scores), the more opinioned it
would be (see equation 3). For French, we use [7], a French lexicon which associates
to 14,129 non-neutral lemmas a binary polarity value (“positive” or “negative”) and six
binary values depending on whether each lemma evokes (1) or not (0) a sentiment
among six different: joy, anger, surprise, sadness, disgust and fear. We consider
sentiment as an internal psychologic state whose expression can serve the formulation of an
opinion. Our hypothesis is that the more a French Tweet contains lemmas present in
4 http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
5 http://textblob.readthedocs.io/en/dev/
6 https://github.com/sloria/textblob-fr
        </p>
        <p>For English Tweets, we compute a concreteness score (see equation 8) relying on
the lexical resource [8]. It associates to nearly 40,000 English lemmas a score which
indicates how much perceptible (by the five senses) their mean is. As we start from the
hypothesis that an argumentative text is factual, thus independent from an individual’s
state of mind, we formulate a new hypothesis saying that it may contain more concrete
lemmas.

where 
, 

(
) = ∑ = 1 


(</p>
        <p>)
(
(</p>
        <p>
          ) is the concreteness score comprised between [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ] of
  ) the
lexicon-based concreteness score of 
(7)
 
(normalized between [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ], 0 if lemma is missing) and  the number of tokens in 
2.5
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Diversity filtering</title>
        <p>In this final step, we build a set of Tweets that maximizes the diversity criterion among
the most argumentative Tweets. Diversity measures how much festival aspects the
Tweets mention. Thus, we suppose diverse Tweets may contain words semantically
distant. For example, we detect that “This festival is too expensive.” and “Ticket price
for this festival is too high.” mention a similar aspect by the semantic proximity
between the words “expensive” and “price”. Inversely, the texts “This festival program is
so good!” and “This festival proposes a good choice of beers!” are more distant due to
the semantic distance between the words “program” and ”beer”.</p>
        <p>As diversity is computed according to the lexical semantics distance between
Tweets, we use word embeddings models from Sketch Engine7 to get a spatial
representation of words, one for English and one for French. As we want to keep as much
form-wise information as possible, we select for both languages word form models
(without lowercasing). For English, we select the model based on British National
Corpus because it is the lighter therefore it avoids memory problem at loading time. For
French, the only one proposed is a model based on a Web corpus. We vectorize a Tweet
by matching its tokens against the model using FastText Python module8. We use
Kmeans clustering via the ScikitLearn toolkit9 to compute the distance between
vectorized Tweets.
2.6</p>
      </sec>
      <sec id="sec-4-3">
        <title>Impacts of dataset pre-filtering</title>
        <p>in reducing data size by evaluating compression ratios among selective properties for
argumentativity. It includes linguistic properties (lemmas, subjectivity and opinion
polarity) obtained as described in section 2.3. We consider properties which are relative</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7 https://embeddings.sketchengine.co.uk/static/index.html</title>
      <p>8 https://github.com/facebookresearch/fastText/tree/master/python
9 http://scikit-learn.org/stable/index.html</p>
      <p>We can observe in Table 1 the evolution of author and vocabulary usage between
the two first filtering steps. Unique authors ratio increases by around 80% for both
languages; it is a considerable increase compared to the previous step and a positive result
for the representativeness of the data. Vocabulary usage is poor in English (0.2% of
different lemmas) and in French (0.5%); this may be a relevant concern for the diversity
criterion. Polarity and subjectivity average magnitude stay low even if it increases for
French Tweets; the selective power of this information may be preserved.
10 Obtained using Unix ‘wc’ command
11 Lemmatized using http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
12 14 Not designed to support all languages of the original dataset
13 Obtained using https://github.com/sloria/textblob
A run returns the 100 most argumentative and diverse Tweets for all languages and
festival names from the particular task context. To get the 100 most argumentative and
diverse Tweets, we run K-Means with k = 100 and select the Tweet with higher
argumentativity score from each cluster. Each run results in a ranked set of Tweets with the
most argumentative first.</p>
      <p>We have submitted three runs which differ by features and associated weights used
for computing the argumentativity score of each Tweet. We combine scores, described
in section 2.4, that use the same types of linguistic information (POS-based and lexical
types). The purpose is to evaluate the impact of Mathematically, the combination is an
arithmetic mean (see equations 9, 10 and 11).</p>
      <p>(
) = 
(
(
) = 
) = 
(
(
) + 
2
) + 
(
2

)
(
(
)
)</p>
      <p>(9)
(10)
(11)
15 Obtained using https://github.com/sloria/textblob
16 Not designed to support all languages of the original dataset
3.1</p>
      <sec id="sec-5-1">
        <title>Results</title>
        <p>Regarding argumentativity, the organizers used two measures to evaluate the quantity
of argumentative content in runs. NDCG measures the relevance of Tweets with a
discount function over the rank: in each run, the most relevant Tweets must appear first.
NDCG measures the relevance of each run according to regular expressions which
match argumentative content. Two references for argumentative content have been
used: one manual prepared by annotators and another one obtained by a pooling of runs
from the different participants’ systems. A measure named “%arg” gives the percentage
of argumentative content comparing to both pooling and manual references. Table 2
presents the results for all runs.
We observe in Table 2 the best results are not obtained with the same types of features
across the languages. All runs use the magnitude of subjectivity and polarity scores (see
section 2.7) but in English the best results are obtained by addition with lexical features
(run 2) while in French the best run combines lexical and POS-based features (run 1).
We explain this difference by the different natures of lexical resources across the
languages; as we suspected preparing the runs (see section 2.7), the lexical resource for
English may be more related to argumentativity especially with the concreteness
property while the French one is only about sentiment expression which may be useful for
opinion mining (see section 2.4) but not sufficient to detect argumentative content.</p>
        <p>Comparing the results in one language, run 2 in French is particularly low
(NDCGpooling and %arg). This run in French may not include enough features related to
argumentativity; the presence of opinion polarity, subjectivity or sentiments in a Tweet
should indicate that it contains a personal expression but it does not imply that it is
justified by an argumentation. However, the addition of the lexical sentiment feature
allows run 1 to be the best in French (in comparison with run 3). We think that a
personal content may be the base for argumentation as a supporting tool. In other words,
particularly on Twitter, we suppose that there might not be argumentation without a
personal content. We note that POS-based information considering the structuration is
effective even on Tweets, probably due to relativity of POS-based scores among
Tweets. In English, it is surprising to observe that the run with lexical and POS-based
features (run 1) gets lower results than the run without POS-based information (run 2).
Regarding the weights among the two runs (see section 2.7), the lexical feature is ¾ of
the score in run 2 whereas it is ½ in run 1; we think that the lower results in run 1
compared to run 2 might be explained by the lower importance of the lexical feature in
run 1 rather than the addition of the POS-based feature. It is supported by the result of
run 3 which uses the POS-based feature and gets a better NDCG score than run 1.</p>
        <p>Considering our position across the different participants’ systems is interesting
because we are at the first place by pooling for both languages (lowest scores are 0.00 in
French and 0.05 in English) and at the last place for English (best score is 0.06) and
penultimate for French (the lowest score is 2.28 from the baseline and the best score is
2.89). It means that our system does not correctly match the manual reference but
extracts arguments not considered by the annotators or other participants. Maybe it
reflects a divergence in what is considered as argumentative.
The hypothesis of argumentation using words which denote concrete things seems to
be validated by the importance of the corresponding lexical feature in English, getting
a better score when it is used with a greater weight. In French, the discourse connectors
feature gives the best results and validates the assumption of a more structured text
when it is argumentative, even on short messages like Tweets.</p>
        <p>As the lexicon encoding concreteness and arousal properties allows to get the best
results, it may be relevant to build a corresponding resource in French; it could be
achieved by a translation process. It would be interesting to view if we also get better
results in French.</p>
        <p>We would like to analyze similar features on other text media to compare their
respective contribution. In particular, it would be interesting to evaluate the POS-based
feature with structuration words on texts which are less bound by their size, maybe a
longer text needs to be more structured.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>CLEF</given-names>
            <surname>MC2 Lab Homepage</surname>
          </string-name>
          , http://www.mc2.
          <source>talne.eu, last accessed</source>
          <year>2018</year>
          /05/24.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Palau</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          :
          <article-title>Argumentation mining: the detection, classification and structure of arguments in text</article-title>
          .
          <source>In: Proceedings of the 12th international conference on artificial intelligence and law</source>
          , pp.
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
          , ACM (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dusmanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrio</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Villata</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Argument mining on twitter: arguments, facts and sources</article-title>
          .
          <source>In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pp.
          <fpage>2317</fpage>
          -
          <lpage>2322</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
          </string-name>
          , T.:
          <article-title>langid. py: An off-the-shelf language identification tool</article-title>
          .
          <source>In: Proceedings of the ACL 2012 system demonstrations</source>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tsytsarau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Palpanas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Survey on mining subjective data on the web</article-title>
          .
          <source>In: Data Mining and Knowledge discovery 24(3)</source>
          ,
          <fpage>478</fpage>
          -
          <lpage>514</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Warriner</surname>
            ,
            <given-names>A. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuperman</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brysbaert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Norms of valence, arousal, and dominance for 13,915 English lemmas</article-title>
          .
          <source>In: Behavior research methods 45(4)</source>
          ,
          <fpage>1191</fpage>
          -
          <lpage>1207</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>