<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DAEDALUS at RepLab 2014: Detecting RepTrak Reputation Dimensions on Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>César de Pablo-Sánchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Janine García-Morera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Villena-Román</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Carlos González-Cristóbal</string-name>
          <email>josecarlos.gonzalez@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAEDALUS - Data</institution>
          ,
          <addr-line>Decisions and Language, S.A</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Carlos III de Madrid</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <fpage>1505</fpage>
      <lpage>1511</lpage>
      <abstract>
        <p>This paper describes our participation at the RepLab 2014 reputation dimensions scenario. Our idea was to evaluate the best combination strategy of a machine learning classifier with a rule-based algorithm based on logical expressions of terms. Results show that our baseline experiment using just Naive Bayes Multinomial with a term vector model representation of the tweet text is ranked second among runs from all participants in terms of accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>RepLab</kwd>
        <kwd>CLEF</kwd>
        <kwd>reputation analysis</kwd>
        <kwd>reputation dimensions</kwd>
        <kwd>machine learning classifier</kwd>
        <kwd>Naive Bayes Multinomial</kwd>
        <kwd>rule-based approach</kwd>
        <kwd>hybrid approach</kwd>
        <kwd>combination</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        RepLab [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a competitive evaluation exercise for reputation analysis, launched in
2012 edition of CLEF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] campaign, which started focusing on the problem of
monitoring the reputation of entities (mainly companies) in Twitter, dealing with the
tasks of entity name disambiguation, reputation polarity, topic detection and topic
ranking. However, RepLab 2014 introduced two new tasks, categorization of
messages with respect to standard reputation dimensions and the characterization of
Twitter profiles (author profiling) with respect to a certain activity domain.
      </p>
      <p>Specifically, the reputation dimensions scenario consists on a classification task
that must return the implicit reputational dimension in a given tweet, to be chosen
among the standard categorization provided by the Reputation Institute1: (1)
Products/Services, (2) Innovation, (3) Workplace, (4) Citizenship, (5) Governance,
(6) Leadership, (7) Performance, and (8) Undefined. Participants are provided with a
training corpus containing collection of tweets in Spanish and English referring to a
selected set of entities in the automotive or banking domain. Each tweet is categorized
into one of the aforementioned reputation dimensions.</p>
      <p>
        This paper describes our participation at the RepLab 2014 reputation dimensions
scenario. We are a research group led by DAEDALUS2, a leading provider of
language-based solutions in Spain, and research groups of Universidad Politécnica
and Universidad Carlos III of Madrid. We are long-time participants in CLEF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], in
many different tracks and tasks since 2003, also in both previous years of RepLab [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The idea behind our participation was to evaluate the best combination strategy of
a machine learning classifier with a rule-based algorithm based on logical expressions
of terms. Our experiments and results achieved are presented and discussed in the
following sections.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>The dataset for the Reputation Dimension task is composed of two languages, English
and Spanish, in two different domains, automotive and banking. Our system uses a
different pipeline for each of the two languages as we were interested in the
comparison between rule based classifiers developed for the Spanish language and
statistical machine-learning classifiers. We submitted five runs that combine the
statistical and rule-based classifiers.</p>
      <p>We invested a certain effort to the process of tokenization of the tweet text and
URL as some preliminary experiments using cross-validation on the training corpus.
These experiments showed that this tokenization process was much more important
than the selection of an algorithm. Our runs use information from the text and
extended_url fields in the tweet.</p>
      <p>Our baseline run (Run #1) is based on a supervised classifier for each language.
Multinomial Naive Bayes (NBM) classifier on a simple bag of words representation
was selected with cross-validation among a collection of different algorithms.</p>
      <p>
        We used Weka 3.7 implementation of NBM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the provided WordTokenizer
that allows to define split characters that are removed from the term vector space
representation of the text. Besides the usual split symbols, spaces and some
punctuation, we use tweet specific delimiters such as hashtags (#), usernames (@) and
emoticons, and also URL specific delimiters such as slashes, ampersands, question
marks and hyphens that are used to separate words in SEO optimized URLs. Finally,
as a high number of terms were low frequency numerals we decided to add numbers
as well to help in normalization.
      </p>
      <p>During the development process, we tested different parameters configuration and
algorithms to reach the conclusion that NBM was robust enough and other
representations (bigrams, feature selection) were not adding additional value.</p>
      <p>Regarding the language, each of the two classifiers has different performance as
the amount of training data for each language was quite different. English training
data is composed of 11 869 tweets but Spanish data is about one third in size (3 692
2 http://www.daedalus.es/
tweets). In our preliminary experiments using cross-validation, the Spanish classifier
performed about 10% in accuracy lower than the English classifier and that was
particularly meaningful for categories with few labelled instances (Innovation,
Leadership or Workplace).</p>
      <p>The rest of the runs make use of different combinations of this NBM classifier with
a rule-based classifier for business reputation developed prior to our participation in
the task. This rule-based classifier is an adaptation for tweets of a previous model
developed for longer texts like news and blogs. This classifier was only available in
Spanish, so English just uses the initial baseline NBM classifier.</p>
      <p>The combination of methods in the different runs is described in next table.</p>
      <p>
        The rule-based classifier is build using Textalytics Text Classification API [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which, despite its name, itself is based on a hybrid algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that combines
statistical classification, which provides a base model that is relatively easy to train,
with rule-based filtering, which is used to post-process and improve the results
provided by the previous classifier by filtering false positives and dealing with false
negatives and allows to obtain a high degree of precision for different environments.
      </p>
      <p>The machine-based classifier uses an implementation based on kNN and we also
have a simple rule language that allows to express lists of positive, negative and
relevant (multiword) terms appearing in the text.</p>
      <p>The classifier uses a slightly modified RepTrak ontology that contains more
detailed classes, for instance, "Products and services" include "Satisfaction of
necessities", "Reclamations", "Customer relationship management", "Value for
money", "Quality of products and services" and "Warranty". Moreover, it is a
multilabel classifier and can assign several labels to a single message.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The reputation dimensions task has been evaluated as a classification problem, so
accuracy and precision/recall measures over each class are reported, using accuracy as
the main measure.</p>
      <p>Results achieved by our runs are shown in Table 3. The columns in the table are
accuracy and the ratio of classified tweets, i.e., the ratio from the set of tweets that
were available at the time of evaluation. The organizers state that a baseline that
classifies every tweet with the most frequent class would get 56% accuracy.</p>
      <p>The following table and figure represents the distribution of classes in the gold
standard and in the output of our runs. Our runs, as most runs from participants, are
clearly biased to the most frequent class ("Products and services"), as can be seen
comparing with the gold standard.</p>
      <p>Run
GOLD
Run #1
Run #4
Run #3
Run #2
Run #2</p>
      <p>Run
GOLD
Run #1
Run #4
Run #3
Run #2
Run #5</p>
      <p>The following table represents the precision and recall of our runs, and the best
ranked experiment in terms of accuracy. Apparently, our problem is on recall rather
than precision of results.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future work</title>
      <p>Results show that our baseline experiment using Naive Bayes Multinomial with a
term vector model representation of the tweet text is ranked second among runs from
all participants in terms of accuracy. No definite conclusion can be drawn from this
fact, whether the Naive Bayes algorithm achieves better or worse accuracy for
prediction reputation dimensions than our rule-based model, as approaches are mixed
in both languages. If we had had the rule-based model migrated to English in time, the
comparison among runs would be easier. Moreover, again due to lack of time and
resources, we have not been able yet to carry out an individual analysis by language
so we do not understand yet the contribution of each approach to the final result.</p>
      <p>However, accuracy values show that, despite of the difficulty of the task, results
are quite acceptable and somewhat validate the fact that this technology may be
already included into an automated workflow process for the first step towards social
media mining and online reputation analysis.</p>
      <p>Moreover, a manual inspection of the training data reveals certain miss
classifications and lack of criteria in the assignment of categories, with some points of
ambiguity and disagreement regarding the consideration of whether a tweet must be
assigned or not to a given reputation dimension, specifically for the case of product
and services and citizenship. We would thank the clear description of guidelines with
the annotation criteria in function of the context.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been supported by several Spanish R&amp;D projects: Ciudad2020:
Towards a New Model of a Sustainable Smart City (INNPRONTA IPT-20111006),
MA2VICMR: Improving the Access, Analysis and Visibility of Multilingual and
Multimedia Information in Web (S2009/TIC-1542) and MULTIMEDICA:
Multilingual Information Extraction in Health Domain and Application to Scientific
and Informative Documents (TIN2010-20644-C03-01).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Enrique</given-names>
            <surname>Amigó</surname>
          </string-name>
          , Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          , Irina Chugur, Adolfo Corujo, Julio Gonzalo, Edgar Meij, Maarten de Rijke, Damiano Spina.
          <year>2014</year>
          .
          <article-title>Overview of RepLab 2014: author profiling and reputation dimensions for Online Reputation Management</article-title>
          .
          <source>Proceedings of the Fifth International Conference of the CLEF Initiative</source>
          ,
          <year>September 2014</year>
          , Sheffield, United Kingdom.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. CLEF.
          <year>2014</year>
          .
          <article-title>CLEF Initiative (Conference and Labs of the Evaluation Forum)</article-title>
          . http://www.clef-initiative.eu/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Román</surname>
          </string-name>
          , Sara Lana-Serrano, Cristina Moreno-García, Janine García-Morera,
          <string-name>
            <given-names>José</given-names>
            <surname>Carlos</surname>
          </string-name>
          González-Cristóbal.
          <year>2012</year>
          . DAEDALUS at RepLab 2012:
          <article-title>Polarity Classification and Filtering on Twitter Data</article-title>
          .
          <article-title>CLEF 2012 Labs</article-title>
          and Workshop Notebook Papers. Rome, Italy,
          <year>September 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Textalytics</given-names>
            <surname>Text Classification API</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Text Classification v1.1</article-title>
          . http://textalytics.com/core/class-info
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Román</surname>
          </string-name>
          ,
          <article-title>Sonia Collada-Pérez, Sara Lana-Serrano, and</article-title>
          <string-name>
            <given-names>José</given-names>
            <surname>Carlos</surname>
          </string-name>
          González-Cristóbal.
          <year>2011</year>
          .
          <article-title>Método híbrido para categorización de texto basado en aprendizaje y reglas</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          , Vol.
          <volume>46</volume>
          ,
          <year>2011</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Román</surname>
          </string-name>
          ,
          <article-title>Sonia Collada-Pérez, Sara Lana-Serrano, and</article-title>
          <string-name>
            <given-names>José</given-names>
            <surname>Carlos</surname>
          </string-name>
          González-Cristóbal.
          <year>2011</year>
          .
          <article-title>Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization</article-title>
          .
          <source>Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference (FLAIRS-11)</source>
          ,
          <source>May 18-20</source>
          ,
          <year>2011</year>
          ,
          <string-name>
            <given-names>Palm</given-names>
            <surname>Beach</surname>
          </string-name>
          , Florida, USA. AAAI Press
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The WEKA Data Mining Software: An Update</article-title>
          .
          <source>SIGKDD Explorations</source>
          , Volume
          <volume>11</volume>
          , Issue 1.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>