<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Glasgow Terrier Team / Project Abaca´ at RepLab 2014: Reputation Dimensions Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Graham McDonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Romain Deveaud</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard McCreadie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timothy Gollins</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science University of Glasgow</institution>
          ,
          <addr-line>G12 8QQ, Glasgow</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>1500</fpage>
      <lpage>1504</lpage>
      <abstract>
        <p>This paper describes our participation in the RepLab 2014 Reputation Dimensions task. The task is a multi-class classification task where tweets relating to an entity of interest are to be classified by their reputation dimension. For our participation we investigate two approaches; Firstly, we use a term's gini-index score to quantify the term's representativeness of a specific class and construct class profiles for tweet classification, and secondly, we perform tweet enrichment using a web scale corpus to derive terms representative of a tweet's class, before training a classifier with the enriched tweets. Our tweet enrichment approach performed exceedingly well, showing that this approach is effective for classifying tweets by their reputation dimensions and a promising direction for future work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        This notebook paper describes our participation in the Reputation Dimensions task of
RepLab 2014 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. RepLab is a competitive exercise for Online Reputation
Management (ORM) systems, organized as an activity of the Cross Language Evaluation
Forum (CLEF)1. ORM is concerned with the tracking and monitoring of media to identify
what is being said about an entity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With the increased popularity of social media
and communication platforms such as Twitter2 that allow users to reach a global
audience and share their experiences in real time, it is especially important for companies
to be able to monitor their public perception, assisted by ORM tools, and react in an
appropriate and timely manor.
      </p>
      <p>For a company to be able to respond to changing public opinion in online
communications, relating to the company, there are three components of the communication that
must be understood. Firstly, the company must be aware of the aspects (Dimensions) of
its business the communication is about, for example Products &amp; Services. Secondly,
the company must understand the type of author, for example is the author a journalist,
and thirdly how influential the author is.</p>
      <p>For our participation in RepLab 2014, we participated in the Reputation Dimensions
task which addresses the first of these three components. The remainder of this paper is
structured as follows. Section 2 gives an overview of the Reputation Dimensions task
before Section 3 outlines our classification approaches. Section 4 presents the results of
our submitted runs and, finally, in Section 5 we present our conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>THE REPUTATION DIMENSIONS CORPUS AND TASK</title>
      <p>In this section we give an overview of the Reputation Dimensions corpus and task.
The corpus consisted of English and Spanish tweets crawled during the period 1st June
2012 to 31st December 2012, with just over 75% English tweets. Each tweet related
to at least one of 31 entities of interest from the Banking and Automotive industries.
For each entity, there were at least 2,200 tweets, with at least 700 and 1,500 tweets for
the training and test sets respectively. The most recent tweets were used for the test set.
Participants were provided the tweet ids and had to download the tweet text directly
from Twitter. To retrieve the tweet text, we used the Java tool provided by the RepLab
organisers.</p>
      <p>The Reputations Dimensions task is a multi-class classification task. Participating
systems were to classify tweets as one of the seven reputation dimensions (Innovation,
Citizenship, Leadership, Workplace, Governance, Performance and Products &amp;
Services) defined in the RepTrak Framework by the Reputation Institute3. The data set also
included the “Undefined” class for tweets that were not classified as one of these seven
dimensions. Undefined tweets are not included in the evaluation.
3</p>
    </sec>
    <sec id="sec-3">
      <title>CLASSIFICATION APPROACHES</title>
      <p>In this section we give an overview of the approaches we deployed for our participation
in the Reputation Dimensions task. The research questions we address in our
participation are twofold: (1) Can we use the gini-index of a term as a measure of the terms
belonging to a reputation dimension to construct dimension profiles for tweet
classification? and 2) Can we identify a tweet’s reputation dimension with greater accuracy by
enriching the tweet using a web scale corpus?</p>
      <p>The remainder of this section is structured as follows: Section 3.1 outlines our
Reputation Dimension Profiling approach, then Section 3.2 gives an overview of our
approach for tweet enrichment.
3.1</p>
      <sec id="sec-3-1">
        <title>REPUTATION DIMENSION PROFILING</title>
        <p>
          For our dimension profiling approach we convert the tweets to lower case, remove
nonalphanumeric characters, new lines and URLs, before using the Terrier Information
Retrieval Platform [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to remove stop-words and index the tweets for each class (as
defined in the gold standard). We discard terms with a term frequency of &lt; 3 before
calculating the conditional probability of a term belonging to each class, normalised by
the class distribution over the collection. Using this probability, we calculate a term’s
gini-index [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] score to quantify the terms discriminatory power between classes. Tweets
3 http://www.reputationinstitute.com/about-reputation-institute/the-reptrak-framework
and profiles are then represented as term frequency vectors and we classify tweets to
their closest dimension profile using cosine similarity. For developing this approach,
we performed a 5-fold cross validation on the training data creating profiles from the
training split of each fold.
        </p>
        <p>The “Undefined” class is included in our gini-index calculations, resulting in scores
in the range of 0.125 (least discriminative terms) to 1 (most discriminative terms). For
terms with a suitably high gini-index score, we use the term’s class conditional
probability to determine the class that the term is most representative of and add the term
to the class profile. Appropriate gini-index and class conditional probability thresholds
were determined by parameter analysis, resulting in profiles constructed from terms
with a gini-index score &gt;= 0.3 and a class conditional probability &gt; 0.1.</p>
        <p>We submitted three runs employing this gini-index technique: Firstly, uogTr RD 1
classifies tweets using profiles constructed by the process. Secondly, uogTr RD 2
constructs profiles using this process before enriching the profiles with expansion terms
derived from Wikipedia4. Finlay, uogTr RD 2 constructs profiles using this process
before enriching the profiles with class specific query expansion terms.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>TWEET ENRICHMENT</title>
        <p>For our tweet enrichment approach we pre-process tweets using the same approach as
in Section 3.1 (converting to lowercase, removing non-alphanumeric characters, new
lines and URLs), before enriching the tweets.</p>
        <p>
          To obtain enrichment terms, we use Terrier to submit a raw tweet as a query to
a large contemporaneous web corpus. The top 10 retrieved documents then form a
pseudo-relevant document set. We calculate the entropy of each term within the set
of retrieved documents, and we select the top 20 terms with the highest entropy as the
most informative terms related to the tweet. Then, we enrich the pre-processed tweets by
appending these informative terms. Stop-words are removed from the enriched tweets,
which are further converted into term frequency feature vectors that are used to train
several types of classifiers in Weka [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], using 10-fold cross validation.
        </p>
        <p>
          We submitted two runs employing this technique: Firstly, for uogTr RD 4 we train
an SVM model using Weka and LibSVM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Secondly, for uogTr RD 5 we train the
Weka implementation of the Random Forests [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] classification algorithm.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>A total of 30 systems were submitted for the Reputation Dimensions task. The task
organisers also reported on a baseline text classification approach that used tweet words as
feature vectors to train an SVM classifier. Results were reported ranked by the system’s
overall Accuracy.</p>
      <p>Table 1 presents the accuracy scores of our runs plus the baseline approach and the
mean over all 31 submissions.
4 http://en.wikipedia.org/wiki/Main Page</p>
      <p>We see that our Tweet Enrichment approach with SVM model performed
excellently being ranked first with an accuracy score of 0.7318. The Tweet Enrichment
approach with a Random Forests model also performs well, achieving an accuracy score
of 0.6871 markedly above the baseline score of 0.6221.</p>
      <p>Our Dimension profiling approach performed less well due to the fact that
increasing gini-index and class conditional probability thresholds results in the selection of
fewer discriminative terms for a class profile, therefore profiles become smaller as
they become more class specific. This makes the task of classifying previously unseen
tweets increasingly difficult due to the sparse nature of tweets. Enriching the
dimension profiles counteracted this somewhat, as we see increased performance using the
enriched profiles from 0.4960 for uogTr RD 1 to 0.6086 and 0.6205 for uogTr RD 3
and uogTr RD 2 respectively.</p>
      <p>Table 2 shows the relative frequency of classes for classified tweets calculated as
#class predictions/total tweets classif ied∗100 and Table 3 shows the precision and recall
for each of the classes. We see that most of the runs are slightly biased towards the
largest class “Products and Services” but the runs that performed best achieved notably
higher precision on smaller classes such as “Innovation” and “Leadership”. We would
expect to be able to further improve our results by achieving higher precision scores for
“Workplace” and “Performance”.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>In this paper, we described our participation in RepLab 2014 Reputation Dimensions
task. We investigated two distinct approaches; firstly we use a term’s gini-index score
to identify terms representative of specific classes to build class profiles for classifying
tweets, and secondly, we take a tweet enrichment approach using a large
contemporaneous web corpus to derive terms representative of the tweet’s class before training a
classifier on the enriched tweets.</p>
      <p>We found that our tweet enrichment approach performed very well. In particular, we
note that when training a SVM classifier our tweet enrichment approach achieved the
best overall Accuracy results of the task. This approach also performed markedly above
average when training a Random Forests classifier. These results are very encouraging
and we intend to explore this methodology further as future work.</p>
      <p>ACKNOWLEDGMENTS The authors would like to thank Information Technology
as a Utility Network (ITaaU) and the EC co-funded project SMART (FP7-287583).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Amigo´,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chugur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Corujo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Meij</surname>
          </string-name>
          , E., de Rijke,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Overview of RepLab 2014: author profiling and reputation dimensions for Online Reputation Management</article-title>
          .
          <source>In: Proceedings of the Fifth International Conference of the CLEF Initiative</source>
          .
          <article-title>(</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Reputation management and social media</article-title>
          . Washington, DC: Pew Internet &amp;
          <article-title>American Life Project</article-title>
          .
          <source>Retrieved May</source>
          <volume>26</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plachouras</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Terrier: A high performance and scalable information retrieval platform</article-title>
          .
          <source>In: Proc. OSIR</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A survey of text classification algorithms</article-title>
          . In Aggarwal, C.C.,
          <string-name>
            <surname>Zhai</surname>
          </string-name>
          , C., eds.:
          <source>Mining Text Data</source>
          . Springer US (
          <year>2012</year>
          )
          <fpage>163</fpage>
          -
          <lpage>222</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The weka data mining software: an update</article-title>
          .
          <source>ACM SIGKDD explorations newsletter 11(1)</source>
          (
          <year>2009</year>
          )
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <issue>6</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ) (
          <year>2011</year>
          )
          <volume>27</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          :
          <fpage>27</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Machine learning 45(1)</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>