<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNED at CLEF RepLab 2014: Author Pro ling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacinto Jesus Mena Lomen~a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Lopez Ostenero jmena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@alumno.uned.es flopez@lsi.uned.es</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>UNED NLP &amp; IR Group Juan del Rosal</institution>
          ,
          <addr-line>16 28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>1537</fpage>
      <lpage>1546</lpage>
      <abstract>
        <p>This paper describes a learning system developed for the RepLab 2014 author pro ling task at UNED. The system uses a voting model, which employs a small set of features based mainly on the tweet text information such as POS tags, number of hashtags or number of links. In the uno cial run, the feature set was increased with Twitter metadata such as number of followers or retweet speed. The system achieved good results in author categorisation, although its performance in author ranking was low.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper describes the participation of UNED in RepLab 2014 where we
tackled the author pro ling task focused on classifying and ranking Twitter pro les
using their tweet streams.</p>
      <p>Twitter constitutes one of the main sources of data relevant for online
reputation management because of the spontaneity and immediacy. Although not
all the tweets have the same impact. The way in which a post may a ect the
reputation of a company often depends on who published it. The author
proling task aims at classifying authors by type of their activity and identifying
the in uential ones, those whose tweets are more likely to propagate quickly and
widely through the network and to produce a greater e ect. So the nal goal is
to build a ranking list of the selected Twitter pro les.</p>
      <p>The paper is organised as follows. The applied approach is introduced in
Section 2 brie y describing the features considered and the learning process.
Sections 3 explains the con gurations of the model for author categorisation and
author ranking. In Section 4, we report the results obtained for each subtask.
Finally, in Section 5, we conclude and outline possible improvements of the
system in the future.
2.1</p>
      <sec id="sec-1-1">
        <title>Features</title>
        <p>The model uses the following set of features:
Bag of Words: a feature set based on a Weka lter called StringtoWordVector
was built. It contains a vector of occurrences of words in a document. We used
the default con guration of this Weka lter.</p>
        <p>This feature is important to determine the most important words which
decide the classi cation of Twitter pro les in the Author Categorisation subtask.
This feature could be more discriminant if it is used taking in consideration the
domain information to divide the classi cation algorithm.</p>
        <p>
          Number of sentences: The system used GATE [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] with the SentenceSplitter
resource to get a feature with the number of sentences. We used a speci c
SentenceSplitter for each language, one for English and other one for Spanish.
POS information: Seven features were built based on the POS tags. We used
the GATE POS Tagger with the OpenNLP framework and di erent models
for each language. Before running the POS tagging, we preprocessed the tweet
contents to remove hashtags, mentions, and URLs, using regular expressions.
After getting the POS tags, we considered a set of the following features that
exploit the number of adverbs, verbs, adjectives, nouns, pronouns, foreign words,
and abbreviations. This follows the previous work by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] where the number of
POS elements were considered for measuring polarity. These features, in our
opinion, could characterise the author's writing style and could be useful useful
in author categorisation.
        </p>
        <p>Number of links: We have built a regular expression method to count the number
of links in the tweet.</p>
        <p>Similar to the point above, we consider this feature useful for the author
categorisation subtask, because it re ects stylistic characteristics of the user's
writing.</p>
        <p>
          Number of hashtags: Following [
          <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
          ], we included a process based on regular
expressions to count the number of hashtags.
        </p>
        <p>The hypothesis is that the number of hashtags could be indicative of the
relevancy of a tweet, as the more hashtags there are, the more topics will be
involved.</p>
        <p>
          Number of mentions: Again, based on the work in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we included the count the
number of explicit mentions of users of the form user.
        </p>
        <p>For instance, for the following tweet it would be generated the value of 6
mentions for this feature:</p>
        <p>
          still waiting on @MeganBerry's #fbumpf contribution :)
kevinGEEdavis @MerlinUWard @MimiOrtega @jeremarketer @AmyVernon
@IAmMrSid
Number of smileys: The system considered the number of smileys, based on the
experience of [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In order to count smileys, we manually built a dictionary using
information extracted from Wikipedia.
        </p>
        <p>Buenos d as :) A por un fin de semana incre ble lleno de color
amigs ;) http://ow.ly/i/2EXp7
Language: We used the language label provided by the Replab 2014 organisers
as a feature of the classi er.</p>
        <p>This feature is used mainly to determine the set of words to be considered
as Bag Of Words.</p>
        <p>In the uno cial run, we included two new features, based on Twitter
metadata. For that, we used Twitter4J, a Java Wrapper for Twitter REST API. We
built the following new features:
Number of followers: For each pro le, we queried Twitter about the number of
followers of every pro le in the training and test data sets.</p>
        <p>The idea was to use this feature in the Author Ranking substask to generate
weight values the application of which is described below.</p>
        <p>Retweet speed: We examined the last retweet of each author. The retweet speed
was calculated as follows using the creation date, number of retweets and the
creation date of the last retweet:
avgT ime =
(LastRT CreationT ime T weetCreationT ime)</p>
        <p>N umberOf RT
In order to sort elements, we built a weight measure which was calculated using
the following formula:
(1)
(2)
weight =</p>
        <p>N umberof F ollowers</p>
        <p>AverageRT speed</p>
        <p>This formula tries to relate the retweet speed with the number of followers.
The aim is to capture those cases when, given two pro les, for instance, one
with 1,500 followers and the other with 1,600, the former has more activity in
terms of tweets propagation and retweet speed than the latter. So the
underlying hypothesis is that it is more relevant a pro le with a smaller number of
followers and higher speed, than a pro le with a bigger number of followers and
lower speed. One run was con gured with this weight parameter. Regarding this
feature, the bigger the weight value is, the more important is a pro le.</p>
        <p>Due to Rate limiting, we only managed to obtain retweet speed information
for about 50% of pro les. In order to use it as a feature, an empty value for the
feature was taken to build the classi er for the Author Category subtask. For
Author Ranking, an average speed was assigned, multiplied by the number of
followers.
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Learning Process and Con dence Methods</title>
        <p>The learning process of our system is composed of a voting system, a set of
classi ers and a method to resolve the ties by means of con dence scores.</p>
        <p>
          We divided the training data set into 5 subsets, each containing 20% of data.
601 tweets provided by the organisers with each pro le were also split in ve
parts. The classi ers were trained considering each tweet as an instance instead
of grouping all the data related to one pro le in one instance. Four of the subsets
were used to train the system employing the following Weka algorithms:
{ ZeroR Algorithm
{ RandomTree Algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
{ RandomForests Algorithm [
          <xref ref-type="bibr" rid="ref6 ref8">8,6</xref>
          ]
{ Nave Bayes Algorithm
These four algorithms allowed covering 80% of the data set. The remaining 20%
was used to create a con dence score table.
        </p>
        <p>That training set partition had nearly 300,000 tweets. We iterate tweet per
tweet and stored (in a relational database) 4 rows per each tweet as con dence
information. As result of that we had a table with close to 1,200,000 (per each
Replab 2014 subtask) rows to query information about con dence. The following
formula was used to solve those cases when at least three classi ers decided the
same:
conf idence(cat; algs) = X
alg2algs
nRightClassif ication(cat; alg)
nClassif ications(cat; alg)
(3)</p>
        <p>Where cat is the category for which the con dence value has to be
calculated and algs is a set of algorithms the result of which was the category cat.
nRighClassif ication is a function with the number of correct classi cations for
this category produced by this algorithm, and nClassif ications is a function
which counts the number of classi cations for that category.</p>
        <p>The con dence scores are used to decide which category is more plausible
after training. Figure 1 reproduces the architecture of the con dence score
component. This gure shows how the con dence scores table is populated with the
outcomes of the algorithms, based on the training data.</p>
        <p>Figure 2 illustrates how the con dence score information is used to
disambiguate the results and decide which class value should be assigned to a pro le.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Algorithms</title>
      <p>In this section, we describe the algorithm con gurations. Table 1 provides an
overview of the Author Categorisation algorithms, specifying the kind of data
used in each of them. \ AC" in the runs identi ers indicates the \Author
Categorisation task", while \ AR" stands for \Author Ranking".
3.1</p>
      <sec id="sec-2-1">
        <title>Author Categorisation</title>
        <p>Basic con guration This is the rst and the simplest system con guration
(ORM UNED AC 1) for author categorisation that consists only of classi
cation algorithms without taking into account information about the domain. The
4 classi ers were fed with a small set of features which included BoW, POS,
hashtags, mentions, links, smileys, and language. The classi cation result was
obtained by applying a basic voting algorithm using majority rule.</p>
        <p>In order to avoid the bias towards the most frequent class (Undecidable),
a threshold was applied. The majority class label (Undecidable) was assigned
only if it was supported by 80% or more votes. Below that threshold (80%), we
classi ed the pro le as another majority class, distinct from Undecidable.</p>
        <p>We used 4 classi ers which classi ed a pro le tweet by tweet. For each pro le,
we generated 4 class values per tweet, producing near 2400 class values per
pro le. This information was used to obtain the majority result of the voting
algorithm.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Basic con guration with domain features This con guration</title>
        <p>(ORM UNED AC 2) includes information about the pro le domain.
Algorithms were de ned to consider the domain element and decide which algorithm
should be used. The same set of features as in the basic con guration, although
choosing di erent classi ers depending on the domain.</p>
        <p>As before, we used a threshold to avoid the bias towards the most frequent
class (Undecidable), setting it at the same value. This con guration produced 8
classi ers.</p>
        <p>Con dence scores model This con guration used information about con
dence of classi ers algorithms when their results are close to a tie. We submitted
the results of this con guration as ORM UNED AC 3.</p>
        <p>The con dence information was used to decide the outcome of the classi
cation. In case of a tie, we calculated con dence scores using the equation 3. We
used the same feature set and threshold as in the basic con guration.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Con dence scores and social information model We built a last con gu</title>
        <p>ration using a new kind of information, social information (ORM UNED AC 4)
after the o cial deadline for submitting results.</p>
        <p>This con guration, for which we can report an uno cial result, is similar to
the simple con dence score model described above, but using two new features:
number of followers and retweet speed.</p>
        <p>We applied to the annotations with the Undecidable class the same threshold
as in the basic con guration.
3.2</p>
      </sec>
      <sec id="sec-2-4">
        <title>Author Ranking</title>
        <p>For the Author Ranking subtask, we submitted one o cial run: ORM UNED AR 3
(see Table 1). The developed algorithm is described below.</p>
        <p>Basic con guration We used the following features:
{ Class value of opinion maker/non opinion maker
{ Number of followers
{ Retweet speed</p>
        <p>The weight function de ned in Equation 2 was used to sort the ranking
results.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The test set contained three domains. We employed two domains and in order
to assign a value to the third class, we selected one of the classi ers built using
the training dataset. Tables 3, 4, 5 report the scores obtained for the evaluation
metrics used in the author category subtask: Reliability (R), Sensitivity (R) and
F1(R; S) for each domain. For the automotive and banking domains we also
include scores of the baselines for reference.
We described the algorithms submitted to the RepLab 2014 Author Pro ling
task, where we tackled both author categorisation and author ranking.</p>
      <p>Author categorisation was our main focus at RepLab 2014. We submitted
three o cial and one uno cial run. Our proposal was based on a voting system
featuring a method to calculate con dence scores to solve ties in votes. However,
the results obtained with the con dence method were not as good as we expected,
as they were surpassed by the basic con guration. Nevertheless, although the
con dence method got the worst results in Average Accuracy, it turned out the
best in F-measure not only among our runs, but also considering the rest of the
Author Categorisation task participants.</p>
      <p>Future work in author categorisation is going to focus on selecting new
features and improving on the whole system in order to make processing more
e cient. Furthermore, we will have to re ne the con dence formula to avoid
setting a threshold for the majority \Undecidable" class.</p>
      <p>Regarding author ranking, the bad results can be partly explained by the
lack of information for building the ranking. Due to the Twitter Rate Limit, we
failed in getting necessary information about the followers and retweet speed for
all the pro les. So in case of pro les without this information, they were assigned
an average value. This distortion might have a ected the system's outcome.</p>
      <p>For author ranking, future work will focus on getting more information from
Twitter, although the rst step, of course, will be to improve the query process
to cope with the Twitter Rate Limit.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Text processing with gate</article-title>
          . Gateway Press CA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Filgueiras</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amir</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Popstar at replab 2013:
          <article-title>Polarity for reputation classi cation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Greenwood</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aswani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Reputation pro ling with gate</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mart n</surname>
          </string-name>
          , T.,
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
          </string-name>
          , J.: Uned at replab 2012:
          <article-title>Monitoring task</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Meina</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodzinska</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celmer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Czokow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patera</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pezacki</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ensemble-based classi cation for author pro ling using various features</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mosquera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            nez-Barco,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreda</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Dlsivolvam at replab 2013:
          <article-title>Polarity classi cation on twitter data</article-title>
          .
          <source>In: Working Notes of CLEF 2013 Evaluation Labs and Workshop</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Foundations and trends in information retrieval 2</source>
          (
          <issue>1-2</issue>
          ),
          <volume>1</volume>
          {
          <fpage>135</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Saleiro</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasquali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teixeira</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nozari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Felix</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strecht</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Popstar at replab 2013:
          <article-title>Name ambiguity resolution on twitter</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>