<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Tehran at RepLab 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abolfazl Aleahmad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Payam Karisani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masoud Rahgozar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Farhad Oroumchian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Research Group, Control and Intelligent Processing Center Of Excellence, School of Electrical and Computer Engineering, University of Tehran</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Wollongong in Dubai</institution>
        </aff>
      </contrib-group>
      <fpage>1528</fpage>
      <lpage>1536</lpage>
      <abstract>
        <p>In this paper, we present our approach to author ranking subtask; which is a part of author-profiling task in RepLab 2014. In this subtask, systems are expected to detect influential authors and opinion makers on Twitter website. The systems' output, for a given domain, must be a ranked list of authors according to their probability of being an influential author or opinion maker. Our system utilizes a Time-sensitive Voting algorithm, which is based on the hypothesis that influential authors tweet actively about topics of their interest. In this method, hot topics of each domain are extracted and a time-sensitive voting algorithm ranks each authors on their respective topics.</p>
      </abstract>
      <kwd-group>
        <kwd>Microblog retrieval</kwd>
        <kwd>twitter profile ranking</kwd>
        <kwd>social networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Twitter has become a common means of spreading personal and public information
in recent years. Users in all ranges of social status are twitting about different subjects
on the Internet. The upward trend of using the websites like Twitter is confirmed by
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which shows Twitter had 200 million users until February 2013. On the other
hand, the information need of the internet users should be addressed in this important
category of social media that has been subject of many researches in the information
retrieval field. The key question in textual information retrieval is how to compute the
relevance probability of a document with regard to a user query. Three major factors
are generally used in effective retrieval models [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: term frequency, document length
and term inverse document frequency.
      </p>
      <p>From these three factors, term frequency and document length normalization are
not meaningful in microblog retrieval due to the short length of users’ posts. On the
other hand, there are some other parameters such as users’ hash tags or retweets that
are far more important and have been exploited by microblog retrieval techniques.
These facts show special and different nature of microblogs that should be considered
in information retrieval algorithms.</p>
      <p>The third year of Replab campaign addresses Online Reputation Management
systems. It comprises of two major tasks, which are Reputation Dimensions and Author
Profiling. The Author Profiling task itself consists of two subtasks: Author
Categorization and Author Ranking. This paper describes our experiments in the author
ranking subtask. This subtask is presented by Replab organizers as below:
"Systems will be expected to find out which authors have more
reputational influence (who the influencers or opinion makers are) and which
profiles are less influential or have no influence at all. For a given domain (e.g.
automotive or banking), the systems’ output will be a ranking of profiles
according to their probability of being an opinion maker with respect to the
concrete domain, optionally including the corresponding weights"</p>
      <p>Our aims in the author ranking task is to verify the hypothesis that influential
authors tweet more about hot topics in their domain compared to the other users.
The rest of this paper is organized as follows: section 2 describes the collection and
preparation process of the provided RepLab 2014 dataset and our experimental setup,
section 3 presents our proposed algorithm, section 4 reports our experimental results,
and section 5 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Setup</title>
      <p>The dataset of the author profiling task consists of nearly 7500 English and Spanish
twitter profiles which are categorized into automotive, banking and Miscellaneous
domains. Every profile has at least 1000 followers and at the crawling time, the last
600 published tweets of each profile are crawled.</p>
      <p>The dataset is split into two training and test sets that contain around 33% and 67% of
the profiles respectively. The training set consists of 1185 and 1315 profiles from
automotive and banking domains and the test set contains 2345, 2500, 146 profiles
from automotive, banking and miscellaneous domains respectively. Table 1 shows the
dataset features.</p>
      <p>The evaluations are carried out based on manual judgments of reputation experts.
The outputs are stored in the standard TREC format and the traditional information
retrieval criteria (MAP, R-Precision, and P@N) are used to evaluate the performance
of each system.</p>
      <p>The author ranking collection of RepLab 2014 contains approximately 4.5 million
tweets from 7491 profiles. This collection was downloaded directly from Twitter;
Table 2 contains properties of the crawled collection:</p>
      <p>We used the Twitter’s standard API to get the number of the followers for each
profile. But because of the limitations of the Twitter API, we developed a tool to
download the HTML page of each tweet which is used to extract the text, the retweet
count, and the favorite count of each tweet. Our proposed algorithm does not use any
external resources.</p>
      <p>
        The tweet messages are stored in TREC format and then indexed in Terrier 3.5 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Our submitted runs are experimented using the Terrier retrieval engine with default
settings (stopword removal and porter stemmer is applied, etc.). Also, we used the
Stanford Topic Modeling Toolbox [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to predict the significant topics in each domain
which is discussed in the next section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Our Algorithm</title>
      <p>2- Retrieval: Let Ri(Qj) be the set of 1000 tweets retrieved by PL2 model
(implemented in Terrier) for the topic QDi,j.
3- Topic based profiles ranking: Let ProfileTopicRanki,j be the set of profiles
ranked by Time-sensitive Voting algorithm based on each list Ri(Qj).
4- Topics' weight calculation: Let Weighti,j be the precision that is calculated for
each Ri(Qj) based on the relevance judgments of the training dataset.
5- Calculate final author rankings: Let T be a constant weighting threshold. Let
ProfileRanki be ProfileTopicRanki,j lists with Weighti,j ≥ T merged by weighted
averaging using weights Weighti,j.</p>
      <p>Finally, ProfileRanki,j will contain the final rank of profile j in each domain Di.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Official Runs</title>
      <p>We submitted 5 different runs to the author ranking sub-task of RepLab 2014. The
following table describes each run briefly:</p>
      <sec id="sec-4-1">
        <title>Run Name UTDBRG_AR_1</title>
      </sec>
      <sec id="sec-4-2">
        <title>UTDBRG_AR_2</title>
      </sec>
      <sec id="sec-4-3">
        <title>UTDBRG_AR_3 UTDBRG_AR_4 UTDBRG_AR_5</title>
      </sec>
      <sec id="sec-4-4">
        <title>Description</title>
        <p>
          This is the output of the algorithm in table 1 with N=100. But
instead of using Time-sensitive Voting, Local method of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is
used and in step 1 of the algorithm, instead of tweets hashtags,
all tweets terms are considered.
        </p>
        <p>This is the output of the algorithm in table 1 with N=50. But only
tweets which are retweeted more than 100 times are considered
in step 1.</p>
        <p>
          This is the output of the algorithm in table 1 with N=50 and
instead of Time-sensitive Voting, the Voting method of [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is used.
        </p>
        <p>This run uses the number of followers to re-rank the result of the
first run named 'UTDBRG_AR_1'</p>
        <p>This is the output of the algorithm in table 1 with N=100</p>
        <p>The threshold T is considered zero in all the above runs. In the remaining part of
this section we compare the submitted runs based on the official results released by
the track organizers. Table 5 compares the 5 submitted runs based on the MAP
measure:</p>
        <p>Docuent cut-off</p>
        <sec id="sec-4-4-1">
          <title>UTDBRG_AR_5</title>
        </sec>
        <sec id="sec-4-4-2">
          <title>UTDBRG_AR_1</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>UTDBRG_AR_2</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>UTDBRG_AR_3</title>
        </sec>
        <sec id="sec-4-4-5">
          <title>UTDBRG_AR_4</title>
          <p>UTDBRG_AR_5
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Further Experiments</title>
      <p>
        It is clear that the topic creation step plays an important role in the final performance
of our algorithm. So, we decided to improve the algorithm further by changing the
first step of the algorithm. For this purpose two categories of topic sets are created as
follows:
─ Grouping hashtags: The extracted keywords in step 1 of the proposed
algorithm are grouped together to form a number of representative topics in each
domain. Here is the process:
(i) The algorithm of table 1 is run once. So, some keywords are extracted and
weighted in step 1 and step 4 respectively. These keywords are ranked in
descending order of their weight.
(ii) The ordered list of keywords is split into different groups and each group is
considered as a topic. In other words, each topic consists of a number of
hashtags grouped together.
─ Using topic modeling: Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used to create
some other topics for each domain. A number of topics are extracted from the
training set using LDA for each domain. For this purpose we took advantage of
Stanford Topic Modeling Toolbox [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>After creating the two topic sets, they are combined to form a unique list of topics.
Then this list of topics is considered as the output of the first step and in step 2, top
10000 tweets are retrieved. The rest of the algorithm is executed as discussed in table
1. The following table compares the performance of the modified algorithm with the
performance of UTDBRG_AR_5 based on MAP in Automotive and Banking
domains.</p>
      <p>Also the following figures compare the performance of the modified algorithm
with the performance of UTDBRG_AR_5 based on precision-recall and P@N
measures.
1
0.9
0.8
0.7
0.8
0.6
n
iso0.4
i
c
e
rP0.2
0</p>
      <sec id="sec-5-1">
        <title>Improved Algorithm</title>
      </sec>
      <sec id="sec-5-2">
        <title>UTDBRG_AR_5</title>
      </sec>
      <sec id="sec-5-3">
        <title>Improved Algorithm</title>
      </sec>
      <sec id="sec-5-4">
        <title>UTDBRG_AR_5</title>
      </sec>
      <sec id="sec-5-5">
        <title>Improved Algorithm</title>
      </sec>
      <sec id="sec-5-6">
        <title>UTDBRG_AR_5</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        In the author ranking task of RepLab2014, we tried to present a new algorithm based
on the voting algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The official evaluation results of RepLab 2014 show the
proposed algorithm outperforms other algorithms in automotive domain. The topic
creation step of our algorithm used simple keywords, so it could not perform well in
banking domain that contains more diverse tweets. So, we used topic modeling in
addition to tweet hashtags to amend the topic creation step of our algorithm.
Evaluation of the improved algorithm shows it works even better than the previous
algorithm.
      </p>
      <p>Analysis of our five official runs shows that the fourth run, named UTDBRG_AR_4,
performed better than the others. The main reason is usage of more keywords
(N=100). Also, it shows that the number of followers is a good feature for detecting
influential people. So, we would like to investigate other structural features like
people's centrality. Also, it's worth mentioning that we made use of the number of
retweets in our experiments but the feature was not helpful. May be the main reason for
this fact is that all authors of the collection have more than 1000 followers and the
feature is not very discriminative. This feature should be investigated more in future.
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res. 3 (March</source>
          <year>2003</year>
          ),
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Stanford Topic Modeling Toolbox, http://nlp.stanford.edu/software/tmt/tmt-0.4,
          <string-name>
            <surname>Last</surname>
            <given-names>visited</given-names>
          </string-name>
          <source>on 5 june</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Text REtrieval Conference (TREC) TREC</surname>
          </string-name>
          _Eval tool: http://trec.nist.gov/trec_eval,
          <source>Last visited on 5 june</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. http://twittercounter.com/pages/100, Last visited on 5 june
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. http://blog.twitter.com/
          <year>2013</year>
          /03/celebrating-twitter7.
          <article-title>html, Last visited on 5 june 2014</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Modern</given-names>
            <surname>Information</surname>
          </string-name>
          <string-name>
            <surname>Retrieval</surname>
          </string-name>
          ,
          <article-title>2ed edition</article-title>
          , chapter
          <volume>3</volume>
          ,
          <string-name>
            <surname>baeza</surname>
            <given-names>yates</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <source>Terrier IR Platform version 3</source>
          .5, http://terrier.org,
          <source>Last visited on 5 june</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Craig</given-names>
            <surname>Macdonald</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iadh</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Voting for candidates: adapting data fusion techniques for an expert search task</article-title>
          .
          <source>In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06)</source>
          . ACM, New York, NY, USA,
          <fpage>387</fpage>
          -
          <lpage>396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Yeha</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Seung-Hoon Na</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jong-Hyeok Lee</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Utilizing local evidence for blog feed search</article-title>
          .
          <source>Inf. Retr</source>
          .
          <volume>15</volume>
          ,
          <issue>2</issue>
          (April
          <year>2012</year>
          ),
          <fpage>157</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>