<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arjumand Younus</string-name>
          <email>arjumand.younus@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Colm O'Riordan</string-name>
          <email>colm.oriordan@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>pasi@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Intelligence Research Group, National University of Ireland Galway</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Retrieval Lab</institution>
          ,
          <addr-line>Informatics, Systems and Communication</addr-line>
          ,
          <institution>University of Milan Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Using Twitter as an effective marketing tool has become a gold mine for companies interested in their online reputation. A quite significant research challenge related to the above issue is to disambiguate tweets with respect to company names. In fact, finding if a particular tweet is relevant or irrelevant to a company is an important task not satisfactorily solved yet; to address this issue in this paper we propose a Wikipedia-based two-pass algorithm. The experimental evaluations demonstrate the effectiveness of the proposed approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Twitter1 is a microblogging site which currently2 ranks 8th world wide in total
traffic according to Alexa3. This huge popularity has turned Twitter into an
effective marketing platform with almost all the major companies maintaining
Twitter accounts. Moreover, Twitter users often express their opinions about
companies via 140-character long Twitter messages called tweets [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Companies are highly interested in monitoring their online reputation; this, however
involves the significant challenge of disambiguating company names in text. The
task becomes even more challenging in tweets due to huge noise, short length
and lack of context for company name disambiguation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This paper
describes our experience in dealing with some of these challenges in the context of
RepLab2012 filtering task where we are given a set of companies and for each
company a set of tweets, which contain some tweets relevant to the company
and some irrelevant ones.
      </p>
      <p>Our approach consists in a two-pass filtering algorithm that makes use of
Wikipedia as an external knowledge resource in the first pass, and a concept</p>
      <sec id="sec-1-1">
        <title>1 http://twitter.com</title>
        <p>2 Data as of July 31, 2012
3 http://www.alexa.com/siteinfo/twitter.com
term score propagation mechanism in the second pass. The first step is
precisionoriented, where the aim is to keep the noisy tweets to a minimum. The second
step enhances the recall via a score propagation technique, and by making use
of other sources of evidence. Our technique shows high accuracy over the
RepLab2012 dataset.</p>
        <p>The rest of the paper is organized as follows: section 2 presents a more
detailed description of the considered problem. Section 3 describes the proposed
methodology in detail. Section 4 gives presents the experimental evaluation of
our method, and Section 5 summarizes the related work. Finally section 6
concludes the paper.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>In this section we give a brief overview of the RepLab2012 filtering task. We were
given company names and a set of tweets obtained via issuing a query (i.e. the
name of the company) to Twitter Search4. Due to the issue of noise in Twitter,
the tweets obtained via the query may or may not be relevant to the company.
Our task addresses the problem of distinguishing the tweets relevant from those
non-relevant to a company.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This section describes the proposed filtering method in detail. We first explain
how we use Wikipedia as an external knowledge resource: we use only portions
of a company’s Wikipedia page, as only some of the portions are meaningful for
filtering purposes. Next, we explain the two steps of our algorithm.
3.1</p>
      <sec id="sec-3-1">
        <title>Wikipedia as External Knowledge Resource</title>
        <p>
          As Meij et al. point out in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a simple matching between tweets and Wikipedia
texts would produce a significant amount of irrelevant and noisy concept terms.
The authors further mention that such noise can be eliminated on either the
Wikipedia or the textual side. On the basis of the intuition that the Wikipedia
page of a company contains significant information about the company in certain
portions of the Wikipedia article (i.e. concept terms exist in some portions of
text), we perform this cleaning on the Wikipedia side as follows:
– we use the information within the Wikipedia infoboxes.
– we use the information within the Wikipedia category box.
– we parse the paragraphs that in the Wikipedia text are followed by
application of POS tagging to these paragraphs. After the application of POS
tagging, we extract unigrams, bigrams and trigrams for the significant POS
tags [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <sec id="sec-3-1-1">
          <title>4 http://twitter.com/search</title>
          <p>Finally, the concept terms extracted from the various portions of Wikipedia
are split into single terms. These are then used for matching against the terms
in the tweets for the task of filtering.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>First Pass</title>
        <p>After collecting all the extracted concept terms from Wikipedia we order them
by their specificity in the Wikipedia article thereby giving a score to each term.
We then check for the occurrence of these concept terms in the tweets, and the
number of occurrences per term is multiplied by the score of that particular
concept term to constitute a score for the tweet. Tweets that have a score above
a certain threshold are considered to be relevant. The intuition behind the use
of Wikipedia concept terms for the first phase is to keep the precision as high as
possible, and to get relevant tweets only.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Second Pass</title>
        <p>The second pass makes use of the idea of concept term score propagation in order
to discover more tweets relevant to a particular company i.e. to increase the
recall. The score propagation technique is based on the intuition that terms
colocated with significant concept terms may have some relevance to that concept.
The scores for concept terms in a relevant tweet obtained from the first pass are
redistributed among co-located terms. This in turn gives some score to the
nonconcept terms, and by using the scores of these non-concept terms we perform
a second computation to obtain the scores of tweets. Moreover, this phase uses
more sources of evidence, these are:
– POS tag of the company name occurring within the tweets
– URL occurring within the tweets
– Twitter username occurring within the tweets
– Hashtag occurring within the tweets</p>
        <p>The score propagation technique as well as the extra sources of evidence
mentioned above enable us to extract more tweets relevant to the company, thus
increasing the recall. We show the results of the evaluation metrics in the next
section.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluations</title>
      <p>As mentioned in section 2 the task comprises binary classification and hence
we report the effectiveness of our algorithm through the standard evaluation
metrics of precision and recall. We were given a very small trial dataset (six
companies) and a considerably larger test dataset (31 companies). We report
our results for the test dataset. Table 1 shows the precision and recall figures
after the application of the first pass, and after the application of the second
pass of our algorithm.</p>
      <p>First Pass Second Pass
Precision 0.84827 0.81129</p>
      <p>Recall 0.16307 0.76229
As Table 1 shows the first pass yields a high precision but an extremely low
recall. The application of the second pass increases the recall by a large degree,
while not overly reducing the precision. The significantly large increase in recall
proves the effectiveness of the score-propagation technique combined with the
use of multiple sources of evidence.</p>
      <p>
        The RepLab2012 filtering task used the measures of Reliability and
Sensitivity for evaluation purposes, these are described in detail in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Table 2 presents
a snapshot of the official results for the Filtering task of RepLab 2012, where
CIRGDISCO is the name of our team.
There has been an increasing interest in research on applying natural language
processing techniques to tweets over the past few years. We provide an overview
of some of these works in this section.
      </p>
      <p>
        Named entity recognition in tweets has been an area of active research over
the past few years with a focus on semi-supervised learning techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Given the lack of context in tweets for the task of named entity recognition,
Liu et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] use a large training set and apply a KNN classifier in combination
with CRF based labeler whereas Ritter et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] employ Labeled LDA over a
FreeBase corpus.
      </p>
      <p>
        Understanding and analysing the content of tweets is another significant
research direction where the goal is to extract keyphrases from the content of tweets
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Hong and Davidson [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed several schemes to train standard LDA,
and the Author-Topic LDA models for topic discovery over Twitter data. Zhao
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] extracted and ranked topical key phrases on Twitter through the use
of topic models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] followed by topical PageRank.
      </p>
      <p>
        Semantic enrichment of microblog posts has also been studied with the aim
of determining what a tweet is about [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Abel et al. make use of news articles
to contextualize tweets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] while Meij et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provide a fine semantic enrichment
of tweets through matching with Wikipedia.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We proposed a two-pass algorithm for company name disambiguation in tweets.
Our algorithm makes use of Wikipedia as a primary knowledge resource in the
first pass of the algorithm, and the tweets are matched across Wikipedia terms.
The matched terms are then used for score propagation in the second pass of
the algorithm that also makes use of multiple sources of evidence. Our algorithm
showed competitive performance and demonstrates the effectiveness of the
techniques employed in the two passes. As a future work, we aim to refine the score
propagation technique of the second pass by taking into account better features
for effective scores computation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Abel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-J.</given-names>
            <surname>Houben</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Tao</surname>
          </string-name>
          .
          <article-title>Semantic enrichment of twitter posts for user profile construction on the social web</article-title>
          .
          <source>In Proceedings of the 8th extended semantic web conference on The semanic web: research and applications -</source>
          Volume
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , ESWC'
          <volume>11</volume>
          , pages
          <fpage>375</fpage>
          -
          <lpage>389</lpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          .
          <article-title>Reliability and Sensitivity: Generic Evaluation Measures for Document Organization Tasks</article-title>
          . UNED, Madrid, Spain,
          <year>2012</year>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <article-title>Empirical study of topic modeling in twitter</article-title>
          .
          <source>In Proceedings of the First Workshop on Social Media Analytics, SOMA '10</source>
          , pages
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Recognizing named entities in tweets</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11</source>
          , pages
          <fpage>359</fpage>
          -
          <lpage>367</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>K.</given-names>
            <surname>Massoudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tsagkias</surname>
          </string-name>
          , M. de Rijke, and
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp</surname>
          </string-name>
          .
          <article-title>Incorporating query expansion and quality indicators in searching microblog posts</article-title>
          .
          <source>In Proceedings of the 33rd European conference on Advances in information retrieval</source>
          ,
          <source>ECIR'11</source>
          , pages
          <fpage>362</fpage>
          -
          <lpage>367</lpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>E.</given-names>
            <surname>Meij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp</surname>
          </string-name>
          , and M. de Rijke.
          <article-title>Adding semantics to microblog posts</article-title>
          .
          <source>In Proceedings of the fifth ACM international conference on Web search and data mining</source>
          ,
          <source>WSDM '12</source>
          , pages
          <fpage>563</fpage>
          -
          <lpage>572</lpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Pak</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Paroubek</surname>
          </string-name>
          .
          <article-title>Twitter as a corpus for sentiment analysis and opinion mining</article-title>
          .
          <source>In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)</source>
          , Valletta, Malta, May
          <year>2010</year>
          .
          <article-title>European Language Resources Association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Mausam, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named entity recognition in tweets: an experimental study</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11</source>
          , pages
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Yerva</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>Mikl´os, and</article-title>
          K. Aberer.
          <article-title>What have fruits to do with technology?: the case of orange, blackberry and apple</article-title>
          .
          <source>In Proceedings of the International Conference on Web Intelligence</source>
          , Mining and Semantics, WIMS '
          <volume>11</volume>
          , pages
          <fpage>48</fpage>
          :
          <fpage>1</fpage>
          -
          <lpage>48</lpage>
          :
          <fpage>10</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Achananuparp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Topical keyphrase extraction from twitter</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11</source>
          , pages
          <fpage>379</fpage>
          -
          <lpage>388</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Comparing twitter and traditional media using topic models</article-title>
          .
          <source>In Proceedings of the 33rd European conference on Advances in information retrieval</source>
          ,
          <source>ECIR'11</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>349</lpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>