CIRG IRGDISCO at RepLab2013 Filtering
    Task: Use of Wikipedia’s Graph Structure for
       Entity Name Disambiguation in Tweets

    Muhammad Atif Qureshi1,2 , Arjumand Younus1,2 , Daniel Abril3 , Colm
                   O’Riordan1 , and Gabriella Pasi2
1
  Computational Intelligence Research Group, National University of Ireland Galway,
                                          Ireland
2
  Information Retrieval Lab,Informatics, Systems and Communication, University of
                               Milan Bicocca, Milan, Italy
       3
          Institut d’Investigacio en Intel-ligencia Artificial, Consejo Superior de
              Investigaciones Cientificas, Campus UAB, Bellateraa, Spain
         muhammad.qureshi@nuigalway.ie,arjumand.younus@nuigalway.ie,
  dabril@iiia.csic.es, colm.oriordan@nuigalway.ie, pasi@disco.unimib.it


      Abstract. Social media repositories serve as a significant source of ev-
      idence when extracting information related to the reputation of a par-
      ticular entity (e.g., a particular politician, singer or company). Reputa-
      tion management experts manually mine the social media repositories
      (in particular Twitter) for monitoring the reputation of a particular en-
      tity. Recently, the online reputation management evaluation campaign
      known as RepLab at CLEF has turned attention to devising computa-
      tional methods for facilitating reputation management experts. A quite
      significant research challenge related to the above issue is to disambiguate
      tweets with respect to entity names. In fact, finding if a particular tweet
      is relevant or irrelevant to a particular entity is an important task not
      satisfactorily solved yet; to address this issue in this paper we use “con-
      text phrases” in a tweet and Wikipedia disambiguated articles for a
      particular entity in an SVM classifier that utilizes features extracted
      from the Wikipedia graph structure i.e., links into Wikipedia articles
      and links from Wikipedia articles. Additionally we also use features de-
      rived from term-specificity and term-collocation features derived from
      the Wikipedia article of an entity under investigation. The experimen-
      tal evaluations do not show a significant improvement over the baseline
      and other systems outperform our approach; however, manual inspection
      of feature sets and training data demonstrates the proposed Wikipedia
      graph-based features may show a promising outcome when used in com-
      bination with sophisticated learning algorithms.


1    Introduction

Companies are increasingly making use of social media for broadening their reach
and enhancing their marketing. At the same time social media users excessively
II

voice out their opinions about various entities (e.g. musicians, movies, compa-
nies) [5]. This has given birth to a new area within the marketing domain known
as “online reputation management” whereby automated methods for monitoring
reputation of entities are essential requiring novel computational algorithms to
facilitate the work of reputation management experts [1, 3]. This paper describes
our experience in devising a completely automated algorithm for dealing with
the “entity name disambiguation” challenge in the context of RepLab2013 filter-
ing task [2] where we are given a set of entities and for each entity a set of tweets,
which contain some tweets relevant to the entity and some irrelevant ones.
    Our approach consists in making use of the knowledge encoded within the
Wikipedia graph structure for entity name disambiguation in tweets. We utilize
the Wikipedia disambiguation pages for an entity to determine the amount of
disambiguation within a particular tweet while at the same time proposing a
technique on top of Wikipedia graph structure to determine context in a tweet4
Although the experimental results do not show a striking performance over the
baseline, we argue that the use of Wikipedia graph structure for entity name
disambiguation in tweets is a promising direction to pursue.


2      Related Work
There has been an increasing interest in research on applying natural language
processing techniques to tweets over the past few years. However, in spite of
the immense significance of extracting commercially useful information from
tweets, the amount of research dedicated to company name disambiguation in
tweets is very limited. The only two serious efforts which have been undertaken
to stimulate this research task are represented by the WePS online reputation
management evaluation campaign at CLEF 2010 [1], and by the RepLab online
reputation management evaluation campaign at CLEF 2012 [3].
    The best two teams in the WePS online reputation management evaluation
campaign were LSIR-EPFL [7] and ITC-UT [8]. The LSIR-EPFL system builds
profiles for each company relying on external resources such as WordNet or the
company homepage in addition to a manual list of keywords for the company and
the most frequent unrelated senses for the company name. The profiles are then
used for extraction of tweet-specific features for use in an SVM classifier. The
ITC-UT system is based on a two-step algorithm. In the first step, the algorithm
categorizes queries by predicting the class of each company (“organization-like
names” or “general-word like names”) using a Naive Bayes classifier with six
binary features (for example, is the query an acronym?, is the query an entry of
a dictionary? etc.). They use thresholds manually set by looking at the training
data results for this categorization. The second step consists in categorizing
the tweets using a set of heuristics. Despite showing promising results, the two
systems LSIR-EPFL and ITC-UT indicate heavy reliance on manual selection
of both terms and thresholds for the company name disambiguation task.
4
     This is a huge research challenge within itself given the huge noise and less amount
     of text in tweets.
                                                                                     III

   During the RepLab2012 online reputation management evaluation campaign,
the best performing team relied on hand-coded rules [3] for the filtering task.
Here we have defined a completely automatic algorithm for this task that relies
on Wikipedia graph structure as an external knowledge resource of evidence.
The method is unique in that it does not require any sort of manual keywords
or hand-coded rules.


3     Methodology
3.1   Background
The underlying filtering algorithm makes use of the encyclopedic structure in
Wikipedia; more specifically the knowledge encoded in Wikipedia’s graph struc-
ture is utilized for the classification of tweets as relevant or irrelevant with respect
to a particular entity. Wikipedia is organized into categories in a taxonomy-like5
structure (see Figure 2). Each Wikipedia category can have an arbitrary num-
ber of subcategories as well as being mentioned inside an arbitrary number of
supercategories (e.g., category C4 in Figure 1 is a subcategory of C2 and C3 ,
and a supercategory of C5 , C6 and C7 .) Furthermore, in Wikipedia each arti-
cle can belong to an arbitrary number of categories, where each category is a
kind of semantic tag for that article [10]. As an example, in Figure 2, article A1
belongs to categories C1 and C10 , article A2 belongs to categories C3 and C4 ,
while article A3 belongs to categories C4 and C7 . It can be seen that the articles
and Wikipedia Category Graph are interlinked and our algorithm makes use of
these interlinks for the task of entity name disambiguation within tweets.

3.2   Wikipedia-Based Feature Set
Our proposed approach involves a two-step method for entity name disambigua-
tion. In the first step we determine the context phrases within a tweet using
an approach similar to Meij et al. [6]. In the second step we use the link struc-
ture of Wikipedia to extract a rich feature set which enables us to perform the
disambiguation task.
    Context phrase extraction is performed by the generation of possible n-grams
within phrase chunks of a tweet6 . Similar to the technique in [6]7 we then re-
duce candidate phrases extracted from a tweet to those that have a match in
Wikipedia article titles. The reduced set of phrases extracted from a tweet are
referred to as ContextPhrases.
5
  We say taxonomy-like because it is not strictly hierarchical due to the presence of
  cycles in the Wikipedia category graph.
6
  We do not perform n-gram generation for the complete tweet but instead treat a
  tweet as a composition of phrase chunks with boundaries such as commas, semi-
  colons, sentence terminators etc. along with other tweet-specific markers such as @,
  RT etc.
7
  We differ in that we do not apply supervised machine learning for reduction of
  candidate phrases.
IV


     Fig. 1. Wikipedia Category Graph Structure along with Wikipedia Articles


    As mentioned in Section 3.1 a significant aspect of our proposed approach
is the Wikipedia graph structure; more specifically links between categories and


       Feature                             Description
       Intersectionduplication             No. of intersections between inlinks,
                                           outlinks and inlinks+outlinks sets
                                           of e and p without removing dupli-
                                           cated articles
       NormalizedIntersectionduplication No. of intersections between inlinks,
                                           outlinks and inlinks+outlinks sets
                                           of e and p without removing du-
                                           plicated articles and normalized by
                                           total number of articles in the sets
       Intersectionnoduplication           No. of intersections between inlinks,
                                           outlinks and inlinks+outlinks sets
                                           of e and p after removing dupli-
                                           cated articles
       NormalizedIntersectionnoduplication No. of intersections between inlinks,
                                           outlinks and inlinks+outlinks sets
                                           of e and p without removing du-
                                           plicated articles and normalized by
                                           total number of articles in the sets
       Ratioinlink:outlink                 Ratio between articles in inlinks to
                                           articles in outlinks

Table 1. Rich feature set for entity name disambiguation in tweets on top of Wikiped
Article Link Structure
                                                                                     V

articles and within articles are used as the fundamental building block for ex-
traction of Wikipedia graph-specific features. At the first level, we use the parent
Wikipedia article for the entity under investigation8 and extract its parent cat-
egories PCentity from which we manually chose categories related to the entity
under investigation. Sub-categories are then extracted from PCentity up to a hop
count of two; finally all articles belonging to these sub-categories are marked as
being related to the entity under investigation and we refer to the set of these
articles as ARelatedentity .
    The final step consists of constructing an information table of Wikipedia-
based features as follows:

 – We extract the disambiguation pages for the entity under investigation and
   the context phrases extracted in the first step. For each of these we then
   find the sets of Wikipedia articles in inlinks, outlinks, and inlinks+outlinks.
   More specifically for each disambiguated Wikipedia article for the entity say
   ed and each context phrase p in set ContextPhrases, we extract the following
   sets of Wikipedia articles
     • Wikipedia articles linking to ed and p referred to as inlinks
     • Wikipedia articles linking from ed and p referred to as outlinks
     • Wikipedia articles linking to and from ed and p referred to as inlinks+outlinks
 – Using information in sets inlinks, outlinks and inlinks+outlinks the features
   shown in Table 1 are constructed.
 – Corresponding to each feature in Table 1 is a boolean feature that reflects
   a mapping between the numerical feature to articles in ARelatedentity . This
   mapping is constructed after taking average scores across all context phrases
   in a tweet and chosing the disambiguated Wikipedia article with highest
   score; if the mapping is to an entity in ARelatedentity we chose the value of
   this feature as “1” and “0” otherwise.


3.3   Additional Features
We also use five additional features and these were obtained from our system [9]
used for the last RepLab online reputation management evaluation campaign in
2012 [3]. The technique described in [9] is a two-pass approach where the first
pass uses term specificity scores of concept terms (i.e., terms in infoboxes corre-
sponding to the Wikipedia article of the entity, proper nouns “NNP” appearing
in the Wikipedia article of the entity9 ), and the second pass utilizes a score
propagation mechanism where terms co-located with concept terms are assigned
a new score for re-computation of a score for each tweet. Further, the following
additional scores were also used in our submission for RepLab2012 where our
team was the second best amongst the participating teams:
8
  The parent Wikipedia article for each entity is given as part of the dataset for this
  task.
9
  These are obtained after applying Stanford POS tagger to the Wikipedia article of
  the entity
VI


     Table 2. Results of Filtering Task of RepLab 2013

     Team           Reliability Sensitivity F(R,S)
     popstar 2      0.729       0.451       0.489
     popstar 3      0.764       0.440       0.480
     popstar 7      0.759       0.428       0.470
     popstar 8      0.589       0.444       0.448
     SZTE NLP 7     0.599       0.444       0.439
     SZTE NLP 10    0.547       0.428       0.407
     SZTE NLP 5     0.508       0.429       0.3911
     SZTE NLP 1     0.491       0.429       0.3910
     SZTE NLP 8     0.507       0.428       0.3893
     SZTE NLP 4     0.480       0.429       0.3889
     SZTE NLP 6     0.517       0.428       0.3886
     SZTE NLP 3     0.496       0.425       0.3882
     SZTE NLP 2     0.492       0.426       0.387
     lia 1          0.658       0.357       0.381
     SZTE NLP 9     0.519       0.416       0.380
     UAMCLYR 4      0.555       0.401       0.379
     UAMCLYR 1      0.631       0.400       0.375
     lia 6          0.619       0.331       0.341
     UNED ORM 2     0.425       0.384       0.338
     BASELINE       0.490       0.320       0.326
     UAMCLYR 3      0.697       0.303       0.322
     Daedalus 1     0.353       0.448       0.321
     Daedalus 3     0.349       0.443       0.318
     lia 10         0.680       0.291       0.312
     lia 9          0.680       0.282       0.302
     UNED ORM 2     0.473       0.327       0.3018
     Daedalus 4     0.302       0.474       0.297
     Daedalus 2     0.299       0.479       0.2963
     lia 8          0.687       0.266       0.2962
     UAMCLYR 2      0.573       0.313       0.292
     lia 5          0.489       0.310       0.289
     lia 4          0.489       0.310       0.289
     UAMCLYR 5      0.569       0.307       0.286
     popstar 5      0.521       0.267       0.282
     CIRG IRDISCO 4 0.341       0.329       0.2724
     lia 2          0.423       0.331       0.2720
                                                                               VII

 – POS tag of the company name occurring within the tweets
 – URL occurring within the tweets
 – Hashtag occurring within the tweets

    Note that the score of the first pass, score of the second pass, POS tag of
company name in the tweet, URL occurring in the tweet and hashtag occuring
in the tweet are used as features in our system for RepLab2013.


3.4   Machine Learning and Experimental Runs

Using the feature sets described in Section 3.2 and 3.3, we train a support vector
machine over the training data and then use it to predict labels for the test data.
We perform six machine learning runs as follows:

 1. For the first run, we use all features i.e. both Wikipedia graph-based features
    and additional score-based features of Section 3.2 and 3.3 whilst training
    SVM per entity
 2. For the second run, we use only Wikipedia graph-based features of Section
    3.2 whilst training SVM per entity
 3. For the third run, we use only the score-based features of Section 3.3 whilst
    training SVM per entity
 4. For the fourth run, we use all features i.e. both Wikipedia graph-based fea-
    tures and additional score-based features of Section 3.2 and 3.3 whilst train-
    ing SVM per categories i.e. combining all tweets related to a a particular
    category into one training and one test set
 5. For the fifth run, we use only Wikipedia graph-based features of Section 3.2
    whilst training SVM per categories i.e. combining all tweets related to a a
    particular category into one training and one test set
 6. For the sixth run, we use only the score-based features of Section 3.3 whilst
    training SVM per categories i.e. combining all tweets related to a a particular
    category into one training and one test set


4     Experimental Results

4.1   Dataset

We performed our experiments by using the data set provided by the organizers
of RepLab 2013 [2]. In this data set 61 entities were provided, and for each
entity at least 2200 tweets were collected: the first 700 constituted the training
set, and the rest served as the test set. Furthermore, the entities are grouped
into categories based on their type and the four types distributed as part of
RepLab2013 are as follows: 1) automatives, 2) banking, 3) universities, and 4)
music.
VIII

4.2    Evaluation Measures

The measures used to the evaluation purposes are Reliability and Sensitivity,
which are described in detail in [4]. In the case of filtering, the measures of Re-
liability and Sensitivity are equivalent to the product of precision scores over
positive and negative classes (reliability) and the product of recall scores (sen-
sitivity). The property that makes them particularly suitable for the filtering
problem is that they are strict with respect to standard measures, i.e., a high
value according to Reliability and Sensitivity implies a high value in all standard
measures.


4.3    Results

Table 2 presents a snapshot of the official results for the filtering task of RepLab
2013, where CIRG IRDISCO is the name of our team. As can be seen from Table
2, out of a total of 11 participating teams in RepLab2013 filtering task 6 teams
outperform our best run along with the baseline system. We believe this to be
a consequence of a considerably high amount of skewness in the training set of
tweets. Most of the tweets contained a high percentage of related tweets which
affects the performance of learning algorithms such as support vector machines.


5      Future Work

Despite the unfavorable outcome of the RepLab2013 filtering task for our runs,
we see significant value in the graph-based features mined from Wikipedia article
inlinks and outlinks. Manual inspection of feature set shows an obvious difference
of inlink and outlink intersections for the related and non-related tweets. As
future work we aim to investigate the value of Wikipedia graph-based features
when used in combination with sophisticated learning algorithms.


References

 1. E. Amigó, J. Artiles, J. Gonzalo, D. Spina, B. Liu, and A. Corujo. WePS3 Eval-
    uation Campaign: Overview of the On-line Reputation Management Task. In 2nd
    Web People Search Evaluation Workshop (WePS 2010), CLEF 2010 Conference,
    Padova Italy, 2010.
 2. E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin,
    E. Meij, M. de Rijke, and D. Spina. Overview of replab 2013: Evaluating online
    reputation monitoring systems. In Fourth International Conference of the CLEF
    initiative, CLEF 2013, Valencia, Spain. Proceedings, Springer LNCS, 2013.
 3. E. Amigó, A. Corujo, J. Gonzalo, E. Meij, and M. d. Rijke. Overview of replab
    2012: Evaluating online reputation management systems. In CLEF 2012 Labs and
    Workshop Notebook Papers, 2012.
 4. E. Amigó, J. Gonzalo, and F. Verdejo. A General Evaluation Measure for Docu-
    ment Organization Tasks. In Proceedings SIGIR 2013, July.
                                                                                   IX

 5. C. Dellarocas, N. F. Awad, and X. M. Zhang. Exploring the value of online reviews
    to organizations: Implications for revenue forecasting and planning. In MANAGE-
    MENT SCIENCE, pages 1407–1424, 2003.
 6. E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts.
    In Proceedings of the fifth ACM international conference on Web search and data
    mining, WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM.
 7. S. R. Yerva, Z. Miklós, and K. Aberer. What have fruits to do with technology?:
    the case of orange, blackberry and apple. In Proceedings of the International Con-
    ference on Web Intelligence, Mining and Semantics, WIMS ’11, pages 48:1–48:10,
    New York, NY, USA, 2011. ACM.
 8. M. Yoshida, S. Matsushima, S. Ono, and I. Sato. Itc-ut: Tweet categorization by
    query categorization for on-line reputation management.
 9. A. Younus, C. O’Riordan, and G. Pasi. Cirgdisco at replab2012 filtering task: A
    two-pass approach for company name disambiguation in tweets. In CLEF 2012
    Evaluation Labs and Workshop, 2012.
10. T. Zesch and I. Gurevych. Analysis of the Wikipedia Category Graph for NLP
    Applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT), 2007.