<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>REINA at RepLab2013 Topic Detection Task: Community Detection*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Luis Alonso Berrocal</string-name>
          <email>berrocal@usal.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos G. Figuerola</string-name>
          <email>figue@usal.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ángel Zazo Rodríguez</string-name>
          <email>zazo@usal.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Science and Technology Institute. Research Group REINA. University of Salamanca</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social networks have become a large repository of comments which can extract multiple information. Twitter is one of the most widespread social networks and larger and is therefore an important source for detecting states of opinion, events and happenings before even the mainstream media. Topic detection is important to discover areas of interest that arise in the tweets. We have used classical systems for a similarity matrix and we have used community detection techniques. The results have been good and allows us to study new possibilities.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic detection</kwd>
        <kwd>Community detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Social networks, blogs, or any online forum internet have become a large repository of
comments which can extract multiple information [1] as shown by numerous research
being carried out in recent years
Twitter is one of the most widespread social networks and larger and is therefore an
important source for detecting states of opinion, events and happenings before even the
mainstream media [2], [3].</p>
      <p>It can be used to share information and also to describe virtually any daily activity [4],
because it allows users to express their opinions and interests, abbreviated and highly
personalized in real time [5]. Its importance is shown where it is present in virtually all
areas of life (social, economic, education ...) and covers any topic (sports, culture,
entertainment, industry, science).
* Financed by the project of the Ministry of Science and Innovation FFI2011-27763
If we value the importance of this network in quantitative terms, it is necessary to refer
to the volume of tweets generated every day, in June 2011, about 200 million, a number
that is increasing- [6]. If assessment in qualitative terms we do need to consider how
their influence is reflected in the many social events that retrasmiten real time in order
to gain visibility as well as the large number of original messages that become spread
(retweets), so you can consider even that Twitter does become niche opinion, since a
message created by a person (either original or a fragment of another work as a
newspaper headline or an extract of news) can be retweeted by another or others who
in turn relay it again causing a diffusion effect in clusters.</p>
      <p>It is true that much of the information provided is completely irrelevant tweets, it is also
necessary to consider that in many cases the messages isolated from their context lose
value, but also a very rich source of information because it compiles the relevant
information condensed for users, whether individuals, institutions or companies that are
the highlights news, opinions or feelings, information would be very difficult to collect
by other channels and is therefore by analyzing twitter messages is being used as
feedstock for multiple investigations ranging from the role played by different types of
users in the dissemination of information [7] a sociological analysis [8], applications to
classification [6] and information retrieval [5] [9], semantic analysis [10], etc.
In this sense, the monitoring of comments, messages and opinions that are poured into
Twitter is useful from the point of view of digital reputation people and institutions.
Early detection of issues and ratings on a particular subject can allow it to react
appropriately and maintain a positive public image [11].
2</p>
    </sec>
    <sec id="sec-2">
      <title>Topic detection</title>
      <p>Our group has focused on the topic detection, starting some exploratory work in this
field. Topic Detection and Tracking (TDT) is an area that began as track in the Text
Retrieval Conferences (TREC) [12] and this year is celebrated in RepLab2013 [13].
However, the application of these techniques is relatively new to Twitter. Some notable
works are those of [14], who applied clustering techniques, or Petrovic [15], which also
was an experimental collection of tweets that has been used in other studies, the
Edinburgh Twitter Corpus [16]. Mathioudakis and Koudas [17] proposed a system for
detecting trending topics from a stream of tweets. Also on the detection of trending
topics have worked Shariffi, Hotton and Kalita [18], like Cheong and Lee [19], although
they work focuses on the temporal evolution of the trending topics.</p>
      <p>The question is therefore how to determine the similarity between all pairs of tweets
relating to a given entity. The similarity between two documents is one of the central
problems in information retrieval and can be approached in various ways. One of the
best known is to consider each document (tweet) as a bag of words and apply a classical
scheme tf x idf.</p>
      <p>The tweets, however, are documents with a number of special features that should be
taken into account. Anta and colleagues [20] mention several of them. A issues 'classic'
of using unigrams, bigrams, trigrams, etc. or stemming, in the case of tweets must add
the emoticons, abbreviations, including a slang that medium itself, and as numerous
abnormalities ortho-typographic. The brevity of the tweets is another important issue
to consider [21]. In our specific case, we find texts at least two possible languages:
English and Spanish.</p>
      <p>Finally we will make the graph representation of the tweets of the entities and we apply
Social Network Analysis to our information [22]. Social Network Analysis is a
measurement tool allowing knowledge and structural analysis of the interactions
between the actors of the analyzed network [23].</p>
      <p>There is a wide range of indicators such as density, centrality, centralization,
betweenness, closeness, etc.. that allow analysis of both network nodes as complete,
although the detection of communities, groups. cliques, etc.., is a subject of great
interest.</p>
      <p>The strategy adopted in this work has been the application of techniques of Social
Network Analysis, in particular communities detection techniques. In a social network
G = (V,E) a community is a subgraph of entities  ⊆  that are associated with
common elements of interest. The elements that are part of the community can be
topics, real-life people, places, events, etc. These techniques are based on detecting, in
a network node groups with strong bonded between them. In our case the tweets would
be the nodes of the network; a semantic similarity between two tweets mean a link
between network nodes.</p>
      <p>There are many techniques for detecting communities [24] [25-27] as hierarchical
clustering algorithms, methods based on cliques, grouping cuts, Girvan-Newman
algorithm, etc.</p>
      <p>One widely used method is the analysis of modularity [28] (the number of links between
groups is small, within groups is high), highlighting the Louvain algorithm [29].
One method that is effective is showing the VOS clustering algorithm (visualization of
similarities) and some jobs are proving more effective compared to other systems,
especially for better performance than systems based on modularity in detecting small
clusters [30]. It is a modification of the algorithm based on modularity where the
weights are maximized differently [31].</p>
      <p>Regarding VOS clustering technique, we can use the mapping to visualization VOS is
very effective compared to other methods, adding a plus detection systems communities
[32]. In this map, the colors indicate the density within each community, ranging from
blue (low density) to red (high density). We can see the most important communities
and placed in relation to each other.</p>
    </sec>
    <sec id="sec-3">
      <title>Our approach</title>
      <p>Since this is the first time we participate in this work, our focus has been simple and
without too many refinements. We have considered each tweet (within each entity) as
a document whose basic features are the words it contains, and we have applied after
heavy classic scheme tf x idf and cosine to construct a similarity matrix [33]. Some
specific issues applied in our work have been:
─ we have not made any distinction between languages of the tweets, possibly there
are notable differences in the treatment applicable according to the language it is one
or other one [20, 34] but in our case we have performed uniform lexical analysis all
tweets
─ we applied a simple s-stemmer
─ we removed the words with less than 4 characters
Additionally, we have considered discarded emoticons. We have considered hashtags
and entities terms particularly interesting.</p>
      <p>On the other hand, in numerous tweets appear weblinks, we have considered especially
interesting, if two tweets have links to the same website we think that dealing with very
similar issues. Thus, the URL of these links are considered equally as important
characteristics of terms or the tweets.</p>
      <p>Given the small number of terms present in a tweet, the co-occurrence of URLs,
hashtags and entities are especially significant. Some studies apply techniques designed
to increase the number of terms per tweet [35], following the links and adding to the
features of that tweet the words of the website referenced. Anta and colleagues [20],
however, report the amount of noise that this technique produces.</p>
      <p>Other refinements possible, as the use of Wikipedia [36] for additional information and
produce more accurate results have not been applied by us on this occasion.
Once the network weighted with the weights of the similarity, we proceeded to generate
individualized networks for each of the entities under study. We obtained the number
of communities (by VOS Clustering algorithm) of each of the entities, we have
individualized the communities and thereafter we performed calculation on the density
of each of the communities.</p>
      <p>When we boarded the density term relationships and social networks we refer to a
widespread concept. This can be defined as the proportion of links in a network relative
to the total possible links (sparse versus dense networks). Other authors density defined
as the interface between network members. The density is an indicator of social network
analysis allows us to measure the extent to which a network is connected.
We can say further that a dense network nodes have a very close relationship between
them, confirming the theory that "the performance of a network has a positive
association with the high density of the network"
With these data we created two tasks:
1. reina_1: Topics were assigned to all tweets, depending on the community to which
they belonged. Topic was assigned to all tweets, even if the community consisted of
few documents.
2. reina_2: Filter was performed according to the density of each of the entities. We
considered only communities with a density greater than 0.5. Topic was assigned
only tweets belonging to these communities.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The results of our two task (reina_1 and reina_2) were as follows:
The result of the measure F[37] (table 1), as we can see has given better results the task
reina_2 and furthermore their behavior with respect to the rest of the task has been very
good. Note that the ratio obtained for this task is the lowest of all the set, (density filter).
This filtering requires a revision in the threshold used to improve the ratio of tweets.</p>
      <sec id="sec-4-1">
        <title>SYSTEM</title>
      </sec>
      <sec id="sec-4-2">
        <title>Amount of improved systems (UIR&gt;0.2)</title>
      </sec>
      <sec id="sec-4-3">
        <title>UAMCLYR_topic_detection_07 12 replab2013_UNED_ORM_topic_detection_2 11</title>
        <p>Concerning the amount of improved systems (table 2), we can see that again reina_2
task behavior is better than reina_1 and also offers good results with respect to total
tasks.</p>
      </sec>
      <sec id="sec-4-4">
        <title>System Pair</title>
        <p>replab2013_UNED_O
RM_topic_detection
_1
lia_topic_detection_3
tliiao_nt_o3pic_detec- BASELINE
re-
replab2013_UNED_ plab2013_UNED_O
ORM_topic_detect RM_topic_detection
ion_2 _1
tliiao_nt_o2pic_detec- BASELINE
repOlRabM20_1to3p_iUc_NdEetDec_t BASELINE
ion_2
In the comparison of the system pairs and UIR [38] (table 3) reina_2 improves to
reina_1 and continues to maintain good results with respect to total tasks of track.
With the working method we can visualize detected communities for a given entity.
Show two different views (Fig. 1 and Fig 2)of the entity RL2013D01E002, allowing us
to obtain a representation of these communities, which eventually become the detection
of specific topics in the entity.</p>
        <p>UIR
This working method allows us to simultaneously perform a complete entity reduction
in their various communities and establish the relationships between these
communities, which offers a mechanism for relations between communities, and
therefore the relationship among topics (Fig. 3).
We have raised a system of detection of topics differently but that has given a few good
enough results.</p>
        <p>Mixing basic scheme for generating the similarity matrix, and detection of communities
promising results.</p>
        <p>The use of a filtering network density has better result than without filtering.
The threshold used in filtering lowered the ratio of processed tweets.</p>
        <p>In the future we need to try different schemes when generating the similarity matrix,
try different community detection algorithms and use other filtering techniques.
6
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.</p>
        <p>Java, A., et al., Why we twitter: An analysis of a microblogging community, in
Advances in Web Mining and Web Usage Analysis2009, Springer. p. 118-138.
Garcia Esparza, S., M.P. O’Mahony, and B. Smyth, Mining the real-time web:
a novel approach to product recommendation. Knowledge-Based Systems,
2012. 29: p. 3-11.</p>
        <p>Lee, K., et al. Twitter trending topic classification. in Data Mining Workshops
(ICDMW), 2011 IEEE 11th International Conference on. 2011. IEEE.
Cha, M., et al., The world of connections and information flow in Twitter.
Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
Transactions on, 2012. 42(4): p. 991-998.</p>
        <p>Chen, S.-C., D.C. Yen, and M.I. Hwang, Factors influencing the continuance
intention to the usage of Web 2.0: An empirical study. Computers in Human
Behavior, 2012. 28(3): p. 933-941.</p>
        <p>Yerva, S.R., Z. Miklós, and K. Aberer, Quality-aware similarity assessment
for entity matching in Web data. Information Systems, 2012. 37(4): p.
336351.</p>
        <p>Narr, S., E.W. De Luca, and S. Albayrak. Extracting semantic annotations
from twitter. in Proceedings of the fourth workshop on Exploiting semantic
annotations in information retrieval. 2011. ACM.</p>
        <p>
          Jansen, B.Z.M.S.K. and A. Chowdury, Twitter power: Tweets as electronic
word of mouth. Journal of the American Society for Informati
          <xref ref-type="bibr" rid="ref2">on Science and
Technology, 2009</xref>
          . 60(11): p. 2169-2188.
        </p>
        <p>Allan, J., Topic detection and tracking, J. Allan, Editor 2002, Kluwer
Academic Publishers. p. 1-16.</p>
        <p>Amigó, E., et al. Overview of RepLab 2013: Evaluating Online Reputation
Monitoring Systems. in Fourth International Conference of the CLEF
initiative, CLEF 2013, Valencia, Spain. Proceedings.</p>
        <p>Sankaranarayanan, J., et al. TwitterStand: news in tweets. in Proceedings of
the 17th ACM SIGSPATIAL International Conference on Advances in
Geographic Information Systems. ACM.</p>
        <p>Petrović, S., M. Osborne, and V. Lavrenko. Streaming first story detection
with application to Twitter. in Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for
Computational Linguistics. Association for Computational Linguistics.
Petrović, S., M. Osborne, and V. Lavrenko. The Edinburgh Twitter corpus. in
Proceedings of the NAACL HLT 2010 Workshop on Computational
Linguistics in a World of Social Media. Association for Computational
Linguistics.</p>
        <p>Mathioudakis, M. and N. Koudas. TwitterMonitor: trend detection over the
twitter stream. in Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data. ACM.</p>
        <p>Sharifi, B., M.-A. Hutton, and J.K. Kalita. Experiments in Microblog
Summarization. in Proceedings of the 2010 IEEE Second International
Conference on Social Computing. IEEE Computer Society.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.</p>
        <p>Cheong, M. and V. Lee. Integrating web-based intelligence retrieval and
decision-making from the twitter trends knowledge base. in Proceedings of the
2nd ACM workshop on Social web search and mining. ACM.</p>
        <p>Anta, A.F., et al., Sentiment Analysis and Topic Detection of Spanish Tweets:
A Comparative Study of of NLP Techniques. Procesamiento del Lenguaje
Natural, 2012. 50: p. 45-52.</p>
        <p>Sriram, B., et al. Short text classification in twitter to improve information
filtering. in Proceedings of the 33rd international ACM SIGIR conference on
Research and development in information retrieval. ACM.</p>
        <p>Wasserman, S. and K. Faust, Social Network analysis: methods and
applications1998, Cambridge: Cambridge University Press.</p>
        <p>Molina, J.L., El análisis de redes sociales: una introducción2001, Barcelona:
Bellaterra.</p>
        <p>
          Porter, M.A., J.P. Onnela, and P.J. Mucha, Communities in networks. Notices
of the American Mathematical S
          <xref ref-type="bibr" rid="ref2">ociety, 2009</xref>
          . 56(9): p. 1082-1097.
        </p>
        <p>Fortunato, S., Community detection in graphs. Physics Reports, 2010. 486(3):
p. 75-174.</p>
        <p>Papadopoulos, S., et al., Community detection in social media. Data Mining
and Knowledge Discovery, 2012. 24(3): p. 515-554.</p>
        <p>Wang, Z., Detection of overlapping communities in networks: a probabilistic
approach. 2012.</p>
        <p>Blondel, V.D., et al., Fast unfolding of communities in large networks. Journal
of Statistical Mechanics: Theory and Experiment, 2008. 2008(10): p. P10008.
De Meo, P., et al. Generalized louvain method for community detection in
large networks. in Intelligent Systems Design and Applications (ISDA), 2011
11th International Conference on. 2011. IEEE.</p>
        <p>Van Eck, N.J., Methodological advances in bibliometric mapping of
science2011: Erasmus University Rotterdam.</p>
        <p>Waltman, L., N.J. van Eck, and E. Noyons, A unified approach to mapping
and clustering of bibliometric networks. Journal of Informetrics, 2010. 4(4):
p. 629-635.</p>
        <p>Van Eck, N.J., et al., A comparison of two techniques for bibliometric
mapping: Multidimensional scaling and VOS. Journal of the American Society
for Information Science and Technology, 2010. 61(12): p. 2405-2416.
Salton, G. and C. Buckley, Term-weighting approaches in automatic text
retrieval. Information Processing and Management, 1988. 24(5): p. 513-523.
Qureshi, M.A., C. O'Riordan, and G. Pasi. Concept Term Expansion Approach
for Monitoring Reputation of Companies on Twitter. in CLEF (Online
Working Notes/Labs/Workshop).</p>
        <p>Benhardus, J. and J. Kalita, Streaming trend detection in twitter. International
Journal of Web Based Communities, 2013. 9(1): p. 122-139.</p>
        <p>Osborne, M., et al. Bieber no more: First story detection using Twitter and
Wikipedia. in SIGIR 2012 Workshop on Time-aware Information Access.
Amigó, E., J. Gonzalo, and F. Verdejo. A General Evaluation Measure for
Document Organization Tasks. in Proceedings SIGIR 2013 /07.</p>
        <p>Amigó, E., et al., Combining evaluation metrics via the unanimous
improvement ratio and its application to clustering tasks. Journal of Artificial
Intelligence Research, 2011. 42(1): p. 689-718.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Thelwall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Buckley</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Paltoglou,</surname>
          </string-name>
          <article-title>Sentiment in Twitter events</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <year>2011</year>
          .
          <volume>62</volume>
          (
          <issue>2</issue>
          ): p.
          <fpage>406</fpage>
          -
          <lpage>418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>O'Brien</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <article-title>Twitter breaks news of plane crash in the Hudson</article-title>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , et al.
          <article-title>What is Twitter, a social network or a news media</article-title>
          ?
          <source>in Proceedings of the 19th international conference on World wide web. ACM.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>