Diversity-Aware Clustering of SIOC Posts

       Andreas Thalhammer, Ioannis Stavrakantonakis, and Ioan Toma

          University of Innsbruck, Technikerstr. 21a, A-6020 Innsbruck
    {andreas.thalhammer, ioannis.stavrakantonakis, ioan.toma}@sti2.at


      Abstract. Sentiment analysis as well as topic extraction and named
      entity recognition are emerging methods used in the ﬁeld of Web Min-
      ing. Next to SQL-like querying and according visualization, new ways
      of organization have become possible. In this demo paper we apply ef-
      ﬁcient clustering algorithms that stem from the image retrieval ﬁeld to
      sioc:Post entities, blending similarity scores of sentiment and covered
      topics. We demonstrate the system with a visualization component that
      combines diﬀerent diversity aspects within microposts by Twitter users
      and a static news article collection.


1   Introduction
Named entity recognition, automatic tagging, and sentiment detection in mi-
croposts, news articles, blog posts, forum posts etc. provide us new ways of
interacting with content. Not only is it possible to retrieve answers from queries
like “select all positive articles that mention Barack Obama” but these features
oﬀer a new way of content organization: combining sentiment and topic simi-
larity in a single clustering approach. This enables the user to browse datasets
in a novel way, for example getting overviews on positive and negative opinions
on the topic “champions league ﬁnal” or retrieving diﬀerent topic clusters in
negative Tweets from a speciﬁc user.
    In this work, we demonstrate the application of two eﬃcient clustering al-
gorithms that stem from the image retrieval domain to sentiment analysis in
combination with topic extraction and named entity recognition. We apply our
approach on two use cases: microposts and news articles. Moreover, the readers
are invited to try the system with live Twitter data to ﬁnd new insights about
the polarity and topic distribution of politicians’ Tweets as well as their own.


2   Related Work
The contribution of our work is twofold, from a cluster dimension perspective
(i.e., sentiment and topics are covered) as well as from a domain perspective
(i.e., news articles and Tweets are covered). In this short paper we are not able
to provide an extensive overview of the state of the art but we would like to
contextualize our approach along with two related approaches.
    [3] presents a study on automatically clustering and classifying Tweets. The
outcomes of the paper stress out that employing a supervised methodology based


                                       1
Diversity-Aware Clustering of SIOC Posts

on hash-tags could produce better results than the traditional unsupervised
methods. Furthermore, the authors present a methodology for ﬁnding the most
representative Tweet in a cluster. Automatic detection of topics discussed in
Tweets is pointed out as one of the interesting problems in Tweet analysis.
    [2] proposes an emotion-oriented clustering approach in accordance to senti-
ment similarities between blog search result titles and snippets. The authors pro-
pose an approach for grouping blog search results in sentiment clusters, which is
related to the grouping that we perform in the retrieved articles when we choose
to cluster them based on the sentiment rather than the topic. The authors’ goals
are similar to ours as the approach focuses on very short text portions, which is
also covered by our method as we cluster Tweets which are no longer than 140
characters. The sentiment detection relies on the SentiWordNet1 which is built
on top of WordNet and it provides sentiment scores on the glosses of WordNet.
    In comparison to [3] and [2] which focus on clustering either by topics or
sentiments, our approach combines those elements in a ﬂexible way. For this,
we introduce a straight-forward combination of topic and sentiment similarity
measures that can be ﬂexibly adapted to be more speciﬁc towards either topic
or sentiment. Similarly to [2] we try to cover clusters of microposts as well as
longer articles.


3    Data Extraction, Modeling, and Storage

We utilize the Twitter API to access the microposts and a static news corpus
of the RENDER project2 . The extracted Twitter data is processed using the
Enrycher service3 and stored in a Sesame4 or OWLIM5 triple store. The news
data is already processed with Enrycher and already available in the correct
format in an OWLIM triple store. As a data model we are utilizing the sioc
[1] vocabulary in combination with the Knowledge Diversity Ontology6 (KDO)
[4]. KDO was developed in the context of the RENDER project and features
assigning sentiments to sioc posts. Moreover we make use of the newly introduced
type kdo:NewsArticle and the class sioc-types:MicroblogPost, both being
subclasses of sioc:Post. In accordance to the respective document, the Enrycher
service [6] assigns to instances of these subclasses a range of sioc:topics as well
as a sentiment (i.e., kdo:hasSentiment). The data model as well as instances
are stored in and retrieved from a triple store implementing the SAIL7 interface
(e.g. OWLIM).
1
  SentiWordNet – http://sentiwordnet.isti.cnr.it/
2
  RENDER News Corpus – http://rendernews.ontotext.com/, RENDER project –
  http://render-project.eu
3
  Enrycher – http://enrycher.ijs.si, http://ailab.ijs.si/tools/enrycher/
4
  Sesame - http://www.openrdf.org/
5
  OWLIM – http://owlim.ontotext.com/
6
  KDO – http://kdo.render-project.eu/
7
  SAIL API – http://www.openrdf.org/doc/sesame2/system/ch05.html


                                       2
Diversity-Aware Clustering of SIOC Posts

4    Diversity-Aware Clustering

Van Leuken et al. introduce “visual diversiﬁcation of image search results” in
[5]. The involved clustering algorithms are reported to be eﬀective and eﬃcient.
The introduced similarity measures are based on visual similarity of images.
For our document-based approach, we employ a combination of two similarity
measures, namely topic and sentiment similarity. The ﬁnal score is calculated
with a ﬂexible weighting component γ (with 0 ≤ γ ≤ 1). We calculate the
similarity of two sioc:Posts p1 and p2 as follows:

              sim(p1 , p2 ) = γ · jacc(p1 , p2 ) + (1 − γ) · sent(p1 , p2 )      (1)

In formula 1 the functions jacc and sent need yet to be deﬁned. jacc is basically
a simple Jaccard similarity index over topics:

                                        |topics(p1 ) ∩ topics(p2 )|
                     jacc(p1 , p2 ) =                                            (2)
                                        |topics(p1 ) ∪ topics(p2 )|

We assume the extracted sentiment scores to be in the interval of [0, 1] with 1
being most positive and 0 being most negative. The similarity score sent takes
this into account, having the highest similarity of 1 if the two scores are equal.
This similarity score is calculated as follows:

                    sent(p1 , p2 ) = 1 − |score(p1 ) − score(p2 )|               (3)

For the case that the scores are not in the mentioned interval, they are normalized
as follows:
                                   score(p) − min(score(p))
                   score(p) =                                                   (4)
                                max(score(p)) − min(score(p))
We utilize the Folding and Maximum algorithm from [5]. These algorithms
were originally designed to cluster in accordance to visual similarity of images.
Rather than using image histograms, we apply these algorithms to textual fea-
tures of posts, using the similarity measure from above (see Formula 1).
    The Folding algorithm assumes a ranked list as input. There are two dis-
joint lists maintained, the representatives and the rest. At the start, the ranked
input is the rest. The algorithm selects the ﬁrst element of the rest (i.e., the
ranked input list) as a representative. In the following, each element of the rest
is compared to the representatives and added to the representatives list in case
its similarity to all existing representatives is less than a certain reference point
(i.e., a variable ). When all representatives are established, each element in the
rest is assigned to the cluster of which the representative is most similar to it.
    The Maximum algorithm is similar to Folding but has some distinct fea-
tures. The Maximum algorithm belongs to the class of randomized algorithms.
Again there are two disjoint lists, the representatives and the rest which is as-
signed to the input at the beginning. The ﬁrst element of the representatives is
selected randomly from the rest. Then, the algorithm adds the element which


                                           3
Diversity-Aware Clustering of SIOC Posts

     Data: List L containing sioc posts
     Result: double value of 
     sumAll := 0;
     for each sioc:Post s1 in L do
         sum := 0;
         for each sioc:Post s2 in L do
             if s1 != s2 then
                 Sum := Sum + sim(s1, s2);
         Avg := Sum / (size(L) -1);
         SumAll := SumAll + Avg;
     return SumAll / size(L);
                           Algorithm 1:  estimation


has minimum maximum similarity (or maximum minimum distance) to the rep-
resentatives. If this minimum maximum similarity is at some point less than
, all representatives are found and the remaining elements in the rest list are
assigned to the clusters with closest representatives.
    Both algorithms produce clusters, each with a selected representative. How-
ever, as a last point, it remains open how to select an appropriate value for .
In this step we determine the average similarity of a sioc:Post to another (see
Algorithm 1).


5     Implementation

We implemented the diversity-aware ranking service with Oracle GlassFish 3.x.
The source code is available as a github project8 and a deployment can be
found at http://ranking.render-project.eu/. There, users can specify a
variety of parameters and retrieve the JSON output for the clustering. For a
better user experience, we introduce a jQuery-based visualization component
that is demonstrated at http://ranking.render-project.eu/tweetVis.html
(Twitter) and http://ranking.render-project.eu/vis.html (news). Figure
1 shows the news visualization component. The slider at the top changes the γ
value of the similarity measure (see Formula 1) either towards sentiment simi-
larity or topic similarity.


6     Conclusion

We have implemented a diversity-aware ranking service that enables clustering
and retrieval of sioc posts along the two dimensions: sentiment and topic. We
exemplify our approach on live Twitter data and a static news dataset. This
work is also meant to initiate new directions to look at content organization,
navigation, and presentation.
8
    Source code – https://github.com/athalhammer/RENDER-ranking-service


                                        4
Diversity-Aware Clustering of SIOC Posts


                      Fig. 1. The news visualization component.


Acknowledgements We are grateful for the feedback from Daniele Pighin
(Google Zurich). This research was partly funded by the European Union’s Sev-
enth Framework Programme (FP7/2007-2013) under grant agreement no. 257790
(RENDER project).


References
1. John G. Breslin, Andreas Harth, Uldis Bojars, and Stefan Decker. Towards
   semantically-interlinked online communities. In The Semantic Web: Research and
   Applications, volume 3532 of Lecture Notes in Computer Science, pages 500–514.
   Springer Berlin Heidelberg, 2005.
2. Shi Feng, Daling Wang, Ge Yu, Chao Yang, and Nan Yang. Sentiment clustering: A
   novel method to explore in the blogosphere. In Proceedings of the Joint International
   Conferences on Advances in Data and Web Management, APWeb/WAIM ’09, pages
   332–344, Berlin, Heidelberg, 2009. Springer-Verlag.
3. K. D. Rosa, R. Shah, B. Lin, A. Gershman, and R. Frederking. Topical clustering
   of tweets. Proceedings of the ACM SIGIR: SWSM, 2011.
4. Andreas Thalhammer, Ioan Toma, Rakebul Hasan, Elena Simperl, and Denny
   Vrandečić. How to represent knowledge diversity. Poster at 10th intl. Semantic
   Web Conf. (ISWC11), 10 2011.
5. Reinier H. van Leuken, Lluis Garcia, Ximena Olivares, and Roelof van Zwol. Visual
   diversiﬁcation of image search results. In Proc. of the 18th intl. conf. on World
   Wide Web, WWW ’09, pages 341–350, New York, NY, USA, 2009. ACM.
6. Tadej Štajner, Delia Rusu, Lorand Dali, Balž Fortuna, Dunja Mladenić, and Marko
   Grobelnik. Enrycher: service oriented text enrichment. In Proc. of the 11th intl.
   multiconference Information Society, IS-2009, 2009.


                                          5