<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Diversity-Aware Clustering of SIOC Posts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Thalhammer</string-name>
          <email>andreas.thalhammer@sti2.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Stavrakantonakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioan Toma</string-name>
          <email>ioan.toma@sti2.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Innsbruck</institution>
          ,
          <addr-line>Technikerstr. 21a, A-6020 Innsbruck</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentiment analysis as well as topic extraction and named entity recognition are emerging methods used in the field of Web Mining. Next to SQL-like querying and according visualization, new ways of organization have become possible. In this demo paper we apply efficient clustering algorithms that stem from the image retrieval field to sioc:Post entities, blending similarity scores of sentiment and covered topics. We demonstrate the system with a visualization component that combines different diversity aspects within microposts by Twitter users and a static news article collection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Named entity recognition, automatic tagging, and sentiment detection in
microposts, news articles, blog posts, forum posts etc. provide us new ways of
interacting with content. Not only is it possible to retrieve answers from queries
like “select all positive articles that mention Barack Obama” but these features
offer a new way of content organization: combining sentiment and topic
similarity in a single clustering approach. This enables the user to browse datasets
in a novel way, for example getting overviews on positive and negative opinions
on the topic “champions league final” or retrieving different topic clusters in
negative Tweets from a specific user.</p>
      <p>In this work, we demonstrate the application of two efficient clustering
algorithms that stem from the image retrieval domain to sentiment analysis in
combination with topic extraction and named entity recognition. We apply our
approach on two use cases: microposts and news articles. Moreover, the readers
are invited to try the system with live Twitter data to find new insights about
the polarity and topic distribution of politicians’ Tweets as well as their own.
The contribution of our work is twofold, from a cluster dimension perspective
(i.e., sentiment and topics are covered) as well as from a domain perspective
(i.e., news articles and Tweets are covered). In this short paper we are not able
to provide an extensive overview of the state of the art but we would like to
contextualize our approach along with two related approaches.</p>
      <p>
        [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] presents a study on automatically clustering and classifying Tweets. The
outcomes of the paper stress out that employing a supervised methodology based
on hash-tags could produce better results than the traditional unsupervised
methods. Furthermore, the authors present a methodology for finding the most
representative Tweet in a cluster. Automatic detection of topics discussed in
Tweets is pointed out as one of the interesting problems in Tweet analysis.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposes an emotion-oriented clustering approach in accordance to
sentiment similarities between blog search result titles and snippets. The authors
propose an approach for grouping blog search results in sentiment clusters, which is
related to the grouping that we perform in the retrieved articles when we choose
to cluster them based on the sentiment rather than the topic. The authors’ goals
are similar to ours as the approach focuses on very short text portions, which is
also covered by our method as we cluster Tweets which are no longer than 140
characters. The sentiment detection relies on the SentiWordNet1 which is built
on top of WordNet and it provides sentiment scores on the glosses of WordNet.
      </p>
      <p>
        In comparison to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which focus on clustering either by topics or
sentiments, our approach combines those elements in a flexible way. For this,
we introduce a straight-forward combination of topic and sentiment similarity
measures that can be flexibly adapted to be more specific towards either topic
or sentiment. Similarly to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we try to cover clusters of microposts as well as
longer articles.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Extraction, Modeling, and Storage</title>
      <p>
        We utilize the Twitter API to access the microposts and a static news corpus
of the RENDER project2. The extracted Twitter data is processed using the
Enrycher service3 and stored in a Sesame4 or OWLIM5 triple store. The news
data is already processed with Enrycher and already available in the correct
format in an OWLIM triple store. As a data model we are utilizing the sioc
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] vocabulary in combination with the Knowledge Diversity Ontology6 (KDO)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. KDO was developed in the context of the RENDER project and features
assigning sentiments to sioc posts. Moreover we make use of the newly introduced
type kdo:NewsArticle and the class sioc-types:MicroblogPost, both being
subclasses of sioc:Post. In accordance to the respective document, the Enrycher
service [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] assigns to instances of these subclasses a range of sioc:topics as well
as a sentiment (i.e., kdo:hasSentiment). The data model as well as instances
are stored in and retrieved from a triple store implementing the SAIL7 interface
(e.g. OWLIM).
1 SentiWordNet – http://sentiwordnet.isti.cnr.it/
2 RENDER News Corpus – http://rendernews.ontotext.com/, RENDER project –
http://render-project.eu
3 Enrycher – http://enrycher.ijs.si, http://ailab.ijs.si/tools/enrycher/
4 Sesame - http://www.openrdf.org/
5 OWLIM – http://owlim.ontotext.com/
6 KDO – http://kdo.render-project.eu/
7 SAIL API – http://www.openrdf.org/doc/sesame2/system/ch05.html
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Diversity-Aware Clustering</title>
      <p>
        Van Leuken et al. introduce “visual diversification of image search results” in
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The involved clustering algorithms are reported to be effective and efficient.
The introduced similarity measures are based on visual similarity of images.
For our document-based approach, we employ a combination of two similarity
measures, namely topic and sentiment similarity. The final score is calculated
with a flexible weighting component γ (with 0 ≤ γ ≤ 1). We calculate the
similarity of two sioc:Posts p1 and p2 as follows:
sim(p1, p2) = γ · jacc(p1, p2) + (1 − γ) · sent(p1, p2)
(1)
In formula 1 the functions jacc and sent need yet to be defined. jacc is basically
a simple Jaccard similarity index over topics:
jacc(p1, p2) = |topics(p1) ∩ topics(p2)|
      </p>
      <p>
        |topics(p1) ∪ topics(p2)|
We assume the extracted sentiment scores to be in the interval of [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] with 1
being most positive and 0 being most negative. The similarity score sent takes
this into account, having the highest similarity of 1 if the two scores are equal.
This similarity score is calculated as follows:
      </p>
      <p>sent(p1, p2) = 1 − |score(p1) − score(p2)|
For the case that the scores are not in the mentioned interval, they are normalized
as follows:
score(p) =</p>
      <p>
        score(p) − min(score(p))
max(score(p)) − min(score(p))
We utilize the Folding and Maximum algorithm from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These algorithms
were originally designed to cluster in accordance to visual similarity of images.
Rather than using image histograms, we apply these algorithms to textual
features of posts, using the similarity measure from above (see Formula 1).
      </p>
      <p>The Folding algorithm assumes a ranked list as input. There are two
disjoint lists maintained, the representatives and the rest. At the start, the ranked
input is the rest. The algorithm selects the first element of the rest (i.e., the
ranked input list) as a representative. In the following, each element of the rest
is compared to the representatives and added to the representatives list in case
its similarity to all existing representatives is less than a certain reference point
(i.e., a variable ). When all representatives are established, each element in the
rest is assigned to the cluster of which the representative is most similar to it.</p>
      <p>The Maximum algorithm is similar to Folding but has some distinct
features. The Maximum algorithm belongs to the class of randomized algorithms.
Again there are two disjoint lists, the representatives and the rest which is
assigned to the input at the beginning. The first element of the representatives is
selected randomly from the rest. Then, the algorithm adds the element which
(2)
(3)
(4)</p>
      <p>Data: List L containing sioc posts
Result: double value of
sumAll := 0;
for each sioc:Post s1 in L do
sum := 0;
for each sioc:Post s2 in L do
if s1 != s2 then</p>
      <p>Sum := Sum + sim(s1, s2);
Avg := Sum / (size(L) -1);</p>
      <p>SumAll := SumAll + Avg;
return SumAll / size(L);</p>
      <p>Algorithm 1: estimation
has minimum maximum similarity (or maximum minimum distance) to the
representatives. If this minimum maximum similarity is at some point less than
, all representatives are found and the remaining elements in the rest list are
assigned to the clusters with closest representatives.</p>
      <p>Both algorithms produce clusters, each with a selected representative.
However, as a last point, it remains open how to select an appropriate value for .
In this step we determine the average similarity of a sioc:Post to another (see
Algorithm 1).</p>
    </sec>
    <sec id="sec-4">
      <title>5 Implementation</title>
      <p>We implemented the diversity-aware ranking service with Oracle GlassFish 3.x.
The source code is available as a github project8 and a deployment can be
found at http://ranking.render-project.eu/. There, users can specify a
variety of parameters and retrieve the JSON output for the clustering. For a
better user experience, we introduce a jQuery-based visualization component
that is demonstrated at http://ranking.render-project.eu/tweetVis.html
(Twitter) and http://ranking.render-project.eu/vis.html (news). Figure
1 shows the news visualization component. The slider at the top changes the γ
value of the similarity measure (see Formula 1) either towards sentiment
similarity or topic similarity.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have implemented a diversity-aware ranking service that enables clustering
and retrieval of sioc posts along the two dimensions: sentiment and topic. We
exemplify our approach on live Twitter data and a static news dataset. This
work is also meant to initiate new directions to look at content organization,
navigation, and presentation.
8 Source code – https://github.com/athalhammer/RENDER-ranking-service</p>
      <p>Acknowledgements We are grateful for the feedback from Daniele Pighin
(Google Zurich). This research was partly funded by the European Union’s
Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257790
(RENDER project).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. John G. Breslin, Andreas Harth, Uldis Bojars, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>Towards semantically-interlinked online communities</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          , volume
          <volume>3532</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>500</fpage>
          -
          <lpage>514</lpage>
          . Springer Berlin Heidelberg,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Shi</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daling</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ge Yu</surname>
            ,
            <given-names>Chao</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            , and
            <given-names>Nan</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Sentiment clustering: A novel method to explore in the blogosphere</article-title>
          .
          <source>In Proceedings of the Joint International Conferences on Advances in Data and Web Management</source>
          , APWeb/WAIM '09, pages
          <fpage>332</fpage>
          -
          <lpage>344</lpage>
          , Berlin, Heidelberg,
          <year>2009</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>K. D. Rosa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gershman</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Frederking</surname>
          </string-name>
          .
          <article-title>Topical clustering of tweets</article-title>
          .
          <source>Proceedings of the ACM SIGIR: SWSM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Thalhammer</surname>
          </string-name>
          , Ioan Toma, Rakebul Hasan, Elena Simperl, and Denny Vrandeˇci´c.
          <article-title>How to represent knowledge diversity</article-title>
          .
          <source>Poster at 10th intl. Semantic Web Conf. (ISWC11)</source>
          ,
          <year>10 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Reinier</surname>
            <given-names>H. van Leuken</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lluis Garcia</surname>
          </string-name>
          , Ximena Olivares, and Roelof van Zwol.
          <article-title>Visual diversification of image search results</article-title>
          .
          <source>In Proc. of the 18th intl. conf. on World Wide Web, WWW '09</source>
          , pages
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tadej</surname>
            <given-names>Sˇtajner</given-names>
          </string-name>
          , Delia Rusu, Lorand Dali, Balˇz Fortuna, Dunja Mladeni´c, and Marko Grobelnik.
          <article-title>Enrycher: service oriented text enrichment</article-title>
          .
          <source>In Proc. of the 11th intl. multiconference Information Society</source>
          , IS-
          <year>2009</year>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>