<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Wordclouds to Navigate and Summarize Twitter Search Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rianne Kaptein Oxyme Amsterdam</string-name>
          <email>rianne@oxyme.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Netherlands</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper describes an application in which wordclouds are used to navigate and summarize Twitter search results. A search on Twitter can return thousands of relevant tweets. By just looking at the first few result pages you will not get an overview of what is discussed in all search results. Our application summarizes sets of tweets into wordclouds, which can be used to get a first idea of the contents of the tweets. Also the application provides the option to zoom in on a certain part of the search results to inspect them in more detail. The application has not been formally evaluated, but we do provide some insights and points for discussion. One of the most common problems in Information Retrieval is information overload: there is simply too much relevant information available for the users to process. Therefore applications are needed to help users deal with large amounts of data. In this paper we describe an application which was developed for this purpose. The use of wordclouds in the application serves two purposes:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. To summarize</title>
    </sec>
    <sec id="sec-2">
      <title>2. To aid navigation</title>
      <p>This application was developed with the following two user
scenarios in mind:</p>
    </sec>
    <sec id="sec-3">
      <title>1. General Twitter search</title>
      <p>Nowadays many people express their opinions about
products, services and companies on Twitter. When you want to
get a broad overview of what people are tweeting in general
about a company or event, it does not suffice to read the first
few pages of search results. You want to get a feeling for the
most frequently discussed topics overall, and dive into
particular subtopics of special interest, such as product
recommendations.</p>
      <p>Presented at EuroHCIR2012. Copyright c 2012 for the individual papers
by the papers’ authors. Copying permitted only for private and academic
purposes. This volume is published and copyrighted by its editors.</p>
    </sec>
    <sec id="sec-4">
      <title>2. Searching fragments of categorized data</title>
      <p>Besides Twitter there are many more places on the Web where
people express their opinions. These opinions can be
collected and annotated with labels such as sentiment, source,
market etcetera. When you have a large amount of annotated
data available, it is interesting to see for example what are
the different topics discussed in positive and in negative
messages.</p>
      <p>In this paper we will focus on the first user scenario: General
Twitter search, since Twitter data is abundant and publicly available.
Humans have a great capacity to notice terms which are out of the
ordinary. When looking at a wordcloud there will always be some
unexpected terms which catch your attention and are good pointers
for further investigation. In tweets about public transport you can
expect for example tweets about delays, but you might not expect
certain tweets about recent events such as a new colour of the trains.
What we try to do in the wordclouds is to emphasize the words that
are noteworthy from a statistical point of view, and leave it up to
the user to decide which messages to explore further.</p>
      <p>
        Although the usefulness of tagclouds for navigation is still a topic of
debate [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], exploratory applications which make use of wordclouds
for summarization and navigation of search results have been
moderately successful on specific domains such as web documents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and PubMed publications in biomedical literature [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The search results that we are investigating in this paper have three
characteristics:
      </p>
      <p>A search result is a short textual message. By design a Twitter
message cannot contain more than 140 characters.</p>
      <p>The number of search results is large. If this would not be the
case, since the results are short texts, you could simply read
through all of them.</p>
      <p>There are many, equally relevant search results. In web search
there are usually not more than a handful highly relevant
search results. Many of the search results contain copied or
redundant information, or only mention the search words
occasionally. Although Twitter search results also contain
redundant information, i.e. repeated tweets and retweets, the
set of relevant tweets can still consist of thousands of equally
highly relevant tweets.</p>
      <p>In the next sections of this paper we will present our approach
(Section 2), a case study (Section 3), and finally our conclusions
(Section 4).
On the first screen the system offers a number of selections that
can be made to make sure you generate the wordclouds that are
best representing your data and your analysis purpose. The input
is collected using textfields, radiobuttons and checkboxes. The first
part of the inputscreen is shown in Figure 1.</p>
    </sec>
    <sec id="sec-5">
      <title>The following selections can be made:</title>
      <p>File selection, a tab separated text file is required as input.
Text selection, which column in the dataset to use as textual
input for the wordcloud generation.</p>
      <p>Category selection, based on a value in any column of your
dataset your data can be categorized. It is also possible to
create categories based on the presence of words in the contents
of your data, e.g. to create a category for all tweets containing
the term ‘happy’.</p>
      <p>Language, used for the removal of standard stopwords.
Optionally, additional stopwords can be specified. These
words will not occur in any of the wordclouds.</p>
      <p>Stemming, currently available only for English. The Krovetz
stemmer is used, because this stemmer always stems words
into other valid English words.</p>
      <p>Exclude numbers, when your data includes many numbers
such as product prices it can be desirable to exclude these
numbers from the wordcloud.</p>
      <p>Exclude retweets / repeated posts, when your data contains
a tweet that is retweeted very frequently, this one tweet will
dominate the wordcloud which can be undesirable.</p>
      <p>Include only usernames, for Twitter data only, keep only the
usernames, i.e. all the words starting with @.</p>
      <p>Include only hashtags, for Twitter data only, i.e. all the words
starting with #.</p>
      <p>The second screen shows the output, which consists of wordclouds
for the categories you have specified, as well as a wordcloud for all
the search results.</p>
      <p>
        Wordclouds for categories are generated using a parsimonious
language model. This model compares the frequency of words in a
set of documents to the average term probability in a background
collection containing similar documents to extract the most
noteworthy terms. In this case the background collection are all the
retrieved search results. Terms that are only mentioned occasionally
in the set of documents and terms which have a similar or higher
probability of occurrence in the background collection will not be
included in the parsimonious language model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The parsimonious language model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an extension to the
standard language model based on maximum likelihood estimation, and
is created using an Expectation-Maximization algorithm.
Maximum likelihood estimation is used to make an initial estimate of
the probabilities of words occurring in the set of documents.
(1)
(2)
Pmle(tijS) =
t f (ti; S)
åt t f (t; S)
where S is the set of documents, and t f (t; S) is the text frequency,
i.e. the number of occurrences of term t in set of documents
S. Subsequently, parsimonious probabilities are estimated using
Expectation-Maximisation:
      </p>
    </sec>
    <sec id="sec-6">
      <title>E-step:</title>
    </sec>
    <sec id="sec-7">
      <title>M-step:</title>
      <p>et = t f (t; S)
Ppars(tjS) =
(1</p>
      <p>l)P(tjS)
(1
e
t , i.e. normalize
åt et
l)P(tjS) + lP(tjC)
where C is the background collection model. In the initial E-step,
maximum likelihood estimates are used for P(tjS). We set the
smoothing parameter l to 0.9. In the M-step the words that
receive a probability below a threshold of 0.001 are removed from
the model. The iteration process stops after a fixed number of
iterations.</p>
      <p>In the next section we present a case in which the generated output
of the application is presented.
3</p>
      <sec id="sec-7-1">
        <title>Case</title>
        <p>Using an example search we will demonstrate how we use
wordclouds in our application to navigate and summarize the search
results. We executed a search on Twitter using the Twitter search
API1 for the query ‘#london2012’ over the last 5 days, saved all
the 30,504 search results in a .csv file and load this file into our
application. Looking at the wordcloud over all the results that is
shown in Figure 2, we see the term ‘torch’ is frequently used, and
we zoom in on this aspect of the ‘#london2012’ search. By
clicking on the word ‘torch’ a list of messages is shown that all contain
the term ‘torch’, so these messages can be inspected in more detail.
This list of messages is still quite long however, consisting of 1,046
tweets. We can zoom in further on these tweets by going back to
the input screen and specifying ‘torch’ as a category. Now, a
parsimonious wordcloud is created from the 1,046 tweets that contain
the term ‘torch’. The resulting wordcloud is shown in Figure 3.
The figure is a screenshot of the screen that is displayed when the
word ‘Sheffield’ is clicked, showing the tweets containing the word
‘Sheffield’.</p>
        <p>Words which occur frequently in all of the ‘#london2012’
messages, such as ‘#london2012’, ‘2012’, and ‘olympics’, receive a
lower score from the parsimonious model, and almost none of these
1https://dev.twitter.com/docs/api/1/get/search
words occur in the ‘torch’ wordcloud. Also general words that
occur frequently in all of the messages, such as ‘get’, and ‘will’ are
filtered out. Instead the cloud contains words that occur more
frequently in the subset of messages that contain the word ‘torch’, for
example some of the cities that the torch passes through such as
Sheffield, Leeds and Manchester. Every result in this cloud by
definition contains the word ‘torch’, therefore it takes a prominent place
in the wordcloud. You can choose to not show the word ‘torch’ in
the wordcloud by specifying it as a stopword on the input screen.
Clicking on a term in the wordcloud has the same effect as query
expansion, i.e. adding that term to your query and retrieve another
set of results. When you use the Twitter API to search Twitter
without query operators, only results will be returned that contain all of
the search terms in the Tweet, username or hyperlink. This means
adding a term to your query will not lead to more search results.
Only if you remove the original query terms, other results will be
returned.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Observations</title>
        <p>We have not had the chance to evaluate our application through
means of a user study. However, we do want to point out the
following observations. Given the nature of our data, i.e. a collection
of tweets, there might be some improvements possible that exploit
this particular type of data. Tweets can contain special elements
in the text, namely usernames, hashtags, links, and emoticons. We
make the following observations:</p>
        <p>Usernames and hashtags are currently considered in the sense
that we remove all punctuation except the characters ‘@’ and
‘#’ which are the indicators of usernames and hashtags
respectively. There is an option to generate wordclouds containing
only usernames, or only hashtags. In the default settings
usernames and hashtags are included as is in the wordcloud. For
future work we want to discuss and investigate two open
issues:
1. Can a word with a hashtag be considered as the same
word without the hashtag? While a hashtag term does
not always have to be a real word, e.g. #london2012,
in many cases it is, e.g. #london. For the wordcloud
should the terms ‘london’ and ‘#london’ be merged?
Sometimes usernames are used in a similar way as
hashtags to address companies, e.g. in this tweet:
‘Ambush marketing at the Olympics! Well played, @Nike.
bit.ly/N4zAUc #London2012’.
2. A related issue is the importance or term weights of
usernames and hashtags. Is a hashtag a stronger signal,
and should it therefore be featured more prominently in
the wordcloud? Similarly for usernames, but usernames
could also be considered a weaker signal, so should they
be featured less prominently?
Both of these questions can also be considered when you want
to optimize a retrieval algorithm.</p>
        <p>Besides the ‘@’, and ‘#’ all other punctuation is removed
during text preprocessing. This means all emoticons like ‘:)’
are removed. Sometimes these emoticons are used as
indicators of sentiment, i.e. tweets containing ‘:)’ are classified
as positive messages, and tweets containing ‘:(’ as negative
messages. In this sense the emoticons do indeed represent
valuable information that could be included in the wordcloud.
When an emoticon appears in the wordcloud, clicking on it
can give you all the messages associated with for example a
positive emoticon.</p>
        <p>Feedback from users is required to determine the most useful
improvements for the application.
4</p>
      </sec>
      <sec id="sec-7-3">
        <title>Conclusions</title>
        <p>In this paper we have shown how wordclouds can be used to
summarize and navigate search results, and in particular Twitter search
results. Wordclouds are a quick way to summarize and get a first
overview of large amounts of data. Using human observation skills
it is easy to zoom in on a group of messages in which you are
interested, i.e. all messages that contain a specific term from the
wordcloud. In future work we would like to evaluate the usefulness of
wordclouds for navigation and summarization of search results in a
user study.
5</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          . Document Word Clouds:
          <article-title>Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions</article-title>
          . In M. Agosti,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Borbinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kapidakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Papatheodorou</surname>
          </string-name>
          , and G. Tsakonas, editors,
          <source>ECDL</source>
          , volume
          <volume>5714</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>94</fpage>
          -
          <lpage>105</lpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Helic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trattner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strohmaier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Andrews</surname>
          </string-name>
          .
          <article-title>Are tag clouds useful for navigation? A network-theoretic analysis</article-title>
          .
          <source>IJSCCPS</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>33</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>Parsimonious Language Models for Information Retrieval</article-title>
          .
          <source>In Proceedings SIGIR'04</source>
          , pages
          <fpage>178</fpage>
          -
          <lpage>185</lpage>
          . ACM Press, New York NY,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaptein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          .
          <source>How Different are Language Models and Word Clouds? In Advances in Information Retrieval: 32nd European Conference on IR Research (ECIR</source>
          <year>2010</year>
          ), volume
          <volume>5993</volume>
          <source>of LNCS</source>
          , pages
          <fpage>556</fpage>
          -
          <lpage>568</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. Y.-L.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hentrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Good</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          .
          <article-title>Tag clouds for summarizing web search results</article-title>
          .
          <source>Proceedings of the 16th international conference on World Wide Web WWW</source>
          <volume>07</volume>
          ,
          <issue>196</issue>
          :
          <fpage>1203</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>