Using Wordclouds to Navigate and Summarize Twitter Search
                               Results

                                                               Rianne Kaptein
                                                                  Oxyme
                                                         Amsterdam, The Netherlands
                                                            rianne@oxyme.com


Abstract                                                                    2. Searching fragments of categorized data
                                                                               Besides Twitter there are many more places on the Web where
This paper describes an application in which wordclouds are used to            people express their opinions. These opinions can be col-
navigate and summarize Twitter search results. A search on Twitter             lected and annotated with labels such as sentiment, source,
can return thousands of relevant tweets. By just looking at the first          market etcetera. When you have a large amount of annotated
few result pages you will not get an overview of what is discussed             data available, it is interesting to see for example what are
in all search results. Our application summarizes sets of tweets into          the different topics discussed in positive and in negative mes-
wordclouds, which can be used to get a first idea of the contents of           sages.
the tweets. Also the application provides the option to zoom in on
a certain part of the search results to inspect them in more detail.      In this paper we will focus on the first user scenario: General Twit-
The application has not been formally evaluated, but we do provide        ter search, since Twitter data is abundant and publicly available.
some insights and points for discussion.
                                                                          Humans have a great capacity to notice terms which are out of the
                                                                          ordinary. When looking at a wordcloud there will always be some
                                                                          unexpected terms which catch your attention and are good pointers
1     Introduction                                                        for further investigation. In tweets about public transport you can
                                                                          expect for example tweets about delays, but you might not expect
One of the most common problems in Information Retrieval is in-           certain tweets about recent events such as a new colour of the trains.
formation overload: there is simply too much relevant information         What we try to do in the wordclouds is to emphasize the words that
available for the users to process. Therefore applications are needed     are noteworthy from a statistical point of view, and leave it up to
to help users deal with large amounts of data. In this paper we de-       the user to decide which messages to explore further.
scribe an application which was developed for this purpose. The
use of wordclouds in the application serves two purposes:                 Although the usefulness of tagclouds for navigation is still a topic of
    1. To summarize                                                       debate [2], exploratory applications which make use of wordclouds
                                                                          for summarization and navigation of search results have been mod-
    2. To aid navigation                                                  erately successful on specific domains such as web documents [1]
                                                                          and PubMed publications in biomedical literature [5].
This application was developed with the following two user scenar-
ios in mind:                                                              The search results that we are investigating in this paper have three
    1. General Twitter search                                             characteristics:
       Nowadays many people express their opinions about prod-               • A search result is a short textual message. By design a Twitter
       ucts, services and companies on Twitter. When you want to               message cannot contain more than 140 characters.
       get a broad overview of what people are tweeting in general
       about a company or event, it does not suffice to read the first       • The number of search results is large. If this would not be the
       few pages of search results. You want to get a feeling for the          case, since the results are short texts, you could simply read
       most frequently discussed topics overall, and dive into partic-         through all of them.
       ular subtopics of special interest, such as product recommen-         • There are many, equally relevant search results. In web search
       dations.                                                                there are usually not more than a handful highly relevant
                                                                               search results. Many of the search results contain copied or
                                                                               redundant information, or only mention the search words oc-
                                                                               casionally. Although Twitter search results also contain re-
                                                                               dundant information, i.e. repeated tweets and retweets, the
                                                                               set of relevant tweets can still consist of thousands of equally
                                                                               highly relevant tweets.
                                                                          In the next sections of this paper we will present our approach (Sec-
                                                                          tion 2), a case study (Section 3), and finally our conclusions (Sec-
Presented at EuroHCIR2012. Copyright c 2012 for the individual papers     tion 4).
by the papers’ authors. Copying permitted only for private and academic
purposes. This volume is published and copyrighted by its editors.
               Figure 1. First part of the inputscreen                    Wordclouds for categories are generated using a parsimonious lan-
                                                                          guage model. This model compares the frequency of words in a
                                                                          set of documents to the average term probability in a background
                                                                          collection containing similar documents to extract the most note-
                                                                          worthy terms. In this case the background collection are all the re-
                                                                          trieved search results. Terms that are only mentioned occasionally
                                                                          in the set of documents and terms which have a similar or higher
                                                                          probability of occurrence in the background collection will not be
                                                                          included in the parsimonious language model [4].

                                                                          The parsimonious language model [3] is an extension to the stan-
                                                                          dard language model based on maximum likelihood estimation, and
                                                                          is created using an Expectation-Maximization algorithm. Maxi-
                                                                          mum likelihood estimation is used to make an initial estimate of
                                                                          the probabilities of words occurring in the set of documents.
                                                                                                                     t f (ti , S)
                                                                                                Pmle (ti |S) =                              (1)
                                                                                                                    ∑t t f (t, S)
                                                                          where S is the set of documents, and t f (t, S) is the text frequency,
                                                                          i.e. the number of occurrences of term t in set of documents
2     Approach                                                            S. Subsequently, parsimonious probabilities are estimated using
                                                                          Expectation-Maximisation:
The application consists of two screens. The first screen handles the
input, the second screen displays the results based on your input.                                                    (1 − λ)P(t|S)
                                                                                    E-step:     et = t f (t, S) ·
                                                                                                              (1 − λ)P(t|S) + λP(t|C)
On the first screen the system offers a number of selections that                                              et
can be made to make sure you generate the wordclouds that are                       M-step:     Ppars (t|S) =       , i.e. normalize        (2)
                                                                                                              ∑t et
best representing your data and your analysis purpose. The input
is collected using textfields, radiobuttons and checkboxes. The first     where C is the background collection model. In the initial E-step,
part of the inputscreen is shown in Figure 1.                             maximum likelihood estimates are used for P(t|S). We set the
                                                                          smoothing parameter λ to 0.9. In the M-step the words that re-
The following selections can be made:                                     ceive a probability below a threshold of 0.001 are removed from
                                                                          the model. The iteration process stops after a fixed number of iter-
    • File selection, a tab separated text file is required as input.     ations.
    • Text selection, which column in the dataset to use as textual
      input for the wordcloud generation.                                 In the next section we present a case in which the generated output
                                                                          of the application is presented.
    • Category selection, based on a value in any column of your
      dataset your data can be categorized. It is also possible to cre-
      ate categories based on the presence of words in the contents       3    Case
      of your data, e.g. to create a category for all tweets containing
      the term ‘happy’.                                                   Using an example search we will demonstrate how we use word-
                                                                          clouds in our application to navigate and summarize the search re-
    • Language, used for the removal of standard stopwords.               sults. We executed a search on Twitter using the Twitter search
    • Optionally, additional stopwords can be specified.          These   API1 for the query ‘#london2012’ over the last 5 days, saved all
      words will not occur in any of the wordclouds.                      the 30,504 search results in a .csv file and load this file into our
                                                                          application. Looking at the wordcloud over all the results that is
    • Stemming, currently available only for English. The Krovetz
                                                                          shown in Figure 2, we see the term ‘torch’ is frequently used, and
      stemmer is used, because this stemmer always stems words
                                                                          we zoom in on this aspect of the ‘#london2012’ search. By click-
      into other valid English words.
                                                                          ing on the word ‘torch’ a list of messages is shown that all contain
    • Exclude numbers, when your data includes many numbers               the term ‘torch’, so these messages can be inspected in more detail.
      such as product prices it can be desirable to exclude these         This list of messages is still quite long however, consisting of 1,046
      numbers from the wordcloud.                                         tweets. We can zoom in further on these tweets by going back to
    • Exclude retweets / repeated posts, when your data contains          the input screen and specifying ‘torch’ as a category. Now, a par-
      a tweet that is retweeted very frequently, this one tweet will      simonious wordcloud is created from the 1,046 tweets that contain
      dominate the wordcloud which can be undesirable.                    the term ‘torch’. The resulting wordcloud is shown in Figure 3.
                                                                          The figure is a screenshot of the screen that is displayed when the
    • Include only usernames, for Twitter data only, keep only the        word ‘Sheffield’ is clicked, showing the tweets containing the word
      usernames, i.e. all the words starting with @.                      ‘Sheffield’.
    • Include only hashtags, for Twitter data only, i.e. all the words
      starting with #.                                                    Words which occur frequently in all of the ‘#london2012’ mes-
                                                                          sages, such as ‘#london2012’, ‘2012’, and ‘olympics’, receive a
The second screen shows the output, which consists of wordclouds          lower score from the parsimonious model, and almost none of these
for the categories you have specified, as well as a wordcloud for all
the search results.                                                           1 https://dev.twitter.com/docs/api/1/get/search
            Figure 2. Wordcloud of all #london2012 Twitter search results, showing the tweets containing the term ‘torch’


Figure 3. Wordcloud of #london2012 Twitter search results containing the term torch, showing the tweets containing the term
‘sheffield’


words occur in the ‘torch’ wordcloud. Also general words that oc-        Observations
cur frequently in all of the messages, such as ‘get’, and ‘will’ are
filtered out. Instead the cloud contains words that occur more fre-      We have not had the chance to evaluate our application through
quently in the subset of messages that contain the word ‘torch’, for     means of a user study. However, we do want to point out the fol-
example some of the cities that the torch passes through such as         lowing observations. Given the nature of our data, i.e. a collection
Sheffield, Leeds and Manchester. Every result in this cloud by defi-     of tweets, there might be some improvements possible that exploit
nition contains the word ‘torch’, therefore it takes a prominent place   this particular type of data. Tweets can contain special elements
in the wordcloud. You can choose to not show the word ‘torch’ in         in the text, namely usernames, hashtags, links, and emoticons. We
the wordcloud by specifying it as a stopword on the input screen.        make the following observations:

Clicking on a term in the wordcloud has the same effect as query           • Usernames and hashtags are currently considered in the sense
expansion, i.e. adding that term to your query and retrieve another          that we remove all punctuation except the characters ‘@’ and
set of results. When you use the Twitter API to search Twitter with-         ‘#’ which are the indicators of usernames and hashtags respec-
out query operators, only results will be returned that contain all of       tively. There is an option to generate wordclouds containing
the search terms in the Tweet, username or hyperlink. This means             only usernames, or only hashtags. In the default settings user-
adding a term to your query will not lead to more search results.            names and hashtags are included as is in the wordcloud. For
Only if you remove the original query terms, other results will be           future work we want to discuss and investigate two open is-
returned.                                                                    sues:
                                                                                1. Can a word with a hashtag be considered as the same
                                                                                    word without the hashtag? While a hashtag term does
                                                                                    not always have to be a real word, e.g. #london2012,
                                                                                    in many cases it is, e.g. #london. For the wordcloud
            should the terms ‘london’ and ‘#london’ be merged?          results. Wordclouds are a quick way to summarize and get a first
            Sometimes usernames are used in a similar way as hash-      overview of large amounts of data. Using human observation skills
            tags to address companies, e.g. in this tweet: ‘Am-         it is easy to zoom in on a group of messages in which you are inter-
            bush marketing at the Olympics! Well played, @Nike.         ested, i.e. all messages that contain a specific term from the word-
            bit.ly/N4zAUc #London2012’.                                 cloud. In future work we would like to evaluate the usefulness of
         2. A related issue is the importance or term weights of        wordclouds for navigation and summarization of search results in a
            usernames and hashtags. Is a hashtag a stronger signal,     user study.
            and should it therefore be featured more prominently in
            the wordcloud? Similarly for usernames, but usernames       5    References
            could also be considered a weaker signal, so should they
            be featured less prominently?                               [1] T. Gottron. Document Word Clouds: Visualising Web Docu-
      Both of these questions can also be considered when you want          ments as Tag Clouds to Aid Users in Relevance Decisions. In
      to optimize a retrieval algorithm.                                    M. Agosti, J. L. Borbinha, S. Kapidakis, C. Papatheodorou, and
                                                                            G. Tsakonas, editors, ECDL, volume 5714 of Lecture Notes in
    • Besides the ‘@’, and ‘#’ all other punctuation is removed dur-        Computer Science, pages 94–105. Springer, 2009.
      ing text preprocessing. This means all emoticons like ‘:)’
      are removed. Sometimes these emoticons are used as indi-          [2] D. Helic, C. Trattner, M. Strohmaier, and K. Andrews. Are
      cators of sentiment, i.e. tweets containing ‘:)’ are classified       tag clouds useful for navigation? A network-theoretic analysis.
      as positive messages, and tweets containing ‘:(’ as negative          IJSCCPS, 1(1):33–55, 2011.
      messages. In this sense the emoticons do indeed represent         [3] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious
      valuable information that could be included in the wordcloud.         Language Models for Information Retrieval. In Proceedings
      When an emoticon appears in the wordcloud, clicking on it             SIGIR’04, pages 178–185. ACM Press, New York NY, 2004.
      can give you all the messages associated with for example a
                                                                        [4] R. Kaptein, D. Hiemstra, and J. Kamps. How Different are
      positive emoticon.
                                                                            Language Models and Word Clouds? In Advances in Infor-
Feedback from users is required to determine the most useful im-            mation Retrieval: 32nd European Conference on IR Research
provements for the application.                                             (ECIR 2010), volume 5993 of LNCS, pages 556–568. Springer,
                                                                            2010.
4     Conclusions                                                       [5] B. Y.-L. Kuo, T. Hentrich, B. M. Good, and M. D. Wilkinson.
                                                                            Tag clouds for summarizing web search results. Proceedings
In this paper we have shown how wordclouds can be used to sum-              of the 16th international conference on World Wide Web WWW
marize and navigate search results, and in particular Twitter search        07, 196:1203, 2007.