Serendipitous Browsing: Stumbling through Wikipedia Claudia Hauff and Geert-Jan Houben Web Information Systems Delft University of Technology Delft, the Netherlands {c.hauff,g.j.p.m.houben}@tudelft.nl ABSTRACT itous browsing is StumbleUpon1 (SU), which allows users While in the early years of the Web, searching for informa- to “stumble” through the Web one (semi-random) page at tion and keeping in touch used to be the two main reasons a time. Interestingly to us, many SU users appreciate be- for ’going online’, today we turn to the Web in many differ- ing shown Wikipedia2 articles, which are informative pieces ent situations, including when we look for entertainment to of text that educate the reader about a particular concept. pass the time or relax. A popular tool to facilitate the users’ The leisure activity of stumbling thus can also incorporate desire for entertainment is StumbleUpon, which allows users a learning experience, which might contribute to the devel- to “stumble” through the Web one (semi-random) page at a opment of novel ideas and lead to creative insights. Since time. Interestingly to us, many StumbleUpon users appre- life-long learning is an important characteristic of knowl- ciate being served Wikipedia articles, which are informative edge economies, it is crucial to understand the interplay be- pieces of text that educate the reader about a particular tween these two seemingly opposing forces (entertainment concept. The leisure activity of stumbling can thus also in- vs. learning). We hypothesize that a greater understanding corporate a learning experience. Since life-long learning is an of what makes certain Wikipedia articles more attractive to important characteristic of knowledge economies, it is cru- the serendipitously browsing user than others, will enable cial to understand the interplay between these two - at first us to develop adaptations that expose a greater amount of sight - opposing forces. We hypothesize that a greater un- Wikipedia articles to the leisure seeking user. derstanding of what makes certain Wikipedia articles more In this position paper we make an argument for the im- attractive to the serendipitously browsing user than others, portance of this task. We draw from a number of insights will enable us to develop adaptations that expose a greater gained in museum studies [11] where the question of how amount of Wikipedia articles to the leisure seeking user. learning can be facilitated in leisure settings (the museum visit) has been investigated for many years. While we do Categories and Subject Descriptors: H.3.3 Information not consider the SU pages to be similar to museum objects, Storage and Retrieval: Information Search and Retrieval we do find a number of parallels. General Terms: Human Factors, Experimentation A first experiment on the stumbled Wikipedia pages re- Keywords: free-choice learning, educational leisure, serendip- vealed that, just as in museums not all objects are equally itous browsing attractive to visitors, not all articles are interesting to the average StumbleUpon user. In fact, only a very small num- ber of Wikipedia articles gather a large number of views by 1. INTRODUCTION SU users, most articles are rarely viewed. While we have no In the early years of the Web, searching for information answer yet to the question of how to automatically classify and keeping in touch used to be the two main reasons for articles according to their attractiveness to the serendipi- ’going online’. Today, we rely on the Web in increasingly di- tously browsing user, we have developed a number of hy- verse situations including shopping, consultations and learn- potheses which are outlined in Section 3.2. ing. While these examples are all directed towards a partic- If we assume for a moment that we are indeed able to ular goal the user has, we also turn to the Web at times when develop such an approach, a number of application scenarios we simply want to be entertained to pass the time or relax. can be envisioned: The possibilities for entertaining yourself on the Web are • A qualitative study of the features that play a role in manifold, one can play games, listen to music, watch movies to trickling the interest of users who do not have an or simply browse through the Web in the hope of finding en- information need, will enable Wikipedia contributors tertaining pages. Due to the sheer size of the Web though, to write their articles in a way that is more accessible random browsing is not effective for discovering pages that to such users. may b interesting to the individual user. For this reason, • Wikipedia is available in many different languages and a number of services have become popular that recommend such a prediction method would allow us to bootstrap a web pages to users based on their interests. One popular tool recommender like StumbleUpon in different languages to facilitate the users’ desire for entertainment by serendip- by adding an initial set of interesting, high quality pages before the critical mass of users is reached. Presented at Searching4Fun workshop at ECIR2012. Copyright c 2012 for the individual papers by the papers’ authors. Copying permitted only for 1 private and academic purposes. This volume is published and copyrighted http://www.stumbleupon.com/ 2 by its editors. http://www.wikipedia.org/ page submission • Outliers (articles with many ’Likes’ but a low proba- # user rating bility of being attractive) can be manually investigated web userdiscovery page to reduce spam. Or conversely, undiscovered articles " page user are obtained and can be injected into the index. index profiles web • The passages that trigger the surprise or the attrac- userbrowsing Stumble! page recommender tiveness of an article can be identified and highlighted engine to the browsing user. This may help to keep those serendipitously browsing users engaged that initially web page with meta-data page available for each entry infos only quickly scan the article. • E-learning applications can also benefit, as articles which are interesting to the casual reader can be found this Figure 1: A StumbleUpon user can contribute Web way. pages he likes to the index and he can “stumble” The rest of the paper is organized as follows: related work pages that are in the SU index according to his in- is presented in Section 2, followed by a preliminary analysis terests. One page at a time is shown; the user can of stumbled Wikipedia pages (Section 3) and the conclusiosn provide feedback in terms of like and dislike. (Section 4). 2. RELATED WORK For this work, we draw inspirations from two areas. On The usage of StumbleUpon is depicted in Figure 1. A user the one hand we consider research into so-called educational “stumbles” pages with a simple click of the ’Stumble!’ button leisure settings and free-choice learning which is a multi- in his browser toolbar. In response, the user is presented disciplinary field that includes aspects from sociology, psy- with a random page from the Web, biased according to his chology and education. On the other hand, our work is also user profile or his friends’ ’Likes’. The simplicity of the strongly related to serendipity. system protects the user from information overload [8, 4], a Education leisure settings can be found in a wide range user has only two choices when faced with a stumbled page: of institutions including museums [12], national parks, zoos, either to start reading or to continue stumbling. Users can science centers [5], etc. As the name suggests, these insti- also contribute pages to the SU index: whenever a SU user tutions serve two purposes: to educate the public as well as discover a web page that is not yet in the index and that he to provide an entertaining experience to the visitors. Edu- likes, he can add it by means of the ’Like’ button. Finally, for cation leisure settings can be characterized by a number of each page in the SU index, there is a SU page which contains commonalities with respect to the visitors and their learning meta-data, including the number of users who viewed/liked experience [9, 10, 11]: (i) the visitors gain direct experience, the page, the category the user who discovered the page (ii) they decide what and whether at all to learn, (iii) the placed it in and the comments users left about the page. learning process is guided by their interests, (iv) learning is influenced by the visitors’ social interactions and (iv) the 3.1 Wikipedia Articles in StumbleUpon visitors are a highly diverse group, with different educational In all experiments we report here, we utilize the English backgrounds and prior knowledge. Since learning in this set- Wikipedia dump enwiki-20111007 from October 2011. In a ting is voluntary, the visitors’ motivation plays an important pre-processing step, we selected all Wikipedia articles that role: why did they come? are neither redirects to other articles, nor new articles or Serendipity, the act of encountering information nuggets explicit disambiguation pages and have a length of at least unexpectedly, has mostly been investigated in the context 500 characters (to remove stubs). In total, 3, 552, 059 arti- of education [3] and work-related discoveries after serendipi- cles remained. tious moments. One of the works outside of this realm is [6] In order to determine the popularity of Wikipedia arti- where tools were developed to help people reminisce in their cles in StumbleUpon, we randomly selected half of these own digital collections. In goal-directed Web search the po- Wikipedia articles and queried the StumbleUpon API for tential for serendipitous encounters has also been recently their number of views by SU users. Since SU is a recom- investigated [2], while [1] offers an insightful discussion of mendation engine, we can safely assume that the highly serendipity and how it is used, exploited and induced in viewed pages are also highly popular and liked. We note, computer science. that the number of ’Likes’ a page has received is not ac- Finally we note that different aspects of Wikipedia ar- cessible through the StumbleUpon API. The information is ticles have also been investigated in the past, though not accessible though at the SU meta-data page, which we man- from a perspective of serendipitously browsing users. For ually checked for the results reported in Table 1. instance, in [7] it was found that the writing style distin- Among the evaluated 1, 776, 029 articles, we found 267, 958 guishes so-called featured articles in Wikipedia3 from un- (15.13%) of them to be contained in the SU index. In our featured articles. Classifying Wikipedia articles according initial investigation, we also considered French and Ger- to their quality, as defined by Wikipedia contributors, was man Wikipedia which are two of the largest non-English also investigated in [13], where network motifs and graph Wikipedia repositories. However, we only found a very lim- patterns in the editor-article graph were exploited. ited number of their articles in the SU index (in both cases less than 1%) and thus did not consider them further. Thus, an application scenario as proposed in the introduction (to 3. STUMBLEUPON bootstrap a recommender for a new language) is highly de- 3 sirable. Featured Wikipedia articles are of particularly high quality and chosen by Wikipedia editors. Let us now focus on those articles that were submitted by Stumblers to the index. Figure 2 shows a scatter plot of (A) Comments expressing surprise the number of views versus the number of Wikipedia articles • “There’s a name for this?” in the index. As can be expected, most articles have very few views (the median number of views is 10) while a small • “I’d never heard of this before (go StumbleUpon!). number of articles have gathered more than half a million Very cool.” views. (B) Comments expressing admiration, sadness, sorrow, etc. • “That’s so sad” 100,000 • “No one should go through life afraid to take a 10,000 walk.” • “don’t know what to say actually..” Number of pages 1,000 (C) Comments about the usefulness of the knowledge 100 • “Simple, but helpful for designers.” 10 • “An exceptional list of colours and their code, in- valuable to graphic designers, webmasters etc.” 1 1 10 100 1,000 10,000 Number of views 100,000 1,000,000 10,000,000 (D) Comments expressing negative sentiments towards the article Figure 2: Log-log scatter plot of the number of views • “Fake.” versus the number of articles in the SU index. • “Why stumble everyday wikipedia articles?” To give an impression of the type of articles that have 3.2 Working Hypotheses gathered few or many views, Table 1 contains the ten most Based on the preliminary qualitative insights gained, we viewed Wikipedia articles in our data set as well as ten developed three intuitions that we believe will enable us to random examples of articles that were viewed one hundred predict to what a Wikipedia article is likely to be beneficial times. We chose these two settings as they represent two ex- to the average SU user. tremes: on the one hand, articles that were viewed and also liked by a large number of people and on the other hand Intuition A. Articles that contain unexpected nuggets of in- articles, that were shown a number of times but less well formation can be identified by considering how semantically received by the SU users. related the article is to the other articles it contains links to. It should also be noted that the SU category Bizarre & For instance, the List of unusual deaths Wikipedia article Oddities, which dominates the list of the ten most viewed ar- has, among others, outgoing links to the following diverse ar- ticles is not as prevalent when considering a larger set of ar- ticles: Common fig, Malvasia (wine), Eddystone Lighthouse, ticles. In fact, the top 100 viewed articles in our data set be- Hawaii, and Chimney. We hypothesize that finding such long to 59 different SU categories: Bizarre & Oddities occurs seemingly unrelated articles can be used as a measure of the 12 times, followed by the Writing category (5 times) and a likelihood of the article being of interest. number of categories with three occurrences, including Arts, Science and Linguistics. Only one of the top 100 articles was Intuition B. Articles that evoke emotional feelings can be a so-called featured article (indicating that previous work on discovered through a form of sentiment analysis. Although featured article prediction, e.g. [7], might not be applicable Wikipedia articles are written in a neutral style, some topics here), while seven were semi-protected articles due to pre- are bound to evoke emotions and those emotional topics can vious vandalism activities. Notable is also the fact that 12 be identified. out of the 100 articles are of the form List of X where X = {algorithms, legendary creatures, band name etymologies} to Intuition C. Articles that contain useful knowledge may be name three examples. identified indirectly, when considering their Talk pages, the While for a human reader it is usually not difficult to amount of discussions that are ongoing and the style of the quickly judge whether an article is potentially interesting to discussions. Articles about practically useful information him or not, it is a challenge to derive a method that automat- are not likely to be emotionally charged, unlike discussions ically classifies articles accordingly. What exactly makes one for instance about politicians, religious topics, etc. article more interesting to the general public than another? We emphasize, that these are hypotheses that need to be In order to get get a first understanding of what users think verified in future work. about the most viewed articles and possibly also why they like them, we analysed the comments that were posted on the SU info page for each of the ten most viewed Wikipedia 4. CONCLUSIONS articles. This analysis is very cursory, as compared to the In this position paper we have proposed to investigate number of views, very few users actually comment on an what makes certain Wikipedia articles interesting to users article, as commenting distracts from the ’stumbling’ expe- who are browsing the Web without a goal in order to pass rience. For example, the article Wrap rage with 0.86 million the time or relax. Since such articles are education to some views and forty-thousand likes has a 41 comments. In total, degree, the leisure activity of browsing (stumbling) can thus we analysed 479 comments and identified four broad cate- also incorporate a learning experience. Since life-long learn- gories: ing is an important characteristic of knowledge economies, it is crucial to understand the interplay between these two Most viewed articles #Views #Likes SU Category Date Example articles viewed 100 times SU Category List of unusual deaths 3.99M 0.423M Bizarre/Oddities 12/2004 Biblioscape Software Flying Spaghetti Monster 1.39M 0.121M Satire 08/2005 Edge of chaos Chaos/Complexity Wrap rage 0.86M 0.040M Bizarre/Oddities 01/2008 Gottfried Wilhelm Leibniz Prize Biology Shigeru Miyamoto 0.75M 0.019M Video Games 10/2003 Mario Buda Crime Benjaman Kyle 0.74M 0.051M Bizarre/Oddities 12/2008 Proto-Indo-European language Linguistics One red paperclip 0.72M 0.070M Bizarre/Oddities 09/2006 Cisco Adler Alternative Rock List of colors 0.70M 0.066M Arts 01/2005 Biofeedback Psychology Do not stand at my grave and weep 0.64M 0.132M Poetry 10/2007 Ovipositor Sexual Health Fuel cell 0.56M 0.009M Science 06/2005 Concealer Beauty Raymond Robinson (Green Man) 0.54M 0.036M Bizarre/Oddities 05/2008 Winklepickers Fashion Table 1: A list of Wikipedia articles that are contained in the SU index. For the most viewed articles, shown are also the number of views and likes in million, the category in StumbleUpon the page was assigned to by the user who discovered the page and the date (month/year) at which the page was discovered. forces. We argue that a greater understanding of features Characterizing wikipedia pages using edit network are indicative of an article’s attractiveness to the average motif profiles. In SMUC ’11, pages 45–52, 2011. user (stumbler) will enable us to develop adaptations that expose a greater amount of Wikipedia articles to the leisure seeking user. 5. REFERENCES [1] P. André, m. schraefel, J. Teevan, and S. T. Dumais. Discovery is never by chance: designing for (un)serendipity. In C&C ’09, pages 305–314, 2009. [2] P. André, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: serendipity and its role in web search. In CHI ’09, pages 2033–2036, 2009. [3] L. Björneborn. Design dimensions enabling divergent behaviour across physical, digital, and social library interfaces. In Persuasive Technology, volume 6137, pages 143–149. 2010. [4] D. Bollen, B. P. Knijnenburg, M. C. Willemsen, and M. Graus. Understanding choice overload in recommender systems. In RecSys ’10, pages 63–70, 2010. [5] J. H. Falk and M. Storksdieck. Science learning in a leisure setting. Journal of Research in Science Teaching, 47(2), 2010. [6] J. Helmes, K. O’Hara, N. Vilar, and A. Taylor. Meerkat and tuba: Design alternatives for randomness, surprise and serendipity in reminiscing. In Human-Computer Interaction - INTERACT 2011, volume 6947, pages 376–391. 2011. [7] N. Lipka and B. Stein. Identifying featured articles in wikipedia: writing style matters. In WWW ’10, 2010, pages 1147–1148. [8] A. Oulasvirta, J. P. Hukkinen, and B. Schwartz. When more is less: the paradox of choice in search engine use. In SIGIR ’09, pages 516–523, 2009. [9] J. Packer. Learning for fun: The unique contribution of educational leisure experiences. Curator: The Museum Journal, 49(3):329–344, 2006. [10] J. Packer. Beyond learning: Exploring visitors’ perceptions of the value and benefits of museum experiences. Curator: The Museum Journal, 51(1):33–54, 2008. [11] J. Packer and R. Ballantyne. Motivational factors and the visitor experience: A comparison of three sites. Curator: The Museum Journal, 45(3):183–198, 2002. [12] J. M. Packer. Motivational factors and the experience of learning in educational leisure settings. PhD thesis, Queensland University of Technology, 2004. [13] G. Wu, M. Harrigan, and P. Cunningham.