xTAS and ThemeStreams Extendable Text Analysis Service and its Usage in a Topic Monitoring Tool Ork de Rooij Tom Kenter Maarten de Rijke University of Amsterdam University of Amsterdam University of Amsterdam Science Park 904 Science Park 904 Science Park 904 1098 XH Amsterdam, The 1098 XH Amsterdam, The 1098 XH Amsterdam, The Netherlands Netherlands Netherlands orooij@uva.nl tom.kenter@uva.nl derijke@uva.nl ABSTRACT between nodes. By default MongoDB [4] is used to store xTAS is an extendable multi-user text analysis service for documents and results though other options are available as large scale multi-lingual document analysis developed at the well. University of Amsterdam. It can process large amounts The software is extendable. Additional functionality can of documents in a timely manner through a web interface easily be added through a plugin architecture. that can be used by multiple users at once. In this demon- In what follows we describe recent additions to xTAS and we stration paper we present recent additions which include se- present ThemeStreams, a novel topic monitoring tool built manticization, on the fly TF-IDF model generation and on on top of xTAS. the fly co-occurrence metrics. Furthermore, we demonstrate ThemeStreams, a novel topic monitoring tool built on top 2. XTAS of xTAS. Recent additions and improvements to xTAS include: Categories and Subject Descriptors • Semanticization2 I.2.7 [Natural Language Processing]: Text Analysis xTAS can semantically enrich texts by linking entities mentioned in it to their Wikipedia article. General Terms • On the fly TF-IDF model generation and application Algorithms, Performance, Experimentation TF-IDF models based on a user selected series of doc- uments can be trained on the fly. The models can be Keywords used to provide TF-IDF statistics for words in new documents. text analysis, web service, distributed processing, microblog visualization • Co-occurrence metric calculation A variety of co-occurrence metric calculation methods 1. INTRODUCTION were added to xTAS, including maximum likelihood xTAS1 is an integrated set of text analysis services for estimate, point wise mutual information, log likelihood processing documents in a timely manner. It is available ratio and χ2 . This enables users to calculate the co- through a web API that can be used by multiple users occurrence of entities in a set of documents. at once. xTAS includes tools for stemming, tokenization, named entity recognition, part–of–speech tagging, sentiment • Automatic language identification analysis and various types of aggregation on top of this. The If the language of a document is not supplied xTAS purpose of xTAS is to run text processing tasks as fast as can automatically determine it. Currently this is im- possible, without concerning users about databases, storage plemented by using TextCat [6]. or result caching. The software can run multiple tasks in parallel, possibly • Support for multiple document stores on different machines (nodes). xTAS is built solely with Besides mongoDB [4], xTAS can communicate directly open source software. It uses Celery [2] to distribute tasks with Apache Solr [1] or ElasticSearch [3]. These stores 1 can be used as a document repository as well as a result See http://xtas.net cache. • Response time improvements Permission to make digital or hard copies of all or part of this work for Analysis of xTAS usage over time shows that named personal or classroom use is granted without fee provided that copies are entity recognition is a frequently requested and time not made or distributed for profit or commercial advantage and that copies consuming analysis. In order to keep response times to bear this notice and the full citation on the first page. To copy otherwise, to 2 republish, to post on servers or to redistribute to lists, requires prior specific Semanticization, the process of linking mentions of con- permission and/or a fee. cepts in a text to the articles in an external knowledge base DIR 2013, April 26, 2013, Delft, The Netherlands. they denote, is also referred to as entity linking or Wikifica- . tion. near-real time speeds xTAS keeps several NER models (for all supported languages) in memory on each xTAS node. 3. THEMESTREAMS ThemeStreams3 is a visual interface that helps answer the question ”Who is talking about what?”. It does so for topics in the Dutch political landscape by showing the ebb and flow of conversations about particular themes trough time. While there are many topic monitoring tools available, the novelty of ThemeStreams lies in its ability to present the user with a quick overview of the relative frequency of posts a particular group of users issued on a certain subject. ThemeStreams is based on tweets posted on Twitter by four groups of people: • politicians (ministers, members of parliament, but also the local ranks of politicians in municipalities and provinces) • political journalists (news paper journalists as well as Figure 1: ThemeStreams - A visual interface that talk show hosts of political television shows) answers the question Who is talking about what?. • lobbyists (people pushing the people who are active in Tweets are shown in a stream graph, categorized by politics) their authors and weighted by their conversational • other influencers (these include (satirical) columnists, influence. Parts of the stream can be selected and politically engaged celebrities and stand-up comedi- detailed word clouds per group pop up to show what ans) was being said by whom during that period in time. The harvesting of these tweets started late 2011. At the time of writing, we follow about 1400 individual users, who, with end users. Currently, we focus on adding support for together with all people participating in conversations with temporal tagging and for easier deployment on large clus- these inner circle users yield a set of just over 3.9M tweets. ters. The interactive visual interface is aimed at giving insight A more detailed user study of ThemeStreams is currently into the ownership and dynamics of themes being discussed. in progress. Also we are looking into additional applica- It enables users to answer questions such as Who put this tion scenarios for ThemeStreams, like discourse analysis over issue on the map?, Who picked up on this topic?, Is this time in other domains such as news paper archives. topic gaining momentum? ThemeStreams allows users to explore streams of tweets either from a fixed set of predefined 5. ACKNOWLEDGEMENTS themes or through a search box. It uses stream graphs [5] This research was partially supported by the European to indicate how the four influence groups discuss a specified Union’s ICT Policy Support Programme as part of the Com- theme, thereby depicting the volume, the “aliveness” and petitiveness and Innovation Framework Programme, CIP ownership of a topic. ICT-PSP under grant agreement nr 250430, the European The interface indicates the time a tweet was posted, the Community’s Seventh Framework Programme (FP7/2007- influence group the poster belongs to and the number of 2013) under grant agreements nr 258191 (PROMISE Net- people which reacted to a statement (which can be used to work of Excellence) and 288024 (LiMoSINe project), the estimate the “size” and “lifetime” of statement). Initially a Netherlands Organisation for Scientific Research (NWO) un- combined word cloud is shown with words colorized by the der project nrs 612.061.814, 612.061.815, 640.004.802, 727.- group they originate from. Users can zoom in to parts of 011.005, 612.001.116, HOR-11-10, the Center for Creation, the stream for more detail. Doing so results in individual Content and Technology (CCCT), the BILAND project funded word clouds being displayed per influence groups during the by the CLARIN-nl program, the Dutch national program selected period. COMMIT, the ESF Research Network Program ELIAS, the Initial usability studies were carried out with university Elite Network Shifts project funded by the Royal Dutch staff members and media analysts working for a communi- Academy of Sciences (KNAW), and the Netherlands eScience cation agent. We found that ThemeStreams was intuitive to Center under project number 027.012.105. understand and it was easy to inspect parts of a tweet stream in detail. The combined clouds proved to be insightful for 6. REFERENCES a fast overview of data. The individual clouds proved to [1] Apache solr. http://lucene.apache.org/solr/. be useful for inspecting relative word usage between groups. [2] Celery: Distributed Task Queue. We also found a need for depicting the most represented http://celeryproject.org/. speakers within a group. [3] elasticsearch. http://www.elasticsearch.org/. [4] MongoDB. http://www.mongodb.org/. 4. FUTURE WORK [5] L. Byron and M. Wattenberg. Stacked graphs–geometry xTAS is actively being used in a number of research and & aesthetics. Visualization and Computer Graphics, production environments. As such, work on xTAS is ongo- IEEE Transactions on, 14(6):1245–1252, 2008. ing and features are being deployed in close collaboration [6] W. B. Cavnar, J. M. Trenkle, et al. N-gram-based text 3 See an online demo at http://themestreams.xtas.net/ categorization. Ann Arbor MI, 48113(2):161–175, 1994.