<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extendable Text Analysis Service and its Usage in a Topic Monitoring Tool</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ork de Rooij</string-name>
          <email>orooij@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Kenter</string-name>
          <email>tom.kenter@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten de Rijke</string-name>
          <email>derijke@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Science Park 904, 1098 XH Amsterdam, The</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>xTAS is an extendable multi-user text analysis service for large scale multi-lingual document analysis developed at the University of Amsterdam. It can process large amounts of documents in a timely manner through a web interface that can be used by multiple users at once. In this demonstration paper we present recent additions which include semanticization, on the y TF-IDF model generation and on the y co-occurrence metrics. Furthermore, we demonstrate ThemeStreams, a novel topic monitoring tool built on top of xTAS.</p>
      </abstract>
      <kwd-group>
        <kwd>text analysis</kwd>
        <kwd>web service</kwd>
        <kwd>distributed processing</kwd>
        <kwd>microblog visualization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. INTRODUCTION</p>
      <p>xTAS1 is an integrated set of text analysis services for
processing documents in a timely manner. It is available
through a web API that can be used by multiple users
at once. xTAS includes tools for stemming, tokenization,
named entity recognition, part{of{speech tagging, sentiment
analysis and various types of aggregation on top of this. The
purpose of xTAS is to run text processing tasks as fast as
possible, without concerning users about databases, storage
or result caching.</p>
      <p>
        The software can run multiple tasks in parallel, possibly
on di erent machines (nodes). xTAS is built solely with
open source software. It uses Celery [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to distribute tasks
      </p>
    </sec>
    <sec id="sec-2">
      <title>1See http://xtas.net</title>
      <p>Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.</p>
      <p>
        DIR 2013, April 26, 2013, Delft, The Netherlands.
.
between nodes. By default MongoDB [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is used to store
documents and results though other options are available as
well.
      </p>
      <p>The software is extendable. Additional functionality can
easily be added through a plugin architecture.</p>
      <p>In what follows we describe recent additions to xTAS and we
present ThemeStreams, a novel topic monitoring tool built
on top of xTAS.
2.</p>
      <sec id="sec-2-1">
        <title>XTAS</title>
        <p>Recent additions and improvements to xTAS include:</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Semanticization2</title>
      <p>xTAS can semantically enrich texts by linking entities
mentioned in it to their Wikipedia article.</p>
      <p>On the y TF-IDF model generation and application
TF-IDF models based on a user selected series of
documents can be trained on the y. The models can be
used to provide TF-IDF statistics for words in new
documents.</p>
      <sec id="sec-3-1">
        <title>Co-occurrence metric calculation</title>
        <p>A variety of co-occurrence metric calculation methods
were added to xTAS, including maximum likelihood
estimate, point wise mutual information, log likelihood
ratio and 2. This enables users to calculate the
cooccurrence of entities in a set of documents.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Automatic language identi cation</title>
        <p>
          If the language of a document is not supplied xTAS
can automatically determine it. Currently this is
implemented by using TextCat [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Support for multiple document stores</title>
        <p>
          Besides mongoDB [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], xTAS can communicate directly
with Apache Solr [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or ElasticSearch [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. These stores
can be used as a document repository as well as a result
cache.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Response time improvements</title>
        <p>Analysis of xTAS usage over time shows that named
entity recognition is a frequently requested and time
consuming analysis. In order to keep response times to
2Semanticization, the process of linking mentions of
concepts in a text to the articles in an external knowledge base
they denote, is also referred to as entity linking or Wiki
cation.
near-real time speeds xTAS keeps several NER models
(for all supported languages) in memory on each xTAS
node.
3.</p>
        <sec id="sec-3-4-1">
          <title>THEMESTREAMS</title>
          <p>ThemeStreams3 is a visual interface that helps answer the
question "Who is talking about what?". It does so for topics
in the Dutch political landscape by showing the ebb and ow
of conversations about particular themes trough time. While
there are many topic monitoring tools available, the novelty
of ThemeStreams lies in its ability to present the user with a
quick overview of the relative frequency of posts a particular
group of users issued on a certain subject. ThemeStreams is
based on tweets posted on Twitter by four groups of people:
politicians (ministers, members of parliament, but also
the local ranks of politicians in municipalities and provinces)
political journalists (news paper journalists as well as
talk show hosts of political television shows)
lobbyists (people pushing the people who are active in
politics)
other in uencers (these include (satirical) columnists,
politically engaged celebrities and stand-up
comedians)
The harvesting of these tweets started late 2011. At the
time of writing, we follow about 1400 individual users, who,
together with all people participating in conversations with
these inner circle users yield a set of just over 3.9M tweets.</p>
          <p>
            The interactive visual interface is aimed at giving insight
into the ownership and dynamics of themes being discussed.
It enables users to answer questions such as Who put this
issue on the map?, Who picked up on this topic?, Is this
topic gaining momentum? ThemeStreams allows users to
explore streams of tweets either from a xed set of prede ned
themes or through a search box. It uses stream graphs [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]
to indicate how the four in uence groups discuss a speci ed
theme, thereby depicting the volume, the \aliveness" and
ownership of a topic.
          </p>
          <p>The interface indicates the time a tweet was posted, the
in uence group the poster belongs to and the number of
people which reacted to a statement (which can be used to
estimate the \size" and \lifetime" of statement). Initially a
combined word cloud is shown with words colorized by the
group they originate from. Users can zoom in to parts of
the stream for more detail. Doing so results in individual
word clouds being displayed per in uence groups during the
selected period.</p>
          <p>Initial usability studies were carried out with university
sta members and media analysts working for a
communication agent. We found that ThemeStreams was intuitive to
understand and it was easy to inspect parts of a tweet stream
in detail. The combined clouds proved to be insightful for
a fast overview of data. The individual clouds proved to
be useful for inspecting relative word usage between groups.
We also found a need for depicting the most represented
speakers within a group.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>FUTURE WORK</title>
          <p>xTAS is actively being used in a number of research and
production environments. As such, work on xTAS is
ongoing and features are being deployed in close collaboration
3See an online demo at http://themestreams.xtas.net/
with end users. Currently, we focus on adding support for
temporal tagging and for easier deployment on large
clusters.</p>
          <p>A more detailed user study of ThemeStreams is currently
in progress. Also we are looking into additional
application scenarios for ThemeStreams, like discourse analysis over
time in other domains such as news paper archives.
5.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>ACKNOWLEDGEMENTS</title>
          <p>This research was partially supported by the European
Union's ICT Policy Support Programme as part of the
Competitiveness and Innovation Framework Programme, CIP
ICT-PSP under grant agreement nr 250430, the European
Community's Seventh Framework Programme
(FP7/20072013) under grant agreements nr 258191 (PROMISE
Network of Excellence) and 288024 (LiMoSINe project), the
Netherlands Organisation for Scienti c Research (NWO)
under project nrs 612.061.814, 612.061.815, 640.004.802,
727.011.005, 612.001.116, HOR-11-10, the Center for Creation,
Content and Technology (CCCT), the BILAND project funded
by the CLARIN-nl program, the Dutch national program
COMMIT, the ESF Research Network Program ELIAS, the
Elite Network Shifts project funded by the Royal Dutch
Academy of Sciences (KNAW), and the Netherlands eScience
Center under project number 027.012.105.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Apache solr</article-title>
          . http://lucene.apache.org/solr/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Celery: Distributed Task Queue</article-title>
          . http://celeryproject.org/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] elasticsearch</article-title>
          . http://www.elasticsearch.org/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>[4] MongoDB. http://www.mongodb.org/.</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Byron</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          .
          <article-title>Stacked graphs{geometry &amp; aesthetics. Visualization and Computer Graphics</article-title>
          , IEEE Transactions on,
          <volume>14</volume>
          (
          <issue>6</issue>
          ):
          <volume>1245</volume>
          {
          <fpage>1252</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Cavnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Trenkle</surname>
          </string-name>
          , et al.
          <article-title>N-gram-based text categorization</article-title>
          . Ann Arbor MI,
          <volume>48113</volume>
          (
          <issue>2</issue>
          ):
          <volume>161</volume>
          {
          <fpage>175</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>