<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>HealthMap - Global
Disease Alert Mapping System. Retrieved January</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Visualization of Clandestine Labs from Seizure Reports: Thematic Mapping and Data Mining Research Directions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammed Abduljabbar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>xec@ksu.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Algorithms</institution>
          ,
          <addr-line>Experimentation, Human Factors</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computing and Information Sciences</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Geography Kansas State University Manhattan</institution>
          ,
          <addr-line>KS 66506</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Ryuichi Osuga</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Wesam Elshamy</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>William H. Hsu</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>25</volume>
      <issue>2010</issue>
      <fpage>113</fpage>
      <lpage>120</lpage>
      <abstract>
        <p>The problem of spatiotemporal event visualization based on reports entails subtasks ranging from named entity recognition to relationship extraction and mapping of events. We present an approach to event extraction that is driven by data mining and visualization goals, particularly thematic mapping and trend analysis. This paper focuses on bridging the information extraction and visualization tasks and investigates topic modeling approaches. We develop a static, finite topic model and examine the potential benefits and feasibility of extending this to dynamic topic modeling with a large number of topics and continuous tome. We describe an experimental test bed for event mapping that uses this end-to-end information retrieval system, and report preliminary results on a geoinformatics problem: tracking of methamphetamine lab seizure events across time and space.</p>
      </abstract>
      <kwd-group>
        <kwd>information extraction</kwd>
        <kwd>information visualization</kwd>
        <kwd>event extraction</kwd>
        <kwd>topic modeling</kwd>
        <kwd>geoinformatics</kwd>
        <kwd>spatiotemporal information retrieval</kwd>
        <kwd>data mining</kwd>
        <kwd>machine learning</kwd>
        <kwd>time series</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3.3 [Information Storage and Retrieval]: Information search
and retrieval – clustering, relevance feedback, selection process;
H.2.8 [Database Management]: Database applications – data
mining, spatial databases and GIS</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>In this paper, we address the problem of event visualization based
on structured data, in the form of time-referenced and
georeferenced relational tuples, and on unstructured data, in the
form of free text. Information extraction systems based on named
entity recognition (NER) and relationship extraction have enabled
detection of events mentioned in free text and extraction of
structured tuples describing the location, time, along with other
attributes of an event. Identifying hotspots and trends, however,
remains an open problem. One limitation is the absence of ground
truth for high event activity. In some cases this is due to a lack of
Presented at EuroHCIR2012. Copyright © 2012 for the individual
papers by the papers' authors. Copying permitted only for private and
academic purposes. This volume is published and copyrighted by its
editors.
well-defined criteria for activity and relevance, while in some it is
due to limitations in existing annotation interfaces.</p>
      <p>
        We first present a basic approach to event visualization. Our
general framework makes use of mapping tools such as Google
Maps [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the Google web toolkit, and timeline visualization tools
such as MIT SIMILE [2]. It also builds upon previous work on
gazetteer-based event recognition and syntactic patterns for
semantic relationship detection. Next, we show how a system
developed originally for visualization of animal disease outbreaks
reported in online news documents can be adapted to display
reports of methamphetamine lab seizures compiled by regional
law enforcement. We briefly outline the development of a
domain-specific data description language for increased
portability and ease of information integration. We then discuss
the role of topic modeling and information retrieval approaches in
filtering and ranking events.
      </p>
      <p>A key technical contribution of this work is the application of
topic modeling algorithms in order to compute the posterior
probability of a particular spatial location, time unit, or
combination given the type of event, which is treated as a topic.
This allows the data to be interrogated systematically in order to
display geographic regions that are more prone to events of
interest. A potential application of this is to construct a time
composite map of administrative divisions within a state or
province, or a spatial composite time series by month or year,
showing active regions. These can be visualized using a
choropleth map: a map in which regions (geographic regions in
this case) are coded by colors or grayscale intensity levels. These
represent a variable of interest – in this case, event frequency.
Finally, the ability to estimate marginal likelihoods over locations
and times given the event type parameters can also be used to
filter events, to display only those that fall within a specified
frequency range. For example, the system can be configured to
search for seizures of methaphetamine production labs in counties
or districts where they are common or rare.</p>
    </sec>
    <sec id="sec-4">
      <title>2. EVENT VISUALIZATION TASKS</title>
    </sec>
    <sec id="sec-5">
      <title>2.1 Spatiotemporal Event Extraction</title>
      <p>The goal of event extraction is to identify phenomena related to
specific actions, occurrences, relationships, or entities. For
example, a positive test for a contagious animal disease on a farm
is an event that may be tied to an epidemic and identified post hoc
as indicating an outbreak. The seizure of equipment from a
methamphetamine production facility, or of waste products from a
dump site, is an example of an event in the domain of drug
enforcement. Events that can be localized in space and time form
the basis of spatiotemporal event extraction.
layers to be applied using the Google Maps API. This includes
choropleth maps with dynamically computable color palettes.
In the domains of veterinary epidemiology and drug enforcement,
decision support systems are typically based on spatiotemporal
event extraction and visualization. When events are already
available in structured form, they are usually compiled manually
from investigative reports by local or national authorities: animal
health agencies in the case of veterinary epidemiology and state of
national bureaus of investigation in the case of drug seizures. By
contrast, unstructured data often comes into the decision support
system as the result of a web crawl based on domain-specific
resource identifiers: seeds (URLs) and search terms. Federated
displays and user interfaces for these decision support systems
often combine event data from structured data repositories with
data extracted from free text. This entails data integration
challenges such as disambiguation, deduplication and identity
uncertainty for entities and events; expansion of existing named
entity sets from gazetteers (lists of known entities); and inference
of attributes for relationships representing events and entities
representing actors and objects. These topics are beyond the
scope of this paper; we refer the interested reader to existing
literature on the current state of the field in domain-focused
relationship extraction.</p>
      <p>Instead, we suppose here that some preliminary classification step
has already taken place to identify the entity that serves as anchor
point for an event, and that further classification or inference has
identified a putative time and location for the event. Whether this
is accomplished through supervised inductive learning from text
corpora or as a result of basic pattern matching, our starting point
is a candidate tuple to be analyzed, considered for presentation to
a decision-maker or search user, and if selected, visualized in the
context of events of interest.</p>
    </sec>
    <sec id="sec-6">
      <title>2.2 Georeferencing and Map Visualization</title>
      <p>Mapping out spatially-referenced events, even using structured
data sources, entails a straighforward but data-intensive
georeferencing task: looking up the coordinates (latitude and
longitude) of street addresses and postal codes where events are
reported to have occurred.</p>
      <p>The resulting coordinates are placed into a spatial database
management system (SDBMS) for visualization using software
libraries and services such as Google Maps, as shown in Figure 1.
For this purpose, we developed two alternative access layers with
a unified representation and geographic information system (GIS)
data model. The first layer is based on Google’s Keyhole Markup
Language (KML) and a file-based application programmer
interface (API), while the second layer is based on a PHP
interface to a MySQL database implementing the KML schema.
Our front-end application, TimeMap, can be configured to use
either layer.</p>
    </sec>
    <sec id="sec-7">
      <title>2.3 Timeline Visualization</title>
      <p>Figure 1 depicts the data integration between the map and timeline
visualization subsystems. The seizure event in April, 2010 is
represented on the map by a pop-up note, on the monthly scale
timeline (upper right) by a circled dot, and on the yearly scale
timeline (lower right) by a circled point.</p>
    </sec>
    <sec id="sec-8">
      <title>2.4 Thematic Mapping</title>
      <p>The object of thematic mapping is to depict phenomena and trends
in a geospatial context. Toward this end, we have added a
mapping overlay to the TimeMap framework that allows map
transformation such as superimposition of data and transparency</p>
    </sec>
    <sec id="sec-9">
      <title>2.5 Spatial Time Series Prediction</title>
      <p>A final rationale for the TimeMap visualization framework arises
from domain-specific data mining objectives in epidemiology and
criminology. Governmental agencies devoted to agriculture,
public health, and law enforcement often encounter a need for
predictive analytics tools to assist with decision-making in both
public policy and intervention, and with civic outreach. In the
domain of public health, tools such as HealthMap [4] have begun
to do for individual citizens what more general crime-mapping
systems are intended do for search users: provide relevance filters
based on criteria related to incident frequency, corroborative
reporting, and significance.</p>
    </sec>
    <sec id="sec-10">
      <title>3. TOPIC MODELING</title>
      <p>As mentioned in Section 2.1, named entity recognition combined
with date and location can provide a means of extracting a stream
of events and updates from news stories. This also holds for
microblogs and other social media. In order to classify new
events and detect the emergence or revival of new event-related
topics, however, a mechanism for monitoring update streams is
needed. This requires a more flexible topic model than the fixed
or expandable sets of named entities used for structured
information extraction. Furthermore, the frequency and semantic
heterogeneity of event reporting from multiple media outlets, even
on the web alone, may require enrichment of the parameters
beyond those used in classical generative models for information
retrieval. We now examine possible extensions to these document
clustering models.</p>
    </sec>
    <sec id="sec-11">
      <title>3.1 Static Topic Models</title>
    </sec>
    <sec id="sec-12">
      <title>3.2 Dynamic Topic Models</title>
      <p>As our preliminary experiments with historical data on both
epizootic disease outbreaks and meth lab seizures showed, news
flashes do not admit the kind of stationarity assumed in Figure 3.
Specifically, the latent variables of our topic model change over
time as a result of concept drift and the arrival of new topics,
which we can think of as a birth-death process tied to observable
events. Blei and Lafferty (2006) proposed a dynamic topic model
with fixed topic count K in which each topic’s word distribution
and popularity are linked over time. [5] Meanwhile, older topic
modeling algorithms such as Latent Semantic Analysis (LSA) [6]
that permit K to vary suffer from problems such as proximity of
different senses of a polysemous word, while variants
Probabilistic Latent Semantic Analysis (PLSA) [7] exhibit
parameter growth linear in the number of documents D.</p>
      <sec id="sec-12-1">
        <title>3.2.1 Discrete Time, Infinite Topic</title>
        <p>Ahmed and Xing (2010) proposed a partial solution to this
problem by introducing an infnite Dynamic Topic Model (iDTM)
that allows for an unbounded number of topics and an evolving
representation of topics according to a Markovian dynamics. [8]
They analyzed the birth and evolution of topics in the neural
computation community based on the Neural Information
Processing Systems (NIPS) conference proceedings. Their model
evolved topics over discrete epochs (time units). All proceedings
of a conference meeting fall into the same epoch. This model does
not suit less “bursty topic” applications such as meth lab seizures
or disease outbreaks, which are asynchronous whether reported in
the news or in local law enforcement records.</p>
        <p>For topic modeling applications such as event visualization, such
discrete time model may be too brittle. An extension to
continuous time will give it the needed flexibility to account for
variability and change in temporal granularity.</p>
      </sec>
      <sec id="sec-12-2">
        <title>3.2.2 Continuous Time, Finite Topic</title>
        <p>Meanwhile, Wang et al. (2008) proposed a continuous time
dynamic topic model that uses Brownian motion to simulate the
evolution of topics over time. [9] Although this model uses a
novel, sparse variational Kalman filtering algorithm for fast
inference, the number of topics it samples from is bounded, which
severely limits its application in news feed storyline creation and
article aggregation. When the number of topics covered by the
news feed is fewer than the pre-tuned number of topics K
specified in the model, similar stories will appear under different
headlines. On the other hand, if the number of topics covered
becomes greater than the preset number of topics, topics and
headlines will get conflated.
3.2.3 Proposed: Continuous Time, Infinite Topic
To accommodate the needs of non-bursty news updates in
domains such as our event visualization domains, we propose a
hybridization of the infinite topic model and the continuous time
model that combines a hierarchical Dirichlet process (DP) for
dynamic topic abstraction and refinement with Brownian motion
to capture stochastic topic drift. [10]
We can use a variational inference algorithm such as variational
Kalman filtering to factorize the variational distribution over
latent variables:
where  is the word distribution over topics and
1:, 1: ,1:
word distribution over topics for time 1:T, topic z1:T and word
index 1:N, where N is the size of the document lexicon.
is the</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>4. APPLICATION TEST BED</title>
    </sec>
    <sec id="sec-14">
      <title>4.1 Prior Work: Veterinary Epidemiology</title>
      <p>Volkova &amp; Hsu (2010) describes earlier work on computational
information and knowledge management (CIKM) and information
extraction. [11] This research, motivated by a need to visualize
digests of news articles on animal disease outbreaks as shown in
Figure 4, led to the earliest prototype of our event visualization
system, implemented using Google Maps and MIT SIMILE. This
system used syntactic detectors for semantic equivalence
assertions (Volkova et al., 2010). [12]</p>
    </sec>
    <sec id="sec-15">
      <title>4.2 Kansas Meth Lab Seizures</title>
      <p>A more recent version of the event visualization system is
represented in the meth lab application described in Sections 1
and 2. This system forms the test bed for both the visualization
techniques described in Section 2 and the topic modeling
techniques intended to raise the precision of the federated system
by improving relevance filtering and ranking.</p>
    </sec>
    <sec id="sec-16">
      <title>5. EXPERIMENTAL EVALUATION</title>
      <p>Figure 5 shows some baseline descriptive statistics for meth lab
seizures from the test bed discussed in Sections 2 and 4.2.
In our topic model, topics are event types, which are 50 different
types of methamphetamine lab seizures. Given the text of a lab
seizure report or a news story of the event with its date and
location as a prior, we can use the topic model shown in Figure 3
to evaluate the likelihood of the event given the prior. Events
with likelihood value above a threshold are considered highly
likely to occur given the location and time of the event.
Given a collection of police lab seizure reports with date and
location, we ran the topic model on it and evaluated the topic
composition of each report in the collection. Table 1 shows the
topic "Abandoned dump site" proportion per seizure reports for
four Kansas counties over six years. If we set a likelihood
threshold value of 0.02, then for year 2000 the event will be
marked on the map for Cowley and Reno counties. The event will
not get marked on the map for year 2002 for any of these four
counties, and for year 2004 will be marked only in Reno County,
and so on.</p>
    </sec>
    <sec id="sec-17">
      <title>6. CONTINUING AND FUTURE WORK</title>
      <p>In continuing work, we are validating the baseline LDA output by
using it to filter and rank search results the seizure database
represented in Figure 5 and Table 1. Relevance feedback from
multiple subject matter experts is being used to evaluate both
LDA and DP-based topic models. [11]</p>
    </sec>
    <sec id="sec-18">
      <title>7. ACKNOWLEDGMENTS</title>
      <p>Thanks to Loretta Wyrick-Severin (Kansas Bureau of
Investigation) for assistance with public information requests and
to Surya Teja Kallumadi, Tim Weninger, Svitlana Volkova, John
Drouhard, Landon Fowles, and Andrew Berggren for
development work on TimeMap.
[8] Ahmed, A., &amp; Xing, E. P. (2010). Timeline: A dynamic
hierarchical dirichlet process model for recovering birth/death and
evolution of topics in text stream. Proceedings of UAI 2010, (pp.
20-29). AUAI Press.
[9] Wang, C., Blei, D. M., &amp; Heckerman, D. (2008). Continuous
time dynamic topic models. Proceedings of UAI 2008, (pp.
579586). AUAI Press.
[10] Blei, D. M. (2012, April). Introduction to probabilistic topic
models. Communications of the ACM, 55(4), pp. 77-84.
[11] Volkova, S., &amp; Hsu, W. H. (2010). Computational
knowledge and information management in veterinary
epidemiology. Proceedings of ISI 2010, (pp. 120-125). IEEE
Press.
[12] Volkova, S., Caragea, D., Hsu, W. H., Drouhard, J., &amp;
Fowles, L. (2010). Boosting Biomedical Entity Extraction by
using Syntactic Patterns for Semantic Relation Discovery.
Proceedings of WI-IAT 2010 (pp. 272 - 278). IEEE Press.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Google.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <source>Google Maps. Retrieved June 29</source>
          ,
          <year>2012</year>
          , from https://maps.google.com
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>