=Paper=
{{Paper
|id=Vol-66/paper-4
|storemode=property
|title=Ontologies for Agent-Based Information Retrieval and Sequence Mining
|pdfUrl=https://ceur-ws.org/Vol-66/oas02-15.pdf
|volume=Vol-66
|authors=Subrata Das,Kurt Shuster and Curt Wu
}}
==Ontologies for Agent-Based Information Retrieval and Sequence Mining==
<pdf width="1500px">https://ceur-ws.org/Vol-66/oas02-15.pdf</pdf>
<pre>
      Ontologies for Agent-Based Information Retrieval and
                       Sequence Mining
                                            Subrata Das, Kurt Shuster, Curt Wu
                                                     Charles River Analytics, Inc.
                                                       625 Mount Auburn St.
                                                       Cambridge, MA 02138
                                                          +1 617 491 3474
                                                   {sdas, kshuster, cwu}@cra.com

ABSTRACT                                                               these data sources often requires that the data be stored in a
                                                                       number of independent repositories distributed over a network.
In this paper, we present two very practical problems in the areas     Because of the large volumes of data stored and the large number
of distributed information retrieval and pattern mining, as well as    of distinct data archives in which the data is located, scientists and
our proposed solutions via the use of intelligent agents and           analysts often face a daunting task when searching for specific
domain ontologies. The first problem is to retrieve data from          data or series of interrelated data. Moreover, each of these data
heterogeneous distributed data sources with a specific application     archives is responsible for a particular domain and autonomously
to distributed Earth Science data archives. Our proposed approach      maintains its data in its own distinct format. Consequently, users
is to develop an engine which acts as an interface agent by            have to learn the format or metadata information of individual
presenting users with the appearance of a single, unified,             data sources. Thus, we see a need for a tool that would
homogenous data source based on a domain ontology of Earth             automatically identify and retrieve data from distributed sources
Science terminology. Users can then pose high-level declarative        based on high-level user queries.
queries against this view. The system then translates each query
                                                                             A large amount of research has been directed toward the
into a set of sub-queries and spawns mobile agents to retrieve data
                                                                       problem of querying and integrating heterogeneous data from
corresponding to each sub-query. The second problem is to
                                                                       distributed sources. Simplified methods for querying such data
predict significant world events at multiple levels of abstraction
                                                                       sources, which may include traditional databases, knowledge
by analyzing a collection of events over a period of time in order
                                                                       bases, programs, Web pages, and data files, can broadly be
to generate sequential patterns. We specifically focus on
                                                                       categorized into the following two approaches (Widom, 1996): 1)
predicting terrorist actions by analyzing terrorist group activities
                                                                       a lazy or on-demand approach, where information is extracted
over time. We employ a hierarchical taxonomic organization of
                                                                       from the sources only when the queries are posed; and 2) an eager
contextual event types to obtain higher-level abstractions of
                                                                       or in-advance approach, where relevant information is extracted
observed low-level events. With this approach, significant events
                                                                       in advance in anticipation to queries and stored in a central
can be predicted at multiple levels of abstractions with associated
                                                                       repository. It is simply not practical to create another data
confidences. Although we have addressed these two problems by
                                                                       repository from several data sources that are already huge and
building prototypes in two different domains, their combination
                                                                       maintained autonomously. Thus adopting an on-demand approach
offers a powerful agent-based tool that can assist scientists and
                                                                       for distributed heterogeneous databases seems quite appropriate,
analysts by automatically retrieving and mining data collected
                                                                       though such an approach to data retrieval requires an
from multiple distributed data sources. Thus with the use of
                                                                       infrastructure for retrieving data from distributed data sources
relevant domain ontologies, the problems of data retrieval and
                                                                       based on the query requests. Mobile agent based data retrieval
pattern discovery can be combined and automated in a single,
                                                                       offers several advantages including remote computation, robust to
elegant system.
                                                                       network connection interruption, and autonomy. Such an agent is
                                                                       an autonomous agent with behavior, state, and location.
Keywords                                                                    But irrespective of the approach adopted for integrating
Agent, Ontology, Taxonomy, Distributed Information Retrieval,          heterogeneous distributed data sources, it is necessary to provide
Sequence Mining.                                                       users with a single, unified, homogenous interface through which
                                                                       users can then pose high-level declarative queries to retrieve data
                                                                       from distributed data sources. This helps users to avoid the time-
1     INTRODUCTION                                                     consuming process of learning individual data sources. One
The exponential growth of the Internet in recent years has given       effective approach to building a unified interface to
the analysts (e.g. counterterrorism analysts) and scientists (e.g.     heterogeneous distributed data sources is via the use of a unified
space and environmental scientists) an opportunity to access large     domain ontology. An ontology in a particular domain is a
amounts of open-source and classified data that are routinely          description of the concepts and relationships that can exist in the
collected and stored on a continuous basis by many large               domain (Sowa, 2000). One of the primary purposes of
corporations and government agencies. Some important uses of           constructing an ontology is to provide a standard, unambiguous
such data includes predicting future terrorist activities,             representation of a particular domain of knowledge (Arens et al,
discovering new space phenomena, and predicting weather                1993). Ontologies have been built and used successfully in
patterns and global warming. However, the proprietary nature of        constructing multi-contextual knowledge bases, including
common-sense knowledge bases like Cyc (Lenat, 1995), as well         science and asymmetric threat prediction, including their
as enterprise knowledge (Uschold et al., 1998) and environmental     acquisition via Protégé and subsequent representation in a
science ontology EDCS (Birkel, 1999). Various ontology               machine readable XML format. Our approach to the use of
representation schemes and acquisition tools are now available,      ontologies is generic, in the sense that for a particular domain,
such as XML, Protégé (Noy et. al, 2000), and KIF (Genesereth,        metadata information from individual sources will be translated to
1991).                                                               a uniform representation with the use of a single ontology of the
     However, there are several issues that must be addressed        domain concerned. Users will pose a query with the ontology in
during the process of building an ontology for a particular          mind and the system will automatically decompose queries into
domain:                                                              subqueries that are understood by individual data sources.
                                                                          The rest of the paper is organized as follows. The following
•   Ontological Structure
                                                                     section briefly describes the two projects and our approach.
    −    The type of ontology must be chosen based on the given      Section 3 describes our use of ontologies in these projects,
         task, with several options available, such as frame-        specifically the organization of ontologies in hierarchical
         based ontologies, task-based ontologies, and others         taxonomies. Section 4 describes our use of Protégé for acquiring
         (Fensel, 2001).                                             ontologies and their machine readable representations in XML.
    −    Many standardized language choices (e.g. KIF, OKBC)         Finally, Section 5 briefly describes our plan to combine the two
                                                                     approaches into an integrated information retrieval and sequence
    −    It is often impractical to independently create entire      mining system.
         ontologies due to the large size of the domain of
         interest; therefore several 3rd-party ontologies may
         need to be integrated.
                                                                     2      THE PROBLEMS
•   Ontology maintenance/evolution                                   This section introduces the two problems that we are currently
                                                                     dealing with and our approach especially with the use of domain
    −    Domain may be very specific to a particular field (e.g.     ontologies. For more details on theses projects, readers are
         oceanic zonation terminology in (Frank and Kemp,            recommended to read (Das, Shuster, and Wu, 2002; Das and
         2001)), requiring expert assistance for generation.         Ruda, 2002).
    −    Ontologies that may change over time must be
         adaptable.                                                  2.1      Information Retrieval from Distributed Earth
•   Upper-level Ontologies                                                    Science Data Archives (ACQUIRE)
    −    If diverse ontologies must be integrated then semantic      NASA’s Earth Science Division continuously collects and stores
         discrepancies need to be rectified. This may require a      vast amounts of environmental data for use by a large and diverse
         high-level upper ontology (e.g. Cyc upper ontology          community of research scientists, engineers, and analysts. This
         (Lenat, 1995), SUMO (Niles and Pease, 2001)).               data comes from a wide variety of sources, including orbiting
•   Populating                                                       satellites, weather stations, research aircraft, and others. Various
                                                                     Distributed Active Archive Centers (DAACs) around the globe
    −    Much work needs to be done to manually map                  collect and maintain this data on behalf of NASA; each of these
         individual data sources to a global ontology –              DAACs is responsible for a particular domain and maintains its
         potentially requiring partial automation of the task.       data in its own distinct format. Researchers who require data
Additionally, our use of mobile agents for distributed information   stored in these archives often spend a great deal of time locating
retrieval raises additional issues regarding their effective         and integrating the specific data they require. The process would
operation within an ontological framework:                           be much simpler and faster if there existed a single, homogenous
                                                                     data repository or the appearance (from the user’s point of view)
•   Mobile agents and ontologies
                                                                     of such a single repository. In this case, the user would not need
    −    As agents hop from sites to sites, it is sometimes          to ‘find’ the location of any data, since all of it would appear to be
         necessary that each agent carry the entire domain           located in the same place. Furthermore, the user could construct
         ontology and the translation mechanism for each site it     his exact query in the form of a suitable database query language,
         is likely to visit. This approach makes an agent bulkier    such as SQL.
         and therefore slower movements within the network.               We have developed (Das, Shuster, and Wu, 2002) an Agent-
    −    Mapping from an individual database schema to global        based Complex QUerying and Information Retrieval Engine
         ontology is not trivial; programmatic mapping may be        (ACQUIRE) for heterogeneous and distributed data sources, and
         required at data source (e.g. converting Farenheit to       subsequently tested the system on simulated Earth Science data
         Celsius). Mobile agents will thus have to carry with        repositories. ACQUIRE implements the following three stages:
         them all relevant mapping and translating code.             •     Accepts a query from a user and decomposes it appropriately
      We are currently addressing the above-mentioned issues               into a set of sub-queries using site and domain models of the
within our two ongoing projects: 1) information retrieval from             distributed data stores
distributed Earth Science data sources (Das, Shuster, and Wu,
                                                                     •     Intelligently creates an optimized plan for retrieving answers
2002), funded by NASA; and 2) sequence mining for terrorist
                                                                           to these sub-queries over a network and spawns a set of
threat prediction (Das and Ruda, 2002), funded by DARPA. Our
                                                                           intelligent mobile agents to delegate these tasks
initial focus is to build ontologies in two domains, environmental
•     Appropriately merges the answers returned by the mobile             hierarchical structure. The event taxonomy is applied when events
      agents and then returns them to the user                            are extracted, and the hierarchical form of the taxonomy is
                                                                          especially useful when only scant information is available about
      Our on-demand approach to data retrieval requires an
                                                                          an event. The taxonomy can also be used to generate temporal
infrastructure for retrieving data from distributed data sources
                                                                          rules at various levels of abstraction.
based on the query requests that are generated from the
ACQUIRE front-end. We have used a mobile agent approach                        The events that are collected from open source and organized
(Kotz and Gray, 1999), where such an agent is defined as a named          hierarchically are then used by machine learning (ML) algorithms
object which contains code, persistent state, data, and a set of          to recognize temporal patterns of behavior and to discover
attributes such as movement history and authentication. A mobile          behavioral rules. These rules are used to predict future activities
agent can transport itself from one data server host to another as        based on current data/events. Initial results are promising,
needed for accomplishing its tasks such as searching for relevant         indicating that terrorist attacks can actually be predicted with hit
data. Such an approach provides distributed querying at sites             rate of 88% (i.e., only 12% of attacks were not predicted) and a
where the relevant data is available instead of shipping large            false-alarm rate of 37%.
volumes of data across the network. Unlike remote procedure
calls, ongoing interactions do not require ongoing communication
in a mobile agent approach. An agent can perform actions with a           3     ONTOLOGIES AND TAXONOMIES
certain degree of autonomy, such as finding alternate routes in the       An ontology is an abstract model of a particular field of
event of a network failure. Another feature of a mobile agent             knowledge. An ontology describes concepts, attributes of
approach is their ability to carry arbitrary computations to the          concepts, and the relationship between concepts. For example, the
data storage site. This allows for greater flexibility when               taxonomy of species in biology is a type of ontology which
retrieving and processing remote data, as relevant data-processing        classifies all known biological organisms by Kingdom, Phylum,
code can be customized to the particulars of a given query.               Class, Order, Family, Genus, and Species. The system is
Numerous applications of mobile agents exist, including remote            hierarchical in nature, such that any organism in the hierarchy
database access, on-line shopping, and communicating with                 posses all of the attributes of the higher-level classification units
travelers. Some of the commercial-off-the-shelf (COTS) software           to which it belongs. For example, Phylum Chordata consists of all
packages for mobile agents are: IBM’s Aglets, Object Space’s              animals that have a notochord. Classes Mammalia and Reptilia
Voyager, and Mitubishi Electric ITA’s Concordia. For our effort           both belong to this phylum, and thus they both share the common
we have explored several possible COTS packages for                       attribute of possessing a notochord. An instance is a concrete
implementing the mobile agents, and we eventually selected the            instantiation of a particular class within the ontology. So whereas
Grasshopper         system       from       IKV        Corporation        “African Elephant”, “Grey Wolf”, and “Saber-toothed Tiger”
(www.grasshopper.de).                                                     represent different species within the ontology of organisms,
                                                                          “Dumbo”, “Spot”, and “Fluffy” are specific instances of those
2.2     Sequence Mining for Significant Terrorist Action                  species. A knowledge base is a data structure which contains both
        Prediction (TACTICS)                                              an ontology and specific instances.
                                                                               One of the primary purposes of constructing an ontology is
The growing digitization of asymmetric warfare and the                    to provide a standard, unambiguous representation of a particular
exponential growth of the Internet in recent years has given the          domain of knowledge. This facilitates communication between
counterterrorism analyst an opportunity to access large amounts           domain experts in a given field. If a biologist discovers a new
of open-source data. One effective use of such data is for                species, she can specify its kingdom, phylum, etcetera, and other
generating past terrorist activity patterns to predict future terrorist   biologists will understand without ambiguity the attributes of the
activities. However, the manual extraction of hidden patterns             new species, since they all share the same vocabulary. The
within an unorganized large volume of open-source data is nearly          following two subsections describe the use of ontologies and
an insurmountable task. What is required is an automated                  taxonomies in two of our ongoing projects ACQUIRE and
technique that will be able to automatically detect useful patterns       TACTICS.
within gathered data from open sources. We have developed (Das
and Ruda, 2002) one such technique where the goal is to make
accurate predictions of future events based on extracted patterns         3.1     Ontology of Earth Science Data in ACQUIRE
from past history and thereby supporting reliable behavior
prediction and threat assessment for counterterrorism.                    In ACQUIRE, the domain of discourse is Earth Science data, and
                                                                          thus we require an ontology of Earth Science terms, including
     Our recent DARPA-sponsored effort under the TACTICS                  standard definitions of space, time, weather, etcetera. This
program has so far been restricted to terrorist activity by a             ontology serves as a common reference linking the diverse and
particular terrorist group (the name and other specifics relating to      nonuniform naming schemes used in the various data sets stored
the actual group being studied are not disclosed for reasons of           in NASA’s DAAC system. For example, data from two different
personal security) and its activities during a particular time frame.     DAACs sets may contain temperature data for different regions of
The past history of the terrorist group activities during the period      the earth. One data source may store the temperature in a column
is represented as a sequence of events. These events include both         labeled “temp”, while the other uses “temperature”. To resolve
significant events such as actual terrorist attacks, as well as non-      this issue (known as the polymony and synonymy problem),
attack events (e.g. leaders visit abroad). In order to represent all      ACQUIRE’s common earth science ontology will contain a
the possible events involving terrorist group activities, an event        TEMPERATURE class that unambiguously denotes all
taxonomy has been created that organizes the events into a
temperature measurements. All data sources accessible to
ACQUIRE will require a mapping between the data set’s
idiosyncratic naming convention and ACQUIRE’s universal
ontology. Thus both the “temp” data and the “temperature” data
can both be accessed with a single query for TEMPERATURE.
Note that there are two distinct mapping steps in the process. The
first mapping is done off-line when the data source is added to
ACQUIRE’s list of available repositories. A system administrator
must perform this one-time mapping, known as data modeling, for
each data source when the data source is added. The second
mapping is the dynamic data acquisition performed by ACQUIRE
during actual data retrieval. The software automatically performs
this operation whenever a data source is accessed, thus providing
the ‘transparency’ of the system’s data retrieval functionality.
      A second reason for employing an ontological approach to
data retrieval is that it allows for a much greater flexibility in
query structure. For example, a researcher may wish to know the            Figure 2: A Marine Ontology (Frank and Kemp, 2001)
total precipitation over a given region and time period. Specific
NASA archives may store various types of precipitation (e.g. one             Due to this high level of specificity, it is essential that third-
that stores snowfall over a given region, another that stores          party ontologies created by domain experts be easily integrated
rainfall). If a user wants to know the total precipitation, he would   with ACQUIRE’s high-level upper ontology. Integrating diverse
have to query both snowfall and rainfall data sources                  ontologies will be crucial for realizing NASA’s goal of a
independently, and then combine the results. With an ontological       distributed, virtually-centralized, and semantically-rich database
approach, he can simply specify “precipitation” in his query, and      system. Until recently, however, a major problem with integrating
the system would automatically recognize snowfall and rainfall as      diverse ontologies has been the lack of a high-level upper
subclasses of precipitation. The system will then return all data      ontology to serve as a foundation for more domain-specific ones.
sets that store rainfall, snowfall, and any other type of              Typically, domain-specific ontologies either define their own
precipitation. Alternatively, he can simply specify “snowfall” in      high-level concepts or leave them out entirely. These high-level
his query, and the system would then only retrieve “snowfall”          semantic differences between diverse domains have restricted the
data sets.                                                             integration of ontologies from vastly different fields. The
                                                                       Suggested Upper Merged Ontology (Niles and Pease, 2001) is an
     As Earth Science data is the primary type of information          IEEE effort to create a standard upper ontology which will allow
stored at NASA’s DAACs, it is necessary to create an ontology of       semantic integration of diverse domain ontologies through shared
Earth Science terms, data types, etc. Because Earth Science data       high-level concepts. ACQUIRE will utilize SUMO as a
typically involves measurements of a particular region at a            foundation for the automatic integration of domain-specific
particular time, the ontology must include two primary                 ontologies for large, heterogeneous data sources.
measurement types: those of spatial and temporal values (Bishr
and Kuhn, 2000). Although most information stored in the DAAC               In ACQUIRE, a “query” is an abstract data type that
system is geospatial in nature, much of the data contain extremely     encapsulates both a request for data any data-processing code to
domain-specific terminology. For example, an ontology of               be applied to that data. A query is generally constructed from a
oceanic zonation terms (Frank and Kemp, 2001) is shown below           higher-level “interface query” which depends on the particular
in Figure 1 and Figure 2.                                              user interface being employed. For example, ACQUIRE could
                                                                       employ an SQL interface in which the user enters a query as a
                                                                       standard SQL string. This string would then be translated to
                                                                       ACQUIRE’s internal query structure before being decomposed
                                                                       into individual subqueries to be retrieved by mobile agents.
                                                                       Alternatively, the interface may be a natural language system that
                                                                       takes English sentences as input and translates that input into
                                                                       ACQUIRE’s internal query representation. This way, ACQUIRE
                                                                       can accommodate any interface so long as it translates the user’s
                                                                       request into ACQUIRE’s internal query data structure. The details
                                                                       of this data structure are beyond the scope of this paper, but in
                                                                       general the structure is much like that of a parsed SQL query,
                                                                       with additional fields corresponding to any data processing code.
                                                                            Once the query is requested by the ACQUIRE interface, it
                                                                       must be decomposed into a series of subqueries corresponding to
                                                                       the actual physical location of the data and the particulars of the
                                                                       data schema used. This is done in three primary stages:
                                                                            First, ACQUIRE breaks the query into retrieval units based
     Figure 1: Oceanic Zonation (Frank and Kemp, 2001)                 on the physical location of the data types requested. So, if the
                                                                       query requires data of type “atmospheric-ozone” and “polar-ice-
thickness”, the system queries its catalog of data sites that contain    to imagine a system in which this type of translation would not
data of this type, and creates a retrieval agent for each one. In this   require customized processing code for each data site
example, “atmospheric-ozone” and “polar-ice-level” were                  representation.
previously defined in the ontology of Earth Science terminology,
and any data sources containing information of this type was
previously cataloged by an administrator.                                3.2       Ontology and Taxonomy in TACTICS
     The next step is to optimize the query. Suppose the query           In TACTICS, the domain of discourse is terrorist threat
was for all polar ice thickness measures taken when atmospheric          prediction, and thus we have defined an ontology of terrorist
ozone levels were above a certain threshold. The system would            activity terms, including standard definitions of attack, threat,
prioritize the retrieval by first retrieving all atmospheric ozone       propaganda, etcetera. The past history of the terrorist activities
levels and then direct the polar ice retrieval agents to only retrieve   during the period considered is represented as a sequence of
polar ice from those regions and times.                                  events. These events include both significant events such as actual
                                                                         terrorist attacks, as well as non-attack events (e.g. leaders visit
     The final step is to map each agent’s ontology-based data
                                                                         abroad). The procedure for collecting the events using the
type against the data schema of the data site at which it is stored.
                                                                         developed ontology is currently semi-automated. Newspaper
For this process, a wrapper is created which maps the particulars
                                                                         articles and other sources are searched for connections to the
of the data site schema to the ontology-based description. So, if a
                                                                         group under consideration, and matching articles are stored in a
data site stores polar ice thickness in a relational database table
                                                                         database. Trained analysts then scrutinize these articles for events,
called “ICE” and a column called “THICKNESS”, the wrapper
                                                                         and any events are represented according to the event type
would consist an appropriate SQL query that selects THICKNESS
                                                                         taxonomy (discussed below) and stored in the database as well.
data from table ICE. The wrapper also contains any data-
                                                                         The extracted events are then used by a sequence learning engine
processing code required. So if the thickness data is stored in feet
                                                                         to generate meaningful temporal rules.
but the user wants it in meters, then translation code will be sent
along with the agent to perform the translation at the site.                  We have developed a taxonomy for contextual event types
Additionally, if the query requested only the mean values, then          for a terrorist group. Contextual events form the top node of the
code to perform this (or any other) statistical operation will also      hierarchy, and represent incidents that occur in regions of interest
be included.                                                             and can be related to the group being studied. The taxonomy for
                                                                         contextual event types is shown in Figure 3. The set of all
     Once the query is decomposed and the retrieval agents
                                                                         contextual event types have been categorized into direct events,
generated, the system spawns the mobile agents and waits for the
                                                                         regular occurrences, and indirect events.
results to return, at which time it merges the results and presents
them to the user via the user interface. Notice that, in some cases,                                       Contextual Events
some agents may not leave until other ones have returned with
required intermediate data, as described above.
     It should be noted that in the current incarnation of
ACQUIRE, all data accessible by the system must be manually                    Direct Events              Regular Occurrences             Indirect Events
modeled and mapped against the global ontology. Clearly, any
attempt to integrate large numbers of data sites will require a                       Figure 3: Taxonomy for Contextual Events
substantial manual data modeling effort. In addition, any changes             Direct events are incidents that can be directly related to the
to the data sites already mapped must be remapped against the            group. Figure 4 shows that the set of all direct events have been
data site catalog. One potential solution to this problem would be       categorized into action/activity by group, action/activity against
to send agents to unmapped data sites along with the entire              group, action/activity against population, action/activity in favor
domain ontology and code for automated data site mapping. Work           of group, and peripheral events. Of these five sub-categories, we
on the Cyc project (Lenat, 1995) has been done in the area of            focus on the action/activity by group category that includes events
automated database understanding, and such an approach could be          resulting from actions directly executed by group members.
used with our mobile agents to determine site contents. This
approach still has many inherent problems to overcome, however,
such as the large size of the agents required to transmit both the                                            Direct Events
ontology and data-analysis code.
     Another problem to address is that of unit type translation at
the data source. For example, one site may store temperature data
                                                                               Action/         Action/          Action/           Action/       Peripheral
in Celsius while another used Fahrenheit units; data translation               Activity        Activity         Activity          Activity       Events
code must therefore be sent along with the mobile agents if                      by            against          against
                                                                                                               Population
                                                                                                                                in favor of
                                                                               Group            Group                             Group
remote computation is to be done at the distributed data sites.
Although the mapping between Celsius and Fahrenheit is trivial,                           Figure 4: Taxonomy for Direct Events
many such mappings are not. For example, a data site may
contain concentrations of a certain pollutant, say S02, in a data             A portion of the structure of the action/activity by group
table, while another stores such information in an image with            category is shown in Figure 5. Group members carry out various
various concentrations represented by different colors. Queries          types of activities including political actions, the execution of
requiring a combination of both data sources would therefore             missions, threats of missions (often related to planning), and
require a much more complex data translation algorithm; it is hard       changes in their goals and modus operandi. Each of these types is
further sub-classified until it is refined to a level of classification   of well-known knowledge representation systems. These three
that cannot be specified any further. These atomic actions or             systems all conform to the Open Knowledge-Base Connectivity
activities by the group at the leaf nodes of a hierarchy are directly     (OKBC) protocol, which specifies a set of minimum requirements
observable and reported in the open source literature. For                for       interoperability     between     knowledge     bases
example, a bombing that results in the outcome of death is a              (http://www.ai.sri.com/~okbc/). For ACQUIRE, we are using the
specific observable event with a clear classification.                    Protégé-2000 KRS developed by Stanford Medical Informatics
                                                                          (http://protege.stanford.edu/index.shtml).
                              Action/Activity
                                by Group                                        Protégé is both a Knowledge Representation System and a
                                                                          graphical development tool. It is available free of charge, free
                                                                          from licensing conditions, for all commercial and educational
                                                                          purposes. It is actively updated and supported by its creators at
    Political     Attack          Threat        Planning     Changes
                                                                          SMI, and has a large and diverse user community. Protégé is
                                                                          being used by ACQUIRE for three purposes: as a representation
                                                                          language for an ontology of earth science data; for modeling data
    Hijacking    Kidnapping         Bombing       Assassination           sites and data sets against the ontology; and for querying the data
                                                                          sets. These three functional features will each be described in
 Figure 5: Partial Taxonomy for Actions/Activities by Group               detail below.
                                                                               As a knowledge representation language, Protégé offers a
                                                                          number of beneficial features. The primary one is its
      On the other hand, the executed mission/attack type is at a         compatibility with the OKBC protocol, which allows it to easily
higher level of abstraction and does not specify which type of            integrate partial ontologies that are themselves OKBC compliant.
mission is being undertaken. For example, given three hijacking           Protégé also supports multiple inheritances, which allows class
and two kidnapping actions, one could abstract the knowledge              membership in more than one parent class. Finally, ontologies
that five missions were executed without specifying the nature of         constructed with Protégé can be easily modified and extended
the missions. This kind of organization helps to generate                 without the need for major refactoring of the ontology’s existing
predictions of terrorist actions at various levels of abstraction and     structure. This is important because the ontology is likely to be
confidence. For example, consider the following three rules where         ‘dynamic’, in that it will change over time as the development
the number after each rule represents its confidence and where            team gains more experience with the salient concepts of ontology
100% signifies absolute confidence:                                       construction. In the longer term, this is important because even
IF Militants Captured and Jailed THEN Hijacking (30%)                     well-constructed ontologies are likely to change over time as
IF Militants Captured and Jailed THEN Kidnapping (20%)                    scientific information changes (for example, the taxonomy of
                                                                          species often changes as scientists discover new species or when
IF Militants Captured and Jailed THEN Hijacking & kidnapping              they learn that known species were previously misclassified).
(10%)
                                                                                Data Modeling in ACQUIRE involves: 1) Ontology
The above three rules can be combined by adding the confidences           generation: defining the semantic types of information available
of the first two rules and subtracting the confidence of the third        from all sources; 2) domain modeling: the description of the
rule, which is the intersection of the sets, to generate a rule with      actual objects and tables in a data source; and 3) site modeling:
higher level of abstraction:                                              the description of the site where a data source resides. We have
IF Militants Captured and Jailed THEN Attack (40%)                        started exploring the use of Protégé-2000 for all three aspects of
                                                                          data modeling. An example of ontology generation using Protégé
If the event Militants Captured and Jailed occurs then both
                                                                          is shown in Figure 6 below.
terrorist actions Hijacking and Kidnapping would be predicted at
different confidence levels, but the terrorist action Attack, which
is more abstract than Hijacking and Kidnapping, would be
predicted at a higher level of confidence. This kind of prediction
is useful when it is very important just to be aware of a terrorist
threat irrespective of its type.


4      ONTOLOGY ENCODING
This section describes our use of Protégé for acquiring ontologies
and their representation in a machine readable XML format.


4.1       Protégé-2000                                                                         Site modeling
                                                                                            specifies data type,
                                                                                             location, and data
A Knowledge Representation System (KRS) is a tool for                                         access wrapper
constructing knowledge bases. A KRS contains a set of protocols
that define the allowable structure of a particular ontology. Loom
(isi.edu/isd/LOOM), Protégé-2000 (protege.stanford.edu), and                        Figure 6: Site and Domain Model Ontology
Ontolingua (ksl.stanford.edu/software/ontolingua) are examples
     Once an ontology is created in Protégé, it can be populated      4.2     XML
with instance data. An instance is a concrete instantiation of a
particular class within the ontology (see Figure 7 below). This       In TACTICS, both the event-type taxonomy and the location
process of populating the ontology specifically maps the physical     taxonomy are stored in XML-based text files. XML provides an
location (site modeling) and access information (domain               excellent storage format because it is a good compromise between
modeling) to the abstract data representation language specified      both human and machine readability, and editing the appropriate
by the ontology. The site model tells the system where to find a      file easily extends a taxonomy. The structure of the XML file uses
data set within the network, while the domain model defines the       only a total of three tags and three attributes are used. The nesting
actual names of tables and columns within that data set. Figure 8     of the elements reflect the hierarchy of the taxonomy. The basic
shows a portion of the text file output corresponding to this         element used is the <Node>, which has a required “name”
ontology.                                                             attribute, specifying the name of the node. The other attributes
                                                                      that may be assigned to the <Node> element are “key” and “ref”.
                                                                      The “key” attribute is used to give a node a unique reference
                                                Data type stored at
                                                 this archive site    name, for those cases where the name attribute is not unique. The
                                                                      “ref” attribute is used when branches in the hierarchy are joined,
                                                                      and specifies a unique <Node> name, or a ref value. The other
                                                                      two elements are <Alternate>, which only uses the “name”
                                                   Archive            attribute, and <Comment> which places an arbitrary comment
                                                   location           between the element begin and end tags. The <Alternate> element
                                                                      is used to specify an alternate spelling for a <Node> name. This is
                                                                      especially useful for alternate spellings of place names, dealing
                                                                      with different languages, contractions, and even misspellings. A
                                                   Data               sample of the XML used to describe the event-type taxonomy is
                                                  wrapper
                                                                      shown in Figure 9 below. The sample demonstrates the use of the
                                                                      tags and attributes discussed above.
                                                                        <Node name="Shooting">
                                                                          <Node name="Leader" key="Leader2"/>
                                                                          <Node name="Member" key="Member2">
              Figure 7: Site Model Instance Data                            <Alternate name="Members"/>
                                                                            <Alternate name="member"/>
                                                                          </Node>
 (defclass Data_set_model                                                 <Node name="Civilian" key="Civilian2">
             (is-a USER)                                                    <Alternate name="Civilians"/>
             (role concrete)                                                <Alternate name="Civilian Shooting"/>
             (single-slot Data_wrapper
                          (type INSTANCE)
                                                                            <Alternate name="Shooting Civilian"/>
 ;+                       (allowed-classes Wrapper)                         <Alternate name="Shooting-Civilian"/>
 ;+                       (cardinality 0 1)                               </Node>
                          (create-accessor read-write))                 </Node>
             (single-slot Extent_type                                   <Node name="Imprisonment">
                          (type SYMBOL)                                   <Alternate name="Imprisonement"/>
 ;+                       (allowed-parents Extent)                        <Alternate name="Imprisonemnt"/>
 ;+                       (cardinality 0 1)
                                                                          <Alternate name="Imrisonment"/>
                          (create-accessor read-write))
             (single-slot Data_location                                   <Node ref="Leader2">
                          (type INSTANCE)                                   <Alternate name="Leader Imprisonment"/>
 ;+                       (allowed-classes Data_set_locati                  <Alternate name="leader Imprisonment"/>
 ;+                       (cardinality 0 1)                                 <Alternate name="Leader Imprisonement"/>
                          (create-accessor read-write)))                  </Node>
                                                                          <Node ref="Member2">
 (defclass Data_set_location
                                                                            <Alternate name="Member Imprisonment"/>
             (is-a USER)
             (role concrete)                                                <Alternate name="Imprisonment Member"/>
             (single-slot name_                                             <Alternate name="Member Imprisonemnt"/>
                         (type STRING)                                    </Node>
 ;+                      (cardinality 0 1)                                <Node ref="Civilian2">
                         (create-accessor read-write))                      <Alternate name="Civilian Imprisonment"/>
             (single-slot Repository_URL                                  </Node>
                         (type STRING)
                                                                        </Node>
 ;+                      (cardinality 1 1)
                         (create-accessor read-write)))
                                                                        Figure 9: XML Fragment from the Event Type Taxonomy

           Figure 8: Ontology Encoding in Protégé
5    COMBINING OUR APPROACHES                                        [4] Das, S., Shuster, K., and Wu. C. “Agent-based Complex
                                                                         Querying and Information Retrieval Engine”, to appear in
We have seen how ontologies can be used for sequence mining of           the Proceedings of the First International Joint Conference
terrorist threats and for the retrieval of heterogeneous and             on Autonomous Agents and Multi-agent Systems (AAMAS
distributed data. Although we have not yet done so, we foresee           2002), Bologna, Italy, July 2002.
much potential for a system that combines these two approaches
into a single, comprehensive system. Such a system could             [5] Das, S. and Ruda, H. “Predicting Significant Events via
potentially automate the task of sequence discovery in large             Sequence Learning”, to be presented at the ECAI Workshop
bodies of scientific data, such as NASA’s massive Earth Science          on Knowledge Discovery from Temporal and Spatial Data,
data archives. Because of the tremendous volume of such data,            Lyon, France, July 2002.
sequence mining and other knowledge discovery methods                [6] Fensel, D. (2001) “Ontologies: A Silver Bullet for
traditionally require large, time-consuming data transfers. With a       Knowledge Management         and    Electronic   Commerce”.
mobile agent approach, the data can be analyzed for sequences at         Springer-Verlag.
the storage site, thus allowing a much larger corpus of data to be
analyzed.                                                            [7] Foley, P. Mamaghani, F. & Birkel, P. The Synthetic
                                                                         Environment Data Representation and Interchange
       One of the drawbacks of the TACTICS system is that data           Specification      (SEDRIS)        development project
must be fed in manually from news sources such as newspaper              (http://www.sedris.org/pr11trpl.htm).
articles and TV reports. An automated data retrieval system that
collects news items from a database could substantially facilitate   [8] Frank, R. and Kemp. Z. (2001) Ontologies for Knowledge
data acquisition. This would, of course, require a suitable              Discovery in Environmental Information Systems. In
ontology of news article ‘topics’, along with a significant amount       Raffacto A and Renso C, editors, International Conference
of manual work dedicated to classifying news archives against            Logic programming ICLP'01 Workshop Proceedings CRGD:
this ontology. Research in the filed of automatic text                   Complex Reasoning on Geographical Data, December 2001.
understanding and classification would certainly be relevant here.   [9] Genesereth, M. R. (1991). “Knowledge Interchange Format”.
                                                                         In Proceedings of the Second International Conference on
6    CONCLUSIONS                                                         the Principles of Knowledge Representation and Reasoning
                                                                         (KR-91), Kaufman, pp 238-249.
In this paper, we have presented two very practical problems in
the areas of distributed information retrieval and pattern mining,   [10] Kotz, D. and Gray, R. (1999). Mobile Agents and the Future
and raised and addressed several issues in relation to our use of        of the Internet. ACM Operating Systems Review, August
intelligent agents and domain ontologies as proposed solutions to        1999, pp. 7-13.
the problems. We have described our use of Protégé for               [11] Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge
constructing ontologies and subsequent representation in a               Infrastructure." Communications of the ACM 38, no. 11
machine readable format. Our future plan is to continue                  (November 1995).
addressing the issues that are raised in Section 1, including the
ones related to the use of existing domain ontologies such as Cyc    [12] Niles, I. and Pease, A. (2001). Towards a Standard Upper
and EDCS. We will then address the task of combining the                 Ontology. In C. Welty and B. Smith (Eds.) Formal Ontology
process of information retrieval with pattern discovery by using a       in Information Systems: Collected Papers from the Second
single domain ontology to accomplish both tasks concurrently.            International Conference. New York: ACM Press, pp. 2-9.
                                                                     [13] Noy, N. F., Fergerson, R. W., and Musen, M. A. (2000). The
7    REFERENCES                                                          knowledge       model     of      Protege-2000:     Combining
                                                                         interoperability and flexibility. 2nd International Conference
[1] Arens, Y., Chee, C. Y., Hsu, C-N., In, H., and Knoblock, C.          on Knowledge Engineering and Knowledge Management
    A.. (1993). Retrieving and integrating data from multiple            (EKAW'2000), Juan-les-Pins, France, 2000.
    information sources. International Journal on Intelligent and
    Cooperative Information Systems, Vol. 2, pp. 127-158.            [14] Sowa, J. (2000). “Knowledge Representation” Brooks/Cole.
[2] Birkel, P. (1999) “SEDRIS Data Coding Standard”, In              [15] Uschold, M., King, M., Moralee, S., and Zorgios, Y. (1998)
    Proceedings of the Spring Simulation Interoperability                The Enterprise Ontology The Knowledge Engineering
    Workshop, March 1999, 99S-SIW-011.                                   Review, Vol. 13, Special Issue on Putting Ontologies to Use
                                                                         (eds. Mike Uschold and Austin Tate). (Also available from
[3] Bishr, Y. and Kuhn, W. (2000) Ontology-Based Modelling               Artificial Intelligence Application Institute (AIAI),
    of Geospatial Information. In Proceedings of the 3rd AGILE           University of Edinburgh, Scotland, as AIAI-TR-195).
    Conference     on   Geographic      Information     Science,
    Helsinki/Espoo, May 25-27.                                       [16] Widom, J. (1996). “Integrating Heterogeneous Databases:
                                                                         Lazy or Eager?”, ACM Computing Surveys, Vol. 2.

</pre>