=Paper=
{{Paper
|id=Vol-66/paper-4
|storemode=property
|title=Ontologies for Agent-Based Information Retrieval and Sequence Mining
|pdfUrl=https://ceur-ws.org/Vol-66/oas02-15.pdf
|volume=Vol-66
|authors=Subrata Das,Kurt Shuster and Curt Wu
}}
==Ontologies for Agent-Based Information Retrieval and Sequence Mining==
Ontologies for Agent-Based Information Retrieval and
Sequence Mining
Subrata Das, Kurt Shuster, Curt Wu
Charles River Analytics, Inc.
625 Mount Auburn St.
Cambridge, MA 02138
+1 617 491 3474
{sdas, kshuster, cwu}@cra.com
ABSTRACT these data sources often requires that the data be stored in a
number of independent repositories distributed over a network.
In this paper, we present two very practical problems in the areas Because of the large volumes of data stored and the large number
of distributed information retrieval and pattern mining, as well as of distinct data archives in which the data is located, scientists and
our proposed solutions via the use of intelligent agents and analysts often face a daunting task when searching for specific
domain ontologies. The first problem is to retrieve data from data or series of interrelated data. Moreover, each of these data
heterogeneous distributed data sources with a specific application archives is responsible for a particular domain and autonomously
to distributed Earth Science data archives. Our proposed approach maintains its data in its own distinct format. Consequently, users
is to develop an engine which acts as an interface agent by have to learn the format or metadata information of individual
presenting users with the appearance of a single, unified, data sources. Thus, we see a need for a tool that would
homogenous data source based on a domain ontology of Earth automatically identify and retrieve data from distributed sources
Science terminology. Users can then pose high-level declarative based on high-level user queries.
queries against this view. The system then translates each query
A large amount of research has been directed toward the
into a set of sub-queries and spawns mobile agents to retrieve data
problem of querying and integrating heterogeneous data from
corresponding to each sub-query. The second problem is to
distributed sources. Simplified methods for querying such data
predict significant world events at multiple levels of abstraction
sources, which may include traditional databases, knowledge
by analyzing a collection of events over a period of time in order
bases, programs, Web pages, and data files, can broadly be
to generate sequential patterns. We specifically focus on
categorized into the following two approaches (Widom, 1996): 1)
predicting terrorist actions by analyzing terrorist group activities
a lazy or on-demand approach, where information is extracted
over time. We employ a hierarchical taxonomic organization of
from the sources only when the queries are posed; and 2) an eager
contextual event types to obtain higher-level abstractions of
or in-advance approach, where relevant information is extracted
observed low-level events. With this approach, significant events
in advance in anticipation to queries and stored in a central
can be predicted at multiple levels of abstractions with associated
repository. It is simply not practical to create another data
confidences. Although we have addressed these two problems by
repository from several data sources that are already huge and
building prototypes in two different domains, their combination
maintained autonomously. Thus adopting an on-demand approach
offers a powerful agent-based tool that can assist scientists and
for distributed heterogeneous databases seems quite appropriate,
analysts by automatically retrieving and mining data collected
though such an approach to data retrieval requires an
from multiple distributed data sources. Thus with the use of
infrastructure for retrieving data from distributed data sources
relevant domain ontologies, the problems of data retrieval and
based on the query requests. Mobile agent based data retrieval
pattern discovery can be combined and automated in a single,
offers several advantages including remote computation, robust to
elegant system.
network connection interruption, and autonomy. Such an agent is
an autonomous agent with behavior, state, and location.
Keywords But irrespective of the approach adopted for integrating
Agent, Ontology, Taxonomy, Distributed Information Retrieval, heterogeneous distributed data sources, it is necessary to provide
Sequence Mining. users with a single, unified, homogenous interface through which
users can then pose high-level declarative queries to retrieve data
from distributed data sources. This helps users to avoid the time-
1 INTRODUCTION consuming process of learning individual data sources. One
The exponential growth of the Internet in recent years has given effective approach to building a unified interface to
the analysts (e.g. counterterrorism analysts) and scientists (e.g. heterogeneous distributed data sources is via the use of a unified
space and environmental scientists) an opportunity to access large domain ontology. An ontology in a particular domain is a
amounts of open-source and classified data that are routinely description of the concepts and relationships that can exist in the
collected and stored on a continuous basis by many large domain (Sowa, 2000). One of the primary purposes of
corporations and government agencies. Some important uses of constructing an ontology is to provide a standard, unambiguous
such data includes predicting future terrorist activities, representation of a particular domain of knowledge (Arens et al,
discovering new space phenomena, and predicting weather 1993). Ontologies have been built and used successfully in
patterns and global warming. However, the proprietary nature of constructing multi-contextual knowledge bases, including
common-sense knowledge bases like Cyc (Lenat, 1995), as well science and asymmetric threat prediction, including their
as enterprise knowledge (Uschold et al., 1998) and environmental acquisition via Protégé and subsequent representation in a
science ontology EDCS (Birkel, 1999). Various ontology machine readable XML format. Our approach to the use of
representation schemes and acquisition tools are now available, ontologies is generic, in the sense that for a particular domain,
such as XML, Protégé (Noy et. al, 2000), and KIF (Genesereth, metadata information from individual sources will be translated to
1991). a uniform representation with the use of a single ontology of the
However, there are several issues that must be addressed domain concerned. Users will pose a query with the ontology in
during the process of building an ontology for a particular mind and the system will automatically decompose queries into
domain: subqueries that are understood by individual data sources.
The rest of the paper is organized as follows. The following
• Ontological Structure
section briefly describes the two projects and our approach.
− The type of ontology must be chosen based on the given Section 3 describes our use of ontologies in these projects,
task, with several options available, such as frame- specifically the organization of ontologies in hierarchical
based ontologies, task-based ontologies, and others taxonomies. Section 4 describes our use of Protégé for acquiring
(Fensel, 2001). ontologies and their machine readable representations in XML.
− Many standardized language choices (e.g. KIF, OKBC) Finally, Section 5 briefly describes our plan to combine the two
approaches into an integrated information retrieval and sequence
− It is often impractical to independently create entire mining system.
ontologies due to the large size of the domain of
interest; therefore several 3rd-party ontologies may
need to be integrated.
2 THE PROBLEMS
• Ontology maintenance/evolution This section introduces the two problems that we are currently
dealing with and our approach especially with the use of domain
− Domain may be very specific to a particular field (e.g. ontologies. For more details on theses projects, readers are
oceanic zonation terminology in (Frank and Kemp, recommended to read (Das, Shuster, and Wu, 2002; Das and
2001)), requiring expert assistance for generation. Ruda, 2002).
− Ontologies that may change over time must be
adaptable. 2.1 Information Retrieval from Distributed Earth
• Upper-level Ontologies Science Data Archives (ACQUIRE)
− If diverse ontologies must be integrated then semantic NASA’s Earth Science Division continuously collects and stores
discrepancies need to be rectified. This may require a vast amounts of environmental data for use by a large and diverse
high-level upper ontology (e.g. Cyc upper ontology community of research scientists, engineers, and analysts. This
(Lenat, 1995), SUMO (Niles and Pease, 2001)). data comes from a wide variety of sources, including orbiting
• Populating satellites, weather stations, research aircraft, and others. Various
Distributed Active Archive Centers (DAACs) around the globe
− Much work needs to be done to manually map collect and maintain this data on behalf of NASA; each of these
individual data sources to a global ontology – DAACs is responsible for a particular domain and maintains its
potentially requiring partial automation of the task. data in its own distinct format. Researchers who require data
Additionally, our use of mobile agents for distributed information stored in these archives often spend a great deal of time locating
retrieval raises additional issues regarding their effective and integrating the specific data they require. The process would
operation within an ontological framework: be much simpler and faster if there existed a single, homogenous
data repository or the appearance (from the user’s point of view)
• Mobile agents and ontologies
of such a single repository. In this case, the user would not need
− As agents hop from sites to sites, it is sometimes to ‘find’ the location of any data, since all of it would appear to be
necessary that each agent carry the entire domain located in the same place. Furthermore, the user could construct
ontology and the translation mechanism for each site it his exact query in the form of a suitable database query language,
is likely to visit. This approach makes an agent bulkier such as SQL.
and therefore slower movements within the network. We have developed (Das, Shuster, and Wu, 2002) an Agent-
− Mapping from an individual database schema to global based Complex QUerying and Information Retrieval Engine
ontology is not trivial; programmatic mapping may be (ACQUIRE) for heterogeneous and distributed data sources, and
required at data source (e.g. converting Farenheit to subsequently tested the system on simulated Earth Science data
Celsius). Mobile agents will thus have to carry with repositories. ACQUIRE implements the following three stages:
them all relevant mapping and translating code. • Accepts a query from a user and decomposes it appropriately
We are currently addressing the above-mentioned issues into a set of sub-queries using site and domain models of the
within our two ongoing projects: 1) information retrieval from distributed data stores
distributed Earth Science data sources (Das, Shuster, and Wu,
• Intelligently creates an optimized plan for retrieving answers
2002), funded by NASA; and 2) sequence mining for terrorist
to these sub-queries over a network and spawns a set of
threat prediction (Das and Ruda, 2002), funded by DARPA. Our
intelligent mobile agents to delegate these tasks
initial focus is to build ontologies in two domains, environmental
• Appropriately merges the answers returned by the mobile hierarchical structure. The event taxonomy is applied when events
agents and then returns them to the user are extracted, and the hierarchical form of the taxonomy is
especially useful when only scant information is available about
Our on-demand approach to data retrieval requires an
an event. The taxonomy can also be used to generate temporal
infrastructure for retrieving data from distributed data sources
rules at various levels of abstraction.
based on the query requests that are generated from the
ACQUIRE front-end. We have used a mobile agent approach The events that are collected from open source and organized
(Kotz and Gray, 1999), where such an agent is defined as a named hierarchically are then used by machine learning (ML) algorithms
object which contains code, persistent state, data, and a set of to recognize temporal patterns of behavior and to discover
attributes such as movement history and authentication. A mobile behavioral rules. These rules are used to predict future activities
agent can transport itself from one data server host to another as based on current data/events. Initial results are promising,
needed for accomplishing its tasks such as searching for relevant indicating that terrorist attacks can actually be predicted with hit
data. Such an approach provides distributed querying at sites rate of 88% (i.e., only 12% of attacks were not predicted) and a
where the relevant data is available instead of shipping large false-alarm rate of 37%.
volumes of data across the network. Unlike remote procedure
calls, ongoing interactions do not require ongoing communication
in a mobile agent approach. An agent can perform actions with a 3 ONTOLOGIES AND TAXONOMIES
certain degree of autonomy, such as finding alternate routes in the An ontology is an abstract model of a particular field of
event of a network failure. Another feature of a mobile agent knowledge. An ontology describes concepts, attributes of
approach is their ability to carry arbitrary computations to the concepts, and the relationship between concepts. For example, the
data storage site. This allows for greater flexibility when taxonomy of species in biology is a type of ontology which
retrieving and processing remote data, as relevant data-processing classifies all known biological organisms by Kingdom, Phylum,
code can be customized to the particulars of a given query. Class, Order, Family, Genus, and Species. The system is
Numerous applications of mobile agents exist, including remote hierarchical in nature, such that any organism in the hierarchy
database access, on-line shopping, and communicating with posses all of the attributes of the higher-level classification units
travelers. Some of the commercial-off-the-shelf (COTS) software to which it belongs. For example, Phylum Chordata consists of all
packages for mobile agents are: IBM’s Aglets, Object Space’s animals that have a notochord. Classes Mammalia and Reptilia
Voyager, and Mitubishi Electric ITA’s Concordia. For our effort both belong to this phylum, and thus they both share the common
we have explored several possible COTS packages for attribute of possessing a notochord. An instance is a concrete
implementing the mobile agents, and we eventually selected the instantiation of a particular class within the ontology. So whereas
Grasshopper system from IKV Corporation “African Elephant”, “Grey Wolf”, and “Saber-toothed Tiger”
(www.grasshopper.de). represent different species within the ontology of organisms,
“Dumbo”, “Spot”, and “Fluffy” are specific instances of those
2.2 Sequence Mining for Significant Terrorist Action species. A knowledge base is a data structure which contains both
Prediction (TACTICS) an ontology and specific instances.
One of the primary purposes of constructing an ontology is
The growing digitization of asymmetric warfare and the to provide a standard, unambiguous representation of a particular
exponential growth of the Internet in recent years has given the domain of knowledge. This facilitates communication between
counterterrorism analyst an opportunity to access large amounts domain experts in a given field. If a biologist discovers a new
of open-source data. One effective use of such data is for species, she can specify its kingdom, phylum, etcetera, and other
generating past terrorist activity patterns to predict future terrorist biologists will understand without ambiguity the attributes of the
activities. However, the manual extraction of hidden patterns new species, since they all share the same vocabulary. The
within an unorganized large volume of open-source data is nearly following two subsections describe the use of ontologies and
an insurmountable task. What is required is an automated taxonomies in two of our ongoing projects ACQUIRE and
technique that will be able to automatically detect useful patterns TACTICS.
within gathered data from open sources. We have developed (Das
and Ruda, 2002) one such technique where the goal is to make
accurate predictions of future events based on extracted patterns 3.1 Ontology of Earth Science Data in ACQUIRE
from past history and thereby supporting reliable behavior
prediction and threat assessment for counterterrorism. In ACQUIRE, the domain of discourse is Earth Science data, and
thus we require an ontology of Earth Science terms, including
Our recent DARPA-sponsored effort under the TACTICS standard definitions of space, time, weather, etcetera. This
program has so far been restricted to terrorist activity by a ontology serves as a common reference linking the diverse and
particular terrorist group (the name and other specifics relating to nonuniform naming schemes used in the various data sets stored
the actual group being studied are not disclosed for reasons of in NASA’s DAAC system. For example, data from two different
personal security) and its activities during a particular time frame. DAACs sets may contain temperature data for different regions of
The past history of the terrorist group activities during the period the earth. One data source may store the temperature in a column
is represented as a sequence of events. These events include both labeled “temp”, while the other uses “temperature”. To resolve
significant events such as actual terrorist attacks, as well as non- this issue (known as the polymony and synonymy problem),
attack events (e.g. leaders visit abroad). In order to represent all ACQUIRE’s common earth science ontology will contain a
the possible events involving terrorist group activities, an event TEMPERATURE class that unambiguously denotes all
taxonomy has been created that organizes the events into a
temperature measurements. All data sources accessible to
ACQUIRE will require a mapping between the data set’s
idiosyncratic naming convention and ACQUIRE’s universal
ontology. Thus both the “temp” data and the “temperature” data
can both be accessed with a single query for TEMPERATURE.
Note that there are two distinct mapping steps in the process. The
first mapping is done off-line when the data source is added to
ACQUIRE’s list of available repositories. A system administrator
must perform this one-time mapping, known as data modeling, for
each data source when the data source is added. The second
mapping is the dynamic data acquisition performed by ACQUIRE
during actual data retrieval. The software automatically performs
this operation whenever a data source is accessed, thus providing
the ‘transparency’ of the system’s data retrieval functionality.
A second reason for employing an ontological approach to
data retrieval is that it allows for a much greater flexibility in
query structure. For example, a researcher may wish to know the Figure 2: A Marine Ontology (Frank and Kemp, 2001)
total precipitation over a given region and time period. Specific
NASA archives may store various types of precipitation (e.g. one Due to this high level of specificity, it is essential that third-
that stores snowfall over a given region, another that stores party ontologies created by domain experts be easily integrated
rainfall). If a user wants to know the total precipitation, he would with ACQUIRE’s high-level upper ontology. Integrating diverse
have to query both snowfall and rainfall data sources ontologies will be crucial for realizing NASA’s goal of a
independently, and then combine the results. With an ontological distributed, virtually-centralized, and semantically-rich database
approach, he can simply specify “precipitation” in his query, and system. Until recently, however, a major problem with integrating
the system would automatically recognize snowfall and rainfall as diverse ontologies has been the lack of a high-level upper
subclasses of precipitation. The system will then return all data ontology to serve as a foundation for more domain-specific ones.
sets that store rainfall, snowfall, and any other type of Typically, domain-specific ontologies either define their own
precipitation. Alternatively, he can simply specify “snowfall” in high-level concepts or leave them out entirely. These high-level
his query, and the system would then only retrieve “snowfall” semantic differences between diverse domains have restricted the
data sets. integration of ontologies from vastly different fields. The
Suggested Upper Merged Ontology (Niles and Pease, 2001) is an
As Earth Science data is the primary type of information IEEE effort to create a standard upper ontology which will allow
stored at NASA’s DAACs, it is necessary to create an ontology of semantic integration of diverse domain ontologies through shared
Earth Science terms, data types, etc. Because Earth Science data high-level concepts. ACQUIRE will utilize SUMO as a
typically involves measurements of a particular region at a foundation for the automatic integration of domain-specific
particular time, the ontology must include two primary ontologies for large, heterogeneous data sources.
measurement types: those of spatial and temporal values (Bishr
and Kuhn, 2000). Although most information stored in the DAAC In ACQUIRE, a “query” is an abstract data type that
system is geospatial in nature, much of the data contain extremely encapsulates both a request for data any data-processing code to
domain-specific terminology. For example, an ontology of be applied to that data. A query is generally constructed from a
oceanic zonation terms (Frank and Kemp, 2001) is shown below higher-level “interface query” which depends on the particular
in Figure 1 and Figure 2. user interface being employed. For example, ACQUIRE could
employ an SQL interface in which the user enters a query as a
standard SQL string. This string would then be translated to
ACQUIRE’s internal query structure before being decomposed
into individual subqueries to be retrieved by mobile agents.
Alternatively, the interface may be a natural language system that
takes English sentences as input and translates that input into
ACQUIRE’s internal query representation. This way, ACQUIRE
can accommodate any interface so long as it translates the user’s
request into ACQUIRE’s internal query data structure. The details
of this data structure are beyond the scope of this paper, but in
general the structure is much like that of a parsed SQL query,
with additional fields corresponding to any data processing code.
Once the query is requested by the ACQUIRE interface, it
must be decomposed into a series of subqueries corresponding to
the actual physical location of the data and the particulars of the
data schema used. This is done in three primary stages:
First, ACQUIRE breaks the query into retrieval units based
Figure 1: Oceanic Zonation (Frank and Kemp, 2001) on the physical location of the data types requested. So, if the
query requires data of type “atmospheric-ozone” and “polar-ice-
thickness”, the system queries its catalog of data sites that contain to imagine a system in which this type of translation would not
data of this type, and creates a retrieval agent for each one. In this require customized processing code for each data site
example, “atmospheric-ozone” and “polar-ice-level” were representation.
previously defined in the ontology of Earth Science terminology,
and any data sources containing information of this type was
previously cataloged by an administrator. 3.2 Ontology and Taxonomy in TACTICS
The next step is to optimize the query. Suppose the query In TACTICS, the domain of discourse is terrorist threat
was for all polar ice thickness measures taken when atmospheric prediction, and thus we have defined an ontology of terrorist
ozone levels were above a certain threshold. The system would activity terms, including standard definitions of attack, threat,
prioritize the retrieval by first retrieving all atmospheric ozone propaganda, etcetera. The past history of the terrorist activities
levels and then direct the polar ice retrieval agents to only retrieve during the period considered is represented as a sequence of
polar ice from those regions and times. events. These events include both significant events such as actual
terrorist attacks, as well as non-attack events (e.g. leaders visit
The final step is to map each agent’s ontology-based data
abroad). The procedure for collecting the events using the
type against the data schema of the data site at which it is stored.
developed ontology is currently semi-automated. Newspaper
For this process, a wrapper is created which maps the particulars
articles and other sources are searched for connections to the
of the data site schema to the ontology-based description. So, if a
group under consideration, and matching articles are stored in a
data site stores polar ice thickness in a relational database table
database. Trained analysts then scrutinize these articles for events,
called “ICE” and a column called “THICKNESS”, the wrapper
and any events are represented according to the event type
would consist an appropriate SQL query that selects THICKNESS
taxonomy (discussed below) and stored in the database as well.
data from table ICE. The wrapper also contains any data-
The extracted events are then used by a sequence learning engine
processing code required. So if the thickness data is stored in feet
to generate meaningful temporal rules.
but the user wants it in meters, then translation code will be sent
along with the agent to perform the translation at the site. We have developed a taxonomy for contextual event types
Additionally, if the query requested only the mean values, then for a terrorist group. Contextual events form the top node of the
code to perform this (or any other) statistical operation will also hierarchy, and represent incidents that occur in regions of interest
be included. and can be related to the group being studied. The taxonomy for
contextual event types is shown in Figure 3. The set of all
Once the query is decomposed and the retrieval agents
contextual event types have been categorized into direct events,
generated, the system spawns the mobile agents and waits for the
regular occurrences, and indirect events.
results to return, at which time it merges the results and presents
them to the user via the user interface. Notice that, in some cases, Contextual Events
some agents may not leave until other ones have returned with
required intermediate data, as described above.
It should be noted that in the current incarnation of
ACQUIRE, all data accessible by the system must be manually Direct Events Regular Occurrences Indirect Events
modeled and mapped against the global ontology. Clearly, any
attempt to integrate large numbers of data sites will require a Figure 3: Taxonomy for Contextual Events
substantial manual data modeling effort. In addition, any changes Direct events are incidents that can be directly related to the
to the data sites already mapped must be remapped against the group. Figure 4 shows that the set of all direct events have been
data site catalog. One potential solution to this problem would be categorized into action/activity by group, action/activity against
to send agents to unmapped data sites along with the entire group, action/activity against population, action/activity in favor
domain ontology and code for automated data site mapping. Work of group, and peripheral events. Of these five sub-categories, we
on the Cyc project (Lenat, 1995) has been done in the area of focus on the action/activity by group category that includes events
automated database understanding, and such an approach could be resulting from actions directly executed by group members.
used with our mobile agents to determine site contents. This
approach still has many inherent problems to overcome, however,
such as the large size of the agents required to transmit both the Direct Events
ontology and data-analysis code.
Another problem to address is that of unit type translation at
the data source. For example, one site may store temperature data
Action/ Action/ Action/ Action/ Peripheral
in Celsius while another used Fahrenheit units; data translation Activity Activity Activity Activity Events
code must therefore be sent along with the mobile agents if by against against
Population
in favor of
Group Group Group
remote computation is to be done at the distributed data sites.
Although the mapping between Celsius and Fahrenheit is trivial, Figure 4: Taxonomy for Direct Events
many such mappings are not. For example, a data site may
contain concentrations of a certain pollutant, say S02, in a data A portion of the structure of the action/activity by group
table, while another stores such information in an image with category is shown in Figure 5. Group members carry out various
various concentrations represented by different colors. Queries types of activities including political actions, the execution of
requiring a combination of both data sources would therefore missions, threats of missions (often related to planning), and
require a much more complex data translation algorithm; it is hard changes in their goals and modus operandi. Each of these types is
further sub-classified until it is refined to a level of classification of well-known knowledge representation systems. These three
that cannot be specified any further. These atomic actions or systems all conform to the Open Knowledge-Base Connectivity
activities by the group at the leaf nodes of a hierarchy are directly (OKBC) protocol, which specifies a set of minimum requirements
observable and reported in the open source literature. For for interoperability between knowledge bases
example, a bombing that results in the outcome of death is a (http://www.ai.sri.com/~okbc/). For ACQUIRE, we are using the
specific observable event with a clear classification. Protégé-2000 KRS developed by Stanford Medical Informatics
(http://protege.stanford.edu/index.shtml).
Action/Activity
by Group Protégé is both a Knowledge Representation System and a
graphical development tool. It is available free of charge, free
from licensing conditions, for all commercial and educational
purposes. It is actively updated and supported by its creators at
Political Attack Threat Planning Changes
SMI, and has a large and diverse user community. Protégé is
being used by ACQUIRE for three purposes: as a representation
language for an ontology of earth science data; for modeling data
Hijacking Kidnapping Bombing Assassination sites and data sets against the ontology; and for querying the data
sets. These three functional features will each be described in
Figure 5: Partial Taxonomy for Actions/Activities by Group detail below.
As a knowledge representation language, Protégé offers a
number of beneficial features. The primary one is its
On the other hand, the executed mission/attack type is at a compatibility with the OKBC protocol, which allows it to easily
higher level of abstraction and does not specify which type of integrate partial ontologies that are themselves OKBC compliant.
mission is being undertaken. For example, given three hijacking Protégé also supports multiple inheritances, which allows class
and two kidnapping actions, one could abstract the knowledge membership in more than one parent class. Finally, ontologies
that five missions were executed without specifying the nature of constructed with Protégé can be easily modified and extended
the missions. This kind of organization helps to generate without the need for major refactoring of the ontology’s existing
predictions of terrorist actions at various levels of abstraction and structure. This is important because the ontology is likely to be
confidence. For example, consider the following three rules where ‘dynamic’, in that it will change over time as the development
the number after each rule represents its confidence and where team gains more experience with the salient concepts of ontology
100% signifies absolute confidence: construction. In the longer term, this is important because even
IF Militants Captured and Jailed THEN Hijacking (30%) well-constructed ontologies are likely to change over time as
IF Militants Captured and Jailed THEN Kidnapping (20%) scientific information changes (for example, the taxonomy of
species often changes as scientists discover new species or when
IF Militants Captured and Jailed THEN Hijacking & kidnapping they learn that known species were previously misclassified).
(10%)
Data Modeling in ACQUIRE involves: 1) Ontology
The above three rules can be combined by adding the confidences generation: defining the semantic types of information available
of the first two rules and subtracting the confidence of the third from all sources; 2) domain modeling: the description of the
rule, which is the intersection of the sets, to generate a rule with actual objects and tables in a data source; and 3) site modeling:
higher level of abstraction: the description of the site where a data source resides. We have
IF Militants Captured and Jailed THEN Attack (40%) started exploring the use of Protégé-2000 for all three aspects of
data modeling. An example of ontology generation using Protégé
If the event Militants Captured and Jailed occurs then both
is shown in Figure 6 below.
terrorist actions Hijacking and Kidnapping would be predicted at
different confidence levels, but the terrorist action Attack, which
is more abstract than Hijacking and Kidnapping, would be
predicted at a higher level of confidence. This kind of prediction
is useful when it is very important just to be aware of a terrorist
threat irrespective of its type.
4 ONTOLOGY ENCODING
This section describes our use of Protégé for acquiring ontologies
and their representation in a machine readable XML format.
4.1 Protégé-2000 Site modeling
specifies data type,
location, and data
A Knowledge Representation System (KRS) is a tool for access wrapper
constructing knowledge bases. A KRS contains a set of protocols
that define the allowable structure of a particular ontology. Loom
(isi.edu/isd/LOOM), Protégé-2000 (protege.stanford.edu), and Figure 6: Site and Domain Model Ontology
Ontolingua (ksl.stanford.edu/software/ontolingua) are examples
Once an ontology is created in Protégé, it can be populated 4.2 XML
with instance data. An instance is a concrete instantiation of a
particular class within the ontology (see Figure 7 below). This In TACTICS, both the event-type taxonomy and the location
process of populating the ontology specifically maps the physical taxonomy are stored in XML-based text files. XML provides an
location (site modeling) and access information (domain excellent storage format because it is a good compromise between
modeling) to the abstract data representation language specified both human and machine readability, and editing the appropriate
by the ontology. The site model tells the system where to find a file easily extends a taxonomy. The structure of the XML file uses
data set within the network, while the domain model defines the only a total of three tags and three attributes are used. The nesting
actual names of tables and columns within that data set. Figure 8 of the elements reflect the hierarchy of the taxonomy. The basic
shows a portion of the text file output corresponding to this element used is the , which has a required “name”
ontology. attribute, specifying the name of the node. The other attributes
that may be assigned to the element are “key” and “ref”.
The “key” attribute is used to give a node a unique reference
Data type stored at
this archive site name, for those cases where the name attribute is not unique. The
“ref” attribute is used when branches in the hierarchy are joined,
and specifies a unique name, or a ref value. The other
two elements are , which only uses the “name”
Archive attribute, and which places an arbitrary comment
location between the element begin and end tags. The element
is used to specify an alternate spelling for a name. This is
especially useful for alternate spellings of place names, dealing
with different languages, contractions, and even misspellings. A
Data sample of the XML used to describe the event-type taxonomy is
wrapper
shown in Figure 9 below. The sample demonstrates the use of the
tags and attributes discussed above.
Figure 7: Site Model Instance Data
(defclass Data_set_model
(is-a USER)
(role concrete)
(single-slot Data_wrapper
(type INSTANCE)
;+ (allowed-classes Wrapper)
;+ (cardinality 0 1)
(create-accessor read-write))
(single-slot Extent_type
(type SYMBOL)
;+ (allowed-parents Extent)
;+ (cardinality 0 1)
(create-accessor read-write))
(single-slot Data_location
(type INSTANCE)
;+ (allowed-classes Data_set_locati
;+ (cardinality 0 1)
(create-accessor read-write)))
(defclass Data_set_location
(is-a USER)
(role concrete)
(single-slot name_
(type STRING)
;+ (cardinality 0 1)
(create-accessor read-write))
(single-slot Repository_URL
(type STRING)
;+ (cardinality 1 1)
(create-accessor read-write)))
Figure 9: XML Fragment from the Event Type Taxonomy
Figure 8: Ontology Encoding in Protégé
5 COMBINING OUR APPROACHES [4] Das, S., Shuster, K., and Wu. C. “Agent-based Complex
Querying and Information Retrieval Engine”, to appear in
We have seen how ontologies can be used for sequence mining of the Proceedings of the First International Joint Conference
terrorist threats and for the retrieval of heterogeneous and on Autonomous Agents and Multi-agent Systems (AAMAS
distributed data. Although we have not yet done so, we foresee 2002), Bologna, Italy, July 2002.
much potential for a system that combines these two approaches
into a single, comprehensive system. Such a system could [5] Das, S. and Ruda, H. “Predicting Significant Events via
potentially automate the task of sequence discovery in large Sequence Learning”, to be presented at the ECAI Workshop
bodies of scientific data, such as NASA’s massive Earth Science on Knowledge Discovery from Temporal and Spatial Data,
data archives. Because of the tremendous volume of such data, Lyon, France, July 2002.
sequence mining and other knowledge discovery methods [6] Fensel, D. (2001) “Ontologies: A Silver Bullet for
traditionally require large, time-consuming data transfers. With a Knowledge Management and Electronic Commerce”.
mobile agent approach, the data can be analyzed for sequences at Springer-Verlag.
the storage site, thus allowing a much larger corpus of data to be
analyzed. [7] Foley, P. Mamaghani, F. & Birkel, P. The Synthetic
Environment Data Representation and Interchange
One of the drawbacks of the TACTICS system is that data Specification (SEDRIS) development project
must be fed in manually from news sources such as newspaper (http://www.sedris.org/pr11trpl.htm).
articles and TV reports. An automated data retrieval system that
collects news items from a database could substantially facilitate [8] Frank, R. and Kemp. Z. (2001) Ontologies for Knowledge
data acquisition. This would, of course, require a suitable Discovery in Environmental Information Systems. In
ontology of news article ‘topics’, along with a significant amount Raffacto A and Renso C, editors, International Conference
of manual work dedicated to classifying news archives against Logic programming ICLP'01 Workshop Proceedings CRGD:
this ontology. Research in the filed of automatic text Complex Reasoning on Geographical Data, December 2001.
understanding and classification would certainly be relevant here. [9] Genesereth, M. R. (1991). “Knowledge Interchange Format”.
In Proceedings of the Second International Conference on
6 CONCLUSIONS the Principles of Knowledge Representation and Reasoning
(KR-91), Kaufman, pp 238-249.
In this paper, we have presented two very practical problems in
the areas of distributed information retrieval and pattern mining, [10] Kotz, D. and Gray, R. (1999). Mobile Agents and the Future
and raised and addressed several issues in relation to our use of of the Internet. ACM Operating Systems Review, August
intelligent agents and domain ontologies as proposed solutions to 1999, pp. 7-13.
the problems. We have described our use of Protégé for [11] Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge
constructing ontologies and subsequent representation in a Infrastructure." Communications of the ACM 38, no. 11
machine readable format. Our future plan is to continue (November 1995).
addressing the issues that are raised in Section 1, including the
ones related to the use of existing domain ontologies such as Cyc [12] Niles, I. and Pease, A. (2001). Towards a Standard Upper
and EDCS. We will then address the task of combining the Ontology. In C. Welty and B. Smith (Eds.) Formal Ontology
process of information retrieval with pattern discovery by using a in Information Systems: Collected Papers from the Second
single domain ontology to accomplish both tasks concurrently. International Conference. New York: ACM Press, pp. 2-9.
[13] Noy, N. F., Fergerson, R. W., and Musen, M. A. (2000). The
7 REFERENCES knowledge model of Protege-2000: Combining
interoperability and flexibility. 2nd International Conference
[1] Arens, Y., Chee, C. Y., Hsu, C-N., In, H., and Knoblock, C. on Knowledge Engineering and Knowledge Management
A.. (1993). Retrieving and integrating data from multiple (EKAW'2000), Juan-les-Pins, France, 2000.
information sources. International Journal on Intelligent and
Cooperative Information Systems, Vol. 2, pp. 127-158. [14] Sowa, J. (2000). “Knowledge Representation” Brooks/Cole.
[2] Birkel, P. (1999) “SEDRIS Data Coding Standard”, In [15] Uschold, M., King, M., Moralee, S., and Zorgios, Y. (1998)
Proceedings of the Spring Simulation Interoperability The Enterprise Ontology The Knowledge Engineering
Workshop, March 1999, 99S-SIW-011. Review, Vol. 13, Special Issue on Putting Ontologies to Use
(eds. Mike Uschold and Austin Tate). (Also available from
[3] Bishr, Y. and Kuhn, W. (2000) Ontology-Based Modelling Artificial Intelligence Application Institute (AIAI),
of Geospatial Information. In Proceedings of the 3rd AGILE University of Edinburgh, Scotland, as AIAI-TR-195).
Conference on Geographic Information Science,
Helsinki/Espoo, May 25-27. [16] Widom, J. (1996). “Integrating Heterogeneous Databases:
Lazy or Eager?”, ACM Computing Surveys, Vol. 2.