=Paper=
{{Paper
|id=Vol-440/paper-2
|storemode=property
|title=Automatic Ontology Creation from Text for National Intelligence Priorities Framework (NIPF)
|pdfUrl=https://ceur-ws.org/Vol-440/paper2.pdf
|volume=Vol-440
|dblpUrl=https://dblp.org/rec/conf/oic/BalakrishnaS08
}}
==Automatic Ontology Creation from Text for National Intelligence Priorities Framework (NIPF)==
Automatic Ontology Creation from Text for
National Intelligence Priorities Framework (NIPF)
Mithun Balakrishna, Munirathnam Srikanth
Lymba Corporation
Richardson, TX, 75080, USA
Email: {mithun,srikanth}@lymba.com
Abstract—Analysts are constantly overwhelmed with large Community on the national intelligence priorities approved by
amounts of unstructured data. This holds especially true for the President of the United States of America [6].
intelligence analysts with the task of extracting useful information Lymba’s Jaguar-KAT [3], [7] is a state-of-the-art tool for
from large data sources. To alleviate this problem, domain-
specific and general-purpose ontologies/knowledge-bases have knowledge acquisition and domain understanding. We use
been proposed to help automate methods for organizing data Jaguar to create rich NIPF ontologies by extracting deep se-
and provide access to useful information. However, problems mantic content from NIPF topic specific document collections
in ontology creation and maintenance have resulted in expen- while keeping the manual intervention to a minimum. In this
sive procedures for expanding/maintaining the ontology library paper, we discuss the technical contributions of automatic
available to support the growing and evolving needs of the
Intelligence Community (IC). In this paper, we will present concept and semantic relation extraction, automatic ontology
a semi-automatic development of an ontology library for the construction, and the metrics to evaluate ontology quality.
National Intelligence Priorities Framework (NIPF) topics. We use
II. AUTOMATIC O NTOLOGY G ENERATION
Jaguar-KAT, a state-of-the-art tool for knowledge acquisition and
domain understanding, with minimized manual intervention to Jaguar automatically builds domain-specific ontologies from
create NIPF ontologies loaded with rich semantic content. We text. The text input to Jaguar can come from a variety
also present evaluation results for the NIPF ontologies created of document sources, including Text, MS Word, PDF and
using our methodology.
Index Terms—ontology generation, National Intelligence Pri- HTML web pages, etc. The ontology/knowledge-base created
orities Framework (NIPF). by Jaguar includes the following constituents:
• Ontological Concepts: basic building blocks of an ontol-
I. I NTRODUCTION ogy
Analysts are constantly plagued and overwhelmed by large • Hierarchy: structure imposed on certain ontological con-
amounts of unstructured, semi-structured data required for cepts via transitive relations that generally hold to be
extracting useful information [1]. Over the past decade, on- universally true (e.g. ISA, Part-Whole, Locative, etc)
tologies and knowledge bases have gained popularity for their • Contextual Knowledge Base: semantic contexts that en-
high potential benefits in a number of applications including capsulate knowledge of events via semantic relations
data/knowledge organization and search applications [2]. The • Axioms on Demand: assertions about concepts of interest
data processing burden on the intelligence analysts have been generated from the available knowledge; this is useful for
relieved with the integration of ontologies to help automate reasoning on text
methods for organizing data and provide access to useful
information [3]. anthrax Ontology Knowledge Base
Though a number of applications can and have benefited Concept Set assassinate
Contextual
due to their integration with domain-specific and general- biological
C1
C5
C7 Knowledge AGT rebel
weapon C3
purpose ontologies/knowledge-bases, it is very well known C6 C21 THM
political
C2 C4 leader
that ontology creation (popularly referred to as the knowledge R1 C22
TMP may 21
R2 C23
acquisition bottleneck [2]) is an expensive process [4], [5]. Hierarchy R3 C24
pw
The modeling of ontologies for non-trivial domains/topics is isa
C4
C14
isa C33
difficult and time/resource consuming. The knowledge acquisi- C3 pw isa R4 C36
C16
tion bottleneck problems in ontology creation and maintenance cau isa
pw C11 R5 C37
C5 C13
have resulted in expensive procedures for maintaining and ex-
panding the ontology library available to support the growing
and evolving needs of the Intelligence Community (IC). Fig. 1. An example Jaguar knowledge-base containing concepts, hierarchy
In this paper, we present a semi-automatic development of and contextual knowledge.
an ontology library for the 33 topics defined in the National
Intelligence Priorities Framework (NIPF). NIPF is the Director Figure 1 shows an example Jaguar knowledge-base con-
of National Intelligence’s (DNI’s) guidance to the Intelligence taining concepts, hierarchy and contextual knowledge. The
Iteration 0 Sentences Seed Concepts
1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases 1. Weapon
2. The nerve gas was created using numerous fluorine-based compounds
Iteration 1 Sentences being used
Seed Concepts
1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases Used/Added
1. Weapon
Concepts/SRs Extracted 2. Chemical Weapon
Chemical ISA
Weapon
Weapon
Iteration 2 Sentences being used
1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases Seed Concepts
Used/Added
Concepts/SRs Extracted 1. Weapon
Rebels POS ISA Weapon 2. Chemical Weapon
Chemical Weapon
3. Nerve gas
ISA ISA 4. Poisonous gas
Gas ISA Nerve Gas Poisonous Gas ISA Gas 5. Gas
Iteration 3 Sentences being used
1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases Seed Concepts
2. The nerve gas was created using numerous fluorine-based compounds Used/Added
Concepts/SRs Extracted 1. Weapon
2. Chemical Weapon
Rebels POS ISA Weapon
Chemical Weapon 3. Nerve gas
4. Poisonous gas
ISA ISA
5. Gas
Gas ISA Nerve Gas Poisonous Gas ISA Gas 6. Fluorine-based
Compound
PW THM
Fluorine-based Create
compound
Fig. 2. An example depicting Jaguar’s iterative process of extracting concepts and semantic relations of interest using seed concepts.
L1 L2
L1 industry
financial market
work_place
work_place industry
money_market
market
exchange market exchange
capital market
financial market
stock_market money_market stock_market
capital market money_market
stock_market
Fig. 3. An example depicting Jaguar’s merging of two ontologies through conflict resolution algorithms.
input to Jaguar includes a document collection (Text, MS through a set of NLP processing tools: named-entity recog-
Word, PDF and HTML web pages, etc.) and a seeds file nition, part-of-speech tagging, syntactic parsing, word-sense
containing the concepts/keywords of interest in the domain. disambiguation, coreference resolution, and semantic parsing
Jaguar’s ontology creation involves complex text processing (or semantic relation discovery) [8], [9]. The concept discovery
using advanced Natural Language Processing (NLP) tools, and module then extracts the concepts of interest using the input
an advanced knowledge classification/management algorithm. seeds set as a starting point and growing it based on the
A single run of Jaguar can be divided into the following two extracted NLP information [3].
major phases: The classification module forms a hierarchical structure
• Text Processing within the set of identified domain concepts via transitive rela-
• Classification/Hierarchy Formation tions that generally hold to be universally true (e.g. ISA, Part-
In Text Processing, the first step is to extract textual content Whole, Locative, etc). Jaguar uses well-formed procedures [7]
from the input document collection. The text files then go to impose a hierarchical structure on the discovered concepts
set using the semantic relations discovered by Polaris [1] and the seed set concepts are added into data structures for subse-
with WordNet [10] as the upper ontology. quent processing into the ontology’s hierarchy. The resulting
data structures are processed and used to populate one or
A. Automatically Building NIPF Ontologies many semantic contexts, groups of relations or nested contexts
which hold true around a common central concept. The seed
In this paper, we use Jaguar to create an ontology library set is then augmented with concepts that have hierarchical
for the 33 topics defined in NIPF. For each NIPF topic, we relations with the target words or seeds. The entire process
collected 500 documents from the web (the W eapons topic of sentence selection, concept extraction, semantic relation
was an exception and its collection had only 50 Wikipedia extraction and seed concepts set augmentation is repeated in
documents) and manually verified their relevance to the cor- an iterative manner, n number of times (by default, n is set
responding topic. We then use Jaguar to create an ontology, to 3). While processing the NIPF topic collections through
for each identified NIPF topic. Jaguar builds each ontology Jaguar, we used ISA, Part-Whole and Synonymy semantic
with rich semantic content extracted from the corresponding relations for automatically augmenting the seeds concept set.
NIPF topic document collection while keeping the manual Figure 2 depicts this iterative process of extracting concepts
intervention to a minimum. These ontologies are fine-tuned and semantic relations of interest using seed concepts.
to contain the level of detail desired by an analyst. 4) Creating Concept Hierarchies: The extracted NIPF topic
1) Extracting Textual Content: We first extract text from noun concepts and semantic relations are fed to the classi-
the input NIPF document collections and then filter/clean-up fication module to determine the hierarchical structure. Cer-
the extracted text. The NIPF text input to Jaguar comes from tain hypernymy relations discovered via classification contain
all possible document types, including MS Word, PDF and anomalies (causing cycles) or redundancies. Hence, we run
HTML web pages, and is therefore prone to having many ir- them through a conflict resolution engine to detect and correct
regularities, such as incomplete, strangely formatted sentences, inconsistencies. The conflict resolution engine creates a NIPF
headings, and tabular information. The text extraction and topic hierarchy link by link (relation by relation) and follows
filtering mechanism of Jaguar is a crucial step that makes the a conflict avoidance technique, wherein each new link is
input acceptable for subsequent NLP tools to process it. The tested for causing inconsistencies before being added to the
extraction/filtering rules include, conversion/removal of non- hierarchy.
ASCII characters, verbalization of Wikipedia infoboxes and 5) Ontology Merging: Although single runs of Jaguar yield
tables, conversion of punctuation symbols, among others. rich NIPF ontologies, Jaguar’s real power lies in providing an
2) Initial Seed Set Selection: For each NIPF topic, Jaguar ontology maintenance option to layer ontologies from many
is provided with an initial seed set containing on average different runs. Figure 3 depicts the process of merging two
51 concepts of interest. The seed set is used to determine ontologies through conflict resolution algorithms. Jaguar can
the set of text sentences of interest in a topic’s document merge disparate ontologies or add new knowledge by using the
collection. The initial seed set selection for the NIPF topic aforementioned conflict resolution techniques. The merge tool
was performed manually based on the concepts found in the merges the two ontologies’ concept sets, hierarchies (using
topic descriptions. The initial seed selection process is the conflict resolution), and their knowledge bases (set of semantic
only manual step that we use in our NIPF ontology creation contexts). Given two ontologies or knowledge bases, ontology
process. We are currently exploring automated methods for merging is performed by enumerating the relations in the
creating the initial seed set using a combination of statistical smaller ontology and adding them to the larger or reference
and semantic clues in the document collection. ontology. A relation may either be represented by a similar
3) Concept and Relation Discovery: For each NIPF topic, relation in the reference ontology, may create a redundant
the set of text files extracted from the document collection are path between concepts or may be a new relation that can
processed through the entire set NLP tools listed in Section II. be added to the reference ontology. The conflict resolution
The NLP processed data files are then passed through the techniques are then used for handling the conflict induced in
concept discovery module, which identifies noun concepts in the ontology to generate a merged ontology. Merging is useful
sentences which are related to the NIPF topic target words or for distributed or parallel systems where small chunks of the
seeds. The concept discovery module analyzes the syntactic input text may be processed on some portions of the system
parse tree of each processed sentence and scans them for and then subsequently merged. It also provides a foundation
noun phrases. Though Jaguar has the capability to extract for future work in contextual reasoning and epistemic logic.
verb concepts by analyzing verb phrases, for our current The resulting rich NIPF knowledge bases can be viewed at
NIPF ontology creation experiment, we focused only on noun many different levels of granularity, providing an analyst with
concepts and their semantic relations. Each noun phrase is then the level of detail desired.
processed and well-formed noun concepts are extracted based
on a set of syntactic patterns and rules. III. E VALUATION OF JAGUAR ’ S NIPF O NTOLOGIES
Noun concepts (which are part of the seed set), their seman- Since the mid-1990s, various methodologies have been
tic relations (extracted from the semantic parser, Polaris [8], proposed to evaluate ontology generation/maintenance/reuse
[9]) and the noun concepts involved in semantic relations with techniques [11]. All the proposed methodologies have focused
TABLE I
S UBSET OF SEMANTIC RELATIONS USED TO EVALUATE THE PERFORMANCE OF JAGUAR ’ S AUTOMATIC NIPF TOPICAL ONTOLOGY GENERATION FROM
TEXT.
Semantic Relation Definition Example Code
ISA X is a (kind of) Y [XY] [John] is a [person] ISA
Part-Whole/Meronymy X is a part of Y [XY] [The engine] is the most important part of [the car] PW
[XY] [steel][cage]
[YX] [faculty] [professor]
[XY] [door] of the [car]
Cause X causes Y [XY] [Drinking] causes [accidents] CAU
TABLE II
P ERFORMANCE RESULTS FOR JAGUAR ’ S AUTOMATIC TOPICAL NIPF ONTOLOGY GENERATION FROM TEXT WITH RESPECT TO THE SEMANTIC
RELATIONS DEFINED IN TABLE I.
Number of NIPF Precision Coverage F-Measure
Annotators Topic Correctness Correctness+ Relevance Correctness Correctness+ Relevance Correctness Correctness+ Relevance
3 Weapons 0.610090 0.501499 0.702424 0.657122 0.653009 0.568859
1 Missiles 0.533867 0.485364 0.793775 0.777747 0.63838 0.597715
2 Illicit Drugs 0.471938 0.274506 0.801422 0.701122 0.594053 0.39454
1 Terrorism 0.388788 0.291019 0.822285 0.776206 0.527953 0.423323
TABLE III
S EMANTIC RELATION AND CONCEPT EXTRACTION STATISTICS FOR THE EVALUATED NIPF ONTOLOGIES PRESENTED IN TABLE II.
NIPF Unique Semantic Relations Unique Concepts
Topic ISA PW CAU Others Total In ISA/PW/CAU Others Total
Weapons 1683 766 113 946 3508 2620 1012 3473
Missiles 2939 2296 646 2692 8573 5982 3539 7873
Illicit Drugs 2356 2040 817 5464 10677 5107 4982 7935
Terrorism 2590 4219 1497 5405 13711 7929 6247 11638
on some facet of the ontology generation problem, and depend • Labeled a correct entry as irrelevant if any of the
on the type of ontology being created/maintained and the concepts or the semantic relation are irrelevant to the
purpose of the ontology [12]. It is noted that not much domain
progress has been achieved in developing a comprehensive and • From the sentences added new entries if the concepts and
global technique for evaluating the correctness and relevance the semantic relation were omitted by Jaguar
of ontologies [13].
The annotation rules provide feedback on the automated
Nj (correct)+Nj (irrelevant) concept tagging and semantic relation extraction and also
P r(Correctness)=
Nj (correct)+Nj (incorrect)+Nj (irrelevant)
0 1 are used for computing precision (Pr) and coverage (Cvg)
BCorrectnessC metrics for the automatically generated ontologies. Equations
B C Nj (correct)
P rB C=
B C
+
B
@
C Nj (correct)+Nj (incorrect)+Nj (irrelevant)
A in (1) capture the metrics defined by Lymba to evaluate
Relevance
Nj (correct)+Nj (irrelevant) (1) Jaguar’s automatic topical NIPF ontology generation from
Cvg(Correctness)=
0 1
Ng (correct)+Ng (irrelevant)+Ng (added) text. In (1), Nj (.) gives the counts from Jaguar’s output and
BCorrectnessC
B C Nj (correct)
Ng (.) correspond to counts in the user annotations. Table II
Cvg B C=
presents our initial evaluation results for 4 NIPF topics using a
B C
B + C Ng (correct)+Ng (added)
@ A
Relevance
subset of 3 semantic relations (ISA, P W and CAU relations)
We evaluated the quality of Jaguar’s NIPF ontologies by defined in Table I. Table III presents the semantic relation and
comparing them against manual gold annotations. Following concept extraction statistics for the four NIPF ontologies being
the ontology evaluation levels defined in [12], our evaluations evaluated in this paper.
are focused on the Lexical, Vocabulary, or Data Layer and We use the metrics defined in (1) to evaluate the ontolo-
the Other Semantic Relations levels. For a NIPF topic, the gies against the manual annotations from different human
ontology and document collection were manually annotated annotators. The results in Table II represent the evaluation
by several human annotators and used in the evaluation of the scores which have been averaged over the results for different
ontology. Viewing an ontology as a set of semantic relations annotators. The first column in Table II identifies the number
between two concepts, the annotators: of annotators for each topic. Jaguar obtained the best Preci-
• Labeled an entry correct if the concepts and the semantic sion results in both Correctness and Correctness+Relevance
relation are correctly detected by the system else marked evaluations for the Weapons NIPF topic. Please note that as
the entry as Incorrect shown in Table III, smaller number of concepts/semantic-
relations were extracted for this topic due to its smaller
collection size (50 documents versus the 500 document set
for the other topics). The Terrorism NIPF topic obtained the
best Coverage result for the Correctness evaluation and it
was also very close to the best Coverage result obtained
by the Missiles NIPF topic for the Correctness+Relevance
evaluation. The Weapons NIPF topic obtained the best F-
Measure result (β = 1) for the Correctness evaluation while
the Missiles NIPF topic obtained the best F-Measure result for
the Correctness+Relevance evaluation.
IV. C ONCLUSIONS AND F UTURE W ORK
In this paper, we presented the semi-automatic development
of an ontology library for the NIPF topics. We use Jaguar-KAT,
a state-of-the-art tool for knowledge acquisition and domain
understanding, with minimized manual intervention to create
NIPF ontologies loaded with rich semantic content. We also
defined evaluation metrics to assess the quality of the NIPF
ontologies created using our methodology. We evaluated a
subset of Jaguar’s NIPF ontologies by comparing them against
manual gold annotations. The results look very promising and
show that a decent amount of knowledge was automatically
and accurately extracted by Jaguar from the input document
collection while keeping the manual intervention in the process
to a minimum. We plan to perform further analysis of the
results and identify methods for improving the precision and
coverage of text processing and ontology generation.
R EFERENCES
[1] D. Bixler, D. Moldovan, and A. Fowler, “Using knowledge extraction
and maintenance techniques to enhance analytical performance,” in
Proceedings of International Conference on Intelligence Analysis, 2005.
[2] P. Cimiano, Ontology Learning and Population from Text: Algorithms,
Evaluation and Applications. Springer, 2006.
[3] D. Moldovan, M. Srikanth, and A. Badulescu, “Synergist: Topic and
user knowledge bases from textual sources for collaborative intelligence
analysis,” in CASE PI Conference, 2007.
[4] E. Ratsch, J. Schultz, J. Saric, P. C. Lavin, U. Wittig, U. Reyle, and
I. Rojas, “Developing a protein-interactions ontology,” Comparative and
Functional Genomics, vol. 4, no. 1, pp. 85–89, 2003.
[5] H. Pinto and J. Martins, “Ontologies: How can they be built?” Knowl-
egde and Information Systems, vol. 6, no. 4, pp. 441–464, 2004.
[6] “FBI: National Security Branch - FAQ,”
Last accessed on Jul 21, 2008, available at
http://www.fbi.gov/hq/nsb/nsb_faq.htm#NIPF.
[7] D. I. Moldovan and R. Girju, “An interactive tool for the rapid
development of knowledge bases,” International Journal on Artificial
Intelligence Tools, vol. 10, no. 1-2, pp. 65–86, 2001.
[8] A. Badulescu, “Classification of semantic relations between nouns,”
Ph.D. dissertation, The University of Texas at Dallas, 2004.
[9] R. Girju, A. M. Giuglea, M. Olteanu, O. Fortu, O. Bolohan, and
D. Moldovan, “Support vector machines applied to the classification of
semantic relations in nominalized noun phrases,” in Lexical Semantics
Workshop in Human Language Technology (HLT), 2004.
[10] G. Miller, “Wordnet: a lexical database for english,” Communications of
the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[11] Y. Sure, G. A. Perez, W. Daelemans, M. L. Reinberger, N. Guarino, and
N. F. Noy, “Why evaluate ontology technologies? because it works!”
IEEE Intelligent Systems, vol. 19, no. 4, pp. 74–81, 2004.
[12] J. Brank, M. Grobelnik, and D. Mladenic, “A survey of ontology
evaluation techniques,” in Data Mining and Data Warehouses (SiKDD),
Ljubljana, Slovenia, 2005.
[13] A. Gangemi, C. Catenacci, M. Ciaramita, and J. Lehmann, “Modelling
ontology evaluation and validation,” in European Semantic Web Sympo-
sium/Conference (ESWC), 2006, pp. 140–154.