=Paper= {{Paper |id=Vol-440/paper-2 |storemode=property |title=Automatic Ontology Creation from Text for National Intelligence Priorities Framework (NIPF) |pdfUrl=https://ceur-ws.org/Vol-440/paper2.pdf |volume=Vol-440 |dblpUrl=https://dblp.org/rec/conf/oic/BalakrishnaS08 }} ==Automatic Ontology Creation from Text for National Intelligence Priorities Framework (NIPF)== https://ceur-ws.org/Vol-440/paper2.pdf
    Automatic Ontology Creation from Text for
  National Intelligence Priorities Framework (NIPF)
                                           Mithun Balakrishna, Munirathnam Srikanth
                                                        Lymba Corporation
                                                   Richardson, TX, 75080, USA
                                                Email: {mithun,srikanth}@lymba.com


   Abstract—Analysts are constantly overwhelmed with large             Community on the national intelligence priorities approved by
amounts of unstructured data. This holds especially true for           the President of the United States of America [6].
intelligence analysts with the task of extracting useful information      Lymba’s Jaguar-KAT [3], [7] is a state-of-the-art tool for
from large data sources. To alleviate this problem, domain-
specific and general-purpose ontologies/knowledge-bases have           knowledge acquisition and domain understanding. We use
been proposed to help automate methods for organizing data             Jaguar to create rich NIPF ontologies by extracting deep se-
and provide access to useful information. However, problems            mantic content from NIPF topic specific document collections
in ontology creation and maintenance have resulted in expen-           while keeping the manual intervention to a minimum. In this
sive procedures for expanding/maintaining the ontology library         paper, we discuss the technical contributions of automatic
available to support the growing and evolving needs of the
Intelligence Community (IC). In this paper, we will present            concept and semantic relation extraction, automatic ontology
a semi-automatic development of an ontology library for the            construction, and the metrics to evaluate ontology quality.
National Intelligence Priorities Framework (NIPF) topics. We use
                                                                                II. AUTOMATIC O NTOLOGY G ENERATION
Jaguar-KAT, a state-of-the-art tool for knowledge acquisition and
domain understanding, with minimized manual intervention to               Jaguar automatically builds domain-specific ontologies from
create NIPF ontologies loaded with rich semantic content. We           text. The text input to Jaguar can come from a variety
also present evaluation results for the NIPF ontologies created        of document sources, including Text, MS Word, PDF and
using our methodology.
   Index Terms—ontology generation, National Intelligence Pri-         HTML web pages, etc. The ontology/knowledge-base created
orities Framework (NIPF).                                              by Jaguar includes the following constituents:
                                                                          • Ontological Concepts: basic building blocks of an ontol-
                       I. I NTRODUCTION                                     ogy
   Analysts are constantly plagued and overwhelmed by large               • Hierarchy: structure imposed on certain ontological con-

amounts of unstructured, semi-structured data required for                  cepts via transitive relations that generally hold to be
extracting useful information [1]. Over the past decade, on-                universally true (e.g. ISA, Part-Whole, Locative, etc)
tologies and knowledge bases have gained popularity for their             • Contextual Knowledge Base: semantic contexts that en-

high potential benefits in a number of applications including               capsulate knowledge of events via semantic relations
data/knowledge organization and search applications [2]. The              • Axioms on Demand: assertions about concepts of interest

data processing burden on the intelligence analysts have been               generated from the available knowledge; this is useful for
relieved with the integration of ontologies to help automate                reasoning on text
methods for organizing data and provide access to useful
information [3].                                                             anthrax              Ontology                Knowledge Base
   Though a number of applications can and have benefited                                          Concept Set                                      assassinate
                                                                                                                             Contextual
due to their integration with domain-specific and general-                   biological
                                                                                            C1
                                                                                                            C5
                                                                                                                    C7       Knowledge            AGT     rebel
                                                                              weapon               C3
purpose ontologies/knowledge-bases, it is very well known                                                         C6         C21                  THM
                                                                                                                                                         political
                                                                                             C2           C4                                              leader
that ontology creation (popularly referred to as the knowledge                                                              R1   C22
                                                                                                                                                  TMP     may 21
                                                                                                                            R2   C23
acquisition bottleneck [2]) is an expensive process [4], [5].                                      Hierarchy                R3   C24
                                                                                                          pw
The modeling of ontologies for non-trivial domains/topics is                               isa
                                                                                                    C4
                                                                                                                  C14
                                                                                                           isa                          C33
difficult and time/resource consuming. The knowledge acquisi-                               C3     pw               isa                R4   C36
                                                                                                         C16
tion bottleneck problems in ontology creation and maintenance                             cau     isa
                                                                                                               pw C11                  R5   C37
                                                                                           C5      C13
have resulted in expensive procedures for maintaining and ex-
panding the ontology library available to support the growing
and evolving needs of the Intelligence Community (IC).                 Fig. 1. An example Jaguar knowledge-base containing concepts, hierarchy
   In this paper, we present a semi-automatic development of           and contextual knowledge.
an ontology library for the 33 topics defined in the National
Intelligence Priorities Framework (NIPF). NIPF is the Director            Figure 1 shows an example Jaguar knowledge-base con-
of National Intelligence’s (DNI’s) guidance to the Intelligence        taining concepts, hierarchy and contextual knowledge. The
                   Iteration 0               Sentences                                                                                          Seed Concepts
                                               1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases          1. Weapon
                                               2. The nerve gas was created using numerous fluorine-based compounds

                   Iteration 1             Sentences being used
                                                                                                                                                 Seed Concepts
                                               1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases          Used/Added
                                                                                                                                                   1. Weapon
                        Concepts/SRs Extracted                                                                                                     2. Chemical Weapon
                            Chemical                ISA
                                                                    Weapon
                            Weapon

                   Iteration 2           Sentences being used
                                               1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases         Seed Concepts
                                                                                                                                                  Used/Added
                       Concepts/SRs Extracted                                                                                                      1. Weapon
                          Rebels   POS                                               ISA            Weapon                                         2. Chemical Weapon
                                           Chemical Weapon
                                                                                                                                                   3. Nerve gas
                                                          ISA                 ISA                                                                  4. Poisonous gas
                            Gas            ISA      Nerve Gas           Poisonous Gas      ISA      Gas                                            5. Gas


                   Iteration 3          Sentences being used
                                              1. The rebels had access to chemical weapons, such as nerve gas and other poisonous gases          Seed Concepts
                                              2. The nerve gas was created using numerous fluorine-based compounds                                Used/Added
                       Concepts/SRs Extracted                                                                                                      1. Weapon
                                                                                                                                                   2. Chemical Weapon
                            Rebels   POS                                                ISA          Weapon
                                                             Chemical Weapon                                                                       3. Nerve gas
                                                                                                                                                   4. Poisonous gas
                                                            ISA                ISA
                                                                                                                                                   5. Gas
                              Gas             ISA     Nerve Gas            Poisonous Gas      ISA    Gas                                           6. Fluorine-based
                                                                                                                                                      Compound
                                               PW                    THM
                                   Fluorine-based                    Create
                                   compound

         Fig. 2.      An example depicting Jaguar’s iterative process of extracting concepts and semantic relations of interest using seed concepts.


                       L1                                                     L2
                                                                                                                                           L1                   industry
                                                          financial market
                                                                                                          work_place
         work_place               industry
                                                                money_market
                                                                                                                                                                 market

         exchange                 market                                                                  exchange
                                                          capital market



                                                                                                                                                                        
                                                                                                                                   financial market

        stock_market        money_market                     stock_market

                                                                                                                              capital market                     money_market



                                                                                                      stock_market


                             Fig. 3.         An example depicting Jaguar’s merging of two ontologies through conflict resolution algorithms.



input to Jaguar includes a document collection (Text, MS                                            through a set of NLP processing tools: named-entity recog-
Word, PDF and HTML web pages, etc.) and a seeds file                                                nition, part-of-speech tagging, syntactic parsing, word-sense
containing the concepts/keywords of interest in the domain.                                         disambiguation, coreference resolution, and semantic parsing
Jaguar’s ontology creation involves complex text processing                                         (or semantic relation discovery) [8], [9]. The concept discovery
using advanced Natural Language Processing (NLP) tools, and                                         module then extracts the concepts of interest using the input
an advanced knowledge classification/management algorithm.                                          seeds set as a starting point and growing it based on the
A single run of Jaguar can be divided into the following two                                        extracted NLP information [3].
major phases:                                                                                          The classification module forms a hierarchical structure
   • Text Processing                                                                                within the set of identified domain concepts via transitive rela-
   • Classification/Hierarchy Formation                                                             tions that generally hold to be universally true (e.g. ISA, Part-
   In Text Processing, the first step is to extract textual content                                 Whole, Locative, etc). Jaguar uses well-formed procedures [7]
from the input document collection. The text files then go                                          to impose a hierarchical structure on the discovered concepts
set using the semantic relations discovered by Polaris [1] and     the seed set concepts are added into data structures for subse-
with WordNet [10] as the upper ontology.                           quent processing into the ontology’s hierarchy. The resulting
                                                                   data structures are processed and used to populate one or
A. Automatically Building NIPF Ontologies                          many semantic contexts, groups of relations or nested contexts
                                                                   which hold true around a common central concept. The seed
    In this paper, we use Jaguar to create an ontology library     set is then augmented with concepts that have hierarchical
for the 33 topics defined in NIPF. For each NIPF topic, we         relations with the target words or seeds. The entire process
collected 500 documents from the web (the W eapons topic           of sentence selection, concept extraction, semantic relation
was an exception and its collection had only 50 Wikipedia          extraction and seed concepts set augmentation is repeated in
documents) and manually verified their relevance to the cor-       an iterative manner, n number of times (by default, n is set
responding topic. We then use Jaguar to create an ontology,        to 3). While processing the NIPF topic collections through
for each identified NIPF topic. Jaguar builds each ontology        Jaguar, we used ISA, Part-Whole and Synonymy semantic
with rich semantic content extracted from the corresponding        relations for automatically augmenting the seeds concept set.
NIPF topic document collection while keeping the manual            Figure 2 depicts this iterative process of extracting concepts
intervention to a minimum. These ontologies are fine-tuned         and semantic relations of interest using seed concepts.
to contain the level of detail desired by an analyst.                 4) Creating Concept Hierarchies: The extracted NIPF topic
    1) Extracting Textual Content: We first extract text from      noun concepts and semantic relations are fed to the classi-
the input NIPF document collections and then filter/clean-up       fication module to determine the hierarchical structure. Cer-
the extracted text. The NIPF text input to Jaguar comes from       tain hypernymy relations discovered via classification contain
all possible document types, including MS Word, PDF and            anomalies (causing cycles) or redundancies. Hence, we run
HTML web pages, and is therefore prone to having many ir-          them through a conflict resolution engine to detect and correct
regularities, such as incomplete, strangely formatted sentences,   inconsistencies. The conflict resolution engine creates a NIPF
headings, and tabular information. The text extraction and         topic hierarchy link by link (relation by relation) and follows
filtering mechanism of Jaguar is a crucial step that makes the     a conflict avoidance technique, wherein each new link is
input acceptable for subsequent NLP tools to process it. The       tested for causing inconsistencies before being added to the
extraction/filtering rules include, conversion/removal of non-     hierarchy.
ASCII characters, verbalization of Wikipedia infoboxes and            5) Ontology Merging: Although single runs of Jaguar yield
tables, conversion of punctuation symbols, among others.           rich NIPF ontologies, Jaguar’s real power lies in providing an
    2) Initial Seed Set Selection: For each NIPF topic, Jaguar     ontology maintenance option to layer ontologies from many
is provided with an initial seed set containing on average         different runs. Figure 3 depicts the process of merging two
51 concepts of interest. The seed set is used to determine         ontologies through conflict resolution algorithms. Jaguar can
the set of text sentences of interest in a topic’s document        merge disparate ontologies or add new knowledge by using the
collection. The initial seed set selection for the NIPF topic      aforementioned conflict resolution techniques. The merge tool
was performed manually based on the concepts found in the          merges the two ontologies’ concept sets, hierarchies (using
topic descriptions. The initial seed selection process is the      conflict resolution), and their knowledge bases (set of semantic
only manual step that we use in our NIPF ontology creation         contexts). Given two ontologies or knowledge bases, ontology
process. We are currently exploring automated methods for          merging is performed by enumerating the relations in the
creating the initial seed set using a combination of statistical   smaller ontology and adding them to the larger or reference
and semantic clues in the document collection.                     ontology. A relation may either be represented by a similar
    3) Concept and Relation Discovery: For each NIPF topic,        relation in the reference ontology, may create a redundant
the set of text files extracted from the document collection are   path between concepts or may be a new relation that can
processed through the entire set NLP tools listed in Section II.   be added to the reference ontology. The conflict resolution
The NLP processed data files are then passed through the           techniques are then used for handling the conflict induced in
concept discovery module, which identifies noun concepts in        the ontology to generate a merged ontology. Merging is useful
sentences which are related to the NIPF topic target words or      for distributed or parallel systems where small chunks of the
seeds. The concept discovery module analyzes the syntactic         input text may be processed on some portions of the system
parse tree of each processed sentence and scans them for           and then subsequently merged. It also provides a foundation
noun phrases. Though Jaguar has the capability to extract          for future work in contextual reasoning and epistemic logic.
verb concepts by analyzing verb phrases, for our current           The resulting rich NIPF knowledge bases can be viewed at
NIPF ontology creation experiment, we focused only on noun         many different levels of granularity, providing an analyst with
concepts and their semantic relations. Each noun phrase is then    the level of detail desired.
processed and well-formed noun concepts are extracted based
on a set of syntactic patterns and rules.                              III. E VALUATION OF JAGUAR ’ S NIPF O NTOLOGIES
    Noun concepts (which are part of the seed set), their seman-      Since the mid-1990s, various methodologies have been
tic relations (extracted from the semantic parser, Polaris [8],    proposed to evaluate ontology generation/maintenance/reuse
[9]) and the noun concepts involved in semantic relations with     techniques [11]. All the proposed methodologies have focused
                                                           TABLE I
 S UBSET OF SEMANTIC RELATIONS USED TO EVALUATE THE PERFORMANCE OF JAGUAR ’ S AUTOMATIC NIPF TOPICAL ONTOLOGY GENERATION FROM
                                                            TEXT.


                          Semantic Relation            Definition           Example                                                      Code
                          ISA                          X is a (kind of) Y   [XY] [John] is a [person]                                    ISA
                          Part-Whole/Meronymy          X is a part of Y     [XY] [The engine] is the most important part of [the car]    PW
                                                                            [XY] [steel][cage]
                                                                            [YX] [faculty] [professor]
                                                                            [XY] [door] of the [car]
                          Cause                        X causes Y           [XY] [Drinking] causes [accidents]                           CAU


                                                            TABLE II
    P ERFORMANCE RESULTS FOR JAGUAR ’ S AUTOMATIC TOPICAL NIPF ONTOLOGY GENERATION FROM TEXT WITH RESPECT TO THE SEMANTIC
                                                  RELATIONS DEFINED IN TABLE I.


       Number of         NIPF                         Precision                                Coverage                                  F-Measure
       Annotators       Topic          Correctness    Correctness+ Relevance    Correctness    Correctness+ Relevance      Correctness    Correctness+ Relevance
           3          Weapons           0.610090              0.501499           0.702424             0.657122              0.653009             0.568859
           1           Missiles         0.533867              0.485364           0.793775             0.777747               0.63838             0.597715
           2        Illicit Drugs       0.471938              0.274506           0.801422             0.701122              0.594053              0.39454
           1          Terrorism         0.388788              0.291019           0.822285             0.776206              0.527953             0.423323


                                                           TABLE III
         S EMANTIC RELATION AND CONCEPT EXTRACTION STATISTICS FOR THE EVALUATED NIPF ONTOLOGIES PRESENTED IN TABLE II.


                                         NIPF               Unique Semantic Relations                      Unique Concepts
                                        Topic        ISA     PW     CAU      Others   Total      In ISA/PW/CAU      Others       Total
                                      Weapons        1683    766     113      946      3508            2620          1012        3473
                                       Missiles      2939   2296     646     2692      8573            5982          3539        7873
                                    Illicit Drugs    2356   2040     817     5464     10677            5107          4982        7935
                                      Terrorism      2590   4219    1497     5405     13711            7929          6247        11638




on some facet of the ontology generation problem, and depend                              • Labeled a correct entry as irrelevant if any of the
on the type of ontology being created/maintained and the                                    concepts or the semantic relation are irrelevant to the
purpose of the ontology [12]. It is noted that not much                                     domain
progress has been achieved in developing a comprehensive and                              • From the sentences added new entries if the concepts and
global technique for evaluating the correctness and relevance                               the semantic relation were omitted by Jaguar
of ontologies [13].
                                                                                         The annotation rules provide feedback on the automated
                               Nj (correct)+Nj (irrelevant)                           concept tagging and semantic relation extraction and also
     P r(Correctness)=
                       Nj (correct)+Nj (incorrect)+Nj (irrelevant)
        0            1                                                                are used for computing precision (Pr) and coverage (Cvg)
        BCorrectnessC                                                                 metrics for the automatically generated ontologies. Equations
        B            C                   Nj (correct)
     P rB            C=
        B            C
              +
        B
        @
                     C Nj (correct)+Nj (incorrect)+Nj (irrelevant)
                     A                                                                in (1) capture the metrics defined by Lymba to evaluate
          Relevance
                             Nj (correct)+Nj (irrelevant)                      (1)    Jaguar’s automatic topical NIPF ontology generation from
     Cvg(Correctness)=
         0           1
                       Ng (correct)+Ng (irrelevant)+Ng (added)                        text. In (1), Nj (.) gives the counts from Jaguar’s output and
         BCorrectnessC
         B           C        Nj (correct)
                                                                                      Ng (.) correspond to counts in the user annotations. Table II
     Cvg B           C=
                                                                                      presents our initial evaluation results for 4 NIPF topics using a
         B           C
         B     +     C Ng (correct)+Ng (added)
         @           A
           Relevance
                                                                                      subset of 3 semantic relations (ISA, P W and CAU relations)
   We evaluated the quality of Jaguar’s NIPF ontologies by                            defined in Table I. Table III presents the semantic relation and
comparing them against manual gold annotations. Following                             concept extraction statistics for the four NIPF ontologies being
the ontology evaluation levels defined in [12], our evaluations                       evaluated in this paper.
are focused on the Lexical, Vocabulary, or Data Layer and                                We use the metrics defined in (1) to evaluate the ontolo-
the Other Semantic Relations levels. For a NIPF topic, the                            gies against the manual annotations from different human
ontology and document collection were manually annotated                              annotators. The results in Table II represent the evaluation
by several human annotators and used in the evaluation of the                         scores which have been averaged over the results for different
ontology. Viewing an ontology as a set of semantic relations                          annotators. The first column in Table II identifies the number
between two concepts, the annotators:                                                 of annotators for each topic. Jaguar obtained the best Preci-
   • Labeled an entry correct if the concepts and the semantic                        sion results in both Correctness and Correctness+Relevance
     relation are correctly detected by the system else marked                        evaluations for the Weapons NIPF topic. Please note that as
     the entry as Incorrect                                                           shown in Table III, smaller number of concepts/semantic-
relations were extracted for this topic due to its smaller
collection size (50 documents versus the 500 document set
for the other topics). The Terrorism NIPF topic obtained the
best Coverage result for the Correctness evaluation and it
was also very close to the best Coverage result obtained
by the Missiles NIPF topic for the Correctness+Relevance
evaluation. The Weapons NIPF topic obtained the best F-
Measure result (β = 1) for the Correctness evaluation while
the Missiles NIPF topic obtained the best F-Measure result for
the Correctness+Relevance evaluation.
            IV. C ONCLUSIONS AND F UTURE W ORK
   In this paper, we presented the semi-automatic development
of an ontology library for the NIPF topics. We use Jaguar-KAT,
a state-of-the-art tool for knowledge acquisition and domain
understanding, with minimized manual intervention to create
NIPF ontologies loaded with rich semantic content. We also
defined evaluation metrics to assess the quality of the NIPF
ontologies created using our methodology. We evaluated a
subset of Jaguar’s NIPF ontologies by comparing them against
manual gold annotations. The results look very promising and
show that a decent amount of knowledge was automatically
and accurately extracted by Jaguar from the input document
collection while keeping the manual intervention in the process
to a minimum. We plan to perform further analysis of the
results and identify methods for improving the precision and
coverage of text processing and ontology generation.
                             R EFERENCES
 [1] D. Bixler, D. Moldovan, and A. Fowler, “Using knowledge extraction
     and maintenance techniques to enhance analytical performance,” in
     Proceedings of International Conference on Intelligence Analysis, 2005.
 [2] P. Cimiano, Ontology Learning and Population from Text: Algorithms,
     Evaluation and Applications. Springer, 2006.
 [3] D. Moldovan, M. Srikanth, and A. Badulescu, “Synergist: Topic and
     user knowledge bases from textual sources for collaborative intelligence
     analysis,” in CASE PI Conference, 2007.
 [4] E. Ratsch, J. Schultz, J. Saric, P. C. Lavin, U. Wittig, U. Reyle, and
     I. Rojas, “Developing a protein-interactions ontology,” Comparative and
     Functional Genomics, vol. 4, no. 1, pp. 85–89, 2003.
 [5] H. Pinto and J. Martins, “Ontologies: How can they be built?” Knowl-
     egde and Information Systems, vol. 6, no. 4, pp. 441–464, 2004.
 [6] “FBI:         National       Security        Branch       -        FAQ,”
     Last      accessed      on     Jul     21,     2008,     available    at
     http://www.fbi.gov/hq/nsb/nsb_faq.htm#NIPF.
 [7] D. I. Moldovan and R. Girju, “An interactive tool for the rapid
     development of knowledge bases,” International Journal on Artificial
     Intelligence Tools, vol. 10, no. 1-2, pp. 65–86, 2001.
 [8] A. Badulescu, “Classification of semantic relations between nouns,”
     Ph.D. dissertation, The University of Texas at Dallas, 2004.
 [9] R. Girju, A. M. Giuglea, M. Olteanu, O. Fortu, O. Bolohan, and
     D. Moldovan, “Support vector machines applied to the classification of
     semantic relations in nominalized noun phrases,” in Lexical Semantics
     Workshop in Human Language Technology (HLT), 2004.
[10] G. Miller, “Wordnet: a lexical database for english,” Communications of
     the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[11] Y. Sure, G. A. Perez, W. Daelemans, M. L. Reinberger, N. Guarino, and
     N. F. Noy, “Why evaluate ontology technologies? because it works!”
     IEEE Intelligent Systems, vol. 19, no. 4, pp. 74–81, 2004.
[12] J. Brank, M. Grobelnik, and D. Mladenic, “A survey of ontology
     evaluation techniques,” in Data Mining and Data Warehouses (SiKDD),
     Ljubljana, Slovenia, 2005.
[13] A. Gangemi, C. Catenacci, M. Ciaramita, and J. Lehmann, “Modelling
     ontology evaluation and validation,” in European Semantic Web Sympo-
     sium/Conference (ESWC), 2006, pp. 140–154.