1


                 Semantics for Information Sharing and
                Discovery in the Intelligence Community
                                                  Martin Thurn
                  Northrop Grumman Corp., 4805 Stonecroft Blvd., Chantilly VA 20151, 703-449-3803,
                                              martin.thurn@ngc.com


                                                                         include any other information about the properties or content
   Abstract—One of the distinguishing characteristics of the             of the document. “Physically separate” means that the
intelligence community is the strict security framework that is          metadata is stored in a separate file rather than being
used to control classified information. A counterproductive side-
                                                                         embedded within the data file itself – an important contrast to
effect of this strict security is that intelligence analysts are often
not aware of information that is relevant to their analysis.             the dominant practice of embedding all metadata within
Semantic technology and ontologies can help analysts discover            documents. “Semantic” means that the metadata represents the
relevant information even if that information is under the strictest     meaning of the data, as opposed to just syntactic sugar. In
controls and even if the analysts are not cleared to access the data.    particular, our approach focuses on expressing the semantics
These techniques can be applied immediately within the current           of the content of the document, i.e. the actual body text, rather
security framework of the intelligence community.
                                                                         than facts about the document which are typically found in the
                                                                         header. “Standard” means that the metadata is represented
  Index Terms—Discovery, Information Sharing, Metadata,
Redaction, Semantics                                                     using semantics standards such as the Resource Description
                                                                         Framework (RDF) and Web Ontology Language (OWL). In
                                                                         addition, “Standard Semantic” means that the metadata strictly
                        I. INTRODUCTION                                  corresponds to an ontology so that the meaning is explicit and
                                                                         can be processed by automated tools.
T    HE many agencies of the United States intelligence
     community – and the corresponding organizations of her
friend and partner countries around the world – employ a strict
                                                                            Using physically separate semantic metadata for discovery
                                                                         is not a new idea – this is a technique that has been used
security framework to protect and control classified                     successfully by libraries for centuries. A card (whether paper
information. The basis of this framework is that a person is             or electronic) in a card catalog is metadata for a book in a
granted access to a sensitive document only if they need to              library’s holdings. The card for a rare and delicate book is
know those data to perform their duties.                                 itself neither rare nor delicate, and therefore does not have to
   This basis creates two immediate impediments to                       be subject to the same protections as the book itself. Whereas
information sharing and discovery across the boundaries of               the book may be held in a special collection accessible only to
security levels and compartments.              When sensitivity          approved scholars, the card describing the book can be
classifications are assigned to an entire document, it prevents          publicly accessible, updated frequently, and copies can be
an unapproved analyst from seeing any portion of the                     distributed to other libraries. In contrast, metadata that is not
document, even when the document may actually contain a                  physically separate from the data – metadata that appears in
mixture of sensitive and unclassified information. To make               the front matter of a book, for instance – cannot be updated
matters worse, it is often the case that an unapproved analyst is        and can only be accessed by those who already have access to
prevented from knowing even the existence of that document.              the book itself.
In the former case, the analyst can at least ask for permission             Within the intelligence community, working with physically
to read the document and fulfill her duties; in the latter case,         separate metadata has all the advantages of working with
there is virtually no hope for the analyst ever to see the data.         catalog cards, and also solves fundamental security problems
                                                                         that stand in the way of discovery and sharing of information.
II. PHYSICALLY-SEPARATE STANDARD SEMANTIC METADATA                       There are two keys to this aspect of the solution. First, the
                                                                         physically separate metadata can be at a lower level of
   We have developed an approach to discovering and sharing              classification than the data itself. It is entirely possible that the
information that is particularly well-suited to the intelligence         very nature of the metadata makes it lower level; or the system
community, an approach based on physically-separate standard             can be specifically designed so that the metadata is of a lower
semantic metadata. “Metadata” is a general term that refers to           classification, if necessary. Second, the physically separate
data that describes other data. Metadata for a document may              metadata can be stored on a different network (or several
explicitly identify the title of the document, provide a table of        different networks) than the original data. The bottom line is,
all the geographic locations mentioned in the document, or
                                                                                                                                   2

while an organization may not be able to share much of its data    ambiguous than English words.          Once sufficiently rich
for security reasons, it may be able to share a great deal of      semantic metadata is available, metadata-based discovery can
metadata. That metadata will allow intelligence analysts to        exceed both the recall and the precision of keyword searching
discover the existence of information that is important to them    against full text documents.
even if they have not been cleared to access the data itself.
   It should also be noted that since electronic metadata files          IV. SAMPLE SCENARIO OF SEMANTIC DISCOVERY
can be much larger than physical index cards in a traditional         An intelligence analyst is creating a map of the locations of
card catalog, the metadata may easily contain a wealth of          certain objects of interest. In the past, creating such maps
valuable content information that can be exploited                 required reading intelligence cables that describe, in ordinary
independently of the actual data file. Of course, the metadata     English, the locations of the objects at various times. The
might not have the same authority as the actual data (see the      analyst would then have to type all the coordinates into a
sample scenario below), but it certainly can be used to suggest    geographical information system (GIS) to create the map – a
hypotheses.                                                        tedious and error-prone task.
                                                                      In our approach, as each cable arrives, a metadata file is
              III. ONTOLOGIES FOR DISCOVERY                        created that contains RDF descriptions of what objects were at
    Rich ontologies are essential to the success of the approach   what locations at what times based on standard ontologies.
to discovery described here. Ontologies allow semantic             This RDF can be automatically generated using existing
searches to match even if the query concept is more specific or    information extraction technology such as NetOwl from SRA
more general than the concept in the metadata. Semantic            International, TextTrainer from Northrop Grumman, or
metadata is data about the meaning of the data. Meaning has        AeroText from Rocket Software. A semantic metadata search
the property that it can be abstracted, which is important for     – either a live search initiated by the analyst, or an automated
both discovery and security reasons. An aircraft ontology, for     “batch” query that runs overnight – is then used to discover all
instance, may indicate that the B-2 is a stealth bomber, a         the metadata files that describe locations of objects of interest.
stealth bomber is a type of bomber, and that bombers are a         Having standard ontologies greatly facilitates the indexing and
type of airplane. This will allow a semantic search for the        retrieval required for this type of search. Since RDF is
concept “airplane” to discover documents that mention              completely structured, the resulting locations can automatically
specific types of aircraft such as the B-2 (even when the          be loaded into the GIS application. As a result, maps that
documents do not contain the query word “airplane”). And if        previously took weeks to create manually are now
the fact that a B-2 was used for a particular mission makes a      automatically generated in seconds more accurately from a
document classified, unclassified metadata can be generated by     more comprehensive set of sources.
referring to the more abstract concepts of “stealth bomber” or,       After automatically generating a new map, the analyst sees
if necessary, “bomber” or just “airplane”. By abstracting as       an alarming pattern and decides to write a report. Of course,
little as possible to meet security requirements, the semantic     she can’t use metadata as source information for a formal
metadata can make the maximum amount of information                intelligence report, so she logs on to the data repository (to
available for discovery and exploitation. Rich standard            which she has access) to verify the pattern against the original
ontologies facilitate this type of searching and abstraction. In   reporting. However, she is denied access to several of the
the ideal case, the ontologies themselves will be standards        cables because they are stored in a restricted collection.
used across the intelligence community – a central topic of this   Through official channels (referenced in the metadata) she
conference.                                                        requests access to the restricted collection, receives access,
    Discovery based on physically-separate metadata is often       confirms the accuracy of the map, and produces an important
viewed as a last resort – a technique to be used only when         report. In the past, she never would have seen the pattern in
security restrictions prevent access to the data itself. Indeed,   the first place because she wasn’t aware of the reports in the
one could argue that it should be a last resort when only very     restricted collection.
basic document metadata (e.g. Title, Author, Date) is
available. However, semantic metadata can be arbitrarily rich,              V. ONTOLOGIES FOR INFORMATION SHARING
containing a detailed, unambiguous, machine-interpretable             The approach and claims described above for using
version of the information contained in a document. Since rich     semantic metadata to improve discovery hold true equally well
metadata provides an unambiguous and direct representation         for information sharing – one can simply view the sharing as a
of the meaning of a document, metadata can serve as a better       “push” of metadata across security boundaries whereas
basis for discovery and automatic exploitation than even the       discovery is like a “pull”. However, the use of ontologies and
document itself. As rich semantic metadata becomes available       rich semantic metadata can enhance information sharing in a
for more and more documents in a repository, search recall         radical way.
should increase, because exact matches are not necessary; and         Recall that our semantic metadata is represented in a
as the metadata becomes richer, the precision should increase      standard language (RDF) that is well-defined and machine-
as well, since fine-grained concepts from an ontology are less     interpretable, and that we can create rich ontologies in OWL
                                                                                                                                                3

that are also machine-interpretable. For discovery, these                                     VII. CONCLUSION
ontologies enable semantic searching by abstracting the query        Discovering information in an environment with strict
concepts; to aid information sharing, ontologies can be used to   security constraints is a critical problem for the intelligence
automatically abstract or redact the semantic metadata itself.    community. Physically-separate metadata can be used to
   Another feature of OWL is that it can encode inferences and    overcome some of these problems. Metadata can have a lower
other logical constructs which can then be automatically          level of classification than the data itself, and can reside on a
processed in software. Classification guides rules and policies   different network than the data itself. In this way, more
can be represented in OWL, and the computer can                   accessible metadata indexes can be created and exploited
automatically apply those rules and policies to semantic          while fully maintaining the security of the source data. This
metadata. This allows the automatic redaction or abstraction      means that even the most sensitive documents can be
classified metadata so that it conforms to the lower              discoverable, and much of the information they contain can be
classification level. Semantic technologies that exist today      exploited – even by analysts that have absolutely no access to
enable us to automatically redact metadata for information        the source documents themselves. Effective discovery and
sharing.                                                          exploitation, however, depends on the availability of rich
   We can actually take this one step further. We can write a     content metadata that is based on extensive ontologies.
classification guide in OWL in such a way that a theorem             There is an inherent conflict in the intelligence community
prover can be used to mathematically prove that the redacted      between the responsibility to share information and the
data does not violate any classification rules. Pellet is one     responsibility to protect it. This dilemma can be finessed by
example of a widely-used and well-respected open source           protecting data and sharing rich metadata. This approach can
theorem prover.                                                   be implemented within the current strict security framework
                                                                  and will benefit significantly from the type of ontology work
       VI. SAMPLE SCENARIO OF SEMANTIC SHARING                    discussed at this conference.
   Local law enforcement has a need-to-know whenever FBI             Semantic technologies that exist today enable us to
identifies an individual in the local community with terrorist    automatically convert documents to metadata, automatically
connections. However, local law enforcement does not have         redact that metadata to any security level, and automatically
the need-to-know (nor do they even care) the source or            prove that the redaction is sound and complete.
methods FBI used to obtain such information. In the past,
whenever a new terrorist connection was established and                                      ACKNOWLEDGMENT
documented, the entire data record was classified because it         Martin Thurn thanks Dr. Terry Patten for 20 years of
described how FBI obtained the information to create the          friendship and mentoring, and for his pioneering work in
connection. The only way local law enforcement came to            computational linguistics, natural language processing,
know about the connection would be if an FBI agent read the       information extraction, and most recently, application of
entire report, distilled it down to an unclassified version,      semantic technologies to the problem of secure information
obtained the relevant approvals, and finally sent the             sharing.
information to local law enforcement.
   In our approach, as each suspect interview summary report                                     REFERENCES
is generated, an RDF metadata file is generated containing        [1]   D. Nardi and R.J. Brachman, “An Introduction to Description Logics”,
names and known-terrorist connections. Again, this can be               The Description Logic Handbook, Jan. 2003.
automatically generated using existing information extraction     [2]   F. Baader and W. Nutt, “Basic Description Logics”, The Description
                                                                        Logic Handbook, Jan. 2003.
technology. This RDF metadata is automatically routed to
                                                                  [3]   A. Uszok, J. Bradshaw, R. Jeffers, N. Suri, P. Hayes, M. Breedy, L.
local law enforcement via a fully accredited hardware/software          Bunch, M. Johnson, S. Kulkarni, and J. Lott, “KAoS Policy and Domain
guard device at the FBI network boundary. This guard reads              Services: Toward a Description-Logic Approach to Policy
the RDF, compares it to classification guides and policies              Representation, Decomfliction, and Enforcement”, Proceedings, pp 93-
                                                                        96 [IEEE 4th International Workshop on Policies for Distributed
encoded in OWL, and performs a logical redaction of the                 Systems and Networks, June 4-6 2003].
simple metadata facts. The redacted RDF metadata is then          [4]   J. Bradshaw, A. Uszok, R. Jeffers, N. Suri, P. Hayes, M. Burstein, A.
allowed to pass outside the FBI network and travels on to local         Acquisti, B. Benyo, M. Breedy, M. Carvalho, D. Diller, M. Johnson, S.
                                                                        Kulkarni, J. Lott, M. Sierhuis, and R. Van Hoof, “Representation and
law enforcement, where it can automatically be added to a               Reasoning for DAML-Based Policy and Domain Services in KAoS and
database or reformatted into a textual message. Through                 Nomads” [AAMAS ’03, July 14-18 2003, Melbourne Australia].
official channels (referenced in the RDF), local law              [5]   N. Suri, J. Bradshaw, M. Burstein, A. Uszok, B. Benyo, M. Breedy, M.
                                                                        Carvalho, D. Diller, P. Groth, R. Jeffers, M. Johnson, S. Kulkarni, and J.
enforcement can request confirmation of the information at
                                                                        Lott, “DAML-Based Policy Enforcement for Semantic Data
any later date.                                                         Transformation and Filtering in Multi-agent Systems” [AAMAS ’03,
                                                                        July 14-18 2003, Melbourne Australia].