1 Semantics for Information Sharing and Discovery in the Intelligence Community Martin Thurn Northrop Grumman Corp., 4805 Stonecroft Blvd., Chantilly VA 20151, 703-449-3803, martin.thurn@ngc.com include any other information about the properties or content Abstract—One of the distinguishing characteristics of the of the document. “Physically separate” means that the intelligence community is the strict security framework that is metadata is stored in a separate file rather than being used to control classified information. A counterproductive side- embedded within the data file itself – an important contrast to effect of this strict security is that intelligence analysts are often not aware of information that is relevant to their analysis. the dominant practice of embedding all metadata within Semantic technology and ontologies can help analysts discover documents. “Semantic” means that the metadata represents the relevant information even if that information is under the strictest meaning of the data, as opposed to just syntactic sugar. In controls and even if the analysts are not cleared to access the data. particular, our approach focuses on expressing the semantics These techniques can be applied immediately within the current of the content of the document, i.e. the actual body text, rather security framework of the intelligence community. than facts about the document which are typically found in the header. “Standard” means that the metadata is represented Index Terms—Discovery, Information Sharing, Metadata, Redaction, Semantics using semantics standards such as the Resource Description Framework (RDF) and Web Ontology Language (OWL). In addition, “Standard Semantic” means that the metadata strictly I. INTRODUCTION corresponds to an ontology so that the meaning is explicit and can be processed by automated tools. T HE many agencies of the United States intelligence community – and the corresponding organizations of her friend and partner countries around the world – employ a strict Using physically separate semantic metadata for discovery is not a new idea – this is a technique that has been used security framework to protect and control classified successfully by libraries for centuries. A card (whether paper information. The basis of this framework is that a person is or electronic) in a card catalog is metadata for a book in a granted access to a sensitive document only if they need to library’s holdings. The card for a rare and delicate book is know those data to perform their duties. itself neither rare nor delicate, and therefore does not have to This basis creates two immediate impediments to be subject to the same protections as the book itself. Whereas information sharing and discovery across the boundaries of the book may be held in a special collection accessible only to security levels and compartments. When sensitivity approved scholars, the card describing the book can be classifications are assigned to an entire document, it prevents publicly accessible, updated frequently, and copies can be an unapproved analyst from seeing any portion of the distributed to other libraries. In contrast, metadata that is not document, even when the document may actually contain a physically separate from the data – metadata that appears in mixture of sensitive and unclassified information. To make the front matter of a book, for instance – cannot be updated matters worse, it is often the case that an unapproved analyst is and can only be accessed by those who already have access to prevented from knowing even the existence of that document. the book itself. In the former case, the analyst can at least ask for permission Within the intelligence community, working with physically to read the document and fulfill her duties; in the latter case, separate metadata has all the advantages of working with there is virtually no hope for the analyst ever to see the data. catalog cards, and also solves fundamental security problems that stand in the way of discovery and sharing of information. II. PHYSICALLY-SEPARATE STANDARD SEMANTIC METADATA There are two keys to this aspect of the solution. First, the physically separate metadata can be at a lower level of We have developed an approach to discovering and sharing classification than the data itself. It is entirely possible that the information that is particularly well-suited to the intelligence very nature of the metadata makes it lower level; or the system community, an approach based on physically-separate standard can be specifically designed so that the metadata is of a lower semantic metadata. “Metadata” is a general term that refers to classification, if necessary. Second, the physically separate data that describes other data. Metadata for a document may metadata can be stored on a different network (or several explicitly identify the title of the document, provide a table of different networks) than the original data. The bottom line is, all the geographic locations mentioned in the document, or 2 while an organization may not be able to share much of its data ambiguous than English words. Once sufficiently rich for security reasons, it may be able to share a great deal of semantic metadata is available, metadata-based discovery can metadata. That metadata will allow intelligence analysts to exceed both the recall and the precision of keyword searching discover the existence of information that is important to them against full text documents. even if they have not been cleared to access the data itself. It should also be noted that since electronic metadata files IV. SAMPLE SCENARIO OF SEMANTIC DISCOVERY can be much larger than physical index cards in a traditional An intelligence analyst is creating a map of the locations of card catalog, the metadata may easily contain a wealth of certain objects of interest. In the past, creating such maps valuable content information that can be exploited required reading intelligence cables that describe, in ordinary independently of the actual data file. Of course, the metadata English, the locations of the objects at various times. The might not have the same authority as the actual data (see the analyst would then have to type all the coordinates into a sample scenario below), but it certainly can be used to suggest geographical information system (GIS) to create the map – a hypotheses. tedious and error-prone task. In our approach, as each cable arrives, a metadata file is III. ONTOLOGIES FOR DISCOVERY created that contains RDF descriptions of what objects were at Rich ontologies are essential to the success of the approach what locations at what times based on standard ontologies. to discovery described here. Ontologies allow semantic This RDF can be automatically generated using existing searches to match even if the query concept is more specific or information extraction technology such as NetOwl from SRA more general than the concept in the metadata. Semantic International, TextTrainer from Northrop Grumman, or metadata is data about the meaning of the data. Meaning has AeroText from Rocket Software. A semantic metadata search the property that it can be abstracted, which is important for – either a live search initiated by the analyst, or an automated both discovery and security reasons. An aircraft ontology, for “batch” query that runs overnight – is then used to discover all instance, may indicate that the B-2 is a stealth bomber, a the metadata files that describe locations of objects of interest. stealth bomber is a type of bomber, and that bombers are a Having standard ontologies greatly facilitates the indexing and type of airplane. This will allow a semantic search for the retrieval required for this type of search. Since RDF is concept “airplane” to discover documents that mention completely structured, the resulting locations can automatically specific types of aircraft such as the B-2 (even when the be loaded into the GIS application. As a result, maps that documents do not contain the query word “airplane”). And if previously took weeks to create manually are now the fact that a B-2 was used for a particular mission makes a automatically generated in seconds more accurately from a document classified, unclassified metadata can be generated by more comprehensive set of sources. referring to the more abstract concepts of “stealth bomber” or, After automatically generating a new map, the analyst sees if necessary, “bomber” or just “airplane”. By abstracting as an alarming pattern and decides to write a report. Of course, little as possible to meet security requirements, the semantic she can’t use metadata as source information for a formal metadata can make the maximum amount of information intelligence report, so she logs on to the data repository (to available for discovery and exploitation. Rich standard which she has access) to verify the pattern against the original ontologies facilitate this type of searching and abstraction. In reporting. However, she is denied access to several of the the ideal case, the ontologies themselves will be standards cables because they are stored in a restricted collection. used across the intelligence community – a central topic of this Through official channels (referenced in the metadata) she conference. requests access to the restricted collection, receives access, Discovery based on physically-separate metadata is often confirms the accuracy of the map, and produces an important viewed as a last resort – a technique to be used only when report. In the past, she never would have seen the pattern in security restrictions prevent access to the data itself. Indeed, the first place because she wasn’t aware of the reports in the one could argue that it should be a last resort when only very restricted collection. basic document metadata (e.g. Title, Author, Date) is available. However, semantic metadata can be arbitrarily rich, V. ONTOLOGIES FOR INFORMATION SHARING containing a detailed, unambiguous, machine-interpretable The approach and claims described above for using version of the information contained in a document. Since rich semantic metadata to improve discovery hold true equally well metadata provides an unambiguous and direct representation for information sharing – one can simply view the sharing as a of the meaning of a document, metadata can serve as a better “push” of metadata across security boundaries whereas basis for discovery and automatic exploitation than even the discovery is like a “pull”. However, the use of ontologies and document itself. As rich semantic metadata becomes available rich semantic metadata can enhance information sharing in a for more and more documents in a repository, search recall radical way. should increase, because exact matches are not necessary; and Recall that our semantic metadata is represented in a as the metadata becomes richer, the precision should increase standard language (RDF) that is well-defined and machine- as well, since fine-grained concepts from an ontology are less interpretable, and that we can create rich ontologies in OWL 3 that are also machine-interpretable. For discovery, these VII. CONCLUSION ontologies enable semantic searching by abstracting the query Discovering information in an environment with strict concepts; to aid information sharing, ontologies can be used to security constraints is a critical problem for the intelligence automatically abstract or redact the semantic metadata itself. community. Physically-separate metadata can be used to Another feature of OWL is that it can encode inferences and overcome some of these problems. Metadata can have a lower other logical constructs which can then be automatically level of classification than the data itself, and can reside on a processed in software. Classification guides rules and policies different network than the data itself. In this way, more can be represented in OWL, and the computer can accessible metadata indexes can be created and exploited automatically apply those rules and policies to semantic while fully maintaining the security of the source data. This metadata. This allows the automatic redaction or abstraction means that even the most sensitive documents can be classified metadata so that it conforms to the lower discoverable, and much of the information they contain can be classification level. Semantic technologies that exist today exploited – even by analysts that have absolutely no access to enable us to automatically redact metadata for information the source documents themselves. Effective discovery and sharing. exploitation, however, depends on the availability of rich We can actually take this one step further. We can write a content metadata that is based on extensive ontologies. classification guide in OWL in such a way that a theorem There is an inherent conflict in the intelligence community prover can be used to mathematically prove that the redacted between the responsibility to share information and the data does not violate any classification rules. Pellet is one responsibility to protect it. This dilemma can be finessed by example of a widely-used and well-respected open source protecting data and sharing rich metadata. This approach can theorem prover. be implemented within the current strict security framework and will benefit significantly from the type of ontology work VI. SAMPLE SCENARIO OF SEMANTIC SHARING discussed at this conference. Local law enforcement has a need-to-know whenever FBI Semantic technologies that exist today enable us to identifies an individual in the local community with terrorist automatically convert documents to metadata, automatically connections. However, local law enforcement does not have redact that metadata to any security level, and automatically the need-to-know (nor do they even care) the source or prove that the redaction is sound and complete. methods FBI used to obtain such information. In the past, whenever a new terrorist connection was established and ACKNOWLEDGMENT documented, the entire data record was classified because it Martin Thurn thanks Dr. Terry Patten for 20 years of described how FBI obtained the information to create the friendship and mentoring, and for his pioneering work in connection. The only way local law enforcement came to computational linguistics, natural language processing, know about the connection would be if an FBI agent read the information extraction, and most recently, application of entire report, distilled it down to an unclassified version, semantic technologies to the problem of secure information obtained the relevant approvals, and finally sent the sharing. information to local law enforcement. In our approach, as each suspect interview summary report REFERENCES is generated, an RDF metadata file is generated containing [1] D. Nardi and R.J. Brachman, “An Introduction to Description Logics”, names and known-terrorist connections. Again, this can be The Description Logic Handbook, Jan. 2003. automatically generated using existing information extraction [2] F. Baader and W. Nutt, “Basic Description Logics”, The Description Logic Handbook, Jan. 2003. technology. This RDF metadata is automatically routed to [3] A. Uszok, J. Bradshaw, R. Jeffers, N. Suri, P. Hayes, M. Breedy, L. local law enforcement via a fully accredited hardware/software Bunch, M. Johnson, S. Kulkarni, and J. Lott, “KAoS Policy and Domain guard device at the FBI network boundary. This guard reads Services: Toward a Description-Logic Approach to Policy the RDF, compares it to classification guides and policies Representation, Decomfliction, and Enforcement”, Proceedings, pp 93- 96 [IEEE 4th International Workshop on Policies for Distributed encoded in OWL, and performs a logical redaction of the Systems and Networks, June 4-6 2003]. simple metadata facts. The redacted RDF metadata is then [4] J. Bradshaw, A. Uszok, R. Jeffers, N. Suri, P. Hayes, M. Burstein, A. allowed to pass outside the FBI network and travels on to local Acquisti, B. Benyo, M. Breedy, M. Carvalho, D. Diller, M. Johnson, S. Kulkarni, J. Lott, M. Sierhuis, and R. Van Hoof, “Representation and law enforcement, where it can automatically be added to a Reasoning for DAML-Based Policy and Domain Services in KAoS and database or reformatted into a textual message. Through Nomads” [AAMAS ’03, July 14-18 2003, Melbourne Australia]. official channels (referenced in the RDF), local law [5] N. Suri, J. Bradshaw, M. Burstein, A. Uszok, B. Benyo, M. Breedy, M. Carvalho, D. Diller, P. Groth, R. Jeffers, M. Johnson, S. Kulkarni, and J. enforcement can request confirmation of the information at Lott, “DAML-Based Policy Enforcement for Semantic Data any later date. Transformation and Filtering in Multi-agent Systems” [AAMAS ’03, July 14-18 2003, Melbourne Australia].