IUI Workshops'19, March 20, 2019, Los Angeles, USA Understanding and Exploring Competitive Technical Data from Large Repositories of Unstructured Text James J. Nolan Mark Stevens Peter David Decisive Analytics Corporation Decisive Analytics Corporation Decisive Analytics Corporation Arlington, VA, USA Jeffersonville, IN, USA Arlington, VA, USA jim.nolan@dac.us mark.stevens@dac.us peter.david@dac.us ABSTRACT Unstructured Text. In Joint Proceedings of the ACM IUI 2019 Workshops, We present an approach to automatically processing open Los Angeles, USA, March 20, 2019, 5 pages. source unstructured data to extract relevant technical INTRODUCTION information. The approach is tailored towards technology Technology across the global landscape is changing at a monitoring, and specifically to prevent “technical surprise” - record pace. Keeping track of or discovering this when a competitor or adversary develops and deploys an information in a timely fashion remains a difficult challenge. unexpected technology. Our approach takes advantage of Natural Language Processing, Entity Extraction, and Visual There are many use cases where it is important to prevent Document Processing. We provide an intuitive interface that “technical surprise”, when a competitor or adversary allows users to easily interact with the Machine Learning develops and deploys an unexpected technology. For system. example, consider smart phone technology. Smart phone manufacturers, such as Apple or Samsung, would like to Author Keywords know immediately when one of their competitors develops a Technology tracking, natural language processing, semantic chip that outperforms previous generations or develops a reasoning, entity extraction, relationship extraction, directed new glass with greater drop resistance. Consider military exploration. adversaries as another example. Military leaders need to ACM Reference format: know when adversaries develop new planes or weapons that Nolan, James, Stevens, Mark, and David, Peter. 2019. Understanding and can fly higher or further than before. Exploring Competitive Technical Data from Large Repositories of Figure 1 - An Overview of the Tech-Trakr approach. IUI Workshops'19, March 20, 2019, Los Angeles, USA. Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. IUI Workshops'19, March 20, 2019, Los Angeles, USA Figure 2 - An Overview of the NLP Capabilities Utilized by Tech-Trakr The challenge to tracking technology to avoid technical ontologies that encode expertise for optimization. The entity surprise is due in part to the fact that manufacturers attempt and relationship information that is output by the analytics is to guard this information, choosing not to publish it for fear then stored in a database and can be accessed via an API or of losing their competitive advantage. However, in spite of through custom visualizations. this guarding of information, the data does frequently end up Figure 2 provides more detail on the NLP capabilities and in the open source domain through industry publications, visualizations embedded within Tech-Trakr. For the journal/conference proceedings, news sources, or press purposes of this paper, we focus on two key enablers for releases. Given that this information does make it out into the updating technology databases from unstructured data: unstructured wild, finding, extracting, and tagging the Entity Extraction and Relationship Discovery and how they information, and ultimately putting it into a format and provide actionable, database-quality information from structure that can be used for competitive analysis is an unstructured data. enormous challenge. BACKGROUND To address this problem, we developed a tool called Tech- Tech-Trakr relies on four primary NLP capabilities that Trakr,1 which encapsulates a suite of Natural Language automatically process and provide insights into large Processing (NLP) and Machine Learning (ML) capabilities unstructured text repositories: Dealing with diverse data, to perform automated extraction and support directed Statistical Topic Modeling (STM), Semantic Role Labeling exploration of competitive technical data from unstructured (SRL), and Entity Extraction and Disambiguation. text. In this paper, we (1) provide an overview of the Tech- Trakr system and underlying technologies, and (2) present a Dealing with Diverse Unstructured Data Sets case study to exemplify how Tech-Trakr supports directed Tech-Trakr provides the ability to collect, parse, and extract exploration for understanding a particular technology. relevant information from diverse and unstructured data sets. Collection from such a large number of sources produces TECH-TRAKR OVERVIEW data with extreme variety of file formats, document An overview of the Tech-Trakr tool is illustrated in Figure 1. organization, page layout, text style, and content. This Working from left to right, Tech-Trakr harvests data from extreme document variety makes it difficult to perform even the web using results from commercial search engines, as simple tasks such as understanding the content and how it well as focusing on specific sites of interest, news sources, impacts analysis. A machine learning capability that can and classified document collections. Ingested data is automate skills that analysts perform well, while also scaling formatted and normalized for processing to provide clean up to handle data velocity is critically needed. Tasks such as data to the downstream algorithms. Tech-Trakr extracting document titles, authorship information, security automatically identifies specific entities, resolves them to classification, or top-level headings are difficult and time- remove ambiguity, extracts metadata and values that consuming processes. describe those entities, and identifies relationships between them. These analytics are informed by domain-specific 1 http://techtrakr.dac.us/techtrakr-ui/#/about IUI Workshops'19, March 20, 2019, Los Angeles, USA US jets destroyed a Russian T-72 battle tank in East Syria Saturday after Destroyer Destroying Undergoer Time government forces fired on US special ops near same location of last week's attack. Assailant Attack Victim 3 inside tank were killed. Victim Killing Figure 3 – Semantic Role Labeling Example One major cause of this difficulty is the proliferation of process that maps the words and phrases in unstructured text metadata-less formats such as PDF files of scanned to a formal model of text meaning. In our prior work, we documents and text data formatted solely through developed an SRL capability that analyzes the semantics of improvised or informally defined typographic conventions. whole sentences, identifies the fundamental concepts, called These data lack an underlying, machine-readable frames [2], that are discussed, maps words and phrases from explanation of how text formatting and style represent the text to the roles that are related to these concepts, and document structure and metadata. ultimately updates structured databases. Our SRL capability has been used to perform analysis of open source data [3], Tech-Trakr provides two foundational machine learning extract rich social network structures from unstructured text capabilities for dealing with these issues: [4], and accurately extract entities from unstructured data Data Harvesting and map information about them to structured databases. An Tech-Trakr acquires content through both ingest of example of the output of our SRL capability is shown in internally-maintained collections of documents and by Figure 3. initiating web searches for online content. Tech-Trakr’s ingest pipeline is designed to process data with extreme Figure 3 illustrates how SRL maps words and phrases in text heterogeneity in terms of file formats, document to a structured model of meaning. Our event-oriented model organization, page layout, text style, and content. The on-line of meaning defines hundreds of event types, such as retrieval function runs periodically, retrieving new content Destroying, Attack, and Killing. The SRL algorithm when it is available online. determined that the words destroyed, fired, and killed in the sample text evoke these event types. Other phrases in the Visual Document Processing text, such as US jets were mapped to event-specific roles. Tech-Trakr includes a Visual Document Processing (VDP) This SRL capability is at the core of our Tech-Trakr product. capability that uses visual analysis of documents to infer the Tech-Trakr uses SRL to find the relationships between communicative intent of the author and to recover document products, manufacturers, and other entities. Tech-Trakr structure and metadata. Our algorithm identifies the stores the extracted information in a database, allowing components of documents such as titles, headings, and body downstream analytics to retrieve information about events, content, based on their appearance. Our algorithm is entirely relationships between entities, and attributes of those format-agnostic; it does not rely on document mark-up or entities. metadata to identify the structural components of a Entity Extraction and Disambiguation document. Instead, it operates on an image of a document and can learn from any document type, including scanned Entity extraction and disambiguation provide a consolidated images. view of an entity across the entire text data set. [5] CASE STUDY: TRANSPARENT ARMOR Statistical Topic Modeling Statistical topic modeling [1] discovers topics and clusters As a working example of Tech-Trakr’s capabilities, consider documents to support rapid exploration of data. an analyst tasked with assessing industry’s Transparent Armor2 capabilities. An analyst can perform an open-source Semantic Role Labeling (SRL) search to quickly discover that compounds Aluminum SRL extracts meaning from sentences by identifying and Oxynitride (ALON) and Aluminum Oxide (Sapphire) are labeling semantic predicates and arguments. SRL is an NLP critical components for transparent armor. While discovering 2 Transparent Armor is a type of bullet proof glass that can be worn to prevent injury IUI Workshops'19, March 20, 2019, Los Angeles, USA Figure 4 - Extracted categories and their Values for Transparent Armor the importance of these components may be a simple task, it Joint Air-to-Ground Missiles. Additionally, Tech-Trakr is significantly more difficult to determine all of the extracted the chemical composition (Aluminum Oxynitride), important characteristics of these materials and present them the manufacturer (Surmet Corporation), the manufacturing in a meaningful way for sharing with other analysts. location (Burlington, Mass.), and the claim that it can defeat Additionally, consider that ALON and Sapphire are two of a a .50 BMG Armor Piercing Round. We are displaying only potentially large set of Transparent Armor materials of a small subset of the over 100 categories of information that interest to an analyst. The challenge is accurately extracting have been discovered about this ballistic glass that comprises every relevant material and guaranteeing that this Transparent Armor. information is up-to-date and accurate. In other words, the The complete Tech-Trakr profile for Transparent Armor challenge is extracting all relevant information for all consists of over 200 additional characteristics, correlations, technologies of interest, at scale, as it becomes available. and relationships, and was extracted from less than 500 Now let us consider specifically the Transparent Armor use articles discovered via open source harvesting capability. case and walk through how Tech-Trakr automatically Directed Source Exploration populates a database with information that can be used to The Tech-Trakr profile shown in Figure 4 is interactive and generate a detailed profile about this technology. allows the analyst to drill into the source material from which Technology Profile the relevant information was extracted. The attribute values In Figure 4, we show Tech-Trakr’s automatically generated highlighted in blue are named entity hyperlinks that navigate profile of the Transparent Armor technology which includes to more information about that concept in the form of its own extracted attributes and relationships related to the entity profile. The icons to the right of each attribute value technology. These extracted attributes and relationships are provide advanced user options and information. Clicking the those that Tech-Trakr has identified as important to most icon that resembles an eye navigates to an annotated view of accurately capture the essence of Transparent Armor. Within the source data from which the attribute value was extracted. the Component section, for example, Tech-Trakr has For example, in Figure 5, within the Defeats section of the automatically linked ALON to Night Vision Goggles and Transparent Armor profile, there is an entry for the .50 BMG Figure 5 - Exploring the "Defeats" category to Determine a Type of Munition incapable of penetrating Transparent Armor IUI Workshops'19, March 20, 2019, Los Angeles, USA Figure 6 - The Tech-Trakr Approach enables exploration down to the original source document Armor-Piercing Round. An analyst interested in 2. Ruppenhofer, Josef, Michael Ellsworth, Miriam RL understanding how the system determined that Transparent Petruck, Christopher R. Johnson, and Jan Scheffczyk. Armor has a “Defeats” relationship with this caliber of FrameNet II: Extended Theory and Practice, 2010. ammunition can click the eye icon on that row of the profile, http://framenet2.icsi.berkeley.edu/docs/r1.5/book.pdf. which navigates to the annotated source data view shown in 3. Kase, Sue E. “Accelerating Exploitation of Low-Grade Figure . This view displays the source sentence and the name Intelligence through Semantic Text Processing of of the semantic frame from which the attribute or Social Media.” DTIC Document, 2013. relationship was identified. Additionally, the source sentence http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPr is annotated with the recognized roles of the semantic frame efix=html&identifier=ADA587022. including the verb that evoked the frame. The “View Artifact” link located beneath the source sentence navigates 4. Davenport, Jack H., and James J. Nolan. “Social to the original annotated document complete with metadata Network Analysis Realization and Exploitation.” outlining the originating source as well as the collection and Baltimore, MD, 2015. creation date of the document, as shown in Figure 6. 5. Ward, Kevin, and Jack Davenport. “Human-Machine CONCLUSION Interaction to Disambiguate Entities in Unstructured In this paper we present a tool called Tech-Trakr that Text and Structured Datasets.” Anaheim, CA, 2017. automatically extracts and provides analysts with an overview and directed exploration of technical data from unstructured text. This tool is based on NLP techniques, including SRL and entity extraction and disambiguation to automatically extract and organize information relevant to various technologies. Tech-Trakr produces technology profiles containing relevant information, such as chemical composition, capabilities, strength, durability, and alternate applications. Analysts interact with the profiles to explore the relevant source data and gain additional understanding of the technology. We demonstrate the Tech-Trakr capability using a specific use case of understanding Transparent Armor technology. REFERENCES 1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” The Journal of Machine Learning Research 3 (2003): 993–1022.