Big Data for Combating Cyber Attacks Terry Janssen, PhD, SAIC Nancy Grady, PhD, SAIC Chief Scientist & Cyber Strategist Technical Fellow, Data Science Cyber Operations Emerging Technologies Washington, D.C. USA Oak Ridge, TN USA terry.l.janssen@saic.com nancy.w.grady@saic.com Abstract—This position paper explores a means of improving identified with Big Data. The engineering revolution began due cybersecurity using Big Data technologies augmented by to the massive datasets from web and system logs. The ontology for preventing or reducing losses from cyber attacks. implication has been the storage of the data in its raw format, Because of the priority of this threat to national security, it is onto distributed resources, with the curation and imposition of necessary to attain results far superior to those found in modern- a schema only when the data is read. day security operations centers (SOCs). Focus is on the potential application of ontology engineering to this end. Issues and potential next steps are discussed. Big Data Analytics. Much of the development of Big Data engineering is a result of the need to analyze massive Keywords—big data; ontology; cybersecurity; modeling, web log data. Massive web logs were first filtered by page for search; discovery; analytics; variety; metadata aggregate page counts, to determine the popularity of pages. Then the pages were analyzed for sessions (spawning the now I. INTRODUCTION massive   “cookie”   industry   to   make   this   simpler).   “Sessions”   The last few years have seen tremendous increases in the are the sequence   of   activities   that   describe   a   customer’s   amount of data being generated and used to provide interaction  with  the  site  at  a  “single-setting,”  with  the  analyst   capabilities never before possible. “Big Data” refers to the new describing what time-window is considered a session. The engineering paradigm that scales data systems horizontally to next step in analytics capability came from the realization that use a collection of distributed resources, rather than only the these sessions could be abstracted into patterns rather than earlier vertical scaling that brought faster processors and more being treated as just the literal collection of pages. With this data storage into a single monolithic data platform. Big Data step, traversal patterns helped site designers see the technologies have the potential to revolutionize our capabilities efficiencies in their link structure. Furthermore, these usage to handle the large datasets generated in any cyber data patterns could in some cases be attached to a customer account analytics. The challenge, however, is not just in handling the record. With this step, the site could be tuned to benefit the large volumes and high data generation rate, but in leveraging most valuable customers, with separate paths being designed all available data sources to provide better and faster analytics for the casual visitor to browse, leaving the easy efficient for attack detection and response. In this paper, we will discuss handling for loyal customers. This pattern-oriented analysis Big Data analytics, metadata, and semantics for data applies to the cyber domain, in analyzing logs from a server. integration, and applications to cybersecurity and cyber data management. The last 15 years have seen the extension of a number of II. BIG DATA analytics techniques to leverage the horizontal Big Data Big Data has several defining characteristics, including scaling paradigm to address both log and linked-node data volume, variety (of data types and domains-of-origin), and the found in social sites. The cyber community can leverage web data flow characteristics of velocity (rate) and variability log and Social Network Analysis to use the massive amounts (change in rate) in which the data is generated and collected. of data to determine session patterns and the appropriateness of activity between resources. The challenge is that cyber must Traditional data systems collect data and curate it into also deal with a richer set of attributes for the resources and information stored in a data warehouse, with a schema tuned their expected/allowed interconnections, which adds in a for the specific analytics for which the data warehouse was variety of other contextual datasets into the analysis. built. Velocity refers to a characteristic that has been previously referred to as streaming data. The log data from cell phones, Variety. Traditional systems handled the variety of data for example, flows rapidly into systems, and alerting and through a laborious integration process to standardize analytics are done on the fly before the curation and routing of terminology, normalize into relational tables, choose indexes, data or aggregated information into persistent storage. In a Big and store into a data warehouse that is tuned for the specific Data architecture, this implies the addition of application analytics that are needed. This is an inflexible process that servers to handle the load. Variability refers to changes in the does not easily accommodate new data sources, changes into data   flow’s   velocity, which for cost-effectiveness leads to the underlying data feeds, or new analytical requirements. automated spawning of additional processors in cloud systems to handle the load as it increases, and release the resources as For web log analysis, this extension to customer session the load diminishes. Volume is the dataset characteristic most analytics only required the assignment of a customer or visitor STIDS 2013 Proceedings Page 158 ID to the session, allowing integration with a purchasing on user interpretation of the data elements. This approach history. In the cyber analytics case, the integration point is not allows rapid integration of data through the wrappers (as so simple. The integration of packet data, with server log data, opposed to a lengthy data warehouse integration), but it is not with port-to-port connectivity data, with server type data, with an approach that can be automated, nor can it be used for large network router settings, and so forth, provides a more complex volume datasets that cannot be copied due to their volume. use case, needing a more sophisticated way to integrate such a Even in a mashup, wrapper terms used in the metadata are variety of data, some of which carries a number of additional themselves subject to interpretation, making reuse of data attributes that are needed. elements difficult. Recently, variety datasets have been addressed through Without metadata referenced to well-understood standard mashups that dynamically integrated a couple of datasets from terminology applicable across domains, the diverse datasets multiple domains to provide new business capabilities. Early cannot be integrated automatically. In addition, the integrating mashups demonstrated this value, for example, in the elements must be applied outside the big data storage, implying integration of crime data with real estate listings; a valuable that the integration logic must reside in the metadata layer. analysis that was not possible before the availability of open datasets. There is a limitation to such mashups because of the IV. SEMANTIC TECHNOLOGY integration of a limited number of datasets, with the integration Semantic technologies are crucial for the future handling of variables being manually selected. This type of manual big datasets across multiple domains. While we have methods integration is insufficient for analytics across different large for unique concept identification arising through the Semantic volume datasets with complex inter-relationships. Web, these technologies have not made inroads into traditional data management systems. Traditionally, the ETL process has Variety is the Big Data attribute that will enable more been used to enforce standard terminology across datasets, with sophisticated cyber analytics. The requirement is for an foreign keys to external tables for the related information. This automated mechanism to integrate multiple highly diverse is not a scalable solution, since the introduction of a new data datasets in an automated and scalable way. This is best source requires the careful construction of foreign keys to each achieved through a controlled metadata. other dataset in the database. This lack of extensibility to add in additional sources highlights the limitations of horizontal III. METADATA scalability in current approaches. In addition, there are The executive branch has been pushing an open data limitations on the continued expansion in large data initiative to move the federal government into being a data warehouses, highlighting their inability to continue to scale steward. The goal in releasing the data is to better serve the vertically. public and promote economic growth through the reuse of this data. The difficulty in using this data arises from the lack of the Semantic technologies have not yet made inroads into Big metadata descriptions. Data reuse requires as much information Data systems. Big datasets that consist of volume tend to be as possible on the provenance of data; the full history of the monolithic with no integration across datasets. The data is methods used for collection, curation, and analysis. Proper typically stored in its raw state (as generated), and no joins metadata increases the chances that datasets are re-purposed were allowed in the initial Big Data engineering. Given this, correctly—leading to analytical conclusions that are less likely most Big Data analytics approaches apply to single datasets. to be flawed. For solutions addressing the integration of variety datasets, Two mechanisms are used for dataset integration in a the ability to integrate the datasets with uniquely defining relational model. In the relational model, lookup tables are semantic technology is a fundamental requirement. Two established to translate to a common vocabulary for views, and overarching requirements need to be addressed to use ontology a one-to-one correspondence is used to create keys between for the integration of Big Data: constructing the ontology and tables. In a NoSQL environment, joins are not possible so table using the ontology to integrate big datasets. lookups and or keys cannot be used for data integration. The connection of data across datasets must reside in the query Ontology scaling. The standard method for data access logic and must rely on information external to the datasets. through an ontology is to ingest the data into an ontological This metadata logic must be used to select the relevant data for database, where the data elements are encoded along with their later integration and analysis, implying the need for both extant relationships. This does not work in a Big Data scenario, standard representation and additional attributes to achieve the since ontological databases do not have the horizontal automated data retrieval. scalability needed to handle data at high volume, velocity, or diversity. Further exacerbating the problem is that some of the A second approach is used to speed the data integration data needing to be integrated are not owned by the analytical process for manual mashups of diverse datasets. Often XML organization and cannot be ingested, but only accessed through wrappers are used to encapsulate the data elements, with the query subsets. nomenclature for each dataset provided in the wrapper, based STIDS 2013 Proceedings Page 159 Separate ontology for metadata. The implementation of V. APPLICATION TO CYBERSECURITY an integrating ontology would consequently need to reside in Practical application to countering cyber attack is the metadata for browsing and querying. While this metadata achievable in the near-term. The following questions can be could be browsed manually, the real value comes if it can be answered with properly implemented Big Data technologies actionable; such that selections over the metadata ontology that span the variety of datasets: What data is available on would automatically construct queries to the Big Data malware X attacks globally? How many machines did an repository. A number of ontologies relative to the cyber event land on? What ports were leveraged? What users were domain already exist, encompassing resources, attack event affected? What machines were compromised? What was ontologies, and so forth. The key is to incorporate the leaked? Was sensitive information lost? Who did it? Was it an appropriate elements and their relationships needed to insider or outsider? More difficult questions for the future describe the elements in the desired datasets. Our intent is not would be: What is the composite activity globally of this to recreate a cyber ontology from scratch, but to leverage attacker that penetration tested (pentested) my perimeter? those that exist to develop a first order ontology specific to the What are all the locations globally of integration of the relevant cyber datasets. Focusing on first attacks? What should I expect from this attacker within the order logic will enable the ontology to be actionable to next hour? Next week? Next month? (Based-on the historical dynamic data integration. data on this attacker.) What unsafe actions are my users doing, rank ordered by risk significance? What suspicious activity In order to serve as the facilitator for data integration for occurred today? Where is the greatest risk within the automated integration, this first order ontology would need to enterprise? It would also be useful to tabulate statistics on contain elements such as: data element definitions, dataset vulnerabilities versus attacks, and visualize the results. location, data producing resource characteristics, and resource connectivity. The  latter  “future set” of questions requires more research and development in topics like machine learning and For analytics, additional mid-level ontologies would be reasoning, and is well beyond this   paper’s   scope. For needed to provide reasoning over the data, such as time and example, can ontology as proposed in this paper help us location. Domain-specific ontology elements would include, reason about risk based on the topology of devices and for example, resource attributes by resource type, translations controls? Theoretically, this is deterministic and machines such as Internet protocol (IP) to location, and derived attack should be able to do better than man. Our intent is to model pattern components. perimeter security of a large, enterprise network and collect real-time data, reason about risk in real-time based on the The key to the use of a semantic representation for the topology of devices and controls, and respond to threats in metadata is separating the semantic metadata from the data attempt to prevent loss. Given the appropriate set of data and storage. In order to leverage the scalability and speed of high- generation of a set of reasonable hypotheses, can we use Big volume NoSQL solutions, the ontology will need to reside in Data to do evidence collection to support or refute those its own scalable environment. Data exploration would require security risk and threat hypotheses, in time to prevent loss? a mechanism to browse the metadata within the ontology, with a seamless transfer mechanism to flow down into the data. Progress-to-Date. As a first step in preparing to instantiate an ontology, we have been mindful of what Probabilistic Challenges. One significant challenge in the hundreds of organizations do in the current cybersecurity use of ontology for automated data analytics across datasets management process in a global networked enterprise. resides in the need for probabilistic reasoning. Typically in Description of this workflow is beyond this   paper’s scope. ontology representations, triplets are considered “facts,”   System awareness currently resides in the minds of hundreds implying full confidence in the data elements being described. of professionals who track threats and malware, maintain the In the real world, such a luxury is typically non-existent. security devices like firewalls and the configurations and Resources will continually be updated, and there will be patches of thousands of network devices, monitor events and latency before the new configurations are updated in the log files, create tickets when an anomaly is observed, and ontology. Attack chains will have multiple possible paths with perform remedial actions such as Incident Response; probabilistic representations of each link type. Activity counts Configuration Management; Vulnerability and Patch must be evaluated with a statistical significance test to Management; Firewall, Intrusion Detection and Prevention; determine if an activity is truly of concern. Such counts will Deep Packet Inspection and Cyber Threat Assessment; have variations relative to time of day and day of week. Using Security Architecture and Design; and so forth. an ontology for such probabilistic analytics will require the ability to analyze activity under some uncertainty. Much work We propose to elicit all knowledge necessary assessment, has been done on probabilistic ontology, like MEBN, which decision, planning, and response into this ontology. At first inserts Bayes’ theorem in ontology nodes [1]. glance, this may appear daunting, but based on the successes with ontology engineering in recent years, and the high stakes, STIDS 2013 Proceedings Page 160 we believe this not only practical, but necessary, to better as previously described. A trade study will need to be understand how to solve this national priority problem. conducted, for tools that can be selected for implantation of a production system capable of meeting the aforementioned Cyber-security management has the characteristics of a objectives in a large, global enterprise network. For the successful knowledge elicitation and ontology engineering purpose of demonstrating the concept we selected an ontology endeavor. The information is in digital form, and cyber- engineering tool from highfleet.com that reportedly provides security processes are repetitive—meaning that the same an implementation of first order logic that is decidable and indications of an attack are well documented and observed in tractable (by simple programming constraint). It is a tool that typical network operations routinely and the remedial steps are one of the authors has used in the past. Results here are documented and used routinely. This is not to say the positive from the little done to-date; we cannot do an cybersecurity experts are not highly knowledgeable and assessment until the ontology is populated. There are other skilled—just the opposite. This knowledge can be coded and ontology engineering tools, for example the description logic reused in the parts the machine does best; man should Protégé ontology editor. We have not made a decision; continue to do the parts that it does better than machines. With eventually we will need to identify appropriate metrics and this expectation, we will meet the goal stated up front of conduct assessments to determine what would be needed for flipping the current situation to one where a network’s  defense production grade deployment to address this problem space is optimized and efficient, lowering cost of defense, and making it very hard and expensive for the attacker. Due to page limit constraints, it is impossible to discuss all aspects of the cyber ontology development, but a few Cyber Ontology for Countering Attacks. The top levels aspects need to be mentioned. For example, there are many are illustrated in Figures 1 and 2. good resources for specifying and instantiations these ontologies to a level useful in cyber, most notable are efforts by MITRE [2]. Research issues remain unanswered and they can be categorized into big data and analytics, ontology and probabilistic reasoning, decision making and design and architecture. Cybersecurity is a hard problem and it is doubtful that the approach taken in this paper, or any other, will be a complete solution. Furthermore, the cyber attack sophistication is advancing rapidly which compounds this problem significantly [3]. Figure 1. Upper Level and Lower Level Infrastructure Ontology. VI. FUTURE STEPS We are in the planning phase for continued research and development, beginning with the Big Data analytics necessary to more fully identify, understand, and respond to cyber attacks. In parallel, we would like to develop a proof-of- concept prototype to test how well this ontology and Big Data integration would work in practice in a large enterprised network with high traffic and large number of cyber attacks. The key to the success of this prototype will be to focus on one narrow aspect of cyber attack defense; if one is implemented and demonstrated, it can be used to extrapolate the resources needed for development and implementation in Figure 2. Lower Level Ontology for Attack and Defense. large production environments. Our goal is a proof-of-concept prototype of the entire REFERENCES process, but only for a few appropriate types of attacks and [1] Laskey, K.B, MEBN: A Language for First-Order Bayesian Knowledge Bases, Department of Systems Engineering and Operations Research, respective plans as defined by a fairly rigorous test set. Big George Mason University, Fairfax, VA, 2007 Data elements for proof-of-concept have been partially [2] Obrst, L., Chase, P., Markeloff, R., Developing an Ontology of the selected. Cyber-security Domain, Semantic Technology for Intelligence, Defense and Security (STIDS) 2012, GMU, Fairfax, VA, 2012. [3] http://www.cnas.org/technology-and-national-security Ontology engineering tools are being  evaluated  for  “most   suitable”  for  implementing  this  ontology  for  use  in  the  system   STIDS 2013 Proceedings Page 161