Big Data for Combating Cyber Attacks
               Terry Janssen, PhD, SAIC                                                 Nancy Grady, PhD, SAIC
            Chief Scientist & Cyber Strategist                                        Technical Fellow, Data Science
                    Cyber Operations                                                     Emerging Technologies
                 Washington, D.C. USA                                                     Oak Ridge, TN USA
                terry.l.janssen@saic.com                                                nancy.w.grady@saic.com

   Abstract—This position paper explores a means of improving         identified with Big Data. The engineering revolution began due
cybersecurity using Big Data technologies augmented by                to the massive datasets from web and system logs. The
ontology for preventing or reducing losses from cyber attacks.        implication has been the storage of the data in its raw format,
Because of the priority of this threat to national security, it is    onto distributed resources, with the curation and imposition of
necessary to attain results far superior to those found in modern-
                                                                      a schema only when the data is read.
day security operations centers (SOCs). Focus is on the potential
application of ontology engineering to this end. Issues and
potential next steps are discussed.                                        Big Data Analytics. Much of the development of Big
                                                                      Data engineering is a result of the need to analyze massive
   Keywords—big data; ontology; cybersecurity;           modeling,    web log data. Massive web logs were first filtered by page for
search; discovery; analytics; variety; metadata                       aggregate page counts, to determine the popularity of pages.
                                                                      Then the pages were analyzed for sessions (spawning the now
                        I.     INTRODUCTION                           massive   “cookie”   industry   to   make   this   simpler).   “Sessions”  
    The last few years have seen tremendous increases in the          are the sequence   of   activities   that   describe   a   customer’s  
amount of data being generated and used to provide                    interaction  with  the  site  at  a  “single-setting,”  with  the  analyst  
capabilities never before possible. “Big Data” refers to the new      describing what time-window is considered a session. The
engineering paradigm that scales data systems horizontally to         next step in analytics capability came from the realization that
use a collection of distributed resources, rather than only the       these sessions could be abstracted into patterns rather than
earlier vertical scaling that brought faster processors and more      being treated as just the literal collection of pages. With this
data storage into a single monolithic data platform. Big Data         step, traversal patterns helped site designers see the
technologies have the potential to revolutionize our capabilities     efficiencies in their link structure. Furthermore, these usage
to handle the large datasets generated in any cyber data              patterns could in some cases be attached to a customer account
analytics. The challenge, however, is not just in handling the        record. With this step, the site could be tuned to benefit the
large volumes and high data generation rate, but in leveraging        most valuable customers, with separate paths being designed
all available data sources to provide better and faster analytics     for the casual visitor to browse, leaving the easy efficient
for attack detection and response. In this paper, we will discuss     handling for loyal customers. This pattern-oriented analysis
Big Data analytics, metadata, and semantics for data                  applies to the cyber domain, in analyzing logs from a server.
integration, and applications to cybersecurity and cyber data
management.                                                                The last 15 years have seen the extension of a number of
                           II.   BIG DATA                             analytics techniques to leverage the horizontal Big Data
    Big Data has several defining characteristics, including          scaling paradigm to address both log and linked-node data
volume, variety (of data types and domains-of-origin), and the        found in social sites. The cyber community can leverage web
data flow characteristics of velocity (rate) and variability          log and Social Network Analysis to use the massive amounts
(change in rate) in which the data is generated and collected.        of data to determine session patterns and the appropriateness
                                                                      of activity between resources. The challenge is that cyber must
    Traditional data systems collect data and curate it into          also deal with a richer set of attributes for the resources and
information stored in a data warehouse, with a schema tuned           their expected/allowed interconnections, which adds in a
for the specific analytics for which the data warehouse was           variety of other contextual datasets into the analysis.
built. Velocity refers to a characteristic that has been previously
referred to as streaming data. The log data from cell phones,             Variety. Traditional systems handled the variety of data
for example, flows rapidly into systems, and alerting and             through a laborious integration process to standardize
analytics are done on the fly before the curation and routing of      terminology, normalize into relational tables, choose indexes,
data or aggregated information into persistent storage. In a Big      and store into a data warehouse that is tuned for the specific
Data architecture, this implies the addition of application           analytics that are needed. This is an inflexible process that
servers to handle the load. Variability refers to changes in the      does not easily accommodate new data sources, changes into
data   flow’s   velocity, which for cost-effectiveness leads to the   underlying data feeds, or new analytical requirements.
automated spawning of additional processors in cloud systems
to handle the load as it increases, and release the resources as         For web log analysis, this extension to customer session
the load diminishes. Volume is the dataset characteristic most        analytics only required the assignment of a customer or visitor


                                                  STIDS 2013 Proceedings Page 158
ID to the session, allowing integration with a purchasing             on user interpretation of the data elements. This approach
history. In the cyber analytics case, the integration point is not    allows rapid integration of data through the wrappers (as
so simple. The integration of packet data, with server log data,      opposed to a lengthy data warehouse integration), but it is not
with port-to-port connectivity data, with server type data, with      an approach that can be automated, nor can it be used for large
network router settings, and so forth, provides a more complex        volume datasets that cannot be copied due to their volume.
use case, needing a more sophisticated way to integrate such a        Even in a mashup, wrapper terms used in the metadata are
variety of data, some of which carries a number of additional         themselves subject to interpretation, making reuse of data
attributes that are needed.                                           elements difficult.

    Recently, variety datasets have been addressed through                Without metadata referenced to well-understood standard
mashups that dynamically integrated a couple of datasets from         terminology applicable across domains, the diverse datasets
multiple domains to provide new business capabilities. Early          cannot be integrated automatically. In addition, the integrating
mashups demonstrated this value, for example, in the                  elements must be applied outside the big data storage, implying
integration of crime data with real estate listings; a valuable       that the integration logic must reside in the metadata layer.
analysis that was not possible before the availability of open
datasets. There is a limitation to such mashups because of the                          IV. SEMANTIC TECHNOLOGY
integration of a limited number of datasets, with the integration         Semantic technologies are crucial for the future handling of
variables being manually selected. This type of manual                big datasets across multiple domains. While we have methods
integration is insufficient for analytics across different large      for unique concept identification arising through the Semantic
volume datasets with complex inter-relationships.                     Web, these technologies have not made inroads into traditional
                                                                      data management systems. Traditionally, the ETL process has
    Variety is the Big Data attribute that will enable more           been used to enforce standard terminology across datasets, with
sophisticated cyber analytics. The requirement is for an              foreign keys to external tables for the related information. This
automated mechanism to integrate multiple highly diverse              is not a scalable solution, since the introduction of a new data
datasets in an automated and scalable way. This is best               source requires the careful construction of foreign keys to each
achieved through a controlled metadata.                               other dataset in the database. This lack of extensibility to add in
                                                                      additional sources highlights the limitations of horizontal
                        III. METADATA                                 scalability in current approaches. In addition, there are
    The executive branch has been pushing an open data                limitations on the continued expansion in large data
initiative to move the federal government into being a data           warehouses, highlighting their inability to continue to scale
steward. The goal in releasing the data is to better serve the        vertically.
public and promote economic growth through the reuse of this
data. The difficulty in using this data arises from the lack of the       Semantic technologies have not yet made inroads into Big
metadata descriptions. Data reuse requires as much information        Data systems. Big datasets that consist of volume tend to be
as possible on the provenance of data; the full history of the        monolithic with no integration across datasets. The data is
methods used for collection, curation, and analysis. Proper           typically stored in its raw state (as generated), and no joins
metadata increases the chances that datasets are re-purposed          were allowed in the initial Big Data engineering. Given this,
correctly—leading to analytical conclusions that are less likely      most Big Data analytics approaches apply to single datasets.
to be flawed.
                                                                          For solutions addressing the integration of variety datasets,
    Two mechanisms are used for dataset integration in a              the ability to integrate the datasets with uniquely defining
relational model. In the relational model, lookup tables are          semantic technology is a fundamental requirement. Two
established to translate to a common vocabulary for views, and        overarching requirements need to be addressed to use ontology
a one-to-one correspondence is used to create keys between            for the integration of Big Data: constructing the ontology and
tables. In a NoSQL environment, joins are not possible so table       using the ontology to integrate big datasets.
lookups and or keys cannot be used for data integration. The
connection of data across datasets must reside in the query               Ontology scaling. The standard method for data access
logic and must rely on information external to the datasets.          through an ontology is to ingest the data into an ontological
This metadata logic must be used to select the relevant data for      database, where the data elements are encoded along with their
later integration and analysis, implying the need for both            extant relationships. This does not work in a Big Data scenario,
standard representation and additional attributes to achieve the      since ontological databases do not have the horizontal
automated data retrieval.                                             scalability needed to handle data at high volume, velocity, or
                                                                      diversity. Further exacerbating the problem is that some of the
   A second approach is used to speed the data integration            data needing to be integrated are not owned by the analytical
process for manual mashups of diverse datasets. Often XML             organization and cannot be ingested, but only accessed through
wrappers are used to encapsulate the data elements, with the          query subsets.
nomenclature for each dataset provided in the wrapper, based


                                                  STIDS 2013 Proceedings Page 159
     Separate ontology for metadata. The implementation of                        V.     APPLICATION TO CYBERSECURITY
an integrating ontology would consequently need to reside in              Practical application to countering cyber attack is
the metadata for browsing and querying. While this metadata          achievable in the near-term. The following questions can be
could be browsed manually, the real value comes if it can be         answered with properly implemented Big Data technologies
actionable; such that selections over the metadata ontology          that span the variety of datasets: What data is available on
would automatically construct queries to the Big Data                malware X attacks globally? How many machines did an
repository. A number of ontologies relative to the cyber             event land on? What ports were leveraged? What users were
domain already exist, encompassing resources, attack event           affected? What machines were compromised? What was
ontologies, and so forth. The key is to incorporate the              leaked? Was sensitive information lost? Who did it? Was it an
appropriate elements and their relationships needed to               insider or outsider? More difficult questions for the future
describe the elements in the desired datasets. Our intent is not     would be: What is the composite activity globally of this
to recreate a cyber ontology from scratch, but to leverage           attacker that penetration tested (pentested) my perimeter?
those that exist to develop a first order ontology specific to the   What are all the locations globally of <malware name>
integration of the relevant cyber datasets. Focusing on first        attacks? What should I expect from this attacker within the
order logic will enable the ontology to be actionable to             next hour? Next week? Next month? (Based-on the historical
dynamic data integration.                                            data on this attacker.) What unsafe actions are my users doing,
                                                                     rank ordered by risk significance? What suspicious activity
    In order to serve as the facilitator for data integration for    occurred today? Where is the greatest risk within the
automated integration, this first order ontology would need to       enterprise? It would also be useful to tabulate statistics on
contain elements such as: data element definitions, dataset          vulnerabilities versus attacks, and visualize the results.
location, data producing resource characteristics, and resource
connectivity.                                                             The  latter  “future set” of questions requires more research
                                                                     and development in topics like machine learning and
     For analytics, additional mid-level ontologies would be         reasoning, and is well beyond this   paper’s   scope. For
needed to provide reasoning over the data, such as time and          example, can ontology as proposed in this paper help us
location. Domain-specific ontology elements would include,           reason about risk based on the topology of devices and
for example, resource attributes by resource type, translations      controls? Theoretically, this is deterministic and machines
such as Internet protocol (IP) to location, and derived attack       should be able to do better than man. Our intent is to model
pattern components.                                                  perimeter security of a large, enterprise network and collect
                                                                     real-time data, reason about risk in real-time based on the
   The key to the use of a semantic representation for the           topology of devices and controls, and respond to threats in
metadata is separating the semantic metadata from the data           attempt to prevent loss. Given the appropriate set of data and
storage. In order to leverage the scalability and speed of high-     generation of a set of reasonable hypotheses, can we use Big
volume NoSQL solutions, the ontology will need to reside in          Data to do evidence collection to support or refute those
its own scalable environment. Data exploration would require         security risk and threat hypotheses, in time to prevent loss?
a mechanism to browse the metadata within the ontology, with
a seamless transfer mechanism to flow down into the data.                 Progress-to-Date. As a first step in preparing to
                                                                     instantiate an ontology, we have been mindful of what
     Probabilistic Challenges. One significant challenge in the      hundreds of organizations do in the current cybersecurity
use of ontology for automated data analytics across datasets         management process in a global networked enterprise.
resides in the need for probabilistic reasoning. Typically in        Description of this workflow is beyond this   paper’s scope.
ontology representations, triplets are considered “facts,”           System awareness currently resides in the minds of hundreds
implying full confidence in the data elements being described.       of professionals who track threats and malware, maintain the
In the real world, such a luxury is typically non-existent.          security devices like firewalls and the configurations and
Resources will continually be updated, and there will be             patches of thousands of network devices, monitor events and
latency before the new configurations are updated in the             log files, create tickets when an anomaly is observed, and
ontology. Attack chains will have multiple possible paths with       perform remedial actions such as          Incident Response;
probabilistic representations of each link type. Activity counts     Configuration Management; Vulnerability and Patch
must be evaluated with a statistical significance test to            Management; Firewall, Intrusion Detection and Prevention;
determine if an activity is truly of concern. Such counts will       Deep Packet Inspection and Cyber Threat Assessment;
have variations relative to time of day and day of week. Using       Security Architecture and Design; and so forth.
an ontology for such probabilistic analytics will require the
ability to analyze activity under some uncertainty. Much work            We propose to elicit all knowledge necessary assessment,
has been done on probabilistic ontology, like MEBN, which            decision, planning, and response into this ontology. At first
inserts Bayes’ theorem in ontology nodes [1].                        glance, this may appear daunting, but based on the successes
                                                                     with ontology engineering in recent years, and the high stakes,


                                                  STIDS 2013 Proceedings Page 160
 we believe this not only practical, but necessary, to better                as previously described. A trade study will need to be
 understand how to solve this national priority problem.                     conducted, for tools that can be selected for implantation of a
                                                                             production system capable of meeting the aforementioned
      Cyber-security management has the characteristics of a                 objectives in a large, global enterprise network. For the
 successful knowledge elicitation and ontology engineering                   purpose of demonstrating the concept we selected an ontology
 endeavor. The information is in digital form, and cyber-                    engineering tool from highfleet.com that reportedly provides
 security processes are repetitive—meaning that the same                     an implementation of first order logic that is decidable and
 indications of an attack are well documented and observed in                tractable (by simple programming constraint). It is a tool that
 typical network operations routinely and the remedial steps are             one of the authors has used in the past. Results here are
 documented and used routinely. This is not to say the                       positive from the little done to-date; we cannot do an
 cybersecurity experts are not highly knowledgeable and                      assessment until the ontology is populated. There are other
 skilled—just the opposite. This knowledge can be coded and                  ontology engineering tools, for example the description logic
 reused in the parts the machine does best; man should                       Protégé ontology editor. We have not made a decision;
 continue to do the parts that it does better than machines. With            eventually we will need to identify appropriate metrics and
 this expectation, we will meet the goal stated up front of                  conduct assessments to determine what would be needed for
 flipping the current situation to one where a network’s  defense            production grade deployment to address this problem space
 is optimized and efficient, lowering cost of defense, and
 making it very hard and expensive for the attacker.                              Due to page limit constraints, it is impossible to discuss
                                                                             all aspects of the cyber ontology development, but a few
       Cyber Ontology for Countering Attacks. The top levels                 aspects need to be mentioned. For example, there are many
 are illustrated in Figures 1 and 2.                                         good resources for specifying and instantiations these
                                                                             ontologies to a level useful in cyber, most notable are efforts
                                                                             by MITRE [2]. Research issues remain unanswered and they
                                                                             can be categorized into big data and analytics, ontology and
                                                                             probabilistic reasoning, decision making and design and
                                                                             architecture. Cybersecurity is a hard problem and it is doubtful
                                                                             that the approach taken in this paper, or any other, will be a
                                                                             complete solution. Furthermore, the cyber attack
                                                                             sophistication is advancing rapidly which compounds this
                                                                             problem significantly [3].

Figure 1. Upper Level and Lower Level Infrastructure Ontology.                                     VI. FUTURE STEPS
                                                                                  We are in the planning phase for continued research and
                                                                             development, beginning with the Big Data analytics necessary
                                                                             to more fully identify, understand, and respond to cyber
                                                                             attacks. In parallel, we would like to develop a proof-of-
                                                                             concept prototype to test how well this ontology and Big Data
                                                                             integration would work in practice in a large enterprised
                                                                             network with high traffic and large number of cyber attacks.
                                                                             The key to the success of this prototype will be to focus on
                                                                             one narrow aspect of cyber attack defense; if one is
                                                                             implemented and demonstrated, it can be used to extrapolate
                                                                             the resources needed for development and implementation in
     Figure 2. Lower Level Ontology for Attack and Defense.                  large production environments.

      Our goal is a proof-of-concept prototype of the entire                                                REFERENCES
 process, but only for a few appropriate types of attacks and                [1]   Laskey, K.B, MEBN: A Language for First-Order Bayesian Knowledge
                                                                                   Bases, Department of Systems Engineering and Operations Research,
 respective plans as defined by a fairly rigorous test set. Big                    George Mason University, Fairfax, VA, 2007
 Data elements for proof-of-concept have been partially                      [2]   Obrst, L., Chase, P., Markeloff, R., Developing an Ontology of the
 selected.                                                                         Cyber-security Domain, Semantic Technology for Intelligence, Defense
                                                                                   and Security (STIDS) 2012, GMU, Fairfax, VA, 2012.
                                                                             [3]   http://www.cnas.org/technology-and-national-security
      Ontology engineering tools are being  evaluated  for  “most  
 suitable”  for  implementing  this  ontology  for  use  in  the  system  


                                                       STIDS 2013 Proceedings Page 161