=Paper=
{{Paper
|id=Vol-2269/FSS-18_paper_53
|storemode=property
|title=Data and Deep Models Applied to Cyber Security Data Analysis
|pdfUrl=https://ceur-ws.org/Vol-2269/FSS-18_paper_53.pdf
|volume=Vol-2269
|authors=Ying Zhao,Andrew Polk,Shaun Kallis,Lauren Jones,Riqui Schwamm,Tony Kendall
|dblpUrl=https://dblp.org/rec/conf/aaaifs/ZhaoPKJSK18
}}
==Data and Deep Models Applied to Cyber Security Data Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-2269/FSS-18_paper_53.pdf</pdf>
<pre>
           Big Data and Deep Models Applied to Cyber Security Data Analysis
                 Ying Zhao                                    Andrew Polk                             Shaun Kallis
         Naval Postgraduate School                          UC Santa Barbara                Cal State University Monterey Bay
             yzhao@nps.edu                                 polk@umail.ucsb.edu                shaunlantzkallis@gmail.com

               Lauren Jones                                Riqui Schwamm                             Tony Kendall
         Naval Postgraduate School                      Naval Postgraduate School               Naval Postgraduate School
            lmjones@nps.edu                               rschwamm@nps.edu                         wakendal@nps.edu


                            Abstract                                   12 gigabytes of network information and 1.6 billion events.
                                                                       There were known malicious activities (identified as Red
  We present initial work that applies big data and deep mod-
  els to a cyber security data analysis with a use case approach.
                                                                       Team Actions) conducted within this network during this
  We explored new technologies such as BDP (Big Data Plat-             time period.
  form) as a service on the Amazon AWS system and Lexical                 Some of the information contained within the dataset was
  Link Analysis (LLA). BDP provides various analytics in near          anonymized or deidentified. While this removes significant
  real-time to help decision makers respond to threats and in          amounts of information from the data set, there is still valu-
  a timely manner. We also used LLA as an example of deep              able information to be gleaned about the behavior of the net-
  models and a data-driven unsupervised ML method that can             work due to unity of identification across the five different
  improve cyber decision making.                                       files (i.e. User 1 or U1 is the same user across all data sets
                                                                       and Computer 1 or C1 is the same computer across all data
DoD networks require strong Cyber Situational Awareness                sets).
Analytic Capabilities (CSAAC) because adversaries deploy
                                                                          Some of the well-known ports (e.g. http port 80, 443,
increasingly sophisticated malicious activities against DoD
                                                                       etc.), protocols (e.g. 6 for Transmission Control Protocol),
networks and therefore requires the capture and inspection
                                                                       and system users (e.g. SYSTEM or Local Service) were left
of packets transmitted within the network to assess the cyber
                                                                       identified within the datasets. Time was captured in one-
security questions of who, what, where and when.
                                                                       second intervals, starting with a time epoch of (1). In order to
   New big data analytical tools and technologies can dra-
                                                                       illustrate the methodologies studied in this paper, we started
matically improve CSAAC by effectively and efficiently ag-
                                                                       with the Domain Name Service (DNS) data set. Figure 1
gregating the ever-increasing volume of data from disparate
                                                                       shows a snapshot of the LANL-DNS data. Time, source
sources that could provide early detection of network vulner-
                                                                       computer, and computer resolved are the attributes.
abilities, threats, and attacks.Big data and deep models could
                                                                          The LANL cyber data set was chosen for a number of dif-
provide significant opportunities to perform better analysis
                                                                       ferent reasons over other popular open source data sets (e.g.,
of real-time data and potentially:
                                                                       DARPA (DARPA 2000) or KDD data (KDD 1999) sources).
• Prevent expensive and damaging distributed denial of ser-            The LANL cyber data set is from 2015, one of the more re-
   vice (DDOS) attacks                                                 cent data sets of this size and complexity, so it contains the
• Maintain a competitive advantage of the military or busi-            activities of some newer malicious attack methodologies.
   nesses by protecting expensive research                             The goal is to classify and predict the hacked or hacking
                                                                       computers using big data and deep models.
• Prevent blackmail from email or ransomware
• Better secure vital networked infrastructure                                                   Methods
                  Data Set Description                                 In order to incrementally test cyber data sets using poten-
                                                                       tial big data and deep models including ML/AI methods, the
The cyber data was taken from multiple routers in the
                                                                       LANL-DNS data file was initially pre-processed, analyzed,
Los Alamos National Laboratorys internal network (LANL
                                                                       and interpreted to understand the output results shown in this
2017). The data set contains windows authentication events
                                                                       paper before testing on other more complex data sets. The
and processes, domain name lookups, network flow data,
                                                                       steps for understanding the data:
and hacking events. The data contains 58 days and total
Copyright c by the papers authors. Copying permitted for private
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-
sium on Adversary-Aware Learning Techniques and Trends in Cy-
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published
at http://ceur-ws.org                                                               Figure 1: The LANL-DNS log data[2]
• Perform data visualization and exploration: display and         Visualization/Exploration Using Gephi and Plotly For
  visualize data initially and check data quality.                the data exploration, we also used a open source network
                                                                  display program Gephi (Gephi 2018) as a way to visualize
• Perform unsupervised machine learning to discover inter-
                                                                  the LANL cyber data that shows the connections between
  esting patterns and anomalies.
                                                                  points in data sets. Gephi uses Source and Target fields to
• Apply supervised learning to generate more precise clas-        draw the network graphs. Gephi also includes a timeline
  sification or prediction models.                                function to allow a user to view the connections between
                                                                  nodes at specific times or in a range of times.
Data Visualization and Exploration Using Big Data                    The LANL flow data was displayed with Gephi. Since the
Platforms (BDP)                                                   red team created hacking events such as teal colored com-
                                                                  puter nodes in Figure 5, the hacking or hacked computer
Defense Information Systems Agency (DISA) ’s BDP is on
                                                                  nodes resulted from the red teams actions. Each node is a
Amazon Web Services (AWS) and a mix of big data standard
                                                                  computer. Figure 4 shows the hacking events during a 24-
tools and customization including tools for ingestion, data
                                                                  hour period. One teal node is hacking, the orange nodes are
management, security, data exploration, and data analysis.
                                                                  being hacked, purple nodes are neither hacking nor being
These functions are supported by open source tools includ-
                                                                  hacked. The color of the edges between nodes represents the
ing PostgreSQL, Apache Maven, Apache Spark, Apache
                                                                  protocols used for the connections. Purple edges are most
Storm (Kronos), Elastic Search, GEM prospector, Hadoop,
                                                                  likely TCP. Green connections are protocol-1 which may be
Map/Reduce, Kafka, Accumulo, Unity, IronHide (Kibana),
                                                                  related to the hacked computers.
Zookeeper, Kryolibrary, NodeJS, R-Shiny.
                                                                     The shape of the graph provides clues as to the nature of
   BDP can process large-scale real time data feeds to pro-
                                                                  the nodes. Nodes that are highly connected to other nodes
vide useful visualizations of the data for initial data explo-
                                                                  may be name servers or popular web servers. The hacked
ration to discover anomalous events. Ingestion of the LANL-
                                                                  nodes seem in the area of the nodes with higher numbers of
DNS data into the BDP cluster included the following steps:
• Customized and formatted a rapid deployment archive
  (RDA) for parsing the csv file data
• Connected a puppet server to upload data to the Kronos
  server which ingested and parsed the data
   For the data visualization and exploration, we used Unity
and Kibana/Iron Hide. Unity uses queries to visualize time
series, histograms, and pie charts for the initial examinations
of the data. Iron Hide creates Data-driven documents (D3)
visualizations including heat maps, graphs, and charts which
could indicate threats. Figure 2 shows the Unity histogram
of the event counts (i.e., each line in the LANL-DNS data is
an event associated with a timestamp) for all the computers.
Figure 3 shows a Kibana heat map of number of connections         Figure 3: BDP data exploration: A heat map showing the
made for each computer (y-axis) over time (x-axis). These         number of connections made for each computer over time
tools could show big data in a near real-time to provide rapid    from Kibana
updates for a focused segment.


Figure 2: BDP Unity histogram of the event counts (i.e.,
each line in the LANL-DNS data is an event associated with
a timestamp)                                                          Figure 4: Gephi network visualization of computers
connections (high centralities).                                 are no identifying features differentiating an end user de-
   We also explored the Sankey graph with Python Plotly          vice such as a personal computer versus a DNS server; all
(Sankey 2018). Figure 5 shows a Sankey graph to catego-          are identified as anonymous devices, such as C123. Figure
rize how different parameters such as protocols, port num-       7 shows an example of a LLA network discovered from the
bers, and packets connected to each other in the LANL flow       LANL-DNS data. Each node is a computer. The links repre-
data. For example, protocol-6 is mostly associated with port     sent how likely two computers are linked as a “source” and
ranges 1025-65536 and then port ranges 0-1024.                   “resolve” pair in the events (timestamps).A correlation mea-
                                                                 sure is computed using Equation (1). Colored nodes (com-
Unsupervised Learning Using Lexical Link                         puters) are grouped into one clusters based on their link pat-
Analysis (LLA)                                                   terns using LLA.
In a LLA (Zhao, MacKinnon, and Gallup 2015), describes
the characteristics of a complex system using a list of at-             (Linked Events Computer i and Computer j)
                                                                  rij = p
tributes or features with specific vocabularies or lexical                (Events Computer i)(Events Computer j)
terms. Because number of lexical terms can be potentially                                                       (1)
very large from big data, the model can be viewed as a deep
model for big data. For example, we can describe a sys-             One can filter the nodes based on the strength of the links
tem using word pairs or bi-grams as lexical terms extracted      in LLA as shown in Figure 8.
from text data. LLA automatically discovers word pairs, and         The detail LLA outputs for the LANL-DNS data set are
displays them as word pair networks. Bi-grams allow LLA          listed as follows:
to be extended to numerical or categorical data. For exam-          Output 1: The list of words representing the computers
ple, for structured data such as attributes from databases, we   in the data set and nodes in the network with the following
discretize and then categorize attributes and their values to    characteristics computed as shown in Figure 2.
word-like features. The word pair model can further be ex-
tended to a context-concept-cluster model (Zhao and Zhou         • Group: what group a node belongs. A node or a word is a
2014). A context can represent a location, a time point or         computer.
an object (e.g. file name) shared across data sources. For ex-   • Type: group type from LLA.
ample, in information assurance, information is the context,     • Degree: how many connections each node has.
assurance is the concept. The timestamp, computer name are
the contexts to link different data sources.                     • Betweenness: how many connections belong to the differ-
   Figure 6 shows an example of such a word network dis-           ent groups.
covered from text data. Clean energy, renewable energy are       • Degree in: how many connections a computer (word) as
two bi-gram word pairs. For a text document, words are rep-        resolve.
resented as nodes and word pairs as the links between nodes.
A word center (e.g., energy in Figure 6) is formed around a
word node connected with a list of other words to form more
word pairs with the center word energy.
   We computed associations and links as pairs of a source
computer and a resolve computer from the LANL-DNS data
set. The strength of the associations and links are defined as
how many time points or events that the two computers are
linked via “source” or “resolve”.
   The output from LLA for the LANL-DNS data processing
identified 15237 unique active devices (computers). There


Figure 5: Sankey for showing the LANL flow data. Number
of packets and bytes are split into five groups each with pro-   Figure 6: An example of word network from a text data by
portional ranges of one of the 5th of their maximum value        LLA
• Degree out: how many connections a computer (word) as          of 15237 total computers are either hacked or hacking as the
  source.                                                        ground truth, therefore, if there is a perfect prediction al-
   Output 2: The list of associations of computer associa-       gorithm, the top 1.75% of the sorted nodes (based on the
tions.                                                           perfect scores) should predict 100% of the hacked or hack-
   After the initial data exploration, the question of the re-   ing computers as shown in the leftmost curve (two straight
search is that how to predict hacking and hacked computers       lines). The two results are interesting:
from these data sets. We computed additional metrics based       • The best performed prediction metric is Multi (de-
on the Output 1 of LLA as follows:                                 gree in*degree out) where the top 2160 nodes (14%) in-
• Multi: degree in*degree out;                                     clude 62% of the total hacked or hacking nodes. This is
                                                                   the best gain over other scores: For example, if sorted by
• DIV: degree in/degree out if degree out not 0;else 0;            the degree in scores, the top 14% contains 56% of the total
• SUM: degree in+degree out;                                       hacked or hacking nodes. If sorted by the random scores,
                                                                   14% contains 14% of the total hacked or hacking nodes,
• DIFF: degree in-degree out                                       which is the worst performing prediction.
  Figure 11 show a gains chart for predicting the hacked
                                                                 • The bottom ranked 40% of the nodes (from 9112 to
and hacking computers. The x-axis shows the computer se-
                                                                   15237) are normal. This is also significant since we can
quence number ranked by the four metrics. The y-axis shows
                                                                   eliminate the 40% nodes when examining hacked or hack-
percentage of hacked or hacking computer nodes. 1.75% out
                                                                   ing nodes, which is a big labor saving for cyber security
                                                                   analysts.
                                                                    The metric “degree in*degree out” indicates highly ac-
                                                                 tive devices are more likely to be hacked. The highly ac-
                                                                 tive devices do not mean they are anomalous, however, a
                                                                 common behavior seen in malicious actions is increased ac-
                                                                 tivity of devices that may be participating involved in the
                                                                 unauthorized action. We later computed an activity metric
                                                                 by counting the number of event (i.e. timestamps) a com-
                                                                 puter is associated in the data set. This is a much simpler
                                                                 metric to compute than the associations in LLA. The activ-
                                                                 ity metric shows a similar gain to the best LLA metric. We
                                                                 also appended other node characteristics of in the flows data
                                                                 such as the number of source ports, number of destination
                                                                 ports, total duration of a nodes connections, total packets
                                                                 of a nodes connections, total bytes of a nodes connections
                                                                 as shown in Figure 12, and then apply supervised machine
Figure 7: An example of feature network from the LANL-           learning methods using the tool (Hall et al. 2009), in an at-
DNS data by LLA


Figure 8: The links of computer nodes filtered from Figure             Figure 9: LLA outputs of the node characteristics
7                                                                https://v2.overleaf.com/project/5b9ae8266dbe242220b55f42
tempt to generate better gains charts. So far, the metric “de-   puters in a network where DNS, flows, services and login in-
gree in*degree out” from unsupervised LLA shows a slight         formation are collected. We showed how big data visualiza-
edge over other methods.                                         tion and exploration tools such as BDP, Gephi, and Python
                                                                 Plotly can explore ig data to provide meaningful information
                       Conclusion                                to decision makers. Gephi and Plotly are good for prototyp-
We applied big data and deep analytics methods to the            ing. The BDP has security advantages and shows potential
LANL cyber data used to detect the hacked or hacking com-        for finding anomalies in near real time through various met-
                                                                 rics. LLA computes the associations, statistics and central-
                                                                 ities for nodes (computers) and derived metrics are signifi-
                                                                 cantly useful to predict hacked or hacking nodes in the gains
                                                                 chart evidently. The best performing metric shows the top
                                                                 14% of the nodes include 62% hacked or hacking nodes and
                                                                 the bottom 40% of the nodes are 100% normal, therefore can
                                                                 be eliminated from examination.

                                                                                   Acknowledgements
                                                                 Authors would like to thank the Naval Research Program at
                                                                 the Naval Postgraduate School and the Naval Research En-
                                                                 terprise Internship Program at the Office of Naval Research
                                                                 for the research support. The views and conclusions con-
                                                                 tained in this document are those of the authors and should
                                                                 not be interpreted as representing the official policies, either
                                                                 expressed or implied of the U.S. Government.

                                                                                         References
                                                                 DARPA. 2000. DARPA intrusion detection scenario specific
    Figure 10: Derived metrics from the output of LLA            data sets, retrieved from https://www.ll.mit.edu/r-d/datasets.
                                                                 Gephi. 2018. The open graph viz platform, retrieved from
                                                                 https://gephi.org/.
                                                                 Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann,
                                                                 P.; and Witten, I. H. 2009. The WEKA Data Mining Soft-
                                                                 ware: An Update. SIGKDD Explorations 11(1):10–18.
                                                                 KDD. 1999. The UCI KDD archive, information and com-
                                                                 puter science, university of california, irvin,retrieved from
                                                                 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
                                                                 LANL. 2017. The LANL cyber data, retrieved from
                                                                 https://csr.lanl.gov/data/cyber1/.
                                                                 Sankey. 2018. Sankey graph with python plotly, retrieved
                                                                 from https://plot.ly/python/sankey-diagram/.
                                                                 Zhao, Y., and Zhou, C.               2014.        US patent
                                                                 8,903,756: System and method for knowledge pat-
Figure 11: Gains chart using centrality node scores com-         tern search from networked agents. retrieved from
puted from LLA and derived metrics                               https://www.google.com/patents/us8903756.
                                                                 Zhao, Y.; MacKinnon, D.; and Gallup, S. 2015. Big data
                                                                 and deep learning for understanding dod data. In Journal of
                                                                 Defense Software Engineering, Special Issue: Data Mining
                                                                 and Metrics.


       Figure 12: Combined data for computer nodes

</pre>