CLEAR-ROAD: Extraction of Temporally Co-occurring yet
                  Rare Critical Alerts
                              Gordon Werner∗1 and Shanchieh Jay Yang†1
                                    1
                                        Rochester Institute of Technology


                                                    Abstract
              Intrusion detection systems generate a large number of streaming alerts. It can be
          overwhelming for analysts to quickly and effectively understand behavior within a network.
          Critical alerts occur so infrequently that it can be difficult to determine what surrounding
          alerts are actually related to them, providing a deep challenge to analysts. What if an analyst
          could provide a collection of known critical alerts and quickly receive a summary detailing
          their temporal behaviors within a network as well consistently co-occurring signatures that
          pre-empt or succeed the critical action? What if this information could be provided in near
          real time, with no training data, and with the capability to adapt to changing temporal
          patterns and relationships across signatures? The Concept Learning for Intrusion Event
          Aggregation in Realtime with Rare co-Occurring Alert signature Discovery (CLEAR-ROAD)
          answers that question, revealing consistent co-occurrences derived from alerts with similar
          temporal arrival patterns. Alerts are aggregated, or sequenced, based on their unique and
          invariant arrival patterns, not external training data. The signature patterns expressed by
          such temporal activity are then discovered through pattern mining techniques. A constrained
          databasing approach is used to reduce the number of sequences processed by an average
          of 90% for individual streams. Case studies are conducted to analyze the co-occurring
          signatures found across two real world datasets, one from a SOC operation and another
          from a penetration testing competition. CLEAR-ROAD is able to find consistently co-
          occurring signatures across streams and datasets quickly and effectively. Differences in
          temporal behavior are also found to lead to unique co-occurring signatures for some critical
          alerts. Case studies show the clear and near-immediate benefits provided to analysts by the
          system.


1        Introduction
Network security has never been more important than in recent times. As networks continue
to grow in scale so to does the threat of malicious activity. Even with Security operation
centers (SOCs) staffed with analysts dedicated to monitoring a network it is easy for them to
be overwhelmed. Modern intrusion detection systems (IDSs) generate massive numbers of alerts
quickly making it difficult for analysts to quickly or easily understand what is happening.
    Imagine the scenario where a network suddenly sees an influx of a certain critical alert
such as “ET WEB_SERVER ColdFusion administrator access." While the signature can inform
analysts of general intent, they would be driven to find what additional alert signatures are
being generated to construct an overall attacker “action". Even using a short time window
around the critical signature, a manual query by an analyst will return a large number of unique
alert signatures.
    ∗
        gxw9834@rit.edu
    †
        jay.yang@rit.edu


    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons
                        License Attribution 4.0 International (CC BY 4.0).
    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                       G.Werner, S.Yang


    This work aims to solve a unique problem motivated through discussions with real world
SOC operators. Given a set of critical signatures from a SOC analyst, can they quickly be
provided with the timing information of any other alert signatures which co-occur with statistical
dependence? Critical signatures are rare occurring very infrequently in a network and it may
not be obvious at a glance how they relate to other alerts. While pattern mining of cyber alerts
has been briefly explored in recent literature, this is the first to the author’s knowledge which
aims to provide such directed and clear insights into alert co-occurrence to cyber analysts.
    This paper introduces and details the Concept Learning for Intrusion Event Aggregation
in Realtime with Rare co-Occurring Alert signature Discovery (CLEAR-ROAD) system. Using
data driven statistical processing, IDS alerts can be sequenced in real time based on their
temporal arrival patterns with no external training data. Pattern mining techniques applied to
constrained sequence data bases [29] allow for regular discovery of co-occurring signatures with
low performance overhead.
    By processing alerts in this way the additional temporal context and relationships across
signatures is represented. Of the 113 critical signatures analyzed, 71 were found to have statisti-
cally co-occurring signatures. 65 of these had consistent co-occurring signatures across external
sources across two IDS datasets exhibiting similar temporal behavior within the same network.
In some cases, variation in temporal behavior lead to unique co-occurring signatures for each
timing pattern. These results are highlighted and discussed through thorough case studies of
the “GPL EXPLOIT CodeRed v2 root.exe access" and “ET WEB_SERVER ColdFusion admin-
istrator access" signatures.
    The rest of this paper is organized as follows. Section 2 discusses related work in the field of
alert aggregation and correlation. Section 3 details the motivation and challenges in finding co-
occurring signatures. Section 4 details the CLEAR-SPADE architecture. Section 5 introduces
the experimental set up, datasets used and the differences between them while Section 6 details
the case studies conducted through the eyes of a SOC analyst. Section 7 concludes the paper.


2     Related Work
2.1     Alert Aggregation
IDS systems are used to raise alerts to network administrators of suspected anomalous or mali-
cious activity [12]. Quickly and effectively processing the extreme number of alerts [4] in order
to construct a clear and unified knowledge of a network’s security status [13] is necessary for
defense. A simple and straightforward way to reduce the overall number of alerts is through
aggregation [9]. Alerts of the same type occurring close with one another are removed. The
intuition here is that they are generated by ongoing activity, e.g, scanning, or alerts that are gen-
erated from the same activity by multiple scanners [8]. Traditional aggregation aims to remove
‘redundant’ alerts [21] but provides no deeper insight into the alerts presented to an analyst.

2.2     Alert Correlation
Alert correlation methods aim to process cyber alerts and provide deeper meaning to them [14].
Some early approaches provided modeling languages that allowed users to define known attack
scenarios [2]. These types of approaches require a large, manually defined set of training patterns
[18]. There is high overhead in creating such a dataset, especially when one considers that there
are always new and emerging attack patterns. Machine learning techniques have been used to
learn attack scenarios [3] or to reconstruct incoming alerts to previously clustered alerts [20],
however these methods are still dependent on manual labeling of training data [14]. Rule-based
approaches attempt to match prerequisite and consequent actions to cyber attack steps [1], but
these too require specific attack knowledge and are unable to detect emerging attacks as they
are not defined in the rules database.

    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                      G.Werner, S.Yang


    Statistically driven approaches to alert correlation attempt to determine relationships across
alerts without prior domain knowledge [14]. Researchers in [17] a Bayesian model was generated
for each “hyper alert." Hyper alert construction leverages many basic alert aggregation principles,
collecting all alerts with identical attributes such as source and destination IP and signature.
Conditional probabilities are then calculated for each pair of hyper alert in an attempt to produce
a causal relationship between alerts. This approach was iterated upon in [18] to provide “online"
alert correlation however the approach still required extensive off-line Bayesian training using
large sets of hyper alerts.

2.3   Temporal Correlation of Alerts
Many alert correlation techniques use some type of windowing when processing new alerts [5].
Alert occurrence frequencies are used when determining relationships across alert types, but the
temporal relationships across alerts usually are not used in alert correlation. This seems like a
missed opportunity as time series modeling has been applied effectively to network traffic, alert
counts and cyber intrusion events in research. ARIMA models provided a boost in accuracy
when forecasting cyber event counts [26]. Hourly counts of individual signatures have been
leveraged to detect abnormalities in occurrence rates [25]. Weekly analysis of malicious activity
against a commercial entity found seasonal behavior, and changes in intensity over the course of
a day or week [27]. Researchers in [28] are the first to our knowledge to incorporate statistically
driven aggregation of alerts based on their arrival patterns. Aggregation leveraged the notion of
concept drift, the phenomena where the distributions or relationships across features change over
time [15]. These changes can happen gradually or suddenly, and are very common in network
traffic and other human driven systems [24]. Manual analysis of learned temporal “concepts"
found signatures with consistent temporal properties and co-occurrences.

2.4   Data Mining Applied to Cyber Alerts
In [23] directed graphs were constructed based on the source and destination IPs of alerts.
Pattern mining was conducted over a day’s alerts to collect association rules. It is not sufficient
to simply collect association rules as they can occur without a true dependence existing [11]. In
[19] sequences of hyper alerts were mined to find patterns within a network. Due to the use of
hyper alerts, directional order across alert types must be estimated as a hyper alert is treated as
a single event although it is made up of a number of alerts occurring at varying times within the
current window. Applying similar pattern mining techniques to individual alerts should provide a
clearer and more confident context to signature occurrences and relationships. The performance
of sequential mining algorithms to alert and netflow data was explored in [10]. While the efficacy
of the algorithms was not explored the performance results showed both that such algorithms
can be run in an on-line manner with low performance overhead and that sequential database
construction can severely impact performance to the point that online operation is infeasible.
    Previous applications of pattern mining to cyber alert data would treat entire streams of
alerts as individual sequences. Pattern mining algorithms are then applied to the group of all
sequences within a network. While this allows for frequent patterns to be found across streams,
it does not allow for patterns unique to streams to be explored. Creating sequences within
streams is difficult as there is no clear or obvious beginning or ending point to a users “actions"
found within alerts.

2.5   Data Driven Learning of Temporal Behaviors
The Concept Learning for Intrusion Event Aggregation in Realtime (CLEAR) system is able
to dynamically learn the temporal arrival patterns of cyber alert streams in near real-time and
with no training data [28]. CLEAR aggregates alerts as they are generated by an IDS system


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                       G.Werner, S.Yang


and is capable of detecting and ending an aggregate with a maximum delay of two alert arrivals.
CLEAR employs a concept learning engine which builds overarching temporal behaviors made
up of statistically similar aggregates. CLEAR’s data driven approach to aggregation allows for
aggregates to be more than collections of alerts with matching fields [9]. It’s lack of dependence
on training data and ability to learn and adapt to ongoing network traffic patterns ensures that
its concepts best represent the current temporal behaviors of alerts within a network.


3     Finding Co-Occurring Signatures: Definitions and Challenges
Finding signatures that occur with a analyst specified critical signature presents a new and
unique challenge. Figure 1 shows an example stream of cyber alerts over time. This mimics an
analyst’s view after querying for the critical alerts and those that occurred temporally near to it.
Are these alerts related to one another? Is the occurrence of the critical signature statistically
dependent on those around it? If so, what temporal patterns exist between them? Can this
example be applied to future instances of the critical signature? These are questions raised
by SOC analysts when attempting to process IDS alert data and ones that do not have easy
answers. By removing a reliance on external historic data and training, analysts can be confident
that the results and statistics they are being presented with are representative of their current
network.


                            Figure 1: Illustration of IDS alert stream


3.1     Sequencing Alerts Analytically
Not every malicious entity attacks a network the same way. Further, a single attacker may
deploy different strategies to reach the same goal when met with resistance. If all alerts from a
single external IP are collected into a single sequence, the new co-occurrences brought on by such
changes in strategy could go unnoticed. Pattern mining algorithms do not consider the number
of pattern occurrences within the same sequence, only the number of sequences containing them
[6]. To better understand and find co-occurring signatures a finer grained sequencing of alerts
is needed.
    Aggregates created by CLEAR are strong candidates for use as sequences in this context.
They are created in a data driven and unsupervised way with no external training data; se-
quencing temporally similar alert arrivals together. CLEAR aggregates are temporally invariant
and statistically unique from one another [28]. This provides great confidence that alerts are
correctly sequenced based on temporal arrival patterns.

3.2     Extracting Co-Occurring Signatures
By sequencing alerts by their temporal arrival patterns it is possible to analyze and under-
stand signature co-occurrences within specific episode types. Concepts generated by CLEAR
are statistically unique from one another. Should a specific alert signature appear in multiple
concepts, it is safe to assume that the signature was generated by two unique temporal alert


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                      G.Werner, S.Yang


“episodes." The temporal structure of concepts helps give context to any co-occurring signatures
found within, and can potentially change which signatures are co-occurring for a given critical
one. SOC analysts are concerned with the flow between alerts: which signature co-occurs prior
to or after a critical one? At what timing? CLEAR aggregates are a series of successive alerts,
however aggregation is based solely on the inter arrival times (IATs) of alerts. Additional analy-
sis is necessary to discover any patterns across other alert attributes within individual concepts.
Sequential pattern mining algorithms process collections of temporally ordered items in order to
find patterns that occur frequently. From these sequences, rules and their statistical confidence
and correlation can be derived. This well fits the goals of this research, to find co-occurring
alerts and understand their temporal relationship to a critical signature. The application of se-
quential pattern and rule mining to CLEAR aggregates and concepts will produce co-occurring
signatures unique to the temporal patterns exhibited by the episodes containing them.


4     CLEAR-ROAD Architecture
To quickly and effectively find and deliver co-occurring alert signatures and their temporal char-
acteristics to a SOC analyst the Concept Learning for Intrusion Event Aggregation in Realtime
with Rare co-Occurring Alert signature Discovery (CLEAR-ROAD) system is presented. Alerts
are processed by CLEAR as they are generated by the IDS and maintain concepts and aggre-
gates for each stream with at most a two alert delay. ROAD post-processes these concepts and
aggregates with sequential pattern and rule mining. To reduce overhead and increase efficiency,
sequence databases (SDBs) are constrained to only process sub-sequences containing the critical
signature while still producing statistics accurately in relation to the entire database. ROAD’s
processing can be manually run by an analyst at anytime or scheduled to occur at fixed in-
tervals, processing recent historic time windows based on supplied parameters. ROAD finds
all co-occurring signatures and collects and presents multiple levels of statistics to the analyst.
High level summaries for critical signatures as well as in depth statistics for each co-occurring
signature are collected and delivered to analysts quickly.
    A flowchart describing the overall process of the presented system can be found in Figure
2. CLEAR runs in an online manner processing IDS alerts and outputting aggregates in near-
real time. Each aggregate is mapped to the temporal concept it represents. These aggregate
and concept mappings are then ingested by ROAD, and processed with cSPADE to extract co-
occurring alert signatures. To reduce processing overhead, a constrained database is constructed
using the analyst supplied critical signatures. Only aggregates containing the critical signatures
are processed by cSPADE, with additional post processing done to the results to account for
the aggregates not included in the SDB. Sequences and rules are then parsed, processed and
tabulated before being presented to the analyst.
    While this processed can be manually executed by an analyst at any time, it can also be run
periodically to analyze potential changes in co-occurrence. The system can process any number
of critical signatures specified by the analyst when it is invoked.

4.1     Pattern Mining of Temporal Concepts
After being processed by CLEAR, ROAD analyzes the sequenced alerts to extract co-occurring
signatures. The constrained SPADE (cSPADE) algorithm is used [29] as it allows for itemsets,
maintains order within sequences and can find frequent patterns that occur with gaps. IDSs pro-
duce a high volume of alerts, some of which are false positives [5] or unrelated to attack actions.
cSPADE’s allowance for gaps between frequent patterns in sequences means that potential noise
in the alert stream will not impact finding consistently co-occurring alerts.
    cSPADE analyzes a sequence data base (SDB) which is made up of a number of individual
sequences S. Each sequence Si contains some number of events Si = ei,1 , ei,2 , ...ei,n with each


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                        G.Werner, S.Yang


                  Figure 2: Flowchart of IDS Alerts Through CLEAR-ROAD

event containing a number of items im ∈ I that occur at a specific time tn . Each successive event
in a sequence occurs at a time later than the previous event ei,a , ei,b , ta < tb . If multiple items
occur at the same time they are stored in the same event ei,j = Ij |i1 , i2 , ...im . cSPADE processes
all sequences in the SDB looking for any sub-sequences s ⊆ S which occur at a frequency higher
than a user designated minimum. A sub-sequence’s frequency, or support Sup(s), is the count
of sequences it is found in divided by the total number of sequences in the SDB N .
     Association rules can be mined from frequent sequences to derive potential relationships
between item occurrences [7]. An association rule is defined over the directional relationship of
two frequent sub-sequences A → B. The support of the rule Sup(A → B) is the proportion of
sequences the rule is found in within the SDB. Rules have a confidence value shown in (2) which
define the rate of occurrence of B given the appearance of A.

                                                             A
                                     Support = Sup(A) =                                           (1)
                                                             N
                                                     Sup(A → B)
                                  Conf (A → B) =                                                  (2)
                                                       Sup(A)
     A number of metrics have been used to analyze the “importance" of individual rules [7] based
on the specific needs of the miner. Lift is one such metric and it measures how likely it is for
the consequent of the rule to occur in relation to the antecedent based on the frequency of each
occurring individually. A lift higher than 1 indicates that the occurrence of the two parts of the
rule are directly correlated with one another, their occurrence together in a sequence is more
likely than random chance. The calculation of lift can be found in (3). Lift is a beneficial metric
in finding co-occurring alerts given that critical alerts are inherently rare. It is likely that any
potential co-occurring alert will have a much higher frequency than the critical alert, therefore
it is imperative to ensure that their co-occurrence is not mere chance, but that it is statistically
probable that they occurred in relation to one another.

                                                Sup(A → B)
                                     Lif t =                                                      (3)
                                               Sup(A) · Sup(B)
    Lift is a powerful metric as it can potentially find co-occurring signatures in as little as one
co-occurrence. With one co-occurrence, the lift calculation simplifies to A·B
                                                                            N
                                                                               where A and B are
the counts of the individual signature occurrences.

  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                     G.Werner, S.Yang


4.2     Constrained SDB Generation
As noted in [10], how the SDB is constructed can severely impact the performance of pattern
mining algorithms. CLEAR’s aggregates are leveraged as sequences as they represent a station-
ary alert arrival pattern [28]. Each stream is independently processed with found sequences and
co-occurring signatures across streams found through analysis of ROAD’s statistical results.
    SDBs are further constrained by discarding any sequences not containing the critical signa-
ture. After being processed by cSPADE, the resulting support statistics are scaled to account
for the unconstrained database. Figure 3 shows the boxplot of the overall size reduction of the
SDB obtained when analyzing only aggregates containing the critical signature for all critical
signatures found in the CPTC dataset. SDBs saw on average a 90.1% reduction in sequences
when including only aggregates containing the signature.


                      Figure 3: Plot of SDB size reduction for all concepts


5     Initial Experimentation and Co-Occurring Signature Findings
Experiments were conducted over two unique sets of Suricata [22] cyber intrusion alerts. The
first set of data was captured in a real world SOC operation (RSOC) environment. IP addresses
were obfuscated to maintain privacy with an additional alert field indicating which, if any,
IPs were external to the network. Alerts were captured and aggregated by CLEAR in real time
during the summer 2020. Alert correlation was later developed and integrated and was therefore
conducted offline.
     The second dataset was collected at the 2018 National Collegiate Penetration Testing Com-
petition (CPTC) [16]. CPTC is a yearly college competition where a number of teams execute
parallel penetration testing operations against a common “client" network. There were nine
teams competing with each provided identical personal and target networks. Such rigid struc-
ture and duplication in network infrastructure allows for greater confidence in potential results.
Knowing that eight groups with the same goals are acting on a network leads to an expectation
of similar behaviors across teams. This should therefore manifest in consistent alert correlation
across the dataset.
     Data streams were created based on “adversary" IPs. For RSOC this was done using the
external IPs captured by the IDS system. In CPTC this was done using the known IPs given
to each of the team members. By parsing the data in this way each stream can be most closely
interpreted as a single attacker’s behavior, and is a common approach to parsing intrusion alerts
in research.
     The CPTC dataset resulted in 50 streams made up of nearly two hundred thousand alerts
over the course of the day of the competition. The RSOC environment by comparison saw
six hundred and twenty thousand alerts in its first day alone generated by over one thousand
external IP addresses. Although the RSOC dataset experienced a larger overall number of alerts,

    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                     G.Werner, S.Yang


they are spread over a large number of relatively short streams. On average, RSOC streams
contain 378.9 alerts while CPTC streams contained 2097.6 alerts. It is very common in a real
world scenario for an external IP to connect to a network and generate only a small number of
alerts. In many cases these are false alarms raised by non-malicious use of the network. Even
when malicious however, it is still not uncommon for an attacker to connect to a network with
a unique IP address in order to conduct a short burst of activity. The CPTC competition by
contrast provided individuals with unique IP addresses to be used for the ten hour competition.
This results in a small number of very long streams, contrasting the RSOC dataset.
    IDS signatures contained in the CPTC dataset were manually mapped by SOC experts to
corresponding attack stages. For the following results, any alert mapped to the attack stages
“arbitrary code execution, brute force credentials, command and control, data exfiltration, and
privilege escalation" were treated as critical.

5.1   Quantitative Summary of CLEAR-ROAD
Table 1 shows the breakdown of critical signatures found in the CPTC dataset by attack stage.
Of the 113 signatures present CLEAR-ROAD found at least one co-occurring signature for 71
(62.8%) of them. The final column of the table shows regularly co-occurring signatures, those
that co-occur with the critical signature at a rate of at least fifty percent across the entire
dataset.

             Table 1: Summary Results for CPTC Critical Signatures
 Atk. Stage            Tot. Crit. Sig. w/co-sig w/reg. co-sig In RSOC                  Same Co-Sig
 ARB. CODE EXE         55              35       34             8                       1
 BRUTE FORCE CREDS 4                   3        2              3                       1
 COM. & CON.           19              9        9              10                      8
 DATA EXFIL            26              19       16             6                       1
 PRIV ESC              9               5        4              3                       2

    The final two columns of the table list the counts of critical signatures also found in the
RSOC database. Of the 30 signatures found in both datasets, a total of 14 had the same co-
occurring signatures in both datasets. The command and control and privilege escalation critical
signatures made up a majority of those that had similar co-occurring signatures across datasets.

5.2   High Level Summary Results for Critical Signatures
Table 2 shows high level summary results for individual critical signatures from both datasets.
Even at this high level, key insights can be made regarding the occurrence patterns of critical
signatures to help determine where to focus. As expected, critical signatures are extremely rare
in the set of all generated IDS alerts for both datasets, in most cases a critical alert occurs
fewer than one in one hundred alerts. However even with this rarity these signatures appear in
aggregates with a high number of unique signatures. Were an analyst to manually query the
critical alerts and the signatures surrounding them it would still be quite difficult for them to
determine which are truly co-occurring.
    By comparison, even just high level results can provide immediate feedback to an analyst.
The bottom four rows of the table highlight a group of critical signatures that co-occur with
one another. These four signatures are all variations of alerts that notify specific configuration
options in a PHP URI. It is clear that the occurrence of these signatures are related given their
identical statistics. Most likely they are borne from an attacker sweeping the network for a
number of potential PHP vulnerabilities. The signatures also co-occur with extremely high lift,
indicating that they occur together in aggregates and can very rarely be found separate from
one another.

  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                       G.Werner, S.Yang


                   Table 2: Summary Results of Critical Signature Correlation
                Cri.Sig.Abr. Dataset Rarity(%) Agg.Sigs. µ Lift µ IAT
                CFADMN        CPTC    1.39         43          4.67    6.1 ms
                CFADMN        RSOC    1.39         19          1.45    1.2 s
                CFAPIA        CPTC    0.16         27          4.52    1.5 ms
                CFAPIA        RSOC    0.07         14          1.75    170 ms
                CFUTIL        CPTC    0.16         27          4.52    1.5 ms
                CFUTIL        RSOC    0.04         11          1.75    322 ms
                DRUPAL        CPTC    0.13         20          5.78    10.8 ms
                DRUPAL        RSOC    0.02         15          1.33    1.6 s
                CDERED        CPTC    0.39         15          7.8     15.8 s
                SMPURI        CPTC    0.04         9           196     2.6 ms
                SSPURI        CPTC    0.04         9           196     2.6 ms
                DIFURI        CPTC    0.04         9           196     2.6 ms
                OBDURI        CPTC    0.04         9           196     2.6 ms


    To better explore the CLEAR-ROAD system and how it can impact an analyst’s ability to
understand network activity, results are presented as case studies. The only assumption made
in collecting results is that the analyst has a known collection of individual alert signatures
that are considered “critical" and that this was provided to the system at run-time. In the
presented experiments critical signatures were those categorized by SOC experts as “command
and control," “privilege escalation," “arbitrary code execution" and “data exfiltration." Examples
chosen best highlight certain findings regarding co-occurrence consistency but are not the only
examples contained within the datasets.


6     Case Studies
6.1     Case Study 1: CodeRed
The critical signature “CDERED" (GPL EXPLOIT CodeRed v2 root.exe access) was only ob-
served in the CPTC dataset, but was an attack vector leveraged by a number of the teams giving
good insight into pattern consistency across users. The CodeRed worm attempts to connect to
random hosts in the hope of finding a Microsoft IIS web server. Figure 4 shows a selection
of boxplots representing the IATs of alerts in the concepts from 7 teams containing the signa-
ture. Each of the two temporal “modes" could potentially see unique or additional correlated
signatures possibly accounting for the differences in presentation of the critical signature across
streams. Having this initial high-level context helps frame the analyst’s expectations as they
delve deeper into the statistics presented regarding individual co-occurring signatures.
    Table 3 shows statistics relating individual co-occurring signatures with the critical signature
under analysis. The third column represents the count of aggregates containing the critical and
co-occurring signature while direction columns indicates the order of occurrence between the two
signatures. The lift column shows the average lift for all occurrences of the signature pairs while
the IAT column shows the average arrival time between the critical and co-occurring signatures.
The IATs and appearances of the co-occurring signatures match with the frequencies and timings
of the concepts seen in Figure 4. There is a strong consistency in these co-occurring signatures
with the critical signature independent of the source, but dependent on the timing. Thanks to
the structure of the CPTC competition each individual attacker was targeting a copy of the
same network. With the same configuration, the same action generated the same set of co-
occurring signatures, even among any other alerts generated by other actions being taken within
the network.


    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                     G.Werner, S.Yang


       Figure 4: Boxplots of concepts containing “CDRED" signature in CPTC dataset

                     Table 3: Detailed Co-Occurring Signature Statistics
                Cri.Sig.Abr. Co.Sig.Abr. Apps. Dir          µ Lift µ IAT
                CDERED        ISAPIA       28      cri→co 18.9       150 ms
                CDERED        ROOTA        22      co→cri 24.5       17 ms
                CDERED        MSAAC        4       co→cri 21         5 ms
                CDERED        JEXBO        6       co→cri 3          34.5 s
                CDERED        DTLEAK       4       cri→co 6          28 s


    To fully understand the relationships between critical and co-occurring signatures and any
potential attack vectors executed, analysis of the individual signatures and their causes are
needed. The first group of co-occurring signatures “ROOTA" (GPL WEB_SERVER / root
access), “MSAAC" (GPL EXPLOIT /msadc/samples/ access) and “ISAPIA" (GPL EXPLOIT
ISAPI .idq access) co-occur with CDERED quickly, with IATs measuring in milliseconds. Sig-
natures ROOTA and MSAAC are triggered by an attempt to access specific directories of an
IIS server, the usual target of the CDERED worm. Successful access is what most likely lead to
the code red exploit being deployed against the server, generating the critical signature shortly
after the initial access. Signatures vary commonly followed the code red alert an indicates a
successful buffer overflow on a IIS server. This could indicate the worm was successful in finding
and exploiting an IIS server.
    The second temporal behavior highlights a completely different attacker action. The sig-
nature “JEXBO" (ETPRO WEB_SERVER JexBoss Common URI struct Observed 2 (IN-
BOUND)) relates to a java platform testing tool that has been used in ransomware attacks
such as “SamSam." Jexboss is used in conjunction with web servers so it is not unreasonable to
assume that it was used as a potential injection vector for the code red worm. About 30 sec-
onds after the Jexboss signature, the code red signature was alerted followed by the “DTLEAK"
(ETPRO WEB_SERVER Possible Information Leak Vuln CVE-2015-1648) which alerts of a
potential data leak through the opening of a command terminal.
    Figure 5 Illustrates the differences in timing between the found correlated signatures and a
potential timing flow of the overall attack. Such a plot makes clear the average timings between
critical and co-occurring signatures while also highlighting the temporal discrepancy between
the two “modes" that CDERED is seen in in the network. Having the knowledge of these co-
occurring signatures can help a SOC analyst to adjust their defenses to appropriately react when
a ROOTA or JEXBO alert is generated with in the network to prevent future potential exploits


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                     G.Werner, S.Yang


      Figure 5: Timing illustration of unique signature co-occurrences related to Code Red


6.2    Case Study 2: ColdFusion
The critical alert signature “CFADMN" (ET WEB_SERVER ColdFusion administrator access)
occurred in both datasets with each experiencing very different appearance rates for co-occurring
signatures. While both datasets saw “CFAPIA" (ET WEB_SERVER ColdFusion adminapi
access), “CFUTIL" (ET WEB_SERVER ColdFusion componentutils access) and “DRUPAL"
(ET EXPLOIT Possible CVE-2014-3704 Drupal SQLi attempt URLENCODE 1), they did not
co-occur with CFADMN in the CPTC dataset with the same frequency as in RSOC. This may
seem strange given that most of these co-occurring signatures are directly related to ColdFusion
servers, but in fact highlights another strength of CLEAR-ROAD’s data driven processing. Just
as different network topologies can cause unique timing characteristics within streams, so too can
the maintenance and configuration of the infrastructure. The CFADMN alert is raised when the
administrator of an Adobe ColdFusion web server is remotely accessed. This is simply one of a
number of alerts that can be raised when a malicious entity is attempting to access a ColdFusion
server. CFAPIA, CFUTIL and CFPWDA (ET WEB_SERVER ColdFusion password.properties
access) all alert on different methods of gaining access to a ColdFusion server. If not configured
properly, it is possible to retrieve the component utils page, the administrator page or the
adminapi page of a ColdFusion server through a standard web call. It is possible for sensitive
infomation, such as login credentials to be stored in plaintext in the component utils pages.

      Table 4: Detailed Co-Occurring Signature Statistics for CFADMIN Critical Signature
              Cri.Sig.    Co.Sig.      Dataset Apps. Dir          µ L µ IAT
              CFADMN CFAPIA            RSOC     1630      cri→co 1.81 0.5 s
              CFADMN CFAPIA            CPTC     2         cri→co 27.6 184 us
              CFADMN CFUTIL            RSOC     903       cri→co 1.82 1.52 s
              CFAPIA      CFUTIL       CPTC     1         co↔cri 13    2 8ms
              CFADMN DRUPAL RSOC                158       co→cri 1.61 1.97 s
              CFADMN DRUPAL CPTC                1         co→cri 3     70.6 ms
              CFADMN CFPWDA CPTC                4         co→cri 23    71 ms
              CFADMN NMAPSC CPTC                82        co↔cri 1.93 5.5 ms
              CFADMN PHPINA            CPTC     31        cri↔co 11.5 20.1 ms
              DRUPAL STREX             RSOC     370       co→cri 1.2   1.84 s

    The discrepancy in co-occurrence numbers across datasets is most likely caused by effec-
tive security measures taken within the RSOC environment. ColdFusion servers in RSOC are
most likely well updated and defended to avoid vulnerabilities that allow for data leaks through
external accesses. Therefor most attackers will attempt to access the administrator, api, and


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                     G.Werner, S.Yang


component utils unsuccessfully. The CPTC competition however intentionally builds networks
with vulnerabilities for teams to attack. It is likely that a ColdFusion server used in the compe-
tition could be accessed in this way, meaning teams did not need to attempt both CFUTIL and
CFADMN accesses as frequently throughout the competition. Another indicator that the CPTC
ColdFusion servers were poorly maintained is the “CFPWDA" (ET WEB_SERVER ColdFu-
sion password.properties access) which exploits a vulnerability that provides the attacker with
hashed administrator passwords. This exploit has been patched as of 2013, but it is not unlikely
that such an out of date server was intentionally included in a penetration testing competition.
Figure 6 illustrates the co-occurrences and timing for signatures related to CFADMN in both
datasets.


Figure 6: Timing illustration of Cold Fusion and co-signatures in RSOC (top) and CPTC (bot)

    Also interesting is the order of the three signatures across datasets. Since all three are
independent vectors of attacking a ColdFusion server order does not truly matter, however
there is a clear difference between the core approach used by CPTC teams and real world
entities. Interestingly both datasets see alerts with the “DRUPAL" (ET EXPLOIT Possible
CVE-2014-3704 Drupal SQLi attempt URLENCODE 1) co-occurring before CFADMN. The
DRUPAL signature can alert to potential SQL injection attacks against Drupal 7 web servers.
Drupal servers are not based on the same coding language as ColdFusion servers. Most likely
attackers are targeting web servers using a script, this theory is given more evidence by the co-
occurrence of the “NMAPSC" (ET SCAN Nmap Scripting Engine User-Agent Detected (Nmap
Scripting Engine)) in CPTC and the “STREX" (ET EXPLOIT Apache Struts 2 REST Plugin
XStream RCE (ProcessBuilder)) exploit commonly seen preceding DRUPAL in RSOC. Likely
each network is being scanned for web servers with potential vulnerabilities which when found
are targeted. The variation in ordering would indicate different scripts are being used in each
network.


7     Concluding Remarks
It is infeasible to expect an analyst to be able to manually query and process all instances of a
specific critical alert in a timely manner. While some automated approaches to alert correlation
exist they require extensive training sets requiring manual labeling and are not focused on
providing fast, intuitive feedback to an analyst in real time. SOC analysts are interested in
knowing which signatures are actively co-occurring with certain “critical" signatures within their
networks. Given the high number of alert signatures and the rate at which alerts are generated,
it is nearly impossible to expect an analyst to discover these co-occurring signatures manually.

    Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                      G.Werner, S.Yang


    The CLEAR-ROAD system is able to quickly and effectively provide this key information
to analysts defending a network. A real world SOC operation’s alert data was processed and
co-occurring signatures were found. Deeper analysis of the signatures produced gave strong sup-
porting evidence towards the co-occurrences truly stemming from an attacker’s action. By first
processing alert arrivals to learn the temporal behaviors within a stream, co-occurring signa-
tures and patterns are given an additional dimension of context. As seen in the first case study
some critical signatures are used in different ways resulting in unique temporal patterns and
co-occurring signatures. CLEAR-ROAD was able to find consistent alert co-occurrences across
streams and across datasets with unique alert timings and vastly different stream characteristics.
    CLEAR-ROAD’s approach to SDB construction saw on average a 90.1 % reduction in the
number of sequences processed with no impact to pattern or rule mining results. This allows
for much lower processing times, providing an analyst with insights even quicker. Systems such
as this that aim to provide new and unique perspective to SOC analysts are necessary. While
automation of IDS systems and models is inevitable, the human element can never be fully
removed from cyber defense. Developing and providing tools such as CLEAR-ROAD allow for
smarter and more proactive defense.


References
 [1] F. Cuppens and A. Miege. Alert correlation in a cooperative intrusion detection framework.
     In Proceedings 2002 IEEE symposium on security and privacy, pages 202–215. IEEE, 2002.

 [2] F. Cuppens and R. Ortalo. Lambda: A language to model a database for detection of
     attacks. In International Workshop on Recent Advances in Intrusion Detection, pages 197–
     216. Springer, 2000.

 [3] O. Dain and R. Cunningham. Fusing a heterogeneous alert stream into scenarios. In
     Applications of Data Mining in Computer Security, pages 103–122. Springer, 2002.

 [4] H. Debar and A. Wespi. Aggregation and correlation of intrusion-detection alerts. In
     International Workshop on Recent Advances in Intrusion Detection, pages 85–103. Springer,
     2001.

 [5] H. T. Elshoush and I. M. O. Alert correlation in collaborative intelligent intrusion detection
     systems—a survey. Applied Soft Computing, 11(7):4349–4365, 2011.

 [6] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. S. Koh, and R. Thomas. A survey of
     sequential pattern mining. Data Science and Pattern Recognition, 1(1):54–77, 2017.

 [7] J. Hipp, U. Güntzer, and G. Nakhaeizadeh. Algorithms for association rule mining—a
     general survey and comparison. ACM sigkdd explorations newsletter, 2(1):58–64, 2000.

 [8] M. Husák and M. Čermák. A graph-based representation of relations in network security
     alert sharing platforms. In 2017 IFIP/IEEE Symposium on Integrated Network and Service
     Management (IM), pages 891–892. IEEE, 2017.

 [9] M. Husák, M. Čermák, M. Laštovička, and J. Vykopal. Exchanging security events: Which
     and how many alerts can we aggregate? In 2017 IFIP/IEEE Symposium on Integrated
     Network and Service Management (IM), pages 604–607. IEEE, 2017.

[10] M. Husák, J. Kašpar, E. Bou-Harb, and P. Čeleda. On the sequential pattern and rule
     mining in the analysis of cyber security alerts. In Proceedings of the 12th International
     Conference on Availability, Reliability and Security, pages 1–10, 2017.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                   G.Werner, S.Yang


[11] P. Lenca, B. Vaillant, P. Meyer, and S. Lallich. Association rule interestingness measures:
     Experimental and theoretical studies. In Quality Measures in Data Mining, pages 51–76.
     Springer, 2007.

[12] H. Liao, Y. Lin, C.and Lin, and K. Tung. Intrusion detection system: A comprehensive
     review. Journal of Network and Computer Applications, 36(1):16–24, 2013.

[13] F. Maggi, M. Matteucci, and S. Zanero. Reducing false positives in anomaly detectors
     through fuzzy alert aggregation. Information Fusion, 10(4):300–311, 2009.

[14] S. Mirheidari, S. Arshad, and R. Jalili. Alert correlation algorithms: A survey and taxon-
     omy. In Cyberspace Safety and Security, pages 183–197. Springer, 2013.

[15] J. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. Chawla, and F. Herrera. A unifying
     view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012.

[16] J. Pelletier. Collegiate penetration testing competition, 2018. https://nationalcptc.
     org/.

[17] X. Qin and W. Lee. Discovering novel attack strategies from infosec alerts. In European
     Symposium on Research in Computer Security, pages 439–456. Springer, 2004.

[18] H. Ren, N. Stakhanova, and A. Ghorbani. An online adaptive approach to alert correlation.
     In International Conference on Detection of Intrusions and Malware, and Vulnerability
     Assessment, pages 153–172. Springer, 2010.

[19] R. Sadoddin and A. A. Ghorbani. An incremental frequent structure mining framework for
     real-time alert correlation. computers & security, 28(3-4):153–173, 2009.

[20] R. Smith, N. Japkowicz, and M. Dondo. Clustering using an autoassociator: A case study
     in network event correlation. In IASTED PDCS, pages 613–618, 2005.

[21] J. Sun, L. Gu, et al. An efficient alert aggregation method based on conditional rough
     entropy and knowledge granularity. Entropy, 22(3):324, 2020.

[22] Suricata. Suricata open source ids, 2020. https://suricata-ids.org/.

[23] J. Treinen and R. Thurimella. A framework for the application of association rule mining
     in large intrusion detection infrastructures. In International Workshop on Recent Advances
     in Intrusion Detection, pages 1–18. Springer, 2006.

[24] P. Vaz de Melo, C. Faloutsos, R. Assunção, and A. Loureiro. The self-feeding process:
     a unifying model for communication dynamics in the web. In Proceedings of the 22nd
     international conference on World Wide Web, pages 1319–1330. ACM, 2013.

[25] J. Viinikka, H. Debar, L. Me, A. Lehikoinen, and M. Tarvainen. Processing intrusion
     detection alert aggregates with time series modeling. Information Fusion, 10(4):312–324,
     2009.

[26] G. Werner, S. Yang, and K. McConky. Time series forecasting of cyber attack intensity.
     In Proceedings of the 12th Annual Conference on Cyber and Information Security Research,
     CISRC ’17, pages 18:1–18:3, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4855-3.
     doi: 10.1145/3064814.3064831. URL http://doi.acm.org/10.1145/3064814.3064831.

[27] G. Werner, S. Yang, and K. McConky. Leveraging intra-day temporal variations to predict
     daily cyberattack activity. In 2018 IEEE International Conference on Intelligence and
     Security Informatics (ISI), pages 58–63. IEEE, 2018.

  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
CLEAR-ROAD                                                                  G.Werner, S.Yang


[28] G. Werner, S. Yang, and K. McConky. Near real-time intrusion alert aggregation using
     concept-based learning. In Proceedings of the 18th ACM International Conference on Com-
     puting Frontiers. ACM, 2021.

[29] M. Zaki. Sequence mining in categorical domains: incorporating constraints. In Proceedings
     of the ninth international conference on Information and knowledge management, pages
     422–429, 2000.


  Proceedings of the Conference on Applied Machine Learning for Information Security, 2021