=Paper=
{{Paper
|id=Vol-3260/paper6
|storemode=property
|title=ThePhish: an Automated Open-Source Phishing Email Analysis Platform
|pdfUrl=https://ceur-ws.org/Vol-3260/paper6.pdf
|volume=Vol-3260
|authors=Emanuele Galdi,Gaetano Perrone,Simon Pietro Romano
|dblpUrl=https://dblp.org/rec/conf/itasec/GaldiPR22
}}
==ThePhish: an Automated Open-Source Phishing Email Analysis Platform==
<pdf width="1500px">https://ceur-ws.org/Vol-3260/paper6.pdf</pdf>
<pre>
ThePhish: an Automated Open-Source Phishing Email
Analysis Platform
Emanuele Galdi1 , Gaetano Perrone1,* and Simon Pietro Romano1
1
    Department of Information Technology and Electrical Engineering, University of Naples Federico II, Naples, Italy


                                         Abstract
                                         Phishing, and specifically phishing emails, are becoming the most pervasive cyberattack and the most
                                         widely used infection vector. As a consequence, SOCs, CERTs, and CSIRTs are overwhelmed by the
                                         number of emails that they need to analyze every day, with the majority of them being false positives. The
                                         manual email analysis is a huge waste of effort. Thus, finding approaches to the full or at least partially
                                         automated analysis is crucial. This work aims to present ThePhish, an open-source phishing email
                                         analysis platform capable of automating the entire email analysis process, starting from the extraction of
                                         the observables from the header and the body of the email to the elaboration of a verdict, which is final
                                         in most cases. The framework leverages the effectiveness of important open-source projects, namely,
                                         MISP, TheHive and Cortex, to filter out a significant number of false positives. If ThePhish is sure about
                                         the maliciousness of the email, it scores it as “malicious”. However, an email sometimes can only be
                                         considered suspicious and need further analysis. In these cases, ThePhish offers several features that
                                         allow analysts to speed up the analysis process and obtain further details on the suspicious emails.

                                         Keywords
                                         Phishing, Email, Cybersecurity, Malware, Automation


1. Introduction
The number of cyberattacks is growing faster and faster, and cyberattacks have been rated
seventh and eighth in the ranking of the top 10 risks of 2020 in terms of likelihood and impact
respectively [1]. This increment has surely been exacerbated by the COVID-19 outbreak,
which on the one hand has triggered a massive digital transformation of all the companies
and organizations around the world, but on the other hand has led to a bigger attack surface
for the attackers to exploit. A cybersecurity incident can cause a business a lot of damage,
such as financial losses, loss of productivity, reputation damage, legal liability, or business
continuity problems. Among the sheer number of typologies of cyberattacks, the ones that
are becoming the most pervasive are those involving social engineering [2]. Social engineering
is the psychological manipulation of people into performing actions or divulging confidential
information [3], so it does not depend on the technological measures used by an organization
to protect its assets, but it is based on human error and, as it is said by Bruce Schneier in [4]:
ITASEC’22: Italian Conference on Cybersecurity, June 20–23, 2022, Rome, Italy
*
 Corresponding author.
$ emanuele.galdi@secsi.io (E. Galdi); gaetano.perrone@unina.it (G. Perrone); spromano@unina.it (S. P. Romano)
 https://github.com/emalderson/ (E. Galdi); https://github.com/giper45/ (G. Perrone);
https://www.docenti.unina.it/simonpietro.romano/ (S. P. Romano)
 0000-0002-1607-1095 (E. Galdi); 0000-0001-8238-6426 (G. Perrone); 0000-0002-5876-0382 (S. P. Romano)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
“People often represent the weakest link in the security chain and are chronically responsible for
the failure of security systems”. Social engineering itself is a broad category of attacks, of which
the best known is definitely phishing. It consists of an attacker persuading a victim to take
actions that make it possible for the attacker himself to access the victim’s device, account, or
personal information, for example, by pretending to be a person or organization the victim trusts
[5]. Although the methods to carry out a phishing attack are diverse, when one thinks about
phishing, the first thing that comes to his mind is surely a situation where the victim receives an
email that tries to convince him to click on a malicious link or downloads a malicious attachment.
The problem is that while some phishing emails are fairly simple to spot for someone with
a trained eye, some are definitely not. With the number of cyberattacks growing so rapidly,
another problem arises: SOCs (Security Operations Centers), CERTs (Computer Emergency
Response Teams), and CSIRTs (Computer Security Incident Response Teams) are struggling to
face all the alerts that require their attention. Moreover, the number of accurate alerts is only a
small percentage, with the rest being false positives. This leads to a massive waste of effort by
the analysts who have to manually take several actions to filter out the false positives instead
of focusing on only the alerts. This problem is a natural consequence of a lack of automation,
which is critical for a SOC that wants to do its job in a more efficient way [6]. Clearly, the
concept of automation inside a SOC can be also applied to alerts related to potential phishing
emails. In fact, analyzing an email is a time-consuming task that can take hours, so the full or
at least partial automation of the analysis to filter out false positives or simplify the analyst’s
job reveals to be crucial. In this work, we present ThePhish, an open-source phishing email
analysis platform that makes analysts’ activities faster and more efficient. The remainder of this
paper is structured as follows. In Section 2 we explore several open-source similar solutions and
show their limitations. Section 3 describes the Cyber Threat Intelligence concept and introduces
the frameworks used by ThePhish, i.e., MISP, TheHive and Cortex. We also show that it is
possible to integrate these frameworks to benefit from all their features, as we did in ThePhish.
Sections 4 and 5 describe ThePhish modules and explain how it is possible to integrate the
framework into security analysis processes. The last section shows possible future evolutions
of the platform.


2. State of the art
Tools that help the analyst do his job when it comes to phishing email analysis can be imple-
mented following different approaches and provide different automation levels to the analysis.
In this section, some already existent tools will be presented, with a focus on free online tools
and open-source projects that are available on GitHub, leaving out the various commercial
and paid products, which are usually implemented as comprehensive security solutions that
are tailored on the needs of a particular organization with the objective of offering protection
against many types of attacks, phishing included, and that also provide automated ways to
respond to such attacks if necessary. That is because the aim of this work is to implement an
open-source solution that is specifically focused on being an aid to an analyst that is dealing
with phishing emails and anyone can use that for free.
2.1. MxToolbox
MxToolbox supports global Internet operations by providing free, fast, and accurate network
diagnostic and lookup tools [7]. For example, it is possible to test SMTP, HTTP/HTTPS, and
DNS servers, perform different types of DNS queries, or obtain information about domains and
IP addresses. What is more is that it has some tools that can be helpful when analyzing an
email. In fact, it is possible to check if a domain has correctly set-up SPF, DKIM, and DMARC
records or if an IP address is blacklisted, but the most useful tools are certainly the Email Header
Analyzer and the Spam Analyzer. The former makes it possible to submit the header of an
email to get it in a human-readable format and with the possible problems highlighted, like a
failed DMARC authentication. The latter allows submitting a full email (header and body) and
analyzing it. It uses the SpamAssassin [8] software to analyze a message and return a spam
score. SpamAssassin is an intelligent email filter that uses a diverse range of tests to identify
spam emails. These tests are applied to both the header and the content of the email to classify
it using advanced statistical methods.

2.2. EmailAnalyzer
EmailAnalyzer is an open-source command-line tool written in Python that is available on
GitHub. It is able to extract data such as email addresses, IP addresses, URLs and attachments
from an email provided as an EML or MSG file [9]. It supports the expansion of shortened URLs
and also allows scanning URLs and hashes of attached files with VirusTotal [10]. Furthermore,
it allows the use of a whitelist through a configuration file that accepts regular expressions. At
the end of the analysis this tool creates the extracted-attachments folder, which contains
the attachments found in the email, and different files containing:

    • the extracted email addresses (i.e., emails.txt);
    • the extracted IP addresses (i.e., ips.txt);
    • the extracted URLs (i.e., urls.txt);
    • the extracted Received lines (i.e., received.txt).

If the command-line option -vt is used, the tool will scan the URLs that are located in the
urls.txt file and the files that are located in the extracted-attachments folder with
VirusTotal. If a file is found as malicious, it will be renamed by appending the word _malware at
its end, while if a URL is found as malicious, it will be added to a file named malware_urls.txt.
Obviously, an API key for VirusTotal is needed to use this feature. However, it has been noted
that the extraction of the observables it makes through the use of regular expressions is not
100% precise. For instance, those regular expressions may fail and extract URLs that end with an
HTML tag, or erroneous header lines (e.g., Received-SPF among the Received ones). Moreover,
email Message-ID header lines may be mistaken for the email addresses as well.

2.3. Sooty
Sooty is an open-source command-line tool written in Python that is available on GitHub. It
aims to aid SOC analysts with automating part of their workflow. One of the goals of Sooty is to
perform as many of the routine checks as possible, allowing the analyst to spend more time on
deeper analysis within the same time frame [11]. It offers the following features that allows to:

    • transform a URL in a way that prevents a user from clicking on it by mistake. This is
      particularly useful when dealing with malicious URLs.
    • choose one of many different transformations for URLs or strings, such as URL unshort-
      ening or base64 decoding).
    • perform reputation checks for IP addresses, email addresses or URLs on VirusTotal,
      AbuseIPDB [12] and Tor exit nodes
    • perform DNS, reverse DNS and WHOIS lookups.
    • generate a hash for a string or a text and check it for known malicious activity against
      VirusTotal.
    • extract IP addresses, email addresses, URLs, and some basic header lines from an email
      provided as an MSG file format (EML is currently not supported). Then, it can automat-
      ically check email addresses against emailrep.io [13] for known malicious activity and
      also against HaveIBeenPwned [14] to see if those addresses are included in a past data
      breach. Moreover, it allows submitting URLs to PhishTank [15] to see if they are related
      to phishing. Lastly, it is possible to create a dynamic response template based on the
      result of the email analysis.
    • get a reputation report from urlscan.io [16] for a URL.

Obviously, most of the external tools used to analyze the information provided by the analyst
require an API key. It should be noted that the tool searches for those pieces of information
only inside the body of the email, ignoring the header lines. Also, it does not extract all the
header lines but only the ones usually displayed by the mail client application. The only email
address that is automatically analyzed is the one found in the From field of the email. All the
other pieces of information that have to be analyzed must be provided manually to the tool.

2.4. IsThisLegit
IsThisLegit is an open-source tool that is available on GitHub [17]. It makes it easy to receive,
analyze and respond to phishing reports and consists of two parts:

    • Dashboard: a Google App Engine application and is the analyst’s window into phishing
      reports from the organization. Analysts can use the dashboard to view, categorize and
      respond to phishing emails.
    • Chrome extension: a button in Gmail application that makes it easier for users to report
      phishing emails to the dashboard.

Once an email is submitted, a report is shown in a table on the initial dashboard page. An
analyst can view a particular report and perform the following actions:

    • Classify the report as “Benign”, “Malicious” or “Pending”.
    • Respond to the user who submitted the report.
    • Obtain an overview about the report, which includes information such as the person who
      submitted the report, the time at which the report was submitted and the current status.
      It also contains a timeline, which tracks any changes to the report.
    • View the list of all the header fields in the email.
    • View the text and HTML content of the email.
It is also possible to create rules that check for a match against a new phishing submission and
perform an action if they match. These rules can either match on header fields or on the body
content. The actions are short Python code snippets that can be coded by the analyst at will.
For example, a rule can be used to automatically classify a report if it matches that rule. Thus,
by using this tool an analyst is able to manage the reports in a simple and polite way, but he
has to write Python snippets that allow automating part of the analysis, otherwise he has to
analyze all the pieces of information manually.


3. Cyber threat intelligence
In order to analyze an email, it is important to extract many pieces of information, such as IP
addresses, email addresses, domains, and URLs, and analyze them one by one. From now on,
the term that will be used to refer to such information is the term observable. An observable
is defined as an event (benign or malicious) on a network, or a system [18]. This event can
be, for instance, the sighting of a specific IP address that might require a further analysis if
considered malicious. If an observable is related to malicious activity, it is called Indicator
of Compromise (IoC). An IoC is defined as an artifact that, with high confidence, indicates a
computer intrusion [19]. In the case of email analysis, IoCs can be IP addresses, email addresses,
domains, URLs deemed related to phishing or spam activity, and dangerous attachment files. To
find out whether an observable is malicious, an analyst can use online tools such as VirusTotal
to analyze a URL or a file or check an IP address or domain against a blocklist. In addition,
he/she can use sandbox services to run an executable file (e.g., an email attachment) into an
automated, virtualized, and safe environment to observe its behavior and obtain additional IoCs.
After IoCs have been identified via a process of incident response and computer forensics, they
can be used for early detection of future attack attempts using intrusion detection systems and
antivirus software.
   Although IoCs found inside an organization might be useful to prevent future similar attacks,
they are not enough to prevent attacks that are currently in the wild and have never targeted the
organization. Also, IoCs alone are not able to identify who is behind an attack, their motivations,
or the ultimate sponsor of the attack itself [20]. If an organization had this information, it
would be able to make important decisions that go far beyond the simple addition of a rule
in a firewall. What is needed is the concept of Cyber Threat Intelligence (CTI). Gartner defines
CTI as “evidence-based knowledge, including context, mechanisms, indicators, implications
and actionable advice, about an existing or emerging menace or hazard to assets that can be
used to inform decisions regarding the subject’s response to that menace or hazard” [21]. CTI
is not only the raw data or the single IoC but requires rich contextual information that can
only be created with the application of human analysis. This contextual information includes
the linkage between the technical indicators, adversaries, their motivations and intents, and
information about who is being targeted. Analysts, in fact, should not only focus on the result of
an attack, like an IoC but also look at the tactics and techniques that indicate that an attack is in
progress. The real advantage provided by CTI is, however, the concept of cyber threat information
sharing. It provides access to threat information that might otherwise be unavailable to an
organization, which in this way can benefit from the knowledge, experience, and capabilities
of other organizations to gain a complete understanding of the threats that it may face. This
allows one organization’s detection to become another’s prevention [22]. In the remainder of
this section we will describe three of the most important CTI open-source freameworks, namely,
MISP, Cortex and TheHive.

3.1. MISP
Malware Information Sharing Platform (MISP) is a free and open-source software helping in-
formation sharing of threat intelligence including cyber security indicators [23]. Its aim is to
help improve the countermeasures used against targeted attacks and set up preventive actions
and detection [24]. It makes it possible to store IoCs in a structured manner, and thus enjoy the
correlation, automated exports for IDSes or SIEMs, in STIX, OpenIOC and many other formats
including CSV and plain text, and synchronize to other MISP servers. It also makes it easier to
share with, but also to receive from trusted partners and trust groups so as to enable fast and
effective detection of attacks. MISP provides an intuitive web interface, but also a REST API
that can be used for automation and feeding devices. Moreover, there is also a Python library
called PyMISP that allows easy access via the API [23]. The basic building block of MISP is
the event. Each event is made up of a list of attributes, which are atomic pieces of data that
could be IoCs like for example an IP address, a URL or a file. Each time an attribute is created,
MISP will check whether that attribute already exists in the system so as to find a correlation.
Not only does a correlation manifest itself with an exact match, but also with the presence of
attributes that are believed to be related in some way, for example there may be a correlation
between an IP address and a range of IP addresses to which it belongs. Moreover, it is possible
to add information to an event by either using tags or more advanced features. Apart from
being a self-contained repository of attacks and malware, one of the main features of MISP is its
ability to connect to other MISP servers (also called MISP instances) and share its information.
The exchange of data between two or more MISP instances is called synchronization. Another
way of getting events from a remote source in MISP is by using feeds. Feeds are remote or
local resources containing indicators that can be automatically imported into MISP at regular
intervals. A great advantage of MISP is the fact that it is supported by many open-source and
proprietary tools. An example of such a tool is TheHive, which is an open-source Security
Incident Response Platform (SIRP) [25].

3.2. TheHive
TheHive is a scalable, open-source and free Security Incident Response Platform (SIRP) created
by TheHive Project, tightly integrated with MISP, designed to make life easier for SOCs, CSIRTs,
CERTs and any information security practitioner dealing with security incidents that need to
be investigated and acted upon swiftly. It provides a web interface from which it is possible to
manage alerts related to security events coming from a multitude of sources, like a SIEM, an IDS,
an email report or a MISP event. Alerts can be ignored, marked as read, previewed and imported.
When an alert is imported, it becomes a case that needs to be investigated [26][27][28]. It also
offers a REST API and most of its endpoints are accessible through TheHive4py, which is the
Python API client for TheHive [29]. The core construct of TheHive is the case, which is also the
core construct of most security investigations. A case is characterized by a title, a description
and a date. Moreover, it is also characterized by several elements, some of which are outlined
below [26][27][28][30]:

    • Tasks: They are used to track the actions taken to answer the investigative questions,
      but also to track containment, eradication and remediation events. They can contain
      multiple logs, which are text entries used to describe an analyst’s progress, attach pieces
      of evidence or noteworthy files and even password-protected ZIP archives containing
      malware or suspicious data.
    • Observables: They can be of different types, for example IP addresses, email addresses,
      URLs and domains. In addition, custom observable types can be defined if needed. They
      can be tagged, flagged as IoC and analyzed. If an observable in a case has already been
      seen in other cases, it is automatically marked as sighted, and cases that share common
      observables are considered related.
    • Tags: They are another way of adding information to a case and can be used for quick
      searching and filtering. They are labels that can be attached to cases, but also to many
      other TheHive objects like alerts and observables. For instance, it is possible to add the
      source that provided or generated an observable by using a tag.

In order to reduce the time wasted on the creation of cases that share the same structure, TheHive
supports the creation of pre-defined case templates. Those templates can also be used to outline
the steps to follow so as to drive the team’s activity. [26][27]. A case can be generated from
an alert or created from scratch. The alert is another important TheHive construct that shares
many properties with the case construct, including the observables that have been observed
in the security event that generated that alert. All those fields will be directly mapped to the
correspondent case fields once the alert is imported. Moreover, case templates can also be used
to create cases from alerts when they are imported. A key feature of TheHive is collaboration. In
fact, each analyst has his own account with its set of permissions and every action he performs
is logged in a real-time live feed, which is visible by other analysts. Both cases and tasks can be
assigned to an analyst and it is possible for multiple analysts to work on the same case but on
different tasks, so as to share the responsibilities. In order to keep track of the progress of the
investigation, cases and tasks can be in different states. For example, a case can be in the “Open”
state, but it can transit in the “Resolved” state when the analysis is terminated and the case
gets closed, specifying a series of fields like the possible impact of the incident. Similarly, tasks
can be in a “Waiting” state when they have not been assigned to an analyst yet, then they can
transit in the “InProgress” state when they are started and they can transit in the “Completed”
state when they are closed.
3.3. Cortex
Cortex is a powerful open-source observable analysis and active response engine [26] created by
TheHive Project that allows analyzing observables at scale by querying a single tool instead of
several [31]. It provides a web interface from which it is possible to analyze observables one by
one or in bulk mode, but it can also be used to automate these operations and submit large sets
of observables from TheHive or through the Cortex REST API. Moreover, most of the endpoints
offered by the Cortex REST API are accessible through Cortex4py, which is the Python API client
for Cortex [32]. The usage of Cortex is based on neurons, which are autonomous applications
managed by and run through the Cortex core engine [33]. They can be of one of two types:

    • Analyzers: They allow analyzing different types of observables by automating the in-
      teraction with a service or a tool so as to speed up the analysis and make it possible to
      contain threats before it is too late. Cortex comes with more than a hundred analyzers for
      popular services such as VirusTotal, emailrep.io, urlscan.io, AbuseIPDB and PhishTank.
      It should be noted that while many analyzers are free to use, some require special access
      and others need a valid service subscription or product license [26][33].
    • Responders: They are installed along with the analyzers. Unlike analyzers, they are only
      useful when Cortex is used in conjunction with TheHive, in fact they perform different
      actions and apply to alerts, cases, tasks, task logs and observables [33].

Analyzers and responders can be enabled, disabled and configured from the web interface. For
each of them it is possible to define many parameters such as a rate limits, usernames, passwords
and API keys. When an observable is submitted for analysis, Cortex creates a job. That job will
generate an analysis report in JSON format if it terminates successfully [28]. Moreover, these
job reports can be cached so that if an analyzer is launched against the same observable, the
previous report can be returned without re-executing the analyzer. The cache is used only if the
second job occurs within a configurable amount of time, where the default is 10 minutes [33].
A job is also created when a responder is started. Also in this case a JSON report regarding the
result of the performed action is provided.

3.4. Integrating TheHive, Cortex and MISP
Cortex is a great tool on its own, but its real potential is only unlocked when used in conjunction
with TheHive. Indeed, TheHive can connect to one or multiple Cortex instances in order to have
access to neurons. When TheHive is connected to a Cortex server, it is possible to launch one
or more analyzers against the observables in a case so as to obtain additional information about
them. The output of each analyzer is a report in JSON format that is viewable from TheHive.
Moreover, it is also possible to launch responders against cases, observables, tasks, task logs
and alerts to execute an action. For example, the Mailer responder can be launched against a
task to automatically send an email containing the description of the task itself, which can be
for example the result of the analysis [34].
   Even though many organizations can share information about cases and observables through
TheHive and Cortex, the actual support for cyber threat information sharing is provided by the
integration with MISP. In fact, by integrating TheHive with MISP it is possible to automatically
import MISP events as alerts and also export cases to MISP as events. Moreover, also Cortex
can be integrated with MISP so as to allow searching observables within a MISP instance. This
is possible thanks to a MISP Search analyzer that is available for Cortex that returns the number
of events where the observable has been found and a list of links to those events with additional
data. This analyzer is very useful when it is launched against an observable in a case from
TheHive, so as to further enrich the information on the investigation.


4. ThePhish
Many tools can be used to automate or facilitate some of the activities to perform to analyze an
email, but none of them provides complete automation for the analysis process. The aim of this
work is to take advantage of the huge potential of TheHive, Cortex and MISP, and develop an
application that is able to automate the entire analysis cycle of an email. ThePhish is a web
application written in Python that allows the analyst to choose the email to analyze and obtain
a verdict, which can be final or not. Figure 1 shows an overview of how the application works
[35][36]. The scenario depicted in Figure 1 is composed of the following steps:

   1. An attacker starts a phishing campaign and sends a phishing email to a user.
   2. A user who receives such an email can send that email as an attachment to the mailbox
      used by ThePhish.
   3. The analyst interacts with ThePhish and selects the email to analyze.
   4. ThePhish extracts all the observables from the email and creates a case on TheHive. The
      observables are analyzed thanks to Cortex and its analyzers.

 Phishing email        Attacker                  Analyst


                                                                     No
                                                                                    Case closed

                                                                            Yes
User

              Forward suspicious email to
                                                           Is the verdict final?
               ThePhish as an attachment


                                            Analysis result
                                             notification

Figure 1: ThePhish overview
   5. ThePhish calculates a verdict based on the verdicts of the analyzers.
   6. If the verdict is final, the case is closed and the user is notified. In addition, if it is a
      malicious email, the case is exported to MISP.
   7. If the verdict is not final, the analyst’s intervention is required. He must review the case
      on TheHive along with the results given by the various analyzers to formulate a final
      verdict, then it can send the notification to the user, optionally export the case to MISP
      and close the case.

ThePhish relieves the analyst from manually extracting all the observables from the header
and the body of the email and adding them one by one in a case on TheHive. Moreover, he
does not need to start the various analyzers on each observable, send notifications to users, nor
interacting with MISP. Even in the case in which his intervention is required, the majority of
the work will have already been performed so that he can focus only on things that matter to
elaborate a final verdict.
   In order to automate the analysis process, ThePhish communicates with TheHive, Cortex and
MISP through the REST APIs made available by TheHive and Cortex. The sequence diagram in
Figure 2 and Figure 3 shows the high-level interactions among all the components. Supposing
that at least one user has already sent a suspicious email as an attachment in EML format to
ThePhish, the workflow is as follows:

   1. The analyst interacts with ThePhish to obtain a list of suspicious emails to analyze.
      ThePhish obtains those emails by selecting all the unread emails in the mailbox that have
      an email message in EML format as an attachment.
   2. The analyst interacts with ThePhish to make the analysis of the selected email start.
   3. ThePhish parses both the header and the body of the attached email to extract all the
      observables.
   4. ThePhish creates a case on TheHive and adds all the previously extracted observables to
      it.
   5. ThePhish starts the Mailer responder to send a notification email to the user in order to
      let him know that the analysis of the email he sent has started.
   6. ThePhish waits for the responder job to complete.
   7. For each observable, ThePhish starts all the analyzers that have been configured on Cortex
      for that observable type. It should be noted that the control returns to ThePhish once an
      analyzer job is started, without waiting for the analysis to terminate.
   8. If the analyzer is the MISP Search analyzer, Cortex checks the presence of the analyzed
      observable among MISP events, while for the other analyzers the interaction is with an
      external service and is not represented in the diagram.
   9. ThePhish waits until all the previously started analyzer jobs are terminated.
  10. Once the results of all the analyzers are available, ThePhish elaborates the verdict on the
      email.
  11. If the verdict is “malicious”, ThePhish interacts with TheHive, which in turn interacts
      with MISP, in order to export the case as an event to MISP.
                                             ThePhish                       TheHive                          Cortex                       MISP
         User   Analyst


                          Request list of
                          emails to analyze         Obtain suitable
                                                    emails from mailbox


                           Select email
                           to analyze
                                                    Obtain attached email


                                                    Parse attached email
                                                    and extract observables


                                                           Create case


                                          loop     [for each observable]


                                                    Add observable to case


                                                     Run Mailer responder
                                                                                    Run Mailer responder

                                                                                       responder job
                                                          responder job

                                                   Notify start of
                                                   analysis via email

                                                    Wait for the responder
                                                    job to terminate


                                      loop       [for each observable]


                                          loop     [for each applicable analyzer]


                                                           Run analyzer
                                                                                       Run analyzer

                                                                                        analyzer job
                                                           analyzer job


                                                                                                  loop          [for each started analyzer]

Figure 2: ThePhish sequence diagram (part 1)
                                                                                                       alt      [analyzer == MISP Search]


                                                                                                                      Search observable
                                                                                                                      among events
  12. If the verdict is either “malicious” or “safe”, the Mailer responder is started to send the
      verdict via email to the user and then, when the responder job              terminates, the case is
                                                                              [else]

      closed. The verdict is then shown to the analyst, which can thenInteract     start with
                                                                                          a new analysis.
                                                                                 external service
  13. If the verdict is “suspicious”, the analyst can review the case on TheHive. In that case, the
      results of all the analyzers will already be available and the analyst only has to make the
                                   Wait for all the analyzer
      final decision and close thejobs
                                     case.    Moreover, if the email is classified as “malicious” in this
                                       to terminate

      phase, the analyst can also export the case to MISP from TheHive in just one click. This
      allows ThePhish to classify     the next email that has some observables in common with
                                   Elaborate verdict

      this email to be classified as “malicious” as well, thanks to the MISP Search analyzer.
   alt      [verdict == malicious or verdict == safe]


                                          opt     [verdict == malicious]


                                                        Export case to MISP
                                                                                                       Create event


                                                     Run Mailer responder
                                                                                    Run Mailer responder

                                                                                       responder job
                                                          responder job

                                                        Send verdict
                                                        via email

                                                    Wait for the responder
                                                    job to terminate
                                         loop    [for each applicable analyzer]


                                                        Run analyzer
                                                                                     Run analyzer

                                                                                      analyzer job
                                                        analyzer job


                                                                                                loop          [for each started analyzer]


                                                                                                     alt      [analyzer == MISP Search]


                                                                                                                    Search observable
                                                                                                                    among events


                                                                                                              [else]


                                                                                                                    Interact with
                                                                                                                    external service


                                                  Wait for all the analyzer
                                                  jobs to terminate


                                                  Elaborate verdict


   alt      [verdict == malicious or verdict == safe]


                                         opt    [verdict == malicious]


                                                     Export case to MISP
                                                                                                     Create event


                                                   Run Mailer responder
                                                                                  Run Mailer responder

                                                                                     responder job
                                                        responder job

                                                     Send verdict
                                                     via email

                                                  Wait for the responder
                                                  job to terminate


                                                         Close case

                          Show verdict

            [verdict == suspicious]


                     Request intervention
                     of the analyst


         User   Analyst
                                          ThePhish                        TheHive                          Cortex                       MISP


Figure 3: ThePhish sequence diagram (part 2)


5. ThePhish Implementation
This section will describe implementation details and illustrate the main ThePhish modules.
Finally, we explain the communication between the front-end and the back-end component.
ThePhish is a web application written in Python 3. The web server is implemented using Flask,
which is a lightweight Web Server Gateway Interface (WSGI) web application framework [37].
In contrast, the front-end part of the application, which is the dynamic page written in HTML,
CSS, and JavaScript, is implemented using the front-end framework Bootstrap. Apart from the
webserver module, the back-end logic of the application is constituted by three Python modules
that encapsulate the logic of the application itself and a Python class used to support the logging
facility through the WebSocket protocol. Moreover, there are several configuration files used by
the aforementioned modules that serve various purposes.

5.1. Obtaining the list of emails to analyze
The list_emails module allows analyst to visualize the emails suitable for the analysis. Figure
4 shows the activity diagram of this module. The first two actions that the module performs are
needed to connect to the IMAP server. Then, it retrieves all the unread emails from the mailbox
on the IMAP server using the imaplib module and walks through their multipart structure
using the email Python standard library module to find out whether they contain an EML
attachment. For each email that satisfies this condition, From, Subject, the date and email body
will be extracted, along with the Subject field of the attached email. These pieces of information,
together with the Unique Identification Number (UID) of the email, will be added to the list to


 List emails

                                        Retrieve emails

                 Initial
              configuration

                                           Select all the unread
                                           emails with an EML
                                           attachment from the
                                                 mailbox
               Connect to
              IMAP server

                                               Create a list
                                                containing
                                          information regarding                       List to
                                           the selected emails                        display


Figure 4: Activity diagram of the list_emails module
return so that the analyst can visualize information regarding the emails to analyze on the web
interface. The UID is a fundamental piece of information. Indeed it uniquely identifies the
email in an IMAP folder and is used when the analyst selects the email to process during the
analysis. When the execution of the module terminates, information regarding the emails that
can be analyzed is returned. It is displayed to the analyst on the web interface, giving him the
possibility to choose the email to analyze.

5.2. Creating the case on TheHive
When the analyst selects an email, the case_from_email module manages the observable
extraction and the case creation on TheHive. Figure 5 shows the activity diagram of this module.
The first two actions that the module performs are needed to connect to the IMAP server and
the TheHive instance. Moreover, the file whitelist.json is also considered. It is used to
configure the lists of observables that have to be excluded from the analysis to reduce the
occurrence of false positives. When the user selects the email to analyze from the web interface,

      Create case from email
                                                               Create case


                                                               [Template does    [Template
                                                                  not exist]       exists]


                                       Parse eml
                         Initial
                                                                 Create case
                      configuration
                                                                  template

                                        Obtain Subject field
                                          of the attached          Create case on TheHive
                                               email
                       Connect to
                      IMAP server
                                              Extract            Add the observables found in
                                         observables from                 the header
                                         the header of the
                                          attached email

                 Obtain eml
                                                                 Add the observables found in
                                              Extract                      the body
                                         observables from
                                          the body of the
                      Retrieve the        attached email
                     selected email                                Add the attachments as
 UID of the                                                             observables
  email to
  analyze                                      Extract
                                         attachments of the
                     Obtain external
                                           attached email
                     From field (the                                 Add the hashes of the
                      user's email                               attachments as observables
                        address)
                                            Calculate the
                                            hashes of the
                                         attachments of the
                      Obtain EML                                    Add the EML file as an
                                           attached email
                      attachment                                         observable              Newly created
                                                                                                     case


                                                                                                Email address of
                                                                                                the user to send
                                                                                                 notifications to


Figure 5: Activity diagram of the case_from_email module
the UID previously returned by the list_emails module is sent back to the server and passed
to this module. The email module utilizes the UID to fetch the email from the mailbox and
extract its From field, i.e., the email address of the user who sent the email to ThePhish for
analysis. This email will be used to send him the email notifications. The module indeed also
extracts the EML attachment to parse. Once the EML attachment is available, it is an email
itself that has to be parsed to extract the observables it contains. The extraction logic uses the
ioc_finder Python module [38] to extract email addresses, IP addresses, URLs and domains
from that buffer. To decide whether an observable should be considered or not and avoid
analyzing that may cause false positives, ThePhish allows creating a whitelist. Through the
whitelist, the analyst can also customize the observables deemed relevant for the email analysis.
The whitelist is contained in a file named whitelist.json. It is constituted by many different
lists to offer great flexibility in terms of observable types to match and matching modes. It
supports the following matching modes:
    • Exact string matching for email addresses, IP addresses, URLs, domains, file names, file
      types and hashes
    • Regex matching for email addresses, IP addresses, URLs, domains and file names
    • Regex matching for subdomains, URLs and email addresses that contain the specified
      domains
While both the parts related to exact matching and regex matching are used without any
modification, the remaining parts are used to create three more lists of regular expressions. It is
not required for the analyst to design complex regular expressions to enable those features, but
he only needs to add the domains to the right lists. These regular expressions have been designed
to avoid some unwanted behaviors. For instance, if the intention is to whitelist the domain
“paypal.com”, they prevent domains like “paypal.com.attacker.com” from being mistakenly
whitelisted. The observables are extracted from the values of some carefully chosen header
fields and the body of the email. Then, the attachments are extracted, and their SHA256 hashes
are calculated in order to add them as observables to the case as well. Finally, the content of the
EML file itself is saved in a variable so that it can be added to the case as an observable named
after the Subject field of the email it contains. Then, TheHive4py is used to create a case and
add observables, which are also tagged to indicate the relevant email sections. Subject field of
the email under analysis is used to name the case. Before doing that, TheHive4py also creates a
case template.

5.3. Running the analysis
The run_analysis module automates the entire analysis procedure and calculates a verdict
that is shown to the analyst and sent via email to the user. Thanks to the case_from_email
module, it knows the case under analysis and the user email address to send notifications to.
Figure 6 shows the activity diagram of this module. The first two actions that the module
performs are needed to connect to the IMAP server and the TheHive, Cortex, and MISP instances.
Moreover, the file whitelist.json is also considered. Then, this module uses TheHive4py to
obtain the IDs of the three tasks present in the case and handles their life cycle. Each task is
handled by one of the remaining three actions. The three tasks are described below.
          Run analysis

Email address of the
   user to send
  notifications to

                                                                      Terminate analysis

                                              Analyze observables

                            Initial
                         configuration
                                                                                Start "ThePhish
                                                                                  result" task
                                                  Start "ThePhish
                                                   analysis" task

                                                                                Calculate verdict
                   Notify start of analysis       <<structured>>
                                                For each observable                          [Verdict is
                                                                                            "suspicious"]
                                                Start analyzers
                                                on observable
                       Start "ThePhish
                                                                                 [Else]
                       notification" task


                                                Handle rate-limited                         [Verdict is
                                                                            [Verdict is
                           Run Mailer              analyzers                                  "safe"]
                                                                           "malicious"]
                       responder (start of
                            analysis)
                                                                                 Export case to
                                                                                     MISP
                                                 Wait for analysis
                                                   completion
                       Close "ThePhish
                       notification" task
                                                                                   Run Mailer
                                                 Obtain the verdict            responder (verdict)
                                                  of each started
                                                      analyzer

                                                                                Close "ThePhish
                                                                                   result" task
                                                 Close "ThePhish
                                                  analysis" task

                                                                                 Close the case             Verdict


   Newly created
       case


Figure 6: Activity diagram of the run_analysis module


     • ThePhish notification: This task is used to send the notification that the analysis has been
       started to the user. This is obtained by starting the Mailer responder using Cortex4py
       after having filled the description of the task with the user’s email address and the body
       of the email.
     • ThePhish analysis: This task is used to analyze the observables. This is obtained by
       starting all the analyzers that are applicable to each observable using Cortex4py. Since
       this task is a fairly complex procedure, it has been modeled as an activity itself that is
       repeated for each observable, and its activity diagram is shown in Figure 7. Each analyzer
       outputs a JSON report containing a maliciousness level for an observable that can be one
 Start analyzers on observable


                                    [Else]                 [Observable is the EML file]


                        [Else]       [Observable is a URL]


                                                Execute
      Execute all the                                                             Execute Yara
                                              UnshortenLink
        analyzers                                                                   analyzer
                                                analyzer


                                             [Else]

                                                       [Successfully
                                                        unshortened]

                                     Add unshortened link as an
                                       observable to the case


                                             Execute remaining
                                                 analyzers


Figure 7: Start analyzers on observable activity diagram


      of “info”, “safe”, “suspicious”, or “malicious”. However, even though the report structure
      usually follows a convention, this convention is not always respected. Moreover, after
      analyzing the code of many analyzers and several tests, some analyzers have been found
      to contain bugs. For this reason, we made various tweaks and workarounds to obtain the
      maliciousness levels provided by these analyzers anyway or to prevent the application
      from crashing due to those bugs. Furthermore, these levels do not always represent
      the real maliciousness level of an observable. Since this depends on how the analyzers
      themselves have been programmed, ThePhish comes with another configuration file
      called analyzers_level_conf.json, with which it is possible to create a mapping
      between the actual maliciousness levels provided by any analyzer and the levels decided
      by the analyst. Besides that, this file allows the analyst to choose which observable types
      should be modified.
    • ThePhish result: This task calculates the verdict and sends it to the user. The judgment is
      calculated so that if at least one analyzer has given a malicious maliciousness level to at
      least one observable, then the email is marked as malicious, and all the observables that
      satisfy this condition are marked as IoC. If that is not the case, then the email is marked as
      “suspicious” if there is at least one analyzer that has given a “suspicious” maliciousness level
      to at least one observable. Otherwise, the email is marked as “safe”. The cases containing
      “malicious” emails, are exported to MISP along with all the observables marked as IoC.
      Finally, the task is updated with a description, including the final verdict and the email
      address of the user to send it to by using the Mailer responder. Both the task and the
      case are closed, and the verdict is returned to the analyst. However, if the verdict is
      “suspicious”, the analyst’s intervention is required, so neither the task nor the case is
      closed.

5.4. Interaction between front-end and back-end
The user interacts with a web application front-end to perform email analysis activities. The
front-end component establishes a bi-directional connection with the server to manage asyn-
chronous tasks. This is done by using the Socket.IO JavaScript library [39] in the web page
that enables real-time, bi-directional, and event-based communication between the browser
and the server. This connection is established with a WebSocket connection whenever possible
and will use HTTP long polling as a fallback. For this to work, the server application uses the
Flask-SocketIO Python library [40], which provides a Socket.IO integration for Flask applications.
ThePhish then uses this connection to display the progress of the analysis on the web interface.
Once the connection has been established, a unique identifier for the socket session (SID) is
available both on the client and the server. Then, when the analyst requests the analysis of a
specific email, that SID is sent as a parameter in the request so that the server can associate the
request with one of the connected clients. Whenever the analyst acts on the web interface, an
AJAX request is sent to the server, an asynchronous HTTP request that permits the exchange of
data with the server in the background and updates the page without reloading it. If the analyst
wants to visualize the list of emails to analyze, he must click on the “List emails” button on
the web interface, which triggers an asynchronous HTTP GET request to the correct endpoint.
Once the response is obtained, the HTML DOM is updated to show the list of emails. At this
point, the analyst can select the email to analyze by clicking on the corresponding “Analyze”
button. Again, an asynchronous HTTP request is triggered, but this time it is a POST request
to another endpoint. Once the response is obtained, the HTML DOM is updated to show the
result of the analysis. It should be noted that the payload of the request contains the UID of the
selected email and the SID used to identify the client to send log messages to.
6. Conclusions
As of today, phishing emails are the most widely used infection vector. The natural consequence
of this fact is that SOCs, CERTs, and CSIRTs are becoming overwhelmed by the number of
emails they need to analyze every day, with the majority being false positives. To avoid wasting
time and effort, many commercial or open-source solutions have been proposed to automate,
at least partially, the long and tedious process of email analysis. In this work, we presented
ThePhish, an open-source phishing email analysis platform. It is based on three open-source
platforms, namely TheHive, Cortex, and MISP, and allows automating the entire analysis process
starting from the extraction of the observables from the header and the body of the email to
the elaboration of a verdict, which is the final in most cases. In addition, it allows the analyst
to intervene in the analysis process and obtain further details on the email being analyzed
if necessary. The platform has been released under the AGPL license and made available on
GitHub so that anyone can contribute to improving it over time. The development of ThePhish
will, in fact, not stop here, as there is significant room for improvements. Further changes
will be made in the future to add new functionalities to ThePhish, support any new feature
introduced by TheHive and Cortex, support new analyzers, and fix bugs that might be present in
the current release or any future release of ThePhish. In future works, we are going to compare
ThePhish with other phishing email analysis approaches against a known dataset, such as
[41]. This dataset is composed of 303 phishing emails. Preliminary studies seem promising, as
ThePhish was able to parse 90% of the emails in the dataset and classify those parsed emails
correctly. In particular, 84.6% of the parsed emails were classified as malicious, while 15.4%
of them as suspicious. Anyway, we also would like to evaluate the performance of ThePhish
against non-phishing emails.


References
 [1] World Economic Forum (WEF), The Global Risks Report 2020, Technical Report, 2020.
 [2] 2021 Must-Know Cyber Attack Statistics and Trends, 2021. URL: https://www.embroker.
     com/blog/cyber-attack-statistics/.
 [3] Wikipedia contributors, Social engineering (security) — Wikipedia, The Free Encyclopedia,
     2021. URL: https://en.wikipedia.org/wiki/Social_engineering_(security).
 [4] Bruce Schneier, Secrets and Lies: Digital Security in a Networked World, Wiley, 2015.
 [5] All About Phishing Scams & Prevention: What You Need to Know, 2021. URL: https:
     //www.kaspersky.com/resource-center/preemptive-safety/phishing-prevention-tips.
 [6] Sam Erdheim, A SOC Under Siege: Alert Overload and Cyber Skills Short-
     age, 2018. URL: https://fidelissecurity.com/threatgeek/threat-detection-response/
     industry-professional-shortage/.
 [7] MxToolbox, 2021. URL: https://mxtoolbox.com/.
 [8] SpamAssassin, 2021. URL: https://spamassassin.apache.org/.
 [9] MrCalv1n, EmailAnalyzer, 2020. URL: https://github.com/MrCalv1n/EmailAnalyzer.
[10] VirusTotal, 2021. URL: https://www.virustotal.com/.
[11] TheresAFewConors, Sooty, 2021. URL: https://github.com/TheresAFewConors/Sooty.
[12] AbuseIPDB, 2021. URL: https://www.abuseipdb.com/.
[13] emailrep.io, 2021. URL: https://emailrep.io/.
[14] HaveIBeenPwned, 2021. URL: https://haveibeenpwned.com/.
[15] PhishTank, 2021. URL: https://www.phishtank.com/.
[16] urlscan.io, 2021. URL: https://urlscan.io/.
[17] duo-labs, IsThisLegit, 2020. URL: https://github.com/duo-labs/isthislegit.
[18] Observable - Glossary, 2021. URL: https://csrc.nist.gov/glossary/term/observable.
[19] Wikipedia contributors, Indicator of compromise — Wikipedia, The Free Encyclopedia,
     2021. URL: https://en.wikipedia.org/wiki/Indicator_of_compromise.
[20] Wikipedia contributors, Cyber threat intelligence — Wikipedia, The Free Encyclopedia,
     2021. URL: https://en.wikipedia.org/wiki/Cyber_threat_intelligence.
[21] Different Definitions of Threat Intelligence and Gartner’s Perspective, 2016. URL: https:
     //socradar.io/different-definitions-of-threat-intelligence-and-gartners-perspective/.
[22] ITL, CYBER-THREAT INTELLIGENCE AND INFORMATION SHARING, Technical Report,
     2017.
[23] MISP - Open Source Threat Intelligence Platform & Open Standards For Threat Information
     Sharing, 2021. URL: https://www.misp-project.org/.
[24] MISP - Open Source Threat Intelligence Platform, 2021. URL: https://www.circl.lu/services/
     misp-malware-information-sharing-platform/.
[25] User guide of MISP intelligence sharing platform, 2021. URL: https://www.circl.lu/doc/
     misp/.
[26] TheHive-Project, TheHive Project, 2021. URL: https://thehive-project.org/.
[27] TheHive-Project, TheHive, 2021. URL: https://github.com/TheHive-Project/TheHive.
[28] Saâd Kadhi, TheHive, Cortex and MISP: How They All Fit Together, 2017. URL: https://blog.
     thehive-project.org/2017/06/19/thehive-cortex-and-misp-how-they-all-fit-together/.
[29] TheHive-Project, TheHive4py Documentation, 2021. URL: https://thehive-project.github.
     io/TheHive4py/.
[30] Chris Sanders, Investigation Case Management with TheHive, 2017. URL: https://
     chrissanders.org/2017/03/case-management-the-hive/.
[31] TheHive-Project, Cortex, 2021. URL: https://github.com/TheHive-Project/Cortex.
[32] TheHive-Project, Cortex4py, 2021. URL: https://github.com/TheHive-Project/Cortex4py.
[33] TheHive-Project, Cortex Docs, 2021. URL: https://github.com/TheHive-Project/
     CortexDocs.
[34] Arnaud Loos, Open Source SIRP with Elasticsearch and TheHive -
     Part 6 - Case Management,                   2019. URL: https://arnaudloos.com/2019/
     open-source-sirp-part-6-case-management/.
[35] Flaticon, 2021. URL: https://www.flaticon.com/.
[36] Free Icons, Clipart Illustrations, Photos, and Music, 2021. URL: https://icons8.com.
[37] pallets, Flask, 2021. URL: https://github.com/pallets/flask.
[38] fhightower, ioc-finder, 2021. URL: https://github.com/fhightower/ioc-finder.
[39] SOCKET.IO, 2021. URL: https://socket.io/.
[40] miguelgrinberg, Flask-SocketIO, 2021. URL: https://github.com/miguelgrinberg/
     flask-socketio.
[41] J. Nazario, Phishing Corpus, 2018. URL: https://monkey.org/~jose/phishing/phishing-2015.
[42] Docker docs, 2021. URL: https://docs.docker.com.
[43] TheHive-Project, thehive4-cortex3-misp-shuffle, 2021. URL: https://github.com/
     TheHive-Project/Docker-Templates/tree/main/docker/thehive4-cortex3-misp-shuffle.
[44] Emanuele Galdi (emalderson), ThePhish, 2021. URL: https://github.com/emalderson/
     ThePhish.


A. ThePhish example usage
In order to start the analysis process, the analyst must first navigate to the web page of ThePhish
and obtain the list of emails to analyze, as shown in Figure 8. It should be noted that the emails
must be forwarded by the users as attachments in EML format so as to prevent the contamination
of the email header. The analyst can then select the email to analyze and start the analysis, the
progress of which is shown in Figure 9. In the meantime, ThePhish extracts the observables
from the email and then interacts with TheHive. Figure 10 shows the case populated with the
extracted observables. At the end of the analysis, ThePhish calculates the verdict. Since the
verdict is “malicious”, all the observables that are found to be “malicious” are marked as IoC.
In this case only one observable is marked as IoC, as shown in Figure 11. The case is then
exported to MISP as an event, with a single attribute represented by the observable mentioned
above. Figure 12 shows the event created on MISP and Figure 13 shows its attribute. It should
be noted that, due to how TheHive implements the interaction with MISP, the event will not be


Figure 8: List of emails to analyze
Figure 9: Analysis progress


Figure 10: Observables added to the case


published by default and will have to be published manually by the analyst. Nevertheless, the
event does not need to be published for the MISP Search analyzer to find a match between an
observable in a case and an attribute of an event. Then, ThePhish sends the verdict via email
to the user thanks to the Mailer responder. Finally, the case is closed. Figure 14 shows that the
case has been closed after five minutes and resolved as “True Positive” with “No Impact”, which
means that the attack has been detected before it could do any damage. Once the case is closed,
Figure 11: Observable marked as IoC


Figure 12: Event created on MISP
Figure 13: Attribute of the MISP event


Figure 14: Task and case closed


the verdict is available for the analyst on the web interface together with the entire log of the
analysis progress, as shown in Figure 15. The above-depicted case was related to a phishing
email, but a similar workflow can be observed when the analyzed email is classified as “safe”.
On the other hand, when an email is classified as “suspicious”, the verdict is only displayed
to the analyst on the web interface. At this point the analyst needs to use the buttons on the
left-hand side of the page to use TheHive, Cortex and MISP for further analysis. This is because
the analysis has not been completed yet and so the user is only notified that the analysis of the
email that he forwarded to ThePhish has been started. Indeed, the last task and the case have
not been closed yet since they need to be closed by the analyst himself once he elaborates a
final verdict. The analyst can view the full reports of all the analyzers on TheHive and Cortex,
also the ones returned by the analyzers that only return an “info” maliciousness level. These
analyzers, in fact, are not considered during the computation of the verdict but they can be
Figure 15: Final verdict (malicious) shown to the analyst


useful for getting more information about an observable. Moreover, in case this revealed not to
be enough, the analyst could also download the EML file of the email and analyze it manually.
When the analyst terminates the analysis, he can populate the body of the email to send to
the user in the last task, start the Mailer responder, export the case to MISP if the verdict is
“malicious” by clicking on the “Export” button and then close the case.
   To do this demonstration, 45 different analyzers have been enabled, 12 of which included
in the analyzers_level_conf.json file to modify their maliciousness level. Moreover, the
whitelist.json file has been populated to avoid false positives. Both these files have been
populated based on many tests on different phishing and safe emails. However, changing the
enabled analyzers and the content of the configuration files may make ThePhish give different
results. This means that a universally correct configuration does not exist, but it is highly
dependant on the organization’s needs and on whether the observables present in the email are
known by the enabled analyzers. Hence, in order to use ThePhish, an analyst must first test it
to configure it properly, but he must also keep this configuration always up to date.


B. Pull requests to TheHive4py
TheHive offers many API endpoints that allow performing the majority of the actions that can
be performed from the web interface. However, the functions provided by TheHive4py do not
cover all those API endpoints yet. In order to make ThePhish able to use all the functionalities
it needs, the following two functionalities have been added to TheHive4py:

    • Export to MISP: Exports a case to MISP with all of its observables marked as IoC.
    • Run a responder: Launches a responder against alerts, cases, tasks, task logs or observables
      by its ID.

Since TheHive4py is an open-source project available on GitHub, two pull requests have been
made to make these functionalities available to the entire community. The pull requests have
been accepted and included in the 1.8.0 TheHive4py milestone.


C. ThePhish deploy
In order to use ThePhish, it is necessary to deploy TheHive, Cortex and MISP instances along
with the services needed for them to work. The complete procedure needed to set up those
services for a production-ready environment can be found in the official guides of TheHive,
Cortex and MISP. However, in this case, Docker [42] and Docker Compose have been used to
deploy the entire application. In order to facilitate the deployment procedure, TheHive Project
has made available Docker images for both TheHive and Cortex. Moreover, several Docker
templates have been made available as well. Those templates contain a docker-compose.yml
file used to configure the containers that constitute the application. It should also be noted
that the Cortex neurons are started as docker containers themselves, which means that their
images are pulled from Docker Hub the first time they have to be executed and then every time
a neuron is re-executed, a container is created based on that image and is destroyed at the end
of the execution. ThePhish has been tested on a virtual machine running Ubuntu 20.04 LTS
with Docker Engine 20.10.8 and Docker Compose 1.29.2, using a modified version of a Docker
template provided by TheHive Project [43]. It not only uses TheHive and Cortex containers
but also MISP, MySQL, Redis, Apache Cassandra and Elasticsearch containers, which are not
provided by TheHive Project. Even though this template does not provide the full configuration
options, it is enough to demonstrate how ThePhish works. The original docker-compose.yml
file has been edited so as to remove unused services and add another container used to run
ThePhish. The versions of the services used are listed below:

    • Apache Cassandra 3.11
    • TheHive 4.1.9
    • Elasticsearch 7.11.1
    • Cortex 3.1.1
    • Redis 6.2.5
    • MySQL 8.0.26
    • MISP 2.4.148

   ThePhish has been made available on GitHub as an open-source project under the AGPL
license at the repository emalderson/ThePhish [44]. It is possible to refer to that repository for a
complete installation and configuration guide.

</pre>