Proxy in a Haystack: Uncovering and Classifying MFA
                                Bypass Phishing Attacks in Large-Scale
                                Authentication Data
                                Rebecca Lynch1 , M.C.S., Lauren Saue-Fletcher2
                                1
                                    Sr. Data Scientist, Security Research, Duo Security, Cisco Secure
                                2
                                    Graduate Student, Stanford Empirical Security Research Group, Stanford University


                                                                         Abstract
                                                                         While phishing has long been a prevalent threat against authentication systems, a recent gain in popularity
                                                                         of OSS reverse-proxy kits has made detection and prevention of phishing attacks increasingly difficult.
                                                                         Open-source tools such as evilginx are capable of not only phishing credentials and passcodes, but
                                                                         proxying an entire multi-factor authentication (MFA) flow and all associated cookies. In this scenario, the
                                                                         user sees an expected login prompt from the MFA provider, proxied through the attack server, while the
                                                                         MFA provider sees what appears to be a valid login session simply originating from a different IP address.
                                                                         To the authentication provider, the IP of the attack server is often the only apparent difference between
                                                                         a malicious and a benign authentication. This, coupled with inaccuracies in IP geolocation databases,
                                                                         highly variable user behaviors, ISP IP shuffling, benign VPN usage, and a severe imbalance between
                                                                         benign and malicious authentications, limits traditional server-side ML detection capabilities. Using data
                                                                         from Duo Security, a large authentication provider, we apply point-in-time DNS data to authentication
                                                                         records to identify domains corresponding to the source IP address of the client at the moment of access.
                                                                         We then applied targeted URL and behavioral filtering to identify likely attacker-owned domain-IP
                                                                         pairs. We analyzed authentications from these IP addresses to provide new insights on MFA phishing
                                                                         attack signatures. With this newly uncovered set of labeled malicious authentications, we test a variety
                                                                         of classification approaches in the detection of MFA bypass attacks. We demonstrate the benefits of
                                                                         threat-informed data mining in true positive sample generation, as well as the performance and usability
                                                                         tradeoffs of multiple classification methods in the server-side detection of MFA bypass attacks. These
                                                                         classification techniques applied on newly labeled phishing authentication data are then shown to
                                                                         out-perform unsupervised methods in the identification of malicious authentications.

                                                                         Keywords
                                                                         multi-factor authentication, phishing, threat detection,


                                1. Introduction
                                1.1. Terminology
                                We use the term “access device” as the device initiating an authentication, “authentication
                                device” as the (optional) device approving the authentication, such as a mobile phone approving
                                a Push request, and “user” as the end user attempting to authenticate. Users belong to an
                                CAMLIS’23: Conference on Applied Machine Learning for Information Security, October 19–20, 2023, Arlington, VA
                                ∗
                                    Corresponding author.
                                Envelope-Open beccalyn@cisco.com (R. Lynch); laurensauefletcher@stanford.edu (L. Saue-Fletcher)
                                GLOBE beccalynch.com (R. Lynch)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
“customer” or organization which has configured an authentication provider to protect one or
more “applications” that a user accesses.

1.2. Phishing Attacks
Phishing attacks remain the most prevalent methodology leveraged by bad actors; a recent
estimate[1] by CISA approximates that 90% of all recent cyber incidents began with some sort
of phishing vector. A 2023 analysis by ZScalar[2] showed that phishing attacks witnessed a
staggering increase of 47.2% in 2022. In that time, an estimated $52 million USD was lost as a
direct result of phishing attacks[3], though that number is likely higher as only approximately
2.1% of phishing attacks are actually reported[4]. Traditional phishing requires a non-trivial
amount of effort to execute, often requiring the creation of a fraudulent website. The attacker
then lures users into sharing credentials and, if needed, MFA tokens. In the case of MFA-targeted
phishing, these attacks often focus on “phishable” authentication methods, such as One-Time
Passcodes (OTP) sent via SMS or mobile apps. While providing more security than no MFA,
these MFA methods can be trivially phished. MFA codes typically remain valid until used,
allowing the attacker to obtain them from the user via their phishing site and replay them to
the authentication provider for unauthorized access.

1.3. Reverse-Proxy MFA Bypass Attacks
While prevention of phishing attacks has improved as more secure MFA factors are increasingly
adopted, a method of phishing has been developed that bypasses most MFA factors altogether[5].
With this adversary-in-the-middle (AitM) approach, traffic between the authentication provider
and the victim is directly proxied through an attack server, significantly reducing the effort
required by an attacker, as spinning up a custom phishing site is no longer necessary. Tools like
evilginx are entirely open-source and offer pre-configured proxy kits for a number of popular
sites such as Facebook, Twitter, Outlook, and Paypal. These tools proxy the entire login flow
that mirrors the user’s expected experience with almost no setup required by the attacker. Upon
a victim accessing the attack server via a phishing URL, the server negotiates an SSL connection
with both the victim and the authentication provider, giving the attacker decrypted access to
all credentials, MFA codes, and cookies shared between the two. The URL is typically the only
perceptible difference for an end user. This difference is generally of limited use, as 38 of 70
surveyed users in our own simulated internal attack reported that they did not check the URL
prior to clicking the link. Phishing resistant authentication methods, such as FIDO2, defend
against reverse-proxy phishing attacks, however these authentication methods are often more
difficult to deploy in practice.[6]
   As an authentication provider, we find these types of attacks to be difficult to detect, as the
attack server proxies all information including user-agent strings and OS telemetry throughout
the login. Attackers can even subvert client-side detection via the ability to inject and overwrite
the Javascript loaded by the client. Because of this, our server-side detection capabilities
via traditional machine learning methods are extremely limited. Often, the only perceptibly
different signal is the source IP address of the login attempt which will belong to the attack
server rather than the user themself. Attempts to classify attack instances are generally limited
Figure 1: An overview of hew reverse-proxy AitM servers bypass Push-based MFA


by a scarcity of labeled data; labeling true positive attacks is exceedingly difficult, as there are a
number of benign reasons for users to authenticate through proxy services or utilize different
networks throughout their standard authentication behavior. Additionally, the sheer volume of
authentications and variety of potential implementations of this style of attack has previously
limited our detection capabilities to unsupervised anomaly detection methods.

1.4. Detection Improvements via DNS Data Integration
With these limitations in mind, we propose the improvement of server-side detection of AitM
phishing attacks with the integration of DNS intelligence relating to observed access device IP
addresses. Available signals for detection are limited to the IP address of the device initiating
access with the authentication provider, the IP address, if relevant, of the authentication device
approving the authentication such as a mobile phone, subsequent IP metadata, as well as
telemetry received from request headers throughout the authentication. We employ a system
in which point-in-time DNS information is used to identify IP addresses suspected of running
these reverse-proxy phishing servers. This approach is based on an understanding of the attack
topology – to employ this attack, a proxy server must be configured with a domain and valid
SSL certificate in order to effectively phish a user. This same server is (in most cases) the server
that initiates the connection with the authentication provider. This understanding allows us to
identify access device IP addresses corresponding to registered domains as a means of more
targeted threat identification. Once these domains were filtered and vetted, we then labeled
authentications deriving from these access IPs as phishing attacks. These labeled instances
can then be used to improve detection capabilities with the introduction of supervised ML
approaches, providing a threat-informed path beyond strictly unsupervised methods.


2. Methodology
2.1. Threat-Informed Data Labeling via DNS Data Integration
Spanning Duo authentication log data from 2023-05-10 to 2022-05-24, we identified 22,280,355
unique access device IP addresses. Our first goal in identifying attack servers was to find IP
addresses associated with potential phishing domains; if a user is currently experiencing an
AitM attack, it’s likely that the phishing URL they accessed is mapped to the IP address seen in
our data, because the attacker’s server will be the one initiating the authentication with Duo
Security. To do this, Farsight Security’s DNS query data was joined against this authentication
data to identify domains that were associated with an authentication access device IP. Roughly
300,000 unique domains were found to map to access device IPs within that window. To narrow
this, we first considered common attributes of phishing URLs such as the existence of phish
“hint” words (login, cash, quick, auth, etc.), common brand names or brand misspellings in the
URL[7], as well as the presence of repeated characters, symbols, or misleading TLDs such as
google[.]com[.]uz. Further work was done to determine the age of the domains and associated
autonomous system number (ASN) in filtering out legitimate domains. Domains older than one
year or owned by trusted organizations such as educational or government agencies were ruled
out. These filtering measures led to the identification of roughly 300 potentially suspicious
domain-IP pairs. We then analyzed the authentications originating from these IPs to determine
the likelihood of malicious activity from these domain-IP pairs. IPs with regular usage by a
consistent set of users were ruled out as either belonging to legitimate proxy services that
are typically used in Zero Trust networks, or personal proxies used to subvert organizational
or government censorship. While not inherently benign, this activity is not likely associated
with phishing and therefore not relevant in our identification of possible AitM servers. Further
filtering was done to remove IPs for which users had extensive history in our data, likely
indicating a home network on which they host a personal domain. With this filtering, we
identified 14 domains matching both suspicious URL markers and authentication behavior
that would suggest a phishing campaign – these domains are listed in the Appendix. These
behaviors include one-off authentications across multiple users and ASNs associated with
known hosting providers that offer free or cheap web hosting. A number of these sites were
still accessible at the time this was written, their alleged purposes ranging from mail/package
reception, moving services, Starbucks ordering, and, most prevalently, tutoring and educational
resources for college students. Of the 14 domain-IP pairs identified, there were over 25 impacted
users, accounting for 77 authentication attempts within the analyzed two week window. 61%
of the discovered phishing authentications impacted users at educational institutions, with
financial services and manufacturing representing the next most common vectors at 14% and
6% respectively.
Figure 2: An overview of targeted data labeling via DNS data integration and subsequent feature
generation and classification


2.2. Feature Engineering
For the users impacted by these suspected AitM servers, we pulled the entirety of their authen-
tication history in 2023. Authentications within the 5/10 - 5/24 window originating from the
phishing IPs were labeled as phish authentications, while the rest were labeled as benign.
The resulting dataset contained 77 phishing authentications and 12,561 benign authentications.
Real-world authentication data is highly imbalanced; for every billion authentications that we
process, confirmed true positive attack reports are in the single digits. For this reason, we
operated within the constraint of this imbalance and chose to not generate synthetic malicious
samples or down-sample the benign class. We generated rolling features based on a targeted
understanding of the likely taxonomy of a phishing attack. Many of the features are a computed
likelihood based on previous authentication data for each user, implemented as the percentage
match of a feature or feature pair over a user’s 90 day successful authentication history. Features
were largely generated at the user level to allow for a generalized classification approach. As
each user behaves differently, even within the same organization, it is necessary to consider
features as they pertain to an individual’s history, rather than make generalizations about
suspected attack behavior and introduce unnecessary bias. These rolling probabilistic features
include (1) the access device browser type, (2) the access device’s country and state of origin as
inferred from MaxMind’s geo-IP dataset, (3) the pairing of the access device IP’s ASN and the
application being accessed, (4) the pairing of the MFA factor used and the application being
accessed, and (5) the pairing of the access device operating system and the application being
accessed. In short, rather than classify on these features themselves, we classify on the user’s
probability that their authentication would have each feature value. We additionally included
boolean features to indicate (6) whether an access device browser’s version had decreased since
the prior successful authentication, and (7) whether the access device carrier (e.g. Comcast,
Amazon, DigitalOcean) associated with the IP has changed since the prior successful authenti-
cation. We additionally incorporated known effective features from prior risk-based assessment
work, including (8) whether the access device ASN and (9) IP address are novel within a user’s
organization, (10) whether the access device IP has been seen by a different organization within
a 24 hour lookback, and (11) the distance between the access device location of the current and
prior authentication as well as (12) the average distance between the access device location of
the current and last ten authentications.

2.3. Classification of Phishing Authentications
We used both XGBoost and LightGBM to classify these malicious authentications, chosen due
to the cardinality and extreme imbalance within our sample dataset. These models were tuned
to optimize the 𝐹1 score. The 𝐹1 score is defined here, using 𝑝 = precision and 𝑟 = recall.
                                                2∗𝑝∗𝑟
                                         𝐹1 =                                                  (1)
                                                 𝑝+𝑟
   We selected 𝐹1 as the performance metric due to the data imbalance – when working with
imbalanced classes, we must optimize for precision (proportion of flagged records that are
correctly identified as malicious) and recall (proportion of truly malicious records that are
correctly flagged). The tuned parameters for each model can be found in Appendix B. Ad-
ditionally, an unsupervised Isolation Forest model was employed as a benchmark, as similar
detection methods are currently used at Duo due to the previously described limitations in
labeled data. The contamination rate for the IF model was set at 0.01 to properly represent the
rate of imbalance in the training dataset.


3. Results
3.1. Feature Correlation with True Positive Phishing Attacks
Generated features were assessed against benign and phishing authentications to better under-
stand the signals that separated the identified true positives. The full set of visualizations for
these features are shown in Appendix A. Of the probabilistic features, the probability of seeing
a given ASN and application pair had one of the highest levels of separation between classes,
with location probabilities also showing discernible differences.
   Among boolean features, we see the most notable difference between classes for the features
involving the distance between authentications, with the majority of phishing authentications
having a > 100 mile distance from the last successful authentication, compared to only 20% of
legitimate authentications.

Table 1
Classification Results
 Model                                                          Recall    Precision   Accuracy
 XGBoost                                                           0.63        0.02         0.64
 LightGBM                                                          0.61        0.05         0.81
 Isolation Forest                                                  0.06        0.08         0.98
Figure 3: Density curves showing distributions of probabilities for phishing vs. benign authentications,
for ASN + application and access device state probabilities respectively


3.2. Supervised vs Unsupervised Classification of Phishing Authentications
Precision, recall, and accuracy are shown below for each approach. It is important to note
preemptively that these metrics were measured purely against the generated user-level time
series data and do not take into account features that would objectively improve both ac-
curacy and precision. These include policy measures that organizations using Duo commonly
employ, including allowlisted networks, remembered devices, and secure FIDO2 factors. In a
real-world application, authentications meeting these criteria would not be flagged as malicious
regardless of the time-series features used here.


4. Discussion
4.1. Label Generation
It is essential to maintain a high-confidence labeled dataset in any classification problem.
This is especially crucial in the realm of cybersecurity, as the risk posed by false negatives
(failing to identify a malicious authentication) can lead not only to financial loss but significant
impacts to victims’ lives. The risk associated with false positives (incorrectly marking benign
authentications as malicious) is also moderate, as our services are used to protect critical
applications including medical software, university portals, and software reliability instructure.
In the authentication space, the large volume of highly cardinal data is further compounded by
the amount of variability within user-level data, making identifying malicious authentications
incredibly difficult. In this case, even with the augmentation of our authentication data with DNS
query information, significant domain knowledge was necessary to uncover high confidence
true positives. When looking for instances of this particular attack, however, our approach
to data augmentation significantly reduced our search space from 800 million authentications
originating from 20 million unique IPs, to 300,000 IPs with corresponding domains, to only
300 IPs with highly suspicious domains. We plan to improve upon this filtering method and
design a real-time system by which this data can be integrated into our detection systems and
allow us to continue to build a set of high confidence true positives. Features found in the
previous section to be highly correlated with phishing attacks can be used to aid in further
threat research as a means of narrowing this search for threat researchers seeking to identify
malicious behavior when DNS information may not be available.

4.2. Limitations of Classification Techniques on Authentication Data
Classification on authentication data is inherently difficult. The vast majority of authentications
in our dataset are not malicious. While our dataset was limited by a small count of confirmed true
positives, the nature of authentication data would likely lead to similar performance numbers
even with the presence of more identified phishing IPs for several reasons. First, the efficacy of
the benign authentication labels is generally unknown. While we can reasonably attest that our
labeled true positives are from malicious attack servers, our confidence that the labeled benign
authentications are truly benign is generally lower. This is likely to lead to a degradation of
precision, as we cannot necessarily ensure that all misclassified “attack” authentications are
truly misclassified. Second, users in general exhibit many legitimate behaviors that may appear
as malicious or anomalous activity. Behavior that is normal for one user may be indicative of
a malicious authentication for a different user. The behavior among individuals using Duo’s
authentication services varies greatly: the average user in our analyzed data utilized 4.5 unique
network carriers, 18 unique IP addresses, and 2.2 distinct operating systems over a two month
period. User behavior also varies seasonally. Many users are university students that exhibit a
dramatic shift in activity at the start and end of the school year, both in terms of the features of
the behavior (different access device types, locations, VPN utilization), and in authentication
volume. With these limitations in mind, the intention of this work is not to propose novel
or perfect ML methods of detection, but rather to describe the application of threat-informed
data filtering in providing a path forward when dealing with the detection of an otherwise
imperceptible attack. That said, the supervised classification methods we were able to use as
a direct result of targeted threat-informed data labeling shows a profound improvement in
detection recall over currently employed unsupervised methods. We intend to use these findings
to develop a fully integrated DNS-aware authentication classification system that can extend
these methods and, with the aid of human label verification, continue to build a set of informed
true positive malicious authentications and improve our automated detection capabilities.


References
[1] CISA, Stop ransomware, 2023. URL: https : / / www . cisa . gov / stopransomware /
    general-information.
[2] D. Desai, R. Hedge, E. Laufer, J. Wang, 2023 phishing report reveals 47.2%
    surge in phishing attacks, 2023. https : / / www . zscaler . com / blogs / security-research /
    2023-phishing-report-reveals-47-2-surge-phishing-attacks-last-year.
[3] Federal bureau of investigation internet crime report, 2022. URL: https://www.ic3.gov/
    Media/PDF/AnnualReport/2022State/StateReport.aspx.
[4] C. Baron, 28% of bec attacks opened by employees, new data shows, 2023. URL: https:
    //abnormalsecurity.com/blog/28-of-bec-attacks-opened-by-employees.
[5] B. Toulas, Mfa adoption pushes phishing actors to reverse-proxy solutions, 2022.
[6] M. Kapko, What is phishing-resistant multifactor authentication? it’s complicated., 2022.
    URL: https://www.cybersecuritydive.com/news/phishing-resistant-mfa/633703/.
[7] A. Moraru, P. R. Donahue, Top 50 most impersonated brands in phishing attacks and new
    tools you can use to protect your employees from them, 2023. https://blog.cloudflare.com/
    50-most-impersonated-brands-protect-phishing/.
A. Feature Separation


Figure 4: Density curves showing distributions of probabilities for phishing vs. benign authentications.
Figure 5: Proportions of boolean values for phishing vs. benign authentications.
B. Domains
  • package-usps[.]us
  • *[.]zetlandcapitals[.]com
  • criteriacorp[.]microsoftonline[.]app-account-127[.]cloud
  • volvo[.]microsoftonline[.]app-account-140[.]cloud
  • cbeyondata[.]microsoftonline[.]app-account-126[.]cloud
  • b9746927-a325-5d2d-7f91-ca0105ac5f52[.]cnnic[.]rip
  • t3[.]freegradely[.]xyz
  • starburkx[.]com
  • gooduugfdhgf[.]click
  • clientedesco004[.]descobrresgate[.]com
  • dvfffpyvl[.]mom
  • lswj35[.]suporteswr[.]com
  • uiuvjfkkge[.]buzz
  • wwwofc[.]getgoingmove[.]com