<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An OSINT-driven architecture for digital asset discovery and risk evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetiana Babenko</string-name>
          <email>babenko.tetiana.v@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kateryna Kolesnikova</string-name>
          <email>kkolesnikova@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damelya Yeskendirova</string-name>
          <email>d.yeskendirova@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yelena Bakhtiyarova</string-name>
          <email>y.bakhtiyarova@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meruyert Dauletbek</string-name>
          <email>m.dauletbek@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Kruchinin</string-name>
          <email>kruchinin.o.v@nmu.one</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>34/1 Manas St., Almaty, 050040</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University "DniproPolytechnic"</institution>
          ,
          <addr-line>Dmytra Yavornytskoho Avenue, 19, Dnipro, 49005</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The expansion of organizational digital footprints creates unprecedented challenges for security teams attempting to maintain comprehensive asset visibility. This paper presents practical implementation results from deploying an OSINT-driven architecture that automates digital asset discovery and risk evaluation through passive intelligence collection combined with machine learning analysis. During twelve months of operational deployment across 3,800 organizations, the system processed 4.8 million OSINT records while achieving 93.3% accuracy in risk classification and detecting 76.8% of confirmed security incidents. The architecture reduced security analyst workload by 58% and decreased per-organization assessment costs from $234 to $19, making continuous monitoring economically viable for resource-constrained organizations. We detail practical challenges encountered during implementation, including handling 11.8% missing values in WHOIS data and managing false positive rates of 21.8% for small organizations. The paper provides actionable guidance for practitioners implementing similar systems, with specific parameter configurations, integration strategies, and maintenance requirements derived from production experience.</p>
      </abstract>
      <kwd-group>
        <kwd>open-source intelligence</kwd>
        <kwd>digital assets</kwd>
        <kwd>cyber risk assessment</kwd>
        <kwd>machine learning</kwd>
        <kwd>GBDT</kwd>
        <kwd>DBSCAN</kwd>
        <kwd>passive reconnaissance1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern organizations face an expanding attack surface that traditional security approaches struggle
to comprehensively address. The challenge becomes particularly acute when considering the
interconnected nature of digital ecosystems where vulnerabilities in third-party infrastructure can
cascade through supply chains. Recent incidents underscore this reality, with the February 2024
ransomware attack on Change Healthcare disrupting operations nationwide and affecting
approximately 190 million individuals [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Supply chain compromises targeting cloud service
providers have revealed systemic weaknesses in third-party dependency management, underscoring
the interconnected nature of modern digital ecosystems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The fundamental problem lies in the mismatch between the scale of modern infrastructure and the
capabilities of manual security assessment. Organizations today maintain presence across multiple
cloud providers, operate hybrid environments spanning on-premises and virtualized resources, and
depend on countless third-party services. The economic impact continues to grow, with cybercrime
losses reaching approximately $600 billion annually according to the Center for Strategic and
International Studies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Recent research indicates cybercrime costs have increased substantially,
from approximately $3 trillion in 2015 to over $6 trillion in 2021, with projections suggesting
continued growth in economic impact [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Traditional reconnaissance methodologies rely heavily on active scanning techniques that
directly probe target systems to enumerate network assets, including IP addresses, domains, and
exposed services [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These approaches worked reasonably well when infrastructures were simpler
and more static. However, the dynamic nature of modern deployments renders periodic scanning
insufficient. While tools such as Shodan and Nmap remain widely deployed for asset discovery, they
present inherent limitations in scalability and stealth operation. Active reconnaissance generates
network traffic that can trigger intrusion detection systems, potentially alerting adversaries to
defensive activities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The dichotomy between active and passive reconnaissance techniques presents distinct
operational trade-offs. Active methods provide direct system interaction and real-time enumeration
capabilities but increase detection risk and may violate operational security requirements [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Organizations attempting comprehensive security assessments through traditional approaches find
themselves caught between thoroughness and operational stealth. Active scanning might reveal
current vulnerabilities but also exposes defensive activities to potential adversaries monitoring
network traffic. This creates a paradox where the very act of searching for vulnerabilities might
compromise operational security.
      </p>
      <p>Passive reconnaissance, particularly through Open Source Intelligence methodologies, offers an
alternative approach that avoids these detection risks. OSINT techniques leverage publicly available
information to construct detailed profiles of organizational infrastructure without generating any
network traffic directed at target systems. DNS records, certificate transparency logs, and WHOIS
registrations provide valuable intelligence about an organization's digital footprint. However, the
sheer volume of available data and the complexity of correlating information across multiple sources
create analytical challenges that exceed human processing capabilities.</p>
      <p>The temporal dynamics of modern infrastructure further complicate security assessment.
Cloudnative architectures enable rapid deployment and modification of services, with
infrastructure-ascode practices allowing entire environments to be created or destroyed within minutes. Traditional
periodic assessments fail to capture these rapid changes, leaving organizations blind to
vulnerabilities introduced between scanning cycles. The shift from static, on-premises infrastructure
to dynamic, distributed systems fundamentally changes the requirements for effective security
monitoring.</p>
      <p>Machine learning techniques offer promising solutions to these analytical challenges. The
heterogeneous nature of OSINT data, combining structured records with semi-structured logs and
unstructured text, aligns well with algorithms designed to handle mixed data types. Gradient Boosted
Decision Trees, in particular, demonstrate strong performance when processing the diverse features
extracted from OSINT sources. Similarly, unsupervised learning approaches can identify unusual
patterns and potential security anomalies without requiring extensive labeled training data.</p>
      <p>This work addresses these limitations through an integrated framework combining passive
OSINT collection with machine learning analysis. The approach builds upon established theoretical
foundations while introducing practical innovations for operational deployment. Our contribution
focuses on the implementation details and real-world performance characteristics that determine
system viability. By automating the collection, correlation, and analysis of OSINT data, the
framework enables continuous security monitoring at scales previously achievable only through
manual effort. The system demonstrates that comprehensive asset discovery and risk assessment can
transition from labor-intensive periodic activities to continuous, automated capabilities accessible to
organizations of varying sizes and security maturity levels.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>The integration of Open Source Intelligence methodologies with automated security assessment
frameworks represents an evolving field where traditional reconnaissance techniques intersect with
modern machine learning approaches. Contemporary research in this domain reveals significant
progress, though substantial challenges persist regarding scalability, accuracy, and operational
deployment.</p>
      <p>
        Recent investigations into the cybersecurity incident handling landscape have established
foundational principles for systematic response mechanisms. Cichonski and colleagues developed
comprehensive guidelines for computer security incident handling that remain influential in shaping
organizational approaches to threat detection and response [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Their work emphasizes the critical
importance of preparation and detection phases, where asset visibility directly impacts response
effectiveness. This framework particularly resonates when considering how OSINT methodologies
can enhance the detection phase through continuous passive monitoring rather than reactive
investigation after incidents occur.
      </p>
      <p>
        The emergence of cloud computing fundamentally altered security assessment requirements.
Kholidy and Baiardi proposed CIDS, a framework specifically designed for intrusion detection in
cloud systems, addressing unique challenges posed by virtualized and distributed infrastructures [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Their research reveals how traditional security approaches fail to adequately cover dynamic cloud
environments where resources scale elastically and boundaries between organizational assets
become increasingly blurred. The framework demonstrates detection accuracies around 85% in
controlled cloud environments, though operational deployments often encounter additional
complexities related to multi-tenancy and shared responsibility models.
      </p>
      <p>
        Machine learning applications for security data analysis have progressed significantly,
particularly in handling heterogeneous data sources. Ester and collaborators introduced DBSCAN, a
density-based algorithm for discovering clusters in large spatial databases with noise [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. While
originally developed for spatial data analysis, DBSCAN's ability to identify outliers without requiring
predetermined cluster numbers proves exceptionally valuable for security anomaly detection. The
algorithm excels at identifying unusual patterns in OSINT data, where normal behavior varies
significantly across different organizational contexts and threat actors constantly evolve their
techniques to evade detection.
      </p>
      <p>
        A comprehensive systematic review by Browne and colleagues examined research utilizing
artificial intelligence for OSINT applications, revealing both achievements and persistent gaps in
current approaches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Their analysis of over 200 papers identified recurring themes where
researchers successfully apply machine learning to specific OSINT challenges yet struggle to create
integrated systems that address the complete intelligence cycle. The review particularly highlights
how most existing work focuses on single data sources or specific threat types rather than
comprehensive frameworks capable of processing diverse OSINT streams simultaneously.
      </p>
      <p>Data sources</p>
      <p>Accuracy range</p>
      <p>Primary limitation</p>
      <p>Operational readiness</p>
      <p>
        The standardization of security terminology and concepts has evolved through various efforts,
with Shirey's Internet Security Glossary providing crucial definitional foundations that enable
consistent communication across research and operational communities [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This standardization
becomes particularly important when correlating OSINT data from diverse sources, where
inconsistent terminology can create analytical gaps or duplicate efforts. The glossary's
comprehensive coverage of authentication, authorization, and threat concepts provides essential
vocabulary for describing complex relationships discovered through OSINT analysis.
      </p>
      <p>
        Internet-wide scanning capabilities have dramatically improved through innovations in scanning
technology. Durumeric and colleagues demonstrated with ZMap that complete IPv4 address space
scanning could be accomplished in under 45 minutes from a single machine [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This capability
fundamentally changed assumptions about reconnaissance feasibility, showing that adversaries
could maintain near real-time awareness of internet-exposed assets. Their work reveals how
traditional defensive strategies assuming reconnaissance requires substantial time investment no
longer hold true. The research particularly influences how organizations must now assume their
external attack surface remains under constant observation by both legitimate researchers and
malicious actors.
      </p>
      <p>
        Pastor-Galindo and team conducted extensive analysis of OSINT opportunities, identifying what
they term the "not yet exploited goldmine" of intelligence possibilities [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Their comprehensive
survey reveals that while OSINT data sources have proliferated exponentially, reaching billions of
accessible records across hundreds of platforms, the actual utilization remains surprisingly limited.
Organizations typically leverage only 15-20% of available OSINT data relevant to their security
posture. The researchers attribute this gap primarily to technical barriers in data correlation and the
absence of automated processing frameworks capable of handling heterogeneous information
streams at scale.
      </p>
      <p>
        Szymoniak and collaborators performed a thorough review of OSINT opportunities and
challenges, documenting persistent issues with data quality and consistency [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Their analysis
shows approximately 15-20% of collected OSINT records contain incomplete or outdated
information, creating cascading effects throughout analytical pipelines. The temporal dynamics of
modern infrastructure exacerbate these challenges, with cloud resources appearing and disappearing
within hours. Privacy considerations further complicate collection efforts, particularly as regulations
increasingly restrict access to registration information that previously provided valuable
reconnaissance data.
      </p>
      <p>
        The human factors and ethical dimensions of OSINT deployment continue generating important
discussions. Riebe and colleagues investigated privacy concerns and acceptance factors through
representative surveys, uncovering fundamental tensions between operational necessity and privacy
expectations [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Security professionals consistently express need for comprehensive
reconnaissance capabilities while acknowledging potential privacy violations. Kaufhold and team
expanded this analysis through a value-sensitive design perspective, revealing how ethical
considerations increasingly influence organizational deployment strategies [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Many organizations
now implement self-imposed collection restrictions despite potential security benefits, attempting to
balance comprehensive monitoring with privacy preservation.
      </p>
      <p>
        Systematic literature reviews by Evangelista and collaborators demonstrate the application
challenges when combining OSINT with artificial intelligence [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Their work identifies recurring
implementation barriers where theoretical advances fail to translate into operational capabilities. The
research particularly emphasizes how manual effort remains necessary for data validation and
correlation, with organizations reporting 40-60 analyst hours required for comprehensive
singleentity assessment. Afrifa and colleagues explored ensemble machine learning techniques for security
applications, achieving improved detection accuracy through combined approaches [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Their work
suggests that integrating multiple algorithms can address limitations inherent in single-method
approaches, particularly when processing diverse OSINT data types.
      </p>
      <p>
        Chen and Guestrin's development of XGBoost established theoretical foundations for gradient
boosted decision trees in high-dimensional security contexts [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Their scalable tree boosting system
demonstrates exceptional performance when handling mixed data types characteristic of OSINT
sources. The algorithm's ability to process both numerical metrics and categorical attributes makes it
particularly suitable for security risk assessment where features span network statistics,
configuration parameters, and behavioral indicators. Operational deployments achieve classification
accuracies exceeding 90% in controlled environments, though real-world performance varies based
on data quality and feature engineering sophistication.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System architecture</title>
      <p>The practical implementation of our OSINT-driven framework required careful architectural
decisions that balance theoretical optimality with operational constraints. Our deployment
experience across 3,800 organizations revealed that seemingly minor design choices could
dramatically impact system performance when processing millions of records daily. The architecture
evolved through several iterations as we encountered unexpected bottlenecks and discovered
optimization opportunities that theoretical analysis had not predicted.</p>
      <p>The fundamental architecture comprises four interconnected stages that operate asynchronously
yet maintain synchronized state through careful coordination mechanisms. Data flows from
collection through processing and machine learning analysis before generating actionable risk
assessments. Each stage operates independently, enabling horizontal scaling based on workload
characteristics while maintaining data consistency through eventual consistency patterns. The
pipeline design draws from established distributed system principles while addressing
OSINTspecific challenges that generic architectures fail to handle adequately.</p>
      <p>The collection stage orchestrates parallel ingestion from multiple OSINT sources, each presenting
unique challenges regarding rate limiting, data formats, and temporal dynamics. DNS collectors
query both authoritative nameservers and passive DNS providers, implementing adaptive throttling
to respect provider constraints while maximizing collection throughput. Certificate transparency
monitors maintain persistent connections to log servers, consuming the continuous stream of newly
issued certificates. WHOIS collectors face particular complexity due to format inconsistencies across
Regional Internet Registries, requiring specialized parsers for each RIR's response format. The rate
adaptation follows an exponential backoff strategy with jitter to prevent synchronized retry storms:
t wait=min ⁡( t base 2attempts+random (− jitter , jiter ) , t max ,
(1)
where tbase = 1 second represents the initial wait time, attempts counts consecutive rate limit
responses, and jitter introduces randomness to prevent synchronized retry storms. This approach
maintains collection efficiency while respecting provider constraints.</p>
      <p>
        Apache Kafka serves as the central nervous system for data movement, managing flow control
between stages while providing durability guarantees essential for compliance requirements [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
We selected Kafka over alternatives after extensive benchmarking revealed superior performance
under our specific workload patterns. The system maintains separate topics for each data source type,
enabling selective consumption based on processing capacity. This design choice proved critical
when certain data sources experienced volume spikes that would otherwise overwhelm downstream
processors.
      </p>
      <p>Processing heterogeneous OSINT data revealed operational challenges that required creative
solutions beyond standard data engineering practices. Missing values in WHOIS records emerged as
a persistent issue, with approximately 11.8% of records lacking critical fields. Rather than discarding
incomplete records, we implemented sophisticated imputation strategies leveraging correlated
attributes. Geographic information missing from WHOIS entries could often be reconstructed from
Autonomous System registration data or inferred from DNS naming patterns. The imputation
accuracy varies based on attribute uniqueness:</p>
      <p>Aimputation= i=1</p>
      <p>n
∑ I (imputedi=actuali)
n</p>
      <p>,
(1−U attr )
(2)
where I represents an indicator function for correct imputation, and Uattr measures the uniqueness
coefficient of the attribute being imputed. Highly unique attributes like specific contact emails
achieve lower imputation accuracy compared to standardized fields like country codes.</p>
      <p>DNS record noise presented another significant challenge, with ephemeral records and
loadbalanced configurations creating unstable resolution patterns. Multi-stage filtering mechanisms
prioritize authoritative data sources while applying statistical smoothing to identify and remove
transient entries. The filtering pipeline first identifies authoritative nameservers through SOA record
analysis, then applies temporal stability checks to eliminate records that appear briefly before
disappearing. This approach reduced false positives from dynamic DNS entries by approximately 34%
while maintaining coverage of legitimate infrastructure changes.</p>
      <p>
        Redis provides high-performance caching essential for maintaining system responsiveness
despite the computational overhead of ML inference [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The cache invalidation strategy employs
time-based decay with adaptive refresh based on access patterns. Frequently accessed assets receive
priority refresh while stale entries age out naturally. This hybrid approach reduced database load by
73% while maintaining cache coherency for rapidly changing data. The cache time-to-live calculation
incorporates both historical validity periods and observed change frequencies:
      </p>
      <p>TTL(k) = TTL_base × e^(−α·F(k)) × (1 + β·σ(k)),
(3)
where F(k) represents access frequency for key k, σ(k) measures prediction uncertainty, and α, β are
tuning parameters empirically determined through production monitoring.</p>
      <p>
        Apache Spark handles both batch and stream processing workloads, implementing a lambda
architecture that combines historical analysis with real-time updates [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The batch layer processes
complete datasets for model training and comprehensive risk assessments, while the speed layer
handles incremental updates for time-sensitive intelligence. This dual-path approach ensures both
analytical completeness and operational responsiveness. Feature extraction leverages Spark's
distributed computing capabilities, processing millions of records in parallel while maintaining data
locality for efficiency. Figure 1 illustrates the detailed data flow through the architecture, showing
how different OSINT sources integrate through the processing pipeline.
      </p>
      <p>Data redundancy across multiple sources complicated deduplication efforts since simple
hashbased approaches failed due to minor variations in representation. We employed locality-sensitive
hashing with Jaccard similarity measures to identify near-duplicate records. The similarity threshold
adapts based on source reliability scores:</p>
      <p>T similary=T base−α ∙ min ⁡( Rsource 1, Rsource 2 ) ,
(4)
where Tbase = 0.85 represents baseline threshold and Rsource indicates source reliability derived from
historical accuracy assessments.</p>
      <p>
        Kubernetes orchestration enables elastic scaling based on workload demands, automatically
provisioning additional processing nodes during peak collection periods [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The container-based
deployment strategy simplifies updates and ensures consistency across distributed components. Pod
autoscaling policies consider both CPU utilization and queue depth metrics, preventing resource
starvation while maintaining cost efficiency. During typical operations, the system runs on 8-16
nodes, expanding to 24 nodes during surge periods.
      </p>
      <p>The architecture's modularity proved essential for maintaining system stability while
implementing improvements. Components can be updated independently without disrupting overall
operations, enabling continuous enhancement based on operational feedback. Performance
monitoring through distributed tracing revealed unexpected bottlenecks in serialization between
stages, leading to optimization of data formats that improved throughput by 28%. These practical
discoveries highlight how theoretical designs require substantial refinement when confronted with
real-world data characteristics and operational constraints.</p>
      <p>Integration challenges emerged when combining outputs from multiple ML models operating on
different feature sets. Rather than simple score aggregation, we developed a hierarchical fusion
approach that weights model outputs based on their historical accuracy for specific asset types. This
nuanced integration improved overall risk assessment accuracy by 12% compared to naive averaging
approaches. The fusion mechanism also provides uncertainty quantification, essential for security
teams making critical decisions based on automated assessments.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental setup</title>
      <p>The validation of our OSINT-driven framework required experimental conditions that reflect
realworld operational complexities while maintaining scientific reproducibility. Constructing datasets of
sufficient scale and diversity proved more challenging than initially anticipated, particularly when
attempting to capture the dynamic nature of modern digital infrastructure. Our experimental design
evolved through multiple iterations as we discovered that laboratory-style controlled experiments
failed to reveal critical performance characteristics that emerged only under production-scale
workloads.</p>
      <p>Dataset construction began with establishing collection parameters that would yield
representative samples across different organizational profiles and infrastructure types. The primary
dataset comprised 4.8 million OSINT records collected over twelve months, though we also
conducted experiments with smaller subsets of 1.2 million records to evaluate scalability
characteristics. This range enabled assessment of how system performance degrades or improves
with varying data volumes. The heterogeneous nature of sources meant that simple random sampling
would bias toward DNS records, which update more frequently than other data types. We therefore
implemented stratified sampling to ensure proportional representation across all OSINT categories.</p>
      <p>The temporal dimension of our experiment spanned twelve months from January to December
2024, capturing both short-term fluctuations and long-term infrastructure evolution patterns. This
duration revealed interesting seasonal variations, with infrastructure changes accelerating during
typical maintenance windows and slowing during holiday periods. Short-term dynamics included
hourly DNS updates and daily certificate renewals, while long-term patterns encompassed gradual
migrations to cloud services and periodic infrastructure refreshes. The extended observation period
proved essential for validating our temporal decay models, which require sufficient time to observe
vulnerability lifecycle patterns from disclosure through exploitation to remediation.</p>
      <p>Geographic and sectoral diversity emerged naturally from our collection methodology, though we
consciously included organizations from underrepresented regions to avoid bias toward
wellconnected infrastructure. The dataset spanned 47 countries with organizations distributed across
technology (24%), financial services (19%), healthcare (14%), government (8%), and other sectors
(35%). This distribution reflects the relative digital maturity and OSINT visibility of different
industries, with technology companies naturally generating more observable infrastructure changes
than traditional industries.</p>
      <p>Performance evaluation employed multiple metrics to capture different aspects of system
effectiveness. Accuracy alone proves misleading in imbalanced datasets where simply predicting the
majority class yields high scores. We therefore adopted a comprehensive metric suite that reveals
nuanced performance characteristics. The F1-score balances precision and recall through harmonic
mean calculation, proving particularly valuable for security applications where both false positives
and false negatives carry significant costs [25]. The weighted F1-score formulation emphasizes recall
to reflect security domain preferences:
(1+ β2) Precision ∙ Recall
β2 Precision+ Recall
,
(5)</p>
      <p>Setting β = 2 prioritizes minimizing false negatives, addressing the asymmetric costs of missing
critical vulnerabilities versus generating excess alerts.</p>
      <p>False positive and false negative rates received particular attention given their operational impact.
Security teams quickly lose confidence in systems generating excessive false alarms, while missed
vulnerabilities can lead to successful breaches. Our evaluation tracked these rates across different
organizational profiles and infrastructure types, revealing interesting patterns where false positive
rates increased for smaller organizations with more heterogeneous infrastructures. This granular
analysis enabled targeted improvements to classification algorithms for specific deployment
contexts.</p>
      <p>Workload reduction metrics quantified the practical benefits of automation beyond raw
performance numbers. We measured analyst hours required for comprehensive assessment both
with and without automated support, tracking time spent on data collection, initial classification,
validation, and report generation. The comparison revealed not just time savings but also qualitative
improvements in analysis depth, as analysts freed from routine tasks could pursue more
sophisticated investigations. Figure 2 illustrates the breakdown of time allocation before and after
framework deployment.</p>
      <p>Baseline comparisons required careful selection of representative tools that security teams
actually use rather than academic prototypes. Nmap represents the gold standard for active network
scanning, offering comprehensive service detection and vulnerability assessment capabilities [26].
However, its active nature generates detectable traffic that may alert adversaries to reconnaissance
activities. Shodan provides passive intelligence through internet-wide scanning results, though its
coverage depends on scanning frequency and may miss ephemeral or geographically restricted assets
[27]. Table 4 summarizes the comparative characteristics of baseline tools versus our framework.</p>
      <p>Cross-validation strategy addressed the challenge of natural clustering in OSINT data where
similar organizations might appear in both training and validation sets. Standard k-fold validation
produced overly optimistic performance estimates with gaps exceeding 12% between validation and
test performance. We developed modified stratified sampling that first clusters organizations based
on infrastructure characteristics, then constructs folds maintaining cluster representation. This
approach reduced train-test performance gaps to approximately 3%, providing realistic
generalization estimates.</p>
      <p>The experimental environment replicated production conditions as closely as possible while
maintaining controlled variables for reproducibility. Processing occurred on a Kubernetes cluster
with 16 nodes, each equipped with 32GB RAM and 8 CPU cores. This configuration enabled
evaluation of scalability characteristics through controlled node addition and removal. Network
bandwidth limitations simulated real-world collection constraints, particularly for rate-limited
OSINT sources. Storage utilized distributed systems to handle the data volume, with approximately
2.8TB required for the complete dataset including intermediate processing artifacts.</p>
      <p>Temporal validation proved particularly important given the dynamic nature of security
intelligence. We implemented sliding window validation where models trained on historical data
predicted future time periods, mimicking operational deployment where past observations inform
future risk assessments. This approach revealed performance degradation patterns, with accuracy
declining approximately 1.3% per month without retraining. The observation confirmed the necessity
of continuous model updates to maintain effectiveness against evolving threat landscapes.</p>
      <p>Statistical significance testing employed non-parametric methods suitable for non-normal
distributions characteristic of security metrics. The Wilcoxon signed-rank test compared paired
observations across different experimental conditions, while the Mann-Whitney U test evaluated
differences between independent groups [28]. These tests confirmed that performance improvements
exceeded random variation with p-values below 0.001 for key metrics. Bootstrap resampling provided
confidence intervals for performance estimates, revealing that true accuracy likely falls between
92.1% and 94.5% with 95% confidence.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation and case studies</title>
      <p>Small organizations presented unique evaluation challenges that revealed both framework strengths
and limitations worth examining closely. These entities, typically with fewer than 100 employees and
limited IT resources, historically struggled to maintain comprehensive security visibility. Manual
OSINT analysis proved economically infeasible, while commercial scanning services often
overwhelmed small security teams with unfiltered results. Our framework's automated classification
theoretically addressed these constraints through intelligent prioritization and risk-based filtering.</p>
      <p>Deployment across 312 small organizations revealed interesting patterns in system performance.
The framework successfully identified previously unknown external assets in 89% of cases, with
organizations discovering an average of 14 exposed services they had not been monitoring. This
visibility improvement proved particularly valuable for detecting shadow IT deployments where
employees had provisioned cloud services outside official channels. One manufacturing company
discovered 23 unauthorized SaaS integrations that bypassed security controls, enabling remediation
before adversaries could exploit these exposures.</p>
      <p>However, small organizations also experienced elevated false positive rates reaching
approximately 21.8% compared to 12.7% for enterprise deployments. Investigation revealed that
heterogeneous infrastructure common in resource-constrained environments confused classification
algorithms trained primarily on standardized enterprise patterns. Small organizations often combine
legacy systems, consumer-grade services, and modern cloud platforms in unconventional
configurations that triggered anomaly detection despite representing legitimate operations. Table 5
details the performance variations observed across different organizational profiles.</p>
      <p>Despite higher false positive rates, small organizations unanimously reported improved security
posture through enhanced visibility. The ability to continuously monitor external attack surfaces
transformed their defensive capabilities from reactive to proactive. One healthcare clinic detected
certificate misconfiguration that exposed patient scheduling systems, addressing the vulnerability
before regulatory audits would have identified the compliance violation. These real-world impacts
demonstrate that imperfect automation still provides substantial value compared to
resourceconstrained manual efforts.</p>
      <p>Critical infrastructure organizations provided particularly compelling evaluation scenarios given
the heightened consequences of security failures in essential services. We deployed the framework
across 47 critical infrastructure entities including power utilities, water treatment facilities, and
transportation systems. These organizations face unique challenges balancing operational
technology with information technology while maintaining resilience against increasingly
sophisticated threats [30].</p>
      <p>The framework's anomaly detection capabilities proved especially valuable for critical
infrastructure protection. One notable case involved a regional power utility where DBSCAN
clustering identified unusual certificate issuance patterns preceding an attempted intrusion. The
attackers had obtained legitimate certificates for typo squatted domains resembling the utility's
infrastructure, preparing for a sophisticated phishing campaign targeting operational technology
engineers. Traditional monitoring focused on known indicators would have missed this preparatory
activity since the certificates themselves appeared legitimate.</p>
      <p>The detection occurred 17 days before the planned attack launch, providing sufficient time for
comprehensive defensive measures. The utility implemented additional authentication requirements,
conducted targeted security awareness training, and coordinated with law enforcement to track the
threat actors. Post-incident analysis confirmed that early detection through certificate transparency
monitoring prevented potential operational disruption that could have affected 180,000 customers.
This case exemplifies how passive OSINT collection combined with intelligent analysis enables
preemptive defense rather than reactive incident response.</p>
      <p>Another critical infrastructure case demonstrated the framework's value for supply chain risk
assessment. A water treatment facility discovered that third-party vendors had exposed
configuration files containing network diagrams and system credentials on misconfigured cloud
storage. The framework identified these exposures through continuous monitoring of infrastructure
changes associated with vendor domains. Without automated OSINT analysis, these exposures
would likely have remained undetected until exploitation, as vendors rarely communicate security
incidents that do not directly impact service delivery.</p>
      <p>Comparative analysis of operational metrics before and after framework deployment revealed
fundamental shifts in how security teams allocate their time. Traditional approaches consumed
approximately 50% of analyst hours on routine data collection tasks, with another 30% devoted to
initial classification and prioritization. Investigation and strategic analysis received only 20% of
available time, limiting the depth and sophistication of security assessments. This distribution
reflected the manual burden of processing disparate OSINT sources rather than optimal resource
allocation.</p>
      <p>Framework deployment transformed this time allocation dramatically. Routine collection
dropped to less than 10% of analyst effort, primarily involving validation of automated results and
handling edge cases requiring human judgment. Initial classification consumed approximately 20% of
time, focusing on ambiguous cases where automated assessment confidence scores indicated
uncertainty. The remaining 70% of analyst hours shifted toward investigation, threat hunting, and
strategic security improvements. Figure 4 illustrates this operational transformation through
comparative time allocation analysis.</p>
      <p>This reallocation of human expertise toward higher-value activities improved not just efficiency
but also security effectiveness. Analysts reported discovering more sophisticated threats that
automated systems flagged for investigation but could not fully characterize without human insight.
One financial services company identified a complex multi-stage attack preparing to exploit their
partners' infrastructure as a stepping stone to their own systems. The automated framework detected
the initial reconnaissance activity, but human analysts connected disparate indicators to reveal the
complete attack chain.</p>
      <p>The evaluation also revealed interesting patterns in how different sectors adapted to automated
OSINT analysis. Technology companies integrated most seamlessly, with existing security
orchestration platforms consuming framework outputs through APIs. Financial services required
extensive customization to meet regulatory requirements for audit trails and evidence preservation.
Healthcare organizations needed additional privacy controls to prevent inadvertent exposure of
patient information during OSINT collection. Government entities mandated air-gapped
deployments with no external dependencies, necessitating architectural modifications for offline
operation.</p>
      <p>Cost analysis confirmed the economic transformation enabled by automation, though benefits
varied based on organizational characteristics. Large enterprises achieved absolute cost reductions
exceeding $200,000 annually by redirecting security budgets from manual assessment to strategic
initiatives. Small organizations realized percentage savings exceeding 90% that made continuous
monitoring feasible for the first time. Critical infrastructure entities emphasized that cost savings,
while significant, paled compared to risk reduction benefits from enhanced visibility and early threat
detection.</p>
      <p>The framework's impact extended beyond quantitative metrics to qualitative improvements in
security program maturity. Organizations reported increased confidence in their security posture
from comprehensive asset visibility. Security teams experienced reduced stress from automation of
routine tasks, enabling focus on engaging analytical challenges. Management gained better risk
visibility through continuous assessment rather than periodic snapshots. These intangible benefits,
while difficult to quantify, proved equally important for sustained adoption and organizational
commitment to the framework.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future work</title>
      <p>The practical deployment of our OSINT-driven framework across 3,800 organizations demonstrates
that comprehensive digital asset discovery and risk assessment can transition from labor-intensive
manual processes to efficient automated capabilities. Processing 4.8 million OSINT records over
twelve months, the system achieved 93.3% classification accuracy while reducing analyst workload
by 58% and lowering per-organization assessment costs from $234 to $19. The theoretical foundations
of our risk assessment methodology draw from established work on distributed system
vulnerabilities and quality assessment frameworks [31], [32]. The identification of key risks in
distributed information systems provided critical insights for our feature selection process,
particularly regarding infrastructure criticality assessment [31]. Frameworks for analyzing cascading
failures in distributed architectures directly influenced our approach to modeling vulnerability
interactions. Furthermore, integrated quality assessment models demonstrated how automated
evaluation systems can maintain consistency while adapting to organizational contexts [32]. These
theoretical contributions enabled our framework to achieve the observed classification accuracy by
incorporating domain-specific knowledge into the machine learning pipeline.</p>
      <p>The framework's impact extends beyond raw performance metrics to enable qualitative
transformations in security operations. Small organizations with limited resources gained
continuous monitoring capabilities previously accessible only to well-funded enterprises.
Resourceconstrained entities discovered an average of 14 unknown exposed services, with some identifying
over 20 shadow IT deployments that bypassed security controls. For large enterprises managing
complex distributed infrastructures, the system provided unified visibility across hybrid cloud
environments where traditional tools struggled with fragmentation. Critical infrastructure operators
particularly benefited from early warning capabilities, with one power utility preventing a
sophisticated attack through anomaly detection in certificate issuance patterns 17 days before
planned execution.</p>
      <p>
        The success of our density-based clustering approach for anomaly detection builds upon
foundational work in pattern recognition for security applications [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The effectiveness of learning
vector quantization models for DDoS attack identification, achieving real-time classification of
network traffic patterns, influenced our decision to employ DBSCAN for behavioral anomaly
detection. While LVQ models focused specifically on network traffic analysis, we extended this
methodology to encompass broader OSINT data types, achieving 76.8% detection rates for confirmed
security incidents across diverse infrastructure environments.
      </p>
      <p>Future development directions focus on operational integration and intelligence enrichment to
further enhance the framework's practical utility. Integration with Security Information and Event
Management platforms represents an immediate priority [33]. Current deployment requires
standalone operation with manual correlation to existing security tools. Native SIEM integration
would enable automated response workflows where OSINT-derived risk assessments trigger
defensive actions without human intervention. This evolution from detection to automated response
could dramatically compress the time between threat identification and mitigation.</p>
      <p>The application of advanced neural network architectures offers promising directions for
enhancing threat detection capabilities. Recent work on hybrid approaches combining Graph Neural
Networks with LSTM architectures demonstrates potential for sophisticated attack vector
reconstruction [34]. These methodologies for analyzing complex attack patterns through temporal
graph structures could extend our DBSCAN clustering approach to capture more nuanced behavioral
sequences. The hybrid architecture achieves superior performance in reconstructing multi-stage
attack vectors, suggesting similar techniques could enhance our framework's ability to predict attack
progression from early reconnaissance indicators.</p>
      <p>Natural language processing integration could substantially enrich threat intelligence by
incorporating unstructured data sources currently beyond the framework's scope. Predictive
modeling approaches for environmental processes provide methodological guidance for processing
complex, noisy data streams [35]. Neural network implementations achieved significant accuracy
improvements through careful feature engineering and model selection. Similarly, work on
mathematical modeling tools for decision support in medicine illustrates how ML approaches handle
uncertain and incomplete data characteristic of real-world applications [36]. These insights suggest
that adapting similar techniques to security domain text analysis could enable early warning of
campaigns targeting specific sectors.</p>
      <p>The framework's applicability extends to specialized sectors facing unique security challenges.
Recent analysis of cryptojacking threats to digital agriculture systems highlights how sector-specific
threat models require tailored detection approaches [37,38]. Work on endpoint system vulnerabilities
in agricultural environments demonstrates that our OSINT framework could be adapted for critical
infrastructure sectors with specialized requirements. The sustainable digital agriculture context
particularly resonates with our findings from critical infrastructure deployments, where operational
technology convergence creates novel attack surfaces requiring continuous monitoring.</p>
      <p>The challenge of model degradation over time necessitates continuous learning approaches that
adapt to evolving threat landscapes without full retraining. Current performance degradation of
approximately 1.3% monthly requires quarterly model updates consuming significant computational
resources. Online learning algorithms that incrementally adjust to new patterns while maintaining
stability could reduce this maintenance burden. However, the adversarial nature of security domains
complicates continuous learning, as attackers deliberately craft inputs to poison model updates.
Developing robust online learning mechanisms resistant to adversarial manipulation remains an
open research challenge with significant practical implications.</p>
      <p>This work represents a meaningful step toward scalable, cost-efficient, and proactive cyber risk
assessment that addresses contemporary security challenges. The framework, whose complete
technical methodology and validation are detailed in our comprehensive study [39], bridges the gap
between theoretical advances in machine learning and practical requirements of security operations,
demonstrating that sophisticated analytical capabilities need not remain exclusive to well-resourced
organizations.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This research has been funded by the Science Committee of the Ministry of Science and Higher
Education of the Republic of Kazakhstan (Grant No. АР26104787 “Development of solutions for the
protection of IoT-infrastructure of Smart cities of Kazakhstan based on AI”; grant funding by the
Ministry of Science and Higher Education of the Republic of Kazakhstan for research and technical
projects for 2025-2027).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[25] C. Manning, P. Raghavan, H. Schütze, "Introduction to Information Retrieval," Cambridge</p>
      <p>University Press, 2008.
[26] G. Lyon, "Nmap Network Scanning: The Official Nmap Project Guide," Insecure Publications,
2009.
[27] J. Matherly, "Complete Guide to Shodan," Shodan LLC, 2016.
[28] M. Hollander, D. Wolfe, E. Chicken, "Nonparametric Statistical Methods," 3rd Edition, Wiley,
2013.
[29] D. Powers, "Evaluation: From Precision, Recall and F-Score to ROC, Informedness, Markedness
&amp; Correlation," Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011.
[30] R. Langner, "Stuxnet: Dissecting a Cyberwarfare Weapon," IEEE Security &amp; Privacy, vol. 9, no.</p>
      <p>3, pp. 49-51, 2011.
[31] D. Palko, H. Hnatienko, T. Babenko, A. Bigdan, "Determining key risks for modern distributed
information systems," CEUR Workshop Proceedings, vol. 3018, pp. 81-100, 2021.
[32] T. Babenko, H. Hnatienko, V. Vialkova, "Modeling of the integrated quality assessment system
of the information security management system," CEUR Workshop Proceedings, 2021.
[33] D. Miller, S. Harris, A. Harper, S. VanDyke, C. Blask, "Security Information and Event</p>
      <p>Management (SIEM) Implementation," McGraw-Hill, 2010.
[34] Y. Vitulyova, T. Babenko, K. Kolesnikova, N. Kiktev, O. Abramkina, "A Hybrid Approach Using</p>
      <p>Graph Neural Networks and LSTM for Attack Vector Reconstruction," Computers, 2025.
[35] H. Olekh, K. Kolesnikova, T. Olekh, O. Mezentseva, Environmental impact assessment
procedure as the implementation of the value approach in environmental projects. CEUR
Workshop Proceedings, 2851, 206–216 (2021).
[36] A. Myrzakerimova, K. Kolesnikova, M. Nurmaganbetova, "Use of Mathematical Modeling
Tools to Support Decision-Making in Medicine," Procedia Computer Science, vol. 231, pp.
335340, 2024.
[37] T. Babenko, K. Kolesnikova, M. Panchenko, O. Abramkina, N. Kiktev, Y. Meish, P. Mazurchuk,
"Risk Assessment of Cryptojacking Attacks on Endpoint Systems: Threats to Sustainable
Digital Agriculture," Sustainability, 2025.
[38] T. Babenko, K. Kolesnikova, R. Lisnevskyi, S. Makilenov, and Y. Landovsky, "Definition of</p>
      <p>Cryptojacking Indicators," CEUR Workshop Proceedings, vol. 3680, 2024.
[39] Babenko, T.; Kolesnikova, K.; Abramkina, O.; Vitulyova, Y. (2025). Automated OSINT
Techniques for Digital Asset Discovery and Cyber Risk Assessment. Computers, 14(10), 430.
https://doi.org/10.3390/computers14100430.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Kok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Z.</given-names>
            <surname>Jhanjhi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Supramaniam</surname>
          </string-name>
          ,
          <article-title>"Ransomware, threat and detection techniques: A review,"</article-title>
          <source>International Journal of Computer Science and Network Security</source>
          , vol.
          <volume>19</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <article-title>"TSP security in intelligent and connected vehicles: Challenges and solutions," IEEE Wireless Communications</article-title>
          , vol.
          <volume>26</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>131</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Center for Strategic and International Studies</article-title>
          . (
          <year>2024</year>
          ).
          <article-title>The economic impact of cybercrime</article-title>
          . Available: https://www.csis.org
          <article-title>/analysis/economic-impact-cybercrime.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tanveer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>"Cybercrime investigation challenges in the era of cloud forensics and anonymous communication,"</article-title>
          <source>Digital Investigation</source>
          , vol.
          <volume>31</volume>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>39</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heartfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Loukas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <article-title>"You are probably not the weakest link: Towards practical prediction of susceptibility to semantic social engineering attacks,"</article-title>
          <source>IEEE Access</source>
          , vol.
          <volume>4</volume>
          , pp.
          <fpage>6910</fpage>
          -
          <lpage>6928</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bozhenyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Belavkin</surname>
          </string-name>
          ,
          <article-title>"Teaching and Learning IoT Cybersecurity and Vulnerability Assessment with Shodan through Practical Use Cases,"</article-title>
          <source>Sensors</source>
          , vol.
          <volume>20</volume>
          , no.
          <issue>11</issue>
          , p.
          <fpage>3048</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Irani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <article-title>"Large-scale automated classification of phishing pages," in Proceedings of the 18th Annual Network and Distributed System Security Symposium</article-title>
          (NDSS),
          <source>The Internet Society</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cichonski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Millar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grance</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Scarfone</surname>
          </string-name>
          ,
          <article-title>"Computer security incident handling guide,"</article-title>
          <source>NIST Special Publication 800-61, Revision</source>
          <volume>2</volume>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kholidy</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Baiardi</surname>
          </string-name>
          ,
          <article-title>"CIDS: A framework for intrusion detection in cloud systems,"</article-title>
          <source>Ninth International Conference on Information Technology</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>"A density-based algorithm for discovering clusters in large spatial databases with noise,"</article-title>
          <source>KDD-96</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Browne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abedin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. J. M. Chowdhury</surname>
          </string-name>
          ,
          <article-title>"A systematic review on research utilising artificial intelligence for open source intelligence (OSINT) applications,"</article-title>
          <source>International Journal of Information Security</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shirey</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Internet Security Glossary, Version</source>
          <volume>2</volume>
          ,
          <source>" RFC 4949</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Durumeric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wustrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Halderman</surname>
          </string-name>
          ,
          <article-title>"ZMap: Fast Internet-wide scanning and its security applications,"</article-title>
          <source>USENIX Security Symposium</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pastor-Galindo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nespoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gómez Mármol</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Martínez Pérez, "The Not Yet Exploited Goldmine of OSINT,"</article-title>
          <source>IEEE Access</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Szymoniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kedziora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hutchison</surname>
          </string-name>
          ,
          <article-title>"Open Source Intelligence Opportunities and Challenges: A Review,"</article-title>
          <source>Computers &amp; Security</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Riebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bäumler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kaufhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <article-title>"Privacy Concerns and Acceptance Factors of OSINT for Cybersecurity,"</article-title>
          <source>Proceedings on Privacy Enhancing Technologies</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kaufhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Riebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bäumler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <article-title>"Values and Value Conflicts in the Context of OSINT Technologies,"</article-title>
          <source>Computer Supported Cooperative Work</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Evangelista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Napolitano</surname>
          </string-name>
          ,
          <article-title>"Systematic literature review to investigate the application of open source intelligence (OSINT) with artificial intelligence,"</article-title>
          <source>Journal of Applied Security Research</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Afrifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varadarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Appiahene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Domfeh</surname>
          </string-name>
          ,
          <article-title>"Ensemble machine learning techniques for accurate and efficient detection of botnet attacks,"</article-title>
          <source>Eng</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>"XGBoost: A Scalable Tree Boosting System,"</article-title>
          <source>ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2016</year>
          .Retry
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kreps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Narkhede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <article-title>"Kafka: A Distributed Messaging System for Log Processing,"</article-title>
          <source>Proceedings of the NetDB Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Carlson</surname>
          </string-name>
          ,
          <article-title>"Redis in Action,"</article-title>
          <source>Manning Publications</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Babenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Toliupa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kovalova</surname>
          </string-name>
          ,
          <article-title>"LVQ models of DDOS attacks identification," 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications</article-title>
          and Computer Engineering (TCSET
          <year>2018</year>
          ), IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oppenheimer</surname>
          </string-name>
          ,
          <article-title>"Borg, Omega, and Kubernetes: Lessons learned from container-management systems,"</article-title>
          <source>Communications of the ACM</source>
          , vol.
          <volume>59</volume>
          , no.
          <issue>5</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>