<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Remote Host Operation System Type Detection Based on Machine Learning Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonid Kupershtein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tatiana Martyniuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olesia Voitovych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur Borusevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vinnytsia National Technical University</institution>
          ,
          <addr-line>Khmelnytske shoes str., 95, Vinnytsia, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>65</fpage>
      <lpage>81</lpage>
      <abstract>
        <p>There are the research results of using machine learning to solve the problem of the remote host operating system detection in the article. The analysis of existing methods and means of detection of the remote host operating system are carried out, the main advantages and disadvantages of their using are defined. Modeling of machine learning methods is carried out. The software architecture is designed and experimental application is developed. It uses a trained machine learning model that allows detecting the type and version of operating system with high accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Operating system detection</kwd>
        <kwd>machine learning</kwd>
        <kwd>computer networks</kwd>
        <kwd>network protocol</kwd>
        <kwd>scanning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Today, the number of devices in computer networks is growing every day. The list of devices
includes routers, printers, IP phones, smart things, personal computers, laptops, smartphones, etc.</p>
      <p>
        However, not all the network devices have the latest version of the operating system (OS) and
updates related to its security. The reasons for this may be: lack of necessary funding or lack of
required hardware for the new version of the operating system to work properly; unwillingness of
device users to master the new interface or capabilities; lack of support for the software used in the
new version of the operating system. The need for constant OS updating is an ever-increasing number
of identified vulnerabilities [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Exploiting these vulnerabilities could lead to breaches of the
confidentiality, integrity, and availability of data and other software, such as web services [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. OS
vulnerabilities cause the possibility of unauthorized access to database-oriented applications, which in
turn requires additional protection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Network administrators must be ready for possible attacks. It
requires constant network monitoring to detect unauthorized devices or devices, which are running an
old and/or vulnerable version of the operating system.
      </p>
      <p>
        Penetration testing specialists and ethical hackers need to gather as much information about the
object as possible to conduct authorized attacks in the initial stages. It is necessary to form the most
effective vector of attack to identify potential vulnerabilities in the protection system of the researched
infrastructure [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Knowledge about the family, type, and version of the operating system installed on
network hosts can help them, because after all, each OS is associated with certain vulnerabilities in its
software [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Currently, there are a significant number of software tools for the operating system detection,
which allow some probability to determine its family, not to mention the ability to determine the type
and version of the OS.</p>
      <p>Therefore, it is very important to research and develop methods and tools that will determine
detailed information about the remote host operating system with high reliability, which will increase
the efficiency of identifying vulnerabilities and, consequently, increase the level of cyber security in
general.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods and tools for the operating system detection</title>
      <p>There are two main methods for detecting a remote host operating system: active and passive.</p>
      <p>
        The active methods are based on sending a specially built service packets to the target machine [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Then after receiving answer analysis, a conclusion about the target node operating system is formed.
      </p>
      <p>The advantages of this method are:
 speed – since the packets are sent to the target node, you can get a response faster, without
having to wait for the necessary packets in the network;
 simplicity – usually you only need to compare the received answers with the database of
signatures, without analyzing the parameters or their combination;
 flexibility – due to the packages are formed manually, it is possible to adjust the packages
contents, adding new ones as needed.</p>
      <p>However, there are also disadvantages:
 visibility – using this method, packets are sent over the network, so one can detect them and
apply appropriate actions;
 signature database usage – record absence in the database, causes wrong detection or no
answer at all;
 necessity of the node response receiving – if the response from the target node is not received,
it is impossible to detect the operating system.</p>
      <p>
        The passive methods of the operating system detection are based on network traffic listening,
transmitted packets collecting, and then, their contents analyzing to form a conclusion about the
remote host OS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The advantages of this method are:
 invisibility – because of continuous listening, there is no activity on the listening network
device, or this activity is so low that it can be perceived as normal traffic during the business day;
 no need to receive the target node response – traffic analysis from the target node to other
devices is allowed.</p>
      <p>The disadvantages of this method are:
 speed – it is necessary to wait for the appearance of certain packets in order to form a
conclusion, that takes a long time in the case of the network activity absence;
 implementing complexity – since it is not possible to send self-generated packets, you need to
use the information from intercepted packets.</p>
      <p>In conclusion, the appropriate approach should be chosen depending on whether you want to
perform the scan imperceptibly or you want to get the result quickly.</p>
      <p>Existing tools of the remote node operating system detection also is considered. Currently, a small
number of software products perform the task of scanning the remote node operating system. Often
this task is not their main function, but only one of the menu options.</p>
      <p>The most common tools that use the active detection method are Nmap, NetScanTools Pro and
Xprobe.</p>
      <p>
        Nmap is a free, open source software designed for network scanning and security auditing. The
program also allows you to detect the available nodes in the network, active network services, types
of firewalls, etc. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Nmap uses an active method to detect the operating system. To do this, the tool creates a
"fingerprint", sending TCP, UDP and ICMP packets to known potentially open and closed ports.
Nmap analyzes the responses to these packets. As a result, a conclusion is formed, which indicates the
type of node operating system and the reliability of this conclusion. If there are no complete matches
with the signature, the score is performed (each parameter has a corresponding weight in points). The
tool does not have the ability to detect the operating system by passive method, so the tool is designed
specifically for active analysis [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ].
      </p>
      <p>
        NetScanTools Pro uses responses to ICMP packets to detect the operating system. This is the main
disadvantage of the tool – using only ICMP packets do not allow obtaining a reliable answer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Xprobe is a software tool that relies on fuzzy signature matching, probabilistic assumptions,
analysis of multiple matches simultaneously in the signature database [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>Software</title>
      </sec>
      <sec id="sec-2-2">
        <title>Nmap p0f</title>
      </sec>
      <sec id="sec-2-3">
        <title>NetScanTools Pro</title>
      </sec>
      <sec id="sec-2-4">
        <title>Xprobe</title>
      </sec>
      <sec id="sec-2-5">
        <title>Satori</title>
      </sec>
      <sec id="sec-2-6">
        <title>PRADS</title>
      </sec>
      <sec id="sec-2-7">
        <title>Ettercap</title>
      </sec>
      <sec id="sec-2-8">
        <title>Method</title>
      </sec>
      <sec id="sec-2-9">
        <title>Active</title>
      </sec>
      <sec id="sec-2-10">
        <title>Passive</title>
      </sec>
      <sec id="sec-2-11">
        <title>Active</title>
      </sec>
      <sec id="sec-2-12">
        <title>Active</title>
      </sec>
      <sec id="sec-2-13">
        <title>Passive</title>
      </sec>
      <sec id="sec-2-14">
        <title>Passive</title>
      </sec>
      <sec id="sec-2-15">
        <title>Passive</title>
      </sec>
      <sec id="sec-2-16">
        <title>NetworkMiner</title>
      </sec>
      <sec id="sec-2-17">
        <title>Passive</title>
      </sec>
      <sec id="sec-2-18">
        <title>TCP, HTTP</title>
        <p>
          You can also use Wireshark for passive detection [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. In such case, you need to analyze certain
fields by yourself: TTL, User-Agent, etc. For example, if the TTL value is 128 and the User-Agent
parameter contains the value "Windows NT 10.0", you can conclude that the device has the Windows
10 operating system installed [
          <xref ref-type="bibr" rid="ref19 ref9">9, 19</xref>
          ]. However, in this case it is necessary to have database, where
types of operating systems are in accordance with the packets content values [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>
          There are also tools that can work using both active and passive operating system detection
methods, for example are SinFP [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and queso [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. However, support for these tools is currently
discontinued and download pages are unavailable.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Related works</title>
      <p>
        In research [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] only mobile operating systems were analyzed. Operating systems, that were
analyzed in that work, are: Android v2.3, Android v4.4, iOS 5, iOS 8, Symbian 3 and Win Phone 7.5.
Dataset size is 489 GB of data, that was gathered during several months. Traffic, that was used to
identify OS, was captured while watching videos on YouTube, downloading files, making video calls
on Skype and combined traffic. Combined traffic included traffic, while all actions described were
      </p>
      <sec id="sec-3-1">
        <title>Protocols</title>
      </sec>
      <sec id="sec-3-2">
        <title>TCP, UDP, ICMP</title>
      </sec>
      <sec id="sec-3-3">
        <title>TCP, HTTP</title>
      </sec>
      <sec id="sec-3-4">
        <title>ICMP</title>
      </sec>
      <sec id="sec-3-5">
        <title>ICMP, TCP</title>
      </sec>
      <sec id="sec-3-6">
        <title>DHCP, TCP, HTTP, SMB Last update 23-04-2021</title>
        <p>18-04-2016
02-09-2020
27-07-2005
04-05-2021
23-09-2020</p>
      </sec>
      <sec id="sec-3-7">
        <title>TCP, UDP, DHCP,</title>
      </sec>
      <sec id="sec-3-8">
        <title>ICMP</title>
        <p>TCP
19-09-2020 (app),
16-02-2010 (DB
signature)
06-01-2021</p>
      </sec>
      <sec id="sec-3-9">
        <title>OS family</title>
      </sec>
      <sec id="sec-3-10">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-11">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-12">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-13">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-14">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-15">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-16">
        <title>FreeBSD, Mac OSX,</title>
      </sec>
      <sec id="sec-3-17">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-18">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-19">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-20">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-21">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-22">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-23">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-24">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-25">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-26">
        <title>Linux, Windows</title>
      </sec>
      <sec id="sec-3-27">
        <title>FreeBSD, iOS, Mac</title>
      </sec>
      <sec id="sec-3-28">
        <title>OSX, OpenSolaris,</title>
      </sec>
      <sec id="sec-3-29">
        <title>Linux, Windows</title>
        <p>made on OS’s, that support multitasking. Research results show that detection accuracy is around
70% when analyzing traffic for 30 seconds, around 90% accuracy while analyzing traffic for 5
minutes, and 100% accuracy when using combined traffic for 30 seconds period.</p>
        <p>
          In research [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] only TCP SYN packets were captured for analysis. There is no description on how
packets were gathered. Developed system searches signature in signature database. Experiments had
next results: an accuracy of 86.3% when finding exact match and additionally 9.2% were detected
correctly, when using minimal distance match. Type I error is 4.5%. Developed system allows to
classify OS as one of three classes: Windows 7 or 8, Windows 7 or Vista, Linux. It does not allow to
detect exact version of OS, and classes are very general, because they unite 2 or more OS in 1 class.
        </p>
        <p>
          SVM was used as a machine learning method in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] to detect OS. Nmap signature database was
used as a dataset for machine learning. Training set consisted of 1503 samples, testing set consisted of
1023 samples. Most of samples were classified as: Other system, Windows, Linux. Developed system
allowed to classify detected OS as one from the list: Windows, Linux, FreeBSD, OpenBSD, MAC
OS, Sun Solaris, Cisco, Other system. On average accuracy is 86.63%. Amount of errors depends on
OS that was detected: for Windows amount of errors is 3,91%, for Linux – 5,19%; FreeBSD –
17,71%; OpenBSD – 15.85%; Mac – 25.8%; Solaris – 4.53%; Cisco – 24.22%; Other system –
9.74%.
        </p>
        <p>
          In research [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] only headers of ICMP packets were analyzed. No system was developed, the OS
detection was conducted manually by analyzing TTL value and identification field increment. Only 5
OS’s were used during experiments: Windows 7, Windows 8.1, Windows 10, Linux 18.x, and Debian
7.x. In [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] 2 similar to our research algorithms were used: Decision Table and J48. Decision Table
used data from ICMP protocol (checksum, checksum_status, ext.checksum, ident, length), J48 used
data from IP, UDP, DNS protocols (IP: checksum_status, dsfield, dsfield.dscp, dsfield.ecn, flags,
flags.df, flags.mf, flags.rb, frag_offset, hdr_len, len, proto, ttl, version). For Decision Table and J48
dataset size is around 79,000 packets. Operating systems, that were classified are: - Linux (Raspberry,
Xubuntu), Mac OS (10.7, 10.11), Windows (7, 8, 10). Decision Tree has an accuracy of 0.994 and J48
has an accuracy of 0.94.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] Decision Tree/C4.5 algorithm was used. Algorithm used data from TCP protocol (window
size, ttl, don’t fragment bit, packet size, options order, window size of fin packet, ttl of fin packet,
don’t fragment of fin packet, packet size of fin packet). Dataset size is around 30000 packets. Oss that
were classified are: Windows Vista SP0-2, Windows 7 SP1, Windows 2000 SP2,4, Windows XP
SP1+, Linux. Algorithm has an accuracy of 0.9086.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Problem statement</title>
      <p>
        Methods and instruments analysis of the operating system type detection show that all of
approaches are based on the signatures database. However, this imposes limitations if the signature of
an OS is missing from the database, which can lead to low credibility. Herewith it is possible to form
a huge base for all possible types of OS but it can lead to considerable time expenses on finding the
corresponding signature. An alternative solution of the remote host OS type detection is use of the
machine learning methods [
        <xref ref-type="bibr" rid="ref23 ref24 ref25 ref26 ref27 ref28 ref29">23-29</xref>
        ]. Many researchers present some results of these methods usage.
They are also based on the signature database usage but with a purpose of model learning.
Consequently, the signatures obtaining method, their preliminary analysis and processing may
significantly affect the credibility of the OS type detection.
      </p>
      <p>If focusing on the real software development the signature database should be formed manually.
This is necessary for understanding the signature forming principles so in the future it would be
possible to use the tool for actual tasks rather than leaving everything on the stage of developed
model. In addition, this approach will allow system scaling, namely to gradually increase the detected
OS number. For making the OS type detection software more convenient, both active and passive
mode should be implemented.</p>
      <p>In addition, it is important not only detect the type of OS, but also its version, what can have
significant influence on the user’s decision-making. Since the task of OS type detection is a
classification task, the criteria of qualitative model obtaining is maximizing its accuracy and
precision. In addition, an important metric of the tool grade evaluation is the response time. Most of
the time is spent on the trial packet sending and receiving responses so it is a network delay.
Herewith, the attempts amount can also have great influence on the response receiving time. The
complexity of the machine learning model can also significantly affect the system efficiency so upon
reaching sufficient accuracy it should be as simple as possible. The volume of the studied test
packages parameters and their pre-processing affect the time delays as well. Therefore, it will be
advisable to select relevant protocol headers.</p>
      <p>Thereby, it is possible to determine the main goals and criteria to develop the OS detection
software based on the machine learning methods:
 precision maximization;
 response time minimization;
 self-dependent signature forming;
 scalability;
 cross-platforming;
 universality of use.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Data gathering and preprocessing</title>
      <p>
        Solving the object classification problem, namely the problem of detection the OS type, involves
the usage of algorithms for learning with the teacher (supervised learning) [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Although it is
possible to set the problem as clustering of objects using unsupervised learning. In any case, a dataset
is needed to build a machine learning model. In addition, when using supervised learning, this data set
must be labeled, i.e., each class of the operating system is associated with a specific label.
      </p>
      <p>
        A ready-made data set can be used to solve the set tasks [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], but it contains only families of
operating systems. However, the user may be interested in the type and version of the operating
system. Therefore, authors decide to form a data set independently.
      </p>
      <p>
        The main idea of detecting the family / type of remote node OS is based on the analysis of
network protocol headers of the OSI model application, transport and network layers. Despite the
standards of network protocols (RFC, IEEE) in different operating systems of even one family, some
header fields can differ significantly, such as TTL, DF, ToS IP [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
      </p>
      <p>
        The essence of the experiment of a training data set formation is to generate certain traffic,
capture and analyze it. The analysis consists in parsing packets, redacting to one data format and
relevant parameters (features) selecting. The creation of a dataset was performed under laboratory
conditions. The following OS versions were studied: Linux (version 5.4.0), Mac OS (version 10.12.4
and 11.4), Windows 10 (Corporate 20h2, Home 20h2), Windows 7 (Professional), Windows XP
(Professional SP3). At the same time, Windows XP and Linux 5.4 were studied using virtualization
technology using VirtualBox software [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], and the others were installed on real PCs.
      </p>
      <p>The experiment steps of dataset creation are shown in Figure 1.</p>
      <p>Feature
selection
5</p>
      <p>Traffic
generation</p>
      <p>1</p>
      <p>Viewing different web pages from the studied OS;</p>
      <p>Downloading images from the web to the studied OS.</p>
      <p>
        Step 2. Traffic capture. The most popular Wireshark traffic analyzer was used to capture packets
from the network, which allows recording an interception session to a file for further processing
without having to be connected to the network [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This software is distributed under the GNU GPL
license. There are versions of Wireshark for various operating systems: Linux, Windows, MacOS,
FreeBSD, Solaris. It is also possible to use no less popular console utilities "tshark and tcpdump [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
Traffic files are saved in pcap format. As a result, so many packages were collected: MacOS 11.4
67949, Windows XP Professional SP3 – 21294, Windows 10 Home 20h2 – 19291, Windows 7
Professional – 17741, Mac OS X 10.12.4 - 16204, Linux 5.4.0 – 14307, Windows 10 Corporate 20h2
– 14072.
      </p>
      <p>Step 3. Traffic parsing. Packets were selected from all captured traffic, in which the IP address of
the source corresponds to the IP address of the PC with the studied OS. After that, it is necessary to
disassemble structure of a packet on headers fields of the main protocols: IP, ICMP, TCP, DNS and
HTTP. As a result of parsing, the following field’s values were obtained:
1. IP – version, hdr_len, dsfield, dsfield_dscp, dsfield_ecn, len, id, flags, flags_rb, flags_df,
flags_mf, frag_offset, ttl, proto, checksum, checksum_status.
2. ICMP – type, code, checksum, checksum_status, ident, seq, seq_le, data, data_data, data_len.
3. TCP – hdr_len, flags, flags_res, flags_ns, flags_cwr, flags_ecn, flags_urg, flags_ack,
flags_push, flags_reset, flags_syn, flags_fin, flags_str, window_size_value, window_size,
window_size_scalefactor, checksum, checksum_status.
4. DNS – id, flags, flags_response, flags_opcode, flags_truncated, flags_recdesired, flags_z,
flags_checkdisable, count_queries, count_answers, count_auth_rr, count_add_rr,
qry_name_len, count_labels, qry_type, qry_class.
5. HTTP – user_agent.</p>
      <p>Parsing is performed using the Python 3.8 and the PyShark library, which allows working
effectively with pcap-files. The parsing results of each packet are saved in a csv-file for further
analysis. In addition to the field values, each entry in the csv file also has an OS version. As a result, a
dataset containing 42,318 records was formed. It can be used as a database of signatures, which can
give a certain result the search for.</p>
      <p>
        Step 4. Preprocessing. Almost all classifiers do not work directly with text data. They received
features as well as numbers or booleans (which translate into numbers 0/1) that some feature is there
or not. Therefore, it is necessary to convert all categorical (text) features (IP: dsfield, flags; TCP:
flags, flags_str; DNS: flags, qry_class) into numbers. The following methods were used for this
purpose [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]:
 label encoding – to encode the values of the operating system classes by assigning a certain
number to each OS value;
 one hot encoding – to encode attribute values (protocol parameters) by creating columns
where each column is responsible for single attribute value and the attribute value is set as 1 in the
corresponding column, and 0 in the other columns of the attribute.
      </p>
      <p>The Processing module from the Scikit-learn (Python) library was used as a tool. An example of
the converted data is shown in Table 2, 3.</p>
      <p>
        The id, checksum, and data fields of the TCP / IP protocols were not used during the
transformation and were removed from the feature list due to lack of clear differences between OS
families. Also for the study of some classifiers data normalization was performed, i.e. reduction of all
values of each parameter with a mean of 0 to the standard deviation of 1 by expression [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]:
,
where zij – i-th normalize value of j-th feature, xij – i-th original value of j-th feature, µj – mean of j-th
feature, j – standard deviation of j-th feature. For this transformation, the “StandartScaler” method
from Scikit-learn (Python) is used.
      </p>
      <p>
        Step 5. Feature selection/importance. Since the size of the dataset has slightly increased from 54 to
156 after the some features transformation, it will be advisable to reduce the dimension and select the
most relevant features. This is necessary to increase the learning speed of the model, computation
time when using the trained model, avoid overfitting and increase the generalizing ability of the
model. Among the significant number of methods for selecting features by importance, the Recursive
Feature Elimination method was used. The essence of this method is to build a model (which includes
all factors), which excludes the least significant factor (feature) from the point of view of the model.
After that, a new model is built, which contains all the factors except those excluded in the previous
stage, and so on [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. The RFE module from the Scikit-learn library was used to solve the problem.
The decision tree model with default hyperparameters was used as an estimator. As a result of
selection, 14 features were identified, the parameters and the importance degree of which are shown
in Table 4. The importance degree is obtained from the "feature_importance" attribute of the trained
model. Moreover, although some features have little impact, in our opinion they are important and
give some expressiveness to different operating systems.
3:64:65:66:67:68:69
      </p>
      <p>513
131328
256</p>
    </sec>
    <sec id="sec-6">
      <title>6. Model selection and training</title>
      <p>When solving the problem of classification by machine learning methods, the question of choosing
a classifier model arises. At present, a significant number of classifiers are known and implemented,
(1)
which differ in approaches to the construction of the decision rule, as well as hyperparameters.
Correct adjustment of hyperparameters is one of the key points; it allows receiving desirable results.
In some models of machine learning, the number of hyperparameters can reach more than 10, and
each hyperparameter can take different values. Finding the optimal combination is not an easy task.
One of the options for solving this problem is to build a model for each possible combination for all
given domains of hyperparameters.</p>
      <p>As a result of preliminary data processing, a dataset was obtained, the structure of which is
shown in Table 5.</p>
      <sec id="sec-6-1">
        <title>Full dataset</title>
      </sec>
      <sec id="sec-6-2">
        <title>Small dataset</title>
        <p>
          For each model, the desired metric is calculated, the best result of which determines the best
model. This approach is implemented in the GridSearch package of the Scikit-learn library [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ].
Although this approach is quite costly in terms of machine time, it gives the best result. In addition,
although the Scikit-learn library does not support GPU computing, one can use a set of cuML libraries
from the RAPIDS project to parallelize calculations based on CUDA technology [
          <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
          ].
        </p>
        <p>Using GridSearch technology, the best parameters for the most used classifier models were
automatically selected:
1. Decision Tree (DT) (criterion='entropy', max_depth=6);
2. Multilayer Perceptron (MLP) (alpha=0.05, hidden_layer_sizes=(50, 50, 50));
3. Gaussian Naive Bayes (GNB) (var_smoothing=1e-9);
4. K-Nearest Neighbors (KNN) (leaf_size=5, n_neighbors=1);
5. Support Vector Machine (SVM) (C=1000, gamma=0.0001);
6. Logistic Regression (LR) (C=5, max_iter=500, solver='newton-cg');
7. Random Forest (RF) (criterion='entropy', max_features='sqrt', min_samples_leaf=4,
n_estimators=1900).</p>
        <p>Since the use of GridSearch is quite time consuming even with a large number of models, it was
decided to use a small dataset (table 5). It is representative enough to obtain adequate results.</p>
        <p>
          The following metrics were used to assess the quality of classifiers [
          <xref ref-type="bibr" rid="ref3 ref35">3, 35</xref>
          ]:
 Accuracy – share of correctly defined classes;
 F1 – weighted average estimate of type I and II errors;
 Precision – the ratio of the correctly defined classes number to the sum of the correctly
defined classes and type I error (false positive) numbers;
 Recall – the ratio of the correctly defined classes number to the sum of the correctly defined
classes and type II error (false negative) numbers;
 Confusion Matrix – a matrix showing the number of correct definitions and the number of
erroneous definitions;
 FP – type I error (False Positive);
 FN – type II error (False Negative).
        </p>
        <p>After obtaining the optimal parameters of the classifiers based on the maximizing "accuracy"
criterion, they were trained on a full dataset. Naive Bayes trained the fastest in time, and Logistic
Regression the longest: DT - 0.194 s, GNB - 0.0409 s, KNN - 0.742, LR - 482.51 s, RF - 88.369 s,
SVM - 5.08 s, MLP - 39.43. Training of models was performed on a PC with AMD A8-4500M APU,
1.9 GHz (4 cores, 4 threads) and 8 Gb RAM. Dataset is divided into training and test in the amount of
70% and 30%, respectively. The test results of the classifiers are presented in Table 6.</p>
        <p>As can be seen from Table 6, the Decision Tree models and its ensemble modification Random
Forest have best results without any errors. The worst classifier is Gaussian Naive Bayes, which
showed the highest number of erroneous predictions, but in general, all classifiers have high metrics.
These results are confirmed by 5-Fold cross validation. Compared with the known research results,
our metrics are better. However, for a more adequate comparison it is necessary to carry it out at least
for identical families / types / versions of OS.</p>
        <p>According to the simulation results, confusion matrixes is obtained, which are shown in Appendix
A. They contain the results of model prediction, namely the following indicators: True Positive, True
Negative and FP, FN.</p>
        <p>Analyzing the matrix of classifier errors (Figs.A.1 – A.6), we can note that most classifiers have
errors within the OS family:
 Multilayer Perceptron: incorrect definition of Windows 10 Home (defined as Windows 10
Corporate) and Windows 7 Professional (defined as Windows XP Professional);
 Gaussian Naive Bayes: incorrect definition of Linux (defined as Windows 10 Home and
Corporate), MacOS (defined as Linux), Windows 10 Corporate (defined as Linux), Windows 10
Home (defined as Windows 10 Corporate and Linux);
 K-Nearest Neighbors: incorrect definition of Windows 10 Corporate (defined as MacOS 11.4
and Windows 10 Home), Windows 10 Home (defined as Windows 10 Corporate);
 Support Vector Machine: incorrect definition of Windows 10 Home (defined as Windows 10
Corporate) and Windows XP (defined as Windows 10 Home);
 Logistic Regression: incorrect definition of MacOS 11.4 (defined as MacOS 10.12.4) and
incorrect definition of Windows 10 Corporate (defined as Windows 10 Home and Linux).</p>
        <p>Because the Decision Tree model is architecturally simpler than Random Forest, it is therefore
more appropriate to use it in software. The tree consists of five levels and has 13 leaves. Due to the
bulkiness of the tree, its image is not given in this article.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Application architecture and experiments</title>
      <p>A software is developed for application use of trained classifier model. The software architecture
is shown in Figure 2.</p>
      <p>It consists of four modules: visualization module, scanning module, preprocessing module and
intelligent analysis module. System has its own database also.</p>
      <p>Visualization module is the first one to start. It initializes graphical components of program: icons,
buttons, windows etc. It is also responsible for providing dialogue windows for user interaction,
providing detection results for user.</p>
      <p>Scanning module has two purposes: sniffing network traffic to capture packets from the host,
which OS needs to be detected; sending probes (specific packets) to the host, which OS needs to be
detected, in order to receive responses and conduct OS detection.</p>
      <p>Preprocessing module is responsible for gathering fields from captured packets that are needed in
order to create OS signature and creating OS signature. Signature is a string, where all fields needed
from received packets are written, each field is separated by comma.</p>
      <p>User</p>
      <sec id="sec-7-1">
        <title>Visualization module</title>
      </sec>
      <sec id="sec-7-2">
        <title>Intelligent analysis module</title>
      </sec>
      <sec id="sec-7-3">
        <title>Scanning module</title>
      </sec>
      <sec id="sec-7-4">
        <title>Preprocessing module</title>
      </sec>
      <sec id="sec-7-5">
        <title>Network</title>
        <p>...</p>
      </sec>
      <sec id="sec-7-6">
        <title>Database</title>
        <p>Intelligent analysis module uses trained machine learning classification algorithm to conduct OS
detection on provided signature. Classification results are sent to visualization module in order to
show them to user. Application database is used for storing previous OS detection results for using
later. It can be used for signature storing and pcap-files also.</p>
        <p>Based on proposed architecture a software is developed using Python 3.8 programming language
with PyQt5 library. SQLite is used as a database. Software has two modes: online and offline.</p>
        <p>In offline mode, user has an opportunity to upload pcap-file from external or internal resources. It
can be useful when you already have captured traffic and need to detect type/version of OS after some
time. Pcap-file should contain information about one specific host. Otherwise it can be filtered by
ipaddress and then resaved. Online mode is available for both active and passive scanning. Passive
scanning takes more time, since, as stated before, all needed packets are required for analysis.
Software is in waiting mode until all needed packets are not captured. However, it can be sped up by,
for example, visiting it if detected host is a webserver and is available through http or ftp. In addition,
software provides a way to detect OS by analyzing User Agent from HTTP packets.</p>
        <p>Active mode is more appropriate since it does not require waiting, because software send probes:
four ICMP requests along with up to 10 TCP SYN requests. The procedure will be repeated up to 5
times if no response is received. After preprocessing module receives packets it sends formed
signature to intelligent analysis module. Classifier predicts OS family/type/version of the target host
and prediction probability. Next, an experiment is conducted by checking how software works and
compare results with Nmap. First a scan for Windows 10 Corporate was conducted, it’s IP address in
local network is 192.168.1.170. Results of the active scan mode are shown in Appendix B.</p>
        <p>Results are stating that Nmap couldn’t detect exactly installed operating system. Also in results we
can see a wide specter of possible operating systems (Windows 10, 7, Windows Phone, Windows
Server, FreeBSD), and probability is equal to 0.92 compared to 0.955 of developed software (Fig.
B.1, 2). By executing similar actions for host, that has IP address 192.168.43.60 developed software
correctly detected Windows 10 Home 20h2 with probability of 0.937 (Fig. B.3). Nmap do not provide
a result and states that host has too many signature matches (Fig. B.4). Also rented host with white IP
address was used in experiments. It has OS Linux 5.4 installed beforehand. Developed software
detected it with probability of 0.987 (Fig. B.5). Nmap gave a result of it having Windows 7 installed
(Fig. B.6).</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and future work</title>
      <p>The analysis of existing methods and means of detection of the remote node operating system
shows the main advantages and disadvantages of their using. Two main approaches, active and
passive, of the remote node OS detection are considered. The combination of these approaches can
make the process detection more flexible and accurate. The existing method of remote host OS
detection is mostly based on signature model of decision making, but in such way they have a lot of
errors because of undecidability in case of properly signature absence. That is why the machine
learning methods are carried out. To form dataset for model learning the five-stage process is realized:
traffic generating and gathering, parsing, preprocessing and feature selection.</p>
      <p>As a result of the research, a classifier based on the Decision Tree model was trained. At the same
time, other classifiers also showed high metrics with low rate errors within the OS family. The high
accuracy of detection the type of OS indicates well-chosen features, as well as a sufficient size of
input data set. This helped to avoid the effect of overfitting.</p>
      <p>The trained model can detect seven OS versions within several families with absolute accuracy.
The system architecture is proposed to remote OS detection. It realized in crosspaltform application
based on trained model. This software tool allows scanning hosts in both offline and online modes. At
the same time, both active and passive scanning are implemented online. This adds a means of
versatility compared to analogues. The developed tool can be used by an ethical hacker for intrusion
testing, network administrator for auditing, checking the network for new unknown devices. You can
also use the tool to test the effectiveness of server protection tools against detecting their OS.</p>
      <p>As a result of the experiment, the developed software was more effective compared to Nmap.
However, Nmap allows to define many more types of OS. Therefore, further research will be
associated with scaling the model, as well as expanding the functionality of application. One such
future function will be to implement the ability to provide a list of ranked vulnerabilities inherent in a
particular OS. It provides faster vector attack creation or decision making for network protection.</p>
    </sec>
    <sec id="sec-9">
      <title>9. References</title>
      <p>10. Appendix
Appendix A. Confusion matrixes
Figure A.2: K-Nearest Neighbors</p>
    </sec>
    <sec id="sec-10">
      <title>Appendix B. Experiments results</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] CVE details: The ultimate security vulnerability datasource - Operation systems</article-title>
          . URL: https://www.cvedetails.com/product-list/product_type-o/vendor_id-0/firstchar-W/OperatingSystems.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Vulnerability and threat trends 2020: Reseatch report</article-title>
          . URL: https://lp.skyboxsecurity.com/rs/440-MPQ-510/images/Skybox_Report_2020-VT_Trends.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Kupershtein</surname>
          </string-name>
          , Tatiana Martyniuk, Olesia Voitovych, Bohdan Kulchytskyi, Andrii Kozhemiako et al.
          <article-title>"DDoS-attack detection using artificial neural networks in Matlab,"</article-title>
          <source>Proc. SPIE 11176</source>
          , Photonics Applications in Astronomy,
          <source>Communications, Industry, and High-Energy Physics Experiments</source>
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1117/12.2536478
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Voitovych</surname>
            ,
            <given-names>O.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuvkovetskyi</surname>
            ,
            <given-names>O.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupershtein</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          <article-title>"SQL injection prevention system"</article-title>
          ,
          <source>2016 IEEE International Scientific Conference "Radio Electronics and Info Communications"</source>
          ,
          <source>UkrMiCo 2016 - Conference Proceedings. doi: 10</source>
          .1109/UkrMiCo.
          <year>2016</year>
          .7739642
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Voitovych</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupershtein</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukichov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikityuk</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>"Multilayer Access for Database Protection"</article-title>
          ,
          <source>2018 International Scientific-Practical Conference on Problems of Infocommunications Science and Technology, PICS&amp;T'2018 - Proceedings</source>
          , pp.
          <fpage>474</fpage>
          -
          <lpage>478</lpage>
          . doi:
          <volume>10</volume>
          .1109/INFOCOMMST.
          <year>2018</year>
          .8632152
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gurpreet</surname>
            <given-names>K.</given-names>
          </string-name>
          <article-title>Juneja1 “Ethical hacking: a technique to enhance information security”</article-title>
          ,
          <source>International Journal of Innovative Research in Science, Engineering and Technology</source>
          , Vol.
          <volume>2</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>12</given-names>
          </string-name>
          , pp.
          <fpage>7575</fpage>
          -
          <lpage>7580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Orebaugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pinkard</surname>
          </string-name>
          ,
          <article-title>Nmap in the Enterprise: your guide to network scanning, Burlington</article-title>
          , MA : Syngress Pub.,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fyodor</surname>
          </string-name>
          <string-name>
            <surname>Lyon</surname>
          </string-name>
          ,
          <article-title>Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning</article-title>
          , Nmap Project,
          <year>2009</year>
          , 468 p.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Nmap: the Network Mapper - Free Security Scanner</article-title>
          . URL: https://nmap.org.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Fingerprinting</given-names>
            <surname>Methods</surname>
          </string-name>
          <article-title>Avoided by Nmap</article-title>
          . URL: https://nmap.org/book/osdetect-othermethods.
          <article-title>html#osdetect-passive.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>NetScanTools</given-names>
            <surname>Pro OS Fingerprinting Tool</surname>
          </string-name>
          <article-title>Description</article-title>
          . URL: https://www.netscantools.com/ nstpro_os_fingerprinting.html.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <issue>xprobe2</issue>
          (
          <article-title>1) - Linux man page</article-title>
          . URL: https://linux.die.
          <source>net/man/1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[13] p0f v3 (version 3.09b)</source>
          . URL: https://lcamtuf.coredump.cx/p0f3.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>GitHub - xnih/satori: Python rewrite of passive OS fingerprinting tool</article-title>
          . URL: https://github.com/xnih/satori.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>NetworkMiner</article-title>
          . URL: https://www.netresec.com/?page=NetworkMiner.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>GitHub - gamelinux/prads: Passive Real-time Asset Detection System</article-title>
          . URL: https://github.com/gamelinux/prads.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Ettercap</given-names>
            <surname>Home</surname>
          </string-name>
          <article-title>Page</article-title>
          . URL: https://www.ettercap-project.
          <source>org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>OS Fingerprinting using Wireshark</article-title>
          . URL: https://andytanoko.wordpress.com/
          <year>2020</year>
          /07/19/osfingerprinting-using-wireshark.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Wireshark</surname>
            <given-names>Tutorial:</given-names>
          </string-name>
          <article-title>IdentifyingHosts and Users</article-title>
          . URL: https://unit42.paloaltonetworks.
          <article-title>com/ using-wireshark-identifying-hosts-and-users.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>OS</given-names>
            <surname>Detection</surname>
          </string-name>
          <article-title>Techniques</article-title>
          . URL: https://jonathansblog.co.uk/os detection-techniques.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Auffret</surname>
          </string-name>
          , SinFP, unification
          <article-title>of active and passive operating system fingerprinting</article-title>
          .
          <source>Journal in Computer Virology</source>
          ,
          <year>2008</year>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>197</fpage>
          -
          <lpage>205</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11416-008-0107-z
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Trowbridge</surname>
          </string-name>
          ,
          <article-title>An Overview of Remote Operating System Fingerprinting</article-title>
          , White paper,
          <year>2003</year>
          . URL: https://sansorg.egnyte.com/dl/dp8wFpM37k/?
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Gurary</surname>
            ,
            <given-names>Jonathan</given-names>
          </string-name>
          &amp; Zhu, Ye &amp; Bettati, Riccardo &amp; Guan,
          <string-name>
            <surname>Yong.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Operating System Fingerprinting</article-title>
          . doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>4939</fpage>
          -6601-
          <issue>1</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Tyagi</surname>
          </string-name>
          , Rohit &amp; Paul, Tuhin &amp; Bs, Manoj &amp; B.,
          <string-name>
            <surname>Thanudas</surname>
          </string-name>
          . (
          <year>2015</year>
          ).
          <article-title>Packet Inspection for Unauthorized OS Detection in Enterprises</article-title>
          .
          <source>IEEE Security &amp; Privacy. 13</source>
          . pp.
          <fpage>60</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .1109/MSP.
          <year>2015</year>
          .
          <volume>86</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zou,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Remote Operation System Detection Base on Machine Learning," 2009 Fourth International Conference on Frontier of Computer Science and Technology</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>542</lpage>
          . doi:
          <volume>10</volume>
          .1109/FCST.
          <year>2009</year>
          .
          <volume>21</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Song</surname>
          </string-name>
          , Jinho &amp; Kim, Yonggun &amp; Won, Yoojae,
          <source>Operating System Fingerprint Recognition Using ICMP</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-13-9341-9_
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Aksoy</surname>
            ,
            <given-names>Ahmet</given-names>
          </string-name>
          &amp; Louis, Sushil &amp; Gunes, Mehmet,
          <source>Operation system fingerprinting via automated network traffic analysis</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2502</fpage>
          -
          <lpage>2509</lpage>
          . doi:
          <volume>10</volume>
          .1109/CEC.
          <year>2017</year>
          .
          <volume>7969609</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Al-Shehari</surname>
          </string-name>
          ,
          <article-title>Taher &amp; Shahzad, Farrukh. Improving Operating System Fingerprinting using Machine Learning Techniques</article-title>
          .
          <source>International Journal of Computer Theory and Engineering</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aksoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Louis and M. H. Gunes</surname>
          </string-name>
          ,
          <article-title>"Operating system fingerprinting via automated network traffic analysis," 2017 IEEE Congress on Evolutionary Computation (CEC), Donostia</article-title>
          , Spain,
          <year>2017</year>
          , pp.
          <fpage>2502</fpage>
          -
          <lpage>2509</lpage>
          . doi:
          <volume>10</volume>
          .1109/CEC.
          <year>2017</year>
          .
          <volume>7969609</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Martyniuk</surname>
            ,
            <given-names>T.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhemiako</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupershtein</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          <article-title>"Formalization of the Object Classification Algorithm"</article-title>
          ,
          <source>Cybernetics and Systems Analysis</source>
          ,
          <year>2015</year>
          ,
          <volume>51</volume>
          (
          <issue>5</issue>
          ), pp.
          <fpage>751</fpage>
          -
          <lpage>756</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10559-015-9767-0.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <article-title>Dataset Using TLS Fingerprints for OS Identification in Encrypted Traffic</article-title>
          . URL: https://zenodo.org/record/3461771.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>De Montigny-Lebouf</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>A Multi-Packet Signature Approach to Passive Operating System Detection</article-title>
          , Communications Research Centre, Canada,
          <year>2005</year>
          . URL: https://apps.dtic. mil/sti/pdfs/ADA436420.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <article-title>Virtualization: IBM Cloud Education</article-title>
          . URL: https://www.ibm.com/cloud/learn/virtualization-acomplete-guide.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <article-title>Tracing network traffic using tcpdump and tshark</article-title>
          . URL: https://techzone.ergon.ch/tcpdump
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Albon</surname>
            <given-names>C.</given-names>
          </string-name>
          <article-title>Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning,</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2018</year>
          , 336 p.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Investigating the impact of data normalization on classification performance, Applied Soft Computing</article-title>
          , Vol.
          <volume>97</volume>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1016/j.asoc.
          <year>2019</year>
          .
          <volume>105524</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Kuhn</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kjell</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Feature Engineering and Selection: A Practical Approach for Predictive Models: Chapman</article-title>
          and Hall/CRC,
          <year>2019</year>
          .
          <volume>310</volume>
          р.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <article-title>Tuning the hyper-parameters of an estimator</article-title>
          . URL: https://scikit-learn.org/stable/modules/grid_search.html.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <article-title>Scikit-learn Tutorial - Beginner's Guide to GPU Accelerating ML Pipeline</article-title>
          . URL: https://developer.nvidia.com/blog/scikit-learn
          <article-title>-tutorial-beginners-guide-to-gpu-accelerating-ml-pipeline.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Mahler</surname>
            <given-names>P.</given-names>
          </string-name>
          RAPIDS Release
          <volume>21</volume>
          .06. URL: https://medium.com/rapids-ai/rapids-release-
          <volume>21</volume>
          -
          <fpage>06</fpage>
          - f9bd2e5a9aa4.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>