<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Information classification framework according to SOC 2 Type II⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleh Deineka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleh Harasymchuk</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii Partyka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerii Kozachok</string-name>
          <email>v.kozachok@kubg.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Borys Grinchenko Kyiv Metropolitan University</institution>
          ,
          <addr-line>18/2 Bulvarno-Kudryavska str., 04053 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CPITS-II 2024: Workshop on Cybersecurity Providing in Information and Telecommunication Systems II</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Stepana Bandery str., 79000 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>182</fpage>
      <lpage>189</lpage>
      <abstract>
        <p>Large Language Models (LLMs) like GPT-3 and BERT, trained on extensive text data, are transforming data management and governance, areas crucial for SOC 2 Type II compliance. LLMs respond to prompts, guiding their output generation, and can automate tasks like data cataloging, enhancing data quality, ensuring data privacy, and assisting in data integration. These capabilities can support a robust data classification policy, a key requirement for SOC 2 Type II. Vector search, another important method in data management, finds similar items to a given item by representing them as vectors in a high-dimensional space. It offers high accuracy, scalability, and flexibility, supporting efficient data classification. Embeddings, which convert categorical data into a form that can be input into a model, play a key role in vector search and LLMs. Prompt engineering, the crafting of effective prompts, is crucial for guiding LLMs' output, and further enhancing data management and governance practices.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SOC 2 Type II</kwd>
        <kwd>information classification</kwd>
        <kwd>data security</kwd>
        <kwd>LLM</kwd>
        <kwd>vector search</kwd>
        <kwd>prompt 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. SOC 2 Type II</title>
      <p>
        In today’s digital age, the exponential growth of
information assets, a significant portion of which is critical,
is a defining characteristic. The sheer volume of this
information necessitates its classification based on various
parameters and features, secure storage and transmission,
and protection against unauthorized access. The frequency
of potential attacks on information resources is on the rise
[1–3]. To counteract these threats, cybersecurity experts are
continually developing new standards, strategies, and
techniques, as well as advancing infrastructure [
        <xref ref-type="bibr" rid="ref11">4–9</xref>
        ]. A key
focus is the creation and research of standards for secure
data storage [
        <xref ref-type="bibr" rid="ref10 ref12 ref13 ref14 ref15 ref16 ref2">10–14</xref>
        ]. These standards provide insight into
how an organization controls data access and ensures its
security and confidentiality.
      </p>
      <p>The standards and requirements for data storage can
differ for organizations based on factors such as
geographical location, industry, sensitivity of the
information, and more. Specific organizations may have
unique standards and requirements based on their needs
and legal obligations.</p>
      <p>Most organizations formulate their security policies
based on international standards, often with the
involvement of external auditing firms that certify standard
compliance. However, professionals dealing with secure
storage of large data volumes still face numerous challenges,
including data integrity, confidentiality, and accessibility.
Ensuring the information remains unchanged from creation
through storage and retrieval can be a complex task.
Additionally, professionals must ensure confidentiality,
allowing only authorized individuals to access the data and
guarantee data accessibility when needed, a task that
becomes increasingly challenging with growing data
volumes.</p>
      <p>Despite the existence of various effective strategies,
methods, and systems for organizing big data storage,
certain problems persist. One significant issue is the
difficulty of searching for required information in
unstructured data.</p>
      <p>
        ISO 27001 [
        <xref ref-type="bibr" rid="ref13">11</xref>
        ] is a standard aimed at ensuring the
proper management of a company’s digital assets, including
financial information, intellectual property, employee data,
and trusted third-party information. Meanwhile, SOC 2
certification [
        <xref ref-type="bibr" rid="ref10 ref12 ref2">10</xref>
        ] is more recognized and typically preferred
by American and Canadian companies.
      </p>
      <p>SOC is divided into SOC 1, SOC 2, and SOC 3. The first
pertains exclusively to financial control, and the third is
primarily used for marketing purposes, allowing SaaS
providers to focus solely on SOC 2.</p>
      <p>The Service and Organization Controls 2 standard,
developed by the American Institute of Certified Public
Accountants using the Trust Services Criteria reliability
criteria, provides an independent evaluation of risk
management control procedures in IT companies that
provide services to users. The standard emphasizes data
privacy and confidentiality, making it a choice for giants
0009-0005-9156-3339 (O. Deineka);
0000-0002-8742-8872 (O. Harasymchuk);
0000-0003-3037-8373 (A. Partyka);
0000-0003-0072-2567 (V. Kozachok)
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
like Google and Amazon, for whom high-security levels and
transparent data processing processes are crucial. External
auditors are engaged for certification. Their role is to
examine the implemented practices, verify the company’s
adherence to its procedures, and monitor changes in
processes.</p>
      <p>SOC 2 Type II is a significant certification in the data
security and compliance landscape. It serves as an
attestation by an independent auditor that a service
organization’s systems are not only designed to meet the
Trust Services Criteria but also operate effectively over
time. The Trust Services Criteria cover several critical areas:
security, availability, processing integrity, confidentiality,
and privacy.</p>
      <p>The value of SOC 2 Type II lies in its ability to foster
trust with clients and stakeholders. By demonstrating a
commitment to stringent data management practices,
companies can assure clients that their sensitive data is
managed responsibly. This is particularly important in
sectors where data privacy and security are crucial, such as
financial services, healthcare, and cloud computing.</p>
      <p>
        Furthermore, the audit process for SOC 2 Type II helps
organizations identify and mitigate potential security risks,
ensuring they maintain a robust security posture. This
proactive approach to risk management is vital in an era
where cyber threats are continually evolving, and data
breaches can have devastating consequences. Hence, there
is a constant search for new strategies and methods to
ensure reliable data storage and user and device
authentication where this data is stored [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20">15–18</xref>
        ].
      </p>
      <p>In an increasingly regulated environment, SOC 2 Type
II compliance can also support adherence to legal and
regulatory requirements, helping organizations avoid
expensive penalties and legal issues associated with
noncompliance.</p>
      <p>
        From a business perspective, SOC 2 Type II compliance
can serve as a competitive differentiator. It signals to the
market that an organization is a reliable and secure partner,
which can be instrumental in winning new business and
retaining existing customers [
        <xref ref-type="bibr" rid="ref21">19</xref>
        ].
      </p>
      <p>The outcome of implementing SOC 2 is a report based
on the AICPA Attestation Standards, section 101, Attest
Engagement.</p>
    </sec>
    <sec id="sec-2">
      <title>2. SOC 2 Type II information</title>
      <p>classification policy</p>
      <sec id="sec-2-1">
        <title>2.1. Requirements</title>
        <p>According to the document: SOC 2 Type II, while not
prescribing specific data classification policies, mandates
that organizations effectively manage and safeguard the
confidentiality, privacy, and security of information in line
with the Trust Services Criteria (TSC).</p>
        <p>A Data Classification Policy is essential in meeting these
criteria, especially the Security criterion, which is common
to all SOC 2 audits. A SOC 2 audit evaluates the
effectiveness of an organization’s processes and systems
based on the Trust Service Criteria and checks compliance
with information security standards and regulations,
including Common Criteria standards. To support SOC 2
Type II compliance, a Data Classification Policy should
address several general requirements, including the
identification of data types. The policy should define the
types of data the organization handles, including sensitive
data subject to SOC 2 considerations, such as personal
identifiable information (PII), business confidential data,
and intellectual property. The policy must also establish
clear classification levels that reflect the sensitivity of the
data, with common levels including Public, Internal Use
Only, Confidential, and Highly Confidential. Additionally,
the policy should define roles and responsibilities for data
classification, including data owners, custodians, and users,
and outline their responsibilities in maintaining data
classification.</p>
        <p>
          A Data Classification Policy for SOC 2 Type II
compliance should specify handling requirements for each
classification level, including storage, transmission, access
controls, encryption standards, and end-of-life procedures.
The policy should also provide guidelines on how data
should be labeled or marked according to its classification
to ensure that it is easily identifiable and handled
appropriately. Access controls must be addressed, ensuring
that access to data is based on the principle of least privilege
and that only authorized individuals can access sensitive
data. The policy should outline data retention periods and
secure disposal methods for each classification level,
ensuring data is not kept longer than necessary and is
disposed of securely. Regular training and awareness
programs for employees should be mandated to understand
the importance of data classification and their role in it. The
policy should include provisions for regular auditing and
monitoring to ensure that classification controls are
effective and being followed. The policy should be linked to
an incident response plan that addresses potential data
breaches or loss, with procedures tailored to the
classification level of the data involved. The policy should
specify intervals for reviewing and updating data
classification procedures to ensure they remain relevant and
effective as the organization evolves, data volumes increase,
and new threats emerge. According to the document: If data
is shared with or handled by third-party vendors, the data
classification policy must extend to these vendors, often
requiring them to adhere to similar or compatible
classification and handling standards. To ensure alignment
with SOC 2 Type II requirements, developing a Data
Classification Policy usually demands a comprehensive
understanding of the AICPA’s TSC and the unique data
protection requirements of the organization. Engaging with
seasoned compliance experts or auditors who can give
tailored advice and oversee compliance with the standard’s
stipulations is highly recommended. The AICPA’s guidance
and frameworks such as ISO 27001, when consulted and
utilized, can offer invaluable inputs for the creation and
sustenance of a strong data classification policy. It is crucial
to identify and categorize data based on its sensitivity,
importance, and regulatory mandates. Moreover, regular
reviews and updates of the policy should be conducted to
ensure its efficiency and continued compliance with SOC 2
Type II requirements [
          <xref ref-type="bibr" rid="ref22 ref23 ref24 ref25 ref26">20–24</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Design</title>
        <p>So, we offer a Data Flow Diagram. Time Creating a Data
Flow Diagram necessitates an initial comprehensive grasp
of the various data types that your company possesses.
Typically, data can be divided into three main categories:
structured, semi-structured, and unstructured.</p>
        <p>Data that is organized in a prearranged manner, such as
the data stored in a relational database, is referred to as
structured data. Its consistent format makes structured data
easy to search, analyze, and manipulate. On the other hand,
semi-structured data has a certain level of organization but
lacks a strict format. XML and JSON files, which house data
in a hierarchical format without a fixed schema, are
examples of semi-structured data.</p>
        <p>
          Unstructured data is characterized by its lack of
inherent structure or organization. This category includes
text documents, images, and videos. The inconsistent format
of unstructured data can pose challenges when it comes to
searching, analyzing, and manipulating it [
          <xref ref-type="bibr" rid="ref27 ref28">25–26</xref>
          ].
        </p>
        <p>
          After the identification of the company’s data types, the
subsequent phase involves gaining an understanding of the
metadata linked to that data. Metadata is essentially data
that offers information about other data. For example, the
metadata linked to a text document could include details like
the author, the date of creation, and the file size. A deep
understanding of the metadata associated with your data
can facilitate better organization, management, and analysis
of your data [
          <xref ref-type="bibr" rid="ref29">27</xref>
          ].
        </p>
        <p>The process of creating a Data Flow Diagram continues
with the utilization of integration tools to manage and store
your data, once you have identified the types of data your
company owns and the metadata associated with that data.
Integration tools facilitate the extraction of data from
various sources, its transformation into a common format,
and its loading into a data store. This process, known as</p>
        <p>
          Extract, Transform, Load (ETL), consolidates your data into
a single location, simplifying its management and analysis
[
          <xref ref-type="bibr" rid="ref30 ref31 ref32 ref33 ref34">28–32</xref>
          ].
        </p>
        <p>
          Following the extraction, transformation, and loading of
your data into a data store, the subsequent phase involves
creating a data model. A data model is a visual depiction of
the relationships between different data elements. It
provides a structure for organizing and structuring your
data and can assist in identifying patterns and trends within
your data [
          <xref ref-type="bibr" rid="ref35">33</xref>
          ].
        </p>
        <p>
          Once a data model has been created, the next step
involves classifying your data and linking it to the associated
metadata. This involves assigning a sensitivity level to your
data, based on its importance and the potential impact if it
were to be lost or stolen. After your data has been classified,
it can be linked to the associated metadata, providing
additional context and information about the data [
          <xref ref-type="bibr" rid="ref36">34</xref>
          ].
        </p>
        <p>
          The final phase in creating a Data Flow Diagram
involves creating an application that enables you to
visualize and manage your data. This application should
offer a user-friendly interface for accessing, analyzing, and
manipulating your data. It should also incorporate logic for
managing access, requests, and incidents, and should be
integrated with your ITSM system to ensure that data is
handled according to your company’s policies and
procedures [
          <xref ref-type="bibr" rid="ref37">35</xref>
          ].
        </p>
        <p>This solution offers numerous advantages over
traditional product-based offerings from various companies.
One of the primary benefits is the flexibility to choose the
hosting environment that best suits your needs, whether
onpremise or cloud-based. This allows you to align the
solution with your operational requirements and
infrastructure capabilities.</p>
        <p>Additionally, you have the liberty to select the technology
stack that best fits your project. This means that you’re not
confined to a predetermined set of technologies but can
customize the solution to leverage the most relevant and
efficient tools for your specific needs.</p>
        <p>In terms of team composition, you have the flexibility to
assemble a team that is uniquely suited to the project at
hand. This flexibility ensures that the right expertise and
skills are applied to deliver the best possible outcomes.</p>
        <p>Another advantage is the flexibility in budgeting. Unlike
vendor-specific solutions that may come with fixed
licensing costs, the budget for this solution can be adjusted
according to your financial capacity and project
requirements. This can result in significant cost savings
without compromising on quality or performance.</p>
        <p>
          Lastly, this solution offers robust change and feature
management capabilities. This means that it can easily adapt
to evolving business needs, with the ability to incorporate
new features and make necessary changes in a timely and
efficient manner. This flexibility ensures the solution remains
relevant and continues to deliver value over time [
          <xref ref-type="bibr" rid="ref15">13</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Information classification</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>Information classification is a critical process in data
management that involves categorizing data based on its
sensitivity, importance, and regulatory requirements. This
process is essential for organizations to effectively protect
their data and comply with various legal, regulatory, and
contractual obligations.</p>
        <p>The primary goal of information classification is to
facilitate appropriate levels of protection for different types
of data. By classifying data such as public, internal,
confidential, or highly confidential, organizations can apply
suitable security measures to each category, ensuring that
sensitive and critical data receives the highest level of
protection.</p>
        <p>Information classification is not a one-time activity but
a continuous process that needs to be integrated into the
organization’s data lifecycle. It involves identifying the
types of data the organization handles, defining
classification levels, assigning responsibilities for data
classification, and implementing procedures for handling,
storing, and disposing of data based on its classification.</p>
        <p>In addition to enhancing data security, information
classification also aids in risk management, regulatory
compliance, and resource allocation. It helps organizations
understand where their most sensitive and valuable data
resides, who has access to it, and how it is being protected,
enabling them to identify and mitigate potential risks. It also
supports compliance with regulations such as GDPR,
HIPAA, and SOC 2, which require organizations to
implement appropriate safeguards for sensitive data.
Furthermore, by identifying less sensitive data that requires
lower levels of protection, organizations can optimize their
use of resources.</p>
        <p>In today’s data-driven world, where vast volumes of
data are generated and processed every day, information
classification has become more important than ever. It is a
fundamental step in ensuring that all data is given the
appropriate level of protection and handled responsibly
throughout its lifecycle.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Importance</title>
        <p>The importance of Information Classification in the context
of SOC 2 Type II compliance cannot be overstated. It serves
as the foundation for data security and privacy controls,
helping organizations identify and protect their most
sensitive data.</p>
        <p>Firstly, Information Classification helps in identifying
the types of data an organization handles, including
sensitive data subject to SOC 2 considerations, such as
personally identifiable information (PII), confidential
business data, and intellectual property. This identification
is the first step towards implementing appropriate security
measures.</p>
        <p>Secondly, Information Classification aids in establishing
clear classification levels that reflect the sensitivity of the
data. These levels, which commonly include Public, Internal
Use Only, Confidential, and Highly Confidential, guide the
implementation of access controls, encryption standards,
and other security measures.</p>
        <p>Thirdly, Information Classification supports the
assignment of roles and responsibilities for data
classification, including data owners, custodians, and users.
This clear delineation of responsibilities ensures
accountability and promotes adherence to data security
policies.</p>
        <p>Lastly, as suggested by us Information Classification
facilitates compliance with legal and regulatory
requirements, including those stipulated by SOC 2 Type II.
Information Content Extraction is a crucial process in
data management that involves retrieving structured
information from unstructured or semi-structured data
sources. This process is essential for transforming raw data
into meaningful and actionable insights.</p>
        <p>Structured data is data that is organized into a formatted
structure, often a relational database. This type of data is
readily searchable by simple, straightforward search engine
algorithms or other search operations.</p>
        <p>Semi-structured data is a form of structured data that
does not adhere to the formal structure of data models
associated with relational databases or other forms of data
tables but contains tags or other markers to separate
semantic elements and enforce hierarchies of records and
fields within the data. Examples of semi-structured data
include XML and JSON files.</p>
        <p>Unstructured data is information that either does not
have a pre-defined data model or is not organized in a
predefined manner. This type of data is typically text-heavy but
may contain data such as dates, numbers, and facts as well.
Examples of unstructured data include text files, PDFs, and
BLOBs (Binary Large Objects).</p>
        <p>Information extraction from these types of data involves
several steps, including text preprocessing, entity
recognition, relation extraction, and event extraction. Text
preprocessing involves cleaning and normalizing the text,
removing stop words, and stemming or lemmatizing words.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Framework</title>
        <p>Entity recognition identifies entities such as names,
locations, and dates in the text. Relation extraction identifies
relationships between these entities, and event extraction
identifies events in which these entities are involved.</p>
        <p>There are several approaches to information extraction:</p>
        <p>Rule-based methods: These methods use a set of
predefined rules or patterns to extract
information. For example, a rule might specify
that if a word is capitalized and followed by a
certain verb, it is likely a person’s name. While
rule-based methods can be very accurate, they are
also labor-intensive and may not generalize well
to new data.</p>
        <p>Machine learning methods: These methods use
algorithms to learn patterns from labeled training
data and apply these patterns to new data. For
example, a machine learning model might learn
that words that often appear in the same context
as known person names are likely to be person
names themselves. Machine learning methods can
be very effective, especially with large amounts of
training data, but they can also be complex and
computationally intensive.</p>
        <p>
          Hybrid methods: These methods combine
rulebased and machine-learning methods to leverage
the strengths of both. For example, a hybrid
method might use rules to extract easy-to-identify
information and machine learning to extract more
complex information [
          <xref ref-type="bibr" rid="ref38 ref39">36–37</xref>
          ].
        </p>
        <p>However, there are no clear recommendations
regarding the implementation of a specific method, and the
choice should be made considering a large number of
factors.</p>
        <p>
          Large Language Models (LLMs) [
          <xref ref-type="bibr" rid="ref40 ref41">38–39</xref>
          ] represent a
significant advancement in the field of artificial intelligence.
These models are trained on extensive volumes of text data,
enabling them to generate text that closely resembles
human writing. Notable examples of LLMs include GPT-3
by OpenAI and BERT by Google [
          <xref ref-type="bibr" rid="ref42 ref43 ref44">40–42</xref>
          ]. These models can
perform a wide range of tasks, such as answering queries,
crafting essays, summarizing texts, translating languages,
and even generating creative ideas.
        </p>
        <p>In the realm of data management and data governance,
LLMs can be leveraged in several innovative ways:
1.
2.
3.</p>
        <sec id="sec-3-3-1">
          <title>Data Cataloging: LLMs can streamline the process</title>
          <p>of data cataloging. They can read and comprehend
the metadata associated with various data assets
and generate descriptions or tags for these assets,
thereby automating a traditionally manual
process.</p>
          <p>Data Quality: LLMs can play a pivotal role in
enhancing data quality. They can be trained to
identify and flag potential errors or
inconsistencies in data, facilitating proactive data
quality management.</p>
          <p>Data Privacy: LLMs can contribute to data privacy
efforts by identifying and redacting sensitive
information in datasets, thereby helping
organizations comply with data privacy
regulations.</p>
          <p>Data Integration: LLMs can aid in data integration
tasks. They can understand the context and
semantics of different data sources and assist in
mapping them to a common model, simplifying
the integration process.</p>
          <p>Choosing the right LLM for data management and data
governance depends on various factors, including the
specific requirements of the tasks, the size and complexity
of the data, the computational resources available, and the
expertise of the team.</p>
          <p>Vector search, or nearest neighbor search, is a
powerful technique utilized in machine learning and data
science to identify items that are most similar to a given
item. This method operates by representing items as vectors
in a multi-dimensional space. Each point in this space
corresponds to a potential item, and the position of that
point is determined by the characteristics of the item.</p>
          <p>The principle behind vector search is that similar items will
be located near each other in this space, while dissimilar items
will be further apart. When a new item is introduced, it is also
converted into a vector and placed into this space. The
algorithm then searches for vectors that are close to the new
vector, with the “closeness” being determined by a distance
metric such as Euclidean distance or cosine similarity.</p>
          <p>This technique is particularly useful when dealing with
large datasets, as it allows for efficient searching and
retrieval of items. It’s commonly used in recommendation
systems, image recognition, and natural language
processing among other applications.</p>
          <p>For instance, in a movie recommendation system, each
movie could be represented as a vector where each
dimension corresponds to a different genre. A romance
movie would be located closer to other romance movies and
further from action movies. When a user rates a movie, the
system can look for other movies that are close in the vector
space to recommend to the user.</p>
          <p>In essence, vector search is a method of transforming
complex, abstract items into a format that can be easily and
efficiently compared, enabling the rapid retrieval of similar
items from large datasets.</p>
          <p>Advantages of Vector Search:</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>High Accuracy: Vector search can provide highly</title>
          <p>accurate results because it considers the
relationships between different features of the
data. By representing data in a high-dimensional
space, it captures the nuances and complexities of
the data that might be missed by other methods.
Scalability: Vector search is highly scalable and
can handle large amounts of data efficiently. This
makes it suitable for big data applications where
traditional search methods may be impractical.
Flexibility: Vector search is highly flexible and can
be used with any data that can be represented as
a vector. This includes text, images, audio, and
more, making it applicable to a wide range of tasks
and industries.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Disadvantages of Vector Search: 1. 2.</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>Computational Complexity: Vector search can be</title>
          <p>computationally intensive, especially when
dealing with high-dimensional data or large
datasets. This can make it slower than other
methods, particularly for real-time applications.
Difficulty in Choosing the Right Distance Metric:
The effectiveness of vector search heavily</p>
          <p>depends on the choice of distance metric, which
can be challenging to determine. The choice of
metric can significantly impact the results, and
there is often no one-size-fits-all solution.</p>
          <p>Sensitivity to Noise: Vector search can be
sensitive to noise in the data. Outliers or errors in
the data can affect the distance calculations and
lead to inaccurate results.</p>
          <p>
            Embeddings are a key component of vector search. In
machine learning, an embedding is a learned representation
for some specific type of data, such as words, users, or
products, where similar items have a similar representation.
They are used to convert categorical data into a form that
can be input into a model. Embeddings are particularly
useful for dealing with high-dimensional data, as they can
reduce the dimensionality of the data while preserving its
structure and relationships [
            <xref ref-type="bibr" rid="ref45 ref46 ref47">43–45</xref>
            ].
          </p>
          <p>Prompts play a crucial role in the functioning of Large
Language Models (LLMs) like GPT-3. A prompt is
essentially an input that is given to the model to guide its
output. It can be a question, a statement, or any piece of text.
The LLM generates a response to the prompt based on the
patterns it learned during its training on a large corpus of
text data.</p>
          <p>Prompts are valuable because they allow us to direct the
model’s output. By carefully crafting our prompts, we can
guide the model to generate useful and relevant responses.
For instance, if we’re using an LLM to write an email, we
might prompt it with “Dear [Recipient’s Name], I am
writing to inform you that...” and the model could generate
the rest of the email.</p>
          <p>In the context of data management, prompts can be used
to extract or generate specific pieces of information from or
about our data. For example, we could prompt an LLM with
a question about our data, such as “What is the average
value of column X?” or “How many entries in column Y are
above Z?”. The LLM could then generate a response based
on its understanding of the data.</p>
          <p>Prompts can also be used to generate metadata for our
data. For instance, we could prompt the LLM with a piece of
data and ask it to generate a description or a set of tags for
that data. This could be particularly useful for tasks like data
cataloging, where we need to generate human-readable
descriptions or annotations for large amounts of data.</p>
          <p>
            However, it’s important to note that the effectiveness of
prompts depends on the quality of the LLM’s training. If the
LLM has not been trained on relevant data, or if it has not
been trained to understand the specific format or context of
the prompts, it may not generate useful responses.
Therefore, careful prompt design and model training are
crucial for getting the most value out of LLMs in data
management [
            <xref ref-type="bibr" rid="ref48">46</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>According to the document: In conclusion, the paper
discusses the importance of information classification in the
context of SOC 2 Type II compliance. Information
classification serves as the foundation for data security and
privacy controls, helping organizations identify and protect
their most sensitive data. By effectively classifying their
data, organizations can ensure its security, meet regulatory
requirements, and ultimately, safeguard their reputation
and business continuity. To optimize and increase efficiency
in the classification and organization of data by SOC 2 Type
II standards, it is proposed to apply Large Language Models
in this model. LLMs like GPT-3 and BERT, trained on
extensive text data, are transforming data management and
governance, areas crucial for SOC 2 Type II compliance.
LLMs respond to prompts, guiding their output generation,
and can automate tasks like data cataloging, enhancing data
quality, ensuring data privacy, and assisting in data
integration. These capabilities can support a robust data
classification policy, a key requirement for SOC 2 Type II.</p>
      <p>Vector search, another important method in data
management, finds similar items to a given item by
representing them as vectors in a high-dimensional space. It
offers high accuracy, scalability, and flexibility, supporting
efficient data classification. Embeddings, which convert
categorical data into a form that can be input into a model,
play a key role in vector search and LLMs.</p>
      <p>Prompt engineering, the crafting of effective prompts, is
crucial for guiding LLMs’ output, and further enhancing
data management and governance practices.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>B.</given-names>
            <surname>Matturdi</surname>
          </string-name>
          , et al.,
          <article-title>Big Data security and privacy: A review</article-title>
          ,
          <source>China Communications</source>
          ,
          <volume>11</volume>
          (
          <issue>14</issue>
          ) (
          <year>2014</year>
          )
          <fpage>135</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>doi: 10</source>
          .1109/CC.
          <year>2014</year>
          .
          <volume>7085614</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Susukailo</surname>
          </string-name>
          , I. Opirskyy,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasylyshyn</surname>
          </string-name>
          ,
          <article-title>Analysis of the attack vectors used by threat actors during the pandemic</article-title>
          ,
          <source>IEEE 15th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2020 - Proceedings</source>
          ,
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>261</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Islam</surname>
          </string-name>
          , et al.,
          <article-title>Security threats for big data: An empirical study</article-title>
          ,
          <source>Int. J. Inf. Commun. Technol. Human Dev</source>
          . (IJICTHD)
          <volume>10</volume>
          (
          <issue>4</issue>
          ) (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Namasudra: DNACDS: Cloud IoE big data security and accessing scheme based on DNA cryptography</article-title>
          ,
          <source>Frontiers Comput. Sci</source>
          .
          <volume>18</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>181801</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>O. I.</given-names>
            <surname>Harasymchuk</surname>
          </string-name>
          , et al.,
          <article-title>Generator of pseudorandom bit sequence with increased cryptographic security</article-title>
          ,
          <source>Metallurgical and Mining Industry: Sci. Tech. J. 5</source>
          (
          <year>2014</year>
          )
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Dudykevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mykytyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ruda</surname>
          </string-name>
          ,
          <article-title>The concept of a deepfake detection system of biometric image modifications based on neural networks</article-title>
          ,
          <source>in: 3rd KhPI Week on Advanced Technology (KhPIWeek)</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/KhPIWeek57572.
          <year>2022</year>
          .
          <volume>9916378</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>O.</given-names>
            <surname>Vakhula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirskyy</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Mykhaylova,</surname>
          </string-name>
          <article-title>Research on Security Challenges in Cloud Environments and Solutions based on the “Security-as-Code” Approach</article-title>
          , in: Cybersecurity
          <source>Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3550</volume>
          (
          <year>2023</year>
          )
          <fpage>55</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Maksymovych</surname>
          </string-name>
          , et al.,
          <article-title>Development of Additive Fibonacci Generators with Improved Characteristics for Cybersecurity Needs</article-title>
          , Appl. Sci.
          <volume>12</volume>
          (
          <issue>3</issue>
          ) (
          <year>2022</year>
          )
          <fpage>1519</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>doi: 10</source>
          .3390/app12031519.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Maksymovych</surname>
          </string-name>
          , et al.,
          <source>Combined Pseudo-Random Sequence Generator for Cybersecurity, Sensors</source>
          <volume>22</volume>
          (
          <year>2022</year>
          )
          <article-title>9700</article-title>
          . doi:
          <volume>10</volume>
          .3390/s22249700.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[10] SOC 2</source>
          Compliance
          <string-name>
            <surname>Documentation</surname>
            <given-names>URL</given-names>
          </string-name>
          : https://secureframe.com/hub/soc-2/compliancedocumentation
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [11] ISO/IEC 27001:
          <year>2022</year>
          URL: https://www.iso.org/ standard/27001
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Maksymovych</surname>
          </string-name>
          , et al.,
          <source>Simulation of Authentication in Information-Processing Electronic Devices Based on Poisson Pulse Sequence Generators. Electronics</source>
          <volume>11</volume>
          (
          <issue>13</issue>
          ) (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .3390/electronics11132039.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Deineka</surname>
          </string-name>
          , et al.,
          <article-title>Designing Data Classification and Secure Store Policy According to SOC 2 Type II</article-title>
          ,
          <source>in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3654</volume>
          (
          <year>2024</year>
          )
          <fpage>398</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Mykhaylova</surname>
          </string-name>
          , et al.,
          <article-title>Mobile Application as a Critical Infrastructure Cyberattack Surface</article-title>
          ,
          <source>in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3550</volume>
          (
          <year>2023</year>
          )
          <fpage>29</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>An Improved Data Backup Scheme Based on Multi-Factor Authentication</article-title>
          ,
          <source>in: IEEE 9th Intl Conference on Big Data Security on Cloud (BigDataSecurity)</source>
          ,
          <source>IEEE Intl Conference on High Performance and Smart Computing</source>
          ,
          <source>(HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1109/BigDataSecurity-HPSCIDS58521.
          <year>2023</year>
          .
          <volume>00041</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shevchuk</surname>
          </string-name>
          , et al.,
          <source>Designing Secured Services for Authentication</source>
          , Authorization, and
          <article-title>Accounting of Users, in: Cybersecurity Providing in Information and Telecommunication Systems II</article-title>
          , vol.
          <volume>3550</volume>
          (
          <year>2023</year>
          )
          <fpage>217</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Martseniuk</surname>
          </string-name>
          , et al.,
          <source>Automated Conformity Verification Concept for Cloud Security, in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3654</volume>
          (
          <year>2024</year>
          )
          <fpage>25</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Horpenyuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Opirskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vorobets</surname>
          </string-name>
          ,
          <article-title>Analysis of Problems and Prospects of Implementation of PostQuantum Cryptographic Algorithms</article-title>
          , in: Classic, Quantum, and
          <string-name>
            <surname>Post-Quantum</surname>
            <given-names>Cryptography</given-names>
          </string-name>
          , vol.
          <volume>3504</volume>
          (
          <year>2023</year>
          )
          <fpage>39</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Calder</surname>
          </string-name>
          , S. Watkins, IT Governance:
          <article-title>An International Guide to Data Security and ISO27001/ISO27002 (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[20] AICPA, SOC</source>
          <volume>2</volume>
          ®
          <article-title>- SOC for Service Organizations: Trust Services Criteria</article-title>
          . URL: https://us.aicpa.org/ interestareas/frc/assuranceadvisoryservices/soc-forservice-organizations
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>IS</given-names>
            <surname>Audit</surname>
          </string-name>
          <article-title>Basics: The Domains of Data and Information Audits URL</article-title>
          : https://www.isaca.org/resources/isacajournal /issues/2016/volume-6/is-audit
          <article-title>-basics-thedomains-of-data-and-information-audits</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [22]
          <article-title>Practical Data Security and Privacy for GDPR and CCPA</article-title>
          , ISACA J.
          <volume>3</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [23]
          <article-title>Boosting Cyber Security with Data Governance and Enterprise Data Management</article-title>
          ,
          <source>ISACA J</source>
          .
          <volume>3</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cannon</surname>
          </string-name>
          , IT Service Management:
          <article-title>A Guide for ITIL Foundation Exam Candidates</article-title>
          ,
          <string-name>
            <surname>BCS</surname>
          </string-name>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Karumanchi</surname>
          </string-name>
          ,
          <article-title>Data Structures and Algorithms Made Easy: Data Structures and Algorithmic Puzzles (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Watson</surname>
          </string-name>
          , Data Management: Databases and
          <string-name>
            <surname>Organizations</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rhodes-Ousley</surname>
          </string-name>
          , Information Security: The Complete Reference,
          <string-name>
            <surname>Second Edition</surname>
          </string-name>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Munawar</surname>
          </string-name>
          ,
          <article-title>Extract Transform Loading (ETL) Based Data Quality for Data Warehouse Development</article-title>
          ,
          <source>in: 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI)</source>
          , Jakarta, Indonesia (
          <year>2021</year>
          )
          <fpage>373</fpage>
          -
          <lpage>378</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCSAI53272.
          <year>2021</year>
          .
          <volume>9609770</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khoma</surname>
          </string-name>
          , et al.,
          <source>Comprehensive Approach for Developing an Enterprise Cloud Infrastructure in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3654</volume>
          (
          <year>2024</year>
          )
          <fpage>201</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          , Mastering Apache Airflow (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaikwad</surname>
          </string-name>
          , Learning AWS Glue (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>D.</given-names>
            <surname>Anoshin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Avdeev</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. van Vliet</surname>
          </string-name>
          ,
          <source>Azure Data Factory Cookbook</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoberman</surname>
          </string-name>
          ,
          <article-title>Data Modeling Made Simple: A Practical Guide for Business and</article-title>
          IT Professionals (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [34]
          <string-name>
            <surname>C. C.</surname>
          </string-name>
          <article-title>Aggarwal, Data Classification: Algorithms and Applications (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sharp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duhamel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft Power Platform Enterprise Architecture</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , et al.,
          <article-title>From Text to Knowledge for the Semantic Web: the ONTOTEXT project</article-title>
          ,
          <source>in: Proceedings of SWAP 2005 Workshop</source>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          , Mining the Web.
          <article-title>Discovering Knowledge from Hypertext Data</article-title>
          , Morgan Kaufmann (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Exploring the Application of Large Language Models in Detecting and Protecting Personally Identifiable Information in Archival Data: A Comprehensive Study</article-title>
          ,
          <source>in: IEEE International Conference on Big Data (BigData)</source>
          , Sorrento, Italy, (
          <year>2023</year>
          )
          <fpage>2116</fpage>
          -
          <lpage>2123</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData59044.
          <year>2023</year>
          .
          <volume>10386949</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piskozub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhuravchak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tolkachova</surname>
          </string-name>
          ,
          <article-title>Researching vulnerabilities in chatbots with LLM (Large language model</article-title>
          ),
          <source>Ukrainian Sci. J. Inf. Secur</source>
          .
          <volume>29</volume>
          (
          <issue>9</issue>
          ) (
          <year>2023</year>
          )
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .18372/
          <fpage>2225</fpage>
          -
          <lpage>5036</lpage>
          .
          <fpage>29</fpage>
          .18069.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [40]
          <article-title>GPT-3by OpenAI</article-title>
          . URL: https://openai.com/research/ gpt-3/
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [41]
          <article-title>BERT by Google</article-title>
          . URL: https://ai.googleblog.com/
          <year>2018</year>
          /11/open-sourcing
          <article-title>-bert-state-of-art-pre</article-title>
          .html/
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Amazon</given-names>
            <surname>Bedrock - Automating</surname>
          </string-name>
          Large-Scale,
          <article-title>FaultTolerant Distributed Training in the Deep Learning Compiler Stack</article-title>
          . URL: https://aws.amazon.com/ blogs/aws/amazon-bedrock
          <article-title>-automating-large-scalefault-tolerant-distributed-training-in-the-deeplearning-compiler-stack/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Manolopoulos</surname>
          </string-name>
          , Nearest Neighbor Search:
          <string-name>
            <given-names>A Database</given-names>
            <surname>Perspective</surname>
          </string-name>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , Mining of Massive Datasets (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , Deep
          <string-name>
            <surname>Learning</surname>
          </string-name>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [46]
          <article-title>Teaching with AI</article-title>
          . URL: https://openai.com/blog/ teaching-with-ai
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>