Information classification framework according
                                to SOC 2 Type II⋆
                                Oleh Deineka1,†, Oleh Harasymchuk1,†, Andrii Partyka1,† and Valerii Kozachok2,*,†
                                1
                                    Lviv Polytechnic National University, 12 Stepana Bandery str., 79000 Lviv, Ukraine
                                2
                                    Borys Grinchenko Kyiv Metropolitan University, 18/2 Bulvarno-Kudryavska str., 04053 Kyiv, Ukraine


                                                   Abstract
                                                   Large Language Models (LLMs) like GPT-3 and BERT, trained on extensive text data, are transforming data
                                                   management and governance, areas crucial for SOC 2 Type II compliance. LLMs respond to prompts,
                                                   guiding their output generation, and can automate tasks like data cataloging, enhancing data quality,
                                                   ensuring data privacy, and assisting in data integration. These capabilities can support a robust data
                                                   classification policy, a key requirement for SOC 2 Type II. Vector search, another important method in data
                                                   management, finds similar items to a given item by representing them as vectors in a high-dimensional
                                                   space. It offers high accuracy, scalability, and flexibility, supporting efficient data classification.
                                                   Embeddings, which convert categorical data into a form that can be input into a model, play a key role in
                                                   vector search and LLMs. Prompt engineering, the crafting of effective prompts, is crucial for guiding LLMs’
                                                   output, and further enhancing data management and governance practices.

                                                   Keywords
                                                   SOC 2 Type II, information classification, data security, LLM, vector search, prompt 1


                         1. SOC 2 Type II                                                             through storage and retrieval can be a complex task.
                                                                                                      Additionally, professionals must ensure confidentiality,
                         In today’s digital age, the exponential growth of                            allowing only authorized individuals to access the data and
                         information assets, a significant portion of which is critical,              guarantee data accessibility when needed, a task that
                         is a defining characteristic. The sheer volume of this                       becomes increasingly challenging with growing data
                         information necessitates its classification based on various                 volumes.
                         parameters and features, secure storage and transmission,                         Despite the existence of various effective strategies,
                         and protection against unauthorized access. The frequency                    methods, and systems for organizing big data storage,
                         of potential attacks on information resources is on the rise                 certain problems persist. One significant issue is the
                         [1–3]. To counteract these threats, cybersecurity experts are                difficulty of searching for required information in
                         continually developing new standards, strategies, and                        unstructured data.
                         techniques, as well as advancing infrastructure [4–9]. A key                      ISO 27001 [11] is a standard aimed at ensuring the
                         focus is the creation and research of standards for secure                   proper management of a company’s digital assets, including
                         data storage [10–14]. These standards provide insight into                   financial information, intellectual property, employee data,
                         how an organization controls data access and ensures its                     and trusted third-party information. Meanwhile, SOC 2
                         security and confidentiality.                                                certification [10] is more recognized and typically preferred
                             The standards and requirements for data storage can                      by American and Canadian companies.
                         differ for organizations based on factors such as                                 SOC is divided into SOC 1, SOC 2, and SOC 3. The first
                         geographical location, industry, sensitivity of the                          pertains exclusively to financial control, and the third is
                         information, and more. Specific organizations may have                       primarily used for marketing purposes, allowing SaaS
                         unique standards and requirements based on their needs                       providers to focus solely on SOC 2.
                         and legal obligations.                                                            The Service and Organization Controls 2 standard,
                             Most organizations formulate their security policies                     developed by the American Institute of Certified Public
                         based on international standards, often with the                             Accountants using the Trust Services Criteria reliability
                         involvement of external auditing firms that certify standard                 criteria, provides an independent evaluation of risk
                         compliance. However, professionals dealing with secure                       management control procedures in IT companies that
                         storage of large data volumes still face numerous challenges,                provide services to users. The standard emphasizes data
                         including data integrity, confidentiality, and accessibility.                privacy and confidentiality, making it a choice for giants
                         Ensuring the information remains unchanged from creation


                                CPITS-II 2024: Workshop on Cybersecurity Providing in Information           0009-0005-9156-3339 (O. Deineka);
                                and Telecommunication Systems II, October 26, 2024, Kyiv, Ukraine         0000-0002-8742-8872 (O. Harasymchuk);
                                ∗
                                  Corresponding author.                                                   0000-0003-3037-8373 (A. Partyka);
                                †
                                  These authors contributed equally.                                      0000-0003-0072-2567 (V. Kozachok)
                                   oleh.r.deineka@lpnu.ua (O. Deineka);                                                © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                       Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                garasymchuk@ukr.net (O. Harasymchuk);
                                andrijp14@gmail.com (A. Partyka);
                                v.kozachok@kubg.edu.ua (V. Kozachok)
CEUR
Workshop
                  ceur-ws.org
              ISSN 1613-0073
                                                                                                    182
Proceedings
like Google and Amazon, for whom high-security levels and               including Common Criteria standards. To support SOC 2
transparent data processing processes are crucial. External             Type II compliance, a Data Classification Policy should
auditors are engaged for certification. Their role is to                address several general requirements, including the
examine the implemented practices, verify the company’s                 identification of data types. The policy should define the
adherence to its procedures, and monitor changes in                     types of data the organization handles, including sensitive
processes.                                                              data subject to SOC 2 considerations, such as personal
    SOC 2 Type II is a significant certification in the data            identifiable information (PII), business confidential data,
security and compliance landscape. It serves as an                      and intellectual property. The policy must also establish
attestation by an independent auditor that a service                    clear classification levels that reflect the sensitivity of the
organization’s systems are not only designed to meet the                data, with common levels including Public, Internal Use
Trust Services Criteria but also operate effectively over               Only, Confidential, and Highly Confidential. Additionally,
time. The Trust Services Criteria cover several critical areas:         the policy should define roles and responsibilities for data
security, availability, processing integrity, confidentiality,          classification, including data owners, custodians, and users,
and privacy.                                                            and outline their responsibilities in maintaining data
    The value of SOC 2 Type II lies in its ability to foster            classification.
trust with clients and stakeholders. By demonstrating a                      A Data Classification Policy for SOC 2 Type II
commitment to stringent data management practices,                      compliance should specify handling requirements for each
companies can assure clients that their sensitive data is               classification level, including storage, transmission, access
managed responsibly. This is particularly important in                  controls, encryption standards, and end-of-life procedures.
sectors where data privacy and security are crucial, such as            The policy should also provide guidelines on how data
financial services, healthcare, and cloud computing.                    should be labeled or marked according to its classification
    Furthermore, the audit process for SOC 2 Type II helps              to ensure that it is easily identifiable and handled
organizations identify and mitigate potential security risks,           appropriately. Access controls must be addressed, ensuring
ensuring they maintain a robust security posture. This                  that access to data is based on the principle of least privilege
proactive approach to risk management is vital in an era                and that only authorized individuals can access sensitive
where cyber threats are continually evolving, and data                  data. The policy should outline data retention periods and
breaches can have devastating consequences. Hence, there                secure disposal methods for each classification level,
is a constant search for new strategies and methods to                  ensuring data is not kept longer than necessary and is
ensure reliable data storage and user and device                        disposed of securely. Regular training and awareness
authentication where this data is stored [15–18].                       programs for employees should be mandated to understand
    In an increasingly regulated environment, SOC 2 Type                the importance of data classification and their role in it. The
II compliance can also support adherence to legal and                   policy should include provisions for regular auditing and
regulatory requirements, helping organizations avoid                    monitoring to ensure that classification controls are
expensive penalties and legal issues associated with non-               effective and being followed. The policy should be linked to
compliance.                                                             an incident response plan that addresses potential data
    From a business perspective, SOC 2 Type II compliance               breaches or loss, with procedures tailored to the
can serve as a competitive differentiator. It signals to the            classification level of the data involved. The policy should
market that an organization is a reliable and secure partner,           specify intervals for reviewing and updating data
which can be instrumental in winning new business and                   classification procedures to ensure they remain relevant and
retaining existing customers [19].                                      effective as the organization evolves, data volumes increase,
    The outcome of implementing SOC 2 is a report based                 and new threats emerge. According to the document: If data
on the AICPA Attestation Standards, section 101, Attest                 is shared with or handled by third-party vendors, the data
Engagement.                                                             classification policy must extend to these vendors, often
                                                                        requiring them to adhere to similar or compatible
2. SOC 2 Type II information                                            classification and handling standards. To ensure alignment
                                                                        with SOC 2 Type II requirements, developing a Data
   classification policy                                                Classification Policy usually demands a comprehensive
2.1. Requirements                                                       understanding of the AICPA’s TSC and the unique data
                                                                        protection requirements of the organization. Engaging with
According to the document: SOC 2 Type II, while not
                                                                        seasoned compliance experts or auditors who can give
prescribing specific data classification policies, mandates
                                                                        tailored advice and oversee compliance with the standard’s
that organizations effectively manage and safeguard the
                                                                        stipulations is highly recommended. The AICPA’s guidance
confidentiality, privacy, and security of information in line
                                                                        and frameworks such as ISO 27001, when consulted and
with the Trust Services Criteria (TSC).
                                                                        utilized, can offer invaluable inputs for the creation and
    A Data Classification Policy is essential in meeting these
                                                                        sustenance of a strong data classification policy. It is crucial
criteria, especially the Security criterion, which is common
                                                                        to identify and categorize data based on its sensitivity,
to all SOC 2 audits. A SOC 2 audit evaluates the
                                                                        importance, and regulatory mandates. Moreover, regular
effectiveness of an organization’s processes and systems
                                                                        reviews and updates of the policy should be conducted to
based on the Trust Service Criteria and checks compliance
                                                                        ensure its efficiency and continued compliance with SOC 2
with information security standards and regulations,
                                                                        Type II requirements [20–24].


                                                                  183
2.2. Design


Figure 1: Data flow diagram

So, we offer a Data Flow Diagram. Time Creating a Data                Extract, Transform, Load (ETL), consolidates your data into
Flow Diagram necessitates an initial comprehensive grasp              a single location, simplifying its management and analysis
of the various data types that your company possesses.                [28–32].
Typically, data can be divided into three main categories:                Following the extraction, transformation, and loading of
structured, semi-structured, and unstructured.                        your data into a data store, the subsequent phase involves
    Data that is organized in a prearranged manner, such as           creating a data model. A data model is a visual depiction of
the data stored in a relational database, is referred to as           the relationships between different data elements. It
structured data. Its consistent format makes structured data          provides a structure for organizing and structuring your
easy to search, analyze, and manipulate. On the other hand,           data and can assist in identifying patterns and trends within
semi-structured data has a certain level of organization but          your data [33].
lacks a strict format. XML and JSON files, which house data               Once a data model has been created, the next step
in a hierarchical format without a fixed schema, are                  involves classifying your data and linking it to the associated
examples of semi-structured data.                                     metadata. This involves assigning a sensitivity level to your
    Unstructured data is characterized by its lack of                 data, based on its importance and the potential impact if it
inherent structure or organization. This category includes            were to be lost or stolen. After your data has been classified,
text documents, images, and videos. The inconsistent format           it can be linked to the associated metadata, providing
of unstructured data can pose challenges when it comes to             additional context and information about the data [34].
searching, analyzing, and manipulating it [25–26].                        The final phase in creating a Data Flow Diagram
    After the identification of the company’s data types, the         involves creating an application that enables you to
subsequent phase involves gaining an understanding of the             visualize and manage your data. This application should
metadata linked to that data. Metadata is essentially data            offer a user-friendly interface for accessing, analyzing, and
that offers information about other data. For example, the            manipulating your data. It should also incorporate logic for
metadata linked to a text document could include details like         managing access, requests, and incidents, and should be
the author, the date of creation, and the file size. A deep           integrated with your ITSM system to ensure that data is
understanding of the metadata associated with your data               handled according to your company’s policies and
can facilitate better organization, management, and analysis          procedures [35].
of your data [27].                                                        This solution offers numerous advantages over
    The process of creating a Data Flow Diagram continues             traditional product-based offerings from various companies.
with the utilization of integration tools to manage and store         One of the primary benefits is the flexibility to choose the
your data, once you have identified the types of data your            hosting environment that best suits your needs, whether on-
company owns and the metadata associated with that data.              premise or cloud-based. This allows you to align the
Integration tools facilitate the extraction of data from              solution with your operational requirements and
various sources, its transformation into a common format,             infrastructure capabilities.
and its loading into a data store. This process, known as


                                                                184
Additionally, you have the liberty to select the technology             fundamental step in ensuring that all data is given the
stack that best fits your project. This means that you’re not           appropriate level of protection and handled responsibly
confined to a predetermined set of technologies but can                 throughout its lifecycle.
customize the solution to leverage the most relevant and
efficient tools for your specific needs.                                3.2. Importance
     In terms of team composition, you have the flexibility to
                                                                        The importance of Information Classification in the context
assemble a team that is uniquely suited to the project at
                                                                        of SOC 2 Type II compliance cannot be overstated. It serves
hand. This flexibility ensures that the right expertise and
                                                                        as the foundation for data security and privacy controls,
skills are applied to deliver the best possible outcomes.
                                                                        helping organizations identify and protect their most
     Another advantage is the flexibility in budgeting. Unlike
                                                                        sensitive data.
vendor-specific solutions that may come with fixed
                                                                             Firstly, Information Classification helps in identifying
licensing costs, the budget for this solution can be adjusted
                                                                        the types of data an organization handles, including
according to your financial capacity and project
                                                                        sensitive data subject to SOC 2 considerations, such as
requirements. This can result in significant cost savings
                                                                        personally identifiable information (PII), confidential
without compromising on quality or performance.
                                                                        business data, and intellectual property. This identification
     Lastly, this solution offers robust change and feature
                                                                        is the first step towards implementing appropriate security
management capabilities. This means that it can easily adapt
                                                                        measures.
to evolving business needs, with the ability to incorporate
                                                                             Secondly, Information Classification aids in establishing
new features and make necessary changes in a timely and
                                                                        clear classification levels that reflect the sensitivity of the
efficient manner. This flexibility ensures the solution remains
                                                                        data. These levels, which commonly include Public, Internal
relevant and continues to deliver value over time [13].
                                                                        Use Only, Confidential, and Highly Confidential, guide the
                                                                        implementation of access controls, encryption standards,
3. Information classification                                           and other security measures.
3.1. Overview                                                                Thirdly, Information Classification supports the
                                                                        assignment of roles and responsibilities for data
Information classification is a critical process in data                classification, including data owners, custodians, and users.
management that involves categorizing data based on its                 This clear delineation of responsibilities ensures
sensitivity, importance, and regulatory requirements. This              accountability and promotes adherence to data security
process is essential for organizations to effectively protect           policies.
their data and comply with various legal, regulatory, and                    Lastly, as suggested by us Information Classification
contractual obligations.                                                facilitates compliance with legal and regulatory
    The primary goal of information classification is to                requirements, including those stipulated by SOC 2 Type II.
facilitate appropriate levels of protection for different types         Information Content Extraction is a crucial process in
of data. By classifying data such as public, internal,                  data management that involves retrieving structured
confidential, or highly confidential, organizations can apply           information from unstructured or semi-structured data
suitable security measures to each category, ensuring that              sources. This process is essential for transforming raw data
sensitive and critical data receives the highest level of               into meaningful and actionable insights.
protection.                                                                  Structured data is data that is organized into a formatted
    Information classification is not a one-time activity but           structure, often a relational database. This type of data is
a continuous process that needs to be integrated into the               readily searchable by simple, straightforward search engine
organization’s data lifecycle. It involves identifying the              algorithms or other search operations.
types of data the organization handles, defining                             Semi-structured data is a form of structured data that
classification levels, assigning responsibilities for data              does not adhere to the formal structure of data models
classification, and implementing procedures for handling,               associated with relational databases or other forms of data
storing, and disposing of data based on its classification.             tables but contains tags or other markers to separate
    In addition to enhancing data security, information                 semantic elements and enforce hierarchies of records and
classification also aids in risk management, regulatory                 fields within the data. Examples of semi-structured data
compliance, and resource allocation. It helps organizations             include XML and JSON files.
understand where their most sensitive and valuable data                      Unstructured data is information that either does not
resides, who has access to it, and how it is being protected,           have a pre-defined data model or is not organized in a pre-
enabling them to identify and mitigate potential risks. It also         defined manner. This type of data is typically text-heavy but
supports compliance with regulations such as GDPR,                      may contain data such as dates, numbers, and facts as well.
HIPAA, and SOC 2, which require organizations to                        Examples of unstructured data include text files, PDFs, and
implement appropriate safeguards for sensitive data.                    BLOBs (Binary Large Objects).
Furthermore, by identifying less sensitive data that requires                Information extraction from these types of data involves
lower levels of protection, organizations can optimize their            several steps, including text preprocessing, entity
use of resources.                                                       recognition, relation extraction, and event extraction. Text
    In today’s data-driven world, where vast volumes of                 preprocessing involves cleaning and normalizing the text,
data are generated and processed every day, information                 removing stop words, and stemming or lemmatizing words.
classification has become more important than ever. It is a


                                                                  185
3.3. Framework


Figure 2: Information classification

Entity recognition identifies entities such as names,                    Large Language Models (LLMs) [38–39] represent a
locations, and dates in the text. Relation extraction identifies         significant advancement in the field of artificial intelligence.
relationships between these entities, and event extraction               These models are trained on extensive volumes of text data,
identifies events in which these entities are involved.                  enabling them to generate text that closely resembles
    There are several approaches to information extraction:              human writing. Notable examples of LLMs include GPT-3
                                                                         by OpenAI and BERT by Google [40–42]. These models can
     1.   Rule-based methods: These methods use a set of                 perform a wide range of tasks, such as answering queries,
          predefined rules or patterns to extract                        crafting essays, summarizing texts, translating languages,
          information. For example, a rule might specify                 and even generating creative ideas.
          that if a word is capitalized and followed by a                    In the realm of data management and data governance,
          certain verb, it is likely a person’s name. While              LLMs can be leveraged in several innovative ways:
          rule-based methods can be very accurate, they are
          also labor-intensive and may not generalize well                    1.   Data Cataloging: LLMs can streamline the process
          to new data.                                                             of data cataloging. They can read and comprehend
     2.   Machine learning methods: These methods use                              the metadata associated with various data assets
          algorithms to learn patterns from labeled training                       and generate descriptions or tags for these assets,
          data and apply these patterns to new data. For                           thereby automating a traditionally manual
          example, a machine learning model might learn                            process.
          that words that often appear in the same context                    2.   Data Quality: LLMs can play a pivotal role in
          as known person names are likely to be person                            enhancing data quality. They can be trained to
          names themselves. Machine learning methods can                           identify and flag potential errors or
          be very effective, especially with large amounts of                      inconsistencies in data, facilitating proactive data
          training data, but they can also be complex and                          quality management.
          computationally intensive.                                          3.   Data Privacy: LLMs can contribute to data privacy
     3.   Hybrid methods: These methods combine rule-                              efforts by identifying and redacting sensitive
          based and machine-learning methods to leverage                           information in datasets, thereby helping
          the strengths of both. For example, a hybrid                             organizations comply with data privacy
          method might use rules to extract easy-to-identify                       regulations.
          information and machine learning to extract more                    4.   Data Integration: LLMs can aid in data integration
          complex information [36–37].                                             tasks. They can understand the context and
                                                                                   semantics of different data sources and assist in
    However, there are no clear recommendations                                    mapping them to a common model, simplifying
regarding the implementation of a specific method, and the                         the integration process.
choice should be made considering a large number of
factors.


                                                                   186
Choosing the right LLM for data management and data                                  depends on the choice of distance metric, which
governance depends on various factors, including the                                 can be challenging to determine. The choice of
specific requirements of the tasks, the size and complexity                          metric can significantly impact the results, and
of the data, the computational resources available, and the                          there is often no one-size-fits-all solution.
expertise of the team.                                                          3.   Sensitivity to Noise: Vector search can be
     Vector search, or nearest neighbor search, is a                                 sensitive to noise in the data. Outliers or errors in
powerful technique utilized in machine learning and data                             the data can affect the distance calculations and
science to identify items that are most similar to a given                           lead to inaccurate results.
item. This method operates by representing items as vectors
in a multi-dimensional space. Each point in this space                         Embeddings are a key component of vector search. In
corresponds to a potential item, and the position of that                  machine learning, an embedding is a learned representation
point is determined by the characteristics of the item.                    for some specific type of data, such as words, users, or
     The principle behind vector search is that similar items will         products, where similar items have a similar representation.
be located near each other in this space, while dissimilar items           They are used to convert categorical data into a form that
will be further apart. When a new item is introduced, it is also           can be input into a model. Embeddings are particularly
converted into a vector and placed into this space. The                    useful for dealing with high-dimensional data, as they can
algorithm then searches for vectors that are close to the new              reduce the dimensionality of the data while preserving its
vector, with the “closeness” being determined by a distance                structure and relationships [43–45].
metric such as Euclidean distance or cosine similarity.                        Prompts play a crucial role in the functioning of Large
     This technique is particularly useful when dealing with               Language Models (LLMs) like GPT-3. A prompt is
large datasets, as it allows for efficient searching and                   essentially an input that is given to the model to guide its
retrieval of items. It’s commonly used in recommendation                   output. It can be a question, a statement, or any piece of text.
systems, image recognition, and natural language                           The LLM generates a response to the prompt based on the
processing among other applications.                                       patterns it learned during its training on a large corpus of
     For instance, in a movie recommendation system, each                  text data.
movie could be represented as a vector where each                              Prompts are valuable because they allow us to direct the
dimension corresponds to a different genre. A romance                      model’s output. By carefully crafting our prompts, we can
movie would be located closer to other romance movies and                  guide the model to generate useful and relevant responses.
further from action movies. When a user rates a movie, the                 For instance, if we’re using an LLM to write an email, we
system can look for other movies that are close in the vector              might prompt it with “Dear [Recipient’s Name], I am
space to recommend to the user.                                            writing to inform you that...” and the model could generate
     In essence, vector search is a method of transforming                 the rest of the email.
complex, abstract items into a format that can be easily and                   In the context of data management, prompts can be used
efficiently compared, enabling the rapid retrieval of similar              to extract or generate specific pieces of information from or
items from large datasets.                                                 about our data. For example, we could prompt an LLM with
     Advantages of Vector Search:                                          a question about our data, such as “What is the average
                                                                           value of column X?” or “How many entries in column Y are
     1.   High Accuracy: Vector search can provide highly                  above Z?”. The LLM could then generate a response based
          accurate results because it considers the                        on its understanding of the data.
          relationships between different features of the                      Prompts can also be used to generate metadata for our
          data. By representing data in a high-dimensional                 data. For instance, we could prompt the LLM with a piece of
          space, it captures the nuances and complexities of               data and ask it to generate a description or a set of tags for
          the data that might be missed by other methods.                  that data. This could be particularly useful for tasks like data
     2.   Scalability: Vector search is highly scalable and                cataloging, where we need to generate human-readable
          can handle large amounts of data efficiently. This               descriptions or annotations for large amounts of data.
          makes it suitable for big data applications where                    However, it’s important to note that the effectiveness of
          traditional search methods may be impractical.                   prompts depends on the quality of the LLM’s training. If the
     3.   Flexibility: Vector search is highly flexible and can            LLM has not been trained on relevant data, or if it has not
          be used with any data that can be represented as                 been trained to understand the specific format or context of
          a vector. This includes text, images, audio, and                 the prompts, it may not generate useful responses.
          more, making it applicable to a wide range of tasks              Therefore, careful prompt design and model training are
          and industries.                                                  crucial for getting the most value out of LLMs in data
                                                                           management [46].
    Disadvantages of Vector Search:

     1.   Computational Complexity: Vector search can be
                                                                           4. Conclusions
          computationally intensive, especially when                       According to the document: In conclusion, the paper
          dealing with high-dimensional data or large                      discusses the importance of information classification in the
          datasets. This can make it slower than other                     context of SOC 2 Type II compliance. Information
          methods, particularly for real-time applications.                classification serves as the foundation for data security and
     2.   Difficulty in Choosing the Right Distance Metric:                privacy controls, helping organizations identify and protect
          The effectiveness of vector search heavily                       their most sensitive data. By effectively classifying their


                                                                     187
data, organizations can ensure its security, meet regulatory           [9]    V. Maksymovych, et al., Combined Pseudo-Random
requirements, and ultimately, safeguard their reputation                      Sequence Generator for Cybersecurity, Sensors 22
and business continuity. To optimize and increase efficiency                  (2022) 9700. doi: 10.3390/s22249700.
in the classification and organization of data by SOC 2 Type           [10]   SOC       2    Compliance      Documentation     URL:
II standards, it is proposed to apply Large Language Models                   https://secureframe.com/hub/soc-2/compliance-
in this model. LLMs like GPT-3 and BERT, trained on                           documentation
extensive text data, are transforming data management and              [11]   ISO/IEC 27001:2022 URL: https://www.iso.org/
governance, areas crucial for SOC 2 Type II compliance.                       standard/27001
LLMs respond to prompts, guiding their output generation,              [12]   V. Maksymovych, et al., Simulation of Authentication
and can automate tasks like data cataloging, enhancing data                   in Information-Processing Electronic Devices Based
quality, ensuring data privacy, and assisting in data                         on Poisson Pulse Sequence Generators. Electronics
integration. These capabilities can support a robust data                     11(13) (2022). doi: 10.3390/electronics11132039.
classification policy, a key requirement for SOC 2 Type II.            [13]   O. Deineka, et al., Designing Data Classification and
     Vector search, another important method in data                          Secure Store Policy According to SOC 2 Type II, in:
management, finds similar items to a given item by                            Cybersecurity Providing in Information and
representing them as vectors in a high-dimensional space. It                  Telecommunication Systems, vol. 3654 (2024) 398–
offers high accuracy, scalability, and flexibility, supporting                409.
efficient data classification. Embeddings, which convert               [14]   O. Mykhaylova, et al., Mobile Application as a Critical
categorical data into a form that can be input into a model,                  Infrastructure Cyberattack Surface, in: Cybersecurity
play a key role in vector search and LLMs.                                    Providing in Information and Telecommunication
     Prompt engineering, the crafting of effective prompts, is                Systems, vol. 3550 (2023) 29–43.
crucial for guiding LLMs’ output, and further enhancing                [15]   J. Yi, Y. Wen, An Improved Data Backup Scheme
data management and governance practices.                                     Based on Multi-Factor Authentication, in: IEEE 9th Intl
                                                                              Conference on Big Data Security on Cloud
References                                                                    (BigDataSecurity), IEEE Intl Conference on High
                                                                              Performance and Smart Computing, (HPSC) and IEEE
[1]   B. Matturdi, et al., Big Data security and privacy: A                   Intl Conference on Intelligent Data and Security (IDS)
      review, China Communications, 11(14) (2014) 135–145.                    (2023).      doi:       10.1109/BigDataSecurity-HPSC-
      doi: 10.1109/CC.2014.7085614.                                           IDS58521.2023.00041.
[2]   V. Susukailo, I. Opirskyy, S. Vasylyshyn, Analysis of            [16]   D. Shevchuk, et al., Designing Secured Services for
      the attack vectors used by threat actors during the                     Authentication, Authorization, and Accounting of
      pandemic, IEEE 15th International Scientific and                        Users, in: Cybersecurity Providing in Information and
      Technical Conference on Computer Sciences and                           Telecommunication Systems II, vol. 3550 (2023) 217–
      Information Technologies, CSIT 2020 - Proceedings, 2                    225.
      (2020) 261–264.                                                  [17]   Y. Martseniuk, et al., Automated Conformity
[3]   M. N. Islam, et al., Security threats for big data: An                  Verification Concept for Cloud Security, in:
      empirical study, Int. J. Inf. Commun. Technol. Human                    Cybersecurity Providing in Information and
      Dev. (IJICTHD) 10(4) (2018) 1–18.                                       Telecommunication Systems, vol. 3654 (2024) 25–37.
[4]   A. Singh, A. Kumar, S. Namasudra: DNACDS: Cloud                  [18]   A. Horpenyuk, I. Opirskyy, P. Vorobets, Analysis of
      IoE big data security and accessing scheme based on                     Problems and Prospects of Implementation of Post-
      DNA cryptography, Frontiers Comput. Sci. 18(1)                          Quantum Cryptographic Algorithms, in: Classic,
      (2024) 181801.                                                          Quantum, and Post-Quantum Cryptography, vol. 3504
[5]   O. I. Harasymchuk, et al., Generator of pseudorandom                    (2023) 39–49.
      bit sequence with increased cryptographic security,              [19]   A. Calder, S. Watkins, IT Governance: An
      Metallurgical and Mining Industry: Sci. Tech. J. 5                      International Guide to Data Security and
      (2014) 25–29.                                                           ISO27001/ISO27002 (2019).
[6]   V. Dudykevych, H. Mykytyn, K. Ruda, The concept of               [20]   AICPA, SOC 2® - SOC for Service Organizations:
      a deepfake detection system of biometric image                          Trust Services Criteria. URL: https://us.aicpa.org/
      modifications based on neural networks, in: 3rd KhPI                    interestareas/frc/assuranceadvisoryservices/soc-for-
      Week on Advanced Technology (KhPIWeek) (2022) 1–                        service-organizations
      4. doi: 10.1109/KhPIWeek57572.2022.9916378.                      [21]   IS Audit Basics: The Domains of Data and Information
[7]   O. Vakhula, I. Opirskyy, O. Mykhaylova, Research on                     Audits URL: https://www.isaca.org/resources/isaca-
      Security Challenges in Cloud Environments and                           journal     /issues/2016/volume-6/is-audit-basics-the-
      Solutions based on the “Security-as-Code” Approach,                     domains-of-data-and-information-audits
      in: Cybersecurity Providing in Information and                   [22]   Practical Data Security and Privacy for GDPR and
      Telecommunication Systems, vol. 3550 (2023) 55–69.                      CCPA, ISACA J. 3 (2020).
[8]   V. Maksymovych, et al., Development of Additive                  [23]   Boosting Cyber Security with Data Governance and
      Fibonacci Generators with Improved Characteristics                      Enterprise Data Management, ISACA J. 3 (2017).
      for Cybersecurity Needs, Appl. Sci. 12(3) (2022) 1519.           [24]   D. Cannon, IT Service Management: A Guide for ITIL
      doi: 10.3390/app12031519.                                               Foundation Exam Candidates, BCS (2012).


                                                                 188
[25] N. Karumanchi, Data Structures and Algorithms Made              [45] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning
     Easy: Data Structures and Algorithmic Puzzles (2011).                (2016).
[26] R. T. Watson, Data Management: Databases and                    [46] Teaching with AI. URL: https://openai.com/blog/
     Organizations (2017).                                                teaching-with-ai
[27] M. Rhodes-Ousley, Information Security: The
     Complete Reference, Second Edition (2012).
[28] Munawar, Extract Transform Loading (ETL) Based
     Data Quality for Data Warehouse Development, in: 1st
     International Conference on Computer Science and
     Artificial Intelligence (ICCSAI), Jakarta, Indonesia
     (2021) 373–378. doi: 10.1109/ICCSAI53272.2021.
     9609770.
[29] V. Khoma, et al., Comprehensive Approach for
     Developing an Enterprise Cloud Infrastructure in:
     Cybersecurity Providing in Information and
     Telecommunication Systems, vol. 3654 (2024) 201–
     215.
[30] S. Chauhan, Mastering Apache Airflow (2020).
[31] A. Gaikwad, Learning AWS Glue (2021).
[32] D. Anoshin, R. Avdeev, R. van Vliet, Azure Data
     Factory Cookbook (2020).
[33] S. Hoberman, Data Modeling Made Simple: A
     Practical Guide for Business and IT Professionals
     (2005).
[34] C. C. Aggarwal, Data Classification: Algorithms and
     Applications (2014).
[35] J. Sharp, Y. Duhamel, Microsoft Power Platform
     Enterprise Architecture (2020).
[36] B. Magnini, et al., From Text to Knowledge for the
     Semantic Web: the ONTOTEXT project, in:
     Proceedings of SWAP 2005 Workshop (2005).
[37] S. Chakrabarti, Mining the Web. Discovering
     Knowledge from Hypertext Data, Morgan Kaufmann
     (2002).
[38] X. Yang, et al., Exploring the Application of Large
     Language Models in Detecting and Protecting
     Personally Identifiable Information in Archival Data:
     A Comprehensive Study, in: IEEE International
     Conference on Big Data (BigData), Sorrento, Italy,
     (2023) 2116–2123. doi: 10.1109/BigData59044.2023.
     10386949.
[39] A. Piskozub,       D. Zhuravchak,       A. Tolkachova,
     Researching vulnerabilities in chatbots with LLM
     (Large language model), Ukrainian Sci. J. Inf. Secur.
     29(9)     (2023)    111–117.     doi:    10.18372/2225-
     5036.29.18069.
[40] GPT-3by OpenAI. URL: https://openai.com/research/
     gpt-3/
[41] BERT by Google. URL: https://ai.googleblog.com/2018
     /11/open-sourcing-bert-state-of-art-pre.html/
[42] Amazon Bedrock – Automating Large-Scale, Fault-
     Tolerant Distributed Training in the Deep Learning
     Compiler Stack. URL: https://aws.amazon.com/
     blogs/aws/amazon-bedrock-automating-large-scale-
     fault-tolerant-distributed-training-in-the-deep-
     learning-compiler-stack/
[43] A. N. Papadopoulos,       Y. Manolopoulos,      Nearest
     Neighbor Search: A Database Perspective (2004).
[44] J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of
     Massive Datasets (2014).


                                                               189