Information classification framework according to SOC 2 Type II⋆ Oleh Deineka1,†, Oleh Harasymchuk1,†, Andrii Partyka1,† and Valerii Kozachok2,*,† 1 Lviv Polytechnic National University, 12 Stepana Bandery str., 79000 Lviv, Ukraine 2 Borys Grinchenko Kyiv Metropolitan University, 18/2 Bulvarno-Kudryavska str., 04053 Kyiv, Ukraine Abstract Large Language Models (LLMs) like GPT-3 and BERT, trained on extensive text data, are transforming data management and governance, areas crucial for SOC 2 Type II compliance. LLMs respond to prompts, guiding their output generation, and can automate tasks like data cataloging, enhancing data quality, ensuring data privacy, and assisting in data integration. These capabilities can support a robust data classification policy, a key requirement for SOC 2 Type II. Vector search, another important method in data management, finds similar items to a given item by representing them as vectors in a high-dimensional space. It offers high accuracy, scalability, and flexibility, supporting efficient data classification. Embeddings, which convert categorical data into a form that can be input into a model, play a key role in vector search and LLMs. Prompt engineering, the crafting of effective prompts, is crucial for guiding LLMs’ output, and further enhancing data management and governance practices. Keywords SOC 2 Type II, information classification, data security, LLM, vector search, prompt 1 1. SOC 2 Type II through storage and retrieval can be a complex task. Additionally, professionals must ensure confidentiality, In today’s digital age, the exponential growth of allowing only authorized individuals to access the data and information assets, a significant portion of which is critical, guarantee data accessibility when needed, a task that is a defining characteristic. The sheer volume of this becomes increasingly challenging with growing data information necessitates its classification based on various volumes. parameters and features, secure storage and transmission, Despite the existence of various effective strategies, and protection against unauthorized access. The frequency methods, and systems for organizing big data storage, of potential attacks on information resources is on the rise certain problems persist. One significant issue is the [1–3]. To counteract these threats, cybersecurity experts are difficulty of searching for required information in continually developing new standards, strategies, and unstructured data. techniques, as well as advancing infrastructure [4–9]. A key ISO 27001 [11] is a standard aimed at ensuring the focus is the creation and research of standards for secure proper management of a company’s digital assets, including data storage [10–14]. These standards provide insight into financial information, intellectual property, employee data, how an organization controls data access and ensures its and trusted third-party information. Meanwhile, SOC 2 security and confidentiality. certification [10] is more recognized and typically preferred The standards and requirements for data storage can by American and Canadian companies. differ for organizations based on factors such as SOC is divided into SOC 1, SOC 2, and SOC 3. The first geographical location, industry, sensitivity of the pertains exclusively to financial control, and the third is information, and more. Specific organizations may have primarily used for marketing purposes, allowing SaaS unique standards and requirements based on their needs providers to focus solely on SOC 2. and legal obligations. The Service and Organization Controls 2 standard, Most organizations formulate their security policies developed by the American Institute of Certified Public based on international standards, often with the Accountants using the Trust Services Criteria reliability involvement of external auditing firms that certify standard criteria, provides an independent evaluation of risk compliance. However, professionals dealing with secure management control procedures in IT companies that storage of large data volumes still face numerous challenges, provide services to users. The standard emphasizes data including data integrity, confidentiality, and accessibility. privacy and confidentiality, making it a choice for giants Ensuring the information remains unchanged from creation CPITS-II 2024: Workshop on Cybersecurity Providing in Information 0009-0005-9156-3339 (O. Deineka); and Telecommunication Systems II, October 26, 2024, Kyiv, Ukraine 0000-0002-8742-8872 (O. Harasymchuk); ∗ Corresponding author. 0000-0003-3037-8373 (A. Partyka); † These authors contributed equally. 0000-0003-0072-2567 (V. Kozachok) oleh.r.deineka@lpnu.ua (O. Deineka); © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). garasymchuk@ukr.net (O. Harasymchuk); andrijp14@gmail.com (A. Partyka); v.kozachok@kubg.edu.ua (V. Kozachok) CEUR Workshop ceur-ws.org ISSN 1613-0073 182 Proceedings like Google and Amazon, for whom high-security levels and including Common Criteria standards. To support SOC 2 transparent data processing processes are crucial. External Type II compliance, a Data Classification Policy should auditors are engaged for certification. Their role is to address several general requirements, including the examine the implemented practices, verify the company’s identification of data types. The policy should define the adherence to its procedures, and monitor changes in types of data the organization handles, including sensitive processes. data subject to SOC 2 considerations, such as personal SOC 2 Type II is a significant certification in the data identifiable information (PII), business confidential data, security and compliance landscape. It serves as an and intellectual property. The policy must also establish attestation by an independent auditor that a service clear classification levels that reflect the sensitivity of the organization’s systems are not only designed to meet the data, with common levels including Public, Internal Use Trust Services Criteria but also operate effectively over Only, Confidential, and Highly Confidential. Additionally, time. The Trust Services Criteria cover several critical areas: the policy should define roles and responsibilities for data security, availability, processing integrity, confidentiality, classification, including data owners, custodians, and users, and privacy. and outline their responsibilities in maintaining data The value of SOC 2 Type II lies in its ability to foster classification. trust with clients and stakeholders. By demonstrating a A Data Classification Policy for SOC 2 Type II commitment to stringent data management practices, compliance should specify handling requirements for each companies can assure clients that their sensitive data is classification level, including storage, transmission, access managed responsibly. This is particularly important in controls, encryption standards, and end-of-life procedures. sectors where data privacy and security are crucial, such as The policy should also provide guidelines on how data financial services, healthcare, and cloud computing. should be labeled or marked according to its classification Furthermore, the audit process for SOC 2 Type II helps to ensure that it is easily identifiable and handled organizations identify and mitigate potential security risks, appropriately. Access controls must be addressed, ensuring ensuring they maintain a robust security posture. This that access to data is based on the principle of least privilege proactive approach to risk management is vital in an era and that only authorized individuals can access sensitive where cyber threats are continually evolving, and data data. The policy should outline data retention periods and breaches can have devastating consequences. Hence, there secure disposal methods for each classification level, is a constant search for new strategies and methods to ensuring data is not kept longer than necessary and is ensure reliable data storage and user and device disposed of securely. Regular training and awareness authentication where this data is stored [15–18]. programs for employees should be mandated to understand In an increasingly regulated environment, SOC 2 Type the importance of data classification and their role in it. The II compliance can also support adherence to legal and policy should include provisions for regular auditing and regulatory requirements, helping organizations avoid monitoring to ensure that classification controls are expensive penalties and legal issues associated with non- effective and being followed. The policy should be linked to compliance. an incident response plan that addresses potential data From a business perspective, SOC 2 Type II compliance breaches or loss, with procedures tailored to the can serve as a competitive differentiator. It signals to the classification level of the data involved. The policy should market that an organization is a reliable and secure partner, specify intervals for reviewing and updating data which can be instrumental in winning new business and classification procedures to ensure they remain relevant and retaining existing customers [19]. effective as the organization evolves, data volumes increase, The outcome of implementing SOC 2 is a report based and new threats emerge. According to the document: If data on the AICPA Attestation Standards, section 101, Attest is shared with or handled by third-party vendors, the data Engagement. classification policy must extend to these vendors, often requiring them to adhere to similar or compatible 2. SOC 2 Type II information classification and handling standards. To ensure alignment with SOC 2 Type II requirements, developing a Data classification policy Classification Policy usually demands a comprehensive 2.1. Requirements understanding of the AICPA’s TSC and the unique data protection requirements of the organization. Engaging with According to the document: SOC 2 Type II, while not seasoned compliance experts or auditors who can give prescribing specific data classification policies, mandates tailored advice and oversee compliance with the standard’s that organizations effectively manage and safeguard the stipulations is highly recommended. The AICPA’s guidance confidentiality, privacy, and security of information in line and frameworks such as ISO 27001, when consulted and with the Trust Services Criteria (TSC). utilized, can offer invaluable inputs for the creation and A Data Classification Policy is essential in meeting these sustenance of a strong data classification policy. It is crucial criteria, especially the Security criterion, which is common to identify and categorize data based on its sensitivity, to all SOC 2 audits. A SOC 2 audit evaluates the importance, and regulatory mandates. Moreover, regular effectiveness of an organization’s processes and systems reviews and updates of the policy should be conducted to based on the Trust Service Criteria and checks compliance ensure its efficiency and continued compliance with SOC 2 with information security standards and regulations, Type II requirements [20–24]. 183 2.2. Design Figure 1: Data flow diagram So, we offer a Data Flow Diagram. Time Creating a Data Extract, Transform, Load (ETL), consolidates your data into Flow Diagram necessitates an initial comprehensive grasp a single location, simplifying its management and analysis of the various data types that your company possesses. [28–32]. Typically, data can be divided into three main categories: Following the extraction, transformation, and loading of structured, semi-structured, and unstructured. your data into a data store, the subsequent phase involves Data that is organized in a prearranged manner, such as creating a data model. A data model is a visual depiction of the data stored in a relational database, is referred to as the relationships between different data elements. It structured data. Its consistent format makes structured data provides a structure for organizing and structuring your easy to search, analyze, and manipulate. On the other hand, data and can assist in identifying patterns and trends within semi-structured data has a certain level of organization but your data [33]. lacks a strict format. XML and JSON files, which house data Once a data model has been created, the next step in a hierarchical format without a fixed schema, are involves classifying your data and linking it to the associated examples of semi-structured data. metadata. This involves assigning a sensitivity level to your Unstructured data is characterized by its lack of data, based on its importance and the potential impact if it inherent structure or organization. This category includes were to be lost or stolen. After your data has been classified, text documents, images, and videos. The inconsistent format it can be linked to the associated metadata, providing of unstructured data can pose challenges when it comes to additional context and information about the data [34]. searching, analyzing, and manipulating it [25–26]. The final phase in creating a Data Flow Diagram After the identification of the company’s data types, the involves creating an application that enables you to subsequent phase involves gaining an understanding of the visualize and manage your data. This application should metadata linked to that data. Metadata is essentially data offer a user-friendly interface for accessing, analyzing, and that offers information about other data. For example, the manipulating your data. It should also incorporate logic for metadata linked to a text document could include details like managing access, requests, and incidents, and should be the author, the date of creation, and the file size. A deep integrated with your ITSM system to ensure that data is understanding of the metadata associated with your data handled according to your company’s policies and can facilitate better organization, management, and analysis procedures [35]. of your data [27]. This solution offers numerous advantages over The process of creating a Data Flow Diagram continues traditional product-based offerings from various companies. with the utilization of integration tools to manage and store One of the primary benefits is the flexibility to choose the your data, once you have identified the types of data your hosting environment that best suits your needs, whether on- company owns and the metadata associated with that data. premise or cloud-based. This allows you to align the Integration tools facilitate the extraction of data from solution with your operational requirements and various sources, its transformation into a common format, infrastructure capabilities. and its loading into a data store. This process, known as 184 Additionally, you have the liberty to select the technology fundamental step in ensuring that all data is given the stack that best fits your project. This means that you’re not appropriate level of protection and handled responsibly confined to a predetermined set of technologies but can throughout its lifecycle. customize the solution to leverage the most relevant and efficient tools for your specific needs. 3.2. Importance In terms of team composition, you have the flexibility to The importance of Information Classification in the context assemble a team that is uniquely suited to the project at of SOC 2 Type II compliance cannot be overstated. It serves hand. This flexibility ensures that the right expertise and as the foundation for data security and privacy controls, skills are applied to deliver the best possible outcomes. helping organizations identify and protect their most Another advantage is the flexibility in budgeting. Unlike sensitive data. vendor-specific solutions that may come with fixed Firstly, Information Classification helps in identifying licensing costs, the budget for this solution can be adjusted the types of data an organization handles, including according to your financial capacity and project sensitive data subject to SOC 2 considerations, such as requirements. This can result in significant cost savings personally identifiable information (PII), confidential without compromising on quality or performance. business data, and intellectual property. This identification Lastly, this solution offers robust change and feature is the first step towards implementing appropriate security management capabilities. This means that it can easily adapt measures. to evolving business needs, with the ability to incorporate Secondly, Information Classification aids in establishing new features and make necessary changes in a timely and clear classification levels that reflect the sensitivity of the efficient manner. This flexibility ensures the solution remains data. These levels, which commonly include Public, Internal relevant and continues to deliver value over time [13]. Use Only, Confidential, and Highly Confidential, guide the implementation of access controls, encryption standards, 3. Information classification and other security measures. 3.1. Overview Thirdly, Information Classification supports the assignment of roles and responsibilities for data Information classification is a critical process in data classification, including data owners, custodians, and users. management that involves categorizing data based on its This clear delineation of responsibilities ensures sensitivity, importance, and regulatory requirements. This accountability and promotes adherence to data security process is essential for organizations to effectively protect policies. their data and comply with various legal, regulatory, and Lastly, as suggested by us Information Classification contractual obligations. facilitates compliance with legal and regulatory The primary goal of information classification is to requirements, including those stipulated by SOC 2 Type II. facilitate appropriate levels of protection for different types Information Content Extraction is a crucial process in of data. By classifying data such as public, internal, data management that involves retrieving structured confidential, or highly confidential, organizations can apply information from unstructured or semi-structured data suitable security measures to each category, ensuring that sources. This process is essential for transforming raw data sensitive and critical data receives the highest level of into meaningful and actionable insights. protection. Structured data is data that is organized into a formatted Information classification is not a one-time activity but structure, often a relational database. This type of data is a continuous process that needs to be integrated into the readily searchable by simple, straightforward search engine organization’s data lifecycle. It involves identifying the algorithms or other search operations. types of data the organization handles, defining Semi-structured data is a form of structured data that classification levels, assigning responsibilities for data does not adhere to the formal structure of data models classification, and implementing procedures for handling, associated with relational databases or other forms of data storing, and disposing of data based on its classification. tables but contains tags or other markers to separate In addition to enhancing data security, information semantic elements and enforce hierarchies of records and classification also aids in risk management, regulatory fields within the data. Examples of semi-structured data compliance, and resource allocation. It helps organizations include XML and JSON files. understand where their most sensitive and valuable data Unstructured data is information that either does not resides, who has access to it, and how it is being protected, have a pre-defined data model or is not organized in a pre- enabling them to identify and mitigate potential risks. It also defined manner. This type of data is typically text-heavy but supports compliance with regulations such as GDPR, may contain data such as dates, numbers, and facts as well. HIPAA, and SOC 2, which require organizations to Examples of unstructured data include text files, PDFs, and implement appropriate safeguards for sensitive data. BLOBs (Binary Large Objects). Furthermore, by identifying less sensitive data that requires Information extraction from these types of data involves lower levels of protection, organizations can optimize their several steps, including text preprocessing, entity use of resources. recognition, relation extraction, and event extraction. Text In today’s data-driven world, where vast volumes of preprocessing involves cleaning and normalizing the text, data are generated and processed every day, information removing stop words, and stemming or lemmatizing words. classification has become more important than ever. It is a 185 3.3. Framework Figure 2: Information classification Entity recognition identifies entities such as names, Large Language Models (LLMs) [38–39] represent a locations, and dates in the text. Relation extraction identifies significant advancement in the field of artificial intelligence. relationships between these entities, and event extraction These models are trained on extensive volumes of text data, identifies events in which these entities are involved. enabling them to generate text that closely resembles There are several approaches to information extraction: human writing. Notable examples of LLMs include GPT-3 by OpenAI and BERT by Google [40–42]. These models can 1. Rule-based methods: These methods use a set of perform a wide range of tasks, such as answering queries, predefined rules or patterns to extract crafting essays, summarizing texts, translating languages, information. For example, a rule might specify and even generating creative ideas. that if a word is capitalized and followed by a In the realm of data management and data governance, certain verb, it is likely a person’s name. While LLMs can be leveraged in several innovative ways: rule-based methods can be very accurate, they are also labor-intensive and may not generalize well 1. Data Cataloging: LLMs can streamline the process to new data. of data cataloging. They can read and comprehend 2. Machine learning methods: These methods use the metadata associated with various data assets algorithms to learn patterns from labeled training and generate descriptions or tags for these assets, data and apply these patterns to new data. For thereby automating a traditionally manual example, a machine learning model might learn process. that words that often appear in the same context 2. Data Quality: LLMs can play a pivotal role in as known person names are likely to be person enhancing data quality. They can be trained to names themselves. Machine learning methods can identify and flag potential errors or be very effective, especially with large amounts of inconsistencies in data, facilitating proactive data training data, but they can also be complex and quality management. computationally intensive. 3. Data Privacy: LLMs can contribute to data privacy 3. Hybrid methods: These methods combine rule- efforts by identifying and redacting sensitive based and machine-learning methods to leverage information in datasets, thereby helping the strengths of both. For example, a hybrid organizations comply with data privacy method might use rules to extract easy-to-identify regulations. information and machine learning to extract more 4. Data Integration: LLMs can aid in data integration complex information [36–37]. tasks. They can understand the context and semantics of different data sources and assist in However, there are no clear recommendations mapping them to a common model, simplifying regarding the implementation of a specific method, and the the integration process. choice should be made considering a large number of factors. 186 Choosing the right LLM for data management and data depends on the choice of distance metric, which governance depends on various factors, including the can be challenging to determine. The choice of specific requirements of the tasks, the size and complexity metric can significantly impact the results, and of the data, the computational resources available, and the there is often no one-size-fits-all solution. expertise of the team. 3. Sensitivity to Noise: Vector search can be Vector search, or nearest neighbor search, is a sensitive to noise in the data. Outliers or errors in powerful technique utilized in machine learning and data the data can affect the distance calculations and science to identify items that are most similar to a given lead to inaccurate results. item. This method operates by representing items as vectors in a multi-dimensional space. Each point in this space Embeddings are a key component of vector search. In corresponds to a potential item, and the position of that machine learning, an embedding is a learned representation point is determined by the characteristics of the item. for some specific type of data, such as words, users, or The principle behind vector search is that similar items will products, where similar items have a similar representation. be located near each other in this space, while dissimilar items They are used to convert categorical data into a form that will be further apart. When a new item is introduced, it is also can be input into a model. Embeddings are particularly converted into a vector and placed into this space. The useful for dealing with high-dimensional data, as they can algorithm then searches for vectors that are close to the new reduce the dimensionality of the data while preserving its vector, with the “closeness” being determined by a distance structure and relationships [43–45]. metric such as Euclidean distance or cosine similarity. Prompts play a crucial role in the functioning of Large This technique is particularly useful when dealing with Language Models (LLMs) like GPT-3. A prompt is large datasets, as it allows for efficient searching and essentially an input that is given to the model to guide its retrieval of items. It’s commonly used in recommendation output. It can be a question, a statement, or any piece of text. systems, image recognition, and natural language The LLM generates a response to the prompt based on the processing among other applications. patterns it learned during its training on a large corpus of For instance, in a movie recommendation system, each text data. movie could be represented as a vector where each Prompts are valuable because they allow us to direct the dimension corresponds to a different genre. A romance model’s output. By carefully crafting our prompts, we can movie would be located closer to other romance movies and guide the model to generate useful and relevant responses. further from action movies. When a user rates a movie, the For instance, if we’re using an LLM to write an email, we system can look for other movies that are close in the vector might prompt it with “Dear [Recipient’s Name], I am space to recommend to the user. writing to inform you that...” and the model could generate In essence, vector search is a method of transforming the rest of the email. complex, abstract items into a format that can be easily and In the context of data management, prompts can be used efficiently compared, enabling the rapid retrieval of similar to extract or generate specific pieces of information from or items from large datasets. about our data. For example, we could prompt an LLM with Advantages of Vector Search: a question about our data, such as “What is the average value of column X?” or “How many entries in column Y are 1. High Accuracy: Vector search can provide highly above Z?”. The LLM could then generate a response based accurate results because it considers the on its understanding of the data. relationships between different features of the Prompts can also be used to generate metadata for our data. By representing data in a high-dimensional data. For instance, we could prompt the LLM with a piece of space, it captures the nuances and complexities of data and ask it to generate a description or a set of tags for the data that might be missed by other methods. that data. This could be particularly useful for tasks like data 2. Scalability: Vector search is highly scalable and cataloging, where we need to generate human-readable can handle large amounts of data efficiently. This descriptions or annotations for large amounts of data. makes it suitable for big data applications where However, it’s important to note that the effectiveness of traditional search methods may be impractical. prompts depends on the quality of the LLM’s training. If the 3. Flexibility: Vector search is highly flexible and can LLM has not been trained on relevant data, or if it has not be used with any data that can be represented as been trained to understand the specific format or context of a vector. This includes text, images, audio, and the prompts, it may not generate useful responses. more, making it applicable to a wide range of tasks Therefore, careful prompt design and model training are and industries. crucial for getting the most value out of LLMs in data management [46]. Disadvantages of Vector Search: 1. Computational Complexity: Vector search can be 4. Conclusions computationally intensive, especially when According to the document: In conclusion, the paper dealing with high-dimensional data or large discusses the importance of information classification in the datasets. This can make it slower than other context of SOC 2 Type II compliance. Information methods, particularly for real-time applications. classification serves as the foundation for data security and 2. Difficulty in Choosing the Right Distance Metric: privacy controls, helping organizations identify and protect The effectiveness of vector search heavily their most sensitive data. By effectively classifying their 187 data, organizations can ensure its security, meet regulatory [9] V. Maksymovych, et al., Combined Pseudo-Random requirements, and ultimately, safeguard their reputation Sequence Generator for Cybersecurity, Sensors 22 and business continuity. To optimize and increase efficiency (2022) 9700. doi: 10.3390/s22249700. in the classification and organization of data by SOC 2 Type [10] SOC 2 Compliance Documentation URL: II standards, it is proposed to apply Large Language Models https://secureframe.com/hub/soc-2/compliance- in this model. LLMs like GPT-3 and BERT, trained on documentation extensive text data, are transforming data management and [11] ISO/IEC 27001:2022 URL: https://www.iso.org/ governance, areas crucial for SOC 2 Type II compliance. standard/27001 LLMs respond to prompts, guiding their output generation, [12] V. Maksymovych, et al., Simulation of Authentication and can automate tasks like data cataloging, enhancing data in Information-Processing Electronic Devices Based quality, ensuring data privacy, and assisting in data on Poisson Pulse Sequence Generators. Electronics integration. These capabilities can support a robust data 11(13) (2022). doi: 10.3390/electronics11132039. classification policy, a key requirement for SOC 2 Type II. [13] O. Deineka, et al., Designing Data Classification and Vector search, another important method in data Secure Store Policy According to SOC 2 Type II, in: management, finds similar items to a given item by Cybersecurity Providing in Information and representing them as vectors in a high-dimensional space. It Telecommunication Systems, vol. 3654 (2024) 398– offers high accuracy, scalability, and flexibility, supporting 409. efficient data classification. Embeddings, which convert [14] O. Mykhaylova, et al., Mobile Application as a Critical categorical data into a form that can be input into a model, Infrastructure Cyberattack Surface, in: Cybersecurity play a key role in vector search and LLMs. Providing in Information and Telecommunication Prompt engineering, the crafting of effective prompts, is Systems, vol. 3550 (2023) 29–43. crucial for guiding LLMs’ output, and further enhancing [15] J. Yi, Y. Wen, An Improved Data Backup Scheme data management and governance practices. Based on Multi-Factor Authentication, in: IEEE 9th Intl Conference on Big Data Security on Cloud References (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE [1] B. Matturdi, et al., Big Data security and privacy: A Intl Conference on Intelligent Data and Security (IDS) review, China Communications, 11(14) (2014) 135–145. (2023). doi: 10.1109/BigDataSecurity-HPSC- doi: 10.1109/CC.2014.7085614. IDS58521.2023.00041. [2] V. Susukailo, I. Opirskyy, S. Vasylyshyn, Analysis of [16] D. Shevchuk, et al., Designing Secured Services for the attack vectors used by threat actors during the Authentication, Authorization, and Accounting of pandemic, IEEE 15th International Scientific and Users, in: Cybersecurity Providing in Information and Technical Conference on Computer Sciences and Telecommunication Systems II, vol. 3550 (2023) 217– Information Technologies, CSIT 2020 - Proceedings, 2 225. (2020) 261–264. [17] Y. Martseniuk, et al., Automated Conformity [3] M. N. Islam, et al., Security threats for big data: An Verification Concept for Cloud Security, in: empirical study, Int. J. Inf. Commun. Technol. Human Cybersecurity Providing in Information and Dev. (IJICTHD) 10(4) (2018) 1–18. Telecommunication Systems, vol. 3654 (2024) 25–37. [4] A. Singh, A. Kumar, S. Namasudra: DNACDS: Cloud [18] A. Horpenyuk, I. Opirskyy, P. Vorobets, Analysis of IoE big data security and accessing scheme based on Problems and Prospects of Implementation of Post- DNA cryptography, Frontiers Comput. Sci. 18(1) Quantum Cryptographic Algorithms, in: Classic, (2024) 181801. Quantum, and Post-Quantum Cryptography, vol. 3504 [5] O. I. Harasymchuk, et al., Generator of pseudorandom (2023) 39–49. bit sequence with increased cryptographic security, [19] A. Calder, S. Watkins, IT Governance: An Metallurgical and Mining Industry: Sci. Tech. J. 5 International Guide to Data Security and (2014) 25–29. ISO27001/ISO27002 (2019). [6] V. Dudykevych, H. Mykytyn, K. Ruda, The concept of [20] AICPA, SOC 2® - SOC for Service Organizations: a deepfake detection system of biometric image Trust Services Criteria. URL: https://us.aicpa.org/ modifications based on neural networks, in: 3rd KhPI interestareas/frc/assuranceadvisoryservices/soc-for- Week on Advanced Technology (KhPIWeek) (2022) 1– service-organizations 4. doi: 10.1109/KhPIWeek57572.2022.9916378. [21] IS Audit Basics: The Domains of Data and Information [7] O. Vakhula, I. Opirskyy, O. Mykhaylova, Research on Audits URL: https://www.isaca.org/resources/isaca- Security Challenges in Cloud Environments and journal /issues/2016/volume-6/is-audit-basics-the- Solutions based on the “Security-as-Code” Approach, domains-of-data-and-information-audits in: Cybersecurity Providing in Information and [22] Practical Data Security and Privacy for GDPR and Telecommunication Systems, vol. 3550 (2023) 55–69. CCPA, ISACA J. 3 (2020). [8] V. Maksymovych, et al., Development of Additive [23] Boosting Cyber Security with Data Governance and Fibonacci Generators with Improved Characteristics Enterprise Data Management, ISACA J. 3 (2017). for Cybersecurity Needs, Appl. Sci. 12(3) (2022) 1519. [24] D. Cannon, IT Service Management: A Guide for ITIL doi: 10.3390/app12031519. Foundation Exam Candidates, BCS (2012). 188 [25] N. Karumanchi, Data Structures and Algorithms Made [45] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning Easy: Data Structures and Algorithmic Puzzles (2011). (2016). [26] R. T. Watson, Data Management: Databases and [46] Teaching with AI. URL: https://openai.com/blog/ Organizations (2017). teaching-with-ai [27] M. Rhodes-Ousley, Information Security: The Complete Reference, Second Edition (2012). [28] Munawar, Extract Transform Loading (ETL) Based Data Quality for Data Warehouse Development, in: 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), Jakarta, Indonesia (2021) 373–378. doi: 10.1109/ICCSAI53272.2021. 9609770. [29] V. Khoma, et al., Comprehensive Approach for Developing an Enterprise Cloud Infrastructure in: Cybersecurity Providing in Information and Telecommunication Systems, vol. 3654 (2024) 201– 215. [30] S. Chauhan, Mastering Apache Airflow (2020). [31] A. Gaikwad, Learning AWS Glue (2021). [32] D. Anoshin, R. Avdeev, R. van Vliet, Azure Data Factory Cookbook (2020). [33] S. Hoberman, Data Modeling Made Simple: A Practical Guide for Business and IT Professionals (2005). [34] C. C. Aggarwal, Data Classification: Algorithms and Applications (2014). [35] J. Sharp, Y. Duhamel, Microsoft Power Platform Enterprise Architecture (2020). [36] B. Magnini, et al., From Text to Knowledge for the Semantic Web: the ONTOTEXT project, in: Proceedings of SWAP 2005 Workshop (2005). [37] S. Chakrabarti, Mining the Web. Discovering Knowledge from Hypertext Data, Morgan Kaufmann (2002). [38] X. Yang, et al., Exploring the Application of Large Language Models in Detecting and Protecting Personally Identifiable Information in Archival Data: A Comprehensive Study, in: IEEE International Conference on Big Data (BigData), Sorrento, Italy, (2023) 2116–2123. doi: 10.1109/BigData59044.2023. 10386949. [39] A. Piskozub, D. Zhuravchak, A. Tolkachova, Researching vulnerabilities in chatbots with LLM (Large language model), Ukrainian Sci. J. Inf. Secur. 29(9) (2023) 111–117. doi: 10.18372/2225- 5036.29.18069. [40] GPT-3by OpenAI. URL: https://openai.com/research/ gpt-3/ [41] BERT by Google. URL: https://ai.googleblog.com/2018 /11/open-sourcing-bert-state-of-art-pre.html/ [42] Amazon Bedrock – Automating Large-Scale, Fault- Tolerant Distributed Training in the Deep Learning Compiler Stack. URL: https://aws.amazon.com/ blogs/aws/amazon-bedrock-automating-large-scale- fault-tolerant-distributed-training-in-the-deep- learning-compiler-stack/ [43] A. N. Papadopoulos, Y. Manolopoulos, Nearest Neighbor Search: A Database Perspective (2004). [44] J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets (2014). 189