The apriori method in the collection of significant values of pharmaceutical data Nataliya Maslova1, Olena Liubymenko1, Olha Polovynka2 and Yaroslaw Dorogyy1 1 Donetsk National Technical University, str. Potebni, 56, Lutsk, 43018, Ukraine 2 National Technical University "Kharkov Polytechnic Institute", ul. Kyrpychyova, 2, Kharkov, 61002, Ukraine 1 Abstract The paper demonstrates the relevance and necessity of seeking efficient algorithms for large-scale data applications. A review of existing approaches and methods for analyzing unstructured data, often found in pharmaceutical data and medical research results, has been conducted, providing a description of the characteristics of such data. It is indicated that Data Mining methods, including the Apriori algorithm, are applicable for discovering hidden patterns in large volumes of data. Modern implementation tools of the Apriori algorithm for parallel and distributed processing on platforms such as MPI, OpenMP, Hadoop, OpenCL, Spark, and Apache Flink are discussed. The specifics of implementing the Apriori algorithm on each platform are described. Theoretical recommendations are provided, and the results of applying the platforms in terms of the complexity (required workload) and time to obtain results are forecasted. A computational experiment is conducted, which overall confirms the theoretical conclusions presented in the concluding sections of the paper and outlines avenues for further research, envisaging deeper experimentation with further application of parallel and distributed versions of association search algorithms in various fields, including medical and pharmacological domains. Keywords Apriori parallel algorithm , Big Data, pharmaceutical logistics 1. Introduction The proliferation of sources and volumes of information, its diversity, and distributed storage have led to the necessity of reviewing not only data collection and storage technologies, but also methods of analysis. Data Mining allows for the generation of new ideas and understandings of the nature and interrelationships of objects or phenomena. Typical data analysis tasks include classification, clustering, association rule mining, anomaly detection, and others. In this context, the Apriori method serves as an important tool for analyzing big data. The Apriori method has long been recognized as a valuable tool for intelligent data analysis, especially for discovering association rules in large datasets. However, the constantly increasing volume and complexity of data, as well as the emergence of new data processing platforms, pose significant challenges for traditional analysis methods. Apriori is used to discover association rules among data elements, enabling the identification of interesting relationships in large datasets that may not be visible using conventional analysis methods. This approach facilitates drawing conclusions and making decisions based on the analysis of large volumes of data with significant efficiency and reliability. One of the main problems associated with the constant growth of data is the complexity of processing and analyzing large datasets due to the limited performance of traditional methods. Various methodological approaches exist to address this problem, including parallel computation, distributed data processing, and the use of specialized algorithms. These approaches enable the efficient distribution of tasks and optimization of data processing for large volumes. The paper discusses approaches and methods for the analysis of unstructured data. It demonstrates how Data Mining methods and the Apriori algorithm can be used to uncover hidden relationships in large unstructured datasets, particularly in pharmaceutical logistics. A comparative analysis of the characteristics of Apriori algorithm variants used for association rule mining is conducted, and a CMIS-2024: Seventh International Workshop on Computer Modeling and Intelligent Systems, May 3, 2024, Zaporizhzhia, Ukraine nataliia.maslova@donntu.edu.ua (N. Maslova); olena.liubymenko@donntu.edu.ua (O. Liubymenko); olga.polovinka1@gmail.com (O. Polovynka); yaroslav.dorohyi@donntu.edu.ua (Y. Dorogyy) 0000-0002-9078-0973 (N. Maslova); 0000-0002-5935-6891 (O. Liubymenko); 0000-0002-0575-4587 (O. Polovynka); 0000-0003-3848-9852 (Y. Dorogyy) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings modification of the Apriori algorithm for processing unstructured distributed data using parallel methods is proposed. Results of application on various platforms, both existing and proposed variants of association rule mining algorithms, are presented. In this context, the scientific novelty of the research lies in organizing the analysis of information platforms, with an emphasis on parallel and distributed processing of unstructured data, where the implementation of deep analysis algorithms, including the Apriori algorithm, will be most efficient in terms of speed and technological implementation during the processing of medical and pharmaceutical data. 1.1. Problems of Analyzing Large Volumes of Data, the Purpose of the Study The concept of «Big Data» encompasses datasets characterized by large volumes, high accumulation rates, and diversity, known as volume, velocity, and variety. Analytical publications on big data processing [1-3] reveal key trends, contributions, and research developments in this field. Processing large volumes of data requires powerful computational resources, efficient algorithms to ensure analysis speed and effectiveness, and appropriate information storage and processing systems. There is a significant number of algorithms and methods for intelligent (deep) analysis, but each task proves to be unique, so research on developing methods for solving big data processing tasks continues to be relevant. Special importance lies in algorithms for processing unstructured distributed data, which accumulate in various fields such as finance, logistics, telecommunications, medicine, etc. [4]. Data analysis in these areas needs to be conducted in real-time, placing additional demands on processing speed and system responsiveness. Additionally, increasing data volume may raise risks of privacy and security breaches, hence organizing an adequate level of data protection is crucial. Neural networks, decision trees, fuzzy models, regression, and clustering analysis methods are widely used for big data analysis. However, these methods are typically used for processing structured data. Processing unstructured data, such as logistic data from pharmaceutical networks or medical research data related to drug selection and interaction analysis, is not possible with similar methods [5]. The choice of processing methods and algorithms determines the research results and effectiveness. The most applicable methods for processing both structured and unstructured large-scale data are Data Mining methods. Research Object: Processes of intelligent analysis of unstructured large-scale data. Research Subject: Algorithms and methods of intelligent data analysis in solving tasks of discovering hidden patterns. The research objective is to select platforms for the effective implementation of deep analysis algorithms (Apriori algorithm), considering the possibility of parallel and distributed data processing for large-scale data. Pharmaceutical data applicable in medical research and pharmaceutical logistics data are chosen as input data for conducting experiments. 2. Apriori Method as a Tool for Big Data Analysis To process large arrays of unstructured data and address the mentioned tasks, it is advisable to use methods for mining associative rules. Based on identified dependencies, rule bases understandable to experts in applied fields can be synthesized. The Apriori method in Data Mining is an algorithm used for extracting associative rules from a database. The main goal of the Apriori method is to identify relationships between different elements in sets of structured and unstructured data. The main Apriori algorithm has several stages. Candidate generation: Initial itemsets are formed with one item. On the second iteration, two-item sets are analyzed. On the third iteration, three-item sets, and so on, until the algorithm provides informative results (rules are formed). Trimming non-informative values in the Apriori method is done by determining significance indicators, which may vary depending on the research area and the required reliability of the results. These indicators are support and confidence. Support measures how often a set of items appears in the dataset. Candidate filtering: Itemsets that do not meet certain support criteria are removed. Association rule creation: The remaining itemsets are used to create association rules describing relationships between elements. A detailed description of the basic algorithm can be found, for example, in [6]. Among the main drawbacks of this algorithm, the authors note the need for repeated scanning of the original dataset and the complexity of determining threshold values for each indicator. Therefore, one of the main directions for improving the efficiency of the Apriori algorithm is the possibility of parallelizing the processing of large datasets, which can significantly reduce scanning time. The Apriori method is widely used in areas such as recommendation systems, purchase analysis (including pharmaceuticals), advertising strategies, education [7], and other fields where the discovery of relationships between different elements in datasets is important. In medical modeling, the Apriori method is used for automated analysis of medical data in the process of providing medical care, in researching the interaction of drugs taken by the patient [8-10]. Modern data analysis methods are used to examine the patient's basic information, medical history, physical condition, and other important textual characteristics. Additionally, correlations are sought between the factors that caused the illness, its severity degree with the results of the patient's examination. This is done to achieve automatic medical clinical care and establish an accurate diagnosis. In this work, the method is applied to pharmaceutical logistics tasks, critical factors in the implementation of management methods after the end of the life cycle in the pharmaceutical care process described in [11]. We will not dwell on the description and analysis of basic and modified algorithms for mining associative rules. They are sufficiently detailed in the literature and works of the authors [11]. We will provide a brief overview of the application area emphasized in the article preview and proceed to the direct analysis of modern and most promising technologies and implementations in terms of speed and reliability. 3. Methodological Approaches to Addressing Big Data Processing Issues in Pharmaceuticals Pharmaceutical logistics involves managing the movement of medical supplies and related financial, personnel, and information flows to effectively distribute and reduce overall costs in the supply, production, and sale of pharmaceuticals to achieve the required quality and meet consumer demands. In the context of pharmaceutical logistics, big data may include information on production, supply, storage, and distribution of medical supplies, clinical trials, electronic medical records of patients, information on pharmaceuticals and their interactions, as well as data on patients' previous prescriptions and treatments. The use of pharmacy network data is envisaged as input to obtain analytical dependencies necessary to solve pharmaceutical logistics tasks. They are characterized by large volumes of information, a distributed structure, and the availability of specialized basic data collection and processing packages. Materials [12-14] describe the features of pharmaceutical logistics, list twenty main principles characterizing it, and confirm the complexity and multidimensionality of pharmaceutical data processing tasks. This data can be classified as Big Data - falling into the category of processing large volumes of structured and unstructured data. In the article [15], the relevance and necessity of searching for effective Data Mining algorithms, including parallel ones, applied to large volumes of pharmaceutical data, are proven. Examples of existing processing packages are provided, their main capabilities and shortcomings are analyzed. It is noted that the main problems in processing pharmaceutical data are: Distribution: collecting information from different sources and the need to store it in a specialized repository; Significant volume (productivity): analyzing information volumes measured in terabytes can take an unacceptable amount of time for analysis and require significant computational power. Modern technologies supporting IDA algorithms and methods include: − Data analysis software programs (such as STATISTICA, SAS, SPSS, STATGRAPHICS, etc.); − Artificial neural networks (BrainMaker, NeuroShell, OWL); − Case-based reasoning systems (tools), for example, KATE tools, Pattern Recognition Workbench; − Decision Trees methods (KnowledgeSeeker, See5/C5.0, Clementine, SIPINA, IDIS, etc.); − Evolutionary programming software tools (NeuroShell); − Genetic algorithms (GeneHunter); − Limited enumeration methods (used, for example, in the WizWhy system from WizSoft); − Data visualization systems in multidimensional space (DataMiner 3D). Despite this variety, [16] asserts that the main problems of processing most types of data can be solved through parallel (Parallel Data Mining) and/or distributed (Distributed Data Mining) Data Mining algorithms. 4. Tools for Implementing Parallel Versions of the Apriori Algorithm The main modern instrumental tools for implementing the Apriori algorithm for parallel and distributed processing are platforms listed below. MPI (Message Passing Interface, 1990): This is a standard for working with parallel programming in large computing clusters. Additionally, the mentioned platform MPI4py is a Python wrapper for the MPI (Message Passing Interface) library and was developed in 2008. This library allows developers to utilize MPI capabilities for parallel programming in the Python language. OpenMP (Open Multi-Processing, 1997): An open standard for shared-memory parallel programming that provides directives for parallelizing code at the process level. Hadoop (2006): A framework for processing and analyzing large volumes of data in distributed computing environments. OpenCL (Open Computing Language, 2008): An open standard for parallel programming on heterogeneous systems such as GPUs, CPUs, and others. Spark (2010): A framework for parallel data processing working in memory, enabling a wide range of data analysis tasks. Apache Flink (2014): A distributed system for processing streaming data and batch operations, providing high-speed data processing. The platforms are arranged by the years of active market entry and allow programmers to effectively utilize parallel computing resources for fast and efficient resolution of various tasks. Let's consider the peculiarities of implementing the Apriori algorithm on some of the listed platforms. 4.1. MPI platform The implementation of the Apriori algorithm in a specified environment is most thoroughly described in the work [17]. It discusses modern methods for mining associative rules and discovering knowledge in large volumes of data. Based on a conducted comparative analysis and taking into account the specificity of the subject area, the use of the Apriori algorithm is justified, for which criteria and methods of parallelization during the implementation of the associative rule search procedure are considered. The results obtained from the research were tested in the form of a parallel implementation of the QuickSort algorithm. As the hardware base, a parallel cluster of the university with the characteristics listed in Table 1 was used. Table 1 Technical characteristics of the cluster Fat servers Thin servers Network devices - Dell Power Edge R710; - Intel® Xeon®; - Dell Power Connect 5448; - 16 cores; CPU X5560@2.8 GHz - InfiniBand Voltaire - 24 GB RAM; - 16 cores; Grid Director - HDD: 2 x 140 GB - 16 GB RAM; - 6 x 1000 GB - HDD – no; - 2 IB ports; - 1 IB ports; - 4 LAN ports - 2 LAN ports; - 1 IPMI port. The general scheme of the cluster is shown in Figure 1. Figure 1: Cluster diagram The MPI implementation allows for distributing data among cluster nodes to perform separate processing tasks on each. MPI does not provide specific tools for distributing processes. It's the developer's responsibility to handle tasks such as data distribution, processing, execution, and final result collection. Considering these capabilities and aiming to expedite result acquisition, the algorithm was divided into functions executed sequentially: F = F1 + F2 + F3 + F4 + F5 + F6, (1) where the first function (F1) - is reading part of the data from the file; the second function (F2) - distributes data between processes; function F3 - parallel scanning and calculation of "itemsets" - a set of elements that are often found in this part of the sample; function F4 - for data exchange between nodes (initiating data transmission operations between nodes); function F5 - for result aggregation: Each node receives partial results from other nodes and combines them to create a common list of frequently occurring itemsets. F6 - function implementation of the main part of the Apriori algorithm (main function), which controls the processes of data reading, distribution among processes, exchange, and initiation of each process and generation of new candidates. Steps 2-6 are repeated until reaching the maximum (specified) value of itemsets or another stopping condition. After processing is completed, each node aggregates the results and analyzes them to discover associative rules. Calculations were performed with a fixed support count (20%). In conclusion, it can be summarized that the implementation of the Apriori algorithm using MPI is possible, although algorithmically it proved to be multi-stage. It is also worth noting that despite the computations being performed quite fast, the data reading and distribution block among processes required significant time. 4.2. OpenMP Platform OpenMP (Open Multi-Processing) is designed for parallel programming on multi-core architectures. It allows programs to be parallelized by adding compiler directives to regular sequential code. Implementation features of the Apriori algorithm for OpenMP include aspects such as memory-level parallelism, distributed work, mechanisms for synchronization of access to shared data between threads, memory allocation, performance optimizations, testing, and validation. Memory-level parallelism means that OpenMP can simultaneously process data on multiple processor cores. Distributing work among different threads speeds up processing time because different parts of the data can be processed simultaneously. OpenMP provides mechanisms for synchronizing access to shared data between threads, which is important for preventing conflicts and race conditions during parallel computation. The implementation of Apriori for OpenMP requires performance optimization by using optimal parameters and selecting the best algorithms for the specific task. After implementing the algorithm, testing and validation were conducted on real data to ensure the correctness and efficiency of the algorithm. OpenMP is a standard for parallel programming on multi-core and multi-processor systems with shared memory. It provides a set of compiler directives and a library of functions that allow programs to be parallelized at the directive level. For comparison with MPI results, C and C++ programming languages were applied during the programming stage. Table 2 shows the algorithm execution time for parallelizing Apriori on a quad-core processor using OpenMP with a fixed support count (20%) and the same algorithm implemented with MPI technology. Table 2 Algorithm Execution Time for OpenMP and MPI Array size MPI OpenMP MPI OpenMP MPI OpenMP 1 1 2 2 4 4 1000000 0.16 0.17 0.13 0.13 1.53 0.32 5000000 1.22 0.51 1.11 0.46 1.63 0.43 10000000 3.564 0.95 3.17 0.86 1.51 0.64 15000000 7.008 1.17 4.40 0.98 1.72 0.41 20000000 11.536 1.31 6.71 1.23 1.63 0.72 40000000 40.571 25.23 27.25 23.15 11.88 5.21 80000000 151.998 27.32 81.49 27.64 21.6 5.34 As demonstrated by the experiment, implementing Apriori for OpenMP can significantly improve data processing speed by leveraging parallel programming and optimizing algorithm performance across multiple processor cores. 4.3. Hadoop Platform. The platform description is provided in [18], [19], [20]. For instance, [19] examines the most well- known modifications of Apriori algorithms for association rule mining. The research presents the performance results of linear algorithms Apriori and Apriori-TID on datasets of various sizes using computers with standard memory capacities. The feasibility of utilizing the AIS algorithm and its parallel implementation on the Hadoop MapReduce framework is justified. A comparison between the linear implementation of the Apriori-TID algorithm and the parallel implementation of the AIS algorithm is provided. Conclusions are drawn regarding the effectiveness of employing parallel algorithms to address the formulated task. In general, developing an Apriori program for Hadoop on 12 processors requires the utilization of the Hadoop MapReduce framework and distributed computing. This can be achieved by leveraging the following capabilities of Hadoop and its compatible tools (Table 3). Table 3 Stages of implementation of the Apriori algorithm on Hadoop № Stage Processing tool Stage description 1 Distribution Hadoop Distributed the data should be distributed among the Hadoop of data File System (HDFS) cluster nodes 2 Fragment Mapper In Mapper, you need to implement logic for processing selecting subsets of items from each fragment 3 aggregation Reducer Reducer combines and processes the results of results obtained from different pieces of data. 4 Apriori Apriori Process finding frequently occurring items and generating candidates for the next level. Each iteration is performed on a separate subset of the data 5 Processing Parallel Processing parallel processing of different data fragments at the same time on different nodes of the cluster. 6 Load control Resource Monitoring the correct use of cluster resources to Management avoid excessive load on individual nodes. Thus, it can be concluded that implementing the Apriori algorithm for Hadoop is a rather complex process that requires a deep understanding of the Hadoop platform itself, the operation of associated tools (MapReduce, Apache Spark), and their application in software implementation. Another aspect to consider is that Hadoop necessitates programming in Java for MapReduce task implementation. For other programming languages, such as C++, one needs to utilize the Hadoop Streaming framework, which does not allow for a direct comparison of results obtained for the Hadoop platform with those from experiments on MPI and OpenMP. Additionally, input files must be prepared in a format understandable by Hadoop, and to obtain results, an output directory must be created. 4.4 . OpenCL Platform The implementation of the Apriori algorithm for OpenCL utilizes the open standard OpenCL for parallel programming on heterogeneous computing devices such as central processing units (CPUs), graphics processing units (GPUs), and others. The uniqueness of the Apriori implementation for OpenCL lies in harnessing the parallelism capabilities provided by OpenCL to process large volumes of data. The algorithm is divided into parts that can be computed in parallel and processed on different computing devices simultaneously. With OpenCL, various computing device resources can be efficiently utilized, allowing for computation distribution between CPUs and GPUs based on their capabilities and tasks. This enables significant improvements in data processing speed and algorithm performance for Apriori. The general process of implementing Apriori for OpenCL includes the following five steps: Development of an OpenCL kernel for computing subsets of itemsets and analyzing their frequency. Implementation of mechanisms for data partitioning and distribution among different computing devices. Configuration and optimization of computing system parameters for maximum performance. Testing and validation of the implemented algorithm on real datasets. 4.5. Apache Spark platform Apache Spark is a distributed data processing system that provides capabilities for parallel processing of large volumes of data. One of its features is the use of the Scala language for algorithm implementation. For data processing, candidate itemsets generation, Apriori algorithm iterations, and result analysis using Spark, functions should be applied, some of which are listed in Table 4. The availability of functions is a distinctive feature of Spark. Table 4 Basic Apache Spark functions to implement the Apriory method Functions Action Data processing functions 1 map(func) Enables you to perform data transformation, processing, and cleaning operations in a distributed Spark environment. 2 flatMap(func) as a result of the operation of the flatMap method, each element can generate zero or more elements in the original RDD. Functions for building itemsets candidates 1 Cartesi ( anotherRDD) Returns an RDD that contains all possible pairs (combinations) of elements from the current RDD and another RDD. 2 mapPartitions(func) Applies the given function func to each section of the RDD and returns a new RDD with the results. Apriori functions (iterations) 1 map Data processing and filtering in RDD 2 flatMap 3 filter and others Analysis of results 1 reduceByKey(func) Combines the values for each key in the RDD using the given function func. 2 collect Gathers all elements from the RDD into a node and returns an array. 4.6. Apache Flink Platform The features of implementing the Apriori algorithm for Apache Flink include the following aspects: − Stream processing: Apache Flink supports stream data processing, allowing data to be processed in real-time. This can be particularly useful for analyzing large data streams, such as in large networks or IoT systems. − Distributed processing: Apache Flink provides distributed data processing capabilities, enabling the utilization of server clusters for fast processing of large volumes of data. − Integration with other tools: Apache Flink easily integrates with other Apache tools such as Apache Hadoop, Apache Kafka, Apache Hive, and others. This facilitates seamless data exchange between different systems and leverages additional functionalities. − Programming simplicity: Apache Flink offers a convenient programming interface for implementing complex analytical tasks on large datasets. It supports programming languages such as Java, Scala, and Python, making program development easier for a wide range of developers. − Performance optimization: Apache Flink incorporates built-in mechanisms for performance optimization, including query optimization and execution plan scheduling. This maximizes cluster resource utilization and reduces data processing time. − Scalability: Apache Flink scales easily both vertically and horizontally, allowing for the processing of both small and large volumes of data. It supports horizontal scaling, enabling the addition of new nodes to the cluster to enhance its performance. Despite its many advantages, Apache Flink also has its drawbacks. These include configuration complexity, the prevalence of Apache Spark, and significant resource requirements, which can pose challenges for scalability and performance when processing large datasets. Additionally, the current stage of incomplete and inadequate documentation and debugging issues does not recommend Apache Flink for roles other than research-oriented ones. 5. Recommendations and Results The execution speed of the Apriori algorithm for the same data volume and with a fixed support count can significantly depend on various factors, such as the characteristics of the program itself, data peculiarities, and the hardware on which it runs. Therefore, there is no definitive answer to the question of which algorithm is "absolutely" fast. However, some general guidelines can be provided: OpenCL Technologies: These can provide a speed boost in execution through parallel computations on a large number of GPU cores. MPI and MPI4py: These can work very efficiently on clusters with a large number of nodes. However, MPI requires complex development and management of communications between cluster nodes. OpenMP: It provides a simple way to parallelize code for multi-processor systems with multicore processors. It can be effective for execution on a single node with multiple processors. Hadoop and Spark: These are designed to work with large data volumes on server clusters. They provide distributed computing capabilities but usually demonstrate lower speed compared to optimized solutions for parallel computing on GPUs or multicore systems. Apache Flink: It is a distributed data processing system for streaming data and batch data. It can be an effective option for real-time data processing or streaming data. In addition, performance can be improved by optimizing the code, using specialized libraries and frameworks, and adjusting system parameters for maximum productivity. It is also important to consider the specific task and project requirements when choosing a parallel programming platform. Thus, an analysis of the feasibility of using different platforms for association rule mining algorithms has been conducted, and recommendations have been provided. The results of the descriptive comparison can assist researchers in selecting a data processing algorithm for large volumes of data using the Apriori algorithm or in solving other data mining tasks. In addition to general recommendations, the authors conducted a series of experiments. To conduct the experiments, several datasets of different sizes were formed. All datasets were stored in CSV format files. This format is lightweight, contains no metadata, and is easily readable, which is crucial when reading large volumes of data. The experiments were conducted under identical conditions using a multicore computer with variations on 2, 4, and 8 cores. For comparison, the parallel processing and parallelization algorithm described in the authors' works were applied, as well as implementations of data analysis algorithms from popular open-source data analysis libraries, namely RapidMiner and WEKA, and with the Apache Spark MLLib MapReduce- based parallel computing system. Weka is a free library and software for machine learning and data analysis developed at the University of Waikato in New Zealand. Weka provides a wide range of machine learning algorithms for classification, clustering, regression, association rules, visualization, and other data analysis tasks. The Apriori algorithm in Weka is available through the user interface and through the programming interface. Weka is primarily designed for use on a single server or workstation and does not have built- in support for distributed or parallel computing using GPUs, MPI, or Hadoop. However, in combination with other tools and technologies, the Weka library can be used as a node in a general computation algorithm. RapidMiner is a powerful data analysis platform for machine learning, model development, and visualization that allows users to effectively analyze and use data for decision making. RapidMiner can effectively work with data of various volumes, processing both small data volumes (e.g., a few megabytes or gigabytes) and large data volumes (several gigabytes or terabytes). The Apriori algorithm in RapidMiner is available as one of the modules for data analysis. Although RapidMiner is not directly associated with MPI (Message Passing Interface) or Hadoop, it can be integrated with them for processing large data volumes and distributed computing. Apache Spark MLlib (Machine Learning Library) is a machine learning library that is part of Apache Spark, an open-source framework for parallel and distributed data processing. MLlib provides a wide range of machine learning algorithms and data analysis algorithms that can be used to solve various tasks, such as classification, regression, clustering, recommendations, image processing, and more. These algorithms are optimized for execution on distributed clusters using Apache Spark, allowing large volumes of data to be processed quickly and efficiently. The Apriori algorithm is not included in the standard Apache Spark but is implemented using the Python programming language. Apache Spark is designed for efficient processing of large volumes of data in distributed environments. It is integrated with Hadoop and can use it as the primary data store and platform for distributed computing, although its library can also be used separately from Hadoop. At the core of this integration is a common distributed file system mechanism (HDFS - Hadoop Distributed File System) that allows Apache Spark to work with data stored in Hadoop. The results of the experiments for various data analysis algorithms are presented in Table 5. Table 5 Execution Time of Association Rule Mining Algorithms Cores Apache Spark MLLib DXelopes Rapid Weka Sample а Miner 1 2 4 8 1 2 4 8 1 1 1000000 0,156 0,149 0,119 0,096 0,108 0,092 0,066 0,051 0,1144 0,0663 2000000 0,295 0,280 0,192 0,111 0,201 0,135 0,084 0,074 0,2145 0,1248 3000000 0,454 0,376 0,241 0,133 0,278 0,168 0,110 0,092 0,3289 0,1911 4000000 0,565 0,413 0,301 0,174 0,370 0,211 0,127 0,113 0,4225 0,247 5000000 0,618 0,485 0,350 0,237 0,415 0,301 0,190 0,131 0,5161 0,3016 6000000 0,823 0,686 0,397 0,299 0,548 0,319 0,176 0,143 0,5967 0,3484 7000000 0,935 0,761 0,475 0,360 0,632 0,344 0,194 0,155 0,6773 0,3952 8000000 1,119 0,913 0,585 0,417 0,706 0,403 0,219 0,170 0,8112 0,4732 9000000 1,240 0,999 0,692 0,464 0,800 0,446 0,235 0,198 0,8983 0,5239 10000000 1,445 1,089 0,778 0,522 0,849 0,595 0,370 0,239 1,0192 0,5954 Graphical Interpretation of Results (calculation time for the Apriori algorithm on different platforms and data volumes) is provided in Figure 2. Figure2: Graphical Interpretation of Results Thus, for algorithms and platforms without GPUs, the experiment revealed that the most efficient algorithm was the parallel implementation using the Weka platform, but it does not provide parallel processing capabilities. Additionally, its execution was preceded by significant preparatory work in data preparation for calculations. Therefore, the Weka library at this stage may be recommended for linear implementations. Slightly lower performance results were provided by the Apache Spark MLLib library, but the further prospect is the full utilization of Apache Spark's capabilities and its integration with Hadoop. It should be noted that during the experiments, a self-developed algorithm (QuickSort algorithms) and the DXelopes algorithm (Kharkov Polytechnic Institute) were applied. Therefore, the further perspective of the research may be the integration of machine learning methods with the Apriori algorithm. For example, by applying the Apriori method, frequent item sets will be identified in the dataset and used to detect associations between different elements. Machine learning methods will allow modeling complex dependencies and predicting results based on input data and classification, regression models, as well as elements of cluster analysis for various prediction or classification tasks. 6. Conclusions As a result of the study, findings have been obtained that both generalize and advance the previous research conducted by the authors. The focus of the work has been on selecting platforms for the efficient implementation of deep analysis algorithms (specifically the Apriori algorithm), considering the possibility of parallel and distributed data processing of large volumes. Pharmaceutical data applicable to medical research and pharmaceutical logistics data were chosen as input data for the experiments. The characteristics of such data, notably their distributed nature and substantial volume, were highlighted. The following actions were undertaken in the study: The concept of "big data" and the challenges associated with their processing were examined. The necessity of applying modern algorithms to analyze unstructured data, such as the results of medical research and pharmaceutical data, including pharmaceutical logistics data, was confirmed. The Apriori method was discussed as a tool for analyzing data of significant volumes, and the study applied the author's QuickSort program. The instrumental tools for implementing the Apriori algorithm for parallel and distributed processing (MPI, OpenMP, Hadoop, OpenCL, Spark, and Apache Flink platforms) were analyzed. The peculiarities of applying these platforms in constructing algorithms for association rule mining were described, recommendations were provided, and the results of applying each of the mentioned platforms were forecasted in terms of complexity (required workload) and time to obtain results. Computational experiments were conducted for cases of using linear and four multi-processor platforms in parallel implementations, confirming the analytical conclusions presented in the final sections of the work. Paths for further research were outlined, envisioning a deeper experiment in the future application of parallel and distributed versions of association rule mining algorithms in various fields, including the medical-pharmacological sphere, with further integration of the Apriori algorithm with machine learning methods. It's worth noting that the scientific novelty of the research lies in the conducted and integrated analysis of the characteristics of various information platforms designed for parallel and distributed processing of unstructured data. It has been demonstrated which of these platforms the implementation of deep analysis algorithms, including the Apriori algorithm, will be most efficient in terms of speed and technological feasibility in the processing The Apriori method has long been recognized as a valuable tool in data mining, particularly for uncovering associative rules in large datasets. However, the ever-growing volume and complexity of data pose significant challenges to traditional methods of analysis. In this context, the scientific novelty of this study lies in the exploration of effective methods and algorithms for implementing the Apriori algorithm specifically tailored to handle large volumes of pharmaceutical data, with a focus on parallel and distributed processing. The study delves into the methodological approaches necessary for processing unstructured data in pharmaceutical logistics, highlighting the importance of finding efficient Data Mining algorithms, particularly those capable of parallel and distributed processing. This aligns with the broader challenges faced in big data processing across various sectors, emphasizing the need for specialized approaches to handle the unique characteristics of pharmaceutical data. The research systematically evaluates various platforms for implementing the Apriori algorithm, considering factors such as parallelism, memory-level processing, performance optimization, and scalability. This comprehensive analysis extends beyond theoretical considerations to practical implementations, providing insights into the feasibility and effectiveness of each platform in real-world scenarios. Furthermore, the study conducts experiments to empirically validate the performance of different implementations of the Apriori algorithm on various platforms and datasets. By comparing results across different platforms and data volumes, the research offers valuable insights into the relative strengths and limitations of each approach, guiding researchers and practitioners in selecting the most suitable methods for their specific requirements. Overall, the study contributes to the advancement of data mining techniques in pharmaceutical logistics by offering novel insights into the implementation of the Apriori algorithm for processing large volumes of data. By addressing the challenges posed by big data in this domain and proposing practical solutions, the research enhances our understanding of how Data Mining methods can be effectively applied to extract valuable insights and improve decision-making processes in pharmaceutical logistics. Acknowledgements Thanks to Professor O. Dmitrieva for guiding the first stages of the work and scientific contribution to research. References [1] M. M. Medeiros, N. C. Hoppen and A. G. Maçad, Data science for business: benefits, challenges and opportunities. The Bottom Line: Managing Library Finances, ahead-of-print. (2020). DOI: 10.1108/BL-12-2019-0132 [2] M.Juliardi, I. A Malik, Bibliometric Analysis of Data Science: Trends, Contributions, and Research Developments. West Science Interdisciplinary Studies, 1(07) (2023) 365-375. DOI:10.58812/wsis.v1i07.81. [3] L. P.Favero, P. Belfiore, Data Science for Business and Decision Making. Elsevier Science. (2019). DOI https://doi.org/10.1016/C2016-0-01101-4 [4] I. Dorohyi. Critical IT Infrastructure Resource Distribution Algorithm. Proceedings of the 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2021, 2 (2021): 632–639. [5] J.Luengo, D.García-Gil, S.Ramírez-Gallego, S.García, F. Herrera, Big Data Preprocessing: Enabling Smart Data. Springer International Publishing. (2021). ISBN 978-3-030-39107-2. [6] Akobir Shahida, Apriori is a scalable algorithm for searching for associative rules, 2020. https://loginom.ru/blog/apriori [7] N.Maslova, O. Polovynka Associative Methods as a Tool to Improve the Quality of Knowledge Control. Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Systems {(COLINS} 2020). Volume {I:} Main Conference. Lviv, 2020. pp. 778-787. https://dblp.uni-trier.de/rec/bibtex/conf/colins/MaslovaP20 [8] K.Gulzar, M. A.Memon, S. M.Mohsin,, S.Aslam, A Akber,., M. A.Nadeem, "An Efficient Healthcare Data Mining Approach Using Apriori Algorithm: A Case Study of Eye Disorders in Young Adults," Information 2023, 14(4), (2023). DOI: 10.3390/info14040203 [9] X. Li, B.Tan, J. ZhengXu, J.X. Xiao, Y. Liu. The Intervention of Data Mining in the Allocation Efficiency of Multiple Intelligent Devices in Intelligent Pharmacy. Comput Intell Neurosci, (2022). DOI: 10.1155/2022/5371575 [10] M. L. Faquetti, A. M. D. L. Torre, Th.Burkard, G. Obozinski, Andrea M. Burden, Identification of polypharmacy patterns in new-users of metformin using the Apriori algorithm: A novel framework for investigating concomitant drug utilization through association rule mining, Pharmacoepidemiology and Drug Safety, 2022, Vol. 32, Issue 3, pp. 366-381. [11] O.L Polovynka, D.V. Nikulin, O.A. Dmitrieva, Study of the speed of operation of a priori algorithms on large volumes of data. Science and production. Series: Information technologies, Mariupol, 2020, 22, pp. 246-253. http://sap.pstu.edu/article/view/211383, [12] E. A. R. de Campos, I. C. de Paula, C. Sc. ten Caten, K. P. Tsagarakis, J. L.D.Ribeiro, Logistics performance: critical factors in the implementation of end of life management practices in the pharmaceutical care process, Environmental Science and Pollution Research, 2023, Vol. 30, pp. 29206–29228. https://doi.org/10.1007/s11356-022-24035-z [13] B. Sarwar, Supply chain management in the pharmaceutical industry (2020). https://enhelion.com/blogs/2020/12/30/supply-chain-management-in-pharmaceutical-industry/ [14] O. O. Molodid, V. V. Shemena, General principles of organizing logistics business processes in a pharmaceutical company, Electronic journal "Effective economy", №6, 2019. doi: 10.32702/2307- 2105-2019.6.62, [15] H.Jahani, , R.Jain, D.Ivanov. Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research. Ann Oper Res (2023). https://doi.org/10.1007/s10479-023-05390-7 [16] A. Koschmider, M. Aleknonytė-Resch, Fr. Fonger, Christian Imenkamp, Ar. Lepsien, K. Apaydin, M. Harms, D. Janssen, D. Langhammer, T.Ziolkowski, Y.Zisgen, Process Mining for Unstructured Data: Challenges and Research Directions Cornell University, (30 Nov 2023), 2023. https://doi.org/10.48550/arXiv.2401.13677 [17] O.A. Dmitrieva, O.L. Polovynka, Algorithmic support for parallel methods of searching for associative rules, Collection of scientific works of the Donetsk National Technical University. Series: "Computer technology and automation", Pokrovsk, 2018, Is. 1 (31). pp. 62-69, https://science.donntu.edu.ua/ota-arhiv/31/025_dmitrieva.pdf [18] The Apache Software Foundation. Apache Hadoop, (28 July, 2020), 2020. URL: https://blogs.apache.org/hadoop/entry/announce-apache-hadoop-3-3 [19] O.Polovynka, O. Dmitrieva, Research of the efficiency of the apriori group algorithms on different database sizes, SWorldJournal, Issue 6 / Part 1, р. 71-78, https://www.sworldjournal.com/index.php/swj/article/view/swj06-01-012 [20] Yange, T. S., Gambo, I. P., Ikono, R., & Soriyan, H. A. A Multi-Nodal Implementation of Apriori Algorithm for Big Data Analytics using MapReduce Framework. International Journal of Applied Information Systems (IJAIS), 2020, 12(31). DOI: 10.5120/ijais2020451868