=Paper=
{{Paper
|id=Vol-3632/ISWC2023_paper_480
|storemode=property
|title=Datalog with External Machine Learning Functions for Automated Cloud Resource Configuration
|pdfUrl=https://ceur-ws.org/Vol-3632/ISWC2023_paper_480.pdf
|volume=Vol-3632
|authors=Zhuoxun Zheng,Ognjen Savkovic,Huu Phuc Luu,Ahmet Soylu,Evgeny Kharlamov,Baifan Zhou
|dblpUrl=https://dblp.org/rec/conf/semweb/ZhengSLSKZ23
}}
==Datalog with External Machine Learning Functions for Automated Cloud Resource Configuration==
Datalog with External Machine Learning Functions for Automated Cloud Resource Configuration Zhuoxun Zheng1,2 , Ognjen Savkovic3 , Nikolay Nikolov4,2 , Luu Huu Phuc1 , Ahmet Soylu4,2 , Evgeny Kharlamov4,2 and Baifan Zhou4,2 1 Bosch Center for AI, Germany 2 Department of Informatics, University of Oslo, Norway 3 Free University of Bozen-Bolzano 4 Department of Computer Science, Oslo Metropolitan University, Norway Abstract Industry 4.0 and Internet of Things (IoT) technologies unlock unprecedented amount of data from factory production, posing big data challenges. In that context, distributed computing solutions such as cloud systems are leveraged to parallelise the data processing and reduce computation time. As the cloud systems become increasingly popular, there is increased demand that more users that were originally not cloud experts (such as data scientists, domain experts) deploy their solutions on the cloud systems. To this end, we propose SemCloud, a semantics-enhanced cloud system, for tackling the challenges of data volume and more users. The system has been evaluated in industrial use case with millions of data, thousands of repeated runs, and domain users, showing promising results. This poster paper accompanies our full paper and focuses on Datalog rules with external machine learning functions for automated resource configuration, and provides additional discussion on formalism and implementation techniques. Keywords Datalog, knowledge graph, cloud configuration, machine learning 1. Introduction Background and Challenges. Industry 4.0 focuses on smart factories that rely on IoT tech- nology for automation. This produces massive amounts of production data, increasing the demand for data-driven solutions and cloud technology. Yet, users of these solutions and cloud technology are often not cloud experts, such as domain experts and data scientists. In a standard setting of a data science project, the team requires extensive assistance from cloud experts, whenever they want to deploy solutions or make small changes to their solutions deployed on the cloud. To facilitate the adoption of cloud systems for more projects and users, one can equip all projects with some cloud experts, or launch training programs about cloud technology. Both require careful planing to balance time, cost, and benefits. Our Approach. We notice that the existing work on this topic addressed the cloud deployment issues only to a limited extent [1], whereby they either only focus on the formal description of cloud, or on the limited adaptability of cloud systems. To address scalability challenges in data ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings volume and democratising cloud systems for more users, we propose SemCloud, a semantics- enhanced cloud system, that scales semantic ETL pipeline on the cloud, and allows non-cloud experts to deploy their solutions. A rough description of workflow of SemCloud is as follows: (0) non-cloud experts create knowledge graphs (KG) that represent ETL-Pipelines on a cloud system, where attributes of cloud resource configuration is under-specified; then Datalog rules execute in three steps: (i) graph extraction rules write populate rule predicates by extracting information from the ETL-pipeline KGs; (ii) resource estimation rules estimate the resource consumption for the given pipeline assuming that there is only one computing node (assuming infinite large node); (iii) resource configuration rules that find the optimal resource allocation in distributed computing given the pipeline. This poster paper accompanies our full In-Use paper [2]. The paper is inspired by a industrial use case of manufacturing quality monitoring. The paper provides additional insights and more details regarding the implementation. 2. Approach Use Case: Distributed Semantic ETL. In our welding use case, large amount of data collected from different factories, customers, software versions are integrated and analysed in parallel to optimise the following welding production. To enable distributed ETL, we need to find a strategy that makes the ETL parallelisable. SemCloud achieves this by breaking down the ETL into pipelines of four steps: retrieve, slice, prepare, and store (Fig. 1), where data is first retrieved from databases or online streams, and then split into subsets (e.g. each belong to one welding machine) by slice to achieve parallel processing and storage in the following two steps. Here the cloud configuration play an important role. KG Construction for ETL Pipelines. SemCloud provides the users GUI to construct semantic ETL pipelines and encode them into knowledge graphs, based on a SemCloud ontology (Fig- ure 1a). The ontology SemCloud [3] is written in OWL 2, and consists of 20 classes and 165 axioms. It has three main classes: DataEntity, Task, Requirement. DataEntity refers to any dataset to be processed; Task has sub-classes that represents the four types of tasks in the data preparation: retrieve, slice, prepare, and store; and Requirement that describes the requirements for computing, storage and networking resources. We illustrate the generation of ETL pipelines in KGs with the example in Figure 1b. For these data, the users construct an ETL pipeline p1 with four layers (via GUI). Firstly, data are “retrieved” from the welding factories. Thus, the layer l1 is of type RetrieveLayer, and has the task t1 of type Retrieve. The task t1 has an IO handler io, which has an output d1 of type DataEntity. Then the data are read in by a task t2 of type Slice, and “sliced” into smaller pieces d2, d3. These slices are input to different computing nodes to do tasks t3 and t4 of type Prepare. Finally, all prepared data entities are stored by t5 of type Store. Datalog ML Rules. Obtaining an optimised cloud configuration is not a trivial task. Cloud experts typically try different configurations by testing the system with various settings and use heuristics to manually decide on the configurations. To this end, SemCloud uses adaptive rules in Datalog with aggregation and calls to external predicates learned by ML (they are adaptive because the function parameters are learned). In particular, we consider non-recursive rules of the form 𝐵 ← 𝐵1 , . . . , 𝐵𝑛 , where 𝐵 is a head of rule (the consequence of the rule application) and 𝐵1 , . . . , 𝐵𝑛 are either predicates that apply join, aggregate function that filters out the Figure 1: (a) Schematic illustration of the SemCloud ontology and (b) a KG for ETL-Pipeline. results or the expression of the form Var = @FUNCT(Vars). For the theory of Datalog we refer to [4]. We have six set of independent Datalog rules that are divided into three steps. Graph Extraction Rules. These rules populate the predicates so that these predicates will be used for the resource estimation and configuration. The rule0 exemplifies populating the predicate subgraph1 that is related to the ETL pipeline p. Similarly, rule1 creates subgraph2(p,n,v,ms,mp ,ts,tp,nc,ns,mrs,mrp,mode). subgraph1(p,n,v,ms,mp,ssl,spr,sst) ← ETLPipeline(p), hasInputData(p,d), hasVolume(d,v), hasNoRecords(d,n) hasEstSliceMemory(p,ms), hasEstPrepareMemory(p,mp) hasEstSliceStorage(p,ssl), hasEstPrepareStorage(p,spr) hasEstStoreStorage(p,sst) (𝑟𝑢𝑙𝑒0 ) Resource Estimation Rules. These rules are used to estimated required resource assuming one computing node. For example, rule2 estimates the required slice memory (ms), prepare memory (mp), slice storage (ssl), prepare storage (spr), and the store storage (sst). It then stores these estimation in the predicate estimated_resource. estimated_resource(p,ms,mp,ssl,spr,sst) ← subgraph1(p,n,v,ms,mp,ssl,spr,sst), ms=@func_ms(n,v), mp=#avg{@func_mp(n,v,ms,i):range(i)}, ssl=@func_ssl(n,v), spr=#avg{@func_spr(n,v,ssl,i):range(i)}, sst=@func_sst(n,v,ssl,spr) (𝑟𝑢𝑙𝑒2 ) where @func_ms, @func_ssl, @func_sst, etc. are parameterised ML functions whose parameters are learnt in the rule parameter learning. In the implementation, those are defined as external functions that are called in the grounding phase of the program and are replaced by concrete values [5]. We also have other estimation rules for other resources, such as CPU consumption. Resource Configuration Rules. These rules find the optimal cloud configurations based on the estimated cloud resource. 𝑟𝑢𝑙𝑒3 is an example for deciding the slicing strategy and the storage strategy, and finding the optimal resource configuration such as the chuck size (nc), slice size (ns), memory reservation for slice (mrs) and for prepare (mrp). In essence, 𝑟𝑢𝑙𝑒3 stipulates that if the maximum of estimated slice memory (ms) and prepare memory (mp) is greater than a given threshold (c1*nm), and the maximum of estimated slice storage (ssl), prepare storage (spr), and store storage (sst) is smaller than (or equal to) another threshold (c2*ns), then the chosen strategy for the given pipeline is slicing (thus nc and ns are computed), and fast storage (fs), where the thresholds are calculated from cloud attributes. configured_resource(p,nc,ns,fs,mrs,mrp) ← subgraph2(p,n,v,ms,mp,ts,tp,nc,ns,mrs,mrp,mode), estimated_resource(p,ms,mp,ssl,spr,sst), CloudAttributes(c,c1,c2,c3,nm,ns,fs,cs), #max{ms,mp} > (c1 * nm), #max{ssl,spr,sst} <= (c2 * ns), nc = @func_fs_1(n,v,ts,tp), ns = @func_fs_2(n,v,ts,tp), mrs = #min{ms, #max{@func_ss(n,v,nc,ns), c3*ms}}, mrp = #min{mp, #max{@func_pn(n,v,nc,ns), c3*mp}} (𝑟𝑢𝑙𝑒3 ) Rule Parameter Learning with ML. The functions in the adaptive rules are in the form of ML models. The resource estimation rules are selected from the best model resulting from training three ML methods and the pilot running statistics. These three ML methods are Polynomial Re- gression (PolyR), Multilayer Percetron (MLP), and K-Nearest Neighbours (KNN). We selected these three methods because they are representative classic ML methods suitable for the scale of the pilot running statistics. The resource configuration rules are trained with the three ML methods and with optimisation techniques, such as Bayesian optimisation or grid search. For example, the functions @func_fs_1 and @func_fs_2 that find the optimal chuck size (nc) and slice size (ns) are trained by finding the arguments of (nc,ns) for the minimal total computing time (𝑡total ): nc, ns = arg min 𝑡total = arg min 𝑓 (v, n, nc, ns, 𝑡slice , 𝑡prepare ) nc,ns nc,ns 3. Implementation and Evaluation Implementation. We implement the Datalog rules with DLV and external functions as Python plugins [6]. In particular, we first write and compile the external functions written in Python, then we define the interface for the functions in the Datalog program, and then link the Datalog program with the compiled Python code when running the DLV. For some complicated rules with reusable part, we introduce a auxiliary predicate that stands for the reusable to improve efficiency. A rule in the form of 𝐵 ← 𝐵1 , ..., 𝐵𝑛 becomes two parts: (PartI): 𝐵𝑎𝑢𝑥 ← 𝐵1 , ..., 𝐵𝑚 , (PartII): 𝐵 ← 𝐵𝑎𝑢𝑥 , 𝐵𝑚+1 , ..., 𝐵𝑛 , where PartI is reused also in other rules. For instance, 𝑟𝑢𝑙𝑒3 applies for one of the four cases of #max{ms,mp}>(c1*nm), #max{ssl,spr,sst}<=(c2*ns ), while there exist other three cases for the comparison (<=,<=), (<=,>), (>,>). We introduce the auxiliary predicate configured_resource_aux to replace the first three lines (subgraph2 to CloudAttributes), which will be reused in the inference of the three other cases. Evaluation and Discussion. To verify the time efficiency and accuracy of the rule parameter learning and inference, we run SemCloud repeatedly 3562 times and gather pilot running statistics. These statistics are split to 80% for training and 20% for testing and inference. Three ML models are trained and tested. After a grid search, the selected hyper-parameters are, PolyR: 4 degree; MLP: 2 hidden layers with neurons 10 and 9; KNN : 2 neighbours. We use these performance metrics: normalised mean absolute error (nmae), minimal training data amount (Min. |𝒟𝑡𝑟𝑎𝑖𝑛 | for yielding satisfactory results, optimisation time (Opt. time), learning time (for ML training) and inference time (including the inference time of ML and Datalog). The results (Table 1) show that PolyR has the best prediction accuracy, requires the least training data, and consumes the least time. Therefore, PolyR generates the best results and is selected for the use case. We presume the reason is that PolyR works better with small amounts of and not very complex data (3562 repeated running statistics). The results show that our approach exhibit promising inference accuracy and time efficiency for automated cloud resource configurations. 4. Conclusion and Outlook Table 1: Parameter learning and reasoning on Intel Core i7-10710U. This poster paper accompanies our full paper [2] with a focus on Datalog rules with external ma- Metric PolyR MLP KNN chine learning functions and provides additional nmae 0.0671 0.0947 0.0818 Min. |𝒟𝑡𝑟𝑎𝑖𝑛 | 7.42% 50.97% 10.00% discussions on formalism and implementation Opt. time 1.12s 174.32s 7.25s techniques. The research is under the under the Learning time 20.82ms 120.31ms 27.52ms umbrella of Neuro-Symbolic AI for Industry 4.0 at Inference time <1ms <1ms <5ms Bosch. We aim at enhancing manufacturing tech- nology with both symbolic AI [7] for improving transparency [8], and ML for prediction power. We will further improve the performance of the KG embedding method and develop other complementary technologies, such as ontologies [9], ontology-based data access, etc. Acknowledgements. The work was partially supported by the European Commission funded projects DataCloud (101016835), enRichMyData (101070284), Graph-Massivizer (101093202), Dome 4.0 (953163), OntoCommons (958371), and the Norwegian Research Council funded projects (237898, 323325, 309691, 309834, and 308817). References [1] L. Youseff, M. Butrico, D. Da Silva, Toward a unified ontology of cloud computing, in: 2008 Grid Computing Environments Workshop, IEEE, 2008, pp. 1–10. [2] B. Zhou, N. Nikolov, Z. Zheng, X. Luo, O. Savkovic, D. Roman, A. Soylu, , E. Kharlamov, Scaling data science solutions with semantics and ML, in: ISWC, 2023. [3] The SemCloud Ontology, 2023. Open source under: https://github.com/nsai-uio/SemCloud. [4] S. Paramonov, et al., An asp approach to query completeness reasoning, TPLP 13 (2013). [5] N. Leone, et al., The dlv system, in: JELIA, Springer, 2002, pp. 537–540. [6] Dlvhex python plugin manual, http://www.kr.tuwien.ac.at/research/systems/dlvhex/doc2x/ group__pythonpluginframework.html, 2016. [7] D. Rincon-Yanez, et al., Addressing the scalability bottleneck of semantic technologies at bosch, ESWC Industry (2023). [8] Z. Zheng, et al., Executable knowledge graph for transparent machine learning in welding monitoring at bosch, in: CIKM, 2022, pp. 5102–5103. [9] B. Zhou, Z. Zheng, D. Zhou, Z. Tan, O. Savković, H. Yang, Y. Zhang, E. Kharlamov, Knowledge graph-based semantic system for visual analytics in automatic manufacturing, ISWC, 2022.