Resource Awareness in Complex Industrial Systems – A Strategy for Software Updates Petar Rajković 1, Dejan Aleksić 2, Dragan Janković 1, Aleksandar Milenković 1, Anđelija Đorđević 1 1 University of Niš, Faculty of Electronic Engineering, Aleksandra Medvedeva 14, Niš, Serbia 2 University of Niš, Faculty of Science and Mathematics, Department of Physics, Višegradska 33, Niš Serbia Abstract The complex industrial systems consist of many heterogeneous devices running different pieces of software in a connected and layer-organized environment. Software instances in different levels communicate between each other using different protocols and are developed using various technologies. Available storage space and network throughput vary from layer to layer. When deploying a new version of the software to some device, an update package, which is, in some cases, of significantly higher volume than usual data traffic, needs to be distributed via a network, verified, and stored to the destination device. The old version needs to be backup in case of rollback. To reduce the impact of the mentioned problems, and to reduce the potential system downtime, we aimed to define the more general deployment approach that could be configured to use the combination of blue-green and canary deployment styles in combination with both shared and local backups. The main objective of this paper is to highlight the common problems with software updates across multiple layers and to bring the set of recommendations and guidelines for, from the resource awareness point of view, the most effective and the cheapest software updates, with the special focus on the lower levels. Keywords 1 Industrial software, IoT nodes, Software deployment strategy, Resource awareness 1. Introduction The complex industrial systems consist of many heterogeneous devices running different pieces of software in a connected and layer-organized environment [1]. Starting from the layer consisting of sensors and the actuators (in our work we will reference it as IoT layer) [2], through the Edge layer [3][4], via SCADA [5] and manufacturing execution systems (MES) [6] to enterprise resource planning (ERP) [7], all pieces of equipment run the software that needs to be updated from time to time. The update process itself comes with the risk of diverse potential failures that could leave parts of the system unresponsive, running with unpredictable behavior, or emitting erroneous data. For this reason, the update process must be executed in a highly controllable environment that allows easy and efficient rollbacks in case of flawed deployment is detected. All software components that are present in the industrial system are usually organized in layers. Layers exchange data with each other using different software protocols. The mentioned facts make the overall software update process a bit more complex than within a standard information system CERCIRAS WS01: 1st Workshop on Connecting Education and Research Communities for an Innovative Resource Aware Society EMAIL: petar.rajkovic@elfak.ni.ac.rs; alexa@pmf.ni.ac.rs; dragan.jankovic@elfak.ni.ac.rs; aleksandar.milenkovic@elfak.ni.ac.rs; andjelija.djordjevic@elfak.ni.ac.rs Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) environment, and every error could lead to serious domino effects [8] [9]. Updating software in one layer could have an effect not only on the targeted device but also on other devices in the same layer as well as on the other layers. I.e., the update that is performed in the device running in Edge level could affect software instances running both in IoT and MES layers. The additional limitation point is not only the expectation for the highest-possible-performances but also the requirement that software must run using as a small number of resources as possible. The complete system must have a high degree of resource awareness, and both storage space and network bandwidth usage must be carefully planned during the update process in order not to significantly reduce the execution of the running components [10][11]. In this paper we will present the following scenarios and the effects of different deployment configurations: - Software update in an IoT layer - Communication protocol update between layers (leading to software update on both sides) - Complex update including protocol and software changes in an IoT layer - Backup fails strategy - Deployment fail/ Rollback strategy - Rollbacks fail strategy The main objective of this paper is to highlight the common problems with software updates across multiple layers and to bring the set of recommendations and guidelines for, from the resource awareness point of view, the most effective and the cheapest software updates. The special focus in this paper will be on the lower levels since they request the highest resource awareness level. The upgrade planning for the lower levels is particularly important since it is very easy to use complete free space on the device as well as the network bandwidth during the upgrade process. 2. Related Work The existing literature offers a wide variety of deployment strategies evaluations and recommendations, but in most of the cases, the research covers software that runs in layers such as MES and ERP. These higher layers deal with a large number of clients transferring a significant amount of data and executing numerous transactions. When defining development strategies for lower levels, the common approaches from the literature are not directly implementable due to the unique limitations. The most critical points for resource management in lower levels are the storage capacity and the data traffic through the connecting networks. The overall effect is not the same on all layers [12]. I.e., manufacturing execution systems (MES) run in a shop floor environment on devices which processing power is close to standard computers. For devices running MES or ERP software the storage space is not a critical requirement, but they are usually connected to their server using the wireless network. The wireless networks in the industrial environment could experience different disruptions as the result of operating nearby machines generating high-frequency harmonics as well as different security threats [13]. For MES and ERP client nodes, data package verification and data consistency are the most important points. When deploying a new version of the software to some device, an update package, which is of significantly higher volume than usual data traffic, needs to be distributed via a network, verified, stored to the destination device, and the old version needs to be backup in case of rollback [14] [15]. Next, the Edge layer has the main mission to collect all the data from sensor networks and pass it to the MES. In this case, the proper buffer implementation ensures smooth software upgrades. All the mentioned layers are highly heterogeneous, with different pieces of hardware running the software instances with diverse category of software. Overall, in the complete industrial system, the type of used devices, their number, the amount of transferred data (per device) could be anything between 1kB and 1GB. To make the complete process more demanding, sometimes devices themselves do not have enough memory to store two versions of the software, thus they would require backup on a different location. This leads to the situation that sometimes is nearly impossible to have an upgrade with no, or at least with very low, downtime [16]. As with every process, a software update could fail due to numerous reasons. In that case, a complete deployment approach or deployment system needs to provide the possibility to roll back to the previous version [17]. The rollback will then take more resources and make the situation even worse, so we need to ensure that system governance successfully goes through the process [18]. To reduce the impact of the mentioned problems, and to reduce the potential system downtime, we aimed to define the more general approach that could be configured to use the combination of blue- green [19] and canary deployment [20] styles in combination with both shared and local backups [21]. This looks like the most promising approach for the IoT level. 3. Testing Environment The environment we used for testing consists of a set of 100 nodes that contain sensors and actuators, in further text IoT (IoT = Internet of Thing) nodes. The global system overview is shown in Figure 1. Generally, each IoT contains a different number of sensors and actuators which count within the node could be anything between a few and 1000. Sensors within one IoT node could be different, and all of them could run a different piece of software. Sensors could be active either constantly or just in predefined periods. During their operation time, they could collect very heterogeneous data with different sample rates. All these facts make the IoT level very dynamic from the operational point of view and could increase the probability that the complete node went out of a stable state in case of problematic deployments. The amount of available memory space is usually between 1 and 5 MB per device, which is nearly enough for the necessary software. The nodes in the IoT layer are connected using various methods – ranging from cable network connector to LoRaWan, making the inconsistent environment in terms of connection speed and quality. The most complex situation is with LoRa connected devices since their bandwidth could be in a range of only 10-20 kbps. Figure 1: The composition of the examined system IoT node layers are further connected to Edge computers or Edge nodes. Edge nodes are responsible for communication between the shop floor and hazardous areas on one side and higher levels such as MES and enterprise resource planning (ERP) on the other side. Edge nodes are devices based on Raspberry Pi or similar base sets and are usually connected by a Wireless network with an effective network speed of around 20 Mbps. Their space requirements are around 30 MB per node. There were 10 of these nodes in our test environment. From the resource awareness point of view, software components on MES and ERP levels are easier to handle. They are running on desktop/laptop computers with enough processing power, disk space, and bandwidth, but even with them, resource planning is inevitable. In our test environment, we used 50 MES clients connected to 2 MES servers (one main and one redundancy), and a similar number of ERP clients connected to the same server configuration (Microsoft Dynamics). All the clients in this level are a few hundred megabytes in volume, but they are located under a gigabyte network. 4. Deployment Strategy for IoT node The process of a software update for IoT nodes and sensor/actuator devices running in a production environment is considered as particularly sensitive. Small components, both in size and capacity, running in a hazardous environment where the only possible connection are relatively slow LoRa networks with no wiring possible and limited physical access, require detailed planning before an update (Figure 2). Besides the slow network, the low-performance hardware is one additional potential problem. This fact could result in an unacceptable long update process which could move the targeted device off the system for an extended period. The last, but not the least important is the problem of energy consumption. The software update is an activity that requires significantly more energy than the regular data collection and data transmission processes. Thus, this process must be planned for the period when the battery is charged to the highest possible level, and when the eventual rollback will not drain the battery. Figure 2: IoT node elements Figure 3: Semaphore-based blue-green deployment strategy 4.1. Software Update Approach for a Single IoT Node Looking at the single node, our choice for a software update is a semaphore-based green/blue approach (Figure 3). This approach is possible with devices that could store at least two versions of the software at the same time. The critical points, in this case, are usually low bandwidth and possibly low battery levels. The approaches to solving these two problems are not in the scope of this paper, and they will be addressed in another future work. The main idea here is to ensure that the target device always keeps two software versions – actual (version N-1) and previous (version N-2). The update process starts by replacing version N-2, with the new version – version N. At that moment, version N-1 is still active, and the device is running uninterrupted. During that period device experiences higher-than-average network traffic and battery use. Once when version N – 2 is deleted, and version N is uploaded and verified, the switchover could start. The device starts version N, but its communication points are still inactive. When version N is fully up and running, the semaphore opens communication to version N and stops version N-1. In that case, there is no operation downtime, and the complete update process is seamless for the customer (Figure 4). If the process is well-planned there will also be no data loss during the switchover process. In the worst case, only the signals that arrived during the switchover (which usually takes up to several seconds) could be lost and not processed. This approach is good in case of update errors since it offers an easy way to return to the previous (valid and proven) version N - 1 without the need for immediate additional traffic. Once when the error gets solved, version N could be replaced with the next update. This setup also supports both full and partial version updates, and it is even more suitable for more powerful devices – these that uses GSM modems instead of LoRa adapters. Usually, newer versions consume slightly higher data space, and the additional challenge that could appear during the software lifecycle is the situation when the new version requires more space than the available in the target device. In this case, the described approach will not be efficient any longer, and the solution must include additional components. ev_req_sleep 0 0 0 0 0 0 0 0 main task ev_ack_sleep 0 0 0 0 0 0 0 0 set event bit /* define sleep request event bits */ #define SLP_REQ_BAT_CHARGER_TASK_BIT ( 1 << 0 ) #define SLP_REQ_PARAM_TASK_BIT ( 1 << 1 ) #define SLP_REQ_GPS_TASK_BIT ( 1 << 2 ) #define SLP_REQ_LoRa_TASK_BIT ( 1 << 3 ) ev_req_sleep 0 1 0 0 0 0 0 0 #define SLP_REQ_GSM_TASK_BIT ( 1 << 4 ) #define SLP_REQ_MQTT_SENDER_TASK_BIT ( 1 << 5 ) ev_ack_sleep 0 0 0 0 0 0 0 0 #define SLP_REQ_READ_I2C_TASK_BIT ( 1 << 6 ) #define SLP_REQ_READ_485_TASK_BIT ( 1 << 7 ) read event bit /* define sleep acknowledgement event bits */ #define SLP_ACK_BAT_CHARGER_TASK_BIT ( 1 << 0 ) #define SLP_ACK_PARAM_TASK_BIT ( 1 << 1 ) I2C_comm #define SLP_ACK_GPS_TASK_BIT ( 1 << 2 ) #define SLP_ACK_LoRa_TASK_BIT ( 1 << 3 ) task #define SLP_ACK_GSM_TASK_BIT ( 1 << 4 ) #define SLP_ACK_MQTT_SENDER_TASK_BIT ( 1 << 5 ) set/reset event bit #define SLP_ACK_READ_I2C_TASK_BIT ( 1 << 6 ) #define SLP_ACK_READ_485_TASK_BIT ( 1 << 7 ) EventGroupHandle_t ev_req_sleep = NULL; ev_req_sleep 0 0 0 0 0 0 0 0 EventGroupHandle_t ev_ack_sleep = NULL; ev_ack_sleep 0 1 0 0 0 0 0 0 Figure 4: Software update sequence with the sleeping sequence 4.2. Software Update Approach for Devices with Limited Storage Space As it is mentioned, the more demanding situation is a case when it is not possible to have both versions N and N – 1 copied to the destination device simultaneously. The proposed solution for this problem is to use an additional device of the same type with (if possible) larger storage space – a backup node. The backup node is used to keep the backup versions of the running software. In the situation where the IoT layer consists of multiple similar (or same) nodes, adding one additional device to the system will not be considered as a drawback, but rather as an acceptable small cost. The deployment process starts with copying the new version (version N) to the backup node. Once this action is finished, the backup node will distribute version N to all devices running the same piece of software. In this situation, the overall downtime will be a bit higher since the target node must stop the previous version (N – 1), get a new one, and then start version N. With this approach having been implemented, the needed amount of traffic is higher, but this setup has its advantage when comes to potential rollbacks. After version N is uploaded to the backup node, deployment to sensor nodes will go one after another. It will start with the sentinel device (the concept borrowed from the canary deployment), and complete validation in production conditions will be done there. In case the new version is valid, the update of consecutive nodes will follow. If not, the rollback sequence will be done only on the sentinel device. The update itself in the second scenario does not allow continuous uptime on the device. In such a case, the currently running version (N-1) must be first put to sleep mode and then removed from the destination device. Next, the new version (version N) must be uploaded, configured and then the wake- up command will be applied to version N. Until version N is not started yet, the node will be in downtime and without the possibility to collect and exchange data, which is the potentially unavoidable weak spot. 4.3. Software Update in Edge Layer affecting IoT nodes The third scenario that will be presented is the effect of the software update in the Edge layer on the nodes in the IoT layer, in the scenario when the device from the Edge level must remain inactive for a period of deployment. In that case, devices at the IoT level will get disconnected for the same amount of time. The course of events in the IoT node will be as follows: - Devices in IoT nodes detect disconnection event - Devices raise the internal alarm - Start reconnection procedure in predefined time frames sem_mqtt_send Version N take/give take/give write read MQTT_SN_comm mqtt_msg queue Version N-1 Figure 5: Software update scheme with message queue While the Edge level node is not running, IoT nodes will have not have a destination where to send processed data. This will cause significant data loss for the complete deployment areas, which could be unacceptable if the process consumes an extensive amount of time. This problematic state will last until the Edge layer node starts running again. When a node from the Edge layer restarts and comes back online, IoT nodes will get connected again and continue to exchange data. In some cases, IoT nodes will not be able to connect back, either due to the change in communication protocol or to any hardware error. In these cases, IoT nodes will run a general alarm, and then the Edge node must be moved back to the previous version. In case when the update is needed in both layers, the general alarm will be stopped by the update notification signal, and then all IoT nodes will get updated one by one. The update will be driven from the backup node. One of the commonly used solutions to reduce the necessity for frequent updates across the levels is the using a buffer between the layers (Figure 5). In this case, the buffer is implemented as the message queue, and in most of the cases, when the communication protocol gets changed, only the synchronization buffer will be updated while all the nodes in the IoT layer will continue to work. In this way, downtime will hit only one layer (in this case Edge layer) while the other layers will continue to run almost without interruptions. 5. Results and Discussion Our research was led by the request to reduce the potential downtime during the software update in a challenging environment such IoT layer is. To achieve this goal, we decided to replace the standard deployment (stop-copy-run) with a combination of blue-green and canary deployment strategies, extended with the buffer component. Combining these three well-known approaches in the proposed way, we tried to benefit from all the positive aspects we could get: - blue-green deployment gives the possibility for a fast version switch - canary deployment allows prompt identification of deployment errors - the presence of a synchronization buffer allows us to keep one layer insulated and still operative while the connected layers are in downtime or performing an update. The proposed approach is initially tested at the IoT level since there we faced the toughest limitations regarding software resources, network bandwidth, and even energy consumption. Another reason to choose this layer for the initial test is also the fact that they are running in critical and hazardous areas. Thus, it is highly requested to reduce the possibility of direct human interaction, or the installation of the additional infrastructural elements such are power or network cables. To make the situation even more complex, we must note that physical access to the IoT nodes could not be achieved easily. It is often connected not only with technology but with mechanical and security procedures. In some cases, different mechanical elements must be removed to physically reach the device at the IoT level. Furthermore, IoT devices could run in a dangerous environment (for human beings) and then the strict procedure must be followed to access the device itself. The approach that was used before was the standard update, where the software component was just replaced with the new version – either fully or partly (stop-copy-start). The problems with the standard updates could be summarized as: - The downtime was always present. If the software component is in the updating process, the software device could not be used - In case of erroneous update, software should be brought back to the previous version which would lead to the further downtime - Restore process sometimes could drain the battery which would require that the personnel member must go to the hazardous area - Connected layers could not continue to work normally since they get flooded with alarm signals The results we achieved (Table 1) with the proposed combined deployment approach proved our expectation and vary between different software layers and scenarios. Applying the proposed strategy reduced the overall downtime and number of unnecessary rollbacks. This was achieved by the cost of the implementation of the backup node, the implementation of the buffer level, and by a slight increase in data traffic. We plan to expand the concept also to other layers and make their update processes as effective as possible. Using backup nodes with the optimized buffering will be the first approach to move to the MES level, and this will be followed with the buffers between Edge and MES as well as between MES and ERP. Table 1 The effects of the proposed deployment strategy on IoT level containing 50 IoT nodes connected to a single Edge node (TD – time to shut down the software in the node, TU – time to start the software in the node, TS – time switch between the versions, IS – software instance size per node, NN – number of nodes) Measurement With standard deployment With proposed deployment Number of software uploads to NN 1 (only to the backup node) IoT level – successful deployment Number of internal uploads – 0 NN successful deployment Number of software uploads - Average 10% of NN 1 to the backup node unsuccessful deployment Security check on upload NN 1 (only to backup node) Number of internal software 0 1 uploads – unsuccessful deployment Rollbacks with unsuccessful 10% of NN 1+1 deployments Downtime per node TD + TU (in seconds) TS (in milliseconds) Used space for software per node 1 x IS 2 x IS (with blue-green approach) Used space for software with NN x IS NN x IS + IS buffer node Update distribution Manual or with task scheduler Optimized by backup node Downtime when connected layer If update is running Until buffer has data update 6. Conclusion With the presented research, we managed to make a significant step forward in the design of the deployment strategy for complex, layer–organized, industrial software systems. When software update must be deployed, the common problems are downtime, network traffic increase, and storage space occupation. In lower levels, even the energy consumption during the deployment process could be an issue. To reduce the effects of the mentioned problems, especially in the cases when the rollback is needed, we defined the hybrid strategy containing a mix of blue-green and canary deployment supported by inter-layer buffer and backup node. Having this approach implemented we managed to reduce the overall downtime from 50% to close to 0. With the backup node active, we managed to reduce the number of software uploads in case of an erroneous update to less than 1%. On the other hand, to achieve mentioned results, we added the additional backup node to the system, but since its volume is slightly higher than the volume of regular IoT nodes, we find it acceptable. The results seem promising and for future work, we plan to adapt and extend this approach to the other layers of the complex industrial systems. 7. Acknowledgement This work is partially supported by CERCIRAS COST Action CA19135 funded by COST. 8. References [1] Shu, Zhaogang, et al. "Cloud-integrated cyber-physical systems for complex industrial applications." Mobile Networks and Applications 21.5 (2016): 865-878. [2] Kondratenko, Yuriy, et al. "Complex industrial systems automation based on the Internet of Things implementation." International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications. Springer, Cham, 2017. [3] Sha, Kewei, et al. "Edgesec: Design of an edge layer security service to enhance IoT security." 2017 IEEE 1st International Conference on Fog and Edge Computing (ICFEC). IEEE, 2017. [4] Li, He, Kaoru Ota, and Mianxiong Dong. "Learning IoT in edge: Deep learning for the Internet of Things with edge computing." IEEE network 32.1 (2018): 96-101. [5] Sajid, Anam, Haider Abbas, and Kashif Saleem. "Cloud-assisted IoT-based SCADA systems security: A review of the state of the art and future challenges." IEEE Access 4 (2016): 1375-1384. [6] Coronado, Pedro Daniel Urbina, et al. "Part data integration in the Shop Floor Digital Twin: Mobile and cloud technologies to enable a manufacturing execution system." Journal of manufacturing systems 48 (2018): 25-33. [7] Chofreh, Abdoulmohammad Gholamzadeh, et al. "Development of guidelines for the implementation of sustainable enterprise resource planning systems." Journal of Cleaner Production 244 (2020): 118655. [8] Cozzani, Valerio, et al. "Quantitative assessment of domino and NaTech scenarios in complex industrial areas." Journal of Loss Prevention in the Process Industries 28 (2014): 10-22. [9] Chen, Yusong, et al. "Research on software failure analysis and quality management model." 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 2018. [10] Usman, Muhammad, et al. "Compliance requirements in large-scale software development: An industrial case study." International Conference on Product-Focused Software Process Improvement. Springer, Cham, 2020. [11] Kalunga, Joseph, Simon Tembo, and Jackson Phiri. "Industrial Internet of Things Common Concepts, Prospects and Software Requirements." vol 9 (2020): 1-11. [12] Chen, Chao, Genserik Reniers, and Nima Khakzad. "A thorough classification and discussion of approaches for modeling and managing domino effects in the process industries." Safety science 125 (2020): 104618. [13] Ren, Zihui, Cheng Chen, and Lijun Zhang. "Security protection under the environment of WiFi." 2017 International Conference Advanced Engineering and Technology Research (AETR 2017). Atlantis Press, 2018. [14] Kim, Dae-Young, Seokhoon Kim, and Jong Hyuk Park. "Remote software update in trusted connection of long range IoT networking integrated with mobile edge cloud." IEEE Access 6 (2017): 66831-66840. [15] Asokan, N., et al. "ASSURED: Architecture for secure software update of realistic embedded devices." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.11 (2018): 2290-2300. [16] Mugarza, Imanol, Jorge Parra, and Eduardo Jacob. "Cetratus: A framework for zero downtime secure software updates in safety‐critical systems." Software: Practice and Experience 50.8 (2020): 1399-1424. [17] Stević, Stevan, et al. "IoT-based software update proposal for next generation automotive middleware stacks." 2018 IEEE 8th International Conference on Consumer Electronics-Berlin (ICCE-Berlin). IEEE, 2018. [18] Mirhosseini, Samim, and Chris Parnin. "Can automated pull requests encourage software developers to upgrade out-of-date dependencies?." 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017. [19] Fowler, M. "Blue-green deployment, March 2010." (2016). [20] Tarvo, Alexander, et al. "CanaryAdvisor: a statistical-based tool for canary testing." Proceedings of the 2015 International Symposium on Software Testing and Analysis. 2015. [21] Killi, Bala Prakasa Rao, and Seela Veerabhadreswara Rao. "Towards improving resilience of controller placement with minimum backup capacity in software defined networks." Computer Networks 149 (2019): 102-114.