A Survey of Health Management System for On-The-Fly Repairing of Concurrency Errors in Airborne Software Junho Lee1 , Seongyun Go2 , Eu-Teum Choi3 and Seongjin Lee4,* 1 Dept. of Aerospace Software Engineering, Gyeongsang National University, Jinju, Republic of Korea 2 Dept. of Aerospace and Software Engineering, Gyeongsang National University Jinju, Republic of Korea 3 Convergence Research Center for Materials and Mechanical Systems, Jinju, Republic of Korea 4 Dept. of AI Convergence Engineering, Gyeongsang National University, Jinju, Republic of Korea Abstract Concurrency errors are known for their difficulty of debugging and reproducing prior to execution. Undetected concurrency errors could result in nondeterministic executions that deviate from the pro- grammer’s intent and significantly undermining the reliability of the program. To prevent functional failure in airborne software, on-the-fly repairing of concurrency errors is crucial. This paper surveys the Health Management System for on-the fly repairing of concurrency errors in airborne software and motivates the future work in concurrent airborne software. Keywords Concurrency error, reliability, airborne software, on-the-fly repairing, Health Management System, 1. Introduction Airborne software is the embedded software that controls, manages, and monitors the state of airborne system[1, 2]. With the rapid advances in avionics, the importance of software in avionics has been increasing. In the 1960s, the proportion of software in F-4 was only 4%. Whereas, in the case of F-35 produced in 2007, the proportion of software had risen to 90%[3]. It indicates that software has played a crucial role in avionics system. As the software made up in avionics system has dramatically risen, the scale and complexity of airborne software have also grown. As a result, debugging of airborne software has become more challenging and the risk of system failure has increased due to potential errors and faults[4, 5, 6, 7]. In aircraft, when accidents occur due to the system failure, they could result in catastrophic loss of life and property. This serious issue can be observed in the accident that occurred during the inaugural test flight of the F-22 Raptor in 1992[4]. The failure of control software made pilot unable to control the induced oscillation, leading to a crash. Other examples can be observed in the case of Boeing 737 MAX accidents[8]. Boeing’s new aircraft model, the Boeing 737 MAX, experienced two crashes in October 2018 and March 2019, resulting in the loss of 189 lives and ISE 2023: 2nd International Workshop on Intelligent Software Engineering, December 4, 2023, Seoul * Corresponding author. $ dlwnsgh0901@gnu.ac.kr (J. Lee); tjddbs6696@gmail.com (S. Go); etchoi@gnu.ac.kr (E. Choi); insight@gnu.ac.kr (S. Lee) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 157 lives, respectively. The main cause of the accidents was determined to be a malfunction of Boeing’s newly introduced technology, MCAS (Maneuvering Characteristics Augmentation System). MCAS is a technology designed to prevent a crash by lowering the aircraft’s nose when a stall is anticipated. In the two instances of flight, due to the malfunctions of MCAS, the aircraft’s nose pitched abnormally downward. This made the pilots unable to respond appropriately, leading to the aircraft crashes. Concurrency errors are typically one of the most critical error types occurring in concurrent programs, and they must be eliminated as they can compromise the intended execution results of the program[9, 10, 11, 12, 13, 14]. Concurrency errors are difficult to debug and reproduce, so they are not easily eliminated during the development phase. To ensure the correct execution of airborne software, these errors must be eliminated during run-time. The software health management system in airborne software diagnoses and treats the errors occurring during the execution, ensuring the proper functioning of the program. Diagnosis is performed by monitoring the state changes of specific values or variables in a program. If an error is diagnosed, appropriate recovery actions are employed to ensure the proper execution of the program. Since 2010, there have been four studies in airborne software that utilize health management systems to eliminate concurrent errors during execution. Three researches[6, 7, 4] diagnose and repair atomicity violations among concurrent errors, while only one research[5] focuses on diagnosing and repairing order violations within the same context. Ha et al.[6][6] and Tchamgoue et al.[7] detect atomicity violations by comparing diagnosis algorithms with actual execution information, while Choi et al.[4] repairs atomicity violations by comparing pre-collected correct execution information with actual execution information. For on-the-fly repairing of order violation in airborne software, Kim et al.[5] diagnoses order violations by comparing the user- defined function call sequences with the information acquired during execution. The remainder of this paper is organized as follows. Section 2 gives the research background discussing concurrency errors and health management systems for on-the-fly repairing them in airborne software. In section 3, related works using health management systems to repair concurrency errors in airborne software are introduced. It also discusses some research questions of related work in section 4. Finally, section 5 concludes the research. 2. Background This section introduces concurrency errors that are hard to debug and remove during the development phase. It also explains the health management system for diagnosing and on-the- fly repairing errors in airborne software during the operational phase. 2.1. Concurrency Error Concurrency errors occur when two or more different threads access shared resources without proper synchronization, resulting in outcomes different from the programmer’s intent[11, 4, 5, 6, 7]. Types of concurrency errors include atomicity violation, order violaiton, and deadlock[11]. Atomicity violations occur in concurrent programs when at least one write access occur with- out proper synchronization[11, 15, 16, 17], violating the atomicity of the program’s execution in the affected section that should be performed atomically. Fig. 1 depicts two threads accessing a Figure 1: Possible interleavings of two threads performing x++ operation respectively shared variable sequentially, performing read and write operations respectively without proper synchronization. In the Fig. 1, white circles represent the read access to the shared variable x, while red circles represent write accesses to the shared variable x. Programmers typically expect that the result will increase by 2 when the x++ operation is performed twice. In cases a and d, the value of x is incremented by 2 because read and write accesses to the shared variable are performed atomically in the x++ operation. However, in cases b, c, e, and f, read and write accesses to the shared variable are not performed atomically, resulting in an increment of only 1 even if the x++ operation is performed twice. Such non-deterministic behavior makes it difficult to guarantee the intended execution of the program. Order violation is occurred when other threads disrupt the intended sequence of shared variable access defined by the developer[11, 10, 5]. Fig. 2 provides an example of order violation that occurred in Mozilla. In this example, two threads access and perform write operations on a shared variable called io_pending. The developer’s expectation is that thread 1 initializes io_pending to TRUE, and after a certain period, thread 2 will change it to FALSE. However, due to an incorrect interleaving, thread 2 could mistakenly initializes io_pending to FALSE first, and then thread 1 attempts to change it to TRUE. In this scenario, an error can occur, resulting in an infinite execution of the code inside a while loop. Deadlock is a type of concurrency error in which a process is unable to proceed when two threads each possess a resource that the other thread requires and they request each other’s resources. The purpose of the health management system (HMS) for airborne software is to prevent Figure 2: Example of an order violation in Mozilla system failures caused by faults or errors[1, 4, 5, 6, 7]. Errors can occur during operation due to faults that were not eliminated during the development phase. HMS monitors, diagnoses, and treats these errors at the module level, partition level, and process level. For error diagnosis, HMS continuously monitors specific variables or state changes. When an error is diagnosed, HMS employs recovery techniques to eliminate the error. Recovery techniques can be categorized into Forward Recovery and Backward Recovery, depending on the timing of dealing with errors. Forward Recovery is relatively straightforward to implement but may not be able to repair all errors. Whereas, Backward Recovery involves returning to the last checkpoint where no errors occurred to correct the fault. This method can address most errors but comes with significant time and space overheads when rolling back to a checkpoint. 2.2. Health Management System of Airborne Software Since 2010, research on utilizing health management systems of airborne software has been actively pursued to diagnose and repair errors [4, 5, 6, 7]. These Researches focus on repairing concurrency errors[4, 6, 7], one of the most severe types of software errors, which occur in concurrent programs. Ha et al.[6], Tchamgoue et al.[7], Choi et al.[4] diagnose and repair atomicity violations. To diagnose atomicity violations, Ha et al.[6] and Tchamgoue et al.[7] incorporate probe code through actual run-time diagnostic algorithms. Choi et al.[4] compares previously executed correct execution information with actual run-time information. Kim et al.[5] diagnoses and repairs order violation errors by comparing the user-defined function call sequence and information acquired during execution. 3. Related Works Errors occurring during the operational phase due to fautls that were not eliminated during the development phase can threaten the normal execution of airborne software. Given that eliminating concurrency errors by considering all possible interleavings of a program during work Diagnosis Treatment Ha et al. atomicity violation lock, unlock Tchamgoue et al. atomicity violation wait/set_event Choi et al. atomicity violation wait, signal Kim et al. order violation wait, signal Table 1 HMS for repairing concurrency errors in airborne software the development phase is impossible, it becomes critically important to address these errors during execution[9, 10, 12, 13, 14]. Since 2010, research in the field of airborne software has been underway to repair atomicity violations and order violations using health management systems. Table 1 lists research studies in the field of airborne software that utilize health management systems to repair concurrency errors. There are three studies focusing on repairing atomicity violations. To diagnose atomicity violations, Ha et al.[6] and Tchamgoue et al.[7] incorporate detection protocols, while Choi et al.[4] diagnoses atomicity violations by comparing information obtained from pre-execution with information from actual run-time. All three studies include synchronization techniques to remedy atomicity violations. In the cases of Ha et al.[6] and Tchamgoue et al.[7], they diagnose atomicity violations by inserting detection protocols and comparing the actual execution results. The detection protocol first determines the concurrency suitability of threads using labeling techniques. Each thread is assigned a unique number, and information about shared variable access and the execution sequence relationship among threads is stored in a data structure called ‘label.’ The labeling technique is then used to check whether threads executing in parallel are protected by synchronization techniques to diagnose atomicity violations. When an atomicity violation is diagnosed, these studies control the thread flow by inserting POSIX lock/unlock [6] or using APEX’s wait/set event [7]before and after shared variable access for threads without proper synchronization. Choi et al.[4] diagnoses atomicity violations by comparing Anticipated Invariant (AI)[14] based BSet() and RPre(). BSet() represents the correct execution order information obtained through pre-execution of the program, while RPre() represents the actual execution information. Choi et al.[4] considers it an atomicity violation if RPre is not included in the previously collected correct execution, BSet, and it repairs atomicity violations by delaying the specific thread accessing shared variables by performing APEX’s wait and signal. To diagnose order violations, Kim et al.[5] compares the user-defined function call sequence with information acquired during execution. When an order violation is detected, wait and signal calls are used to remove it. Figure 3: Techniques for repairing of atomicity violations 4. Research Questions 4.1. RQ1: Is it possible to repair atomicity violations regardless of the number of shared variables involving in errors? Atomicity violation can be categorized into a single-variable atomicity violation and multi- variable atomicity violation depending on the number of shared variables involving in errors[11]. In the case of a single-variable atomicity violation, it occurs when two or more threads access a shared variable concurrently without proper synchronization. Whereas, multi-variable atomicity violation occurs when two or more threads access two or more shared variables concurrently without proper synchronization. To diagnose and repair multi-variable atomicity violations, correlations among shared variables should be considered[11]. However, three existing studies only focus on a single-variable atomicity violation, not considering the correlations among the shared variables. Therefore, it is limited to diagnose and repair multi-variable atomicty violations using existing techniques. 4.2. RQ2: Is it possible to repair atomicty violations in signal-driven program? When a signal occurs in a program, the program is terminated or preempted by a user-defined signal handler[12]. Due to the characteristics of signals having higher priority than threads, they preempt the execution flow of threads upon occurrence. In a situation where a signal is raised while a thread is accessing a shared variable and sharing it with a signal handler, there is a potential for atomicity violation. In a scenario where a thread increments the value of a shared variable x using the x++ instruction, and a signal handler also increments the value of x using x++ instruction, if a signal is raised between the read and write access of the thread, atomicity violation may occur. The value of x will be incremented by only 1 when the x++ operation is performed twice. If atomicity violation is diagnosed, related works control threads inserting lock/unlock[6], APEX set/wait_event[7], POSIX wait/signal[4]. However, using these mechanisms in a signal handler, deadlock can occur. Due to the inherent nature of signal handlers, once invoked, they do not return to the thread until their execution is complete. If, however, control is transferred to a signal handler while a thread has acquired a lock, the signal handler is simply waiting for the thread to release the lock, and the thread is waiting for the signal handler to return, potentially resulting in a deadlock. Therefore, there is a need for future research to investigate methods that block the access of a signal handler, which shares a variable with a thread, when the thread is accessing that variable. 5. Conclusion This paper discussed one of the most serious type of software errors known as concurrency errors. To ensure the correct execution of the program, it is essential to eliminate undetected concurrency errors. The Health Management System of Airborne software is designed to diagnose and repair these errors during execution. While three studies have focused on repairing atomicity violations, one study has addressed to repair order violation in airborne software. The research question indicates that existing works for repairing atomicity violations primarily concentrate on a single-variable atomicity violation, and a data race among threads. Therefore, there is a need for future research to expand the scope of repairs to include multi-variable atomicity violations and those caused by signal handlers. Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2021R1A2C1014163). References [1] A. E. E. C. (AEEC), Avionics Application Software Standard Interface – ARINC Specification 653 – Part 1, ARINC Inc, 2015. [2] J. Knight, The glass cockpit, Computer 40 (2007) 92–95. [3] D. G. Firesmith, P. Capell, D. Falkenthal, C. B. Hammons, L. DeWitt, T. Merendino, The method framework for engineering system architectures, CRC Press, 2008. [4] E.-t. Choi, T.-h. Kim, Y.-K. Jun, S. Lee, M. Han, On-the-fly repairing of atomicity violations in arinc 653 software, Applied Sciences 12 (2022) 2014. [5] T.-H. Kim, E.-T. Choi, Y.-K. Jun, An efficient on-the-fly repairing system of order violation errors for health management of airborne software, The Korean Society for Aeronautical and Space Sciences 12 (2020). [6] O.-K. Ha, G. M. Tchamgoue, J.-B. Suh, Y.-K. Jun, On-the-fly healing of race conditions in arinc-653 flight software, in: 29th Digital Avionics Systems Conference, IEEE, 2010, pp. 5–A. [7] G. M. Tchamgoue, O.-K. Ha, K.-H. Kim, Y.-K. Jun, A framework for on-the-fly race healing in arinc-653 applications, International Journal of Hybrid Information Technology, SERSC 4 (2011) 1–12. [8] D. KOENIG, New software glitch found in boeing’s troubled 737 max jet (2011). [9] Y. Lin, S. S. Kulkarni, Automatic repair for multi-threaded programs with deadlock/livelock using maximum satisfiability, in: Proceedings of the 2014 International Symposium on Software Testing and Analysis, 2014, pp. 237–247. [10] B. Lucia, L. Ceze, Cooperative empirical failure avoidance for multithreaded programs, ACM SIGPLAN Notices 48 (2013) 39–50. [11] S. Lu, S. Park, E. Seo, Y. Zhou, Learning from mistakes: a comprehensive study on real world concurrency bug characteristics, in: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, 2008, pp. 329–339. [12] G. M. Tchamgoue, K. H. Kim, Y.-K. Jun, Eventhealer: Bypassing data races in event-driven programs, Journal of Systems and Software 118 (2016) 208–220. [13] L. Zhang, C. Wang, Runtime prevention of concurrency related type-state violations in multithreaded applications, in: Proceedings of the 2014 International Symposium on Software Testing and Analysis, 2014, pp. 1–12. [14] M. Zhang, Y. Wu, S. Lu, S. Qi, J. Ren, W. Zheng, A lightweight system for detecting and tolerating concurrency bugs, IEEE Transactions on Software Engineering 42 (2016) 899–917. [15] G. Jin, W. Zhang, L. B. Deng, Dongdong, S. Lu, Automated {Concurrency-Bug} fixing, in: 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), 2012, pp. 221–236. [16] C. Li, R. Chen, B. Wang, T. Yu, D. Gao, M. Yang, Precise and efficient atomicity violation detection for interrupt-driven programs via staged path pruning, in: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 506–518. [17] A. Muzahid, N. Otsuki, J. Torrellas, Atomtracker: A comprehensive approach to atomic region inference and violation detection, in: 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 2010, pp. 287–297.