An empirical study of async wait flakiness in front-end testing Yu Pei1 , Sarra Habchi2 , Renaud Rwemalika1 , Jeongju Sohn1 and Mike Papadakis1 1 University of Luxembourg, Luxembourg 2 Ubisoft, Montreal, Canada Abstract Automated front-end regression testing is an essential part of web development, allowing fast release cycles while maintaining high-quality requirements. However, due to the asynchronous nature of web applications, front-end testing is sensitive to Async Wait flakiness which reduces the usefulness of such test suites by introducing false alarms. In this work, we conducted an empirical study to investigate the causes and the impact of Async Wait flakiness in front-end testing. To do so, we build a dataset of 62 tests exhibiting reproducible Async Wait flakiness associated with a clean fix commit, which becomes the foundation of our study. Our preliminary results suggest that tests relying on an explicit time to wait in order to synchronize the tests tend to create more flakiness (38 instances) than synchronizing on the status of DOM elements (24 instances). Further study shows that where time-based issues are typically addressed by increasing the time to wait, DOM-based issues are resolved by actually introducing a missing synchronization point. We conclude our study with an analysis of the different implementations of synchronization mechanisms to provide tool manufacturers with concrete insights on how to improve their solutions. Keywords front-end testing, async wait, flaky tests, empirical study 1. Introduction To remain competitive, companies are required to adapt to new business requirements in a timely fashion and fix potential defects or vulnerabilities as soon as they are detected to minimize any negative impact on their consumer base. As such, time to deployment has to be reduced to its minimum without compromising on the quality of the product shipped to production. To address these potentially conflicting requirements, software producers largely adopted automation testing to ensure the quality of the software they deploy. This practice has proven to be efficient at improving software quality[1] and allowing for rapidly finding vulnerabilities[19]. Regression testing is one of the software testing practices typically adopted by software- producing companies[20]. It ensures that the system under test (SUT) still functions as expected after any code changes. To achieve this goal, every time a developer makes a change to the SUT, the regression test suite is executed against the SUT. If any of the tests fail, it is an indication Envelope-Open yu.pei@uni.lu (Y. Pei); sarra.habchi@ubisoft.com (S. Habchi); renaud.rwemalika@uni.lu (R. Rwemalika); jeongju.sohn@uni.lu (J. Sohn); michail.papadakis@uni.lu (M. Papadakis) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) that the newly introduced change broke a requirement exercised by the test suite. However, this signal may contain noise, or false alarms, where some tests fail even though their requirements are actually met. As a result, builds fail where they should have actually passed, increasing not only the time to market but also production cost. This effect is even more pronounced in the case the failures are non-deterministic. In the past years, researchers have increased their efforts to address non-deterministic test failures, also known as flaky test[2][3][4][5]. A flaky test script is one that might non- deterministically pass or fail on the same test code, resulting in different outcomes in different runs with no changes[6][7]. To detect flaky tests, the majority of the proposed solutions rely on test rerun where a test script is executed a certain number of times and is marked as flaky if the outcome differs in at least one execution[21]. Detecting flaky tests is an important challenge as they bring many problems: 1. Increase the debugging cost. Suppose the developer does not know that the test is flaky. The developer might try to spend plenty of time debugging only to find the observed test failure is not due to the recent changes but due to a flaky test.[7] Flaky tests often require hours or even days to debug. As Gruber et al.[22] explain, it takes approximately 170 runs of a test case to determine with certainty whether a test is flaky. 2. Undermine developers’ trust in testing. The inconsistencies in the test results of flaky tests over given code changes can cause developers to lose faith in the tests themselves. 3. Hide real bugs. If a flaky test fails randomly, developers tend to ignore its failures and thus might miss real bugs[9]. It is thus important to determine whether a test has become flaky. Due to the effects of flakiness and flakiness that can come from the test itself, researching flakiness in front-end testing can help developers by providing insight into effective detection and prevention methods. In this work, we focus our investigation on regression tests that are targeting the user-facing layer of the web applications; we refer to this type of test as front-end tests. Our preliminary investigation on front-end testing shows that flaky test failure often occurs under the following four circumstances: • Interact with Document Object Model (DOM). DOM represents a web page. Depending on our test script, it can become flaky when there is an update to the DOM’s style, structure, and content. It occurs when our test suite attempt to interact with elements in the DOM that do not render consistently. • Render resources. Flaky tests in this category attempt to perform an action on a User Interface (UI) component or UI resources before resources are fully rendered. • Transition and Animation. Animations cause the component to be highly time sensi- tive[14]. Due to the sensitivity of animation scheduling, judging the progress of an animation based on an element’s state can lead to problems. • User interface. Front-end testing is an interactive testing process that includes diverse user interactions, such as keyboard input events, mouse click events, etc. This diversity in interactions may produce unexpected flaky test failures. The four potential circumstances for flakiness identified by our preliminary study show a strong predominance of the Async Wait category defined by Luo et al.[9]. Note that this category is not only prevalent in front-end testing. Indeed, Romano et al.[14] collected commit data to analyze, and they discovered that 45% of them fell into the Async Wait category. As such, this paper focuses on the Async Wait category of flakiness, which we refer to as Async Wait flakiness in the remainder of this paper, and aims at providing a deeper understanding of the causes of flakiness as well as providing developers with useful information to address this issue. To do so, we carry out an empirical investigation on the Async Wait flakiness in front- end testing, to provide a better understanding of flakiness in web applications. We collected flaky test cases from GitHub written in JavaScript and conduct our analysis. We specifically searched GitHub for web apps and searched for flaky test-relevant commits using the keywords ”async, wait, timeout”, ”flaky” and ”flakiness”. We discovered 62 commits in 26 web projects by manually reviewing the retrieved changes. We investigated the root cause and fixed the strategy of flakiness by analyzing each commit. By identifying the causes and fixing strategies for front- end flaky tests, our study intends to determine if our proposed characterization of flakiness varies from those researches on general software systems and if there are domain-specific flakiness patterns. The main contributions of this study are as follows: 1. We investigated the main causes and fixing strategies behind tests exhibiting Async Wait flakiness. 2. We derive a reproducible dataset of 26 projects with 62 tests exhibiting Async Wait flakiness. 3. We compare fixing strategies for Async Wait flakiness from different testing frameworks. We investigate the relationship between the characteristics of different synchronization mechanisms, such as their respective ease of use and the likelihood of inducing Async Wait flakiness. 4. Our study provides developers and tool manufacturers with insights on the potential introduction of Async Wait flakiness. 2. Related work Researchers have studied flaky tests and proposed several approaches to classify, identify, and fix flaky tests. Vahabzadeh et al.[13] carried out a comprehensive quantitative and qualitative study on test bugs, which are problems that might cause a test to fail or pass depending on whether or not the test code is right. Researchers identified three major causes of semantic bugs, flaky tests, and environmental bugs. Luo et al.[9] introduced ten major causes of flakiness (e.g., Async Wait, Concurrency, Test Order Dependency). Gao et al.[10] conducted a study that concluded that reproducing flaky tests can be difficult. Although we focus on front-end tests, our experience has been similar. Among these works, we find some that try to understand the reasons for flakiness and identify the causes from developers[8]. Lam et al.[11] study the lifecycle of flaky tests in large-scale projects at Microsoft by focusing on the timing between flakiness reappearance and the time to fix the flakiness. Terragni et al.[12] proposed a technique to run flaky tests in multiple containers with different environments simultaneously. As summarized in Romano et al.[14], flakiness problems are caused by a high diffusion of the Async Wait category, which happens when a test script makes asynchronous calls without waiting for the results. Moreover, there are many works with different goals from ours that propose techniques and tools to diagnose flaky test scripts[15] or detect them in a test suite[16] or fix the flaky tests[17]. To prevent and mitigate the negative impact of flaky tests during the web testing workflow, we focus our attention on the async wait flaky test, since this type of problem is one of the main causes of web testing flakiness. We aim to advance the state-of-the-art in the field of research by providing valuable insights. 3. Methodology 3.1. Research Question Our first research question regards the origin of Async Wait flakiness. More specifically, we aim to define a finer-grain nomenclature to classify Async Wait flakiness based on its origin. Thus, we ask: RQ1 What are the main root cause categories behind Async Wait flakiness? To answer this question we collected 26 projects with 62 flaky tests where all tests are related to Async Wait flakiness. Each test makes at least one asynchronous call or asynchronous wait and passes if it completes on time, but fails if it ends too early or too late. Then, we investigate each test by analyzing its associated commits to explore the main causes of flakiness and fixing strategies employed to remove it. Different from prior work, our research tries to reveal the main causes behind Async Wait flakiness in front-end tests. To do so, we formulate the following research question: RQ2 How do developers fix with Async Wait flakiness? To understand how developers identify and fix Async Wait flakiness, we look into the changeset between the flaky and fixed versions of the test. Additionally, we study how to effectively pick synchronization mechanisms to minimize the test execution time. Finally, the last step of our endeavor regards the capabilities offered by different testing frameworks. More specifically, we ask: RQ3 How do developers use different test frameworks to handle Async Wait flakiness? With this research question, we intend to examine whether or not actions to fix Async Wait flakiness are consistent across different testing frameworks. As such, we classify the type of changes performed within the tests relying on each framework and extract utilities provided by the frameworks to help developers to avoid Async Wait flakiness. 3.2. Data Collection To identify commits that likely fix flaky tests, we choose to search open-source projects on GitHub. During this process, we follow a procedure similar to the one used in Luo et al.[9]. We search for the keywords ”web testing”, “flaky”, and “flakiness” in commits and projects written in JavaScript. When executing this query against the Github Search API, over 600 web projects are initially returned as a result. To ensure the projects are related to Async Wait flakiness in front-end tests, a second filter[14] is performed using the keywords ”UI”, ”DOM”, ”async”, ”await”, ”delay”, ”timeout”. Next, we proceed to a manual analysis in order to identify the commits that are modifying test code and isolate the commits related to fixing flaky tests. As a result, we identify roughly 200 projects with commits potentially fixing Async Wait flakiness. To validate this step and guarantee that the tests are flaky, we ensure that the flakiness is reproducible. Therefore, we clone the projects from GitHub and rerun the tests which were modified by the commit we isolated in the previous step. To capture the flakiness, we execute each suspicious test 20 times; we label a test to be flaky if any of its execution results in a different outcome. This manual inspection, followed by rerunning the suspicious tests, resulted in 62 tests from 26 web projects being collected in the dataset. The collected dataset in this study differs from those of previous studies in three perspectives. First, the flaky tests in our dataset are all reproducible from the way they were collected. Second, unlike the previous flaky test datasets containing various flaky tests, ours targets a specific direction that we only focus on, namely, Async Wait flakiness in the front-end test code. Third, each flaky test is associated with a fixed commit that can be analyzed to extract the fine-grained root cause of flakiness. 3.3. Study setup RQ1 What are the main root cause categories behind Async Wait flakiness? Tests can be flaky due to an async wait issue, where an asynchronous call is made or an asynchronous await is performed, but the result is not properly waited for before using it. In the context of testing a web application, async failures may occur when the application runs an asynchronous operation that must be completed before the application state or UI is ready to be tested. Without the appropriate synchronization mechanism between the test and the SUT, during its execution, the test may perform assertion or interact with elements of the web page that are not yet available. Thus, in addition, to collecting flaky tests, we also want to provide explanations for the observed non-deterministic behavior and explore the main causes of async wait flakiness in web front-end tests. We manually analyze each test and review the test code and developers’ fixes to assign each test to a category. In addition, we rerun each test code to get error messages, part of the cases are flakiness caused by time issues, and other parts of the cases are element-related, such as that the element does not exist, etc. Moreover, DOM and time are often the primary concerns of developers and testers when it comes to web front-end testing since they directly influence how web pages display and load. Table 1 Summary of Cause Categories Categories Description of causes Number of tests DOM-related DOM rendering and usage 24 Time-related Insufficient waiting time or exceeded time limit 38 Indeed, we identify two categories based on the call performed by the test at the fixed location: (1) time-related, such as ”await page.waitForTimeout()” or ”await wait()” where the synchronization point is an explicit amount of time; and (2) DOM-related, such as ”await page.waitForSelector()” where the synchronization point depends on the rendering state of a specific DOM element. This definition allows us to determine if the fix actions are DOM-related or time-related. For example, directly extending the wait time is associated with a time-related fix and the introduction of a method to wait for elements to be rendered is associated with a DOM-related fix. Note that this classification is already used in the literature and by practitioners. Spadini et al.[23] relying on developers’ perception to create a classification of test smell severity, mentioned that using time-related locators, i.e. sleepy test, might introduce flakiness. This observation is corroborated by practitioners from the industry where developers believe that a good front-end testing framework should avoid using explicit thread waiting for the sake of system stability[18][24]. This is why, relying on the asynchronous instability reasons proposed by previous work, we have carried out two-class categorization, one is DOM-related, and the other is time-related. We describe those categories in Table 1. Based on the results of our analysis, we can quickly classify Async Wait flakiness and propose an appropriate repair method for each cause. RQ2 How do developers fix with Async Wait flakiness? We manually analyze all the fixes introduced by the developers by reviewing the comments and the changeset of each commit. First, we clean the changeset to remove any changes unrelated to fixing flakiness such as refactoring operations. Then, we analyze the fine-grained changes to discover what contributes to the flakiness of the test suite. For example, in the absence of the ”waitFor” method in the test code which takes a DOM element as an attribute, to fix flakiness, a DOM-based synchronization point is introduced. DOM- related fixes are typically performed by introducing method calls such as ”waitForElements” or ”waitForBeTrue”. Thus, during the manual analysis, we extract actions consisting of insetting these methods. Code snippet 1 1 async () => { 2 - expect(connection.streams).to.have.length(0) 3 + await pWaitFor(() => connection.streams.length === 0) 4 } For instance, code snippet 1 shows the introduction of a wait for true condition method, to offer a DOM-based synchronization point for the test. Another example can be found in the Shopify-theme-inspector project (code snippet 2) where the developer directly adds the method to wait until the element renders before executing the assertion. Code snippet 2 1 async () => { 2 await page.$eval('[data-refresh-button]', elem => elem.click()); 3 + await page.waitForSelector('.d3-flame-graph') 4 } On the other hand, time-related fixes are typically solved by increasing the time attribute (code snippet 3) or by adding an explicit wait call. We observe that most flaky behavior occurs due to a short timeout caused by elements waiting for results or callbacks being delayed. Code snippet 3 1 it("should see loaded profiles on the team page", function () { 2 cy.visit("/team"); 3 - cy.wait(250); 4 + cy.wait(1000); 5 6 expect(cy.data("contributor-handle-GithubPerson")).to.exist; 7 }); To conclude, our preliminary results suggest that developers use different strategies for fixing Async Wait flakiness depending on whether it is time-based or DOM-based. In order to solve DOM-related flakiness, they usually introduce a synchronization point on a specific DOM element, whereas for time-related flakiness, they usually add or extend the waiting time. RQ3 How do developers use different test frameworks to handle Async Wait flakiness? In order to compare the performances of the frameworks used to exercise the tests of our dataset, we extract from the repository the dependencies from the build configuration (pack- age.json). Table 2 shows the outcome of this process and suggests that our dataset is composed of four test automation frameworks. For each testing framework, we build an exhaustive list of the Synchronization Mechanism Methods (SMM) provided by the framework; for each SMM from this catalog, we compute code metrics associated with the ease of use and the way they interact with the SUT. Analyzing the entirety of the tests from our projects, regardless of their flakiness, we assign to each test the SMMs gathered from each framework. If a SMM is present Table 2 The use of different testing frameworks Test framework Number of flaky tests Jest 27 Cypress 16 Mocha 14 Puppeteer 5 in the fix commit identified in RQ2, it is labeled as flakiness inducing and if not, it is labeled as non-inducing. Finally, we perform a statistical analysis to define whether some types, which can be deter- mined using the pre-computed code metrics, of SMMs are more likely to generate flakiness. Answering this question allows us to propose strategies that can be implemented by tool providers to assist developers to avoid introducing flakiness. Our preliminary results suggest the following: • Jest: developers solve Async Wait flakiness mainly by using separate wait, like page.wait() method or delay functions. • Cypress: developers resolve Async Wait flakiness mainly by testing the framework’s integrated cy.wait() method. • Mocha: Async Wait flakiness is also resolved using a separate wait method. • Puppeteer: Async Wait flakiness is resolved using the framework’s integrated page method. 4. Conclusion and Future work In this paper, we demonstrate that flaky tests are equally common in front-end testing. Among various types of flaky test failures, we specifically target Async Wait flakiness, one of the most prevalent categories of flakiness identified in prior work and throughout our preliminary study. For the benefit of the flakiness community and for our study, we generate a dataset of tests exhibiting Async Wait flakiness. With this dataset, we start to analyze the root cause of Async Wait flakiness and study how they can be fixed effectively by investigating the developers’ fixes, specifically concerning the type of employed testing framework and more specifically their implementation of the different synchronization mechanisms. Our current observations suggest that this study will provide useful insights for future research on flaky tests in front-end testing. As part of our future research, we will demonstrate that the proposed procedure is general- izable and effective using other test suites, as well as investigate the possibility of developing automated fixing strategies. References [1] Garousi, V., Felderer, Developing, verifying, and maintaining high-quality automated test scripts. IEEE Softw. 33, 68–75 (2016) [2] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, ”Deflaker: Automatically detecting flaky tests,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018, pp. 433–444. [3] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, ”Root causing flaky tests in a large-scale industrial setting,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 101–111. [4] Leotta, Maurizio, et al. ”A family of experiments to assess the impact of page object pattern in web test suite development.” 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, 2020. [5] Leotta, Maurizio, Andrea Stocco, Filippo Ricca, and Paolo Tonella. ”Pesto: Automated migration of DOM‐based Web tests towards the visual approach.” Software Testing, Verifi- cation And Reliability 28, no. 4 (2018): e1665. [6] M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, “Understanding �aky tests: The de- veloper’s perspective,” in Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2019, pp. 830–840. [7] Zolfaghari, Behrouz, et al. ”Root causing, detecting, and fixing flaky tests: State of the art and future roadmap.” Software: Practice and Experience 51.5 (2021): 851-867. [8] Eck, Moritz, et al. ”Understanding flaky tests: The developer’s perspective.” Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. [9] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of �aky tests,” in International Symposium on Foundations of Software Engineering (FSE). ACM, 2014, pp. 643–653. [10] Z. Gao, Y. Liang, M. Cohen, A. Memon, and Z. Wang. 2015. Making system user interactive tests repeatable: When and what should we control?. In ICSE. Florence, Italy, 55–65. [11] W. Lam, K. Mus¸lu, H. Sajnani, and S. Thummalapenta, ”A study on the lifecycle of flaky tests,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1471–1482. [12] F. F. Valerio Terragni, Pasquale Salza, ”A container-based infrastructure for fuzzy-driven root causing of flaky tests” 2020. [13] A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of bugs in test code,” in 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), Sept 2015, pp. 101–110. [14] Romano, Alan, et al. ”An empirical analysis of UI-based flaky tests.” 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021. [15] Morán Barbón, Jesús, et al. ”FlakyLoc: flakiness localization for reliable test suites in web applications.” Journal of Web Engineering, 2 (2020). [16] Bell, Jonathan, et al. ”DeFlaker: Automatically detecting flaky tests.” 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018. [17] Shi, August, et al. ”iFixFlakies: A framework for automatically fixing order-dependent flaky tests.” Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. [18] Ricca, Filippo, and Andrea Stocco. ”Web test automation: Insights from the grey literature.” In International Conference on Current Trends in Theory and Practice of Informatics, pp. 472-485. Springer, Cham, 2021. [19] ”Flaky tests at google and how we mitigate” https://testing.googleblog.com/2016/05/�aky- tests-at-google-andhow-we.html. [20] M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, ”Usage, costs, and benefits of continuous integration in open-source projects,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 426–437. [21] M. Fowler. (2011) Eradicating non-determinism in tests. [Online]. Available: https://bit.ly/2PFHI5B [22] Gruber, Martin, et al. ”An empirical study of flaky tests in python.” 2021 14th IEEE Confer- ence on Software Testing, Verification and Validation (ICST). IEEE, 2021. [23] Spadini, Davide, et al. ”Investigating severity thresholds for test smells.” Proceedings of the 17th International Conference on Mining Software Repositories. 2020. [24] Bushnev Y (2019) Top 15 ui test automation best practices. URL https://www.blazeme- ter.com/blog/top-15-ui-test-automation-best-practices-you-should-follow