Building a Bridge between User-Adaptive Systems Evaluation and Software Testing Veronika Bogina Tsvi Kuflik Information Systems Dept. Information Systems Dept. The University of Haifa, The University of Haifa Haifa 31905, Israel Haifa 31905, Israel sveron@gmail.com tsvikak@is.haifa.ac.il ABSTRACT by that reduces a frustration of people that often struggle with the User Adaptive Systems (UASs) are futile without software. guessing an author’s intention. Albeit everyone agrees with the fact Moreover, integrating user modeling component into software that it is essential to improve the quality of research algorithms by system may add bugs if not tested properly. However, the at least making their code publicly available and elaborating on the evaluation of UASs does not intersect with software evaluation as tools one had used during the research, nobody reports on any commonly defined in Software Engineering. We suggest adopting testing done to ensure that a system provides correct, consistent the common software engineering practices, changing the results. Once researchers get good results in terms of community’s practice and methods by integrating software testing accuracy/coverage/novelty/serendipity or any other predefined as an integral part of any study involving software development. evaluation metric, they hasten to share their findings with the That will allow win-win situation for both: the researcher and community. However, without testing whether they have bug in community, since the code will be bug-free and hence easily their implementation or even inconsistency in the chain of actions reproducible/reusable by other members of the community. according to the paper they have published, the results may be questionable. CCS Concepts Hence we inquire the reader:” How do you know whether your User • Information systems → Recommender systems • Software Adaptive system’s results are correct? Have you tried to verify your and its engineering → Software testing and debugging results by writing new code and reproducing the steps that were defined in your paper?” By following such strategy, you can kill Keywords two birds with one stone: first, test that your method/experiment is Software Engineering; Testing; Evaluation; User-Adaptive reproducible and secondly prove that it works correctly by testing Systems results with other implementation. 1. THE BRIDGE FOUNDATION The goal of this position paper is to examine existing evaluation User-Adaptive Systems (UASs) are interactive systems that adjust methods in both UASs and Software Engineering and suggest their functionality to individual users according to the user model testing approach that will strengthen this research field’s outcome that was built by learning user behavior, inference, or decision in the future. making [13]. Over the last decade, a wide variety of different UASs Evaluation metrics are commonly used to determine both the has been introduced. Indeed, they incentivize researchers to publish quality and performance of UASs [2]. Most frequently used are their results by writing papers that often have the same structure: statistical evaluation methods, personality tests, accuracy, RMSE, They contain of few common essential parts, like introduction, A/B Testing and Benchmarking. Albeit all these metrics appraise related work, experiment/method, evaluation and conclusions. The UASs on their performance, accuracy and statistical significance, evaluation part allows us, as researchers, to decide whether this none of them addresses software testing for code, that is used to specific system lives up to both: the scientific community and the implement these UAS. end user expectations in terms of quality and performance [16]. However, no one, as far as we know, reported on testing these UAS 2. A BRIDGE TO SOFTWARE TESTING for their correctness. This is, primarily, because performing a study According to Myers’ classic definition: “Testing is the process of without testing the software properly first is the reality in our executing a program with intention of finding errors.” The intent of community. As Valentino Rossi maintained once:” Once the races the testing is to discover as many errors as possible and by that begin it's more difficult and there is never that much time for bring the tested software to the accepted level of quality [5]. testing”[9]. By taking different approaches software tests can be classified Let us consider other researchers’ expectations within the same incongruously: according to the testing concept or to the scientific community when they read a relevant paper and are eager requirements [5]. The former is related to the black box testing to replicate the same experiment again but with a different setup. (functionality) and the while box testing (structural). The latter is There are two primary aspects in such scenario: reproducibility and defined by McCall’s classic model for classification of software correctness of the method/algorithm/system that introduced in the quality requirements [7] that is shown in the Table 9.1[6] . paper. Nowadays, the reproducibility is a hot topic [3][14]. The testing strategy choice depends on the software and its Researchers’ intent is obvious: they want to be able to compare requirements: whether one develops desktop, web, android or their algorithm with the published one by reproducing it according mission critical system applications. to the paper that describes the study. Moreover, clear methodology eliminates a redundant mail correspondence, when researchers Let us peruse two kinds of applications to show the differences in approach the authors of the published paper for more details, and the testing process. Web application, a client-server software application, differs from other applications in few ways. Indeed, it of the experiments and software reuse, both portability and can be accessed by a wide number of users, from different parts of reusability need to be tested as well. We believe that approaching the world. Since each one of them uses different hardware, OS, testing in UASs is going to benefit both the researcher and the Web browser and etc., such application should be able to run on community. As Burt Rutan maintained:” Testing leads to failure, heterogeneous execution environments. Moreover, the ability to and failure leads to understanding”. respond the user input in real time is essential [4]. Common examples for such applications are web mails, online retail stores, instant messaging chats, wikis and so on [12]. From the testing 4. REFERENCES perspective an executing performance, an availability testing, web [1] Avazpour, I., Pitakrat, T., Grunske, L., & Grundy, J. (2014). accessibility, different web browsers, operating systems and Dimensions and metrics for evaluating recommendation middleware testing, security, usability, hyperlinks testing are systems. In Recommendation Systems in Software germane testing mechanisms that should be used here in order to Engineering, 245-273. Springer Berlin Heidelberg. verify functional and non-functional requirements [4] . [2] Chin, D. N. (2001). Empirical evaluation of user models and On the other hand, in mission critical systems, whose failure may user-adapted systems. User modeling and user-adapted cause the failure of some goal-directed activity, errors or failures interaction, 11(1-2), 181-194. cannot be tolerated. Moreover, in safety-critical applications failure can be catastrophic. Meaning that errors in such systems are [3] Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., unacceptable. Thus reliability, availability, clear documentation Gargi, U., & Sampath, D. (2010, September). The YouTube and instructions, proper design and reviews, security are essential video recommendation system. In Proceedings of the fourth parts of the genuine testing [15]. Common examples are online ACM conference on Recommender systems 293-296. ACM. banking systems, railway and aircraft operating systems, electric [4] Di Lucca, G. A., & Fasolino, A. R. (2006). Web application power and other similar computer systems [10]. testing. In Web Engineering , 219-260. Springer Berlin As can be seen above software testing is an essential part of Heidelberg. software development. All applications require such approach, with [5] Galin, D. (2004). Software quality assurance: from theory to no exception and depending on their functionality and the goal, implementation. Pearson education. various methods for testing are chosen. [6] Galin, D. (2004). Software quality assurance: from theory to 3. DISCUSSION AND CONLCUSIONS implementation. Pearson education. Table 9.1, 188. In this section, we claim that since user models and user modeling [7] General Electric Company, McCall, J. A., Richards, P. K., & components should not be separated from the software, testing Walters, G. F. (1977). Factors in software quality: Final techniques that are applicable in software development should also report. Information Systems Programs, General Electric suit User Modeling research. The question only is what testing Company. techniques can be applied from Software Engineering field. [8] http://www.brainyquote.com/quotes/quotes/b/burtrutan39455 After scrutinizing various factors categories [6] and existing 6.html evaluation methods, it caught our eyes that there is a gap in UASs testing and only Operation factor is partially covered in UASs [9] http://www.brainyquote.com/quotes/quotes/v/valentinor3012 evaluation. Yet Revision and Transition are not considered. From 35.html Revision perspective there is a need to ensure code’s testability and [10] https://en.wikipedia.org/wiki/Mission_critical it needs to be tested at least for accuracy, as a starting point. Let us [11] https://en.wikipedia.org/wiki/Portability_testing argue on how using Transition techniques can be advantageously. [12] https://en.wikipedia.org/wiki/Web_application Wouldn’t it be useful to make our code reusable by publishing it on GitHub with good documentation and clear code and by this [13] Jameson, A. (2001, December). User-adaptive and other allowing other researchers to reuse modules from our code? smart adaptive systems: Possible synergies. In Proceedings of the first EUNITE Symposium, Tenerife, Spain. What about portability? Whether we should write software on Java (write once, run everywhere) to exclude compatibility issues, like [14] Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. we have with different python versions and packages support. M. (2009). Controlled experiments on the web: survey and Adaptability, installability and interoperability [11] can only practical guide. Data mining and knowledge strengthen our systems and code exchange between researchers. discovery, 18(1), 140-181. We advocate that it is an essential process to use a software testing [15] Parnas, D. L., van Schouwen, A. J., & Kwan, S. P. (1990). in User Adaptive Systems in the future, though it is up to Evaluation of safety-critical software. Communications of researchers to decide what technique to use for their specific the ACM, 33(6), 636-648. system. It would be beneficial to both: a researcher and the [16] Said, A., & Bellogín, A. (2015, September). Replicable community. Evaluation of Recommender Systems. In Proceedings of the As a further matter, in order to be able to rely on the results of the 9th ACM Conference on Recommender Systems 363-364. system, there is a need to ensure its testability and it needs to be ACM. tested at least for accuracy. Moreover, in order to enable replication