Building a Bridge between User-Adaptive Systems
                    Evaluation and Software Testing
                          Veronika Bogina                                                   Tsvi Kuflik
                     Information Systems Dept.                                     Information Systems Dept.
                       The University of Haifa,                                      The University of Haifa
                         Haifa 31905, Israel                                           Haifa 31905, Israel
                       sveron@gmail.com                                             tsvikak@is.haifa.ac.il

ABSTRACT                                                                by that reduces a frustration of people that often struggle with the
User Adaptive Systems (UASs) are futile without software.               guessing an author’s intention. Albeit everyone agrees with the fact
Moreover, integrating user modeling component into software             that it is essential to improve the quality of research algorithms by
system may add bugs if not tested properly. However, the                at least making their code publicly available and elaborating on the
evaluation of UASs does not intersect with software evaluation as       tools one had used during the research, nobody reports on any
commonly defined in Software Engineering. We suggest adopting           testing done to ensure that a system provides correct, consistent
the common software engineering practices, changing the                 results. Once researchers get good results in terms of
community’s practice and methods by integrating software testing        accuracy/coverage/novelty/serendipity or any other predefined
as an integral part of any study involving software development.        evaluation metric, they hasten to share their findings with the
That will allow win-win situation for both: the researcher and          community. However, without testing whether they have bug in
community, since the code will be bug-free and hence easily             their implementation or even inconsistency in the chain of actions
reproducible/reusable by other members of the community.                according to the paper they have published, the results may be
                                                                        questionable.
CCS Concepts                                                            Hence we inquire the reader:” How do you know whether your User
• Information systems → Recommender systems • Software                  Adaptive system’s results are correct? Have you tried to verify your
and its engineering → Software testing and debugging                    results by writing new code and reproducing the steps that were
                                                                        defined in your paper?” By following such strategy, you can kill
Keywords                                                                two birds with one stone: first, test that your method/experiment is
Software   Engineering;     Testing;   Evaluation;   User-Adaptive      reproducible and secondly prove that it works correctly by testing
Systems                                                                 results with other implementation.
1. THE BRIDGE FOUNDATION                                                The goal of this position paper is to examine existing evaluation
User-Adaptive Systems (UASs) are interactive systems that adjust        methods in both UASs and Software Engineering and suggest
their functionality to individual users according to the user model     testing approach that will strengthen this research field’s outcome
that was built by learning user behavior, inference, or decision        in the future.
making [13]. Over the last decade, a wide variety of different UASs     Evaluation metrics are commonly used to determine both the
has been introduced. Indeed, they incentivize researchers to publish    quality and performance of UASs [2]. Most frequently used are
their results by writing papers that often have the same structure:     statistical evaluation methods, personality tests, accuracy, RMSE,
They contain of few common essential parts, like introduction,          A/B Testing and Benchmarking. Albeit all these metrics appraise
related work, experiment/method, evaluation and conclusions. The        UASs on their performance, accuracy and statistical significance,
evaluation part allows us, as researchers, to decide whether this       none of them addresses software testing for code, that is used to
specific system lives up to both: the scientific community and the      implement these UAS.
end user expectations in terms of quality and performance [16].
However, no one, as far as we know, reported on testing these UAS       2. A BRIDGE TO SOFTWARE TESTING
for their correctness. This is, primarily, because performing a study   According to Myers’ classic definition: “Testing is the process of
without testing the software properly first is the reality in our       executing a program with intention of finding errors.” The intent of
community. As Valentino Rossi maintained once:” Once the races          the testing is to discover as many errors as possible and by that
begin it's more difficult and there is never that much time for         bring the tested software to the accepted level of quality [5].
testing”[9].                                                            By taking different approaches software tests can be classified
Let us consider other researchers’ expectations within the same         incongruously: according to the testing concept or to the
scientific community when they read a relevant paper and are eager      requirements [5]. The former is related to the black box testing
to replicate the same experiment again but with a different setup.      (functionality) and the while box testing (structural). The latter is
There are two primary aspects in such scenario: reproducibility and     defined by McCall’s classic model for classification of software
correctness of the method/algorithm/system that introduced in the       quality requirements [7] that is shown in the Table 9.1[6] .
paper. Nowadays, the reproducibility is a hot topic [3][14].            The testing strategy choice depends on the software and its
Researchers’ intent is obvious: they want to be able to compare         requirements: whether one develops desktop, web, android or
their algorithm with the published one by reproducing it according      mission critical system applications.
to the paper that describes the study. Moreover, clear methodology
eliminates a redundant mail correspondence, when researchers            Let us peruse two kinds of applications to show the differences in
approach the authors of the published paper for more details, and       the testing process. Web application, a client-server software
application, differs from other applications in few ways. Indeed, it       of the experiments and software reuse, both portability and
can be accessed by a wide number of users, from different parts of         reusability need to be tested as well. We believe that approaching
the world. Since each one of them uses different hardware, OS,             testing in UASs is going to benefit both the researcher and the
Web browser and etc., such application should be able to run on            community. As Burt Rutan maintained:” Testing leads to failure,
heterogeneous execution environments. Moreover, the ability to             and failure leads to understanding”.
respond the user input in real time is essential [4]. Common
examples for such applications are web mails, online retail stores,
instant messaging chats, wikis and so on [12]. From the testing            4. REFERENCES
perspective an executing performance, an availability testing, web
                                                                           [1] Avazpour, I., Pitakrat, T., Grunske, L., & Grundy, J. (2014).
accessibility, different web browsers, operating systems and
                                                                               Dimensions and metrics for evaluating recommendation
middleware testing, security, usability, hyperlinks testing are
                                                                               systems. In Recommendation Systems in Software
germane testing mechanisms that should be used here in order to
                                                                               Engineering, 245-273. Springer Berlin Heidelberg.
verify functional and non-functional requirements [4] .
                                                                           [2] Chin, D. N. (2001). Empirical evaluation of user models and
On the other hand, in mission critical systems, whose failure may
                                                                               user-adapted systems. User modeling and user-adapted
cause the failure of some goal-directed activity, errors or failures
                                                                               interaction, 11(1-2), 181-194.
cannot be tolerated. Moreover, in safety-critical applications failure
can be catastrophic. Meaning that errors in such systems are               [3] Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T.,
unacceptable. Thus reliability, availability, clear documentation              Gargi, U., & Sampath, D. (2010, September). The YouTube
and instructions, proper design and reviews, security are essential            video recommendation system. In Proceedings of the fourth
parts of the genuine testing [15]. Common examples are online                  ACM conference on Recommender systems 293-296. ACM.
banking systems, railway and aircraft operating systems, electric
                                                                           [4] Di Lucca, G. A., & Fasolino, A. R. (2006). Web application
power and other similar computer systems [10].
                                                                               testing. In Web Engineering , 219-260. Springer Berlin
As can be seen above software testing is an essential part of                  Heidelberg.
software development. All applications require such approach, with
                                                                           [5] Galin, D. (2004). Software quality assurance: from theory to
no exception and depending on their functionality and the goal,
                                                                               implementation. Pearson education.
various methods for testing are chosen.
                                                                           [6] Galin, D. (2004). Software quality assurance: from theory to
3. DISCUSSION AND CONLCUSIONS                                                  implementation. Pearson education. Table 9.1, 188.
In this section, we claim that since user models and user modeling         [7] General Electric Company, McCall, J. A., Richards, P. K., &
components should not be separated from the software, testing                  Walters, G. F. (1977). Factors in software quality: Final
techniques that are applicable in software development should also             report. Information Systems Programs, General Electric
suit User Modeling research. The question only is what testing                 Company.
techniques can be applied from Software Engineering field.
                                                                           [8] http://www.brainyquote.com/quotes/quotes/b/burtrutan39455
After scrutinizing various factors categories [6] and existing                 6.html
evaluation methods, it caught our eyes that there is a gap in UASs
testing and only Operation factor is partially covered in UASs             [9] http://www.brainyquote.com/quotes/quotes/v/valentinor3012
evaluation. Yet Revision and Transition are not considered. From               35.html
Revision perspective there is a need to ensure code’s testability and      [10] https://en.wikipedia.org/wiki/Mission_critical
it needs to be tested at least for accuracy, as a starting point. Let us
                                                                           [11] https://en.wikipedia.org/wiki/Portability_testing
argue on how using Transition techniques can be advantageously.
                                                                           [12] https://en.wikipedia.org/wiki/Web_application
Wouldn’t it be useful to make our code reusable by publishing it
on GitHub with good documentation and clear code and by this               [13] Jameson, A. (2001, December). User-adaptive and other
allowing other researchers to reuse modules from our code?                      smart adaptive systems: Possible synergies. In Proceedings
                                                                                of the first EUNITE Symposium, Tenerife, Spain.
What about portability? Whether we should write software on Java
(write once, run everywhere) to exclude compatibility issues, like         [14] Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R.
we have with different python versions and packages support.                    M. (2009). Controlled experiments on the web: survey and
Adaptability, installability and interoperability [11] can only                 practical guide. Data mining and knowledge
strengthen our systems and code exchange between researchers.                   discovery, 18(1), 140-181.
We advocate that it is an essential process to use a software testing      [15] Parnas, D. L., van Schouwen, A. J., & Kwan, S. P. (1990).
in User Adaptive Systems in the future, though it is up to                      Evaluation of safety-critical software. Communications of
researchers to decide what technique to use for their specific                  the ACM, 33(6), 636-648.
system. It would be beneficial to both: a researcher and the               [16] Said, A., & Bellogín, A. (2015, September). Replicable
community.                                                                      Evaluation of Recommender Systems. In Proceedings of the
As a further matter, in order to be able to rely on the results of the          9th ACM Conference on Recommender Systems 363-364.
system, there is a need to ensure its testability and it needs to be            ACM.
tested at least for accuracy. Moreover, in order to enable replication