Evaluation and user acceptance issues of a Bayesian classifier based TV Recommendation System Benedikt Engelbert Karsten Morisse Kai-Christoph Hamborg University of Applied Sciences University of Applied Sciences University of Osnabrück Osnabrück Osnabrück Seminarstr. 20 Artilleriestr. 46 Barbarastr. 16 49069 Osnabrück 49076 Osnabrück 49076 Osnabrück +49 541 969 4703 +49 541 969 3262 +49 541 969 3615 b.engelbert@hs-osnabrueck.de k.morisse@hs-osnabrueck.de k.hamborg@uni-osnabrueck.de ABSTRACT content regarding the user’s interests remains unnoticed. Because Nowadays there is a variety of TV channels and programs. This of this, a user supporting Personal Video Recorder based on a seems to be an advantage for the TV user, but in most cases the Bayesian classifier has been presented in [3] to generate user is overwhelmed and not able to choose the most appropriate personalized TV recommendations and to counteract the problems content though. Assistive systems are needed to support the user in the given TV landscape from a user perspective. For the sake of in selecting the most appropriate content regarding the user’s missing evaluation results, it was not possible to give a statement interests. The research group Next Generation PVR faced the task about the quality of recommendations. In this paper we present to develop a user supporting Personal Video Recorder (PVR) in two evaluation scenarios to measure the quality of the developed the form of a Bayesian classifier based recommendation system. Bayesian classifier based TV recommendation system. The The work on the prototype of the system is almost done. This quality within a recommendation system is obviously one of the paper focuses on the evaluation of the given system. We are most important facts to be determined. Nevertheless in a user presenting two types of evaluation scenarios as well as an supporting system the question about user acceptance is just as approach for measuring user acceptance of a TV recommendation important. For this reason this paper describes also an approach of system. Within the evaluation, the acceptance will be questioned. measuring user acceptance for recommendation systems and the In addition, the results of both scenarios and of the user associated results for the given system. It will be shown, that the acceptance survey are presented and discussed. quality of recommendations and the user acceptance are strongly related. To present our work, the paper is organized as follows: section 2 gives an overview of related work on recommendation Categories and Subject Descriptors system with the focus on multimedia and TV systems. Within I.2.m [Artificial Intelligence]: Miscellaneous section 3, a short overview about the evaluated system is presented. We just explain the parts of the system, which are General Terms necessary for the work of the evaluation. Section 4 is divided into Algorithms, Experimentation, Human Factors a part where both of the evaluation scenarios are explained and where our approach of measuring user acceptance is presented. The results of the evaluation are described and discussed in Keywords section 5. The paper ends with a conclusion and future work. Recommendation System, Television, Evaluation, Bayesian classifier, User Acceptance 2. RELATED WORK Recommendation systems (RS) exist in several application areas 1. INTRODUCTION and for different types of content. One of the most common The way of consuming media clearly changed in recent years. examples is Amazon.com. Within the online shop, Amazon Especially in times of high bandwidth and loads of appropriate recommends items like Books or CDs on the basis of already media delivering services, the Internet plays an important role in purchased items. For this, Amazon is using an item-to-item based cases of consuming audio/video content. However, live television approach, where relations between items get calculated [4]. Also is still the most popular media. In Germany 97% of all households in the area of multimedia applications RS are common. YouTube posses a TV set with an average use of 220 minutes a day [1]. for instance is a popular online video broadcaster with million Therefore it is evident that the television market is still interesting queries every day. The YouTube RS generates recommendations for broadcasters. The satellite operator ASTRA holds up to 1700 for related video content by analyzing user activities on the portal TV channels just for the region of Germany. Regardless of the website [5]. With the intention to counteract the enormous offer of encrypted and shopping program, 53 is a realistic number of video content on YouTube, the work in [6] presents a mobile RS receivable TV channels [2]. For the TV user it is quite difficult to based on an extended Bayesian classifier. Also in other fields of handle the enormous offer of content. In most cases extensive TV RS the Bayesian classifier has been proofed as an efficient and guides list just a limited number of TV channels and often only proper working method. In 1996 Billsus and Pazzani presented popular ones. The user will invest time to get an overview of all their work on classifying web sites using the Bayes classifier [7]. the available content. Due to too much effort most of the users The main goal of the work was to classify web sites automatically focus on favored or popular TV channels and the most interesting into the classes like or dislike. This is done by extracting information from the given HTML tags within the source code. CARS-2012, September 9, 2012, Dublin, Ireland. The accuracy measurement of the classification process gained up Copyright is held by the author/owner(s). to 81% of properly classified web sites. The work points out, that EPG an increasing database does not yield in an increasing accuracy of Data Set the classification process. A similar effect for a Bayesian classifier approach has been reported in [8]. The work describes a system Initial Class dislike for classifying Arabic text documents. For an evaluation a data set User Profiles Bayesian classifier with increasing number of words has been created. The results Adaptive Class like have shown that the accuracy of correct classified documents is the highest in a number of 800 words with 74%. With an increasing number of words the accuracy continuously decreases. Figure 1. System Architecture This information needs to be carefully respected in our evaluation For the classification process a database is needed on which the framework described in section 4. Furthermore, in the context of calculation depends. The system provides for this purpose User television, recommendation systems are also available. Already in Profiles. There are two types of user profiles: initial and adaptive 1998 [9] discusses fundamental ideas and methods for RS in the user profile. The initial profile is important in the beginning of the TV application area. Das and Herman characterize the use of two system usage. In section 2 the cold start problem has already been different user profiles, which are influenced by the implicit and discussed. The user can fill the initial profile with TV content he explicit behavior of a TV user. Those user profiles are also or she likes to get recommendation also in the beginning period of considered in our work. A more concrete scenario is described in the system. For this purpose, the system provides a web-based TV the work of Gutta et. al. [10]. The paper describes an intelligent guide. The initial profile is just created once, whereas the adaptive TV guide, which generates personalized TV recommendations. profiles gets build up continuously during the system use. So it is Those recommendations are generated on an adaptive user profile, planned that the adaptive profile grows over the time and the where search requests for TV content of a user are captured. For influence as a database for the recommendation process increases. the recommendation process a Bayesian classifier is also used. In This means that, conversely, the influence of the initial profile [11] a personalized Electronic Program Guide (EPG) is described. decreases. There is the assumption that the higher the adaptive Through user interactions with a Set-Top-Box, a user profile can profile the better the generated recommendations. It remains an be derived to generate personalized recommendation. Users are open question how the adaptive profile is created. We are dividing assigned to a user group on which basis the recommendations between implicit and explicit feedback, which is based on the depend. An evaluation showed that the accuracy for this approach work of Das and Herman in [9]. With an explicit feedback the runs against 70% at maximum. A popular example in the TV area user consciously expresses whether he or she likes a certain TV is the Set-Top-Box System TiVo [12]. The TiVo approach is content or not (e.g. rating a TV content). The implicit feedback similar to the Amazon item-to-item recommendation process bases on the viewing behavior of the user. For example, if a user described in [4], where the TiVo user base rates TV content. With watched a whole TV content the system assumes, that the user the use of those ratings, the system tries to find related and similar likes the content. Otherwise the system would register the content TV items. Problematic is the state in the beginning period of the as disliked. In this context the work of e.g Hu et. al. in [19] should system. An item-to-item approach needs a certain number of be mentioned. The paper deals with the profound analysis of ratings before the system is able to generate accurate implicit feedback, however, it describes the use of a much more recommendations. So the TiVo system is using a context aware complex model. The improvement of our model could be future Bayesian classifier to counteract those circumstances. The work in work. It is important to say, that the stored DomainObjects within [12] speaks of accurate, but internal evaluation results. That’s why the user profiles are assigned to a class like or dislike. This is it isn’t possible to name any results at this point. necessary for the calculation mentioned at the beginning of this section. 3. SYSTEM OVERVIEW The following section describes the system architecture of the 4. EVALUATION developed system. In section 2 several approaches have been In section 3 the main components with associated functionality of presented, where some ideas were also considered for the the system architecture has been described. We just focused on following system architecture. In Figure 1 the components for the those parts of the system, which are important for the evaluation recommendation process within the system are shown. The main process in the following section. First we explain two evaluation functionality of the Bayesian classifier is to classify given content scenarios. The first one is an online scenario to reach a high objects into the classes like or dislike. So the classifier is based on number of potential volunteers. The second one is a more realistic a two-class decision model, where the conditional probability that scenario, where the volunteers write down their TV habits in a TV a certain content fits into one of the classes like or dislike gets diary. In section 4.2 we describe an approach to measure user calculated. Within the calculation the classifier compares object acceptance for the given system. The approach is based on the attributes and counts how often they occur in one of the pre Technology Acceptance Model (TAM), which characterizes the defined classes. For a detailed description of the classifier relationship between information systems and user acceptance. calculation we refer to our early work [3]. The object attributes are derived from a given EPG data set the system uses. More 4.1 Scenarios specifically the system uses generic DomainObjects, which are described for each application by metadata. In the given 4.1.1 Online Scenario application, the metadata is derived from the EPG data with the An important question at the beginning of the scenario design is at following attributes: TV Channel, Title, Subtitle, Category (e.g. what point of the recommendation process the quality can be Movie, show), Genre (e.g. Comedy), Actors, Description, Year. measured. As pointed out before, the system uses a Bayesian With the generic data model it is possible to use the classifier also classifier, which classifies TV content in one of the classes like or in other application areas. dislike. In the long term, the main database for generating recommendations is the adaptive user profile, which is gathered by the feedback of the user during the use of the system. The idea for the online scenario is to simulate the creation of an adaptive Perceived Ease of Use (PEOU). These two variables are defined user profile on which basis the user gets personalized as the main factors influencing the use of a system. Davis defines recommendations. We consider at least five steps where the user the variables as follows: fills the adaptive profile and the system generates Perceived Usefulness „the degree to which a person believes that recommendations. After every step the user needs to rate each of using a particular system would enhance his or her job the given recommendations. Every step represents one week of a performance“ cp. [13] p. 320 user’s watching behavior, which means that the system generates recommendations from one week of EPG data. The rating of the Perceived Ease of Use „ the degree to which a person believes recommendations is more a verification of the classification the that using a particular system would be free of effort“ cp. [13] p. system did. The system presents 20 TV items where ten of these 320. items belong to the class like and ten to the class dislike. The There is a lot of research on User Acceptance Issues in items are displayed to the user without class affiliation. The user recommender systems cp. [14][15]. Especially the work of Jones needs to perform a classification on his/her own. In the back of the and Pu in [14] about User Acceptance Issues in Music process the system compares both classifications and stores the recommender systems is interesting for the work presented in this classification of the user within the adaptive profile for the next paper due to similar conditions. Jones and Pu adapted the TAM step. At the end of the scenario there is the sixth final for the use in music recommendation systems by defining the classification step, where the system calculates the quality of the variables more Application-specific. They pointed out, that 1) the classification. To measure the quality, we are using the F1-score, entertainment due to given recommendations, 2) the correct which can be interpreted as the weighted average of precision and adaption of the user feedback and 3) the entirety of the given data recall values. Precision and recall is a common metric for base is important for Perceived Usefulness. The factors for assessing the accuracy of classification systems. To meet the Perceived Ease of Use Jones and Pu name Usability and Effort requirements of the evaluation, the relationship between temporal until the system works properly. Caused by the fact, that we are development of the adaptive user profile and the quality of the using a simulation within the evaluation process, the point of system classification needs to be respected. For this reason, every Usability has been almost discarded. Only the Usability of the evaluation user needs to do overall six steps, so that all users need creation of the initial user profile can be questioned. We presented the same length for the scenario and have the same requirements. an adapted questionnaire from the given work in [14] with 20 Internally, every volunteer gets randomly another number of items at the end of the evaluation. The user needed to apply a five steps. Thus, the completion of the adaptive profile varies between step Likert scale from 1 (totally disagree) to 5 (totally agree). The one and five weeks, so that just the data for that random count of items were formulated as positive or negative statements so that steps gets stored. At the end of the evaluation we can compare the the evaluation user could answer with the given Likert scale. calculated quality between the varying lengths of the scenario. At F1 I liked the TV content which has been recommended to me the beginning of the scenario an initial profile by the user with ten objects will be created. The TV content which has been recommended to me tailored my F2 taste 4.1.2 TV diary F3 The TV content which has been recommended were new to me The measurement of quality is similar to the scenario described in F4 I liked the TV content I already knew section 4.1.1. Significant is the development of the adaptive user profile. Where the Online Scenario was more a simulation of data F5 In general I was satisfied with the recommendations creation, whereas the TV diary deals with real watched TV The recommendations were as good as the recommendations from content. The participants watch television as usual. They need to F6 my friends write down all the watched TV content in a TV diary with the F7 Many recommendations were too similar to each other name of a series, date and time so that the system can find the watched content within the given EPG data. For the participants it F8 The system has a huge selection of possible content is also mandatory to write down TV content they began to watch, F9 I like that the system identifies my taste but switched the channel in case of dissatisfaction. Zapping I can influence the quality of the recommendations with my behavior isn’t respected due to the fact that the system wouldn’t F10 feedback do either. The participants write down their viewing behavior for F11 The system recognizes my taste five weeks. After that, the evaluation supervisor collects all the diaries and transfers the diary data to the system. On the basis of F12 I know how the system generates recommendations after I’ve used it the data, the system generates recommendations. Each participant The time the system needs to generate recommendations is needs to rate the system recommendations by dividing the F13 appropriate displayed and unordered TV content into the classes like and For the creation of the initial profile I needed to spend too much dislike. The classification of the system and the participant are F14 time compared and on this basis the quality gets calculated. This is equally done within the sixth step in the Online Scenario (cp. F15 The creation of the initial Profile was easy and comfortable section 4.1.1.). An initial profile won’t be created. I would create the initial profile again to get proper F16 recommendations fast 4.2 User Acceptance F17 The system asks me for my television watching too much Measuring user acceptance for a software system is related to the field of psychology. It is necessary to figure out the conditions for If there is another technology which recommends other things to me F18 (e.g. books), I would use it a software system, which result in an actual use of the system. Already in 1989 Davis presented the Technology Acceptance F19 I think the system is useful and I would use it again Model (TAM). The TAM is an information-theoretic model, F20 I think the system is useful choosing interesting TV content which defines the Variables Perceived Usefulness (PU) and Table 1 Questionnaire User Acceptance 5. RESULTS 5.2 User Acceptance Results In the following section we present the evaluation results. The The following section presents the results of the proposed user section is again divided into two subsections. Within section 5.1 acceptance approach. In Figure 2 the median of the 20 items are we present the results for the evaluation scenarios; in section 5.2 shown. Negative formulated questions are marked with a dot on the evaluated user acceptance questionnaire is presented. The the top of the bar. results will be discussed in section 6. 5.1 Evaluation Results 5.1.1 Online Scenario For the Online Scenario we had 51 participants, where nine of them didn’t finish the evaluation. So the number can be reduced to 42 with 34 male and eight female participants. A large part of the group can be classified as educated (higher education level or university degree) with an age between 18 and 39. Table 2 presents the results for the Online Scenario. The data were tested for independence using the chi square (χ2) test with a significance level of 5% (α=0.05). We’ve proved the calculated chi square Figure 2 Acceptance Items/Median value against the quantile of the chi square distribution p=3.84 Figure 2 shows high results within the questionnaire, which is an respecting the significance level of α=0.05 and degree of freedom df=1. indication for general user acceptance. Negated questions were overall answered with a low valence, nevertheless the results tend Iterat. Participants Objects F1 Score χ 2-Value into the right direction. For a closer look we refer to Table 1 1 8 30 0.61345 16.799 including the question overview. The questions are divided into three sections asking for Quality, Effort and Acceptance as 2 9 50 0.65 27.023 already mentioned in section 4.2. To analyze the results, we 3 8 70 0.81319 69.237 interpret every subarea separately. Items one till twelve concern the quality of the given recommendations. The questioned quality 4 8 90 0.74534 55.844 within the questionnaire is a rather subjective interpretation of 5 9 110 0.73143 41.216 each user, than an objective measurement. Especially the items 1,2,4 and 5 question the perceived quality by the user. All of the 4 Table 2 Results Online Scenario items were rated with a minimum value of 4, which tends to be a The table shows the iteration number and the associated number positive feedback. Item 6 questioned if the quality of the of participants to those the iteration number was randomly recommendations correspond to the quality of TV assigned. The column number Objects presents the size of the recommendations from friends. The middle value of 3 is a little adaptive user profile e.g. for the scenario with one iteration the low, but to reach human interpretation level for the system wasn’t adaptive user profile stored 30 TV objects like films, series, etc.. a compound goal at all. Item 12 questioned the transparency of the In the fourth column the quality in form of the calculated F1 score system, which is with a value of 3 also a little low. However items is shown. As you can see at first the quality increases steadily. 8 to 11 were with a value of 4 approvingly answered. On the basis The system achieved a maximum quality with a database of 70 of the given results with agreeing (4) and fully agreeing (5) objects. With a more increasing data volume the quality decreases. feedback we can say that a certain quality of the system recommendations is given, which influences the user acceptance 5.1.2 TV diary positively. For the subarea of effort we questioned the items 13 to For the TV diary eight people attended, mostly between 18 and 39 17. Those items questioned the fact of needed effort for using the years and with a high educational level. The number of system properly. We got a positive feedback for the items 14 to 17 participants decreases from ten to eight caused by incomplete TV with a value of 4 (agree) and 2 (not agree) for negative formed diaries. Table 3 presents the results for the TV diary equally to the questions. These items were questioned for the construction of the results in section 5.1.1. Due to a limited number of initial feedback. The creation of the initial feedback is the only participants/data records we used Fisher’s exact test proving direct interaction between user and system needed to get independence of the data set. We have no quality data for the 5th recommendations presented. Just the time for creating new iteration because of insignificant test results. In contrast to the recommendations by the system was rated with a neutral value Online Scenario the quality increases continuously. The highest (3). So we can say that the effort for using the system is limited, value in iteration four is similar to the highest quality in the which is also a positive influence for user acceptance. Within the Online Scenario. subarea of acceptance for recommender systems in general and for Iterat. Participants Objects F1 Score p-Value the tested system all questions were approvingly answered (4). These results also tend to a positive user acceptance. These results 1 2 12.5 0.73913 0.0069 have proven that a general user acceptance for the system is given. 2 2 11 0.75472 0.0083 We have some particular shortcomings in comparison to 3 1 32 0.80 0.0325 recommendations by friends. In this case the system could be advanced by a connection to one of the big social networks like 4 2 56 0.83333 0.0108 Facebook or Twitter. A concrete approach for this was already 5 1 47 - 0.0573 presented in one of our previous works cp. [16], but not respected within the evaluation. Table 3 Results TV diary 6. DISCUSSION [4] Linden, G., Smith, B., and York, J. 2003. Amazon.com In the following section the results from the evaluation presented Recommendations. IEEE Press. in section 5 are discussed. We proposed two different approaches [5] Davidson, J., Liebald, B., Liu, J., Nandy, P., and Vleet, T.V. for evaluating a Bayesian classifier based recommendation 2010. The YouTube Video Recommendation System. In system. Both approaches differ in the way data for the generating ACM Conference Proceedings on Recommender Systems process was collected. Within the Online Simulation data was (Barcelona, Spain, September 2010). collected in a continuous manner. The adaptive profiles within the [6] De Pessemier, T., Deryckere, T., and Martens, L. 2010. TV diary were created more qualitatively with a real user Extending the Bayesian Classifier to a Context-Aware behavior. However, both scenarios were designed to examine the Recommender System for Mobile Devices. In Fifth quality of given recommendations and to examine if there is International Conference on Internet and Web Applications connection between increasing database (adaptive user profile) and Services (Barcelona, Spain, 2010). and generated TV recommendations. First of all we can say that the system classifies objects with a quality between 81% and 83% [7] Billsus, D., and Pazzani, M. 1996. Revising User Profiles: with a database between 56 and 70 already classified objects. The Search for Interesting Web Stites. In Proceedings of 3rd Because of this, it is clear that a sufficiently large database is International Workshop on Multistrategy Learning. needed to reach a proper quality. Both result tables show that [8] Thabtah, F., Eljinini, M. A., Zamzeer, M., and Hadi, W. M. there is an increasing quality with an increasing number of 2009. Naive Bayesian Based on Chi Squere to Categorize classified data objects. Table 2 shows a decreasing quality at a Arabic Data. Communications of the IBIMA (10). high number of data objects though. This effect has been observed in other application areas, where also a Bayesian classifier has [9] Das, D., and Herman, t. H. 1998. Recommender Systems for been used for classification. In [17] the classification of Arabic TV. Eindhoven: American Association for Artificial text documents is described. Also in this research a large database Intelligence. causes a decreasing quality. The work of Billsus and Pazzani on [10] Gutta, S., Kurapati, K., Lee, K., Martino, J., Milanski, J., classifying spam mails described a decreasing quality at one point Schaffer, D., et al. 2000. TV Content Recommender System. of the evaluation [18]. It can be assumed that the effect just occurs Briarcliff Manor: American Association of Artificial within the online scenario. The Set-Top-Box prototype provides a Intelligence. mechanism, so that older data objects are deleted. That means the [11] Ardissono, L., Gena, C., Torasso, P., Bellifemine, F., database needs to be updated and shouldn’t exceeds a certain Chiarotto, A., Difino, A., et. Al. 2003. Personalized threshold. A number between 56 and 70 data objects has been Recommendation of TV Programs. In 8th Advanced in determined within our evaluation. The evaluation has also shown Artificial Intelligence Conference. that the quality is at a proper level, but can be increased. One approach proposed in section 5.2 is the integration of social [12] Kamal, A., & van Stam, W. 2004. TiVo: Making Show networking to satisfy the user with recommendations done by Recommendations Using a Distributed Collaborative friends. As mentioned the approach is already implemented, but Filtering Architecture. ACM Knowledge Discovery and Data not evaluated at all. A resulting increase in quality is thus just a Mining. Seattle, Washington: ACM. hypothesis. It would also be possible to increase the quality by [13] Davis, F. D. 1989. Perceived Usefulness, Perceived Ease of interpreting the attributes e.g. a series name synonymously. For Use, and User Acceptance of Information Technology. MIS this it is possible to implement a thesaurus or ontology. Quaterly Vol. 13, pp. 319-339. 7. CONCLUSION [14] Jones, N., & Pu,P. 2008. User Acceptance Issues in Music In this paper we presented an evaluation for a Bayesian classifier Recommender Systems. EPFL Technical Report. Lausanne. based recommendation system for generating TV content on the [15] Hu, R., & Pu, P. 2009. Acceptance Issues of Personality- basis of user behavior. In addition to two evaluation scenarios, based Recommender Systems. 3rd ACM Conference on which measure the quality of recommendations, we presented Recommender Systems (pp. 221-224). New York: ACM. results of user acceptance evaluation based on the work of [14]. [16] Engelbert, B., Blanken, M., Kruthoff-Brüwer, R., & Morisse, With a correct classification rate up to 83% the system reached a K. 2011. A User Supporting Personal Video Recorder Based good quality and fulfills the demanded requirements. The results on a Generic Bayesian Classifier and Social Network of the user acceptance gave a good feedback for the acceptance Recommendations. J.J. Park, L.T. Yang, C.Lee (Eds.): and the actual use of the system. Even though we achieved a good Future Tech 2011, Proceedings, Part II. Communications in quality, there is still space for improvements (cp. section 6). Our Computer and Information Science, Vol 185, pp. 1-8. future work concentrates on the evaluation of the social recommendation approach. [17] Thabtah, F., Eljinini, M. A., Zamzeer, M., & Hadi, W. M. 2009. Naive Bayesian Based on Chi Squere to Categorize 8. REFERENCES Arabic Data. Communications of the IBIMA (10). [1] B. Engel, C.-M. Ridder, “Massenkommunikation 2010”, [18] Billsus, D., & Pazzani, M. 1996. Revising User Profiles: The Press Conference of ARD, ZDF, Sep. 2010. Search for Interesting Web Sites. Proceedings of 3rd [2] SES Astra. Available at http://www.astra.de/2117/de International Workshop on Multistrategy Learning. [3] Engelbert, B., Blanken, M. B., Kruthoff-Brüwer, R., and [19] Hu, Y., Koren, Y., & Volinsky, C. 2008. Collaborative Morisse, K. 2011. A user supporting Personal Video Filtering for Implicit Feedback Datasets. Proceedings of the Recorder by implementing a generic Bayesian classifier 2008 8th IEEE International Conference on Data Mining. based recommendation system. In Proceedings of 7th IEEE Washington, D.C., USA. International Workshop on PervasivE Learning, Life and Leisure (Seattle, WA, USA, March 21 – 25, 2011).