A Cross-Cultural Analysis of Explanations for Product Reviews John O’Donovan Shinsuke Nakajima Tobias Höllerer Dept. of Computer Science Faculty of Computer Science Dept. of Computer Science University of California, Santa and Engineering University of California, Santa Barbara, CA, USA Kyoto Sangyo University, Barbara, CA, USA jod@cs.ucsb.edu Kyoto, Japan holl@cs.ucsb.edu nakajima@cse.kyoto-su.ac.jp Mayumi Ueda Yuuki Matsunami Byungkyu Kang Faculty of Economics 3 Faculty of Computer Science Dept. of Computer Science University of Marketing and and Engineering, University of California, Santa Distribution Sciences, Kobe, Kyoto Sangyo University, Barbara, CA, USA Japan g1245108@cc.kyoto-su.ac.jp bkang@cs.ucsb.edu Mayumi Ueda@red.umds.ac.jp ABSTRACT Keywords Cosmetic products are inherently personal. Many people User Experience, Explanation, Decision Making, User-Centric rely on product reviews when choosing to purchase cosmet- Evaluation ics. However, reviewers can have tastes that vary based on personal, demographic or cultural background. Prior work 1 Introduction has discussed methods for generating attribute-based expla- Over the last 25 years, recommender systems have attempted nations for item ratings on cosmetic products, based on as- to help users find the right information at the right time [15]. sociated text-based reviews. This paper focuses on evalu- More recently, the proliferation of e-commerce applications ating explanation interfaces for product reviews and related supports buying and selling products in the global market attributes. We present the results of a cross-cultural user with relatively little e↵ort. Increasingly, consumers are rely- study that evaluates five associated explanation interfaces ing on customer reviews to inform purchasing decisions [8]. for cosmetic product reviews across groups of participants In many cases, product reviews are presented in summary from three di↵erent cultural backgrounds. We applied a 3 form via mechanisms such as star ratings. Such represen- by 2 within subjects experimental design in a user study tations, however, typically fail to capture the subtle opin- (N=150) to evaluate e↵ects of UI design and personaliza- ions that exist in the accompanying text-based reviews. In tion on a range of user experience metrics in a cosmetics this paper, we build on recent work that automatically ex- shopping scenario. Results of the study show that 1) Ko- tracts attributes and associated ratings from online product rean and Japanese speakers chose the most complex UI more reviews [10]. In particular, we focus on understanding how often than English speakers. 2) older participants also pre- visual representations of various types of extracted item rat- ferred more options in cosmetic product selection, regardless ings impact user experience and conversion likelihoods in an of cultural background. 3) personalization of product rat- e-commerce setting, as exemplified in Figure 1. Motivated ings did not show an e↵ect on user experience. 4) Attribute- by recent research that shows the importance of user ex- based explanations were preferred over star-ratings for all perience over traditional accuracy metrics in recommender three cultures. 5) Rating propensity evaluation showed that systems [7], we conduct a user experiment to understand Japanese provided significantly higher ratings than Korean how rating display a↵ects user experience. Specifically, we or English participants, and that Females provided higher applied a 3 by 2 within subjects design (Table 1) in an online ratings than Males, regardless of background. study (N=150) to evaluate e↵ects of UI design and personal- ization on a user experience metrics in a cosmetics shopping CCS Concepts scenario, considering the following research questions: •Human-centered computing ! HCI design and evalu- R1: Do cross-cultural preference di↵erences exist for recom- ation methods; User models; User studies; mendation interfaces? If so, what are the key predictors of these di↵erences? Permission to make digital or hard copies of all or part of this work for personal or R2: Are there cross-cultural preference di↵erences for per- classroom use is granted without fee provided that copies are not made or distributed sonalized v/s non-personalized recommender system inter- for profit or commercial advantage and that copies bear this notice and the full citation faces? on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or R3: Are there cross-cultural preference di↵erences between republish, to post on servers or to redistribute to lists, requires prior specific permission traditional (star-rating) and more granular attribute-based and/or a fee. Request permissions from permissions@acm.org. recommender system interfaces? IntRS 2016, September 16, 2016, Boston, MA, USA. Copyright remains with the authors and/or original copyright holders, 2016. R4: Are there di↵erences in rating propensities across the three cultures? If so, what are the strongest predictors of backgrounds. In contrast to their study, which compared observed rating shifts? a novel UI against a list view and assessed user experience The cosmetics domain was used for this study, since they metrics, we focus on the perception of attribute ratings ver- are sold globally and are inherently personal in nature. To sus traditional less-fine grained ratings, and on the impact explore variances in opinions on the explanation interfaces of personalization on these perceptions. A second contrast across di↵erent cultural backgrounds, participant groups were to [1] is that our work explores rating propensity across the sourced from American, Japanese and Korean cultural back- di↵erent groups. grounds. These particular groups were selected as a repre- sentative sample with diverse cultures, and because they are 3 Mining Attribute Ratings among the fastest growing markets for cosmetics.1 This study builds on a recent work [10] on attribute ex- traction from online product reviews. Specifically, we posit 2 Related Work that more explanations of a given product in the form of In this study, we focus on explanations and transparency of multiple attributes with corresponding scores (on five star recommender systems and on the (associated) role of prod- rating scales), see Figure 1, can provide benefits to poten- uct attributes mined from product reviews. Here, we discuss tial customers. In the prototype of the proposed recom- several related work in these areas. mender system, both personalized information (Simgroup ratings: “Users similar to you rate this item as”) and mul- Product Attributes To understand consumer behavior in tiple product attributes extracted from a review text are economics, research has focused on the di↵erent attributes added as features. Through an online user study, we apply and uncertainties that consumers consider when purchasing both novel approaches as controlled variables to the proto- a product [8, 13]. For buyers, these attributes play impor- type design and investigate the preference of the users to tant roles when deciding to purchase a product. More im- such features across demographic backgrounds, particularly, portantly, attributes vary widely across product types and cultural backgrounds (English, Korean and Japanese). users’ personal tastes. For example, [3] study the e↵ects of search attributes and provide a comparison between tradi- 4 Interface Design tional and online supermarkets. A recent study on descrip- tion and performance uncertainty [4] focused on the diffi- culty in assessing the product’s characteristics. Building on B works such as [13] that show advantages of using fine-grained KO Good Moisturizing Lotion Good Moisturizing Lotion 당신과 비슷한 사용자의 평점 전체평점 C mean rating (B~E) product attributes in the recommendation process, we aim JP 상품 리뷰 好みが似ているユーザによる 合評 Good Moisturizing 선블록 냄새가 나지 않는 적당히 가벼운 로션, 괜찮아요! Lotion Customer Reviews Users similar to you rate this item as この商品に するレビュー to further our understanding of the role of fine-grained prod- 약간 느낌이 두텁기는 하지만 미끈거리지는 않네요. 바르고 나면 피부가 엄청 뽀송뽀송 해 ENサンブロック臭のしない,程よいトロトロ感の良い化粧水です! 지진 않지만 2개월 사용Reviews 후에 얼룩덜룩on했던 this item 고르게 바뀌고 더 밝아졌어요. 확실 피부톤이 simgroup rating (C, E) 이 리뷰를 쓰신 분의 세부 평점 E 히 겨울철에濃厚でクリーミーなのに,アブラっぽくない.お肌がメチャクチャソフトになる感じでは 흔하게 생기는 각질이나 자외선으로 인한 손상은 늦춰주는듯 합니다. Nice, mid-weight lotion with no sunscreen smell 보습 このレビュー投稿者による項目別評 ないけれど,普通に2ヵ月間使用したら,明らかにお肌のキメが整ってきましたし,明る uct attribute ratings in consumer decisions. Thick and creamy but not greasy. 더くなってきました.間違いなく冬のお肌の Doesn't make my skin ultra-soft, but after い味方です. 피부탄력 읽기 A two months of regular use, my skin tone has very obviously evened out and glows to boot. Definitely staves off scaly winter skin. 안티에이징 This 保 reviewer rate this item as ハリ· 力 Explanation and Transparency in Recommendation Within もっと む 가격 Moisturizing アンチエイジング 천연재료사용 Tightening skin the recommender systems research community, there is an Read 강추입니다. 역시 이 회사 제품은 제가More 예상한대로네요. 구입 후 리뷰 남깁니다. 상품 패키징 퀄리티가 상당히 좋구요, 최상급 상태로 매우 빠른 배 これ良いです! さすがはこのブランド,期待通り! 송도 마음에 듭니다. 와이프가 며칠 사용해 봤는데 확실히 얼굴에 광택이 달라졌네요. 한 D 브랜드 コスト オーガニック 향 Anti-aging Cost ブランド 번 써볼 만 한 듯 합니다. に素敵なパッケージで送られてきました.本 に到着も早く, 態も良 この商品は本 Organic increasing understanding of the need for user-centered eval- It's wonderful. かったです.私の妻は 明るくなるのです. What I expected from this brand. 日間この商品を試したのですが,使用すると明らかに妻の顔が I was sent this to review for Alina. The product ships in a really nice package, 더 읽기 arrived extremely fast and was in excellent condition. 香り Brand Scent uations [12]. Recent keynote talks [2] and workshops [14] もっと My wife gave it a try for a couple days. It puts a noticeable shine on the face. む attribute weights (D, E) have helped to highlight the importance of this topic. In Read More this paper, we follow Knijnenburg et al.’s [9] argument for a framework that takes a user-centric approach to recom- mender system evaluation, beyond the scope of recommen- dation accuracy. In contrast to that work however, we argue Figure 1: Screenshots of the interface used in the that decision quality is an important evaluation metric that online user study. The annotations A-E show the goes beyond the user experience metrics described in [9], and items that varied in each condition, as shown in Ta- further, that it can be used to explain observed usage pat- ble 1. We designed a novel user interface for product review terns for search and recommendation tools. Garcia-Molena pages based on the feedback we received from a preliminary [6] described di↵erences and similarities between search and user study (N=100). We performed the study with a simple recommendation, and argued that interactive interfaces can design layout to test the di↵erent visual conditions outlined help users understand and use these tools in more efficient in Table 1. Participants gave feedback on their preference ways. Along the same vein, it has also been recognized for each UI in a virtual shopping scenario. They were also that many recommender systems function as black boxes, required to leave a comment on the interface design. For providing no transparency into the working of the recom- example, they reported the benefit of the new features, such mendation process, nor o↵ering any additional information as “I like the level of detail it has related to the product ”, to accompany the recommendations beyond the recommen- and suggested preferred features, such as “More alive col- dations themselves [7]. To address this issue, static or in- ors” / “More explanations and ratings”. The collection of teractive/conversational explanations can be given to im- 100 comments were manually assessed, and improvements prove the transparency and control of recommender sys- were made to the UI, including shortened review text with tems. Research on textual explanations in recommender “read more” button and breakdown of multiple attributes systems to date has been evaluated in wide range of do- extracted from the review text on stars. The revised design mains (varying from movies to financial advice [5]). From a is shown in Figure 1. cross-cultural perspective, Pu and Chen performed a related study that evaluated perceptions of di↵erent recommenda- 5 Experimental Setup tion interfaces in [1], using subjects from Chinese and Swiss Figure 1 shows an example of the refactored interface for 1 http://polishcosmetics.pl/Korean-Market-Analysis.pdf a sample product review. To test our hypotheses above, a Table 1: Overview of the controlled variables for the online user study. UI Config non-personalized personalized (no information from similar users) (with social data from similar users) review text only A: product review text review text with star B: A + mean rating on stars C: A + mean rating and the rating from rating active user’s simgroup on stars review text, star rating D: B + attribute weights computed from E: C + attribute weights computed from and attributes current review text (on stars) current review text (on stars) 6 country English value 4 Japanese Korean 2 Text Only(A) NP−Star(B) P.Star(C) NP−Attr(D) P−Attr(E) variable Figure 2: Preferred User Interface by Culture. Figure 4: Cross-cultural perspective of helpfulness of the five evaluated interfaces. Mean Rating of Frown (5 point Likert Scale) Mean Rating of Smile (5 point Likert) Neutral Expression (5 point Likert) 4.4 4.8 1.9 4.0 4.4 1.7 3.6 4.0 Figure 3: Preferred User Interface by Age. 1.5 3.2 3.6 n=53 n=23 n=53 n=23 n=53 n=23 3x2 within subjects experiment was conducted, controlling Female Male Female Male Female Male Gender Gender Gender for personalization, and rating type, as shown in Table 1. The study (N=150) was performed on the crowdsourcing Figure 5: Di↵erence in rating propensity by gender. platform, Amazon Mechanical Turk. Each participant was shown a randomly ordered set of 5 di↵erent design layouts accordingly, an increased need to explore user ratings on fine corresponding to the treatments in Table 1, and were asked grained product attributes (see Figure 3). to rank them in order of preference. They were also asked to rate the helpfulness of each. Participants were evenly Personalization and Rating Type Figure 4 shows the re- balanced across cultural backgrounds. All participants were sults of perceived usefulness of the interfaces, broken down shown with the five interfaces in random order. The content by cultural groupings. Each UI condition is shown as a group was shown in their primary language based on their cultural on the x-axis, and each group contains the mean utility score background. Overall, participants took between 5-10 min- for the three cultural groups. The x-axis groups (UI treat- utes doing the study, and were paid $1.50 for their time. ments) are also ranked from left to right based on number of Questions were added to test for user attention level and for visible features (UI complexity). This graph shows several language proficiency, including identification of di↵erences interesting e↵ects: first, there is a general preference across between UIs and simple math questions written in the ap- all groups for the attribute-based representations (groups D propriate language. After filtering our data based on these and E, on the right side), over less granular, star-ratings metrics, group sizes were 39, 25 and 12 for English, Japanese or text-based UIs. This is a promising result that indicates and Korean, respectively. Participant age ranged between that attribute extraction and visualization has a positive ef- 18-64 with an average of 26. Gender groups were not evenly fect on Ux. The second interesting result is that within the distributed, as expected for the cosmetics domain, with 70% star-rating group (2nd and 3rd group) and the attribute- female and 30% male. rating (4th and 5th) groups there is no notable di↵erence 6 Results between the personalized and non-personalized treatments. This result tells us that the granularity of presented ratings Perception and Rating Differences Figure 2 shows the has more positive impact on user experience than the percep- results for the UI ranking task, broken down by age. The tion that the ratings come from similar users. To investigate result shows a clear preference for design E in all groups, but this result in more depth, a followup experiment is planned there is a significant increase in that preference for partici- with a large corpus of product reviews collected from Ama- pants over 40 (shown on the right side). This e↵ect was also zon.com [11] 2 to compute actual similarity scores based on seen from 100 participants in the preliminary study. Inter- user profiles. This would clearly give better insight into the face E, shown in Figure 1, shows the most information, and observed e↵ect. Figure 4 also answers R2, in that there are allows users to understand how users similar to them rate no significant di↵erences between the cultural groups within individual product attributes. This e↵ect might be a result 2 of specific preferences for cosmetics developing with age, and http://jmcauley.ucsd.edu/data/amazon/ Mean of Neutral Expression (5 point Likert) step is to evaluate on real product data. The authors plan a Mean Rating of Frown (5 point Likert) Mean Rating of Smile (5 point Likert) follow-up study to compare LDA and dictionary-based ap- 4.5 proaches to product attribute extraction, and to explore how 2.5 5.0 the resulting attributes can improve explanations, and user 4.0 2.0 profiles for collaborative filtering. Additionally, a more de- 4.6 tailed evaluation of the di↵erent rating propensities across 1.5 3.5 cultures is underway using a larger number of participants 4.2 and multiple product domains. 1.0 3.0 n=39 n=25 n=12 n=39 n=25 n=12 n=39 n=25 n=12 En Jp Ko En Jp Ko En Jp Ko Culture Culture Culture 8 References [1] L. Chen and P. Pu. A cross-cultural user evaluation of product recommender interfaces. In Proceedings of the 2008 ACM Figure 6: Di↵erence in rating propensity by culture. Conference on Recommender Systems, RecSys ’08, pages 75–82, New York, NY, USA, 2008. ACM. [2] E. H. Chi. Blurring of the boundary between interactive search each UI treatment, although the Japanese showed a trend and recommendation. In Proceedings of the 20th International towards favoring the more complex UI treatments. Conference on Intelligent User Interfaces, pages 2–2. ACM, 2015. Rating Propensity For some users, rating an item with a [3] A. M. Degeratu, A. Rangaswamy, and J. Wu. Consumer choice specific number of stars can have very di↵erent meanings. behavior in online and traditional supermarkets: The e↵ects of User ratings on items serve as the basis for most collabo- brand name, price, and other search attributes. International Journal of research in Marketing, 17(1):55–78, 2000. rative recommendation techniques, but they tend to ignore [4] A. Dimoka, Y. Hong, and P. A. Pavlou. On product uncertainty such di↵erences when computing neighborhoods for recom- in online markets: Theory and evidence. Mis Quarterly, 36, mendation. Further, little work has been done to under- 2012. stand cross-cultural di↵erences in rating propensity. Since [5] A. Felfernig, E. Teppan, and B. Gula. Knowledge-based recommender technologies for marketing and sales. Int. J. these participant groupings were available our experimen- Patt. Recogn. Artif. Intell., 21:333–355, 2007. tal setup, a logical step was to evaluate rating propensities [6] H. Garcia-Molina. Thoughts on the future of recommender within each of the cultural groups, to serve as both an in- systems. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 1–2. ACM, 2014. dependent result, and as a weighting factor for the analysis [7] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining in Figure 4. Each participant was shown three randomly collaborative filtering recommendations. In ACM conference on ordered faces, showing expressions with happy, neutral and Computer supported cooperative work, pages 241–250, 2000. sad expressions. They were asked to rate the ‘happiness’ per- [8] Y. Kim and R. Krishnan. On product-level uncertainty and online purchase behavior: An empirical analysis. Management ceived in each on a five point Likert scale. Figure 5 shows Science, 61(10):2449–2467, 2015. the results by gender (for all groups). Interestingly, there [9] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, is a trend for Females to rate higher than males, and the and C. Newell. Explaining the user experience of recommender di↵erence becomes more pronounced for the ‘happy’ expres- systems. User Modeling and User-Adapted Interaction, 22(4-5):441–504, 2012. sion, shown on the rightmost plot of Figure 5 with a mean [10] Y. Matsunami, M. Ueda, S. Nakajima, T. Hashikami, di↵erence of 0.7 (relative increase of 16%, p<0.005). Figure S. Iwasaki, J. O’Donovan, and B. Kang. A method for 6 shows the results of the rating propensity analysis broken automatic scoring of various aspects of cosmetic item review texts based on evaluation expression dictionary. In Proceedings down by cultural group. Again, the graphs represent mean of the 24th International MultiConference of Engineers and rating for sad, neutral and happy expression ratings from left Computer Scientists, IMECS ’16, pages 392–397. IAENG, to right, respectively. Here, we see a clear trend for higher 2016. ratings in the Japanese group across all three expressions. [11] J. McAuley and A. Yang. Addressing Complex and Subjective Product-Related Queries with Customer Reviews. ArXiv While this is only a small-scale initial study, we believe that e-prints, Dec. 2015. this is an important result for the study of recommender [12] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is system performance across di↵erent cultures in general, and not enough: How accuracy metrics have hurt recommender a follow-up study on propensity of ratings for recommender systems. In Extended Abstracts of the 2006 ACM Conference on Human Factors in Computing Systems (CHI 2006), 2006. systems is planned to investigate this further. [13] J. O’Donovan, B. Smyth, V. Evrim, and D. McLeod. Extracting and visualizing trust relationships from online 7 Discussion and Future Work auction feedback comments. In IJCAI, pages 2826–2831, 2007. [14] J. O’Donovan, N. Tintarev, A. Felfernig, P. Brusilovsky, This study applied a 3 by 2 within subjects experimental de- G. Semeraro, and P. Lops. Joint workshop on interfaces and sign in a user study (N=150) to evaluate e↵ects of UI design human decision making for recommender systems (intrs). In and personalization on a range of user experience metrics H. Werthner, M. Zanker, J. Golbeck, and G. Semeraro, editors, in a cosmetics shopping scenario using participant groups RecSys, pages 347–348. ACM, 2015. [15] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. from three di↵erent cultural backgrounds. Results of the Grouplens: An open architecture for collaborative filtering of study show that 1) Korean and Japanese speakers chose the netnews. In Proceedings of ACM CSCW’94 Conference on most complex UI more often than English speakers. 2) older Computer-Supported Cooperative Work, pages 175–186, 1994. participants also preferred more options in cosmetic product selection, regardless of cultural background. 3) personaliza- tion of product ratings did not show an e↵ect on user expe- rience. 4) attribute-based explanations were preferred over star-ratings for all three cultures. 5) Rating propensity eval- uation showed that Japanese had significantly higher ratings than Korean or English, and that Females provided higher ratings than Males, regardless of background. A clear next-