Genre Prediction to Inform the Recommendation Process Nevena Dragovic Maria Soledad Pera Computer Science Department Computer Science Department Boise State University Boise State University Boise, ID, USA Boise, ID, USA nevenadragovic@u.boisestate.edu solepera@boisestate.edu ABSTRACT time component is important and crucial to consider in the In this paper we present a time-based genre prediction strat- prediction process [5]. egy that can inform the book recommendation process. To There are many avenues that can be explored from a time- explicitly consider time in predicting genres of interest, we sensitive stand point in order to generate better recommen- rely on a popular time series forecasting model as well as dations, including user-generated ratings, reviews or books reading patterns of each individual reader or group of readers metadata. One of them, which is often overlooked and is the (in case of libraries or publishing companies). Based on a focus of our paper, is genre. By its definition, genre (e.g., conducted initial assessment using the Amazon dataset, we drama, comedy) is a category of literary composition, deter- demonstrate our strategy outperforms its baseline counter- mined by literary technique, tone, content, or even length. part. While genre has been studied as a part of the recommenda- tion process [4], the influence of its distribution over time on suggesting suitable books for individual or group of users CCS Concepts (as in the case of libraries and publishing companies) has not been explored. Change of genre in time is a significant •Mathematics of computing → Time series analysis; dimension to improve the genre prediction process. This can •Information systems → Recommender systems; consequentially influences the process performed by book recommenders since it provides the likelihood of reader(s) Keywords interest in each genre based on its occurrences at a specific point of time, not only the most recent or the most frequently Prediction; Genre; Books; Time Sequence; ARIMA read one. As an answer to this need, we propose a genre prediction strategy that examines genre distribution over 1. INTRODUCTION time and applies time series analysis models. The goal of Books, which constitute a billion dollar industry1 , are the our strategy is to discover different areas of users’ interests, most popular reading material among all generations of read- not only the most dominant ones. ers, both for leisure and educational purposes. Hundreds of From the users’ point of view, explicitly including time based thousands of books of different types (e.g., paperback and e- analysis to inform the recommendation process will lead to books) and styles (e.g., fiction and non-fiction) are published relevant suggestions that satisfy their specific reading needs. on a yearly basis, giving readers a variety of options to choose Finally, from the commercial point of view, the benefit would from. Book recommendation systems, which are meant to be in understanding the influence of reading patterns on enhance the decision making process, can help users by identi- decisions about what genre should be published or acquired fying, among the sometimes overwhelming number of diverse in a given point of time. books, the ones that best suit their interests and preferences. These recommenders are not exclusively designed to aid in- 2. RELATED WORK dividuals in their quest for reading materials. They can also A considerable number of studies examine the importance improve the decision making process for libraries, by sug- of book genre on readers’ activity [1, 2]. However, to the best gesting what books to buy in order to maximize the use of of our knowledge, research based on past genre distribution library resources by their patrons, and publishing companies, coupled with time series analysis to influence the recommen- by advising which books to publish in order to maximize dation process has not been conducted. As presented in [4], revenue. To better serve stakeholders, recommenders must genre is used as a data point to inform a cross-domain collab- be able to predict interest and needs. However, given that orative filtering approach that recommends books based on preferences may alter over time for different readers, the users’ genre preferences. You et al. [6] propose a clustering method based on users’ ratings and genre interests extracted 1 from social networks to solve the cold-start problem affect- http://goo.gl/GMn8Nc ing collaborative filtering approaches. Unlike the proposed methods, we use time series to predict the genres most likely currently of interest to each individual user, which can further enhance the book recommendation process. 3. METHOD Predicting genre to inform the recommendation process, regardless of the major stakeholder (a reader, library or publishing company), involves examining genres considered Copyright held by the author(s). in the past, either read by specific users or purchased by customers. While a simple genre distribution analysis yields probabilities or weights that determine the most favored Table 1: Evaluation using the Amazon dataset genres, it lacks the ability to consider genre preference evo- MAE KL ACC lution over time. To overcome this drawback, we propose a With Time Series 0.143 0.623 0.870 time-based genre examination which requires information on Without Time Series 0.144 0.663 0.826 reading activities among readers. As a first step to our pro- With Time Series (3+ genre) 0.138 0.660 0.857 posed strategy, we explore reading activity of a user to obtain Without Time Series (3+ genre) 0.146 0.720 0.810 the distribution of his/her genre interest during continuous periods of time. IN every period of time, for each genre we calculate a significance score that captures its importance imates to the ground truth. Furthermore, the probability by considering a number of books read of that genre in that of occurrence of each considered genre is closer to the real period of time. Thereafter, to explicitly consider the change values when the time component is included in the prediction of genre preference distribution over time, our genre predic- process. As a further assessment, we observed the differences tion strategy takes advantage of Auto-Regressive Integrated in genre predictions among users who read different number Moving Average2 (ARIMA). We selected ARIMA since it of distinct genres. For users who read only one to two genres, is one of the most popular models that uses time series for the time-based prediction strategy does not perform better prediction purposes. than the baseline. However, if a user reads three or more By using ARIMA we are able to determine a model tailored genres, our time-based genre prediction strategy outperforms to each genre distribution to predict its importance for the the baseline in all three metrics. This is not surprising, corresponding user in real time based on its previous occur- given that it is not hard to determine area(s) of interest for rences. Note that each predicted genre importance score is a user who constantly reads only one or two book genres, based on: its occurrences in the past, a specific time when which is why the baseline performs as good as time-based it occurred and its importance for a specific user. To define prediction strategy. Given that users that read 3 or more length of time periods used by ARIMA, we used information genres represent 91% of the users in our sampled dataset, from a recent study done by Pew3 on reading habits in the the proposed strategy provides significant improvements in USA, we establish one month long “windows” of time in predicting preferred genre for the vast majority of readers. which each user is expected to read at least one book, so our strategy uses 1 month time frames from the the first book 5. CONCLUSIONS log (either bookmarked or rated book) to last. In this paper, we described our efforts in developing a time-based genre prediction strategy that can better inform 4. INITIAL EVALUATION the recommendation process. The novelty of our approach Framework. To validate the performance of our proposed consists of incorporating an explicit time component to gen- time-based genre prediction strategy, we selected a subset of erate genre distribution. To the best of our knowledge, this is the Amazon/LibraryThing4 book dataset. Since the dataset the first time that the well-known time series ARIMA model does not always include genre as a part of the provided meta- is used to predict book genre of readers’ interests. The data, we extended it by including genre information from described strategy provides successful predictions and out- the Library of Congress5 . We used 1214 users6 along with performs the baseline for 77% of users based on the presented the books they rated or reviewed. To quantify the assess- initial evaluation, while for the remaining users it provides ment, we applied Mean Average Error (MAE), Accuracy and predictions comparable to the baseline. Because of the scope Kullback-Leibler (KL) divergence [3]. MAE estimates the of this paper, the conducted evaluation showcases the genre difference between the predicted genre importance and the prediction performance for a single user, while we still need to ground truth, i.e., genre distribution for a user at a given conduct further assessments in terms of quantitatively deter- time, whereas Accuracy applies a binary strategy that reflects mining the degree to which the proposed strategy (i)provides if the predicted genres correspond to the ones read by a user successful genre predictions for libraries and (ii)publishing in a given period of time. KL divergence measures how well companies and influences the recommendation process to a distribution q generated by a prediction strategy approx- assist all three stakeholders. imates to distribution p, the ground truth. In establishing the ground truth for each user considered for evaluation 6. REFERENCES purposes, we adopted the well-known N-1 strategy, such that [1] P. Afflerbach. The influence of prior knowledge and text the genre of the books rated by a given user U in the N time genre on readers’ prediction strategies. Journal of frame are treated as “relevant” genres for U, and the genre Literacy Research, 22(2):131–148, 1990. of the books rated in the previous N-1 windows are used for [2] J. Anderson, A. Anderson, J. Lynch, and J. Shapiro. training U ’s genre prediction model. As a baseline of our Examining the effects of gender and genre on initial assessment, we use a traditional prediction strategy interactions in shared book reading. Literacy Research that considers the proportion of occurrences of each genre and Instruction, 43(4):1–20, 2004. based on data collected over N-1 periods of time to estimate [3] C. D. Manning and H. Schütze. Foundations of the importance of each genre for a given user on the current, statistical natural language processing, volume 999. MIT i.e., N, time period. Press, 1999. Results. As shown in Table 1, for N=117 outperform the [4] S. Sarawagi and S. H. Nagaralu. Data mining models as baseline. KL divergence scores showcase that genre distri- services on the internet. ACM SIGKDD Explorations bution predicted using time-series approach better approx- Newsletter, 2(1):24–28, 2000. 2 [5] J. Wang and Y. Zhang. Opportunity model for http://goo.gl/Dhzcg7 3 e-commerce recommendation: right product; right time. http://goo.gl/BAUQK4 In ACM SIGIR, pages 303–312, 2013. 4 http://goo.gl/drH0yF [6] T. You, A. N. Rosli, I. Ha, and G.-S. Jo. Clustering 5 https://www.loc.gov/ method based on genre interest for cold-start problem in 6 In our initial assessment, we considered Amazon users who movie recommendation. JIIS, 19(1):57–77, 2013. provided ratings for at least 35 books. 7 We empirically verified that for 6