Genre Prediction to Inform the Recommendation Process

                                   Nevena Dragovic                              Maria Soledad Pera
                        Computer Science Department                        Computer Science Department
                           Boise State University                             Boise State University
                              Boise, ID, USA                                     Boise, ID, USA
                nevenadragovic@u.boisestate.edu                             solepera@boisestate.edu


ABSTRACT                                                           time component is important and crucial to consider in the
In this paper we present a time-based genre prediction strat-      prediction process [5].
egy that can inform the book recommendation process. To            There are many avenues that can be explored from a time-
explicitly consider time in predicting genres of interest, we      sensitive stand point in order to generate better recommen-
rely on a popular time series forecasting model as well as         dations, including user-generated ratings, reviews or books
reading patterns of each individual reader or group of readers     metadata. One of them, which is often overlooked and is the
(in case of libraries or publishing companies). Based on a         focus of our paper, is genre. By its definition, genre (e.g.,
conducted initial assessment using the Amazon dataset, we          drama, comedy) is a category of literary composition, deter-
demonstrate our strategy outperforms its baseline counter-         mined by literary technique, tone, content, or even length.
part.                                                              While genre has been studied as a part of the recommenda-
                                                                   tion process [4], the influence of its distribution over time
                                                                   on suggesting suitable books for individual or group of users
CCS Concepts                                                       (as in the case of libraries and publishing companies) has
                                                                   not been explored. Change of genre in time is a significant
•Mathematics of computing → Time series analysis;
                                                                   dimension to improve the genre prediction process. This can
•Information systems → Recommender systems;
                                                                   consequentially influences the process performed by book
                                                                   recommenders since it provides the likelihood of reader(s)
Keywords                                                           interest in each genre based on its occurrences at a specific
                                                                   point of time, not only the most recent or the most frequently
Prediction; Genre; Books; Time Sequence; ARIMA                     read one. As an answer to this need, we propose a genre
                                                                   prediction strategy that examines genre distribution over
1.     INTRODUCTION                                                time and applies time series analysis models. The goal of
   Books, which constitute a billion dollar industry1 , are the    our strategy is to discover different areas of users’ interests,
most popular reading material among all generations of read-       not only the most dominant ones.
ers, both for leisure and educational purposes. Hundreds of        From the users’ point of view, explicitly including time based
thousands of books of different types (e.g., paperback and e-      analysis to inform the recommendation process will lead to
books) and styles (e.g., fiction and non-fiction) are published    relevant suggestions that satisfy their specific reading needs.
on a yearly basis, giving readers a variety of options to choose   Finally, from the commercial point of view, the benefit would
from. Book recommendation systems, which are meant to              be in understanding the influence of reading patterns on
enhance the decision making process, can help users by identi-     decisions about what genre should be published or acquired
fying, among the sometimes overwhelming number of diverse          in a given point of time.
books, the ones that best suit their interests and preferences.
These recommenders are not exclusively designed to aid in-         2.   RELATED WORK
dividuals in their quest for reading materials. They can also         A considerable number of studies examine the importance
improve the decision making process for libraries, by sug-         of book genre on readers’ activity [1, 2]. However, to the best
gesting what books to buy in order to maximize the use of          of our knowledge, research based on past genre distribution
library resources by their patrons, and publishing companies,      coupled with time series analysis to influence the recommen-
by advising which books to publish in order to maximize            dation process has not been conducted. As presented in [4],
revenue. To better serve stakeholders, recommenders must           genre is used as a data point to inform a cross-domain collab-
be able to predict interest and needs. However, given that         orative filtering approach that recommends books based on
preferences may alter over time for different readers, the         users’ genre preferences. You et al. [6] propose a clustering
                                                                   method based on users’ ratings and genre interests extracted
1                                                                  from social networks to solve the cold-start problem affect-
    http://goo.gl/GMn8Nc
                                                                   ing collaborative filtering approaches. Unlike the proposed
                                                                   methods, we use time series to predict the genres most likely
                                                                   currently of interest to each individual user, which can further
                                                                   enhance the book recommendation process.

                                                                   3.   METHOD
                                                                     Predicting genre to inform the recommendation process,
                                                                   regardless of the major stakeholder (a reader, library or
                                                                   publishing company), involves examining genres considered
Copyright held by the author(s).                                   in the past, either read by specific users or purchased by
customers. While a simple genre distribution analysis yields
probabilities or weights that determine the most favored            Table 1: Evaluation using the Amazon dataset
genres, it lacks the ability to consider genre preference evo-                                    MAE KL      ACC
lution over time. To overcome this drawback, we propose a          With Time Series               0.143 0.623 0.870
time-based genre examination which requires information on         Without Time Series            0.144 0.663 0.826
reading activities among readers. As a first step to our pro-      With Time Series (3+ genre)    0.138 0.660 0.857
posed strategy, we explore reading activity of a user to obtain    Without Time Series (3+ genre) 0.146 0.720 0.810
the distribution of his/her genre interest during continuous
periods of time. IN every period of time, for each genre we
calculate a significance score that captures its importance       imates to the ground truth. Furthermore, the probability
by considering a number of books read of that genre in that       of occurrence of each considered genre is closer to the real
period of time. Thereafter, to explicitly consider the change     values when the time component is included in the prediction
of genre preference distribution over time, our genre predic-     process. As a further assessment, we observed the differences
tion strategy takes advantage of Auto-Regressive Integrated       in genre predictions among users who read different number
Moving Average2 (ARIMA). We selected ARIMA since it               of distinct genres. For users who read only one to two genres,
is one of the most popular models that uses time series for       the time-based prediction strategy does not perform better
prediction purposes.                                              than the baseline. However, if a user reads three or more
By using ARIMA we are able to determine a model tailored          genres, our time-based genre prediction strategy outperforms
to each genre distribution to predict its importance for the      the baseline in all three metrics. This is not surprising,
corresponding user in real time based on its previous occur-      given that it is not hard to determine area(s) of interest for
rences. Note that each predicted genre importance score is        a user who constantly reads only one or two book genres,
based on: its occurrences in the past, a specific time when       which is why the baseline performs as good as time-based
it occurred and its importance for a specific user. To define     prediction strategy. Given that users that read 3 or more
length of time periods used by ARIMA, we used information         genres represent 91% of the users in our sampled dataset,
from a recent study done by Pew3 on reading habits in the         the proposed strategy provides significant improvements in
USA, we establish one month long “windows” of time in             predicting preferred genre for the vast majority of readers.
which each user is expected to read at least one book, so our
strategy uses 1 month time frames from the the first book         5.   CONCLUSIONS
log (either bookmarked or rated book) to last.                       In this paper, we described our efforts in developing a
                                                                  time-based genre prediction strategy that can better inform
4.   INITIAL EVALUATION                                           the recommendation process. The novelty of our approach
   Framework. To validate the performance of our proposed         consists of incorporating an explicit time component to gen-
time-based genre prediction strategy, we selected a subset of     erate genre distribution. To the best of our knowledge, this is
the Amazon/LibraryThing4 book dataset. Since the dataset          the first time that the well-known time series ARIMA model
does not always include genre as a part of the provided meta-     is used to predict book genre of readers’ interests. The
data, we extended it by including genre information from          described strategy provides successful predictions and out-
the Library of Congress5 . We used 1214 users6 along with         performs the baseline for 77% of users based on the presented
the books they rated or reviewed. To quantify the assess-         initial evaluation, while for the remaining users it provides
ment, we applied Mean Average Error (MAE), Accuracy and           predictions comparable to the baseline. Because of the scope
Kullback-Leibler (KL) divergence [3]. MAE estimates the           of this paper, the conducted evaluation showcases the genre
difference between the predicted genre importance and the         prediction performance for a single user, while we still need to
ground truth, i.e., genre distribution for a user at a given      conduct further assessments in terms of quantitatively deter-
time, whereas Accuracy applies a binary strategy that reflects    mining the degree to which the proposed strategy (i)provides
if the predicted genres correspond to the ones read by a user     successful genre predictions for libraries and (ii)publishing
in a given period of time. KL divergence measures how well        companies and influences the recommendation process to
a distribution q generated by a prediction strategy approx-       assist all three stakeholders.
imates to distribution p, the ground truth. In establishing
the ground truth for each user considered for evaluation          6.   REFERENCES
purposes, we adopted the well-known N-1 strategy, such that       [1] P. Afflerbach. The influence of prior knowledge and text
the genre of the books rated by a given user U in the N time          genre on readers’ prediction strategies. Journal of
frame are treated as “relevant” genres for U, and the genre           Literacy Research, 22(2):131–148, 1990.
of the books rated in the previous N-1 windows are used for       [2] J. Anderson, A. Anderson, J. Lynch, and J. Shapiro.
training U ’s genre prediction model. As a baseline of our            Examining the effects of gender and genre on
initial assessment, we use a traditional prediction strategy          interactions in shared book reading. Literacy Research
that considers the proportion of occurrences of each genre            and Instruction, 43(4):1–20, 2004.
based on data collected over N-1 periods of time to estimate      [3] C. D. Manning and H. Schütze. Foundations of
the importance of each genre for a given user on the current,         statistical natural language processing, volume 999. MIT
i.e., N, time period.                                                 Press, 1999.
Results. As shown in Table 1, for N=117 outperform the            [4] S. Sarawagi and S. H. Nagaralu. Data mining models as
baseline. KL divergence scores showcase that genre distri-            services on the internet. ACM SIGKDD Explorations
bution predicted using time-series approach better approx-            Newsletter, 2(1):24–28, 2000.
2                                                                 [5] J. Wang and Y. Zhang. Opportunity model for
  http://goo.gl/Dhzcg7
3                                                                     e-commerce recommendation: right product; right time.
  http://goo.gl/BAUQK4                                                In ACM SIGIR, pages 303–312, 2013.
4
  http://goo.gl/drH0yF                                            [6] T. You, A. N. Rosli, I. Ha, and G.-S. Jo. Clustering
5
  https://www.loc.gov/                                                method based on genre interest for cold-start problem in
6
  In our initial assessment, we considered Amazon users who           movie recommendation. JIIS, 19(1):57–77, 2013.
provided ratings for at least 35 books.
7
  We empirically verified that for 6<N<11 the results are
comparable to the ones for N=11