=Paper= {{Paper |id=Vol-2068/humanize8 |storemode=property |title=Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models |pdfUrl=https://ceur-ws.org/Vol-2068/humanize8.pdf |volume=Vol-2068 |authors=Mark Graus,Martijn Willemsen,Chris Snijders |dblpUrl=https://dblp.org/rec/conf/iui/GrausWS18 }} ==Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models== https://ceur-ws.org/Vol-2068/humanize8.pdf
 Personalizing an Online Parenting Library: Parenting-Style
  Surveys Outperform Behavioral Reading-Based Models
             Mark P. Graus                              Martijn C. Willemsen                     Chris C. P. Snijders
         Eindhoven University of                       Eindhoven University of                 Eindhoven University of
          Technology, IPO 0.20                          Technology, IPO 0.17                    Technology, IPO 1.[20]
         5600 MB Eindhoven, the                        5600 MB Eindhoven, the                  5600 MB Eindhoven, the
              Netherlands                                    Netherlands                              Netherlands
            m.p.graus@tue.nl                            m.c.willemsen@tue.nl                     c.c.p.snijders@tue.nl


ABSTRACT                                                               have to learn a whole new set of care-taking skills, ranging
The present study set out to personalize a digital library aimed       from practical (such as changing diapers) to more emotional
at new parents by reordering articles to match users’ inferred         (such as recognizing and reacting to a child’s emotions). There
interests. The interests were inferred from reading behavior           are numerous ways to acquire these skills: parents can get
as well as parenting styles measured through surveys. As               advice from relatives, or alternatively rely on vast amounts of
prior research has shown that parenting styles are related to          books, websites, videos, and other types of media.
how parents take care of their children, these styles are likely
                                                                       Parents have different styles of parenting and as such some
to be related to what content a parent is interested in. The
                                                                       topics may be very relevant to a parent, while others are com-
present study compared personalization based on parenting
                                                                       pletely irrelevant. In this sense, helping parents find their way
styles against other types of personalization.
                                                                       in content related to the parenting domain is similar to person-
We conducted a user study with 106 participants, in which we           alization areas such as movie or book recommendations. A
compared the effects of four different approaches of personal-         challenge in personalizing content on parenting is that first-
ization to our users’ reading behavior and user experience: a          time parents have to find their own way in a domain that is
non-personalized baseline, personalization based on reading            completely new to them. Parents may not have a clear view
behavior, personalization based on parenting styles measured           yet on the range of alternative ways of taking care of a child
through surveys, and a hybrid personalization based on both            that match their styles. It might not be easy for them to judge
reading behavior and parenting styles. We found that while the         what content is relevant and they might read content that is not
reading behavior was not significantly influenced by different         in line with their parenting styles or interests. As such, there
types of personalization, participants had a better user expe-         might be a discrepancy between what content new parents
rience with our survey-based approach. They indicated they             read and what is actually relevant to them. As a result person-
perceived a higher level of personalization and satisfaction           alization based on reading behavior (as is common) might not
with the system, even though in terms of objective metrics this        provide the desired results.
approach performed worse.
                                                                       An additional challenge is that parenting is an activity people
ACM Classification Keywords
                                                                       are very committed to and about which they hold strong beliefs.
                                                                       As a result, new parents might find certain types of content
H.5.2 Information Interfaces and Presentation (e.g., HCI):
                                                                       extremely irrelevant, to the point of being offended. A mother
User Interfaces; H.3.3 Information Storage and Retrieval: In-
                                                                       who struggled and eventually gave up breastfeeding might be
formation Search and Retrieval
                                                                       hurt by receiving unwanted breastfeeding advice. Being wrong
Author Keywords                                                        in personalization in this domain thus has a bigger impact than
Personalization; Parenting; User Experience; Cold Start;               in other domains.
Psychological Traits; Psychological Models; User Models                We aim to help parents in finding relevant content by person-
                                                                       alizing a digital library of information articles on parenting.
INTRODUCTION                                                           Because the content is aimed at new parents, we think a dis-
Becoming a parent is for many a big challenge in life. New             crepancy can exist between reading behavior and reading in-
parents have to get used to a new set of responsibilities and          terests and parenting styles measured through surveys might
                                                                       provide more reliable information for predicting reading inter-
                                                                       ests. To investigate this, we personalize a library using both
                                                                       behavior data and survey data.


                                                                       Research Question and Hypotheses
©2018. Copyright for the individual papers remains with the authors.   The current paper aims to investigate how a library comprising
Copying permitted for private and academic purposes.                   articles on parenting can be improved by personalizing the
HUMANIZE ’18, March 11, 2018, Tokyo, Japan
order in which the articles are presented1 . A screenshot of
what the library interface looked like can be found in Figure. 1.
In addition the paper investigates if and how parenting styles
can contribute to this personalization. The main research
question thus is “How does personalization based on parenting
styles compare to personalization based on reading behavior
in terms of user behavior and user experience?”
We try to answer this research question by investigating the
effects of personalization based on survey responses mea-
suring parenting styles (explained in Section 1.4) and more
conventional ways of personalization that rely on behavior
data. Specifically we compare the effects of survey-based per-
sonalization with personalization based on reading behavior,
personalization based on both reading behavior and survey
responses and a non-personalized baseline. We are interested
in the effects of this personalization both in terms of influ-
enced behavior (e.g. does personalization based on surveys
increase the number of articles users read?) and in terms of
user experience (e.g. does personalization based on surveys
result in a higher satisfaction with the digital library?). To
investigate the effects on the user experience we adopted the
User-Centric Evaluation Framework for personalized systems
by Knijnenburg and Willemsen [11]. We designed a UX sur-
vey with items aimed to measure different aspects of the user
experience.
With the survey we aimed to measure three aspects of the user
experience specifically and formulated survey items to do so:
the perceived level of personalization (“The library shows arti-
cles I find interesting”), the system satisfaction (“It was easy
to find relevant/interesting articles”), and reading satisfaction
(“I enjoyed reading the items I read”). We hypothesize that the
different ways of personalization influence the perceived level
of personalization. A higher level of personalization should
lead to a higher system satisfaction, which should lead to a
higher reading satisfaction. The higher system satisfaction is
also expected to increase the amount of reading by the user.                           Figure 1. uGrow ‘My Articles’ page

In terms of improving user satisfaction and increasing reading
behavior we hypothesize the following order in the different
personalizations, from worst to best:                                  of personalization can be found on numerous websites, for
                                                                       example in the form of recommendations on Amazon, or as
• The non-personalized library                                         filters on social media feeds such as Twitter and Facebook. In
• The library personalized based on just reading behavior              general the goal is to alter a system in a way that it caters to
                                                                       the individual needs of a user to influence user behavior or
• The library personalized based on just survey responses              user experience. A typical goal of influencing behavior is to
                                                                       make users consume more content in a media browsing system
• The library personalized based on both reading behavior              or purchase more items in a webshop, while a typical goal of
  and survey responses                                                 influencing user experience is to make it easier for users to
The remainder of this section will introduce the theoretical           reach their goals.
background on which this study is based.                               Personalization can be implemented in many different
                                                                       ways [19], but the most widely adopted methods rely on his-
Personalization                                                        torical data describing how users interact with a system, and
Personalization is the process of altering a system to fit to          combine these data across users to make predictions on what
the needs and/or preferences of an individual [15]. Examples           content a user will find relevant. The system is subsequently
1 The content and the design of the library were taken from Philips’   altered so that the user is exposed to more of the content he/she
uGrow app. Available for iOS https://itunes.apple.com/app/             is likely to find relevant.
ugrow-healthy-baby-development/id1063224663 and Android
https://play.google.com/store/apps/details?id=com.philips.             A standard problem related to this approach to personalization
cl.uGrowDigitalParentingPlatform                                       is the cold start problem [18]. More specifically, three cold
start problems exist: the system cold start, the user cold start,     Parenting Styles
and the item cold start. The system cold start occurs when            Zhao [20] performed a literature review on research on parent-
not enough data are available within the system as a whole to         ing with the goal of understanding how scholars operationalize
make predictions. The user and item cold start occur when             and measure parenting styles. Zhao was in particular interested
there are not enough interaction data available corresponding         in how parenting styles relate to the actual care-taking behav-
to respectively the user or the item, so that no predictions can      ior and as such, the review was primarily focused on research
be made for respectively the user or the item.                        that comprised both questionnaires and a behavioral aspect.
                                                                      She found that parenting as a whole is a combination of cog-
In the context of parenting an additional challenge occurs.
                                                                      nitive factors, the physical task of taking care of a baby, and
Apart from being new to a system, (some) parents are also
                                                                      the interplay between the two (cf. [2]). Zhao in addition found
new to being parents and they might find it hard to identify
                                                                      that researchers conceptualize parenting styles as individual
what content is relevant to them. This can result in a mismatch
                                                                      differences along two cognitive dimensions: structure (i.e.
between the content they read and the content that they are
                                                                      how important parents think structure is for their children) and
actually interested in. In systems in which user evaluations of
                                                                      attunement (i.e. how much parents value reacting to a child’s
content are not being tracked explicitly, assuming that content
                                                                      needs and how able they are at reading those needs) [1, 3, 16].
is appreciated because it was read may well lead to inaccurate
                                                                      Prototypical parenting styles are the resulting combinations
predictions about user preferences. Because of this, a library
                                                                      of scores along these two dimensions (high attunement/high
aimed at parents might benefit from relying on other types of
                                                                      structure, high attunement/low structure, low attunement/high
data for personalization.
                                                                      structure and low attunement/low structure). Other cognitive
                                                                      factors that have been identified in literature to play a role are
                                                                      parental distress, perceived self-efficacy, and the perceived
Personalization and Psychological Traits                              difficulty of the child.
Many psychological traits have been incorporated in personal-
ization applications. Hauser et al. [9] personalized an online        The cognitive factors allegedly have an interplay with how
tool to compare contracts for mobile phones based on cogni-           parents actually take care of their children. To validate these
tive styles (i.e. the way in which individuals prefer to process      parenting styles and investigate how they relate to care-taking
information) and showed that providing users information in           behavior, Zhao [20] conducted a survey study in which she
a way that matches their cognitive style (e.g. textual versus         measured parenting styles and asked respondents to self-report
visual information) increases buying propensity. Germanakos           on how they take care of their children. The analysis of the
and Belk [7] found that adapting an online learning environ-          survey data showed support for the conceptualization of par-
ment to the working memory capacity of its students resulted          enting styles along the previously mentioned dimensions of
in higher test scores.                                                structure and attunement. In addition it showed that parenting
                                                                      styles are related to the actual care-taking behavior of parents.
Similarly, Fernandez-Tobias et al. [5] showed that incorporat-        For example, parents scoring low on attunement are less likely
ing personality in collaborative filtering algorithms allowed         to engage in breast-feeding and more likely to opt for bottle-
them to better predict recommendations across domains (e.g.           feeding. As parenting styles are related to how parents take
recommending movies based on someone’s music listening be-            care of their children, they are likely to be useful predictors
havior). They did this by extending the SVD++ algorithm [12],         for what type of content parents are interested in. For example,
an algorithm used to predict ratings that users will assign to        parents that find structure important put their children to bed
items. Fernandez-Tobias et al. used a part of the myPerson-           at a fixed bedtime instead of waiting for the kid to become
ality dataset2 comprising 160k users and in total just over 5         sleepy. As a result they might be more conscious of the fact
million likes over 16k items (consisting of books, movies or          that their child does not fall asleep easily and will thus be more
music artists). The personality traits (the five factor model         interested in content on how to get a baby to sleep well than
with the traits openness to experience, conscientiousness, ex-        people that value flexibility over structure and wait for their
traversion, agreeableness and neuroticism [14]) were available        child to get sleepy.
for all users and were used to predict likes. Their results
showed that incorporating the personality information substan-
tially improved the extent to which likes on Facebook could           STUDY DESIGN
successfully be predicted.                                            To investigate our research question we designed a user study
                                                                      that consisted of two main parts, with a first part aimed at
These studies demonstrate that personalization can benefit            collecting initial data to be used for personalizing the “My
from considering and incorporating personal characteristics           Articles” page and a second part aimed at investigating the
(such as personality traits or cognitive styles). In the case         effects of the different ways of personalization on the reading
of parenting, parenting styles are psychological traits that are      behavior and user experience.
likely to play a role in what content parents find relevant. In the
present study we measure parenting styles and subsequently            During the first part, participants were asked to complete a
use them for personalizing the online library.                        survey to measure their parenting styles, after which they
                                                                      were invited to browse the non-personalized library (i.e. a
                                                                      library with a fixed order of articles). The responses to the
2 Available from http://mypersonality.org/wiki/doku.php?id=           surveys were stored for personalization later. The information
download_databases                                                    regarding what articles participants read during the browsing
phase was used for personalization based on reading behavior.
The order of articles for the second part of the study was
calculated in one out of four ways (described in more detail in
section 3). For each participant we selected at random which
set of predictions was used to personalize the library.
In the second part of the study the participants were re-invited
to interact with their now personalized digital library. Sub-
sequently, participants evaluated their experience with the
system through our UX survey. We first report on the initial
phases of the study.

Initial Data Collection: Survey and Reading Behavior
We implemented the online library on a website that was acces-
sible through browsers on computers and mobile phones. We
recruited participants through posts in online forums dedicated
to parenting and through Facebook ads targeting parents in
the United Kingdom and United States with children younger
than two years old. In total 234 parents clicked on the link to
participate in the study. All participants that completed the en-
tire study were compensated with $4.50 or £3.50 of shopping
credit for amazon.co.uk or amazon.com. The ad campaign and
data collection took place in May and June 2017.                      Figure 2. Distributions of the 5 factor scores measured through the first
                                                                      survey.
The first part of the study consisted of two steps. In the first
part people were asked to complete the survey to measure
parenting styles. After completing the survey, the participants
                                                                      The initial part of the data collection was concluded with
were presented with the digital library and invited to browse
                                                                      offering the participants to freely browse the online library.
through it and read the articles that they were interested in.
                                                                      Participants opened on average 2.23 articles (SD: 3.37 articles)
The participants were invited to read as many articles as they
                                                                      from 1.25 categories (SD: 1.51 categories). These data and the
wanted for as long as they wanted and to click a link labeled
                                                                      survey responses were used to calculate relevance predictions
“I’ve finished reading” once they felt like they read enough.
                                                                      for the individual participants.
After clicking this link participants were asked to submit their
email address for the second step of the study.
                                                                      CALCULATING RELEVANCE PREDICTIONS
In total 181 participants completed the survey (15 men/166            Based on the data collected in the first step of the study we cal-
women, 99 first time parents, with an average age of the baby         culated per participant four different relevance rank predictions
11.39 (SD: 7.96) months). On average the whole session                for all articles. As a baseline we used the non-personalized
lasted just over 6 minutes (378 seconds, SD: 279.80 seconds).         General Top-N. The three other ways of predicting differed
The survey consisted of 15 items of the original survey of            in what data from the first step were used. A survey-based
Zhao [20]. For the five cognitive (structure, attunement, ma-         ordering was based on the data from the survey responses
ternal self-efficacy, parental disstress and perceived difficulty     of the participants and on reading behavior at the aggregated
of the child) we selected per factor the two items with most          level. A reading-based ordering used only data regarding the
extreme factor loadings. We added items concerning the de-            articles that people had read in the first step. Finally, a hybrid
mographics of the parent (gender, level of education, whether         ordering used both the survey responses and the individual
they were first time parents) and child (gender, age) that had        reading behavior. The way these orderings were calculated is
had large effects on the self-reported behavior in the original       described in the following sections.
analysis. The factor scores for our participants were calculated
by using the factor loadings from the original survey and are
                                                                      Survey-Based Predictions
displayed in Figure 2. These scores show similar distributions
and correlations as the factors in the original survey.               We used the survey responses collected in the first step to
                                                                      predict relevance of the different articles for the participants
The interface of our library was made to have the look and            in our study. To do this, the participants were subdivided
feel of the original library (see Fig. 1) as much as possible.        in segments, by performing median splits on the 2 cognitive
As in the original interface, the articles are subdivided in          factors: attunement and structure. The user segment was then
categories that are displayed in rows. Within the category rows       defined to be the combination of these two scores, resulting
the articles are displayed horizontally. The user is able to scroll   in four segments. We considered incorporating the three other
up and down to different categories and left and right within         factors measured in the study (self-efficacy, parental distress
categories to the different articles. As in the original interface,   and perceived difficulty of the child), but given the number
the order of articles and categories was fixed: every participant     of users in our dataset adding additional factors resulted in
had exactly the same order of categories and articles.                segments that became too small.
                                                                         the positive only data describe whether or not a user read an
                                                                         article in the first step, and the predictions would indicate what
                                                                         items a user is most likely to read. In order to translate this
                                                                         positive-only, binary feedback into a ranking, pairs of items
                                                                         are semi-randomly selected per user, where each pair consists
                                                                         of an item that the user has interacted with and one with which
                                                                         the user has not interacted. The assumption is that the first
                                                                         is preferred over the second. Sampling a large number of
                                                                         pairs per user, results in a ranking that can be used in matrix
                                                                         factorization and the resulting model then calculates a relative
                                                                         relevance score instead of a rating.
   Figure 3. Article Categories Ranked on Popularity per Segment         Hybrid Predictions
                                                                         The BPRMF algorithm was extended to combine reading
                                                                         behavior and the individual parents’ user attributes inferred
As the participants read on average just over 2 articles, there
                                                                         from the survey for the calculation of hybrid predictions. The
was not enough data to show differences on the level of indi-
                                                                         BPRMF algorithm was extended similarly to how Fernandez-
vidual articles (i.e. articles were not read often enough to allow
                                                                         Tobias et al. [5] extended the SVD++ [12] algorithm to incor-
for enough variance), but participants from different segments
                                                                         porate personality in predictions.
did prefer different categories, as can be seen in Figure 3.
When investigating these predictions, the popularity order for           Where the original BPRMF algorithm uses two matrices P and
these categories seems to make sense intuitively. For example,           Q to calculate predictions, our user attribute aware BPRMF
the breastfeeding category is predicted to be more popular for           algorithm uses a third matrix Y . Y describes the user attributes
segments with high attunement, which is congruent with the               on the same k latent features the users and articles are ex-
relationship with breast-feeding and high attunement in the              pressed in. In our case we used high and low scores for the
original survey [20].                                                    five cognitive factors from our parenting style survey as user at-
                                                                         tributes. We decided again to use the median splits per factors
As a result we decided to sort the categories based on the
                                                                         to assign each user a high or low score for each factor in order
attunement-structure segment and sort the articles within each
                                                                         to prevent overfitting. Every user has thus 5 user attributes and
category based on general popularity. That is, the survey-based
                                                                         the relevance predictions are similar to the original BPRMF
predictions only personalized the order of the categories, not
                                                                         algorithm with an additional matrix in which user attributes
the articles within each category. We tried basing segments on
                                                                         are represented. The predicted relevance is then calculated
other factors than attunement and structure, but the resulting
                                                                         according to equation 2.
predictions were not as easily interpretable as the predictions
based on these segments.
                                                                                                                        !
Reading-Based Predictions                                                                   r̂ui = qi ∗ pu +    ∑ ya                      (2)
For the conditions based on reading behavior alone, we used                                                    a∈A(u)
the Bayesian Personalized Ranking Matrix Factorization (or
BPRMF) algorithm implemented in MyMediaLite [6, 17] to                   This model is fit using stochastic gradient descent. Each itera-
predict relevance. BPRMF is an extension to classic matrix               tion consists of two steps. In the first step the P and Q matrices
factorization [13] that allows it to calculate recommendations           are fit, while leaving the Y matrix constant. In the second step
from positive only feedback instead of rating data.                      the Y matrix is fit, while leaving the P and Q matrix con-
                                                                         stant. We implemented this algorithm in the MyMediaLite
Conventional matrix factorization attempts to complete the
                                                                         library [6].
matrix R with dimensionality of U (number of users) and I
(number of items). In this matrix the cells represent ratings            Calculated Relevance Predictions
the user has given to the corresponding item. This matrix is             In total the dataset contained 221 users and 508 reads3 . For
decomposed into two k-dimensional sub-matrices P and Q in                each user predictions using the four methods described above
which the rows of P and Q represent respectively users and               were calculated. The predictions were then sorted in two steps.
items in a k-dimensional latent feature space. These matrices            First the 7 categories were ordered based on the article with
are constructed so that the predicted rating r̂ui is calculated by       the highest predicted relevance (a strategy called min-rank
taking the inner product pu ∗ qi (see Equation 1).                       that has been shown to work well in similar circumstances [4,
                                                                         21]). Within the categories the articles were ordered based on
                                                                         predicted relevance.
                          r̂ui = qi ∗ pu                           (1)
                                                                         The algorithms for the reading-based and hybrid predictions
Rendle et al. [17] extended this matrix factorization into               required the tuning of a set of regularization hyperparame-
BPRMF to allow using positive only feedback to calculate                 ters, which we carried out using Bayesian Optimization. The
per user a ranking of the articles from highest predicted rel-           3 We included reading data from a pilot study to ensure we had enough
evance to lowest predicted relevance. In the current study,              data to calculate predictions
                                        5-fold Cross Validation                             Post-hoc Comparison
              algorithm      AUC        prec@5      prec@10     NDCG             AUC        prec@5    prec@10            NDCG
               baseline      0.840       0.083        0.065     0.424            0.706       0.146      0.104            0.477
                survey         -            -           -         -              0.650       0.060      0.062            0.353
               reading       0.832       0.079        0.061     0.411            0.767       0.176      0.114            0.522
                hybrid       0.769       0.080        0.059     0.404            0.807       0.214      0.126            0.561
                   Table 1. Performance Metrics calculated through 5-fold Cross Validation and a post-hoc performance analysis


Bayesian Optimization was performed using 5-fold cross vali-               relevance rank (or highest predicted relevance) within the cat-
dation, using AUC as the target measure. Once optimal values               egory, so that the category with the article with the highest
for the hyperparameters were established, the predictive mod-              predicted relevance was shown on top. This way of sorting
els were constructed and the predictive performance (i.e. the              categories has been shown to be one of the best strategies in
reading-based and hybrid recommendations) was investigated                 terms of reducing browsing time [4]. Within categories the
through 5-fold cross validation also. Table 1 shows these per-             articles were ordered by predicted relevance rank, with the
formance metrics of the three algorithms under the column                  article with the lowest predicted relevance rank to the left of
‘5-fold Cross Validation’. The performance metrics appeared                the list.
to be adequate4 . However, the baseline, reading-based, and
hybrid predictions are calculated on the level of the individual           Participants were allowed to browse the library freely during
articles, they cannot be easily compared to the survey-based               which we measured what articles the participants opened. Par-
                                                                           ticipants were shown a link labeled “I have finished reading”
predictions that are calculated first on the category level and
                                                                           that would take them to the survey as soon as they felt they
then within the categories on an individual article level. In
                                                                           read enough. The survey contained 11 items aimed at measur-
order to make a fair comparison, we performed a post-hoc
                                                                           ing Perceived Level of Personalization, System Satisfaction,
analysis by recalculating the performance metrics for the sets
of recommendations to correspond to the survey-based predic-               and Reading Satisfaction.
tions. We did this by calculating the lists of recommendations             Participants
and sorting all lists first by category based on the minimum
                                                                           All 181 users from the first part were invited to join the second
predicted rank of the article within that category and subse-
                                                                           part of the study via email. Of the 181 users we sent invita-
quently sorting the articles within their categories based on the
                                                                           tions to, 150 visited the second part and 121 completed the
predicted relevance for the individual articles. We then calcu-
                                                                           study. A number of cases were removed, for either trying to
lated performance metrics by using the actual reading behavior
                                                                           complete the study with multiple email addresses (3 users),
as ground truth. The outcome of these recalculations can be
                                                                           having missing data in the survey (1 user), or finishing the
found under the columns ‘Post-hoc Comparison’ in Table 1.
                                                                           second part of the study in less than 50 seconds (11 users). For
These numbers indicate the most accurate predictions for the
                                                                           our final data analysis we ended up with 106 users (9 men/97
hybrid predictions, followed by the reading-based predictions,
                                                                           women, 50 first time parents, mean (SD) age of the baby 10.63
the survey-based predictions and finally the non-personalized
                                                                           (8.45) months)
baseline. Based on these metrics we would expect the hybrid
predictions to be most in line with what participants will read,           These users were distributed roughly equally over conditions
and the survey-based least. This order is different from the               (baseline: 29, survey: 29, reading-based: 22, hybrid: 26). In
order in the k-fold cross validation metrics because no k-fold             addition, there appeared to be no bias in response rate for the
cross validation was applied to be able to compare with the                different parenting style segments of the participants, with
survey-based recommendations (i.e. the train and test set were             response rates of .56 for the low structure/high attunement
identical).                                                                segment, .65 for the high structure/low attunement segment,
                                                                           .73 for the high structure/high attunement segment and .60
RE-ENGAGING WITH THE NOW PERSONALIZED SYSTEM                               for the low structure/low attunement segment(χ 2 (3) = 3.239,
The second part of the study was used to investigate our re-               p = 0.356).
search question and test our hypotheses. To this end partici-
                                                                           Results
pants were re-invited to interact with the website, where they
were now shown the library personalized in one out of four                 To gain insight in how the different methods of predicting
ways (selected at random). The invitations were sent out after             relevance influenced the final recommendations participants
all predictions were calculated, which means that the time                 received, we calculated the difference of the recommendations
between finishing the first part and starting the second part              with the general Top-N in terms of Spearman Rank Correla-
differed between participants (median 42.6 days. SD: 15.4                  tion. The (Spearman) correlation coefficient ρ indicates to
days). In this step the interface was personalized by reorder-             what extent lists are similar, with a value of 1 if the order is
ing both the categories and the articles within the categories.            identical and -1 if they are in reverse order. The results are
The categories were ranked based on the minimum predicted                  shown in Figure 4 and they reveal that the available reading
                                                                           data does not allow personalization that differs a lot from the
4 An overview of the different metrics and how to interpret them can       baseline condition (as the correlation between reading-based
be found in [8].                                                           and baseline is 0.91 on average). Personalization based on the
survey-based predictions is quite different from the baseline
predictions, with an average correlation of 0.37. The hybrid
predictions fall somewhere in between the reading-based and
survey-based predictions with a correlation of 0.74. These
numbers indicate that the additional data of parenting styles
allows for personalization that deviates more from the baseline
than personalization based on reading behavior alone.
One possible explanation of the reading-based personalization
not differing much from the baseline is insufficient data. As
there are a limited number of users (221 users, see Section 2.1),
that read a limited number of articles (2.44 articles on average)
from a library with a limited number of articles (102) that was
presented in a fixed order. As such the dataset might not con-
tain enough variance between users’ reading behavior to fully                  Figure 5. Survey Items and Response Distributions. The light-grey items
benefit from collaborative filtering. What argues against this                 have been omitted from the analysis because of poor factor loadings.
is the fact that the reading-based and hybrid recommendations
appear to outperform the survey-based predictions in terms of
prediction accuracy (see Table 1).                                             User Experience
                                                                               As per the user-centric evaluation framework by Knijnenburg
Reading Behavior                                                               and Willemsen [11] all survey items were submitted to a struc-
Participants read on average 2.72 articles (SD: 4.28 articles),                tural equation model (SEM). The responses to the individual
but 42 participants (39.6%) did not read any articles. The                     items can be seen in Figure 5, with items belonging to Per-
descriptives for the number of article reads per condition are                 ceived Level of Personalization (pers1-pers4), System Satis-
shown in Table 2. The different conditions had no significant                  faction (syssat1-syssat4), and Reading Satisfaction (readsat1-
influence on the number of articles people read, as negative                   readsat3). The three items for reading satisfaction show very
binomial regressions with the condition as independent vari-                   low variance among each other, which lead to these three items
able and the number of reads as dependent variable showed                      not fitting in the model. This might have been caused by the
no significant difference across conditions. This implies that                 fact that the reading behavior did not differ across conditions
no support is found for the hypotheses regarding the effect of                 as we manipulated only the order in which the articles were
our experimental manipulations on how participants interact                    presented, and not the actual content in the library. Therefore,
with their personalized libraries.                                             people were actually able to read the same articles regardless
                                                                               of experimental condition and thus the reading satisfaction
                                                                               might be similar. Apart from the items on Reading Satisfac-
                                                                               tion, two of the remaining items (pers1 and syssat2) explained
           condition       Mean           SD            min    max        N
                                                                               little variance and were also removed from the analysis.
 1          baseline       2.448         3.501           0     13         29
 2           survey        2.517         4.032           0     16         29   Despite the fact that participants did not read a large amount
 3          reading        4.273         6.670           0     31         22   of articles, the interface did allow participants to get a general
 4           hybrid        2.038         2.289           0      9         26   idea of the library by looking at the categories and the article
                                                                               titles. Nevertheless, we do feel that the participants who ac-
                      Table 2. Article Reads Per Segment
                                                                               tually read articles are better able to evaluate the library. To
                                                                               account for this we introduced an additional (dummy) vari-
                                                                               able labeled ‘Read’ indicating whether or not people read any
                                                                               articles.
           Spearman's Rank Correlation
                  ●
                                                                               A SEM was constructed using the remaining six survey items
                                                                               measuring two latent constructs (Perceived Personalization
     0.9                                                                       and System Satisfaction), the experimental conditions, and the
                                                                     ●
                                                                     ●

                                                                               variable describing whether or not people read as exogenous
                                                                               variables. The two latent factors had high correlation, but the
     0.7
                                                                               model showed good fit (with χ 2 (36) = 44.447, p = .158, CFI
ρ




                                                                     ●
                                                                     ●
                                                                               = .984, TLI = .974, RMSEA = .047, 90% CI: [0.000, .088]).
     0.5
                                                                     ●
                                                                               For each participant we used this model to calculate the scores
                                                                               on these latent factors to be used for the remainder of the
                                                                               analysis.
               baseline         survey               reading     hybrid        As the final model consists of only two latent constructs
                                         condition                             (Perceived Personalization and Systems Satisfaction) that are
Figure 4. Boxplots of Spearman’s Rank Correlation with General Top-N           highly correlated, there is no clear underlying structural model
per Condition                                                                  to test anymore. For the analysis we could either combine both
factors into one overall latent factor, or analyze both factors             (2) in Table 3. Model (3) reveals how this effect holds up
separately. We chose to do the latter as both factors might still           for participants that read versus participants that did not. It
capture different nuances of the user experience, despite their             shows a negative interaction effect for the participants that
high correlation.                                                           received survey-based personalization and read at least one
                                                                            article, which suggests that only the people that do not read
We analyzed the effect of our manipulation on the factor scores             any articles actually perceive a higher system satisfaction; for
of both constructs through linear regressions, with the factor              those who do read at least one article the effect is strongly
scores as dependent variables and the experimental condition                reduced.
as independent variable. As additional moderator we included
the dummy variable representing whether or not people read                  In conclusion support is found for the hypothesis that survey-
articles.                                                                   based personalization outperforms the non-personalized base-
                                                                            line, while no evidence was found that the reading-based and
The average factor scores per condition for the two measured                hybrid personalization did so. The lack of effect in terms
constructs can be found in Figure 6. The image shows an                     of reader experience are in line with the comparison of the
increase in both Perceived Personalization and System Satis-                different predicted rankings in terms of Spearman’s Rank Cor-
faction for the survey-based condition. The effects are higher              relation, that showed a high similarity between the reading-
for the participants that did not read (represented in the red              based and non-personalized baseline. This comparison further
bars) and lower for the participants that did (represented in the
                                                                            showed that the survey-based personalization was most dif-
green bars).
                                                                            ferent from the baseline, which is also reflected in the user
                                                                            experience (albeit stronger for the people that did not read than
                                                                            the people that read). The hybrid conditions falls in between
                                                                            the survey-based and reading-based and similarly the effects
                                                                            on user experience appear to fall in between the effects of the
                                                                            survey-based and reading-based recommendations.

                                                                            CONCLUSION AND DISCUSSION
                                                                            This study set out to compare personalization based on psy-
                                                                            chological traits measured through a survey to personaliza-
                                                                            tion based on reading data. Through a user study we com-
                                                                            pared different methods against a non-personalized baseline
                                                                            and showed that personalization based on survey information
                                                                            about parenting styles resulted in a significantly higher experi-
                                                                            enced user satisfaction and perceived level of personalization
                                                                            despite a lower objective performance, whereas using only
Figure 6. Marginal effects on Perceived Personalization (top row) and
System Satisfaction (bottom row) for the different conditions. The error    historical reading behavior or the combination of historical
bars correspond to +/- 1 standard error. Separate bars are shown for        reading behavior and measured parenting styles did not. Our
participants that read no articles (red) and at least one (green). Scores   findings speak to the potential usefulness of including data
are standardized: a score of +1 implies 1SD higher than the baseline        regarding characteristics of users (collected through an initial
(baseline recommendations for a user that did not read). Error bars are
one standard error of the mean.
                                                                            survey or otherwise) in personalization to alleviate the cold
                                                                            start problem. While the actual reading behavior for users was
The regression models in Table 3 show these effects as well.                not influenced, an improved user experience may increase the
Regression model (1) shows a positive and significant effect on             probability for users to return to the library later on.
Perceived Personalization for participants in the survey-based              The fact that using the survey data for personalization also out-
condition, indicating that these participants had the feeling the           performed the condition where recommendations were based
library catered more to their interests5 . An additional, albeit            on both survey data and reading behavior is likely caused by
not statistically significant, effect the table shows is an effect          the fact that the hybrid recommender - given how we had im-
with a significance level of p < 0.1 for the increased perceived            plemented it - came up with suggestions that were relatively
level of personalization in the condition with hybrid person-               close to the baseline condition. Hybrid predictions that would
alization. Although caution is needed when interpreting this                have assigned more weight to the survey data might have
effect, it describes a trend towards participants experiencing a            faired better. In any case, we do see that personalization based
higher level of personalization with the hybrid personalization.            on surveys captures the interests better, or at least increase
In terms of System Satisfaction the patterns are slightly differ-           the reported user satisfaction, and that they lead to a more
ent. Participants that received the survey-based personalization            different order in which articles are presented than based on
were more satisfied with the system, as can be seen in model                the reading behavior alone.

5 Because the factor scores are calculated through a Structural Equa-
                                                                            From a system owner point of view it is worth noticing that the
                                                                            survey-based predictions were very straightforward to calcu-
tion Model they are normally distributed with a mean of 0 and SD
of 1. Participants in the condition with survey-based personalization       late and implement compared to the reading-based and hybrid
thus had a perceived level of personalization of 0.563 SD higher than       predictions. In addition, after completing the short survey the
participants in the baseline.                                               user can immediately benefit from personalization. Both the
                                                                   Dependent variable:
                                           Perceived Personalization              System Satisfaction
                                                      (1)                     (2)                    (3)
                                           β            (SE)          β         (SE)         β          (SE)
                  survey                   0.563∗       (0.241)       0.673∗∗ (0.243)         1.334∗∗∗ (0.334)
                  reading                  0.140        (0.260)       0.406     (0.261)       0.479 (0.460)
                  hybrid                   0.438•       (0.248)       0.273     (0.249)     −0.030 (0.391)
                  Read                                                                        0.196 (0.328)
                  survey:Read                                                               −1.279∗∗ (0.464)
                  reading:Read                                                              −0.158 (0.556)
                  hybrid:Read                                                                 0.389 (0.498)
                  Constant                 0.057        (0.171)       0.097     (0.171)     −0.004 (0.236)
                  Observations                        106                     106                    106
                  R2                                 0.062                   0.072                  0.186
                  Adjusted R2                        0.034                   0.045                  0.128
                  Residual Std. Error          0.919 (df = 102)        0.923 (df = 102)        0.882 (df = 98)
                  F Statistic                 2.236 (df = 3; 102)     2.649 (df = 3; 102) 3.200∗∗ (df = 7; 98)
                  Note:                                                     • p<0.1; ∗ p<0.05; ∗∗ p<0.01; ∗∗∗ p<0.001

Table 3. Regression Tables for Experimental Manipulation and Read on Perceived Level of Personalization and System Satisfaction. The regression
coefficients are the standardized β s and values between parentheses are standard errors.


reading-based and (to a lesser extent) the hybrid predictions             Participants in our study interacted with the system twice. One
require reading behavior from the user before they can be cal-            time for an initial data collection and a second time for the
culated. Admittedly providing explicit feedback in the form               evaluation. This difference might have lead to a discrepancy,
of a survey demands more effort than the implicit feedback                as in the first session people were exploring the system and
provided through the natural interaction of reading. However,             possibly paying attention to other aspects than in the second
the higher user experience suggests there might be a trade-off            session. For example, in the first session people were getting
between the costs of user effort and the benefits of accurate             used to the way of navigation in the library and getting ac-
personalization.                                                          quainted with the system and its usability may have been an
                                                                          issue. In the second session, participants are more likely to
Another interesting finding is that the effects of personalization        have evolved past this stage, and they can now focus more on
on user experience disappeared as soon as participants started            what it is that they want to read. This would imply that data in
reading articles. A possible explanation for this observation
                                                                          the first session is describing behavior of participants who are
can be the number of articles people see in the second part
                                                                          getting to know a system, and as a result models trained on
that they have already read in the first part. Seeing articles one
                                                                          this data will generate recommendations based on what an ex-
has already read may contribute to a higher perceived level of            ploring user will typically read, which may not be appropriate
personalization and satisfaction with the library as a whole,             to personalize a library for a participant who already knows
while reading these articles might actually be detrimental for            and is actively using a system.
the user experience. In other words, what looks good might
not necessarily be what helps the user and as such it might be            As mentioned in the results section, it is unsure how our find-
worthwhile to investigate the factors that influence user satis-          ings hold up in a setting with a bigger library and more inter-
faction of a personalized system before and after consumption             action data (both in terms of number of users and in terms of
and to see if and how these are different from each other. From           interactions per user). With only 102 articles in a fixed order,
a more general perspective this raises the question whether               behavior for participants in the initial data collection may not
and how personalization needs to anticipate possible changes              have differed enough from each other (yet) to allow the per-
and differences in the perception of recommendations as the               sonalization based on reading behavior to produce predictions
user progresses. Alternatively it might indicate that the pro-            that are personalized sufficiently. The fact that these personal-
cess of evaluating personalization is different and depends on            izations stayed relative close to the non-personalized baseline
whether the user is evaluating through observing or through               can be interpreted this way. The survey-based recommenda-
experiencing.                                                             tions on the other hand combined data from users with similar
                                                                          parenting styles and as a result were able to differentiate them-
                                                                          selves more from the non-personalized baseline. Having more
Shortcomings and Future Work                                              articles and perhaps also a somewhat longer initial period will
While the findings of this study indicate that using surveys as           allow for behavior with more differences between users, al-
a basis for personalization can improve personalized systems,             lowing to more effectively leverage the predictive power and
the specific application in which we tested our hypotheses                complexity of reading-based personalization, which in turn
might limit the extent to which this finding can be generalized.          will provide more insight into the conditions that play a role
in how personalization based on behavior compares to person-        2. Diana Baumrind. 1966. Effects of Authoritative Parental
alization based on psychological traits. However, our results          Control on Child Behavior. Child Development 37, 4 (dec
show that in this situation with limited reading data a short          1966), 887. DOI:http://dx.doi.org/10.2307/1126611
survey delivers good data for initial personalization.
                                                                    3. Jay Belsky and Sara R. Jaffee. 2015. The Multiple
In line with the previous argument, it is important to realize         Determinants of Parenting. In Developmental
that in terms of data per user, our participants only interacted       Psychopathology. Number April. John Wiley & Sons,
with the system once and read 2.23 articles on average. They           Inc., Hoboken, NJ, USA, 38–85. DOI:
might still have been in their cold start phase and there may          http://dx.doi.org/10.1002/9780470939406.ch2
not have been enough information about the users’ reading
                                                                    4. Gianluca Demartini, Paul-Alexandru Chirita, Ingo
behavior to provide useful recommendations. What argues
                                                                       Brunkhorst, and Wolfgang Nejdl. 2008. Ranking
against this is that both the hybrid and reading-based models
                                                                       Categories for Web Search. In Advances in Information
had higher prediction accuracy than the survey-based recom-
                                                                       Retrieval. Springer Berlin Heidelberg, Berlin, Heidelberg,
mendations. Given these observations it would be worthwhile
                                                                       564–569. DOI:
to perform a study that controls for the amount of feedback
                                                                       http://dx.doi.org/10.1007/978-3-540-78646-7_56
collected from the participants. Having more feedback per par-
ticipant allows to investigate how the number of interactions       5. Ignacio Fernández-Tobías, Matthias Braunhofer, Mehdi
per user affects the performance of the different personal-            Elahi, Francesco Ricci, and Iván Cantador. 2016.
ization approaches, similar to how Kluver and Konstan [10]             Alleviating the new user problem in collaborative filtering
investigated the effects of number of interactions on predictive       by exploiting personality information. User Modeling and
accuracy.                                                              User-Adapted Interaction 26, 2-3 (jun 2016), 221–255.
                                                                       DOI:http://dx.doi.org/10.1007/s11257-016-9172-z
Apart from the amount of data per user, the amount of data
available within the system as a whole may be another factor        6. Zeno Gantner, Steffen Rendle, Christoph Freudenthaler,
that plays a role which method of personalization works best.          and Lars Schmidt-Thieme. 2011. MyMediaLite: A Free
Evaluating how survey-based and reading-based personaliza-             Recommender System Library. In Proceedings of the 5th
tion compare over time, as more data enter the system as a             ACM Conference on Recommender Systems (RecSys
whole or per user, would provide valuable insight in which             2011).
approach works best when. One could imagine a system that
                                                                    7. Panagiotis Germanakos and Marios Belk. 2016.
starts out from personalization based on measured psychologi-
                                                                       Human-Centred Web Adaptation and Personalization.
cal traits that transitions into a system based more on behavior
                                                                       Springer International Publishing, Cham. 336 pages. DOI:
or a hybrid system. Investigating this effect would require a
                                                                       http://dx.doi.org/10.1007/978-3-319-28050-9
more longitudinal study, where users are invited to a personal-
ized library at multiple moments, to see whether and how the        8. Asela Gunawardana and Guy Shani. 2015. Evaluating
different approaches are affected by the cold start.                   Recommender Systems. Springer US, Boston, MA,
                                                                       265–308. DOI:
Apart from the drawback of a low number of participants for            http://dx.doi.org/10.1007/978-1-4899-7637-6_8
calculating relevance predictions, the low number also limited
the statistical power of our statistical analysis of the effects    9. J. R. Hauser, G. L. Urban, G. Liberali, and M. Braun.
of personalization. While young parents are active on the              2009. Website Morphing. Marketing Science 28, 2 (mar
internet, they are hard to approach. In the current study we did       2009), 202–223. DOI:
not manage to detect effects of personalization on reading be-         http://dx.doi.org/10.1287/mksc.1080.0459
havior and only differences between some of the experimental       10. Daniel Kluver. 2012. How Many Bits Per Rating ?
conditions. The effects caused by the personalization might            Proceedings of the 6th ACM conference on Recommender
have been smaller than the statistical power of our analysis           systems - RecSys ’12 (2012), 99–106. DOI:
allows us to detect. Conducting a study with more participants         http://dx.doi.org/10.1145/2365952.2365974
would allow us to detect these possibly smaller effects.
                                                                   11. Bart P. Knijnenburg and Martijn C. Willemsen. 2015.
In conclusion, the current paper demonstrates that measuring           Evaluating Recommender Systems with User
psychological traits for the sake of personalization is worth-         Experiments. In Recommender Systems Handbook.
while and might well lead to increased user satisfaction, but          Springer US, Boston, MA, 309–352. DOI:
additional work is needed to establish under which conditions          http://dx.doi.org/10.1007/978-1-4899-7637-6_9
this approach is valuable.
                                                                   12. Yehuda Koren. Factorization Meets the Neighborhood : a
REFERENCES                                                             Multifaceted Collaborative Filtering Model. (????). DOI:
 1. B Arnott and Amy Brown. 2013. An Exploration of                    http://dx.doi.org/978-1-60558-193-4/08/08
    Parenting Behaviours and Attitudes During Early Infancy:       13. Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.
    Association with Maternal and Infant Characteristics.              Matrix Factorization Techniques for Recommender
    Infant and Child Development 22 (2013), 349–361. DOI:              Systems. IEEE Computer (2009), 42–49.
    http://dx.doi.org/10.1002/icd.1794
                                                                   14. Robert R. McCrae, Jr. Paul T. Costa, and Thomas A.
                                                                       Martin. 2005. The NEOâĂŞPIâĂŞ3: A More Readable
   Revised NEO Personality Inventory. Journal of              18. Andrew I Schein, Alexandrin Popescul, Lyle H Ungar,
   Personality Assessment 84, 3 (2005), 261–270. DOI:             and David M Pennock. 2002. Methods and metrics for
   http://dx.doi.org/10.1207/s15327752jpa8403_05 PMID:            cold-start recommendations. Proceedings of the 25th
   15907162.                                                      annual international ACM SIGIR conference on Research
                                                                  and development in information retrieval SIGIR 02 46,
15. Bamshad Mobasher. 2007. Data mining for web
                                                                  Sigir (2002), 253–260. DOI:
    personalization. The adaptive web (2007), 90–135. http:
                                                                  http://dx.doi.org/10.1145/564376.564421
    //link.springer.com/chapter/10.1007/978-3-540-72079-9
                                                              19. K. R. Venugopal, K. G. Srinivasa, and L. M. Patnaik.
16. Stephanie L Prady, Kathleen Kiernan, Lesley Fairley,
                                                                  2009. Algorithms for Web Personalization. Springer
    Sarah Wilson, and John Wright. 2014. Self-reported
                                                                  Berlin Heidelberg, Berlin, Heidelberg, 217–230. DOI:
    maternal parenting style and confidence and infant
                                                                  http://dx.doi.org/10.1007/978-3-642-00193-2_10
    temperament in a multi-ethnic community: Results from
    the Born in Bradford cohort. Journal of Child Health      20. Tiange Zhao. 2016. Investigating the relationship
    Care 18, 1 (2014), 31–46. DOI:                                between parenting beliefs and parenting practice for
    http://dx.doi.org/10.1177/1367493512473855                    in-app personalization. Master thesis. Eindhoven
                                                                  University of Technology.
17. Steffen Rendle, Wolf Huijsen, and Karen Tso-Sutter.
                                                                  https://pure.tue.nl/ws/files/46944250/855031-1.pdf
    2008. State-of-the-art Recommender Algorithms.
    Technical Report. www.mymediaproject.org                  21. Zheng Zhu. 2011. Improving Search Engines via
                                                                  Classification. Ph.D. Dissertation.