=Paper= {{Paper |id=Vol-2068/humanize8 |storemode=property |title=Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models |pdfUrl=https://ceur-ws.org/Vol-2068/humanize8.pdf |volume=Vol-2068 |authors=Mark Graus,Martijn Willemsen,Chris Snijders |dblpUrl=https://dblp.org/rec/conf/iui/GrausWS18 }} ==Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models== https://ceur-ws.org/Vol-2068/humanize8.pdf

Personalizing an Online Parenting Library: Parenting-Style
Surveys Outperform Behavioral Reading-Based Models
Mark P. Graus Martijn C. Willemsen Chris C. P. Snijders
Eindhoven University of Eindhoven University of Eindhoven University of
Technology, IPO 0.20 Technology, IPO 0.17 Technology, IPO 1.[20]
5600 MB Eindhoven, the 5600 MB Eindhoven, the 5600 MB Eindhoven, the
Netherlands Netherlands Netherlands
m.p.graus@tue.nl m.c.willemsen@tue.nl c.c.p.snijders@tue.nl

ABSTRACT have to learn a whole new set of care-taking skills, ranging
The present study set out to personalize a digital library aimed from practical (such as changing diapers) to more emotional
at new parents by reordering articles to match users’ inferred (such as recognizing and reacting to a child’s emotions). There
interests. The interests were inferred from reading behavior are numerous ways to acquire these skills: parents can get
as well as parenting styles measured through surveys. As advice from relatives, or alternatively rely on vast amounts of
prior research has shown that parenting styles are related to books, websites, videos, and other types of media.
how parents take care of their children, these styles are likely
Parents have different styles of parenting and as such some
to be related to what content a parent is interested in. The
topics may be very relevant to a parent, while others are com-
present study compared personalization based on parenting
pletely irrelevant. In this sense, helping parents find their way
styles against other types of personalization.
in content related to the parenting domain is similar to person-
We conducted a user study with 106 participants, in which we alization areas such as movie or book recommendations. A
compared the effects of four different approaches of personal- challenge in personalizing content on parenting is that first-
ization to our users’ reading behavior and user experience: a time parents have to find their own way in a domain that is
non-personalized baseline, personalization based on reading completely new to them. Parents may not have a clear view
behavior, personalization based on parenting styles measured yet on the range of alternative ways of taking care of a child
through surveys, and a hybrid personalization based on both that match their styles. It might not be easy for them to judge
reading behavior and parenting styles. We found that while the what content is relevant and they might read content that is not
reading behavior was not significantly influenced by different in line with their parenting styles or interests. As such, there
types of personalization, participants had a better user expe- might be a discrepancy between what content new parents
rience with our survey-based approach. They indicated they read and what is actually relevant to them. As a result person-
perceived a higher level of personalization and satisfaction alization based on reading behavior (as is common) might not
with the system, even though in terms of objective metrics this provide the desired results.
approach performed worse.
An additional challenge is that parenting is an activity people
ACM Classification Keywords
are very committed to and about which they hold strong beliefs.
As a result, new parents might find certain types of content
H.5.2 Information Interfaces and Presentation (e.g., HCI):
extremely irrelevant, to the point of being offended. A mother
User Interfaces; H.3.3 Information Storage and Retrieval: In-
who struggled and eventually gave up breastfeeding might be
formation Search and Retrieval
hurt by receiving unwanted breastfeeding advice. Being wrong
Author Keywords in personalization in this domain thus has a bigger impact than
Personalization; Parenting; User Experience; Cold Start; in other domains.
Psychological Traits; Psychological Models; User Models We aim to help parents in finding relevant content by person-
alizing a digital library of information articles on parenting.
INTRODUCTION Because the content is aimed at new parents, we think a dis-
Becoming a parent is for many a big challenge in life. New crepancy can exist between reading behavior and reading in-
parents have to get used to a new set of responsibilities and terests and parenting styles measured through surveys might
provide more reliable information for predicting reading inter-
ests. To investigate this, we personalize a library using both
behavior data and survey data.

Research Question and Hypotheses
©2018. Copyright for the individual papers remains with the authors. The current paper aims to investigate how a library comprising
Copying permitted for private and academic purposes. articles on parenting can be improved by personalizing the
HUMANIZE ’18, March 11, 2018, Tokyo, Japan
order in which the articles are presented1 . A screenshot of
what the library interface looked like can be found in Figure. 1.
In addition the paper investigates if and how parenting styles
can contribute to this personalization. The main research
question thus is “How does personalization based on parenting
styles compare to personalization based on reading behavior
in terms of user behavior and user experience?”
We try to answer this research question by investigating the
effects of personalization based on survey responses mea-
suring parenting styles (explained in Section 1.4) and more
conventional ways of personalization that rely on behavior
data. Specifically we compare the effects of survey-based per-
sonalization with personalization based on reading behavior,
personalization based on both reading behavior and survey
responses and a non-personalized baseline. We are interested
in the effects of this personalization both in terms of influ-
enced behavior (e.g. does personalization based on surveys
increase the number of articles users read?) and in terms of
user experience (e.g. does personalization based on surveys
result in a higher satisfaction with the digital library?). To
investigate the effects on the user experience we adopted the
User-Centric Evaluation Framework for personalized systems
by Knijnenburg and Willemsen [11]. We designed a UX sur-
vey with items aimed to measure different aspects of the user
experience.
With the survey we aimed to measure three aspects of the user
experience specifically and formulated survey items to do so:
the perceived level of personalization (“The library shows arti-
cles I find interesting”), the system satisfaction (“It was easy
to find relevant/interesting articles”), and reading satisfaction
(“I enjoyed reading the items I read”). We hypothesize that the
different ways of personalization influence the perceived level
of personalization. A higher level of personalization should
lead to a higher system satisfaction, which should lead to a
higher reading satisfaction. The higher system satisfaction is
also expected to increase the amount of reading by the user. Figure 1. uGrow ‘My Articles’ page

In terms of improving user satisfaction and increasing reading
behavior we hypothesize the following order in the different
personalizations, from worst to best: of personalization can be found on numerous websites, for
example in the form of recommendations on Amazon, or as
• The non-personalized library filters on social media feeds such as Twitter and Facebook. In
• The library personalized based on just reading behavior general the goal is to alter a system in a way that it caters to
the individual needs of a user to influence user behavior or
• The library personalized based on just survey responses user experience. A typical goal of influencing behavior is to
make users consume more content in a media browsing system
• The library personalized based on both reading behavior or purchase more items in a webshop, while a typical goal of
and survey responses influencing user experience is to make it easier for users to
The remainder of this section will introduce the theoretical reach their goals.
background on which this study is based. Personalization can be implemented in many different
ways [19], but the most widely adopted methods rely on his-
Personalization torical data describing how users interact with a system, and
Personalization is the process of altering a system to fit to combine these data across users to make predictions on what
the needs and/or preferences of an individual [15]. Examples content a user will find relevant. The system is subsequently
1 The content and the design of the library were taken from Philips’ altered so that the user is exposed to more of the content he/she
uGrow app. Available for iOS https://itunes.apple.com/app/ is likely to find relevant.
ugrow-healthy-baby-development/id1063224663 and Android
https://play.google.com/store/apps/details?id=com.philips. A standard problem related to this approach to personalization
cl.uGrowDigitalParentingPlatform is the cold start problem [18]. More specifically, three cold
start problems exist: the system cold start, the user cold start, Parenting Styles
and the item cold start. The system cold start occurs when Zhao [20] performed a literature review on research on parent-
not enough data are available within the system as a whole to ing with the goal of understanding how scholars operationalize
make predictions. The user and item cold start occur when and measure parenting styles. Zhao was in particular interested
there are not enough interaction data available corresponding in how parenting styles relate to the actual care-taking behav-
to respectively the user or the item, so that no predictions can ior and as such, the review was primarily focused on research
be made for respectively the user or the item. that comprised both questionnaires and a behavioral aspect.
She found that parenting as a whole is a combination of cog-
In the context of parenting an additional challenge occurs.
nitive factors, the physical task of taking care of a baby, and
Apart from being new to a system, (some) parents are also
the interplay between the two (cf. [2]). Zhao in addition found
new to being parents and they might find it hard to identify
that researchers conceptualize parenting styles as individual
what content is relevant to them. This can result in a mismatch
differences along two cognitive dimensions: structure (i.e.
between the content they read and the content that they are
how important parents think structure is for their children) and
actually interested in. In systems in which user evaluations of
attunement (i.e. how much parents value reacting to a child’s
content are not being tracked explicitly, assuming that content
needs and how able they are at reading those needs) [1, 3, 16].
is appreciated because it was read may well lead to inaccurate
Prototypical parenting styles are the resulting combinations
predictions about user preferences. Because of this, a library
of scores along these two dimensions (high attunement/high
aimed at parents might benefit from relying on other types of
structure, high attunement/low structure, low attunement/high
data for personalization.
structure and low attunement/low structure). Other cognitive
factors that have been identified in literature to play a role are
parental distress, perceived self-efficacy, and the perceived
Personalization and Psychological Traits difficulty of the child.
Many psychological traits have been incorporated in personal-
ization applications. Hauser et al. [9] personalized an online The cognitive factors allegedly have an interplay with how
tool to compare contracts for mobile phones based on cogni- parents actually take care of their children. To validate these
tive styles (i.e. the way in which individuals prefer to process parenting styles and investigate how they relate to care-taking
information) and showed that providing users information in behavior, Zhao [20] conducted a survey study in which she
a way that matches their cognitive style (e.g. textual versus measured parenting styles and asked respondents to self-report
visual information) increases buying propensity. Germanakos on how they take care of their children. The analysis of the
and Belk [7] found that adapting an online learning environ- survey data showed support for the conceptualization of par-
ment to the working memory capacity of its students resulted enting styles along the previously mentioned dimensions of
in higher test scores. structure and attunement. In addition it showed that parenting
styles are related to the actual care-taking behavior of parents.
Similarly, Fernandez-Tobias et al. [5] showed that incorporat- For example, parents scoring low on attunement are less likely
ing personality in collaborative filtering algorithms allowed to engage in breast-feeding and more likely to opt for bottle-
them to better predict recommendations across domains (e.g. feeding. As parenting styles are related to how parents take
recommending movies based on someone’s music listening be- care of their children, they are likely to be useful predictors
havior). They did this by extending the SVD++ algorithm [12], for what type of content parents are interested in. For example,
an algorithm used to predict ratings that users will assign to parents that find structure important put their children to bed
items. Fernandez-Tobias et al. used a part of the myPerson- at a fixed bedtime instead of waiting for the kid to become
ality dataset2 comprising 160k users and in total just over 5 sleepy. As a result they might be more conscious of the fact
million likes over 16k items (consisting of books, movies or that their child does not fall asleep easily and will thus be more
music artists). The personality traits (the five factor model interested in content on how to get a baby to sleep well than
with the traits openness to experience, conscientiousness, ex- people that value flexibility over structure and wait for their
traversion, agreeableness and neuroticism [14]) were available child to get sleepy.
for all users and were used to predict likes. Their results
showed that incorporating the personality information substan-
tially improved the extent to which likes on Facebook could STUDY DESIGN
successfully be predicted. To investigate our research question we designed a user study
that consisted of two main parts, with a first part aimed at
These studies demonstrate that personalization can benefit collecting initial data to be used for personalizing the “My
from considering and incorporating personal characteristics Articles” page and a second part aimed at investigating the
(such as personality traits or cognitive styles). In the case effects of the different ways of personalization on the reading
of parenting, parenting styles are psychological traits that are behavior and user experience.
likely to play a role in what content parents find relevant. In the
present study we measure parenting styles and subsequently During the first part, participants were asked to complete a
use them for personalizing the online library. survey to measure their parenting styles, after which they
were invited to browse the non-personalized library (i.e. a
library with a fixed order of articles). The responses to the
2 Available from http://mypersonality.org/wiki/doku.php?id= surveys were stored for personalization later. The information
download_databases regarding what articles participants read during the browsing
phase was used for personalization based on reading behavior.
The order of articles for the second part of the study was
calculated in one out of four ways (described in more detail in
section 3). For each participant we selected at random which
set of predictions was used to personalize the library.
In the second part of the study the participants were re-invited
to interact with their now personalized digital library. Sub-
sequently, participants evaluated their experience with the
system through our UX survey. We first report on the initial
phases of the study.

Initial Data Collection: Survey and Reading Behavior
We implemented the online library on a website that was acces-
sible through browsers on computers and mobile phones. We
recruited participants through posts in online forums dedicated
to parenting and through Facebook ads targeting parents in
the United Kingdom and United States with children younger
than two years old. In total 234 parents clicked on the link to
participate in the study. All participants that completed the en-
tire study were compensated with $4.50 or £3.50 of shopping
credit for amazon.co.uk or amazon.com. The ad campaign and
data collection took place in May and June 2017. Figure 2. Distributions of the 5 factor scores measured through the first
survey.
The first part of the study consisted of two steps. In the first
part people were asked to complete the survey to measure
parenting styles. After completing the survey, the participants
The initial part of the data collection was concluded with
were presented with the digital library and invited to browse
offering the participants to freely browse the online library.
through it and read the articles that they were interested in.
Participants opened on average 2.23 articles (SD: 3.37 articles)
The participants were invited to read as many articles as they
from 1.25 categories (SD: 1.51 categories). These data and the
wanted for as long as they wanted and to click a link labeled
survey responses were used to calculate relevance predictions
“I’ve finished reading” once they felt like they read enough.
for the individual participants.
After clicking this link participants were asked to submit their
email address for the second step of the study.
CALCULATING RELEVANCE PREDICTIONS
In total 181 participants completed the survey (15 men/166 Based on the data collected in the first step of the study we cal-
women, 99 first time parents, with an average age of the baby culated per participant four different relevance rank predictions
11.39 (SD: 7.96) months). On average the whole session for all articles. As a baseline we used the non-personalized
lasted just over 6 minutes (378 seconds, SD: 279.80 seconds). General Top-N. The three other ways of predicting differed
The survey consisted of 15 items of the original survey of in what data from the first step were used. A survey-based
Zhao [20]. For the five cognitive (structure, attunement, ma- ordering was based on the data from the survey responses
ternal self-efficacy, parental disstress and perceived difficulty of the participants and on reading behavior at the aggregated
of the child) we selected per factor the two items with most level. A reading-based ordering used only data regarding the
extreme factor loadings. We added items concerning the de- articles that people had read in the first step. Finally, a hybrid
mographics of the parent (gender, level of education, whether ordering used both the survey responses and the individual
they were first time parents) and child (gender, age) that had reading behavior. The way these orderings were calculated is
had large effects on the self-reported behavior in the original described in the following sections.
analysis. The factor scores for our participants were calculated
by using the factor loadings from the original survey and are
Survey-Based Predictions
displayed in Figure 2. These scores show similar distributions
and correlations as the factors in the original survey. We used the survey responses collected in the first step to
predict relevance of the different articles for the participants
The interface of our library was made to have the look and in our study. To do this, the participants were subdivided
feel of the original library (see Fig. 1) as much as possible. in segments, by performing median splits on the 2 cognitive
As in the original interface, the articles are subdivided in factors: attunement and structure. The user segment was then
categories that are displayed in rows. Within the category rows defined to be the combination of these two scores, resulting
the articles are displayed horizontally. The user is able to scroll in four segments. We considered incorporating the three other
up and down to different categories and left and right within factors measured in the study (self-efficacy, parental distress
categories to the different articles. As in the original interface, and perceived difficulty of the child), but given the number
the order of articles and categories was fixed: every participant of users in our dataset adding additional factors resulted in
had exactly the same order of categories and articles. segments that became too small.
the positive only data describe whether or not a user read an
article in the first step, and the predictions would indicate what
items a user is most likely to read. In order to translate this
positive-only, binary feedback into a ranking, pairs of items
are semi-randomly selected per user, where each pair consists
of an item that the user has interacted with and one with which
the user has not interacted. The assumption is that the first
is preferred over the second. Sampling a large number of
pairs per user, results in a ranking that can be used in matrix
factorization and the resulting model then calculates a relative
relevance score instead of a rating.
Figure 3. Article Categories Ranked on Popularity per Segment Hybrid Predictions
The BPRMF algorithm was extended to combine reading
behavior and the individual parents’ user attributes inferred
As the participants read on average just over 2 articles, there
from the survey for the calculation of hybrid predictions. The
was not enough data to show differences on the level of indi-
BPRMF algorithm was extended similarly to how Fernandez-
vidual articles (i.e. articles were not read often enough to allow
Tobias et al. [5] extended the SVD++ [12] algorithm to incor-
for enough variance), but participants from different segments
porate personality in predictions.
did prefer different categories, as can be seen in Figure 3.
When investigating these predictions, the popularity order for Where the original BPRMF algorithm uses two matrices P and
these categories seems to make sense intuitively. For example, Q to calculate predictions, our user attribute aware BPRMF
the breastfeeding category is predicted to be more popular for algorithm uses a third matrix Y . Y describes the user attributes
segments with high attunement, which is congruent with the on the same k latent features the users and articles are ex-
relationship with breast-feeding and high attunement in the pressed in. In our case we used high and low scores for the
original survey [20]. five cognitive factors from our parenting style survey as user at-
tributes. We decided again to use the median splits per factors
As a result we decided to sort the categories based on the
to assign each user a high or low score for each factor in order
attunement-structure segment and sort the articles within each
to prevent overfitting. Every user has thus 5 user attributes and
category based on general popularity. That is, the survey-based
the relevance predictions are similar to the original BPRMF
predictions only personalized the order of the categories, not
algorithm with an additional matrix in which user attributes
the articles within each category. We tried basing segments on
are represented. The predicted relevance is then calculated
other factors than attunement and structure, but the resulting
according to equation 2.
predictions were not as easily interpretable as the predictions
based on these segments.
!
Reading-Based Predictions r̂ui = qi ∗ pu + ∑ ya (2)
For the conditions based on reading behavior alone, we used a∈A(u)
the Bayesian Personalized Ranking Matrix Factorization (or
BPRMF) algorithm implemented in MyMediaLite [6, 17] to This model is fit using stochastic gradient descent. Each itera-
predict relevance. BPRMF is an extension to classic matrix tion consists of two steps. In the first step the P and Q matrices
factorization [13] that allows it to calculate recommendations are fit, while leaving the Y matrix constant. In the second step
from positive only feedback instead of rating data. the Y matrix is fit, while leaving the P and Q matrix con-
stant. We implemented this algorithm in the MyMediaLite
Conventional matrix factorization attempts to complete the
library [6].
matrix R with dimensionality of U (number of users) and I
(number of items). In this matrix the cells represent ratings Calculated Relevance Predictions
the user has given to the corresponding item. This matrix is In total the dataset contained 221 users and 508 reads3 . For
decomposed into two k-dimensional sub-matrices P and Q in each user predictions using the four methods described above
which the rows of P and Q represent respectively users and were calculated. The predictions were then sorted in two steps.
items in a k-dimensional latent feature space. These matrices First the 7 categories were ordered based on the article with
are constructed so that the predicted rating r̂ui is calculated by the highest predicted relevance (a strategy called min-rank
taking the inner product pu ∗ qi (see Equation 1). that has been shown to work well in similar circumstances [4,
21]). Within the categories the articles were ordered based on
predicted relevance.
r̂ui = qi ∗ pu (1)
The algorithms for the reading-based and hybrid predictions
Rendle et al. [17] extended this matrix factorization into required the tuning of a set of regularization hyperparame-
BPRMF to allow using positive only feedback to calculate ters, which we carried out using Bayesian Optimization. The
per user a ranking of the articles from highest predicted rel- 3 We included reading data from a pilot study to ensure we had enough
evance to lowest predicted relevance. In the current study, data to calculate predictions
5-fold Cross Validation Post-hoc Comparison
algorithm AUC prec@5 prec@10 NDCG AUC prec@5 prec@10 NDCG
baseline 0.840 0.083 0.065 0.424 0.706 0.146 0.104 0.477
survey - - - - 0.650 0.060 0.062 0.353
reading 0.832 0.079 0.061 0.411 0.767 0.176 0.114 0.522
hybrid 0.769 0.080 0.059 0.404 0.807 0.214 0.126 0.561
Table 1. Performance Metrics calculated through 5-fold Cross Validation and a post-hoc performance analysis

Bayesian Optimization was performed using 5-fold cross vali- relevance rank (or highest predicted relevance) within the cat-
dation, using AUC as the target measure. Once optimal values egory, so that the category with the article with the highest
for the hyperparameters were established, the predictive mod- predicted relevance was shown on top. This way of sorting
els were constructed and the predictive performance (i.e. the categories has been shown to be one of the best strategies in
reading-based and hybrid recommendations) was investigated terms of reducing browsing time [4]. Within categories the
through 5-fold cross validation also. Table 1 shows these per- articles were ordered by predicted relevance rank, with the
formance metrics of the three algorithms under the column article with the lowest predicted relevance rank to the left of
‘5-fold Cross Validation’. The performance metrics appeared the list.
to be adequate4 . However, the baseline, reading-based, and
hybrid predictions are calculated on the level of the individual Participants were allowed to browse the library freely during
articles, they cannot be easily compared to the survey-based which we measured what articles the participants opened. Par-
ticipants were shown a link labeled “I have finished reading”
predictions that are calculated first on the category level and
that would take them to the survey as soon as they felt they
then within the categories on an individual article level. In
read enough. The survey contained 11 items aimed at measur-
order to make a fair comparison, we performed a post-hoc
ing Perceived Level of Personalization, System Satisfaction,
analysis by recalculating the performance metrics for the sets
of recommendations to correspond to the survey-based predic- and Reading Satisfaction.
tions. We did this by calculating the lists of recommendations Participants
and sorting all lists first by category based on the minimum
All 181 users from the first part were invited to join the second
predicted rank of the article within that category and subse-
part of the study via email. Of the 181 users we sent invita-
quently sorting the articles within their categories based on the
tions to, 150 visited the second part and 121 completed the
predicted relevance for the individual articles. We then calcu-
study. A number of cases were removed, for either trying to
lated performance metrics by using the actual reading behavior
complete the study with multiple email addresses (3 users),
as ground truth. The outcome of these recalculations can be
having missing data in the survey (1 user), or finishing the
found under the columns ‘Post-hoc Comparison’ in Table 1.
second part of the study in less than 50 seconds (11 users). For
These numbers indicate the most accurate predictions for the
our final data analysis we ended up with 106 users (9 men/97
hybrid predictions, followed by the reading-based predictions,
women, 50 first time parents, mean (SD) age of the baby 10.63
the survey-based predictions and finally the non-personalized
(8.45) months)
baseline. Based on these metrics we would expect the hybrid
predictions to be most in line with what participants will read, These users were distributed roughly equally over conditions
and the survey-based least. This order is different from the (baseline: 29, survey: 29, reading-based: 22, hybrid: 26). In
order in the k-fold cross validation metrics because no k-fold addition, there appeared to be no bias in response rate for the
cross validation was applied to be able to compare with the different parenting style segments of the participants, with
survey-based recommendations (i.e. the train and test set were response rates of .56 for the low structure/high attunement
identical). segment, .65 for the high structure/low attunement segment,
.73 for the high structure/high attunement segment and .60
RE-ENGAGING WITH THE NOW PERSONALIZED SYSTEM for the low structure/low attunement segment(χ 2 (3) = 3.239,
The second part of the study was used to investigate our re- p = 0.356).
search question and test our hypotheses. To this end partici-
Results
pants were re-invited to interact with the website, where they
were now shown the library personalized in one out of four To gain insight in how the different methods of predicting
ways (selected at random). The invitations were sent out after relevance influenced the final recommendations participants
all predictions were calculated, which means that the time received, we calculated the difference of the recommendations
between finishing the first part and starting the second part with the general Top-N in terms of Spearman Rank Correla-
differed between participants (median 42.6 days. SD: 15.4 tion. The (Spearman) correlation coefficient ρ indicates to
days). In this step the interface was personalized by reorder- what extent lists are similar, with a value of 1 if the order is
ing both the categories and the articles within the categories. identical and -1 if they are in reverse order. The results are
The categories were ranked based on the minimum predicted shown in Figure 4 and they reveal that the available reading
data does not allow personalization that differs a lot from the
4 An overview of the different metrics and how to interpret them can baseline condition (as the correlation between reading-based
be found in [8]. and baseline is 0.91 on average). Personalization based on the
survey-based predictions is quite different from the baseline
predictions, with an average correlation of 0.37. The hybrid
predictions fall somewhere in between the reading-based and
survey-based predictions with a correlation of 0.74. These
numbers indicate that the additional data of parenting styles
allows for personalization that deviates more from the baseline
than personalization based on reading behavior alone.
One possible explanation of the reading-based personalization
not differing much from the baseline is insufficient data. As
there are a limited number of users (221 users, see Section 2.1),
that read a limited number of articles (2.44 articles on average)
from a library with a limited number of articles (102) that was
presented in a fixed order. As such the dataset might not con-
tain enough variance between users’ reading behavior to fully Figure 5. Survey Items and Response Distributions. The light-grey items
benefit from collaborative filtering. What argues against this have been omitted from the analysis because of poor factor loadings.
is the fact that the reading-based and hybrid recommendations
appear to outperform the survey-based predictions in terms of
prediction accuracy (see Table 1). User Experience
As per the user-centric evaluation framework by Knijnenburg
Reading Behavior and Willemsen [11] all survey items were submitted to a struc-
Participants read on average 2.72 articles (SD: 4.28 articles), tural equation model (SEM). The responses to the individual
but 42 participants (39.6%) did not read any articles. The items can be seen in Figure 5, with items belonging to Per-
descriptives for the number of article reads per condition are ceived Level of Personalization (pers1-pers4), System Satis-
shown in Table 2. The different conditions had no significant faction (syssat1-syssat4), and Reading Satisfaction (readsat1-
influence on the number of articles people read, as negative readsat3). The three items for reading satisfaction show very
binomial regressions with the condition as independent vari- low variance among each other, which lead to these three items
able and the number of reads as dependent variable showed not fitting in the model. This might have been caused by the
no significant difference across conditions. This implies that fact that the reading behavior did not differ across conditions
no support is found for the hypotheses regarding the effect of as we manipulated only the order in which the articles were
our experimental manipulations on how participants interact presented, and not the actual content in the library. Therefore,
with their personalized libraries. people were actually able to read the same articles regardless
of experimental condition and thus the reading satisfaction
might be similar. Apart from the items on Reading Satisfac-
tion, two of the remaining items (pers1 and syssat2) explained
condition Mean SD min max N
little variance and were also removed from the analysis.
1 baseline 2.448 3.501 0 13 29
2 survey 2.517 4.032 0 16 29 Despite the fact that participants did not read a large amount
3 reading 4.273 6.670 0 31 22 of articles, the interface did allow participants to get a general
4 hybrid 2.038 2.289 0 9 26 idea of the library by looking at the categories and the article
titles. Nevertheless, we do feel that the participants who ac-
Table 2. Article Reads Per Segment
tually read articles are better able to evaluate the library. To
account for this we introduced an additional (dummy) vari-
able labeled ‘Read’ indicating whether or not people read any
articles.
Spearman's Rank Correlation
●
A SEM was constructed using the remaining six survey items
measuring two latent constructs (Perceived Personalization
0.9 and System Satisfaction), the experimental conditions, and the
●
●

variable describing whether or not people read as exogenous
variables. The two latent factors had high correlation, but the
0.7
model showed good fit (with χ 2 (36) = 44.447, p = .158, CFI
ρ

●
●
= .984, TLI = .974, RMSEA = .047, 90% CI: [0.000, .088]).
0.5
●
For each participant we used this model to calculate the scores
on these latent factors to be used for the remainder of the
analysis.
baseline survey reading hybrid As the final model consists of only two latent constructs
condition (Perceived Personalization and Systems Satisfaction) that are
Figure 4. Boxplots of Spearman’s Rank Correlation with General Top-N highly correlated, there is no clear underlying structural model
per Condition to test anymore. For the analysis we could either combine both
factors into one overall latent factor, or analyze both factors (2) in Table 3. Model (3) reveals how this effect holds up
separately. We chose to do the latter as both factors might still for participants that read versus participants that did not. It
capture different nuances of the user experience, despite their shows a negative interaction effect for the participants that
high correlation. received survey-based personalization and read at least one
article, which suggests that only the people that do not read
We analyzed the effect of our manipulation on the factor scores any articles actually perceive a higher system satisfaction; for
of both constructs through linear regressions, with the factor those who do read at least one article the effect is strongly
scores as dependent variables and the experimental condition reduced.
as independent variable. As additional moderator we included
the dummy variable representing whether or not people read In conclusion support is found for the hypothesis that survey-
articles. based personalization outperforms the non-personalized base-
line, while no evidence was found that the reading-based and
The average factor scores per condition for the two measured hybrid personalization did so. The lack of effect in terms
constructs can be found in Figure 6. The image shows an of reader experience are in line with the comparison of the
increase in both Perceived Personalization and System Satis- different predicted rankings in terms of Spearman’s Rank Cor-
faction for the survey-based condition. The effects are higher relation, that showed a high similarity between the reading-
for the participants that did not read (represented in the red based and non-personalized baseline. This comparison further
bars) and lower for the participants that did (represented in the
showed that the survey-based personalization was most dif-
green bars).
ferent from the baseline, which is also reflected in the user
experience (albeit stronger for the people that did not read than
the people that read). The hybrid conditions falls in between
the survey-based and reading-based and similarly the effects
on user experience appear to fall in between the effects of the
survey-based and reading-based recommendations.

CONCLUSION AND DISCUSSION
This study set out to compare personalization based on psy-
chological traits measured through a survey to personaliza-
tion based on reading data. Through a user study we com-
pared different methods against a non-personalized baseline
and showed that personalization based on survey information
about parenting styles resulted in a significantly higher experi-
enced user satisfaction and perceived level of personalization
despite a lower objective performance, whereas using only
Figure 6. Marginal effects on Perceived Personalization (top row) and
System Satisfaction (bottom row) for the different conditions. The error historical reading behavior or the combination of historical
bars correspond to +/- 1 standard error. Separate bars are shown for reading behavior and measured parenting styles did not. Our
participants that read no articles (red) and at least one (green). Scores findings speak to the potential usefulness of including data
are standardized: a score of +1 implies 1SD higher than the baseline regarding characteristics of users (collected through an initial
(baseline recommendations for a user that did not read). Error bars are
one standard error of the mean.
survey or otherwise) in personalization to alleviate the cold
start problem. While the actual reading behavior for users was
The regression models in Table 3 show these effects as well. not influenced, an improved user experience may increase the
Regression model (1) shows a positive and significant effect on probability for users to return to the library later on.
Perceived Personalization for participants in the survey-based The fact that using the survey data for personalization also out-
condition, indicating that these participants had the feeling the performed the condition where recommendations were based
library catered more to their interests5 . An additional, albeit on both survey data and reading behavior is likely caused by
not statistically significant, effect the table shows is an effect the fact that the hybrid recommender - given how we had im-
with a significance level of p < 0.1 for the increased perceived plemented it - came up with suggestions that were relatively
level of personalization in the condition with hybrid person- close to the baseline condition. Hybrid predictions that would
alization. Although caution is needed when interpreting this have assigned more weight to the survey data might have
effect, it describes a trend towards participants experiencing a faired better. In any case, we do see that personalization based
higher level of personalization with the hybrid personalization. on surveys captures the interests better, or at least increase
In terms of System Satisfaction the patterns are slightly differ- the reported user satisfaction, and that they lead to a more
ent. Participants that received the survey-based personalization different order in which articles are presented than based on
were more satisfied with the system, as can be seen in model the reading behavior alone.

5 Because the factor scores are calculated through a Structural Equa-
From a system owner point of view it is worth noticing that the
survey-based predictions were very straightforward to calcu-
tion Model they are normally distributed with a mean of 0 and SD
of 1. Participants in the condition with survey-based personalization late and implement compared to the reading-based and hybrid
thus had a perceived level of personalization of 0.563 SD higher than predictions. In addition, after completing the short survey the
participants in the baseline. user can immediately benefit from personalization. Both the
Dependent variable:
Perceived Personalization System Satisfaction
(1) (2) (3)
β (SE) β (SE) β (SE)
survey 0.563∗ (0.241) 0.673∗∗ (0.243) 1.334∗∗∗ (0.334)
reading 0.140 (0.260) 0.406 (0.261) 0.479 (0.460)
hybrid 0.438• (0.248) 0.273 (0.249) −0.030 (0.391)
Read 0.196 (0.328)
survey:Read −1.279∗∗ (0.464)
reading:Read −0.158 (0.556)
hybrid:Read 0.389 (0.498)
Constant 0.057 (0.171) 0.097 (0.171) −0.004 (0.236)
Observations 106 106 106
R2 0.062 0.072 0.186
Adjusted R2 0.034 0.045 0.128
Residual Std. Error 0.919 (df = 102) 0.923 (df = 102) 0.882 (df = 98)
F Statistic 2.236 (df = 3; 102) 2.649 (df = 3; 102) 3.200∗∗ (df = 7; 98)
Note: • p<0.1; ∗ p<0.05; ∗∗ p<0.01; ∗∗∗ p<0.001

Table 3. Regression Tables for Experimental Manipulation and Read on Perceived Level of Personalization and System Satisfaction. The regression
coefficients are the standardized β s and values between parentheses are standard errors.

reading-based and (to a lesser extent) the hybrid predictions Participants in our study interacted with the system twice. One
require reading behavior from the user before they can be cal- time for an initial data collection and a second time for the
culated. Admittedly providing explicit feedback in the form evaluation. This difference might have lead to a discrepancy,
of a survey demands more effort than the implicit feedback as in the first session people were exploring the system and
provided through the natural interaction of reading. However, possibly paying attention to other aspects than in the second
the higher user experience suggests there might be a trade-off session. For example, in the first session people were getting
between the costs of user effort and the benefits of accurate used to the way of navigation in the library and getting ac-
personalization. quainted with the system and its usability may have been an
issue. In the second session, participants are more likely to
Another interesting finding is that the effects of personalization have evolved past this stage, and they can now focus more on
on user experience disappeared as soon as participants started what it is that they want to read. This would imply that data in
reading articles. A possible explanation for this observation
the first session is describing behavior of participants who are
can be the number of articles people see in the second part
getting to know a system, and as a result models trained on
that they have already read in the first part. Seeing articles one
this data will generate recommendations based on what an ex-
has already read may contribute to a higher perceived level of ploring user will typically read, which may not be appropriate
personalization and satisfaction with the library as a whole, to personalize a library for a participant who already knows
while reading these articles might actually be detrimental for and is actively using a system.
the user experience. In other words, what looks good might
not necessarily be what helps the user and as such it might be As mentioned in the results section, it is unsure how our find-
worthwhile to investigate the factors that influence user satis- ings hold up in a setting with a bigger library and more inter-
faction of a personalized system before and after consumption action data (both in terms of number of users and in terms of
and to see if and how these are different from each other. From interactions per user). With only 102 articles in a fixed order,
a more general perspective this raises the question whether behavior for participants in the initial data collection may not
and how personalization needs to anticipate possible changes have differed enough from each other (yet) to allow the per-
and differences in the perception of recommendations as the sonalization based on reading behavior to produce predictions
user progresses. Alternatively it might indicate that the pro- that are personalized sufficiently. The fact that these personal-
cess of evaluating personalization is different and depends on izations stayed relative close to the non-personalized baseline
whether the user is evaluating through observing or through can be interpreted this way. The survey-based recommenda-
experiencing. tions on the other hand combined data from users with similar
parenting styles and as a result were able to differentiate them-
selves more from the non-personalized baseline. Having more
Shortcomings and Future Work articles and perhaps also a somewhat longer initial period will
While the findings of this study indicate that using surveys as allow for behavior with more differences between users, al-
a basis for personalization can improve personalized systems, lowing to more effectively leverage the predictive power and
the specific application in which we tested our hypotheses complexity of reading-based personalization, which in turn
might limit the extent to which this finding can be generalized. will provide more insight into the conditions that play a role
in how personalization based on behavior compares to person- 2. Diana Baumrind. 1966. Effects of Authoritative Parental
alization based on psychological traits. However, our results Control on Child Behavior. Child Development 37, 4 (dec
show that in this situation with limited reading data a short 1966), 887. DOI:http://dx.doi.org/10.2307/1126611
survey delivers good data for initial personalization.
3. Jay Belsky and Sara R. Jaffee. 2015. The Multiple
In line with the previous argument, it is important to realize Determinants of Parenting. In Developmental
that in terms of data per user, our participants only interacted Psychopathology. Number April. John Wiley & Sons,
with the system once and read 2.23 articles on average. They Inc., Hoboken, NJ, USA, 38–85. DOI:
might still have been in their cold start phase and there may http://dx.doi.org/10.1002/9780470939406.ch2
not have been enough information about the users’ reading
4. Gianluca Demartini, Paul-Alexandru Chirita, Ingo
behavior to provide useful recommendations. What argues
Brunkhorst, and Wolfgang Nejdl. 2008. Ranking
against this is that both the hybrid and reading-based models
Categories for Web Search. In Advances in Information
had higher prediction accuracy than the survey-based recom-
Retrieval. Springer Berlin Heidelberg, Berlin, Heidelberg,
mendations. Given these observations it would be worthwhile
564–569. DOI:
to perform a study that controls for the amount of feedback
http://dx.doi.org/10.1007/978-3-540-78646-7_56
collected from the participants. Having more feedback per par-
ticipant allows to investigate how the number of interactions 5. Ignacio Fernández-Tobías, Matthias Braunhofer, Mehdi
per user affects the performance of the different personal- Elahi, Francesco Ricci, and Iván Cantador. 2016.
ization approaches, similar to how Kluver and Konstan [10] Alleviating the new user problem in collaborative filtering
investigated the effects of number of interactions on predictive by exploiting personality information. User Modeling and
accuracy. User-Adapted Interaction 26, 2-3 (jun 2016), 221–255.
DOI:http://dx.doi.org/10.1007/s11257-016-9172-z
Apart from the amount of data per user, the amount of data
available within the system as a whole may be another factor 6. Zeno Gantner, Steffen Rendle, Christoph Freudenthaler,
that plays a role which method of personalization works best. and Lars Schmidt-Thieme. 2011. MyMediaLite: A Free
Evaluating how survey-based and reading-based personaliza- Recommender System Library. In Proceedings of the 5th
tion compare over time, as more data enter the system as a ACM Conference on Recommender Systems (RecSys
whole or per user, would provide valuable insight in which 2011).
approach works best when. One could imagine a system that
7. Panagiotis Germanakos and Marios Belk. 2016.
starts out from personalization based on measured psychologi-
Human-Centred Web Adaptation and Personalization.
cal traits that transitions into a system based more on behavior
Springer International Publishing, Cham. 336 pages. DOI:
or a hybrid system. Investigating this effect would require a
http://dx.doi.org/10.1007/978-3-319-28050-9
more longitudinal study, where users are invited to a personal-
ized library at multiple moments, to see whether and how the 8. Asela Gunawardana and Guy Shani. 2015. Evaluating
different approaches are affected by the cold start. Recommender Systems. Springer US, Boston, MA,
265–308. DOI:
Apart from the drawback of a low number of participants for http://dx.doi.org/10.1007/978-1-4899-7637-6_8
calculating relevance predictions, the low number also limited
the statistical power of our statistical analysis of the effects 9. J. R. Hauser, G. L. Urban, G. Liberali, and M. Braun.
of personalization. While young parents are active on the 2009. Website Morphing. Marketing Science 28, 2 (mar
internet, they are hard to approach. In the current study we did 2009), 202–223. DOI:
not manage to detect effects of personalization on reading be- http://dx.doi.org/10.1287/mksc.1080.0459
havior and only differences between some of the experimental 10. Daniel Kluver. 2012. How Many Bits Per Rating ?
conditions. The effects caused by the personalization might Proceedings of the 6th ACM conference on Recommender
have been smaller than the statistical power of our analysis systems - RecSys ’12 (2012), 99–106. DOI:
allows us to detect. Conducting a study with more participants http://dx.doi.org/10.1145/2365952.2365974
would allow us to detect these possibly smaller effects.
11. Bart P. Knijnenburg and Martijn C. Willemsen. 2015.
In conclusion, the current paper demonstrates that measuring Evaluating Recommender Systems with User
psychological traits for the sake of personalization is worth- Experiments. In Recommender Systems Handbook.
while and might well lead to increased user satisfaction, but Springer US, Boston, MA, 309–352. DOI:
additional work is needed to establish under which conditions http://dx.doi.org/10.1007/978-1-4899-7637-6_9
this approach is valuable.
12. Yehuda Koren. Factorization Meets the Neighborhood : a
REFERENCES Multifaceted Collaborative Filtering Model. (????). DOI:
1. B Arnott and Amy Brown. 2013. An Exploration of http://dx.doi.org/978-1-60558-193-4/08/08
Parenting Behaviours and Attitudes During Early Infancy: 13. Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.
Association with Maternal and Infant Characteristics. Matrix Factorization Techniques for Recommender
Infant and Child Development 22 (2013), 349–361. DOI: Systems. IEEE Computer (2009), 42–49.
http://dx.doi.org/10.1002/icd.1794
14. Robert R. McCrae, Jr. Paul T. Costa, and Thomas A.
Martin. 2005. The NEOâĂŞPIâĂŞ3: A More Readable
Revised NEO Personality Inventory. Journal of 18. Andrew I Schein, Alexandrin Popescul, Lyle H Ungar,
Personality Assessment 84, 3 (2005), 261–270. DOI: and David M Pennock. 2002. Methods and metrics for
http://dx.doi.org/10.1207/s15327752jpa8403_05 PMID: cold-start recommendations. Proceedings of the 25th
15907162. annual international ACM SIGIR conference on Research
and development in information retrieval SIGIR 02 46,
15. Bamshad Mobasher. 2007. Data mining for web
Sigir (2002), 253–260. DOI:
personalization. The adaptive web (2007), 90–135. http:
http://dx.doi.org/10.1145/564376.564421
//link.springer.com/chapter/10.1007/978-3-540-72079-9
19. K. R. Venugopal, K. G. Srinivasa, and L. M. Patnaik.
16. Stephanie L Prady, Kathleen Kiernan, Lesley Fairley,
2009. Algorithms for Web Personalization. Springer
Sarah Wilson, and John Wright. 2014. Self-reported
Berlin Heidelberg, Berlin, Heidelberg, 217–230. DOI:
maternal parenting style and confidence and infant
http://dx.doi.org/10.1007/978-3-642-00193-2_10
temperament in a multi-ethnic community: Results from
the Born in Bradford cohort. Journal of Child Health 20. Tiange Zhao. 2016. Investigating the relationship
Care 18, 1 (2014), 31–46. DOI: between parenting beliefs and parenting practice for
http://dx.doi.org/10.1177/1367493512473855 in-app personalization. Master thesis. Eindhoven
University of Technology.
17. Steffen Rendle, Wolf Huijsen, and Karen Tso-Sutter.
https://pure.tue.nl/ws/files/46944250/855031-1.pdf
2008. State-of-the-art Recommender Algorithms.
Technical Report. www.mymediaproject.org 21. Zheng Zhu. 2011. Improving Search Engines via
Classification. Ph.D. Dissertation.