=Paper=
{{Paper
|id=Vol-2068/humanize8
|storemode=property
|title=Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models
|pdfUrl=https://ceur-ws.org/Vol-2068/humanize8.pdf
|volume=Vol-2068
|authors=Mark Graus,Martijn Willemsen,Chris Snijders
|dblpUrl=https://dblp.org/rec/conf/iui/GrausWS18
}}
==Personalizing a Parenting App: Parenting-Style Surveys Beat Behavioral Reading-Based Models==
Personalizing an Online Parenting Library: Parenting-Style Surveys Outperform Behavioral Reading-Based Models Mark P. Graus Martijn C. Willemsen Chris C. P. Snijders Eindhoven University of Eindhoven University of Eindhoven University of Technology, IPO 0.20 Technology, IPO 0.17 Technology, IPO 1.[20] 5600 MB Eindhoven, the 5600 MB Eindhoven, the 5600 MB Eindhoven, the Netherlands Netherlands Netherlands m.p.graus@tue.nl m.c.willemsen@tue.nl c.c.p.snijders@tue.nl ABSTRACT have to learn a whole new set of care-taking skills, ranging The present study set out to personalize a digital library aimed from practical (such as changing diapers) to more emotional at new parents by reordering articles to match users’ inferred (such as recognizing and reacting to a child’s emotions). There interests. The interests were inferred from reading behavior are numerous ways to acquire these skills: parents can get as well as parenting styles measured through surveys. As advice from relatives, or alternatively rely on vast amounts of prior research has shown that parenting styles are related to books, websites, videos, and other types of media. how parents take care of their children, these styles are likely Parents have different styles of parenting and as such some to be related to what content a parent is interested in. The topics may be very relevant to a parent, while others are com- present study compared personalization based on parenting pletely irrelevant. In this sense, helping parents find their way styles against other types of personalization. in content related to the parenting domain is similar to person- We conducted a user study with 106 participants, in which we alization areas such as movie or book recommendations. A compared the effects of four different approaches of personal- challenge in personalizing content on parenting is that first- ization to our users’ reading behavior and user experience: a time parents have to find their own way in a domain that is non-personalized baseline, personalization based on reading completely new to them. Parents may not have a clear view behavior, personalization based on parenting styles measured yet on the range of alternative ways of taking care of a child through surveys, and a hybrid personalization based on both that match their styles. It might not be easy for them to judge reading behavior and parenting styles. We found that while the what content is relevant and they might read content that is not reading behavior was not significantly influenced by different in line with their parenting styles or interests. As such, there types of personalization, participants had a better user expe- might be a discrepancy between what content new parents rience with our survey-based approach. They indicated they read and what is actually relevant to them. As a result person- perceived a higher level of personalization and satisfaction alization based on reading behavior (as is common) might not with the system, even though in terms of objective metrics this provide the desired results. approach performed worse. An additional challenge is that parenting is an activity people ACM Classification Keywords are very committed to and about which they hold strong beliefs. As a result, new parents might find certain types of content H.5.2 Information Interfaces and Presentation (e.g., HCI): extremely irrelevant, to the point of being offended. A mother User Interfaces; H.3.3 Information Storage and Retrieval: In- who struggled and eventually gave up breastfeeding might be formation Search and Retrieval hurt by receiving unwanted breastfeeding advice. Being wrong Author Keywords in personalization in this domain thus has a bigger impact than Personalization; Parenting; User Experience; Cold Start; in other domains. Psychological Traits; Psychological Models; User Models We aim to help parents in finding relevant content by person- alizing a digital library of information articles on parenting. INTRODUCTION Because the content is aimed at new parents, we think a dis- Becoming a parent is for many a big challenge in life. New crepancy can exist between reading behavior and reading in- parents have to get used to a new set of responsibilities and terests and parenting styles measured through surveys might provide more reliable information for predicting reading inter- ests. To investigate this, we personalize a library using both behavior data and survey data. Research Question and Hypotheses ©2018. Copyright for the individual papers remains with the authors. The current paper aims to investigate how a library comprising Copying permitted for private and academic purposes. articles on parenting can be improved by personalizing the HUMANIZE ’18, March 11, 2018, Tokyo, Japan order in which the articles are presented1 . A screenshot of what the library interface looked like can be found in Figure. 1. In addition the paper investigates if and how parenting styles can contribute to this personalization. The main research question thus is “How does personalization based on parenting styles compare to personalization based on reading behavior in terms of user behavior and user experience?” We try to answer this research question by investigating the effects of personalization based on survey responses mea- suring parenting styles (explained in Section 1.4) and more conventional ways of personalization that rely on behavior data. Specifically we compare the effects of survey-based per- sonalization with personalization based on reading behavior, personalization based on both reading behavior and survey responses and a non-personalized baseline. We are interested in the effects of this personalization both in terms of influ- enced behavior (e.g. does personalization based on surveys increase the number of articles users read?) and in terms of user experience (e.g. does personalization based on surveys result in a higher satisfaction with the digital library?). To investigate the effects on the user experience we adopted the User-Centric Evaluation Framework for personalized systems by Knijnenburg and Willemsen [11]. We designed a UX sur- vey with items aimed to measure different aspects of the user experience. With the survey we aimed to measure three aspects of the user experience specifically and formulated survey items to do so: the perceived level of personalization (“The library shows arti- cles I find interesting”), the system satisfaction (“It was easy to find relevant/interesting articles”), and reading satisfaction (“I enjoyed reading the items I read”). We hypothesize that the different ways of personalization influence the perceived level of personalization. A higher level of personalization should lead to a higher system satisfaction, which should lead to a higher reading satisfaction. The higher system satisfaction is also expected to increase the amount of reading by the user. Figure 1. uGrow ‘My Articles’ page In terms of improving user satisfaction and increasing reading behavior we hypothesize the following order in the different personalizations, from worst to best: of personalization can be found on numerous websites, for example in the form of recommendations on Amazon, or as • The non-personalized library filters on social media feeds such as Twitter and Facebook. In • The library personalized based on just reading behavior general the goal is to alter a system in a way that it caters to the individual needs of a user to influence user behavior or • The library personalized based on just survey responses user experience. A typical goal of influencing behavior is to make users consume more content in a media browsing system • The library personalized based on both reading behavior or purchase more items in a webshop, while a typical goal of and survey responses influencing user experience is to make it easier for users to The remainder of this section will introduce the theoretical reach their goals. background on which this study is based. Personalization can be implemented in many different ways [19], but the most widely adopted methods rely on his- Personalization torical data describing how users interact with a system, and Personalization is the process of altering a system to fit to combine these data across users to make predictions on what the needs and/or preferences of an individual [15]. Examples content a user will find relevant. The system is subsequently 1 The content and the design of the library were taken from Philips’ altered so that the user is exposed to more of the content he/she uGrow app. Available for iOS https://itunes.apple.com/app/ is likely to find relevant. ugrow-healthy-baby-development/id1063224663 and Android https://play.google.com/store/apps/details?id=com.philips. A standard problem related to this approach to personalization cl.uGrowDigitalParentingPlatform is the cold start problem [18]. More specifically, three cold start problems exist: the system cold start, the user cold start, Parenting Styles and the item cold start. The system cold start occurs when Zhao [20] performed a literature review on research on parent- not enough data are available within the system as a whole to ing with the goal of understanding how scholars operationalize make predictions. The user and item cold start occur when and measure parenting styles. Zhao was in particular interested there are not enough interaction data available corresponding in how parenting styles relate to the actual care-taking behav- to respectively the user or the item, so that no predictions can ior and as such, the review was primarily focused on research be made for respectively the user or the item. that comprised both questionnaires and a behavioral aspect. She found that parenting as a whole is a combination of cog- In the context of parenting an additional challenge occurs. nitive factors, the physical task of taking care of a baby, and Apart from being new to a system, (some) parents are also the interplay between the two (cf. [2]). Zhao in addition found new to being parents and they might find it hard to identify that researchers conceptualize parenting styles as individual what content is relevant to them. This can result in a mismatch differences along two cognitive dimensions: structure (i.e. between the content they read and the content that they are how important parents think structure is for their children) and actually interested in. In systems in which user evaluations of attunement (i.e. how much parents value reacting to a child’s content are not being tracked explicitly, assuming that content needs and how able they are at reading those needs) [1, 3, 16]. is appreciated because it was read may well lead to inaccurate Prototypical parenting styles are the resulting combinations predictions about user preferences. Because of this, a library of scores along these two dimensions (high attunement/high aimed at parents might benefit from relying on other types of structure, high attunement/low structure, low attunement/high data for personalization. structure and low attunement/low structure). Other cognitive factors that have been identified in literature to play a role are parental distress, perceived self-efficacy, and the perceived Personalization and Psychological Traits difficulty of the child. Many psychological traits have been incorporated in personal- ization applications. Hauser et al. [9] personalized an online The cognitive factors allegedly have an interplay with how tool to compare contracts for mobile phones based on cogni- parents actually take care of their children. To validate these tive styles (i.e. the way in which individuals prefer to process parenting styles and investigate how they relate to care-taking information) and showed that providing users information in behavior, Zhao [20] conducted a survey study in which she a way that matches their cognitive style (e.g. textual versus measured parenting styles and asked respondents to self-report visual information) increases buying propensity. Germanakos on how they take care of their children. The analysis of the and Belk [7] found that adapting an online learning environ- survey data showed support for the conceptualization of par- ment to the working memory capacity of its students resulted enting styles along the previously mentioned dimensions of in higher test scores. structure and attunement. In addition it showed that parenting styles are related to the actual care-taking behavior of parents. Similarly, Fernandez-Tobias et al. [5] showed that incorporat- For example, parents scoring low on attunement are less likely ing personality in collaborative filtering algorithms allowed to engage in breast-feeding and more likely to opt for bottle- them to better predict recommendations across domains (e.g. feeding. As parenting styles are related to how parents take recommending movies based on someone’s music listening be- care of their children, they are likely to be useful predictors havior). They did this by extending the SVD++ algorithm [12], for what type of content parents are interested in. For example, an algorithm used to predict ratings that users will assign to parents that find structure important put their children to bed items. Fernandez-Tobias et al. used a part of the myPerson- at a fixed bedtime instead of waiting for the kid to become ality dataset2 comprising 160k users and in total just over 5 sleepy. As a result they might be more conscious of the fact million likes over 16k items (consisting of books, movies or that their child does not fall asleep easily and will thus be more music artists). The personality traits (the five factor model interested in content on how to get a baby to sleep well than with the traits openness to experience, conscientiousness, ex- people that value flexibility over structure and wait for their traversion, agreeableness and neuroticism [14]) were available child to get sleepy. for all users and were used to predict likes. Their results showed that incorporating the personality information substan- tially improved the extent to which likes on Facebook could STUDY DESIGN successfully be predicted. To investigate our research question we designed a user study that consisted of two main parts, with a first part aimed at These studies demonstrate that personalization can benefit collecting initial data to be used for personalizing the “My from considering and incorporating personal characteristics Articles” page and a second part aimed at investigating the (such as personality traits or cognitive styles). In the case effects of the different ways of personalization on the reading of parenting, parenting styles are psychological traits that are behavior and user experience. likely to play a role in what content parents find relevant. In the present study we measure parenting styles and subsequently During the first part, participants were asked to complete a use them for personalizing the online library. survey to measure their parenting styles, after which they were invited to browse the non-personalized library (i.e. a library with a fixed order of articles). The responses to the 2 Available from http://mypersonality.org/wiki/doku.php?id= surveys were stored for personalization later. The information download_databases regarding what articles participants read during the browsing phase was used for personalization based on reading behavior. The order of articles for the second part of the study was calculated in one out of four ways (described in more detail in section 3). For each participant we selected at random which set of predictions was used to personalize the library. In the second part of the study the participants were re-invited to interact with their now personalized digital library. Sub- sequently, participants evaluated their experience with the system through our UX survey. We first report on the initial phases of the study. Initial Data Collection: Survey and Reading Behavior We implemented the online library on a website that was acces- sible through browsers on computers and mobile phones. We recruited participants through posts in online forums dedicated to parenting and through Facebook ads targeting parents in the United Kingdom and United States with children younger than two years old. In total 234 parents clicked on the link to participate in the study. All participants that completed the en- tire study were compensated with $4.50 or £3.50 of shopping credit for amazon.co.uk or amazon.com. The ad campaign and data collection took place in May and June 2017. Figure 2. Distributions of the 5 factor scores measured through the first survey. The first part of the study consisted of two steps. In the first part people were asked to complete the survey to measure parenting styles. After completing the survey, the participants The initial part of the data collection was concluded with were presented with the digital library and invited to browse offering the participants to freely browse the online library. through it and read the articles that they were interested in. Participants opened on average 2.23 articles (SD: 3.37 articles) The participants were invited to read as many articles as they from 1.25 categories (SD: 1.51 categories). These data and the wanted for as long as they wanted and to click a link labeled survey responses were used to calculate relevance predictions “I’ve finished reading” once they felt like they read enough. for the individual participants. After clicking this link participants were asked to submit their email address for the second step of the study. CALCULATING RELEVANCE PREDICTIONS In total 181 participants completed the survey (15 men/166 Based on the data collected in the first step of the study we cal- women, 99 first time parents, with an average age of the baby culated per participant four different relevance rank predictions 11.39 (SD: 7.96) months). On average the whole session for all articles. As a baseline we used the non-personalized lasted just over 6 minutes (378 seconds, SD: 279.80 seconds). General Top-N. The three other ways of predicting differed The survey consisted of 15 items of the original survey of in what data from the first step were used. A survey-based Zhao [20]. For the five cognitive (structure, attunement, ma- ordering was based on the data from the survey responses ternal self-efficacy, parental disstress and perceived difficulty of the participants and on reading behavior at the aggregated of the child) we selected per factor the two items with most level. A reading-based ordering used only data regarding the extreme factor loadings. We added items concerning the de- articles that people had read in the first step. Finally, a hybrid mographics of the parent (gender, level of education, whether ordering used both the survey responses and the individual they were first time parents) and child (gender, age) that had reading behavior. The way these orderings were calculated is had large effects on the self-reported behavior in the original described in the following sections. analysis. The factor scores for our participants were calculated by using the factor loadings from the original survey and are Survey-Based Predictions displayed in Figure 2. These scores show similar distributions and correlations as the factors in the original survey. We used the survey responses collected in the first step to predict relevance of the different articles for the participants The interface of our library was made to have the look and in our study. To do this, the participants were subdivided feel of the original library (see Fig. 1) as much as possible. in segments, by performing median splits on the 2 cognitive As in the original interface, the articles are subdivided in factors: attunement and structure. The user segment was then categories that are displayed in rows. Within the category rows defined to be the combination of these two scores, resulting the articles are displayed horizontally. The user is able to scroll in four segments. We considered incorporating the three other up and down to different categories and left and right within factors measured in the study (self-efficacy, parental distress categories to the different articles. As in the original interface, and perceived difficulty of the child), but given the number the order of articles and categories was fixed: every participant of users in our dataset adding additional factors resulted in had exactly the same order of categories and articles. segments that became too small. the positive only data describe whether or not a user read an article in the first step, and the predictions would indicate what items a user is most likely to read. In order to translate this positive-only, binary feedback into a ranking, pairs of items are semi-randomly selected per user, where each pair consists of an item that the user has interacted with and one with which the user has not interacted. The assumption is that the first is preferred over the second. Sampling a large number of pairs per user, results in a ranking that can be used in matrix factorization and the resulting model then calculates a relative relevance score instead of a rating. Figure 3. Article Categories Ranked on Popularity per Segment Hybrid Predictions The BPRMF algorithm was extended to combine reading behavior and the individual parents’ user attributes inferred As the participants read on average just over 2 articles, there from the survey for the calculation of hybrid predictions. The was not enough data to show differences on the level of indi- BPRMF algorithm was extended similarly to how Fernandez- vidual articles (i.e. articles were not read often enough to allow Tobias et al. [5] extended the SVD++ [12] algorithm to incor- for enough variance), but participants from different segments porate personality in predictions. did prefer different categories, as can be seen in Figure 3. When investigating these predictions, the popularity order for Where the original BPRMF algorithm uses two matrices P and these categories seems to make sense intuitively. For example, Q to calculate predictions, our user attribute aware BPRMF the breastfeeding category is predicted to be more popular for algorithm uses a third matrix Y . Y describes the user attributes segments with high attunement, which is congruent with the on the same k latent features the users and articles are ex- relationship with breast-feeding and high attunement in the pressed in. In our case we used high and low scores for the original survey [20]. five cognitive factors from our parenting style survey as user at- tributes. We decided again to use the median splits per factors As a result we decided to sort the categories based on the to assign each user a high or low score for each factor in order attunement-structure segment and sort the articles within each to prevent overfitting. Every user has thus 5 user attributes and category based on general popularity. That is, the survey-based the relevance predictions are similar to the original BPRMF predictions only personalized the order of the categories, not algorithm with an additional matrix in which user attributes the articles within each category. We tried basing segments on are represented. The predicted relevance is then calculated other factors than attunement and structure, but the resulting according to equation 2. predictions were not as easily interpretable as the predictions based on these segments. ! Reading-Based Predictions r̂ui = qi ∗ pu + ∑ ya (2) For the conditions based on reading behavior alone, we used a∈A(u) the Bayesian Personalized Ranking Matrix Factorization (or BPRMF) algorithm implemented in MyMediaLite [6, 17] to This model is fit using stochastic gradient descent. Each itera- predict relevance. BPRMF is an extension to classic matrix tion consists of two steps. In the first step the P and Q matrices factorization [13] that allows it to calculate recommendations are fit, while leaving the Y matrix constant. In the second step from positive only feedback instead of rating data. the Y matrix is fit, while leaving the P and Q matrix con- stant. We implemented this algorithm in the MyMediaLite Conventional matrix factorization attempts to complete the library [6]. matrix R with dimensionality of U (number of users) and I (number of items). In this matrix the cells represent ratings Calculated Relevance Predictions the user has given to the corresponding item. This matrix is In total the dataset contained 221 users and 508 reads3 . For decomposed into two k-dimensional sub-matrices P and Q in each user predictions using the four methods described above which the rows of P and Q represent respectively users and were calculated. The predictions were then sorted in two steps. items in a k-dimensional latent feature space. These matrices First the 7 categories were ordered based on the article with are constructed so that the predicted rating r̂ui is calculated by the highest predicted relevance (a strategy called min-rank taking the inner product pu ∗ qi (see Equation 1). that has been shown to work well in similar circumstances [4, 21]). Within the categories the articles were ordered based on predicted relevance. r̂ui = qi ∗ pu (1) The algorithms for the reading-based and hybrid predictions Rendle et al. [17] extended this matrix factorization into required the tuning of a set of regularization hyperparame- BPRMF to allow using positive only feedback to calculate ters, which we carried out using Bayesian Optimization. The per user a ranking of the articles from highest predicted rel- 3 We included reading data from a pilot study to ensure we had enough evance to lowest predicted relevance. In the current study, data to calculate predictions 5-fold Cross Validation Post-hoc Comparison algorithm AUC prec@5 prec@10 NDCG AUC prec@5 prec@10 NDCG baseline 0.840 0.083 0.065 0.424 0.706 0.146 0.104 0.477 survey - - - - 0.650 0.060 0.062 0.353 reading 0.832 0.079 0.061 0.411 0.767 0.176 0.114 0.522 hybrid 0.769 0.080 0.059 0.404 0.807 0.214 0.126 0.561 Table 1. Performance Metrics calculated through 5-fold Cross Validation and a post-hoc performance analysis Bayesian Optimization was performed using 5-fold cross vali- relevance rank (or highest predicted relevance) within the cat- dation, using AUC as the target measure. Once optimal values egory, so that the category with the article with the highest for the hyperparameters were established, the predictive mod- predicted relevance was shown on top. This way of sorting els were constructed and the predictive performance (i.e. the categories has been shown to be one of the best strategies in reading-based and hybrid recommendations) was investigated terms of reducing browsing time [4]. Within categories the through 5-fold cross validation also. Table 1 shows these per- articles were ordered by predicted relevance rank, with the formance metrics of the three algorithms under the column article with the lowest predicted relevance rank to the left of ‘5-fold Cross Validation’. The performance metrics appeared the list. to be adequate4 . However, the baseline, reading-based, and hybrid predictions are calculated on the level of the individual Participants were allowed to browse the library freely during articles, they cannot be easily compared to the survey-based which we measured what articles the participants opened. Par- ticipants were shown a link labeled “I have finished reading” predictions that are calculated first on the category level and that would take them to the survey as soon as they felt they then within the categories on an individual article level. In read enough. The survey contained 11 items aimed at measur- order to make a fair comparison, we performed a post-hoc ing Perceived Level of Personalization, System Satisfaction, analysis by recalculating the performance metrics for the sets of recommendations to correspond to the survey-based predic- and Reading Satisfaction. tions. We did this by calculating the lists of recommendations Participants and sorting all lists first by category based on the minimum All 181 users from the first part were invited to join the second predicted rank of the article within that category and subse- part of the study via email. Of the 181 users we sent invita- quently sorting the articles within their categories based on the tions to, 150 visited the second part and 121 completed the predicted relevance for the individual articles. We then calcu- study. A number of cases were removed, for either trying to lated performance metrics by using the actual reading behavior complete the study with multiple email addresses (3 users), as ground truth. The outcome of these recalculations can be having missing data in the survey (1 user), or finishing the found under the columns ‘Post-hoc Comparison’ in Table 1. second part of the study in less than 50 seconds (11 users). For These numbers indicate the most accurate predictions for the our final data analysis we ended up with 106 users (9 men/97 hybrid predictions, followed by the reading-based predictions, women, 50 first time parents, mean (SD) age of the baby 10.63 the survey-based predictions and finally the non-personalized (8.45) months) baseline. Based on these metrics we would expect the hybrid predictions to be most in line with what participants will read, These users were distributed roughly equally over conditions and the survey-based least. This order is different from the (baseline: 29, survey: 29, reading-based: 22, hybrid: 26). In order in the k-fold cross validation metrics because no k-fold addition, there appeared to be no bias in response rate for the cross validation was applied to be able to compare with the different parenting style segments of the participants, with survey-based recommendations (i.e. the train and test set were response rates of .56 for the low structure/high attunement identical). segment, .65 for the high structure/low attunement segment, .73 for the high structure/high attunement segment and .60 RE-ENGAGING WITH THE NOW PERSONALIZED SYSTEM for the low structure/low attunement segment(χ 2 (3) = 3.239, The second part of the study was used to investigate our re- p = 0.356). search question and test our hypotheses. To this end partici- Results pants were re-invited to interact with the website, where they were now shown the library personalized in one out of four To gain insight in how the different methods of predicting ways (selected at random). The invitations were sent out after relevance influenced the final recommendations participants all predictions were calculated, which means that the time received, we calculated the difference of the recommendations between finishing the first part and starting the second part with the general Top-N in terms of Spearman Rank Correla- differed between participants (median 42.6 days. SD: 15.4 tion. The (Spearman) correlation coefficient ρ indicates to days). In this step the interface was personalized by reorder- what extent lists are similar, with a value of 1 if the order is ing both the categories and the articles within the categories. identical and -1 if they are in reverse order. The results are The categories were ranked based on the minimum predicted shown in Figure 4 and they reveal that the available reading data does not allow personalization that differs a lot from the 4 An overview of the different metrics and how to interpret them can baseline condition (as the correlation between reading-based be found in [8]. and baseline is 0.91 on average). Personalization based on the survey-based predictions is quite different from the baseline predictions, with an average correlation of 0.37. The hybrid predictions fall somewhere in between the reading-based and survey-based predictions with a correlation of 0.74. These numbers indicate that the additional data of parenting styles allows for personalization that deviates more from the baseline than personalization based on reading behavior alone. One possible explanation of the reading-based personalization not differing much from the baseline is insufficient data. As there are a limited number of users (221 users, see Section 2.1), that read a limited number of articles (2.44 articles on average) from a library with a limited number of articles (102) that was presented in a fixed order. As such the dataset might not con- tain enough variance between users’ reading behavior to fully Figure 5. Survey Items and Response Distributions. The light-grey items benefit from collaborative filtering. What argues against this have been omitted from the analysis because of poor factor loadings. is the fact that the reading-based and hybrid recommendations appear to outperform the survey-based predictions in terms of prediction accuracy (see Table 1). User Experience As per the user-centric evaluation framework by Knijnenburg Reading Behavior and Willemsen [11] all survey items were submitted to a struc- Participants read on average 2.72 articles (SD: 4.28 articles), tural equation model (SEM). The responses to the individual but 42 participants (39.6%) did not read any articles. The items can be seen in Figure 5, with items belonging to Per- descriptives for the number of article reads per condition are ceived Level of Personalization (pers1-pers4), System Satis- shown in Table 2. The different conditions had no significant faction (syssat1-syssat4), and Reading Satisfaction (readsat1- influence on the number of articles people read, as negative readsat3). The three items for reading satisfaction show very binomial regressions with the condition as independent vari- low variance among each other, which lead to these three items able and the number of reads as dependent variable showed not fitting in the model. This might have been caused by the no significant difference across conditions. This implies that fact that the reading behavior did not differ across conditions no support is found for the hypotheses regarding the effect of as we manipulated only the order in which the articles were our experimental manipulations on how participants interact presented, and not the actual content in the library. Therefore, with their personalized libraries. people were actually able to read the same articles regardless of experimental condition and thus the reading satisfaction might be similar. Apart from the items on Reading Satisfac- tion, two of the remaining items (pers1 and syssat2) explained condition Mean SD min max N little variance and were also removed from the analysis. 1 baseline 2.448 3.501 0 13 29 2 survey 2.517 4.032 0 16 29 Despite the fact that participants did not read a large amount 3 reading 4.273 6.670 0 31 22 of articles, the interface did allow participants to get a general 4 hybrid 2.038 2.289 0 9 26 idea of the library by looking at the categories and the article titles. Nevertheless, we do feel that the participants who ac- Table 2. Article Reads Per Segment tually read articles are better able to evaluate the library. To account for this we introduced an additional (dummy) vari- able labeled ‘Read’ indicating whether or not people read any articles. Spearman's Rank Correlation ● A SEM was constructed using the remaining six survey items measuring two latent constructs (Perceived Personalization 0.9 and System Satisfaction), the experimental conditions, and the ● ● variable describing whether or not people read as exogenous variables. The two latent factors had high correlation, but the 0.7 model showed good fit (with χ 2 (36) = 44.447, p = .158, CFI ρ ● ● = .984, TLI = .974, RMSEA = .047, 90% CI: [0.000, .088]). 0.5 ● For each participant we used this model to calculate the scores on these latent factors to be used for the remainder of the analysis. baseline survey reading hybrid As the final model consists of only two latent constructs condition (Perceived Personalization and Systems Satisfaction) that are Figure 4. Boxplots of Spearman’s Rank Correlation with General Top-N highly correlated, there is no clear underlying structural model per Condition to test anymore. For the analysis we could either combine both factors into one overall latent factor, or analyze both factors (2) in Table 3. Model (3) reveals how this effect holds up separately. We chose to do the latter as both factors might still for participants that read versus participants that did not. It capture different nuances of the user experience, despite their shows a negative interaction effect for the participants that high correlation. received survey-based personalization and read at least one article, which suggests that only the people that do not read We analyzed the effect of our manipulation on the factor scores any articles actually perceive a higher system satisfaction; for of both constructs through linear regressions, with the factor those who do read at least one article the effect is strongly scores as dependent variables and the experimental condition reduced. as independent variable. As additional moderator we included the dummy variable representing whether or not people read In conclusion support is found for the hypothesis that survey- articles. based personalization outperforms the non-personalized base- line, while no evidence was found that the reading-based and The average factor scores per condition for the two measured hybrid personalization did so. The lack of effect in terms constructs can be found in Figure 6. The image shows an of reader experience are in line with the comparison of the increase in both Perceived Personalization and System Satis- different predicted rankings in terms of Spearman’s Rank Cor- faction for the survey-based condition. The effects are higher relation, that showed a high similarity between the reading- for the participants that did not read (represented in the red based and non-personalized baseline. This comparison further bars) and lower for the participants that did (represented in the showed that the survey-based personalization was most dif- green bars). ferent from the baseline, which is also reflected in the user experience (albeit stronger for the people that did not read than the people that read). The hybrid conditions falls in between the survey-based and reading-based and similarly the effects on user experience appear to fall in between the effects of the survey-based and reading-based recommendations. CONCLUSION AND DISCUSSION This study set out to compare personalization based on psy- chological traits measured through a survey to personaliza- tion based on reading data. Through a user study we com- pared different methods against a non-personalized baseline and showed that personalization based on survey information about parenting styles resulted in a significantly higher experi- enced user satisfaction and perceived level of personalization despite a lower objective performance, whereas using only Figure 6. Marginal effects on Perceived Personalization (top row) and System Satisfaction (bottom row) for the different conditions. The error historical reading behavior or the combination of historical bars correspond to +/- 1 standard error. Separate bars are shown for reading behavior and measured parenting styles did not. Our participants that read no articles (red) and at least one (green). Scores findings speak to the potential usefulness of including data are standardized: a score of +1 implies 1SD higher than the baseline regarding characteristics of users (collected through an initial (baseline recommendations for a user that did not read). Error bars are one standard error of the mean. survey or otherwise) in personalization to alleviate the cold start problem. While the actual reading behavior for users was The regression models in Table 3 show these effects as well. not influenced, an improved user experience may increase the Regression model (1) shows a positive and significant effect on probability for users to return to the library later on. Perceived Personalization for participants in the survey-based The fact that using the survey data for personalization also out- condition, indicating that these participants had the feeling the performed the condition where recommendations were based library catered more to their interests5 . An additional, albeit on both survey data and reading behavior is likely caused by not statistically significant, effect the table shows is an effect the fact that the hybrid recommender - given how we had im- with a significance level of p < 0.1 for the increased perceived plemented it - came up with suggestions that were relatively level of personalization in the condition with hybrid person- close to the baseline condition. Hybrid predictions that would alization. Although caution is needed when interpreting this have assigned more weight to the survey data might have effect, it describes a trend towards participants experiencing a faired better. In any case, we do see that personalization based higher level of personalization with the hybrid personalization. on surveys captures the interests better, or at least increase In terms of System Satisfaction the patterns are slightly differ- the reported user satisfaction, and that they lead to a more ent. Participants that received the survey-based personalization different order in which articles are presented than based on were more satisfied with the system, as can be seen in model the reading behavior alone. 5 Because the factor scores are calculated through a Structural Equa- From a system owner point of view it is worth noticing that the survey-based predictions were very straightforward to calcu- tion Model they are normally distributed with a mean of 0 and SD of 1. Participants in the condition with survey-based personalization late and implement compared to the reading-based and hybrid thus had a perceived level of personalization of 0.563 SD higher than predictions. In addition, after completing the short survey the participants in the baseline. user can immediately benefit from personalization. Both the Dependent variable: Perceived Personalization System Satisfaction (1) (2) (3) β (SE) β (SE) β (SE) survey 0.563∗ (0.241) 0.673∗∗ (0.243) 1.334∗∗∗ (0.334) reading 0.140 (0.260) 0.406 (0.261) 0.479 (0.460) hybrid 0.438• (0.248) 0.273 (0.249) −0.030 (0.391) Read 0.196 (0.328) survey:Read −1.279∗∗ (0.464) reading:Read −0.158 (0.556) hybrid:Read 0.389 (0.498) Constant 0.057 (0.171) 0.097 (0.171) −0.004 (0.236) Observations 106 106 106 R2 0.062 0.072 0.186 Adjusted R2 0.034 0.045 0.128 Residual Std. Error 0.919 (df = 102) 0.923 (df = 102) 0.882 (df = 98) F Statistic 2.236 (df = 3; 102) 2.649 (df = 3; 102) 3.200∗∗ (df = 7; 98) Note: • p<0.1; ∗ p<0.05; ∗∗ p<0.01; ∗∗∗ p<0.001 Table 3. Regression Tables for Experimental Manipulation and Read on Perceived Level of Personalization and System Satisfaction. The regression coefficients are the standardized β s and values between parentheses are standard errors. reading-based and (to a lesser extent) the hybrid predictions Participants in our study interacted with the system twice. One require reading behavior from the user before they can be cal- time for an initial data collection and a second time for the culated. Admittedly providing explicit feedback in the form evaluation. This difference might have lead to a discrepancy, of a survey demands more effort than the implicit feedback as in the first session people were exploring the system and provided through the natural interaction of reading. However, possibly paying attention to other aspects than in the second the higher user experience suggests there might be a trade-off session. For example, in the first session people were getting between the costs of user effort and the benefits of accurate used to the way of navigation in the library and getting ac- personalization. quainted with the system and its usability may have been an issue. In the second session, participants are more likely to Another interesting finding is that the effects of personalization have evolved past this stage, and they can now focus more on on user experience disappeared as soon as participants started what it is that they want to read. This would imply that data in reading articles. A possible explanation for this observation the first session is describing behavior of participants who are can be the number of articles people see in the second part getting to know a system, and as a result models trained on that they have already read in the first part. Seeing articles one this data will generate recommendations based on what an ex- has already read may contribute to a higher perceived level of ploring user will typically read, which may not be appropriate personalization and satisfaction with the library as a whole, to personalize a library for a participant who already knows while reading these articles might actually be detrimental for and is actively using a system. the user experience. In other words, what looks good might not necessarily be what helps the user and as such it might be As mentioned in the results section, it is unsure how our find- worthwhile to investigate the factors that influence user satis- ings hold up in a setting with a bigger library and more inter- faction of a personalized system before and after consumption action data (both in terms of number of users and in terms of and to see if and how these are different from each other. From interactions per user). With only 102 articles in a fixed order, a more general perspective this raises the question whether behavior for participants in the initial data collection may not and how personalization needs to anticipate possible changes have differed enough from each other (yet) to allow the per- and differences in the perception of recommendations as the sonalization based on reading behavior to produce predictions user progresses. Alternatively it might indicate that the pro- that are personalized sufficiently. The fact that these personal- cess of evaluating personalization is different and depends on izations stayed relative close to the non-personalized baseline whether the user is evaluating through observing or through can be interpreted this way. The survey-based recommenda- experiencing. tions on the other hand combined data from users with similar parenting styles and as a result were able to differentiate them- selves more from the non-personalized baseline. Having more Shortcomings and Future Work articles and perhaps also a somewhat longer initial period will While the findings of this study indicate that using surveys as allow for behavior with more differences between users, al- a basis for personalization can improve personalized systems, lowing to more effectively leverage the predictive power and the specific application in which we tested our hypotheses complexity of reading-based personalization, which in turn might limit the extent to which this finding can be generalized. will provide more insight into the conditions that play a role in how personalization based on behavior compares to person- 2. Diana Baumrind. 1966. Effects of Authoritative Parental alization based on psychological traits. However, our results Control on Child Behavior. Child Development 37, 4 (dec show that in this situation with limited reading data a short 1966), 887. DOI:http://dx.doi.org/10.2307/1126611 survey delivers good data for initial personalization. 3. Jay Belsky and Sara R. Jaffee. 2015. The Multiple In line with the previous argument, it is important to realize Determinants of Parenting. In Developmental that in terms of data per user, our participants only interacted Psychopathology. Number April. John Wiley & Sons, with the system once and read 2.23 articles on average. They Inc., Hoboken, NJ, USA, 38–85. DOI: might still have been in their cold start phase and there may http://dx.doi.org/10.1002/9780470939406.ch2 not have been enough information about the users’ reading 4. Gianluca Demartini, Paul-Alexandru Chirita, Ingo behavior to provide useful recommendations. What argues Brunkhorst, and Wolfgang Nejdl. 2008. Ranking against this is that both the hybrid and reading-based models Categories for Web Search. In Advances in Information had higher prediction accuracy than the survey-based recom- Retrieval. Springer Berlin Heidelberg, Berlin, Heidelberg, mendations. Given these observations it would be worthwhile 564–569. DOI: to perform a study that controls for the amount of feedback http://dx.doi.org/10.1007/978-3-540-78646-7_56 collected from the participants. Having more feedback per par- ticipant allows to investigate how the number of interactions 5. Ignacio Fernández-Tobías, Matthias Braunhofer, Mehdi per user affects the performance of the different personal- Elahi, Francesco Ricci, and Iván Cantador. 2016. ization approaches, similar to how Kluver and Konstan [10] Alleviating the new user problem in collaborative filtering investigated the effects of number of interactions on predictive by exploiting personality information. User Modeling and accuracy. User-Adapted Interaction 26, 2-3 (jun 2016), 221–255. DOI:http://dx.doi.org/10.1007/s11257-016-9172-z Apart from the amount of data per user, the amount of data available within the system as a whole may be another factor 6. Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, that plays a role which method of personalization works best. and Lars Schmidt-Thieme. 2011. MyMediaLite: A Free Evaluating how survey-based and reading-based personaliza- Recommender System Library. In Proceedings of the 5th tion compare over time, as more data enter the system as a ACM Conference on Recommender Systems (RecSys whole or per user, would provide valuable insight in which 2011). approach works best when. One could imagine a system that 7. Panagiotis Germanakos and Marios Belk. 2016. starts out from personalization based on measured psychologi- Human-Centred Web Adaptation and Personalization. cal traits that transitions into a system based more on behavior Springer International Publishing, Cham. 336 pages. DOI: or a hybrid system. Investigating this effect would require a http://dx.doi.org/10.1007/978-3-319-28050-9 more longitudinal study, where users are invited to a personal- ized library at multiple moments, to see whether and how the 8. Asela Gunawardana and Guy Shani. 2015. Evaluating different approaches are affected by the cold start. Recommender Systems. Springer US, Boston, MA, 265–308. DOI: Apart from the drawback of a low number of participants for http://dx.doi.org/10.1007/978-1-4899-7637-6_8 calculating relevance predictions, the low number also limited the statistical power of our statistical analysis of the effects 9. J. R. Hauser, G. L. Urban, G. Liberali, and M. Braun. of personalization. While young parents are active on the 2009. Website Morphing. Marketing Science 28, 2 (mar internet, they are hard to approach. In the current study we did 2009), 202–223. DOI: not manage to detect effects of personalization on reading be- http://dx.doi.org/10.1287/mksc.1080.0459 havior and only differences between some of the experimental 10. Daniel Kluver. 2012. How Many Bits Per Rating ? conditions. The effects caused by the personalization might Proceedings of the 6th ACM conference on Recommender have been smaller than the statistical power of our analysis systems - RecSys ’12 (2012), 99–106. DOI: allows us to detect. Conducting a study with more participants http://dx.doi.org/10.1145/2365952.2365974 would allow us to detect these possibly smaller effects. 11. Bart P. Knijnenburg and Martijn C. Willemsen. 2015. In conclusion, the current paper demonstrates that measuring Evaluating Recommender Systems with User psychological traits for the sake of personalization is worth- Experiments. In Recommender Systems Handbook. while and might well lead to increased user satisfaction, but Springer US, Boston, MA, 309–352. DOI: additional work is needed to establish under which conditions http://dx.doi.org/10.1007/978-1-4899-7637-6_9 this approach is valuable. 12. Yehuda Koren. Factorization Meets the Neighborhood : a REFERENCES Multifaceted Collaborative Filtering Model. (????). DOI: 1. B Arnott and Amy Brown. 2013. An Exploration of http://dx.doi.org/978-1-60558-193-4/08/08 Parenting Behaviours and Attitudes During Early Infancy: 13. Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Association with Maternal and Infant Characteristics. Matrix Factorization Techniques for Recommender Infant and Child Development 22 (2013), 349–361. DOI: Systems. IEEE Computer (2009), 42–49. http://dx.doi.org/10.1002/icd.1794 14. Robert R. McCrae, Jr. Paul T. Costa, and Thomas A. Martin. 2005. The NEOâĂŞPIâĂŞ3: A More Readable Revised NEO Personality Inventory. Journal of 18. Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, Personality Assessment 84, 3 (2005), 261–270. DOI: and David M Pennock. 2002. Methods and metrics for http://dx.doi.org/10.1207/s15327752jpa8403_05 PMID: cold-start recommendations. Proceedings of the 25th 15907162. annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 02 46, 15. Bamshad Mobasher. 2007. Data mining for web Sigir (2002), 253–260. DOI: personalization. The adaptive web (2007), 90–135. http: http://dx.doi.org/10.1145/564376.564421 //link.springer.com/chapter/10.1007/978-3-540-72079-9 19. K. R. Venugopal, K. G. Srinivasa, and L. M. Patnaik. 16. Stephanie L Prady, Kathleen Kiernan, Lesley Fairley, 2009. Algorithms for Web Personalization. Springer Sarah Wilson, and John Wright. 2014. Self-reported Berlin Heidelberg, Berlin, Heidelberg, 217–230. DOI: maternal parenting style and confidence and infant http://dx.doi.org/10.1007/978-3-642-00193-2_10 temperament in a multi-ethnic community: Results from the Born in Bradford cohort. Journal of Child Health 20. Tiange Zhao. 2016. Investigating the relationship Care 18, 1 (2014), 31–46. DOI: between parenting beliefs and parenting practice for http://dx.doi.org/10.1177/1367493512473855 in-app personalization. Master thesis. Eindhoven University of Technology. 17. Steffen Rendle, Wolf Huijsen, and Karen Tso-Sutter. https://pure.tue.nl/ws/files/46944250/855031-1.pdf 2008. State-of-the-art Recommender Algorithms. Technical Report. www.mymediaproject.org 21. Zheng Zhu. 2011. Improving Search Engines via Classification. Ph.D. Dissertation.