=Paper=
{{Paper
|id=Vol-2955/paper2
|storemode=property
|title=Evaluating Recommender Systems with and for Children: towards a Multi-Perspective Framework
|pdfUrl=https://ceur-ws.org/Vol-2955/paper2.pdf
|volume=Vol-2955
|authors=Emilia Gómez,Vicky Charisi,Stephane Chaudron 
|dblpUrl=https://dblp.org/rec/conf/recsys/GomezCC21
}}
==Evaluating Recommender Systems with and for Children: towards a Multi-Perspective Framework==
<pdf width="1500px">https://ceur-ws.org/Vol-2955/paper2.pdf</pdf>
<pre>
Evaluating recommender systems with and for
children: towards a multi-perspective framework
Emilia Gómez1 , Vicky Charisi1 and Stephane Chaudron2
1
    Joint Research Centre, European Commission. Edificio Expo, C. Inca Garcilaso, 3, 41092 Seville, Spain.
2
    Joint Research Centre, European Commission. Via Enrico Fermi, 2749, 21027 Ispra (VA), Italy.


                                         Abstract
                                         Children are common users of recommender systems (RSs) when watching videos on streaming ser-
                                         vices, accessing information on the web or playing games, being tablets or phones their favourite de-
                                         vices. Some concerns have been raised by parents and educators on the risks that these systems pose to
                                         children and the need to develop products and services that empower children by design and support
                                         children’s rights. The RSs literature shows that children scenarios are difficult for evaluation, which
                                         makes it a clear example of the need to integrate perspectives from multiple stakeholders. Motivated
                                         by the need for practical methodologies for children-centric trustworthy artificial intelligence, this pa-
                                         per provides a comprehensive view of the different perspectives involved in the evaluation of RSs for
                                         children. We first carry out a literature review, with a focus on the RSs literature, on children-related re-
                                         search, which integrates knowledge from disciplines such as engineering, cognitive science and human-
                                         computer interaction. From this review, we identify the main opportunities, challenges and risks related
                                         to children-centred RSs and their evaluation. Finally, we propose a multi-perspective framework for the
                                         evaluation of RSs for children.

                                         Keywords
                                         recommender systems, information retrieval, children, evaluation, impact assessment, trustworthy arti-
                                         ficial intelligence


1. Introduction
A recommender system (RS) is a type of information retrieval (IR) system whose goal is to
suggest items from a large collection that meets the preference of a user [1]. RSs are used in a
variety of domains, with well-known applications such as video services (e.g. YouTube), product
recommenders in online shopping, content recommenders in social media and web content
recommenders in a different topic such as restaurants, wines, dating, news, language teachers
or financial services. Children are common users of recommender systems. Watching videos
is one of the most common digital activities of children reported in the literature [2], where
tablets seem to be their favourite devices in studies carried out in Europe and USA [3, 4].
   Despite the opportunities for new personalized learning and play experiences that RSs
provide to children, parents and educators have raised certain concerns regarding their use in

Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2021), September 25th, 2021,
co-located with the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands
" emilia.gomez-gutierrez@ec.europa.eu (E. Gómez); vasiliki.charisi@ec.europa.eu (V. Charisi);
stephane.chaudron@ec.europa.eu (S. Chaudron)
 0000-0003-4983-3989 (E. Gómez); 0000-0001-7677-027X (V. Charisi); 0000-0001-7650-8562 (S. Chaudron)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
digital environments by children. One of most promising direction for the mitigation of those
risks is the development of services that empower children by design and support children’s
rights [3, 5]. However, the RSs research literature is limited in child-centric studies compared to
adult evaluations so that design decisions on datasets, algorithms and interaction designs are
mostly driven by adult needs. Existing literature has shown that children scenarios are difficult
for the evaluation of RSs [6, 7] and that they require a multi-stakeholder evaluation as defined
in [8].
   Motivated by the need for practical methodologies for child-centred artificial intelligence
(AI) as defined by UNICEF [9], the goal of this paper is to provide a comprehensive view on
the different perspectives involved in the evaluation of RSs for children. We first carry out a
literature review on research related to children and RSs, with a comprehensive review of KidRec
proceedings (International and Interdisciplinary Perspectives on Children and Recommender and
Information Retrieval Systems workshop series), key contributions from ACM Recommender
Systems Conference - RecSys, and insights from other communities such as cognitive science
and human-computer interaction. Then, we identify the main potential opportunities, the
emerging risks and the challenges in the evaluation of RSs. As a follow up, we propose a
multi-perspective framework for the comprehensive evaluation of RSs with children and for
children’s well-being.


2. Recommender systems components and evaluation
Recommender systems are implemented through different components which contribute to
their outcome and impact, as illustrated in Figure 1. Datasets are crucial component for their
development and evaluation, as they set up the application part of the context and scope in
terms of information sources used for the recommendation. Machine learning algorithms learn
from these data to propose recommendations, which are presented to users by means of a
graphical user interface (GUI) and adapted to a particular hardware device, such as computer,
mobile phone or tablet. By means of several user-interaction components, the system is able to
capture user behaviour with the system, and perform the relevant adaptations to the analyzed
data, algorithm and user interface.
   State-of-the-art recommendation algorithms are hybrid and combine different approaches
such as collaborative filtering techniques (e.g. recommending to a user the items that a similar
user liked in the past), content-based algorithms (e.g. recommending to a user similar items
to the ones she/he likes), demographic systems (e.g. targeting specific languages or countries)
or knowledge-base approaches (e.g. case-based reasoning systems) [1]. As an example, music
recommendation systems implement hybrid approaches using collaborative filtering (e.g. play
counts, information from peer users), music content description (e.g. features extracted from
music audio recordings such as melody, tempo or volume), music context descriptors (e.g.
information about the artist or lyrics found on the web), and user properties and behaviour
(e.g. demographics, mood). They now integrate state-of-the-art data-driven machine learning
techniques [10].
   RSs evaluation practices intend to assess the effectiveness of the system and includes the
definition of the different aspects of the evaluation:
Figure 1: Components of a RS, adapted and complemented from [10]. Yellow blocks refer to system
inputs and outputs, grey ones to data and blue ones to data-driven and human-computer interaction
approaches. Examples are tailored to audio-visual media recommendation.


    • Content to be recommended in the evaluation exercise, including the selection or creation
      of specific content datasets.
    • User population, defined as the target population for the recommendations in terms of
      background, experience, age, culture, etc.
    • Methodology for the evaluation exercise and protocol, e.g. user study, online evaluation
      (where recommendation results are shown to real users of the system and we observe
      their behavior to measure their satisfaction with the system, e.g. if they select or play
      the recommended items), and offline evaluations (where we use existing datasets built
      from historical data, we then discard some of this data and we try to predict it using the
      recommendation algorithm) as explained in [11].
    • Criteria or metrics for system evaluation, with a focus on accuracy, which is usually
      represented by standard metrics such as mean squared error, root mean squared error,
      IR metrics such as precision and recall. Other aspects going beyond it include diversity,
      novelty, coverage, robustness, serendipity, trust, privacy or reproducibility.


3. From general to children-centred recommender systems
General recommender systems, even if not adapted to or designed especially for children, are
widely used by children. For instance, according to Statista1 , as of March 2020, a survey on

   1
       https://www.statista.com/statistics/1150571/share-us-parents-young-child-watch-youtube-videos/
parenting in USA showed that 89% of parents with children aged 5-11 years old and 57% of
parents with children aged 0 to 2 years old reported that their children had watched YouTube
videos. The usage varies for different countries, media modalities and platforms. For instance,
according to the same source, 23% of Brazilian parents with children between 10 and 12 years
old stated that their kids used Spotify, and 9% percent of Brazilian parents with children between
the ages of 7 and 9, as well as those with toddlers, reported that these children used the digital
music, podcast, and video streaming service2 .
   This extended usage is confirmed by some research studies. According to Izci et al. [4],
it is recognized that children from all ages use YouTube and researchers found that children as
young as 6 months are exposed to videos on the YouTube platform. Radeski et al. also found
YouTube and YouTubeKids to be one of the most commonly used applications in a study with
346 English-speaking parents and guardians of children aged 3 to 5 [2]. Chaudron et al. carried
out a cross-national study covering 21 European countries on young children (0 to 8) and digital
technologies [3]. Their analysis, grounded on data from 234 family interviews, showed that
children usually have their first interaction with digital technologies at a very early age, through
their parent’s devices, which are not tailored for them in the first place (below 2). In a similar
line, parents have identified their own perspectives on the use of RSs by children that relate to
the transformation of their own parental role, by the use of online applications with RS as a
“digital babysitter” [3, 4].
   But as mentioned by Cunningham and Zhang [12], children are not miniature adults, they
have different needs, capabilities and expectations of computer products. Some studies have
addressed specific children’s needs, challenges and risks of this kind of technology [4], and
industry has also adapted their products to children (e.g. YouTube Kids3 or Spotify Kids4 ).
   In the ACM Recommender Systems (Recsys) conference5 , the most well-known international
forum for RS research, there are only a few (four) papers having the keyword “child” in the
title or abstract [13, 14, 15, 16]. Pera and Ng [13] propose and evaluate a book recommender
for K-12 users and a readability analysis tool to determine the grade level of books. Milton et
al. [15] carry out an empirical study to identify the traits affecting children’s preferences in
books. They found out preference differences between children from different ages in terms of
preferred colours, emotions, length, writing style and topics, and signal the small availability of
recorded interaction among young unders and recommender systems. Based on this work the
authors present in [16] StoryTime, a web-based book recommender specifically co-designed
with children, based on images which elicit their preferences. Fails et al. [14] established
in 2017 the International and Interdisciplinary Perspectives on Children and Recommender and
Information Retrieval Systems workshop series - KidRec6 - as a research forum on children-
specific recommender systems, now in its fifth edition (2017-2021). KidRec proceedings contain
21 research works on different topics related to the design of children-centred RSs, dealing with
the specific challenges, applications, evaluation practices and ethical concerns.


    2
      https://www.statista.com/statistics/1193642/children-using-spotify-brazil/
    3
      https://www.youtube.com/kids/
    4
      https://www.spotify.com/us/kids/
    5
      https://recsys.acm.org/
    6
      https://kidrec.github.io/
4. Potential opportunities and applications
The literature identifies several domains where recommender systems can bring value and
support children’s autonomy in several tasks by facilitating the access to different information
sources and modalities. The specific application of RSs for children, as presented in KidRec
proceedings, include information search [17], video recommendation [18, 4] (e.g. YouTube or
dedicated apps), music recommendation [19], learning [20, 21, 22, 23], second language learning
[24, 25], smart toys [26], story and book recommendation [13, 27, 16] and social media (e.g.
MessengerKids7 ).
   The above-mentioned RS-based platforms for children have the potential, under certain
circumstances, to bring unique opportunities for learning, play and entertainment. First, these
platforms have the capacity to accommodate and render accessible to children large sets of
material that otherwise would not be accessible to them. This has a particular impact in school-
based activities, especially for children from less-advantaged socio-economic backgrounds, due
to the fact that, otherwise, they would not have access to a teacher to manually curate the
information for them. In addition, RSs for game-based learning for children can facilitate self-
guided cognitive training, especially when the system has an orientation towards transparency
with explainable recommendations [28]. Moreover, these systems can support children’s di-
versification by allowing each individual child to have control in their own learning or play
and entertainment trajectories by selecting among a large set of recommended material to be
engaged with. In this way, children even from very young age are empowered to develop their
agency, especially in online environments, while avoiding information overload [29]. Another
particularly beneficial feature of RSs platforms is the possibility for children to send each other
messages, thus expanding the database to peer-to-peer recommendations [30]. RSs that are
used in educational setting can support cognitive self-regulated learning skills in children,
considering individual differences on abilities, preferences, and needs [28]. At the same time,
RS-based applications for children are being often developed not only to scaffold the child but
to monitor and report the child’s progress and predict future performance [31] which might
prove beneficial for the teaching process. These features can be used by parents and educators
in order to further support child’s development and well-being.
   The above-mentioned examples of RSs have the potential to benefit children only under
certain circumstances. For instance, in the case of a reading recommendation system that
collects data on a child’s engagement with books, and generates graph data and predictions,
it can easily turn into a monitoring and surveillance tool [29] which would probably violate
children’s rights for privacy. The identification and the mitigation of the potential risks of the
use of RSs by children will help us understand the possible necessary future actions for the
development of RSs that support children’s well-being. In the following sections we elaborate
on the relevant literature on the emerging risks and the challenges we face for their evaluation
in order to propose certain future directions.


   7
       https://www.technologyreview.com/2018/02/07/145469/facebooks-app-for-kids-should-freak-parents-out/
5. Emerging risks of recommender systems
While the use of RSs by children brings certain opportunities for children’s learning, play and
entertainment, recent research literature, policy reports and press articles have identified several
risks that children may encounter when using recommender systems:
    • Personal data collection: data related to the behaviour and interaction of users with
      RSs is crucial for their development. However, as a vulnerable population it is important
      to protect children’s data and privacy [9] by considering which information is appropriate
      to gather. For instance, the Children’s Online Privacy Protection Act (COPPA) [32] states
      that people at age 13 can participate in social-media platforms, which need to make sure
      this age limit is correctly defined and enforced. The General Data Protection Regulation
      (GDPR) [33] also states that children should merit specific protection with regard to
      their personal data. As a consequence, there is an additional effort for researchers to
      establish data ownership, responsibility and protection procedures when they design RSs
      for children [3].
    • Over-exposure: several studies mention the risk for children of being exposed to the
      same or similar media repeatedly, due to the fact that media recommender systems are
      driven by the notion of similarity [4]. This includes the so-called information bubbles,
      understood as the risk for children to encounter a certain type of content based on its
      previous choices, reinforcing those and giving less opportunity and room for discovering
      something different.
    • Being exposed to undesirable content, given that entertaining videos are not always
      adequate for children and may for instance contain sexual content, include physical
      violence or refer to unhealthy food or habits [3, 4].
    • Online advertising is also considered as a risk, as platforms may treat children as “young
      consumers”, linked to the concept of the “commodification of childhood” [4].
    • Addictions or dependency on screen has been also identified as a relevant risk8 . Lukoff
      et al. [34] carried out a survey with 120 YouTube adult users and a set of co-design
      experiences to analyze how some internal mechanisms implemented in the app can
      support user agency, as low sense of agency can relate to negative life impacts such as
      loss of social opportunities, sleep or productivity. The authors found out that, on the one
      hand, some mechanisms such as autoplay or automatic recommendations, decrease the
      user sense of agency. On the other hand, some other functionalities such as search, or
      playlist creation can support it. Research studies addressing children provide conclusions
      in a similar line. Hiniker et al. [35] carried out a behavioural study with 24 3-5 y.o.
      children, and they found that some design features can support children’s autonomy and
      self-regulation, such as those providing opportunity for planning and making choices,
      the ones reminding children of their intentions and those asking questions to the child.
      However, others such as post-play, can undermine it.
    • Social media platforms or apps such as Messenger Kids allow children to post and message
      friends through a federation mechanism monitored by parents. Some voices have signaled
    8
     What Screen Addictions and Drug Addictions Have in Common https://www.pbs.org/wgbh/nova/article/
screen-time-addiction/
      the risk of these applications to be used to familiarize children with commercial
      products that will be used when they become teenagers9 .
    • Difficulty for parents to monitor children’s behaviour, as recommender systems
      are consumed by children mostly on personal devices such as tablets or phones [3].
    • Propagation of existing gender stereotypes present in search and recommendation
      systems [36].
   Although some of these risks also appear in adult population, children need special pro-
tection, given their vulnerability and potential impact in their cognitve and socio-emotional
development [9]. In addition, the particular tendency of children to use trial-and-error methods
to learn how to use a tool increases several risks such as the deviation from non-suitable content,
the accidental disclosure of personal information, and the unintended contact with people [3].


6. Challenges of RSs evaluation with children
In the previous sections we elaborated on the potential advantages and the emerging risks of
the use of RSs by children. For the design and development of RSs that promote children’s rights
and benefit their well-being while taking advantage of the unique opportunities of the use of
those systems, we need to development scientifically rigorous and responsible techniques for
their evaluation which, as we will discuss, is still a challenging endeavor.
   Some studies have identified and analysed children behaviour in adult-centred platforms.
One example in the music domain is the work by Shedl and Bauer [19] in the Last-FM platform.
Among all users, the authors found a small presence of children 6-17 vs adults, and a small but
significant presence of young children (e.g. 6-10), including 5,953 users (12.9% of users in the
platform). For those children using Last-FM , Schedl and Bauer found that recommendation
algorithms based on collaborative filtering seem to work better for children than for adults.
The authors also found significant differences in the musical genres preference between young
and adult listeners. For instance, young listeners were found to have a high preference for rock
music and low preference for blues. In addition, the youngest age group (6-12) was found to
appreciate electronic music the most in comparison to the other age groups, and rock, folk,
punk, alternative, and metal were the least liked genres by this youngest group, compared to
the older groups [19]. The need to define children-specific musical genres is also visible in some
commercial products, e.g. Spotify Kids, with genres such as movies music, bedtimes tunes, party
jams and stories.
   We also find studies specifically focused on children, such as the work by Cunningham and
Zhang [12], who propose a participatory design activity for children with mmusic recommender
systems, and is the only paper of ISMIR (International Society for Music Information Retrieval
Conference10 ) with the “Child” keyword in the title. The authors develop Kids Music Box, a
music recommendation system created with 6-10 y.o. children in mind. In this work, the authors
organize the different challenges for children to use music recommendation platforms in terms
of their cognitive and physical development and their preferred functionalities. In terms of
     9
       Child health advocates call for Facebook to shutter Messenger Kids app http://social.techcrunch.com/2018/01/
30/child-health-advocates-call-for-facebook-to-shutter-messenger-kids-app/
    10
       http://www.ismir.net
cognitive development, the authors mention that children should not be forced to use software
designed with complex interaction and interfaces, requiring good spelling, and reading skills
beyond their current abilities. In addition, they mention the need for children to get constant
visual or acoustic feedback, which is not always provided by textual interfaces. Children may
have difficulty with abstract concepts, so the selected icons should represent familiar, real-world
objects. Finally, they signal the fact that children use trial-and-error methods to learn how
to use a tool, which is not always the case for adults. In terms of physical development, the
authors suggest that children may have difficulty controlling the mouse, targeting small areas
on the screen or typing on the keyboard. We then need to design simple physical interactions
for this user population. Finally, as regards the specific needs or preferred functionalities for
music RSs, they mention the rating of songs, the synchronization of visuals with the music, the
incorporation of games while listening to music and the option to have parental setting or control.
These findings are inline with the need to integrate children-specific design recommendations,
which are adapted to their cognitive and physical needs and abilities.
   In fact, Human-Computer Interaction research has widely addressed the evaluation of in-
terfaces with children. Soni et al. review existing design recommendations for children’s
touchscreen interfaces based on cognitive, physical and socio-emotional developmental ap-
propriateness [37]. In their work, the authors define, from a review of the state of the art, the
Touchscreen Interaction Design Recommendation for Children - TDRC framework, incorporating
57 different design recommendations found in the literature, organized by interface dimensions.
This framework was used to empirically analyze how these recommendations were considered
in 50 popular apps, finding out that only 63% of those apps followed design recommendations
to fulfill children’s cognitive (51%), physical (67%) and socio-emotional (72%) needs. This study
illustrates the existing divergence between research findings and practical children-centred
touchscreens applications.
   In the recommender systems literature, Ekstrand [6] summarizes the challenges of evaluating
RSs with children, confirmed by other authors, and including the following issues:

    • Data availability: the lack of data (i.e. the so called “cold-start problem”) is one of the
      limitations of children-centric studies, emphasized with the above mentioned rights of
      data protection, also signaled in [7]; All these aspects limit the availability of benchmark-
      ing datasets including children users, which are crucial for algorithm evaluation and
      development and to ensure the reproducibility of studies [6].
    • Limited survey abilities when dealing with children. Surveys provide a common
      strategy and practical way for large-scale evaluation of RSs. However, some studies have
      signaled the limitations of this methodology for children [38, 3, 6]. For instance, click logs
      from children interactions with a system are likely to be noisier than those from general
      users, and children are unlikely to be able to provide robust ratings particularly when
      attempting to accommodate different factors such as educational value or information
      accuracy. Other methodologies such as user studies, usability exercises and participatory
      design processed are then required for children, which are costly to be carried out on a
      large scale.
    • Multi-stakeholder evaluation: as mentioned in [6], RS evaluation has been tradition-
      ally centred on metrics and protocols that measure how the different system components
      impact "user" behaviour (e.g. accuracy, satisfaction, play counts) or platform/business
      outcomes (e.g. sales, user retention). However, in child-centred recommendation, we
      need to consider different stakeholders as related to the target "user" or "consumer" as
      indicated by Bauer and Jannach [8]: the child has particular interests and information
      needs; the caretaker might decide on the information and content that are suitable for
      the child; in educational scenarios, the teacher uses a RS to support certain learning
      outcomes; Other stakeholders mentioned in [8] include the RS provider (e.g. platform),
      supplier (e.g. product manufacturer) and society in general. For instance, the RS provider
      want to ease the discovery of specific content according to their business model. These
      different views need to be formulated and integrated into the design of the evaluation
      protocol. As mentioned by Bauer and Jannach in [8], multi-stakeholder evaluation implies
      the optimization of multiple objectives in parallel, and needs to be considered from the
      dataset and algorithm itself to the evaluation methodology. The authors also reflect on the
      concept of fairness and other ethical questions arising from the consideration of different
      stakeholders, e.g. provider vs consumer, which is another research gap [8].


7. Towards a multi-perspective evaluation framework
After analyzing existing literature, we observe that the evaluation of recommender systems for
children is very challenging, and different studies have approached varied evaluation aspects in
specific contexts. Landoni et al. [39] proposed an evaluation framework allowing the compara-
tive analysis of diverse IR strategies by a given user group, task and context. Building upon this
work and the previous review, we propose a multi-perspective framework covering the four
dimensions represented in Figure 2: component, stakeholder, methodology and temporal scale,
as illustrated in Figure 2 and further detailed in the following subsections. These dimensions
should be driven from the intended context, purpose and expected value of the RS [8].


Figure 2: Perspectives to be considered in the evaluation of RS for children.
  As a complement to these four perspectives, we consider the aspects of reproducibility
and transparency as key requirements for meaningful evaluations, as open protocols and
community-built toolkits are the only way towards incremental, comparative and comprehensive
evaluations of RSs.

7.1. Component
We have seen that RSs are complex systems as represented in Figure 1. Different components
contribute to the system output, such as the device (e.g. computer, tablet, phone), user inter-
face, interaction mechanisms, functionalities, recommendation algorithm, data collected from
children or content information used for training. These components may also need varied
evaluation strategies, and there is a need to understand the impact of each of the components
into the final evaluation outcomes. Most evaluation approaches reviewed here focus on full
system evaluation or stay in the particular device, set of functionalities, the user interface or
the interaction paradigm. Up to our knowledge there are no comprehensive evaluations on how
specific steps of the recommendation process affect and should target children, e.g. content
description methods, item similarity metrics, emotion recognition models, or collaborative
filtering strategies. This indicates a clear risk of bias and malfunction of state of the art RS for
children. Full-system evaluation, combined with the understanding of the role of the different
components, should be the target goal.

7.2. Stakeholder
We see the need to consider the evaluation exercise from the perspective of the different
stakeholders involved, which might have different needs and expectations from the RS. In
addition, we need to adapt the evaluation methodology to the particular user (e.g. as mentioned
before, surveys might be more adequate for adults). Moreover, the interaction between those
stakeholders and the processes that develop among them should also be understood. We reflect
now on the main stakeholders involved in the design of child-specific RSs:
    • Children have specific information needs and preferences, empowering their participa-
      tion in the design process. From this perspective, evaluation practices should reflect
      on the child’s individual history and current behaviour, the current context and culture.
      Importantly, children are not an homogeneous group: age, gender and family and social
      background affect their choices and preferences. From this perspective, evaluation prac-
      tices should reflect on the child’s individual history and current behaviour, the current
      context and culture.
    • Parents or guardians should be able to incorporate their preferences in terms of protection
      and values to be transmitted by RSs. We should note that, although some studies such
      as [3, 2] are based on interviews with parents/guardians, according to Radesky et al.,
      parent-reported duration of mobile device use in young children has low accuracy, and the
      use of objective measures is needed in future research. This reveals the need to contrast the
      evaluation results obtained with different methodologies and stakeholders.
    • Educators: educational goals and expected learning outcomes are of particular relevance
      when using RSs in educational contexts. Often, parents are involved in educational
      activities with their children, especially in informal settings, and educators have a great
      role in the protection of children as well as in the creation of opportunities of children’s
      participation. Notably, tensions can appear between decisions supporting children’s
      online protection and participation.
    • Companies need to consider the effect of the RS on their business and business model.
      Being the main developers and integrator of the RS, it is important to understand their
      needs and limitations as related to system evaluation.
    • Policy makers: evaluation practices and results can provide the needed scientific evidence
      to design policies that can minimize risks, ensure children protection, empower their
      participation, and support shaping the current educational systems to prepare children in
      the best possible way.

7.3. Temporal scale
A third important dimension is the temporal scope of the evaluation exercise, that should also
fit its purpose:

    • Short-term: if we design a one-shot co-design exercise, user/usability study or survey, we
      will research on the immediate effect of recommendations.
    • Mid-term: in this case, we would follow children in their interaction with a RS to study
      the potential impact after several sessions or exercises.
    • Long-term: longitudinal studies are also needed, with the goal of understanding in the
      impact RSs may have on children in the long-term, e.g. for their future development as
      teenagers or adults.

7.4. Methodology
Finally, we have mentioned the need to combine different methodologies for a comprehensive
RSs evaluation:.

    • Criteria and metrics: evaluation goals and criteria are linked to the selected metrics, either
      algorithm-centred accuracy metrics or application-specific holistic ones [40]. Metrics are
      also linked to the needs of different stakeholders and the target component and temporal
      scope. We consider, for instance, that the overall challenge of designing a child-specific
      RS is to understand how the RS tackle child’s cognitive models and developmental aspects.
      This includes the consideration of different aspects such as: (1) How the RS supports
      the child’s need for agency acquisition; (2) How to implement design decisions that
      better fit children’s attention span; (3) Which is the role of interactivity in children’s
      connections between online and offline scenarios; and (4) Which is the correct balanced
      regarding children’s rapid development and their predisposition for repetition (especially
      in early childhood). Although traditionally the goal of the evaluation of a RS is linked
      to its accuracy, we also need to evaluate whether the tool scaffolds child’s well-being
      and development by prioritizing children’s innate characteristics such as curiosity and
      exploration.
    • Set up and protocol: here, we would need to detail and select relevant evaluation methods
      including quantitative (survey, behavioural data analysis) and qualitative (participatory
      design exercise, usability study, interviews, focus group, ethnographic studies) approaches.


8. Conclusions
In this paper we have summarized existing literature on the evaluation, opportunities, risks and
challenges of children using recommender systems. An analysis of the literature on children-
specific RSs has revealed the main challenges researchers address to evaluate recommender
systems with children’s audiences, and the importance of children-centred design to minimize
the risks that recommender systems pose, without sacrificing the opportunities such systems
can bring to children. Our review shows that evaluation practices typically focus on RS accuracy;
however we need to include other points for evaluation such as whether the tool scaffolds the
child’s well-being and development by prioritizing children’s innate characteristics such as
curiosity, exploration and creativity.
   In addition, while most research focus on partial aspects of the evaluation such as the effect
of design decisions and interfaces in particular contexts, we propose a comprehensive multi-
perspective framework to develop reproducible and incremental evaluation practices allowing
the scientific understanding on the impact, potential bias and needed adaptations of RSs for
children, to make sure these systems support their current and future welfare.
   As mentioned in [8], we think that only by evaluating RSs from these different perspectives
the research community will understand the effect that their designs may have on individual
stakeholders (e.g. children, parents, business) and the wider society.


References
 [1] F. Ricci, L. Rokach, B. Shapira, P. Kantor, Recommender Systems Handbook, Springer, 2011.
     doi:10.1007/978-0-387-85820-3.
 [2] J. S. Radesky, H. M. Weeks, R. Ball, A. Schaller, S. Yeo, J. Durnez, M. Tamayo-Rios, M. Epstein,
     H. Kirkorian, S. Coyne, R. Barr, Young children’s use of smartphones and tablets, Pediatrics
     146 (2020). URL: https://pediatrics.aappublications.org/content/146/1/e20193518. doi:10.
     1542/peds.2019-3518.
 [3] S. Chaudron, R. D. Gioia, M. Gemo, Young Children (0-8) and Digital Technology - A
     qualitative study across Europe, Publication Office of the European Union, 2018. doi:10.
     2760/294383.
 [4] B. Izci, I. Jones, T. Ozdemir, L. Alktebi, E. Bakır, Youtube and young children: Research,
     concerns, and new directions., Lisbon School of Education, 2019, pp. 81–92.
 [5] EU strategy on the rights of the child COM/2021/142 final, Technical Report, European
     Commission, 2021. URL: https://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%
     3A52021DC0142.
 [6] M. Ekstrand, Challenges in evaluating recommendations for children, in: KidRec 2017,
     2017.
 [7] A. Milton, #thehorror: Evaluating information retrieval systems for kid, in: KidRec 2017,
     2020.
 [8] D. Jannach, C. Bauer, Escaping the mcnamara fallacy: Towards more impactful recom-
     mender systems research, AI Magazine 41 (2020) 79–95. URL: https://ojs.aaai.org/index.
     php/aimagazine/article/view/5312. doi:10.1609/aimag.v41i4.5312.
 [9] V. Dignum, M. Penagos, K. Pigmans, S. Vosloo, Policy Guidance on AI for Children,
     Technical Report, UNICEF, 2020. URL: https://www.unicef.org/globalinsight/reports/
     policy-guidance-ai-children.
[10] M. Schedl, E. Gómez, J. Urbano, Music Information Retrieval: Recent Developments and
     Applications, Now Foundations and Trends, 2014. doi:10.1561/9781601988072.
[11] J. Beel, M. Genzmehr, S. Langer, A. Nürnberger, B. Gipp, A comparative analysis of offline
     and online evaluations and discussion of research paper recommender system evaluation,
     in: Proceedings of the International Workshop on Reproducibility and Replication in
     Recommender Systems Evaluation, RecSys ’13, Association for Computing Machinery,
     New York, NY, USA, 2013, p. 7–14. URL: https://doi.org/10.1145/2532508.2532511. doi:10.
     1145/2532508.2532511.
[12] S. J. Cunningham, E. Zhang, Development of a music organizer for children, in: J. P.
     Bello, E. Chew, D. Turnbull (Eds.), ISMIR 2008, 9th International Conference on Music
     Information Retrieval, Drexel University, Philadelphia, PA, USA, September 14-18, 2008,
     2008, pp. 185–190. URL: http://ismir2008.ismir.net/papers/ISMIR2008_123.pdf.
[13] M. S. Pera, Y.-K. Ng, What to read next? making personalized book recommendations for
     k-12 users, in: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys
     ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 113–120. URL:
     https://doi.org/10.1145/2507157.2507181. doi:10.1145/2507157.2507181.
[14] J. A. Fails, M. S. Pera, F. Garzotto, M. Gelsomini, Kidrec: Children & recommender
     systems: Workshop co-located with acm conference on recommender systems (recsys
     2017), in: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys
     ’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 376–377. URL:
     https://doi.org/10.1145/3109859.3109956. doi:10.1145/3109859.3109956.
[15] A. Milton, M. Green, A. Keener, J. Ames, M. D. Ekstrand, M. S. Pera, Storytime: Eliciting
     preferences from children for book recommendations, in: Proceedings of the 13th ACM
     Conference on Recommender Systems, RecSys ’19, Association for Computing Machinery,
     New York, NY, USA, 2019, p. 544–545. URL: https://doi.org/10.1145/3298689.3347048. doi:10.
     1145/3298689.3347048.
[16] A. Milton, L. Batista, G. Allen, S. Gao, Y.-K. D. Ng, M. S. Pera, “don’t judge a book by
     its cover”: Exploring book traits children favor, in: Fourteenth ACM Conference on
     Recommender Systems, RecSys ’20, Association for Computing Machinery, New York,
     NY, USA, 2020, p. 669–674. URL: https://doi.org/10.1145/3383313.3418490. doi:10.1145/
     3383313.3418490.
[17] M. Landoni, E. Murgia, T. Huibers, M. Pera, My name is sonny, how may i help you search-
     ing for information?, in: IDC ’19, Association for Computing Machinery (ACM), United
     States, 2019. 18th ACM International Conference on Interaction Design and Children, IDC
     2019, IDC 2019 ; Conference date: 12-06-2019 Through 15-06-2019.
[18] Y. Deldjoo, C. Frà, M. Valla, M. A. Tuncel, F. Garzotto, P. Cremonesi, A. Paladini, D. Anghi-
     leri, Enhancing children’s experience with recommendation systems, in: KidRec 2017,
     2017.
[19] M. Schedl, C. Bauer, Online music listening culture of kids and adolescents: Listening
     analysis and music recommendation tailored to the young, in: KidRec 2017, 2017.
[20] M. S. Pera, Y.-K. Ng, With a little help from my friends: Generating personalized book
     recommendations using data extracted from a social website, in: 2011 IEEE/WIC/ACM
     International Conferences on Web Intelligence and Intelligent Agent Technology, volume 1,
     2011, pp. 96–99. doi:10.1109/WI-IAT.2011.9.
[21] M. Landoni, E. Murgia, F. Gramuglio, G. Manfredi, Teaching an alien: Children recom-
     mending what and how to learn, in: KidRec 2018, 2018.
[22] T. Horiuchi, M. Rothschild, R. Barrera, S. Gururajan, Designing a personally meaningful
     abcmouse.com: Challenges and questions in an edtech recommendation system, in: KidRec
     2018, 2017.
[23] A. Milton, E. Murgia, M. Landoni, T. Huibers, M. Pera, Here, there, and everywhere:
     Building a scaffolding for children’s learning through recommendations, in: O. Shalom,
     D. Jannach, I. Guy (Eds.), ImpactRS 2019: Impact of Recommender Systems 2019, CEUR
     workshop proceedings, CEUR, 2019. 1st Workshop on the Impact of Recommender Systems,
     ImpactRS 2019, ImpactRS ; Conference date: 19-09-2019 Through 19-09-2019.
[24] W. Ma, M. Zhang, C. Zhang, Y. Chen, Q. Xie, W. Sun, Y. Liu, S. Ma, A game-based data
     collecting framework for the recommendation of kids’ second language learning, in:
     KidRec 2017, 2017.
[25] H. Xie, M. Wang, D. Zou, F. L. Wang, A personalized task recommendation system for
     vocabulary learning based on readability and diversity, in: International conference on
     blended learning, Springer, 2019, pp. 82–92.
[26] F. Delprino, O. F. Bravo, M. Mariani, C. Piva, N. Izzo, M. Matera, R. Tassi, Playing outdoor,
     recommending new content: Stimulating kids’ learning through the abbot smart object,
     in: KidRec 2017, 2017.
[27] M. S. Pera, K. Wright, M. Ekstrand, Recommending texts to children with an expert in the
     loop, in: KidRec 2018, 2018.
[28] K. Tsiakas, E. Barakova, J. V. Khan, P. Markopoulos, Brainhood: towards an explainable
     recommendation system for self-regulated cognitive training in children, in: Proceedings
     of the 13th ACM International Conference on PErvasive Technologies Related to Assistive
     Environments, 2020, pp. 1–6.
[29] N. Kucirkova, The learning value of personalization in children’s reading recommendation
     systems: What can we learn from constructionism?, International Journal of Mobile and
     Blended Learning (IJMBL) 11 (2019) 80–95.
[30] I. Picton, The impact of ebooks on the reading motivation and reading skills of children
     and young people: A rapid literature review., National Literacy Trust (2014).
[31] M. Ueno, Y. Miyazawa, Irt-based adaptive hints to scaffold learning in programming, IEEE
     Transactions on Learning Technologies 11 (2017) 415–428.
[32] Children’s Online Privacy Protection Rule ("COPPA"), Technical Report, Federal
     Trade Commission, United States, 2021. URL: https://www.ftc.gov/enforcement/rules/
     rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule.
[33] Regulation (EU) 2016/679 on the protection of natural persons with regard to the processing
     of personal data and on the free movement of such data, and repealing Directive 95/46/EC
     (General Data Protection Regulation), Technical Report, European Parliament and Council,
     2016. URL: https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng.
[34] K. Lukoff, U. Lyngs, H. Zade, J. V. Liao, J. Choi, K. Fan, S. A. Munson, A. Hiniker, How the
     Design of YouTube Influences User Sense of Agency, Association for Computing Machinery,
     New York, NY, USA, 2021. URL: https://doi.org/10.1145/3411764.3445467.
[35] A. Hiniker, S. S. Heung, S. R. Hong, J. A. Kientz, Coco’s Videos: An Empirical Investigation
     of Video-Player Design Features and Children’s Media Use, Association for Computing Ma-
     chinery, New York, NY, USA, 2018, p. 1–13. URL: https://doi.org/10.1145/3173574.3173828.
[36] A. Raj, A. Milton, M. D. Ekstrand, Pink for princesses, blue for superheroes: The need
     to examine gender stereotypes in kid’s products in search and recommendations, CoRR
     abs/2105.09296 (2021). URL: https://arxiv.org/abs/2105.09296. arXiv:2105.09296.
[37] N. Soni, A. Aloba, K. S. Morga, P. J. Wisniewski, L. Anthony, A framework of touchscreen
     interaction design recommendations for children (tidrc): Characterizing the gap between
     research evidence and design practice, in: Proceedings of the 18th ACM International
     Conference on Interaction Design and Children, IDC ’19, Association for Computing
     Machinery, New York, NY, USA, 2019, p. 419–431. URL: https://doi.org/10.1145/3311927.
     3323149. doi:10.1145/3311927.3323149.
[38] N. Borgers, E. de Leeuw, J. Hox,                Children as respondents in survey re-
     search: Cognitive development and response quality 1,                 Bulletin of Sociologi-
     cal Methodology/Bulletin de Méthodologie Sociologique 66 (2000) 60–75. URL:
     https://doi.org/10.1177/075910630006600106.           doi:10.1177/075910630006600106.
     arXiv:https://doi.org/10.1177/075910630006600106.
[39] M. Landoni, D. Matteri, E. Murgia, T. Huibers, M. Pera, Sonny, cerca! evaluating the
     impact of using a vocal assistant to search at school, in: F. Crestani, M. Braschler, J. Savoy,
     A. Rauber, H. Müller, D. Losada, G. Heinatz Bürki, L. Cappellato, N. Ferro (Eds.), Experimen-
     tal IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Sci-
     ence, Springer, Netherlands, 2019, pp. 101–113. doi:10.1007/978-3-030-28577-7_6,
     10th International Conference of the CLEF Association, CLEF 2019.
[40] O. Anuyah, M. Green, A. Milton, S. Pera, The need for a comprehensive strategy to evaluate
     search engine performance in the classroom, in: KidRec 2019, 2019.

</pre>