A dialogue-based software architecture for gamified discrimination tests Antonio Origlia Piero Cosi Antonio Rodà Dept. of Information Institute of Cognitive Sciences Dept. of Information Engineering and Technology (CNR-ISTC) Engineering University of Padua piero.cosi@pd.istc.cnr.it University of Padua antonio.origlia@dei.unipd.it roda@dei.unipd.it Claudio Zmarich Institute of Cognitive Sciences and Technology (CNR-ISTC) claudio.zmarich@cnr.it ABSTRACT [16]. Among several types of discrimination tests, we choose In this work we describe the current stage of development of a the standard AX or “same-different” procedure. Tradition- software architecture designed to present discrimination tests ally, AX tests to evaluate the phonemes discrimination capa- to pre-school children in the form of gamified tasks. We in- bility of young children are designed as scripts and software terpret the problem of administering these tests as a dialogue traditionally used to administer this kind of test also follows model using probabilistic rules to generate customised tests scripts (e.g. [1]). These contain a series of (non-) word pairs on the basis of the child’s performance. In the proposed ar- presenting phoneme oppositions (i.e. ’pepi / ’pemi) in differ- chitecture, the dialogue system controls a gaming setup com- ent syllabic structures (i.e. CV-CV is a disyllabic structure posed of a virtual agent and a robotic companion that needs where each syllable has a single heading consonant). The to be taught how to talk. This learning-by-teaching approach child is given the task to indicate, after listening to the exper- is used to camouflage a phonemes discrimination test that has imenter reading the stimuli, whether the two (non-)words are the added value of being generated at runtime on the basis of the same or if they are different. These tests are designed the child’s performance. We will describe the architectural in such a way that consonants presenting a single distinc- components involved and we will describe how the dialogue tive trait are opposed at each time (e.g. voiced/unvoiced, system can make use of linguistic knowledge to generate the sonorant/non-sonorant). Control stimuli are present in such discrimination test and administer it by controlling the agents tests as pairs composed by the same word repeated twice and involved in the game. by pairs composed by completely different words. This ap- proach is necessary as it is impossible for a human expert to Author Keywords dynamically select word pairs that comply to a set of very Gamification; software architecture; discrimination tests strict constraints. Specifically, each word pair must: • present opposed consonants that differ in exactly one trait INTRODUCTION Phonetic perception abilities are in place and active already • syllabic structure must be the same in the two (non-)words in the fetus, and their integrity is necessary for a normal func- tioning future speech development [12, 23]. Since the ability • present the opposition in a precise position in the syllabic to discriminate linguistic sounds is associated to the correct structure (e.g. the head consonant of the second syllable) acquisition and production of the same sounds, an alteration of the same ability could contribute to the onset of speech • the accent must be in the same place in the two (non-)words and language disorders [2]. For this reason the evaluation Given the young age of the considered subjects, it is neces- of the phonetic discrimination ability is important in order sary to mask the test in a game-like scenario to make it less to individuate at-risk subjects, allowing clinicians and care- imposing. Healthy contact with language, in the first years of givers to operate in focused and specific ways. For preschool life, consists of a playful activity where parents and infants children (from 3 years-old onward), the paradigms of iden- engage protoconversations made of rhythmical and musical tification and discrimination are the same as used by adults content. This manifests the emotional regulation of primary inter-subjectivity [19], where interaction with the caregiver, either reciprocally directed or mediating access to objects of Permission to make digital or hard copies of all or part of this work for personal interest for the infant, manifests the typical playfulness of- or academic purposes is granted without fee provided that copies are not made or ten observed in mammals. At 9 months, secondary inter- distributed for profit or commercial advantage and that copies bear this notice and the GHITALY17: full 1st first citation on the Workshop page. Toon copyGames-Human Interaction, otherwise, or republish, Aprilon 18th, to post servers2017, or to subjectivity arises [22] and the baby’s interest moves onto Cagliari, Italy. redistribute to lists, requires prior specific permission and/or a fee. sharing the ways companions use objects as she starts to in- Copyright © 2017 for the individual papers by the papers' authors. Copying Copyright c by the paper’s authors. permitted for 1private st and academic purposes.Interaction, This volume is published and teract with the material world in a more informed way. The GHITALY17: Workshop on Games-Human September 18th, 2017, copyrighted Cagliari, by its editors. Italy. caregivers’ language also shifts, in this phase, from questions and rhetorical comments to instructions and informative com- available time. The system architecture we designed to ad- ments to support the baby’s interest in participating to a task minister the discrimination tests has two main purposes: [10]. This is “[. . . ] the start of cultural information transfer between generations” [20, p. 74]. Playful behaviour adapts • dynamically adapt the test to the child’s performance; to new roles as the child grows older but always stays in the • support groups of virtual agents to establish social setups background, motivating access to cultural information, rein- forcing memory and supporting the creation of meaning [17]. To pursue the first goal, we represent the discrimination test Language development strongly depends on inter-subjective as a dialogue model where each stimulus, once paired with experiences: from the effective engagement of minds and the child’s answer, generates a new stimulus as a system re- bodies depends cultural learning [9]. Although humans ap- sponse. This stimulus is selected depending on a utility func- pear to be born with a natural disposition towards cultural tion taking into account linguistic knowledge and the child’s learning [21], successful acquisition of cultural skills depend performance. From an architectural point of view, this reflects on the interaction quality, especially considering social feed- in a dialogue manager acting as the system’s controller and in back. Given the social nature of cultural transfer, it is not linguistic knowledge being distributed between the dialogue enough to expose children to new words without providing manager and a database of Italian words. The dialogue man- an adequate context to them. Engaging and meaningful ac- ager is provided with the capability to establish which kind tivities are especially important to attract interest in the chil- of information can be obtained by presenting each available dren and show them how words can provide the natural plea- stimulus and with a non-words generator using phonotactic sure that comes with gaining competence in interacting with rules to avoid structures not belonging to the Italian language. their loved ones and with peers. Storytelling has been demon- The database contains morpho-syntactic, phonological and strated to be a powerful mean to accomplish this as children frequency data about words to improve the quality of the se- are born with “[. . . ] an abundant and early armament of nar- lected stimuli. To present the discrimination test in a social rative tools” [3, p. 90]. Through storytelling, children ac- setup, the dialogue manager controls a set of virtual agents quire skills related to the so called emergent literacy [18], with different characteristics. In our case, a virtual avatar which is a necessary prerequisite to mastering reading and is presented on a computer screen and acts as the game’s writing. These skills cover metalinguistic awareness, cohe- guide while a social robot is used to implement a learning-by- sion and reference in oral communication and the capability teaching approach, detailed in Section . The virtual avatar is of making one’s own intentions known to others. Emergent controlled using the Unreal Engine 41 and its voice is dynam- literacy capabilities “[. . . ] are acquired first in language play ically generated using the Mivoq Voice Synthesis Engine2 . and in storytelling. Many of them are acquired in the context The synthetic voice has a number of advantages: it allows the of childrens interactions with peers, in early play contexts.” system to be easily updated as the proposed stimuli are not [5, p. 76]. Once again, the importance of social context pre-recorded, it allows the 3D characters to address the child and playful interaction is highlighted concerning the acqui- by calling her by name, thus establishing a closer relation- sition of literacy skills. Wordplay for children appears to be ship, and it can be adapted to different kinds of characters. In based on matching and substituting words on the basis of their the specific case of Mivoq, personalised voices and specific sound rather than their meaning as they appear to [5, p. 78] prosodic styles can also be synthesised, opening to a number “[. . . ] derive tremendous pleasure from rhyming words (“you of applications for game-like software artefacts. A tablet in- silly”; “no, you pilly”) or words that sound similar (adult: terface, also controlled using the Unreal Engine 4, is provided “Indians lived in a teepee”; child: “pee-pee!”)”. In order to to the child to evaluate the proposed stimuli. Since the ability become meaningful and precious for children, teaching activ- to adequately use a tablet interface appears to be reliable for 5 ities need to have a basis of experiences showing language as years old and onwards children [24], this is the minimum age a tool to provide pleasure in social activities. In this paper, recommended to apply this technology. The robot used in our we will present a software architecture designed to present implementation is Nao3 , which is a well established robotic discrimination tests in a playful setup depicting a social situa- platform to work with children. The dialogue manager does tion with different kinds of virtual agents. This ongoing work not make assumptions about the nature of the virtual agents it builds upon the experience of the Colorado Literacy Tutor [6] is connected to. The commands it generates are the same for and of the Italian Literacy Tutor [7]. both the robotic platform and for the 3D character (i.e. Syn- thesise, Speak. . . ). Command implementation is delegated to the specific platform to separate the test logic from its ac- tual implementation. The full schema of the architecture we SYSTEM ARCHITECTURE present is shown in Figure 1. In the following sections, we The scripted approach has the disadvantage of not being able detail how each module was designed and its role in the gen- to adjust the test depending on the subject’s performance. As eral setting. While the system we are developing is able to a limited amount of time is available to administer the test be- administer the test without human supervision, we do not ex- fore the child gets tired, choosing the most informative stimu- clude the human expert from the experimental setup. The lus at each step of the test would represent an advantage when presence of a reference human figure is important to reassure information is clearer on some traits and more uncertain on others. Being able to concentrate on collecting information 1 www.unrealengine.com on specific aspects of linguistic competence that have been 2 www.mivoq.it 3 observed to be challenging for the child would optimise the www.softbank.jp/en/corp/group/sbr/ the child and to integrate the obtained results in the light of is designed to be a declarative language that highlights pat- direct observation of the child’s behaviour. In these develop- terns structure by using an SQL inspired ascii-art syntax. A ment stages, moreover, the experience of practitioners is pre- brief overview of the syntactic elements of Cypher queries cious to improve the quality of the overall experience without is given here to help understanding the example queries pre- altering the validity of the test. sented in this paper. The reader is referred to the online Cypher manual5 for a more detailed presentation of Cypher. As in graphical representations of graphs nodes are usually LINGUISTIC KNOWLEDGE BASE represented by circles, in Cypher nodes are represented by With the advent of the Big Data and, in particular, with the round brackets. For example, the query MATCH (n:VERB) increasing availability of Linked Open Data, the need to es- RETURN n returns all the nodes of the graph labelled as tablish a representation format suitable for dynamic, rapidly verbs. In the same way, since relationships are usually rep- changing and interconnected objects arose. RDF represents resented by labelled arrows in graph schemas, relationships the most widely used solution to this need and has been between nodes are described by using ASCII arrows, too. adopted to implement the most widely known repositories of The query MATCH (m)-[:DERIVES FROM]->(:VERB linked knowledge available today. An alternative to RDF is word: ’essere’) RETURN m returns all the nodes now represented by graph databases. Neo4J [25] is the graph that contain a term that derives from the essere (to be) verb. database solution we used in our architecture. It is an open The SQL-like WHERE clause may also be used to filter re- source graph database manager that has been developed over sults using boolean logic. The query shown in Figure 2 shows the last sixteen years and has been applied to a high number how to obtain a pair (w1 , w2 ) consisting of dysyllabic words of tasks related, among others, to data representation [8] and that are phonological neighbours and are obtained by substi- visualisation [11]. In Neo4J, nodes and relationships may be tuting the /p/ phoneme in the first word with the /b/ phoneme assigned labels, which describe the type of the object they are in the second word. Sets of words to be excluded after having associated to. In this work, labels are mainly used to repre- been presented are also included (in this example, cubo and sent morpho-syntactic characteristics of words and the nature cupo) as well as the sorting logic. The first part of the Cypher of the relationships among nodes. Nodes and relationships query matches words that are linked by phonological neigh- may have properties, which are used here to store the details bourhood relationships at distance 1, regardless of arc orien- of each single node or relationship. Labels and properties tation. A filter is then applied on the syllabic structure using are the main way used by Neo4J to filter data and retrieve an- a regular expression on the SAMPA transcription property. swers to user queries. In this work, we use the MultiWordNet- In this case, only words presenting a CV-CV structure with Extended (MWN-E) dataset [14], as the knowledge base to the accent on the first syllable and presenting the phonemes control the decision process for the discrimination test. The /p/ and /b/ in opposition on the head of the second syllable MWN-E dataset is based on the MultiWordNet dataset [15] are accepted. The regular expression is dynamically gener- and extended by introducing morpho-syntactic data (e.g. gen- ated by the dialogue system depending on the opposition to der, number. . . ), derived forms (e.g. plurals, conjugations. . . ) present and on the word structure complexity. The former and SAMPA pronunciations. Also, phonological neighbour- comes from a decision process implemented in the dialogue hoods are computed and are of particular interest for this manager while the latter becomes more complex as the words work. A word A is defined to be a phonological neigh- available for the each considered structure become less infor- bour of the word B if it is possible to obtain B by altering mative, as in the case of words presenting oppositions that the phonological representation of A using exactly one In- have already been investigated. sertion/Deletion/Substitution operation. Phonological neigh- bourhoods are represented by establishing relationships of OPENDIAL type HAS PHONOLOGICAL NEIGHBOUR between two words if the Minimum Edit Distance of their phonological Opendial [13] is a dialogue management framework based on transcriptions equals 1. This kind of relationship has a dis- probabilistic rules aiming at merging the best of rule-based tance property that, in these cases, is set to 1. Relationships and probabilistic dialogue management. In cases where a of type HAS PHONOLOGICAL NEIGHBOUR are also es- good amount of previous knowledge about the domain is pos- tablished between words that have the same pronunciation but sessed by the dialogue designer with specific needs of fine- have different written forms. In this case the value of the dis- tuning rules, the rule based approach can be integrated with tance property is set to 0. Other than the data included in the probability and utility-based reasoning to fine tune the sys- version presented in [14], the MWN-E version used in this tem’s response. Probabilistic rules, in Opendial, are used to work also contains frequency data for the terms in the vocab- setup and update a Bayesian network consisting of variables ulary presented in the Primo Vocabolario del Bambino (Chil- representing the dialogue state. Depending on this, the dia- dren’s first vocabulary) [4] and from the Italian Wikipedia4 . logue manager selects the most probable user action given a Currently, MWN-E consists of 292282 nodes containing set of, possibly inaccurate, inputs. Using a set of utility func- 1536550 properties. 943174 relationships among these nodes tions provided by the dialogue designer, the manager com- are found, phonological neighbourhood relationships at dis- putes the most useful system reaction, possibly generating tance 1 representing the majority. The querying language natural language responses or executing actions. In Open- used to extract data from a Neo4J database is Cypher. Cypher dial, it is possible to apply a priori estimates on future values of state variables. The probability distributions providing a 4 5 Data extracted from the 20/04/2017 Wikipedia.it dump https://neo4j.com/developer/cypher-query-language Figure 1. System Architecture. trait. Then, the probability of the set of opposing traits to contain the sonorant trait is equal to the XOR of the result ob- tained by applying the HasTraits function on the considered phonemes. Opendial can also be extended with Java-based plugins and functions. In our case, we developed a set of plu- gins to connect the dialogue system to the Neo4J database and to the remote actors providing the user interface. We also developed the custom function to compute the set of opposed traits given two phonemes and a utility model to select the most informative stimulus at each step. The system makes use of the prediction and feedback mechanism provided by Opendial to build the probability distributions describing the Figure 2. Example query. Extracts a word pair of disyllabic phonolog- likelihood of a subject to discriminate a specific trait. This is ical neighbours opposing the /p/ sound and the /b/ sound in the head of used to select the next stimulus that improve the user model the second syllable. the most, given previous answers. This approach results in an adaptive test. The description of the utility model is beyond the scope of this work so we provide only a brief description priori estimates can be updated, using Bayesian inference, of the aspects it takes into account. The model considers the after the actual observation arrives to dynamically improve information entropy for each trait, the syllabic structures al- the model. In Opendial, dialogue domains are described in an ready used to present the available oppositions, the number of XML format specifically designed for the dialogue system. traits opposed in each possible phoneme pair and the intrinsic This is composed of a set of models triggered by variable phoneme complexity evaluated on an acquisitional basis [26]. updates and containing sets of rules to change the dialogue For all these aspects, a specific utility value is computed. The state. Opendial supports unification in its dialogue specifi- obtained measures are combined into a utility value that is cation language so that variables can be included to obtain used to select the best stimulus at each step. generic rules. In the example shown in Figure 3, a part of the model that identifies opposing traits given two phonemes is presented. The condition for the considered rule to fire is that INTERFACE the two phonemes in the opposition variable are not the same The interface proposed to the child to mask the discrimination one. If the condition is verified, a custom HasTraits function test supports a narrative in which the Nao robot wants to learn is used to determine if the two phonemes have the sonorant how to speak and the 3D character needs the child’s help to Figure 3. Example rule to check whether two phonemes have the Sono- rant trait in opposition. The HasTraits function has been implemented in Java and exposed to the domain specification language. Figure 4. The experimental setup. teach it. A three-polar setup, shown in Figure 4, is established to involve the child in a socially engaging situation. Through this learning-by-teaching approach, the child is given an au- thoritative role to avoid making him feel threatened or evalu- ated. When the system starts, an introductory scenario is pre- sented and the 3D character, shown in Figure 5, introduces itself. The scenario ends with the 3D character asking the child to caress Nao in order to wake it up. This has both the goal of providing the invitation to play and to establish phys- ical contact between Nao and the child. Whether the physical attributes of robots constitute an advantage for acceptability per se is still a debated issue. In our work, we attempt to fully exploit the physical presence of the robot by presenting tasks that require the child to physically interact with it. By propos- ing activities that a 3D character simply cannot be involved into, we attempt to capitalise on the robot’s potential to pro- vide a more engaging multisensorial experience. Caressing is also a powerful social mean to build attachment. On the other Figure 5. The 3D character. It guides the child through the game and hand, the high level of control over the 3D character move- interacts with Nao during cutscenes. ments allows to efficiently represent its higher competence in the considered setup: differently from Nao, this avatar can move the lips and change its facial expressions, providing ef- fective indications on how to continue playing. An advantage of the presented architecture is that different virtual agents can be combined to build the test upon the various advan- tages they offer. After a tutorial session where Nao performs a small set of funny behaviours, the child is introduced to the actual test. The dialogue manager selects the most ap- propriate stimulus and coordinates the two agents so that one presents the first (non-)word and the second presents the sec- ond. The child is given one possibility to listen to the stimulus again and is required to provide a same/different feedback us- ing an evaluation card that appears on the tablet. The interface to provide feedback is shown in Figure 6. CONCLUSIONS AND FUTURE WORK We have presented the work-in-progress on an architectural setup that has been designed to administer gamified discrimi- Figure 6. The tablet interface. The child gives feedback by touching the nation tests. We interpret the test as a dialogue model between red or green areas to evaluate Nao’s performance. A repeat button is the child and a group of virtual characters controlled by a sin- also present to allow the child to listen to the opposed words again. This is allowed only once for each stimulus. gle artificial intelligence. Instead of providing pre-scripted tests, we propose an approach where the test is dynamically generated. The system is able to exploit a significant amount of linguistic knowledge to automatically select the most in- 13. Lison, P., and Kennington, C. Opendial: A toolkit for formative stimulus to present at each time. The architecture developing spoken dialogue systems with probabilistic does not make assumptions about the nature of the virtual rules. ACL 2016 (2016), 67. agents involved and can be reused to design other types of 14. Origlia, A., Paci, G., and Cutugno, F. MWN-E: a graph test. Future work will consist of evaluating the usability and database to merge morpho-syntactic and phonological appreciation of the discrimination test we are designing with data for italian. In Proceedings of Subsidia (2017). children that do not show problems in language acquisition to establish a baseline that will be useful to evaluate the ap- 15. Pianta, E., Bentivogli, L., and Girardi, C. Developing an proach on children with potential language problems. Also, aligned multilingual database. In Proc. of the 1st the possibilities given by the Mivoq engine to train person- International Conference on Global WordNet (2002). alised voices will also be explored. 16. Polka, L., Jusczyk, P. W., and Rvachew, S. Methods for ACKNOWLEDGMENTS studying speech perception in infants and children. Antonio Origlia’s work is supported by Veneto Region and Speech perception and linguistic experience: Issues in European Social Fund (grant C92C16000250006). cross-language research (1995), 49–89. 17. Reddy, V. How infants know minds. Harvard University REFERENCES Press, 2008. 1. André, C., Ghio, A., Cavé, C., and Teston, B. PERCEVAL: a computer-driven system for 18. Teale, W. H., and Sulzby, E. Emergent Literacy: Writing experimentation on auditory and visual perception. and Reading. Writing Research: Multidisciplinary CoRR abs/0705.4415 (2007). Inquiries into the Nature of Writing Series. ERIC, 1986. 2. Brancalioni, A. R., Bertagnolli, A. P. C., Bonini, J. B., 19. Trevarthen, C. Communication and cooperation in early Gubiani, M. B., and Keske-Soares, M. The relation infancy: A description of primary intersubjectivity. between auditory discrimination and phonological Before speech: The beginning of interpersonal disorder. Jornal da Sociedade Brasileira de communication (1979), 321–347. Fonoaudiologia 24, 2 (2012), 157–161. 20. Trevarthen, C. The functions of emotion in infancy. In 3. Bruner, J. S. Acts of meaning, vol. 3. Harvard University The healing power of emotion: Affective neuroscience, Press, 1990. development & clinical practice (Norton Series on 4. Caselli, M. C., and Casadio, P. Il primo vocabolario del Interpersonal Neurobiology), D. Fosha and S. M. F. bambino. Milano: Franco Angeli, 1995. Siegel, D. J., Eds. WW Norton & Company, 2009, 55–85. 5. Cassell, J. Towards a model of technology and literacy development: Story listening systems. Journal of 21. Trevarthen, C., and Aitken, K. Regulation of brain Applied Developmental Psychology 25, 1 (2004), development and age-related changes in infants. 75–105. Motives: The Developmental Function of Regressive Periods.In M. Heimann (ed.) Regression Periods in 6. Cole, R. A. Roadmaps, journeys and destinations Human Infancy. Mahwah, NJ: Erlbaum (2003), speculations on the future of speech technology 107–184. research. In Eighth European Conference on Speech Communication and Technology (2003). 22. Trevarthen, C., Hubley, P., et al. Secondary intersubjectivity: Confidence, confiding and acts of 7. Cosi, P., Delmonte, R., Biscetti, S., Cole, R. A., Pellom, meaning in the first year. Action, gesture and symbol: B., and Vuren, S. v. Italian literacy tutor-tools and The emergence of language (1978), 183–229. technologies for individuals with cognitive disabilities. In InSTIL/ICALL Symposium 2004 (2004). 23. Tsao, F.-M., Liu, H.-M., and Kuhl, P. K. Speech perception in infancy predicts language development in 8. Dietze, F., Karoff, J., Valdez, A. C., Ziefle, M., Greven, the second year of life: A longitudinal study. Child C., and Schroeder, U. An open-source development 75, 4 (2004), 1067–1084. object-graph-mapping framework for neo4j and scala: Renesca. In International Conference on Availability, 24. Vatavu, R.-D., Cramariuc, G., and Schipor, D. M. Touch Reliability, and Security, Springer (2016), 204–218. interaction for children aged 3 to 6 years: Experimental 9. Donald, M. A mind so rare: The evolution of human findings and relationship to motor skills. International consciousness. WW Norton & Company, 2001. Journal of Human-Computer Studies 74 (2015), 54 – 76. 10. Halliday, M. A. K. Learning How to Mean–Explorations 25. Webber, J. A programmatic introduction to neo4j. In in the Development of Language. ERIC, 1975. Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, 11. Jiménez, P., Diez, J. V., and Ordieres-Mere, J. Hoshin ACM (2012), 217–218. kanri visualization with neo4j. empowering leaders to operationalize lean structural networks. Procedia CIRP 26. Zmarich, C., and Bonifacio, S. Phonetic inventories in 55 (2016), 284–289. italian children aged 18-27 months: a longitudinal study. 12. Kuhl, P. K. Early language acquisition: cracking the In INTERSPEECH (2005), 757–760. speech code. Nature reviews neuroscience 5, 11 (2004), 831–843.