A Needs-Driven Cognitive Architecture for Future
        ‘Intelligent’ Communicative Agents
                                                                Roger K. Moore
                                           Dept. Computer Science, University of Sheffield, UK
                                                    Email: r.k.moore@sheffield.ac.uk


   Abstract—Recent years have seen considerable progress in             component technologies such as automatic speech recognition
the deployment of ‘intelligent’ communicative agents such as            and text-to-speech synthesis are subject to continuous ongoing
Apple’s Siri, Google Now, Microsoft’s Cortana and Amazon’s              improvement, the overall architecture of a spoken language
Alexa. Such speech-enabled assistants are distinguished from the
previous generation of voice-based systems in that they claim           system has been standardised for some time [3] – see Fig. 2.
to offer access to services and information via conversational          Standardisation is helpful because it promotes interoperability
interaction. In reality, interaction has limited depth and, after       and expands markets. However, it can also stifle innovation
initial enthusiasm, users revert to more traditional interface          by prescribing sub-optimal solutions. So, what (if anything)
technologies. This paper argues that the standard architecture          might be wrong with the architecture illustrated in Fig. 2?
for a contemporary communicative agent fails to capture the
fundamental properties of human spoken language. So an al-
ternative needs-driven cognitive architecture is proposed which
models speech-based interaction as an emergent property of
coupled hierarchical feedback control processes. The implications
for future spoken language systems are discussed.

                        I. I NTRODUCTION
   The performance of spoken language systems has improved
significantly in recent years, with corporate giants such as
MicroSoft and IBM issuing claim and counter-claim as to
who has the lowest word error rates. Such progress has con-
tributed to the deployment of ever more sophisticated voice-
based applications, from the earliest military ‘Command and
Control Systems’ to the latest consumer ‘Voice-Enabled Per-
sonal Assistants’ (such as Siri) [1]. Research is now focussed
on voice-based interaction with ‘Embodied Conversational
Agents (ECAs)’ and ‘Autonomous Social Agents’ based on
the assumption that spoken language will provide a ‘natural’
conversational interface between human beings and future (so-                Fig. 2. Illustration of the W3C Speech Interface Framework [3].
called) intelligent systems – see Fig. 1.
                                                                           In the context of spoken language, the main issue with the
                                                                        architecture illustrated in Fig. 2 is that it reflects a traditional
                                                                        stimulus–response (‘behaviourist’) view of interaction; the
                                                                        user utters a request, the system replies. This is the ‘tennis
                                                                        match’ analogy for language; a stance that is now regarded
                                                                        as restrictive and old-fashioned. Contemporary perspectives
                                                                        regard spoken language interaction as being more like a three-
                                                                        legged race than a tennis match [4]: continuous coordinated
                                                                        behaviour between coupled dynamical systems.

                                                                               II. T OWARDS A ‘C OGNITIVE ’ A RCHITECTURE
    Fig. 1. The evolution of spoken language technology applications.
                                                                          What seems to be required is an architecture that re-
   In reality, users’ experiences with contemporary spoken              places the traditional ‘open-loop’ stimulus-response arrange-
language systems leaves a lot to be desired. After initial              ment with a ‘closed-loop’ dynamical framework; a frame-
enthusiasm, users lose interest in talking to Siri or Alexa,            work in which needs/intentions lead to actions, actions lead
and they revert to more traditional interface technologies [2].         to consequences, and perceived consequences are compared
One possible explanation for this state of affairs is that, while       to intentions/needs (in a continuous cycle of synchronous


      Proceedings of EUCognition 2016 - "Cognitive Robot Architectures" - CEUR-WS                                                       50
                           Fig. 3. Illustration of the proposed architecture for a needs-driven communicative agent [7].


behaviours). Such an architecture has been proposed by the                 and EU-FP7-611971], and the UK Engineering and Physical
author [5], [6], [7] – see Fig. 3.                                         Sciences Research Council (EPSRC) [EP/I013512/1].
   One of the key concepts embedded in the architecture                                                  R EFERENCES
illustrated in Fig. 3 is the agent’s ability to ‘infer’ (using
                                                                            [1] R. Pieraccini. The Voice in the Machine. MIT Press, Cambridge, 2012.
search) the consequences of their actions when they cannot be               [2] R. K. Moore, H. Li, & S.-H. Liao. Progress and prospects for spoken
observed directly. Another is the use of a forward model of                     language technology: what ordinary people think. In INTERSPEECH
‘self’ to model ‘other’. Both of these features align well with                 (pp. 3007–3011). San Francisco, CA, 2016.
                                                                            [3] Introduction and Overview of W3C Speech Interface Framework, http:
the contemporary view of language as “ostensive inferential                     //www.w3.org/TR/voice-intro/
recursive mind-reading” [8]. Also, the architecture makes                   [4] F. Cummins. Periodic and aperiodic synchronization in skilled action.
an analogy between the depth of each search process and                         Frontiers in Human Neuroscience, 5(170), 1–9, 2011.
                                                                            [5] R. K. Moore. PRESENCE: A human-inspired architecture for speech-
‘motivation/effort’. This is because it has been known for                      based human-machine interaction. IEEE Trans. Computers, 56(9), 1176–
some time that speakers continuously trade effort against                       1188, 2007.
intelligibility [9], [10], and this maps very nicely into a                 [6] R. K. Moore. Spoken language processing: time to look outside? In
                                                                                L. Besacier, A.-H. Dediu, & C. Martn-Vide (Eds.), 2nd International
hierarchical control-feedback process [11] which is capable of                  Conference on Statistical Language and Speech Processing (SLSP 2014),
maintaining sufficient contrast at the highest pragmatic level of               Lecture Notes in Computer Science (Vol. 8791). Springer, 2014.
communication by means of suitable regulatory compensations                 [7] R. K. Moore. PCT and Beyond: Towards a Computational Framework
                                                                                for “Intelligent” Systems. In A. McElhone & W. Mansell (Eds.), Living
at the lower semantic, syntactic, lexical, phonemic, phonetic                   Control Systems IV: Perceptual Control Theory and the Future of the Life
and acoustic levels.                                                            and Social Sciences. Benchmark Publications Inc. In Press (available at
   As a practical example, these ideas have been used to con-                   https://arxiv.org/abs/1611.05379).
                                                                            [8] T. Scott-Phillips. Speaking Our Minds: Why human communication is
struct a new type of speech synthesiser (known as ‘C2H’) that                   different, and how language evolved to make it special. London, New
adjusts its output as a function of its inferred communicative                  York: Palgrave MacMillan, 2015.
success [13], [14] – it listens to itself!                                  [9] E. Lombard. Le sign de l?lvation de la voix. Ann. Maladies Oreille,
                                                                                Larynx, Nez, Pharynx, 37, 101–119, 1911.
                                                                           [10] B. Lindblom. Explaining phonetic variation: a sketch of the H&H theory.
                    III. F INAL R EMARKS                                        In W. J. Hardcastle & A. Marchal (Eds.), Speech Production and Speech
                                                                                Modelling (pp. 403–439). Kluwer Academic Publishers, 1990.
   Whilst the proposed cognitive architecture successfully cap-            [11] W. T. Powers. Behavior: The Control of Perception. NY: Aldine:
tures some of the key elements of language-based interaction,                   Hawthorne, 1973.
it is important to note that such interaction between human                [12] S. Hawkins. Roles and representations of systematic fine phonetic detail
                                                                                in speech understanding. Journal of Phonetics, 31, 373–405, 2003.
beings is founded on substantial shared priors. This means                 [13] R. K. Moore & M. Nicolao. Reactive speech synthesis: actively man-
that there may be a fundamental limit to the language-based                     aging phonetic contrast along an H&H continuum, 17th International
interaction that can take place between mismatched partners                     Congress of Phonetics Sciences (ICPhS). Hong Kong, 2011.
                                                                           [14] M. Nicolao, J. Latorre & R. K. Moore. C2H: A computational model
such as a human being and an autonomous social agent [15].                      of H&H-based phonetic contrast in synthetic speech. INTERSPEECH.
                                                                                Portland, USA, 2012.
                     ACKNOWLEDGMENT                                        [15] R. K. Moore. Is spoken language all-or-nothing? Implications for future
                                                                                speech-based human-machine interaction. In K. Jokinen & G. Wilcock
  This work was partially supported by the European Com-                        (Eds.), Dialogues with Social Robots - Enablements, Analyses, and
mission [EU-FP6-507422, EU-FP6-034434, EU-FP7-231868                            Evaluation. Springer Lecture Notes in Electrical Engineering, 2016.


     Proceedings of EUCognition 2016 - "Cognitive Robot Architectures" - CEUR-WS                                                              51