Neurosymbolic Visual Commonsense On Integrated Reasoning and Learning about Space and Motion in Embodied Multimodal Interaction Mehul Bhatt School of Science and Technology, Örebro University – Sweden CoDesign Lab EU (Artificial and Human Intelligence)., Cognitive Vision and Perception » https:// codesign-lab.org/ cognitive-vision Abstract We present recent and emerging advances in computational cognitive vision addressing artificial visual and spatial intelligence at the interface of (spatial) language, (spatial) logic and (spatial) cognition research. With a primary focus on explainable sensemaking of dynamic visuospatial imagery, we highlight the (systematic and modular) integration of methods from knowledge representation and reasoning, computer vision, spatial informatics, and computational cognitive modelling. A key emphasis here is on generalised (declarative) neurosymbolic reasoning & learning about space, motion, actions, and events relevant to embodied multimodal interaction under ecologically valid naturalistic settings in everyday life. Practically, this translates to general-purpose mechanisms for computational visual commonsense encompassing capabilities such as (neurosymbolic) semantic question-answering, relational spatio-temporal learning, visual abduction etc. The presented work is motivated by and demonstrated in the applied backdrop of areas as diverse as autonomous driving, cognitive robotics, design of digital visuoauditory media, and behavioural visual perception research in cognitive psychology and neuroscience. More broadly, our emerging work is driven by an interdisciplinary research mindset addressing human-centred responsible AI through a methodological confluence of AI, Vision, Psychology, and (human-factors centred) Interaction Design. Keywords Cognitive vision, Knowlede representation and reasoning (KR), Machine Learning, Integration of reasoning & learning, Commonsense reasoning, Declarative spatial reasoning, Relational Learning, Computational cognitive modelling, Human-Centred AI, Responsible AI 1. Motivation spectrum of high-level human-centred sensemaking capa- bilities. These capabilities encompass operational functions Multimodality in embodied interaction is an inherent aspect such as: of human activity, be it in social, professional, or every- • Visuospatial conception formation, common- day mundane contexts. Next-generation human-centred sense/qualitative generalisation, analogical AI technologies, operating in such contextualised every- inference; day settings, will require an inherent foundational capacity to “make sense” of —e.g., perceive, understand, explain, • Hypothetical reasoning, argumentation, explana- anticipate— everyday, naturalistic interactional multimodal- tion, counterfactual reasoning; ity. This would be essential towards successfully achieving • Event based episodic maintenance & retrieval for technology mediated (“human-in-the-loop” ) collaborative perceptual narrativisation. assistance, as well as ensuring compliance with emerging human-centred ethical and legal requirements, performance The afore enumeration is by no means exhaustive: in benchmarks, and inclusive usability expectations. It is there- essence, in scope of artificial visual intelligence are diverse fore crucial that the foundational building blocks of such high-level cognitive visuospatial sensemaking capabili- next-generation systems be semantically aligned with the ties —be it mundane, analytical, or creative— that humans descriptive, analytical, and explanatory characteristics and acquire developmentally or through specialised training, complexity of human task conceptualisation, performance and are routinely adept at performing seamlessly in their benchmarks, and usability expectations. Against this back- everyday life and work (e.g., driving a vehicle, tracking drop, we define artificial visual intelligence [1] as: moving objects, navigating a crowded urban environment, engaging in sports, interpreting subtle cues in everyday » The computational capability to seman- people-communication from visual / gestural and auditory tically process and interpret diverse forms signals). of visual stimuli (typically, but not necessar- ily) emanating from sensing embodied mul- Our central focus is on the development of general, timodal interactions of / amongst humans domain-independent methods that may be seamlessly and other artefacts in diverse naturalistic integrated as part of hybrid computational cognitive system, situations of everyday life and work. or even within computational cognitive models / cognitive architectures [2]. We also contextualise and demonstrate in Within the scope of artificial visual intelligence are a wide- the backdrop of applications in autonomous driving, cog- nitive robotics, visuoauditory media design, and cognitive International Joint Conference on Artificial Intelligence (IJCAI)., STRL 24: psychology (e.g. [3, 4, 5, 6], [7, 8] ). Through applied case- Third International Workshop on Spatio-Temporal Reasoning and Learning studies, we provide a systematic model and general method- (STRL), IJCAI 2024 – 5 August 2024, Jeju, South Korea ology showcasing the integration of diverse, multi-faceted $ mehul.bhatt@oru.se (M. Bhatt) AI methods pertaining Knowledge Representation and Rea- € https://mehulbhatt.org (M. Bhatt) © 2024 CoDesign Lab EU. Use permitted under Creative Commons License Attribution 4.0 International soning, Computer Vision, Machine Learning, and Visual (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Perception towards realising practical, human-centred, com- spatial systems [10] where integrated reasoning about ac- putational visual intelligence. tion and change [11, 12] is involved: • interpolation and projection of missing informa- 2. Neurosymbolic Visual tion, e.g., what could be hypothesised about missing information (e.g., moments of occlusion [13]); how Commonsense: Integrated can this hypothesis support planning an immediate Reasoning and Learning about next step? Space, Motion, and Inter(A)ction • object identity maintenance at a semantic level, e.g., in the presence of occlusions, missing and noisy In the present status quo, our research in (computational) quantitative data, error in detection and tracking neurosymbolic visual commonsense categorically addresses • ability to make default assumptions, e.g., pertain- three key questions: ing to persistence objects and/or object attributes I. What kind of (relational) abstraction mechanisms • maintaining consistent beliefs respecting (domain- are needed to computationally “make-sense” of em- neutral) commonsense criteria, e.g., related to com- bodied multimodal interaction ? positionality & indirect effects, space-time continu- II. How can (and why should) abstraction mechanisms ity, positional changes resulting from motion (such as in I) be founded on behaviourally estab- lished cognitive human- factors emanating from nat- • inferring / computing counterfactuals [14], in a uralistic empirical observation in real-world applied manner akin to human cognitive ability to perform contexts? mental simulation for purposes of introspection III. How to articulate behaviourally established abstrac- about the past or anticipation of the future, or per- tion mechanisms, preferences (etc) as formal declar- forming “what-if” reasoning tasks etc ative models suited for computational modelling We particularly emphasise the abilities to abstract, learn, aimed at operational“sensemaking” (encompassing and reason with cognitively rooted structured characterisa- capabilities such as abduction, relational learning, tions of commonsense knowledge about space and motion, counterfactual inference) ? encompassing visuospatial question-answering, abduction, and relational learning: Present work is particularly aimed at developing general methods for the semantic interpretation of (multimodal) dy- I. Visuospatial Question-Answering. Focus is on a com- namic visuospatial imagery with an emphasis on the ability to neurosymbolically perform abstraction, reasoning, and putational framework for semantic-question answering learning with cognitively rooted structured characterisa- with video and eye-tracking data founded in constraint logic tions of commonsense knowledge pertaining to space and programming; we also demonstrate an application in cogni- motion. Here, we specifically emphasise: tive film & media studies, where human perception of films vis-a-via cinematographic devices is of interest. • General foundational commonsense abstractions of » [4, 6, 7, 8] space, time, and motion needed for representation mediated (grounded) reasoning and learning with II. Visuospatial Abduction. Focus is on a hybrid archi- dynamic visuospatial stimuli (e.g., emanating from tecture for systematically computing robust visual explana- multimodal human behavioural signals in modali- tion(s) encompassing hypothesis formation, belief revision, ties such as RGB(D), video, audio, eye-tracking and and default reasoning with video data (for active vision possibly even bio signals [9]); for autonomous driving, as well as for offline processing). • Deep (visuospatial) semantics, entailing systemat- The architecture supports visual abduction with space-time ically formalised declarative (neurosymbolic) rea- histories as native entities, and founded in (functional) an- soning and learning with aspects pertaining to swer set programming based spatial reasoning. space, space-time, motion, actions & events, spatio- » [3, 13, 15][16, 17] linguistic conceptual knowledge. Here, it is of the essence that an expressive ontology consisting of, III. Relational Visuospatial Learning. Focus is on a gen- for instance, space, time, space-time motion primi- eral framework and pipeline for: relational spatio-temporal tives as first-class ‘neurosymbolic’ objects is accessi- (inductive) learning with an elaborate ontology supporting ble within the (declarative) programming paradigm a range of space-time features; and generating semantic, under consideration; and (declaratively) explainable interpretation models in a neu- rosymbolic pipeline demonstrated for the case of analysing • Explainable models of computational visuospatial visuospatial symmetry in visual art. commonsense based on a systematic integration of » [18][5][19] symbolic/relational methods on the one hand, and neural techniques aimed at low level quantitative Formal semantics and computational models of deep seman- (e.g., visual) data processing on the other; tics manifest themselves as neurosymbolic spatio-temporal At a higher level of abstraction, deep (visualspatial) se- extensions of established declarative AI frameworks such mantics (or deep semantics for short) entails inherent sup- as Constraint Logic Programming (CLP) [20], Inductive port for tackling a range of challenges concerning epistemo- Logic Programming (ILP) [21], and Answer Set Program- logical and phenomenological aspects relevant to dynamic ming (ASP) [22]. The more foundational aspects pertaining declarative spatial reasoning (built on top of CLP, ILP, ASP) neural learning techniques, or otherwise. independent of its relationship to cognitive vision research In this invited position statement, we have attempted to may be consulted in [23], [16, 24], [18]. summarise our mindset and ongoing work in the CoDesign Lab towards: 3. Discussion » Establishing a human-centric foundation and roadmap for the development of neu- The vision that drives our scientific methodology is: rosymbolically grounded inference about embodied multimodal interaction as iden- » To shape the nature and character of tifiable in a range of real-world application (machine-based) artificial visual intelligence contexts. with respect to human-centred cognitive considerations, demonstrating an exemplar This summary is not meant to be a comprehensive literature for developing, applying, and disseminating review; this may be obtained through the cited works. For such methods in socio-technologically rele- key technical details and to obtain a summary of open di- vant application areas where: rections, we direct interested readers to select publications as follows: a compact starting point may be obtained via (a) embodied (multimodal) human interac- the comprehensive summary in [1], or through the shorter/- tion is inherent; focussed components in [15, 5, 4, 13, 3]. Longer summaries in the form of (recent) doctoral dissertations are available (b) human-in-the-loop collaborative work is in [29] and [30, 31]. of the essence; and (c) normative ethico-legal compliance based on regulatory requirement and human- Acknowledgments factors driven inclusive or universal design criteria is to be ensured. We acknowledge funding by the Swedish Research Council Towards realising this vision, we adopt an interdisciplinary (VR - Vetenskapsrådet) - https://www.vr.se, and the Swedish approach –at the confluence of Cognition, AI, Interaction, Foundation for Strategic Research (SSF – Stiftelsens för and Design– which we deem necessary to better appreci- Strategisk Forskning) - https://strategiska.se. Previously, ate the complexity and spectrum of varied human-centred this research has been supported by the German Research challenges for the design and (usable) implementation of Foundation (DFG – Deutsche Forschungsgemeinschaft) - (explainable) artificial visual intelligence solutions in diverse https://www.dfg.de. human-system interaction contexts. One of the key technical driving forces in our work is that of “representation mediated multimodal sensemaking”. References In essence, we consider (neurosymbolic) representation me- [1] M. Bhatt, J. Suchan, Artificial visual intelligence: diated grounding as being significant in semiotic construc- Perceptual commonsense for human-centred cogni- tion, e.g., enabling high-level meaning-making. This view tive technologies, in: Human-Centered Artificial stems from the long-established value of “grounding” in Intelligence: Advanced Lectures, Springer-Verlag, Artificial Intelligence and related disciplines [25]. Our re- Berlin, Heidelberg, 2023, p. 216–242. URL: https:// search advances the theoretical, methodological, and ap- doi.org/10.1007/978-3-031-24349-3_12. doi:10.1007/ plied understanding of “grounded representation” mediated 978-3-031-24349-3_12. multimodal sensemaking of embodied human interaction at the interface of spatial language, spatial logic, and spatial [2] S. Jones, J. Laird, Anticipatory thinking in cogni- cognition. In our view, the significance of this form of (neu- tive architectures with event cognition mechanisms, rosymbolic) grounding must now be reiterated, re-asserted in: A. Amos-Binks, D. Dannenhauer, R. E. Cardona- even, in view of recent advances in neural machine learning Rivera, G. A. Brewer (Eds.), Short Paper Proc. of Work- and the well-recognised “explainability” and “interpretabil- shop on Cognitive Systems for Anticipatory Thinking ity” requirements from the viewpoint of human-centred AI (COGSAT 2019), AAAI Fall Symp., volume 2558, 2019. [26, 27, 28]. We believe that research in knowledge represen- URL: http://ceur-ws.org/Vol-2558/short1.pdf. tation and reasoning (KR) has, since its inception, concerned [3] J. Suchan, M. Bhatt, S. Varadarajan, Common- itself with the “hard” problem of semantics, emphasising sense visual sensemaking for autonomous driv- explainability, formal verification and diagnosis, elabora- ing - on generalised neurosymbolic online abduc- tion tolerance amongst other things. Research in KR, and tion integrating vision and semantics, Artif. In- more broadly in symbolic AI and semantics, and their role tell. 299 (2021) 103522. URL: https://doi.org/10.1016/ and contribution towards large-scale hybrid “human-in-the- j.artint.2021.103522. doi:10.1016/j.artint.2021. loop” intelligence is of even greater significance now than 103522. ever before given the tremendous synergistic opportunities afforded by the widely demonstrated power of deep learning [4] J. Suchan, M. Bhatt, Semantic question-answering driven techniques in computer vision (and beyond). The with video and eye-tracking data: AI foundations for onus now, we posit, is on KR research to drive itself towards human visual perception driven cognitive film stud- developing methods that can seamlessly integrate (and be ies, in: S. Kambhampati (Ed.), Proceedings of the “usable”) with other kinds of AI methods, be data-centric Twenty-Fifth International Joint Conference on Ar- tificial Intelligence, IJCAI 2016, New York, NY, USA, [15] J. Suchan, M. Bhatt, P. A. Walega, C. P. L. Schultz, Vi- 9-15 July 2016, IJCAI/AAAI Press, 2016, pp. 2633–2639. sual explanation by high-level abduction: On answer- URL: http://www.ijcai.org/Abstract/16/374. set programming driven reasoning about moving ob- jects, in: 32nd AAAI Conference on Artificial Intel- [5] J. Suchan, M. Bhatt, S. Vardarajan, S. A. Amirshahi, ligence (AAAI-18), USA, AAAI Press, 2018, pp. 1965– S. Yu, Semantic Analysis of (Reflectional) Visual Sym- 1972. metry: A Human-Centred Computational Model for Declarative Explainability, Advances in Cognitive [16] P. A. Walega, M. Bhatt, C. P. L. Schultz, ASPMT(QS): Systems 6 (2018) 65–84. URL: http://www.cogsys.org/ non-monotonic spatial reasoning with answer set pro- journal. gramming modulo theories, in: F. Calimeri, G. Ianni, M. Truszczynski (Eds.), Logic Programming and Non- [6] J. Suchan, M. Bhatt, The geometry of a scene: On deep monotonic Reasoning - 13th International Conference, semantics for visual perception driven cognitive film, LPNMR 2015, Lexington, KY, USA, September 27-30, studies, in: 2016 IEEE Winter Conference on Applica- 2015. Proceedings, volume 9345 of Lecture Notes in tions of Computer Vision, WACV 2016, Lake Placid, Computer Science, Springer, 2015, pp. 488–501. URL: NY, USA, March 7-10, 2016, IEEE Computer Soci- https://doi.org/10.1007/978-3-319-23264-5_41. doi:10. ety, 2016, pp. 1–9. URL: https://doi.org/10.1109/WACV. 1007/978-3-319-23264-5\_41. 2016.7477712. doi:10.1109/WACV.2016.7477712. [17] P. A. Walega, C. P. L. Schultz, M. Bhatt, Non- [7] J. Suchan, M. Bhatt, Deep Semantic Abstractions of monotonic spatial reasoning with answer Everyday Human Activities: On Commonsense Rep- set programming modulo theories, Theory resentations of Human Interactions, in: ROBOT 2017: Pract. Log. Program. 17 (2017) 205–225. URL: Third Iberian Robotics Conference, Advances in Intel- https://doi.org/10.1017/S1471068416000193. ligent Systems and Computing 693, 2017. doi:10.1017/S1471068416000193. [8] M. Spranger, J. Suchan, M. Bhatt, Robust Natural Lan- [18] J. Suchan, M. Bhatt, C. P. L. Schultz, Deeply semantic guage Processing - Combining Reasoning, Cognitive inductive spatio-temporal learning, in: J. Cussens, Semantics and Construction Grammar for Spatial Lan- A. Russo (Eds.), Proceedings of the 26th Interna- guage, in: IJCAI 2016: 25th International Joint Confer- tional Conference on Inductive Logic Programming ence on Artificial Intelligence, AAAI Press, 2016. (Short papers), London, UK, 2016, volume 1865, CEUR- [9] M. Bhatt, K. Kersting, Semantic interpretation of multi- WS.org, 2016, pp. 73–80. modal human-behaviour data - making sense of events, [19] K. S. R. Dubba, A. G. Cohn, D. C. Hogg, M. Bhatt, activities, processes, Künstliche Intell. 31 (2017) 317– F. Dylla, Learning Relational Event Models from 320. URL: https://doi.org/10.1007/s13218-017-0511-y. Video, J. Artif. Intell. Res. (JAIR) 53 (2015) 41– doi:10.1007/S13218-017-0511-Y. 90. URL: http://dx.doi.org/10.1613/jair.4395. doi:10. [10] M. Bhatt, S. W. Loke, Modelling dynamic spatial 1613/jair.4395. systems in the situation calculus, Spatial Cogni- [20] J. Jaffar, M. J. Maher, Constraint logic programming: tion & Computation 8 (2008) 86–130. URL: https: A survey, The journal of logic programming 19 (1994) //doi.org/10.1080/13875860801926884. doi:10.1080/ 503–581. 13875860801926884. [21] S. Muggleton, L. D. Raedt, Inductive logic program- [11] M. Bhatt, H. W. Guesgen, S. Wölfl, S. M. Hazarika, ming: Theory and methods, Journal of Logic Program- Qualitative spatial and temporal reasoning: Emerging ming 19 (1994) 629–679. applications, trends, and directions, Spatial Cognition & Computation 11 (2011) 1–14. URL: https://doi.org/10. [22] G. Brewka, T. Eiter, M. Truszczyński, Answer set 1080/13875868.2010.548568. doi:10.1080/13875868. programming at a glance, Commun. ACM 54 (2011) 2010.548568. 92–103. doi:10.1145/2043174.2043195. [12] M. Bhatt, Reasoning about space, actions and change: [23] M. Bhatt, J. H. Lee, C. P. L. Schultz, CLP(QS): A declar- A paradigm for applications of spatial reasoning, in: ative spatial reasoning framework, in: M. J. Egenhofer, Qualitative Spatial Representation and Reasoning: N. A. Giudice, R. Moratz, M. F. Worboys (Eds.), Spa- Trends and Future Directions, IGI Global, USA, 2012. tial Information Theory - 10th International Confer- ence, COSIT 2011, Belfast, ME, USA, September 12-16, [13] J. Suchan, M. Bhatt, S. Varadarajan, Out of sight but 2011. Proceedings, volume 6899 of Lecture Notes in not out of mind: An answer set programming based Computer Science, Springer, 2011, pp. 210–230. URL: online abduction framework for visual sensemaking in https://doi.org/10.1007/978-3-642-23196-4_12. doi:10. autonomous driving, in: S. Kraus (Ed.), Proc. of 25th 1007/978-3-642-23196-4_12. Intnl. Joint Conference on Artificial Intelligence, IJ- CAI 2019, 2019, pp. 1879–1885. doi:10.24963/ijcai. [24] C. P. L. Schultz, M. Bhatt, J. Suchan, P. A. Walega, 2019/260. Answer Set Programming Modulo Space-Time, in: C. Benzmüller, F. Ricca, X. Parent, D. Roman (Eds.), [14] R. Byrne, Counterfactual thought, Annual Re- Rules and Reasoning - Second International Joint Con- view of Psychology 67 (2016) 135–157. URL: https: ference, RuleML+RR 2018, Luxembourg, September 18- //doi.org/10.1146/annurev-psych-122414-033249. 21, 2018, Proceedings, volume 11092 of Lecture Notes doi:10.1146/annurev-psych-122414-033249. in Computer Science, Springer, 2018, pp. 318–326. URL: pMID: 26393873. https://doi.org/10.1007/978-3-319-99906-7_24. doi:10. 1007/978-3-319-99906-7\_24. (Eds.), 2013 Workshop on Computational Models of Narrative, CMN 2013, August 4-6, 2013, Hamburg, [25] S. Harnad, The symbol grounding problem, Physica Germany, volume 32 of OASIcs, Schloss Dagstuhl - D 42 (1990) 335–346. Leibniz-Zentrum für Informatik, 2013, pp. 24–29. URL: [26] AI HLEG, High-level expert group on artificial intel- https://doi.org/10.4230/OASIcs.CMN.2013.24. doi:10. ligence: Ethical guidelines for trustworthy ai, 2019. 4230/OASICS.CMN.2013.24. URL: https://www.aepd.es/sites/default/files/2019-12/ [36] C. Lewis, Representation, Inclusion, and Innova- ai-ethics-guidelines.pdf. tion: Multidisciplinary Explorations, Synthesis Lec- [27] EU Commission, Communication: Building trust in tures on Human-Centered Informatics, Morgan & human centric artificial intelligence, 2019. Claypool Publishers, 2017. URL: https://doi.org/10. 2200/S00812ED1V01Y201710HCI038. doi:10.2200/ [28] EU Commission, Proposal for a regulation of the euro- S00812ED1V01Y201710HCI038. pean parliament and of the council laying down har- monised rules on artificial intelligence (artificial intelli- gence act) and amending certain union legislative acts, 2021. URL: https://eur-lex.europa.eu/legal-content/ EN/TXT/?uri=CELEX:52021PC0206. [29] J. Suchan, Declarative Reasoning about Space and Mo- tion in Visual Imagery - Theoretical Foundations and Applications, Ph.D. thesis, Universität Bremen, 2022. URL: https://elib.dlr.de/188919/. [30] V. Kondyli, Behavioural Principles for the Design of Human-Centred Cognitive Technologies : The Case of Visuo-Locomotive Experience, Ph.D. thesis, Örebro University, School of Science and Technology, 2023. [31] V. Nair, The Observer Lens: Characterizing Visuospa- tial Features in Multimodal Interactions, Ph.D. thesis, , School of Informatics, Informatics Research Environ- ment, 2024. [32] M. Bhatt, J. Suchan, Cognitive vision and percep- tion, in: G. D. Giacomo, A. Catalá, B. Dilkina, M. Mi- lano, S. Barro, A. Bugarín, J. Lang (Eds.), ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325 of Frontiers in Ar- tificial Intelligence and Applications, IOS Press, 2020, pp. 2881–2882. URL: https://doi.org/10.3233/FAIA200434. doi:10.3233/FAIA200434. [33] V. Kondyli, M. Bhatt, D. Levin, J. Suchan, How do drivers mitigate the effects of naturalistic visual com- plexity? on attentional strategies and their impli- cations under a change blindness protocol, Cogni- tive Research: Principles and Implications 8 (2023). doi:10.1186/s41235-023-00501-1. [34] K. S. R. Dubba, M. Bhatt, F. Dylla, D. C. Hogg, A. G. Cohn, Interleaved inductive-abductive reasoning for learning complex event models, in: S. H. Muggle- ton, A. Tamaddoni-Nezhad, F. A. Lisi (Eds.), Inductive Logic Programming - 21st International Conference, ILP 2011, Windsor Great Park, UK, July 31 - August 3, 2011, Revised Selected Papers, volume 7207 of Lecture Notes in Computer Science, Springer, 2011, pp. 113– 129. URL: https://doi.org/10.1007/978-3-642-31951-8_ 14. doi:10.1007/978-3-642-31951-8\_14. [35] M. Bhatt, J. Suchan, C. Schultz, Cognitive interpre- tation of everyday activities - toward perceptual nar- rative based visuo-spatial scene interpretation, in: M. A. Finlayson, B. Fisseni, B. Löwe, J. C. Meister