Quantum-inspired Multimodal Representation

Quantum-inspired Multimodal Representation QiuchiLi qiuchili@dei.unipd.it Department of Information Engineering University of Padova Padova

Italy

MassimoMelucci Department of Information Engineering University of Padova Padova

Italy

KatrienLaenen katrien.laenen@kuleuven.be KU Leuven Human Computer Interaction

Heverlee Belgium

SusanaZoghbi susana.zoghbi@kuleuven.be KU Leuven Human Computer Interaction

Heverlee Belgium

Marie-FrancineMoens KU Leuven Human Computer Interaction

Heverlee Belgium

Quantum-inspired Multimodal Representation 61C2D5040D28126499FF2E6AA3EBABE6 10.1145/3159652.3159716 GROBID - A machine learning software for extracting information from scholarly documents multimodal data fusion quantum physics neural networks

We introduce our work in progress that targets on building multimodal representation under quantum inspiration. The challenge for multimodal representation falls on a fusion strategy to capture the interaction between different modalities of data. As the most successful approaches, neural networks lack a mechanism of explicitly showing how different modalities are related to each other. We address this issue by seeking inspirations from Quantum Theory (QT), which has been demonstrated advantageous in explicitly capturing the correlations between textual features. In this paper, we give an overview of the related works and present the proposed methodology.

INTRODUCTION

In human communication, messages are often conveyed through a combination of different modalities, such as visual, audio and linguistic modalities. In order for an automatic understanding of the multimodal messages, one needs to fuse the information from different modalities to construct a joint multimodal representation. The challenge falls on how to characterise the interactions between different modalities, which can be complicated in some scenarios. In the example shown in Fig. 1, the visual-linguistic query cannot be understood solely by either the text or image individually, but the relation between the image and text must be correctly recognized. This is where the significance and challenge sit for multimodal representation learning.

Most existing research focus on neural network-based fusion strategies for constructing multimodal representation, ranging from the earlier Hidden Markov Model (HMM)-based models [7] to different RNN variants [1] and more recently tensor-based approaches [12] and seq-to-seq structures [8]. Despite their strong accuracy performances, the interplay between different data modalities are encoded in an inherent way by the neural network components, making it difficult for humans to understand the contributions of each modality in a particular task.

Quantum Theory (QT) provides a well-established theoretical and mathematical formalism for describing the physical world on a microscopic level. Beyond physics, QT frameworks have been applied to many research areas [2,9,10,15], among which successful

INTRODUCTION

Today's consumers have become very exigent. When shopping online, they have in mind a specific clothing item in a particular color and style, and they want to find it without too much effort. However, current e-commerce search mechanisms are often too limited to provide this kind of service. A common way for searching 1 Image material adapted from www.amazon.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. products in a webshop is to navigate through a product category hierarchy. Users end up in a subcategory which they need to search completely, which is time-consuming. They need to go through many irrelevant products, without the guarantee of actually finding the product they are looking for. To narrow down the search, users can sometimes select certain filters. However, desired product attributes might not be amongst the available filters. With a text-based search approach, users can describe the desired product by entering keywords into a search bar. The webshop then finds relevant products by matching these keywords to the words in the product descriptions. It is often difficult for users to write the right keywords that will induce the search engine to provide the products they are interested in. For example, some users might be interested in "jeans with holes", but the relevant products are described as "distressed jeans". Additionally, this approach hampers the search for product attributes which are not mentioned in the product descriptions. Alternatively but rarely, webshops offer image-based search where the user uploads an image of the desired product and receives visually similar products. Recently, there is an increasing interest of users in this kind of image-based search. One of the main reasons is the growing usage of visual social media such as Pinterest and Instagram, where users see products they want to buy. Using an image as a query allows users to convey much more information about the desired product than with a textual query. Additionally, another advantage of image-based search over text-based search is that the language of images is universal. However, the user might be interested in changing or adding attributes to the product in the query image to obtain very specific results. For example, for an image of a red dress, the user likes the sleeve length to be different.

Technical Presentation

WSDM'18, February 5-9, 2018, Marina Del Rey, CA, USA 342 results have been observed in text-based language understanding [10,15]. In this context, quantum-inspired frameworks have the natural advantage of explicitly capturing the correlations between features by the concepts quantum superposition and quantum entanglement. This inspires us to adopt quantum-inspired frameworks for constructing multimodal representation, and propose effective and interpretable multimodal fusion methods. In particular, we propose to capture the interactions within a single modality with quantum superposition, and model the cross-modal interactions by means of quantum entanglement. In this way, we establish a pipeline that extracts the interactions within multi-modal data in a way understandable from a quantum perspective. We expect to obtain comparable performances to state-of-the-art systems in concrete multimodal tasks.

PROPOSED METHODOLOGY

Here we introduce our proposed approach for building multimodal data representation inspired by quantum theory. Essentially, we represent multimodal data as many-body systems composed of different data modalities as subsystems. The interaction of different modalities is inherently captured by the notion of entanglement between the subsystems. We propose to build a complex-valued learning network to implement the quantum theoretical framework, which facilitates learning the cross-modal interactions in a datadriven fashion.

Complex-valued Unimodal Representation

Complex values are essential for the mathematical formalism of quantum physics. However, most existing quantum-inspired models for text representation are based on the real vector space, ignoring the complex-valued nature of quantum notions. Recently, our prior works [3,4,11] leveraged quantum superposition to model correlations between textual features, and the complex-valued representation leads to improved performance and enhanced interpretability. We attempt to employ the concept of superposition for modeling intra-modal interactions, and investigate the complexvalued embedding approach to capture the interactions within other modalities.

Tensor-based Approaches for Capturing Inter-modal Interactions

We represent multimodal data as a many-body quantum system in entanglement. The mathematical formulation will be a complexvalued tensor constructed from unimodal complex-valued vectors by tensor-based approaches. Tensor-based models have been applied for classification [5] and matching [14] tasks, but they avoid directly computing the tensor by decomposing the tensor and learning the decomposed weights via neural networks. Explicit tensor combination of different data modalities into a holistic multimodal representation remains unexplored. [6] proposed a framework to investigate the entanglement between user and document for relevance feedback, but it remains a challenge on how to apply it to multimodal data.

Quantum-inspired Framework for Multimodal Sentiment Analysis

We focus on the multimodal sentiment analysis task, and work with the benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1].

The task is to classify sentiment of a video into 2, 5 or 7 classes with textual, visual and acoustic features. As is shown in Fig. 2, our framework represents unimodal data as a set of pure states through complex-value embedding, and constructs the many-body state of an video utterance through tensor-based approaches. Finally, quantum-like measurement operators are implemented for sentiment classification. The whole process is implemented into a complex-valued neural network, and the parameters in the pipeline can be learned from labeled data in an end-to-end manner. The network is born with an advantage in interpretability compared to classical neural networks, in that the role of each component is made explicit prior to the network training phase. We are working on deploying the network to CMU-MOSI and CMU-MOSEI, and we expect to see comparable values to state-of-the-art models in terms of effectiveness.

Figure 1 :1Figure 1: Example of a multimodal query 1 . It consists of a query image and a query text that alters the query image.The query text mentions two fashion attributes: short and lace. The query image is knee length and does not have a lace type appearance.

Figure 1 :1Figure 1: Example of a Multi-modal query.

Figure 2 :2Figure 2: Our proposed multimodal representation framework.

ACKNOWLEDGMENTS

This PhD project is supported by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 721321.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph AmiraliBagher Zadeh PaulPuLiang SoujanyaPoria ErikCambria Louis-PhilippeMorency Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics the 56th Annual Meeting of the Association for Computational Linguistics

Melbourne, Australia

Association for Computational Linguistics July 2018 1 : Long Papers) Quantum Models of Cognition and Decision JeromeRBusemeyer PeterDBruza 2012 Cambridge University Press New York, NY, USA 1st edition Quantum-Inspired Complex Word Embedding QiuchiLi SagarUprety BenyouWang DaweiSong Proceedings of The Third Workshop on Representation Learning for NLP The Third Workshop on Representation Learning for NLP 2018 CNM: An Interpretable Complexvalued Network for Matching QiuchiLi BenyouWang MassimoMelucci Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

June 2019 1 Association for Computational Linguistics Efficient Low-rank Multimodal Fusion With Modality-Specific Factors ZhunLiu YingShen VarunBharadhwaj Lakshminarasimhan PaulPuLiang AmiraliBagher Zadeh Louis-PhilippeMorency Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics the 56th Annual Meeting of the Association for Computational Linguistics

Melbourne, Australia

July 2018 1 Association for Computational Linguistics Towards modeling implicit feedback with quantum entanglement MassimoMelucci Quantum Interaction 2008. 2008 Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web Louis-PhilippeMorency RadaMihalcea PayalDoshi Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI '11 the 13th International Conference on Multimodal Interfaces, ICMI '11

New York, NY, USA

ACM 2011 Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis HaiPham ThomasManzini PaulPuLiang BarnabasPoczos arXiv:1807.03915 arXiv: 1807.03915 July 2018 cs, stat The Geometry of Information Retrieval CJVan Rijsbergen 2004 Cambridge University Press New York, NY, USA Modeling Term Dependencies with Quantum Language Models for IR AlessandroSordoni Jian-YunNie YoshuaBengio Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13 the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13

New York, NY, USA

ACM 2013 Semantic Hilbert Space for Text Representation Learning BenyouWang QiuchiLi MassimoMelucci DaweiSong The World Wide Web Conference

New York, NY, USA; San Francisco, CA, USA

ACM 2019 WWW '19. . event-place Tensor Fusion Network for Multimodal Sentiment Analysis AmirZadeh MinghaiChen SoujanyaPoria ErikCambria Louis-PhilippeMorency arXiv:1707.07250 arXiv: 1707.07250 July 2017 MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos AmirZadeh RownZellers EliPincus 10 A Quantum Many-body Wave Function Inspired Language Modeling Approach PengZhang ZhanSu LipengZhang BenyouWang DaweiSong Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM '18 the 27th ACM International Conference on Information and Knowledge Management, CIKM '18

New York, NY, USA

ACM 2018 A quantum-inspired multimodal sentiment analysis framework YazhouZhang DaweiSong PengZhang PanpanWang JingfeiLi XiangLi BenyouWang Theoretical Computer Science April 2018