Technical Presentation WSDM’18, February 5-9, 2018, Marina Del Rey, CA, USA Web Quantum-inspired Search of Fashion Items with Multimodal Multimodal Querying Representation Katrien Laenen Qiuchi Li Susana Zoghbi Marie-Francine Massimo Melucci Moens KU Leuven KU Leuven KU Leuven qiuchili@dei.unipd.it melo@dei.unipd.it Human Computer Interaction Human Computer Interaction Human Computer Interaction Department of Information Engineering Department of Information Engineering Heverlee, Belgium Heverlee, Belgium Heverlee, Belgium University of Padova University of Padova katrien.laenen@kuleuven.be susana.zoghbi@kuleuven.be sien.moens@kuleuven.be Padova, Italy Padova, Italy ABSTRACT ABSTRACT In this paper, we introduce a novel multimodal fashion search para- We introduce our work in progress that targets on building mul- digm where e-commerce data is searched with a multimodal query timodal representation under quantum inspiration. The challenge composed of both an image and text. In this setting, the query for multimodal representation falls on a fusion strategy to capture image shows a fashion product that the user likes and the query the interaction between different modalities of data. As the most text allows to change certain product attributes to fit the product successful approaches, neural networks lack a mechanism of ex- to the user’s desire. Multimodal search gives users the means to plicitly showing how different modalities are related to each other. clearly express what they are looking for. This is in contrast to We address this issue by seeking inspirations from Quantum The- current e-commerce search mechanisms, which are cumbersome ory (QT), which has been demonstrated advantageous in explicitly and often fail to grasp the customer’s needs. Multimodal search capturing the correlations between textual features. In this paper, requires intermodal representations of visual and textual fashion at- we give an overview of the related works and present the proposed tributes which can be mixed and matched to form the user’s desired Figure 1: Example of a Multi-modal query. methodology. product, and which have a mechanism to indicate when a visual Figure 1: Example of a multimodal query 1 . It consists of a and textual fashion attribute represent the same concept. With a query image and a query text that alters the query image. KEYWORDS neural network, we induce a common, multimodal space for visual The query text mentions two fashion attributes: short and multimodal and textual data fusion, fashion quantum attributes wherephysics, neuralproduct their inner networksmeasures results lace. Thehavequery been image observed in text-based is knee length andlanguage does understand- not have a their semantic similarity. We build a multimodal retrieval model ing lace[10, 15].appearance. type In this context, quantum-inspired frameworks have the natural advantage of explicitly capturing the correlations between 1which operates on the obtained intermodal representations and INTRODUCTION features by the concepts of quantum superposition and quantum which ranks images based on In human communication, their relevance messages to aconveyed are often multimodal query. through entanglement. products in a webshop is to navigate through a product category aWe demonstrateofthat combination our model different is able tosuch modalities, retrieve imagesaudio as visual, that both and This inspires exhibit themodalities. necessary Inquery image attributes andunderstanding satisfy the query hierarchy. Usersusendtoup adopt in a quantum-inspired subcategory whichframeworks they need tofor con- search linguistic order for an automatic of structing multimodal representation, and propose effective and texts. Moreover,messages, we show that completely, which is time-consuming. They need to go through the multimodal one our model needs substantially to fuse outperforms the information from interpretable multimodal fusion methods. In particular, we propose two state-of-the-art retrieval models adapted to multimodal fashion many irrelevant products, without the guarantee of actually find- different modalities to construct a joint multimodal representation. to search. ingcapture the interactions the product within for. they are looking a single modality To narrow withthe down quantum search, The challenge falls on how to characterise the interactions between users can sometimes select certain filters. However, desiredmeans superposition, and model the cross-modal interactions by prod- different ACM Referencemodalities, which can be complicated in some scenarios. In format: of In amongst this way, the we establish Katrien Laenen, Susana Zoghbi, andvisual-linguistic Marie-Francine Moens. 2018. Web uctquantum entanglement. attributes might not be available afilters. pipeline With thata the example shown in Fig. 1, the query cannot be extracts thesearch interactions within Search of Fashion Items with Multimodal Querying. In text-based approach, users multi-modal datadesired can describe the in a way un- product understood solely by either the text or image individually,ofbut Proceedings WSDM the derstandable from a quantum perspective. We expect to obtain 2018: The Eleventh ACM International Conference on Web Search and Data by entering keywords into a search bar. The webshop then finds relation between the image and text must be correctly recognized. comparable performances to state-of-the-art Mining, Marina Del Rey, CA, USA, February 5–9, 2018 (WSDM 2018), 9 pages. relevant products by matching these keywordssystems in concrete to the words in the This is where the significance and challenge sit for multimodal multimodal tasks. https://doi.org/10.1145/3159652.3159716 product descriptions. It is often difficult for users to write the right representation learning. keywords that will induce the search engine to provide the products Most existing research focus on neural network-based fusion 2theyPROPOSED 1 INTRODUCTION are interested in.METHODOLOGY For example, some users might be interested strategies for constructing multimodal representation, ranging from in “jeans Here with holes”, we introduce but the relevant our proposed approach products are described for building multimodal as the earlier Today’s Hidden Markov consumers have become Modelvery (HMM)-based exigent. When models [7] to shopping “distressed data jeans”. Additionally, representation inspired by this approach quantum hampers theory. the search Essentially, we different online, they RNN variants have in mind [1]a and more specific recently clothing itemtensor-based in a particularap- for product represent attributes which multimodal data asare not mentioned many-body in the systems product de- composed of proaches color and[12] and style, and seq-to-seq they wantstructures to find it [8]. Despite without too their muchstrong effort. scriptions. different data Alternatively modalities asbut rarely, webshops subsystems. offer image-based The interaction of different accuracy However,performances, the interplay current e-commerce search between differentare mechanisms data modal- often too search where modalities the user uploads is inherently an image captured by theofnotion the desired product and of entanglement ities areto limited encoded provideinthisan kind inherent way byA the of service. neuralway common network compo- for searching receives visually between similar products. the subsystems. We proposeRecently, there to build is an increasing a complex-valued nents, making it difficult for humans to understand the contribu- 1 Image material adapted from www.amazon.com interest of learning users intothis network kind of image-based implement the quantumsearch. One framework, theoretical of the main tions of each modality in a particular task. reasonsfacilitates which is the growing usage learning theofcross-modal visual social media such asinPinterest interactions a data- Quantum Permission Theory to make digital(QT) provides or hard a well-established copies of all theoretical or part of this work for personal or and Instagram, driven fashion. where users see products they want to buy. Using and mathematical classroom formalism use is granted without fee for describing provided thearephysical that copies not madeworld on a or distributed for profit or commercial advantage and that copies bear this notice and the full citation an image as a query allows users to convey much more information microscopic level. Beyond physics, QT frameworks have been ap- on the first page. Copyrights for components of this work owned by others than ACM about the 2.1 desired product than Complex-valued with a textual Unimodal query. Additionally, Representation plied must betohonored. many Abstracting research areas [2, 9, with credit 10, 15], among is permitted. which successful To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a another advantage of image-based search over text-based search is fee. Request permissions from permissions@acm.org. Complex values are that the language of essential images is for the mathematical universal. However, the formalism user mightof Copyright WSDM 2018, @ 2019 for this February paper 5–9, 2018,byMarina its authors. Use CA, Del Rey, permitted USA under Creative Commons quantum physics. However, most existing quantum-inspired be interested in changing or adding attributes to the product in models License Attribution 4.0 International (CC BY 4.0). © 2018 Association for Computing Machinery. for the text queryrepresentation image to obtain arevery based on the specific real vector results. space, ignor- For example, for an IIR ACM2019, September ISBN 16–18, 2019, Padova, 978-1-4503-5581-0/18/02. Italy . . $15.00 ing the complex-valued nature of quantum notions. Recently, image of a red dress, the user likes the sleeve length to be different. our https://doi.org/10.1145/3159652.3159716 prior works [3, 4, 11] leveraged quantum superposition to model 342 49 IIR 2019, September 16–18, 2019, Padova, Italy Q. Li, M. Melucci are working on deploying the network to CMU-MOSI and CMU- MOSEI, and we expect to see comparable values to state-of-the-art models in terms of effectiveness. ACKNOWLEDGMENTS This PhD project is supported by the European Union‘s Horizon 2020 research and innovation programme under the Marie Sklodowska- Curie grant agreement No. 721321. REFERENCES [1] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Com- putational Linguistics. Figure 2: Our proposed multimodal representation frame- [2] Jerome R. Busemeyer and Peter D. Bruza. Quantum Models of Cognition and work. Decision. Cambridge University Press, New York, NY, USA, 1st edition, 2012. [3] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-Inspired Complex Word Embedding. Proceedings of The Third Workshop on Representation Learning for NLP, pages 50–57, 2018. correlations between textual features, and the complex-valued rep- [4] Qiuchi Li, Benyou Wang, and Massimo Melucci. CNM: An Interpretable Complex- valued Network for Matching. In Proceedings of the 2019 Conference of the North resentation leads to improved performance and enhanced inter- American Chapter of the Association for Computational Linguistics: Human Lan- pretability. We attempt to employ the concept of superposition for guage Technologies, Volume 1 (Long and Short Papers), pages 4139–4148, Min- modeling intra-modal interactions, and investigate the complex- neapolis, Minnesota, June 2019. Association for Computational Linguistics. [5] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, valued embedding approach to capture the interactions within other AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient Low-rank Multi- modalities. modal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, Melbourne, Australia, July 2018. Association for Computational 2.2 Tensor-based Approaches for Capturing Linguistics. Inter-modal Interactions [6] Massimo Melucci. Towards modeling implicit feedback with quantum entangle- ment. Quantum Interaction 2008, 2008. We represent multimodal data as a many-body quantum system in [7] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards Multimodal entanglement. The mathematical formulation will be a complex- Sentiment Analysis: Harvesting Opinions from the Web. In Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI ’11, pages 169–176, valued tensor constructed from unimodal complex-valued vectors New York, NY, USA, 2011. ACM. by tensor-based approaches. Tensor-based models have been ap- [8] Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos. Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Senti- plied for classification [5] and matching [14] tasks, but they avoid ment Analysis. arXiv:1807.03915 [cs, stat], July 2018. arXiv: 1807.03915. directly computing the tensor by decomposing the tensor and learn- [9] C. J. van Rijsbergen. The Geometry of Information Retrieval. Cambridge University ing the decomposed weights via neural networks. Explicit tensor Press, New York, NY, USA, 2004. [10] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling Term De- combination of different data modalities into a holistic multimodal pendencies with Quantum Language Models for IR. In Proceedings of the 36th representation remains unexplored. [6] proposed a framework to International ACM SIGIR Conference on Research and Development in Information investigate the entanglement between user and document for rel- Retrieval, SIGIR ’13, pages 653–662, New York, NY, USA, 2013. ACM. [11] Benyou Wang, Qiuchi Li, Massimo Melucci, and Dawei Song. Semantic Hilbert evance feedback, but it remains a challenge on how to apply it to Space for Text Representation Learning. In The World Wide Web Conference, multimodal data. WWW ’19, pages 3293–3299, New York, NY, USA, 2019. ACM. event-place: San Francisco, CA, USA. [12] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe 2.3 Quantum-inspired Framework for Morency. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv:1707.07250 [cs], July 2017. arXiv: 1707.07250. Multimodal Sentiment Analysis [13] Amir Zadeh, Rown Zellers, and Eli Pincus. MOSI: Multimodal Corpus of Senti- We focus on the multimodal sentiment analysis task, and work with ment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10. [14] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A Quantum the benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1]. Many-body Wave Function Inspired Language Modeling Approach. In Proceed- The task is to classify sentiment of a video into 2, 5 or 7 classes ings of the 27th ACM International Conference on Information and Knowledge with textual, visual and acoustic features. As is shown in Fig. 2, Management, CIKM ’18, pages 1303–1312, New York, NY, USA, 2018. ACM. [15] Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, and our framework represents unimodal data as a set of pure states Benyou Wang. A quantum-inspired multimodal sentiment analysis framework. through complex-value embedding, and constructs the many-body Theoretical Computer Science, April 2018. state of an video utterance through tensor-based approaches. Fi- nally, quantum-like measurement operators are implemented for sentiment classification. The whole process is implemented into a complex-valued neural network, and the parameters in the pipeline can be learned from labeled data in an end-to-end manner. The network is born with an advantage in interpretability com- pared to classical neural networks, in that the role of each com- ponent is made explicit prior to the network training phase. We 50