-

WSDM

Web Search of Fashion Items with Multimodal Querying uQantum-inspired Multimodal Representation

Q. Li

M. Melucci

0 0 Katrien Laenen Susana Zoghbi Marie-Francine Moens Qiuchi Li Massimo Melucci KU Leuven KU Leuven KU Leuven

2018

In this paper, we introduce a novel multimodal fashion search paraWe introduce our work in progress that targets on building muldigm where e-commerce data is searched with a multimodal query timodal representation under quantum inspiration. The challenge composed of both an image and text. In this setting, the query for multimodal representation falls on a fusion strategy to capture image shows a fashion product that the user likes and the query the interaction between diferent modalities of data. As the most text allows to change certain product attributes to fit the product successful approaches, neural networks lack a mechanism of exto the user's desire. Multimodal search gives users the means to plicitly showing how diferent modalities are related to each other. clearly express what they are looking for. This is in contrast to We address this issue by seeking inspirations from Quantum Thecurrent e-commerce search mechanisms, which are cumbersome ory (QT), which has been demonstrated advantageous in explicitly and often fail to grasp the customer's needs. Multimodal search capturing the correlations between textual features. In this paper, requires intermodal representations of visual and textual fashion atwe give an overview of the related works and present the proposed tributes which can be mixed and matched to form the user's desired methodology. product, and which have a mechanism to indicate when a visual and textual fashion attribute represent the same concept. With a

that will induce the search engine to provide the products

which ranks images based on their relevance to a multimodal query. In human communication, messages are often conveyed through We demonstrate that our model is able to retrieve images that both a combination of diferent modalities, such as visual, audio and exhibit the necessary query image attributes and satisfy the query linguistic modalities. In order for an automatic understanding of texts. Moreover, we show that our model substantially outperforms the multimodal messages, one needs to fuse the information from two state-of-the-art retrieval models adapted to multimodal fashion diferent modalities to construct a joint multimodal representation. search.

The challenge falls on how to characterise the interactions between dAiCferMenRtmefoerdeanlicteiefso, rwmhaict:h can be complicated in some scenarios. In tKhaetreixenamLapelenesnh,oSwusnanina FZiogg.h1b,it,haendviMsuaarli-el-iFnrgaunicsintiec Mquoeernys.c2a0n1n8o.tWbeeb uSneadrecrhsotofFoadshsioolnelIytembys weiitthheMrutlhtiemtoedxatl oQrueimryaingge. IinndPriovciedeudainllgys,obf uWtStDhMe r2e0l1a8t:ioTnhebEeltewveenetnh AthCeMi mInategrenaatniodnatel xCtomnfuersetnbceeocnoWrreebctSleyarrcehcoagndniDzeadta.

Most existing research focus on neural network-based fusion s1trateIgNieTsfRorOcoDnsUtrCucTtinIOgmNultimodal representation, ranging from tThoedaeyar’sliecronHsiudmdeenrsMhavrkeobvecMoomdeelve(HryMeMxi)g-benast.edWmheondeslhso[p7p]intog doinfelriennet, tRhNeyNhvaavreiaintms[ind1]a asnpdecmificocrleotrheicnegnittleymteinsaorp-abratsiecdularppcrooloacrhaensd[s1t2y]lea,nadnsdetqh-etoy-sweaqnstttrou cfintdurietsw[i8t]h.oDuetstpoiotemthuecihr esfotrrot.ng aHccouwreavcyerp,ecrufrorremnatnec-ecso, mthme einrcteerpselaayrcbhetmwecehnadnifeisrmenst adraetaomfteondatlo-o iltiimesitaerdeteonpcroodveideinthains ki ninhderoefnstewrvaicyeb.yAtchoemnmeuornalwnaeyt wfoorrskeacorcmhpinognents, making it dificult for humans to understand the contribu1Image material adapted from www.amazon.com tions of each modality in a particular task. for profit or commercial advantage and that copies bear this notice and the full citation microscopic level. Beyond physics, QT frameworks have been apon the first page. Copyrights for components of this work owned by others than ACM pmluiestdbteohomnoarnedy. Arebssteraacrtcinhgawriethascre[d2i,t 9is,p1e0rm,1it5te]d,. aTomcoopnygowthehriwcihses,ourcrceepsusbfliushl, to post on servers or to redistribute to lists, requires prior specific permission and/or a query image and a query text that alters the query image. The query text mentions two fashion attributes: short and rleascuel.tsThaevqeubeereyn iombsaegreveids kinneteexlte-bnagstehd alanndgudaogees unnodtehrsatvanedailnagce[1t0y, p15e]a.Ipnptehaisracnonctee.xt, quantum-inspired frameworks have the natural advantage of explicitly capturing the correlations between features by the concepts of quantum superposition and quantum epnrtoadnugcletmsiennta. webshop is to navigate through a product category hieTrharischinys.pUisreerssuesntdouapdoinptaqsuuabncatutemg-oirnyspwirheidchfrtahmeyewneoerkdstofosrecaorcnhsctorumcptilnetgelmy,uwlthimicohdiasl triemper-ecsoennstautmioinn,ga.nTdhepyronpeoesdeteofegctoivtehraonudgh imntaenrpyriertraeblleevmanutltpimroodduacltsfu,swioitnhmouetththoedsg.uInarpaanrtteiceuolafra, cwtueapllryo pfinods-e tiongcatphteurperothdeucintttehreayctaiorenslowoiktihnign faosr.inTgolenmarorodwalitdyowwniththqeusaenatrucmh, suuspeerrspcoasnitisoonm,aentidmmesosdeellecthtececrrtoasins-filmteords.aHlionwteervaecrt,iodnessirbeyd mpreoadn-s oufcqtuaattnrtiubmuteenstmanigghletmneontt.bIen athmisonwgasyt, twhee easvtaaiblalibslhe afiltpeirpse.lWiniet hthaat etxextrta-bcatssetdheseianrtcehraacptpioronaschw, iuthseinrs mcaunltdi-emscordibael tdhaetadeisniraedwparyoduunc-t dbeyrsetnatnedrainbglekferyowmoradqsuinantotuamsepaercrshpbeacrt.ivTeh.eWweeebxspheocpt tthoeno bfintdaisn croemlevpaanratbplreodpuercftosrbmyamncaetcshtiongsttahtees-oefk-tehyew-aorrtdssytsotethmeswinorcdosnicnrethtee mpruoldtiumctoddaelsctraispktsio.ns. It is often dificult for users to write the right 2theyPaRre OinPteOreSstEedDinM.FoErTexHamOpDleO,sLomOeGuYsers might be interested Hine r“jeeawnes iwntirtohdhuocleeos,”ubruptrothpeosreedleavpapnrtopacrhodfuorctbsuailrdeindgesmcruilbteimdoadsal dd“aisttarersesperdesjeeanntas.”tiAodndiintisopniraelldy,btyhiqsuaapnptruomacthhheoamryp.eErsssethnetisaellayr,cwhe rfeoprrpersoednutcmtautlttrimibuodteasl wdahtiachasarmeannoyt-mboendtyiosnyesdteimn sthceopmrpoodsuecdt doefdsicferriepntitodnast.aAmltoedranlaittiievsealys sbuubtsyrasrteemlys,.wTehbesihnotpersaoctfeiro nimoafgdeife-breansted mseoadrcahlitwiehseirseitnhheeuresenrtluypcloaapdtusraendimbyagtheeofntohteiodnesoifreedntparnogdluecmt eanndt breetcweieveens vthiseuasullbyssyismt eilmars.pWroedupcrtosp.Roseecetnotlbyu, itlhdeareciosmanplienxc-rveaalsuiendg lienatrenriensgtonfeutwseorrskintothimispkleinmdeonftitmheagqeu-abnatsuemdstehaerocrhe.tOicnaelforfamtheewmoarikn, wrehaiscohnsfaicsitlhiteatgersowleianrgniunsgagtehoefcvriossusa-lmsoocdiaall minetedriaacstuicohnsasinPiantdearetas-t darnivdeInnsftaasghriaomn., where users see products they want to buy. Using an image as a query allows users to convey much more information about the desired product than with a textual query. Additionally,

2.1 Complex-valued Unimodal Representation

another advantage of image-based search over text-based search is Complex values are essential for the mathematical formalism of that the language of images is universal. However, the user might quantum physics. However, most existing quantum-inspired models be interested in changing or adding attributes to the product in for text representation are based on the real vector space, ignorthe query image to obtain very specific results. For example, for an ing the complex-valued nature of quantum notions. Recently, our image of a red dress, the user likes the sleeve length to be diferent. prior works [ 3, 4, 11 ] leveraged quantum superposition to model work. correlations between textual features, and the complex-valued representation leads to improved performance and enhanced interpretability. We attempt to employ the concept of superposition for modeling intra-modal interactions, and investigate the complexvalued embedding approach to capture the interactions within other modalities. 2.2

Tensor-based Approaches for Capturing Inter-modal Interactions

We represent multimodal data as a many-body quantum system in entanglement. The mathematical formulation will be a complexvalued tensor constructed from unimodal complex-valued vectors by tensor-based approaches. Tensor-based models have been applied for classification [ 5 ] and matching [ 14 ] tasks, but they avoid directly computing the tensor by decomposing the tensor and learning the decomposed weights via neural networks. Explicit tensor combination of diferent data modalities into a holistic multimodal representation remains unexplored. [ 6 ] proposed a framework to investigate the entanglement between user and document for relevance feedback, but it remains a challenge on how to apply it to multimodal data. 2.3

Quantum-inspired Framework for Multimodal Sentiment Analysis

We focus on the multimodal sentiment analysis task, and work with the benchmarking datasets CMU-MOSI [ 13 ] and CMU-MOSEI [ 1 ]. The task is to classify sentiment of a video into 2, 5 or 7 classes with textual, visual and acoustic features. As is shown in Fig. 2, our framework represents unimodal data as a set of pure states through complex-value embedding, and constructs the many-body state of an video utterance through tensor-based approaches. Finally, quantum-like measurement operators are implemented for sentiment classification. The whole process is implemented into a complex-valued neural network, and the parameters in the pipeline can be learned from labeled data in an end-to-end manner.

The network is born with an advantage in interpretability compared to classical neural networks, in that the role of each component is made explicit prior to the network training phase. We are working on deploying the network to CMU-MOSI and CMUMOSEI, and we expect to see comparable values to state-of-the-art models in terms of efectiveness.

ACKNOWLEDGMENTS

This PhD project is supported by the European Union‘s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie grant agreement No. 721321.

[1]

AmirAli

Bagher Zadeh , Paul Pu Liang, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2236 - 2246 , Melbourne, Australia, July 2018 . Association for Computational Linguistics .

[2] Jerome

Busemeyer and Peter D. Bruza . Quantum Models of Cognition and Decision . Cambridge University Press, New York, NY, USA, 1st edition, 2012 .

[3]

Qiuchi

Li ,

Sagar

Uprety ,

Benyou

Wang , and

Dawei

Song . Quantum-Inspired Complex Word Embedding . Proceedings of The Third Workshop on Representation Learning for NLP , pages 50 - 57 , 2018 .

[4]

Qiuchi

Li ,

Benyou

Wang , and

Massimo

Melucci . CNM: An Interpretable Complexvalued Network for Matching . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), pages 4139 - 4148 , Minneapolis, Minnesota, June 2019 . Association for Computational Linguistics .

[5]

Zhun

Liu , Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency . Eficient Low-rank Multimodal Fusion With Modality-Specific Factors . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2247 - 2256 , Melbourne, Australia, July 2018 . Association for Computational Linguistics .

[6]

Massimo

Melucci . Towards modeling implicit feedback with quantum entanglement . Quantum Interaction 2008 , 2008 .

[7] Louis-Philippe

Morency

, Rada Mihalcea, and

Payal

Doshi . Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web . In Proceedings of the 13th International Conference on Multimodal Interfaces , ICMI '11 , pages 169 - 176 , New York, NY, USA, 2011 . ACM.

[8]

Hai

Pham , Thomas Manzini, Paul Pu Liang, and Barnabas Poczos. Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis . arXiv: 1807 .03915 [cs, stat], July 2018 . arXiv: 1807 .03915.

[9] C. J. van Rijsbergen . The Geometry of Information Retrieval . Cambridge University Press, New York, NY, USA, 2004 .

[10] Alessandro

Sordoni

, Jian-Yun Nie , and Yoshua Bengio. Modeling Term Dependencies with Quantum Language Models for IR . In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13 , pages 653 - 662 , New York, NY, USA, 2013 . ACM.

[11] Benyou

Wang

Qiuchi

Li ,

Massimo

Melucci , and

Dawei

Song . Semantic Hilbert Space for Text Representation Learning . In The World Wide Web Conference, WWW '19 , pages 3293 - 3299 , New York, NY, USA, 2019 . ACM. event-place: San Francisco, CA, USA.

[12] Amir

Zadeh

, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency . Tensor Fusion Network for Multimodal Sentiment Analysis . arXiv:1707 .07250 [cs], July 2017 . arXiv: 1707 . 07250 .

[13] Amir

Zadeh

, Rown Zellers, and

Eli

Pincus . MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.

[14] Peng

Zhang

, Zhan Su, Lipeng Zhang, Benyou Wang, and

Dawei

Song . A Quantum Many-body Wave Function Inspired Language Modeling Approach . In Proceedings of the 27th ACM International Conference on Information and Knowledge Management , CIKM '18 , pages 1303 - 1312 , New York, NY, USA, 2018 . ACM.

[15] Yazhou

Zhang

, Dawei Song, Peng Zhang, Panpan Wang,

Jingfei

Li ,

Xiang

Li ,

and Benyou

Wang . A quantum-inspired multimodal sentiment analysis framework . Theoretical Computer Science , April 2018 .