<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>WSDM</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Web Search of Fashion Items with Multimodal Querying uQantum-inspired Multimodal Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Q. Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Melucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Katrien Laenen Susana Zoghbi Marie-Francine Moens Qiuchi Li Massimo Melucci KU Leuven KU Leuven KU Leuven</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>9</volume>
      <abstract>
        <p>In this paper, we introduce a novel multimodal fashion search paraWe introduce our work in progress that targets on building muldigm where e-commerce data is searched with a multimodal query timodal representation under quantum inspiration. The challenge composed of both an image and text. In this setting, the query for multimodal representation falls on a fusion strategy to capture image shows a fashion product that the user likes and the query the interaction between diferent modalities of data. As the most text allows to change certain product attributes to fit the product successful approaches, neural networks lack a mechanism of exto the user's desire. Multimodal search gives users the means to plicitly showing how diferent modalities are related to each other. clearly express what they are looking for. This is in contrast to We address this issue by seeking inspirations from Quantum Thecurrent e-commerce search mechanisms, which are cumbersome ory (QT), which has been demonstrated advantageous in explicitly and often fail to grasp the customer's needs. Multimodal search capturing the correlations between textual features. In this paper, requires intermodal representations of visual and textual fashion atwe give an overview of the related works and present the proposed tributes which can be mixed and matched to form the user's desired methodology. product, and which have a mechanism to indicate when a visual and textual fashion attribute represent the same concept. With a</p>
      </abstract>
      <kwd-group>
        <kwd>that will induce the search engine to provide the products</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>which ranks images based on their relevance to a multimodal query.
In human communication, messages are often conveyed through
We demonstrate that our model is able to retrieve images that both
a combination of diferent modalities, such as visual, audio and
exhibit the necessary query image attributes and satisfy the query
linguistic modalities. In order for an automatic understanding of
texts. Moreover, we show that our model substantially outperforms
the multimodal messages, one needs to fuse the information from
two state-of-the-art retrieval models adapted to multimodal fashion
diferent modalities to construct a joint multimodal representation.
search.</p>
      <p>The challenge falls on how to characterise the interactions between
dAiCferMenRtmefoerdeanlicteiefso, rwmhaict:h can be complicated in some scenarios. In
tKhaetreixenamLapelenesnh,oSwusnanina FZiogg.h1b,it,haendviMsuaarli-el-iFnrgaunicsintiec Mquoeernys.c2a0n1n8o.tWbeeb
uSneadrecrhsotofFoadshsioolnelIytembys weiitthheMrutlhtiemtoedxatl oQrueimryaingge. IinndPriovciedeudainllgys,obf uWtStDhMe
r2e0l1a8t:ioTnhebEeltewveenetnh AthCeMi mInategrenaatniodnatel xCtomnfuersetnbceeocnoWrreebctSleyarrcehcoagndniDzeadta.</p>
      <p>Most existing research focus on neural network-based fusion
s1trateIgNieTsfRorOcoDnsUtrCucTtinIOgmNultimodal representation, ranging from
tThoedaeyar’sliecronHsiudmdeenrsMhavrkeobvecMoomdeelve(HryMeMxi)g-benast.edWmheondeslhso[p7p]intog
doinfelriennet, tRhNeyNhvaavreiaintms[ind1]a
asnpdecmificocrleotrheicnegnittleymteinsaorp-abratsiecdularppcrooloacrhaensd[s1t2y]lea,nadnsdetqh-etoy-sweaqnstttrou cfintdurietsw[i8t]h.oDuetstpoiotemthuecihr esfotrrot.ng
aHccouwreavcyerp,ecrufrorremnatnec-ecso, mthme einrcteerpselaayrcbhetmwecehnadnifeisrmenst adraetaomfteondatlo-o
iltiimesitaerdeteonpcroodveideinthains ki ninhderoefnstewrvaicyeb.yAtchoemnmeuornalwnaeyt
wfoorrskeacorcmhpinognents, making it dificult for humans to understand the
contribu1Image material adapted from www.amazon.com
tions of each modality in a particular task.
for profit or commercial advantage and that copies bear this notice and the full citation
microscopic level. Beyond physics, QT frameworks have been
apon the first page. Copyrights for components of this work owned by others than ACM
pmluiestdbteohomnoarnedy. Arebssteraacrtcinhgawriethascre[d2i,t 9is,p1e0rm,1it5te]d,. aTomcoopnygowthehriwcihses,ourcrceepsusbfliushl,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
query image and a query text that alters the query image.
The query text mentions two fashion attributes: short and
rleascuel.tsThaevqeubeereyn iombsaegreveids kinneteexlte-bnagstehd alanndgudaogees
unnodtehrsatvanedailnagce[1t0y, p15e]a.Ipnptehaisracnonctee.xt, quantum-inspired frameworks have the
natural advantage of explicitly capturing the correlations between
features by the concepts of quantum superposition and quantum
epnrtoadnugcletmsiennta. webshop is to navigate through a product category
hieTrharischinys.pUisreerssuesntdouapdoinptaqsuuabncatutemg-oirnyspwirheidchfrtahmeyewneoerkdstofosrecaorcnhsctorumcptilnetgelmy,uwlthimicohdiasl triemper-ecsoennstautmioinn,ga.nTdhepyronpeoesdeteofegctoivtehraonudgh
imntaenrpyriertraeblleevmanutltpimroodduacltsfu,swioitnhmouetththoedsg.uInarpaanrtteiceuolafra, cwtueapllryo pfinods-e
tiongcatphteurperothdeucintttehreayctaiorenslowoiktihnign faosr.inTgolenmarorodwalitdyowwniththqeusaenatrucmh,
suuspeerrspcoasnitisoonm,aentidmmesosdeellecthtececrrtoasins-filmteords.aHlionwteervaecrt,iodnessirbeyd mpreoadn-s
oufcqtuaattnrtiubmuteenstmanigghletmneontt.bIen athmisonwgasyt, twhee easvtaaiblalibslhe afiltpeirpse.lWiniet hthaat
etxextrta-bcatssetdheseianrtcehraacptpioronaschw, iuthseinrs mcaunltdi-emscordibael tdhaetadeisniraedwparyoduunc-t
dbeyrsetnatnedrainbglekferyowmoradqsuinantotuamsepaercrshpbeacrt.ivTeh.eWweeebxspheocpt tthoeno bfintdaisn
croemlevpaanratbplreodpuercftosrbmyamncaetcshtiongsttahtees-oefk-tehyew-aorrtdssytsotethmeswinorcdosnicnrethtee
mpruoldtiumctoddaelsctraispktsio.ns. It is often dificult for users to write the right
2theyPaRre OinPteOreSstEedDinM.FoErTexHamOpDleO,sLomOeGuYsers might be interested
Hine r“jeeawnes iwntirtohdhuocleeos,”ubruptrothpeosreedleavpapnrtopacrhodfuorctbsuailrdeindgesmcruilbteimdoadsal
dd“aisttarersesperdesjeeanntas.”tiAodndiintisopniraelldy,btyhiqsuaapnptruomacthhheoamryp.eErsssethnetisaellayr,cwhe
rfeoprrpersoednutcmtautlttrimibuodteasl wdahtiachasarmeannoyt-mboendtyiosnyesdteimn sthceopmrpoodsuecdt
doefdsicferriepntitodnast.aAmltoedranlaittiievsealys sbuubtsyrasrteemlys,.wTehbesihnotpersaoctfeiro nimoafgdeife-breansted
mseoadrcahlitwiehseirseitnhheeuresenrtluypcloaapdtusraendimbyagtheeofntohteiodnesoifreedntparnogdluecmt eanndt
breetcweieveens vthiseuasullbyssyismt eilmars.pWroedupcrtosp.Roseecetnotlbyu, itlhdeareciosmanplienxc-rveaalsuiendg
lienatrenriensgtonfeutwseorrskintothimispkleinmdeonftitmheagqeu-abnatsuemdstehaerocrhe.tOicnaelforfamtheewmoarikn,
wrehaiscohnsfaicsitlhiteatgersowleianrgniunsgagtehoefcvriossusa-lmsoocdiaall minetedriaacstuicohnsasinPiantdearetas-t
darnivdeInnsftaasghriaomn., where users see products they want to buy. Using
an image as a query allows users to convey much more information
about the desired product than with a textual query. Additionally,</p>
    </sec>
    <sec id="sec-2">
      <title>2.1 Complex-valued Unimodal Representation</title>
      <p>
        another advantage of image-based search over text-based search is
Complex values are essential for the mathematical formalism of
that the language of images is universal. However, the user might
quantum physics. However, most existing quantum-inspired models
be interested in changing or adding attributes to the product in
for text representation are based on the real vector space,
ignorthe query image to obtain very specific results. For example, for an
ing the complex-valued nature of quantum notions. Recently, our
image of a red dress, the user likes the sleeve length to be diferent.
prior works [
        <xref ref-type="bibr" rid="ref11 ref3 ref4">3, 4, 11</xref>
        ] leveraged quantum superposition to model
work.
correlations between textual features, and the complex-valued
representation leads to improved performance and enhanced
interpretability. We attempt to employ the concept of superposition for
modeling intra-modal interactions, and investigate the
complexvalued embedding approach to capture the interactions within other
modalities.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Tensor-based Approaches for Capturing</title>
    </sec>
    <sec id="sec-4">
      <title>Inter-modal Interactions</title>
      <p>
        We represent multimodal data as a many-body quantum system in
entanglement. The mathematical formulation will be a
complexvalued tensor constructed from unimodal complex-valued vectors
by tensor-based approaches. Tensor-based models have been
applied for classification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and matching [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] tasks, but they avoid
directly computing the tensor by decomposing the tensor and
learning the decomposed weights via neural networks. Explicit tensor
combination of diferent data modalities into a holistic multimodal
representation remains unexplored. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a framework to
investigate the entanglement between user and document for
relevance feedback, but it remains a challenge on how to apply it to
multimodal data.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Quantum-inspired Framework for</title>
    </sec>
    <sec id="sec-6">
      <title>Multimodal Sentiment Analysis</title>
      <p>
        We focus on the multimodal sentiment analysis task, and work with
the benchmarking datasets CMU-MOSI [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and CMU-MOSEI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The task is to classify sentiment of a video into 2, 5 or 7 classes
with textual, visual and acoustic features. As is shown in Fig. 2,
our framework represents unimodal data as a set of pure states
through complex-value embedding, and constructs the many-body
state of an video utterance through tensor-based approaches.
Finally, quantum-like measurement operators are implemented for
sentiment classification. The whole process is implemented into a
complex-valued neural network, and the parameters in the pipeline
can be learned from labeled data in an end-to-end manner.
      </p>
      <p>The network is born with an advantage in interpretability
compared to classical neural networks, in that the role of each
component is made explicit prior to the network training phase. We
are working on deploying the network to CMU-MOSI and
CMUMOSEI, and we expect to see comparable values to state-of-the-art
models in terms of efectiveness.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This PhD project is supported by the European Union‘s Horizon
2020 research and innovation programme under the Marie
SklodowskaCurie grant agreement No. 721321.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>AmirAli</given-names>
            <surname>Bagher Zadeh</surname>
          </string-name>
          , Paul Pu Liang, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency.
          <article-title>Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>2236</fpage>
          -
          <lpage>2246</lpage>
          , Melbourne, Australia,
          <year>July 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Jerome</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Busemeyer</surname>
          </string-name>
          and
          <string-name>
            <surname>Peter D. Bruza</surname>
          </string-name>
          .
          <source>Quantum Models of Cognition and Decision</source>
          . Cambridge University Press, New York, NY, USA, 1st edition,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Qiuchi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sagar</given-names>
            <surname>Uprety</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Benyou</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dawei</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Quantum-Inspired Complex Word Embedding</article-title>
          .
          <source>Proceedings of The Third Workshop on Representation Learning for NLP</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Qiuchi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Benyou</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Melucci</surname>
          </string-name>
          .
          <article-title>CNM: An Interpretable Complexvalued Network for Matching</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4139</fpage>
          -
          <lpage>4148</lpage>
          , Minneapolis, Minnesota,
          <year>June 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Zhun</given-names>
            <surname>Liu</surname>
          </string-name>
          , Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and
          <string-name>
            <surname>Louis-Philippe Morency</surname>
          </string-name>
          .
          <article-title>Eficient Low-rank Multimodal Fusion With Modality-Specific Factors</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>2247</fpage>
          -
          <lpage>2256</lpage>
          , Melbourne, Australia,
          <year>July 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Melucci</surname>
          </string-name>
          .
          <article-title>Towards modeling implicit feedback with quantum entanglement</article-title>
          .
          <source>Quantum Interaction</source>
          <year>2008</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Louis-Philippe</surname>
            <given-names>Morency</given-names>
          </string-name>
          , Rada Mihalcea, and
          <string-name>
            <given-names>Payal</given-names>
            <surname>Doshi</surname>
          </string-name>
          .
          <article-title>Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Multimodal Interfaces</source>
          ,
          <source>ICMI '11</source>
          , pages
          <fpage>169</fpage>
          -
          <lpage>176</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Hai</given-names>
            <surname>Pham</surname>
          </string-name>
          , Thomas Manzini, Paul Pu Liang, and Barnabas Poczos.
          <article-title>Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis</article-title>
          . arXiv:
          <year>1807</year>
          .03915 [cs, stat],
          <year>July 2018</year>
          . arXiv:
          <year>1807</year>
          .03915.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          .
          <source>The Geometry of Information Retrieval</source>
          . Cambridge University Press, New York, NY, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Sordoni</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          , and
          <article-title>Yoshua Bengio. Modeling Term Dependencies with Quantum Language Models for IR</article-title>
          .
          <source>In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13</source>
          , pages
          <fpage>653</fpage>
          -
          <lpage>662</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Benyou</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Qiuchi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Melucci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dawei</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Semantic Hilbert Space for Text Representation Learning</article-title>
          .
          <source>In The World Wide Web Conference, WWW '19</source>
          , pages
          <fpage>3293</fpage>
          -
          <lpage>3299</lpage>
          , New York, NY, USA,
          <year>2019</year>
          . ACM. event-place: San Francisco, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Amir</surname>
            <given-names>Zadeh</given-names>
          </string-name>
          , Minghai Chen, Soujanya Poria, Erik Cambria, and
          <string-name>
            <surname>Louis-Philippe Morency</surname>
          </string-name>
          .
          <article-title>Tensor Fusion Network for Multimodal Sentiment Analysis</article-title>
          .
          <source>arXiv:1707</source>
          .07250 [cs],
          <year>July 2017</year>
          . arXiv:
          <volume>1707</volume>
          .
          <fpage>07250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Amir</surname>
            <given-names>Zadeh</given-names>
          </string-name>
          , Rown Zellers, and
          <string-name>
            <given-names>Eli</given-names>
            <surname>Pincus</surname>
          </string-name>
          .
          <source>MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Peng</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Zhan Su, Lipeng Zhang, Benyou Wang, and
          <string-name>
            <given-names>Dawei</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>A Quantum Many-body Wave Function Inspired Language Modeling Approach</article-title>
          .
          <source>In Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          ,
          <source>CIKM '18</source>
          , pages
          <fpage>1303</fpage>
          -
          <lpage>1312</lpage>
          , New York, NY, USA,
          <year>2018</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Yazhou</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Dawei Song, Peng Zhang, Panpan Wang,
          <string-name>
            <given-names>Jingfei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Benyou</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>A quantum-inspired multimodal sentiment analysis framework</article-title>
          .
          <source>Theoretical Computer Science</source>
          , April
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>