Technical Presentation                                                                                           WSDM’18, February 5-9, 2018, Marina Del Rey, CA, USA


            Web Quantum-inspired
                Search of Fashion Items with Multimodal
                                   Multimodal           Querying
                                              Representation
                    Katrien Laenen
                               Qiuchi Li                                           Susana Zoghbi                                  Marie-Francine
                                                                                                                             Massimo Melucci     Moens
                    KU Leuven                            KU Leuven                             KU Leuven
                        qiuchili@dei.unipd.it                                   melo@dei.unipd.it
           Human Computer Interaction           Human Computer Interaction           Human Computer Interaction
               Department of Information Engineering                   Department of Information Engineering
                Heverlee, Belgium                    Heverlee, Belgium                      Heverlee, Belgium
                        University of Padova                                   University of Padova
           katrien.laenen@kuleuven.be            susana.zoghbi@kuleuven.be              sien.moens@kuleuven.be
                            Padova, Italy                                          Padova, Italy
ABSTRACT
ABSTRACT
 In this paper, we introduce a novel multimodal fashion search para-
We introduce our work in progress that targets on building mul-
 digm where e-commerce data is searched with a multimodal query
timodal representation under quantum inspiration. The challenge
 composed of both an image and text. In this setting, the query
for multimodal representation falls on a fusion strategy to capture
 image shows a fashion product that the user likes and the query
the interaction between different modalities of data. As the most
 text allows to change certain product attributes to fit the product
successful approaches, neural networks lack a mechanism of ex-
 to the user’s desire. Multimodal search gives users the means to
plicitly showing how different modalities are related to each other.
 clearly express what they are looking for. This is in contrast to
We address this issue by seeking inspirations from Quantum The-
 current e-commerce search mechanisms, which are cumbersome
ory (QT), which has been demonstrated advantageous in explicitly
 and often fail to grasp the customer’s needs. Multimodal search
capturing the correlations between textual features. In this paper,
 requires intermodal representations of visual and textual fashion at-
we give an overview of the related works and present the proposed
 tributes which can be mixed and matched to form the user’s desired                                               Figure 1: Example of a Multi-modal query.
methodology.
 product, and which have a mechanism to indicate when a visual                                         Figure 1: Example of a multimodal query 1 . It consists of a
 and textual fashion attribute represent the same concept. With a                                      query image and a query text that alters the query image.
KEYWORDS
 neural network, we induce a common, multimodal space for visual                                      The query text mentions two fashion attributes: short and
multimodal
 and textual data  fusion,
              fashion      quantum
                      attributes  wherephysics,  neuralproduct
                                          their inner    networksmeasures                             results
                                                                                                       lace. Thehavequery
                                                                                                                       been image
                                                                                                                             observed      in text-based
                                                                                                                                      is knee    length andlanguage
                                                                                                                                                                does understand-
                                                                                                                                                                       not have a
 their semantic similarity. We build a multimodal retrieval model                                     ing
                                                                                                       lace[10, 15].appearance.
                                                                                                             type    In this context, quantum-inspired frameworks have the
                                                                                                      natural advantage of explicitly capturing the correlations between
1which   operates on the obtained intermodal representations and
      INTRODUCTION                                                                                    features by the concepts of quantum superposition and quantum
 which  ranks  images based on
In human communication,          their relevance
                              messages             to aconveyed
                                           are often    multimodal    query.
                                                                   through                            entanglement.
                                                                                                       products in a webshop is to navigate through a product category
aWe  demonstrateofthat
   combination          our model
                    different       is able tosuch
                              modalities,       retrieve imagesaudio
                                                    as visual,    that both
                                                                       and                                This inspires
 exhibit themodalities.
             necessary Inquery  image   attributes  andunderstanding
                                                         satisfy the query                             hierarchy.  Usersusendtoup
                                                                                                                                adopt
                                                                                                                                   in a quantum-inspired
                                                                                                                                         subcategory whichframeworks
                                                                                                                                                               they need tofor    con-
                                                                                                                                                                               search
linguistic                 order  for an   automatic                     of                           structing   multimodal     representation, and       propose   effective    and
 texts. Moreover,messages,
                   we show that                                                                        completely,    which is time-consuming.         They   need to   go through
the  multimodal              one our   model
                                   needs       substantially
                                          to fuse             outperforms
                                                   the information    from                            interpretable    multimodal    fusion   methods.  In particular,  we propose
 two state-of-the-art retrieval  models  adapted   to multimodal    fashion                            many irrelevant     products,    without   the guarantee    of actually    find-
different  modalities to construct  a joint  multimodal    representation.                            to
 search.                                                                                               ingcapture  the interactions
                                                                                                            the product                 within for.
                                                                                                                          they are looking      a single modality
                                                                                                                                                    To narrow       withthe
                                                                                                                                                                 down      quantum
                                                                                                                                                                              search,
The  challenge falls on how to characterise the interactions between
                                                                                                       users can sometimes select certain filters. However, desiredmeans
                                                                                                      superposition,    and  model    the   cross-modal   interactions    by     prod-
different
ACM Referencemodalities,     which can be complicated in some scenarios. In
                        format:                                                                       of                               In amongst
                                                                                                                                           this way, the
                                                                                                                                                     we establish
 Katrien    Laenen,    Susana    Zoghbi,    andvisual-linguistic
                                                  Marie-Francine Moens.         2018. Web              uctquantum    entanglement.
                                                                                                            attributes  might not be                     available afilters.
                                                                                                                                                                      pipeline
                                                                                                                                                                             With thata
the   example      shown     in Fig.  1, the                           query cannot      be           extracts   thesearch
                                                                                                                      interactions    within
 Search   of Fashion   Items   with  Multimodal     Querying.   In                                     text-based           approach,     users multi-modal    datadesired
                                                                                                                                                can describe the     in a way      un-
                                                                                                                                                                             product
understood solely by either the text or image individually,ofbut    Proceedings      WSDM
                                                                                        the           derstandable     from   a quantum       perspective.  We   expect   to  obtain
 2018: The Eleventh ACM International Conference on Web Search and Data                                by entering keywords into a search bar. The webshop then finds
relation     between the image and text must be correctly recognized.                                 comparable      performances      to state-of-the-art
 Mining, Marina Del Rey, CA, USA, February 5–9, 2018 (WSDM 2018), 9 pages.                             relevant products    by matching      these keywordssystems      in concrete
                                                                                                                                                               to the words     in the
This is where the significance and challenge sit for multimodal                                       multimodal     tasks.
 https://doi.org/10.1145/3159652.3159716                                                               product descriptions.    It is often difficult for users to write the right
representation learning.
                                                                                                      keywords that will induce the search engine to provide the products
     Most existing research focus on neural network-based fusion                                      2theyPROPOSED
 1 INTRODUCTION                                                                                            are interested in.METHODOLOGY
                                                                                                                              For example, some users might be interested
strategies     for constructing multimodal representation, ranging from
                                                                                                       in “jeans
                                                                                                      Here          with holes”,
                                                                                                             we introduce            but the relevant
                                                                                                                                our proposed    approach products     are described
                                                                                                                                                            for building    multimodal as
the   earlier
Today’s           Hidden Markov
              consumers       have become Modelvery (HMM)-based
                                                          exigent. When  models     [7] to
                                                                               shopping               “distressed
                                                                                                      data            jeans”. Additionally,
                                                                                                             representation       inspired by this  approach
                                                                                                                                                 quantum         hampers
                                                                                                                                                             theory.          the search
                                                                                                                                                                       Essentially,   we
different
 online, they  RNN     variants
                    have   in mind  [1]a and    more
                                          specific      recently
                                                      clothing    itemtensor-based
                                                                         in a particularap-
                                                                                                       for product
                                                                                                      represent        attributes which
                                                                                                                    multimodal       data asare  not mentioned
                                                                                                                                               many-body            in the
                                                                                                                                                              systems        product de-
                                                                                                                                                                          composed     of
proaches
 color and[12]       and
                style,  and seq-to-seq
                              they wantstructures
                                              to find it [8].   Despite
                                                           without     too their
                                                                            muchstrong
                                                                                     effort.           scriptions.
                                                                                                      different   data Alternatively
                                                                                                                         modalities asbut    rarely, webshops
                                                                                                                                         subsystems.               offer image-based
                                                                                                                                                        The interaction     of different
accuracy
 However,performances,            the interplay
                 current e-commerce           search between     differentare
                                                        mechanisms          data    modal-
                                                                                often   too            search where
                                                                                                      modalities          the user uploads
                                                                                                                      is inherently           an image
                                                                                                                                       captured   by theofnotion
                                                                                                                                                             the desired    product and
                                                                                                                                                                     of entanglement
ities  areto
 limited     encoded
                provideinthisan kind
                                 inherent    way byA the
                                       of service.           neuralway
                                                         common        network     compo-
                                                                           for searching               receives visually
                                                                                                      between                 similar products.
                                                                                                                   the subsystems.       We proposeRecently,   there
                                                                                                                                                        to build       is an increasing
                                                                                                                                                                   a complex-valued
nents,    making it difficult for humans to understand the contribu-
 1 Image material adapted from www.amazon.com                                                          interest of
                                                                                                      learning       users intothis
                                                                                                                  network            kind of image-based
                                                                                                                                 implement    the quantumsearch.       One framework,
                                                                                                                                                              theoretical    of the main
tions of each modality in a particular task.
                                                                                                       reasonsfacilitates
                                                                                                      which      is the growing     usage
                                                                                                                             learning    theofcross-modal
                                                                                                                                               visual social media    such asinPinterest
                                                                                                                                                              interactions        a data-
     Quantum
 Permission        Theory
               to make  digital(QT)   provides
                                or hard             a well-established
                                        copies of all                        theoretical
                                                       or part of this work for personal or
                                                                                                       and Instagram,
                                                                                                      driven   fashion. where users see products they want to buy. Using
and   mathematical
 classroom                formalism
             use is granted  without fee for  describing
                                          provided           thearephysical
                                                    that copies     not madeworld      on a
                                                                              or distributed
 for profit or commercial advantage and that copies bear this notice and the full citation             an image as a query allows users to convey much more information
microscopic        level.  Beyond     physics,     QT   frameworks        have   been   ap-
 on the first page. Copyrights for components of this work owned by others than ACM                    about the
                                                                                                      2.1            desired product than
                                                                                                               Complex-valued                   with a textual
                                                                                                                                             Unimodal            query. Additionally,
                                                                                                                                                               Representation
plied
 must betohonored.
            many Abstracting
                     research areas      [2, 9,
                                  with credit     10, 15], among
                                              is permitted.           which successful
                                                            To copy otherwise,  or republish,
 to post on servers or to redistribute to lists, requires prior specific permission and/or a           another advantage of image-based search over text-based search is
 fee. Request permissions from permissions@acm.org.
                                                                                                      Complex       values are
                                                                                                       that the language       of essential
                                                                                                                                   images is for  the mathematical
                                                                                                                                              universal.   However, the   formalism
                                                                                                                                                                             user mightof
Copyright
 WSDM 2018, @ 2019 for this
               February     paper
                         5–9, 2018,byMarina
                                      its authors.   Use CA,
                                                Del Rey, permitted
                                                             USA under Creative Commons               quantum     physics.    However,    most  existing  quantum-inspired
                                                                                                       be interested in changing or adding attributes to the product in          models
License Attribution 4.0 International (CC BY 4.0).
© 2018 Association for Computing Machinery.                                                           for
                                                                                                       the text
                                                                                                            queryrepresentation
                                                                                                                    image to obtain   arevery
                                                                                                                                          based   on the
                                                                                                                                               specific    real vector
                                                                                                                                                        results.          space, ignor-
                                                                                                                                                                  For example,     for an
IIR
 ACM2019, September
       ISBN          16–18, 2019, Padova,
             978-1-4503-5581-0/18/02.          Italy
                                          . . $15.00                                                  ing  the  complex-valued        nature   of quantum     notions.    Recently,
                                                                                                       image of a red dress, the user likes the sleeve length to be different.       our
 https://doi.org/10.1145/3159652.3159716
                                                                                                      prior works [3, 4, 11] leveraged quantum superposition to model


                                                                                                342
                                                                                                49
IIR 2019, September 16–18, 2019, Padova, Italy                                                                                             Q. Li, M. Melucci


                                                                            are working on deploying the network to CMU-MOSI and CMU-
                                                                            MOSEI, and we expect to see comparable values to state-of-the-art
                                                                            models in terms of effectiveness.

                                                                            ACKNOWLEDGMENTS
                                                                            This PhD project is supported by the European Union‘s Horizon
                                                                            2020 research and innovation programme under the Marie Sklodowska-
                                                                            Curie grant agreement No. 721321.

                                                                            REFERENCES
                                                                             [1] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-
                                                                                 Philippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEI
                                                                                 Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th
                                                                                 Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
                                                                                 Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Com-
                                                                                 putational Linguistics.
Figure 2: Our proposed multimodal representation frame-                      [2] Jerome R. Busemeyer and Peter D. Bruza. Quantum Models of Cognition and
work.                                                                            Decision. Cambridge University Press, New York, NY, USA, 1st edition, 2012.
                                                                             [3] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-Inspired
                                                                                 Complex Word Embedding. Proceedings of The Third Workshop on Representation
                                                                                 Learning for NLP, pages 50–57, 2018.
correlations between textual features, and the complex-valued rep-           [4] Qiuchi Li, Benyou Wang, and Massimo Melucci. CNM: An Interpretable Complex-
                                                                                 valued Network for Matching. In Proceedings of the 2019 Conference of the North
resentation leads to improved performance and enhanced inter-                    American Chapter of the Association for Computational Linguistics: Human Lan-
pretability. We attempt to employ the concept of superposition for               guage Technologies, Volume 1 (Long and Short Papers), pages 4139–4148, Min-
modeling intra-modal interactions, and investigate the complex-                  neapolis, Minnesota, June 2019. Association for Computational Linguistics.
                                                                             [5] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang,
valued embedding approach to capture the interactions within other               AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient Low-rank Multi-
modalities.                                                                      modal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual
                                                                                 Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
                                                                                 pages 2247–2256, Melbourne, Australia, July 2018. Association for Computational
2.2    Tensor-based Approaches for Capturing                                     Linguistics.
       Inter-modal Interactions                                              [6] Massimo Melucci. Towards modeling implicit feedback with quantum entangle-
                                                                                 ment. Quantum Interaction 2008, 2008.
We represent multimodal data as a many-body quantum system in                [7] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards Multimodal
entanglement. The mathematical formulation will be a complex-                    Sentiment Analysis: Harvesting Opinions from the Web. In Proceedings of the
                                                                                 13th International Conference on Multimodal Interfaces, ICMI ’11, pages 169–176,
valued tensor constructed from unimodal complex-valued vectors                   New York, NY, USA, 2011. ACM.
by tensor-based approaches. Tensor-based models have been ap-                [8] Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos.
                                                                                 Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Senti-
plied for classification [5] and matching [14] tasks, but they avoid             ment Analysis. arXiv:1807.03915 [cs, stat], July 2018. arXiv: 1807.03915.
directly computing the tensor by decomposing the tensor and learn-           [9] C. J. van Rijsbergen. The Geometry of Information Retrieval. Cambridge University
ing the decomposed weights via neural networks. Explicit tensor                  Press, New York, NY, USA, 2004.
                                                                            [10] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling Term De-
combination of different data modalities into a holistic multimodal              pendencies with Quantum Language Models for IR. In Proceedings of the 36th
representation remains unexplored. [6] proposed a framework to                   International ACM SIGIR Conference on Research and Development in Information
investigate the entanglement between user and document for rel-                  Retrieval, SIGIR ’13, pages 653–662, New York, NY, USA, 2013. ACM.
                                                                            [11] Benyou Wang, Qiuchi Li, Massimo Melucci, and Dawei Song. Semantic Hilbert
evance feedback, but it remains a challenge on how to apply it to                Space for Text Representation Learning. In The World Wide Web Conference,
multimodal data.                                                                 WWW ’19, pages 3293–3299, New York, NY, USA, 2019. ACM. event-place: San
                                                                                 Francisco, CA, USA.
                                                                            [12] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe
2.3    Quantum-inspired Framework for                                            Morency.       Tensor Fusion Network for Multimodal Sentiment Analysis.
                                                                                 arXiv:1707.07250 [cs], July 2017. arXiv: 1707.07250.
       Multimodal Sentiment Analysis                                        [13] Amir Zadeh, Rown Zellers, and Eli Pincus. MOSI: Multimodal Corpus of Senti-
We focus on the multimodal sentiment analysis task, and work with                ment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.
                                                                            [14] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A Quantum
the benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1].                       Many-body Wave Function Inspired Language Modeling Approach. In Proceed-
The task is to classify sentiment of a video into 2, 5 or 7 classes              ings of the 27th ACM International Conference on Information and Knowledge
with textual, visual and acoustic features. As is shown in Fig. 2,               Management, CIKM ’18, pages 1303–1312, New York, NY, USA, 2018. ACM.
                                                                            [15] Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, and
our framework represents unimodal data as a set of pure states                   Benyou Wang. A quantum-inspired multimodal sentiment analysis framework.
through complex-value embedding, and constructs the many-body                    Theoretical Computer Science, April 2018.
state of an video utterance through tensor-based approaches. Fi-
nally, quantum-like measurement operators are implemented for
sentiment classification. The whole process is implemented into a
complex-valued neural network, and the parameters in the pipeline
can be learned from labeled data in an end-to-end manner.
   The network is born with an advantage in interpretability com-
pared to classical neural networks, in that the role of each com-
ponent is made explicit prior to the network training phase. We


                                                                       50