=Paper=
{{Paper
|id=Vol-2441/paper17
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-2441/paper18.pdf
|volume=Vol-2441
|dblpUrl=https://dblp.org/rec/conf/iir/LiM19
}}
==None==
Technical Presentation WSDM’18, February 5-9, 2018, Marina Del Rey, CA, USA
Web Quantum-inspired
Search of Fashion Items with Multimodal
Multimodal Querying
Representation
Katrien Laenen
Qiuchi Li Susana Zoghbi Marie-Francine
Massimo Melucci Moens
KU Leuven KU Leuven KU Leuven
qiuchili@dei.unipd.it melo@dei.unipd.it
Human Computer Interaction Human Computer Interaction Human Computer Interaction
Department of Information Engineering Department of Information Engineering
Heverlee, Belgium Heverlee, Belgium Heverlee, Belgium
University of Padova University of Padova
katrien.laenen@kuleuven.be susana.zoghbi@kuleuven.be sien.moens@kuleuven.be
Padova, Italy Padova, Italy
ABSTRACT
ABSTRACT
In this paper, we introduce a novel multimodal fashion search para-
We introduce our work in progress that targets on building mul-
digm where e-commerce data is searched with a multimodal query
timodal representation under quantum inspiration. The challenge
composed of both an image and text. In this setting, the query
for multimodal representation falls on a fusion strategy to capture
image shows a fashion product that the user likes and the query
the interaction between different modalities of data. As the most
text allows to change certain product attributes to fit the product
successful approaches, neural networks lack a mechanism of ex-
to the user’s desire. Multimodal search gives users the means to
plicitly showing how different modalities are related to each other.
clearly express what they are looking for. This is in contrast to
We address this issue by seeking inspirations from Quantum The-
current e-commerce search mechanisms, which are cumbersome
ory (QT), which has been demonstrated advantageous in explicitly
and often fail to grasp the customer’s needs. Multimodal search
capturing the correlations between textual features. In this paper,
requires intermodal representations of visual and textual fashion at-
we give an overview of the related works and present the proposed
tributes which can be mixed and matched to form the user’s desired Figure 1: Example of a Multi-modal query.
methodology.
product, and which have a mechanism to indicate when a visual Figure 1: Example of a multimodal query 1 . It consists of a
and textual fashion attribute represent the same concept. With a query image and a query text that alters the query image.
KEYWORDS
neural network, we induce a common, multimodal space for visual The query text mentions two fashion attributes: short and
multimodal
and textual data fusion,
fashion quantum
attributes wherephysics, neuralproduct
their inner networksmeasures results
lace. Thehavequery
been image
observed in text-based
is knee length andlanguage
does understand-
not have a
their semantic similarity. We build a multimodal retrieval model ing
lace[10, 15].appearance.
type In this context, quantum-inspired frameworks have the
natural advantage of explicitly capturing the correlations between
1which operates on the obtained intermodal representations and
INTRODUCTION features by the concepts of quantum superposition and quantum
which ranks images based on
In human communication, their relevance
messages to aconveyed
are often multimodal query.
through entanglement.
products in a webshop is to navigate through a product category
aWe demonstrateofthat
combination our model
different is able tosuch
modalities, retrieve imagesaudio
as visual, that both
and This inspires
exhibit themodalities.
necessary Inquery image attributes andunderstanding
satisfy the query hierarchy. Usersusendtoup
adopt
in a quantum-inspired
subcategory whichframeworks
they need tofor con-
search
linguistic order for an automatic of structing multimodal representation, and propose effective and
texts. Moreover,messages,
we show that completely, which is time-consuming. They need to go through
the multimodal one our model
needs substantially
to fuse outperforms
the information from interpretable multimodal fusion methods. In particular, we propose
two state-of-the-art retrieval models adapted to multimodal fashion many irrelevant products, without the guarantee of actually find-
different modalities to construct a joint multimodal representation. to
search. ingcapture the interactions
the product within for.
they are looking a single modality
To narrow withthe
down quantum
search,
The challenge falls on how to characterise the interactions between
users can sometimes select certain filters. However, desiredmeans
superposition, and model the cross-modal interactions by prod-
different
ACM Referencemodalities, which can be complicated in some scenarios. In
format: of In amongst
this way, the
we establish
Katrien Laenen, Susana Zoghbi, andvisual-linguistic
Marie-Francine Moens. 2018. Web uctquantum entanglement.
attributes might not be available afilters.
pipeline
With thata
the example shown in Fig. 1, the query cannot be extracts thesearch
interactions within
Search of Fashion Items with Multimodal Querying. In text-based approach, users multi-modal datadesired
can describe the in a way un-
product
understood solely by either the text or image individually,ofbut Proceedings WSDM
the derstandable from a quantum perspective. We expect to obtain
2018: The Eleventh ACM International Conference on Web Search and Data by entering keywords into a search bar. The webshop then finds
relation between the image and text must be correctly recognized. comparable performances to state-of-the-art
Mining, Marina Del Rey, CA, USA, February 5–9, 2018 (WSDM 2018), 9 pages. relevant products by matching these keywordssystems in concrete
to the words in the
This is where the significance and challenge sit for multimodal multimodal tasks.
https://doi.org/10.1145/3159652.3159716 product descriptions. It is often difficult for users to write the right
representation learning.
keywords that will induce the search engine to provide the products
Most existing research focus on neural network-based fusion 2theyPROPOSED
1 INTRODUCTION are interested in.METHODOLOGY
For example, some users might be interested
strategies for constructing multimodal representation, ranging from
in “jeans
Here with holes”,
we introduce but the relevant
our proposed approach products are described
for building multimodal as
the earlier
Today’s Hidden Markov
consumers have become Modelvery (HMM)-based
exigent. When models [7] to
shopping “distressed
data jeans”. Additionally,
representation inspired by this approach
quantum hampers
theory. the search
Essentially, we
different
online, they RNN variants
have in mind [1]a and more
specific recently
clothing itemtensor-based
in a particularap-
for product
represent attributes which
multimodal data asare not mentioned
many-body in the
systems product de-
composed of
proaches
color and[12] and
style, and seq-to-seq
they wantstructures
to find it [8]. Despite
without too their
muchstrong
effort. scriptions.
different data Alternatively
modalities asbut rarely, webshops
subsystems. offer image-based
The interaction of different
accuracy
However,performances, the interplay
current e-commerce search between differentare
mechanisms data modal-
often too search where
modalities the user uploads
is inherently an image
captured by theofnotion
the desired product and
of entanglement
ities areto
limited encoded
provideinthisan kind
inherent way byA the
of service. neuralway
common network compo-
for searching receives visually
between similar products.
the subsystems. We proposeRecently, there
to build is an increasing
a complex-valued
nents, making it difficult for humans to understand the contribu-
1 Image material adapted from www.amazon.com interest of
learning users intothis
network kind of image-based
implement the quantumsearch. One framework,
theoretical of the main
tions of each modality in a particular task.
reasonsfacilitates
which is the growing usage
learning theofcross-modal
visual social media such asinPinterest
interactions a data-
Quantum
Permission Theory
to make digital(QT) provides
or hard a well-established
copies of all theoretical
or part of this work for personal or
and Instagram,
driven fashion. where users see products they want to buy. Using
and mathematical
classroom formalism
use is granted without fee for describing
provided thearephysical
that copies not madeworld on a
or distributed
for profit or commercial advantage and that copies bear this notice and the full citation an image as a query allows users to convey much more information
microscopic level. Beyond physics, QT frameworks have been ap-
on the first page. Copyrights for components of this work owned by others than ACM about the
2.1 desired product than
Complex-valued with a textual
Unimodal query. Additionally,
Representation
plied
must betohonored.
many Abstracting
research areas [2, 9,
with credit 10, 15], among
is permitted. which successful
To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a another advantage of image-based search over text-based search is
fee. Request permissions from permissions@acm.org.
Complex values are
that the language of essential
images is for the mathematical
universal. However, the formalism
user mightof
Copyright
WSDM 2018, @ 2019 for this
February paper
5–9, 2018,byMarina
its authors. Use CA,
Del Rey, permitted
USA under Creative Commons quantum physics. However, most existing quantum-inspired
be interested in changing or adding attributes to the product in models
License Attribution 4.0 International (CC BY 4.0).
© 2018 Association for Computing Machinery. for
the text
queryrepresentation
image to obtain arevery
based on the
specific real vector
results. space, ignor-
For example, for an
IIR
ACM2019, September
ISBN 16–18, 2019, Padova,
978-1-4503-5581-0/18/02. Italy
. . $15.00 ing the complex-valued nature of quantum notions. Recently,
image of a red dress, the user likes the sleeve length to be different. our
https://doi.org/10.1145/3159652.3159716
prior works [3, 4, 11] leveraged quantum superposition to model
342
49
IIR 2019, September 16–18, 2019, Padova, Italy Q. Li, M. Melucci
are working on deploying the network to CMU-MOSI and CMU-
MOSEI, and we expect to see comparable values to state-of-the-art
models in terms of effectiveness.
ACKNOWLEDGMENTS
This PhD project is supported by the European Union‘s Horizon
2020 research and innovation programme under the Marie Sklodowska-
Curie grant agreement No. 721321.
REFERENCES
[1] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-
Philippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEI
Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Com-
putational Linguistics.
Figure 2: Our proposed multimodal representation frame- [2] Jerome R. Busemeyer and Peter D. Bruza. Quantum Models of Cognition and
work. Decision. Cambridge University Press, New York, NY, USA, 1st edition, 2012.
[3] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-Inspired
Complex Word Embedding. Proceedings of The Third Workshop on Representation
Learning for NLP, pages 50–57, 2018.
correlations between textual features, and the complex-valued rep- [4] Qiuchi Li, Benyou Wang, and Massimo Melucci. CNM: An Interpretable Complex-
valued Network for Matching. In Proceedings of the 2019 Conference of the North
resentation leads to improved performance and enhanced inter- American Chapter of the Association for Computational Linguistics: Human Lan-
pretability. We attempt to employ the concept of superposition for guage Technologies, Volume 1 (Long and Short Papers), pages 4139–4148, Min-
modeling intra-modal interactions, and investigate the complex- neapolis, Minnesota, June 2019. Association for Computational Linguistics.
[5] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang,
valued embedding approach to capture the interactions within other AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient Low-rank Multi-
modalities. modal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 2247–2256, Melbourne, Australia, July 2018. Association for Computational
2.2 Tensor-based Approaches for Capturing Linguistics.
Inter-modal Interactions [6] Massimo Melucci. Towards modeling implicit feedback with quantum entangle-
ment. Quantum Interaction 2008, 2008.
We represent multimodal data as a many-body quantum system in [7] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards Multimodal
entanglement. The mathematical formulation will be a complex- Sentiment Analysis: Harvesting Opinions from the Web. In Proceedings of the
13th International Conference on Multimodal Interfaces, ICMI ’11, pages 169–176,
valued tensor constructed from unimodal complex-valued vectors New York, NY, USA, 2011. ACM.
by tensor-based approaches. Tensor-based models have been ap- [8] Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos.
Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Senti-
plied for classification [5] and matching [14] tasks, but they avoid ment Analysis. arXiv:1807.03915 [cs, stat], July 2018. arXiv: 1807.03915.
directly computing the tensor by decomposing the tensor and learn- [9] C. J. van Rijsbergen. The Geometry of Information Retrieval. Cambridge University
ing the decomposed weights via neural networks. Explicit tensor Press, New York, NY, USA, 2004.
[10] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling Term De-
combination of different data modalities into a holistic multimodal pendencies with Quantum Language Models for IR. In Proceedings of the 36th
representation remains unexplored. [6] proposed a framework to International ACM SIGIR Conference on Research and Development in Information
investigate the entanglement between user and document for rel- Retrieval, SIGIR ’13, pages 653–662, New York, NY, USA, 2013. ACM.
[11] Benyou Wang, Qiuchi Li, Massimo Melucci, and Dawei Song. Semantic Hilbert
evance feedback, but it remains a challenge on how to apply it to Space for Text Representation Learning. In The World Wide Web Conference,
multimodal data. WWW ’19, pages 3293–3299, New York, NY, USA, 2019. ACM. event-place: San
Francisco, CA, USA.
[12] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe
2.3 Quantum-inspired Framework for Morency. Tensor Fusion Network for Multimodal Sentiment Analysis.
arXiv:1707.07250 [cs], July 2017. arXiv: 1707.07250.
Multimodal Sentiment Analysis [13] Amir Zadeh, Rown Zellers, and Eli Pincus. MOSI: Multimodal Corpus of Senti-
We focus on the multimodal sentiment analysis task, and work with ment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.
[14] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A Quantum
the benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1]. Many-body Wave Function Inspired Language Modeling Approach. In Proceed-
The task is to classify sentiment of a video into 2, 5 or 7 classes ings of the 27th ACM International Conference on Information and Knowledge
with textual, visual and acoustic features. As is shown in Fig. 2, Management, CIKM ’18, pages 1303–1312, New York, NY, USA, 2018. ACM.
[15] Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, and
our framework represents unimodal data as a set of pure states Benyou Wang. A quantum-inspired multimodal sentiment analysis framework.
through complex-value embedding, and constructs the many-body Theoretical Computer Science, April 2018.
state of an video utterance through tensor-based approaches. Fi-
nally, quantum-like measurement operators are implemented for
sentiment classification. The whole process is implemented into a
complex-valued neural network, and the parameters in the pipeline
can be learned from labeled data in an end-to-end manner.
The network is born with an advantage in interpretability com-
pared to classical neural networks, in that the role of each com-
ponent is made explicit prior to the network training phase. We
50