=Paper= {{Paper |id=Vol-1393/paper-08 |storemode=property |title=Leveraging Metropolis-Hastings Algorithm on Graph-based Model for Multimodal IR |pdfUrl=https://ceur-ws.org/Vol-1393/paper-08.pdf |volume=Vol-1393 |dblpUrl=https://dblp.org/rec/conf/sigir/SabetghadamLR15 }} ==Leveraging Metropolis-Hastings Algorithm on Graph-based Model for Multimodal IR== https://ceur-ws.org/Vol-1393/paper-08.pdf
                 Leveraging Metropolis-Hastings Algorithm on
                    Graph-based Model for Multimodal IR

                                    Serwah Sabetghadam, Mihai Lupu, Andreas Rauber
                                        Institute of Software Technology and Interactive Systems
                                                      Vienna University of Technology
                                                              Vienna, Austria


ABSTRACT                                                                    relevant patterns for a query. 4) Interactive Reranking: in
The velocity of multimodal information shared on web has                    this case a user can edit a part of the search results (to delete
increased significantly. Many reranking approaches try to                   or to emphasize).
improve the performance of multimodal retrieval, however                       Graph-based methods for reranking are a subset of Self-
not in the direction of true relevancy of a multimodal ob-                  reranking category, in which a graph oriented search is per-
ject. Metropolis-Hastings (MH) is a method based on Monte                   formed based on relations between objects. Mostly, related
Carlo Markov Chain (MCMC) for sampling from a distribu-                     work in this area is performed on images/videos with sim-
tion when traditional sampling methods such as transfor-                    ilarity links between them [11, 5]. The use of results from
mation or inversion fail. If we assume this probability dis-                independent modality indexing neglect that data objects
tribution as true relevancy of documents for an information                 are interlinked through different relations. The problem
need, in this paper we explore how leveraging our model                     becomes more challenging when the graph is multimodal.
with Metropolis-Hastings algorithm may help towards true                    During traversal, we may see information objects from dif-
relevancy in multimodal IR.                                                 ferent modalities (text, audio, video or image). We propose
                                                                            a model to utilize probabilistic model of IR in multimodal
                                                                            retrieval, with the goal of approaching true relevancy rather
Categories and Subject Descriptors                                          than just a reranking. This means that a query may have
H.3 [Information Storage and Retrieval]: General; H.3.3                     null result because of lack of any relevant data. According
[Information Search and Retrieval]: Metrics—Retrieval                       to probability ranking principle in IR, the relevancy of a
models, Search process                                                      document to a query is defined as p(d|q) = p(q|d)p(d)
                                                                                                                               p(q)
                                                                                                                                      . This
                                                                            requires the probabilities of p(q) and p(d) which are not
General Terms                                                               available. Different ranking models like TF.IDF, BM25 or
                                                                            LM aim to probe the true ranking through different models
Theory, Algorithm                                                           on p(q|d).
                                                                               In this paper, we explore the capability of our model to
Keywords                                                                    approach probabilistic IR for multimodal retrieval with the
IR, Multimodal, Graph, Metropolis-Hastings                                  help of the MH algorithm. MH is based on MCMC and is
                                                                            used in cases where it is hard to sample from a probability
                                                                            distribution. Assuming the true probability distribution of
1.    INTRODUCTION                                                          relevancy of documents to the query as stationary distribu-
   There are many challenges in multimodal information re-                  tion, utilizing MH we make a Markov-chain of documents
trieval. Mei et al. [8] have performed a survey on reranking                which results in the same stationary distribution of proba-
models of multimodal information retrieval. They divide the                 bilities. We conduct the experiments on ImageCLEF2011
related work in four categories: 1) Self-reranking: includes                Wikipedia collection as a multimodal collection.
reranking methods that include data from the original rank-
ing result such as Pseudo-Relevance Feedback or learning                    2.   RELATED WORK
a ranking model by giving top ranked documents as posi-
tive. 2) Example-based reranking: methods to understand                       There are many efforts in multimodal retrieval in com-
the query using accompanying examples. 3) Crowd rerank-                     bining textual and visual modalities. Martinent et al. [7]
ing: leverages crowd-sourced knowledge on the web to mine                   propose to generate automatic document annotations from
                                                                            inter-modal analysis. They consider visual feature vectors
                                                                            and annotation keywords as binary random variables. Jing
                                                                            et al. [6] employ the PageRank to rerank image search. The
                                                                            hyperlinks between images are based on visual similarity of
                                                                            search results. Yao et al. [11] make a similarity graph of
                                                                            images and find authority nodes as result for image queries.
Copyright c 2015 for the individual papers by the papers’ authors. Copy-    Through this model, both visual content and textual infor-
ing permitted for private and academic purposes. This volume is published   mation of the images is explored. Hsu et .al [5] leverage
and copyrighted by its editors.
SIGIR Workshop on Graph Search and Beyond ’15 Santiago, Chile               context reranking as a random walk over a graph of video
Published on CEUR-WS: http://ceur-ws.org/Vol-1393/.                         stories. The links are based on similarities between different
video stories. The final scoring value is a combination of              • Semantic (α): any semantic relation between two ob-
initial text and stationary distribution scores.                          jects in the collection (e.g. the link between lyrics and
   The application of MH method in information retrieval, is              a music file). The edge weight wxy is made inversely
limited to search in peer-2-peer networks [3, 1]. Ferreira et             proportional to the α-out-degree of the source node u
                                                                                           (α)
al. [3] have designed a protocol for locating a specific object           and wxy = 1/Nx .
regardless of the topology of the network through uniform
sampling from peer-to-peer networks. Zhong et al. [12] use              • Part-of (β): a specific type of semantic relation, in-
random walks and focus on convergence time for different                  dicating an object as part of another object, e.g. an
network sizes. They investigate the probability distribution              image in a document. The weight is 1 because of con-
of visiting nodes. In order to go beyond peer-2-peer net-                 tainment relation as an object part of another one.
works and apply MH in IR, we need a jumping distribution,
i.e. weighted links between nodes. Such links may be sim-               • Similarity (γ): relation between the facets of two in-
ilarity/semantic or a mixture of the two. The difficulty, as              formation objects. The weight is the similarity value
we will see, is ensuring the stochastic and ergodic nature of             between the facets.
the chain.
                                                                        • Facet (δ): linking an object to its representation(s).
                                                                          It is a unidirectional relation from facet to the parent
3.     MH ALGORITHM                                                       object. Weights are given by perceived information
   MH is one of the algorithms based on MCMC to obtain                    content of features, with respect to the query type.
samples from a complex probability distribution π(x). The
goal is to draw samples form π(x) where π(x) = πeK (x)
                                                       . The          Our scoring method consists of two steps: 1) In the first
normalizing variable K is unknown and hard to compute.             step, we perform an initial search with Lucene and/or Lire
Based on the jumping distribution matrix of W , MH algo-           result based on the facets. This provides us a set of activa-
rithm generates a sequence from this distribution as follows:      tion nodes. 2) In the second step, using the initial result set
                                                                   of data objects (with normalized scores) as seeds, we exploit
   1. Start with initial value x that π(x) > 0
                                                                   the graph structure and traverse it.
     2. Using the current x value, sample a candidate point y         The model can perform both partial/whole facet retrieval.
        from W (x, y).                                             We may decide to search e.g. only based on query textual
     3. The transition probability then is made according to       or visual facets, or based on all query facets. In practice, we
                                                                   make a form of late facet fusion by combination of different
       P r(x, y) = W (x, y)λ(x, y)                          (1)    scores and giving one score to the parent information object.
                                                                   However, it is not in the traditional way of late fusion. Since
                                       
                       π
                       e(y).W (y, x)
       λ(x, y) = min                 ,1                     (2)    we do not make the result rank list out of top ranked nodes.
                       π
                       e(x).W (x, y)
                                                                   We initiate their scores in graph nodes and then start prop-
       Note that λ(x, y) does not require knowledge of the         agation. In our model, facet fusion is implicitly calculated
       normalizing constant because π(y)/π(x) drops it out.        by matrix multiplication and final vector computation.
       If it increases the density (λ > 1), accept y and set the
       next sample xt = y. Repeat step 3. If it decreases the      5.    MH MAPPED TO IR
       density, sample u from uniform (0,1). Accept if λ > u,         We want to achieve a query dependent stationary distri-
       else reject it.                                             bution such that the probability in node x is proportional to
                                                                   the probability that this node is relevant to the query, and at
   In order to reach a unique equilibrium state for a Markov-
                                                                   any other node (non-relevant) the probability is zero. This
chain, it should be ergodic, satisfying irreducibility (for any
                                                                   is the π(x) distribution from which we cannot directly sam-
state, the probability of getting there given any starting
                                                                   ple. Instead, we have the π  e(x) which could be a relevance
point is more than zero) and aperiodicity (there is no rhythm
                                                                   scoring function (e.g. a BM25 score between the data ob-
in which states can be reached given a starting point). There
                                                                   ject xi and the query). MH would formally provide us with
may be different proposal distributions for MH. Two general
                                                                   a method to sample from the probability distribution, if the
approaches are [10]: 1) Random walks - the new state y is
                                                                   approximate probability π  e is properly chosen.
dependent to the current state x. 2) Independent sample
                                                                      We have the graph of different relations in the adjacency
finding - the probability of jumping to point y is chosen
                                                                   matrix W . Assuming the true relevancy of nodes to the
from a distribution of interest, independent of the current
                                                                   query as π(x), we define the π̃(x) as relevance score value
value. This method is usually used in asymmetric MH. We
                                                                   function (RSV ). A node (M) in the graph may be of any
use the first approach in our work.
                                                                   modality: Text (T), Image (I), Audio (A) or Video (V), and
                                                                   the query (Q) may be combination of different modalities.
4.     MODEL REPRESENTATION                                        We define the relevance score value function (RSV ), as fol-
   We define a graph-based model G = (V, E), in which V            lows:
is the set of information objects and their facets, and E
                                                                   M ∈ {T, I, V, A}
is the set of edges. By facet we mean inherent feature or
representation of an object (e.g., tf.idf facet of a document      M = ∪n
                                                                        i=1 Mfi
or edge histogram of an image). Each object may have a             Q = ∪m
                                                                        j=1 Qfj
number of facets. We define four types of relations. Their         l = |{Qf |Qfi = Mfi }|
characteristics are discussed in detail in [9]. We formally
define the relation types and their weights as follows:
                                                                       5.2     Role of MH in Adjusting the Weights
                         l
                         X                                                In principle, MH either accepts a jumping density of W (x, y)
        RSV (Q, M ) =          norm(sim(Qfi , Mfi )).wfi       (3)     (when λ > 1) and keeps the value and moves forward, or
                         i=1                                           modifies the weight with the factor of λ. The new value of
where n is the number of facet types of the information                this link for next step is W (x, y) · λ. According to stochas-
object node, m is the number of facet types of the query, sim          tic property, the sum of the weights of links of an edge is
is the similarity function between two facets, norm is the             1. In each step, when weights are adjusted by MH, the
normalizing function and wfi is the weight of facet fi for this        sum may get lower than 1. In this case the link is ac-
query. We compute the similarity (sim) between l number                cepted with probability of λ < 1. The decreased value is
of the same facets of this information object and the query,           given as self-transitivity value to the node, indicating stay-
in which Qfi and Mfi are the value of corresponding facets.            ing in this state is preferred than choosing that specific link.
Usually the value of a facet is in the form of a feature vector.       Performing this for many steps, loosens the links with less
In case of no common facet, the sim function output is zero.           relevant neighbours and keeps the links with increasing rel-
Relevancy of an information object to a query should be                evancy neighbours. This way, MH may modify the weights
calculated in accordance to other information objects. For             in the direction of making a Markov chain which reaches to
this purpose we compute the similarity of all objects for each         the true probability distribution.
query facet and normalize. As we have a multimodal graph                  To prevent poorly mixing, we start from different starting
and in each step may visit a node with different modality,             points according to standard search result for each facet.
we require a normalized value to be able to compare the                These points satisfy the condition of π  e(x) > 0 as it is the
relevancy values.                                                      scored ranked result.
   Different modalities have different facets. Reaching nodes
with the same modality of query examples, we have all the              6.    EXPERIMENT DESIGN
facets in common (e.g. an image query and an image node).                We applied the ImageCLEF 2011 Wikipedia collection for
Visiting nodes with different modality than query examples,            imgae retrieval task. Each image has one metadata file that
we perform similarity for common facets. For instance, if we           provides information about name, location, one or more as-
have an audio object and an image query, we can compare                sociated parent documents in up to three languages (EN, DE
their textual facets (the tf.idf facet of image metadata and           and FR), and textual image annotations (i.e. caption, de-
tf.idf facet of the audio tags or lyrics).                             scription and comment). The collection consists of 125,828
                                                                       documents and 237,434 images. We parsed the image meta-
                                                                       data and created nodes for all parent documents, images and
5.1    MH Constraints in Astera                                        corresponding facets. We created different relation types:
   Irreducibility: To check irreducibility we should prove             the β relation between parent documents and images (as
that our graph is connected. By adding different relations             part of the document), and δ relation between information
of β, γ and α, we have a connected graph. For this purpose,            objects and their facets. We use the 50 English query topics.
starting from top ranked results for a sample query we tra-
verse the graph. In each step we visit new neighbours and              6.1     Document and Image Facets
continue until we see no more new nodes. The number of                   In the first phase of our hybrid search, we use standard
nodes seen in this traversal was the whole graph size. This            indexing results both for documents and images. The com-
observation, even for one query, indicates the connectivity            puted scores in both modalities are normalized per topic be-
of our graph.                                                          tween (0,1) based on min-max method. Different indexings
   Aperiodicity: Finding nodes from a starting point is not            based on different facets are:
multiple of a number in our graph. We satisfy this constraint
by construction.                                                            • Text tf.idf facet: We utilize default Lucene indexer,
   Stochastic property: According to the weight definition                    based on tf.idf, as text facet.
in Astera for β links, the sum of weights on a row may be
more than one. However, semantic (α) and/or similarity                      • Image textual annotation tf.idf facet (Metadata):
(γ) links can be used in a normalized form, complying with                    We use metadata information of the images caption,
stochastic property.                                                          comment and description), as image textual facets.
   Transition Function in Astera According to Metropolis-                   • CEDD facet: For image facets, we selected the Color
Hasitngs algorithm, and Eq. 2 we sample from W (x, y) and                     and Edge Directivity Descriptor (CEDD) feature since
accept the move with probability λ(x, y). This implies on                     it is considered the best method to extract purely vi-
how we define high-order     transition probabilities after t steps:          sual results [2].
P rqt+1 (x, y) = ki=1 P rqt (x, zi )(zi , y) where q is the query, k
                P
is the number of common nodes z between x and y, and P rt                In the second phase, starting from standard indexed re-
is the transition probability of starting from x and moving            sults, we conduct the graph search based on MH. In this
t steps further.                                                       instantiation of Astera, we use only β links between the
   Mixing Walsh divides the mixing chains in two categories            documents and images. We investigate adding α and δ link
of poorly mixing and well mixing chains [10]. To prevent               types are in our future works.
poorly mixing, one usual way is to use Simulated Annealing
method with high jumps. Second option is to start with                 6.2     Transition Matrix in Astera
several chains to cover the space to find nodes. Our model               To compute the transition matrix P r, we need to com-
follows the second option, as we start from different starting         pute the λ(x, y) for each two neighbour nodes to update the
points according to standard search result for each facet.             weights. In this instantiation of Astera with ImageCLEF
2011 Wikipedia collection, we have images and documents              We compare the results with/without using MH algorithm
node types. The query topic in this collection is multimodal.     (Tables 1, 2). We did not get better result in our preliminary
It is a combination of keywords and image examples with           experiment with MH. The reason is dependency of a jump to
facet set of {tf.idf, CEDD}.                                      the value of RSV (y)/RSV (x). The implemented RSV func-
   Based on any of these facets, we can start traversal in the    tion for images is based on metadata facet. A large number
graph. For example, if we start from similarity with meta-        of images are not retrieved in Lucene result for Metadata
data tf.idf results, we will have a set of images as starting     facet- we retrieve in the scale of 1000 images for each query,
points to make the traversal. In this instantiation of Astera,    compared to having 274,000 images. We set the minimum
an image object (I) has two facets of {tf.idf, CEDD}. The         value of retrieved scores (0.0001), as RSV value of visited
common set of facets of l between the query and image is          images not in the Lucene results. We have observed that
l = {tf.idf, CEDD}. Each image is connected to at least           this approach biases a large number of images to very low
one parent document (D) through β link. To compute the            score, which we assume to be the cause of low precision.
P r(I, D) = W (I, D) · λ(I, D), we need the λ value, which is:    Though, further experiments in this direction are needed 1 .
                                                
                         RSV (Q, D) W (D, I)
            λ(I, D) =                .         ,1          (4)    7.    CONCLUSION AND DISCUSSION
                         RSV (Q, I) W (I, D)
                                                                     We presented a graph-based model for multimodal IR
  where                                                           leveraging MH algorithm. The graph is enriched by ex-
                                                                  tracted facets of information objects. Different modalities
   RSV (Q, I) = norm(sim(Qtf.idf , Itf.idf )).wtf.idf +           are treated equally thanks to faceted search. We proposed
                                                           (5)
                  norm(sim(QCEDD , ICEDD )).wCEDD                 a generic relevancy function based on facet similarity of ob-
                                                                  jects to the query. Leveraging this model, we have a plat-
  and                                                             form, potential to investigate the affect of different facets on
                                                                  performance, and burning in the matrix. We have the op-
                                                                  portunity to examine query dependent traversal, as weights
      RSV (Q, D) = norm(sim(Qtf.idf , Dtf.idf )).wtf.idf   (6)    in the graph are affected by relevancy of source and target
                                                                  nodes to the query. The preliminary results with MH did
   The RSV value is computed based on normalized Lucene           not improve the result. Many steps in the graph should be
and LIRE similarity score for tf.idf and CEDD facet respec-       taken until the matrix burns in to the stationary distribu-
tively. The wCEDD and wtf.idf are facet weights for this          tion, which is in our future work. However, this experiment
query. For each query, we perform this similarity computa-        brings some issues to discuss: 1) How much the final prob-
tion in all three languages, separately for image metadata        ability distribution is dependent on the chosen π     e(x)? 2)
and documents. We take this value as relevancy value of           Is MH algorithm on graph-based collections an opportunity
each image/document for a specific query.                         to compare the effect of different ranking models? 3) How
                                                                  much expensive is this approach regarding the need of high
6.3      Experiment Result                                        number of transitions until the matrix burns in? 4) How
   We included text tf.idf and metadata tf.idf facets in this     do we satisfy stochastic property in multimodal graph with
experiment. We start with top 20 similar documents and            heterogeneous relation types? In principle, this property is
images (as activated nodes) based on these facets for each        beyond mathematically summing the weights to 1, but it
query, and traverse the graph from these 40 touch points,         goes back to the utility of different modalities as neighbours
step by step in parallel. In each step, for node x and its        to the user. The difficulty is whether these neighbours are
neighbour y, we compute the λ(x, y), update the weight and        equally useful to the user?
continue to the next neighbour. This is performed in the
form of matrix multiplications.                                   8.   REFERENCES
   In Markov chain random walks, without MH algorithm,             [1] A. Awan, R. A. Ferreira, S. Jagannathan, and
we utilize matrix multiplication to simulate the walk in the           A. Grama. Distributed uniform sampling in
graph. The probability distribution after t steps is calcu-            unstructured peer-to-peer networks. In HICSS, 2006.
lated as at = a0 · W t , where a0 is the starting scores and       [2] T. Berber, A. H. Vahid, O. Ozturkmenoglu, R. G.
at is the scores after t steps. However, leveraging MH, the            Hamed, and A. Alpkocak. Demir at imageclefwiki
edge weights are affected by λ (Eq. 1). This is a potential            2011: Evaluating different weighting schemes in
problem for computing the updated transition matrix. The               information retrieval. In CLEF, 2011.
reason is that, in each iteration, the matrix W is affected
                                                                   [3] R. A. Ferreira, M. Krishna Ramanathan, A. Awan,
by λ which is a min function - W · λ in first iteration and
                                                                       A. Grama, and S. Jagannathan. Search with
W · λ · λ in the second iteration. However, Hlynka et al. [4]
                                                                       probabilistic guarantees in unstructured peer-to-peer
observed that the transition matrix P r does not change in
                                                                       networks. In P2P, 2005.
further steps. Therefore, we need to compute only once the
matrix of P r(x, y) = W (x, y) · λ(x, y) for all nodes, and use    [4] M. Hlynka and M. Cylwa. Observations on the
this matrix in further multiplications. This makes the MH              Metropolis-Hastings Algorithm. University of Windsor,
steps simulation feasible in implementation.                           Department of Mathematics and Statistics, 2009.
   We compute the final score as at = a0 · P rt after t steps.     [5] W. H. Hsu, L. S. Kennedy, and S.-F. Chang. Video
This computation is needed for middle steps, since in ideal            search reranking through random walk over
case the multiplication is performed many times until the              document-level context graph. MULTIMEDIA, 2007.
matrix converges and in stationary distribution the nodes’        1
                                                                    The code of Astera is open-source and available at link:
probability are independent of starting scores in the graph.      http://www.ifs.tuwien.ac.at/∼sabetghadam/Astera.html
 steps   st       p@10      r@10     p@20    r@20
 1       0.297    0.135     0.229    0.158
 2       0.297    0.135     0.229    0.158
 3       0.252    0.123     0.188    0.138
 4       0.224    0.120     0.184    0.134
 5       0.206    0.1148    0.173    0.124
 6       0.182    0.1104    0.156    0.113
 7       0.142    0.106     0.13     0.115

Table 1: Result for documents without image facets, self-
transitivity: 0.9, links: δ, β

 steps   st      p@10      r@10     p@20     r@20
 1       0.27    0.125     0.151    0.135
 2       0.27    0.125     0.151    0.135
 3       0.23    0.113     0.148    0.1295
 4       0.22    0.1097    0.133    0.1163
 5       0.18    0.1091    0.113    0.1163
 6       0.17    0.107     0.111    0.109
 7       0.14    0.08      0.108    0.087

Table 2: Result for documents without image facets, self-
transitivity: 0.9, links: δ, β


 [6] Y. Jing and S. Baluja. Visualrank: Applying pagerank
     to large-scale image search. IEEE Trans. Pattern
     Anal. Mach. Intell., 2008.
 [7] J. Martinet and S. Satoh. An information theoretic
     approach for automatic document annotation from
     intermodal analysis. In Workshop on Multimodal
     Information Retrieval, 2007.
 [8] T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search
     reranking: A literature survey. ACM Computing
     Surveys (CSUR), 2014.
 [9] S. Sabetghadam, M. Lupu, and A. Rauber. Astera - a
     generic model for multimodal information retrieval. In
     Proc. of Integrating IR Technologies for Professional
     Search Workshop, 2013.
[10] B. Walsh. Markov chain monte carlo and gibbs
     sampling. 2004.
[11] T. Yao, T. Mei, and C.-W. Ngo. Co-reranking by
     mutual reinforcement for image search. CIVR, 2010.
[12] M. Zhong, K. Shen, and J. Seiferas. The
     convergence-guaranteed random walk and its
     applications in peer-to-peer networks. Computers,
     IEEE Transactions on, 2008.