=Paper=
{{Paper
|id=Vol-1393/paper-08
|storemode=property
|title=Leveraging Metropolis-Hastings Algorithm on Graph-based Model for Multimodal IR
|pdfUrl=https://ceur-ws.org/Vol-1393/paper-08.pdf
|volume=Vol-1393
|dblpUrl=https://dblp.org/rec/conf/sigir/SabetghadamLR15
}}
==Leveraging Metropolis-Hastings Algorithm on Graph-based Model for Multimodal IR==
Leveraging Metropolis-Hastings Algorithm on Graph-based Model for Multimodal IR Serwah Sabetghadam, Mihai Lupu, Andreas Rauber Institute of Software Technology and Interactive Systems Vienna University of Technology Vienna, Austria ABSTRACT relevant patterns for a query. 4) Interactive Reranking: in The velocity of multimodal information shared on web has this case a user can edit a part of the search results (to delete increased significantly. Many reranking approaches try to or to emphasize). improve the performance of multimodal retrieval, however Graph-based methods for reranking are a subset of Self- not in the direction of true relevancy of a multimodal ob- reranking category, in which a graph oriented search is per- ject. Metropolis-Hastings (MH) is a method based on Monte formed based on relations between objects. Mostly, related Carlo Markov Chain (MCMC) for sampling from a distribu- work in this area is performed on images/videos with sim- tion when traditional sampling methods such as transfor- ilarity links between them [11, 5]. The use of results from mation or inversion fail. If we assume this probability dis- independent modality indexing neglect that data objects tribution as true relevancy of documents for an information are interlinked through different relations. The problem need, in this paper we explore how leveraging our model becomes more challenging when the graph is multimodal. with Metropolis-Hastings algorithm may help towards true During traversal, we may see information objects from dif- relevancy in multimodal IR. ferent modalities (text, audio, video or image). We propose a model to utilize probabilistic model of IR in multimodal retrieval, with the goal of approaching true relevancy rather Categories and Subject Descriptors than just a reranking. This means that a query may have H.3 [Information Storage and Retrieval]: General; H.3.3 null result because of lack of any relevant data. According [Information Search and Retrieval]: Metrics—Retrieval to probability ranking principle in IR, the relevancy of a models, Search process document to a query is defined as p(d|q) = p(q|d)p(d) p(q) . This requires the probabilities of p(q) and p(d) which are not General Terms available. Different ranking models like TF.IDF, BM25 or LM aim to probe the true ranking through different models Theory, Algorithm on p(q|d). In this paper, we explore the capability of our model to Keywords approach probabilistic IR for multimodal retrieval with the IR, Multimodal, Graph, Metropolis-Hastings help of the MH algorithm. MH is based on MCMC and is used in cases where it is hard to sample from a probability distribution. Assuming the true probability distribution of 1. INTRODUCTION relevancy of documents to the query as stationary distribu- There are many challenges in multimodal information re- tion, utilizing MH we make a Markov-chain of documents trieval. Mei et al. [8] have performed a survey on reranking which results in the same stationary distribution of proba- models of multimodal information retrieval. They divide the bilities. We conduct the experiments on ImageCLEF2011 related work in four categories: 1) Self-reranking: includes Wikipedia collection as a multimodal collection. reranking methods that include data from the original rank- ing result such as Pseudo-Relevance Feedback or learning 2. RELATED WORK a ranking model by giving top ranked documents as posi- tive. 2) Example-based reranking: methods to understand There are many efforts in multimodal retrieval in com- the query using accompanying examples. 3) Crowd rerank- bining textual and visual modalities. Martinent et al. [7] ing: leverages crowd-sourced knowledge on the web to mine propose to generate automatic document annotations from inter-modal analysis. They consider visual feature vectors and annotation keywords as binary random variables. Jing et al. [6] employ the PageRank to rerank image search. The hyperlinks between images are based on visual similarity of search results. Yao et al. [11] make a similarity graph of images and find authority nodes as result for image queries. Copyright c 2015 for the individual papers by the papers’ authors. Copy- Through this model, both visual content and textual infor- ing permitted for private and academic purposes. This volume is published mation of the images is explored. Hsu et .al [5] leverage and copyrighted by its editors. SIGIR Workshop on Graph Search and Beyond ’15 Santiago, Chile context reranking as a random walk over a graph of video Published on CEUR-WS: http://ceur-ws.org/Vol-1393/. stories. The links are based on similarities between different video stories. The final scoring value is a combination of • Semantic (α): any semantic relation between two ob- initial text and stationary distribution scores. jects in the collection (e.g. the link between lyrics and The application of MH method in information retrieval, is a music file). The edge weight wxy is made inversely limited to search in peer-2-peer networks [3, 1]. Ferreira et proportional to the α-out-degree of the source node u (α) al. [3] have designed a protocol for locating a specific object and wxy = 1/Nx . regardless of the topology of the network through uniform sampling from peer-to-peer networks. Zhong et al. [12] use • Part-of (β): a specific type of semantic relation, in- random walks and focus on convergence time for different dicating an object as part of another object, e.g. an network sizes. They investigate the probability distribution image in a document. The weight is 1 because of con- of visiting nodes. In order to go beyond peer-2-peer net- tainment relation as an object part of another one. works and apply MH in IR, we need a jumping distribution, i.e. weighted links between nodes. Such links may be sim- • Similarity (γ): relation between the facets of two in- ilarity/semantic or a mixture of the two. The difficulty, as formation objects. The weight is the similarity value we will see, is ensuring the stochastic and ergodic nature of between the facets. the chain. • Facet (δ): linking an object to its representation(s). It is a unidirectional relation from facet to the parent 3. MH ALGORITHM object. Weights are given by perceived information MH is one of the algorithms based on MCMC to obtain content of features, with respect to the query type. samples from a complex probability distribution π(x). The goal is to draw samples form π(x) where π(x) = πeK (x) . The Our scoring method consists of two steps: 1) In the first normalizing variable K is unknown and hard to compute. step, we perform an initial search with Lucene and/or Lire Based on the jumping distribution matrix of W , MH algo- result based on the facets. This provides us a set of activa- rithm generates a sequence from this distribution as follows: tion nodes. 2) In the second step, using the initial result set of data objects (with normalized scores) as seeds, we exploit 1. Start with initial value x that π(x) > 0 the graph structure and traverse it. 2. Using the current x value, sample a candidate point y The model can perform both partial/whole facet retrieval. from W (x, y). We may decide to search e.g. only based on query textual 3. The transition probability then is made according to or visual facets, or based on all query facets. In practice, we make a form of late facet fusion by combination of different P r(x, y) = W (x, y)λ(x, y) (1) scores and giving one score to the parent information object. However, it is not in the traditional way of late fusion. Since π e(y).W (y, x) λ(x, y) = min ,1 (2) we do not make the result rank list out of top ranked nodes. π e(x).W (x, y) We initiate their scores in graph nodes and then start prop- Note that λ(x, y) does not require knowledge of the agation. In our model, facet fusion is implicitly calculated normalizing constant because π(y)/π(x) drops it out. by matrix multiplication and final vector computation. If it increases the density (λ > 1), accept y and set the next sample xt = y. Repeat step 3. If it decreases the 5. MH MAPPED TO IR density, sample u from uniform (0,1). Accept if λ > u, We want to achieve a query dependent stationary distri- else reject it. bution such that the probability in node x is proportional to the probability that this node is relevant to the query, and at In order to reach a unique equilibrium state for a Markov- any other node (non-relevant) the probability is zero. This chain, it should be ergodic, satisfying irreducibility (for any is the π(x) distribution from which we cannot directly sam- state, the probability of getting there given any starting ple. Instead, we have the π e(x) which could be a relevance point is more than zero) and aperiodicity (there is no rhythm scoring function (e.g. a BM25 score between the data ob- in which states can be reached given a starting point). There ject xi and the query). MH would formally provide us with may be different proposal distributions for MH. Two general a method to sample from the probability distribution, if the approaches are [10]: 1) Random walks - the new state y is approximate probability π e is properly chosen. dependent to the current state x. 2) Independent sample We have the graph of different relations in the adjacency finding - the probability of jumping to point y is chosen matrix W . Assuming the true relevancy of nodes to the from a distribution of interest, independent of the current query as π(x), we define the π̃(x) as relevance score value value. This method is usually used in asymmetric MH. We function (RSV ). A node (M) in the graph may be of any use the first approach in our work. modality: Text (T), Image (I), Audio (A) or Video (V), and the query (Q) may be combination of different modalities. 4. MODEL REPRESENTATION We define the relevance score value function (RSV ), as fol- We define a graph-based model G = (V, E), in which V lows: is the set of information objects and their facets, and E M ∈ {T, I, V, A} is the set of edges. By facet we mean inherent feature or representation of an object (e.g., tf.idf facet of a document M = ∪n i=1 Mfi or edge histogram of an image). Each object may have a Q = ∪m j=1 Qfj number of facets. We define four types of relations. Their l = |{Qf |Qfi = Mfi }| characteristics are discussed in detail in [9]. We formally define the relation types and their weights as follows: 5.2 Role of MH in Adjusting the Weights l X In principle, MH either accepts a jumping density of W (x, y) RSV (Q, M ) = norm(sim(Qfi , Mfi )).wfi (3) (when λ > 1) and keeps the value and moves forward, or i=1 modifies the weight with the factor of λ. The new value of where n is the number of facet types of the information this link for next step is W (x, y) · λ. According to stochas- object node, m is the number of facet types of the query, sim tic property, the sum of the weights of links of an edge is is the similarity function between two facets, norm is the 1. In each step, when weights are adjusted by MH, the normalizing function and wfi is the weight of facet fi for this sum may get lower than 1. In this case the link is ac- query. We compute the similarity (sim) between l number cepted with probability of λ < 1. The decreased value is of the same facets of this information object and the query, given as self-transitivity value to the node, indicating stay- in which Qfi and Mfi are the value of corresponding facets. ing in this state is preferred than choosing that specific link. Usually the value of a facet is in the form of a feature vector. Performing this for many steps, loosens the links with less In case of no common facet, the sim function output is zero. relevant neighbours and keeps the links with increasing rel- Relevancy of an information object to a query should be evancy neighbours. This way, MH may modify the weights calculated in accordance to other information objects. For in the direction of making a Markov chain which reaches to this purpose we compute the similarity of all objects for each the true probability distribution. query facet and normalize. As we have a multimodal graph To prevent poorly mixing, we start from different starting and in each step may visit a node with different modality, points according to standard search result for each facet. we require a normalized value to be able to compare the These points satisfy the condition of π e(x) > 0 as it is the relevancy values. scored ranked result. Different modalities have different facets. Reaching nodes with the same modality of query examples, we have all the 6. EXPERIMENT DESIGN facets in common (e.g. an image query and an image node). We applied the ImageCLEF 2011 Wikipedia collection for Visiting nodes with different modality than query examples, imgae retrieval task. Each image has one metadata file that we perform similarity for common facets. For instance, if we provides information about name, location, one or more as- have an audio object and an image query, we can compare sociated parent documents in up to three languages (EN, DE their textual facets (the tf.idf facet of image metadata and and FR), and textual image annotations (i.e. caption, de- tf.idf facet of the audio tags or lyrics). scription and comment). The collection consists of 125,828 documents and 237,434 images. We parsed the image meta- data and created nodes for all parent documents, images and 5.1 MH Constraints in Astera corresponding facets. We created different relation types: Irreducibility: To check irreducibility we should prove the β relation between parent documents and images (as that our graph is connected. By adding different relations part of the document), and δ relation between information of β, γ and α, we have a connected graph. For this purpose, objects and their facets. We use the 50 English query topics. starting from top ranked results for a sample query we tra- verse the graph. In each step we visit new neighbours and 6.1 Document and Image Facets continue until we see no more new nodes. The number of In the first phase of our hybrid search, we use standard nodes seen in this traversal was the whole graph size. This indexing results both for documents and images. The com- observation, even for one query, indicates the connectivity puted scores in both modalities are normalized per topic be- of our graph. tween (0,1) based on min-max method. Different indexings Aperiodicity: Finding nodes from a starting point is not based on different facets are: multiple of a number in our graph. We satisfy this constraint by construction. • Text tf.idf facet: We utilize default Lucene indexer, Stochastic property: According to the weight definition based on tf.idf, as text facet. in Astera for β links, the sum of weights on a row may be more than one. However, semantic (α) and/or similarity • Image textual annotation tf.idf facet (Metadata): (γ) links can be used in a normalized form, complying with We use metadata information of the images caption, stochastic property. comment and description), as image textual facets. Transition Function in Astera According to Metropolis- • CEDD facet: For image facets, we selected the Color Hasitngs algorithm, and Eq. 2 we sample from W (x, y) and and Edge Directivity Descriptor (CEDD) feature since accept the move with probability λ(x, y). This implies on it is considered the best method to extract purely vi- how we define high-order transition probabilities after t steps: sual results [2]. P rqt+1 (x, y) = ki=1 P rqt (x, zi )(zi , y) where q is the query, k P is the number of common nodes z between x and y, and P rt In the second phase, starting from standard indexed re- is the transition probability of starting from x and moving sults, we conduct the graph search based on MH. In this t steps further. instantiation of Astera, we use only β links between the Mixing Walsh divides the mixing chains in two categories documents and images. We investigate adding α and δ link of poorly mixing and well mixing chains [10]. To prevent types are in our future works. poorly mixing, one usual way is to use Simulated Annealing method with high jumps. Second option is to start with 6.2 Transition Matrix in Astera several chains to cover the space to find nodes. Our model To compute the transition matrix P r, we need to com- follows the second option, as we start from different starting pute the λ(x, y) for each two neighbour nodes to update the points according to standard search result for each facet. weights. In this instantiation of Astera with ImageCLEF 2011 Wikipedia collection, we have images and documents We compare the results with/without using MH algorithm node types. The query topic in this collection is multimodal. (Tables 1, 2). We did not get better result in our preliminary It is a combination of keywords and image examples with experiment with MH. The reason is dependency of a jump to facet set of {tf.idf, CEDD}. the value of RSV (y)/RSV (x). The implemented RSV func- Based on any of these facets, we can start traversal in the tion for images is based on metadata facet. A large number graph. For example, if we start from similarity with meta- of images are not retrieved in Lucene result for Metadata data tf.idf results, we will have a set of images as starting facet- we retrieve in the scale of 1000 images for each query, points to make the traversal. In this instantiation of Astera, compared to having 274,000 images. We set the minimum an image object (I) has two facets of {tf.idf, CEDD}. The value of retrieved scores (0.0001), as RSV value of visited common set of facets of l between the query and image is images not in the Lucene results. We have observed that l = {tf.idf, CEDD}. Each image is connected to at least this approach biases a large number of images to very low one parent document (D) through β link. To compute the score, which we assume to be the cause of low precision. P r(I, D) = W (I, D) · λ(I, D), we need the λ value, which is: Though, further experiments in this direction are needed 1 . RSV (Q, D) W (D, I) λ(I, D) = . ,1 (4) 7. CONCLUSION AND DISCUSSION RSV (Q, I) W (I, D) We presented a graph-based model for multimodal IR where leveraging MH algorithm. The graph is enriched by ex- tracted facets of information objects. Different modalities RSV (Q, I) = norm(sim(Qtf.idf , Itf.idf )).wtf.idf + are treated equally thanks to faceted search. We proposed (5) norm(sim(QCEDD , ICEDD )).wCEDD a generic relevancy function based on facet similarity of ob- jects to the query. Leveraging this model, we have a plat- and form, potential to investigate the affect of different facets on performance, and burning in the matrix. We have the op- portunity to examine query dependent traversal, as weights RSV (Q, D) = norm(sim(Qtf.idf , Dtf.idf )).wtf.idf (6) in the graph are affected by relevancy of source and target nodes to the query. The preliminary results with MH did The RSV value is computed based on normalized Lucene not improve the result. Many steps in the graph should be and LIRE similarity score for tf.idf and CEDD facet respec- taken until the matrix burns in to the stationary distribu- tively. The wCEDD and wtf.idf are facet weights for this tion, which is in our future work. However, this experiment query. For each query, we perform this similarity computa- brings some issues to discuss: 1) How much the final prob- tion in all three languages, separately for image metadata ability distribution is dependent on the chosen π e(x)? 2) and documents. We take this value as relevancy value of Is MH algorithm on graph-based collections an opportunity each image/document for a specific query. to compare the effect of different ranking models? 3) How much expensive is this approach regarding the need of high 6.3 Experiment Result number of transitions until the matrix burns in? 4) How We included text tf.idf and metadata tf.idf facets in this do we satisfy stochastic property in multimodal graph with experiment. We start with top 20 similar documents and heterogeneous relation types? In principle, this property is images (as activated nodes) based on these facets for each beyond mathematically summing the weights to 1, but it query, and traverse the graph from these 40 touch points, goes back to the utility of different modalities as neighbours step by step in parallel. In each step, for node x and its to the user. The difficulty is whether these neighbours are neighbour y, we compute the λ(x, y), update the weight and equally useful to the user? continue to the next neighbour. This is performed in the form of matrix multiplications. 8. REFERENCES In Markov chain random walks, without MH algorithm, [1] A. Awan, R. A. Ferreira, S. Jagannathan, and we utilize matrix multiplication to simulate the walk in the A. Grama. Distributed uniform sampling in graph. The probability distribution after t steps is calcu- unstructured peer-to-peer networks. In HICSS, 2006. lated as at = a0 · W t , where a0 is the starting scores and [2] T. Berber, A. H. Vahid, O. Ozturkmenoglu, R. G. at is the scores after t steps. However, leveraging MH, the Hamed, and A. Alpkocak. Demir at imageclefwiki edge weights are affected by λ (Eq. 1). This is a potential 2011: Evaluating different weighting schemes in problem for computing the updated transition matrix. The information retrieval. In CLEF, 2011. reason is that, in each iteration, the matrix W is affected [3] R. A. Ferreira, M. Krishna Ramanathan, A. Awan, by λ which is a min function - W · λ in first iteration and A. Grama, and S. Jagannathan. Search with W · λ · λ in the second iteration. However, Hlynka et al. [4] probabilistic guarantees in unstructured peer-to-peer observed that the transition matrix P r does not change in networks. In P2P, 2005. further steps. Therefore, we need to compute only once the matrix of P r(x, y) = W (x, y) · λ(x, y) for all nodes, and use [4] M. Hlynka and M. Cylwa. Observations on the this matrix in further multiplications. This makes the MH Metropolis-Hastings Algorithm. University of Windsor, steps simulation feasible in implementation. Department of Mathematics and Statistics, 2009. We compute the final score as at = a0 · P rt after t steps. [5] W. H. Hsu, L. S. Kennedy, and S.-F. Chang. Video This computation is needed for middle steps, since in ideal search reranking through random walk over case the multiplication is performed many times until the document-level context graph. MULTIMEDIA, 2007. matrix converges and in stationary distribution the nodes’ 1 The code of Astera is open-source and available at link: probability are independent of starting scores in the graph. http://www.ifs.tuwien.ac.at/∼sabetghadam/Astera.html steps st p@10 r@10 p@20 r@20 1 0.297 0.135 0.229 0.158 2 0.297 0.135 0.229 0.158 3 0.252 0.123 0.188 0.138 4 0.224 0.120 0.184 0.134 5 0.206 0.1148 0.173 0.124 6 0.182 0.1104 0.156 0.113 7 0.142 0.106 0.13 0.115 Table 1: Result for documents without image facets, self- transitivity: 0.9, links: δ, β steps st p@10 r@10 p@20 r@20 1 0.27 0.125 0.151 0.135 2 0.27 0.125 0.151 0.135 3 0.23 0.113 0.148 0.1295 4 0.22 0.1097 0.133 0.1163 5 0.18 0.1091 0.113 0.1163 6 0.17 0.107 0.111 0.109 7 0.14 0.08 0.108 0.087 Table 2: Result for documents without image facets, self- transitivity: 0.9, links: δ, β [6] Y. Jing and S. Baluja. Visualrank: Applying pagerank to large-scale image search. IEEE Trans. Pattern Anal. Mach. Intell., 2008. [7] J. Martinet and S. Satoh. An information theoretic approach for automatic document annotation from intermodal analysis. In Workshop on Multimodal Information Retrieval, 2007. [8] T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search reranking: A literature survey. ACM Computing Surveys (CSUR), 2014. [9] S. Sabetghadam, M. Lupu, and A. Rauber. Astera - a generic model for multimodal information retrieval. In Proc. of Integrating IR Technologies for Professional Search Workshop, 2013. [10] B. Walsh. Markov chain monte carlo and gibbs sampling. 2004. [11] T. Yao, T. Mei, and C.-W. Ngo. Co-reranking by mutual reinforcement for image search. CIVR, 2010. [12] M. Zhong, K. Shen, and J. Seiferas. The convergence-guaranteed random walk and its applications in peer-to-peer networks. Computers, IEEE Transactions on, 2008.