=Paper=
{{Paper
|id=None
|storemode=property
|title=Astera - A Generic Model for Semantic Multimodal Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-968/irps_7.pdf
|volume=Vol-968
}}
==Astera - A Generic Model for Semantic Multimodal Information Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-968/irps_7.pdf</pdf>
<pre>
     Astera - A Generic Model for Semantic Multimodal
                   Information Retrieval

                           Serwah Sabetghadam, Mihai Lupu, Andreas Rauber
                         Institute of Software Technology and Interactive Systems
                                      Vienna University of Technology
                                sabetghadam, lupu, rauber@ifs.tuwien.ac.at


                                                         Abstract
                       Finding useful information from large multimodal document collections
                       such as the WWW is one of the major challenges of Information Re-
                       trieval (IR). The many sources of information now available - text,
                       images, audio, video and more - increases the need for multimodal
                       search. Particularly important is also the recognition, that each infor-
                       mation item is inherently multimodal (i.e. has aspects in its informa-
                       tion character that stem from di↵erent modalities) and forms part of a
                       networked set of related information items. In this paper we propose
                       a graph-based model for multimodal information retrieval based on a
                       faceted view of information objects. For retrieval purposes, we consider
                       both relatedness and similarity relations between objects.


1    Introduction
Searching for text, images and audio is common now in Web search and digital libraries. When a user searches
for a topic using search engines like Bing and Yahoo, the default category is document. If the user aims to
search for other modalities, such as images or videos, she defines it explicitly. However, a user may prefer to
see the information in di↵erent modalities in the first search and it may happen that she changes the modalities
and searches again to find what is more related for her query. Recently we observe a change in this direction at
major search engines (i.e. showing a combination of text, image and video in the first page of results, whenever
it is considered relevant) further demonstrating the need for a true multimodal system. The limits of current
approaches, as observed in these search engines, are the use of essentially one modality to retrieve others (i.e.
the use of text features only in retrieving images or videos).
    Multimodal IR is generally understood as the combination of text, image, video and sound in information
retrieval. In our case, we prefer to generalize this idea, and see Multimodal IR as based on the notion of facet.
This allows considering a document under several points of view, each one being associated to a possible space For
instance, text documents have primarily a textual facet, but also others such as stylistic/layout facets (covered
partially by image features), may contain images, or have time/versioning aspects (recency of information).
Another example is music files which primarily have audio facets (comprising several actual feature spaces/sub-
facets such as melodic, rhythmic, chords, voice) but also other facets such as lyrics (as detected from the audio
voice), time, genre, etc.

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Lupu, M. Salampasis, N. Fuhr, A. Hanbury, B. Larsen, H. Strindberg (eds.): Proceedings of the Integrating IR technologies
for Professional Search Workshop, Moscow, Russia, 24-March-2013, published at http://ceur-ws.org


                                                          35
    Astera - A Generic Model for Semantic Multimodal Information Retrieval
   Furthermore, going beyond the document itself, in modern IR settings, documents are usually not isolated
objects: instead, they are frequently connected to other objects, via hyperlinks or meta-data [MCYN06]. Infor-
mation objects are connected to other information objects, and they provide mutual information on each other,
forming a background information model that may be used explicitly. Sometimes this information link is explicit
as related information (e.g. a music file and a singer) resulting in a network of related objects; sometimes it is
inherent in the information object, e.g. similar pitch histogram of two music files.
   There are numerous works in recent years addressing di↵erent challenges in multimodal IR. Most of related
work try to improve the result relevancy by including di↵erent modalities, or focus on ranking issues. Few have
worked on addressing di↵erent modalities from the very beginning in the search procedure. In this paper, we
propose an integrated model for semantic multimodal information retrieval which considers both related and
similar objects in the retrieval procedure. Moreover, we employ a faceted view to information objects that
enlightens di↵erent characteristics of an object, enabling comprehensive and in-depth search.
   The rest of the paper is organised as follows: We present the related work in Section 2, followed by the
description of our proposed data model in Section 3. We continue with the search procedure in Section 4, and a
short summary of the proposed model is provided in Section 5.

2       Related Work
There are many e↵orts in combining textual and visual modalities. Srinivasan and Slaney [SS07] have improved
their performance by adding content based information retrieval, in addition to image characteristics, as visual
information. They use a model based on random walks on bipartite graphs of joint modelling of images and
textual content.
   The combination of both textual and visual features for cross-language image retrieval is addressed by Cheng
et al. [CYK+ 05], who suggest two interactive retrieval procedures. One incorporates a relevance feedback
mechanism based on textual information while the second one, combines textual and image information to help
users find a target image. Hwang and Grauman have also explored ranking object importance in static images,
learning what people mention first from human-annotated tags [HG10].
   One of the ideas in the issue of query formulation for multimodal IR is to integrate di↵erent modalities to
initialize the query. Hubert and Mothe [HM09] suggest a combination of ontology browsing and keyword-based
querying. Combining these two modes enables users to complement their queries with keywords for which they
do not identify corresponding categories.
   Considering the graph-nature of our data model, we look principally at works in the semantic web area. We are
taking advantage of the whole semantic web and introduce another feature of similarity checking. Semantic web
search is keyword based and there are works on generating adequate interpretations of user queries [SAN+ 11].
In our model, in addition to including keywords, we consider similarity computation in searching for an object
of information. We generalize the query and provide a list of the highly related neighbours to the user, rather
than only giving the exact response.
   The most related work to our own is the I-Search project, which is a multimodal search engine [LARD12].
They propose a multimodality relation between di↵erent modalities of an information object, e.g. a dog image,
its sound (barking) and its 3D representation. They define a neighbourhood relation between two multimodal
objects which are similar in at least one of their modalities. However, they do not consider semantic relation
between objects (e.g. a dog and a cat object), nor the importance of these relations in answering the user’s
query.

3       Graph of Information Objects
We define a model to represent information objects and their relationships, together with a general framework
for computing similarity. As shown in Figure 1, we see the information objects as a graph G = (V, E). Each
object in this graph has a number of facets. The object modalities could be text, image, audio or video.
   For each object, the information can be divided in four categories that applies on di↵erent relation types an
object holds with neighbors. We formally define relation type R(e) of an edge e, taking one of the four values
R(e) 2 {↵, , , }. These types are described as below:
    • ↵: Related ; this is the relatedness relation type and is similar to the relations existing in Semantic Web. For
      instance a music file object is related to a singer object.
    •    : IsPartof/HasPart; it is used for showing relation between objects which are part of another object, e.g.
        an image in a document.


                                                      36
  Astera - A Generic Model for Semantic Multimodal Information Retrieval


Figure 1: This figure shows a part of related objects of LisbonStory movie as a concrete example. Di↵erent
types of edges (↵, , , ) are shown between nodes.

  •    : Similar ; used to show the similarity between objects from the same modality and the same type, e.g.
      two music files.
  • : Inherent/facet relationship; this type consists of di↵erent views to an object, e.g. statistical facet, visual
    facet, feature facet or genre facet of a piece of music.
   An example of mapping this model to a real example is shown in Figure 1. It is about the information related
to the movie Lisbon Story. As it is shown, the object LisbonStory page at IMDB has ↵ relation type with Music
File (Anida), Music File (Alfama), Andia Lyrics, Alfama Lyrics, Lisbon Story Trailer and Full Movie objects.
It has relation with Singer Image (Maderedus) and Movie cover image which are the images in the page. Each
of these objects have relation with their facets, like the relation of Andia Lyrics and BOW facet. Moreover, we
see relations between facets of objects. For instance, the SIFT feature of the Full Movie, Lisbon Story Trailer
and Movie cover image have relations to each other.
3.1    Weighting in the Graph
The di↵erent types of links described in the previous section may carry with them di↵erent weights. We denote
the weight of an edge e as W (e). The value of this weight is between 0 and 1, W (e) = (0, 1]. For di↵erent types
of edges, this weight may have di↵erent understandings:
   • W (e|R(e) = ↵) 2 (0, 1]. Since this relation is between two non-homogeneous type objects, we cannot define
     a weight function. As Crestani [Cre97] mentions, there is no default value for edge weights in spreading
     activation technique and it is application dependent. Therefore, we assume an initial value of 0.5 for ↵
     relations. This may change over time, for instance, based on di↵erent relevance feedback techniques.
  • W (e|R(e) = ) = 1. Since this relation is between two objects that are tightly related, one is a part of the
    other, the value 1 is assigned. The nodes with relations are extracted from single modal or multi-component
    objects which are inherently multimodal.
  • W (e|R(e) = ) 2 (0, 1]. This weight is computed by a normalized similarity function between objects of the
    same type within shared feature spaces. We are aware that normalizing a similarity function is not always
    obvious, and this is part of the study that this paper starts.
  • W (e|R(e) = ) 2 (0, 1]. Similarly to the edges, the relationships have an initial value of 1 because they
    denote an intrinsic part of the node. The edges link an object and its facets in potentially di↵erent feature
    spaces.

3.2    Graph Construction
In this section we explain how we construct the graph with di↵erent relation types. The nodes with ↵ relation
types are either generated using information extraction techniques from our dataset or extracted from Linked
Data [WA11]. The nodes with relations are created by extracting inherent objects from multimodal objects,
e.g. images and text from a PowerPoint presentation. Nodes with relationship are generated by computing
similarity measures between objects of the same type. Nodes with relationships are created in several ways.
For instance by feature extraction or by machine learning to learn about, for example, the genre of a music file.


                                                    37
    Astera - A Generic Model for Semantic Multimodal Information Retrieval
4    Search Procedure
We use spreading activation technique to manage the search procedure, and perform the search on object facets.
The weights on edges, which are damping factors, are defined as df = 1 w in Astera. Therefore higher weighted
edges consume less activation energy. After receiving a query, the query facets are extracted. This faceted
view of information objects and query enables us to perform search on di↵erent characteristics of the objects,
resulting in faceted search. We hit the graph from N hit points according to query facets and files. In each hit
point, parallel multimodal search is conducted basing on spreading activation method. Finally, result collections
of di↵erent modalities are provided. Our model gives the option of a↵ecting di↵erent modalities of the query
in search spreading. For instance, if the query consists of both text and music, in searching for each of these
modalities, links with neighborhood of the other modality are prioritized.
   Astera is capable of representing di↵erent retrieval models like vector space, faceted search or multimodal
search. Faceted search is directly covered by relations, the vector space model can be modelled directly via
metrics employed on the facets and gamma relations, with further propagation being set to 0. Multimodal search
can be handled both via facets ( relations) as well as relations. Semantic search may be modelled by using
the ↵ relations. Furthermore, Astera has the potential to answer queries that may not be answerable by VSM
or Semantic search individually, but which require a combination of search techniques.

5    Conclusions
In this paper we have introduced a model for multimodal IR with two distinguishing characteristics: one is
the idea of faceted view to inherent information encapsulated in objects, which enables us to extract di↵erent
characteristics of an object to be included in the search procedure. The second is considering both relatedness
and similarity relations between objects in the graph model of information objects. The proposed model is
domain independent and can be mapped to di↵erent domains.

References
[Cre97]     F. Crestani. Application of spreading activation techniques in information retrieval. Artificial Intel-
            ligence Review, 11(6):453–482, 1997.
[CYK+ 05] P.C. Cheng, J.Y. Yeh, H.R. Ke, B.C. Chien, and W.P. Yang. Comparison and combination of textual
          and visual features for interactive cross-language image retrieval. In Multilingual Information Access
          for Text, Speech and Images, pages 919–919. Springer, 2005.
[HG10]      S.J. Hwang and K. Grauman. Accounting for the relative importance of objects in image retrieval.
            In Proceedings of the British Machine Vision Conference, pages 1–12, 2010.
[HM09]      G. Hubert and J. Mothe. An adaptable search engine for multimodal information retrieval. In Journal
            of the American Society for Information Science and Technology, volume 60, pages 1625–1634. Wiley
            Online Library, 2009.
[LARD12] M. Lazaridis, A. Axenopoulos, D. Rafailidis, and P. Daras. Multimedia search and retrieval using
         multimodal annotation propagation and indexing techniques. Signal Processing: Image Communi-
         cation, 2012.
[MCYN06] E. Minkov, W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email using
         graphs. In SIGIR, pages 27–34, 2006.
[SAN+ 11] S. Shekarpour, S. Auer, AN Ngomo, D. Gerber, S. Hellmann, and C. Stadler. Keyword-driven sparql
          query generation leveraging background knowledge. In Web Intelligence and Intelligent Agent Tech-
          nology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on Web Intelligence, volume 1,
          pages 203–210. IEEE, 2011.
[SS07]      S. Srinivasan and M. Slaney. A bipartite graph model for associating images and text. In IJCAI-2007
            Workshop on Multimodal Information Retrieval, 2007.
[WA11]      A. Westerski and Iglesias C. A. Exploiting structured linked data in enterprise knowledge man-
            agement systems: An idea management case study. In Enterprise Distributed Object Computing
            Conference Workshops (EDOCW), 15th IEEE International, 2011.


                                                    38

</pre>