<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Astera - A Generic Model for Semantic Multimodal Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Serwah Sabetghadam, Mihai Lupu, Andreas Rauber Institute of Software Technology and Interactive Systems Vienna University of Technology sabetghadam</institution>
          ,
          <addr-line>lupu</addr-line>
        </aff>
      </contrib-group>
      <fpage>35</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>Finding useful information from large multimodal document collections such as the WWW is one of the major challenges of Information Retrieval (IR). The many sources of information now available - text, images, audio, video and more - increases the need for multimodal search. Particularly important is also the recognition, that each information item is inherently multimodal (i.e. has aspects in its information character that stem from di↵ erent modalities) and forms part of a networked set of related information items. In this paper we propose a graph-based model for multimodal information retrieval based on a faceted view of information objects. For retrieval purposes, we consider both relatedness and similarity relations between objects.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Furthermore, going beyond the document itself, in modern IR settings, documents are usually not isolated
objects: instead, they are frequently connected to other objects, via hyperlinks or meta-data [MCYN06].
Information objects are connected to other information objects, and they provide mutual information on each other,
forming a background information model that may be used explicitly. Sometimes this information link is explicit
as related information (e.g. a music file and a singer) resulting in a network of related objects; sometimes it is
inherent in the information object, e.g. similar pitch histogram of two music files.</p>
      <p>There are numerous works in recent years addressing di↵ erent challenges in multimodal IR. Most of related
work try to improve the result relevancy by including di↵ erent modalities, or focus on ranking issues. Few have
worked on addressing di↵ erent modalities from the very beginning in the search procedure. In this paper, we
propose an integrated model for semantic multimodal information retrieval which considers both related and
similar objects in the retrieval procedure. Moreover, we employ a faceted view to information objects that
enlightens di↵ erent characteristics of an object, enabling comprehensive and in-depth search.</p>
      <p>The rest of the paper is organised as follows: We present the related work in Section 2, followed by the
description of our proposed data model in Section 3. We continue with the search procedure in Section 4, and a
short summary of the proposed model is provided in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>There are many e↵ orts in combining textual and visual modalities. Srinivasan and Slaney [SS07] have improved
their performance by adding content based information retrieval, in addition to image characteristics, as visual
information. They use a model based on random walks on bipartite graphs of joint modelling of images and
textual content.</p>
      <p>The combination of both textual and visual features for cross-language image retrieval is addressed by Cheng
et al. [CYK+05], who suggest two interactive retrieval procedures. One incorporates a relevance feedback
mechanism based on textual information while the second one, combines textual and image information to help
users find a target image. Hwang and Grauman have also explored ranking object importance in static images,
learning what people mention first from human-annotated tags [HG10].</p>
      <p>One of the ideas in the issue of query formulation for multimodal IR is to integrate di↵ erent modalities to
initialize the query. Hubert and Mothe [HM09] suggest a combination of ontology browsing and keyword-based
querying. Combining these two modes enables users to complement their queries with keywords for which they
do not identify corresponding categories.</p>
      <p>Considering the graph-nature of our data model, we look principally at works in the semantic web area. We are
taking advantage of the whole semantic web and introduce another feature of similarity checking. Semantic web
search is keyword based and there are works on generating adequate interpretations of user queries [SAN+11].
In our model, in addition to including keywords, we consider similarity computation in searching for an object
of information. We generalize the query and provide a list of the highly related neighbours to the user, rather
than only giving the exact response.</p>
      <p>The most related work to our own is the I-Search project, which is a multimodal search engine [LARD12].
They propose a multimodality relation between di↵ erent modalities of an information object, e.g. a dog image,
its sound (barking) and its 3D representation. They define a neighbourhood relation between two multimodal
objects which are similar in at least one of their modalities. However, they do not consider semantic relation
between objects (e.g. a dog and a cat object), nor the importance of these relations in answering the user’s
query.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Graph of Information Objects</title>
      <p>We define a model to represent information objects and their relationships, together with a general framework
for computing similarity. As shown in Figure 1, we see the information objects as a graph G = (V, E). Each
object in this graph has a number of facets. The object modalities could be text, image, audio or video.</p>
      <p>For each object, the information can be divided in four categories that applies on di↵ erent relation types an
object holds with neighbors. We formally define relation type R(e) of an edge e, taking one of the four values
R(e) 2 {↵ , , , }. These types are described as below:
• ↵ : Related ; this is the relatedness relation type and is similar to the relations existing in Semantic Web. For
instance a music file object is related to a singer object.
• : IsPartof/HasPart ; it is used for showing relation between objects which are part of another object, e.g.
an image in a document.</p>
      <p>• : Similar ; used to show the similarity between objects from the same modality and the same type, e.g.</p>
      <p>two music files.
• : Inherent/facet relationship; this type consists of di↵ erent views to an object, e.g. statistical facet, visual
facet, feature facet or genre facet of a piece of music.</p>
      <p>An example of mapping this model to a real example is shown in Figure 1. It is about the information related
to the movie Lisbon Story. As it is shown, the object LisbonStory page at IMDB has ↵ relation type with Music
File (Anida), Music File (Alfama), Andia Lyrics, Alfama Lyrics, Lisbon Story Trailer and Full Movie objects.
It has relation with Singer Image (Maderedus) and Movie cover image which are the images in the page. Each
of these objects have relation with their facets, like the relation of Andia Lyrics and BOW facet. Moreover, we
see relations between facets of objects. For instance, the SIFT feature of the Full Movie, Lisbon Story Trailer
and Movie cover image have relations to each other.
3.1</p>
      <sec id="sec-3-1">
        <title>Weighting in the Graph</title>
        <p>The di↵ erent types of links described in the previous section may carry with them di↵ erent weights. We denote
the weight of an edge e as W (e). The value of this weight is between 0 and 1, W (e) = (0, 1]. For di↵ erent types
of edges, this weight may have di↵ erent understandings:
• W (e|R(e) = ↵ ) 2 (0, 1]. Since this relation is between two non-homogeneous type objects, we cannot define
a weight function. As Crestani [Cre97] mentions, there is no default value for edge weights in spreading
activation technique and it is application dependent. Therefore, we assume an initial value of 0.5 for ↵
relations. This may change over time, for instance, based on di↵ erent relevance feedback techniques.
• W (e|R(e) = ) = 1. Since this relation is between two objects that are tightly related, one is a part of the
other, the value 1 is assigned. The nodes with relations are extracted from single modal or multi-component
objects which are inherently multimodal.
• W (e|R(e) = ) 2 (0, 1]. This weight is computed by a normalized similarity function between objects of the
same type within shared feature spaces. We are aware that normalizing a similarity function is not always
obvious, and this is part of the study that this paper starts.
• W (e|R(e) = ) 2 (0, 1]. Similarly to the edges, the relationships have an initial value of 1 because they
denote an intrinsic part of the node. The edges link an object and its facets in potentially di↵ erent feature
spaces.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Graph Construction</title>
        <p>In this section we explain how we construct the graph with di↵ erent relation types. The nodes with ↵ relation
types are either generated using information extraction techniques from our dataset or extracted from Linked
Data [WA11]. The nodes with relations are created by extracting inherent objects from multimodal objects,
e.g. images and text from a PowerPoint presentation. Nodes with relationship are generated by computing
similarity measures between objects of the same type. Nodes with relationships are created in several ways.
For instance by feature extraction or by machine learning to learn about, for example, the genre of a music file.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Search Procedure</title>
      <p>We use spreading activation technique to manage the search procedure, and perform the search on object facets.
The weights on edges, which are damping factors, are defined as df = 1 w in Astera. Therefore higher weighted
edges consume less activation energy. After receiving a query, the query facets are extracted. This faceted
view of information objects and query enables us to perform search on di↵ erent characteristics of the objects,
resulting in faceted search. We hit the graph from N hit points according to query facets and files. In each hit
point, parallel multimodal search is conducted basing on spreading activation method. Finally, result collections
of di↵ erent modalities are provided. Our model gives the option of a↵ ecting di↵ erent modalities of the query
in search spreading. For instance, if the query consists of both text and music, in searching for each of these
modalities, links with neighborhood of the other modality are prioritized.</p>
      <p>Astera is capable of representing di↵ erent retrieval models like vector space, faceted search or multimodal
search. Faceted search is directly covered by relations, the vector space model can be modelled directly via
metrics employed on the facets and gamma relations, with further propagation being set to 0. Multimodal search
can be handled both via facets ( relations) as well as relations. Semantic search may be modelled by using
the ↵ relations. Furthermore, Astera has the potential to answer queries that may not be answerable by VSM
or Semantic search individually, but which require a combination of search techniques.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions References</title>
      <p>In this paper we have introduced a model for multimodal IR with two distinguishing characteristics: one is
the idea of faceted view to inherent information encapsulated in objects, which enables us to extract di↵ erent
characteristics of an object to be included in the search procedure. The second is considering both relatedness
and similarity relations between objects in the graph model of information objects. The proposed model is
domain independent and can be mapped to di↵ erent domains.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Cre97]
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>Application of spreading activation techniques in information retrieval</article-title>
          .
          <source>Artificial Intelligence Review</source>
          ,
          <volume>11</volume>
          (
          <issue>6</issue>
          ):
          <fpage>453</fpage>
          -
          <lpage>482</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [CYK+05] P.C. Cheng, J.Y. Yeh,
          <string-name>
            <given-names>H.R.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.C.</given-names>
            <surname>Chien</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.P.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Comparison and combination of textual and visual features for interactive cross-language image retrieval</article-title>
          .
          <source>In Multilingual Information Access for Text, Speech and Images</source>
          , pages
          <fpage>919</fpage>
          -
          <lpage>919</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [HG10] [HM09]
          <string-name>
            <given-names>S.J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <article-title>Accounting for the relative importance of objects in image retrieval</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>In Proceedings of the British Machine Vision Conference</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Hubert</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          .
          <article-title>An adaptable search engine for multimodal information retrieval</article-title>
          .
          <source>In Journal of the American Society for Information Science and Technology</source>
          , volume
          <volume>60</volume>
          , pages
          <fpage>1625</fpage>
          -
          <lpage>1634</lpage>
          . Wiley Online Library,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>[LARD12] M. Lazaridis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Axenopoulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Rafailidis</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Daras</surname>
          </string-name>
          .
          <article-title>Multimedia search and retrieval using multimodal annotation propagation and indexing techniques</article-title>
          .
          <source>Signal Processing: Image Communication</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>[MCYN06</surname>
            ]
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Minkov</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Contextual search and name disambiguation in email using graphs</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [SAN+11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekarpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>AN</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          .
          <article-title>Keyword-driven sparql query generation leveraging background knowledge</article-title>
          .
          <source>In Web Intelligence and Intelligent Agent Technology (WI-IAT)</source>
          ,
          <source>2011 IEEE/WIC/ACM International Conference on Web Intelligence</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>203</fpage>
          -
          <lpage>210</lpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [SS07] [WA11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Slaney</surname>
          </string-name>
          .
          <article-title>A bipartite graph model for associating images and text</article-title>
          .
          <source>In IJCAI-2007 Workshop on Multimodal Information Retrieval</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Westerski</surname>
          </string-name>
          and
          <string-name>
            <surname>Iglesias C.</surname>
          </string-name>
          <article-title>A. Exploiting structured linked data in enterprise knowledge management systems: An idea management case study</article-title>
          .
          <source>In Enterprise Distributed Object Computing Conference Workshops (EDOCW)</source>
          ,
          <source>15th IEEE International</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>