<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Automated Story Illustration Task at FIRE 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Debasis Ganguly</string-name>
          <email>dganguly@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iacer Calixto</string-name>
          <email>icalixto@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth Jones</string-name>
          <email>gjones@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, School of Computing, Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <fpage>63</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>In this paper, we describe an overview of the shared task (track) carried out as part of the Forum of Information Retrieval and Evaluation (FIRE) 2015 workshop. The objective in this task is to illustrate a passage of text automatically by retrieving a set of images and then inserting them at appropriate places in the text. In particular, for this track, the text to be illustrated is a set of short stories (fables) for children. Some of the research challenges for the participants in developing an automated story illustrating system involve developing techniques to automatically extract out the concepts to be illustrated from a full story text, explore how to use these extracted concepts for query representation in order to retrieve a ranked list of images per query, and nally investigating how merge the ranked lists obtained from each individual concept to present a single ranked list of candidate relevant images per story. In addition to reporting an overview of the approaches undertaken by two participating groups who submitted runs for this task, we also report two of our own baseline approaches for tackling the problem of automated story illustration.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Document expansion, in addition to inserting text and
hyperlinks, can also involve adding non textual content such
as images that are topically related to document text, in
order to enhance the readability of the text. For example, in
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Wikipedia articles are augmented with images retrieved
from the Kirklees image archive, where automatically
extracted key concepts from the Wiki text passages were used
to formulate the queries for retrieving the images. This
automatic augmentation of documents can be useful for
various purposes, such as enhancing the readability of text for
children enabling them to learn and engage with the
content more, for making it easier for medical students to learn
more about a disease or its syndromes by looking at related
images, etc.
      </p>
      <p>The aim of our work, reported in this paper, is to build up
a dataset for evaluating the e ectiveness of automated
approaches for document expansion with images. In particular,
the problem that we address in the paper is that of
augmenting the text of children's short stories (e.g. fairy tales and
fables) with images in order to help improve the
readability of the stories for small children according to the adage
that \a picture is worth a thousand words"1. The \document
expansion with images" methodologies, developed and
evaluated on this dataset, can also be applied to augment other
types of text documents, such as news articles, blogs etc.</p>
      <p>
        The illustration of children's stories is a particular
instance of the general problem of automatic text illustration,
an inherently multimodal problem that involves image
processing and natural language processing. A related problem
to automatic text illustration is that of automatic textual
generation of image description. This problem is in fact
under active research and has drawn signi cant research
interests in recent years [
        <xref ref-type="bibr" rid="ref2 ref4 ref7 ref8">2, 7, 4, 8</xref>
        ].
      </p>
      <p>The rest of the paper is organized as follows. In Section
2, we present a brief overview of the task objectives. In
Section 3, we describe how the dataset (queries and relevance
judgments) is constructed. Section 4 describes our own
initial experiments so as to obtain our own baselines on the
dataset constructed. Section 5 provides a brief overview of
the approaches undertaken by the participating groups and
presents the o cial results. Finally Section 6 concludes the
paper with directions for future work.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>In order to share among researchers a dataset for text
augmentation with images, and to encourage them to use
this dataset for research purposes, we are organizing a shared
task, named \Automated Story Illustration"2, as a part of
the Forum of Information Retrieval and Evaluation (FIRE)
2015 workshop3. The goal of this task is to automatically
illustrate children's short stories by retrieving a set of images
that can be considered relevant to illustrate the concepts
(agents, events and actions) of a given story.</p>
      <p>
        In contrast to the standard keyword-based ad-hoc search
for images [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], there exists no explicitly user formulated
keyword based queries in this task. Instead, each text passage
acts as an implicit query for which images need to retrieved
to augment it. To illustrate the task output with an
example, let us consider the story \The Ant and the Grasshopper"
shown in Figure 1. In the text we underline the key concepts
that are likely to be used to formulate queries for illustrating
the story. Additionally, we show a set of manually collected
1http://en.wikipedia.org/wiki/A_picture_is_worth_
a_thousand_words
2http://srv-cngl.computing.dcu.ie/
StoryIllustrationFireTask/
3http://fire.irsi.res.in/fire/
      </p>
      <p>IN a eld one summer's day a Grasshopper was hopping about, chirping and
singing to its heart's content. An Ant passed by, bearing along with great toil
an ear of corn he was taking to the nest. \Why not come and chat with me, said
the Grasshopper, \instead of toiling and moiling in that way?" \I am helping
to lay up food for the winter," said the Ant, \and recommend you to do the
same." \Why bother about winter?" said the Grasshopper; \we have got plenty
of food at present." But the Ant went on its way and continued its toil. When
the winter came the Grasshopper had no food, and found itself dying of hunger,
while it saw the ants distributing every day corn and grain from the stores they
had collected in the summer. Then the Grasshopper knew: \IT IS BEST TO</p>
      <p>PREPARE FOR THE DAYS OF NECESSITY."
images from the results of Google image search4 executed
with each of these underlined phrases as queries. It can be
seen that the story with these sample images is likely to
be more appealing to a child rather than the plain raw text.
This is because, with the accompanying images, the children
can potentially relate to the concepts described in the text,
e.g. the top left image shows a child how does a \summer
day's eld" look like.</p>
    </sec>
    <sec id="sec-3">
      <title>DATASET DESCRIPTION</title>
      <p>It is worth mentioning that we use Google image search in
our example of Figure 1 for illustrative purpose only.
However, in order to achieve a fair comparison between
automated approaches to the story illustration task, it is
imperative to build up a dataset comprised of a static document
collection, a set of test queries (text from stories), and the
relevance assessments for each story.</p>
      <p>
        The static image collection that we use for this task is the
ImageCLEF 2010 Wikipedia image collection released [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
For the queries, we used popular children's fairy tales since
most of them are available in the public domain and freely
distributable. In particular, we make use of 22 short stories
collected from \Aesop's Fables"5.
      </p>
      <p>The rst research challenge for an automated story
illustration approach is to extract the key concepts from the text
passages in order to formulate suitable queries for retrieving
relevant images, e.g. an automated approach should extract
\summer day eld" as a meaningful unit for illustration. The
second research challenge is to make use of these extracted
concepts or phrases to construct queries and perform
retrieval from the collection of images, which in this cases is</p>
      <sec id="sec-3-1">
        <title>4https://images.google.com/ 5https://en.wikipedia.org/wiki/Aesop</title>
        <p>the ImageCLEF collection.</p>
        <p>In order to facilitate participants to concentrate on
retrieval only, we manually annotated the short stories with
concepts that are likely to require illustration. Participants,
volunteering for the annotation task, were instructed to
highlight parts of the stories that they feel would better be
understood by children with the help of illustrative images. In
total, we got ve participants annotating 22 stories, three
annotating 4 each and the rest two annotating 5 each. Each
story was annotated by a single participant only.</p>
        <p>For other participants who want to automatically extract
the concepts from a story for the purpose of illustration, we
encouraged them to develop automated approaches and then
compare their results with the manually annotated ones. A
participating system may use shallow natural language
processing (NLP) techniques, such as named entity recognition
and chunking, to rst identify individual query concepts and
then to retrieve candidate images for each of these. Another
approach may be to use the entire text as query and then to
cluster the result-list of documents to identify the individual
query components.</p>
        <p>An important component in an information retrieval (IR)
dataset is the set of relevance assessments for a query. To
obtain the set of relevant images for each story, we
undertake the standard pooling procedure of IR, where a pool of
documents, i.e. the set of top ranked documents from
retrieval systems with di erent settings, is assessed manually
for relevance. The relevance judgements for our dataset are
obtained as follows.</p>
        <p>Firstly, in order to be able to search for images with
adhoc keywords, we indexed the ImageCLEF collection. In
particular, the extracted text from the caption of each image
in the ImageCLEF collection, was indexed as a retrievable
document. The ImageCLEF collection was indexed with
Lucene6, an open source IR system in Java.</p>
        <p>Secondly, we make use of the manually annotated concepts
as an individual query that is executed on the document
collection of the ImageCLEF. To construct the pool, we obtain
runs with di erent retrieval models, such as BM25, LM and
tf-idf with default parameter settings in Lucene and nally
fuse the ranked lists with the standard COMBSUM merging
technique.</p>
        <p>Finally, top 20 documents from this fused ranked list were
then assessed for relevance. The relevance assessment for
each manually annotated concept for each story was
conducted by the same participant who created the annotation
in the rst place. This ensured that the participants had
a clear understanding of the relevance criteria. The
participants were asked to assign relevance on a ve point scale
ranging from absolutely non-relevant to highly relevant.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>OUR BASELINES</title>
      <p>
        In this section, we describe some initial experiments that
we conducted on our dataset, meant to act as baselines for
future work on this dataset. As our rst baseline we simply
use all the words in a story to create a query. We then use
this query to retrieve a list of images by making use of the
similarity of the query with the caption texts of the images
in the index. The retrieval model that we use is the LM
with Jelinek Mercer smoothing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As a second baseline, we
still use all the words in the story but this time weight each
      </p>
      <sec id="sec-4-1">
        <title>6https://lucene.apache.org/</title>
        <p>query term by its tf-idf score. It is worth mentioning here
that the two baselines that we use are quite simple because
our intention is to see how simple methods can perform,
before attempting to apply more involved approaches for
this task.</p>
        <p>In Table 1, we observe that simply using all terms of a
story as a query to retrieve a ranked list of images does not
produce satisfactory results. In contrast, even a very simple
approach of weighting the terms in the text of the story by
their tf-idf weights can produce a signi cant improvement
in the results. We believe that shallow NLP techniques to
extract useful concepts can further improve the results.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. SUBMITTED RUNS</title>
      <p>Two participating groups submitted runs for this task.
The details about each group is shown in Table 2. The
rst group (Group 1) employed a word embedding based
approach to expand the annotated concepts of each story
to formulate a query and retrieve a ranked list of images.
Only the text of the image captions was used for computing
similarities with the queries. The similarity function
employed was tf-idf. The second group (Group 2) used Terrier
for indexing the Image CLEF 2010 collection. For retrieval,
they applied the Divergence from Randomness (DFR) model
similarity function of Terrier.</p>
      <p>Table 3 shows the o cial results evaluated on the
submitted runs by the two participating groups. Each
participating group were allowed to submit three runs. While group 1
submitted only one run, the second group submitted three.
It can be seen that the run submitted by Group 1 is
comprised of a higher number of retrieved documents (6405)
than the submitted runs of group 2 (about 100). Due to a
higher value of the average number of retrieved images per
story by group 1 (6405=22 291) in comparison to group 2
(100=22 4:5), group 1 achieves a higher recall and MAP
(compare the #relret and MAP values in Table 3). However,
the submitted runs from group 2 were scored high on
precision, e.g. compare the MRR and the P@5 values between
the runs of the two groups.</p>
      <p>A comparison of the o cial results and our own baselines
(see Tables 3 and 1 shows that none of the submitted runs
were able to outperform the simple baseline approaches that
we had experimented with. More investigation is required
to comment on this observation which we leave for future
work.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, we describe the construction of a dataset for
the purpose of evaluating automated approaches for
document augmentation with images. In particular, we address
the problem of automatically illustrating children stories.
Our constructed dataset comprises of 22 children stories as
the set of queries and uses the ImageCLEF document
collection as the set of retrievable images. The dataset also
1
2
comprises manually annotated concepts in each story that
can potentially be used as queries to retrieve a collection
of relevant images for each story. In fact, the retrieval
results obtained with the manual annotations can act as strong
baselines to compare against approaches that automatically
extract out the concepts from a story. The dataset
contains the relevance assessments for each story obtained with
pooling to a depth of 20.</p>
      <p>Our initial experiments suggest that the dataset can be
used to compare and evaluate various approaches to
automated augmentation of documents with images. We
demonstrate that a tf-idf based term weighting for the query terms
can prove useful in improving retrieval e ectiveness, thus
leaving open some of the future directions of research for
e ective query representation for this task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          , H. Muller, J.
          <string-name>
            <surname>Mart</surname>
            nez-Gomez,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Patricia</surname>
            ,
            <given-names>N. B.</given-names>
          </string-name>
          <string-name>
            <surname>Marvasti</surname>
            , S. U skudarli, R. Paredes,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Garc a-Varea, and</article-title>
          <string-name>
            <given-names>V.</given-names>
            <surname>Morell</surname>
          </string-name>
          .
          <article-title>Imageclef 2014: Overview and analysis of the results</article-title>
          .
          <source>In Information Access Evaluation</source>
          . Multilinguality, Multimodality, and Interaction - 5th
          <source>International Conference of the CLEF Initiative, CLEF</source>
          <year>2014</year>
          ,
          <article-title>She eld</article-title>
          ,
          <source>UK, September 15-18</source>
          ,
          <year>2014</year>
          . Proceedings, pages
          <volume>192</volume>
          {
          <fpage>211</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          .
          <article-title>Topic models for image annotation and text illustration</article-title>
          .
          <source>In Human Language Technologies</source>
          :
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          ,
          <source>HLT '10</source>
          , pages
          <fpage>831</fpage>
          {
          <fpage>839</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>M. M. Hall</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. L. de Lacalle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Soroa</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Agirre</surname>
          </string-name>
          .
          <article-title>Enabling the discovery of digital cultural heritage objects through wikipedia</article-title>
          .
          <source>In Proceedings of the 6th Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities</source>
          , LaTeCH '
          <volume>12</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Deep visual-semantic alignments for generating image descriptions</article-title>
          .
          <source>CoRR, abs/1412.2306</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ponte</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A language modeling approach to information retrieval</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>275</volume>
          {
          <fpage>281</fpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kludas</surname>
          </string-name>
          .
          <article-title>Overview of the wikipedia retrieval task at imageclef 2010</article-title>
          . In M. Braschler,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          , and E. Pianta, editors,
          <source>CLEF</source>
          (Notebook Papers/LABs/Workshops),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and tell: A neural image caption generator</article-title>
          .
          <source>CoRR, abs/1411.4555</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Zemel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>CoRR, abs/1502.03044</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>