<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anshita Talsania</string-name>
          <email>anshita.talsania@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandip Modha</string-name>
          <email>sjmodha@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amit Ganatra</string-name>
          <email>amitganatra.ce@charusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Automated Story Illustrator, Illustrated story.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charotar University of Science and, Technology (Charusat)</institution>
          ,
          <addr-line>Anand</addr-line>
          ,
          <country country="IN">India.</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L.D.R.P. College</institution>
          ,
          <addr-line>Gandhinagar</addr-line>
          ,
          <country country="IN">India.</country>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>73</lpage>
      <abstract>
        <p>The Automated Story Illustration is a task under FIRE 2015 to be organized in DAIICT, Gandhinagar. The participants are required illustrate stories automatically by retrieving a set of images from an image dataset and if it is the case, identifying which concepts and events in the text should be illustrated. This paper overviews the task, the approach- the model and the tool used to carry out the task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The task “Automated Story Illustration” requires to illustrate
stories automatically by retrieving set of images from the
image dataset. The key tasks involved to are:
a. Identify concepts and events in the text that should be
illustrated (annotations).
b. Selecting best illustration from the image dataset for that
particular concept/event.</p>
      <p>
        In the FIRE task, participants are provided with multiple
children’s short stories which need to be illustrated using the
ImageCLEF Wikipedia Image Retrieval dataset. The story text
as well as the important entities and events that need
illustration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in it are provided. The objective is to provide
one ranked list of images corresponding to each important
entity and event in a story.
      </p>
      <p>
        The need of this research stems from the fact that we often
forget what we read few sentences before. Our reading
memory is affected due to boredom, lack of attention or
distraction. Studies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] suggest creating visual illustrations can
improve reading memory therefore going a long way in
helping children and elder people who often face reading
memory problems.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. DATA</title>
      <p>The task uses two different data components: one is an image
dataset containing all possible images available for illustration
and another one is the children's short stories that need to be
illustrated. In this task, ImageCLEF Wikipedia Image
Retrieval 2010 is used as the image dataset. This dataset
consists of 237,434 images along with their captions metadata.
Captions are available in English, French and/or German.
Secondly, the stories that need to be illustrated are all
children’s short stories. Additionally annotations are provided
for each story to indicate what portion of the story needs to be
illustrated. Annotations are in form important entities (nouns
or noun phrases) and events (a combination of entities and
verbs or verb phrases) that need illustration in a story.</p>
    </sec>
    <sec id="sec-3">
      <title>3. APPROACH</title>
      <p>
        For creating the story illustration, primarily we need to
perform information retrieval – query important passages of
the story and retrieve the corresponding image representation
available from the image dataset provided i.e perform two
main tasks:
a. Indexing -Mapping of terms (basic indexed units) to
documents in a corpus
b. Retrieval - Generation of results due to a query (information
need)
To perform these tasks, several open source tools are available.
These tools differ on the grounds of their indexing and
retrieval models used. We choose the Terriertool [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as it is
ideal for performing information retrieval experiments. Terrier
can index large corpus of documents with multiple indexing
strategies. Additionally it is highly effective providing support
of state of art retrieval approaches like DFR and BM25.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Indexing</title>
      <p>Indexing using the Terrier tool is performed on the
ImageCLEF dataset provided. The tool utilizes entire image
caption metadata represented in form of XML and then
performs indexing using configuration set within the tool.
The corpus data (ImageCLEF) is parsed in TREC format and
that data forms the collection. A Collection object extracts the
raw content of each individual document and hands it in to a
Document object. The Document object then removes any
unwanted content (e.g., from a particular document tag) and
gives the resulting text to a Tokeniser object. Unwanted
content is removed through the TermPipeline, which
transforms the terms removing stopwords (high frequency
terms) and stemming (prefix, suffix removal) [4]. Finally, the
tokeniser object converts the text into a stream of tokens that
represent the content of the document. The entire iterations of
terms and is building of index is performed by
BasicIndexer(default indexer).</p>
      <p>We get output in form 10177882 tokens from the corpus data
of 237434 images. Result of indexing is depicted below:</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Retrieval</title>
      <p>
        After performing indexing, we can now initiate retrieval
process using different queries. These queries are in natural
language and denote the text from the story that needs to be
illustrated. We use the event description of a story as a search
query. These event descriptions are provided as a XML data in
the task with respective manual annotations. Each event from
each story is used as query. Figure-3 provides an outline of the
retrieval process in Terrier:
The input query is parsed initially post which it enters in
Preprocessing – entering it into same configured TermPipeline.
The query is then handed to matching component. Weighting
Model is instantiated (DFR model is used) and document
scores for the query are then computed. To improve the scores,
query moves to post processing e.g query expansion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
taking the top most informative terms from the top-ranked
documents of the query, and adding these new related terms
into the query. The processed query is assigned scores by
matching component. Post filtering is the final step in Terrier’s
retrieval process, where a series of filters can remove already
retrieved documents, which do not satisfy a given condition.
      </p>
      <p>Figure-4: Screenshot of retrieval results of the query in
terrier tool
After searching every query, run files (output) are compiled
using the scores of search query and retrieved image id.</p>
    </sec>
    <sec id="sec-6">
      <title>4. EVALUATION AND RESULTS</title>
      <p>Evaluation is conducted on the run files using standard
trec_eval tools. Precision-at-K (P@K) and mean average
precision (MAP) scores are evaluated. Each important entity or
event in a story will have a relevance list associated with it.
P@K and MAP for each annotation are computed against these
relevance scores.</p>
      <p>There were a total of two groups participating and four system
submissions with the result shown below in Table-1.
A highly effective information retrieval is one with high recall
and precision i.e. retrieve as many relevant documents as
possible and as few non-relevant documents as possible. The
results of cguj-run-2 file had 0.1545 precision.</p>
    </sec>
    <sec id="sec-7">
      <title>5. CONCLUSION</title>
      <p>Automated Story Illustrator task is first time released in FIRE.
The research can go a long way in illustrating short stories
especially for children as well help improve reading memory.
Lots of further work needs to be carried out in the task to
improve the effectiveness of the result like modifying scores
using different algorithms, improve the manual annotations
and modify the queries or improving the indexing.</p>
    </sec>
    <sec id="sec-8">
      <title>6. ACKNOWLEDGEMENTS</title>
      <p>We are grateful to Dr. Debasis Ganguly and Mr. Iacer Calixto
for his guidance throughout of this task. Additionally we
would also like to thank FIRE2015 for opportunity to work
under this task and facilitating the process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Diogo</given-names>
            <surname>Delgado</surname>
          </string-name>
          , Joao Magalhaes and
          <string-name>
            <given-names>Nuno</given-names>
            <surname>Correia</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Assisted News Reading with Automated Illustrations</article-title>
          . ACM. DOI= http://dl.acm.org/citation.cfm?id=
          <fpage>1874311</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Delgado</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Automated illustration of news stories</article-title>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Michel</given-names>
            <surname>Beigbeder</surname>
          </string-name>
          ,
          <source>Wray Buntine and Wai Gen Yee</source>
          .
          <year>2006</year>
          .
          <article-title>Terrier: A High Performance and Scalable Information Retrieval Platform</article-title>
          . OSIR William B.
          <string-name>
            <surname>Frakes</surname>
          </string-name>
          , Ricardo Baeza-Yates,
          <year>1992</year>
          . Book:
          <article-title>Stemming Algorithms</article-title>
          . PEARSON Education.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>