<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collecting Spatial Information for Locations in a Text-to-Scene Conversion System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masoud Rouhizadeh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Bauer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bob Coyne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Owen Rambow</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Sproat</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Columbia University</institution>
          ,
          <addr-line>New York NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oregon Health &amp; Science University</institution>
          ,
          <addr-line>Portland OR</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We investigate using Amazon Mechanical Turk (AMT) for building a low-level description corpus and populating VigNet, a comprehensive semantic resource that we will use in a text-to-scene generation system. To depict a picture of a location, VigNet should contain the knowledge about the typical objects in that location and the arrangements of those objects. Such information is mostly common-sense knowledge that is taken for granted by human beings and is not stated in existing lexical resources and in text corpora. In this paper we focus on collecting objects of locations using AMT. Our results show that it is a promising approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Text-to-Scene Systems</kwd>
        <kwd>Amazon Mechanical Turk</kwd>
        <kwd>Lexical Resources</kwd>
        <kwd>VigNet</kwd>
        <kwd>Location Information</kwd>
        <kwd>Description Corpora</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Our aim is to populate VigNet, a comprehensive semantic resource that we will
use in a text-to-scene generation system. This system follows in the footsteps
of Coyne and Sproat's WordsEye [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but while WordsEye did only support a
very limited number of actions in a static manner and mostly accepted low-level
language as input (John is in front of the kitchen table. A cup is on the table.
A plate is next to the cup. Toast is on the plate) the new system will support
higher-level language (John had toast for breakfast).
      </p>
      <p>
        VigNet is based on FrameNet[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and contains lexical, semantic and
spatial/graphical information needed to translate text into plausible 3D scenes. In
VigNet frames are decomposed into subframes and eventually into primitive
spatial relations between frame participants (frame elements), describing one way
a frame can be depicted graphically. We call a frame that is decomposable into
such primitives a vignette. Even though the technical details are not crucial to
understand this paper we refer the interested reader to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>This paper deals with the collection of spatial information to populate
VigNet. Even though VigNet contains vignettes for actions and other events,
complex objects and situations, this paper focuses only on the induction of location
vignettes. Knowledge about locations is of great importance to create detailed
scenes because locations de ne the context in which an action takes place. For
instance when someone takes a shower he usually does so in the bathroom,
interacting with the `a ordances' provided by this room (i.e. shower cabin, curtain,
shower head, shower tap etc.) in a speci c way. Note that location vignettes can,
but do not have to be evoked by lexical items. We can say John took a shower
in the bathroom, but this seems redundant because bathrooms are the preferred
location for taking a shower. VigNet records knowledge of this type that can be
accessed in the text-to-scene generation process.</p>
      <p>In this paper we propose a methodology for collecting semantic information
for locations vignettes using Amazon Mechanical Turk (AMT). The next section
rst discusses location vignettes in more detail. We then review related work in
section 3. We describe how we use AMT to build an image description corpus
and collect semantic information for locations in section 4 and compare di erent
methods in an evaluation. Section 5 concludes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Location Vignettes</title>
      <p>As mentioned before, location vignettes are important because they provide the
context in which actions can take place. Locations involve the spatial
composition of several individual objects. For example, in `John sat in the living room',
we might expect the living room to contain objects such as a sofa, a co ee table,
and a replace. In addition, these objects would be spatially arranged in some
recognizable manner, perhaps with the replace embedded in a wall and the
coffee table in front of the sofa in the middle of the room. In order to represent such
locations graphically we are adding knowledge about the typical arrangements
of objects for a wide variety of locations into VigNet.</p>
      <p>Any given location term can potentially be realized in a variety of ways and
hence can have multiple associated vignettes. For example, we can have multiple
location vignettes for a living room, each with a somewhat di erent set of objects
and arrangement of those objects. This is analogous to how an individual object,
such as a couch, can be represented in any number of styles and realizations. Each
location vignette consists of a list of constituent objects (its frame elements) and
graphical relations between those objects (by means of frame decomposition).
For example, one type of living room (of many possible ones) might contain a
couch, a co ee table, and a replace in a certain arrangement.</p>
      <p>living-room 42(left wall, far wall, couch,</p>
      <p>co ee table, replace)
touching( gure:couch, ground:left wall)
facing( gure:couch, ground:right wall)
front-of( gure:co ee table, ground: sofa)
embedded( gure: re-place, ground:far wall)</p>
      <p>The set of graphical primitives used by location vignettes control surface
properties (color, texture, opacity, shininess) and spatial relations (position,
orientation, size). This set of primitive relations is su cient to describe the basic
spatial layout of most locations (and scenes taking place in them). Generally we
do not record information about how the parts of a location can be used in an
action, but rather consider this knowledge to be part of the action.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        Existing lexical and common-sense knowledge resources do not contain the
spatial and semantic information required to construct location vignettes. In a few
cases, WordNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] glosses specify location-related information, but the number
of such entries with this kind of information is very small, and they cannot be
used in a systematic way. For example, the WordNet gloss for living room (a
room in a private house or establishment where people can sit and talk and
relax) de nes it in terms of its function, not its constituent objects and spatial
layout. Similarly, the WordNet gloss for sofa (an upholstered seat for more than
one person) provides no location information. FrameNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is focused on verb
semantics and thematic roles and provides little to no information on the spatial
arrangement of objects.
      </p>
      <p>
        More relevant to our project is OpenMind [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] where online crowd-sourcing
is used to collect a large set of common-sense assertions. These assertions are
normalized into a couple dozen relations, including the typical locations for
objects. The list of resulting objects found for each location, however, is noisy and
contains many peripheral and spurious relations. In addition, even the valid
relations are often vague and represent di erent underlying relations. For example,
a book is declared to be located at a desk (the directly supporting object) as
well as at a bookstore (the overall location). In addition, like most existing
approaches, it su ers from having objects and relations being generalized across all
locations of a given type and hence is unable to represent the dependencies that
would occur in any given speci c location. As a result, there's no clear way to
reliably determine the main objects and disambiguated spatial relations needed
for location vignettes.
      </p>
      <p>
        LabelMe [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a large collection of images with annotated 2D polygonal
regions for most elements and objects in a picture. It bene ts from the coherence
of grounding the objects in speci c locations. It su ers, though, from the lack
of di erentiation between main objects and peripheral ones. Furthermore, it
contains no 3D spatial relations between objects.
      </p>
      <p>
        One of the well-known approaches for building lexical resources is automatic
extracting lexical relations from large text corpora. For a comprehensive review
of these works see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, a few works focus speci cally on extracting
semantic information for locations, including [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which use the
vectorspace model and a nearest-neighbor classi er to extract locations of objects. Also
directly relevant to this paper is work by Sproat [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which attempts to extract
associations between actions and locations from text corpora. This approach
provides some potentially useful information, but the extracted data is noisy
and requires hand editing. In addition, it extracts locations for actions rather
than the objects and spatial relations associated with those locations.
      </p>
      <p>Furthermore, much of the information that we are looking for is
commonsense knowledge that is taken for granted by human beings and is not explicitly
stated in corpora. Although structured corpora like Wikipedia do mention
associated objects, they are often incomplete. For example in the Wikipedia entry
for kitchen there is no mention of a counter or other surface on which to prepare
food but the picture that goes with the de nition paragraph (labeled \A modern
Western kitchen") clearly has one.</p>
      <p>In this paper we investigate using Amazon Mechanical Turk (AMT) for
building a low-level description corpus for locations and for directly collecting objects
of locations vignettes. We will compare the accuracy of collected data to several
gold standard vignettes generated by an expert. We show that we can tune our
information collection method to scale for large number of locations.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Using AMT to build location vignettes</title>
      <p>In this section we discuss how we use Amazon Mechanical Turk (AMT) to build
a location description corpus and for collecting the typical objects of location
vignettes. AMT is an online marketplace to co-ordinate the use of human
intelligence to perform small tasks such as image annotation that are di cult for
computers but easy for humans. The input to our AMT experiments are
pictures of di erent rooms. By collecting objects and relations grounded to speci c
rooms we capture coherent sets of dependencies between objects in context and
not just generalized frequencies that may not work together. In each task we
collected answers for each room by ve workers who were located in the US
and had previous approval rating of 99%. Restricting the location of the workers
increases the chance that they are native speakers of English, or at least have
good command of the language. We carefully selected input pictures from the
results of image searches using the Google and Bing search engines. We selected
photos that show `typical' instances of the room type, e.g. room instances which
include typical large objects found in such rooms. Photos should show the entire
room. We then de ned the following task:</p>
      <p>Task 1: Building low-level location description corpus: In this task,
we asked AMT workers to provide simple and clear descriptions of 85 pictured
room. We explicitly asked AMT workers that their descriptions had to be in the
form of naming the main elements or objects in the room and their positions
in relation to each other, using verbs such as is or are (i.e. linking verb). Each
description had to be very precise and 5 to 10-sentence long. Our collected
description corpus contains around 11,000 words.</p>
      <p>In order to extract location information from the low-level location
description corpus, the text is rst processed using the NLP module of WordsEye. We
extracted the objects and other elements of locations which are mainly in the
form of relation{ground{figure and extract the objects and elements which
are represented as figure or ground. We then further processed the extracted
locations as is explained in sub-section 4.1.</p>
      <p>Task 2: Listing functionally important objects of locations:
According to this criterion, the important objects for a room are those that are required
in order for the room to be recognized or to function in this way. One can imagine
a kitchen without a picture frame but it is rarely possible to think of a kitchen
without a refrigerator. Other functional objects include a stove, an oven, and
a sink. We asked workers to provide a list of functional objects using an AMT
hit such as the one shown in gure 1. We showed each AMT worker an example
room with a list of objects and their counts. We gave the following instructions:
\ Based on the following picture of a kitchen list the objects that you really
need in a kitchen and the counts of the objects.
1. In each picture, rst tell us how many room doors and room windows do
you see.
2. Again, don't list the objects that you don't really need in a kitchen (such
as magazine, vase, etc). Just name the objects that are absolutely required
for this kitchen. "</p>
      <p>Task 3: Listing visually important objects of locations: For this task
we asked workers to list large objects (furniture, appliances, rugs, etc) and those
that are xed in location (part of walls, ceilings, etc). The goal was to know
which objects help de ne the basic structural makeup of this particular room
instance. We used the AMT input form shown in gure 1 again, provided a single
example room with example objects and and gave the following instruction:
\ What are the main objects/elements in the following kitchen? How many
of each?
1. In selecting the objects give priority to:
{ Large objects (furniture, appliances, rugs, etc).</p>
      <p>{ Objects that are xed in location (part of walls, ceilings, etc).
The goal is to know which objects help de ne the basic makeup and structure
of this particular kitchen.
2. In each picture, rst tell us how many room doors and room windows do
you see. "
4.1</p>
      <p>Post-processing of the extracted object names from AMT
We post-processed the extracted objects from the location description corpus
and the objects that were listed in tasks 2 and 3 in the following steps:
1. Manual checking of spelling and converting plural nouns to singular.
2. Removing conjunctions like\and", \or", and \/". For example, we converted
\desk and chair" to \desk" and \chair".
3. Converting the objects belonging to the same WordNet synset into the most
frequent word of the synset. For example we converted tub, bath, and bathtub
into bathtub with frequency of three.
4. Finding the intersection from the inputs of ve workers and selecting the
objects that listed three times or more
5. Finding major substrings in common: some input words only di er by a space
or a hyphen character such as night stand, night-stand, and nightstand. We
convert such variants to the simplest form i.e. nightstand.
6. Looking for head nouns in common: if the head of the compound noun input
such as projector screen can be found in another single-word input i.e. screen,
we assume that both refer to the same object i.e. screen.
7. Recalculating the intersections and selecting the objects with frequency of
three or more.
4.2</p>
      <p>Evaluation
For evaluating the results we manually built a set of gold standard vignettes
(GSVs) for 5 rooms which include A) a list of objects in each room, and B) the
arrangements of those objects. Selected objects for GSVs are the ones that help
de ne the basic makeup and structure of the particular room. We are comparing
the extracted object from AMT tasks against the list of objects in the GSVs.</p>
      <p>Table 1 shows the comparison of the AMT tasks against GSVs. The
\Extracted Objs" row shows the number of objects we extracted from each AMT
tasks for 5 rooms. The \Correct Objs" row shows the number of extracted
objects from AMT that are present in our GSVs of each room and the precision
score derived based on that. The \Expected Objs" row shows the number of all
the objects in GSVs that we expected the workers to list, and the recall score
based on that.</p>
      <p>Pre: 87%
Rec: 85%</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this paper we explored di erent approaches to populate VigNet, a resource
containing spatially grounded lexical semantics, with locational information
(location vignettes) using Amazon Mechanical Turk. In one approach we used AMT
to collect a low-level description corpus for locations. We then used the
WordsEye NLP module to extract the objects from each description. For comparison
we asked AMT workers to directly list objects of locations shown in photographs,
either based on visual or on functional criteria. We then post-processed the
extracted objects from each experiment and compared them against gold standard
location vignettes.</p>
      <p>We have shown that we can extract reasonably accurate objects from
processing the description corpus as well as spatial relations and arrangements of
objects. The results achieved using the functional and visual object listing tasks
approximate the gold standard even better, with the visual elicitation criterion
outperforming the functional one.</p>
      <p>In current work, due to the good results on the small training set we are using
the visual object listing paradigm to induce descriptions of 85 rooms. We are
planing to collect vignettes for a variety of other indoor and outdoor locations.</p>
      <p>Location vignettes also contain the spatial arrangement of objects. In
addition to the extracted relations from the description corpus, we also designed
a series of AMT tasks for determining the arrangements of objects in di erent
locations using the objects that we collected in the present work. For each room
we ask AMT workers to determine the arrangements of the previously collected
objects in that particular room. For each object in the room, workers have to
determine its spatial relation with A) one wall of the room and B) one other
object in the room. We did not include the results in this paper since we are still
exploring methods to evaluate the spatial arrangements task. The gold standard
location vignettes include arrangements of objects, but it is di cult to directly
compare the gold standard to the AMT workers' inputs as there are di erent
possibilities to describe the same spatial layout.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This material is based upon work supported by the National Science Foundation
under Grant No. IIS-0904361.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fillmore</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Berkeley FrameNet Project</article-title>
          .
          <article-title>COLING-ACL (</article-title>
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Coyne</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sproat</surname>
          </string-name>
          , R.: Wordseye:
          <article-title>An automatic text-to-scene conversion system</article-title>
          .
          <source>In Proceedings of the 28th annual conference on Computer graphics and interactive techniques</source>
          , Los Angeles, CA, USA, pp.
          <fpage>487</fpage>
          -
          <lpage>496</lpage>
          , (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Coyne</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sproat</surname>
          </string-name>
          , R.:
          <article-title>Frame semantics in text-to-scene generation</article-title>
          . In R. Setchi,
          <string-name>
            <surname>I. Jordanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howlett</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Jain</surname>
          </string-name>
          (Eds.),
          <source>KnowledgeBased and Intelligent Information and Engineering Systems</source>
          , Volume
          <volume>6279</volume>
          of Lecture Notes in Computer Science, pp.
          <fpage>375</fpage>
          -
          <lpage>384</lpage>
          . Springer Berlin / Heidelberg (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Coyne</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>VigNet: Grounding Language in Graphics using Frame Semantics</article-title>
          .
          <source>In ACL Workshop on Relational Models of Semantics</source>
          , (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>WordNet: An Electronic Lexical Database</article-title>
          . Bradford
          <string-name>
            <surname>Books</surname>
          </string-name>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Girju</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beamer</surname>
            ,
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rozovskaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fister</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhat</surname>
          </string-name>
          . S.:
          <article-title>A knowledge-rich approach to identifying semantic relations between nominals</article-title>
          .
          <source>Information Processing and Management</source>
          , vol.
          <volume>46</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>589</fpage>
          -
          <lpage>610</lpage>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>B. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Freeman</surname>
          </string-name>
          , W. T.:
          <article-title>LabelMe: a database and web-based tool for image annotation</article-title>
          .
          <source>International Journal of Computer Vision</source>
          , vol.
          <volume>77</volume>
          , no.
          <issue>13</issue>
          , pp.
          <fpage>157173</fpage>
          ,
          <string-name>
            <surname>May</surname>
          </string-name>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Havasi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>ConceptNet 3: a Flexible, Multilingual Semantic Network for Common Sense Knowledge</article-title>
          .
          <source>Proceedings of Recent Advances in Natural Languge Processing</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sproat</surname>
          </string-name>
          , R.:
          <article-title>Inferring the environment in a text-to-scene conversion system</article-title>
          .
          <source>First International Conference on Knowledge Capture</source>
          , Victoria,
          <string-name>
            <surname>BC</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Corpus-based Learning of Analogies and Semantic Relations</article-title>
          .
          <source>Machine Learning Journal</source>
          <volume>60</volume>
          (
          <issue>1-3</issue>
          ), pp.
          <fpage>251</fpage>
          -
          <lpage>278</lpage>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Expressing implicit semantic relations without supervision</article-title>
          .
          <source>In: Proceedings of COLING-ACL, Australia</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>