<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Iconic Gestures with Spatial Semantics: A Case Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elizabeth Hinkelman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Galactic Village Games, Inc.</institution>
          ,
          <addr-line>110 Groton Rd., Westford MA 01886</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The spontaneous gestures that accompany spoken language are particularly suited to conveying spatial information, yet their briefness, individuality, and lack of conventional linguistic structure impede their integration into NLU systems. The current work characterizes spontaneous size gestures in a manual task corpus, clarifying their form, discourse role and representation as a first step toward incorporating them into NLU systems.</p>
      </abstract>
      <kwd-group>
        <kwd>gesture</kwd>
        <kwd>spatial language</kwd>
        <kwd>knowledge representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        When gesture carries the primary load of communication, as in the major sign
languages, it develops linguistic properties such as verb subcategorization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
lexicalization [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]. The spontaneous hand gestures that accompany speech, in
contrast, do not show linguistic structure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For this reason, computational research
on spontaneous gesture has focused primarily on discourse functions, such as using
long range video features to signal repair strategies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or shifts in topic [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Discretevalued features extracted from gaze and body orientation have also been used for
discourse functions such as signaling grounding. Much of this work emphasizes
gesture production rather than recognition [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ].
      </p>
      <p>
        Yet the spontaneous hand gestures that accompany speech are increasingly
recognized both as a cognitive aid to the gesturer, and an encoding of meaning [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10,
11, 12</xref>
        ]. Among the spontaneous gestures that accompany speech, iconic gestures are
those which present “images of concrete entities and actions”[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Iconic gestures have
in some cases (though not yet broadly) been shown to be effective in communicating
spatial information between discourse participants [
        <xref ref-type="bibr" rid="ref11 ref13 ref4">4, 11, 13</xref>
        ].
      </p>
      <p>The current work pursues the incorporation of spontaneous gesture into NLU
systems: much groundwork must be laid. Amid the fluidity and abstractness of
spontaneous gesture, we focus on concrete gestures with (relatively) straightforward
spatial interpretations. We seek to answer the questions:</p>
      <p>What is the discourse purpose of the gestures?
Do the gestures constitute intended communication?
To what extent are they lexicalized?
What are their semantics?</p>
      <p>How can they be related to the semantics of the co-ocurring speech?
We collected a reference corpus for dialogue with intonation and gesture in a physical
task context. The subjects were twelve pairs of University of Chicago undergraduate
and graduate students, who were familiar with each other and had some cooking
experience. They were recorded while performing a 30-45 minute cooking task
(making chocolate truffles), using a single camera and lapel microphones. Some
elements of the task include locating ingredients and equipment, dividing the labor,
choosing flavorings, and activities such as measuring and washing up.</p>
      <p>
        The resulting eight hours of videotape were examined for spatial gestures. These
included pointing, displaying, miming of physical actions and manner[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and size
gestures. We selected the size gestures as a focus for possible NLU because they are
the simplest and most imagistic of these groupings, and because they were relatively
uniform in form.
      </p>
      <p>All of the size gestures in our corpus stemmed from the recipe step: “Take a hunk
of set ganache and roll into a walnut-sized ball between your palms.” An example can
be seen in Illustration 1, where subject Chris reads the recipe step aloud, envisions the
ball he will roll, and enlists Jason to confirm the ball size. In total he performs the
gesture for about three seconds; Jason eventually turns his head to view it for about
800ms. We will refer to this example and similar gestures as 'the ball size gesture'.</p>
      <sec id="sec-1-1">
        <title>2.1 results: ball size gesture use and discourse purpose</title>
        <p>Of twelve pairs of subjects, two did not communicate about truffle size beyond
reading the recipe. Ten discussed truffle size verbally; of these, three did not use
gestures, and three used displays of ganache (dough). Four used size gestures: three
ball size gestures and one caliper size gesture1. Gestures were used in two main ways:
to inform the partner of a desired size, or to request confirmation that a size was
correct. In one case, multiple ball gestures were used to explain how an incorrect ball
size leads to difficulties in baking. All gestures were used with co-occurring speech.
1 A 'caliper gesture' shows the size of a small object using parallel thumb and forefinger .</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.2 Intended communication – ball size and display</title>
        <p>We classify five of the seven gestures as intended communication, on the basis that:
in three cases the gesturer used motion or location to attract visual attention; in two
cases the gesturer made a verbal reference to the gesture (e.g.“like this?”), and in one
case both were used. For the seventh gesture (the incorrect ball size explanation) we
have no evidence that the gesture per se was intended communicatively. A further
analysis of gaze and uptake in these cases is in progress. Although this is a very small
sample, most of these gestures showed evidence of communicative intent.</p>
      </sec>
      <sec id="sec-1-3">
        <title>2.3 Form constraints on the ball size gesture</title>
        <p>We initially suspected that the ball size gesture was strongly lexicalized in
comparison with spontaneous gesture generally. In all cases the thumb and forefinger
circle to touch each other and embrace a notional ball, and are displayed as the focal
side of the gesture. However, there is notable variation in other parameters. Either
hand could be used, as in ASL. The position of the other three fingers is not
conventionalized (where it might or might not be constrained in a sign language.)
The location of the gesture relative to the gesturer is not as conventionalized as it
would be in ASL. In the table, we refer to the gesturer as G and the observer as O.</p>
        <p>The third column, the explanation of how two balls may melt into each other while
baking, is more typical of spontaneous gesture in showing dynamic configurational
elements with extended duration. The ball size gesture is not as conventionalized as
an ASL gesture – nor can we say what lexicon it would belong to. More work is
needed on this point. The ball size gesture contrasts with the caliper gesture in form.</p>
        <sec id="sec-1-3-1">
          <title>Lexicalized?</title>
        </sec>
        <sec id="sec-1-3-2">
          <title>Chris&amp;Jason</title>
        </sec>
        <sec id="sec-1-3-3">
          <title>Chris&amp;Trish Hand Handform Fingers</title>
          <p>Orientation
Location
Path</p>
          <p>Duration
(ASL=250ms)
left
'OK'
splayed</p>
        </sec>
        <sec id="sec-1-3-4">
          <title>O's visual plane</title>
          <p>At G's eye level
static
&gt;3000ms (G)
&gt; 700ms (O)
right
'OK'
curled</p>
        </sec>
        <sec id="sec-1-3-5">
          <title>O's visual plane</title>
        </sec>
        <sec id="sec-1-3-6">
          <title>Near O's focus static 260ms</title>
        </sec>
        <sec id="sec-1-3-7">
          <title>Josh&amp;Naomi</title>
          <p>both
'OK', 'OK'
splayed, splayed</p>
        </sec>
        <sec id="sec-1-3-8">
          <title>Off G's vis plane</title>
        </sec>
        <sec id="sec-1-3-9">
          <title>Near G's chest</title>
        </sec>
        <sec id="sec-1-3-10">
          <title>Slowly together 1500ms</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Representing Size</title>
      <p>
        Finally we consider semantic representation. A size is a property of a physical object,
generally represented as a value on a scale, where a scale is a partial ordering on a set
of elements. The majority of verbal size descriptions followed the recipe text: 'the
size of a” small object, or simply mentioned a small object: walnut, half a walnut,
meatball. The comparative “...smaller”, and (negated) intensifier “don't make it too
big!” also occurred. The scale in this case seems to be based on the generics (types)
of ball shaped food items, and the asserted relation is purely qualitative. Qualitative
representations [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] may prove extensible. Gesture's spatial medium, by contrast,
is continuous rather than discrete; the underlying scale is tied to the visual or perhaps
kinesic system. What representation could plausibly be generated by the visual
system? Our preliminary work investigates low level features in the spirit of [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ].
Acknowledgments. This work was supported in part by NSF grant no. IRI-9109914.
K-E. McCullough, C. Sidner and R. Jacobs provided valuable discussion.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Supalla</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Serial verbs of motion in American Sign Language</article-title>
          . In S. Fischer (Ed.),
          <source>Theoretical Issues in Sign Language Research</source>
          . University of Chicago Press (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hoiting</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slobin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>From Gestures to Signs in the Acquisition of Sign Language</article-title>
          . In Duncan, S. D.,
          <string-name>
            <surname>Cassell</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
          </string-name>
          , E. T. (Eds.),
          <source>Gesture and the Dynamic Dimension of Language</source>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>66</lpage>
          . John Benjamins Publishing Company, Philadelphia (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goldin-Meadow</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Gesture with Speech and Without It</article-title>
          . In Duncan Cassell Levy, pp
          <fpage>31</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>McNeill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (Ed.),
          <source>Language and Gesture</source>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>7</lpage>
          . Cambridge Univ. Press, New York (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Gesture Patterns during Speech Repairs</article-title>
          .
          <source>In Proc. icmi</source>
          , pp.
          <fpage>155</fpage>
          -
          <string-name>
            <surname>Fourth</surname>
            <given-names>IEEE</given-names>
          </string-name>
          <source>International Conference on Multimodal Interfaces (ICMI'02)</source>
          , (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barzilay</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Discourse topic and gestural form</article-title>
          . In Cohn, A. (Ed.)
          <source>: Proceedings of the 23rd NCAI</source>
          , pp.
          <fpage>836</fpage>
          -
          <lpage>841</lpage>
          . AAAI Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cassell</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakano</surname>
            ,
            <given-names>Y.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bickmore</surname>
            ,
            <given-names>T.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Non-verbal cues for discourse structure</article-title>
          ,
          <source>Proceedings of the 39th Annual Meeting on Association for Computational Linguistics</source>
          , p.
          <fpage>114</fpage>
          -
          <lpage>123</lpage>
          . Toulouse (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Traum</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L-P.</given-names>
          </string-name>
          :
          <article-title>Integration of Visual Perception in Dialogue Understanding for Virtual Humans in Multi-Party Interaction</article-title>
          .
          <source>In Proc. AAMAS</source>
          (in press). Toronto (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponsler</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holroyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidner</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Recognizing Engagement in Human-Robot Interaction</article-title>
          .
          <source>In: Proc. Human-robot Interaction. Osaka</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>McNeill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Gesture and Thought. University of Chicago Press, Chicago (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tversky</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lozano</surname>
            ,
            <given-names>S. C.</given-names>
          </string-name>
          :
          <article-title>Gestures aid both communicators and recipients</article-title>
          . In K. Coventry,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bateman</surname>
          </string-name>
          , T. Tenbrink (Eds.),
          <source>Spatial language and dialogue</source>
          . Oxford: Oxford University Press (forthcoming)
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Goldin-Meadow</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Hearing gesture: How our hands help us think</article-title>
          . Cambridge, MA: Harvard University Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Beattie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shovelton</surname>
          </string-name>
          , H.:
          <article-title>When Size Really Matters</article-title>
          . Gesture,
          <volume>6</volume>
          :
          <fpage>1</fpage>
          ., pp.
          <fpage>63</fpage>
          -
          <lpage>84</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hinkelman</surname>
          </string-name>
          , E.:
          <article-title>Spatiomotor Routines as Spontaneous Gestures</article-title>
          . Spatial
          <string-name>
            <surname>Cognition</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lovett</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forbus</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Shape is like Space: Modeling Shape Representation as a Set of Qualitative Spatial Relations</article-title>
          . AAAI Spring Symposium Series, North America, Mar.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hois</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenbrink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>A Linguistic Ontology of Space for Natural Language Processing</article-title>
          . In Artificial Intelligence, in press (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Regier</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          :
          <article-title>Grounding Spatial Language in Perception: An Empirical and Computational Investigation</article-title>
          .
          <source>Journal of Experimental Psychology</source>
          , Vol.
          <volume>130</volume>
          , No.
          <issue>2</issue>
          , pp
          <fpage>273</fpage>
          -
          <lpage>298</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Franconieri</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scimeca</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helseth</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Visual Spatial Relationship Representation as a sequence of attentional shifts</article-title>
          .
          <source>Subm. J. Cognitive Science.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>