<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context-aware Speech Recognition in a Robot Navigation Scenario</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Hacker</string-name>
          <email>martin.hacker@cs.fau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Embedded Systems Initiative (ESI), Univ. of Erlangen-Nuremberg Department Computer Science 8 (Arti cial Intelligence) Haberstr.</institution>
          <addr-line>2, 91058 Erlangen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic speech recognition lacks the ability to integrate contextual knowledge into its optimization process { a resource that human speech perception makes extensive use of. We discuss shortcomings of current approaches to solve this problem, formalize the problem of context-aware speech recognition and understanding and introduce a robot navigation game that can be used to demonstrate and evaluate the impact of context on speech processing.</p>
      </abstract>
      <kwd-group>
        <kwd>speech recognition</kwd>
        <kwd>speech understanding</kwd>
        <kwd>context</kwd>
        <kwd>robot navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>For the Arti cial Intelligence community, Automatic Speech Recognition (ASR)
has been a very challenging task for several decades. Although substantial
progress has been made in both acoustic modeling and modeling of language use,
systems are still clearly outperformed by humans when employed under real-life
conditions. The main reason for this performance gap is that speech itself is
highly ambiguous and susceptible to noise and external audio events. What is
actually perceived by a human is already an interpretation of the ambiguous
signal, highly depending on expectations what the speaker could say in the
current situation and on associations of the listener emanating from current
thoughts and perception of the environment.</p>
      <p>The same holds for the interpretation of a correctly perceived utterance that
is often ambiguous in spontaneous speech, but is not perceived as ambiguous as
long as the listener can infer the correct meaning in the current situation.</p>
      <p>The circumstances of this type as a whole that in uence speech perception
and understanding are commonly referred to as the (relevant) context of an
utterance. The concept of context, however, is often either used in a vague way or
simpli ed in an ad-hoc manner. The result is that the in uence of context is not
adequately integrated into state-of-the-art ASR systems, which is presumably
one of the main reasons for the above mentioned performance gap.</p>
      <p>In this article, we discuss the shortcomings of current approaches by working
out the role of context in a robot navigation scenario. The remainder of the
article is structured as follows: After reviewing approaches to model contextual
in uence in ASR, we formalize the problem of context-aware speech recognition
and understanding. Then we introduce the application scenario and our data
collection and evaluate the bene ts of context-aware speech recognition using
a simple context model. We conclude with a discussion of challenges of context
modeling using selected examples and an outlook on open issues and future work.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>De nitions of Context</title>
        <p>
          A frequently used de nition of context was given by Dey [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]:
        </p>
        <p>Context is any information that can be used to characterise the situation
of an entity. An entity is a person, place, or object that is considered
relevant to the interaction between a user and an application, including
the user and applications themselves.</p>
        <p>Although this de nition evokes associations with physical objects, it can
be extended to cover also mental concepts such as dialogues. A situation is a
concrete assignment of contextual variables, i. e. a set of attribute-value pairs
that describe the relevant properties of an entity at a certain time.</p>
        <p>
          Commonly, researchers distinguish between three types of context that are
relevant for speech processing (see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], e. g.):
{ Discourse context: topic of the conversation, linguistic information from
preceding utterances referred to by the current utterance (ellipses, anaphora,
answering a question);
{ Physical context: device states or physical sensor data including time and
location characterizing the current environmental and in-domain objects;
{ Logical context: interpreted sensor data accumulated over multiple sensors
and/or over time (e. g. activity recognition), conclusions drawn from user
utterances or expected impact of system actions.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Context in Dialogue</title>
        <p>A traditional method to account for discourse context is to use separate language
models for pre-de ned discourse situations. This approach is straight-forward
for simple applications which engage nite-state-based dialogue models. For
example, in a system-initiated timetable information dialogue where the system
step-by-step lls necessary slots for the database query, the vocabulary can be
divided into question-relevant parts to decrease the perplexity of the language
model.</p>
        <p>
          The construction of such context-dependent language models can be
automated if su cient annotated training data is available. Among others, Bod [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
used discourse context to foster both speech recognition and understanding in
a public transport information system. The corpus-based parsing model called
data-oriented parsing (DOP) was made context-dependent by exploiting only
training utterances made in the same discourse context as the current utterance.
The method was shown to slightly increase system accuracy, but it requires
syntactically and semantically annotated training corpora for the respective
application domain. As discourse context, only the preceding system utterance was
used. Thus, applying the method to applications providing more ne-grained
sets of system utterances or integrating richer discourse context would result in
sparse data problems due to over-fragmentation of the training corpus.
        </p>
        <p>
          The general principle of using separate language models can easily be
transferred to account for di erent environment situations derived from physical
context. Everitt et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] augmented speech recognition in a tness room with
physical sensor information about which gym object is currently being used by the
speaker. The recognition grammar was split up into object-related grammars
from which a single one is selected depending on the sensor information. The
results are encouraging although the method proved to be susceptible to sensor
errors. However, it remains unclear how the derivation of sub-grammars can be
performed when more than one context variable is available.
        </p>
        <p>
          Leong et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] present a context-sensitive approach for controlling devices
in an intelligent meeting room. As an intermediate step before deep parsing,
ASR n-best lists are re-ranked by shallow analysis of user goals and their
contextual coherence. Bayesian networks are used to estimate the probability of a
user goal given the words in the utterance and contextual attributes. By using a
Bayesian network structure, system designers can directly integrate knowledge
on dependencies between contextual variables and linguistic entities to decrease
the number of parameters to be learned. Their results show a signi cant
improvement in disambiguating ASR output while the system is also enabled to
process deictic references and elliptical utterances.
        </p>
        <p>A potential shortcoming of the model used is that it is based on a
bagof-words language model. Although the approach is general enough for richer
linguistic information to be included, it is unclear how deep linguistic
structures might t into the Bayesian network architecture, assuming that a deep
understanding of such structures is necessary to model complex linguistic
context phenomena. What is more, the approach requires to pre-de ne a set of all
possible system actions. This is feasible for a room with a limited set of on-o
devices as in the meeting room scenario. It would, however, not scale up to
multiple parameterizable commands or even problem-solving dialogues. Neither is
it feasible to contruct Bayesian networks that account for all possible user goals
in such applications nor would it be possible to collect su cient training data.</p>
        <p>Summing up, all state-of-the art approaches either require training data to
learn statistical language models or hand-crafted grammar solutions. Thus
incorporating rich context for complex applications is not feasible, which underlines
the need to integrate explicit world and domain knowledge about causal
relations between entities in the environment and conversation. Logic seems to be an
appropriate means to model this knowledge. Using logic, however, it is di cult
to cope with uncertainty, ambiguity and vagueness which are unavoidable in real
applications dealing with sensor data and spontaneous speech.
3
3.1</p>
      </sec>
      <sec id="sec-2-3">
        <title>Context</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Formalizing the problem</title>
      <p>As mentioned in section 2.1, contextual information is commonly partitioned
into discourse, physical and logical context. The distinction between physical
and logical context, however, seems to be arti cial since real physical sensor
data also needs to be interpreted. The distinction between discourse-related and
other context is based on the fact that dialogue participants can address entities
on various levels:
{ The discourse level, e. g. by directly or indirectly referring to a linguistic
concept or a discourse element such as an utterance (e. g. "What did you
say?").
{ The in-domain level, i. e. talking about entities addressed by the application.
{ The out-of-domain level. Entities outside the application and discourse
domain are usually not referred to by utterances. However, they can have an
in uence on what is said and how it is said.</p>
      <p>Instead, we propose a distinction according to the stage of speech production
that is in uenced by the contextual information, as this is more appropriate to
modeling contextual in uence:
{ Motivational context: Information that in uences what people want to
express (e. g. the fact that the heating is running and the temperature is high
would cause the user to instruct the system to switch o the heating).
{ Linguistic context: Information that in uences how people phrase what they
want to express (e. g. discourse constituents already introduced enable the
speaker to use anaphoric references).
{ Articulatory context: Information that in uences how people articulate the
utterance (e. g. noise level or emotional state).
{ Acoustic context: Simultaneous acoustic events and conditions that in uence
the speech signal as perceived by the system.
3.2</p>
      <sec id="sec-3-1">
        <title>Speech Recognition</title>
        <p>The problem of speech recognition is commonly de ned as follows:
{ Assume that the audio signal spans the time inverval from the beginning to
the end of the utterance of interest (this is done by segmentation).
{ From the audio signal, a sequence of feature vectors X = x0; : : : ; xn 1 has
been extracted.
{ Find the word chain w^ 2 W that correctly transcribes the utterance:
w^ = argmaxP (wjX) = argmax
w2W w2W</p>
        <p>P (Xjw) P (w)</p>
        <p>P (X)
{ The denominator P (X) is independent of w and thus can be ignored by the
maximization process. Hence the best solution is the transcription that is the
best trade-o between the acoustic model P (Xjw) and the language model
P (w).
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Context-dependent Speech Recognition</title>
        <p>Equation 1 describes which is the most probable sequence of words when nothing
is known about the utterance but the acoustic observation X. However, if the
context C of the utterance is known, we must reformulate the problem as follows:
(1)
(2)
(3)
w^ = argmaxP (wjX; C) = argmax
w2W w2W
= argmax
w2W</p>
        <p>P (Xjw; C) P (wjC)</p>
        <p>P (XjC)
P (Xjw; C) P (Cjw) P (w)</p>
        <p>P (X; C)</p>
        <p>The context-dependent acoustic model P (Xjw; C) can be substituted by the
traditional acoustic model P (Xjw) if we assume that the pronunciation of a given
word chain is independent of any contextual factor. Apparently, this is not the
case as, for instance, the emotional state of the speaker, the level of background
noise and the distance from the listener have an in uence on the audio signal
received by the listener (articulatory and acoustic context). A viable solution to
this problem would be to approximate P (Xjw; C) by a set of acoustic models
Pd(Xjw) with d being disjunct classes of situations. In practice, this would mean
that a set of di erent speech recognizers { e. g. for di erent emotional states { is
available from which the one is chosen that best ts the current context. Instead
of choosing one speech recognizer, multiple speech recognizers could be combined
by linear combination to allow for arbitrary states between the prototypes.</p>
        <p>The context-dependent language model P (wjC) can be modeled in two ways.
The one given in equation 2 is used by most approaches and uses di erent
language models Pd(w) for di erent classes d of situations. Again, it would be
possible to engage a linear combination, but this seems to be quite unintuitive
for language models, as they indicate what can be said and not how it can be
said.</p>
        <p>An alternative model is given in equation 3: We keep the general language
model P (w) and multiply it by the posterior probability P (Cjw). But how to
estimate this distribution? We can easily nd the values for C where P (Cjw)
has a very small value or is even null. They correspond to situations that would
not be considered as possible when an user utters w, particularly situations that
are inconsistent with the meaning of the utterance. But it is di cult to estimate
P (Cjw) for contexts that are considered as possible given w.
A simplifying solution would be to assign zero probabilities for contexts that
are inconsistent and a uniform distribution for all other contexts. This would
be equivalent to the commonly used approach to calculate the n-best lists with
a context-independent speech recognizer in a rst step and, in a second step,
to remove n-best results that turn out to be inconsistent with the context after
parsing.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Speech Understanding</title>
        <p>Up to now, the task was to nd the most probable sequence of words. This is
su cient for a dictation task. When the system, however, needs to understand
the user utterance, it is rather interested in the user's goal than in its wording.
Two or more word chains may result from the same user goal at the pragmatical
level. To solve the problem of speech understanding, hence, we need to consider
all possible user goals g and sum up the pobabilities of all possible wordings1:
g^ = argmaxP (gjX; C) = argmax X P (wjX; C)P (gjw; C) =
g2G g2G</p>
        <p>w2W
= argmax</p>
        <p>g2G
= argmax
g2G</p>
        <p>1
P (XjC)
P (gjCM )
P (XjC)
w2W
w2W
X P (Xjw; C) P (wjC) P (gjw; C) =
X P (Xjw; C) P (wjg; CL)
(4)
(5)</p>
        <p>Solving this formula for the combined speech recognition and understanding
problem is computationally expensive as the goal must be computed for every
word chain. To overcome this di culty, the problem usually is divided into two
sequential problems:</p>
        <sec id="sec-3-3-1">
          <title>1. At rst, the n best word chains are computed. 2. Then the goals are evaluated for the n best word chains. This means that the sum for all word chains in equation 4 is approximated by the sum for the n best word chains.</title>
          <p>
            Now, as the problem becomes computationally manageable, the question is how
the probability P (gjw; C) can be modeled. Learning it from examples as done
in [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]2 is even more data-intensive than learning P (wjC) (which was discussed
above) and therefore not feasible for rich context models and complex goals.
          </p>
          <p>The sparse data problem can be alleviated (Equation 5) by transforming the
context-dependent language model and reducing the corresponding contexts to
the relevant classes as de ned in section 3.1: For the user goal, only the
motivational context CM and for the wording, only the linguistic context CL is
1 Hereafter we assume that X and g are independent for a given w, i.e. every speech
recognition hypothesis w includes the word chain including all annotations of features
such as prosody that might depend on the current user goal.
2 The authors transform the formula and learn P (w; Cjg), but omit the summation.
considered as relevant. The goal prior probability P (gjCM ) can be modeled
using knowledge about preconditions and e ects of system actions as well as user
motivations. The context- and goal-dependent language model can be
approximated by P (w) if g can be derived from w by parsing and by anaphora and
ellipsis resolution using CL and is set to 0 otherwise.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Work</title>
      <sec id="sec-4-1">
        <title>Robertino-game: A Simulated Robot Navigation Task</title>
        <p>
          Simulated robots are commonly used for AI research [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], as they are inexpensive
and easy to maintain. To construct naturalistic conditions, sensor data must
be simulated by noise and lters, and information must be hidden for the user
to constrain her knowledge of the situation to that part she would be able to
perceive in real environments. The robot can derive the current context from
the simulated sensor data. In our rst experiments, we assume that the user has
full access to the context and the robot can reliably sense its direct
environment. The data can though be used to perform future investigations with the
robot using simulated sensors or further context such as maps that it constructs
progressively.
        </p>
        <p>Another di erence engineers are faced with when simulating a robot is that
continuous movements must be replaced by a series of steps. For instance, a
rotation caused by a motor running for n frames with a certain power P would
be replaced by n discrete rotation steps with an increment depending on P .</p>
        <p>The scenario used for our experiments is a game where one human player
navigates the robot through a ctious maze-like environment (see Fig. 1) to a
pre-de ned goal by only using speech commands. The environment consists of
walls, bombs, substitute rockets and colored areas painted on the oor. The robot
will explode when it runs into a bomb or a wall. It can perform the following
behaviours:
{ Turning: The user can roughly specify the degree of rotation with linguistic
hedges (e. g. slightly to the left ). Turning while the robot is driving will cause
the robot drive a curve.
{ Driving: The user can instruct the robot to drive in a certain direction
relative to its orientation, without changing the orientation.
{ Shooting: A rocket can be red in the direction of the current orientation.</p>
        <p>This can be used to destroy bombs that are in the way. Three rockets are
available after the game has been started. When a substitute rocket is passed
by the robot, it is automatically loaded.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Data Collection</title>
        <p>The data has been recorded in two sessions. The participants that played the
game were mainly high school students and university sta members. In the
rst session, a close-distance speech microphone was used, while in the second
session the integrated microphone of a MacBook Pro was used as a far-distance
microphone. The participants attended the sessions in groups of 7{15 people,
which caused background talking and laughing that makes speech recognition
more challenging. In the second session, there was also continuing background
noise from a construction site outside of the building.</p>
        <p>The data includes the audio signal of each on-talk player utterance and its
reference transcription, as well as all robot actions. The reference transcription
was conducted by the presenter who attended the sessions. The utterances were
also annotated with the presumed user goal. The robot actions are reactions to
user utterances as they were understood by the system. Half of the games were
conducted following the Wizard of Oz method. In this part of the data, the robot
actions are reactions to user utterances as they were understood by the wizard,
which is very close to the reference annotation.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Speech understanding</title>
        <p>Table 1 shows a list of basic system actions that can be executed in every frame.
These basic actions are highly implementation-dependent. For the user, however,
it is irrelevant which robot actions will be executed in which frames. She would
articulate a more abstract descripition of the intended robot behaviour that is
independent from implementation details. Following common naming conventions,
we will refer to these described behaviours as user goals.</p>
        <p>Command Parameter
TURN STEP LEFT
TURN STEP RIGHT</p>
        <p>START RUNNING</p>
        <p>STOP RUNNING</p>
        <p>CHANGE DIRECTION degree
CHANGE VELOCITY RELATIVE velocity di
CHANGE VELOCITY ABSOLUTE velocity</p>
        <p>SHOOT</p>
        <p>A list of possible user goals is shown in Table 2, annotated by its meaning
which is constructed by a mapping from utterances to sets of basic user
actions. The mapping can be realized by a simple keyword spotting and slot lling
approach, which is su cient for most simple applications and therefore
implemented in most commercial speech-based systems. However, when the
application is more complex and the meaning of words and phrases depends on their
syntactic function within the sentence, it is necessary to employ more advanced
parsing and speech understanding techniques.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Evaluation of context awareness</title>
        <p>A part of the recorded data containing 199 utterances was used to investigate
the potential bene t of context-aware speech recognition. Each utterance was
processed by a speech recognizer producing up to 5 best hypotheses5. We used
a simple context model (see Table 3) to describe the game situation at the time
when the utterance was processed by the system6.</p>
        <p>Following the terminology of section 3.1, the context model contains
motivational context. To derive the possible user goals in a particular situation, we
built a very simple motivational user model with the following rules:</p>
        <p>Expected user goals:
{ When an obstacle is coming closer, the user probably wants to stop the robot.
{ When a bomb is in the line of re in front of the robot and there are still
rockets available, the user will probably tell the robot to shoot.
{ When a bomb is in the line of re in direction d and there are still rockets
available, the user will probably tell the robot to turn towards d.</p>
        <p>Implausible user goals:
{ When an obstacle is in direction d, the user would not want the machine to
move towards d.
{ When the robot is neither driving nor turning and the last goal7 was not
stop, the user would not tell the machine to stop.
5 There might be less than 5 hypotheses for short utterances when the speech
recognizer cannot nd more hypotheses with acoustic con dence above a certain threshold
relative to the best hypothesis.
6 In fact, the context is dynamic and can change during a user utterance. We used a
xed representation of the context immediately after the utterance was spoken. This
was due to the fact that the users adapted to the system's latency by anticipating
situations and starting to speak before the command was applicable.
{ When the robot is driving and the last goal7 was not start moving, the user
would not tell the machine to start moving.
{ When the robot is driving towards direction d and the last goal7 was not
move towards d, the user would not tell the machine to move towards d.</p>
        <p>For almost 20% of the utterances, the underlying user goal is expected and
can therefore be augmented by the system purely based on the context, i.e.
without using a speech recognizer. For a few utterances (3%), the expressed user
goal is implausible given the situation. The latter can be explained by the fact
that user behavior is not always reasonable. In most of the present cases, the
user was confused by the relative orientation of the robot and accidentally said
right instead of left (and vice versa).
if n-best-list contains expected utterances</p>
        <p>choose the best-ranked one
elseif n-best-list contains utterances that are not implausible
choose the best-ranked one
else</p>
        <p>reject utterance</p>
        <p>When this strategy is applied, 9,2% of the false rst hypotheses are corrected,
and additional 5,3% are rejected. However, 3,3% of the correct rst hypotheses
are replaced by a false one, and additional 4,1% are rejected. Most false rejections
and false corrections though correspond to utterances where the expressed user
goal di ers from the intended user goal. This includes the above mentioned user
mistakes.
7 Some rules were extended by a condition regarding the last user goal. This was
necessary because of the latency of the system and the delayed reaction of the user:
Sometimes the user repeated the command immediately after or at the same time
as it was executed by the system.</p>
        <p>The gures seem to suggest that contextual information only slightly
increases ASR accuracy. We though think that the results are encouraging because
both contextual model and motivational user model used for the evaluation were
very simple. Only a few of the possible user goals were investigated with respect
to their contextual coherence. We are convinced that a richer context model and
an advanced user model that allows deep reasoning over multiple dialogue turns
would have the potential to noticeably improve ASR performance.</p>
        <p>
          Moreover, the classi cation and re-ranking algorithms we applied are very
basic. Allowing for gradual values of contextual coherence and integrating
features from other knowledge sources like acoustics and syntax would enable us to
use powerful machine learning techniques for error correction (cf. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Challenges of Context Modeling for Speech</title>
    </sec>
    <sec id="sec-6">
      <title>Understanding</title>
      <p>In the robot navigation scenario, the user is allowed to give instructions for
future behavior depending on conditions, e. g. by saying
Example 1. move slowly left to the beginning of the red area and then turn right
and follow the wall.</p>
      <p>When such combinations of user goals are communicated, the system is faced
with the following problems:
{ Persistence of user goals: For some goals, it is unclear how long they are
valid unless they are replaced by a new incompatible goal. In Example 1, it is
unclear whether the system should continue going slowly when following the
wall. The problem is also present when simple user goals are communicated
in a series of utterances:</p>
      <sec id="sec-6-1">
        <title>Example 2. Move forward until you reach the red area.</title>
        <p>Stop! (before the robot has reached the red area)
Continue moving.
! Should the system still stop at the red area as said before
the stop command?
{ Future context: When calculating the contextual coherence of a future user
goal, the system must take the anticipated future context into account rather
than the current context. The future context depends on the e ects of the
behavior performed up to the time the goal of interest becomes active. This
time, however, is unknown because it depends on some (sensory) conditions
8 The frame immediately following after the understanding process is complete is
denoted by cur. Frames can also be declared by conditions, which denotes the rst
future frame with the condition being ful lled.
that cannot be evaluated before a future state is reached. Assuming that the
user utterance can reliably be segmented into goal-speci c parts, it would be
possible to simplify the problem by evaluating parts not before they become
active. However, when we consider why the user uttered a series of goals at
once rather than waiting for the preceding actions to be completed, we must
admit that this often happens because the user considers the future goals as
relevant for the interpretation of the preceding goals. In other words, the
announced future actions are part of the relevant context for the understanding
of the preceding actions.</p>
        <p>These considerations scratch the borderline between speech understanding
and planning, two issues that are commonly treated as sequential but in fact
are highly interactive.
{ Partial knowledge: Speech input can be viewed as a sensor itself, with either
the speaker giving information about the context directly (a) or the system
deriving such information from the utterance (b).</p>
        <p>Example 3. (a) There's a wall to the left.</p>
        <p>
          (b) Follow the wall. (! there must be a wall)
{ Subjectivity: User and system can have di erent beliefs about contextual
entities unless both have acknowledged the fact to be part of the common
ground [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. When the system uses context for speech understanding to nd
motivational evidence for an utterance, it must use the user's subjective
context instead of its own. This means that the system needs to keep track
of hypotheses about the user's beliefs.
{ Uncertainty: The system's beliefs about contextual entities are partially
derived from sensor data and therefore might be wrong. When wrong context
is used for speech understanding, the correct interpretation of the user goal
might be rejected due to missing contextual coherence. A major challenge is
to design systems that are aware of potential misrecognitions and allow to
update their context model in such situations. These are the possible reasons
when the system cannot nd enough acoustic and contextual evidence for
any hypothesis:
1. It was not possible to understand the utterance (e. g. out of domain or
superimposed by other acoustic events).
2. The system's beliefs about the (user) context are wrong and need to be
updated using information from the acoustically evident, but
contextually incoherent hypothesis.
3. The user's beliefs about the context are wrong and di er from the
system's beliefs about the user context. Hence it is up to the system to
clarify the facts in a dialogue with the user.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Summary and Future Work</title>
      <p>In this paper, we formalized the problem of context-aware speech understanding
and introduced a speech-controlled robot navigation game. For this application,
we demonstrated how speech recognition can bene t from contextual information
and discussed challenges of context modeling using selected examples.</p>
      <p>
        The extension of the application by a context model is still ongoing work.
Future work will focus on building rich contextual and motivational models and
integrating the contextual knowledge into speech recognition. In a second step,
we will compare the context-aware system to human speech perception using the
method described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bod</surname>
          </string-name>
          , R.:
          <article-title>Context-sensitive spoken dialogue processing with the DOP model</article-title>
          .
          <source>Nat. Lang. Eng</source>
          .
          <volume>5</volume>
          ,
          <issue>309</issue>
          {
          <issue>323</issue>
          (
          <year>December 1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brennan</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          :
          <article-title>Grounding in communication</article-title>
          . In: Resnick,
          <string-name>
            <given-names>L.B.</given-names>
            ,
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Teasley</surname>
          </string-name>
          , S.D. (eds.)
          <source>Perspectives on socially shared cognition</source>
          , pp.
          <volume>127</volume>
          {
          <fpage>149</fpage>
          . American Psychological Association, Washington, DC (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Understanding and using context</article-title>
          .
          <source>Personal Ubiquitous Comput. 5</source>
          ,
          <issue>4</issue>
          {7 (
          <year>January 2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Everitt</surname>
            ,
            <given-names>K.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bilmes</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landay</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Disambiguating speech commands using physical context</article-title>
          .
          <source>In: Proceedings of the 9th International Conference on Multimodal Interfaces</source>
          . pp.
          <volume>247</volume>
          {
          <fpage>254</fpage>
          . ICMI '07,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hacker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elsweiler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ludwig</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Investigating human speech processing as a model for spoken dialogue systems: An experimental framework</article-title>
          . In: Coelho,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Studer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Wooldridge</surname>
          </string-name>
          , M. (eds.)
          <source>Proceeding of the 19th European Conference on Arti cial Intelligence (ECAI</source>
          <year>2010</year>
          ).
          <source>Frontiers in Arti cial Intelligence and Applications</source>
          , vol.
          <volume>215</volume>
          , pp.
          <volume>1137</volume>
          {
          <fpage>1138</fpage>
          . IOS Press, Amsterdam, The Netherlands, The
          <string-name>
            <surname>Netherlands</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hugues</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bredeche</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Futurs</surname>
            ,
            <given-names>T.I.</given-names>
          </string-name>
          :
          <article-title>Simbad: An autonomous robot simulation package for education and research</article-title>
          .
          <source>In: Proceedings of The Ninth International Conference on the Simulation of Adaptive Behavior (SAB'06)</source>
          . Roma, Italy - Springer
          <source>'s Lecture Notes in Computer Sciences / Arti cial Intelligence</source>
          series (LNCS/LNAI) n. pp.
          <volume>831</volume>
          {
          <issue>842</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Leong</surname>
            ,
            <given-names>L.e.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koshizuka</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakamura</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>CASIS: a context-aware speech interface system</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Intelligent User Interfaces</source>
          . pp.
          <volume>231</volume>
          {
          <fpage>238</fpage>
          . IUI '05,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Skantze</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edlund</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Early error detection on word level</article-title>
          .
          <source>In: Proceedings of ITRW on Robustness Issues in Conversational Interaction. Norvich</source>
          , UK (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pieraccini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
          </string-name>
          , H.:
          <article-title>A dynamic semantic model for re-scoring recognition hypotheses</article-title>
          .
          <source>Acoustics</source>
          , Speech, and Signal Processing,
          <source>IEEE International Conference on 1, 589{592</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>