<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two New Modalities for ARIA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dresden</string-name>
          <email>Martin.Franke4@mailbox.tu-dresden.de</email>
          <email>gerhard.weber@tu-dresden.de</email>
          <email>jens.voegler@tu-dresden.de</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Accessibility of rich internet applications is ensured by ARIA and assistive technology supporting this additional mark-up. We propose MM-ARIA for the integration of multiple types of assistive technology while supporting multimodal interaction. The Dojo toolkit's calendar widget is an example of a complex, non-standard widget requiring ARIA mark-up. MM-ARIA generates grammars for speech recognition and integrates gestural input for navigation in the calendar or selecting an item. MM-ARIA relies on multimodal mark-up language EMMA and an interpreter for parallel input. Our proof-of-concept application demonstrates the extensibility of widgets described by ARIA.</p>
      </abstract>
      <kwd-group>
        <kwd>Browser</kwd>
        <kwd>speech</kwd>
        <kwd>gesture</kwd>
        <kwd>assistive technology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>http://www.inf.tu-dresden.de
Rich Internet Applications become more important for new websites. Their
advantage over plain forms is that dynamic content can be presented to the user
without loading a new web page, such as explaining the missing information in
a partially completed contact form, or refreshing a news ticker automatically. It
is also possible to use novel types of interaction, for example gestures for
navigation, or moving an item directly within a map. Additional modalities beyond
mouse and keyboard such as speech input are on the horizon as new browser
support access to the microphone.</p>
      <p>But often these advantages are not accessible for disabled users or create
additional barriers. For example assistive technologies like screenreaders just
represents a snapshot of a website and recognizes changes if the website is
refreshed or a new site is loaded. Gestures and interaction like drag and drop
requires a visual feedback and are time-dependent which may be a barrier for
blind and motor-disabled users. Another disadvantage is that gestures are not
consistent among di erent websites and browsers.</p>
      <p>
        Accessible Rich Internet Applications (ARIA) is a markup language for
developing accessible website for assistive technologies. The main goals of ARIA
are [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
{ expanding the accessibility information that may be supplied by the author,
{ requiring that supporting host languages provide full keyboard support that
may be implemented in a device-independent way, for example, by
telephones, handheld devices, e-book readers, and televisions,
{ improving the accessibility of dynamic content generated by scripts, and
{ providing interoperability with assistive technologies.
      </p>
      <p>With ARIA, websites such as Facebook become by and large accessible, but
navigation for blind users, to name one user group, is currently only possible by
keyboard. This paper will show how ARIA can be used to develop multi-modal
websites which can be controlled by speech, keyboard and mouse jointly. Solving
this problem requires to interpret ARIA attributes as a semantic enhancement
of HTML elements while ensuring accessibility for people who bene t from the
availability of one or multiple of these modalities.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State Of The Art</title>
      <p>
        The main goal of ARIA is improvement of accessibility for dynamic web content.
Dynamic changes can have inappropriate e ects for users, which cannot use the
mouse or have limited access to the keyboard. The worst case for these users
is an inoperable state in a web application. ARIA de nes states and properties
of HTML-elements and landmarks for navigation of the website [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These new
attributes for HTML elements may be accessed by Assistive Technology (AT),
such as a screen reader. Each AT implements a separate but accessible user
interface based on the extra mark-up such as:
{ Roles for types: button, dialog, gridcell, menuitem, menu, scrollbar, tree, tab,
tabpanel
{ Roles for structure: banner, navigation, main, search
{ Properties for showing the state of one object like pressed, or its value
      </p>
      <p>Additional markup requires developers to manually enhance their code.
Existing HTML and JavaScript will hardly by modi ed, but new JavaScript
frameworks and toolkits may be developed with ARIA support from the beginning on.
They are deployed with a set of widgets, already tested against ARIA speci
cations and often with AT or even by users relying on AT. One of this frameworks
is the Dojo toolkit.
2.1</p>
      <sec id="sec-2-1">
        <title>Dojo Toolkit</title>
        <p>
          The Dojo Toolkit is a JavaScript framework, developed by the Dojo Foundation
since 2004 and is currently in version 1.6. It is class-oriented and modularized
in widgets. Each widget consists of HTML, CSS and JavaScript les. There are
a wide variety of widgets available for professional internet applications. Most
widgets of them are evaluated to be accessible, these include basic widgets like
buttons, sliders or validating input elds, and also complex widgets such as:
Calendar, Dialog, Editor, InlineEditBox, Menu, MenuBar, ProgressBar, TabPanel,
Toolbar, Tooltip and Tree [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Within the Dojo toolkit, a life cycle manager
controls the widgets and a data connector simpli es exchange of data between the
widgets and an application.
        </p>
        <p>Building a exible, accessible rich internet application based on multiple
widgets requires to establish well controllable input facilities while supporting
dynamic changes of the status of widgets. As in any graphical user interface,
is the ow of control in Dojo based on events. The next step towards a
multimodal system is the collection of events to prepare the synchronization of new
modalities for interaction with ARIA-based widgets.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Collecting Events</title>
        <p>
          Most of the events cause by user input can easily be caught by using JavaScript.
jQuery1 is one of the most used frameworks for this purpose. With this
framework one can easily catch keyboard inputs, such key pressed, or mouse input,
such as clicks or the position the mouse cursor. Via an additional script [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], it
is also possible to recognize mouse gestures (see chapter 3.2). A limitation of
jQuery is the lack of handling microphone input and speech recognition. None
of the actual browsers is able to natively catch the stream of data arising from
microphone input. However, HTML5 speci cations include such a feature and
soon industry standard browsers will support audio input [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. One approach for
adding speech processing to browsers is the Web-Accessible Multimodal
Applications framework (WAMI) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. WAMI applications are not built with accessibility
in mind, there is also no use of WAI-ARIA. Still, a switch among modalities
is feasible. City Browser, is a sample implementation using WAMI with actual
multi-modal interaction. It combines mouse and speech input within a browser
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>WAMI provides an implementation of a client-server concept catching
microphone streams in the client and recognition of speech input in the server.
The client is implemented as a Java-applet and therefore nearly platform- and
browser-independent. The applet records microphone data as an MP3 le and
sends it to a server. This server checks the recorded stream against a prede ned
grammar and sends the recognized words as an JavaScript object back to the
client application. With this object functions are triggered on the client side. A
WAMI server can be deployed on a Java Servlet Container, like Apache Tomcat2
or jetty3, and uses an underlying speech recognizer, for example CMU Sphinx4
or MIT's own City Browser speech server.</p>
        <p>To summarize, we have presented approaches for additional mark-up based
on WAI-ARIA to get an accessible RIA, the recognition of the user standard</p>
        <sec id="sec-2-2-1">
          <title>1 http://www.jquery.org</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>2 http://tomcat.apache.org/</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>3 http://jetty.codehaus.org/jetty/</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>4 http://cmusphinx.sourceforge.net/</title>
          <p>input through events, and an approach for handling speech input in browsers.
Multimodality may be embedded into widgets at this point, however we aim at a
modular approach using mark-up to model multimodality independently of
speci c modalities while preserving accessibility features of widgets. One well-known
approach to multimodality recommended by W3C is EMMA - the Extensible
Multimodal Annotation Language.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>EMMA</title>
        <p>
          EMMA is W3C Recommendation since 2009 with the main goal to establish
multimodal interactions in the world wide web. It is a descriptive XML-based
language for processing multimodal input. An EMMA message consists of the
underlying modality speci c mark-up embedded within EMMA-speci c mark-up.
The main concepts are the interpretation- and the group-element. An
interpretation element describes the input modality and consists of [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
{ id, unique identi cation
{ tokens, the whole input string
{ process, reference to the processor, like speech recognizer
{ no-input, true, if input string is empty
{ uninterpreted, true, if error in processor occurred
{ signal, reference to the interpreted le, like the MP3 le
{ media-type, media-type of the signal
{ con dence, con dence of the right interpretation from 0.0 to 1.0
{ source, reference to the application, which invoked the processor
{ start, end, duration, points of the start/end of the event and the duration
{ medium, acoustic, tactile or visual
{ mode, modality of the signal, like voice, dtmf, gui, keys, video,...
{ function, function of the signal, like recording, transcription, dialog, veri
cation,...
{ verbal, true or false
{ cost, cost of the interpretation, such as CPU time, and
{ dialog-turn, placeholder for application-dependent implementation
        </p>
        <p>EMMA thereby supports a wide range of modalities and enables exible
interpretations without imposing restrictions on the temporal granularity.</p>
        <p>
          The group element is an container element and consists of interpretation
elements belonging together in a temporal sense, typically they are considered
as being parallel. With this element the application knows, which modalities
should interpreted as a whole and which remain independently. Our multimodal
time model for recognizing originates from [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and can be summarized as
follows. There are two types of synchronization of modalities possible, overlapped
and non-overlapped events. Overlapped events are events that are risen at the
same time. Non-overlapped events are sequential events , but still belong to
one multimodal interaction. Overlapped events are synchronized automatically,
non-overlapped events are processed according to the following synchronization
rules. Thes rules are based on the assumption of some maximum duration of
user input. We consider for the purpose of this work only intra-gesture, extra
gestures may be added nevertheless.
        </p>
        <p>1-3s
5s</p>
        <p>SGI, speech command + intra-gesture
GIGI, two intra-gestures
SS, 2 speech commands</p>
        <p>GIS, intra-gesture + speech command</p>
        <p>Intra-gestures, are described as any non-verbal tactile event, which belongs to
the previous or next event. Every event outside of these time intervals are
extragestures, and therefore interpreted independently. The interaction sketched in
table Fig. 1 are multimodal and integrate two or more modalities based on the
temporal model. An example following the rules in Fig. 1 may start with a speech
command. If the next event (such as a intra-gesture) starts within 3s after the
end of the spoken input, it will be grouped as two emma:interpretions in one
emma:group element.</p>
        <p>This approach to a temporal model independent of modalities and widgets
lays the foundations for Multimodal Accessible Rich Internet Applications
(MMARIA). We consider an application to be an MM-ARIA, if multiple and
alternative modalities can be utilized when interacting with a web application, while
assistive technologies are supported.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementing MM-ARIA</title>
      <p>MM-ARIA is based on the modular extension of a client-server architecture. The
client application consists of JavaScript les binding events and connecting with
the server. The server is application independent and consists of recognizers and
a web service to generate EMMA documents. Fig. 2 gives an overview of the
individual parts and modules as well as the ow of control.</p>
      <p>An event together with its event data, such as a keyboard event, or a spoken
command is sent to the server connector (1). The Server connector routes the
event data to the intended Recognizer. A Recognizer converts the event data
to a json-object and sends it back to the Server connector (2). Only the Server
connector needs to know the endpoint of the EMMA web service, prepares the
json-object and sends it to the EMMA web service. This web service waits for
upcoming events according to the temporal model (3), combines them to an
EMMA document and sends it back to the server connector (4). The connector
routes the EMMA document to the interpreter, which interprets the document
and triggers the appropriate application actions with the parameters determined
during processing the event (5).</p>
      <p>Event
binding
Application
actions</p>
      <p>Client
processors
(1)
(5)</p>
      <p>Server
connector
(
4
)
Interpreter</p>
      <p>We apply the following `algorithmic' procedure for developing the actual
mapping between modalities and widgets described by ARIA mark-up:
1. search for roles
2. determine for typical interactions
3. make a mapping from roles to interactions, like enhancement of gestures and
speech grammar
4. determine for contextual interactions and attributes, like buttons and
arialabeledby
5. make a possible enhancement in gestures and grammar
6. develop multimodal approaches to simplify the interactions</p>
      <p>A more elaborated example is described in the following discussion of
dijitcalendar. This example demonstrates the procedure and also the shortcomings
of using WAI-ARIA.
3.1</p>
      <sec id="sec-3-1">
        <title>Multimodal Interaction with Calendar Widget</title>
        <p>Dijit-calendar consists of one table with the role='grid'. Calendar days are tagged
with role='gridcell' and the weekly header with role='columnheader'. As there is
no WAI-ARIA role for a calendar, the calendar must be interpreted as a special
kind of a table. Usual interactions on a table are selection of one item to insert,
edit or delete it. Selection may be performed by navigating forward or backward
once from one table item to another item. We need to develop algorithmically a
mapping between WAI-ARIA roles and such an interaction. For dijit-calendar,
we developed a mapping between role='grid' and gestures such as swipe left
and right as well as spoken commands. This mapping matches well with the
e ect of selecting next and previous items. Similarly we map role='grid' with
the possibility to mark an item and insert a new one, edit the actual one or
delete it. Appropriate spoken commands are edit this or delete this. In a
multimodal system users may want to refer to an element `this' also by the mouse or
keyboard while saying `this'.</p>
        <p>Multi-modal operation on dijit-calendar links swipe gestures and spoken
commands. The spoken command year followed by a swipe gesture will increase or
decrease the year respectively. Similar month followed by the the same swipe
gesture will change the month shown.</p>
        <p>More complex mappings are generated from WAI-ARIA properties of buttons
or drop down lists by identifying the elements and their corresponding
arialabelledby element. Fig. 3 shows the result of the analysis of other sample Dojo
widgets.</p>
        <p>Widget
calendar
dialog
inlineeditbox
menu
tree
landmark roles</p>
        <p>Roles
grid, gridcell,
columnheader, row
dialog, button
button
menu,
menubar,
menuitem
tree,
group,
treeitem
navigation,
search</p>
        <p>Gestures
swipe
treehitch</p>
        <p>Speech Commands
delete this, edit this,
this row, this cell,
this column, sort by,
button
minimize, maximize,
close
collapse all / this,
expand all / this</p>
        <p>This ndings are examples and the most practical used roles of WAI-ARIA.
The table Fig. 3 can simply been extended with roles, gestures and grammars
in case of the generic system.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Building the grammar</title>
        <p>
          One goal of our approach is to enhance ARIA with speech control and
inparticular WAMI WAMI requires a grammar, modeled in JSGF syntax. Our analysis
shows that control elements are de ned in two di erent ways. On the one hand
WAI-ARIA roles are set (see Fig. 3). And on the another hand elements like
images can be used as buttons. These buttons have a prede ned function on the
click and also a descriptive name in the aria-labelledby attribute. Therefore the
grammar is generated in two ways, a static and a dynamic one. For the static
approach a script extracts the role-attribute and adds the results to the
grammar. Possible commands are e.g. this cell or this row or this column for setting
the focus and delete this or edit this for the selection of a an item. The dynamic
method searches for special words in the role-attribute like button or
navigation in the DOM of the page. In this way menuitems, navigation lists and many
more can be analyzed and added to the grammar. Combining both approaches
within a context grammar increases the accuracy of speech input recognition up
to nearly 100% [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Every time the user focuses on an element, the grammar
changes. Only some elements, such as items belonging to navigation commands
are part of the grammar at any time.
        </p>
        <p>From the user's point of view, it becomes possible to click the button in the
focused element without moving the mouse or using some keyboard controls and
easily enhance the application with speech recognition.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Enabling gesture recognition</title>
        <p>Gesture recognition is included by only one JavaScript le. ARIA roles as listed
in Fig. 3 will be interpreted either internally by this gesture recognition engine or
externally on a recognition server. A gesture starts by holding the right mouse
key and is followed by drawing some shape. We aim at gestures which may
be mapped to keyboard input and avoid direct manipulation. Our gestures are
concatenations of up, right, left, down. Thereby it becomes possible to simulate
the gestures by pressing the control-key and use the arrow keys.</p>
        <p>Based on speech and gesture recognition, as well as mouse and keyboard
events, MM-ARIA converts all events to the EMMA format. Based on this
format it is possible to trigger and execute multimodal operations on the client
application.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Generating EMMA</title>
        <p>The EMMA generator is implemented as a RESTful web service, which takes
the events and combines them according to the EMMA-XML format. The data
are stored in an INMEMORY database for the calculation the response. This
is e cient since after an event has been received and the operation has been
executed, the event becomes invalid and never needs to be reused. The structure
of the database is similar to the EMMA overview from chapter 2.3 with the
following changes:
{ process: url of the client website invoking the web service
{ signal: url of the interpreter
{ source: session id of the user
{ dialog-turn: id of the DOM-element, involved in the interaction</p>
        <p>The receiving process of the web service is divided into two stages. This
is necessary to measure the time accurately. In the rst stage, the initial data
describing the kind of event will be sent from the client to the server. The
callback at this stage is the Id of the database table entry and the start time as
deterined by the server. During the user input the recognizers interpret the data
and send the interpreted data to the server with the id of the callback for further
aggregation and interpretation. The integration of events according to the time
model (see Fig. 1) generates some trigger preparing control of the widget.</p>
        <p>The server triggers the client fully asynchronously through the use of COMET.
In our implementation of COMET, client and server open a channel for
sharing asynchronous messages without beeing forced to adhere to the plain HTTP
request-response-pattern. Through this channel the server can be exact, based
on the time model, when sending synthesized EMMA document to the client.
Finally, the EMMA document must be interpreted by the client and triggers
operations.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>MM-ARIA prototype</title>
        <p>Our MM-ARIA prototype demonstrator consists of a client and a server part.
The server part consists of recognizers and the EMMA web service. The client
part contains ARIA mark-up generated by Dojo. It is automatically inspected
and enhanced by JavaScript les for the connector and one for interpreter. These
connectors are the implementing the multi-modal design as described in previous
sections, and consist of grammars for speech processing, gestures and WAI-ARIA
event binding.</p>
        <p>An interpreter has been developed taking all user input into account and
which identi es the proper feedback within dijit-calendar. In other words, it
triggers the appropriate operations for various combinations of events. Below
the calendars a button named Click to talk is placed. The user can start the
speech recognition by pressing hshifti+hctrli, future implementations of speech
recognition engines may avoid such explicit triggering. Button settings opens
the con guration menu for the microphone and the volume of the audio input.
In the text eld to the left of the button the command recognized is shown. The
visual functionality of our MM-ARIA demonstrator is presented in Fig. 4 by two
calendars, two trees and one navigation bar.
Our work shows that ARIA may be expanded with additional modalities like
speech and gestures by using JavaScript, WAMI and EMMA. Existing
ARIAelements are interpreted and a grammar for speech recognition is generated
by JavaScript based on ARIA and HTML tags. The two new modalities are
described by EMMA for multimodal interpretation. We analyzed some Dojo
widgets and developed sample multimodal operations.</p>
        <p>A major advantage of this approach is that only JavaScript is required on the
user's computer. The result is a prototype of the calendar and tree which can
be controlled by standard interaction like keyboard and mouse, but also with
speech commands and combined commands based on speech and gesture, such
as saying delete this while pointing on a element.</p>
        <p>Future works should analyze the requirements of integrating multiple types
of AT. For example it is unclear if the mapping between keyboard input and
gesture controlled interaction is appropriate for motor impaired users. It is also
important to consider how temporal dependencies within multimodal
interaction, e.g. speech commands and gesture, can be represented to the users in order
to allow for su cient time (and in accordance with WCAG). One possibility is
to further the standardization of spoken and gestural interaction in RIA.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. W3C:
          <article-title>Accessible Rich Internet Applications (WAI-ARIA) 1</article-title>
          .0, http://www.w3.org/ TR/wai-aria/, (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. The Dojo Foundation: Dijit Widgets, http://dojotoolkit.org/reference-guide/ dijit/index.html, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Adrien Friggeri: jQuery Gesture Plugin, http://random.friggeri.net/ jquery-gestures/, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. W3C: HTML, http://www.w3.org/TR/html5/, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Gruenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ian</given-names>
            <surname>Mcgraw</surname>
          </string-name>
          and
          <article-title>Ibrahim Badr: The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible Multimodal Interfaces (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. W3C:
          <article-title>EMMA: Extensible MultiModal Annotation markup language</article-title>
          , http://www. w3.org/TR/emma/#s3.
          <fpage>1</fpage>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Wasinger</surname>
          </string-name>
          :
          <article-title>Multimodal Interaction with Mobile Devices: Fusing a Broad Spectrum of Modality Combinations (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Gruenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chao</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Stephanie Sene ,
          <article-title>Context-Sensitive Statistical Language Modeling</article-title>
          ,
          <source>Proc. Interspeech</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. WAMI Toolkit: City Browser, http://web.sls.csail.mit.edu/city/, (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>