=Paper=
{{Paper
|id=None
|storemode=property
|title=Two New Modalities for ARIA
|pdfUrl=https://ceur-ws.org/Vol-792/Franke.pdf
|volume=Vol-792
}}
==Two New Modalities for ARIA ==
Two New Modalities for ARIA Martin Franke, Jens Voegler, and Gerhard Weber Technische Universität Dresden Institut für Angewandte Informatik D-01062 Dresden Martin.Franke4@mailbox.tu-dresden.de {jens.voegler,gerhard.weber}@tu-dresden.de http://www.inf.tu-dresden.de Abstract. Accessibility of rich internet applications is ensured by ARIA and assistive technology supporting this additional mark-up. We propose MM-ARIA for the integration of multiple types of assistive technology while supporting multimodal interaction. The Dojo toolkit’s calendar widget is an example of a complex, non-standard widget requiring ARIA mark-up. MM-ARIA generates grammars for speech recognition and in- tegrates gestural input for navigation in the calendar or selecting an item. MM-ARIA relies on multimodal mark-up language EMMA and an inter- preter for parallel input. Our proof-of-concept application demonstrates the extensibility of widgets described by ARIA. Keywords: Browser, speech, gesture, assistive technology 1 Introduction Rich Internet Applications become more important for new websites. Their ad- vantage over plain forms is that dynamic content can be presented to the user without loading a new web page, such as explaining the missing information in a partially completed contact form, or refreshing a news ticker automatically. It is also possible to use novel types of interaction, for example gestures for navi- gation, or moving an item directly within a map. Additional modalities beyond mouse and keyboard such as speech input are on the horizon as new browser support access to the microphone. But often these advantages are not accessible for disabled users or create additional barriers. For example assistive technologies like screenreaders just represents a snapshot of a website and recognizes changes if the website is re- freshed or a new site is loaded. Gestures and interaction like drag and drop requires a visual feedback and are time-dependent which may be a barrier for blind and motor-disabled users. Another disadvantage is that gestures are not consistent among different websites and browsers. Accessible Rich Internet Applications (ARIA) is a markup language for de- veloping accessible website for assistive technologies. The main goals of ARIA are [1]: 2 Martin Franke, Jens Voegler, and Gerhard Weber – expanding the accessibility information that may be supplied by the author, – requiring that supporting host languages provide full keyboard support that may be implemented in a device-independent way, for example, by tele- phones, handheld devices, e-book readers, and televisions, – improving the accessibility of dynamic content generated by scripts, and – providing interoperability with assistive technologies. With ARIA, websites such as Facebook become by and large accessible, but navigation for blind users, to name one user group, is currently only possible by keyboard. This paper will show how ARIA can be used to develop multi-modal websites which can be controlled by speech, keyboard and mouse jointly. Solving this problem requires to interpret ARIA attributes as a semantic enhancement of HTML elements while ensuring accessibility for people who benefit from the availability of one or multiple of these modalities. 2 State Of The Art The main goal of ARIA is improvement of accessibility for dynamic web content. Dynamic changes can have inappropriate effects for users, which cannot use the mouse or have limited access to the keyboard. The worst case for these users is an inoperable state in a web application. ARIA defines states and properties of HTML-elements and landmarks for navigation of the website [1]. These new attributes for HTML elements may be accessed by Assistive Technology (AT), such as a screen reader. Each AT implements a separate but accessible user interface based on the extra mark-up such as: – Roles for types: button, dialog, gridcell, menuitem, menu, scrollbar, tree, tab, tabpanel – Roles for structure: banner, navigation, main, search – Properties for showing the state of one object like pressed, or its value Additional markup requires developers to manually enhance their code. Ex- isting HTML and JavaScript will hardly by modified, but new JavaScript frame- works and toolkits may be developed with ARIA support from the beginning on. They are deployed with a set of widgets, already tested against ARIA specifica- tions and often with AT or even by users relying on AT. One of this frameworks is the Dojo toolkit. 2.1 Dojo Toolkit The Dojo Toolkit is a JavaScript framework, developed by the Dojo Foundation since 2004 and is currently in version 1.6. It is class-oriented and modularized in widgets. Each widget consists of HTML, CSS and JavaScript files. There are a wide variety of widgets available for professional internet applications. Most widgets of them are evaluated to be accessible, these include basic widgets like Two New Modalities for ARIA 3 buttons, sliders or validating input fields, and also complex widgets such as: Cal- endar, Dialog, Editor, InlineEditBox, Menu, MenuBar, ProgressBar, TabPanel, Toolbar, Tooltip and Tree [2]. Within the Dojo toolkit, a life cycle manager con- trols the widgets and a data connector simplifies exchange of data between the widgets and an application. Building a flexible, accessible rich internet application based on multiple widgets requires to establish well controllable input facilities while supporting dynamic changes of the status of widgets. As in any graphical user interface, is the flow of control in Dojo based on events. The next step towards a multi- modal system is the collection of events to prepare the synchronization of new modalities for interaction with ARIA-based widgets. 2.2 Collecting Events Most of the events cause by user input can easily be caught by using JavaScript. jQuery1 is one of the most used frameworks for this purpose. With this frame- work one can easily catch keyboard inputs, such key pressed, or mouse input, such as clicks or the position the mouse cursor. Via an additional script [3], it is also possible to recognize mouse gestures (see chapter 3.2). A limitation of jQuery is the lack of handling microphone input and speech recognition. None of the actual browsers is able to natively catch the stream of data arising from microphone input. However, HTML5 specifications include such a feature and soon industry standard browsers will support audio input [4]. One approach for adding speech processing to browsers is the Web-Accessible Multimodal Applica- tions framework (WAMI) [5]. WAMI applications are not built with accessibility in mind, there is also no use of WAI-ARIA. Still, a switch among modalities is feasible. City Browser, is a sample implementation using WAMI with actual multi-modal interaction. It combines mouse and speech input within a browser [9]. WAMI provides an implementation of a client-server concept catching mi- crophone streams in the client and recognition of speech input in the server. The client is implemented as a Java-applet and therefore nearly platform- and browser-independent. The applet records microphone data as an MP3 file and sends it to a server. This server checks the recorded stream against a predefined grammar and sends the recognized words as an JavaScript object back to the client application. With this object functions are triggered on the client side. A WAMI server can be deployed on a Java Servlet Container, like Apache Tomcat2 or jetty3 , and uses an underlying speech recognizer, for example CMU Sphinx4 or MIT’s own City Browser speech server. To summarize, we have presented approaches for additional mark-up based on WAI-ARIA to get an accessible RIA, the recognition of the user standard 1 http://www.jquery.org 2 http://tomcat.apache.org/ 3 http://jetty.codehaus.org/jetty/ 4 http://cmusphinx.sourceforge.net/ 4 Martin Franke, Jens Voegler, and Gerhard Weber input through events, and an approach for handling speech input in browsers. Multimodality may be embedded into widgets at this point, however we aim at a modular approach using mark-up to model multimodality independently of spe- cific modalities while preserving accessibility features of widgets. One well-known approach to multimodality recommended by W3C is EMMA - the Extensible Multimodal Annotation Language. 2.3 EMMA EMMA is W3C Recommendation since 2009 with the main goal to establish multimodal interactions in the world wide web. It is a descriptive XML-based language for processing multimodal input. An EMMA message consists of the un- derlying modality specific mark-up embedded within EMMA-specific mark-up. The main concepts are the interpretation- and the group-element. An interpre- tation element describes the input modality and consists of [6]: – id, unique identification – tokens, the whole input string – process, reference to the processor, like speech recognizer – no-input, true, if input string is empty – uninterpreted, true, if error in processor occurred – signal, reference to the interpreted file, like the MP3 file – media-type, media-type of the signal – confidence, confidence of the right interpretation from 0.0 to 1.0 – source, reference to the application, which invoked the processor – start, end, duration, points of the start/end of the event and the duration – medium, acoustic, tactile or visual – mode, modality of the signal, like voice, dtmf, gui, keys, video,... – function, function of the signal, like recording, transcription, dialog, verifi- cation,... – verbal, true or false – cost, cost of the interpretation, such as CPU time, and – dialog-turn, placeholder for application-dependent implementation EMMA thereby supports a wide range of modalities and enables flexible interpretations without imposing restrictions on the temporal granularity. The group element is an container element and consists of interpretation elements belonging together in a temporal sense, typically they are considered as being parallel. With this element the application knows, which modalities should interpreted as a whole and which remain independently. Our multimodal time model for recognizing originates from [7] and can be summarized as fol- lows. There are two types of synchronization of modalities possible, overlapped and non-overlapped events. Overlapped events are events that are risen at the same time. Non-overlapped events are sequential events , but still belong to one multimodal interaction. Overlapped events are synchronized automatically, non-overlapped events are processed according to the following synchronization Two New Modalities for ARIA 5 rules. Thes rules are based on the assumption of some maximum duration of user input. We consider for the purpose of this work only intra-gesture, extra gestures may be added nevertheless. 1-3s SGI, speech command + intra-gesture GIGI, two intra-gestures 5s SS, 2 speech commands GIS, intra-gesture + speech command Fig. 1. Temporal model based on [7] Intra-gestures, are described as any non-verbal tactile event, which belongs to the previous or next event. Every event outside of these time intervals are extra- gestures, and therefore interpreted independently. The interaction sketched in table Fig. 1 are multimodal and integrate two or more modalities based on the temporal model. An example following the rules in Fig. 1 may start with a speech command. If the next event (such as a intra-gesture) starts within 3s after the end of the spoken input, it will be grouped as two emma:interpretions in one emma:group element. This approach to a temporal model independent of modalities and widgets lays the foundations for Multimodal Accessible Rich Internet Applications (MM- ARIA). We consider an application to be an MM-ARIA, if multiple and alter- native modalities can be utilized when interacting with a web application, while assistive technologies are supported. 3 Implementing MM-ARIA MM-ARIA is based on the modular extension of a client-server architecture. The client application consists of JavaScript files binding events and connecting with the server. The server is application independent and consists of recognizers and a web service to generate EMMA documents. Fig. 2 gives an overview of the individual parts and modules as well as the flow of control. An event together with its event data, such as a keyboard event, or a spoken command is sent to the server connector (1). The Server connector routes the event data to the intended Recognizer. A Recognizer converts the event data to a json-object and sends it back to the Server connector (2). Only the Server connector needs to know the endpoint of the EMMA web service, prepares the json-object and sends it to the EMMA web service. This web service waits for upcoming events according to the temporal model (3), combines them to an EMMA document and sends it back to the server connector (4). The connector routes the EMMA document to the interpreter, which interprets the document and triggers the appropriate application actions with the parameters determined during processing the event (5). 6 Martin Franke, Jens Voegler, and Gerhard Weber Client Server processors processors (1) Event (1) Server Recognizers binding connector (2) (2 (4) (3) ) (4 ) Application (5) EMMA Interpreter actions webservice (1) event data (2) json object (3) wait (4) emma document (5) operation name and parameters Fig. 2. MM-ARIA application We apply the following ‘algorithmic’ procedure for developing the actual mapping between modalities and widgets described by ARIA mark-up: 1. search for roles 2. determine for typical interactions 3. make a mapping from roles to interactions, like enhancement of gestures and speech grammar 4. determine for contextual interactions and attributes, like buttons and aria- labeledby 5. make a possible enhancement in gestures and grammar 6. develop multimodal approaches to simplify the interactions A more elaborated example is described in the following discussion of dijit- calendar. This example demonstrates the procedure and also the shortcomings of using WAI-ARIA. 3.1 Multimodal Interaction with Calendar Widget Dijit-calendar consists of one table with the role=’grid’. Calendar days are tagged with role=’gridcell’ and the weekly header with role=’columnheader’. As there is no WAI-ARIA role for a calendar, the calendar must be interpreted as a special kind of a table. Usual interactions on a table are selection of one item to insert, edit or delete it. Selection may be performed by navigating forward or backward Two New Modalities for ARIA 7 once from one table item to another item. We need to develop algorithmically a mapping between WAI-ARIA roles and such an interaction. For dijit-calendar, we developed a mapping between role=’grid’ and gestures such as swipe left and right as well as spoken commands. This mapping matches well with the effect of selecting next and previous items. Similarly we map role=’grid’ with the possibility to mark an item and insert a new one, edit the actual one or delete it. Appropriate spoken commands are edit this or delete this. In a multi- modal system users may want to refer to an element ‘this’ also by the mouse or keyboard while saying ‘this’. Multi-modal operation on dijit-calendar links swipe gestures and spoken com- mands. The spoken command year followed by a swipe gesture will increase or decrease the year respectively. Similar month followed by the the same swipe gesture will change the month shown. More complex mappings are generated from WAI-ARIA properties of buttons or drop down lists by identifying the elements and their corresponding aria- labelledby element. Fig. 3 shows the result of the analysis of other sample Dojo widgets. Widget Roles Gestures Speech Commands calendar grid, gridcell, swipe delete this, edit this, column- this row, this cell, header, row this column, sort by, button dialog dialog, button - minimize, maximize, close inlineeditbox button - - menu menu, - - menubar, menuitem tree tree, tree- hitch collapse all / this, ex- group, pand all / this treeitem landmark roles navigation, - - search Fig. 3. Outcomes of wigdet analysis This findings are examples and the most practical used roles of WAI-ARIA. The table Fig. 3 can simply been extended with roles, gestures and grammars in case of the generic system. 8 Martin Franke, Jens Voegler, and Gerhard Weber 3.2 Building the grammar One goal of our approach is to enhance ARIA with speech control and inpartic- ular WAMI WAMI requires a grammar, modeled in JSGF syntax. Our analysis shows that control elements are defined in two different ways. On the one hand WAI-ARIA roles are set (see Fig. 3). And on the another hand elements like images can be used as buttons. These buttons have a predefined function on the click and also a descriptive name in the aria-labelledby attribute. Therefore the grammar is generated in two ways, a static and a dynamic one. For the static approach a script extracts the role-attribute and adds the results to the gram- mar. Possible commands are e.g. this cell or this row or this column for setting the focus and delete this or edit this for the selection of a an item. The dynamic method searches for special words in the role-attribute like button or naviga- tion in the DOM of the page. In this way menuitems, navigation lists and many more can be analyzed and added to the grammar. Combining both approaches within a context grammar increases the accuracy of speech input recognition up to nearly 100% [8]. Every time the user focuses on an element, the grammar changes. Only some elements, such as items belonging to navigation commands are part of the grammar at any time. From the user’s point of view, it becomes possible to click the button in the focused element without moving the mouse or using some keyboard controls and easily enhance the application with speech recognition. 3.3 Enabling gesture recognition Gesture recognition is included by only one JavaScript file. ARIA roles as listed in Fig. 3 will be interpreted either internally by this gesture recognition engine or externally on a recognition server. A gesture starts by holding the right mouse key and is followed by drawing some shape. We aim at gestures which may be mapped to keyboard input and avoid direct manipulation. Our gestures are concatenations of up, right, left, down. Thereby it becomes possible to simulate the gestures by pressing the control-key and use the arrow keys. Based on speech and gesture recognition, as well as mouse and keyboard events, MM-ARIA converts all events to the EMMA format. Based on this for- mat it is possible to trigger and execute multimodal operations on the client application. 3.4 Generating EMMA The EMMA generator is implemented as a RESTful web service, which takes the events and combines them according to the EMMA-XML format. The data are stored in an INMEMORY database for the calculation the response. This is efficient since after an event has been received and the operation has been executed, the event becomes invalid and never needs to be reused. The structure of the database is similar to the EMMA overview from chapter 2.3 with the following changes: Two New Modalities for ARIA 9 – process: url of the client website invoking the web service – signal: url of the interpreter – source: session id of the user – dialog-turn: id of the DOM-element, involved in the interaction The receiving process of the web service is divided into two stages. This is necessary to measure the time accurately. In the first stage, the initial data describing the kind of event will be sent from the client to the server. The callback at this stage is the Id of the database table entry and the start time as deterined by the server. During the user input the recognizers interpret the data and send the interpreted data to the server with the id of the callback for further aggregation and interpretation. The integration of events according to the time model (see Fig. 1) generates some trigger preparing control of the widget. The server triggers the client fully asynchronously through the use of COMET. In our implementation of COMET, client and server open a channel for shar- ing asynchronous messages without beeing forced to adhere to the plain HTTP request-response-pattern. Through this channel the server can be exact, based on the time model, when sending synthesized EMMA document to the client. Finally, the EMMA document must be interpreted by the client and triggers operations. 3.5 MM-ARIA prototype Our MM-ARIA prototype demonstrator consists of a client and a server part. The server part consists of recognizers and the EMMA web service. The client part contains ARIA mark-up generated by Dojo. It is automatically inspected and enhanced by JavaScript files for the connector and one for interpreter. These connectors are the implementing the multi-modal design as described in previous sections, and consist of grammars for speech processing, gestures and WAI-ARIA event binding. An interpreter has been developed taking all user input into account and which identifies the proper feedback within dijit-calendar. In other words, it triggers the appropriate operations for various combinations of events. Below the calendars a button named Click to talk is placed. The user can start the speech recognition by pressing hshifti+hctrli, future implementations of speech recognition engines may avoid such explicit triggering. Button settings opens the configuration menu for the microphone and the volume of the audio input. In the textfield to the left of the button the command recognized is shown. The visual functionality of our MM-ARIA demonstrator is presented in Fig. 4 by two calendars, two trees and one navigation bar. 10 Martin Franke, Jens Voegler, and Gerhard Weber Fig. 4. MM-ARIA Prototype 4 Conclusion Our work shows that ARIA may be expanded with additional modalities like speech and gestures by using JavaScript, WAMI and EMMA. Existing ARIA- elements are interpreted and a grammar for speech recognition is generated by JavaScript based on ARIA and HTML tags. The two new modalities are described by EMMA for multimodal interpretation. We analyzed some Dojo widgets and developed sample multimodal operations. A major advantage of this approach is that only JavaScript is required on the user’s computer. The result is a prototype of the calendar and tree which can be controlled by standard interaction like keyboard and mouse, but also with speech commands and combined commands based on speech and gesture, such as saying delete this while pointing on a element. Future works should analyze the requirements of integrating multiple types of AT. For example it is unclear if the mapping between keyboard input and gesture controlled interaction is appropriate for motor impaired users. It is also important to consider how temporal dependencies within multimodal interac- tion, e.g. speech commands and gesture, can be represented to the users in order Two New Modalities for ARIA 11 to allow for sufficient time (and in accordance with WCAG). One possibility is to further the standardization of spoken and gestural interaction in RIA. References 1. W3C: Accessible Rich Internet Applications (WAI-ARIA) 1.0, http://www.w3.org/ TR/wai-aria/, (2011) 2. The Dojo Foundation: Dijit Widgets, http://dojotoolkit.org/reference-guide/ dijit/index.html, (2010) 3. Adrien Friggeri: jQuery Gesture Plugin, http://random.friggeri.net/ jquery-gestures/, (2010) 4. W3C: HTML, http://www.w3.org/TR/html5/, (2010) 5. Alexander Gruenstein, Ian Mcgraw and Ibrahim Badr: The WAMI Toolkit for De- veloping, Deploying, and Evaluating Web-Accessible Multimodal Interfaces (2008) 6. W3C: EMMA: Extensible MultiModal Annotation markup language, http://www. w3.org/TR/emma/#s3.1, (2010) 7. Rainer Wasinger: Multimodal Interaction with Mobile Devices: Fusing a Broad Spec- trum of Modality Combinations (2006) 8. Alexander Gruenstein, Chao Wang, and Stephanie Seneff, Context-Sensitive Statis- tical Language Modeling, Proc. Interspeech, 2005. 9. WAMI Toolkit: City Browser, http://web.sls.csail.mit.edu/city/, (2006)