Speech Control for HTML5 Hypervideo Players Britta Meixner1,2 , Fabian Kallmeier1 1 University of Passau, Innstrasse 43, 94032 Passau, Germany 2 FX Palo Alto Laboratory, 3174 Porter Drive, Palo Alto, CA 94304, USA meixner@fxpal.com, kallmeie@fim.uni-passau.de ABSTRACT additional information. Playing such videos requires special Hypervideo usage scenarios like physiotherapy trainings or players that provide additional means of navigation in addi- instructions for manual tasks make it hard for users to use tional information, in scenes, and between scenes [14]. In an input device like a mouse or touch screen on a hand-held usage scenarios like cooking instructions, physiotherapy and device while they are performing an exercise or use both hands fitness trainings [15], or physical tasks that have to be done to perform a manual task. In this work, we are trying to over- with two hands, speech controls may help the user to navigate come this issue by providing an alternative input method for in the hypervideo without having to interrupt the current task. hypervideo navigation using speech commands. In a user The hypervideo may be paused, next scenes may be selected, test, we evaluated two different speech recognition libraries, or annotations may be read without having to interrupt the annyang (in combination with the Web Speech API) and Pock- task/exercise using voice commands. etSphinx.js (in combination with the Web Audio API), for their usability to control hypervideo players. Test users spoke 18 SPEECH RECOGNITION FRAMEWORKS words, either in German or English, which were recorded and Several speech recognition APIs exist having varying features then processed by both libraries. We found out that annyang and limitations. Available APIs are, for example, Google shows better recognition results. However, depending on other Speech API [6] which accepts 10-15 seconds of audio, the factors of influence, like the occurrence of background noise IBM Speech to Text API [11] which uses IBM’s speech recog- (reliability), the availability of an internet connection, or the nition capabilities, wit.ai [26] which is an open and exten- used browser, PocketSphinx.js may be a better fit. sible natural language platform, Speechmatics [19] and the VoxSigma REST API [21] which transcribe uploaded files into ACM Classification Keywords text, or the open source APIs Kaldi [12] and OpenEars [18], H.5.2. Information Interfaces and Presentation (e.g. HCI): the latter of which provides free speech recognition and speech User Interfaces synthesis for the iPhone. Hereafter we briefly describe the combinations of frameworks that will be tested in the remain- Author Keywords der of this work. We chose these frameworks based on the Hypervideo; Navigation; Language Processing; Speech Input; following criteria: the framework should be able to process HTML5; longer phrases (in case the speech recognition gets extended in the player). It should be possible to integrate it into a Web INTRODUCTION application and the library should not be limited to certain Using speech input, users are nowadays able to control smart- OSes. phones, navigation systems, and Smart-TVs without touching them. Depending on the system, either certain commands are Web Audio API and PocketSphinx.js recognized (for example in TomTom navigation systems [20]), The Web Audio API is a “high-level JavaScript API for pro- or freely formulated questions can be asked (like Siri for cessing and synthesizing audio in web applications” [24]. It iPhones [1], Utter for Android Phones [10], or the Google allows splitting and merging of channels in an audio stream. app [9]) which are then processed by the system trying to find Audio sources from an HTML5