Speech Control for HTML5 Hypervideo Players
Britta Meixner1,2 , Fabian Kallmeier1
1
University of Passau, Innstrasse 43, 94032 Passau, Germany
2 FX Palo Alto Laboratory, 3174 Porter Drive, Palo Alto, CA 94304, USA
meixner@fxpal.com, kallmeie@fim.uni-passau.de
ABSTRACT additional information. Playing such videos requires special
Hypervideo usage scenarios like physiotherapy trainings or players that provide additional means of navigation in addi-
instructions for manual tasks make it hard for users to use tional information, in scenes, and between scenes [14]. In
an input device like a mouse or touch screen on a hand-held usage scenarios like cooking instructions, physiotherapy and
device while they are performing an exercise or use both hands fitness trainings [15], or physical tasks that have to be done
to perform a manual task. In this work, we are trying to over- with two hands, speech controls may help the user to navigate
come this issue by providing an alternative input method for in the hypervideo without having to interrupt the current task.
hypervideo navigation using speech commands. In a user The hypervideo may be paused, next scenes may be selected,
test, we evaluated two different speech recognition libraries, or annotations may be read without having to interrupt the
annyang (in combination with the Web Speech API) and Pock- task/exercise using voice commands.
etSphinx.js (in combination with the Web Audio API), for their
usability to control hypervideo players. Test users spoke 18 SPEECH RECOGNITION FRAMEWORKS
words, either in German or English, which were recorded and Several speech recognition APIs exist having varying features
then processed by both libraries. We found out that annyang and limitations. Available APIs are, for example, Google
shows better recognition results. However, depending on other Speech API [6] which accepts 10-15 seconds of audio, the
factors of influence, like the occurrence of background noise IBM Speech to Text API [11] which uses IBM’s speech recog-
(reliability), the availability of an internet connection, or the nition capabilities, wit.ai [26] which is an open and exten-
used browser, PocketSphinx.js may be a better fit. sible natural language platform, Speechmatics [19] and the
VoxSigma REST API [21] which transcribe uploaded files into
ACM Classification Keywords text, or the open source APIs Kaldi [12] and OpenEars [18],
H.5.2. Information Interfaces and Presentation (e.g. HCI): the latter of which provides free speech recognition and speech
User Interfaces synthesis for the iPhone. Hereafter we briefly describe the
combinations of frameworks that will be tested in the remain-
Author Keywords der of this work. We chose these frameworks based on the
Hypervideo; Navigation; Language Processing; Speech Input; following criteria: the framework should be able to process
HTML5; longer phrases (in case the speech recognition gets extended
in the player). It should be possible to integrate it into a Web
INTRODUCTION application and the library should not be limited to certain
Using speech input, users are nowadays able to control smart- OSes.
phones, navigation systems, and Smart-TVs without touching
them. Depending on the system, either certain commands are Web Audio API and PocketSphinx.js
recognized (for example in TomTom navigation systems [20]), The Web Audio API is a “high-level JavaScript API for pro-
or freely formulated questions can be asked (like Siri for cessing and synthesizing audio in web applications” [24]. It
iPhones [1], Utter for Android Phones [10], or the Google allows splitting and merging of channels in an audio stream.
app [9]) which are then processed by the system trying to find Audio sources from an HTML5 or element
an answer. can be processed. It is furthermore possible to process live
audio input from a MediaStream via getUserMedia() [24].
However, up to now, only few homepages and Web applica- The speech recognition library PocketSphinx.js is written en-
tions have built-in support for speech input. Especially hyper- tirely in JavaScript. It is running entirely in the web browser
video players could benefit from speech control. Hypervideos [7], which builds on the Web Audio API. The speech recog-
consist of interlinked video scenes which are enriched with nizer is implemented in C (PocketSphinx) [3] and converted
into JavaScript using Emscripten [5]. It is possible to add
words, grammar, and key phrases to extend or improve the
recognition [7]. Each language needs its own language model
with a vocabulary.
Web Speech API and annyang
The Web Speech API enables the incorporation of voice
4th International Workshop on Interactive Content Consumption at TVX’16, June 22,
2016, Chicago, IL, USA data into web apps [17][23]. The SpeechRecognition (Asyn-
Copyright is held by the author/owner(s).
chronous Speech Recognition) interface “provides the ability page. JavaScript was used to start and stop voice recording
to recognize voice context from an audio input (normally via and recognition in the different technologies. In our tests,
the device’s default speech recognition service) and respond we used a fixed set of words which was shown in our test
appropriately” [17]. The SpeechGrammar interface represents application and had to be spoken out loud by the participants.
a container for a particular set of grammar (defined in the We furthermore used a timer which limited the recognition
JSpeech Grammar Format (JSGF)) that an app should recog- time per word and showed each word for 10 seconds. If a
nize [17]. As most modern OSes have a speech recognition word was recognized correctly during the first attempt, the
system for issuing voice commands, this is used for speech next word was loaded. If it was not recognized, a second
recognition on the device. Speech recognition systems are, for attempt with a new timer was started. The user then had to
example, Dictation on Mac OS X [2], Siri on iOS [1], Cor- repeat the word. The Web page informed the user whether the
tana on Windows 10 [16], and Android Speech [8]. The tiny word was recognized correctly. The system used versions of
standalone JavaScript SpeechRecognition library annyang lets annyang and Pocketsphinx.js available in November 2014.
users control a homepage with voice commands [4]. The size
of the library is less than 1 kb. The back-end is supported by STUDY/METHOD
the (Chrome) Web Speech API [25]. To find out if annyang or PocketSphinx.js performs better, we
conducted a study with 58 participants.
IMPLEMENTATION
In order to test the annyang and the PocketSphinx.js projects, Procedure/Data Collection
we installed a reference platform. It consists of a Web server
We used the 18 words shown in Table 1 which represent the key
(Apache 2.4.10) and a database (MySQL 5.5.38). We used
functions of our HTML5 hypervideo player. The words were
Perl (Version 5.14.2) for the implementation of the dynamic
presented to the participants in random order to avoid exercise
test Web page, which shows contents in the selected language
effects towards higher word IDs. Each recorded word was
(German or English) and provides a log-in system to avoid
tested with both technologies, annyang and PocketSphinx.js.
abuse and falsification of the test results. We only used Google
Before starting the tests, the users had to select which language
Chrome for our tests, because annyang is built on the Web
they wanted to do the test in. As a result, 33 participants used
Speech API which was only available for Google Chrome at
the German version of the test and 25 users participated in the
the time of the tests.
English version.
annyang
Table 1. Words tested for recognition.
For an implementation of speech detection and recognition
with annyang, it was only necessary to include the JavaScript ID German word English word
library into the Web application. In order to allow the usage of 1 abspielen play
the microphone it was mandatory to install an SSL-certificate. 2 wiederholen repeat
We furthermore rewrote the onResult function of the an- 3 öffnen open
nyang project to make the implementation conform with the 4 schließen close
PocketSphinx.js implementation described hereafter. To test 5 lauter volume up
German words, the language only had to be set to German 6 leiser volume down
using the setLanguage function. Further modifications and 7 einblenden fade in
additions were not necessary. 8 ausblenden fade out
9 vorwärts previous
PocketSphinx.js 10 zurück next
The implementation of speech detection and recognition with 11 Inhaltsverzeichnis content
PocketSphinx.js required more effort compared to the imple- 12 Suche search
mentation with annyang, because the source code of Pocket- 13 Tagebuch journal
Sphinx.js only comes with an English acoustical model. To 14 Vollbild full screen
avoid having to generate our own acoustical model for Ger- 15 Fensteransicht windows view
man words, we used the one provided by VoxForge [22]. We 16 Bild picture
furthermore used the possibility to add words and grammars 17 Bildergalerie picture gallery
during run-time to avoid too large files which could lead to 18 Hauptvideo main video
crashes of the browser. For that reason we compiled the acous-
tical models outside the main file which led to smaller files
and a better performance. Participants
The participants in our study were mainly between 18 and
Test System 60 years old. 34 of the participants were male, 24 were fe-
The web page used for the tests consisted of a index.pl file male. The test was mainly distributed in Germany, so most
in Perl (which generated the HTML code), some JavaScript of the participants were native German speakers. The tests
files, and a MySQL database. The database stored user names were conducted on desktop computers or laptops, whereby 33
and passwords, test words and their pronunciation, test results, participants used internal and 25 participants used external
as well as data that might be displayed on the dynamic web microphones. See Table 2 for more precise demographic data.
Table 2. Test participant demographics. for PocketSphinx.js was 56. The biggest difference was in
part. the number of partially recognized words1 , where the number
Age below 18 1 for annyang was quite low, but PocketSphinx.js recognized 77
18-29 32 words partially.
30-45 11 Table 3. Recognition of German words.
46-60 14
above 60 0 annyang PocketSphinx.js
1st attempt 527 399
Gender male 34
2nd attempt 27 62
female 24
Partially recognized 3 77
First German 55 Not recognized 37 56
language English 0
other 3 Taking a look at the results for the English words (see Ta-
Microphone internal 33 ble 4 and Figure 1, orange and yellow bars), the results are
external 25 similar to those of the German words. Out of 450 words (18
words spoken by 25 test users), annyang recognized 367 in the
Experience with none 20 first and 25 in the second attempt resulting in 392 correctly
speech input some 15 recognized words. In contrast, PocketSphinx.js recognized
medium 20 269 words in the first and 42 words in the second attempt
often 3 resulting in 311 correctly recognized words. Only 1 word was
daily 0 recognized partially using annyang, whereas PocketSphinx.js
recognized 88 words partially. For the English words, annyang
showed slightly worse results (57 not recognized words) than
PocketSphinx.js (51 not recognized words). One reason for
the higher number of not recognized words might be the fact
that the words were not spoken by native speakers. The level
of correct pronunciation is unfortunately not known in this
case.
Table 4. Recognition of English words.
annyang PocketSphinx.js
1st attempt 367 269
2nd attempt 25 42
Partially recognized 1 88
Not recognized 57 51
Figure 1. Recognition grouped by attempts. Summarizing the results over all languages, it can be noted that
annyang showed better overall results than PocketSphinx.js
(see Table 5). Annyang had a recognition rate of 90.61 %
ANALYSIS AND RESULTS while PocketSphinx.js recognized only about three quarters
We analyzed the frameworks in two different ways. On the (73.94 %) of the words. The rate of not recognized words
one hand, we analyzed the number of recognized words per was around 10 % for both libraries. One reason for the worse
language and per framework. On the other hand, we com- results for PocketSphinx.js may be background noise which
pared the two frameworks in different categories relevant for has a greater influence on PocketSphinx.js than on annyang.
practical usage in our hypervideo player.
Table 5. Recognition rate of all words in percent.
Recognition of Words annyang PocketSphinx.js
We analyze the recognition of the words for the two languages 1st attempt 85.63 63.98
first separately and then together. Taking a look at the recog- 2nd attempt 4.98 9.96
nition of the German words, it can be said that annyang has a Partially recognized 0.38 15.80
better recognition rate than PocketSphinx.js (see Table 3 and Not recognized 9.00 10.25
Figure 1, blue and gray bars). Out of 594 words (18 words
spoken by 33 test users), annyang recognized 527 words in Taking a look at the recognition performance of individual
the first and 27 in the second attempt which results in 554 words (see Figure 2), it can be stated that the recognition for
recognized words. PocketSphinx.js in contrast recognized 399 1 Partially recognized words are words that either are only a part of
words in the first and 62 in the second attempt which results the given word or contain the given word (but also other letters),
in 461 recognized words. The annyang library failed to rec- meaning the recognized word contains more or less letters than the
ognize 37 words, while the number of not recognized words given word.
Figure 2. Recognition grouped by words.
the German words for annyang did not show huge differences Table 6. Practical comparison of annyang and PocketSphinx.js
between the words. The English words in contrast showed annyang PocketSphinx.js
larger differences. The words “fade in” and “journal” were Reliability Good: Background Satisfactory: recogni-
recognized correctly less than 15 out of 25 times compared to noise is reliably dis- tion is reliable as long
annyang. The results for German words with PocketSphinx.js tinguished from lan- as the surroundings
are worse for all words, especially for the word “abspielen”. guage; recognition of have no background
Taking a look at the results for the English words, it can be said spoken words is reli- noises
that especially the words “fade in”, “search”, “full screen”, able in most cases
“picture”, and “picture gallery” showed worse results with 10
or less out of 25 recognized words. Availability Internet connection is Application on client
necessary side, no Internet con-
Practical Comparison nection necessary
While the results in the user tests regarding word recognition Browser Limited to Chrome All current browsers
performance clearly were in favor of annyang, the decision support except Internet Ex-
for using one of the libraries in real world HTML5 hyper- plorer
video players requires further thoughts. We examined 5 fac-
tors further, namely: dependencies and integration, reliability,
availability, browser support, and supported languages. De-
pendencies and integration as well as supported languages However, recognition may not be the only factor to consider
may be of less interest. Assuming that no large changes are when integrating one of the libraries into a hypervideo player.
made in the Web application that uses the speech recognition, Depending on the application area, the occurrence of back-
the integration only has to be implemented once. Regarding ground noise (reliability), the availability of an internet con-
language support, both libraries show a large number of sup- nection, and the used browsers may influence the selection of
ported languages or provide the possibility to extend or create the library.
language models in case they do not exist aready.
In the tests described in this work, we only used Google
Reliability, availability, and browser support play a more im- Chrome, due to a missing support of the libraries in other
portant role. Depending on the hypervideo application area, browsers. In future work a test with other browsers and the
background noise may occur, an internet connection may not test of other libraries may bring further results that may influ-
be available at all times, or company restrictions may not ence the selection of one of the libraries.
allow to use certain browsers. Please refer to Table 6 for a
comparison what to use best in a given scenario. The voice control should be integrated into the hypervideo
player and tested in a real world scenario measuring user
CONCLUSION frustration due to speech recognition performance in a real
In this work, we describe the implementation of a test frame- world setting. Depending on the scenario the hypervideo
work for the speech recognition libraries annyang and Pocket- player is used in, another hypervideo control approach may
Sphinx.js. We wanted to test the quality of the recognition of also be helpful. In case of a physiotherapy or fitness training,
certain words that could be used to verbally control hypervideo for example, it is helpful to show main video contents on
players. As a result, it can be noted that annyang provides a bigger screen. A solution to enable an easier control of
better recognition results both for English and German words. the hypervideo in this specific case may be a second screen
application that splits contents from control elements [13]. 14. Britta Meixner, Stefan John, and Christian Handschigl.
Both approaches should be compared for their suitability in 2015. SIVA Suite: Framework for Hypervideo Creation,
these scenarios. Playback and Management. In Proceedings of the 23rd
ACM International Conference on Multimedia (MM ’15).
REFERENCES ACM, New York, NY, USA, 713–716. DOI:
1. Apple Inc. 2016a. Use Siri on your iPhone, iPad, or iPod http://dx.doi.org/10.1145/2733373.2807413
touch. (2016). Website 15. Britta Meixner, Katrin Tonndorf, Stefan John, Christian
https://support.apple.com/en-us/HT204389 (accessed Handschigl, Kai Hofmann, and Michael Granitzer. 2014.
April 20, 2016). A Multimedia Help System for a Medical Scenario in a
2. Apple Inc. 2016b. Use your voice to enter text on your Rehabilitation Clinic. In Proceedings of the 14th
Mac. (2016). Website International Conference on Knowledge Technologies
https://support.apple.com/en-us/HT202584 (accessed and Data-driven Business (i-KNOW ’14). ACM, New
May 27, 2016). York, NY, USA, Article 25, 8 pages. DOI:
http://dx.doi.org/10.1145/2637748.2638429
3. Carnegie Mellon University. 2016. CMU Sphinx - OPEN
SOURCE SPEECH RECOGNITION TOOLKIT. (2016). 16. Microsoft. 2016. Get Started with Windows 10 - What is
Website http://cmusphinx.sourceforge.net/ (accessed Cortana? (2016). Website http://windows.microsoft.com/
en-us/windows-10/getstarted-what-is-cortana (accessed
April 20, 2016).
April 20, 2016).
4. GitHub, Inc. 2016a. annyang - Speech recognition for 17. Mozilla Developer Network. 2016. Web Speech API.
your site. (2016). Website (2016). Website https://developer.mozilla.org/en-US/
https://github.com/TalAter/annyang (accessed April 20,
docs/Web/API/Web_Speech_API (accessed April 20, 2016).
2016).
18. Politepix. 2016. OpenEars - iPhone Voice Recognition
5. GitHub, Inc. 2016b. Emscripten: An LLVM-to-JavaScript and Text-To-Speech. (2016). Website
Compiler. (2016). Website http://www.politepix.com/openears/ (accessed May 27,
https://github.com/kripken/emscripten (accessed April 2016).
20, 2016).
19. Speechmatics. 2016. speech made simple. (2016).
6. GitHub, Inc. 2016c. Google Speech API v2. (2016). Website https://speechmatics.com/ (accessed May 27,
Website 2016).
https://github.com/gillesdemey/google-speech-v2
20. TomTom International BV. 2016. Why TomTom devices
(accessed May 27, 2016). are the easiest. (2016). Website
7. GitHub, Inc. 2016d. Pocketsphinx.js - Speech http://www.tomtom.com/whytomtom/subject.php?subject=4
Recognition in JavaScript. (2016). Website (accessed April 20, 2016).
https://github.com/syl22-00/pocketsphinx.js/blob/ 21. Vocapia Research SAS. 2016. Speech to Text API.
master/README.md (accessed April 20, 2016). (2016). Website
8. Google. 2016a. android.speech. (2016). Website http://www.vocapia.com/speech-to-text-api.html
https://developer.android.com/reference/android/ (accessed May 27, 2016).
speech/package-summary.html (accessed April 20, 2016). 22. VoxForge. 2016. VoxForge - Downloads - German.
(2016). Website http://www.voxforge.org/de/Downloads
9. Google. 2016b. Meet the Google app. (2016). Website
(accessed April 20, 2016).
http://www.google.com/search/about/ (accessed April 20,
2016). 23. W3C. 2012. Web Speech API Specification (19 October
2012). (2012). Website https://dvcs.w3.org/hg/
10. Google. 2016c. utter! Voice Commands BETA! (2016). speech-api/raw-file/tip/speechapi.html (accessed June
Website https://play.google.com/store/apps/details?id= 09, 2016).
com.brandall.nutter (accessed April 20, 2016).
24. W3C. 2016. Web Audio API - W3C Editor’s Draft 15
11. IBM. 2016. Speech to Text. (2016). Website http://www. April 2016. (2016). Website
ibm.com/smarterplanet/us/en/ibmwatson/developercloud/ https://webaudio.github.io/web-audio-api/ (accessed
speech-to-text.html$#$how-it-is-used-block (accessed April 19, 2016).
May 27, 2016).
25. WEBRESOURCESDEPOT. 2016. Speech Recognition
12. Kaldi. 2016. Kaldi. (2016). Website With JavaScript - annyang. (2016). Website
http://kaldi-asr.org/ (accessed May 27, 2016). http://webresourcesdepot.com/
speech-recognition-with-javascript-annyang/ (accessed
13. Britta Meixner, Christian Handschigl, Stefan John, and April 20, 2016).
Michael Granitzer. 2016. From Single Screen to Dual
Screen - a Design Study for a User-Controlled 26. Wit.ai, Inc. 2016. wit.ai - Natural Language for
Hypervideo-Based Physiotherapy Training. In Developers. (2016). Website https://wit.ai/ (accessed
Proceedings of WSICC 2016. May 27, 2016).