Speech Control for HTML5 Hypervideo Players

                                            Britta Meixner1,2 , Fabian Kallmeier1
                                     1
                                  University of Passau, Innstrasse 43, 94032 Passau, Germany
                           2 FX Palo Alto Laboratory, 3174 Porter Drive, Palo Alto, CA 94304, USA

                                      meixner@fxpal.com, kallmeie@fim.uni-passau.de
ABSTRACT                                                                            additional information. Playing such videos requires special
Hypervideo usage scenarios like physiotherapy trainings or                          players that provide additional means of navigation in addi-
instructions for manual tasks make it hard for users to use                         tional information, in scenes, and between scenes [14]. In
an input device like a mouse or touch screen on a hand-held                         usage scenarios like cooking instructions, physiotherapy and
device while they are performing an exercise or use both hands                      fitness trainings [15], or physical tasks that have to be done
to perform a manual task. In this work, we are trying to over-                      with two hands, speech controls may help the user to navigate
come this issue by providing an alternative input method for                        in the hypervideo without having to interrupt the current task.
hypervideo navigation using speech commands. In a user                              The hypervideo may be paused, next scenes may be selected,
test, we evaluated two different speech recognition libraries,                      or annotations may be read without having to interrupt the
annyang (in combination with the Web Speech API) and Pock-                          task/exercise using voice commands.
etSphinx.js (in combination with the Web Audio API), for their
usability to control hypervideo players. Test users spoke 18                        SPEECH RECOGNITION FRAMEWORKS
words, either in German or English, which were recorded and                         Several speech recognition APIs exist having varying features
then processed by both libraries. We found out that annyang                         and limitations. Available APIs are, for example, Google
shows better recognition results. However, depending on other                       Speech API [6] which accepts 10-15 seconds of audio, the
factors of influence, like the occurrence of background noise                       IBM Speech to Text API [11] which uses IBM’s speech recog-
(reliability), the availability of an internet connection, or the                   nition capabilities, wit.ai [26] which is an open and exten-
used browser, PocketSphinx.js may be a better fit.                                  sible natural language platform, Speechmatics [19] and the
                                                                                    VoxSigma REST API [21] which transcribe uploaded files into
ACM Classification Keywords                                                         text, or the open source APIs Kaldi [12] and OpenEars [18],
H.5.2. Information Interfaces and Presentation (e.g. HCI):                          the latter of which provides free speech recognition and speech
User Interfaces                                                                     synthesis for the iPhone. Hereafter we briefly describe the
                                                                                    combinations of frameworks that will be tested in the remain-
Author Keywords                                                                     der of this work. We chose these frameworks based on the
Hypervideo; Navigation; Language Processing; Speech Input;                          following criteria: the framework should be able to process
HTML5;                                                                              longer phrases (in case the speech recognition gets extended
                                                                                    in the player). It should be possible to integrate it into a Web
INTRODUCTION                                                                        application and the library should not be limited to certain
Using speech input, users are nowadays able to control smart-                       OSes.
phones, navigation systems, and Smart-TVs without touching
them. Depending on the system, either certain commands are                          Web Audio API and PocketSphinx.js
recognized (for example in TomTom navigation systems [20]),                         The Web Audio API is a “high-level JavaScript API for pro-
or freely formulated questions can be asked (like Siri for                          cessing and synthesizing audio in web applications” [24]. It
iPhones [1], Utter for Android Phones [10], or the Google                           allows splitting and merging of channels in an audio stream.
app [9]) which are then processed by the system trying to find                      Audio sources from an HTML5 <audio> or <video> element
an answer.                                                                          can be processed. It is furthermore possible to process live
                                                                                    audio input from a MediaStream via getUserMedia() [24].
However, up to now, only few homepages and Web applica-                             The speech recognition library PocketSphinx.js is written en-
tions have built-in support for speech input. Especially hyper-                     tirely in JavaScript. It is running entirely in the web browser
video players could benefit from speech control. Hypervideos                        [7], which builds on the Web Audio API. The speech recog-
consist of interlinked video scenes which are enriched with                         nizer is implemented in C (PocketSphinx) [3] and converted
                                                                                    into JavaScript using Emscripten [5]. It is possible to add
                                                                                    words, grammar, and key phrases to extend or improve the
                                                                                    recognition [7]. Each language needs its own language model
                                                                                    with a vocabulary.

                                                                                    Web Speech API and annyang
                                                                                    The Web Speech API enables the incorporation of voice
4th International Workshop on Interactive Content Consumption at TVX’16, June 22,
2016, Chicago, IL, USA                                                              data into web apps [17][23]. The SpeechRecognition (Asyn-
Copyright is held by the author/owner(s).
chronous Speech Recognition) interface “provides the ability       page. JavaScript was used to start and stop voice recording
to recognize voice context from an audio input (normally via       and recognition in the different technologies. In our tests,
the device’s default speech recognition service) and respond       we used a fixed set of words which was shown in our test
appropriately” [17]. The SpeechGrammar interface represents        application and had to be spoken out loud by the participants.
a container for a particular set of grammar (defined in the        We furthermore used a timer which limited the recognition
JSpeech Grammar Format (JSGF)) that an app should recog-           time per word and showed each word for 10 seconds. If a
nize [17]. As most modern OSes have a speech recognition           word was recognized correctly during the first attempt, the
system for issuing voice commands, this is used for speech         next word was loaded. If it was not recognized, a second
recognition on the device. Speech recognition systems are, for     attempt with a new timer was started. The user then had to
example, Dictation on Mac OS X [2], Siri on iOS [1], Cor-          repeat the word. The Web page informed the user whether the
tana on Windows 10 [16], and Android Speech [8]. The tiny          word was recognized correctly. The system used versions of
standalone JavaScript SpeechRecognition library annyang lets       annyang and Pocketsphinx.js available in November 2014.
users control a homepage with voice commands [4]. The size
of the library is less than 1 kb. The back-end is supported by     STUDY/METHOD
the (Chrome) Web Speech API [25].                                  To find out if annyang or PocketSphinx.js performs better, we
                                                                   conducted a study with 58 participants.
IMPLEMENTATION
In order to test the annyang and the PocketSphinx.js projects,     Procedure/Data Collection
we installed a reference platform. It consists of a Web server
                                                                   We used the 18 words shown in Table 1 which represent the key
(Apache 2.4.10) and a database (MySQL 5.5.38). We used
                                                                   functions of our HTML5 hypervideo player. The words were
Perl (Version 5.14.2) for the implementation of the dynamic
                                                                   presented to the participants in random order to avoid exercise
test Web page, which shows contents in the selected language
                                                                   effects towards higher word IDs. Each recorded word was
(German or English) and provides a log-in system to avoid
                                                                   tested with both technologies, annyang and PocketSphinx.js.
abuse and falsification of the test results. We only used Google
                                                                   Before starting the tests, the users had to select which language
Chrome for our tests, because annyang is built on the Web
                                                                   they wanted to do the test in. As a result, 33 participants used
Speech API which was only available for Google Chrome at
                                                                   the German version of the test and 25 users participated in the
the time of the tests.
                                                                   English version.
annyang
                                                                                  Table 1. Words tested for recognition.
For an implementation of speech detection and recognition
with annyang, it was only necessary to include the JavaScript                ID    German word             English word
library into the Web application. In order to allow the usage of              1    abspielen               play
the microphone it was mandatory to install an SSL-certificate.                2    wiederholen             repeat
We furthermore rewrote the onResult function of the an-                       3    öffnen                  open
nyang project to make the implementation conform with the                     4    schließen               close
PocketSphinx.js implementation described hereafter. To test                   5    lauter                  volume up
German words, the language only had to be set to German                       6    leiser                  volume down
using the setLanguage function. Further modifications and                     7    einblenden              fade in
additions were not necessary.                                                 8    ausblenden              fade out
                                                                              9    vorwärts                previous
PocketSphinx.js                                                              10    zurück                  next
The implementation of speech detection and recognition with                  11    Inhaltsverzeichnis      content
PocketSphinx.js required more effort compared to the imple-                  12    Suche                   search
mentation with annyang, because the source code of Pocket-                   13    Tagebuch                journal
Sphinx.js only comes with an English acoustical model. To                    14    Vollbild                full screen
avoid having to generate our own acoustical model for Ger-                   15    Fensteransicht          windows view
man words, we used the one provided by VoxForge [22]. We                     16    Bild                    picture
furthermore used the possibility to add words and grammars                   17    Bildergalerie           picture gallery
during run-time to avoid too large files which could lead to                 18    Hauptvideo              main video
crashes of the browser. For that reason we compiled the acous-
tical models outside the main file which led to smaller files
and a better performance.                                          Participants
                                                                   The participants in our study were mainly between 18 and
Test System                                                        60 years old. 34 of the participants were male, 24 were fe-
The web page used for the tests consisted of a index.pl file       male. The test was mainly distributed in Germany, so most
in Perl (which generated the HTML code), some JavaScript           of the participants were native German speakers. The tests
files, and a MySQL database. The database stored user names        were conducted on desktop computers or laptops, whereby 33
and passwords, test words and their pronunciation, test results,   participants used internal and 25 participants used external
as well as data that might be displayed on the dynamic web         microphones. See Table 2 for more precise demographic data.
              Table 2. Test participant demographics.             for PocketSphinx.js was 56. The biggest difference was in
                                                part.             the number of partially recognized words1 , where the number
            Age                   below 18       1                for annyang was quite low, but PocketSphinx.js recognized 77
                                  18-29          32               words partially.
                                  30-45          11                              Table 3. Recognition of German words.
                                  46-60          14
                                  above 60       0                                              annyang        PocketSphinx.js
                                                                      1st attempt                   527                   399
            Gender                male           34
                                                                      2nd attempt                    27                    62
                                  female         24
                                                                      Partially recognized            3                    77
            First                 German         55                   Not recognized                 37                    56
            language              English        0
                                  other           3               Taking a look at the results for the English words (see Ta-
            Microphone            internal       33               ble 4 and Figure 1, orange and yellow bars), the results are
                                  external       25               similar to those of the German words. Out of 450 words (18
                                                                  words spoken by 25 test users), annyang recognized 367 in the
            Experience with       none           20               first and 25 in the second attempt resulting in 392 correctly
            speech input          some           15               recognized words. In contrast, PocketSphinx.js recognized
                                  medium         20               269 words in the first and 42 words in the second attempt
                                  often           3               resulting in 311 correctly recognized words. Only 1 word was
                                  daily           0               recognized partially using annyang, whereas PocketSphinx.js
                                                                  recognized 88 words partially. For the English words, annyang
                                                                  showed slightly worse results (57 not recognized words) than
                                                                  PocketSphinx.js (51 not recognized words). One reason for
                                                                  the higher number of not recognized words might be the fact
                                                                  that the words were not spoken by native speakers. The level
                                                                  of correct pronunciation is unfortunately not known in this
                                                                  case.
                                                                                 Table 4. Recognition of English words.
                                                                                                annyang        PocketSphinx.js
                                                                      1st attempt                   367                   269
                                                                      2nd attempt                    25                    42
                                                                      Partially recognized            1                    88
                                                                      Not recognized                 57                    51

            Figure 1. Recognition grouped by attempts.            Summarizing the results over all languages, it can be noted that
                                                                  annyang showed better overall results than PocketSphinx.js
                                                                  (see Table 5). Annyang had a recognition rate of 90.61 %
ANALYSIS AND RESULTS                                              while PocketSphinx.js recognized only about three quarters
We analyzed the frameworks in two different ways. On the          (73.94 %) of the words. The rate of not recognized words
one hand, we analyzed the number of recognized words per          was around 10 % for both libraries. One reason for the worse
language and per framework. On the other hand, we com-            results for PocketSphinx.js may be background noise which
pared the two frameworks in different categories relevant for     has a greater influence on PocketSphinx.js than on annyang.
practical usage in our hypervideo player.
                                                                            Table 5. Recognition rate of all words in percent.

Recognition of Words                                                                            annyang        PocketSphinx.js
We analyze the recognition of the words for the two languages         1st attempt                  85.63                63.98
first separately and then together. Taking a look at the recog-       2nd attempt                   4.98                  9.96
nition of the German words, it can be said that annyang has a         Partially recognized          0.38                15.80
better recognition rate than PocketSphinx.js (see Table 3 and         Not recognized                9.00                10.25
Figure 1, blue and gray bars). Out of 594 words (18 words
spoken by 33 test users), annyang recognized 527 words in         Taking a look at the recognition performance of individual
the first and 27 in the second attempt which results in 554       words (see Figure 2), it can be stated that the recognition for
recognized words. PocketSphinx.js in contrast recognized 399      1 Partially recognized words are words that either are only a part of
words in the first and 62 in the second attempt which results     the given word or contain the given word (but also other letters),
in 461 recognized words. The annyang library failed to rec-       meaning the recognized word contains more or less letters than the
ognize 37 words, while the number of not recognized words         given word.
                                                  Figure 2. Recognition grouped by words.


the German words for annyang did not show huge differences                   Table 6. Practical comparison of annyang and PocketSphinx.js
between the words. The English words in contrast showed                                     annyang                  PocketSphinx.js
larger differences. The words “fade in” and “journal” were                Reliability       Good: Background         Satisfactory: recogni-
recognized correctly less than 15 out of 25 times compared to                               noise is reliably dis-   tion is reliable as long
annyang. The results for German words with PocketSphinx.js                                  tinguished from lan-     as the surroundings
are worse for all words, especially for the word “abspielen”.                               guage; recognition of    have no background
Taking a look at the results for the English words, it can be said                          spoken words is reli-    noises
that especially the words “fade in”, “search”, “full screen”,                               able in most cases
“picture”, and “picture gallery” showed worse results with 10
or less out of 25 recognized words.                                       Availability Internet connection is        Application on client
                                                                                       necessary                     side, no Internet con-
Practical Comparison                                                                                                 nection necessary
While the results in the user tests regarding word recognition            Browser           Limited to Chrome        All current browsers
performance clearly were in favor of annyang, the decision                support                                    except Internet Ex-
for using one of the libraries in real world HTML5 hyper-                                                            plorer
video players requires further thoughts. We examined 5 fac-
tors further, namely: dependencies and integration, reliability,
availability, browser support, and supported languages. De-
pendencies and integration as well as supported languages               However, recognition may not be the only factor to consider
may be of less interest. Assuming that no large changes are             when integrating one of the libraries into a hypervideo player.
made in the Web application that uses the speech recognition,           Depending on the application area, the occurrence of back-
the integration only has to be implemented once. Regarding              ground noise (reliability), the availability of an internet con-
language support, both libraries show a large number of sup-            nection, and the used browsers may influence the selection of
ported languages or provide the possibility to extend or create         the library.
language models in case they do not exist aready.
                                                                        In the tests described in this work, we only used Google
Reliability, availability, and browser support play a more im-          Chrome, due to a missing support of the libraries in other
portant role. Depending on the hypervideo application area,             browsers. In future work a test with other browsers and the
background noise may occur, an internet connection may not              test of other libraries may bring further results that may influ-
be available at all times, or company restrictions may not              ence the selection of one of the libraries.
allow to use certain browsers. Please refer to Table 6 for a
comparison what to use best in a given scenario.                        The voice control should be integrated into the hypervideo
                                                                        player and tested in a real world scenario measuring user
CONCLUSION                                                              frustration due to speech recognition performance in a real
In this work, we describe the implementation of a test frame-           world setting. Depending on the scenario the hypervideo
work for the speech recognition libraries annyang and Pocket-           player is used in, another hypervideo control approach may
Sphinx.js. We wanted to test the quality of the recognition of          also be helpful. In case of a physiotherapy or fitness training,
certain words that could be used to verbally control hypervideo         for example, it is helpful to show main video contents on
players. As a result, it can be noted that annyang provides             a bigger screen. A solution to enable an easier control of
better recognition results both for English and German words.           the hypervideo in this specific case may be a second screen
application that splits contents from control elements [13].   14. Britta Meixner, Stefan John, and Christian Handschigl.
Both approaches should be compared for their suitability in        2015. SIVA Suite: Framework for Hypervideo Creation,
these scenarios.                                                   Playback and Management. In Proceedings of the 23rd
                                                                   ACM International Conference on Multimedia (MM ’15).
REFERENCES                                                         ACM, New York, NY, USA, 713–716. DOI:
 1. Apple Inc. 2016a. Use Siri on your iPhone, iPad, or iPod       http://dx.doi.org/10.1145/2733373.2807413
    touch. (2016). Website                                     15. Britta Meixner, Katrin Tonndorf, Stefan John, Christian
    https://support.apple.com/en-us/HT204389 (accessed             Handschigl, Kai Hofmann, and Michael Granitzer. 2014.
    April 20, 2016).                                               A Multimedia Help System for a Medical Scenario in a
 2. Apple Inc. 2016b. Use your voice to enter text on your         Rehabilitation Clinic. In Proceedings of the 14th
    Mac. (2016). Website                                           International Conference on Knowledge Technologies
    https://support.apple.com/en-us/HT202584 (accessed             and Data-driven Business (i-KNOW ’14). ACM, New
    May 27, 2016).                                                 York, NY, USA, Article 25, 8 pages. DOI:
                                                                   http://dx.doi.org/10.1145/2637748.2638429
 3. Carnegie Mellon University. 2016. CMU Sphinx - OPEN
    SOURCE SPEECH RECOGNITION TOOLKIT. (2016).                 16. Microsoft. 2016. Get Started with Windows 10 - What is
    Website http://cmusphinx.sourceforge.net/ (accessed            Cortana? (2016). Website http://windows.microsoft.com/
                                                                   en-us/windows-10/getstarted-what-is-cortana (accessed
    April 20, 2016).
                                                                   April 20, 2016).
 4. GitHub, Inc. 2016a. annyang - Speech recognition for       17. Mozilla Developer Network. 2016. Web Speech API.
    your site. (2016). Website                                     (2016). Website https://developer.mozilla.org/en-US/
    https://github.com/TalAter/annyang (accessed April 20,
                                                                   docs/Web/API/Web_Speech_API (accessed April 20, 2016).
    2016).
                                                               18. Politepix. 2016. OpenEars - iPhone Voice Recognition
 5. GitHub, Inc. 2016b. Emscripten: An LLVM-to-JavaScript          and Text-To-Speech. (2016). Website
    Compiler. (2016). Website                                      http://www.politepix.com/openears/ (accessed May 27,
    https://github.com/kripken/emscripten (accessed April          2016).
    20, 2016).
                                                               19. Speechmatics. 2016. speech made simple. (2016).
 6. GitHub, Inc. 2016c. Google Speech API v2. (2016).              Website https://speechmatics.com/ (accessed May 27,
    Website                                                        2016).
    https://github.com/gillesdemey/google-speech-v2
                                                               20. TomTom International BV. 2016. Why TomTom devices
    (accessed May 27, 2016).                                       are the easiest. (2016). Website
 7. GitHub, Inc. 2016d. Pocketsphinx.js - Speech                   http://www.tomtom.com/whytomtom/subject.php?subject=4
    Recognition in JavaScript. (2016). Website                     (accessed April 20, 2016).
    https://github.com/syl22-00/pocketsphinx.js/blob/          21. Vocapia Research SAS. 2016. Speech to Text API.
    master/README.md (accessed April 20, 2016).                    (2016). Website
 8. Google. 2016a. android.speech. (2016). Website                 http://www.vocapia.com/speech-to-text-api.html
    https://developer.android.com/reference/android/               (accessed May 27, 2016).
    speech/package-summary.html (accessed April 20, 2016).     22. VoxForge. 2016. VoxForge - Downloads - German.
                                                                   (2016). Website http://www.voxforge.org/de/Downloads
 9. Google. 2016b. Meet the Google app. (2016). Website
                                                                   (accessed April 20, 2016).
    http://www.google.com/search/about/ (accessed April 20,
    2016).                                                     23. W3C. 2012. Web Speech API Specification (19 October
                                                                   2012). (2012). Website https://dvcs.w3.org/hg/
10. Google. 2016c. utter! Voice Commands BETA! (2016).             speech-api/raw-file/tip/speechapi.html (accessed June
    Website https://play.google.com/store/apps/details?id=         09, 2016).
    com.brandall.nutter (accessed April 20, 2016).
                                                               24. W3C. 2016. Web Audio API - W3C Editor’s Draft 15
11. IBM. 2016. Speech to Text. (2016). Website http://www.         April 2016. (2016). Website
    ibm.com/smarterplanet/us/en/ibmwatson/developercloud/          https://webaudio.github.io/web-audio-api/ (accessed
    speech-to-text.html$#$how-it-is-used-block (accessed           April 19, 2016).
    May 27, 2016).
                                                               25. WEBRESOURCESDEPOT. 2016. Speech Recognition
12. Kaldi. 2016. Kaldi. (2016). Website                            With JavaScript - annyang. (2016). Website
    http://kaldi-asr.org/ (accessed May 27, 2016).                 http://webresourcesdepot.com/
                                                                   speech-recognition-with-javascript-annyang/ (accessed
13. Britta Meixner, Christian Handschigl, Stefan John, and         April 20, 2016).
    Michael Granitzer. 2016. From Single Screen to Dual
    Screen - a Design Study for a User-Controlled              26. Wit.ai, Inc. 2016. wit.ai - Natural Language for
    Hypervideo-Based Physiotherapy Training. In                    Developers. (2016). Website https://wit.ai/ (accessed
    Proceedings of WSICC 2016.                                     May 27, 2016).