<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speech Control for HTML5 Hypervideo Players</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Author Keywords Hypervideo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Navigation</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Language Processing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Speech Input</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>4th International Workshop on Interactive Content Consumption at TVX'16</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Britta Meixner</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>FX Palo Alto Laboratory</institution>
          ,
          <addr-line>3174 Porter Drive, Palo Alto, CA 94304</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Innstrasse 43, 94032 Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hypervideo usage scenarios like physiotherapy trainings or instructions for manual tasks make it hard for users to use an input device like a mouse or touch screen on a hand-held device while they are performing an exercise or use both hands to perform a manual task. In this work, we are trying to overcome this issue by providing an alternative input method for hypervideo navigation using speech commands. In a user test, we evaluated two different speech recognition libraries, annyang (in combination with the Web Speech API) and PocketSphinx.js (in combination with the Web Audio API), for their usability to control hypervideo players. Test users spoke 18 words, either in German or English, which were recorded and then processed by both libraries. We found out that annyang shows better recognition results. However, depending on other factors of influence, like the occurrence of background noise (reliability), the availability of an internet connection, or the used browser, PocketSphinx.js may be a better fit.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        INTRODUCTION
Using speech input, users are nowadays able to control
smartphones, navigation systems, and Smart-TVs without touching
them. Depending on the system, either certain commands are
recognized (for example in TomTom navigation systems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]),
or freely formulated questions can be asked (like Siri for
iPhones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Utter for Android Phones [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], or the Google
app [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) which are then processed by the system trying to find
an answer.
      </p>
      <p>
        However, up to now, only few homepages and Web
applications have built-in support for speech input. Especially
hypervideo players could benefit from speech control. Hypervideos
consist of interlinked video scenes which are enriched with
additional information. Playing such videos requires special
players that provide additional means of navigation in
additional information, in scenes, and between scenes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In
usage scenarios like cooking instructions, physiotherapy and
fitness trainings [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], or physical tasks that have to be done
with two hands, speech controls may help the user to navigate
in the hypervideo without having to interrupt the current task.
The hypervideo may be paused, next scenes may be selected,
or annotations may be read without having to interrupt the
task/exercise using voice commands.
      </p>
      <p>
        SPEECH RECOGNITION FRAMEWORKS
Several speech recognition APIs exist having varying features
and limitations. Available APIs are, for example, Google
Speech API [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which accepts 10-15 seconds of audio, the
IBM Speech to Text API [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which uses IBM’s speech
recognition capabilities, wit.ai [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] which is an open and
extensible natural language platform, Speechmatics [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and the
VoxSigma REST API [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] which transcribe uploaded files into
text, or the open source APIs Kaldi [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and OpenEars [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
the latter of which provides free speech recognition and speech
synthesis for the iPhone. Hereafter we briefly describe the
combinations of frameworks that will be tested in the
remainder of this work. We chose these frameworks based on the
following criteria: the framework should be able to process
longer phrases (in case the speech recognition gets extended
in the player). It should be possible to integrate it into a Web
application and the library should not be limited to certain
OSes.
      </p>
      <p>
        Web Audio API and PocketSphinx.js
The Web Audio API is a “high-level JavaScript API for
processing and synthesizing audio in web applications” [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. It
allows splitting and merging of channels in an audio stream.
Audio sources from an HTML5 &lt;audio&gt; or &lt;video&gt; element
can be processed. It is furthermore possible to process live
audio input from a MediaStream via getUserMedia() [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
The speech recognition library PocketSphinx.js is written
entirely in JavaScript. It is running entirely in the web browser
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which builds on the Web Audio API. The speech
recognizer is implemented in C (PocketSphinx) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and converted
into JavaScript using Emscripten [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It is possible to add
words, grammar, and key phrases to extend or improve the
recognition [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Each language needs its own language model
with a vocabulary.
      </p>
      <p>
        Web Speech API and annyang
The Web Speech API enables the incorporation of voice
data into web apps [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ][
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The SpeechRecognition
(Asynchronous Speech Recognition) interface “provides the ability
to recognize voice context from an audio input (normally via
the device’s default speech recognition service) and respond
appropriately” [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The SpeechGrammar interface represents
a container for a particular set of grammar (defined in the
JSpeech Grammar Format (JSGF)) that an app should
recognize [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. As most modern OSes have a speech recognition
system for issuing voice commands, this is used for speech
recognition on the device. Speech recognition systems are, for
example, Dictation on Mac OS X [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Siri on iOS [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
Cortana on Windows 10 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and Android Speech [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The tiny
standalone JavaScript SpeechRecognition library annyang lets
users control a homepage with voice commands [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The size
of the library is less than 1 kb. The back-end is supported by
the (Chrome) Web Speech API [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>IMPLEMENTATION
In order to test the annyang and the PocketSphinx.js projects,
we installed a reference platform. It consists of a Web server
(Apache 2.4.10) and a database (MySQL 5.5.38). We used
Perl (Version 5.14.2) for the implementation of the dynamic
test Web page, which shows contents in the selected language
(German or English) and provides a log-in system to avoid
abuse and falsification of the test results. We only used Google
Chrome for our tests, because annyang is built on the Web
Speech API which was only available for Google Chrome at
the time of the tests.
annyang
For an implementation of speech detection and recognition
with annyang, it was only necessary to include the JavaScript
library into the Web application. In order to allow the usage of
the microphone it was mandatory to install an SSL-certificate.
We furthermore rewrote the onResult function of the
annyang project to make the implementation conform with the
PocketSphinx.js implementation described hereafter. To test
German words, the language only had to be set to German
using the setLanguage function. Further modifications and
additions were not necessary.</p>
      <p>
        PocketSphinx.js
The implementation of speech detection and recognition with
PocketSphinx.js required more effort compared to the
implementation with annyang, because the source code of
PocketSphinx.js only comes with an English acoustical model. To
avoid having to generate our own acoustical model for
German words, we used the one provided by VoxForge [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. We
furthermore used the possibility to add words and grammars
during run-time to avoid too large files which could lead to
crashes of the browser. For that reason we compiled the
acoustical models outside the main file which led to smaller files
and a better performance.
      </p>
      <p>Test System
The web page used for the tests consisted of a index.pl file
in Perl (which generated the HTML code), some JavaScript
files, and a MySQL database. The database stored user names
and passwords, test words and their pronunciation, test results,
as well as data that might be displayed on the dynamic web
page. JavaScript was used to start and stop voice recording
and recognition in the different technologies. In our tests,
we used a fixed set of words which was shown in our test
application and had to be spoken out loud by the participants.
We furthermore used a timer which limited the recognition
time per word and showed each word for 10 seconds. If a
word was recognized correctly during the first attempt, the
next word was loaded. If it was not recognized, a second
attempt with a new timer was started. The user then had to
repeat the word. The Web page informed the user whether the
word was recognized correctly. The system used versions of
annyang and Pocketsphinx.js available in November 2014.
STUDY/METHOD
To find out if annyang or PocketSphinx.js performs better, we
conducted a study with 58 participants.</p>
      <p>Procedure/Data Collection
We used the 18 words shown in Table 1 which represent the key
functions of our HTML5 hypervideo player. The words were
presented to the participants in random order to avoid exercise
effects towards higher word IDs. Each recorded word was
tested with both technologies, annyang and PocketSphinx.js.
Before starting the tests, the users had to select which language
they wanted to do the test in. As a result, 33 participants used
the German version of the test and 25 users participated in the
English version.
Participants
The participants in our study were mainly between 18 and
60 years old. 34 of the participants were male, 24 were
female. The test was mainly distributed in Germany, so most
of the participants were native German speakers. The tests
were conducted on desktop computers or laptops, whereby 33
participants used internal and 25 participants used external
microphones. See Table 2 for more precise demographic data.</p>
      <p>ANALYSIS AND RESULTS
We analyzed the frameworks in two different ways. On the
one hand, we analyzed the number of recognized words per
language and per framework. On the other hand, we
compared the two frameworks in different categories relevant for
practical usage in our hypervideo player.</p>
      <p>Recognition of Words
We analyze the recognition of the words for the two languages
first separately and then together. Taking a look at the
recognition of the German words, it can be said that annyang has a
better recognition rate than PocketSphinx.js (see Table 3 and
Figure 1, blue and gray bars). Out of 594 words (18 words
spoken by 33 test users), annyang recognized 527 words in
the first and 27 in the second attempt which results in 554
recognized words. PocketSphinx.js in contrast recognized 399
words in the first and 62 in the second attempt which results
in 461 recognized words. The annyang library failed to
recognize 37 words, while the number of not recognized words
Summarizing the results over all languages, it can be noted that
annyang showed better overall results than PocketSphinx.js
(see Table 5). Annyang had a recognition rate of 90.61 %
while PocketSphinx.js recognized only about three quarters
(73.94 %) of the words. The rate of not recognized words
was around 10 % for both libraries. One reason for the worse
results for PocketSphinx.js may be background noise which
has a greater influence on PocketSphinx.js than on annyang.
Taking a look at the recognition performance of individual
words (see Figure 2), it can be stated that the recognition for
1Partially recognized words are words that either are only a part of
the given word or contain the given word (but also other letters),
meaning the recognized word contains more or less letters than the
given word.
for PocketSphinx.js was 56. The biggest difference was in
the number of partially recognized words1, where the number
for annyang was quite low, but PocketSphinx.js recognized 77
words partially.
Taking a look at the results for the English words (see
Table 4 and Figure 1, orange and yellow bars), the results are
similar to those of the German words. Out of 450 words (18
words spoken by 25 test users), annyang recognized 367 in the
first and 25 in the second attempt resulting in 392 correctly
recognized words. In contrast, PocketSphinx.js recognized
269 words in the first and 42 words in the second attempt
resulting in 311 correctly recognized words. Only 1 word was
recognized partially using annyang, whereas PocketSphinx.js
recognized 88 words partially. For the English words, annyang
showed slightly worse results (57 not recognized words) than
PocketSphinx.js (51 not recognized words). One reason for
the higher number of not recognized words might be the fact
that the words were not spoken by native speakers. The level
of correct pronunciation is unfortunately not known in this
case.
the German words for annyang did not show huge differences
between the words. The English words in contrast showed
larger differences. The words “fade in” and “journal” were
recognized correctly less than 15 out of 25 times compared to
annyang. The results for German words with PocketSphinx.js
are worse for all words, especially for the word “abspielen”.
Taking a look at the results for the English words, it can be said
that especially the words “fade in”, “search”, “full screen”,
“picture”, and “picture gallery” showed worse results with 10
or less out of 25 recognized words.</p>
      <p>Practical Comparison
While the results in the user tests regarding word recognition
performance clearly were in favor of annyang, the decision
for using one of the libraries in real world HTML5
hypervideo players requires further thoughts. We examined 5
factors further, namely: dependencies and integration, reliability,
availability, browser support, and supported languages.
Dependencies and integration as well as supported languages
may be of less interest. Assuming that no large changes are
made in the Web application that uses the speech recognition,
the integration only has to be implemented once. Regarding
language support, both libraries show a large number of
supported languages or provide the possibility to extend or create
language models in case they do not exist aready.
Reliability, availability, and browser support play a more
important role. Depending on the hypervideo application area,
background noise may occur, an internet connection may not
be available at all times, or company restrictions may not
allow to use certain browsers. Please refer to Table 6 for a
comparison what to use best in a given scenario.</p>
      <p>CONCLUSION
In this work, we describe the implementation of a test
framework for the speech recognition libraries annyang and
PocketSphinx.js. We wanted to test the quality of the recognition of
certain words that could be used to verbally control hypervideo
players. As a result, it can be noted that annyang provides
better recognition results both for English and German words.</p>
    </sec>
    <sec id="sec-2">
      <title>Good: Background Satisfactory: recogni</title>
      <p>noise is reliably dis- tion is reliable as long
tinguished from lan- as the surroundings
guage; recognition of have no background
spoken words is reli- noises
able in most cases</p>
    </sec>
    <sec id="sec-3">
      <title>Availability Internet connection is necessary</title>
    </sec>
    <sec id="sec-4">
      <title>Browser</title>
      <p>support</p>
    </sec>
    <sec id="sec-5">
      <title>Limited to Chrome</title>
    </sec>
    <sec id="sec-6">
      <title>Application on client side, no Internet connection necessary</title>
    </sec>
    <sec id="sec-7">
      <title>All current browsers</title>
      <p>except Internet
Explorer
However, recognition may not be the only factor to consider
when integrating one of the libraries into a hypervideo player.
Depending on the application area, the occurrence of
background noise (reliability), the availability of an internet
connection, and the used browsers may influence the selection of
the library.</p>
      <p>In the tests described in this work, we only used Google
Chrome, due to a missing support of the libraries in other
browsers. In future work a test with other browsers and the
test of other libraries may bring further results that may
influence the selection of one of the libraries.</p>
      <p>
        The voice control should be integrated into the hypervideo
player and tested in a real world scenario measuring user
frustration due to speech recognition performance in a real
world setting. Depending on the scenario the hypervideo
player is used in, another hypervideo control approach may
also be helpful. In case of a physiotherapy or fitness training,
for example, it is helpful to show main video contents on
a bigger screen. A solution to enable an easier control of
the hypervideo in this specific case may be a second screen
application that splits contents from control elements [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Both approaches should be compared for their suitability in
these scenarios.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Apple</given-names>
            <surname>Inc</surname>
          </string-name>
          . 2016a.
          <article-title>Use Siri on your iPhone, iPad, or iPod touch</article-title>
          . (
          <year>2016</year>
          ). Website https://support.apple.com/en-us/
          <source>HT204389 (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Apple</given-names>
            <surname>Inc</surname>
          </string-name>
          . 2016b.
          <article-title>Use your voice to enter text on your Mac. (</article-title>
          <year>2016</year>
          ). Website https://support.apple.com/en-us/HT202584 (accessed May 27,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Carnegie Mellon University.
          <year>2016</year>
          .
          <article-title>CMU Sphinx - OPEN SOURCE SPEECH RECOGNITION TOOLKIT</article-title>
          . (
          <year>2016</year>
          ). Website http://cmusphinx.sourceforge.
          <source>net/ (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. GitHub, Inc. 2016a. annyang
          <article-title>- Speech recognition for your site</article-title>
          . (
          <year>2016</year>
          ). Website https://github.com/TalAter/annyang (accessed
          <source>April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. GitHub, Inc. 2016b.
          <article-title>Emscripten: An LLVM-to-JavaScript Compiler</article-title>
          . (
          <year>2016</year>
          ). Website https://github.com/kripken/emscripten (accessed
          <source>April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. GitHub, Inc. 2016c.
          <article-title>Google Speech API v2</article-title>
          .
          <article-title>(</article-title>
          <year>2016</year>
          ). Website https://github.com/gillesdemey/google-speech-v2
          <source>(accessed May 27</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. GitHub, Inc. 2016d. Pocketsphinx.js
          <article-title>- Speech Recognition in JavaScript</article-title>
          . (
          <year>2016</year>
          ). Website https://github.com/syl22-00/pocketsphinx.js/blob/ master/README.md (
          <issue>accessed April 20</issue>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Google</surname>
          </string-name>
          . 2016a. android.speech. (
          <year>2016</year>
          ). Website https://developer.android.com/reference/android/ speech/package-summary.
          <source>html (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Google</surname>
          </string-name>
          . 2016b.
          <article-title>Meet the Google app</article-title>
          . (
          <year>2016</year>
          ). Website http://www.google.com/search/about/ (accessed April 20,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Google</surname>
          </string-name>
          . 2016c. utter! Voice
          <string-name>
            <surname>Commands</surname>
            <given-names>BETA</given-names>
          </string-name>
          ! (
          <year>2016</year>
          ). Website https://play.google.com/store/apps/details?id= com.brandall.
          <source>nutter (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2016</year>
          . Speech to Text. (
          <year>2016</year>
          ). Website http://www. ibm.com/smarterplanet/us/en/ibmwatson/developercloud/ speech-to-text.html$#
          <article-title>$how-it-is-used-block (accessed May 27,</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kaldi</surname>
          </string-name>
          .
          <year>2016</year>
          . Kaldi. (
          <year>2016</year>
          ). Website http://kaldi-asr.
          <source>org/ (accessed May 27</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Britta</surname>
            <given-names>Meixner</given-names>
          </string-name>
          , Christian Handschigl,
          <string-name>
            <given-names>Stefan</given-names>
            <surname>John</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Granitzer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>From Single Screen to Dual Screen - a Design Study for a User-Controlled Hypervideo-Based Physiotherapy Training</article-title>
          .
          <source>In Proceedings of WSICC</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Britta</surname>
            <given-names>Meixner</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Stefan</given-names>
            <surname>John</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Handschigl</surname>
          </string-name>
          .
          <year>2015</year>
          . SIVA Suite:
          <article-title>Framework for Hypervideo Creation, Playback and Management</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15)</source>
          . ACM, New York, NY, USA,
          <fpage>713</fpage>
          -
          <lpage>716</lpage>
          . DOI: http://dx.doi.org/10.1145/2733373.2807413
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Britta</surname>
            <given-names>Meixner</given-names>
          </string-name>
          , Katrin Tonndorf,
          <string-name>
            <surname>Stefan John</surname>
          </string-name>
          , Christian Handschigl, Kai Hofmann, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Granitzer</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A Multimedia Help System for a Medical Scenario in a Rehabilitation Clinic</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Knowledge Technologies</source>
          and
          <article-title>Data-driven Business (i-KNOW '14)</article-title>
          . ACM, New York, NY, USA, Article
          <volume>25</volume>
          , 8 pages. DOI: http://dx.doi.org/10.1145/2637748.2638429
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Get Started with Windows 10 - What is Cortana? (</article-title>
          <year>2016</year>
          ). Website http://windows.microsoft.com/ en-us/windows-10/getstarted-what
          <article-title>-is-cortana (accessed April 20,</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Mozilla Developer Network.
          <year>2016</year>
          .
          <article-title>Web Speech API</article-title>
          . (
          <year>2016</year>
          ). Website https://developer.mozilla.org/en-US/ docs/Web/API/Web_Speech_
          <source>API (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Politepix</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>OpenEars - iPhone Voice Recognition</article-title>
          and
          <string-name>
            <surname>Text-</surname>
          </string-name>
          To-Speech.
          <article-title>(</article-title>
          <year>2016</year>
          ). Website http://www.politepix.com/openears/ (accessed May 27,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Speechmatics</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>speech made simple</article-title>
          . (
          <year>2016</year>
          ). Website https://speechmatics.com/ (accessed May 27,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. TomTom International BV.
          <year>2016</year>
          .
          <article-title>Why TomTom devices are the easiest</article-title>
          . (
          <year>2016</year>
          ). Website http://www.tomtom.com/whytomtom/subject.php?
          <source>subject=4 (accessed April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Vocapia Research SAS.
          <year>2016</year>
          .
          <article-title>Speech to Text API</article-title>
          . (
          <year>2016</year>
          ). Website http://www.vocapia.com/speech-to-text-api.
          <source>html (accessed May 27</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>VoxForge</surname>
          </string-name>
          .
          <year>2016</year>
          . VoxForge - Downloads - German. (
          <year>2016</year>
          ). Website http://www.voxforge.org/de/Downloads (accessed
          <source>April 20</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. W3C.
          <year>2012</year>
          .
          <article-title>Web Speech API Specification (19</article-title>
          <year>October 2012</year>
          ).
          <article-title>(</article-title>
          <year>2012</year>
          ). Website https://dvcs.w3.org/hg/ speech-api/raw-file/tip/speechapi.html (
          <issue>accessed June 09</issue>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24. W3C.
          <year>2016</year>
          . Web Audio API - W3C
          <source>Editor's Draft 15 April</source>
          <year>2016</year>
          . (
          <year>2016</year>
          ). Website https://webaudio.github.io/web-audio-api
          <source>/ (accessed April 19</source>
          ,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>WEBRESOURCESDEPOT</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Speech Recognition With JavaScript - annyang</article-title>
          . (
          <year>2016</year>
          ). Website http://webresourcesdepot.com
          <article-title>/ speech-recognition-with-javascript-annyang/ (accessed April 20,</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Wit.ai, Inc.
          <year>2016</year>
          . wit.
          <source>ai - Natural Language for Developers</source>
          . (
          <year>2016</year>
          ). Website https://wit.ai/ (accessed May 27,
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>