<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On-device Chatbot System using SuperChat Method on Raspberry Pi and CNN Domain Specific Accelerator</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hao Sha Gyrfalcon Technology Inc. Milpitas</string-name>
          <email>baohua.sun@gyrfalcontech.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Baohua Sun Gyrfalcon Technology Inc. Milpitas, CA Wenhan Zhang Lin Yang Gyrfalcon Technology Inc. Gyrfalcon Technology Inc.</institution>
          <addr-line>Milpitas, CA Milpitas, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nicholas Yi Gyrfalcon Technology Inc.</institution>
          <addr-line>Milpitas, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Chatbot is a popular interactive entertainment device requires semantic understanding and natural language processing of input inquiries and appropriate individualized responses. Currently, most chatbot services are provided with connection to cloud due to the limitation of computation power on edge devices, which brings in the privacy and latency concerns. However, the recent research on SuperChat method shows that the chit- chat tasks can be solved using two-dimensional CNN models. In addition, low-power CNN Domain Specific Accelerators are in wide availability since the past two or three years. In this paper, we implement SuperChat method on a Raspberry Pi 3.0 connected through USB to a low-power CNN accelerator chip, which is loaded with the quantized weights two-dimensional CNN model. The resulting system can reach convincing accuracy with high power, memory efficiency, and very low power consumption.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Chatbots such as Apples Siri and Amazons Echo are widely used today to interactively carry out simple tasks
and answer questions. These chatbots use cloud computing technology for both semantic understanding and
natural language processing of input inquiries and appropriate individualized response. However, this comes
at a cost of offline unavailability and privacy concerns with human voice data being communicated and stored.
In addition, the desire arises for these chatbots to emulate human characteristics in personalized behavior and
response. There also arises a need for localized chatbot solutions in dealing with specific areas, such as senior
centers, kids toys, etc.</p>
      <p>
        The SuperChat solution [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ] is applied to solve the above problems. It uses the two-dimensional embedding of
the state-of- art Super Characters method [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] for text classification operations to achieve high quality, engaging
responses. Super Characters method is also extended to tabular data machine learning [1], image captioning [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ],
and Multi-Modal sentiment analysis [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ]. Low-power CNN accelerators are wide available to implement the CNN
Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
models in these methods. Sun, et al. (2017) has designed a Convolutional Neural Networks Domain Specific
Architecture (CNN-DSA) accelerator for extracting features out of an input image [
        <xref ref-type="bibr" rid="ref5 ref8">9, 6</xref>
        ]. It processes 224x224
RGB images at 140fps with ultra-power-efficiency, a record of 9.3 TOPS/Watt and peak power less than 300mW.
Super Characters deployed on these low-power devices [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ] shows the promissing availability on edge devices.
      </p>
      <p>In this paper, we propose a low-cost solution for chatbot, where the core SuperChat engine is all localized.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>SuperChat</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Chatbots Implementation</title>
      <p>In order to implement Super Chat method, the system uses a low-cost Raspberry Pi single board computer to
perform voice recording, audio playback, Internet/Cloud accessing, etc. and uses Gyrfalcons Edge-Computing
Device (GTI 2801 dongle) to perform sentiment analysis as well as generate appropriate response, as shown in
Figure 2. Cloud servers are temporarily used to perform Speech- to-Text and optionally Text-to-Speech. While
the Speech-to- Text and Text-to-Speech module can be easily replaced by licensed services at this moment, we
were temporarily using cloud-based free services to prove the concept for sake of convenience.
Recording software, FFmpeg, which is a free open-source project, is used to capture and compress the voice to
AAC format for better quality and compression ratio. AMR audio format can also be used, but is not supported
by FFmpeg on Raspberry Pi. Ffmpeg is also used for playback.
3.3</p>
      <sec id="sec-3-1">
        <title>Speech Recognition (Speech-to-Text) and Natural Language Processing</title>
        <p>Using the Baidu Speech Recognition API, audio is recorded in AAC (m4a) format, which has a high compression
rate. Baidu offers speech-to-text and text-to-speech via the cloud, and both operations average 2 seconds for
sentences of 10 characters or shorter, with the majority of time spent transmitting and receiving the audio file.
The Tencent Speech Recognition API is also successful in performing Speech Recognition and transcription, but
it does not support AAC compression audio format, so the default WAV file takes a much longer time ( 10
seconds) as it is much larger. Before resolving to use AAC and M4A audio formats, WAV and AMR formats
were tried. WAV is supported on initial tests with a laptop, but due to the low compression rates, yielded a large
delay when reading and extracting speech in the cloud. AMR has a similar compression rate to AAC and M4A,
with its speech-to-text being around 5 times faster than WAVs, but due to AMR format not being supported by
Raspberry Pi, it was ultimately substituted out in favor of AAC.</p>
        <p>Although the current solution involves the cloud-based s2t, while this module could be easily replaced with
licensed softwares. A possible way for the on-device chatbot to be independent to the cloud service for this s2t
module will be to purchase license from third parties. For example, iFlyTech has local translation device which
first recognize the voice as text, and then translate text into a target language. The voice recognition module
could be implemented on device, and the entire on-device translator product is less than $200 which means the
voice recognition module could be implemented on-device with affordable cost.</p>
        <p>A trained SuperChat model is stored locally on the Raspberry Pi and loaded to a GTI 2801 dongle. The
model being used is a quantinized Gnet18 model, as showed as Figure 4, which is a modified ResNet 18 model
with all shortcut removed. The first four major layers uses 3-bits precision and the last major layer uses 1-bit
precision. All activations are presented by 5-bits in order to save on-chip data memory. The representation
mechanism inside theaccelerator supports up to four times compression with the 1-bit precision, and two times
compression with the 3-bits precision. To efficiently use the on-chip memory, the model coefficients from the
fifth major layers are only using 1-bit precision. For the first four major layers, 3-bits model coefficients are used
as fine- grained filters from the original input image.</p>
        <p>After CNN layers, FC layers are implemented on CPU before output prediction. The calculation power
required by FC layer is negligible.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results 5</title>
    </sec>
    <sec id="sec-5">
      <title>Future Work References</title>
      <p>
        The CNN-DSA chip processing time is 15ms, and the pre- processing time on mobile device is about 6ms. The
time for FC layer is 1 ms, and post-processing is negligible, so the total text classification time is 22ms. It can
process nearly 50 sentences in one second, which satisfies more than real-time requirement for NLP applications
like chatbot. [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ]
3.5
      </p>
      <sec id="sec-5-1">
        <title>Speech Synthesis (Text-to-Speech)</title>
        <p>The response TXT file is sent to a separate Baidu Cloud Text- to-Speech server. The server will send back an
audio file that was selected to be MP3 format in order to save the communication bandwidth.</p>
        <p>
          To use offline text-to-speech, within the Python library there is an open source software speech synthesizer
called eSpeak, which uses a formant synthesis method. It has many languages in a compact, small package.
Ekho is another option for offline Text-to-Speech, which can be found at [
          <xref ref-type="bibr" rid="ref10">11</xref>
          ].
        </p>
        <p>Initial tests on a Raspberry Pi (ARM8) proved successful, with nearly perfect accuracy for audio transcription
with varying degrees of loudness/length input. Output could be easily changed to accommodate volume, speed,
speaker voice, etc..</p>
        <p>Ideally, offline Speech-to-Text can be used to implement a fully localized chatbot solution. The existing solutions
are only found on Android devices for simple commands..
[1] Sun, Baohua, Lin Yang, Wenhan Zhang, Michael Lin, Patrick Dong, Charles Young, and Jason Dong.
”Supertml: Two-dimensional word embedding for the precognition on structured tabular data.” In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0-0. 2019.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Squared english word: A method of generating glyph to use super characters for sentiment analysis</article-title>
          .
          <source>arXiv preprint arXiv:1902.02160</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Sun</surname>
            , Baohua,
            <given-names>Lin</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Lin</surname>
          </string-name>
          , Wenhan Zhang, Patrick Dong, Charles Young, and Jason Dong. ”
          <article-title>System Demo for Transfer Learning across Vision and Text using Domain Specific CNN Accelerator for On-Device NLP Applications</article-title>
          .” arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01145</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sun</surname>
            , Baohua,
            <given-names>Lin</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Lin</surname>
          </string-name>
          , Charles Young, Patrick Dong, Wenhan Zhang, and Jason Dong. ”Supercaptioning:
          <article-title>Image captioning using two-dimensional word embedding</article-title>
          .” arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>10515</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Baohua</surname>
          </string-name>
          , et al.
          <article-title>”Multi-modal Sentiment Analysis using Super Characters Method on Low-power CNN Accelerator Device</article-title>
          .” arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>10179</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          , Daniel Liu, Leo Yu,
          <string-name>
            <given-names>Jay</given-names>
            <surname>Li</surname>
          </string-name>
          , Helen Liu, Wenhan Zhang, and
          <string-name>
            <given-names>Terry</given-names>
            <surname>Torng</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>12179</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patrick Dong</surname>
            , Wenhan Zhang, Jason Dong, and
            <given-names>Charles</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Super Characters: A Conversion from Sentiment Classification to Image Classification</article-title>
          .
          <source>In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis</source>
          .
          <volume>309315</volume>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Baohua</surname>
          </string-name>
          , et al. ”
          <article-title>SuperChat: dialogue generation by transfer learning from vision to language using two-dimensional word embedding</article-title>
          .
          <source>” Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patrick Dong</surname>
            , Wenhan Zhang, Jason Dong, and
            <given-names>Charles</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Ultra PowerEfficient CNN Domain Specific Accelerator with 9.3 TOPS/Watt for Mobile and Embedded Applications</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops</source>
          .
          <volume>16771685</volume>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Dong,
          <string-name>
            <given-names>C.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>System Demo for Transfer Learning across Vision and Text using Domain Specific CNN Accelerator for On-Device NLP Applications</article-title>
          , arXiv:
          <year>1906</year>
          .01145.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[11] https://www.eguidedog.net/ekho.php</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>