On-device Chatbot System using SuperChat Method on
  Raspberry Pi and CNN Domain Specific Accelerator

                           Hao Sha                                 Baohua Sun
                   Gyrfalcon Technology Inc.                Gyrfalcon Technology Inc.
                         Milpitas, CA                             Milpitas, CA
                                                         baohua.sun@gyrfalcontech.com
           Nicholas Yi                           Wenhan Zhang                     Lin Yang
    Gyrfalcon Technology Inc.               Gyrfalcon Technology Inc.     Gyrfalcon Technology Inc.
          Milpitas, CA                            Milpitas, CA                  Milpitas, CA


                                                       Abstract
                      Chatbot is a popular interactive entertainment device requires seman-
                      tic understanding and natural language processing of input inquiries
                      and appropriate individualized responses. Currently, most chatbot ser-
                      vices are provided with connection to cloud due to the limitation of
                      computation power on edge devices, which brings in the privacy and
                      latency concerns. However, the recent research on SuperChat method
                      shows that the chit- chat tasks can be solved using two-dimensional
                      CNN models. In addition, low-power CNN Domain Specific Accelera-
                      tors are in wide availability since the past two or three years. In this
                      paper, we implement SuperChat method on a Raspberry Pi 3.0 con-
                      nected through USB to a low-power CNN accelerator chip, which is
                      loaded with the quantized weights two-dimensional CNN model. The
                      resulting system can reach convincing accuracy with high power, mem-
                      ory efficiency, and very low power consumption.


1    Introduction
Chatbots such as Apples Siri and Amazons Echo are widely used today to interactively carry out simple tasks
and answer questions. These chatbots use cloud computing technology for both semantic understanding and
natural language processing of input inquiries and appropriate individualized response. However, this comes
at a cost of offline unavailability and privacy concerns with human voice data being communicated and stored.
In addition, the desire arises for these chatbots to emulate human characteristics in personalized behavior and
response. There also arises a need for localized chatbot solutions in dealing with specific areas, such as senior
centers, kids toys, etc.
   The SuperChat solution [8] is applied to solve the above problems. It uses the two-dimensional embedding of
the state-of- art Super Characters method [7] for text classification operations to achieve high quality, engaging
responses. Super Characters method is also extended to tabular data machine learning [1], image captioning [4],
and Multi-Modal sentiment analysis [5]. Low-power CNN accelerators are wide available to implement the CNN

Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ Workshop, Location, Country, DD-MMM-YYYY, published at
http://ceur-ws.org
models in these methods. Sun, et al. (2017) has designed a Convolutional Neural Networks Domain Specific
Architecture (CNN-DSA) accelerator for extracting features out of an input image [9, 6]. It processes 224x224
RGB images at 140fps with ultra-power-efficiency, a record of 9.3 TOPS/Watt and peak power less than 300mW.
Super Characters deployed on these low-power devices [3] shows the promissing availability on edge devices.
  In this paper, we propose a low-cost solution for chatbot, where the core SuperChat engine is all localized.

2     Related Work
2.1   SuperChat
Figure 1 illustrates the SuperChat method that was used in our system. The response sentence is predicted
sequentially by predicting the next response word in multiple iterations. During each iteration, the input sentence
and the current partial response sentence are embedded into an image through two-dimentional embedding. The
resulting image is called as a SuperChat image. And then this SuperChat image is fed into a CNN model to
predict the next response word. In each SuperChat image, the upper portion corresponses to the input sentence,
and the lower portion corresponses to the partial response sentence. At the beggining of the iteration, the partial
response sentence is initiallized as null. The prediction of the first response word is based on the SuperChat image
with only the input sentence embedded, and then the predicted word is added to the current partial response
sentence. This iteration continues until End Of Sentence (EOS) appeared. Then, the final output would be a
concatenation of the sequential output.
   Although the examples used in Figure 1 is illustrated with Chinese sentences, however, it can be also applied
to other languages. For example, Asian languages such as Japanese and Korean, which has the same square
shaped characters as in Chinese. For Latin languages where words may have variant length, SEW method [2]
could be used to convert the Latin languages also into the squared shape before applying the SuperChat method
to generate the dialogue response.


                                              Figure 1: Super Chat.


                                  Figure 2: Raspberry Pi and GTI 2801 dongle
3     Chatbots Implementation
In order to implement Super Chat method, the system uses a low-cost Raspberry Pi single board computer to
perform voice recording, audio playback, Internet/Cloud accessing, etc. and uses Gyrfalcons Edge-Computing
Device (GTI 2801 dongle) to perform sentiment analysis as well as generate appropriate response, as shown in
Figure 2. Cloud servers are temporarily used to perform Speech- to-Text and optionally Text-to-Speech. While
the Speech-to- Text and Text-to-Speech module can be easily replaced by licensed services at this moment, we
were temporarily using cloud-based free services to prove the concept for sake of convenience.


                                           Figure 3: System Diagram.


3.1   System Structure
Figure 3 illustrates the full proposed system structure of the proposed SuperChat implementation, which will be
discussed in details later. Inquiry text input may be received via two ways: transcribed audio to text (steps 1-3)
or direct text from keyboard (step 3).

3.2   Voice Recording and Playback
Recording software, FFmpeg, which is a free open-source project, is used to capture and compress the voice to
AAC format for better quality and compression ratio. AMR audio format can also be used, but is not supported
by FFmpeg on Raspberry Pi. Ffmpeg is also used for playback.

3.3   Speech Recognition (Speech-to-Text) and Natural Language Processing
Using the Baidu Speech Recognition API, audio is recorded in AAC (m4a) format, which has a high compression
rate. Baidu offers speech-to-text and text-to-speech via the cloud, and both operations average 2 seconds for
sentences of 10 characters or shorter, with the majority of time spent transmitting and receiving the audio file.
The Tencent Speech Recognition API is also successful in performing Speech Recognition and transcription, but
it does not support AAC compression audio format, so the default WAV file takes a much longer time ( 10
seconds) as it is much larger. Before resolving to use AAC and M4A audio formats, WAV and AMR formats
were tried. WAV is supported on initial tests with a laptop, but due to the low compression rates, yielded a large
delay when reading and extracting speech in the cloud. AMR has a similar compression rate to AAC and M4A,
with its speech-to-text being around 5 times faster than WAVs, but due to AMR format not being supported by
Raspberry Pi, it was ultimately substituted out in favor of AAC.
    Although the current solution involves the cloud-based s2t, while this module could be easily replaced with
licensed softwares. A possible way for the on-device chatbot to be independent to the cloud service for this s2t
module will be to purchase license from third parties. For example, iFlyTech has local translation device which
first recognize the voice as text, and then translate text into a target language. The voice recognition module
could be implemented on device, and the entire on-device translator product is less than $200 which means the
voice recognition module could be implemented on-device with affordable cost.
3.4   On Chip SuperChat Engine


                          Figure 4: Gnet18 Model Architecture and Quantized Weight.

    A trained SuperChat model is stored locally on the Raspberry Pi and loaded to a GTI 2801 dongle. The
model being used is a quantinized Gnet18 model, as showed as Figure 4, which is a modified ResNet 18 model
with all shortcut removed. The first four major layers uses 3-bits precision and the last major layer uses 1-bit
precision. All activations are presented by 5-bits in order to save on-chip data memory. The representation
mechanism inside theaccelerator supports up to four times compression with the 1-bit precision, and two times
compression with the 3-bits precision. To efficiently use the on-chip memory, the model coefficients from the
fifth major layers are only using 1-bit precision. For the first four major layers, 3-bits model coefficients are used
as fine- grained filters from the original input image.
    After CNN layers, FC layers are implemented on CPU before output prediction. The calculation power
required by FC layer is negligible.
   The CNN-DSA chip processing time is 15ms, and the pre- processing time on mobile device is about 6ms. The
time for FC layer is 1 ms, and post-processing is negligible, so the total text classification time is 22ms. It can
process nearly 50 sentences in one second, which satisfies more than real-time requirement for NLP applications
like chatbot. [10]

3.5   Speech Synthesis (Text-to-Speech)
The response TXT file is sent to a separate Baidu Cloud Text- to-Speech server. The server will send back an
audio file that was selected to be MP3 format in order to save the communication bandwidth.
   To use offline text-to-speech, within the Python library there is an open source software speech synthesizer
called eSpeak, which uses a formant synthesis method. It has many languages in a compact, small package.
Ekho is another option for offline Text-to-Speech, which can be found at [11].

4     Results
Initial tests on a Raspberry Pi (ARM8) proved successful, with nearly perfect accuracy for audio transcription
with varying degrees of loudness/length input. Output could be easily changed to accommodate volume, speed,
speaker voice, etc..

5     Future Work
Ideally, offline Speech-to-Text can be used to implement a fully localized chatbot solution. The existing solutions
are only found on Android devices for simple commands..

References
[1] Sun, Baohua, Lin Yang, Wenhan Zhang, Michael Lin, Patrick Dong, Charles Young, and Jason Dong. ”Su-
   pertml: Two-dimensional word embedding for the precognition on structured tabular data.” In Proceedings
   of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0-0. 2019.
[2] B. Sun, L. Yang, C. Chi, W. Zhang, and M. Lin. Squared english word: A method of generating glyph to use
   super characters for sentiment analysis. arXiv preprint arXiv:1902.02160, 2019.
[3] Sun, Baohua, Lin Yang, Michael Lin, Wenhan Zhang, Patrick Dong, Charles Young, and Jason Dong. ”System
   Demo for Transfer Learning across Vision and Text using Domain Specific CNN Accelerator for On-Device
   NLP Applications.” arXiv preprint arXiv:1906.01145 (2019).
[4] Sun, Baohua, Lin Yang, Michael Lin, Charles Young, Patrick Dong, Wenhan Zhang, and Jason Dong. ”Su-
   percaptioning: Image captioning using two-dimensional word embedding.” arXiv preprint arXiv:1905.10515
   (2019).
[5] Sun, Baohua, et al. ”Multi-modal Sentiment Analysis using Super Characters Method on Low-power CNN
   Accelerator Device.” arXiv preprint arXiv:2001.10179 (2020).
[6] Baohua Sun, Daniel Liu, Leo Yu, Jay Li, Helen Liu, Wenhan Zhang, and Terry Torng. 2018. MRAM
   Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications. arXiv preprint
   arXiv:1811.12179 (2018).
[7] Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Super Charac-
   ters: A Conversion from Sentiment Classification to Image Classification. In Proceedings of the 9th Workshop
   on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 309315. (2018)
[8] Sun, Baohua, et al. ”SuperChat: dialogue generation by transfer learning from vision to language using
   two-dimensional word embedding.” Proceedings of the 1st International Workshop on Deep Learning Practice
   for High-Dimensional Sparse Data. 2019.
[9] Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Ultra Power-
   Efficient CNN Domain Specific Accelerator with 9.3 TOPS/Watt for Mobile and Embedded Applications.
   In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 16771685.
   (2018)
[10] B. Sun, L. Yang, M. Lin, W. Zhang, P. Dong, C. Young, J. Dong, System Demo for Transfer Learning across
   Vision and Text using Domain Specific CNN Accelerator for On-Device NLP Applications, arXiv:1906.01145.

[11] https://www.eguidedog.net/ekho.php