=Paper=
{{Paper
|id=Vol-2579/BIgMine-2019_paper_6
|storemode=property
|title=System Demonstration of MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications
|pdfUrl=https://ceur-ws.org/Vol-2579/BIgMine-2019_paper_6.pdf
|volume=Vol-2579
|authors=Baohua Sun,Daniel Liu,Leo Yu,Jay Li,Helen Liu,Wenhan Zhang,Terry Torng
|dblpUrl=https://dblp.org/rec/conf/kdd/SunLYLLZT19
}}
==System Demonstration of MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications==
<pdf width="1500px">https://ceur-ws.org/Vol-2579/BIgMine-2019_paper_6.pdf</pdf>
<pre>
           System Demonstration of MRAM Co-designed
              Processing-in-Memory CNN Accelerator
                 for Mobile and IoT Applications

                         Baohua Sun                          Daniel Liu
                  Gyrfalcon Technology Inc.         Gyrfalcon Technology Inc.
                        Milpitas, CA                       Milpitas, CA
               baohua.sun@gyrfalcontech.com
             Leo Yu                         Jay Li                       Helen Liu
    Gyrfalcon Technology Inc.     Gyrfalcon Technology Inc.      Gyrfalcon Technology Inc.
          Milpitas, CA                  Milpitas, CA                    Milpitas, CA
                       Wenhan Zhang                       Terry Torng
                 Gyrfalcon Technology Inc.         Gyrfalcon Technology Inc.
                        Milpitas, CA                     Milpitas, CA


                                                      Abstract
                      We designed a device for Convolution Neural Network applications with
                      non-volatile MRAM memory and computing-in-memory co-designed
                      architecture. This chip is targeted to reduce the power leakage for
                      the CNN chip. It has been successfully fabricated using 22nm tech-
                      nology node CMOS Si process. More than 40MB MRAM density with
                      9.9TOPS/W are provided. It enables multiple models within one single
                      chip for mobile and IoT device applications.


1    Introduction
Artificial Intelligence (AI) is recognized as one of the key technology in the Fourth Industrial Revolution. AI
also known as machine learning has been around for a long while [1] only recently becoming more advanced,
popular and mature. Deep learning has found its applications in a wide variety of tasks such as computer vision,
image and speech recognition, machine translation, robotics, and medical image processing, etc [2]. Hardware
acceleration of deep learning tasks come in perfect timing to offer a multi-fold speed improved and functional
logic processor with dedicated Application-Specific Integrated Circuit (ASIC) and its memory storage system.
   Convolutional Neural Network (CNN) models are successful in computer vision [3, 6] and Natural Language
Processing tasks [8]. It repeatedly executes convolution operations on the input image, thus power-efficient
ASICs are highly desirable for IoT implementations. Besides that, the CNN model sizes are big, which requires
large size of memories. For computing-in-memory architectures, SRAM solution which provides low memory
density becomes the bottleneck. Sun, et al. (2018) has designed a Convolutional Neural Networks Domain
Specific Architecture (CNN-DSA) accelerator for extracting features out of an input image [9, 7]. It processes

Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ Workshop, Location, Country, DD-MMM-YYYY, published at
http://ceur-ws.org
          (a) The whole demo system composed of MRAM CNN accel-          (b) A closer look at the MRAM
          erator with a host processor.                                  CNN accelerator (in the middle of
                                                                         the board and under the socket).

Figure 1: Demo system of MRAM CNN accelerator with host processor for applications on mobile and IoT
devices.

224x224 RGB images at 140fps with ultra-power-efficiency, a record of 9.3 TOPS/Watt and peak power less than
300mW. This architecture mainly focuses on inference, rather than training. Its CNN-DSA hardware system
based on 28nm node is produced successfully with all internal SRAM memory only. Its APiM (AI Processing in
Memory) architecture can save power and increase speed, but it requires large memory size with better power
efficiency control.
    Magneto-resistive Random Access Memory (MRAM) is a high speed Non-Volatile Memory (NVM) that can
provide unique solutions which improve overall system performance in a variety of areas including data storage,
industrial controls, networking, and others. The recent development of magnetic tunnel junction (MTJ), a key
element of the MRAM device, enables fundamentally a fast, reliable read and write operation MRAM circuit.
Spin Torque Transfer (STT) MRAM is an emerging memory which possesses an excellent combination of density,
speed, and non-volatility [4]. STT MRAM is considered as the most promising NVM when compared to existing
SRAM, eFlash, and ReRAM in terms of energy efficiency, endurance, speed, extendibility and scaling [10].
    Thus, APiM using MRAM memory would provide better solution for larger memory size and better power
efficiency of enabling large CNN model or multiple CNN models. We introduce our unique computational
architecture by implementing the CNN in a matrix form, with each element consisting of advanced memory such
as a non-volatile memory device. With MRAM co-designed, large sized models could be possibly loaded into
CNN chips. This can be used in mobile, IoT and embedded smart device systems in real world.
    The second section introduces the system of MRAM CNN accelerator with host processor. And the third
section states the STT MRAM technology used in this chip. The fourth section gives the detailed architecture
of the MRAM based CNN acclerator. The fifth section compares the MRAM solution with the previous SRAM
solution. And the last section concludes the whole paper.

2   Demo System: MRAM CNN Accelerator with Host Processor
The demo system is shown in Figure 1. The MRARM CNN chip works as a coprocessor with a host. Figure
1a shows the general view of the demo system, which includes a host processor, an MRAM CNN accelerator on
a PCB board connected to the host processor, a microphone for receiving voice input, and also a monitor for
display purpose. The demo will show that the MRAM CNN chip can load four CNN models simultaneously,
including a face recognition model, a voice recognition model, a voice command recognition model, and an image
classification model. For security application scenarios, this will enable multiple models loaded into a single chip
for identity verification. The non-volatile feature of MRAM memory will keep the model in the memory even
if power is off. In real world applications, if power is off and back to on, there is no need to reload the model
into the memory. The MRAM CNN chip is on the PCB board inside the host computer case. Figure 1b shows
                               Conditions         Power(mW)      MRAM      SRAM
                            Room Tempreature       Dynamic        38.3      39.2
                            Room Tempreature       Standby        5.5       34.3
                            High Tempreature       Dynamic        35.4      43.1
                            High Tempreature       Standby        7.2       136


        Table 1: MRAM vs SRAM dynamic and standby power at room and high Temperature(70°C)
a closer look of the PCB board. The MRAM CNN chip is located in the middle of the PCB board and under
the chip socket.
   In the following sections, we will first introduce the MRAM technology used in the chip, and then the design
of the CNN accelerator chip. After that we will explain how the MRAM CNN chip works during the inference
workflow.

3     STT MRAM technology
We successfully implemented advanced technology node of 22nm on CMOS, SRAM and emerging STT MRAM.
More than 40MB MRAM density was embedded into our CNN engines. This is 4.5x increase of memory compared
to Sun, et al. (2018) SRAM based CNN-DSA, which total memory size is about 9MB as a record [9]. Figure 2
showed a schematic drawing of a STT MRAM memory cell image. MRAM only requires few additional masks
to be processed in between BEOL metal layers as magnetic storage structures.


                                      Figure 2: STT MRAM cell image.

   MRAM leakage was believed to be low [10]. To measure power leakage on our real chip, standby current was
tested by placing chip on the socket of the test board and power supply set to working voltage. We compared
room temperature vs 70 °C on MRAM vs SRAM leakage behavior. Voltage VDD 0.9V and VDIO 2V were used.
Table 1 shows the dynamic and standby power comparing MRAM vs SRAM. It demonstrated that MRAM
leakage power was low at 5.5mW at room temp and 7.2mW at 70°C. STT MRAM indeed offers embedded 4-5X
higher density of memory but with much lower leakage power consumption.

4     MRAM based convolutional neural network accelerator architecture
4.1   CNN Matrix Processing Engine (MPE)
The CNN algorithm is constructed by stacking multiple computation layers for feature extraction and classifi-
cation [5]. Modern CNNs achieve their superior accuracy by building a very deep hierarchy of layers [3], which
transform the input image data into highly abstract representations called feature maps (fmaps). The primary
computation in the CNN layers is performing convolutional operations. A layer applies filters on the input fmaps
to extract embedded characteristics and generate the output fmaps by accumulating the weighted sums (Wsums)
and non-linear activations.
   We designed a coprocessor for CNN acceleration. The CNN processing block simultaneously performs 3x3
convolution on 2-D image at PxP pixel locations using input from input buffer and filter coefficients from
co-designed on-chip MRAM memory. Padding is applied to PxP pixel locations for convolution. Each layer
performs 3x3 convolution and bias can also be added. Then it is connected to nonlinear activation operations.
Max pooling operation follows and shrink the output size by 4 times. Figure 3 showed a CNN Matrix Processing
Engine implementation of M=14.


                              Figure 3: CNN Matrix Processing Engine (MPE).


   Each CNN processing engine includes a CNN processing block, a first set of memory buffers for storing imagery
data and a second set of memory buffers for storing filter coefficients. When two or more CNN processing engines
are configured on the IC controller, the CNN engines connect to one another via a clockskew circuit for cyclic
data access and the number of I/O data bus. Activations use 9 bits Domain Specific Floating Point (DSFP),
and model coefficients use 15 bits DSFP.

4.2   System overview of STT MRAM based CNN accelerator
CNN is implemented in hardware in the form that is very similar to a memory array. This concept is a processor-
in-memory design architecture. Figure 4 shows the block diagram for our chip architecture loading multiple
models. It uses the CNN processing engine with co-designed MRAM architecture block, instead of using a
regular SRAM. Our chip architecture includes four major components, including on-chip MRAM, SRAM, MAC
array and control unites. The MRAM loads the coefficients of multiple models in different locations of the
memory. The SRAM stores the data inputs and intermediate results of activations, which is reused and read-
write multiple times. The MAC array executes the convolution computation operations using coefficients in the
MRAM and data activations in SRAM. And the control unit coordinates the above three components.
   For a CNN based IC for artificial intelligence, data must be provided as close to the CNN processing logic
to save power. In image processing, filter coefficients and imagery data have different requirements for data
access. Filter coefficients need to be validly stored for long time, while the imagery data are written and read
more often. Since MRAM has high endurance, preloaded models can be saved in MRAM for some applications
without requirement of external storage and memory buffer to load model into chip. In view of this requirement,
STT MRAM is selected to be the NVM for storing filter coefficients or weights. Such advanced memory has high
          Memory        Current        BackGround          Average          Average       1% duty cycle
            Type       on Vdd(A)        Current(A)        Current(A)       Power(W)         power(W)
           SRAM          0.6630             0.1              0.146           0.132            0.095
          MRAM           0.5567            0.05              0.092           0.083            0.050


Table 2: MRAM embedded CNN accelerator chip reduces power leakage, especially in IoT application scenarios
with low percentage of duty cycles.


                                    Figure 4: CNN block with memory array.


retention rate at 85 for 10 years which fulfills and allows the purpose of memory needs for imbalanced read and
write operations.
   MRAM allows bigger memory for loading models than SRAM. There are some key features on multi-task
performance with multi-models capability in one single chip, such as voice and facial recognition simultaneously.
Potential application can be identity verification from different aspect of biological characteristics, e.g. recogniz-
ing voice, face, fingerprints and gesture with four independent models in one single chip. In addition, MRAM
AI chip also enables ensemble of multiple models in a single chip.

4.3   Inference with MRAM based CNN accelerator
Figure 5 shows the working flow of using our AI MRAM chip. First, it loads the CNN coefficients in MRAM.
Second, it loads image into SRAM, which is fast for multiple read and write. Third, coefficients and image data
are sent to CNN processing block for convolutional operations. At last, the convolution results will be sent to
host processor. Color coding corresponds to the activated components in same color as in Figure 4.
   Based on the measurement of the operating power, current, length and background current while running the
image classification, the power consumption of coefficients memory is about 1/4 of total power, and rest part
of chip consumes 3/4 of total power. By fitting Ivdd (Vdd operation current) vs. frequency curve as SRAM
                        Figure 5: Inference workflow of MRAM CNN accelerator chip.


chip, we got Ivdd is around three quarters of SRAM. Based on the above measurements, the power efficiency of
MRAM based CNN accelerator can be calculated as 9.3/[3/4 + (1/4) ∗ (3/4)] = 9.9 TOPS/W. At 12.5Mhz, we
processes 3x224x224 image at the speed of 35fps, which should be sufficient for applications in IoT and mobile
scenarios.

5   Power Saving for IoT Applications and Comparison with SRAM Based CNN
    Accelerators
The memory power leakage within the chip takes the majority of total power consumption, as shown in Table 2
from one measurement. The Vdd is set at 0.9V, and frequency at 66MHz. The operation length is 2.72 ms,
during which the logics for convolution are executed. The current on Vdd is 0.663A for SRAM, and decreased
to 0.5567A for MRAM. For inference speed at 30fps (frame per second), the corresponding cycle length is 33ms.
The average current is calculated by current for operation and idle time weighted by time percentage during a
duty cycle. For IoT application scenarios, the device is not always doing inference. The inference job comes in
time spikes. For 1% duty cycle, the average power is 0.095W for SRAM and 0.050W for MRAM, which means
the MRAM embedded CNN accelerator chip reduces the average power by about 50% than the one with SRAM.

6   Conclusion
The 22nm device of STT MRAM memory co-designed with processing-in-memory CNN accelerator has been
successfully fabricated. Compared with SRAM, the lower leakage of MRAM push the record of power efficiency
to 9.9 TOPS/W with reliable read and write operations. Image classification and voice recognition tasks were
successfully executed simultaneously on one single chip. At the workshop, we will demo our MRAM CNN chip
with multiple models on one single chip, and show the CNN chip with non-volatile feature. MRAM co-designed
processing-in-memory CNN accelerator chip would be used for mobile, IoT, and smart device applications.

References
[1] Leon O Chua and Lin Yang. 1988. Cellular neural networks: Applications. IEEE Transactions on circuits
   and systems 35, 10 (1988), 12731290.

[2] I Goodfellow, Y Bengio, and A Courville. 2016. Deep Learning, The MIT Press. Cambridge, Massachusetts
   (2016).
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition.
   In Proceedings of the IEEE conference on computer vision and pattern recognition. 770778.

[4] Andrew D Kent and Daniel C Worledge. 2015. A new spin on magnetic memories. Nature nanotechnology
   10, 3 (2015), 187.

[5] Yann LeCun, Koray Kavukcuoglu, and Clment Farabet. 2010. Convolutional networks and applications in
   vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems. IEEE, 253256.

[6] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image
   recognition. arXiv preprint arXiv:1409.1556 (2014).

[7] Baohua Sun, Daniel Liu, Leo Yu, Jay Li, Helen Liu, Wenhan Zhang, and Terry Torng. 2018. MRAM
   Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications. arXiv preprint
   arXiv:1811.12179 (2018).

[8] Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Super Charac-
   ters: A Conversion from Sentiment Classification to Image Classification. In Proceedings of the 9th Workshop
   on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 309315.

[9] Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Ultra Power-
   Efficient CNN Domain Specific Accelerator with 9.3 TOPS/Watt for Mobile and Embedded Applications. In
   Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 16771685.

[10] Luc Thomas et al. 2017. Basic Principles, Challenges and Opportunities of STT- MRAM for Embedded
   Memory Applications. MSST 2017 (2017).

</pre>