Categories and Subject Descriptors

On the Energy Efficiency of Parallel Multi-core vs Hardware Accelerated HD Video Decoding

Djamel Benazzouz

dbenazzouz@yahoo.fr 0 1 2 3 0 Eric Senn Univ. Bretagne Sud, UMR6285, Lab-STICC , F56100 Lorient , France 1 Jalil Boukhobza Univ. Bretagne Occidentale, UMR6285, Lab-STICC , F29200 Brest , France brest.fr 2 Univ. M'hamed Bougara, LMSS , Boumerdes , Algeria 3 Yahia Benmoussa

2014

13 14

Hardware video accelerators are used on mobile devices to provide support for energy e cient real time High de nition (HD) video decoding. Recently, the rise of multi-core architectures on those devices increased their performances and make real time HD video decoding possible using parallel processing on the GPP cores only. What is even more interesting to know is the level of energy e ciency these kind of multi-core General Purpuse Processor (GPP) can achieve as compared to hardware video accelerators. In this paper, we propose an experimental evaluation of the energy e ciency of the two video decoding approaches. An accurate energy measurement was achieved on a recent low-power 40 nm mobile SoC containing a quad-core ARM processors and a video hardware accelerator. The results show that parallel multi-core HD decoding enhances both the performance and the energy e ciency as compared to the use of a single core. However, the hardware accelerated decoding is about three times more energy e cient. Based on the experimental observations, some challenges for enhancing parallel multi-core video decoding energy e ciency are pointed out.

Categories and Subject Descriptors

H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems; D.4.8 [Operating Systems]: Performance; C.3 [Special Purpose and Application Based Systems]: Real-time and embedded systems Parallel video decoding, Energy e ciency, multi-core SoC

1. INTRODUCTION

Video decoding is both processing intensive and real time application. To ful ll these constraints, the processor equipping the mobile devices may need to run at more and more high frequency especially in the context of an increasing demand on HD videos.

However, due to the thermal and power issues faced in the design of modern microprocessors, it is no longer possible to increase continuously the clock frequency. In fact, using high frequencies leads to a drastic increase in the thermal dissipation and the energy consumption due to the quadratic relation between the dynamic power consumption and the clock frequency. This is more critical in the case of energy constrained mobile devices such as smartphones and tablets. To overcome this issue, modern embedded processor architectures use the parallelism to increase the performance without the need to increase the frequency [ 9 ].

In the eld of video decoding, the parallelism can enhance the energy e ciency on the energy constrained device. It can be implemented in a specialized hardware video accelerators whose energy e ciency is well established [ 18 ]. However, the hardware accelerators are a proprietary solutions and lack of exibility. In fact, they are not open and their use depend on some API provided by the vendor. Moreover, it may take a long time to implement a new video standard on hardware circuits unlike the software based solutions running on GPP. For example, the latest mobile device still does not support hardware accelerator for the new HEVC standard.

Recently, the new SoC equipping mobile devices include more and more GPP cores. For example, the latest ARM big.LITTLE architecture contains four Cortex A7 and four Cortex A15 processors [ 3 ]. What is even more interesting to know is the level of performance and energy e ciency these kind of multi-core GPP processors can achieve as compared to hardware video accelerators. The objective is to provide a video decoding solution that conciliates both the energy e ciency and the exibility of video decoding.

In this study, we investigate the performance and energy e ciency of parallel multi-core video decoding as compared to the hardware accelerator based approach. For this purpose, we propose an experimental methodology based on power consumption measurement achieved on an embedded platform containing four GPP cores and a video hardware accelerator. The obtained results showed that the hardware accelerator is three time more energy e cient than the optimal parallel multi-core video decoding. Moreover, they allowed to point out some challenges to enhance the energy e ciency of the parallel multi-core video decoding.

The remainder of this paper is organized as follows : Related works on energy consideration of parallel video decoding are discussed in section 2. In section 3, some background material regarding the power consumption and the energy e ciency of parallel video decoding is presented. The experimental methodology and the obtained results are described in section 4 and 5 respectively. Finally, the conclusions and some future work perspectives are given in section 6.

RELATED WORKS

In [ 10 ], is introduced some architecture design basis of hardware accelerated H.264/AVC HD video decoding. The advantages, in terms of performance and energy consumption, of H.264/AVC video decoding using hardware accelerator are highlighted in [ 18 ]. In the same way, a more general study [ 12 ] investigates the reasons of energy ine ciency of GPPs and proposes guidelines to reduce the energy breakdown as compared to video hardware accelerator. The energy e ciency of hardware accelerated video decoding is well established, however, the video standards evolve quickly and hardware vidoe accelerator does not provide the exibility to adapt to those changes [ 16 ].

DSP-based solutions aim to conciliate the exibility of GPPs and the energy e ciency of hardware accelerator. In [ 9 ], the authors focus on the performance and energy e ciency of DSPs due to the use of pipeline and parallelism in CMOS circuits. In [ 6, 5 ], the authors compare the performance and the energy e ciency of GPP and DSP. However, the HD video decoding was not considered in these studies. Moreover, DSP-based video decoding seems to be abandoned by mobile device manufacturer in favor of hardware video accelerators which are more energy e cient.

With the rise of modern SoC integrating more and more processor cores, many studies investigated the performance and the energy e ciency of parallel video decoding on these architectures. In [ 13 ], they compared di erent video decoding parallelism levels (MB, slice and frame) on multi-core architecture. In [ 4 ], the authors focus on the energy e ciency of parallel H.264/AVC decoding on multi-core processor. They have evaluated the energy saving as compared to mono-core decoding. However, they have not considered hardware acceleration in their study.

In this study, we propose a comprehensive experimental methodology to investigate the energy e ciency of parallel multi-core video decoding as compared to that based on hardware video accelerators. As far as we know, no prior work provided a clear evaluation data of the two approaches.

BACKGROUND

We describe hereafter some elementary background related to the energy consumption in electronic -circuits and the role of parallelism in reducing the energy consumption especially in case of video decoding. 3.1

Energy consumption

In CMOS digital circuits, the total power consumption is the sum of the static and dynamic power : (1) (3)

Ptot = Pstatic + Pdyn where Pstatic and Pdyn are de ned as :

Pstatic = Ileak:V (2)

Pdyn = Ceff :V 2:f Ileak is the leakage current, V is the supply voltage associated to the clock frequency f and Ceff is the circuit e ective capacitance [ 9 ].

The static power is related to the circuit fabrication technology and does not depend on its activity. Below 65-nm circuits feature size, it becomes signi cant and poses new low-power design challenges [ 14 ]. On the other hand, the dynamic power is related to the circuit activity. For example, in case of a microprocessor, the dynamic power depends on the type of instructions executed and on the data accessed. In equation 3, this is represented by the Ceff parameter dened as Ceff = A:C, where C is the circuit capacitance and A is the the activity factor.

Figure-1 illustrates a simpli ed representation of a CMOS circuit which processes a set of sequential data D (encoded video frames) using a block B (video decoder). The block B operates at frequencies f2 and f corresponding to the supply voltage levels V1 = 0:925V and V2 = 1:15V respectively1. If t is the processing time when B operates at a frequency f (Figure 1-a), then the energy consumption is EV2 = PV2 :t where PV2 = Ceff :V22:f . If we suppose the processing time at frequency f2 (Figure-1-b) is doubled, then the ratio between the energy EV1 consumed by the circuit at the frequency f2 with V1 = 1:06V , and EV2 is :

EV1 = Ceff :V12: f2 :2:t = ( V1 )2 ' 65%

EV2 Ceff :V22:f:t V2 In this case, scaling down the voltage and the frequency decreases the power consumption to PV1 = Ceff :V12: f2 ' 25 :PV2 which leads to 35% energy saving at the cost of a decreased performance. This may represent a scenario where the operating system scales down dynamically the processor frequency at run time when it detects a load decrease. This illustrates a system-driven voltage scaling.

In order to save energy without sacri cing performance, an architectural-driven voltage scaling [ 9 ] can be achieved by using two B blocks which are both clocked at a frequency f2 and supplied with a voltage V1 as described in Figure-1-c. 1V1 and V2 are the associated to the frequencies 800 and 400 MHz of the Cortex A9 processor used in our experiments. P2:V1 and E2:V1 refer to the power and the energy consumption associated to this con guration. Since the two blocks are operating in parallel, the execution time does not decrease and the ratio between E2:V1 and EV2 is : E2:V1 = Ceff :V12: f2 :t + Ceff :V12: f2 :t = ( V1 )2 ' 65% EV1 Ceff :V22:f:t V2

In this con guration, the total power consumption P2:V1 is the sum of the power consumptions of the two blocks, which is equal to 2:Ceff :V12: f2 = 45 :PV2 . The energy saving is equal to 35% without sacri cing the performance but at a cost of an additional circuit area and static power. 3.2

Parallel video decoding

As illustrated in Figure-2, a H.264/AVC video sequence is composed of a set of frames. Each frame may contain several slices and each slice contains several macroblocks (MB = 16 x 16 pixels). The H.264 standard de nes three main types of slices: I, P, and B. An I slice uses intra prediction and is independent of the slices in other frames. In intra prediction a MB is predicted based on adjacent blocks. A P-slice uses motion estimation and intra prediction and depends on one or more slices in a previous frames, either I, P or B. Motion estimation is used to exploit temporal correlation between slices. Finally, B-slices use bidirectional motion estimation and depend on slices from previous and future frames. Each slice can be decoded independently of the slices within the same frame whatever its type.

The parallelism of video decoding can be achieved at a frame, slice or MB levels [ 15 ]. At a frame level, the frames may be decoded in parallel on di erent processing units. The drawback of such an approach is that it does not scale very well because the number of independent slices is limited at a given time. On the other hand, a higher scalability is possible at a slice level. However, this depends on the encoder setting to enable multi-slice frames. At MB level, a very good scalability can be achieved when the decoding is implemented on hardware codecs. On the other hand, parallel MB decoding on many core processors is not e cient due to a considerable inter-processor and synchronization overhead [ 2 ].

In this study, we compare slice-based parallelism on multicore ARM processors and hardware accelerated video decoding using MB level parallelism.

METHODOLOGY

The proposed experimental methodology aims to compare between the energy e ciency of parallel multi-core and A p p li c a t i o n s H a r d w a r e hardware accelerated HD video decoding. It is based on power consumption measurement on a real embedded platform containing a multi-core processor and a hardware video codec. We describe hereafter the used hardware and software, then the performance and energy consumption measurement methodology.

On this hardware platform, the Linux operating system version 3.0.17 was used with cpufreq enabled to drive the ARM cores frequency scaling. The userspace governor was activated to allow the control of the clock frequency at the application level. The H.264/AVC video decoding was achieved using GStreamer [ 11 ], a multimedia development framework. The ARM decoding, was performed using dec h264, an open-source plug-in based on the widely used mpeg/libavcodec library compiled with the support of NEON SIMD instructions set. For the hardware accelerated decoding, we used vpudec, a proprietary GStreamer H.264/AVC plug-in provided by Freescale. As a test video, we use the well known Big Buck Bunny sequence. We encode it in 720p resolution (1280x720), 2Mb/s bit-rate and 24Hz rate using x264 encoder. We con gured the encoder to set the number of slice per frame to 4 by means of the {slices option. The objective is to fully exploit the 4 available ARM cores on the

Performance measurement

We started by measuring the performance of video decoding using a single core, dual-core, quad-core decoding at all the available clock frequencies (400, 800 and 1000 MHz) and the VPU decoding. The number of cores used for decoding the video is selected by setting the value of max threads parameter of the dec h264 plug-in. The VPU and multi-core video decoding is selected by choosing the corresponding GStreamer plug-in : ( dec h264 or vpudec). For each conguration, we calculated the number of decoded frame per second (fps). The libavcodec library supports both slice and frame multi-threaded decoding. However, the dec h264 plug-in does not allow to select explicitly which method to use and the automatic selection mechanism tends to select systematically the frame-level multi-threading. To x this issue, the plug-in was forced to use the slice-level method by setting active thread type = FF THREAD SLICE in the pthread.c source le. 4.4

Energy consumption measurement

The used SABRE board has two power domains which can be measured separately. The ARM power domain in

Processor usage 300 250 )200 % ( e tag150 n e c reP100 50 0

1 core 2 cores 4 cores VPU clude the 4 ARM cores plus the cache memory and the SoC power domain include the VPU, 2DGPU, 3DGPU and the OpenVG [ 1 ]. At each power domain was inserted Rshunt, a 0:02 shunt resistor (See Figure-3). 5. 5.1

EXPERIMENTAL RESULTS Performances

The power consumptions is then measured using the OpenPEOPLE framework [ 7 ], a multi-user and multi-target power and energy optimization platform and estimator. It includes the NI-PXI-4472 digitizer allowing up to a 100 KHz sampling resolution. At a given time, the power consumption is P = Vc:Vshunt . The energy consumption is obtained by

Rshunt summing the elementary power consumption obtained using 1 KHz sampling rate multiplied by the sampling duration.

In case of multi-core ARM video decoding, only the ARM power domain consumption is measured. On the other hand, the sum of the ARM power domain and the SoC power are measured in case of VPU decoding since both the ARM cores the the VPU are involved in the decoding process.

Table-2 shows the performances results of the video decoding. One can observe that in case of multi-core decoding, the decoding speed is higher than the displaying rate using 2 cores or 4 cores starting from 800 MHz clock frequency. In case of VPU video decoding, the decoding speed is (x 3.75) higher than the displaying rate regardless of the ARM cores frequency2. This is illustrated in Figure-4 where the at red surface represents the displaying rate (24 fps).

The values between the parenthesis in Table-2 represent the performance scaling factor as compared to mono-core video decoding. One can observe that using four ARM cores allows only x2.4 performance increase. This is mainly due to the unbalanced workload. In fact, the video encoder divides each frame into equal-size slices. However, the decoding workload depends on the slice scene complexity. Thus, a decoding thread assigned to a given slice may terminate be2The frequency of the VPU frequency (264 MHz) remains constant when varying the frequency of the ARM cores 1.5 ) W ( re 1 w o P 1.5 ) W ( re 1 w o P fore the other ones. During this time, it goes into a blocked status waiting the other threads to terminate.

On the other hand, the scaling factor is much more higher (from x5 to x12) in case of VPU decoding. This is due to MB level parallelism implemented in the VPU.

The measured processor usages3 illustrated in Figure-5 con rm these observations. In fact, in case of single-core video decoding (one thread), the processor usage is 100% which means that the decoding thread is all time in active state. However, it is around 160% and 260% in case of dual-core and quad-core decoding respectively. On the other hand, when using the VPU, the processor usage is about 15% because the ARM cores are almost time in idle mode waiting for the frame to be decoded by the VPU. 5.2

Energy consumption

Table-3 shows the energy consumption of video decoding using the ARM cores and the VPU. The values between parenthesis in Table-2 represent the energy reduction factor as compared to single core decoding. As expected, for a given clock frequency, increasing the number of cores allows to reduce the energy consumption (See Figure-6). For example, as compared to mono-core decoding, the optimal multi-core con guration (4 cores, 800 MHz) deceases the energy by a factor of x0.74 while increasing the performance by a factor of x2.43.

On the other hand, the energy saving is much more important in case of VPU video decoding (0.23 scaling factor) as compared to mono-core decoding at 800 MHz and x0.36 as compared to the optimal multi-core video decoding (4 cores, 800 MHz). This can be explained by both a high decoding performance and a very low power consumption 3processor usage = (Pi Ti)=Texe where Ti is the time that the ith thread got a processor core (active time), Texe is the decoding time.

Time (s) 5 10 20 25 30 of the VPU. As illustrated in Figure-7-a, one can observe that the decoding time of the 480 video frames terminated in almost 5 seconds. During this decoding phase, the power consumption of the SoC power domain increases with only 0.2 W which correspond to the VPU power consumption. This low value can be explained by the low frequency (264 MHz) of the VPU. During this time, the ARM cores power consumption is negligible. In fact, as illustrated in Figure7-b showing the frame-by-frame power consumption variation, the ARM cores are almost time in idle state waiting the VPU to decode a video frames. In the idle state, the ARM cores execute the WFI (Wait For Interrupt) instruction were almost the processor clocks are gated to reduce the power consumption.

Unlike the VPU decoding, multi-core video decoding can not conciliate the performance and the energy e ciency. As illustrated in Figure-8, at 400 MHz frequency (See Figure8-a), the power consumption is low (v 0:3 mW), but the decoding time is very long. On the other hand, at the higher frequencies, the decoding time is reduced but the power consumption increases considerably (See Figure-8-b and c).

One can highlight that the unbalanced workload over the processor cores may be source of energy ine ciency. In fact, during a thread waiting time, the processor core continues to consume energy while doing nothing. One approach to x this issue is to set the clock frequency of each core depending on the slice decoding workload or to transit a processor core to low power mode during its inactivity using Dynamic Power Management (DPM) as proposed in [ 17 ]. However, this is not possible in case of the used i.MX6 SoC since it does not support a per-core DVFS/DPM.

CONCLUSION

This paper is a use case study based on the i.MX6 SoC. The experimental results showed that multi-core video decoding allows to enhance both the performance and the energy e ciency of HD video decoding as compared to single core decoding. However, the hardware video accelerator is three time more energy e cient than multi-core optimal multi-core video decoding.

Although, these results may be di erent on other architecture, the obtained data allows to have a general idea the the energy consumption levels of HD video decoding on a recent heterogeneous SoC.

According to the rapid evolution of the SoC which tend to integrate more and more cores [ 8 ], one may expect that the energy e ciency of multi-core video decoding can be enhanced if a larger number of cores are used [ 19 ]. Moreover, as pointed out in the results discussion, the energy e ciency of multi-core video decoding may also be enhanced if it is combined with per-core DVFS/DPM strategies. The objective is to avoid wasting the energy due to idling the cores. We plan to investigate these issues in a future works using the Exynos5 SoC containing 8 cores supporting a per-core DVFS/DPM.

Acknowledgment

This work was supported by BPI France, Region Ile-deFrance, Region Bretagne and Rennes Metropole through the French Project GreenVideo.

[1] i.MX 6Dual/6Quad Power Consumption Measurement, Freescale Semiconductor , 2012 .

[2]

Alvarez Mesa , A. Ram rez, A . Azevedo,

Meenderinck ,

Juurlink , and

Valero . Scalability of macroblock-level parallelism for h. 264 decoding . In Parallel and Distributed Systems (ICPADS) , 2009 15th International Conference on, pages 236 { 243 . IEEE, 2009 .

[3] ARM. big.little processing . http://www.arm.com/products/ processors/technologies/biglittleprocessing.php, 2014 .

[4]

Baaklini ,

Rethinagiri ,

Sbeity , and

Niar . Scalable row-based parallel h.264 decoder on embedded multicore processors . Signal, Image and Video Processing , pages 1 { 15 , 2014 .

[5]

Benmoussa ,

Boukhobza , E. Senn, and

Benazzouz . Energy consumption modeling of h.264/avc video decoding for gpp and dsp . in Proceedings of 16th Euromicro Conference on Digital System Design , 2013 .

[6]

Benmoussa ,

Boukhobza , E. Senn, and

Benazzouz . GPP vs DSP: A performance/energy characterization and evaluation of video decoding . in Proceedings of the IEEE 21st International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems , 2013 .

[7]

Benmoussa ,

Senn ,

Boukhobza ,

Lanoe , and

Benazzouz . Open-PEOPLE, a collaborative platform for remote & accurate measurement and evaluation of embedded systems power consumption . in Proceedings of the IEEE 22nd International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems , 2014 .

[8]

Borkar . Thousand core chips: A technology perspective . In Proceedings of the 44th Annual Design Automation Conference , DAC '07 , pages 746 { 749 . ACM, 2007 .

[9]

Chandrakasan ,

Sheng , and

Brodersen . Low-power CMOS digital design . IEEE Journal of Solid-State Circuits , 27 ( 4 ): 473 { 484 , 1992 .

[10] T.-C. Chen , S.-Y. Chien, Y.-W.

Huang , C.-H.

Tsai , C.-Y. Chen, T.-W. Chen, and L.-G.

Chen . Analysis and architecture design of an hdtv720p 30 frames/s h. 264/avc encoder . Circuits and Systems for Video Technology, IEEE Trans. on , pages 673 { 688 , 2006 .

[11] C. M. Don Darling and B. Singh . Gstreamer on texas instruments OMAP35x processors . Proceedings of the Ottawa Linux Symposium , pages 69 { 78 , 2009 .

[12]

Hameed ,

Qadeer ,

Wachs ,

Azizi ,

Solomatnikov ,

B. C.

Lee ,

Richardson ,

Kozyrakis , and

Horowitz . Understanding sources of ine ciency in general-purpose chips . SIGARCH Comput. Archit. News , 38 ( 3 ): 37 { 47 , 2010 .

[13]

Kilicarslan , C. G.

Gurler, O. Ozkasap, and

A. M.

Tekalp . Energy e cient video decoding on multi-core devices . In Proceedings of the 2nd International Conference on Energy-E cient Computing and Networking , pages 63 { 66 . ACM, 2011 .

[14]

Kim ,

Austin ,

Baauw ,

Mudge ,

Flautner ,

Hu ,

Irwin ,

Kandemir , and

Narayanan . Leakage current: Moore's law meets static power . Computer , 36 ( 12 ): 68 { 75 , 2003 .

[15]

Meenderinck ,

Azevedo ,

Alvarez ,

Juurlink , and

Ramirez . Parallel scalability of h. 264. In Proceedings of the rst Workshop on Programmability Issues for Multi-Core Computers , 2008 .

[16]

G. J.

Smit ,

A. B.

Kokkeler ,

P. T.

Wolkotte , and M. D. van de Burgwal. Multi-core architectures and streaming applications . In Proceedings of the 2008 international workshop on System level interconnect prediction , SLIP '08 , pages 35 { 42 . ACM, 2008 .

[17]

Y.-H.

Wei , C.-Y. Yang, T.-W. Kuo,

S.-H.

Hung , and

Y.-H.

Chu . Energy-e cient real-time scheduling of multimedia tasks on multi-core processors . In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC '10 , pages 258 { 262 . ACM, 2010 .

[18]

Xu , T.-M. Liu , J.-I.

Guo , and C.-S.

Choy . Methods for power/throughput/area optimization of H.264/AVC decoding . Journal of Signal Processing Systems , 60 ( 1 ): 131 { 145 , 2010 .

[19]

Zhu ,

Yu ,

Cui ,

Yu , and

Zeng . H. 264 video parallel decoder on a 24-core processor . In ASIC (ASICON) , 2013 IEEE 10th International Conference on, pages 1{4 , 2013 .