<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Energy Efficiency of Parallel Multi-core vs Hardware Accelerated HD Video Decoding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Djamel Benazzouz</string-name>
          <email>dbenazzouz@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eric Senn Univ. Bretagne Sud, UMR6285, Lab-STICC</institution>
          ,
          <addr-line>F56100 Lorient</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jalil Boukhobza Univ. Bretagne Occidentale, UMR6285, Lab-STICC</institution>
          ,
          <addr-line>F29200 Brest</addr-line>
          ,
          <country>France brest.fr</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Univ. M'hamed Bougara, LMSS</institution>
          ,
          <addr-line>Boumerdes</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Yahia Benmoussa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Hardware video accelerators are used on mobile devices to provide support for energy e cient real time High de nition (HD) video decoding. Recently, the rise of multi-core architectures on those devices increased their performances and make real time HD video decoding possible using parallel processing on the GPP cores only. What is even more interesting to know is the level of energy e ciency these kind of multi-core General Purpuse Processor (GPP) can achieve as compared to hardware video accelerators. In this paper, we propose an experimental evaluation of the energy e ciency of the two video decoding approaches. An accurate energy measurement was achieved on a recent low-power 40 nm mobile SoC containing a quad-core ARM processors and a video hardware accelerator. The results show that parallel multi-core HD decoding enhances both the performance and the energy e ciency as compared to the use of a single core. However, the hardware accelerated decoding is about three times more energy e cient. Based on the experimental observations, some challenges for enhancing parallel multi-core video decoding energy e ciency are pointed out.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.5.1 [Information Interfaces and Presentation]:
Multimedia Information Systems; D.4.8 [Operating Systems]:
Performance; C.3 [Special Purpose and Application Based
Systems]: Real-time and embedded systems
Parallel video decoding, Energy e ciency, multi-core SoC</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Video decoding is both processing intensive and real time
application. To ful ll these constraints, the processor
equipping the mobile devices may need to run at more and more
high frequency especially in the context of an increasing
demand on HD videos.</p>
      <p>
        However, due to the thermal and power issues faced in the
design of modern microprocessors, it is no longer possible to
increase continuously the clock frequency. In fact, using high
frequencies leads to a drastic increase in the thermal
dissipation and the energy consumption due to the quadratic
relation between the dynamic power consumption and the clock
frequency. This is more critical in the case of energy
constrained mobile devices such as smartphones and tablets. To
overcome this issue, modern embedded processor
architectures use the parallelism to increase the performance
without the need to increase the frequency [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In the eld of video decoding, the parallelism can enhance
the energy e ciency on the energy constrained device. It can
be implemented in a specialized hardware video accelerators
whose energy e ciency is well established [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, the
hardware accelerators are a proprietary solutions and lack
of exibility. In fact, they are not open and their use depend
on some API provided by the vendor. Moreover, it may take
a long time to implement a new video standard on hardware
circuits unlike the software based solutions running on GPP.
For example, the latest mobile device still does not support
hardware accelerator for the new HEVC standard.
      </p>
      <p>
        Recently, the new SoC equipping mobile devices include
more and more GPP cores. For example, the latest ARM
big.LITTLE architecture contains four Cortex A7 and four
Cortex A15 processors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. What is even more interesting to
know is the level of performance and energy e ciency these
kind of multi-core GPP processors can achieve as compared
to hardware video accelerators. The objective is to provide
a video decoding solution that conciliates both the energy
e ciency and the exibility of video decoding.
      </p>
      <p>In this study, we investigate the performance and energy
e ciency of parallel multi-core video decoding as compared
to the hardware accelerator based approach. For this
purpose, we propose an experimental methodology based on
power consumption measurement achieved on an embedded
platform containing four GPP cores and a video hardware
accelerator. The obtained results showed that the hardware
accelerator is three time more energy e cient than the
optimal parallel multi-core video decoding. Moreover, they
allowed to point out some challenges to enhance the energy
e ciency of the parallel multi-core video decoding.</p>
      <p>The remainder of this paper is organized as follows :
Related works on energy consideration of parallel video
decoding are discussed in section 2. In section 3, some background
material regarding the power consumption and the energy
e ciency of parallel video decoding is presented. The
experimental methodology and the obtained results are described
in section 4 and 5 respectively. Finally, the conclusions and
some future work perspectives are given in section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORKS</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], is introduced some architecture design basis of
hardware accelerated H.264/AVC HD video decoding. The
advantages, in terms of performance and energy
consumption, of H.264/AVC video decoding using hardware
accelerator are highlighted in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In the same way, a more general
study [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] investigates the reasons of energy ine ciency of
GPPs and proposes guidelines to reduce the energy
breakdown as compared to video hardware accelerator. The
energy e ciency of hardware accelerated video decoding is well
established, however, the video standards evolve quickly and
hardware vidoe accelerator does not provide the exibility
to adapt to those changes [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        DSP-based solutions aim to conciliate the exibility of
GPPs and the energy e ciency of hardware accelerator. In
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the authors focus on the performance and energy e
ciency of DSPs due to the use of pipeline and parallelism in
CMOS circuits. In [
        <xref ref-type="bibr" rid="ref5 ref6">6, 5</xref>
        ], the authors compare the
performance and the energy e ciency of GPP and DSP. However,
the HD video decoding was not considered in these
studies. Moreover, DSP-based video decoding seems to be
abandoned by mobile device manufacturer in favor of hardware
video accelerators which are more energy e cient.
      </p>
      <p>
        With the rise of modern SoC integrating more and more
processor cores, many studies investigated the performance
and the energy e ciency of parallel video decoding on these
architectures. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], they compared di erent video
decoding parallelism levels (MB, slice and frame) on multi-core
architecture. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors focus on the energy e
ciency of parallel H.264/AVC decoding on multi-core
processor. They have evaluated the energy saving as compared
to mono-core decoding. However, they have not considered
hardware acceleration in their study.
      </p>
      <p>In this study, we propose a comprehensive experimental
methodology to investigate the energy e ciency of
parallel multi-core video decoding as compared to that based on
hardware video accelerators. As far as we know, no prior
work provided a clear evaluation data of the two approaches.</p>
    </sec>
    <sec id="sec-4">
      <title>BACKGROUND</title>
      <p>We describe hereafter some elementary background
related to the energy consumption in electronic -circuits and
the role of parallelism in reducing the energy consumption
especially in case of video decoding.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Energy consumption</title>
      <p>In CMOS digital circuits, the total power consumption is
the sum of the static and dynamic power :
(1)
(3)</p>
      <p>Ptot = Pstatic + Pdyn
where Pstatic and Pdyn are de ned as :</p>
      <p>Pstatic = Ileak:V
(2)</p>
      <p>
        Pdyn = Ceff :V 2:f
Ileak is the leakage current, V is the supply voltage
associated to the clock frequency f and Ceff is the circuit e ective
capacitance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The static power is related to the circuit fabrication
technology and does not depend on its activity. Below 65-nm
circuits feature size, it becomes signi cant and poses new
low-power design challenges [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. On the other hand, the
dynamic power is related to the circuit activity. For example,
in case of a microprocessor, the dynamic power depends on
the type of instructions executed and on the data accessed.
In equation 3, this is represented by the Ceff parameter
dened as Ceff = A:C, where C is the circuit capacitance and
A is the the activity factor.
      </p>
      <p>Figure-1 illustrates a simpli ed representation of a CMOS
circuit which processes a set of sequential data D (encoded
video frames) using a block B (video decoder). The block B
operates at frequencies f2 and f corresponding to the supply
voltage levels V1 = 0:925V and V2 = 1:15V respectively1. If
t is the processing time when B operates at a frequency f
(Figure 1-a), then the energy consumption is EV2 = PV2 :t
where PV2 = Ceff :V22:f . If we suppose the processing time
at frequency f2 (Figure-1-b) is doubled, then the ratio
between the energy EV1 consumed by the circuit at the
frequency f2 with V1 = 1:06V , and EV2 is :</p>
      <p>EV1 = Ceff :V12: f2 :2:t = ( V1 )2 ' 65%</p>
      <p>EV2 Ceff :V22:f:t V2
In this case, scaling down the voltage and the frequency
decreases the power consumption to PV1 = Ceff :V12: f2 '
25 :PV2 which leads to 35% energy saving at the cost of a
decreased performance. This may represent a scenario where
the operating system scales down dynamically the processor
frequency at run time when it detects a load decrease. This
illustrates a system-driven voltage scaling.</p>
      <p>
        In order to save energy without sacri cing performance,
an architectural-driven voltage scaling [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] can be achieved by
using two B blocks which are both clocked at a frequency f2
and supplied with a voltage V1 as described in Figure-1-c.
1V1 and V2 are the associated to the frequencies 800 and 400
MHz of the Cortex A9 processor used in our experiments.
P2:V1 and E2:V1 refer to the power and the energy
consumption associated to this con guration. Since the two blocks
are operating in parallel, the execution time does not
decrease and the ratio between E2:V1 and EV2 is :
E2:V1 = Ceff :V12: f2 :t + Ceff :V12: f2 :t = ( V1 )2 ' 65%
EV1 Ceff :V22:f:t V2
      </p>
      <p>In this con guration, the total power consumption P2:V1
is the sum of the power consumptions of the two blocks,
which is equal to 2:Ceff :V12: f2 = 45 :PV2 . The energy saving
is equal to 35% without sacri cing the performance but at
a cost of an additional circuit area and static power.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Parallel video decoding</title>
      <p>As illustrated in Figure-2, a H.264/AVC video sequence is
composed of a set of frames. Each frame may contain several
slices and each slice contains several macroblocks (MB = 16
x 16 pixels). The H.264 standard de nes three main types
of slices: I, P, and B. An I slice uses intra prediction and is
independent of the slices in other frames. In intra prediction
a MB is predicted based on adjacent blocks. A P-slice uses
motion estimation and intra prediction and depends on one
or more slices in a previous frames, either I, P or B. Motion
estimation is used to exploit temporal correlation between
slices. Finally, B-slices use bidirectional motion estimation
and depend on slices from previous and future frames. Each
slice can be decoded independently of the slices within the
same frame whatever its type.</p>
      <p>
        The parallelism of video decoding can be achieved at a
frame, slice or MB levels [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. At a frame level, the frames
may be decoded in parallel on di erent processing units.
The drawback of such an approach is that it does not scale
very well because the number of independent slices is limited
at a given time. On the other hand, a higher scalability
is possible at a slice level. However, this depends on the
encoder setting to enable multi-slice frames. At MB level,
a very good scalability can be achieved when the decoding
is implemented on hardware codecs. On the other hand,
parallel MB decoding on many core processors is not e cient
due to a considerable inter-processor and synchronization
overhead [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In this study, we compare slice-based parallelism on
multicore ARM processors and hardware accelerated video
decoding using MB level parallelism.</p>
    </sec>
    <sec id="sec-7">
      <title>METHODOLOGY</title>
      <p>The proposed experimental methodology aims to
compare between the energy e ciency of parallel multi-core and
A
p
p
li
c
a
t
i
o
n
s
H
a
r
d
w
a
r
e
hardware accelerated HD video decoding. It is based on
power consumption measurement on a real embedded
platform containing a multi-core processor and a hardware video
codec. We describe hereafter the used hardware and
software, then the performance and energy consumption
measurement methodology.</p>
      <p>
        On this hardware platform, the Linux operating system
version 3.0.17 was used with cpufreq enabled to drive the
ARM cores frequency scaling. The userspace governor was
activated to allow the control of the clock frequency at the
application level. The H.264/AVC video decoding was achieved
using GStreamer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a multimedia development framework.
The ARM decoding, was performed using dec h264, an
open-source plug-in based on the widely used
mpeg/libavcodec library compiled with the support of NEON SIMD
instructions set. For the hardware accelerated decoding, we
used vpudec, a proprietary GStreamer H.264/AVC plug-in
provided by Freescale. As a test video, we use the well
known Big Buck Bunny sequence. We encode it in 720p
resolution (1280x720), 2Mb/s bit-rate and 24Hz rate using
x264 encoder. We con gured the encoder to set the number
of slice per frame to 4 by means of the {slices option. The
objective is to fully exploit the 4 available ARM cores on the
      </p>
    </sec>
    <sec id="sec-8">
      <title>Performance measurement</title>
      <p>We started by measuring the performance of video
decoding using a single core, dual-core, quad-core decoding at all
the available clock frequencies (400, 800 and 1000 MHz) and
the VPU decoding. The number of cores used for decoding
the video is selected by setting the value of max threads
parameter of the dec h264 plug-in. The VPU and multi-core
video decoding is selected by choosing the corresponding
GStreamer plug-in : ( dec h264 or vpudec). For each
conguration, we calculated the number of decoded frame per
second (fps). The libavcodec library supports both slice and
frame multi-threaded decoding. However, the dec h264
plug-in does not allow to select explicitly which method to
use and the automatic selection mechanism tends to select
systematically the frame-level multi-threading. To x this
issue, the plug-in was forced to use the slice-level method
by setting active thread type = FF THREAD SLICE in the
pthread.c source le.
4.4</p>
    </sec>
    <sec id="sec-9">
      <title>Energy consumption measurement</title>
      <p>The used SABRE board has two power domains which
can be measured separately. The ARM power domain
in</p>
      <p>Processor usage
300
250
)200
%
(
e
tag150
n
e
c
reP100
50
0</p>
      <p>
        1 core 2 cores 4 cores VPU
clude the 4 ARM cores plus the cache memory and the SoC
power domain include the VPU, 2DGPU, 3DGPU and the
OpenVG [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. At each power domain was inserted Rshunt, a
0:02 shunt resistor (See Figure-3).
5.
5.1
      </p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-11">
      <title>Performances</title>
      <p>
        The power consumptions is then measured using the
OpenPEOPLE framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a multi-user and multi-target power
and energy optimization platform and estimator. It includes
the NI-PXI-4472 digitizer allowing up to a 100 KHz
sampling resolution. At a given time, the power consumption
is P = Vc:Vshunt . The energy consumption is obtained by
      </p>
      <p>Rshunt
summing the elementary power consumption obtained using
1 KHz sampling rate multiplied by the sampling duration.</p>
      <p>In case of multi-core ARM video decoding, only the ARM
power domain consumption is measured. On the other hand,
the sum of the ARM power domain and the SoC power are
measured in case of VPU decoding since both the ARM cores
the the VPU are involved in the decoding process.</p>
      <p>Table-2 shows the performances results of the video
decoding. One can observe that in case of multi-core decoding,
the decoding speed is higher than the displaying rate using
2 cores or 4 cores starting from 800 MHz clock frequency. In
case of VPU video decoding, the decoding speed is (x 3.75)
higher than the displaying rate regardless of the ARM cores
frequency2. This is illustrated in Figure-4 where the at red
surface represents the displaying rate (24 fps).</p>
      <p>The values between the parenthesis in Table-2 represent
the performance scaling factor as compared to mono-core
video decoding. One can observe that using four ARM cores
allows only x2.4 performance increase. This is mainly due to
the unbalanced workload. In fact, the video encoder divides
each frame into equal-size slices. However, the decoding
workload depends on the slice scene complexity. Thus, a
decoding thread assigned to a given slice may terminate
be2The frequency of the VPU frequency (264 MHz) remains
constant when varying the frequency of the ARM cores
1.5
)
W
(
re 1
w
o
P
1.5
)
W
(
re 1
w
o
P
fore the other ones. During this time, it goes into a blocked
status waiting the other threads to terminate.</p>
      <p>On the other hand, the scaling factor is much more higher
(from x5 to x12) in case of VPU decoding. This is due to
MB level parallelism implemented in the VPU.</p>
      <p>The measured processor usages3 illustrated in Figure-5
con rm these observations. In fact, in case of single-core
video decoding (one thread), the processor usage is 100%
which means that the decoding thread is all time in
active state. However, it is around 160% and 260% in case
of dual-core and quad-core decoding respectively. On the
other hand, when using the VPU, the processor usage is
about 15% because the ARM cores are almost time in idle
mode waiting for the frame to be decoded by the VPU.
5.2</p>
    </sec>
    <sec id="sec-12">
      <title>Energy consumption</title>
      <p>Table-3 shows the energy consumption of video decoding
using the ARM cores and the VPU. The values between
parenthesis in Table-2 represent the energy reduction
factor as compared to single core decoding. As expected, for
a given clock frequency, increasing the number of cores
allows to reduce the energy consumption (See Figure-6). For
example, as compared to mono-core decoding, the optimal
multi-core con guration (4 cores, 800 MHz) deceases the
energy by a factor of x0.74 while increasing the performance
by a factor of x2.43.</p>
      <p>On the other hand, the energy saving is much more
important in case of VPU video decoding (0.23 scaling factor)
as compared to mono-core decoding at 800 MHz and x0.36
as compared to the optimal multi-core video decoding (4
cores, 800 MHz). This can be explained by both a high
decoding performance and a very low power consumption
3processor usage = (Pi Ti)=Texe where Ti is the time that
the ith thread got a processor core (active time), Texe is the
decoding time.</p>
      <p>15</p>
      <p>Time (s)
5
10
20
25
30
of the VPU. As illustrated in Figure-7-a, one can observe
that the decoding time of the 480 video frames terminated
in almost 5 seconds. During this decoding phase, the power
consumption of the SoC power domain increases with only
0.2 W which correspond to the VPU power consumption.
This low value can be explained by the low frequency (264
MHz) of the VPU. During this time, the ARM cores power
consumption is negligible. In fact, as illustrated in
Figure7-b showing the frame-by-frame power consumption
variation, the ARM cores are almost time in idle state waiting
the VPU to decode a video frames. In the idle state, the
ARM cores execute the WFI (Wait For Interrupt)
instruction were almost the processor clocks are gated to reduce
the power consumption.</p>
      <p>Unlike the VPU decoding, multi-core video decoding can
not conciliate the performance and the energy e ciency. As
illustrated in Figure-8, at 400 MHz frequency (See
Figure8-a), the power consumption is low (v 0:3 mW), but the
decoding time is very long. On the other hand, at the higher
frequencies, the decoding time is reduced but the power
consumption increases considerably (See Figure-8-b and c).</p>
      <p>
        One can highlight that the unbalanced workload over the
processor cores may be source of energy ine ciency. In fact,
during a thread waiting time, the processor core continues
to consume energy while doing nothing. One approach to x
this issue is to set the clock frequency of each core
depending on the slice decoding workload or to transit a processor
core to low power mode during its inactivity using Dynamic
Power Management (DPM) as proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. However,
this is not possible in case of the used i.MX6 SoC since it
does not support a per-core DVFS/DPM.
      </p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>This paper is a use case study based on the i.MX6 SoC.
The experimental results showed that multi-core video
decoding allows to enhance both the performance and the
energy e ciency of HD video decoding as compared to
single core decoding. However, the hardware video accelerator
is three time more energy e cient than multi-core optimal
multi-core video decoding.</p>
      <p>Although, these results may be di erent on other
architecture, the obtained data allows to have a general idea the
the energy consumption levels of HD video decoding on a
recent heterogeneous SoC.</p>
      <p>
        According to the rapid evolution of the SoC which tend
to integrate more and more cores [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], one may expect that
the energy e ciency of multi-core video decoding can be
enhanced if a larger number of cores are used [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Moreover,
as pointed out in the results discussion, the energy e ciency
of multi-core video decoding may also be enhanced if it is
combined with per-core DVFS/DPM strategies. The
objective is to avoid wasting the energy due to idling the cores.
We plan to investigate these issues in a future works using
the Exynos5 SoC containing 8 cores supporting a per-core
DVFS/DPM.
      </p>
    </sec>
    <sec id="sec-14">
      <title>Acknowledgment</title>
      <p>This work was supported by BPI France, Region
Ile-deFrance, Region Bretagne and Rennes Metropole through the
French Project GreenVideo.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] i.MX 6Dual/6Quad Power Consumption Measurement, Freescale Semiconductor</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Alvarez Mesa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Ram rez, A</article-title>
          . Azevedo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meenderinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Juurlink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Valero</surname>
          </string-name>
          .
          <article-title>Scalability of macroblock-level parallelism for h. 264 decoding</article-title>
          .
          <source>In Parallel and Distributed Systems (ICPADS)</source>
          ,
          <year>2009</year>
          15th International Conference on, pages
          <volume>236</volume>
          {
          <fpage>243</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] ARM. big.little processing</article-title>
          . http://www.arm.com/products/ processors/technologies/biglittleprocessing.php,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Baaklini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rethinagiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sbeity</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Niar</surname>
          </string-name>
          .
          <article-title>Scalable row-based parallel h.264 decoder on embedded multicore processors</article-title>
          .
          <source>Signal, Image and Video Processing</source>
          , pages
          <volume>1</volume>
          {
          <fpage>15</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benmoussa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boukhobza</surname>
          </string-name>
          , E. Senn, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Benazzouz</surname>
          </string-name>
          .
          <article-title>Energy consumption modeling of h.264/avc video decoding for gpp and dsp</article-title>
          .
          <source>in Proceedings of 16th Euromicro Conference on Digital System Design</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benmoussa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boukhobza</surname>
          </string-name>
          , E. Senn, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Benazzouz</surname>
          </string-name>
          .
          <article-title>GPP vs DSP: A performance/energy characterization and evaluation of video decoding</article-title>
          .
          <source>in Proceedings of the IEEE 21st International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benmoussa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Senn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boukhobza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanoe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Benazzouz</surname>
          </string-name>
          .
          <article-title>Open-PEOPLE, a collaborative platform for remote &amp; accurate measurement and evaluation of embedded systems power consumption</article-title>
          .
          <source>in Proceedings of the IEEE 22nd International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Borkar</surname>
          </string-name>
          .
          <article-title>Thousand core chips: A technology perspective</article-title>
          .
          <source>In Proceedings of the 44th Annual Design Automation Conference</source>
          ,
          <source>DAC '07</source>
          , pages
          <fpage>746</fpage>
          {
          <fpage>749</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chandrakasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Brodersen</surname>
          </string-name>
          .
          <article-title>Low-power CMOS digital design</article-title>
          .
          <source>IEEE Journal of Solid-State Circuits</source>
          ,
          <volume>27</volume>
          (
          <issue>4</issue>
          ):
          <volume>473</volume>
          {
          <fpage>484</fpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T.-C. Chen</surname>
            , S.-Y. Chien,
            <given-names>Y.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>C.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Tsai</surname>
            , C.-Y. Chen, T.-W. Chen, and
            <given-names>L.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Analysis and architecture design of an hdtv720p 30 frames/s h. 264/avc encoder</article-title>
          .
          <source>Circuits and Systems for Video Technology, IEEE Trans. on</source>
          , pages
          <volume>673</volume>
          {
          <fpage>688</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C. M. Don Darling</surname>
            and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          .
          <article-title>Gstreamer on texas instruments OMAP35x processors</article-title>
          .
          <source>Proceedings of the Ottawa Linux Symposium</source>
          , pages
          <volume>69</volume>
          {
          <fpage>78</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qadeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wachs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solomatnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kozyrakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Horowitz</surname>
          </string-name>
          .
          <article-title>Understanding sources of ine ciency in general-purpose chips</article-title>
          .
          <source>SIGARCH Comput. Archit. News</source>
          ,
          <volume>38</volume>
          (
          <issue>3</issue>
          ):
          <volume>37</volume>
          {
          <fpage>47</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kilicarslan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. G.</surname>
          </string-name>
          <article-title>Gurler, O. Ozkasap, and</article-title>
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Tekalp</surname>
          </string-name>
          .
          <article-title>Energy e cient video decoding on multi-core devices</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Energy-E cient Computing and Networking</source>
          , pages
          <volume>63</volume>
          {
          <fpage>66</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Baauw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mudge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Flautner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Irwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kandemir</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Leakage current: Moore's law meets static power</article-title>
          .
          <source>Computer</source>
          ,
          <volume>36</volume>
          (
          <issue>12</issue>
          ):
          <volume>68</volume>
          {
          <fpage>75</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Meenderinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Azevedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Juurlink</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramirez</surname>
          </string-name>
          .
          <source>Parallel scalability of h. 264. In Proceedings of the rst Workshop on Programmability Issues for Multi-Core Computers</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Smit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Kokkeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Wolkotte</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. D. van de</surname>
          </string-name>
          <article-title>Burgwal. Multi-core architectures and streaming applications</article-title>
          .
          <source>In Proceedings of the 2008 international workshop on System level interconnect prediction</source>
          ,
          <source>SLIP '08</source>
          , pages
          <fpage>35</fpage>
          {
          <fpage>42</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Wei</surname>
          </string-name>
          , C.-Y. Yang, T.-W. Kuo,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Hung</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Chu</surname>
          </string-name>
          .
          <article-title>Energy-e cient real-time scheduling of multimedia tasks on multi-core processors</article-title>
          .
          <source>In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC '10</source>
          , pages
          <fpage>258</fpage>
          {
          <fpage>262</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-M. Liu</surname>
            ,
            <given-names>J.-I.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            , and
            <given-names>C.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Choy</surname>
          </string-name>
          .
          <article-title>Methods for power/throughput/area optimization of H.264/AVC decoding</article-title>
          .
          <source>Journal of Signal Processing Systems</source>
          ,
          <volume>60</volume>
          (
          <issue>1</issue>
          ):
          <volume>131</volume>
          {
          <fpage>145</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          . H.
          <article-title>264 video parallel decoder on a 24-core processor</article-title>
          .
          <source>In ASIC (ASICON)</source>
          ,
          <source>2013 IEEE 10th International Conference on, pages 1{4</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>