<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Con gurable data structure layout for memory hierarchies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ELTE Eotvos Lorand University</string-name>
          <email>kmate@elte.hu</email>
          <email>matej@elte.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics</institution>
          ,
          <addr-line>Budapest</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <fpage>119</fpage>
      <lpage>135</lpage>
      <abstract>
        <p>Developments of the last decades resulted in an increasing gap between processor and memory speeds. Therefore in high-performance computations the main bottleneck is memory access. A practical solution to this problem is to implement memory hierarchies utilizing multiple levels of memories with di erent capacities and access pro les. For embedded and low-power systems it is more bene cial to apply programmable scratchpad memories instead hardware controlled caches. The overall performance of scratchpad-aware software heavily depends on how the data is distributed and accessed in di erent memory layers. Choosing the data layout for optimal performance is a set of non-trivial design decisions. This paper introduces con gurations over data structures to enable deferring these decisions, by making it easier to change the data layout of existing programs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper's authors. Copying permitted for private and academic purposes.</p>
      <p>Scratchpad-aware systems are usually programmed using low level languages. In these languages, the
distribution of data in di erent memory layers is re ected by the data structure de nitions and the operations on those
structures [Nie97]. Changes in data layout will be re ected in the source code as well, often causing signi cant
non-trivial modi cations. To make these modi cations easier, we de ne con gurations over data structures.
Con gurations enable ne-grained description of data layout of an application. They were rst introduced as
built-in language constructs in Miller [Nem13], a domain speci c language for multi-core router programming.
By reusing the key ideas behind these constructs, it is also possible to synthesize data structure de nitions and
accessing code through implementations consisting of C macros and C++ templates. Con gurations could also
be used to control and parametrize source-to-source transformations.</p>
      <p>The organization of this paper is the following. The next subsection presents related work. Section 2 introduces
and describes con gurations. Di erent implementations of con gurations are detailed in section 3. This is
followed by a comparison of the implementation techniques in section 4 . Finally, section 5 concludes the paper.
1.1</p>
      <sec id="sec-1-1">
        <title>Related work</title>
        <p>Miller is a domain-speci c language focused on the development of performance-critical applications for
scratchpad-aware architectures. As it was presented earlier by other members of our research group in [Nem13],
it supports stackless programming using execution units called bubbles instead of regular functions. Therefore
the control ow is based on continuation passing instead of call stacks. The other specialty of this language is
the built-in support for con gurable data layout, that is presented in this paper. Miller was implemented in
Haskell, rst as an embedded language. Later a custom C-like syntax was developed with a standalone compiler.
The compiler supports a general MIPS32 architecture, and one of its proprietary derivatives. The generated
output is optimized assembly text. The applied optimizations include peephole and prefetch optimizations. The
implementation of prefetch optimization is based on memory access pro le descriptions.</p>
        <p>There are data structure transformations that increase performance both on hardware-cached and
scratchpadaware architectures. In [Pra14], several such transformations are presented over the Low Level Virtual Machine
(LLVM) framework. Structure peeling and splitting is intended to improve performance by the separation of
"hot" and "cold" elds of structures, that is, the frequently and infrequently used parts of the data. These
transformations could also need structure eld reordering. Other presented techniques for improving cache
locality is struct-array copy, instance interleaving and array remapping. The execution of these transformations is
planned to be automatic, based on pro ling data and static code analysis. However, currently no implementation
is available yet for these techniques in the public LLVM code base.</p>
        <p>Data layout transformations are very common to support vectorization. [Sin16] shows a data layout inference
technique that identi es layout transformations that convert data into a format favorable for SIMD instructions.
The presented inference algorithm is based on a type system for layout transformations, and enables the
autovectorization of programs.</p>
        <p>Scout [Krz11] also supports automatic vectorization. It is a source-to-source transformation tool based on
the Clang compiler infrastructure. Scout supports several di erent loop transformations to achieve optimal
conditions for applying SIMD instructions. It supports several widely used SIMD instruction sets like SSE or
AVX, but it is also extensible.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Con gurations</title>
      <p>The memory hierarchy of the target architecture must be taken into account to achieve optimal performance on
scratchpad-aware systems. The data layout of an application could change for several reasons, for example in
order to always store the most frequently accessed data in the fastest memory. When a new system is under
development, usually not all factors a ecting performance are known in advance. Fine-tuning of the application
requires consecutive measurements and changes in the memory mapping of data. Requirement changes could
also have the same e ect. A more drastic reason is a change in the underlying memory hierarchy, for instance
when an existing application is ported to a di erent target architecture. In this case, new capacity and speed
relationships must be considered for performance.</p>
      <p>These changes could produce two di erent kinds of transformations in the data layout. When the space
available for a given type of objects is decreased, either because the capacity of the memory is decreased, or
the number of objects is increased, it could be necessary to place the less frequently used data members into a
di erent memory layer. The original data elds will be transformed into pointers. The indirection introduced
here will increase the access time of these members, but the more frequently accessed data parts are still in the
faster memories. This transformation is called data outsourcing.</p>
      <p>In the opposite case, when the amount of free space available in faster memories increases, indirections that
are pointing to data in di erent layers could be eliminated. This enables the placement of less frequently used
data members into faster memories, thus increases the average memory access speed. Complementary to the
previous technique, this transformation is called data inlining.</p>
      <p>The role of con gurations is to describe the data layout of types and objects across the memory hierarchy.
Therefore changes in con gurations enable the description of the previous transformations.
2.1</p>
      <sec id="sec-2-1">
        <title>Architecture-independent memory model</title>
        <p>As there are many scratchpad-aware architectures with very di erent memory hierarchies, we decided to describe
data layout con gurations in an architecture-independent way over a simpli ed memory model. In practice, the
architecture-speci c information could be added to the model easily when implementing con gurations with a
given technique, as it will be detailed in various subsections of Section 3.</p>
        <p>By forgetting the architecture-dependent memory information, the mapping of data layouts to concrete
memory layers is lost. The only signi cant property is whether the value of a data member is directly available as an
integral part of the structure, or only indirectly through a pointer. The latter case makes it possible to store a
part of an object in a di erent memory than the object itself.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Description of con gurations</title>
        <p>As most of the scratchpad-aware systems are programmed in low level languages, especially in some dialect of C,
we decided to derive con gurations from the type constructions of the C language. Enumeration types, function
pointer types and type aliases have nothing to do with data layout in general, thus the interesting ones are
pointer, array and record (struct and union) constructions. In the following, let T be the set of all types, and
TB T is the set of con guration base types, that is, all scalar and record types. TB does not contain any
pointer or array types.</p>
        <p>According to the transformations described in the beginning of the section, only data outsourcing could be
applied on a given scalar element of type 2 TB. Using the C language notation for pointer types, this
transformation could be described as 7! between the types. Let the source of this transformation, that is,
the primitive con guration corresponding to a directly contained scalar data member, denoted by scalar 2 C,
where C is the set of all con gurations. Furthermore, denote the previous transformation that introduces an
indirection level with a pointer type constructor as ptr : C ! C. These two constructors generate the following
con gurations:
scalar; ptr(scalar); ptr(ptr(scalar)); : : : ; ptr(
(ptr(scalar))
)</p>
        <p>Let us introduce the type generator function G : C TB ! T , that generates a con gured data type from a
con guration and a base type. Using a base type , these previous con gurations are generating the following
types:</p>
        <p>G(scalar; ) =</p>
        <p>G(ptr(scalar); ) =
G(ptr(
(ptr(scalar))
); ) =</p>
        <p>Layout con guration of arrays requires the con guration of two properties. First, the skeleton of the array
needs to be accessed to index a speci c element, and second, the access of the selected element itself is needed
to be con gured. These are both possible by introducing a third con guration constructor, array : C ! C. The
parameter of this constructor speci es how the elements are con gured at each index, relative to the skeleton.
The con guration of the array skeleton could be speci ed by applying more con guration operations around this
constructor. Figure 1 shows the application of this constructor in various cases, using [ ] array type constructor
on application of array in G, with the notation of the C language.</p>
        <p>All elds of record types could be con gured in the similar way as independent objects, so there is no need to
de ne a separate con guration constructor for this case. However, we can de ne a compound con guration for
a record. Let crec = h f1 = c1 ; ; fn = cn i denote a compound con guration for a record rec with elds
f1; : : : ; fN , and c1; : : : ; cn 2 C. As con guration operations will not be de ned over compound con gurations,
crec 2= C.</p>
        <p>A con guration is valid for an object, when it contains as many array constructors applied as the
dimensionality of the object requires. For scalar values, it is zero, for one dimensional arrays it is one, and so on. The
innermost constructor of a valid con guration is always scalar. Denote the dimension or array rank of a valid
con guration with D : C ! N, that gives the number of array constructors applied in it. Important to note,
that TB does not contain array types. It implies that all array dimensions must be visible in con gurations.</p>
        <p>The minimal con guration of an object is a valid con guration that contains no ptr constructors. It is easy to
see that data inlining could not be applied on a minimal con guration, as it would involve the removal of a ptr
constructor. It means when an object has a minimal con guration, it is only possible to store the whole object
in the same memory layer.</p>
        <p>Two con gurations are compatible, when they only di er in the application of ptr constructors. This
requirement is ful lled for con gurations C1 and C2 exactly when D(C1) D(C2). Compatible con gurations could be
transformed freely into each other by adding or removing ptr constructors { therefore by applying data
outsourcing or inlining. It is obvious that valid(c1; o) ^ compatible(c1; c2) =) valid(c2; o), where c1; c2 2 C and o is an
arbitrary con gurable object.</p>
        <p>De ne the extension function Ec : C ! C corresponding to a con guration c 2 C by replacing the innermost
scalar constructor of c with a variable. For example, let c = ptr(array(ptr(scalar))). Then replacing the
innermost scalar with a new variable x in c gives the extension function Ec(x) = ptr(array(ptr(x))).</p>
        <p>Consider con gurations c1 and c2 as structurally equivalent, also denoted as c1 c2 when c1 and c2 was
created using the same constructors in the same order. The length of a given con guration is L : C ! N that
gives the number of constructor functions used to create it. Of course, L(c1) = L(c2) is implied by c1 c2.
L(c) could also be denoted as Lc.</p>
        <p>Note that all of the con guration constructors are extensible with arbitrary information. For example,
the scalar and ptr constructors could be extended with references to memory layers. This enables to inject
architecture-speci c information into a con guration.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Operations over con gured objects</title>
        <p>As the previous subsection presented, function G can be used to generate a con gured data type c = G(c; ) 2 T
where c 2 C and 2 TB. To access data in objects de ned with type c, we need operations that could follow
the changes of their data layout con guration.</p>
        <p>First, an operation is needed to declare an object with type c. Let declare : C TB Id ! Decl, where Id
is the set of valid identi ers, and Decl denotes the set of all valid declarations of a given programming language.
It is easy to see that declare can be built trivially when an implementation of G is available.</p>
        <p>Since it is allowed to modify a con guration by adding or removing ptr constructors at any time, the memory
management of con gured objects could be changed drastically. As a consequence, explicit memory allocation
and deallocation operations are needed at the beginning and at the end of the lifetime of con gurable objects.
For instance, take the following declaration in the C language: declare(int; scalar; "x") )int x. By default,
x will have an automatic storage, that means its lifetime will be bound to the surrounding block. There is
no need for explicit allocation or deallocation of x in this case. However, if we apply data outsourcing to the
con guration, which is completely valid in this case, it gives declare(int; ptr(scalar); "x") )int* x. To store
any value at the memory location pointed by x, we need to implement explicit memory management. It implies
that explicit allocate : C Id ! Stmts and deallocate : C Id ! Stmts operations must be used every time
to initialize and nalize storage for con gured objects, where Stmts is a list of statements in a given
programming language. Following the previous example, allocate(scalar; "x") )f g and deallocate(scalar; "x") )f g,
where f g is an empty list of statements, but allocate(ptr(scalar); "x") )f x = malloc(sizeof(x)); g and
deallocate(scalar; "x") )f free(x); g. These are the most complex operators, as they must be able to derive
the allocation and deallocation statements for any objects with valid con gurations, including multi-dimensional
arrays with several layers of indirection.</p>
        <p>Accessing the value of a con gured object could also change with the con guration. It is evident that all
pointers must be dereferenced to access a value. Fortunately, con gurations contain all the information needed
to achieve this. Let valueOf : C Expr ! Expr be an operation over a con guration and an object that
generates an expression 2 Expr the set of all expressions in a given language, to access a con gured object. To
continue our previous example, valueOf can be implemented for the C language to give valueOf (scalar; x) )x,
and valueOf (ptr(scalar); x) )*x.</p>
        <p>Similarly, when the base type of an object 2 TB is a record type, and a eld of this record object is accessed,
the object must be dereferenced rst. Assume we have a projection function : Expr Id ! Expr that selects
a given member from an object. Let memberOf : C Expr Id ! Expr, and memberOf (c; o; m) ::=
(valueOf (c; o); m) 2 Expr, where c 2 C, o 2 Expr and m 2 Id. In an example, when we have an
object x of a record type that has a eld named y, then the following could be implemented for the C
language: memberOf (scalar; x; "y") )x.y, and memberOf (ptr(scalar); x; "y") )x-&gt;y, that is equivalent
with (valueOf (ptr(scalar); x); "y") )(*x).y, given that (o; m) ) (o):m. Note that when eld y is
also a con gured data type, then it must be accessed with the appropriate operation, for instance using
valueOf (cy; memberOf (cx; x; "y")), where cx; cy 2 C and x 2 Expr with a record base type having a eld
named y.</p>
        <p>The last operation needed for con gured objects is array indexing. An array con guration has a dual role:
it describes not only how to access the elements, but it also describes the array skeleton itself. These could be
done in the following steps. First, the con guration needs to be split into two parts. Let split : C ! C C
be a function that returns two con gurations, split(carr) = (cskel; citem) when D(carr) 1 is satis ed.
Require that D(cskel) 0 and D(citem) D(carr) 1. The rst requirement implies that cskel contains only a
scalar constructor and an arbitrary number of ptr constructors. Take the extension function Ecskel : C ! C
of cskel by replacing the innermost scalar constructor with a variable. Then the following structural
equivalence must be satis ed between the previous con gurations: carr Ecskel (array(citem)). The rst part
of the con guration, cskel is needed to dereference the array object itself before indexing. It could be done
easily with the previous valueOf operation. After selecting the required element using an index function
: Expr N ! Expr, con guration citem must be considered similarly to dereference the element data.
Therefore operation valueAt : C Expr N ! Expr could be de ned as follows: valueAt(carr; arr; idx) ::=
valueOf (citem; (valueOf (cskel; arr); idx)) 2 Expr, where split(carr) = (cskel; citem). Con guration citem
could also be referred as the item con guration of carr. For example, assume we have an array x with
conguration cx = ptr(array(ptr(scalar))). Then the value of (x; 3) )x[3] as in C language, could be
accessed as valueAt(cx; x; 3) valueOf (cx elem; (valueOf (cx skel; x); 3)) )*((*x)[3]) in C. Splitting cx gives
cx skel ptr(scalar) with extension function Ecx skel (c) = ptr(c) and also cx elem ptr(scalar).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementation techniques</title>
      <p>This section details techniques that we used to implement con gurations for data type layout. Four solutions
will be evaluated: the integration into a domain-speci c language; implementation with C preprocessor macros
and C++ template metaprogramming; and nally source-to-source transformations. The following subsections
describe these implementations independently, while a comparison could be found in Section 4. As most of their
implementation details are out of the scope of this paper, the complete source code for the C and C++ libraries
could be found at https://github.com/kmate/configurable-data-layout.
First we developed and implemented data layout con gurations in a domain-speci c language named Miller.
Miller was designed for e cient multi-core router programming over complex multi-layer, scratchpad-aware
memory hierarchies. The language has a C-like syntax and type system. It supports various xed-width integer
scalars, enumeration types, type aliases, bit sets, records, and arrays. Pointers are treated specially in the
language: they are split into two separate types. Sharing of data is expressed with references. They are very
similar to ordinary C pointers with two exceptions: it is not possible to do pointer arithmetic on references, and
they cannot be used for expressing optionality. For the second use case, there is a di erent optional type in the
language, that can be used in a special case construction to deal with the absence of a value. Reference types
are introduced with a ref keyword, while option types are marked with a question mark. Arrays and records
have similar syntax as in C. For instance, in the following listing bucket is a reference to an array of 32 optional
values with type elem. It is able to store a single line out of 256 possible such entries of table.
s t r u c t elem
{
i n t 3 2 key ;
i n t 3 2 v a l u e ;
}
elem in ;
elem ? [ 3 2 ] [ 2 5 6 ] t a b l e ;
ref elem ? [ 3 2 ] b u c k e t ;</p>
      <p>Miller has a language construction to provide information about the memory hierarchy. The memory keyword
introduces a new layer with the given name and size. It is also needed to provide an access pro le, that can be
used by the prefetch optimizer of the compiler. In the following example it is omitted, as this information is
irrelevant for con gurations. The listing below de nes two memory layers named dram and spad, with 256 and</p>
      <sec id="sec-3-1">
        <title>8 megabytes of total size.</title>
        <p>m e m o r y dram (256 MB ) { /* a c c e s s p r o f i l e o m i t t e d */ }
m e m o r y spad (8 MB ) { /* a c c e s s p r o f i l e o m i t t e d */ }</p>
        <p>Once the memory types are given, it is possible to de ne locations where objects can be stored. There are
two di erent kinds of location de nitions: registers and locations in one of the memory layers.
r e g i s t e r r1 (4 B );
r e g i s t e r r2 (4 B );
// ...
l o c a t i o n t a b l e _ l o c (256 KB ) on spad ;
l o c a t i o n i n _ l o c (8 B ) on dram ;</p>
        <p>After memory types and locations are given, con gurations can be provided for each object independently
from the algorithms working with them. In fact, there are two types of con gurations in Miller. There are type
con gurations to determine how objects of a given type should be stored in the hierarchy, and there are object
con gurations for specifying location and type con guration of individual objects.
t y p e c o n f i g e l e m _ c o n f {
key ;
v a l u e -&gt; dram ;
};</p>
        <p>elem conf shows a possible con guration for record type elem. Field key has an empty con guration. It
means that all data referred by key will be stored in the same memory as the containing object, so it represents
a scalar con guration constructor. This is the default behavior, so this entry is optional. The value eld has
a type con guration built with operator -&gt;. It introduces an indirection into the con guration: the containing
object will only have a pointer to the value stored in the dram memory. Hence the con guration of value is
ptr(scalar). Note that for algorithms working with the value eld this indirection is invisible. This guarantees
that the con guration could easily be changed independently from other parts of the code.</p>
        <p>An object con guration consists of a location and a type con guration. In the following listing some of the
irrelevant syntactical parts are omitted { the ones that are only required to unambiguously map the names in
con gurations to the object declarations.
in : i n _ l o c ;
t a b l e : t a b l e _ l o c e l e m _ c o n f [][] ? &gt; dram ;
b u c k e t : r1 : &gt; spad e l e m _ c o n f [] ? &gt; dram ;</p>
        <p>The simplest object con guration only speci es a location for the given object. The con guration for in could
be described with a single scalar constructor, along with the meta-information about the location. As table
refers to a two-dimensional array, its con guration must be constructed using two array constructors. Because
the elements of the array are optional, and this is implemented with pointers under the hood, it is possible to
store the elements of this two-dimensional array in another layer. The ?&gt; operator con gures an optional type
to store the values in the memory given on its right hand side. The con guration of table can be described as
array(array(ptr(scalar))). The outer array constructors have associated meta-information about the location
table loc, and the ptr constructor refers to dram. Similarly to ?&gt;, operator :&gt; can be used to con gure a
pointer, but under a reference type instead of an optional. It means that bucket will be stored as a pointer in
register r1 to an array of pointers on spad. The pointers in the array are referencing values stored on the dram.
Both of the last two con gurations refer to elem conf, to specify how the structure elds are stored.</p>
        <p>Operations de ned for con gurations in Section 2.3 are built-in into the Miller compiler. As con gurations
can also be stored separately, it is possible to have multiple di erent con gurations for the same program. The
con gurations are read from source les, and connected implicitly and unambiguously with the types and objects
found in the program by their quali ed names. The user does not have to explicitly use di erent operations
on objects with con gured data types, as all objects are con gurable in the language. Of course, there is a
validation phase in the compiler, that checks the con gurations of each type and object before compilation. Type
con gurations are allowed to be under-speci ed: when no con guration is provided, a default con guration takes
places that stores everything in the same memory as its container object. The details of the target architecture
were adjustable through memory and register con guration to a given extent, but still some information was
built into the compiler, especially about how the data access code should be generated for particular layers.
Even though Miller implemented con gurations perfectly { as it was explicitly designed for them {, it is always
hard to introduce a new domain speci c language instead a well-known industrial standard. This lead us
to experiment with the embedding of con gurations into C, that is already widely used on scratchpad-aware
architectures.</p>
        <p>The most convenient option was to implement con gurations and their operations as a preprocessor macro
library. Despite the high expressiveness of preprocessor macros, it is very hard to build and debug really complex
libraries. To ease the development, we have chosen an implementation strategy based on modeling the macros
with simple purely functional programs in Haskell [Kar14]. Finally, these de nitions were translated back to the
preprocessor macro language, using the Boost Preprocessor Library [Bpl].</p>
        <p>In this system, the con gurations are represented internally with a parenthesized sequence of tokens instead
a chain of constructor applications. It means that con gurations in the form C1(C2(C3)), where each Ci is a
constructor of con gurations, is expressed as (T1)(T2)(T3), where each Ti is the corresponding token to constructor
Ci. The exact de nition of these tokens is not important, only the fact that each di erent constructor has a
di erent, unique token that makes it possible to pattern match on the value during processing. Although the
internal representation is not a chain of constructor applications, it is possible to de ne builder macros that
support chained application.
# define SCALAR ( SCALAR )
# define PTR ( tokens ) ( PTR ) tokens
# define ARRAY ( tokens ) ( ARRAY ) tokens</p>
        <p>As macro expansions are not recursive, tokens on the right hand side will not be re-expanded
after substitution. For example, the con guration PTR(ARRAY(PTR(SCALAR))) will build the token sequence
(PTR)(ARRAY)(PTR)(SCALAR). This sequence can be e ectively processed with a fold-like construction, up to a
prede ned maximum length (that is currently 64 using the Boost Preprocessor Library). The library de nes the
following macros corresponding to properties of con gurations. Macro ARRAY RANK calculates Dc, the number
of array dimensions for a given c 2 C. ITEM CONFIG returns the item con guration of a given con guration. It
is used to implement split. Utility HAS PTR ITEM gives a macro representation of a boolean value. When the
con guration passed to HAS PTR ITEM has a (PTR) item it returns true, and false otherwise. This utility is useful
when implementing memory allocation and deallocation operations, as it determines whether it is needed to
generate dynamic memory allocation code or not.</p>
        <p>The macro library is implemented with two di erent interfaces. The rst version requires the con gurations
to be passed as an argument for every operation. The second version uses a lookup mechanism to retrieve con
gurations for objects by their name. This works as the following. The user has to declare their con gurations as
macros according to a naming scheme. Then the GET OBJECT CONFIG utility macro could retrieve the appropriate
value for a given object identi er:
# define C O N F I G U R E _ _ i n p u t PTR ( SCALAR )
# define C O N F I G U R E _ _ t a b l e ARRAY ( PTR ( ARRAY ( SCALAR )))</p>
        <p>The above snippet de nes con gurations for two objects. They can be retrieved by expanding
GET OBJECT CONFIG(input) and GET OBJECT CONFIG(table). The naming scheme is also con gurable in the
library. In the second version of the interface, this utility macro is called automatically by the operations, so
they only take an object identi er instead of a con guration. In the following, this interface will be presented.</p>
        <p>Each operation on objects from con gured data types is implemented with a speci c macro. Thus for the
declare operation there is a corresponding macro named DECLARE, for valueOf there is VALUE OF and so on. An
interesting di erence, that comes from the characteristics of the C language, is that the declaration of arrays
need a di erent treatment. This is because they contain the length of each array dimension. This is implemented
with a DECLARE ARRAY variadic macro. Unfortunately, it is problematic to handle the zero parameter-case of
variadic arguments. This is the main reason for this solution.
s t r u c t i n p u t _ t { u i n t 3 2 _ t a , b ; };
D E C L A R E ( s t r u c t input_t , i n p u t );
D E C L A R E _ A R R A Y ( char , table , 8 , 32);
// s t r u c t i n p u t _ t * i n p u t ;
// char ((*( t a b l e [8]) ) [ 3 2 ] ) ;</p>
        <p>The comments after each macro invocation are showing their expansion results. Under normal circumstances,
the result of the preprocessing phase is not directly visible to the user. It is not formatted by the compiler, and
the macro library does not try to eliminate unnecessary parentheses. The con gurations de ned earlier are used
to assemble the nal declarations. Of course, it is not enough to declare these objects before using them. As
the con gurations could contain ptr constructors, explicit allocation and deallocation is needed at the beginning
and at the end of their lifetime. This could be achieved as the following.</p>
        <p>A L L O C A T E ( i n p u t ); // do { i n p u t = m a l l o c ( s i z e o f (*( i n p u t ))); } w h i l e (0);
A L L O C A T E ( t a b l e );
// do { for ( int i0 = 0; i0 &lt; s i z e o f ( t a b l e ) / s i z e o f (( t a b l e ) [ 0 ] ) ; ++ i0 ) {
// ( t a b l e )[ i0 ] = m a l l o c ( s i z e o f (*(( t a b l e )[ i0 ] ) ) ) ;
// } } w h i l e (0);
// ...</p>
        <p>D E A L L O C A T E ( t a b l e );
// do { for ( int i0 = 0; i0 &lt; s i z e o f ( t a b l e ) / s i z e o f (( t a b l e ) [ 0 ] ) ; ++ i0 ) {
// free (( t a b l e )[ i0 ]);
// } } w h i l e (0);
D E A L L O C A T E ( i n p u t ); // do { free ( i n p u t ); } w h i l e (0);</p>
        <p>The expansions in comments clearly show the complexity of the generated code for allocation and deallocation.
Bodies of do-while blocks are only running once, and the loop structure will be optimized out by the compiler.
They are only needed to suppress the semicolon at the end of the macro invocation when multiple statements
are generated.</p>
        <p>Of course, the data stored in the con gured objects should be accessed only with the appropriate operations,
instead of directly operating on them. This ensures the correct access event when the con guration is changed
between compilations.
s t r c p y ( V A L U E _ A T ( table , 0) , " Zero " );
// s t r c p y (*(( t a b l e ))[0] , " Zero ");
// ...
s c a n f ( " % d % d " , &amp; M E M B E R _ O F ( input , a ) , &amp; M E M B E R _ O F ( input , b ));
// s c a n f ("% d % d " , &amp;(*( i n p u t )). a , &amp;(*( i n p u t )). b );
u i n t 3 2 _ t key = ( M E M B E R _ O F ( input , a ) + M E M B E R _ O F ( input , b )) % 8;
// u i n t 3 2 _ t key = ((*( i n p u t )). a + (*( i n p u t )). b ) % 8;
p r i n t f ( " % s \ n " , V A L U E _ A T ( table , key ));
// p r i n t f ("% s \ n " , *(( t a b l e ))[ key ]);</p>
        <p>Note that address-of operator &amp; can safely be applied on the result of MEMBER OF, as it will alway result in a
value selected from a structure. It is clearly visible, that the macro library does not try to simplify the syntax
tree, as it generates (*(input)).a instead of the equivalent input-&gt;a. It is also worth to note that elds of
struct input t are not con gurable. In that case, the value access of the elds must apply an additional
VALUE OF operator around MEMBER OF.</p>
        <p>Unfortunately, the case of con gurable records elds is even more complicated. As there is no compile
time re ection information behind the macro system, it is impossible to automatically determine the elds of
a structure. However, when the structure declaration itself is built up with macros as a preprocessor data
structure, there is a hope to share its eld information. Although, the current version of this library follows a
simpler implementation. The following example shows how to use named con gurations with structure elds.
# d e f i n e C O N F I G U R E _ _ t 0 PTR ( S C A L A R )
# d e f i n e C O N F I G U R E _ M E M B E R _ _ t e s t _ t _ _ x PTR ( S C A L A R )
# d e f i n e C O N F I G U R E _ M E M B E R _ _ t e s t _ t _ _ y PTR ( A R R A Y ( S C A L A R ))
t y p e d e f s t r u c t
{
int id ;
D E C L A R E _ M E M B E R ( test_t , bool , x );</p>
        <p>D E C L A R E _ M E M B E R _ A R R A Y ( test_t , char , y , 8);
} t e s t _ t ;
D E C L A R E ( test_t , t0 );
A L L O C A T E ( t0 );
A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x );
A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y );
M E M B E R _ O F ( t0 , id ) = 0;
V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ) = true ;
s t r c p y (&amp; V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y ) , " test " );
p r i n t f ( " t0 : { % d , % d , \"% s \" }\ n " ,</p>
        <p>M E M B E R _ O F ( t0 , id ) ,
V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ) ,</p>
        <p>V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y ));
D E A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y );
D E A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x );
D E A L L O C A T E ( t0 );</p>
        <p>This listing shows how the complexity of the macro system explodes when there are con gured record elds.
In case of named con gurations, di erent macros needed to be used for the operations on members, because
they need to use a di erent name-lookup algorithm. This is why all * MEMBER macro receives the name of the
encapsulating struct, and the name of the actual member. This information is used to assemble a macro name
where the required con guration can be looked up. When the user chooses the alternative interface, where
con gurations are directly passed to operation macros, there is no need for de ning separate * MEMBER macros,
but the original ones could be used everywhere. It is also important to see, that the encapsulating struct object
must be resolved to its value with its own con guration before any member operations could be applied to it.</p>
        <p>While it is possible to implement con gurations in the C preprocessing system, it is not very practical. The
con guration changes are propagated easily, once the code is appropriately prepared, but unfortunately that
requires a huge e ort, and is highly error-prone.</p>
        <p>Architecture specialization could be done by extending the con guration builder macros with additional
parameters. For instance, PTR can be extended to hold additional value to identify the target memory of
a pointer. In the implementation of operations, this information can be extracted and used for example to
generate di erent allocation or dereference code for di erent pointers.
3.3</p>
        <sec id="sec-3-1-1">
          <title>C++ Template Metaprogramming</title>
          <p>While the previous subsection showed that it is possible to implement con gurations in the preprocessing phase,
there is still space for improvement. Some of the scratchpad-aware systems also have C++ compilers available.
It is a known result that template metaprograms are Turing-complete in absence of instantiation limits [Vel03].
Their integration in the C++ language, along with other language features like operator overloading, provide a
good basis for implementing con gurations. The library presented in this section is implemented using the Boost
MPL Library [Abr04].</p>
          <p>As template metaprogramming also has its roots in functional programming, like preprocessor
metaprogramming, the representation of con gurations follows a similar strategy: con gurations are stored as type-level lists
of con guration tokens, where tokens themselves are types. Most of the operations are processing these type
level lists with folds and pattern matching on the tokens.
t y p e d e f s t r u c t {} T _ S C A L A R ;
t y p e d e f s t r u c t {} T _ P T R ;
t y p e d e f s t r u c t {} T _ A R R A Y ;
s t r u c t S C A L A R : mpl :: vector &lt; impl :: T_SCALAR &gt; {};
template &lt; t y p e n a m e Tokens &gt;
s t r u c t PTR : mpl :: p u s h _ f r o n t &lt; t y p e n a m e T o k e n s :: type , impl :: T_PTR &gt; {};
template &lt; t y p e n a m e Tokens &gt;
s t r u c t A R R A Y : mpl :: p u s h _ f r o n t &lt; t y p e n a m e T o k e n s :: type , impl :: T_ARRAY &gt; {};</p>
          <p>Builder constructions for these list are provided too, thus the user can easily de ne named con gurations as
type aliases:
t y p e d e f PTR &lt; SCALAR &gt; i n p u t _ c o n f i g ;
t y p e d e f ARRAY &lt; PTR &lt; ARRAY &lt; SCALAR &gt; &gt; &gt; t a b l e _ c o n f i g ;</p>
          <p>These con gurations are essentially the same as we used in the previous subsection. An interesting aspect
of this technique is that the con guration chain is built up from nested template applications. Declarations of
con gurable objects is done in the following way:
c o n f i g u r e d &lt; s t r u c t input_t , i n p u t _ c o n f i g , true &gt; i n p u t ;
c o n f i g u r e d _ a r r a y &lt; char , t a b l e _ c o n f i g , false , 8 , 32 &gt; t a b l e ;</p>
          <p>Declarations for scalar and array objects are di erent, like it was with preprocessor macros. However, the
reason for this is di erent here: it is to improve type safety. Array-like con gured objects support indexing, while
others do not. Both of the types are parametrized with a base type, a con guration, and a boolean value, that
indicates whether automatic allocation and deallocation should be done for these objects. Con gured arrays also
need to provide sizes for each dimensions, this makes configured array a variadic template. While input will
be automatically allocated and deallocated, this has to be done explicitly for table (hence the false parameter
value above):
t a b l e . a l l o c a t e ();
// ...
t a b l e . d e a l l o c a t e ();</p>
          <p>As the above listing shows, allocation and deallocation operations become instance methods of con gured
objects in this system. This allows the implicit sharing of the underlying con guration. Other con guration
operations are also implemented with instance methods, and some of them with operator overloading:
s t r c p y ( t a b l e [0] , " Zero " );
// ...
s c a n f ( " % d % d " , &amp; input - &gt; a , &amp; input - &gt; b );
u i n t 3 2 _ t key = ( input - &gt; a + input - &gt; b ) % 8;
p r i n t f ( " % s \ n " , t a b l e [ key ] ( ) ) ;</p>
          <p>The valueAt operation is simply implemented with the operator[] of configured array objects. Also,
operation memberOf is mapped to operator-&gt; of configured objects. The valueOf operation is implemented
with overloaded cast operators, and constructors. When it is not possible to infer the required type cast, like in
case of a printf parameter, explicit execution of valueOf could be requested by using the object as a functor {
that is, calling operator().</p>
          <p>Similarly to the preprocessing library, utilities for con gurations are also provided with templates: array rank,
item config and has ptr item are de ned to produce the same result as their macro counterparts, but now as
type-level values.</p>
          <p>To understand how the library works, take a look at the hierarchy and the internal structure of con gurable
objects.</p>
          <p>The data wrapped by a con gurable object is stored in a member named value. Its type is derived using a
template metaprogram implementation of generator function G. For instance, the type of this member could be
de ned as typedef struct input t *real type for input, and typedef char (*real type[])[] for table.
The type casting operator and the functor implementation returns an object of value of type. This type is
derived from the con guration after removing its outer ptr constructors, if any. This ensures that these operations
implement the valueOf operation, by dereferencing the outer layer of pointers in the representation. Result of
array indexing is value at type, that is a type derived from the item type of the corresponding con guration. For
multi-dimensional arrays, indexing an outer dimension will return another object of type configured array,
with decreased rank and shorter con guration. Utility methods rank and size are returning the number of
dimensions and the length of the outer dimension, respectively.
typedef PTR &lt; SCALAR &gt; CFG_t0 ;
typedef PTR &lt; PTR &lt; SCALAR &gt; &gt; CFG_tx ;
typedef PTR &lt; ARRAY &lt; SCALAR &gt; &gt; CFG_ty ;
typedef struct test_t
{
int id ;
configured &lt; bool , CFG_tx &gt; x ;
configured_array &lt; char , CFG_ty , true , 8 &gt; y ;
} test_t ;
configured &lt; test_t , CFG_t0 &gt; t0 ;
t0 - &gt; id = 0;
t0 - &gt; x = true ;
strcpy ( t0 - &gt;y , " test " );
printf ( " t0 : { %d , %d , \"% s \" }\ n " , t0 - &gt; id , t0 - &gt; x () , t0 - &gt; y ());</p>
          <p>Con guration of record elds is also very easy with this library. The user can reuse the configured and
configured array templates to declare arbitrary elds of records. When the automatic allocation and
deallocation is turned on, the lifetime of the elds will be bound to the containing objects. As this is the default setting,
the declaration of eld x omits this parameter entirely.</p>
          <p>Transforming existing software to use con gurations only needs a little e ort: the declaration of con gurable
objects should be replaced. Adaptation to operations is simple, thanks to operator overloading. Type checking
will ensure that each access of the con gured objects is correct.</p>
          <p>Architecture specialization could be done similarly as in the previous technique. Every con guration builder
template can be extended with meta-information about the memory hierarchy, and this can be processed in the
implementation of the operations.
3.4</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Source-to-source Transformations</title>
          <p>The previous two implementations are intrusive: they require the modi cation of existing sources when applied.
Using the C macro library requires to replace all declarations and operations where con gurable objects involved.
Working with the C++ template solution needs much less e ort, but still, con gurable objects have to be treated
di erently from ordinary data types.</p>
          <p>This subsection shows a technique where applying con gurations requires no preliminary changes on the
source code. The idea is to use automated source-to-source transformations, that are parametrized with con
gurations. In this paper we focus on the description and validation of con gurations, and not on the execution of
transformations themselves. Although the presented method could be applied to any appropriate programming
language, to make the results comparable with the previous sections, we decided to work on C source code.</p>
          <p>The role of con gurations is to describe the data layout of types and objects across the memory hierarchy.
This information is actually represented in the existing source code as types. The key to generate con gurations
from C types, is to implement the inverse of generator function G. Denote this function with GC 1 : T ! C TB,
where T is the set of all possible C types, C is the set of con gurations over a representation, and TB is the set of
C base types. The latter set contains all scalar, enumeration and record types (introduced either with a struct
or union keyword). Type aliases should be fully resolved before applying GC 1. This is a partial function, as C
function pointer types are not supported by con gurations.</p>
          <p>GC 1( ) =
8&gt;scalar;
&lt;</p>
          <p>ptr(GC 1( 0));
&gt;:array(GC 1( 0));
when
when
when
2 TB
= 0</p>
          <p>Once this function is available, the question is, how should these con gurations be stored and presented for
the C programmer, and how the changes will be given as instructions to the transformation tool. The solution
is to create a con guration skeleton for the source les. The skeleton le will have the same structure as the
input le, and is also a valid C source le, but contains no statements at all. To create a skeleton, process the
input le as the following. Keep all type declarations, global objects and functions. For functions, only keep
their local variable con gurations in their outermost blocks. Drop all function parameters, and change return
types to void. The preservation of the structure is needed to uniquely identify each type and object declarations
in the source le later. As inner blocks of function bodies are dropped, the identi cation will not be available
for local variables declared there, but the practice shows that being con gurable is not a concern for short-living
deeply nested locals. What remains in the skeleton le, is essentially a set of con gurations, if we apply GC 1
on the types in the remaining declarations. The user can rewrite the types in the skeleton le, and use it as a
parameter to the transformation tool.</p>
          <p>The transformation starts with an analyzer phase. Both an input le and a con guration le (possibly a
modi ed skeleton) is parsed using a C parser, and a semantic analysis is executed to create a type-annotated
abstract syntax tree of the sources. Function GC 1 is applied on each type found in the con guration le. At the
same time, the structurally corresponding declaration is looked up in the input le for each declaration in the
con guration le. This makes it possible to apply GC 1 also on the types found in the input declarations, and
compare the old and the new con gurations.</p>
          <p>Each of the compared con gurations must be compatible, that is, they must only di er in the number of ptr
constructors applied. This restricts the changes to execute only data outsourcing or data inlining transformations.
Base types corresponding to the two con gurations must also be the same. As changes restricted to compatible
con gurations, there is no way to enforce an invalid con guration for an object. When there is no corresponding
declaration found in the input le for an element in the con guration le, or the compatibility check fails, the
transformation could signal an error and terminate.</p>
          <p>For each of the compatible con gurations, the di erences can be collected. These di erences will control the
transformations executed on the input syntax tree. According to the old con guration, that is read from the input
le, it could be checked that the objects are only used according to the operations of con gured objects. When
any non-matching access pattern is found, it can not be guaranteed that the transformation will be successful.
This restriction is needed because C enables pointer arithmetics and other special ways to work with data. The
modi ed code for the operations can only be derived correctly when it is ensured that in the input code the given
object is used like an object con gured previously with its old con guration.</p>
          <p>Unfortunately, correct source-to-source transformations of C programs is highly non-trivial. Preservation of
preprocessing structure, comments and formatting is very di cult. Type aliases in con gured types are need to
be handled with care. Consider the following:
// input file
typedef int speed ;
typedef int * place ;
speed v e r t i c a l _ s p e e d ;
// modified c o n f i g u r a t i o n file
speed * v e r t i c a l _ s p e e d ;</p>
          <p>The old and new con gurations of vertical speed are scalar and ptr(scalar), which are compatible.
Application of GC 1 resolves all type aliases to produce the con gurations and examine the base types, thus
giving int *vertical speed when re-synthesized with G. It is incorrect, as picking another alias, like place
vertical speed would also be. To transform the declaration correctly, the new type should be written exactly
in the same way as it was in the con guration le. Detection of incorrect access patterns and synthesizing the
transformed code with the minimal amount of changes is also a challenging problem. For these reasons, we only
implemented a few special cases in a tool based on the Clang compiler infrastructure.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Comparison of implementation techniques</title>
      <p>This section contains a comparison of the previously presented con guration implementation techniques. The
comparison is made based on the following 6 + 1 aspects.</p>
      <sec id="sec-4-1">
        <title>Recon guration</title>
        <p>Shows how much code is needed to be modi ed manually by the user when a con guration changes. For the next
example, take a data outsourcing on the elements of an array, thus the con guration changes from array(scalar)
to array(ptr(scalar)).</p>
        <p>Miller: elem config [] ) elem config [] -&gt; dram
C macros: ARRAY(SCALAR) ) ARRAY(PTR(SCALAR))
C++ templates: ARRAY&lt;SCALAR&gt; ) ARRAY&lt;PTR&lt;SCALAR&gt;&gt;</p>
        <p>Source transformations: elem type arr[]; ) elem type *arr[];</p>
      </sec>
      <sec id="sec-4-2">
        <title>Storage of con gurations</title>
        <p>Describes where and how the con gurations are stored and what is their relationships to other parts of the code.</p>
        <p>Miller Con gurations are an integral part of the language. They can be stored in the same les as in
the corresponding program code, or in separate les. In that case, con gurations can be included into the
program with a standard C preprocessor command. Con gurations can be named and referenced later.
Objects and con gurations are loosely-coupled.</p>
        <p>C macros The user can make her own con guration de nitions as macros. This enables to store them
either together with other parts of the code or separately. However, con guration of record elds is much
less advanced as they can not belong together structurally as in the previous case.</p>
        <p>C++ templates This solution has almost the same properties as C macros from this perspective. The
only di erence is that user-de ned con gurations can be de ned as type aliases instead of macros. This
helps to group related con gurations together.</p>
        <p>Source transformations Con gurations are stored in separate les. They are valid C les, to ease their
handling and processing both for the users and the tools. The structure of con gurations follow the structure
of the original code. Objects and con gurations are tightly-coupled, as the con guration le must follow
the structure of the original code. However, it is usually not a problem, as con gurations can be deleted
after the desired transformations are executed.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Syntax of con gurations</title>
        <p>Miller introduces a custom syntax for describing memory layers, locations and con gurations, that the user
must learn in the beginning.</p>
        <p>C macros and C++ templates use almost the same syntax. The macro library uses a chain of macro
invocations, while the other uses nested template applications. The low number of con guration constructors
makes it easy to master it in minutes.</p>
        <p>Source transformations use con guration les where each con guration is described as a standard C
declaration. This is the most natural and straightforward way to express the changes in types, as it uses
the same syntax to describe the transformation as the transformed language itself.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Amount of syntactic noise</title>
        <p>What is the price of con gurability in terms of readability and maintainability of code.</p>
        <p>Miller has built-in construction to support con gurations. Basically, every object is con gured in the
language, and the default operations are working accordingly. Con gurability \comes for free" in this
manner.</p>
        <p>C macros add huge syntactic noise. Every declaration gets replaced by a syntactically more complicated
one. This is also true for all operations done with these objects.</p>
        <p>C++ templates add less noise than macros, thank to their better integration into the language and to
operator overloading. However, non-trivial changes should be done when turning a non-con gurable object
into a con gured one.</p>
        <p>Source transformations have no syntactic noise at all, if the transformation engine is implemented well
and could preserve all the appropriate comments and formatting of the original source.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Re-usability of con gurations</title>
        <sec id="sec-4-5-1">
          <title>How easy it is to re-use an existing con guration for a new piece of code.</title>
          <p>Miller, C macros and C++ templates provide the ability to give names to con gurations. They can be
referenced later in any other con gurations or get assigned to newly de ned objects.</p>
          <p>Source transformations use separate con gurations for each declaration explicitly. However, as con
guration les are C source les, it could be possible to use preprocessor macros to produce the con gurations.</p>
          <p>Using single con guration le to transform multiple input les is also feasible.</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>Applicability to existing code base</title>
        <p>Di culties that arising when turning non-con gured objects to con gured ones.</p>
        <p>Miller is a domain-speci c language that was designed with memory hierarchies and con gurability in mind
from the very beginning. Every object and type expressed in Miller is con gurable.</p>
        <p>C macros may require a huge amount of highly error-prone, non-trivial modi cations of the original code
to turn a single object into a con gured one.</p>
        <p>C++ templates reduce the drawbacks of macros, by using operator overloading and instance methods.
The type checker also ensures that the programmer rewrites them correctly.</p>
        <p>Source transformations were designed especially to be applicable on existing code bases, so their cost is
zero from this perspective.</p>
      </sec>
      <sec id="sec-4-7">
        <title>Other bene ts and drawbacks</title>
        <p>Miller provides built-in con gurations, and gives outstanding support for programming complex memory
hierarchies. However, it is always hard to introduce a new language when there is an industrially accepted
standard or a huge existing code base.</p>
        <p>C macros are simple enough for demonstration and prototyping purposes, but are very impractical to use
in real software.</p>
        <p>C++ templates are completely acceptable, when the target platform provides a C++ compiler { which
is not always the case. When the number of con gurable variables is small, and the library is introduced
at the beginning of the development, it could be a practical solution. Type safety of templates helps a lot
compared to macros, but the template instantiation errors could be very long and confusing.
Source transformations are trying to combine the bene ts of the built-in constructions with an already
known language. However, implementing a transformation tool that handles all cases automatically and
correctly, is a great challenge for di cult to refactor languages like C.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We introduced con gurations to ease the change of data layout over di erent memory layers. The programmer
only needs to change these con gurations of the data structures, and the required highly non-trivial modi cations
will be propagated automatically over the algorithms. This makes data layout changes less error-prone and more
e cient. This is especially important for performance tuning of applications on scratchpad-aware architectures.
Four di erent implementation techniques were shown: one with built-in constructions of a domain-speci c
language, one with a C preprocessor library, a solution based on C++ template library, and a source-to-source
transformation tool. The comparison shows that in practice, as it is expected, the C preprocessor metaprograms
are not really usable. However under speci c circumstances a C++ template metaprograms could implement
con gurable objects adequately. The real solution is a domain-speci c language like Miller, or a sophisticated
source-to-source transformation tool. The former is unfortunately di cult to introduce in industrial scenarios,
while the latter is a great challenge to implement correctly for languages with a rich syntactic structure and
complex semantics like C.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>We wish to thank the current and former members of the research group at ELTE Software Technology Lab,
including Gergely Devai, Boldizsar Nemeth, Zoltan Csornyei, Zoltan Kelemen and Daniel Lesko. Their useful
and constructive work on this project made our publication possible.
[Bpl] The Boost Library Preprocessor Subset for C/C++.</p>
        <p>http://www.boost.org/doc/libs/1_60_0/libs/preprocessor/doc/index.html
[Car02] Carlos Carvalho. The gap between processor and memory speeds. In Proceedings of IEEE International</p>
        <p>Conference on Control and Automation, 2002.
[Kar14] Mate Karacsony. Modeling C preprocessor metaprograms using purely functional languages. Proceedings
of the 9th International Conference on Applied Informatics, 2014. Vol. 2. 85{92.
[Krz11] Olaf Krzikalla, Kim Feldho , Ralph Muller-Pfe erkorn and Wolfgang E Nagel. Scout: a source-to-source
transformator for SIMD-optimizations. Euro-Par 2011: Parallel Processing Workshops, 2011. 137{145.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Abr04]
          <string-name>
            <given-names>David</given-names>
            <surname>Abrahams</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aleksey</given-names>
            <surname>Gurtovoy</surname>
          </string-name>
          . C+
          <article-title>+ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond</article-title>
          .
          <source>Addison Wesley Professional</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Ban02]
          <string-name>
            <given-names>Rajeshwari</given-names>
            <surname>Banakar</surname>
          </string-name>
          , Stefan Steinke,
          <string-name>
            <surname>Bo-Sik</surname>
            <given-names>Lee</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mahesh</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Marwedel</surname>
          </string-name>
          .
          <article-title>Scratchpad memory: design alternative for cache on-chip memory in embedded systems</article-title>
          .
          <source>Proceedings of the tenth international symposium on Hardware/software codesign. ACM</source>
          ,
          <year>2002</year>
          .
          <volume>73</volume>
          {
          <issue>78</issue>
          [Nem13]
          <article-title>Boldizsar Nemeth and Zoltan Csornyei. Stackless programming in Miller</article-title>
          .
          <source>Acta Universitatis Sapientiae, Informatica 5.2</source>
          ,
          <year>2013</year>
          .
          <volume>167</volume>
          {
          <fpage>183</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Nie97]
          <string-name>
            <given-names>Jarek</given-names>
            <surname>Nieplocha</surname>
          </string-name>
          , Robert Harrison, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Foster</surname>
          </string-name>
          .
          <article-title>Explicit management of memory hierarchy</article-title>
          .
          <source>Advances in High Performance Computing</source>
          . Springer Netherlands,
          <year>1997</year>
          .
          <volume>185</volume>
          {
          <fpage>199</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[Pan97] Preeti Ranjan</surname>
            <given-names>Panda</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Nikil D.</given-names>
            <surname>Dutt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Alexandru</given-names>
            <surname>Nicolau</surname>
          </string-name>
          .
          <article-title>E cient utilization of scratch-pad memory in embedded processor applications</article-title>
          .
          <source>Proceedings of the 1997 European conference on Design and Test</source>
          .
          <source>IEEE Computer Society</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Pra14]
          <string-name>
            <surname>Prashantha</surname>
            <given-names>NR.</given-names>
          </string-name>
          ,
          <source>Vikram TV. and Vaivaswatha N. Implementing Data Layout Optimizations in LLVM Framework. LLVM Developers' Meeting, October 28-29</source>
          ,
          <year>2014</year>
          . http://llvm.org/devmtg/2014-10/#talk14 [Sin16]
          <string-name>
            <given-names>Artjoms</given-names>
            <surname>Sinkarovs</surname>
          </string-name>
          and
          <string-name>
            <surname>Sven-Bodo Scholz</surname>
          </string-name>
          .
          <article-title>Type-driven Data Layouts for Improved Vectorisation</article-title>
          .
          <source>Concurrency and Computation: Practice &amp; Experience</source>
          ,
          <year>2016</year>
          .
          <year>2092</year>
          {
          <volume>2119</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>[Vel03] Todd L. Veldhuizen</surname>
          </string-name>
          . C+
          <article-title>+ templates are turing complete</article-title>
          . http://citeseerx.ist.psu.edu/viewdoc/summary?doi
          <source>=10.1.1.14.3670</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Ver02]
          <article-title>Manish Verma, Stefan Steinke</article-title>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Marwedel</surname>
          </string-name>
          .
          <article-title>Data partitioning for maximal scratchpad usage</article-title>
          .
          <source>Proceedings of the 2003 Asia and South Paci c Design Automation Conference</source>
          . ACM,
          <year>2003</year>
          .
          <volume>77</volume>
          {
          <fpage>83</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Ver04]
          <article-title>Manish Verma, Lars Wehmeyer</article-title>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Marwedel</surname>
          </string-name>
          .
          <article-title>Dynamic overlay of scratchpad memory for energy minimization</article-title>
          .
          <source>Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis. ACM</source>
          ,
          <year>2004</year>
          .
          <volume>104</volume>
          {
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>