Introduction

Con gurable data structure layout for memory hierarchies

ELTE Eotvos Lorand University

kmate@elte.hu matej@elte.hu 0 0 Faculty of Informatics , Budapest , Hungary

119 135

Developments of the last decades resulted in an increasing gap between processor and memory speeds. Therefore in high-performance computations the main bottleneck is memory access. A practical solution to this problem is to implement memory hierarchies utilizing multiple levels of memories with di erent capacities and access pro les. For embedded and low-power systems it is more bene cial to apply programmable scratchpad memories instead hardware controlled caches. The overall performance of scratchpad-aware software heavily depends on how the data is distributed and accessed in di erent memory layers. Choosing the data layout for optimal performance is a set of non-trivial design decisions. This paper introduces con gurations over data structures to enable deferring these decisions, by making it easier to change the data layout of existing programs.

Introduction

Scratchpad-aware systems are usually programmed using low level languages. In these languages, the distribution of data in di erent memory layers is re ected by the data structure de nitions and the operations on those structures [Nie97]. Changes in data layout will be re ected in the source code as well, often causing signi cant non-trivial modi cations. To make these modi cations easier, we de ne con gurations over data structures. Con gurations enable ne-grained description of data layout of an application. They were rst introduced as built-in language constructs in Miller [Nem13], a domain speci c language for multi-core router programming. By reusing the key ideas behind these constructs, it is also possible to synthesize data structure de nitions and accessing code through implementations consisting of C macros and C++ templates. Con gurations could also be used to control and parametrize source-to-source transformations.

The organization of this paper is the following. The next subsection presents related work. Section 2 introduces and describes con gurations. Di erent implementations of con gurations are detailed in section 3. This is followed by a comparison of the implementation techniques in section 4 . Finally, section 5 concludes the paper. 1.1

Related work

Miller is a domain-speci c language focused on the development of performance-critical applications for scratchpad-aware architectures. As it was presented earlier by other members of our research group in [Nem13], it supports stackless programming using execution units called bubbles instead of regular functions. Therefore the control ow is based on continuation passing instead of call stacks. The other specialty of this language is the built-in support for con gurable data layout, that is presented in this paper. Miller was implemented in Haskell, rst as an embedded language. Later a custom C-like syntax was developed with a standalone compiler. The compiler supports a general MIPS32 architecture, and one of its proprietary derivatives. The generated output is optimized assembly text. The applied optimizations include peephole and prefetch optimizations. The implementation of prefetch optimization is based on memory access pro le descriptions.

There are data structure transformations that increase performance both on hardware-cached and scratchpadaware architectures. In [Pra14], several such transformations are presented over the Low Level Virtual Machine (LLVM) framework. Structure peeling and splitting is intended to improve performance by the separation of "hot" and "cold" elds of structures, that is, the frequently and infrequently used parts of the data. These transformations could also need structure eld reordering. Other presented techniques for improving cache locality is struct-array copy, instance interleaving and array remapping. The execution of these transformations is planned to be automatic, based on pro ling data and static code analysis. However, currently no implementation is available yet for these techniques in the public LLVM code base.

Data layout transformations are very common to support vectorization. [Sin16] shows a data layout inference technique that identi es layout transformations that convert data into a format favorable for SIMD instructions. The presented inference algorithm is based on a type system for layout transformations, and enables the autovectorization of programs.

Scout [Krz11] also supports automatic vectorization. It is a source-to-source transformation tool based on the Clang compiler infrastructure. Scout supports several di erent loop transformations to achieve optimal conditions for applying SIMD instructions. It supports several widely used SIMD instruction sets like SSE or AVX, but it is also extensible. 2

Con gurations

The memory hierarchy of the target architecture must be taken into account to achieve optimal performance on scratchpad-aware systems. The data layout of an application could change for several reasons, for example in order to always store the most frequently accessed data in the fastest memory. When a new system is under development, usually not all factors a ecting performance are known in advance. Fine-tuning of the application requires consecutive measurements and changes in the memory mapping of data. Requirement changes could also have the same e ect. A more drastic reason is a change in the underlying memory hierarchy, for instance when an existing application is ported to a di erent target architecture. In this case, new capacity and speed relationships must be considered for performance.

These changes could produce two di erent kinds of transformations in the data layout. When the space available for a given type of objects is decreased, either because the capacity of the memory is decreased, or the number of objects is increased, it could be necessary to place the less frequently used data members into a di erent memory layer. The original data elds will be transformed into pointers. The indirection introduced here will increase the access time of these members, but the more frequently accessed data parts are still in the faster memories. This transformation is called data outsourcing.

In the opposite case, when the amount of free space available in faster memories increases, indirections that are pointing to data in di erent layers could be eliminated. This enables the placement of less frequently used data members into faster memories, thus increases the average memory access speed. Complementary to the previous technique, this transformation is called data inlining.

The role of con gurations is to describe the data layout of types and objects across the memory hierarchy. Therefore changes in con gurations enable the description of the previous transformations. 2.1

Architecture-independent memory model

As there are many scratchpad-aware architectures with very di erent memory hierarchies, we decided to describe data layout con gurations in an architecture-independent way over a simpli ed memory model. In practice, the architecture-speci c information could be added to the model easily when implementing con gurations with a given technique, as it will be detailed in various subsections of Section 3.

By forgetting the architecture-dependent memory information, the mapping of data layouts to concrete memory layers is lost. The only signi cant property is whether the value of a data member is directly available as an integral part of the structure, or only indirectly through a pointer. The latter case makes it possible to store a part of an object in a di erent memory than the object itself. 2.2

Description of con gurations

As most of the scratchpad-aware systems are programmed in low level languages, especially in some dialect of C, we decided to derive con gurations from the type constructions of the C language. Enumeration types, function pointer types and type aliases have nothing to do with data layout in general, thus the interesting ones are pointer, array and record (struct and union) constructions. In the following, let T be the set of all types, and TB T is the set of con guration base types, that is, all scalar and record types. TB does not contain any pointer or array types.

According to the transformations described in the beginning of the section, only data outsourcing could be applied on a given scalar element of type 2 TB. Using the C language notation for pointer types, this transformation could be described as 7! between the types. Let the source of this transformation, that is, the primitive con guration corresponding to a directly contained scalar data member, denoted by scalar 2 C, where C is the set of all con gurations. Furthermore, denote the previous transformation that introduces an indirection level with a pointer type constructor as ptr : C ! C. These two constructors generate the following con gurations: scalar; ptr(scalar); ptr(ptr(scalar)); : : : ; ptr( (ptr(scalar)) )

Let us introduce the type generator function G : C TB ! T , that generates a con gured data type from a con guration and a base type. Using a base type , these previous con gurations are generating the following types:

G(scalar; ) =

G(ptr(scalar); ) = G(ptr( (ptr(scalar)) ); ) =

Layout con guration of arrays requires the con guration of two properties. First, the skeleton of the array needs to be accessed to index a speci c element, and second, the access of the selected element itself is needed to be con gured. These are both possible by introducing a third con guration constructor, array : C ! C. The parameter of this constructor speci es how the elements are con gured at each index, relative to the skeleton. The con guration of the array skeleton could be speci ed by applying more con guration operations around this constructor. Figure 1 shows the application of this constructor in various cases, using [ ] array type constructor on application of array in G, with the notation of the C language.

All elds of record types could be con gured in the similar way as independent objects, so there is no need to de ne a separate con guration constructor for this case. However, we can de ne a compound con guration for a record. Let crec = h f1 = c1 ; ; fn = cn i denote a compound con guration for a record rec with elds f1; : : : ; fN , and c1; : : : ; cn 2 C. As con guration operations will not be de ned over compound con gurations, crec 2= C.

A con guration is valid for an object, when it contains as many array constructors applied as the dimensionality of the object requires. For scalar values, it is zero, for one dimensional arrays it is one, and so on. The innermost constructor of a valid con guration is always scalar. Denote the dimension or array rank of a valid con guration with D : C ! N, that gives the number of array constructors applied in it. Important to note, that TB does not contain array types. It implies that all array dimensions must be visible in con gurations.

The minimal con guration of an object is a valid con guration that contains no ptr constructors. It is easy to see that data inlining could not be applied on a minimal con guration, as it would involve the removal of a ptr constructor. It means when an object has a minimal con guration, it is only possible to store the whole object in the same memory layer.

Two con gurations are compatible, when they only di er in the application of ptr constructors. This requirement is ful lled for con gurations C1 and C2 exactly when D(C1) D(C2). Compatible con gurations could be transformed freely into each other by adding or removing ptr constructors { therefore by applying data outsourcing or inlining. It is obvious that valid(c1; o) ^ compatible(c1; c2) =) valid(c2; o), where c1; c2 2 C and o is an arbitrary con gurable object.

De ne the extension function Ec : C ! C corresponding to a con guration c 2 C by replacing the innermost scalar constructor of c with a variable. For example, let c = ptr(array(ptr(scalar))). Then replacing the innermost scalar with a new variable x in c gives the extension function Ec(x) = ptr(array(ptr(x))).

Consider con gurations c1 and c2 as structurally equivalent, also denoted as c1 c2 when c1 and c2 was created using the same constructors in the same order. The length of a given con guration is L : C ! N that gives the number of constructor functions used to create it. Of course, L(c1) = L(c2) is implied by c1 c2. L(c) could also be denoted as Lc.

Note that all of the con guration constructors are extensible with arbitrary information. For example, the scalar and ptr constructors could be extended with references to memory layers. This enables to inject architecture-speci c information into a con guration. 2.3

Operations over con gured objects

As the previous subsection presented, function G can be used to generate a con gured data type c = G(c; ) 2 T where c 2 C and 2 TB. To access data in objects de ned with type c, we need operations that could follow the changes of their data layout con guration.

First, an operation is needed to declare an object with type c. Let declare : C TB Id ! Decl, where Id is the set of valid identi ers, and Decl denotes the set of all valid declarations of a given programming language. It is easy to see that declare can be built trivially when an implementation of G is available.

Since it is allowed to modify a con guration by adding or removing ptr constructors at any time, the memory management of con gured objects could be changed drastically. As a consequence, explicit memory allocation and deallocation operations are needed at the beginning and at the end of the lifetime of con gurable objects. For instance, take the following declaration in the C language: declare(int; scalar; "x") )int x. By default, x will have an automatic storage, that means its lifetime will be bound to the surrounding block. There is no need for explicit allocation or deallocation of x in this case. However, if we apply data outsourcing to the con guration, which is completely valid in this case, it gives declare(int; ptr(scalar); "x") )int* x. To store any value at the memory location pointed by x, we need to implement explicit memory management. It implies that explicit allocate : C Id ! Stmts and deallocate : C Id ! Stmts operations must be used every time to initialize and nalize storage for con gured objects, where Stmts is a list of statements in a given programming language. Following the previous example, allocate(scalar; "x") )f g and deallocate(scalar; "x") )f g, where f g is an empty list of statements, but allocate(ptr(scalar); "x") )f x = malloc(sizeof(x)); g and deallocate(scalar; "x") )f free(x); g. These are the most complex operators, as they must be able to derive the allocation and deallocation statements for any objects with valid con gurations, including multi-dimensional arrays with several layers of indirection.

Accessing the value of a con gured object could also change with the con guration. It is evident that all pointers must be dereferenced to access a value. Fortunately, con gurations contain all the information needed to achieve this. Let valueOf : C Expr ! Expr be an operation over a con guration and an object that generates an expression 2 Expr the set of all expressions in a given language, to access a con gured object. To continue our previous example, valueOf can be implemented for the C language to give valueOf (scalar; x) )x, and valueOf (ptr(scalar); x) )*x.

Similarly, when the base type of an object 2 TB is a record type, and a eld of this record object is accessed, the object must be dereferenced rst. Assume we have a projection function : Expr Id ! Expr that selects a given member from an object. Let memberOf : C Expr Id ! Expr, and memberOf (c; o; m) ::= (valueOf (c; o); m) 2 Expr, where c 2 C, o 2 Expr and m 2 Id. In an example, when we have an object x of a record type that has a eld named y, then the following could be implemented for the C language: memberOf (scalar; x; "y") )x.y, and memberOf (ptr(scalar); x; "y") )x->y, that is equivalent with (valueOf (ptr(scalar); x); "y") )(*x).y, given that (o; m) ) (o):m. Note that when eld y is also a con gured data type, then it must be accessed with the appropriate operation, for instance using valueOf (cy; memberOf (cx; x; "y")), where cx; cy 2 C and x 2 Expr with a record base type having a eld named y.

The last operation needed for con gured objects is array indexing. An array con guration has a dual role: it describes not only how to access the elements, but it also describes the array skeleton itself. These could be done in the following steps. First, the con guration needs to be split into two parts. Let split : C ! C C be a function that returns two con gurations, split(carr) = (cskel; citem) when D(carr) 1 is satis ed. Require that D(cskel) 0 and D(citem) D(carr) 1. The rst requirement implies that cskel contains only a scalar constructor and an arbitrary number of ptr constructors. Take the extension function Ecskel : C ! C of cskel by replacing the innermost scalar constructor with a variable. Then the following structural equivalence must be satis ed between the previous con gurations: carr Ecskel (array(citem)). The rst part of the con guration, cskel is needed to dereference the array object itself before indexing. It could be done easily with the previous valueOf operation. After selecting the required element using an index function : Expr N ! Expr, con guration citem must be considered similarly to dereference the element data. Therefore operation valueAt : C Expr N ! Expr could be de ned as follows: valueAt(carr; arr; idx) ::= valueOf (citem; (valueOf (cskel; arr); idx)) 2 Expr, where split(carr) = (cskel; citem). Con guration citem could also be referred as the item con guration of carr. For example, assume we have an array x with conguration cx = ptr(array(ptr(scalar))). Then the value of (x; 3) )x[3] as in C language, could be accessed as valueAt(cx; x; 3) valueOf (cx elem; (valueOf (cx skel; x); 3)) )*((*x)[3]) in C. Splitting cx gives cx skel ptr(scalar) with extension function Ecx skel (c) = ptr(c) and also cx elem ptr(scalar). 3

Implementation techniques

This section details techniques that we used to implement con gurations for data type layout. Four solutions will be evaluated: the integration into a domain-speci c language; implementation with C preprocessor macros and C++ template metaprogramming; and nally source-to-source transformations. The following subsections describe these implementations independently, while a comparison could be found in Section 4. As most of their implementation details are out of the scope of this paper, the complete source code for the C and C++ libraries could be found at https://github.com/kmate/configurable-data-layout. First we developed and implemented data layout con gurations in a domain-speci c language named Miller. Miller was designed for e cient multi-core router programming over complex multi-layer, scratchpad-aware memory hierarchies. The language has a C-like syntax and type system. It supports various xed-width integer scalars, enumeration types, type aliases, bit sets, records, and arrays. Pointers are treated specially in the language: they are split into two separate types. Sharing of data is expressed with references. They are very similar to ordinary C pointers with two exceptions: it is not possible to do pointer arithmetic on references, and they cannot be used for expressing optionality. For the second use case, there is a di erent optional type in the language, that can be used in a special case construction to deal with the absence of a value. Reference types are introduced with a ref keyword, while option types are marked with a question mark. Arrays and records have similar syntax as in C. For instance, in the following listing bucket is a reference to an array of 32 optional values with type elem. It is able to store a single line out of 256 possible such entries of table. s t r u c t elem { i n t 3 2 key ; i n t 3 2 v a l u e ; } elem in ; elem ? [ 3 2 ] [ 2 5 6 ] t a b l e ; ref elem ? [ 3 2 ] b u c k e t ;

Miller has a language construction to provide information about the memory hierarchy. The memory keyword introduces a new layer with the given name and size. It is also needed to provide an access pro le, that can be used by the prefetch optimizer of the compiler. In the following example it is omitted, as this information is irrelevant for con gurations. The listing below de nes two memory layers named dram and spad, with 256 and

8 megabytes of total size.

m e m o r y dram (256 MB ) { /* a c c e s s p r o f i l e o m i t t e d */ } m e m o r y spad (8 MB ) { /* a c c e s s p r o f i l e o m i t t e d */ }

Once the memory types are given, it is possible to de ne locations where objects can be stored. There are two di erent kinds of location de nitions: registers and locations in one of the memory layers. r e g i s t e r r1 (4 B ); r e g i s t e r r2 (4 B ); // ... l o c a t i o n t a b l e _ l o c (256 KB ) on spad ; l o c a t i o n i n _ l o c (8 B ) on dram ;

After memory types and locations are given, con gurations can be provided for each object independently from the algorithms working with them. In fact, there are two types of con gurations in Miller. There are type con gurations to determine how objects of a given type should be stored in the hierarchy, and there are object con gurations for specifying location and type con guration of individual objects. t y p e c o n f i g e l e m _ c o n f { key ; v a l u e -> dram ; };

elem conf shows a possible con guration for record type elem. Field key has an empty con guration. It means that all data referred by key will be stored in the same memory as the containing object, so it represents a scalar con guration constructor. This is the default behavior, so this entry is optional. The value eld has a type con guration built with operator ->. It introduces an indirection into the con guration: the containing object will only have a pointer to the value stored in the dram memory. Hence the con guration of value is ptr(scalar). Note that for algorithms working with the value eld this indirection is invisible. This guarantees that the con guration could easily be changed independently from other parts of the code.

An object con guration consists of a location and a type con guration. In the following listing some of the irrelevant syntactical parts are omitted { the ones that are only required to unambiguously map the names in con gurations to the object declarations. in : i n _ l o c ; t a b l e : t a b l e _ l o c e l e m _ c o n f [][] ? > dram ; b u c k e t : r1 : > spad e l e m _ c o n f [] ? > dram ;

The simplest object con guration only speci es a location for the given object. The con guration for in could be described with a single scalar constructor, along with the meta-information about the location. As table refers to a two-dimensional array, its con guration must be constructed using two array constructors. Because the elements of the array are optional, and this is implemented with pointers under the hood, it is possible to store the elements of this two-dimensional array in another layer. The ?> operator con gures an optional type to store the values in the memory given on its right hand side. The con guration of table can be described as array(array(ptr(scalar))). The outer array constructors have associated meta-information about the location table loc, and the ptr constructor refers to dram. Similarly to ?>, operator :> can be used to con gure a pointer, but under a reference type instead of an optional. It means that bucket will be stored as a pointer in register r1 to an array of pointers on spad. The pointers in the array are referencing values stored on the dram. Both of the last two con gurations refer to elem conf, to specify how the structure elds are stored.

Operations de ned for con gurations in Section 2.3 are built-in into the Miller compiler. As con gurations can also be stored separately, it is possible to have multiple di erent con gurations for the same program. The con gurations are read from source les, and connected implicitly and unambiguously with the types and objects found in the program by their quali ed names. The user does not have to explicitly use di erent operations on objects with con gured data types, as all objects are con gurable in the language. Of course, there is a validation phase in the compiler, that checks the con gurations of each type and object before compilation. Type con gurations are allowed to be under-speci ed: when no con guration is provided, a default con guration takes places that stores everything in the same memory as its container object. The details of the target architecture were adjustable through memory and register con guration to a given extent, but still some information was built into the compiler, especially about how the data access code should be generated for particular layers. Even though Miller implemented con gurations perfectly { as it was explicitly designed for them {, it is always hard to introduce a new domain speci c language instead a well-known industrial standard. This lead us to experiment with the embedding of con gurations into C, that is already widely used on scratchpad-aware architectures.

The most convenient option was to implement con gurations and their operations as a preprocessor macro library. Despite the high expressiveness of preprocessor macros, it is very hard to build and debug really complex libraries. To ease the development, we have chosen an implementation strategy based on modeling the macros with simple purely functional programs in Haskell [Kar14]. Finally, these de nitions were translated back to the preprocessor macro language, using the Boost Preprocessor Library [Bpl].

In this system, the con gurations are represented internally with a parenthesized sequence of tokens instead a chain of constructor applications. It means that con gurations in the form C1(C2(C3)), where each Ci is a constructor of con gurations, is expressed as (T1)(T2)(T3), where each Ti is the corresponding token to constructor Ci. The exact de nition of these tokens is not important, only the fact that each di erent constructor has a di erent, unique token that makes it possible to pattern match on the value during processing. Although the internal representation is not a chain of constructor applications, it is possible to de ne builder macros that support chained application. # define SCALAR ( SCALAR ) # define PTR ( tokens ) ( PTR ) tokens # define ARRAY ( tokens ) ( ARRAY ) tokens

As macro expansions are not recursive, tokens on the right hand side will not be re-expanded after substitution. For example, the con guration PTR(ARRAY(PTR(SCALAR))) will build the token sequence (PTR)(ARRAY)(PTR)(SCALAR). This sequence can be e ectively processed with a fold-like construction, up to a prede ned maximum length (that is currently 64 using the Boost Preprocessor Library). The library de nes the following macros corresponding to properties of con gurations. Macro ARRAY RANK calculates Dc, the number of array dimensions for a given c 2 C. ITEM CONFIG returns the item con guration of a given con guration. It is used to implement split. Utility HAS PTR ITEM gives a macro representation of a boolean value. When the con guration passed to HAS PTR ITEM has a (PTR) item it returns true, and false otherwise. This utility is useful when implementing memory allocation and deallocation operations, as it determines whether it is needed to generate dynamic memory allocation code or not.

The macro library is implemented with two di erent interfaces. The rst version requires the con gurations to be passed as an argument for every operation. The second version uses a lookup mechanism to retrieve con gurations for objects by their name. This works as the following. The user has to declare their con gurations as macros according to a naming scheme. Then the GET OBJECT CONFIG utility macro could retrieve the appropriate value for a given object identi er: # define C O N F I G U R E _ _ i n p u t PTR ( SCALAR ) # define C O N F I G U R E _ _ t a b l e ARRAY ( PTR ( ARRAY ( SCALAR )))

The above snippet de nes con gurations for two objects. They can be retrieved by expanding GET OBJECT CONFIG(input) and GET OBJECT CONFIG(table). The naming scheme is also con gurable in the library. In the second version of the interface, this utility macro is called automatically by the operations, so they only take an object identi er instead of a con guration. In the following, this interface will be presented.

Each operation on objects from con gured data types is implemented with a speci c macro. Thus for the declare operation there is a corresponding macro named DECLARE, for valueOf there is VALUE OF and so on. An interesting di erence, that comes from the characteristics of the C language, is that the declaration of arrays need a di erent treatment. This is because they contain the length of each array dimension. This is implemented with a DECLARE ARRAY variadic macro. Unfortunately, it is problematic to handle the zero parameter-case of variadic arguments. This is the main reason for this solution. s t r u c t i n p u t _ t { u i n t 3 2 _ t a , b ; }; D E C L A R E ( s t r u c t input_t , i n p u t ); D E C L A R E _ A R R A Y ( char , table , 8 , 32); // s t r u c t i n p u t _ t * i n p u t ; // char ((*( t a b l e [8]) ) [ 3 2 ] ) ;

The comments after each macro invocation are showing their expansion results. Under normal circumstances, the result of the preprocessing phase is not directly visible to the user. It is not formatted by the compiler, and the macro library does not try to eliminate unnecessary parentheses. The con gurations de ned earlier are used to assemble the nal declarations. Of course, it is not enough to declare these objects before using them. As the con gurations could contain ptr constructors, explicit allocation and deallocation is needed at the beginning and at the end of their lifetime. This could be achieved as the following.

A L L O C A T E ( i n p u t ); // do { i n p u t = m a l l o c ( s i z e o f (*( i n p u t ))); } w h i l e (0); A L L O C A T E ( t a b l e ); // do { for ( int i0 = 0; i0 < s i z e o f ( t a b l e ) / s i z e o f (( t a b l e ) [ 0 ] ) ; ++ i0 ) { // ( t a b l e )[ i0 ] = m a l l o c ( s i z e o f (*(( t a b l e )[ i0 ] ) ) ) ; // } } w h i l e (0); // ...

D E A L L O C A T E ( t a b l e ); // do { for ( int i0 = 0; i0 < s i z e o f ( t a b l e ) / s i z e o f (( t a b l e ) [ 0 ] ) ; ++ i0 ) { // free (( t a b l e )[ i0 ]); // } } w h i l e (0); D E A L L O C A T E ( i n p u t ); // do { free ( i n p u t ); } w h i l e (0);

The expansions in comments clearly show the complexity of the generated code for allocation and deallocation. Bodies of do-while blocks are only running once, and the loop structure will be optimized out by the compiler. They are only needed to suppress the semicolon at the end of the macro invocation when multiple statements are generated.

Of course, the data stored in the con gured objects should be accessed only with the appropriate operations, instead of directly operating on them. This ensures the correct access event when the con guration is changed between compilations. s t r c p y ( V A L U E _ A T ( table , 0) , " Zero " ); // s t r c p y (*(( t a b l e ))[0] , " Zero "); // ... s c a n f ( " % d % d " , & M E M B E R _ O F ( input , a ) , & M E M B E R _ O F ( input , b )); // s c a n f ("% d % d " , &(*( i n p u t )). a , &(*( i n p u t )). b ); u i n t 3 2 _ t key = ( M E M B E R _ O F ( input , a ) + M E M B E R _ O F ( input , b )) % 8; // u i n t 3 2 _ t key = ((*( i n p u t )). a + (*( i n p u t )). b ) % 8; p r i n t f ( " % s \ n " , V A L U E _ A T ( table , key )); // p r i n t f ("% s \ n " , *(( t a b l e ))[ key ]);

Note that address-of operator & can safely be applied on the result of MEMBER OF, as it will alway result in a value selected from a structure. It is clearly visible, that the macro library does not try to simplify the syntax tree, as it generates (*(input)).a instead of the equivalent input->a. It is also worth to note that elds of struct input t are not con gurable. In that case, the value access of the elds must apply an additional VALUE OF operator around MEMBER OF.

Unfortunately, the case of con gurable records elds is even more complicated. As there is no compile time re ection information behind the macro system, it is impossible to automatically determine the elds of a structure. However, when the structure declaration itself is built up with macros as a preprocessor data structure, there is a hope to share its eld information. Although, the current version of this library follows a simpler implementation. The following example shows how to use named con gurations with structure elds. # d e f i n e C O N F I G U R E _ _ t 0 PTR ( S C A L A R ) # d e f i n e C O N F I G U R E _ M E M B E R _ _ t e s t _ t _ _ x PTR ( S C A L A R ) # d e f i n e C O N F I G U R E _ M E M B E R _ _ t e s t _ t _ _ y PTR ( A R R A Y ( S C A L A R )) t y p e d e f s t r u c t { int id ; D E C L A R E _ M E M B E R ( test_t , bool , x );

D E C L A R E _ M E M B E R _ A R R A Y ( test_t , char , y , 8); } t e s t _ t ; D E C L A R E ( test_t , t0 ); A L L O C A T E ( t0 ); A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ); A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y ); M E M B E R _ O F ( t0 , id ) = 0; V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ) = true ; s t r c p y (& V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y ) , " test " ); p r i n t f ( " t0 : { % d , % d , \"% s \" }\ n " ,

M E M B E R _ O F ( t0 , id ) , V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ) ,

V A L U E _ O F _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y )); D E A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , y ); D E A L L O C A T E _ M E M B E R ( test_t , V A L U E _ O F ( t0 ) , x ); D E A L L O C A T E ( t0 );

This listing shows how the complexity of the macro system explodes when there are con gured record elds. In case of named con gurations, di erent macros needed to be used for the operations on members, because they need to use a di erent name-lookup algorithm. This is why all * MEMBER macro receives the name of the encapsulating struct, and the name of the actual member. This information is used to assemble a macro name where the required con guration can be looked up. When the user chooses the alternative interface, where con gurations are directly passed to operation macros, there is no need for de ning separate * MEMBER macros, but the original ones could be used everywhere. It is also important to see, that the encapsulating struct object must be resolved to its value with its own con guration before any member operations could be applied to it.

While it is possible to implement con gurations in the C preprocessing system, it is not very practical. The con guration changes are propagated easily, once the code is appropriately prepared, but unfortunately that requires a huge e ort, and is highly error-prone.

Architecture specialization could be done by extending the con guration builder macros with additional parameters. For instance, PTR can be extended to hold additional value to identify the target memory of a pointer. In the implementation of operations, this information can be extracted and used for example to generate di erent allocation or dereference code for di erent pointers. 3.3

C++ Template Metaprogramming

While the previous subsection showed that it is possible to implement con gurations in the preprocessing phase, there is still space for improvement. Some of the scratchpad-aware systems also have C++ compilers available. It is a known result that template metaprograms are Turing-complete in absence of instantiation limits [Vel03]. Their integration in the C++ language, along with other language features like operator overloading, provide a good basis for implementing con gurations. The library presented in this section is implemented using the Boost MPL Library [Abr04].

As template metaprogramming also has its roots in functional programming, like preprocessor metaprogramming, the representation of con gurations follows a similar strategy: con gurations are stored as type-level lists of con guration tokens, where tokens themselves are types. Most of the operations are processing these type level lists with folds and pattern matching on the tokens. t y p e d e f s t r u c t {} T _ S C A L A R ; t y p e d e f s t r u c t {} T _ P T R ; t y p e d e f s t r u c t {} T _ A R R A Y ; s t r u c t S C A L A R : mpl :: vector < impl :: T_SCALAR > {}; template < t y p e n a m e Tokens > s t r u c t PTR : mpl :: p u s h _ f r o n t < t y p e n a m e T o k e n s :: type , impl :: T_PTR > {}; template < t y p e n a m e Tokens > s t r u c t A R R A Y : mpl :: p u s h _ f r o n t < t y p e n a m e T o k e n s :: type , impl :: T_ARRAY > {};

Builder constructions for these list are provided too, thus the user can easily de ne named con gurations as type aliases: t y p e d e f PTR < SCALAR > i n p u t _ c o n f i g ; t y p e d e f ARRAY < PTR < ARRAY < SCALAR > > > t a b l e _ c o n f i g ;

These con gurations are essentially the same as we used in the previous subsection. An interesting aspect of this technique is that the con guration chain is built up from nested template applications. Declarations of con gurable objects is done in the following way: c o n f i g u r e d < s t r u c t input_t , i n p u t _ c o n f i g , true > i n p u t ; c o n f i g u r e d _ a r r a y < char , t a b l e _ c o n f i g , false , 8 , 32 > t a b l e ;

Declarations for scalar and array objects are di erent, like it was with preprocessor macros. However, the reason for this is di erent here: it is to improve type safety. Array-like con gured objects support indexing, while others do not. Both of the types are parametrized with a base type, a con guration, and a boolean value, that indicates whether automatic allocation and deallocation should be done for these objects. Con gured arrays also need to provide sizes for each dimensions, this makes configured array a variadic template. While input will be automatically allocated and deallocated, this has to be done explicitly for table (hence the false parameter value above): t a b l e . a l l o c a t e (); // ... t a b l e . d e a l l o c a t e ();

As the above listing shows, allocation and deallocation operations become instance methods of con gured objects in this system. This allows the implicit sharing of the underlying con guration. Other con guration operations are also implemented with instance methods, and some of them with operator overloading: s t r c p y ( t a b l e [0] , " Zero " ); // ... s c a n f ( " % d % d " , & input - > a , & input - > b ); u i n t 3 2 _ t key = ( input - > a + input - > b ) % 8; p r i n t f ( " % s \ n " , t a b l e [ key ] ( ) ) ;

The valueAt operation is simply implemented with the operator[] of configured array objects. Also, operation memberOf is mapped to operator-> of configured objects. The valueOf operation is implemented with overloaded cast operators, and constructors. When it is not possible to infer the required type cast, like in case of a printf parameter, explicit execution of valueOf could be requested by using the object as a functor { that is, calling operator().

Similarly to the preprocessing library, utilities for con gurations are also provided with templates: array rank, item config and has ptr item are de ned to produce the same result as their macro counterparts, but now as type-level values.

To understand how the library works, take a look at the hierarchy and the internal structure of con gurable objects.

The data wrapped by a con gurable object is stored in a member named value. Its type is derived using a template metaprogram implementation of generator function G. For instance, the type of this member could be de ned as typedef struct input t *real type for input, and typedef char (*real type[])[] for table. The type casting operator and the functor implementation returns an object of value of type. This type is derived from the con guration after removing its outer ptr constructors, if any. This ensures that these operations implement the valueOf operation, by dereferencing the outer layer of pointers in the representation. Result of array indexing is value at type, that is a type derived from the item type of the corresponding con guration. For multi-dimensional arrays, indexing an outer dimension will return another object of type configured array, with decreased rank and shorter con guration. Utility methods rank and size are returning the number of dimensions and the length of the outer dimension, respectively. typedef PTR < SCALAR > CFG_t0 ; typedef PTR < PTR < SCALAR > > CFG_tx ; typedef PTR < ARRAY < SCALAR > > CFG_ty ; typedef struct test_t { int id ; configured < bool , CFG_tx > x ; configured_array < char , CFG_ty , true , 8 > y ; } test_t ; configured < test_t , CFG_t0 > t0 ; t0 - > id = 0; t0 - > x = true ; strcpy ( t0 - >y , " test " ); printf ( " t0 : { %d , %d , \"% s \" }\ n " , t0 - > id , t0 - > x () , t0 - > y ());

Con guration of record elds is also very easy with this library. The user can reuse the configured and configured array templates to declare arbitrary elds of records. When the automatic allocation and deallocation is turned on, the lifetime of the elds will be bound to the containing objects. As this is the default setting, the declaration of eld x omits this parameter entirely.

Transforming existing software to use con gurations only needs a little e ort: the declaration of con gurable objects should be replaced. Adaptation to operations is simple, thanks to operator overloading. Type checking will ensure that each access of the con gured objects is correct.

Architecture specialization could be done similarly as in the previous technique. Every con guration builder template can be extended with meta-information about the memory hierarchy, and this can be processed in the implementation of the operations. 3.4

Source-to-source Transformations

The previous two implementations are intrusive: they require the modi cation of existing sources when applied. Using the C macro library requires to replace all declarations and operations where con gurable objects involved. Working with the C++ template solution needs much less e ort, but still, con gurable objects have to be treated di erently from ordinary data types.

This subsection shows a technique where applying con gurations requires no preliminary changes on the source code. The idea is to use automated source-to-source transformations, that are parametrized with con gurations. In this paper we focus on the description and validation of con gurations, and not on the execution of transformations themselves. Although the presented method could be applied to any appropriate programming language, to make the results comparable with the previous sections, we decided to work on C source code.

The role of con gurations is to describe the data layout of types and objects across the memory hierarchy. This information is actually represented in the existing source code as types. The key to generate con gurations from C types, is to implement the inverse of generator function G. Denote this function with GC 1 : T ! C TB, where T is the set of all possible C types, C is the set of con gurations over a representation, and TB is the set of C base types. The latter set contains all scalar, enumeration and record types (introduced either with a struct or union keyword). Type aliases should be fully resolved before applying GC 1. This is a partial function, as C function pointer types are not supported by con gurations.

GC 1( ) = 8>scalar; <

ptr(GC 1( 0)); >:array(GC 1( 0)); when when when 2 TB = 0

Once this function is available, the question is, how should these con gurations be stored and presented for the C programmer, and how the changes will be given as instructions to the transformation tool. The solution is to create a con guration skeleton for the source les. The skeleton le will have the same structure as the input le, and is also a valid C source le, but contains no statements at all. To create a skeleton, process the input le as the following. Keep all type declarations, global objects and functions. For functions, only keep their local variable con gurations in their outermost blocks. Drop all function parameters, and change return types to void. The preservation of the structure is needed to uniquely identify each type and object declarations in the source le later. As inner blocks of function bodies are dropped, the identi cation will not be available for local variables declared there, but the practice shows that being con gurable is not a concern for short-living deeply nested locals. What remains in the skeleton le, is essentially a set of con gurations, if we apply GC 1 on the types in the remaining declarations. The user can rewrite the types in the skeleton le, and use it as a parameter to the transformation tool.

The transformation starts with an analyzer phase. Both an input le and a con guration le (possibly a modi ed skeleton) is parsed using a C parser, and a semantic analysis is executed to create a type-annotated abstract syntax tree of the sources. Function GC 1 is applied on each type found in the con guration le. At the same time, the structurally corresponding declaration is looked up in the input le for each declaration in the con guration le. This makes it possible to apply GC 1 also on the types found in the input declarations, and compare the old and the new con gurations.

Each of the compared con gurations must be compatible, that is, they must only di er in the number of ptr constructors applied. This restricts the changes to execute only data outsourcing or data inlining transformations. Base types corresponding to the two con gurations must also be the same. As changes restricted to compatible con gurations, there is no way to enforce an invalid con guration for an object. When there is no corresponding declaration found in the input le for an element in the con guration le, or the compatibility check fails, the transformation could signal an error and terminate.

For each of the compatible con gurations, the di erences can be collected. These di erences will control the transformations executed on the input syntax tree. According to the old con guration, that is read from the input le, it could be checked that the objects are only used according to the operations of con gured objects. When any non-matching access pattern is found, it can not be guaranteed that the transformation will be successful. This restriction is needed because C enables pointer arithmetics and other special ways to work with data. The modi ed code for the operations can only be derived correctly when it is ensured that in the input code the given object is used like an object con gured previously with its old con guration.

Unfortunately, correct source-to-source transformations of C programs is highly non-trivial. Preservation of preprocessing structure, comments and formatting is very di cult. Type aliases in con gured types are need to be handled with care. Consider the following: // input file typedef int speed ; typedef int * place ; speed v e r t i c a l _ s p e e d ; // modified c o n f i g u r a t i o n file speed * v e r t i c a l _ s p e e d ;

The old and new con gurations of vertical speed are scalar and ptr(scalar), which are compatible. Application of GC 1 resolves all type aliases to produce the con gurations and examine the base types, thus giving int *vertical speed when re-synthesized with G. It is incorrect, as picking another alias, like place vertical speed would also be. To transform the declaration correctly, the new type should be written exactly in the same way as it was in the con guration le. Detection of incorrect access patterns and synthesizing the transformed code with the minimal amount of changes is also a challenging problem. For these reasons, we only implemented a few special cases in a tool based on the Clang compiler infrastructure. 4

Comparison of implementation techniques

This section contains a comparison of the previously presented con guration implementation techniques. The comparison is made based on the following 6 + 1 aspects.

Recon guration

Shows how much code is needed to be modi ed manually by the user when a con guration changes. For the next example, take a data outsourcing on the elements of an array, thus the con guration changes from array(scalar) to array(ptr(scalar)).

Miller: elem config [] ) elem config [] -> dram C macros: ARRAY(SCALAR) ) ARRAY(PTR(SCALAR)) C++ templates: ARRAY<SCALAR> ) ARRAY<PTR<SCALAR>>

Source transformations: elem type arr[]; ) elem type *arr[];

Storage of con gurations

Describes where and how the con gurations are stored and what is their relationships to other parts of the code.

Miller Con gurations are an integral part of the language. They can be stored in the same les as in the corresponding program code, or in separate les. In that case, con gurations can be included into the program with a standard C preprocessor command. Con gurations can be named and referenced later. Objects and con gurations are loosely-coupled.

C macros The user can make her own con guration de nitions as macros. This enables to store them either together with other parts of the code or separately. However, con guration of record elds is much less advanced as they can not belong together structurally as in the previous case.

C++ templates This solution has almost the same properties as C macros from this perspective. The only di erence is that user-de ned con gurations can be de ned as type aliases instead of macros. This helps to group related con gurations together.

Source transformations Con gurations are stored in separate les. They are valid C les, to ease their handling and processing both for the users and the tools. The structure of con gurations follow the structure of the original code. Objects and con gurations are tightly-coupled, as the con guration le must follow the structure of the original code. However, it is usually not a problem, as con gurations can be deleted after the desired transformations are executed.

Syntax of con gurations

Miller introduces a custom syntax for describing memory layers, locations and con gurations, that the user must learn in the beginning.

C macros and C++ templates use almost the same syntax. The macro library uses a chain of macro invocations, while the other uses nested template applications. The low number of con guration constructors makes it easy to master it in minutes.

Source transformations use con guration les where each con guration is described as a standard C declaration. This is the most natural and straightforward way to express the changes in types, as it uses the same syntax to describe the transformation as the transformed language itself.

Amount of syntactic noise

What is the price of con gurability in terms of readability and maintainability of code.

Miller has built-in construction to support con gurations. Basically, every object is con gured in the language, and the default operations are working accordingly. Con gurability \comes for free" in this manner.

C macros add huge syntactic noise. Every declaration gets replaced by a syntactically more complicated one. This is also true for all operations done with these objects.

C++ templates add less noise than macros, thank to their better integration into the language and to operator overloading. However, non-trivial changes should be done when turning a non-con gurable object into a con gured one.

Source transformations have no syntactic noise at all, if the transformation engine is implemented well and could preserve all the appropriate comments and formatting of the original source.

Re-usability of con gurations How easy it is to re-use an existing con guration for a new piece of code.

Miller, C macros and C++ templates provide the ability to give names to con gurations. They can be referenced later in any other con gurations or get assigned to newly de ned objects.

Source transformations use separate con gurations for each declaration explicitly. However, as con guration les are C source les, it could be possible to use preprocessor macros to produce the con gurations.

Using single con guration le to transform multiple input les is also feasible.

Applicability to existing code base

Di culties that arising when turning non-con gured objects to con gured ones.

Miller is a domain-speci c language that was designed with memory hierarchies and con gurability in mind from the very beginning. Every object and type expressed in Miller is con gurable.

C macros may require a huge amount of highly error-prone, non-trivial modi cations of the original code to turn a single object into a con gured one.

C++ templates reduce the drawbacks of macros, by using operator overloading and instance methods. The type checker also ensures that the programmer rewrites them correctly.

Source transformations were designed especially to be applicable on existing code bases, so their cost is zero from this perspective.

Other bene ts and drawbacks

Miller provides built-in con gurations, and gives outstanding support for programming complex memory hierarchies. However, it is always hard to introduce a new language when there is an industrially accepted standard or a huge existing code base.

C macros are simple enough for demonstration and prototyping purposes, but are very impractical to use in real software.

C++ templates are completely acceptable, when the target platform provides a C++ compiler { which is not always the case. When the number of con gurable variables is small, and the library is introduced at the beginning of the development, it could be a practical solution. Type safety of templates helps a lot compared to macros, but the template instantiation errors could be very long and confusing. Source transformations are trying to combine the bene ts of the built-in constructions with an already known language. However, implementing a transformation tool that handles all cases automatically and correctly, is a great challenge for di cult to refactor languages like C. 5

Conclusion

We introduced con gurations to ease the change of data layout over di erent memory layers. The programmer only needs to change these con gurations of the data structures, and the required highly non-trivial modi cations will be propagated automatically over the algorithms. This makes data layout changes less error-prone and more e cient. This is especially important for performance tuning of applications on scratchpad-aware architectures. Four di erent implementation techniques were shown: one with built-in constructions of a domain-speci c language, one with a C preprocessor library, a solution based on C++ template library, and a source-to-source transformation tool. The comparison shows that in practice, as it is expected, the C preprocessor metaprograms are not really usable. However under speci c circumstances a C++ template metaprograms could implement con gurable objects adequately. The real solution is a domain-speci c language like Miller, or a sophisticated source-to-source transformation tool. The former is unfortunately di cult to introduce in industrial scenarios, while the latter is a great challenge to implement correctly for languages with a rich syntactic structure and complex semantics like C.

Acknowledgements

We wish to thank the current and former members of the research group at ELTE Software Technology Lab, including Gergely Devai, Boldizsar Nemeth, Zoltan Csornyei, Zoltan Kelemen and Daniel Lesko. Their useful and constructive work on this project made our publication possible. [Bpl] The Boost Library Preprocessor Subset for C/C++.

http://www.boost.org/doc/libs/1_60_0/libs/preprocessor/doc/index.html [Car02] Carlos Carvalho. The gap between processor and memory speeds. In Proceedings of IEEE International

Conference on Control and Automation, 2002. [Kar14] Mate Karacsony. Modeling C preprocessor metaprograms using purely functional languages. Proceedings of the 9th International Conference on Applied Informatics, 2014. Vol. 2. 85{92. [Krz11] Olaf Krzikalla, Kim Feldho , Ralph Muller-Pfe erkorn and Wolfgang E Nagel. Scout: a source-to-source transformator for SIMD-optimizations. Euro-Par 2011: Parallel Processing Workshops, 2011. 137{145.

[Abr04]

David

Abrahams and

Aleksey

Gurtovoy . C+ + Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond . Addison Wesley Professional , 2004 .

[Ban02]

Rajeshwari

Banakar , Stefan Steinke, Bo-Sik

Lee

Mahesh

Balakrishnan and

Peter

Marwedel . Scratchpad memory: design alternative for cache on-chip memory in embedded systems . Proceedings of the tenth international symposium on Hardware/software codesign. ACM , 2002 . 73 { 78 [Nem13] Boldizsar Nemeth and Zoltan Csornyei. Stackless programming in Miller . Acta Universitatis Sapientiae, Informatica 5.2 , 2013 . 167 { 183 .

[Nie97]

Jarek

Nieplocha , Robert Harrison, and

Ian

Foster . Explicit management of memory hierarchy . Advances in High Performance Computing . Springer Netherlands, 1997 . 185 { 199 .

[Pan97] Preeti Ranjan

Panda

Nikil D.

Dutt , and

Alexandru

Nicolau . E cient utilization of scratch-pad memory in embedded processor applications . Proceedings of the 1997 European conference on Design and Test . IEEE Computer Society , 1997 .

[Pra14] Prashantha

NR.

, Vikram TV. and Vaivaswatha N. Implementing Data Layout Optimizations in LLVM Framework. LLVM Developers' Meeting, October 28-29 , 2014 . http://llvm.org/devmtg/2014-10/#talk14 [Sin16]

Artjoms

Sinkarovs and Sven-Bodo Scholz . Type-driven Data Layouts for Improved Vectorisation . Concurrency and Computation: Practice & Experience , 2016 . 2092 { 2119 .

[Vel03] Todd L. Veldhuizen . C+ + templates are turing complete . http://citeseerx.ist.psu.edu/viewdoc/summary?doi =10.1.1.14.3670 , 2003 .

[Ver02] Manish Verma, Stefan Steinke and

Peter

Marwedel . Data partitioning for maximal scratchpad usage . Proceedings of the 2003 Asia and South Paci c Design Automation Conference . ACM, 2003 . 77 { 83 .

[Ver04] Manish Verma, Lars Wehmeyer and

Peter

Marwedel . Dynamic overlay of scratchpad memory for energy minimization . Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis. ACM , 2004 . 104 { 109 .