Metamodeling Lightweight Data Compression Algorithms and its Application Scenarios Juliana Hildebrandt1 , Dirk Habich1 , Thomas Kühn2 , Patrick Damme1 , and Wolfgang Lehner1 1 Technische Universität Dresden, Database Systems Group, Dresden, Germany 2 Technische Universität Dresden, Software Technology Group, Dresden, Germany {firstname.lastname}@tu-dresden.de1 thomas.kuehn3@tu-dresden.de2 Abstract. Lossless lightweight data compression is a very important optimization technique in various application domains like database sys- tems, information retrieval or machine learning. Despite this importance, currently, there exists no comprehensive and non-technical abstraction. To overcome this issue, we have developed a systematic approach using metamodeling that focuses on the non-technical concepts of these algo- rithms. In this paper, we describe COLLATE, the metamodel we developed, and show that each algorithm can be described as a model conform- ing with COLLATE. Furthermore, we use COLLATE to specify a compression algorithm language COALA, so that lightweight data compression algo- rithms can be specified and modified in a descriptive and abstract way. Additionally, we present an approach to transform such descriptive algo- rithms into executable code. As we are going to show, our abstract and non-technical approach offers several advantages. 1 Introduction The continuous growth of data volumes is still a major challenge for efficient data processing. This applies not only to database systems [1, 2], but also to other areas, such as information retrieval [3, 4] or machine learning [5]. With growing capacities of the main memory, efficient analytical in-memory data pro- cessing becomes viable [6, 7]. However, the gap between computing power of the CPUs and main memory bandwidth continuously increases being now the main bottleneck [1]. To overcome this issue, the mentioned application domains have a common approach: (i) encode values of each data attribute as sequence of inte- gers using some kind of dictionary encoding [1, 8] and (ii) apply lightweight data compression algorithms to each sequence of integer values. Besides reducing the amount of data, operations can be directly performed on compressed data [1, 9]. For the lossless compression of a sequence of integer values, a large corpus of lightweight algorithms has been developed [1, 3, 4, 9–13, ?]3 . In contrast to heavy- weight algorithms, like arithmetic coding [14], Huffman [15], or Lempel Ziv [16], lightweight algorithms achieve comparable or even better compression rates [1, 3, 3 Without claim of completeness. Copyright © by the paper’s authors. Copying permitted only for private and academic purposes. In: C. Cabanillas, S. España, S. Farshidi (eds.): Proceedings of the ER Forum 2017 and the ER 2017 Demo track, Valencia, Spain, November 6th-9th, 2017, published at http://ceur-ws.org 4, 9–13, ?]. Moreover, the computational effort for (de)compression is lower than for heavyweight algorithms. To achieve these unique properties, each lightweight compression algorithm employs one or more basic compression techniques such as frame-of-reference [9, 11] or null suppression [1, 13], that allow the appropri- ate utilization of contextual knowledge like value distribution, sorting, or data locality. Every single algorithm is important and has its field of application [17]. This is true not only for current, but also for future algorithms. Our Contributions and Outline. Despite this importance, there exists cur- rently no comprehensive and non-technical abstraction for this domain. There- fore, the algorithms are only presented technically. However, this makes both the selection as well as the adaptation of algorithms to the respective context difficult. To overcome that, we have developed an approach using metamodel- ing that focuses on the non-technical concepts of lightweight data compression algorithms. Thus, our main contributions in this paper are: 1. We present our metamodeling approach with regard to separation of concerns by summarizing (i) the core results of our algorithm analysis and (ii) the derived metamodel called COLLATE in Section 2. 2. We apply COLLATE to develop a compression algorithm language COALA in Section 3. This allows us to specify and modify lightweight data compression algorithms in a descriptive and non-technical way. Additionally, we present an approach to transform such descriptive algorithms into executable C/C++ code. 3. We evaluate our language and execution approach using case studies in Sec- tion 4. In particular, we show (i) that we are able to describe existing as well as new algorithms with COALA and (ii) that our transformation approach produces correct executable code. We close this paper with related work and a conclusion in Sections 5 and 6. 2 Metamodeling Lightweight Compression Algorithms In this paper, we limit our discussion to lossless lightweight integer compression. But the described approach applies to decompression as well. Before we intro- duce COLLATE, we summarize the results of our algorithm analysis with regard to common non-functional aspects. 2.1 Properties of Data Compression Algorithms To compress a finite sequence of integer values losslessly, all available com- pression algorithms use the following five basic techniques: frame-of-reference (FOR) [9, 11], delta coding (DELTA) [12, 13], dictionary compression (DICT) [1, 9], run-length encoding (RLE) [1, 13], and null suppression (NS) [1, 13]. FOR and DELTA represent each value as the difference to a certain given reference value (FOR) respectively to its predecessor value (DELTA). DICT replaces each value by its unique key given by a dictionary. The objective of these three well- known techniques is to represent the original data as a sequence of small integers, which is then suited for actual compression using the NS technique. NS is the most well-studied kind of all lightweight compression techniques. Its basic idea is the omission of leading zeros in the bit representation of small integers. Finally, RLE tackles uninterrupted sequences of occurrences of the same value, so called runs. In its compressed format, each run is represented by its value and length. Thus, the compressed data is a sequence of such pairs. Generally, these five techniques address different data levels. While FOR, DELTA, DICT, and RLE consider the logical data level, NS addresses the phys- ical level of bits or bytes. This explains why lightweight data compression algo- rithms are always composed of one or more of these techniques. In the following, we denote the techniques from the logical level as preprocessing techniques for the physical compression with NS. These techniques can be further divided into two groups depending on how the input values are mapped to output values (often called codewords). FOR, DELTA, and DICT map each input value to exactly one integer as output value (1:1 mapping). The objective is to achieve smaller numbers that can be better compressed on the bit level. In RLE, not every input value is necessarily mapped to an encoded output value, because a successive sub-sequence of equal values is encoded in the output as a pair of run value and run length (N:1 mapping). The NS technique is either a 1:1 or an N:1 mapping depending on the implementation [18]. The genericity of these basic techniques is the foundation to tailor the al- gorithms to different data characteristics. The NS technique has been studied most extensively. There is a very large number of specific algorithms showing the diversity of the implementations for a single technique. The pure NS algorithms can be divided into the following classes [18]: (i) bit-aligned, (ii) byte-aligned, and (iii) word-aligned. While bit-aligned NS algorithms try to compress an in- teger using a minimal number of bits, byte-aligned NS algorithms compress an integer with a minimal number of bytes (1:1 mapping). The word-aligned NS algorithms encode as many integer values as possible into 32-bit or 64-bit words (N:1 mapping). The logical-level techniques have been usually investigated in connection with the NS technique [17]. Concluding, a lightweight data compression algorithm is always a combina- tion of one or more of these basic techniques. The algorithms differ greatly in how they apply the individual techniques. This great variability results from the multitude of opportunities in which the input sequence of integers can be subdivided. For each subsequence, the algorithms analyze the contained values to optimize the application of the basic techniques. 2.2 COLLATE Metamodel Thus, subdivision and specific application of the five basic techniques are im- portant aspects of each lightweight data compression algorithm. For the specific application, the algorithms usually determine parameters for each subsequence and with the help of the calculated parameter values, the techniques are special- ized for the subsequences. Parameter Parameter 1 Logical Tokenizer Calculator 1..∗ Definition Calculation 0..1 1 1 0..1 1 Compression Physical Scheme Encoder 1 Calculation 1 Combiner Recursion Association Inheritance Fig. 1: COLLATE Metamodel. COLLATE Concepts. To describe every single one of these lightweight compres- sion algorithms, we identified five core concepts as highlighted in gray boxes in Fig. 1. Because compression algorithms are often characterized by a hierarchical data subdivision, we used the composite design pattern as foundation. Thus, each algorithm is a Compression, which is associated with one Tokenizer, zero to one Parameter Calculator, one Encoder or one further Compression, and one Combiner. The different concepts fulfill the following tasks: Recursion: This concept is responsible for the hierarchical data subdivision and for applying the included concepts in the Recursion on each data sub- sequence. Tokenizer: This concept is responsible for dividing an input sequence into finite subsequences or single values. Parameter Calculator: The optional concept Parameter Calculator deter- mines parameter values for finite subsequences or single values. The specifi- cation of the parameter values is done using parameter definitions. Parame- ters can be calculated on the logical as well as the physical level. Encoder: The fourth concept determines the encoded form (codewords) for in- teger values to be compressed at bit level. The concrete encoding is specified using functions representing the basic techniques on the logical and physical data level. Combiner: The Combiner arranges the encoded values (codewords) and the cal- culated parameters for the output representation. Concept Properties. The concepts Recursion as well as Combiner should be clear enough. For the Tokenizer concept, we identified three classifying char- acteristics. The first one is data dependency. A data independent Tokenizer outputs a special number of values without considering the value itself, while a data dependent Tokenizer is used in case the decision about how many values to output is led by the knowledge of the concrete values. For example, a data dependent Tokenizer is necessary for the technique RLE, where the Tokenizer output depends on the values themselves. A second characteristic is adaptivity. A Tokenizer is adaptive if the calculation rule changes depending on previously read data. The third property is the necessary input for decisions. Most Tok- enizers need only a finite prefix of a data sequence to decide how many values to output. The rest of the sequence is used as further input for the Tokenizer and processed in the same manner. Only those Tokenizers are able to process input sequence in tail Tokenizer token: calculation rule finite Parameter adaptation subsequence Calculator Encoder/Recursion calculated compressed value or parameters compressed sequence Combiner Recursion compressed sequence Fig. 2: Interaction and Data Flow of the COLLATE Metamodel. data streams with potentially infinite data sequences. Moreover, there are Tok- enizers requiring the whole (finite) input sequence to decide how to subdivide the sequence. Parameters are often required for the compression. Therefore, we introduce the Parameter Calculator concept, which follows special rules (parameter def- initions) for the calculation of several parameters. There are different kinds of parameter definitions. We often need single numbers like a common bit width for all values or mapping information for dictionary based encodings. We call a parameter definition adaptive, if the knowledge of a calculated parameter for one token (output of the Tokenizer) is needed for the calculation of parameters for further tokens at the same hierarchical level. For example, an adaptive pa- rameter definition is necessary for DELTA. Calculated parameters have a logical representation for further calculations including encoding of values and option- ally a representation at bit level, because on the one hand they are needed to calculate the encoding of values, on the other hand they mostly have to be stored additionally to allow the decoding. The Encoder processes an atomic input, where the output of the Param- eter Calculator and other parameters are additional inputs. The input is a token that cannot or should not be subdivided anymore. In practice the Encoder mostly obtains a single integer value to be mapped into a binary code (1:1 map- ping techniques). An exception is RLE as N:1 mapping technique, where the Parameter Calculator maps a sequence of equal values to its run length and the Encoder maps the sequence to the special value. Equally to parameter defi- nitions, the Encoder calculates a logical representation of its input value and a bit encoding. 2.3 Interaction of COLLATE Concepts Besides the single concepts, we are also able to specify (i) the interactions of the concepts and (ii) the data flow through the concepts for lightweight data com- pression algorithms as illustrated in Fig. 2. The dashed lines highlight optional data flows. In general, this arrangement can be used as metaobject protocol for input sequence in tail 1 coala { 2 Algorithm ( Tokenizer in 7→ 32 3 ”FORBP32” , 4 Recursion ( 5 Tokenizer ( i n => 3 2 ) , Parameter Calculator 6 ParameterCalculator ( ref : in 7→ in.min, 7 ParameterDefinition [ I n t ] ( in 7→ in.bin(32) ’ref, width : in 7→ in.max − ref, i n => i n . m i n , in 7→ in.bin(32) i n => i n . b i n ( 3 2 ) ), 8 ParameterDefinition [ I n t ] ( subsequence ’width, tail i n => ( i n . m a x − ’ r e f + 1 ) . b w , ref, width Tokenizer in 7→ 1 i n => i n . b i n ( 3 2 ) 1 value ) 9 ), Encoder 10 Recursion ( val : in 7→ in − ref, 11 Tokenizer ( i n => 1 ) , in 7→ in.bin(width) 12 ParameterCalculator ( ) , 13 Encoder ( i n => i n − ’ r e f , i n => i n . b i n ( ’ w i d t h ) Combiner in32 ), Recursion 14 Combiner ( i n => i n ) 15 ), 16 Combiner ( i n => i n o ’ w i d t h o ’ r e f ) Combiner (in32 o width o ref)∗ 17 ) 18 ) Recursion 19 } compressed sequence . Fig. 3: Graphical and COALA Code for FORBP32. lightweight compression algorithms and each algorithm conforms to this inter- action of the specific concepts. 3 Application Scenarios The goal of this section is to show different application scenarios of COLLATE. In detail, we apply COLLATE (i) to design and describe algorithms in an abstract way as models, (ii) to define a compression algorithm language COALA for an unambigious, declarative non-technical specification of algorithms, and (iii) to transform descriptive algorithms into executable code. 3.1 Algorithms as Models Generally, each lightweight data compression algorithm can be modeled as an instance of the COLLATE metamodel. To show that, we use a self-constructed algo- rithm called FORBP32 as a running example. This algorithm does not occur in recent literature. But it works well in the case when (i) values have a good local- ity (consecutive values in the same range), (ii) there are no single huge outliers and when (iii) the absolute size of values to compress does not matter. FORBP32 combines the frame-of-reference (FOR) technique at the logical level to achieve smaller numbers which are afterwards compressed using binary packing. Binary packing (BP) is a kind of null suppression, where a common bit width is used for the encoding of values within a finite subsequence [12]. Thus, in FORBP32, a common reference value and bit width are calculated for each subsequence of 32 integer values and stored with 32 bits each before the 32 encoded values follow. We can model FORBP32 by specifying all COLLATE concepts as depicted in Fig. 3. Like any algorithm model, FORBP32 is a Recursion. It consists of a To- kenizer outputting 32 integer values independently in each subsequence, which is formalized by the function (in 7→ 32), a Parameter Calculator, where (i) the calculation rule for the reference value as the minimum of each subsequence, expressed by the mapping (in 7→ in.min) and (ii) its binary encoding with 32 bits (in 7→ in.bin(32)) are determined by a parameter definition. Also (i) the calculation rule for the bit width on the basis of the range of 32 values, expressed by (in 7→ in.max−ref) and (ii) its encoding with 32 bits specify a further param- eter definition. At the end, the Combiner arranges bit width, the reference value and the corresponding sequence of 32 encoded integer values. This is expressed ∗ by ((in32 o width o ref) ), which means that the binary representations of the reference value and the bit width are followed by the encoding of 32 data values for every block of 32 values. Instead of an Encoder, we use a further Recur- sion, where the processing of a subsequence containing 32 values is done. This processing consists of a Tokenizer outputting single values and an Encoder, which (i) calculates the difference of a data value and the reference value and (ii) encodes the result with the number of bits determined before the Combiner constructs the sequence of 32 encoded compressed integer values. At this point, we can draw three conclusions. First, the FORBP32 algorithm can be expressed by the colluding concepts according to COLLATE. The same can be shown for a wide range of lightweight compression algorithms. Second, the models/algorithms can be easily adjusted. An example is the adjustment of the subdivision in 32, 64 or 128 values, which results in new algorithms. Third, new algorithms can be designed applying user-specified calculation rules for the logical as well the physical data level as shown by FORBP32. 3.2 Compression Algorithm Language COALA To simplify the coding of lightweight data compression algorithms, we designed a domain-specific language called COALA allowing a declarative and non-technical specification. COALA is based on COLLATE and offers the opportunity to express instances of concepts as higher-order functions – concepts are functions taking functions as parameters. We decided to embed COALA into the host language Scala4 , because Scala is a high level programming language, which is object- oriented as well as functional and which treats higher-order functions are first class citizens. The advantage of this embedding is that COALA inherits many of the features and benefits of Scala [19]. Thus, we are able to use the higher- order function behavior of Scala to instantiate our five COLLATE concepts with an 4 Scala Programming Language - https://www.scala-lang.org adjustment by functions. This leads to a declarative description of algorithms as shown below. The COALA code of the FORBP32 example algorithm is depicted on the right side of Fig. 3, which is more or less a 1:1 mapping of the corresponding model. Each COLLATE concept is implemented using a domain-specific higher- order function of COALA. For example, the higher-order function tokenizer rep- resents the COLLATE concept Tokenizer. In FORBP32, this tokenizer contains a mapping function from the input sequence in to the number 32 as function parameter. Here, we declare that the input sequence has to be subdivided in sub- sequences, where each subsequence contains 32 values. We do not specify how to do it, which is the difference between a declarative and an imperative pro- gramming approach. Another example is the definition of the parameter named ‘ref, which determines the minimum of a sequence and encodes it with 32 bits. It can be described by ParameterDefinition[Int](‘ref, in => in.min, in => in.bin(32)), because the parameter definition is characterized by a name, a mapping from the input sequence to a logical (integer) value, which can be used for further calculations, and a mapping from the logical value to the binary encoding with 32 bits as indicated by the predefined function in.bin(32). To further support the declarative approach of COALA, the data flow and the interaction of the concepts are implicitly realized using the COLLATE metaobject protocol as illustrated in Fig. 2. Therefore, we omit an explicit data flow coding in COALA. All major and minor concepts shown in Fig. 3 can be fully expressed by higher-order functions. In summary, FORBP32 is expressed by 10 concepts con- taining 10 simple lambda functions in a declarative way. The code is expressive and easy to understand, since it complies with the model instance. Furthermore, the code is without any technical detail about the how. Concluding, with COLLATE we have a systematic approach to understand, to model and to adapt compression algorithms. The compression algorithm lan- guage COALA allows us to express these algorithms in a descriptive and non- technical way. Adaptations for reasons of optimization like the justification of the data subdivision, calculation of parameters, or the order of the compressed values can be implemented by changing single values without considering side effects. On the contrary, this could occur by changing native executable code. 3.3 Transformation of COALA Code to Executable Code Our COALA approach allows designers of lightweight data compression algorithms to specify their algorithms in a novel declarative and domain-specific way. To execute such declarative algorithms, we transform COALA code into executable C/C++ code using generative programming [20]. Our generative programming approach for this transformation is based on the COLLATE concepts and is realized using Scala macros [21]. Generally, we defined a fixed transformation for each COLLATE concept de- pending on the properties as described in Section 2. We illustrate our approach using the running example algorithm FORBP32. The resulting C/C++ code is depicted in Fig. 4. Fundamentally, a COALA Algorithm is transformed into a 1 i n l i n e u i n t 3 2 t min ( u i n t 3 2 t ∗ i n , i n t c n t r , i n t s t e p ) { . . . } 2 i n l i n e u i n t 3 2 t max ( u i n t 3 2 t ∗ i n , i n t c n t r , i n t s t e p ) { . . . } 3 // a l g o r i t h m 4 s i z e t forbp32 ( u i n t 3 2 t ∗ i n , i n t l e n g t h , u i n t 3 2 t ∗ out ) { 5 // I n i t i a l i z a t i o n 6 u i n t 3 2 t ∗ i n i t o u t = out ; 7 int bitcounter = 0 ; 8 // Parameter D e c l a r a t i o n L e v e l 0 9 int ref ; 10 i n t width ; 11 // Recursion and T o k e n i z e r L e v e l 0 12 i n t s t e p 0 = 32 ; 13 f o r ( i n t c0 = 0 ; c0 < l e n g t h ; c0 += s t e p 0 ) { 14 // Parameter C a l c u l a t i o n L e v e l 0 15 r e f = min ( i n , c 0 , s t e p 0 ) ; 16 width = bw( max ( i n , c 0 , s t e p 0 ) − r e f + 1 ) ; 17 // Encoding Parameters 18 ∗ o u t |= r e f << b i t c o u n t e r ; 19 out = . . . 20 bitcounter = . . . 21 ... 22 // Recursion and T o k a n i z e r L e v e l 1 23 int step1 = 1 ; 24 f o r ( i n t c1 = c0 ; c1 < c0 + s t e p 0 ; c1 += s t e p 1 ) { 25 // Encoding Data 26 ∗ o u t |= ( i n [ c1 ] − r e f ) << b i t c o u n t e r ; 27 out = . . . 28 bitcounter = . . . 29 } 30 } 31 r e t u r n o u t + ( b i t c o u n t e r != 0 ) − i n i t o u t ; 32 } Fig. 4: Generated C/C++ Code for FORBP32. C/C++ generic function, which expects (i) an array with 32-bit integer values to be compressed, (ii) its length, and (iii) an output array as function parameters. Thus, we specify that we represent the input sequence of integers as an array on the execution level and the further processing has to be done on this array structure. Furthermore, each algorithm returns the word size of the compressed values. For our FORBP32, the generated C/C++ function starts with lines 4-7 and ends with line 31-32 in Fig. 4. An algorithm is always a Recursion iterating over the input. For this reason, the Recursion concept is always mapped to a loop iterating over an array in our case. For the exact loop specification, the following Tokenizer has to be included. In FORBP32, we have two similar cases. Both cases are nested. There- fore, the nested recursions and tokenizers are also mapped to nested loops with fixed step widths of 32 and 1. This is done in lines 12-13 and 23-24 in Fig. 4. The Tokenizers in FORBP32 are very simple, but it is also possible to calculate the steps in Tokenizers as a function of parameters or data dependent. This is necessary for example for the RLE technique. The corresponding C/C++ transformation is illustrated in Fig. 5. Parameter Calculator: For the algorithm FORBP32, the reference value and the bit width have to be calculated for each subsequence of 32 values. This is done for the logical value within the outer loop. The generated code is depicted in lines 15-16 in Fig. 4, where each aggregation function respectively other operation 1 inline int calc1 ( uint32 t ∗ i n , s i z e t length ) {...} 2 ... 3 int step1 = calc1 ( in + 0, length − 0 ) ; 4 f o r ( i n t c0 = 0 ; c0 < l e n g t h ; c0 += s t e p 1 ) ) { 5 ... 6 s t e p 1 = c a l c 1 ( i n + c 0 , l e n g t h − c0 ) ; 7 } Fig. 5: Realization of a Data Dependent (Adaptive) Tokenizer. used in COALA is mapped to a predefined C/C++ function respectively a C/C++ operator, shown in lines 1-2. The code for the Encoder is transformed in the same way. In addition to cal- culations at the logical level, calculations on the physical level are still needed. The physical calculations are always transformed into bit shift operations. The binary encoding of 32 values with the same bit width is done as depicted in lines 26-28, the binary encoding of the parameters in lines 18-21. The transfor- mation of the Combiner results in the code position of the physical parameter and data encoding. In our case, first the parameters are written to the output in lines 18-21 and afterwards the data values. To summarize, the transformation of COALA code to executable C/C++ is mainly driven by the transformation of the single COLLATE concepts. On the one hand, lightweight data compression algorithms iterate over the input integer sequence and subdivide it in subsequences or single values, which is specified by Recursion and Tokenizer. Thus, both concepts have to be considered jointly for the translation. On the other hand, the other concepts are more or less independent, only the order must be maintained. This is necessary to establish the necessary data flow as depicted in Fig. 2. 4 Case Studies With the help of two use cases, we want to highlight the applicability and the advantages of our language as well as our execution approach. While the first use case considers an existing algorithm, the second use case evaluates our self- constructed algorithm FORBP32. Existing Null Suppression Algorithm There is a large variety of lightweight data compression algorithms implementing only the null suppression (NS) tech- nique. The pure NS algorithms can be divided into the following classes [18]: (i) bit-aligned, (ii) byte-aligned, and (iii) word-aligned. To show the applicability of our approach for these kinds of algorithms, we decided to use the existing byte-oriented algorithm varint-SU [4]. The algorithm suppresses leading zeros for each single integer by determining the smallest number of 7-bit units that are needed for binary representation without losing any information. To support decoding, the 7-bit number is stored as additional descriptor. This is done in a 31 23 15 7 0 Uncompressed 32-bit 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 Integer Value Descriptor Bit 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 Compressed Codeword Data Bits Fig. 6: Compressed Data Format for varint-SU. 1 coala { 300 Generated Code 2 Algorithm ( Native by Lemire 3 ” VarintSU ” , 250 Optimized Generated Code 4 Recursion ( 5 Tokenizer ( i n => 1 ) , 200 6 ParameterCalculator ( 7 ParameterDefinition [ I n t ] ( mis 150 ’units, i n => i n . b w ( 7 ) , i n => i n . u n a r y 0 1 1 100 ) ), 50 8 Encoder ( i n => i n , 0 i n => i n . b i n ( ’ u n i t s ∗ 7 ) ) , 1 2 3 4 5 6 7 8 9 10 9 Combiner ( i n => z i p ( L i s t ( ( 1 , ’ u n i t s ) , ( 7 , in ))) number of values [in million] ) 10 ) 11 } (a) COALA Code (b) Runtime Comparison Fig. 7: Evaluation of varint-SU [12]. unary way by placing a single parameter bit at the high end of a 7-bit unit as depicted in Fig. 6. The algorithm specification in COALA is shown in Fig. 7(a). As we can see, the descriptive code consists of eleven code lines. The Tokenizer subdivides the input sequence into single integer values (indicated by in => 1). For each value, the Parameter Calculator determines the number of necessary 7-bit units us- ing an appropriate function. The determined number is used in the subsequent Encoder to compress the integer value. The Combiner subdivides the bit level representation of the input value into several 7-bit units and arranges the bits of the encoded data and the descriptor. Table 1: Comparison of the Code Sizes for varint-SU. Absolute measures Relative measures COA A L Generated Lemire COA A /Generated L CoALa/Lemire Lines of Code 11 36 47 31% 23% Number of Characters 220 953 908 23% 24% For this algorithm, a native C/C++ implementation of Lemire exists [12]. Ta- ble 1 compares the code size of our descriptive COALA specification with the gener- ated C/C++ code and the native implementation. While our generated code has 36 lines of code, the native implementation has 47,because the native implemen- tation uses mainly means of case distinctions. In contrast, our COALA description has eleven lines of code, which is only 23% to 31% of the size of the executable code. The same applies to the number of characters as depicted in Fig. 1. Fur- thermore, we compared the compression result of our generated C/C++ code with the result of the Lemire implementation for different input sequences using our compression benchmark framework [22]. In all cases, the results are identical which is a clear indicator for the correctness of our generated executable code. In addition to the comparison of the code sizes, we also compared the runtime behavior of our generated and the native variant. The runtime comparison was done on a standard server with an Intel Core i5 2.9GHz and 8GB main memory. Both compression algorithms are executed single-threaded and compiled with g++ 4.2.1 using the optimization flag -O3. In the experiment, we varied the num- ber of integer values to be compressed, thereby the compressed representations randomly vary between one and four 7-bit units per integer value. We report compression speeds in million integers per second (mis) as it is usually done in that domain [17]. As we can see in Fig. 7(b), our generated code performs poorly compared to the native implementation, whereas our generated code is approximately 7 times slower. The reason is that we are doing bit encoding with a loop, which is avoided in the native implementation by explicitly unrolling the loop. We included that unrolling concept in our transformation concept leading to a comparable compression speed as illustrated in Fig. 7(b). Some other opti- mization steps are case distinctions as well as avoidance of bit shift calculations, of copying, and of branching. However, these effects must be investigated more closely, which we will do in future work. This analysis has been carried out for various NS algorithms, all of which have led to similar results. Therefore, we can draw the following conclusions: 1. Our COALA language can be used to specify NS algorithms in a descriptive and non-technical way, whereby those descriptions are much shorter than any imperative and technical implementation. 2. Our transformation approach produces correct executable codes, which are comparable to native implementations on a code size basis. The currently generated code is slower than native code, but we are convinced of being able to achieve comparable performance with automatical optimizations. Novel and Self-Constructed Algorithms: As shown with FORBP32, we are able to specify and modify novel lightweight data compression algorithms with COALA. Furthermore, COALA algorithms can be executed with the help of our transformation approach. The COALA code of FORBP32 is much smaller than any executable code, because the COALA code includes only non-technical details. This leads to more clearance and understandability. 5 Related Work To the best of our knowledge, there exists only a high-level modularization con- cept for data compression in the domain of data transmission [23]. That scheme subdivides compression methods merely in (i) a data model adopting to data already read and (ii) a coder encoding the incoming data by means of the calcu- lated data model. In that work, the area of conventional data compression was considered. Here, a multitude of approaches like arithmetic coding [14] Huff- man [15] or Lempel-Ziv [16] exists. These classical algorithms achieve high com- pression rates, but the computational effort is high. Therefore, these techniques are usually denoted as heavyweight. The modularization concept of [23] is well- suited for this kind of heavyweight algorithms, but it does not adequately reflect different properties of the lightweight algorithms. For example, it does not sup- port a sophisticated segmentation or a multi-level hierarchical data partitioning. Additionally, a wide range of metamodels and domain-specific languages for various applications domains has been proposed. Hitherto, none of them has considered the field of lightweight data compression. Considering the fact that lightweight data compression algorithms are an important optimization tech- nique in different application domains, COLLATE and the domain-specific language COALA are more than important. 6 Conclusion The continuous growth of data volumes is still a major challenge for efficient data processing. To tackle this challenge with regard to in-memory data pro- cessing, application domains like databases, information retrieval or machine learning follow a common approach: (i) encode values of each data attribute as sequence of integers using some kind of dictionary encoding and (ii) apply loss- less lightweight data compression algorithms to each sequence of integer values. Aside from reducing the amount of data, a large variety of operations can be directly performed on compressed data. In this paper, we described COLLATE, the metamodel we developed, for the large and evolving corpus of lightweight data compression algorithms and showed that each algorithm can be described as a model conforming to COLLATE. Fur- thermore, we used our metamodel to specify a novel compression algorithm lan- guage COALA, so that lightweight data compression algorithms can be expressed in a descriptive and non-technical way. Additionally, we presented an approach to transform such descriptive algorithms into executable code. From our perspective, the following two areas of future application are avail- able, where our presented approach can play an important role: test platform and system integration. In the test platform, we target the design of a specific development and test environment for lightweight data compression algorithms. In addition to the development of specific algorithms, the behavior of the al- gorithms must also be checked. To systematically test the algorithm behavior, we want to combine our COALA approach with our lightweight data compres- sion benchmark system [22]. This enables us to establish a comprehensive test platform for this emerging optimization technology. In addition, we want to extend our approach with the aim to integrate the large and evolving corpus of lightweight data compression algorithms in corre- sponding systems. For example, to the best of our knowledge, there is currently no in-memory database system available providing this corpus to a full extent. Therefore, the most challenging task is now to define an appropriate integration approach. With COALA, we have a language approach, but the transformation has to be database specific, which will be feasible. References 1. D. J. Abadi, S. Madden, and M. Ferreira, “Integrating compression and execution in column-oriented database systems,” in SIGMOD, 2006. 2. P. A. Boncz, M. Zukowski, and N. Nes, “Monetdb/x100: Hyper-pipelining query execution,” in CIDR, 2005. 3. D. Arroyuelo, S. González, M. Oyarzún, and V. Sepulveda, “Document identi- fier reassignment and run-length-compressed inverted indexes for improved search performance,” in SIGIR, 2013. 4. A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi, “Simd- based decoding of posting lists,” in CIKM, 2011. 5. A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, and B. Reinwald, “Compressed linear algebra for large-scale machine learning,” PVLDB, vol. 9, no. 12, 2016. 6. P. A. Boncz, M. L. Kersten, and S. Manegold, “Breaking the memory wall in monetdb,” Commun. ACM, vol. 51, no. 12, 2008. 7. T. Kissinger, T. Kiefer, B. Schlegel, D. Habich, D. Molka, and W. Lehner, “ERIS: A numa-aware in-memory storage engine for analytical workloads,” in ADMS, 2014. 8. C. Binnig, S. Hildenbrand, and F. Färber, “Dictionary-based order-preserving string compression for main memory column stores,” in SIGMOD, 2009, pp. 283– 296. 9. M. Zukowski, S. Héman, N. Nes, and P. A. Boncz, “Super-scalar RAM-CPU cache compression,” in ICDE, 2006. 10. V. N. Anh and A. Moffat, “Index compression using 64-bit words,” Softw., Pract. Exper., vol. 40, no. 2, 2010. 11. J. Goldstein, R. Ramakrishnan, and U. Shaft, “Compressing relations and indexes,” in ICDE, 1998. 12. D. Lemire and L. Boytsov, “Decoding billions of integers per second through vec- torization,” Softw., Pract. Exper., vol. 45, no. 1, pp. 1–29, 2015. 13. M. A. Roth and S. J. Van Horn, “Database compression,” SIGMOD Rec., vol. 22, no. 3, 1993. 14. F. Silvestri and R. Venturini, “Vsencoding: Efficient coding and fast decoding of integer lists via dynamic programming,” in CIKM, 2010. 15. I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compres- sion,” Commun. ACM, vol. 30, no. 6, 1987. 16. D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the Institute of Radio Engineers, vol. 40, no. 9, 1952. 17. J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theor., vol. 23, no. 3, 1977. 18. P. Damme, D. Habich, J. Hildebrandt, and W. Lehner, “Lightweight data compres- sion algorithms: An experimental survey (experiments and analyses),” in EDBT, 2017, pp. 72–83. 19. W. X. Zhao, X. Zhang, D. Lemire, D. Shan, J. Nie, H. Yan, and J. Wen, “A general simd-based approach to accelerating compression algorithms,” ACM Trans. Inf. Syst., vol. 33, no. 3, 2015. 20. L. Tratt, “Domain specific language implementation via compile-time meta- programming,” ACM Trans. Program. Lang. Syst., vol. 30, no. 6, pp. 31:1–31:40, 2008. 21. K. Czarnecki and U. W. Eisenecker, Generative programming - methods, tools and applications. Addison-Wesley, 2000. 22. E. Burmako, M. Odersky, C. Vogt, S. Zeiger, and A. Moors, “Scala macros,” URL http://scalamacros. org, 2012. 23. P. Damme, D. Habich, and W. Lehner, “A benchmark framework for data com- pression techniques,” in TPCTC, 2015. 24. R. Williams, Adaptive Data Compression, ser. Kluwer international series in engi- neering and computer science: Communications and information theory. Springer US, 1991.