Model-driven Video Decoding: An Application Domain for Model Transformations Christian Schenk, Sonja Schimmler, and Uwe M. Borghoff Computer Science Department Universität der Bundeswehr München 85577 Neubiberg, Germany {c.schenk,sonja.schimmler,uwe.borghoff}@unibw.de Abstract. Modern serialization formats for digital video data are de- signed in a way that they allow for the combination of high compression rates, high quality as well as performant en- and decoding. As the capa- bility for real-time decoding often is a requirement, concrete decoders are usually implemented in a way that they best fit a specific architecture and, thus, are not portable. In scenarios, where it is essential that the ca- pability to decode existent video data can be retained, even if the under- lying architecture is changed, portability is more important than perfor- mance. For such scenarios, we propose decoder specifications that serve as templates for concrete decoder implementations on different architec- tures. As a high level of abstraction guarantees system-independence, we use (meta)models and model transformations for that purpose. Conse- quently, every decoder specification is (partially) executable, which sim- plifies the development process. In this paper, we describe our concepts for model-driven video decoding and give an overview of a prototypical implementation. Keywords: model driven engineering; models; model transformations; video decoding 1 Introduction In our current work, we are focusing on the problem of preserving digital video content. One challenge is to ensure that today’s data can be restored on future architectures. There are several approaches that address this problem for general content [1]. One widely used approach is to use standard file formats that are likely to be readable on future architectures (such as PDF/A for documents and JPEG 2000 for images). But, to our knowledge, there is no such standard format for digital video content that is suitable for long-term preservation. The existing formats that are widely used (e.g., H.264 [4]) are designed for another purpose: they combine high compression and high quality, and they allow for the devel- opment of efficient en- and decoders. As these formats involve the use of lossy compression techniques, it is impossible to transform an already encoded digital video into another format without risking an additional loss of information. As this implies that we should not introduce a new standard format, we now focus on the question of how decoding capabilities can be retained. As common de- coder implementations are normally system-specific and thus not portable, we propose the definition of human-readable and machine-processable decoder spec- ifications that serve as a template for the development of suitable decoders on different systems. Our goal are specifications that are partially executable and that allow a potential developer to create code (possibly supported by tools) that serves as a basis for the implementation of a functional (but possibly inef- ficient) decoder prototype for any existing video compression standard. As the capability to decode a digital video can be retained this way, existing videos can be preserved without any (additional) loss of information. In our previous work [10], we introduced the general idea of our approach. In this paper, we are focusing on details of a protypical implementation that serves as a proof of concept. We will show how we combined (meta)models, model transformations, R scripts [9] as well as a control program in order to specify essential parts of the decoding process for H.264 encoded videos, and we explain how these parts constitute the basis for the implementation of an H.264 decoder. As digital video data tend to require lots of memory, we will also introduce our approach as a potential application domain for testing MDE tools and techniques in combination with large models. The remaining paper is structured as follows: in section 2, we give a brief overview of video coding basics that are needed for the following sections. In section 3, we give an overview of our approach and describe some aspects in more detail. In section 4, we give an overview of related work before we conclude our work and give an outlook in section 5. 2 Serialization Formats for Digital Video Data Usually, every (digital) video consists of frames, which, when presented consec- utively for a fixed period of time, create the feeling of movement. Each frame mainly contains the image data and some time information (called the composi- tion time) enabling a video player to play the video correctly. Due to the needed memory, storing a complete video using well-known image file serialization for- mats (such as png, jpg or bmp) is no option. Instead, in modern video compres- sion formats, such as the H.264 standard [4], different compression techniques are combined in order to allow for suitable file sizes. Lossy compression algo- rithms play an essential role. Simply said, these algorithms do not distinguish between “similar” values; when stored, they are just handled equally. Conse- quently, these algorithms are irreversible, i.e., video encoding generally implies a loss of information. The H.264 standard, which is widely used and which will serve as a running example in this paper, is briefly introduced in the following. 2.1 H.264 basics The H.264 standard is used for video streaming scenarios, in combination with Blue-Ray disks and, of course, also for locally stored video files. As it can be regarded as a de facto standard, we focus on H.264 encoded videos for our con- siderations. As other standards are specified similarly if not identically (e.g., MPEG-4 AVC), we assume, however, that it is straightforward to use our ap- proach for these standards. An H.264 video track is structured in samples, which are (usually) grouped into chunks. Each sample can be seen as a (still encoded) representation of a single frame in the video, i.e., it contains every information that is needed to reconstruct a two-dimensional field of pixels. Pixel data are usually stored using the YCbCr color model, which distinguishes between one luma (Y) and two chroma channels (Cb and Cr), whereby each channel is encoded independently. Compression basics: Principally, each sample is encoded using lossy image com- pression. However, in addition to that, delta compression is used in order to remove redundancies caused by pixel similarities between successive frames and (spatially) nearby regions within one frame. We briefly explain how it works: in case of a single frame, it is common that pixels are similar or even equal if compared with neighboring pixels. Furthermore, two successive frames often only differ in details (in order to allow for smooth frame changes). The delta compression’s main principle is to just store “quantified” difference values be- tween two similar pixels. The quantification, which is a simple integer division, just increases the desired effect: the resulting values tend to be smaller and oc- cur more often. (By the way, the fact that the integer division is not completely reversible is one of the reasons why the complete compression is lossy.) In an H.264 encoded video, every sample is divided into a grid of 16 x 16 blocks (called macroblocks), which are traversed row by row and column by column during the decoding process. Each macroblock may depend on one or more other macroblocks whose data already have been decoded before. When a macroblock is decoded, the data of its dependencies are used to first predict [4] an intermediate data representation. Afterwards, the explicitly stored difference values are added in order to restore the macroblock’s actual data. Macroblocks that only depend on data of neighboring macroblocks (within the same frame) are called intra-predicted, whereas most of the macroblocks of an H.264 encoded video are usually inter-predicted, i.e., they refer to arbitrary regions of other frames using motion vectors (in combination with frame ids). Inter-predicted macroblocks automatically cause inter-dependencies between different frames. In order to constrain possible inter-dependencies (which is es- sential when a video is not decoded from the beginning), the sequence of all frames are partitioned into groups of pictures (called GOPs), whereby two frames can only depend on each other if they belong to the same GOP. Consequently, every GOP contains one or more (intra-predicted) frames that do not depend on any other frame and thus only consists of intra-predicted macroblocks. All the other (inter-predicted) frames contain inter-predicted macroblocks and depend on one or more other frames. Even if the delta compression plays an essential role for the complete en- coding, it is more a preparation for the actual compression: the results of the delta compression are compressed using variable length encoding (e.g., huffman encoding), which ensures that more frequent values are transformed into smaller codes than values that are less frequent. Decoding order and composition order: An H.264 encoded video permits sequen- tial decoding, i.e., forward jumps are generally not necessary during the decoding process. A frame can only be restored if all the frames it depends on have already been decoded. Hence, if a frame A depends on a frame B, B must be stored be- fore A within the serialization. Nevertheless, the H.264 standard allows frames to depend on other frames that have a greater composition time, i.e., which have to be displayed later during the playback. That is why we have to distinguish between the decoding order and the composition order : frames are serialized in decoding order but have to be presented in composition order. Thus, after their restoring in decoding order, all the frames have to be put into composition order. 3 Model-driven Video Decoding In this section, we will give an overview of our approach of model-driven video decoding. We will further give some details of the model-based parts. 3.1 Overview of the Approach Regardless of the actual scenario and the further processing, the main task of any decoder is to transform an encoded binary representation (e.g., an H.264 serialization) into a sequence of frames. Therefore, we have decided to use a suitable abstraction of that principle as a basis for our approach. As illustrated in Fig. 1, the model-driven decoding process is divided into 5 phases: The modeling phase converts the original video into a model representa- tion that corresponds to the H.264 metamodel, which formalizes H.264 content (see Fig. 2). The preparation phase’s purpose is to partition the complete work into independent “parts”, which we call task chains in the following. Each task chain contains all the information that is needed to decode one GOP. In the execution phase, every task chain is actually decoded, i.e., the dependencies are resolved, and the pixel data are restored. In the finalization phase, the result of the execution phase is sorted in accordance to the composition order and trans- formed into a model that corresponds to the frame sequence metamodel, which formalizes general video content (see Fig. 3). The unmodeling phase’s purpose is simple: it transforms such a model into the final result, i.e., into a sequence of concrete frames. In order to describe the decoding process in an architecture-independent way, our decoder specifications unify different abstraction mechanisms. As the Modeling Preparation H.264 video H.264 data model decoder task model Execution Java KM3 / EMF ATL KM3 / EMF R Unmodeling Finalization frame sequence frame sequence model sample sequence model Fig. 1. Model-driven video decoding VideoTrack Sample - chunks Chunk - samples - width - id - height * - id - compositionTime * - duration - refSample - idrFrame Macroblock - macroblocks - predictionType - picOrderCnt * … * - frameNum - dependencies Fig. 2. H.264 input metamodel - due to clarity reasons, details of the abstract class Macroblock (colored in gray) and all its subclasses have been omitted. Video Picture Frame - frames - picture - width - width * - id 1 - height - height - compositionTime - duration - pixels Fig. 3. Frame sequence metamodel modeling and the unmodeling phase essentially perform pre- and post-processing steps, which depend on the underlying architecture and on the concrete scenario, they are not in the scope of such a specification. In the next section, we give some details of how the three remaining phases have actually been implemented. 3.2 Details of the Approach For the formalization of all the intermediate data representations, which serve as in- and output for the different phases, we use EMF-based [11] metamod- els. Consequently, the complete decoding process may be regarded as a series of model transformations. Indeed, we use model transformations as the means to formalize the preparation and the finalization phase. Within the execution phase, however, the inter- and intra-predicted values must be determined in order to restore the pixel values. As this process involves different mathematical calcula- tions, we have decided to describe this phase using a mathematical abstraction. As we want our decoder specifications to be human-comprehensible as well as machine-processable, we have decided to use languages that are text-based. In summary, we use ATL [5] model transformations, KM3 [6] defined metamodels as well as R scripts [9]. We will give some details in the following: KM3-based metamodels: Being convenient to define EMF metamodels using sim- ple text, the KM3 language has been used to define all the necessary metamodels. For the preparation phase, for example, we have defined 5 metamodels, which are used for the formalization of 7 (internal) models. The following excerpt shows the definition of the class Frame of the frame sequence metamodel (see Fig. 3). class Frame { attribute frameId : Int32; attribute compositionTime : Int64; reference picture container : Picture; } ATL model transformations: The transformations of the preparation and the finalization phase are written in ATL, using mostly its declarative language features. We have decided to use ATL because it provides declarative but also imperative features, is based on EMF, is well integrated into the Eclipse IDE and can also easily be executed programmatically. Generally, every language that fulfills these requirements is an appropriate alternative for ATL (such as the Epsilon Transformation Language (ETL) [8]). For the preparation phase, for instance, we have defined 5 ATL transfor- mations, containing 16 rule definitions (with a total of about 190 LOCs). One example is a transformation that creates a model that explicitly stores the GOPs. The following excerpt shows the responsible ATL rule: rule VideoTrack2Video { from vt : H264!VideoTrack to v : GOP!Video ( gops <- vt.gopSmpls->collect(smpls | thisModule.createGOP(smpls)) ) } The ATL helper gopSmpls does the actual partitioning and determines a se- quence of sample sequences representing the GOPs. R scripts: For the abstraction of the mathematical operations we use R scripts. The R language was originally designed for statistical calculations. We have chosen to use it for our approach as it allows us to express and execute all the mathematical operations that we need for the decoding (including matrix operations). Principally, every language that provides basic as well as matrix operations, e.g., Octave1 , would be a suitable alternative. For the determination of the luma and chroma values, for instance, we have defined 25 R scripts (with a total of about 600 LOCs). The following excerpt, for example, shows how the luma values of a macroblock are determined. The variable prediction contains the intra-predicted values, whereas the parameter residual contains the (pre-processed) delta values. (The function Clip1Y simply ensures that the result is in the correct range). for (x in 1:16) { for (y in 1:16) { values[x, y] <- Clip1Y(residual[x, y] + prediction[x, y]) } } DSL-based control program: For coordination, we use control programs written in a DSL that we have defined with the framework Xtext [2]. Xtext is a framework that allows for the definition of DSLs (and the generation of corresponding tools) based on a grammar specification. Such a control program constitutes the frame of the specification as it defines how (meta)models, transformations and R scripts are connected. The following example shows how metamodels and models are declared within the the control program: metamodel MM_H264 origin h264.km3.ecore model H264 conforms to MM_H264 origin in.h264 The next excerpt shows how a transformation H264ToGOP is introduced, which is stored in H264ToGOP.asm and transforms the model H264 into a model GOP. transformation H264ToGOP origin H264ToGOP.asm { input model H264 output model GOP } 3.3 Implementation Status As explained before, essential steps of the decoding work have already been specified using different abstraction mechanisms. Each step we have abstracted so far was originally implemented in form of a Java tool set; consequently, we have a reference implementation permitting us to generate suitable test data for every abstracted decoding step. We have developed a Java library that allows us to access and extract all information of an H.264 encoded video. This library is used for the implementation of the modeling phase. Furthermore, we have developed a Java application that is able to decode every intra-decoded frame 1 Project URL: www.gnu.org/software/octave of an H.264 encoded video file (stored in the mp4 file format). We have also developed a Java tool that can process control program files. Up to now, the specification of the preparation phase is complete. In combi- nation with the control program definition, an H.264 model can automatically be transformed into a decoder task model (containing the task chains). We have also already defined all the necessary R scripts that are needed to decode intra-predicted macroblocks, and we have tested them with our Java application, which uses the open source library renjin2 for interpreting the R code. In a next step, we want to integrate the execution phase into the decoder specification. The missing link is a conversion of the model-based representation into an R-compatible input format. Here, we want to check if this conversion can be implemented using languages that allow for the specification of model-to- text transformations (such as the Epsilon Generation Language (EGL), which is related to ETL [8]). The finalization phase has also already been specified, but will certainly need an update when the execution phase is integrated. 3.4 Scalability We have chosen abstraction mechanisms that are executable in order to simplify the development process of functional (but not necessarily efficient) decoders for arbitrary architectures. Supposing that the most effective optimization measures are system-specific and therefore decrease the level of abstraction, we generally have neglected any performance considerations within the design of the specifi- cations. For practical reasons, we have made one exception: all the models within the preparation phase do not contain any macroblock data as they are not needed before the execution phase. Consequently, any H.264 model (that conforms to the metamodel illustrated in Fig. 2) does not contain instances of the class Macroblock (and its subclasses). As these instances contain (i.e., within refer- enced objects) the information to restore the actual pixel data, the size of H.264 models can drastically be reduced. For an evaluation, we generated the H.264 models for three different videos and determined their sizes before and after removing the macroblock data. As models can be regarded as graphs, we simply counted the number of nodes and edges to specify the models’ size. Furthermore, we measured the time it took to execute the preparation phase and determined the percentage of this time spent on the model transformations (MTs). Finally, we measured the time it took to decode all the intra-predicted frames (I-frames) when using the renjin-based Java application. The results are listed in Table 1.3 Currently, the frames are decoded sequentially. But as a GOP can com- pletely be decoded independently of other ones, decoding them simultaneously is assumed to crucially decrease the execution time of the decoding process. 2 Project URL: www.renjin.org 3 Used abbreviations: k for kilo (×103 ), M for mega (×106 ) Table 1. Evaluation results for test videos (System: Debian 8, CPU: i7-4790 3.6 GHz, RAM: 32 GB) video 1 video 2 video 3 Frames (I-frames) 2540 (52) 8189 (273) 163327 (2373) Resolution (w × h) 512 × 288 1920 × 1080 1280 × 720 Model graph’s nodes/edges - original 21.6M/24.6M 1760M/1900M 14500M/15700M - reduced 5.08k/13.0k 9.04k/24.6k 327k/646k Preparation phase (MTs) ∼ 2 s (75%) ∼ 5 s (83%) ∼ 7 min (96%) R-based decoding of I-frames ∼ 24 min ∼ 1 day ∼ 4 days 4 Related Work Our approach utilizes MDE techniques in the domain of long-term preservation in order to address the issue of preserving digital video content in a way that it can be restored on future architectures. As far as we know, there are no similar approaches. In the following, we present two approaches based on image encoding standards. Motion JPEG 2000 [3] is a part of the JPEG 2000 standard that simply uses the corresponding compression for each frame. As JPEG 2000 is already used for the preservation of digital images [7], a combination would principally allow for the preservation of digital video content. In another approach [12], digital videos are preserved using an XML-representation. Two variants are distinguished: first, video data are completely transformed into primitive XML. Second, the video frames are converted into image files that are suitable for long-term preservation (such as JPEG 2000) and referenced within the XML representation. Both vari- ants involve the decompression of lossy compressed parts, which naturally leads to larger file sizes. An important difference between our approach and the approaches named before is that we generally allow for delta compression, i.e., frames can have de- pendencies between each other. Avoiding delta compression simplifies the decod- ing, but results in larger file sizes. Contrary to our approach, the two approaches also imply recoding, which results in potential loss of information. 5 Conclusion and Outlook In this paper, we have continued our presentation of a model-based approach for architecture-independent video decoding [10] with a focus on the model-based parts and its realization. At this point, our approach does not provide full support for the decoding of inter-predicted frames, and the conversion between models and R-compatible representations is still hard-coded in Java. Nevertheless, so far we have succeeded to find appropriate abstractions for all essential parts of the decoding process. Our approach involves the processing of large models. Even if the efficiency aspect plays a subordinate role, optimizations, which do not impede the applica- bility, might improve the development process. Besides, the utilized models may also be used as potential test data for common MDE tools. After finishing the specification of the H.264 decoding process, we plan to focus on the design of a serialization format that simplifies the implementation of the modeling phase, can be used to preserve H.264 encoded data without additional loss of authenticity and allows for “suitable” file sizes. In a further step, we want to evaluate our approach by asking developers who are unfamiliar with the H.264 standard to implement a decoder based on our specification. In this context, we also want to find out whether and in which way our decoder specifications can also be used to simplify the development process by serving as a base for automatic code generation. References 1. Borghoff, U.M., Rödig, P., Scheffczyk, J., Schmitz, L.: Long-Term Preservation of Digital Documents, Principles and Practices. Springer-Verlag Berlin Heidelberg (2006) 2. Efftinge, S., Völter, M.: oAW xText: A Framework for Textual DSLs. In: Eclipsecon Summit Europe 2006 (2006) 3. ISO/IEC: International Standard ISO/IEC 15444-3: JPEG 2000 Image Coding System - Part 3: Motion JPEG 2000. International Standard Organization (2002) 4. ITU-T: Recommendation ITU-T H.264: Advanced Video Coding for Generic Au- diovisual Services. International Telecommunication Unit (2013) 5. Jouault, F., Allilaire, F., Bzivin, J., Kurtev, I.: ATL: A model transformation tool. Science of Computer Programming 72(12), 31–39 (2008) 6. Jouault, F., Bzivin, J., Team, A.: KM3: A DSL for Metamodel Specification. In: Formal Methods for Open Object-Based Distributed Systems. pp. 171–185. Springer (2006) 7. van der Knijff, J.: JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format. D-Lib Magazine 17(5/6) (2011) 8. Kolovos, D.S., Paige, R.F., Polack, F.A.: The Epsilon Transformation Language. In: International Conference on Theory and Practice of Model Transformations. pp. 46–60. Springer-Verlag, Berlin, Heidelberg (2008) 9. Ross Ihaka, R.G.: R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5(3), 299–314 (1996) 10. Schenk, C., Maier, S., Borghoff, U.M.: A Model-based Approach for Architecture- independent Video Decoding. In: 2015 International Conference on Collaboration Technologies and Systems. pp. 407–414 (2015) 11. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: EMF: Eclipse Modeling Framework 2.0. Addison-Wesley Professional, 2nd edn. (2009) 12. Uherek, A., Maier, S., Borghoff, U.M.: An Approach for Long-term Preservation of Digital Videos based on the Extensible MPEG-4 Textual Format. In: 2014 In- ternational Conference on Collaboration Technologies and Systems. pp. 324–329 (2014)