Model-Based Decoder Specifications for the
      Long-Term Preservation of Video Content

                                   Christian Schenk??

                             Computer Science Department
                          Universität der Bundeswehr München
                              85577 Neubiberg, Germany


         Abstract. The long-term preservation of digital data implies two as-
         pects that are substantial: The data’s binary representation (i.e., the bit
         sequence) must be kept restorable in its original form, and the capability
         to decode (or interpret) the binary representation has to be retained. The
         former requires failure-resistant hardware, and, in case they have to be
         replaced, the existence of procedures that ensure copying without loss of
         information. The latter is a question of how the availability of compatible
         decoding software can be guaranteed. A general-purpose approach is to
         use standard formats that have been designed for or have proven to be
         well suited for long-term preservation. In case of digital video content,
         however, such a format does not exist. Common video formats are quite
         complex, and the conversion of digital video content without risking any
         loss of authenticity or information is usually not possible. As we assume
         that this circumstance will not change in the near future, we propose
         decoder specifications as a means to describe the decoding process us-
         ing different abstraction mechanisms. Being human-comprehensible and
         machine-processable, such a specification serves as a template, which
         supports a potential developer to implement a decoder prototype for an
         arbitrary architecture. This way, it is ensured that most commonly used
         formats can also be decoded on future architectures.

         Keywords: long-term preservation; video decoding; models; model trans-
         formations; H.264


1      Introduction
Documents, in general, are used to provide, publish and preserve any kind of
information. While, in most cases, such documents are relevant for a limited
time only, there are also documents that are important for future use (e.g.,
historical texts, photographs or videos). The issue of archiving such documents
in a way that future generations can access them is addressed within the domain
of long-term preservation.
    Because of the digital revolution, today, we mostly work with digital docu-
ments, which are either digital-born (i.e., they have digitally been created) or
??
     Doctoral supervisor: Uwe M. Borghoff. The author is in his third year.
the result of a digitization step (e.g., using a scanner or a digital camera). Thus,
it is worth thinking about the long-term preservation of digital documents [1]:
As digital content is always stored as a sequence of bits (i.e., ones and zeros),
which is likely to be transferable from one medium onto another one, it will
always be possible to copy a digital document without any loss of information
and to store it redundantly (even if, one day, a future architecture might use an-
other representation for digital content). Consequently, every digital document
can theoretically be preserved for an indefinite period of time.
    Preserving, however, only makes sense if it is guaranteed that the digital
content can also be restored in the future. As digital documents always need
appropriate hard- and software when being restored, we have to focus on the
question of how the availability of compatible software (called decoders in the
following) can be ensured even for future architectures. As applications generally
are not portable between different architectures, it can be assumed that the
re-development of decoders will mostly be necessary when an architecture is
replaced by another one. Consequently, whenever digital content is preserved, it
must be ensured that the capability to develop such a decoder can be retained.
    Standard formats, in general, are used to unify the serialization of digital
content of a specific type (e.g., text, audio or video). Specifications that define
such formats enable potential developers to implement a corresponding decoder.
Therefore, standard formats are principally an appropriate means to ensure that
preserved content can be restored on future architectures. However, while there
are standard formats that have particularly been designed or (at least) have
proven to be well suited for long-term preservation (such as PDF/A for textual
documents or JPEG 2000 for images [5]), many widely used standard formats
are appropriate for everyday usage but not in long-term preservation scenarios.
Proprietary formats, for instance, are often not publicly available; other formats
provide too much features that are useful in scenarios but impede the develop-
ment of compatible decoders.
    Migration, i.e., the transformation of digital content from one format into
another one, is an approach that makes it possible to convert data into a format
that is suitable for long-term preservation. This way, a Word document, for
instance, could be migrated into the PDF/A format. Furthermore, migration
can also be used whenever a new format is designed that is better suited for
long-term preservation than the original one or whenever one format becomes
outdated (and thus will no longer be supported). In both scenarios, migration
is a (supporting) measure for the preservation of digital data. Though, as two
different formats are usually not completely compatible to each other, every
migration involves a loss of information or at least authenticity. Hence, in a
preservation scenario, it is important to avoid migration steps wherever possible.

1.1   Preserving Digital Video Content
Digital video content usually is stored using a modern video serialization stan-
dard (such as H.264 [3]), which allows for high compression, high quality and effi-
cient en- and decoding. Efficiency mostly is accompanied by system-dependence,
and that is why modern serialization formats are commonly tailored to current
architectures. Furthermore, they provide additional features to allow for an effec-
tive usage in different scenarios or for different purposes. Both aspects impede
the development of compatible decoders, especially in cases where the target
architecture is likely to be different. Hence, current serialization formats are ill-
suited for the preservation of digital video content, but because there is no other
suitable approach (to our knowledge), currently there is no alternative but to use
(one of) these formats. In addition to that, there is another issue: modern video
serialization formats use lossy compression techniques, i.e., the encoding of dig-
ital video content usually involves an irreversible loss of information. Thus, the
migration of digital video content, which enforces recoding, should be avoided.
    The problem of preserving existing digital video content can be summarized
as follows: Currently, users who want to preserve a video need to store it in its
original format (instead of a format that is at least well-suited for long-term
preservation) and they have to rely on the corresponding format specification
being explicit enough so that future developers can implement compatible de-
coders. Therefore, we are considering the following questions in our research:

 1. How can existing digital video content be preserved so that its decoding on
    future architectures can be ensured without risking loss of information?
 2. What can be done today to simplify the implementation of suitable video
    decoders on future architectures?

    Our general idea is to define decoder specifications, which enable potential de-
velopers to implement a decoder on an arbitrary architecture [8], whereby differ-
ent means of abstraction ensure such a specification to be human-comprehensible
and also machine-processable.
    In our current work, we propose a model-driven approach and want to eval-
uate if common MDE technologies (i.e., metamodels, model transformations,
etc.) may serve as a basis to abstract the video decoding process for common
video standard formats so that decoding capabilities can be retained in case
an architecture has to be replaced. In this paper, we give an overview of this
approach.
    The remaining paper is structured as follows: In Section 2, we give an overview
of related work and, in Section 3, we give a short introduction to video decod-
ing. In Section 4, we explain the overall idea of our approach before we give an
overview of the current implementation status in Section 5. Finally, in Section
6, we summarize the paper and give an outlook.


2   Related Work
We already have named two approaches that can be used for preserving digital
content, namely standard formats and migration, and we explained why they are
not suitable for digital video content. Migration usually implies recoding and thus
results in loss of information. As a consequence, it would also be impossible to
use one of the existing standard formats (such as H.264) to store arbitrary videos
as it would always imply a migration step. Another approach, which focuses on
the decoder’s portability, is emulation [1]. In simple words, the architecture the
decoder has been designed for is emulated on a (future) architecture in form of
a virtual machine, which permits to use the original decoder implementation on
a new architecture. This approach, however, only makes sense if an emulator for
a complete system is really worth being implemented, for the development of a
decoder normally is supposed to be easier. We briefly introduce some approaches
that address these issues:

Image standard formats for the preservation of video data: Technically, a video
can be regarded as a sequence of images (or rather frames). JPEG 2000 has
already proven to be well-suited for the preservation of image files [5], and in
addition to that, Motion JPEG 2000 [2], an extension of the JPEG 2000 stan-
dard, proposes to use the image coding procedure for every video frame. Hence,
a combination could serve as a basis for archiving digital video content.
    In [10], an XML-based language is proposed to store digital video content.
Two variants are distinguished: first, video data are completely transformed into
primitive XML; second, the video frames are converted into image files that are
suitable for long-term preservation (such as JPEG 2000) and then referred to
within the XML representation.
    Contrary to our approach, the two approaches involve recoding, which gen-
erally results in loss of information. Furthermore, they only use still-image com-
pression and, thus, compression results tend to be worse because similarities
among different frames are not exploited.

Universal virtual computer: The decoder specifications, we propose, serve as
templates for the implementation of concrete decoders for different architectures.
Principally, they are a means to overcome the fact that ordinary software is not
portable. The universal virtual computer (UVC [6]) addresses the same issue.
It is a hypothetical computer that is supposed to be implementable in form of
a virtual machine on arbitrary systems (i.e., it can be emulated). Originally, it
has been designed as a general-purpose approach for the preservation of digital
content. The basic idea is to write decoders in form of a UVC program. As
such a program is portable between different UVC implementations, restoring
preserved content (on a future architecture) can be limited to the development
of a suitable UVC implementation.
    The main difference between both approaches is the level of abstraction:
While our decoder specifications are human-comprehensible and machine-pro-
cessable, an executable UVC program consists of UVC machine code, which only
supports basic instructions, also excluding floating point calculations. Therefore,
a UVC program will usually be rather complex, difficult to maintain or to extend.
Having this in mind, the development of a UVC program for a modern and
complex video serialization standard seems to be unrealistic. However, a decoder
specification principally could also serve as a template for developing a UVC
program.
3   Digital Video Decoding
Before we describe our approach, we briefly provide an introduction to the de-
coding process of digital video content.
    Every video can be regarded as a sequence of frames, which, when presented
consecutively for a fixed period of time, create the feeling of movement. Regard-
less of how the video has actually been serialized, in every scenario, it is the
decoder’s task to restore the sequence of frames for further processing. There-
fore, serialization standards are used that specify how the binary representation
has to be interpreted during the decoding process. The H.264 video standard
[3], which is used in combination with Blue-Ray disks, for video streaming sce-
narios and for locally stored video files, is one of these standards. As it is widely
used, we use it for the following explanations, but we assume that most of the
principles also hold for other standards that use similar or the same techniques.
    An H.264 encoded video consists of samples, which are grouped into chunks.
Principally, every sample contains all the information needed to reconstruct one
frame. Every sample is partitioned into macroblocks, which provide the actual
pixel data, and which each represent a region of 16 × 16 pixels within the frame.
This structure is encoded within an H.264 serialization. The corresponding stan-
dard uses several methods that convert video data into representations that make
further compression techniques effective. One of these methods (called delta com-
pression) follows the principle of only storing differences between frames, mac-
roblocks or even arbitrary regions of different frames. Most of these methods
have in common that they are based on mathematical operations. These oper-
ations result in data representations usually containing values that tend to be
small and occur quite frequently.
    The H.264 standard combines different compression techniques, which are
either lossless or lossy. Lossy compression algorithms play an essential role as
they promise higher compression rates. Principally, these algorithms are used
to detect “similar” values and transform them into one single code. Hence, the
resulting representation contains more identical values than the original one and
thus serves as a perfect input for lossless compression algorithms, which can
handle repetitions effectively.
    In summary, a common H.264 decoder first has to reverse the lossless com-
pression in order to create an intermediate representation consisting of lossy
compressed (as well as uncompressed) data. Then, it has to resolve dependen-
cies and to reverse the lossy compression before actually being able to restore
the pixel data. As a final step, the decoder has to ensure that the decompressed
frames are put into the correct order that is specified by given time stamps.

4   Proposed Solution
Assuming that there will not be any appropriate standard format in the near
future, we are considering the question of how a video, serialized in a common
standard format, should be preserved if it is to be restored on future architec-
tures. Instead of concentrating on the actual serialization, we pursue the goal of
simplifying the development process for decoders by combining different means
of abstraction to define human-comprehensible and machine-processable decoder
specifications. Ideally, such a specification enables a developer, regardless of be-
ing familiar with the video standard or not, to implement a functional decoder
prototype for any arbitrary architecture. Such a prototype, in contrast to a com-
mon decoder, only has to be able to restore the digital content, but it needs not
to be efficient. Neglecting the efficiency aspect may have a positive impact as
it is assumed to simplify the development process, which is actually essential.
Nevertheless, such a decoder prototype (regardless of how efficient it is) is still
applicable for different purposes:

 1. It can be used to completely decode video content for further processing.
    In particular, the decoded content can also be recoded using an up-to-date
    format that is well supported on the target architecture.
 2. It can be used for cases where efficiency is less important, such as for the
    extraction of single frames.

     As a final remark, it has to be stated that the first scenario is not a normal
migration step as its result is only used as an intermediate representation; it is
not used for long-term preservation.
     As a conclusion, we expect the decoder specification to be an appropriate
means to retain decoding capabilities for existing video content. Such a decoder
prototype can even serve as a basis for the development of an optimized and more
efficient version or can serve as a reference implementation for an architecture-
specific decoder.
     For the following explanations, we assume that the video content has been
encoded using the H.264 serialization format, but as said before, other formats
usually follow similar or rather the same encoding principles and thus are sup-
posed to be processable in a similar manner.


4.1   Model-based Decoder Specifications

In Section 3, we have described that general video content can simply be regarded
as a sequence of frames. In the same section, we have also explained that an H.264
encoded video follows a hierarchical structure. Therefore, these two abstractions
constitute the basis for specifying the in- and output of a suitable decoder. A
decoder specification, thus, simply has to describe how one abstraction has to be
transformed into the other. Assuming that a high level of abstraction guarantees
the specification to be architecture-independent and human-comprehensible, we
propose a model-based approach: in- and output is described using metamodels
(as illustrated in Fig. 1 and Fig. 2).
    We further have divided the complete decoding process into several steps,
each transforming one or more data representations into another one, whereby all
these intermediate representations are formalized by metamodels, which suggests
that each step can be regarded as one model transformation. Indeed, we propose
to use model transformations as a means of abstraction for several processing
       VideoTrack                                                                           Sample
                             - chunks               Chunk
                                                                      - samples
  - width                                                                          - id
  - height                          *      - id                                    - compositionTime
                                                                            *
  - duration                                                                       - refSample
                                                                                   - idrFrame
                                                  Macroblock         - macroblocks - predictionType
                                                                                   - picOrderCnt                       *
                                           …                               *       - frameNum
                                                                                                - dependencies


Fig. 1. H.264 data metamodel - due to clarity reasons, details of the abstract class
Macroblock and all its subclasses have been omitted

          Video                                                                                  Picture
                                                    Frame
                         - frames                                          - picture
  - width                                                                                 - width
                                *          - id                                 1
  - height                                                                                - height
                                           - compositionTime
  - duration                                                                              - pixels


                                        Fig. 2. Frame sequence metamodel


steps, but, as our main principle is to use abstractions that best fit a specific
purpose, we also use other abstraction mechanisms if they are supposed to be
more suitable. As video decoding usually involves mathematical calculations,
we use an abstraction mechanism being well-suited for specifying mathematical
expressions and operations.
    We have grouped all the decoding steps into five phases (illustrated in Fig, 3),
whereby the modeling and the unmodeling phase serve as pre- and postprocessing
steps that convert a video into an H.264 model and, vice versa, a frame sequence
model into a sequence of frames. These phases highly depend on the actual
serialization format, the target architecture and the concrete scenario; therefore,
they are not in the scope of the decoder specification.


                    Modeling                                Preparation

   H.264 video                      H.264 data model                       decoder task model
                                                                                                           Execution


                      Java                 EMF                 ATL                  EMF         R


                    Unmodeling                              Finalization

 frame sequence                 frame sequence model                  sample sequence model


                  Fig. 3. Model-driven decoding process for an H.264 video
    The preparation phase transforms the original H.264 model into an interme-
diate representation that constitutes the basis for performing the mathematical
operations within the execution phase, which are necessary to restore the pixel
data. Thereafter, the finalization phase creates the output model containing the
uncompressed frame data. These three phases are the essential steps of a decoder
specification. In contrast to the execution phase, which is defined using mathe-
matical abstraction mechanisms, the preparation and the finalization phase are
specified as a series of model transformations.
    The main advantage of our specifications over common informal ones is
the fact that they are designed to be human-comprehensible and machine-
processable. This implies that they constitute the basis for automatic code gen-
eration and/or direct execution. Besides the obvious effect that a developer will
not have to implement the complete decoder from scratch, the proposed speci-
fications are supposed to be more unambiguous than common informal format
specifications. This also implies that all the intermediate model representations,
resulting from the particular decoding steps, should always be the same regard-
less of the concrete implementation. As a consequence, it is possible to generate
test models that, if provided in combination with the decoder specification, per-
mit developers to test the implementation of every step by comparing the result-
ing model with the corresponding test model. This way, developers can identify
and correct implementation errors; consequently, the development process can
further be simplified.


4.2   Preservation Process

As stated, a decoder specification only describes the steps that are needed to
transform an H.264 model representation of an encoded video into a frame se-
quence model. Therefore, we propose to store the H.264 model in a serialization
format that is well-suited for that purpose and that is likely to be restorable in
the future (such as XML). By using a general-purpose compression algorithm,
the file size can further be reduced without risking loss of information. This way,
restoring the model representation (i.e., implementing the modelling phase) is
likely to be easier than it would be if the original file were used.
    In summary, if a video is to be preserved, its model representation have to
be converted into the serialization format and stored in combination with the
decoder specification and additional test data (as described before). When it
is restored one day, developers first have to restore the model representation;
afterwards, they can use the decoder specification to develop a suitable decoder.


5     Current Status and Future Plans

The general applicability of the frame sequence model is a requirement for the
definition of decoder specifications that allow for the implementation of con-
crete decoders for different scenarios. Therefore, we have implemented a library
demonstrating that the frame sequence model is applicable to reference and
access video content in an interoperable way (as originally discussed in [8]).
     Above all, based on the official specification [3], we have implemented essen-
tial parts of the decoding process for H.264 encoded videos using Java. While
the decoding of dependent frames is not completely supported yet, independent
frames and other information can randomly be accessed and extracted. Having
implemented it from scratch (without using external libraries) with a focus on
the abstraction (rather than efficiency), it serves as a basis for our current work,
i.e., the specification of suitable decoder specifications.
     Currently, we are working on the decoder specification’s design for H.264
encoded videos. We have already implemented the preparation and the finaliza-
tion phase using EMF metamodels [9] as well as ATL transformations [4], and
we have defined a set of R scripts [7] that perform all the mathematical opera-
tions needed within the execution phase to restore the pixel data of independent
frames. Furthermore, we have defined a DSL to write a control program that
allows us to specify, which abstraction mechanisms (i.e., ATL transformation or
R script) have to be used to transform one data representation into another one.
     While the essential elements of the decoder specification, namely those of
the preparation, the execution as well as the finalization phase, already exist
and can also be executed, up to now, the transition between the three phases,
i.e., its specification, is still hard-coded. In a next step, we want to replace the
hard-coded creation of R-compatible data and the execution of R scripts (using
the open-source library renjin1 ) with suitable abstractions such as model-to-text
transformations. This will be the last step before we address the issue of decoding
dependent frames.
     The size of the intermediate model representations was a challenge we had to
tackle before we were able to actually execute parts of the decoder specification
(using MDE technologies): Because of the memory video data tend to require,
using metamodels for their formalization results in large models. A test video
of about 2 hours with a resolution of 1280 × 720 pixels, for example, resulted
in an H.264 model that, if represented by a graph, consists of about 14.5 billion
nodes and 15.7 billion edges. As the complete video decoding process allows
for effective partitioning, in our current solution, we use model representations
only containing those contents that are actually needed for every phase. For
performing the preparation phase of the aforementioned example model, we were
able to use a reduced graph representation, which contains “only” about 327,000
nodes and 646,000 edges.
     As a next step, we plan to test our approach by asking developers to im-
plement a program that can convert H.264 models into frame sequence models.
For that purpose, the developers will be given an H.264 decoder specification,
an H.264 model of a video and additionally all the intermediate model represen-
tations resulting from the specified decoding steps as test data. We will assume
our approach to be a full success if the resulting decoder prototypes can actu-
ally be used to transform H.264 models into frame sequence models. However,
1
    Project URL: www.renjin.org
we also want to find out whether the chosen abstraction mechanisms are well
suited for specifying the video decoding process or whether other mechanisms
and technologies should be preferred.
    Thereafter, we plan to test our concept for other formats in order to evaluate
its general applicability. In this context, we also want to examine if parts of the
H.264 decoder specification can be reused for other specifications.

6    Conclusion and Outlook
As stated, we propose decoder specifications that help potential developers im-
plement functional decoder prototypes, even if they are unfamiliar with a video
standard. Our work has been motivated by the lack of existing approaches that
are supposed to be suitable for the long-term preservation of digital video con-
tent. Our approach directly addresses this issue and thus may also serve as a
template for related problems. As decoders are ordinary software, the approach
might be usable for other complex data, e.g., games.
    A decoder specification unifies different abstraction mechanisms that ensure
such a specification to be human-comprehensible and machine-processable. The
means, we have chosen so far, allow for automatic code generation on the one
hand, and they permit potential developers to perform such specifications in an
early state of the development process on the other hand.

References
 1. Borghoff, U.M., Rödig, P., Scheffczyk, J., Schmitz, L.: Long-Term Preservation
    of Digital Documents, Principles and Practices. Springer-Verlag Berlin Heidelberg
    (2006)
 2. ISO/IEC: International Standard ISO/IEC 15444-3: JPEG 2000 Image Coding
    System - Part 3: Motion JPEG 2000. International Standard Organization (2002)
 3. ITU-T: Recommendation ITU-T H.264: Advanced Video Coding for Generic Au-
    diovisual Services. International Telecommunication Unit (2013)
 4. Jouault, F., Allilaire, F., Bzivin, J., Kurtev, I.: ATL: A model transformation tool.
    Science of Computer Programming 72(12), 31–39 (2008)
 5. van der Knijff, J.: JPEG 2000 for Long-term Preservation: JP2 as a Preservation
    Format. D-Lib Magazine 17(5/6) (2011)
 6. Lorie, R.A., van Diessen, R.J.: UVC: A Universal Computer for Long-Term Preser-
    vation of Digital Information. IBM Research Division (2005)
 7. Ross Ihaka, R.G.: R: A Language for Data Analysis and Graphics. Journal of
    Computational and Graphical Statistics 5(3), 299–314 (1996)
 8. Schenk, C., Maier, S., Borghoff, U.M.: A Model-based Approach for Architecture-
    independent Video Decoding. In: 2015 International Conference on Collaboration
    Technologies and Systems. pp. 407–414 (2015)
 9. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: EMF: Eclipse Modeling
    Framework 2.0. Addison-Wesley Professional, 2nd edn. (2009)
10. Uherek, A., Maier, S., Borghoff, U.M.: An Approach for Long-term Preservation
    of Digital Videos based on the Extensible MPEG-4 Textual Format. In: 2014 In-
    ternational Conference on Collaboration Technologies and Systems. pp. 324–329
    (2014)