Interactive Authoring Tool for Extensible MPEG-4
                                          Textual Format (XMT)
                                              Kyungae Cha1 and Sangwook Kim2

                                                                     practices of content authors, such as the Extensible 3D (X3D)
Abstract. MPEG-4 is an ISO/IEC standard which defines a
                                                                     being developed by the Web3D Consortium and the
multimedia system for communicating interactive scenes               Synchronized Multimedia Integration Language (SMIL) from
containing various types of media objects. The Extensible            the W3C consortium[7,11]. Thus authors can get multimedia
MPEG-4 Textual format (XMT) framework provides                       contents, which are exchangeable and interoperable with X3D
interoperability between existing practices such as the              and SMIL, using the XMT authoring tool.
Extensible 3D (X3D) and MPEG-4. This paper introduces an                In the authoring system, authors can visually make a spatial
XMT authoring tool that supports a visual environment for            arrangement of media objects and compose a temporal behavior
building a spatio-temporal scenario of media objects comprising      of objects with timeline approach. Authors can also modify the
a multimedia scene. The authoring tool provides a                    material characteristics of each object using interactive and
comprehensive set of facilitative editing tools for composing        visual tools. Moreover, the visual scene is automatically
                                                                     transformed into an XMT-α and XMT-Ω format document.
multimedia scene, as well as tools for automatic generation of
                                                                        In section 2, XMT formats are briefly discussed. In section 3,
XMT documents and MPEG-4 contents. This paper also                   the various functions of the XMT authoring tool are described.
describes the functionality of the developed system and shows        The implementation of the proposed system is then presented in
an example of its use.                                               section 4. Finally section 5 gives conclusion and presents our
                                                                     future plans.

1    INTRODUCTION
                                                                     2    XMT-Α AND XMT-Ω FORMATS
MPEG-4, one of the leading streaming media formats, is an
ISO/IEC standard which defines a multimedia system for               The XMT framework consists of two levels of textual syntax
communicating interactive scenes with various types of media         and semantics: XMT-α and XMT-Ω formats[7,10].
objects. In MPEG-4, a scene is accompanied with the                     XMT-α is an XML-based version of MPEG-4 content which
description specifying how the objects should be combined in         provides a straightforward, one-to-one mapping between the
time and space in order to form the scene intended by the author.    textual and the binary formats of an MPEG-4 scene description.
   The scene description is coded in a binary format called          XMT-α also provides interoperability with X3D[5], which
Binary Format for Scenes or BIFS[1,4,7,8,10,11], which is built
                                                                     improves upon VRML with new features such as flexible XML
on several concepts from the Virtual Reality Modeling
Language(VRML)[5]. 12 This binary form is suitable for low-          encoding and a modularization approach[6]. It contains a subset
overhead transmission so that BIFS basically provides an             of the X3D as well as the X3D-like representations of MPEG-4
efficient application for the sender and the receiver[1,7].          features such as Object Descriptors(OD), BIFS update
   On the other hand, the Extensible MPEG-4 Textual format           commands and 2D composition[7].
(XMT) is a framework for representing MPEG-4 scene                      XMT-Ω is a high-level abstraction of MPEG-4 features based
description using a textual syntax.                                  on the SMIL[9]. It specifies objects and their relationships in
   This paper presents an XMT document authoring tool that
                                                                     terms of the author’s intention rather than coded nodes and route
enables visual composition of an MPEG-4 scene and generates
the corresponding XMT document and MEPG-4 contents. XMT              mechanism in BIFS. In the respect of reusing SMIL, XMT-Ω
is designed to provide a high-level abstraction for MPEG-4           defines a subset of modules used in SMIL whose semantics are
functionalities and an easy interoperability between existing        compatible. Moreover XMT-Ω format can be parsed and played
                                                                     directly by a W3C SMIL player, preprocessed to the
                                                                     corresponding X3D nodes and played by a VRML player. It
1                                                                    may also be compiled to an MPEG-4 representation such as
    Department of Computer Science, Kyungpook National University,
                                                                     mp4 which can then be played by an MPEG-4 player. Figure 1
    Daegu, Korea, email : chaka@woorisol.knu.ac.kr
2                                                                    shows the interoperability of XMT between SMIL player,
    Department of Computer Science, Kyungpook National University,
                                                                     VRML player and MPEG-4 player.
    Daegu, Korea, email : swkim@cs.knu.ac.kr
                                                                                Circle, and others). These tools enable authors to compose
                                                                  S M IL        audio-visual scenes with direct manipulation technique and see
                                       P a rs e                   P la y e r
                                                                                them immediately. Figure 3 presents an overview of the
         XMT                                                                    graphical user interface and a simple example of a scene.
                                     C o m p il e                VRM L
                                                                 B r o w s er      Authors first select from the toolbar one of the tools they
                                                                                want to add in the scene and then draw the selected object. For
                                                                                image objects, the object is drawn in the interface window. For
                                     M P E G -4                 M P E G -4      video objects, the first frame of the video is drawn in the
                                R ep r es en ta ti o n           P la y e r     interface window.
                                 ( e.g m p 4 fil e )
                                                                                   Whenever a new media object is added in the scene, the
                                                                                system automatically assigns the object ID, start time and end
                    Figure 1. The interoperability of XMT                       time of the object with default value. The bottom portion of
                                                                                figure 3 shows the timeline window where the timelines of
                                                                                objects are arranged. The layer of timeline represents the
3      XMT AUTHORING SYSTEM                                                     drawing order of corresponding objects, which is determined
                                                                                following the object addition sequence. Here the timeline
This section shows the XMT document authoring environment
                                                                                window shows the initial state, i.e. no modification is occurred.
of our system and the authoring process in creating an MPEG-4
scene and an XMT document. The main functionalities of the
system are also described.

3.1 System Structure
The following figure shows the system structure and every
component of the XMT authoring tool.


                          Media
       Media                                      Graphical User Interface
                         Decoders
       data

                          Parser
                                            Scene composition tree Manager


    XMT documents
                                                    Scene composition tree
                         XMT_Ω
      XMT_Ω              Generator
      document
                                                                                                  Figure 3. Graphical user interface
       XMT_α              XMT_α
      document           Generator
                                                                                3.2.1 Spatial composition
                                                                                In the user interface, each object participated in a scene is
                         Figure 2. System Structure                             contained in a rectangular tracker so that they are treated as
                                                                                individual objects. Thus the author can move, resize or remove
   Authors compose an MPEG-4 scene with various editing                         the objects directly for composing a spatial arrangement of the
tools provided in the graphical user interface. Following the                   scene.
authoring process, the scene composition tree, which represents                    The spatial attributes of an object can be specified in terms of
the visual scene as internal data structure, is built and modified.             the spatial position of the object’s bounding rectangle, which is
   Using the scene composition tree, the XMT-α or XMT-Ω                         represented as a rectangular tracker containing the object in the
generator makes a corresponding XMT format document. At                         user interface. The spatial position of bounding rectangle of an
this time the author can choose the output format that he/she                   object (i.e. the spatial attribute of the object) is specified as the
wants. The XMT format files can be parsed and then displayed                    form of (x,y,h,w), where w denotes the width of the bounding
in the user interface as a visual scene. The author can also                    rectangle; h denotes the height, while x and y denote the
modify the visual scene and recreate the XMT file.                              coordinates of the center of the rectangle with referring to the
                                                                                center of whole rectangle of the presentation as origin of
3.2 Graphical User Interface                                                    coordinate system.
                                                                                   The author can also apply material characteristics such as
The graphical user interface provides a set of drawing tools and                color, transparency, and border type using editing tools. These
editing tools for various media types such as JPEG image,                       material properties of an object are specified as object property
MPEG-1 video, G.723 audio and graphical objects (Rectangle,
node in the internal form of our authoring system. The spatial         a hierarchical structural form. Whenever a new object is created
and material attributes of each object are automatically specified     in the user interface, the corresponding object node is also
by the system from the visual scene.                                   created.
                                                                          The object node has its corresponding object type, object ID
3.2.2 Interactive scenario composition                                 and values specifying spatio-temporal attributes. The scene
                                                                       composition tree is modified through the attachment of the new
In the presentation of an MPEG-4 scene, user interaction is            object node. At the same time, the property node of the object is
possible within the set in the scene description. Assume that the      attached as a child node of the new object node. The property
author designs the following scenario for the scene in figure 3.       node as well as the tree structure can be changed throughout the
  Example 1. If an end user clicks the circle object, the fill color   authoring process. The tree structure can be changed while
of the rectangle object will be changed through the gradient           objects are added, replaced, or removed. If the author creates
from red to green.                                                     event information, an event object which contains destination
   Here, the circle object and the rectangle object refer to the       object ID, event type and values of transition status is created
source object and the destination object respectively. To make         and attached to the source object node as its child node. Thus an
an interactive scenario, the event type(e.g. user’s click), the        event object does not specify its source object ID.
source and destination object and the responding action type(e.g.
change fill color), etc., should be specified. We denote the
interactive information as event object which is represented as a
                                                                       3.4 Generation of XMT Document
quadruple (destination object ID, event type, action type, key         The resulting graphical user interface is represented as the scene
values). The key values mean an array of values to be used to          composition tree. From the scene composition tree, both of the
change the parameters of the action type field. The event object
                                                                       XMT-α and XMT-Ω document corresponding to the visual
for the above example is specified as (3000, click, fill color,
((1.00 0.00 0.00),(0.00 0.50 0.00)), if the rectangle object as the    scene are directly generated.
destination has the number 3000 for its object ID.
   We provide a dialog based interface in order to facilitate the      3.4.1 XMT-α generation
interactive scenario authoring process. The event object
specification is done by selecting an event type and attributes of     In XMT-α format, each object is represented as an element
the destination object that the author wants the event type to         similar to the object node described in BIFS. Thus, the XMT-α
change, without the need for an extra description.                     format document can be generated following the BIFS
                                                                       generation rules.
3.2.3 Temporal scenario composition                                       The XMT-α generator searches the scene composition tree
                                                                       until it meets the audio and visual object node. It then creates
For composing temporal scenario of objects, the author can             the corresponding object element of the XMT-α document using
modify the timeline of each object, i.e., the author directly          spatio-temporal attributes of the object node. With the value
modifies the length and position of timelines in the timeline          specified in the object’s property node in the scene composition
window. Moreover the author can declare the temporal                   tree, the XMT-α generator can describe geometric attributes
relationships among objects, which are maintained through the          such as position, size and shape of the object or material
authoring process. Consider the following scenario for the scene       attributes such as fill color and border style.
in figure 3.                                                              Figure 5 and Figure 6 show a portion of XMT-α and BIFS
   Example 2. The text object is rendered at end of the image          text for the scene of example 1 respectively. In this case, when
object.                                                                the XMT-α generator finds the circle object node in the scene
   The scenario can be specified if the author modifies the            composition tree, it also meets the circle object’s property node
timelines of the two objects like figure 4 and he/she declares the     as well as its event node at the object node’s child. Using the
two objects as a sequence group which maintains the objects            information written in the event node, the route and sensor
play sequentially.                                                     nodes can be described.
                 Sequence group
                                                                       3.4.2 XMT-Ω generation
         image
                                                                       XMT-Ω syntax and semantics have been designed using
         text
                                                                       extensible media (xMedia) objects as basic building blocks[7].
                                                        time
                                                                           The elements within XMT-Ω abstract the geometry and the
     Figure 4. An example of timeline modification and temporal        behavior of the corresponding object in the visual scene. Thus,
                      relationship declaration
                                                                       if an object is associated with an event object node, its behavior
  The timeline of the image object is automatically updated to         should be defined by a set of animation and timing element.
maintain the relationship each time the duration of the text               Figure 7 shows the XMT-Ω format document corresponding
object is modified.                                                    the XMT-α format in figure 5. The rectangle object is defined
                                                                       with the elements describing the object’s spatial and material
3.3 Scene Composition Tree                                             attributes as well as the animate elements describing a change of
The resulting graphical user interface is represented as a scene       fill color which responds to a click on the circle object.
composition tree designed to organize the composed scene into              Likewise figure 8 shows a portion of XMT-Ω document
specifying the scenario of example 2. It represents a temporal                         transparency -1.00
relationship and synchronization module expression using SMIL                                           ...
                                                                                   geometry Rectangle {
timing constraints. A ‘seq’ container defines a sequence of
                                                                                             size 162.00 110.00 }
elements in which elements play one after the other. The text            DEF TimeSI3000I0 TimeSensor {
object starts one second after the presentation begins and 19                                cycleInterval 3.00
seconds later disappears. When the text object disappears, the                               enabled FALSE
                                                                                             loop TRUE
image object whose temporal duration is 23 seconds starts.                                   startTime 0.00
Figure 9 represents the BIFS text corresponding XMT-Ω in                                     stopTime -1.00         }
figure 8.                                                                DEF ColorInter3000I0 ColorInterpolator {
 <Transform2D DEF="Transform2D3000"                                                          key [
       translation="-163.00 17.00" scale="1.00 1.00"                                             0.00
       rotationAngle="0.00" >                                                                     1.00 ]
     <children>                                                                              keyValue [
       <Shape>                                                                                    0.00 0.50 0.00
       <Appearance>                                                                              1.00 0.00 0.00 ]
          <Material2D DEF="Material2D3000"                                                   ...
                   emissiveColor="0.75 0.75 0.75" …>                     geometry Circle { radius 57.00 }
          <LineProperties DEF="LineProperties3000"                       DEF TouchS3001 TouchSensor {
                                                                                             enabled TRUE         }
                   …
          </Material2D>                                                                                         ...
       </Appearance>                                              ROUTE TouchS3001.isActive TO TimeSI3000I0.enabled
       <Rectangle DEF="Rectangle3000" USE="Group0"                ROUTE TimeSI3000I0.fraction_changed TO
                          size="162.00 110.00">                                             ColorInter3000I0.set_fraction
     </Rectangle>                                                 ROUTE ColorInter3000I0.value_changed TO
     </Shape>                                                                          Material2D3000.emissiveColor
     <TimeSensor DEF="TimeSI3000I0" cycleInterval="3.00"
                enabled="FALSE" loop="TRUE"                      Figure 6. A portion of BIFS text corresponding the XMT-α in figure 5
                startTime="0.00" stopTime="-1.00" >
     </TimeSensor>                                                <rectangle ID="rectangle_3000" parent="group_0"
     <ColorInterpolator DEF="ColorInter3000I0"                                size="162.00 110.00">
                     key="0.00 1.00"                              <transformation ID="transformation_3000" translation="-163
             keyValue="0.00 0.50 0.00 1.00 0.00 0.00 " >                      17" scale="1.00 1.00"></transformation>
     </ColorInterpolator>                                         <material ID="material_3000" color="#c0c0c0"
     …                                                                        transparency="-1.00" filled="true" >
     <Circle DEF="Circle3001" USE="Group0"                        <animateColor attributeName="color"
             radius="57.00">                                              dur="1s" begin="circle_3001.click"
          …                                                            values="#000000; #010000" keyTimes="0.00; 1.0" />
     <TouchSensor DEF="TouchS3001" enabled="TRUE" >               </material>
     </TouchSensor>                                               </rectangle>
                             …                                    <circle ID="circle_3001" parent="group_0" radius="57.00">
 <Route fromNode="TouchS3001" fromField="isActive"                <transformation ID="transformation_3001"
                toNode="TimeSI3000I0" toField="enabled" />             …
 <Route fromNode="TimeSI3000I0"                                   <material ID="material_3001" color="#ffd700"
        fromField="fraction_changed"
        toNode="ColorInter3000I0" toField="set_fraction" />         Figure 7. A portion of XMT-Ω corresponding XMT-α in figure 5
 <Route fromNode="ColorInter3000I0"
             fromField="value_changed"                            <seq begin="1s" >
              toNode="Material2D3000"                               <string ID="string_3002" parent="group_0"
              toField="emissiveColor" />                                        textLines="MPEG-4 ......" dur="19s">
                                                                    <fontStyle ID="fontStyle_3002" family="Arial"
        Figure 5. A portion of XMT-α for the example 1                              horizontal="true" justify="BEGIN"
                                                                                    language="(null)" leftToRight="true"
                                                                                   size="-21.00" spacing="34.00" style="PLAIN"
 DEF Transform2D3000 Transform2D {                                                 topToBottom="true">
       translation -163.00 17.00                                         </fontStyle>
       scale 1.00 1.00                                                <transformation ID="transformation_3002"
       rotationAngle 0.00                                                              translation="100 91" scale="1.00 1.00">
       children [                                                     </transformation>
          Shape {                                                     <material ID="material_3002" color="#ffff00"
           appearance Appearance {                                                    transparency="-1.00" filled="true" >
          material DEF Material2D3000 Material2D {                    </material>
                     emissiveColor 0.75 0.75 0.75                  </string>
                     filled TRUE
   <image ID="image_1000" parent="group_0"                              A valid XMT document can be transformed as a form of
                       src="D:\sample_image.gif" dur="23s">          scene composition tree using DOM API [2]. DOM API provides
   <transformation ID="transformation_1000"                          a tree-based API to compile an XML document into an internal
            translation="113 -25" scale="1.00 1.00">…
                                                                     tree structure and navigate the tree.
</seq>
                                                                        Media elements described within the parsed XMT document
                                                                     are represented as object nodes with their corresponding
          Figure 8. A portion of XMT-Ω for the example 2             property nodes. Thus the scene described in the XMT document
                                                                     can be visualized by rendering the corresponding media object
 DEF Switch3002 Switch {                                             nodes using the scene composition tree. The visualized scene
     whichChoice 1
         choice [
                                                                     can also be modified and rewritten as XMT document.
              DEF Transform2D3002 Transform2D {
                      ...                                            4    IMPLEMENTATION
                  Shape {
                   appearance Appearance {                           The proposed XMT authoring tool is developed using C++
                     material DEF Material2D3002 Material2D          under the Windows 95/98/NT platform. The system supports the
 {                                                                   Complete2D profile for MPEG-4 contents.
                               emissiveColor 1.00 1.00 0.00
                               filled TRUE
                               transparency -1.00                    5    CONCLUSION
                      ...                                            The XMT document authoring tool provides visual and direct
          geometry Text { string [ "MPEG-4 ......"]                  manipulating authoring technique. In the system, common users
                     fontStyle DEF FontStyle3002
                            FontStyle {
                                                                     can create an MPEG-4 scene and its XMT format document
                               family "Arial "                       although they are not familiar with XMT syntax and semantics.
                               horizontal TRUE                          Moreover, the visual scene is automatically transformed into
                               justify "BEGIN"                       XMT-α or XMT-Ω document without syntax error. Likewise, a
                               language "(null)"                     sophisticated scene, which may be very difficult to create using
                               leftToRight TRUE                      text description, can be generated. In the future, it is necessary
                               size -21.00
                               spacing 34.00
                                                                     to support more types of media data and scene nodes such as 3D
                               style "PLAIN"                         objects and a more facilitative authoring interface.
                               topToBottom TRUE . . .
 DEF Switch1000 Switch {                                             REFERENCES
       whichChoice 1
          choice [                                                   [1] A. Puri and A. Eleftheriadis, “MPEG-4: An Object-Based
              DEF Transform2D1000 Transform2D {                          Multimedia Coding Standard Supporting Mobile Applications,”
                      ...                                                Mobile Networks and Applications, vol. 3, pp. 5–32, 1998.
                     appearance Appearance {                         [2] Document Object Model (DOM) Level 1 Specification, W3C
                                  texture ImageTexture {                 Recommendation, October, 1998. http://www.w3.org/TR/REC-
                                           url 1                         DOM-Level-1/
                                           repeatS TRUE              [3] http://www.alphaworks.ibm.com/tech/xml4c/
                                           repeatT TRUE              [4] ISO/IEC 14496-1:1999 Information technology - Coding of audio-
          }                                geometry Bitmap {             visual objects - Part 1: Systems ISO/IEC JTC1/SC29/WG11 N2501,
                     ...                                                 1999.
 AT 1000 { REPLACE Switch3002.whichChoice BY 0 }                     [5] ISO/ICE FDIS 14772:200x, Information Technology-Computer
 AT 20000 { REPLACE Switch3002.whichChoice BY 1 }                        graphics and image processing--The Virtual Reality Modeling
 AT 20000 { REPLACE Switch1000.whichChoice BY 0 }                        Language (VRML)
 AT 43000 { REPLACE Switch1000.whichChoice BY 1 }
                                                                     [6] ISO/IEC xxxxx:200x, X3D, Information technology -- Computer
                                                                         graphics and image processing -- X3D.
  Figure 9. A portion of BIFS text corresponding XMT-Ω in figure 8
                                                                     [7] M. Kim, S. Wood, L.T. Cheok, “Extensible MPEG-4 textual format
                                                                         (XMT),” in Proc. on ACM multimedia 2000 workshops, Los
  All the XMT and BIFS text which are shown the above, are               Angeles, California, United States, 2000, pp. 71–74.
generated automatically from the visual scene.                       [8] S. Battista, F. Casalino and C. Lande, “MPEG-4: A Multimedia
                                                                         Standard for the Third Millennium, Part 1,” IEEE Multimedia, vol. 6,
3.5 XMT Parsing                                                          no. 4, pp.74–83, 1999.
                                                                     [9] Synchronized Multimedia Integration Language (SMIL) 1.0
The XMT framework is based on XML, thus valid XMT                        Specification,      W3C       Recommendation,        June,   1998.
element nesting can be defined in the Document Type                      http://www.w3.org/TR/1998/REC-smil-19980615
                                                                     [10]WG11(MPEG), MPEG-4 Overview (V.16 La Baule Version)
Declaration (DTD) and parsed using XML parser. XML4C[3] is
                                                                         document, ISO/IEC JTC1/SC29/WG11 N3747, October 2000.
used as a validating XML parser written in a portable subset of      [11]WG11(MPEG), MPEG-4 Overview (V.18 Singapore Version)
C++ for parsing XMT documents.                                           document, ISO/IEC JTC1/SC29/WG11 N4030, March 2001.