<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interactive Authoring Tool for Extensible MPEG-4 Textual Format (XMT)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kyungae Cha</string-name>
          <email>chaka@woorisol.knu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sangwook Kim</string-name>
          <email>swkim@cs.knu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Kyungpook National University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>MPEG-4 is an ISO/IEC standard which defines a multimedia system for communicating interactive scenes containing various types of media objects. The Extensible MPEG-4 Textual format (XMT) framework provides interoperability between existing practices such as the Extensible 3D (X3D) and MPEG-4. This paper introduces an XMT authoring tool that supports a visual environment for building a spatio-temporal scenario of media objects comprising a multimedia scene. The authoring tool provides a comprehensive set of facilitative editing tools for composing multimedia scene, as well as tools for automatic generation of XMT documents and MPEG-4 contents. This paper also describes the functionality of the developed system and shows an example of its use.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>MPEG-4, one of the leading streaming media formats, is an
ISO/IEC standard which defines a multimedia system for
communicating interactive scenes with various types of media
objects. In MPEG-4, a scene is accompanied with the
description specifying how the objects should be combined in
time and space in order to form the scene intended by the author.</p>
      <p>
        The scene description is coded in a binary format called
Binary Format for Scenes or BIFS[
        <xref ref-type="bibr" rid="ref1 ref10 ref4 ref7 ref8">1,4,7,8,10,11</xref>
        ], which is built
on several concepts from the Virtual Reality Modeling
Language(VRML)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. 12 This binary form is suitable for
lowoverhead transmission so that BIFS basically provides an
efficient application for the sender and the receiver[
        <xref ref-type="bibr" rid="ref1 ref7">1,7</xref>
        ].
      </p>
      <p>On the other hand, the Extensible MPEG-4 Textual format
(XMT) is a framework for representing MPEG-4 scene
description using a textual syntax.</p>
      <p>This paper presents an XMT document authoring tool that
enables visual composition of an MPEG-4 scene and generates
the corresponding XMT document and MEPG-4 contents. XMT
is designed to provide a high-level abstraction for MPEG-4
functionalities and an easy interoperability between existing</p>
    </sec>
    <sec id="sec-2">
      <title>XMT-Α AND XMT-Ω FORMATS</title>
      <p>
        The XMT framework consists of two levels of textual syntax
and semantics: XMT-α and XMT-Ω formats[
        <xref ref-type="bibr" rid="ref7">7,10</xref>
        ].
      </p>
      <p>
        XMT-α is an XML-based version of MPEG-4 content which
provides a straightforward, one-to-one mapping between the
textual and the binary formats of an MPEG-4 scene description.
XMT-α also provides interoperability with X3D[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which
improves upon VRML with new features such as flexible XML
encoding and a modularization approach[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It contains a subset
of the X3D as well as the X3D-like representations of MPEG-4
features such as Object Descriptors(OD), BIFS update
commands and 2D composition[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        XMT-Ω is a high-level abstraction of MPEG-4 features based
on the SMIL[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It specifies objects and their relationships in
terms of the author’s intention rather than coded nodes and route
mechanism in BIFS. In the respect of reusing SMIL, XMT-Ω
defines a subset of modules used in SMIL whose semantics are
compatible. Moreover XMT-Ω format can be parsed and played
directly by a W3C SMIL player, preprocessed to the
corresponding X3D nodes and played by a VRML player. It
may also be compiled to an MPEG-4 representation such as
mp4 which can then be played by an MPEG-4 player. Figure 1
shows the interoperability of XMT between SMIL player,
VRML player and MPEG-4 player.
      </p>
      <p>X M T
C o m p il e</p>
      <p>M P E G -4
R ep r es en ta ti on
( e.g m p 4 fil e )</p>
      <p>S M I L
P la y e r
V R M L
B r ow s er
M P E G -4
P la y e r</p>
    </sec>
    <sec id="sec-3">
      <title>XMT AUTHORING SYSTEM</title>
      <p>This section shows the XMT document authoring environment
of our system and the authoring process in creating an MPEG-4
scene and an XMT document. The main functionalities of the
system are also described.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>System Structure</title>
      <p>The following figure shows the system structure and every
component of the XMT authoring tool.</p>
      <sec id="sec-4-1">
        <title>Media data</title>
      </sec>
      <sec id="sec-4-2">
        <title>XMTdocuments</title>
        <p>XMT_Ω
document
XMT_α
document
Media
Decoders</p>
      </sec>
      <sec id="sec-4-3">
        <title>Parser</title>
        <p>XMT_Ω
Generator
XMT_α
Generator</p>
        <p>Graphical User Interface</p>
      </sec>
      <sec id="sec-4-4">
        <title>Scene compositiontree Manager</title>
      </sec>
      <sec id="sec-4-5">
        <title>Scene composition tree</title>
        <p>Authors compose an MPEG-4 scene with various editing
tools provided in the graphical user interface. Following the
authoring process, the scene composition tree, which represents
the visual scene as internal data structure, is built and modified.</p>
        <p>Using the scene composition tree, the XMT-α or XMT-Ω
generator makes a corresponding XMT format document. At
this time the author can choose the output format that he/she
wants. The XMT format files can be parsed and then displayed
in the user interface as a visual scene. The author can also
modify the visual scene and recreate the XMT file.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Graphical User Interface</title>
      <p>The graphical user interface provides a set of drawing tools and
editing tools for various media types such as JPEG image,
MPEG-1 video, G.723 audio and graphical objects (Rectangle,
Circle, and others). These tools enable authors to compose
audio-visual scenes with direct manipulation technique and see
them immediately. Figure 3 presents an overview of the
graphical user interface and a simple example of a scene.</p>
      <p>Authors first select from the toolbar one of the tools they
want to add in the scene and then draw the selected object. For
image objects, the object is drawn in the interface window. For
video objects, the first frame of the video is drawn in the
interface window.</p>
      <p>Whenever a new media object is added in the scene, the
system automatically assigns the object ID, start time and end
time of the object with default value. The bottom portion of
figure 3 shows the timeline window where the timelines of
objects are arranged. The layer of timeline represents the
drawing order of corresponding objects, which is determined
following the object addition sequence. Here the timeline
window shows the initial state, i.e. no modification is occurred.</p>
      <sec id="sec-5-1">
        <title>3.2.1 Spatial composition</title>
        <p>In the user interface, each object participated in a scene is
contained in a rectangular tracker so that they are treated as
individual objects. Thus the author can move, resize or remove
the objects directly for composing a spatial arrangement of the
scene.</p>
        <p>The spatial attributes of an object can be specified in terms of
the spatial position of the object’s bounding rectangle, which is
represented as a rectangular tracker containing the object in the
user interface. The spatial position of bounding rectangle of an
object (i.e. the spatial attribute of the object) is specified as the
form of (x,y,h,w), where w denotes the width of the bounding
rectangle; h denotes the height, while x and y denote the
coordinates of the center of the rectangle with referring to the
center of whole rectangle of the presentation as origin of
coordinate system.</p>
        <p>The author can also apply material characteristics such as
color, transparency, and border type using editing tools. These
material properties of an object are specified as object property
node in the internal form of our authoring system. The spatial
and material attributes of each object are automatically specified
by the system from the visual scene.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2.2 Interactive scenario composition</title>
        <p>In the presentation of an MPEG-4 scene, user interaction is
possible within the set in the scene description. Assume that the
author designs the following scenario for the scene in figure 3.</p>
        <p>Example 1. If an end user clicks the circle object, the fill color
of the rectangle object will be changed through the gradient
from red to green.</p>
        <p>Here, the circle object and the rectangle object refer to the
source object and the destination object respectively. To make
an interactive scenario, the event type(e.g. user’s click), the
source and destination object and the responding action type(e.g.
change fill color), etc., should be specified. We denote the
interactive information as event object which is represented as a
quadruple (destination object ID, event type, action type, key
values). The key values mean an array of values to be used to
change the parameters of the action type field. The event object
for the above example is specified as (3000, click, fill color,
((1.00 0.00 0.00),(0.00 0.50 0.00)), if the rectangle object as the
destination has the number 3000 for its object ID.</p>
        <p>We provide a dialog based interface in order to facilitate the
interactive scenario authoring process. The event object
specification is done by selecting an event type and attributes of
the destination object that the author wants the event type to
change, without the need for an extra description.</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.2.3 Temporal scenario composition</title>
        <p>For composing temporal scenario of objects, the author can
modify the timeline of each object, i.e., the author directly
modifies the length and position of timelines in the timeline
window. Moreover the author can declare the temporal
relationships among objects, which are maintained through the
authoring process. Consider the following scenario for the scene
in figure 3.</p>
        <p>Example 2. The text object is rendered at end of the image
object.</p>
        <p>The scenario can be specified if the author modifies the
timelines of the two objects like figure 4 and he/she declares the
two objects as a sequence group which maintains the objects
play sequentially.</p>
        <sec id="sec-5-3-1">
          <title>Sequence group</title>
          <p>image
text
time</p>
          <p>The timeline of the image object is automatically updated to
maintain the relationship each time the duration of the text
object is modified.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.3 Scene Composition Tree</title>
      <p>The resulting graphical user interface is represented as a scene
composition tree designed to organize the composed scene into
a hierarchical structural form. Whenever a new object is created
in the user interface, the corresponding object node is also
created.</p>
      <p>The object node has its corresponding object type, object ID
and values specifying spatio-temporal attributes. The scene
composition tree is modified through the attachment of the new
object node. At the same time, the property node of the object is
attached as a child node of the new object node. The property
node as well as the tree structure can be changed throughout the
authoring process. The tree structure can be changed while
objects are added, replaced, or removed. If the author creates
event information, an event object which contains destination
object ID, event type and values of transition status is created
and attached to the source object node as its child node. Thus an
event object does not specify its source object ID.</p>
    </sec>
    <sec id="sec-7">
      <title>3.4 Generation of XMT Document</title>
      <p>The resulting graphical user interface is represented as the scene
composition tree. From the scene composition tree, both of the
XMT-α and XMT-Ω document corresponding to the visual
scene are directly generated.</p>
      <sec id="sec-7-1">
        <title>3.4.1 XMT-α generation</title>
        <p>In XMT-α format, each object is represented as an element
similar to the object node described in BIFS. Thus, the XMT-α
format document can be generated following the BIFS
generation rules.</p>
        <p>The XMT-α generator searches the scene composition tree
until it meets the audio and visual object node. It then creates
the corresponding object element of the XMT-α document using
spatio-temporal attributes of the object node. With the value
specified in the object’s property node in the scene composition
tree, the XMT-α generator can describe geometric attributes
such as position, size and shape of the object or material
attributes such as fill color and border style.</p>
        <p>Figure 5 and Figure 6 show a portion of XMT-α and BIFS
text for the scene of example 1 respectively. In this case, when
the XMT-α generator finds the circle object node in the scene
composition tree, it also meets the circle object’s property node
as well as its event node at the object node’s child. Using the
information written in the event node, the route and sensor
nodes can be described.</p>
      </sec>
      <sec id="sec-7-2">
        <title>3.4.2 XMT-Ω generation</title>
        <p>
          XMT-Ω syntax and semantics have been designed using
extensible media (xMedia) objects as basic building blocks[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>The elements within XMT-Ω abstract the geometry and the
behavior of the corresponding object in the visual scene. Thus,
if an object is associated with an event object node, its behavior
should be defined by a set of animation and timing element.</p>
        <p>Figure 7 shows the XMT-Ω format document corresponding
the XMT-α format in figure 5. The rectangle object is defined
with the elements describing the object’s spatial and material
attributes as well as the animate elements describing a change of
fill color which responds to a click on the circle object.</p>
        <p>Likewise figure 8 shows a portion of XMT-Ω document
specifying the scenario of example 2. It represents a temporal
relationship and synchronization module expression using SMIL
timing constraints. A ‘seq’ container defines a sequence of
elements in which elements play one after the other. The text
object starts one second after the presentation begins and 19
seconds later disappears. When the text object disappears, the
image object whose temporal duration is 23 seconds starts.</p>
        <sec id="sec-7-2-1">
          <title>DEF Switch3002 Switch {</title>
          <p>whichChoice 1
choice [</p>
          <p>DEF Transform2D3002 Transform2D {
. . .</p>
          <p>Shape {
appearance Appearance {</p>
          <p>material DEF Material2D3002 Material2D
{
emissiveColor 1.00 1.00 0.00
filled TRUE
transparency -1.00
. . .
geometry Text { string [ "MPEG-4 ......"]
fontStyle DEF FontStyle3002</p>
          <p>FontStyle {
family "Arial "
horizontal TRUE
justify "BEGIN"
language "(null)"
leftToRight TRUE
size -21.00
spacing 34.00
style "PLAIN"
topToBottom TRUE . . .</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>DEF Switch1000 Switch {</title>
          <p>whichChoice 1
choice [</p>
          <p>DEF Transform2D1000 Transform2D {
. . .
appearance Appearance {
texture ImageTexture {
url 1
repeatS TRUE
repeatT TRUE
} geometry Bitmap {</p>
          <p>. . .</p>
          <p>AT 1000 { REPLACE Switch3002.whichChoice BY 0 }
AT 20000 { REPLACE Switch3002.whichChoice BY 1 }
AT 20000 { REPLACE Switch1000.whichChoice BY 0 }
AT 43000 { REPLACE Switch1000.whichChoice BY 1 }
All the XMT and BIFS text which are shown the above, are
generated automatically from the visual scene.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.5 XMT Parsing</title>
      <p>
        The XMT framework is based on XML, thus valid XMT
element nesting can be defined in the Document Type
Declaration (DTD) and parsed using XML parser. XML4C[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is
used as a validating XML parser written in a portable subset of
C++ for parsing XMT documents.
      </p>
      <p>
        A valid XMT document can be transformed as a form of
scene composition tree using DOM API [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. DOM API provides
a tree-based API to compile an XML document into an internal
tree structure and navigate the tree.
      </p>
      <p>Media elements described within the parsed XMT document
are represented as object nodes with their corresponding
property nodes. Thus the scene described in the XMT document
can be visualized by rendering the corresponding media object
nodes using the scene composition tree. The visualized scene
can also be modified and rewritten as XMT document.
4</p>
    </sec>
    <sec id="sec-9">
      <title>IMPLEMENTATION</title>
      <p>The proposed XMT authoring tool is developed using C++
under the Windows 95/98/NT platform. The system supports the
Complete2D profile for MPEG-4 contents.
5</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION</title>
      <p>The XMT document authoring tool provides visual and direct
manipulating authoring technique. In the system, common users
can create an MPEG-4 scene and its XMT format document
although they are not familiar with XMT syntax and semantics.</p>
      <p>Moreover, the visual scene is automatically transformed into
XMT-α or XMT-Ω document without syntax error. Likewise, a
sophisticated scene, which may be very difficult to create using
text description, can be generated. In the future, it is necessary
to support more types of media data and scene nodes such as 3D
objects and a more facilitative authoring interface.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Puri</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Eleftheriadis</surname>
          </string-name>
          , “
          <article-title>MPEG-4: An Object-Based Multimedia Coding Standard Supporting Mobile Applications,” Mobile Networks and Applications</article-title>
          , vol.
          <volume>3</volume>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Document</given-names>
            <surname>Object</surname>
          </string-name>
          <article-title>Model (DOM) Level 1 Specification, W3C Recommendation</article-title>
          , October,
          <year>1998</year>
          . http://www.w3.org/TR/RECDOM-Level-
          <volume>1</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>[3] http://www.alphaworks.ibm.com/tech/xml4c/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[4] ISO/IEC 14496-1:1999 Information technology - Coding of audiovisual objects - Part</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Systems</surname>
            <given-names>ISO</given-names>
          </string-name>
          /IEC JTC1/SC29/WG11 N2501,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] ISO/ICE FDIS 14772:200x, Information Technology-Computer graphics and image processing--The Virtual Reality Modeling Language (VRML)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] ISO/IEC xxxxx:200x,
          <fpage>X3D</fpage>
          , Information technology --
          <source>Computer graphics and image processing -- X3D.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wood</surname>
          </string-name>
          , L.T. Cheok, “
          <article-title>Extensible MPEG-4 textual format (XMT),”</article-title>
          <source>in Proc. on ACM multimedia 2000 workshops</source>
          , Los Angeles, California, United States,
          <year>2000</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Battista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Casalino</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lande</surname>
          </string-name>
          , “
          <article-title>MPEG-4: A Multimedia Standard for the Third Millennium</article-title>
          ,
          <source>Part</source>
          <volume>1</volume>
          ,” IEEE Multimedia, vol.
          <volume>6</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Synchronized</given-names>
            <surname>Multimedia Integration</surname>
          </string-name>
          <article-title>Language (SMIL) 1.0 Specification, W3C Recommendation</article-title>
          , June,
          <year>1998</year>
          . http://www.w3.org/TR/1998/REC-smil-
          <volume>19980615</volume>
          [10]
          <article-title>WG11(MPEG), MPEG-4 Overview (V.16 La Baule Version) document</article-title>
          , ISO/IEC JTC1/SC29/WG11 N3747,
          <year>October 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <article-title>WG11(MPEG), MPEG-4 Overview (V.18 Singapore Version) document</article-title>
          , ISO/IEC JTC1/SC29/WG11 N4030,
          <year>March 2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>