Interactive Authoring Tool for Extensible MPEG-4 Textual Format (XMT) Kyungae Cha1 and Sangwook Kim2 practices of content authors, such as the Extensible 3D (X3D) Abstract. MPEG-4 is an ISO/IEC standard which defines a being developed by the Web3D Consortium and the multimedia system for communicating interactive scenes Synchronized Multimedia Integration Language (SMIL) from containing various types of media objects. The Extensible the W3C consortium[7,11]. Thus authors can get multimedia MPEG-4 Textual format (XMT) framework provides contents, which are exchangeable and interoperable with X3D interoperability between existing practices such as the and SMIL, using the XMT authoring tool. Extensible 3D (X3D) and MPEG-4. This paper introduces an In the authoring system, authors can visually make a spatial XMT authoring tool that supports a visual environment for arrangement of media objects and compose a temporal behavior building a spatio-temporal scenario of media objects comprising of objects with timeline approach. Authors can also modify the a multimedia scene. The authoring tool provides a material characteristics of each object using interactive and comprehensive set of facilitative editing tools for composing visual tools. Moreover, the visual scene is automatically transformed into an XMT-α and XMT-Ω format document. multimedia scene, as well as tools for automatic generation of In section 2, XMT formats are briefly discussed. In section 3, XMT documents and MPEG-4 contents. This paper also the various functions of the XMT authoring tool are described. describes the functionality of the developed system and shows The implementation of the proposed system is then presented in an example of its use. section 4. Finally section 5 gives conclusion and presents our future plans. 1 INTRODUCTION 2 XMT-Α AND XMT-Ω FORMATS MPEG-4, one of the leading streaming media formats, is an ISO/IEC standard which defines a multimedia system for The XMT framework consists of two levels of textual syntax communicating interactive scenes with various types of media and semantics: XMT-α and XMT-Ω formats[7,10]. objects. In MPEG-4, a scene is accompanied with the XMT-α is an XML-based version of MPEG-4 content which description specifying how the objects should be combined in provides a straightforward, one-to-one mapping between the time and space in order to form the scene intended by the author. textual and the binary formats of an MPEG-4 scene description. The scene description is coded in a binary format called XMT-α also provides interoperability with X3D[5], which Binary Format for Scenes or BIFS[1,4,7,8,10,11], which is built improves upon VRML with new features such as flexible XML on several concepts from the Virtual Reality Modeling Language(VRML)[5]. 12 This binary form is suitable for low- encoding and a modularization approach[6]. It contains a subset overhead transmission so that BIFS basically provides an of the X3D as well as the X3D-like representations of MPEG-4 efficient application for the sender and the receiver[1,7]. features such as Object Descriptors(OD), BIFS update On the other hand, the Extensible MPEG-4 Textual format commands and 2D composition[7]. (XMT) is a framework for representing MPEG-4 scene XMT-Ω is a high-level abstraction of MPEG-4 features based description using a textual syntax. on the SMIL[9]. It specifies objects and their relationships in This paper presents an XMT document authoring tool that terms of the author’s intention rather than coded nodes and route enables visual composition of an MPEG-4 scene and generates the corresponding XMT document and MEPG-4 contents. XMT mechanism in BIFS. In the respect of reusing SMIL, XMT-Ω is designed to provide a high-level abstraction for MPEG-4 defines a subset of modules used in SMIL whose semantics are functionalities and an easy interoperability between existing compatible. Moreover XMT-Ω format can be parsed and played directly by a W3C SMIL player, preprocessed to the corresponding X3D nodes and played by a VRML player. It 1 may also be compiled to an MPEG-4 representation such as Department of Computer Science, Kyungpook National University, mp4 which can then be played by an MPEG-4 player. Figure 1 Daegu, Korea, email : chaka@woorisol.knu.ac.kr 2 shows the interoperability of XMT between SMIL player, Department of Computer Science, Kyungpook National University, VRML player and MPEG-4 player. Daegu, Korea, email : swkim@cs.knu.ac.kr Circle, and others). These tools enable authors to compose S M IL audio-visual scenes with direct manipulation technique and see P a rs e P la y e r them immediately. Figure 3 presents an overview of the XMT graphical user interface and a simple example of a scene. C o m p il e VRM L B r o w s er Authors first select from the toolbar one of the tools they want to add in the scene and then draw the selected object. For image objects, the object is drawn in the interface window. For M P E G -4 M P E G -4 video objects, the first frame of the video is drawn in the R ep r es en ta ti o n P la y e r interface window. ( e.g m p 4 fil e ) Whenever a new media object is added in the scene, the system automatically assigns the object ID, start time and end Figure 1. The interoperability of XMT time of the object with default value. The bottom portion of figure 3 shows the timeline window where the timelines of objects are arranged. The layer of timeline represents the 3 XMT AUTHORING SYSTEM drawing order of corresponding objects, which is determined following the object addition sequence. Here the timeline This section shows the XMT document authoring environment window shows the initial state, i.e. no modification is occurred. of our system and the authoring process in creating an MPEG-4 scene and an XMT document. The main functionalities of the system are also described. 3.1 System Structure The following figure shows the system structure and every component of the XMT authoring tool. Media Media Graphical User Interface Decoders data Parser Scene composition tree Manager XMT documents Scene composition tree XMT_Ω XMT_Ω Generator document Figure 3. Graphical user interface XMT_α XMT_α document Generator 3.2.1 Spatial composition In the user interface, each object participated in a scene is Figure 2. System Structure contained in a rectangular tracker so that they are treated as individual objects. Thus the author can move, resize or remove Authors compose an MPEG-4 scene with various editing the objects directly for composing a spatial arrangement of the tools provided in the graphical user interface. Following the scene. authoring process, the scene composition tree, which represents The spatial attributes of an object can be specified in terms of the visual scene as internal data structure, is built and modified. the spatial position of the object’s bounding rectangle, which is Using the scene composition tree, the XMT-α or XMT-Ω represented as a rectangular tracker containing the object in the generator makes a corresponding XMT format document. At user interface. The spatial position of bounding rectangle of an this time the author can choose the output format that he/she object (i.e. the spatial attribute of the object) is specified as the wants. The XMT format files can be parsed and then displayed form of (x,y,h,w), where w denotes the width of the bounding in the user interface as a visual scene. The author can also rectangle; h denotes the height, while x and y denote the modify the visual scene and recreate the XMT file. coordinates of the center of the rectangle with referring to the center of whole rectangle of the presentation as origin of 3.2 Graphical User Interface coordinate system. The author can also apply material characteristics such as The graphical user interface provides a set of drawing tools and color, transparency, and border type using editing tools. These editing tools for various media types such as JPEG image, material properties of an object are specified as object property MPEG-1 video, G.723 audio and graphical objects (Rectangle, node in the internal form of our authoring system. The spatial a hierarchical structural form. Whenever a new object is created and material attributes of each object are automatically specified in the user interface, the corresponding object node is also by the system from the visual scene. created. The object node has its corresponding object type, object ID 3.2.2 Interactive scenario composition and values specifying spatio-temporal attributes. The scene composition tree is modified through the attachment of the new In the presentation of an MPEG-4 scene, user interaction is object node. At the same time, the property node of the object is possible within the set in the scene description. Assume that the attached as a child node of the new object node. The property author designs the following scenario for the scene in figure 3. node as well as the tree structure can be changed throughout the Example 1. If an end user clicks the circle object, the fill color authoring process. The tree structure can be changed while of the rectangle object will be changed through the gradient objects are added, replaced, or removed. If the author creates from red to green. event information, an event object which contains destination Here, the circle object and the rectangle object refer to the object ID, event type and values of transition status is created source object and the destination object respectively. To make and attached to the source object node as its child node. Thus an an interactive scenario, the event type(e.g. user’s click), the event object does not specify its source object ID. source and destination object and the responding action type(e.g. change fill color), etc., should be specified. We denote the interactive information as event object which is represented as a 3.4 Generation of XMT Document quadruple (destination object ID, event type, action type, key The resulting graphical user interface is represented as the scene values). The key values mean an array of values to be used to composition tree. From the scene composition tree, both of the change the parameters of the action type field. The event object XMT-α and XMT-Ω document corresponding to the visual for the above example is specified as (3000, click, fill color, ((1.00 0.00 0.00),(0.00 0.50 0.00)), if the rectangle object as the scene are directly generated. destination has the number 3000 for its object ID. We provide a dialog based interface in order to facilitate the 3.4.1 XMT-α generation interactive scenario authoring process. The event object specification is done by selecting an event type and attributes of In XMT-α format, each object is represented as an element the destination object that the author wants the event type to similar to the object node described in BIFS. Thus, the XMT-α change, without the need for an extra description. format document can be generated following the BIFS generation rules. 3.2.3 Temporal scenario composition The XMT-α generator searches the scene composition tree until it meets the audio and visual object node. It then creates For composing temporal scenario of objects, the author can the corresponding object element of the XMT-α document using modify the timeline of each object, i.e., the author directly spatio-temporal attributes of the object node. With the value modifies the length and position of timelines in the timeline specified in the object’s property node in the scene composition window. Moreover the author can declare the temporal tree, the XMT-α generator can describe geometric attributes relationships among objects, which are maintained through the such as position, size and shape of the object or material authoring process. Consider the following scenario for the scene attributes such as fill color and border style. in figure 3. Figure 5 and Figure 6 show a portion of XMT-α and BIFS Example 2. The text object is rendered at end of the image text for the scene of example 1 respectively. In this case, when object. the XMT-α generator finds the circle object node in the scene The scenario can be specified if the author modifies the composition tree, it also meets the circle object’s property node timelines of the two objects like figure 4 and he/she declares the as well as its event node at the object node’s child. Using the two objects as a sequence group which maintains the objects information written in the event node, the route and sensor play sequentially. nodes can be described. Sequence group 3.4.2 XMT-Ω generation image XMT-Ω syntax and semantics have been designed using text extensible media (xMedia) objects as basic building blocks[7]. time The elements within XMT-Ω abstract the geometry and the Figure 4. An example of timeline modification and temporal behavior of the corresponding object in the visual scene. Thus, relationship declaration if an object is associated with an event object node, its behavior The timeline of the image object is automatically updated to should be defined by a set of animation and timing element. maintain the relationship each time the duration of the text Figure 7 shows the XMT-Ω format document corresponding object is modified. the XMT-α format in figure 5. The rectangle object is defined with the elements describing the object’s spatial and material 3.3 Scene Composition Tree attributes as well as the animate elements describing a change of The resulting graphical user interface is represented as a scene fill color which responds to a click on the circle object. composition tree designed to organize the composed scene into Likewise figure 8 shows a portion of XMT-Ω document specifying the scenario of example 2. It represents a temporal transparency -1.00 relationship and synchronization module expression using SMIL ... geometry Rectangle { timing constraints. A ‘seq’ container defines a sequence of size 162.00 110.00 } elements in which elements play one after the other. The text DEF TimeSI3000I0 TimeSensor { object starts one second after the presentation begins and 19 cycleInterval 3.00 seconds later disappears. When the text object disappears, the enabled FALSE loop TRUE image object whose temporal duration is 23 seconds starts. startTime 0.00 Figure 9 represents the BIFS text corresponding XMT-Ω in stopTime -1.00 } figure 8. DEF ColorInter3000I0 ColorInterpolator { 1.00 ] keyValue [ 0.00 0.50 0.00 1.00 0.00 0.00 ] geometry Circle { radius 57.00 } ... ROUTE TouchS3001.isActive TO TimeSI3000I0.enabled ColorInter3000I0.set_fraction ROUTE ColorInter3000I0.value_changed TO Material2D3000.emissiveColor key="0.00 1.00" 17" scale="1.00 1.00"> dur="1s" begin="circle_3001.click" … values="#000000; #010000" keyTimes="0.00; 1.0" /> Figure 7. A portion of XMT-Ω corresponding XMT-α in figure 5 toNode="Material2D3000" textLines="MPEG-4 ......" dur="19s"> translation -163.00 17.00 scale 1.00 1.00 children [ Shape { material DEF Material2D3000 Material2D { emissiveColor 0.75 0.75 0.75 filled TRUE scene composition tree using DOM API [2]. DOM API provides … tree structure and navigate the tree. Media elements described within the parsed XMT document are represented as object nodes with their corresponding Figure 8. A portion of XMT-Ω for the example 2 property nodes. Thus the scene described in the XMT document can be visualized by rendering the corresponding media object DEF Switch3002 Switch { nodes using the scene composition tree. The visualized scene whichChoice 1 choice [ can also be modified and rewritten as XMT document. DEF Transform2D3002 Transform2D { ... 4 IMPLEMENTATION Shape { appearance Appearance { The proposed XMT authoring tool is developed using C++ material DEF Material2D3002 Material2D under the Windows 95/98/NT platform. The system supports the { Complete2D profile for MPEG-4 contents. emissiveColor 1.00 1.00 0.00 filled TRUE transparency -1.00 5 CONCLUSION ... The XMT document authoring tool provides visual and direct geometry Text { string [ "MPEG-4 ......"] manipulating authoring technique. In the system, common users fontStyle DEF FontStyle3002 FontStyle { can create an MPEG-4 scene and its XMT format document family "Arial " although they are not familiar with XMT syntax and semantics. horizontal TRUE Moreover, the visual scene is automatically transformed into justify "BEGIN" XMT-α or XMT-Ω document without syntax error. Likewise, a language "(null)" sophisticated scene, which may be very difficult to create using leftToRight TRUE text description, can be generated. In the future, it is necessary size -21.00 spacing 34.00 to support more types of media data and scene nodes such as 3D style "PLAIN" objects and a more facilitative authoring interface. topToBottom TRUE . . . DEF Switch1000 Switch { REFERENCES whichChoice 1 choice [ [1] A. Puri and A. Eleftheriadis, “MPEG-4: An Object-Based DEF Transform2D1000 Transform2D { Multimedia Coding Standard Supporting Mobile Applications,” ... Mobile Networks and Applications, vol. 3, pp. 5–32, 1998. appearance Appearance { [2] Document Object Model (DOM) Level 1 Specification, W3C texture ImageTexture { Recommendation, October, 1998. http://www.w3.org/TR/REC- url 1 DOM-Level-1/ repeatS TRUE [3] http://www.alphaworks.ibm.com/tech/xml4c/ repeatT TRUE [4] ISO/IEC 14496-1:1999 Information technology - Coding of audio- } geometry Bitmap { visual objects - Part 1: Systems ISO/IEC JTC1/SC29/WG11 N2501, ... 1999. AT 1000 { REPLACE Switch3002.whichChoice BY 0 } [5] ISO/ICE FDIS 14772:200x, Information Technology-Computer AT 20000 { REPLACE Switch3002.whichChoice BY 1 } graphics and image processing--The Virtual Reality Modeling AT 20000 { REPLACE Switch1000.whichChoice BY 0 } Language (VRML) AT 43000 { REPLACE Switch1000.whichChoice BY 1 } [6] ISO/IEC xxxxx:200x, X3D, Information technology -- Computer graphics and image processing -- X3D. Figure 9. A portion of BIFS text corresponding XMT-Ω in figure 8 [7] M. Kim, S. Wood, L.T. Cheok, “Extensible MPEG-4 textual format (XMT),” in Proc. on ACM multimedia 2000 workshops, Los All the XMT and BIFS text which are shown the above, are Angeles, California, United States, 2000, pp. 71–74. generated automatically from the visual scene. [8] S. Battista, F. Casalino and C. Lande, “MPEG-4: A Multimedia Standard for the Third Millennium, Part 1,” IEEE Multimedia, vol. 6, 3.5 XMT Parsing no. 4, pp.74–83, 1999. [9] Synchronized Multimedia Integration Language (SMIL) 1.0 The XMT framework is based on XML, thus valid XMT Specification, W3C Recommendation, June, 1998. element nesting can be defined in the Document Type http://www.w3.org/TR/1998/REC-smil-19980615 [10]WG11(MPEG), MPEG-4 Overview (V.16 La Baule Version) Declaration (DTD) and parsed using XML parser. XML4C[3] is document, ISO/IEC JTC1/SC29/WG11 N3747, October 2000. used as a validating XML parser written in a portable subset of [11]WG11(MPEG), MPEG-4 Overview (V.18 Singapore Version) C++ for parsing XMT documents. document, ISO/IEC JTC1/SC29/WG11 N4030, March 2001.