=Paper=
{{Paper
|id=Vol-209/paper-1
|storemode=property
|title=Translating Documents into Semantic Documents using Semantic Web and Web2.0
|pdfUrl=https://ceur-ws.org/Vol-209/saaw06-full05-kim.pdf
|volume=Vol-209
|dblpUrl=https://dblp.org/rec/conf/semweb/KimKCD06
}}
==Translating Documents into Semantic Documents using Semantic Web and Web2.0==
Translating Documents into Semantic Documents using
Semantic Web and Web2.0
Hak Lae Kim Hong Gee Kim Jae Hwa Choi Stefan Decker
Digital Enterprise Research Seoul National University Dankook University Digital Enterprise
Institute, National University 28-22 Yeonkun-dong 29, Anseo-Dong Research Institute,
of Ireland,Galway Jongro-gu Chonan, Chungnam, Korea National University of
IDA Business Park, Galway, Seoul, Korea +82-41-550-3368 Ireland,Galway
Ireland +82-2-7707452 jchoi@dankook.ac.kr IDA Business Park,
+353-91- 495016 hgkim@snu.ac.kr Galway, Ireland
haklae.kim@deri.org +353-91- 495016
stefan.decker@deri.org
ABSTRACT applications or software components to manage electronic
documents in a Desktop, but it is very difficult to organize
documents in a consistent way and to search expected ones in a
Managing metadata of documents is a difficult and slippery for precise way.
desktop users. A wide variety of technologies have been applied
for supporting requirements of metadata management, ranging There have been many efforts [2, 3, 5, 6, 13, 19, and 23] to reduce
from the acquisition, creation, maintenance, retrieval, reuse, and the complexity of metadata operations by implementing automatic
publishing of metadata. tools for acquisition, extraction, storage, and annotation. The
Social Semantic Desktop [1] and Web2.0 are also reliable
We introduce essential concepts of a semantic document and technologies trying to promise solutions for metadata
implement the necessary functionality of metadata managing management.
process. We also propose that three tasks are required to facilitate
unambiguous representation of metadata in documents: using The Social Semantic Desktop is a new computing paradigm that
XMP to store metadata with the file itself, using ontologies to provides an advanced way to create, automate and structure
represent semantic concepts and using Social Web services to information and “the technology convergences including the
interact with web based resources. So our approach allows a user social network and community services, P2P services” [1, 3]. It
to interact and share the resources among a Desktop and Web could be provided for the transformation of a typical desktop
more easily. system into a collaborative environment that supports both
personal computing and information sharing via social and
organizational channels [17]. There are several approaches in this
Categories and Subject Descriptors direction such as Haystack 1 , Gnowsis 2 , IRIS 3 etc.
I.7.1 [Document and Text Editing]
Web2.0 comprises technologies and services to enable users to
collaborate and share social contents. From the technical point of
General Terms view, it includes social software, content syndication, messaging
Management, Documentation, Design, Reliability, Human protocol such as weblogs, wikis, podcasts, RSS feeds etc. Social
Factors, Languages. softwares are not only focused on connecting people, but also on
sharing data. Therefore, it plays an important role in building
Keywords social networking on the web. There exist well-known Web2.0
Semantic Document, Semantic Desktop, Web2.0, Folksonomy, sites like Flickr 4 , del.icio.us 5 , Technorati 6 and the majority of
Semantic Web etc. such sites are connecting people into communities creating
networks of shared experience using folksonomy and RSS [10]. In
general terms, a folksonomy represents the set of tags containing
1. INTRODUCTION one or more keywords. Users create tags using their own
knowledge then other people use same terms and the content is
Managing electronic documents in a Desktop is a more
challenging task for end users [5]. There are many kinds of
1
http://haystack.lcs.mit.edu/
Permission to make digital or hard copies of all or part of this work for 2
http://nepomuk.semanticdesktop.org/xwiki/bin/Main1/
personal or classroom use is granted without fee provided that copies are
3
not made or distributed for profit or commercial advantage and that http://www.openiris.org/
copies bear this notice and the full citation on the first page. To copy 4
http://www.flickr.com
otherwise, or republish, to post on servers or to redistribute to lists,
5
requires prior specific permission and/or a fee. http://del.icio.us
SAAW’06, November 6, 2006, Athens, GA, USA 6
Copyright 2006 ACM 1-58113-000-0/00/0004…$5.00.
http://www.technorati.com
linked. Hence the Social Web Services contains all features of XMP. It is possible to reuse and share for other users easily. (iii)
web services and social software through a folksonomy. We provide a user-friendly interface to extract or create metadata
and efficient navigation through ontology and tags.
1.1 Problems
1.3 Outline of the paper
As illustrated by a Semantic documentation of Section 2, desktop
environments have critical problems to manage [6]: The main part of this paper is about how desktop systems can use
resources to enrich metadata in document. So we decide to use the
Heavyweight cognitive activity. The hierarchical file structure of Social Semantic Desktop and Web2.0 technologies for making
desktop systems allows users to find the documents easily, but semantic documents in a Desktop. Especially we focus on PDF
also reminds users of their respective task. There are, however, (Potable Document Format) which is the most well-known
some critical limitations within the file structure for managing the document format and on XMP which represents embedded
information resources within a Desktop application. Users metadata in PDF.
regardless of their behavior need to remember their document’s
name, the directory it was saved in, the saved time amongst other The remaining of this paper is structured as follows: Section 2
details. Because most activities are doing by human themselves defines a Semantic Documentation and proposes the Semantic
this behavior requires heavyweight cognitive activity. Document Model for our research. Section 3 then explains the
design principles. Section 4 describes the system architecture and
Multiple semantics. The hierarchy file system doesn’t provide the metadata managing process for a semantic document. Finally,
multiple semantics for a single directory. How could a user save a the paper concludes with Section 5.
paper about a conference and a location? A user could create a
“Conference_Location” or “ConferenceLocation” folder as its
name. It is a slightly ambiguous approach and doesn’t reflect 2. Semantic Documentation
multiple semantics correctly. In other words, a computer cannot 2.1 Semantic Document
process the inter-relationships between file names and directory
names if their naming is different.
Lawrence (Lawrence et al., 2004) defines that a semantic
Poor updatability and interoperability. Compared with web annotation is “the process of mapping instance data” to a
content, Desktop content is difficult to modify without an owner’s semantic structure such as an ontology. A semantic document
intervention. If the users spend a significant amount of time includes any information regarding the document and its
adding and/or modifying documents, the updatability of desktop relationship with other documents [27]. A semantic annotation of
content might be high. However, the majority of people don’t documents formally identifies concepts and relations between
spend their time adding additional information to the document. concepts in documents, and is intended primarily for use by
Also it is hard to share documents with other users despite P2P or machines [28]. Therefore, a semantic annotation is a key notion
instant messenger, both of which are supposed to provide file and a basic technology for the realization of a semantic annotation.
sharing services. It is augmentation of data to facilitate automatic recognition of the
Editing problem. The metadata-oriented approaches provide underlying semantic structure such as document structure (title,
enriched functionalities such as managing, searching and even section, paragraph, etc.), linguistic structure (dependency,
sharing information in information systems. There exist a variety coordination, thematic role, conference, etc.), and so forth.
of metadata schemes as de facto standards such as RDF, Dublin Basically it is based on the semantically links between
core, vCard. But these approaches are not a panacea. The information stored within a document and the ontology.
operations over metadata are complex and time-consuming. Ontologies are conceptualizations of a domain that typically are
Moreover, a metadata is stored separately from the document and represented using domain vocabulary.
is connected by external references or links like XPointers. When
a document are edited, deleted, or copied, however, it is the
maintenance of the links that become a problem. This problem 2.2 PDF and XMP
has been termed the editing problem by the Open Hypermedia
community. A straightforward solution to “editing problem” [4] PDF is an open document format developed by Adobe. Most
is to embed the metadata in the document itself. authors and publishers use it to store and to view documents.
There are some advantages of using PDF format as the basis for
1.2 Contributions semantic documents. PDF supports on-line viewing and printing
while containing semantic information linked to the document
We present three contributions. (i) We propose the architecture itself [26] and provides extensible ways to add new information
and implement the tool to interact between a Dekstop and Web. It inside document using XMP.
bootstraps the management of metadata and stimulates a user to In a nutshell, XMP (eXtensible Metadata Platform) is a format
participate in information management activity. (ii) We propose for embedding metadata in documents. It is a labeling technology
how desktop documents can be enriched using existed that allows users to embed data about a file, known as metadata,
technologies like Semantic Web and Web2.0. Ontology and into the file itself [10, 11, and 15]. It consists of a data model, a
Folksonomy based metadata are important part of our system. A storage model, and schemas. A data model is a useful and flexible
generated metadata by a user can be saved in document itself as way of describing metadata in documents. It defines the kinds of
metadata values and concepts that can be represented. A storage
model, as the implementation of the data model, includes the
serialization of the metadata as a stream of XML and XMP
Packets, a means of packaging the data in files [10]. Also
schemas are predefined sets of metadata property definitions that
are relevant for a wide range of applications, including all of
Adobe’s editing and publishing products, as well as for
applications from a wide variety of vendors.
The specific serialization syntax is important. As long as the
mapping to the data model is well defined, it is reasonably easy to
convert between different ways to write the metadata [11]. XMP
makes use of the Resource Description Framework (RDF), which
is based on XML. By adopting the RDF standard, XMP benefits
from the documentation, tools, and shared implementation
experience that come with an open W3C standard [7-10].
2.3 Semantic Document Model
In this section, we describe the Semantic Document Model where
users are managing metadata of their documents. Most users are
doing their information management activity with both desktop
and web applications; here, we describe a conceptual model for Figure 1 Semantic Document Model
managing metadata using desktop resources and resources of
social web sites. Firstly, the Semantic Document Model consists
of a number of ontologies to define a metadata structure.
Basically we propose the document schema ontology 7 for
3. Design Principles
describing metadata of document. It can be locally maintained,
interlinked and highly structured semantic information of each In this section, we describe basic design principles, which are
document. We propose the document type ontology to describe founded on the general problems sketched in the introduction
publication’s type of research communities and relevant concepts above. Table 1 depicts simple processes for semantic document
- proceedings, thesis, article, technical reports etc. Domain and requirements for solving problems. The key functions or
ontology describes a certain subject which is closely related to a process are extraction, creation, storage, index, and search. An
content of document. It might be extended by users as they need. overview of the matrix is given in Table 1. It shows functions are
Furthermore, users are able to get valuable piece of tags from mainly used to answer challenges set forth in the introduction.
various roots like the social web sites, user’s blogs.
Figure 1 shows the Semantic Document Model which defines Table 1 Design Principles
types of information. Basically it contains a physical information
and basic content metadata of a document which supports by Processes Extraction Creation Storage Search
conventional file systems. Also a semantic document consists of Problems & Index
social information and ontological information.
Heavyweight X X X
cognitive
activity
Poor X X X
updatability
Multiple X X
semantics
Editing X X
problem
Extraction. In order to reduce heavyweight cognitive activities of
a user, the extraction process allows semi-automatic or automatic
methods. Basically, the results of the this process can involve
with a metadata of documents, physical information such as file
name, size, and date etc. In addition, this process should extract a
metadata from weblogs or social web services.
7
http://www.blogweb.co.kr/research/ontology
Figure 2 Architecture
Creation. To generate or modify metadata users can use various meaning through collaborative work on the Web. Although
sources such as ontologies, tags, and even physical information. ontology and folksonomy have different approaches to make
Users can define their own knowledge structures which are called meanings, they can both supplement each other in the process of
domain ontology. Also tagging is one of new approaches to create creating metadata and searching it.
metadata. In order to allow for the creating this metadata, the Basically metadata of a document is extracted from the
process must be supported by tools. document itself. The Metadata Extractor can parse and deliver
Storage & Index. A document metadata must be existed in the metadata inside the document to the Metadata Explorer. Also
document itself to avoid the editing problem. And the metadata users are able to get valuable piece of information from various
should have URIs of web resources. It becomes a starting point to roots like folksonomies, user’s blogs, or even ontologies when
connect on the Web. they would create metadata. Then all kinds of metadata should be
Search. This process must cover ontology-based and tag-based saved in certain PDF file itself as XMP.
search. The search results must be connected other resources as Each document including metadata is built and is stored the
URIs. For example, a user identified the tags at a particular time, index automatically. It allows user to search using the domain
with URIs of web resources. But when they search, they can get ontology or tags. Search results would contain relevant data such
unintended results with the tags because tags or folksonomies are as raw file information, ontology concepts, and tags from embed
self-evolutionary. It can be solved the problems of Poor metadata. If users want to see web resources with relevant results,
updatability and interoperability in a Desktop. they may be getting all lists of the terms from specific blogs or
social web services sites.
4. Implementation In order to solve general problems and support the processes
mentioned the introduction above, we provide core UIs such as
Figure 2 illustrates our architecture designed in response to the the Metadata Explorer, Ontology Editor, and Tag Generator etc.
opportunities for functionality identified in the previous section. tool support is essential component of the semantic document
In this architecture, metadata of documents is created by two approach.
different sources, based on the ontologies and folksonomies. The
z Metadata Extractor : extract metadata from a
idea behind the methods is based on the following observations.
document
Ontologies are “intentional models” of information models of
information contents with a well-defined logical basis which can z Metadata Explorer: view, create, and modify metadata
be used for reasoning [13]. A folksonomy provides a shared
z Ontology Editor: view, edit an ontology Insert ontology concepts. Users can define their own ontology
z Tag Generator: create, view tags using the Ontology Editor. It provides functionalities for editing
and browsing ontology and allows users to define and update
z Search : keyword, ontology based search ontology in a tree structures. The Subject item which describes
In the following subsection we explain the concrete realization [dc:subject] in Dublin Core, related to a specific domain ontology
and processes. in our system. The Type item which describes [dc:type] in the
document type ontology concerns a document type. Users select a
4.1 Metadata Extraction node to insert it into the subject or type item in the Metadata
Explorer from the Ontology Editor.
Metadata Extraction is an internal process. Users do not need to
know how it works since XMP is machine readable metadata. The
XMP handler extracts a XMP metadata using Jena RDF API and
display each items in the Metadata Explorer (see Figure 3).
Figure 4 Tag Generator
Insert tags. To add certain tags we provide several functions.
Users can add tags from social web services using the TagCloud 8
interface. It shows folksonomy from Flickr or Del.icio.us etc. In
addition, if users want to create tags automatically, they would
create tags using the Tag Generator (see Figure 4). It is based on
the Yahoo’s Content Analysis web service 9 which is a context
extraction web service. This service allows retrieval of terms that
were extracted from a given text [13]. Tags which users selected
will be added in Keyword item in the Metadata Explorer.
After inserting relevant items, it can be saved in the file as
well-defined data in RDF format. One of the main advantages of
serializing XMP as RDF is that this has potential possibility for
reaching ubiquity as the cross-platform container for machine
readable/processible metadata [20].
Figure 3 Metadata Explorer
Ontological concepts and tags can be assigned to a document; the
The Metadata Extractor can automatically extract embedded document in desktop no longer has to be in a single folder.
metadata if documents have pieces of information and the Eventually it can be solve the restriction of multiple semantics in
Metadata Explorer shows the items of metadata. It allows users to desktop. In addition, the tags contain relevant URIs or feeds on
add or modify metadata directly in the fields as it allows editing the Web. It can be evolved itself without any human interruption.
items. Unfortunately some items (subject, tags etc) should be It means desktop documents can be evolved through connecting
added manually. In following section, we describe two kinds of a the Social Web services.
way to add metadata in document. Since it provides user-friendly
interface, a user would be saved their time and effort to create
metadata.
4.2 Metadata Creation
8
http://www.tagcloud.com
9
http://developer.yahoo.net/search/content/V1/
Figure 5 Unified Search View
one computer at this moment. So if users want to make multiple
4.3 Indexing and Search one, they should select upper level folder.
Users may search for more specific information regarding the
We build an index using XMP which already embedded in PDF topics or keywords, but are not sure how to narrow their search.
file. We use Jena 10 to parse the XMP data and Jakarta Lucene 11 to Although they are typing in several terms, they cannot sure
index metadata. This is the most popular document indexing and results. Our tools are able to help users in narrowing down their
search library available for Java and .Net. Since Lucene by itself search range using the Ontology Editor and to search related
will accept and process only plain text, some kind of adapter must items using the results.
be used that can extract plain text from PDF files in order for
those files’ content to be added to a Lucene index. This process is Ontology-Based Search. The search component executes a
done using the XMP Parser class module in Jena. With search across the ‘subject’, ‘title’, ‘keyword’ and ‘description’
Jena/Jakarta Lucene user can select a folder they want to build an metadata fields as well as the text of PDF files. If a user cannot
index. This is quiet simple. User clicks the Browser button, and find a start term, he or she can use the Ontology Editor. The
then chooses the folder. But we don’t provide multiple indexes in search results display the ‘file name’, ‘title’, ‘description’, ‘date’,
‘format’ and ‘weighted score’ and ‘format’ metadata fields. The
weighted score is a weighted primary according to the subject
10
http://jena.sourceforge.net filed in the metadata. The Ontology Viewer is used for a refined
11
http://lucene.apache.org/java/docs/ searching. If user chooses several terms in the Ontology Editor,
then results change automatically. It allows user to combine any [3] Sauermann, L., The gnowsis semantic desktop for information
fields such as subject, title, description. integration, In: 1st Workshop on Intelligent office appliances,
2005
[4] Leslie. C, Timothy. M.B, and Arouna. W, The Case for
Tag-Based Search. This function gathers RSS feeds from a set of Explicit Knowledge in Documents”, DocEng’04, 2004.
selected remote tags. When a user chooses a keyword in their [5] H.L. Kim, H.G. Kim, and K.M. Park, Ontalk:Ontology-Based
results, it collects the related feeds with the selected keyword Personal Document Management System, WWW2004.
from the remote web blog. The data is collected simultaneously
when the search executes. Currently we selected a list of RSS [6] H.L. Kim, H.G. Kim, and Decker,S., Semantic Documentation
feeds consisting of several web blog sites. The tag-based search using Semantic Web Technologies and Social Web Services,
interacts with the information published in user’s blog. It tries to In:Proc. International Conference on Next Generation Web
enrich users’ metadata with associated information in web. Services Practices (NWeSP'06), 2006
[7] Jenneke. F, Johan. P, Wray. B, Tag-Based Navigation for
Figure 5 shows the search results which includes file information, Peer-to-Peer Wikipedia, WWW2006, 2006.
ontology, and folksonomy. That is, our tool provides unified [8] Adobe, XMP SDK Overview, 2001.
search views. Firstly, a user can see physical information of files. [9] Gray. K, A Manager’s Introduction to Adobe eXtensible
Even though the Window Explorer already provides this function, Metadata Platform, the Adobe XML Metadata Framework,
it is useful because the Result View includes not only a file name, Adobe Whitepaper, 2001
folder, but also content’s title, keywords, concepts. Secondly, if a [10] Adobe, XMP Specification, 2005. available at:
user want to see more detail metadata information, they click each http://partners.adobe.com/public/developer/en/xmp/sdk/xmpsp
list in results, and then it opens the Metadata Explorer. Finally, a ecification.pdf
user is able to reuse keywords, which attach raw files as metadata, [11] Alan. L, Duane. N, OpenDocument metadata and XMP,
of the clouds in blog. If a user wants to see blog entries with 2005, available at: http://www.oasis-
relevant results, she clicks the term of keywords in results and open.org/archives/office/200512/msg00009.html
then she can get all list of the term – “clicked term”. [12] Hopkins, I., Vassileva, J, Beyond keywords and
hierarchies, .Journal of Digital Information Management 3
(2005) 139–145
5. Conclusions and Future Work [13] Stuckenschmidt, H., Harmelen F. V, Ontology-Based
Metadata Generation from Semi-Structured Information,
In:Proc. 1st international conference on knowledge capture(K-
This paper describes a means for managing a semantic
CAP’01), 2001, pp 440-444.
document by leveraging two kinds of metadata: ontology based
[14] Kraft, R., Maghoul, F., Chang, C. C, Y!Q: Contextual Search
and tag-based. In order to enable documents to be unambiguously
at the Point of Inspiration, In:Proc. CIKM’05 , 2005.
used by human and machine, metadata should be represented with
[15] Johnson, A, XMP Blaster: Embedding Metadata into Digital
explicit part of documents. The document schema ontology
Photographs,
contains ontological concepts as well as social collective tags.
http://www.mines.edu/Academic/courses/math_cs/macs370/FS2
Furthermore metadata could be existed embedded object in the
004/FinalReports/FinalWhite.pdf
document rather than being separated with it. An embedding
metadata could be stayed with file content itself regardless of [16] Kevin Broccoli, Improving Information Retrieval with
moving, modifying the file. The documents would then be Human Indexing,
indexed and be searched by semantic tools. Hence making http://www.intranetjournal.com/features/humanindex-1.shtml
semantic documentation an explicit and embed part of the [17] Mander. R, Salomon. G, and Wong. Y.Y, A ‘pile’ metaphor
document makes the metadata managing process easier to support. for supporting casual organization of information, In:Proc.
We have focused mainly on PDF format. But we have plan to Conf. on Hum. Factors in comp. sys., 1992, pp 627-634
process different format like JPEG, GIF, Microsoft Office formats
etc. Our future work plans include a more detailed focused on the [18] James. H, Abby. G, Why can’t I manage academic papers
mechanisms to interact and feedback between Desktop and Web. like MP3s? The evolution and intent of Metadata standards,
The approach, model, and techniques of this research will be 2004
explored in our future work. [19] Handschuh, S., Staab, S.: Authoring and Annotation of Web
Pages in CREAM. In:Proceedings of the Eleventh International
6. ACKNOWLEDGMENTS World Wide Web Conference, Honolulu, Hawaii, USA.2002.
We also thank our colleague Dr. Handschuh for his continued [20] Tallis, M.: Semantic Word Processing for Content Authors.
guidance and his assistance with information for this paper. In: Proceedings of the Knowledge Markup & Semantic
Annotation Workshop, Florida, USA. (2003) Part of the Second
7. REFERENCES International Conference on Knowledge Capture, K-CAP 2003.
[1] Decker, S., Frank, M. The networked semantic desktop, In: [21] Fillies, C., Wood-Albrecht, G., Weichardt, F.: A Pragmatic
Workshop on application design, development and Application of the Semantic Web using SemTalk. In: Proceedings
implementation issues in the semantic web. 2004. of the Eleventh International World Wide Web Conference,
[2] D.Quan, D.Huynh, and D.R. Karger., Haystack: A Platform Honolulu, Hawaii, USA. (2002) 686-692
for Authoring End User Semantic Web Applications, In
[22] Ontoprise GmbH: OntoOffice Tutorial.
International Semantic Web Conference 2003, 2003
http://www.ontoprise.de/documents/tutorial ontooffice.pdf (2003)
[23] Carr, L., Miles-Board, T., Wills, G., Woukeu, A. and Hall, W. [28] J. Heflin, J. Hendler and S. Luke: SHOE: A Knowledge
(2004) Towards a Knowledge-Aware Office Environment. In Representation Language for Internet Applications, Technical
Proceedings of 5th International Conference on Practical Aspects Report CS-TR-4078 (UMIACS TR-99-71), 1999.
of Knowledge Management (PAKM 2004) LNAI 3336, pp. 129- [29] Guoren, W., Bin, W., Donghong, H., and Baiyou, Q.: Design
140, Vienna, Austria. Karagiannis, D. and Reimer, U., Eds. and Implementation of a Semantic Document Management
[24] Martin, P & Eklund, P: Embedding Knowledge in Web System, Information Technology Journal 4 (1): 21-31, 2005
Documents, In: Proceedings of the 8th Int. World Wide Web Conf. [30] Uren, Victoria; Cimiano, Philipp; Iria, Jose; Handschuh,
(WWW’8), Toronto, May 1999, 1403-1419 Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio.;
[25] Anita, D., W., Gerard, T.: The ABCDE Format: Enabling Semantic Annotation for Knowledge Management: Requirements
Semantic Conference Proceedings, and a Survey of the State of the Art, Journal of Web Semantics 4
[26] Henrik Eriksson: A PDF Storage Backend for Protégé, (1):14-28, 2006
http://protege.stanford.edu/conference/2006/submissions/abstracts [31] Lawrence Reeve, Hyoil Han: Technical Report: Semantic
/9.4_Protege-2006-Eriksson.pdf Annotation Platforms,
[27] S. Staab, A. Maedche, and S. Handschuh.: An annotation http://www.pages.drexel.edu/~lhr24/pubs/2004SemanticAnnotatio
framework for the semantic web. In Proceedings of the First nTechnicalPaper.pdf, 2004.
Workshop on Multimedia Annotation, Tokyo, Japan, January 30-
31, 2001.