CISWeb 2008
ESWC 2008 Workshop on Collective Semantics:
Collective Intelligence & the
Semantic Web
http://mklab.iti.gr/CISWeb
June 2, 2008
Tenerife, Spain
Preface
This volume includes the papers presented at the 1st International Workshop on “Collective
Semantics: Collective Intelligence & the Semantic Web (CISWeb 2008)”, which was hosted by
the 5th European Semantic Web Conference (ESWC-08), in Tenerife, Spain, June 2nd, 2008.
Web 2.0 technologies have introduced new information sharing practices which favor mass users
participation and aim at improving quality of information content and information organization. It
is challenging to dynamically capture knowledge that emerges as the outcome of the interactions
of masses of users in social networks, since difficulties are posed by the heterogeneous data
sources, the large information scale and the huge amount of information postings. Semantic Web
may contribute by providing language basis, structuring help from distributed ad-hoc ontologies,
and by offering new ways of exploring the information space.
In this context, CISWeb 2008 Workshop has attracted very interesting work which covers crucial
and emerging research topics such as using and enriching ontologies, semantically enhancing
folksonomies and webspaces, social data management, , and interrelating Web 2.0 to Semantic
Web. More specifically, interesting ideas were presented at the Workshop for ontology matching
via knowledge extracted by multiple ontologies, enriching ontological user profiles with tagging
history, merging Web 2.0 and the Semantic Web by (semi-) automated content tagging,
semantically enriching folsonomies and tagging. Most of these efforts were experimented and
validated under popular datasets and testbeds (such as Wikipedia, Flickr, LycosIQ).
There were 11 submissions from 9 countries, and three reviewers were assigned to each paper.
The program committee has finally selected 5 regular papers and 3 poster papers for presentation
at the workshop. We would like to thank all the program committee members for their dedicated
effort to review papers in their area of expertise and on a timely manner. Their effort was
valuable to accommodate high quality papers in the CISWeb 2008 program.
The research work presented at CISWeb 2008 was very interesting and exciting and the
Workshop involved live discussions and fruitful comments. Moreover, the program included a
very interesting invited talk by Prof. Bettina Hoser, from the Universität Karlsruhe, who
presented “Information Retrieval versus Knowledge Retrieval: A social network pespective”, a
topic which is emerging and of wide interest. We are grateful to Prof. Bettina Hoser for her
insightful presentation.
Special thanks are ought to Eirini Giannakidou, PhD graduate student from the CERTH Research
Institute, for her technical support to CISWeb 2008 organization. The workshop has been held in
cooperation with the European Commission and WeKnowIt Integrated Project, and we are
indebted for their contributions and financial support.
CISWeb 2008 Co-Chairs
Dr. Yannis Avrithis, National Technical University of Athens, Greece
Dr. Yiannis Kompatsiaris, CERTH-ITI, Greece
Prof. Steffen Staab, University of Koblenz-Landau, Germany
Prof. Athena Vakali, Aristotle University of Thessaloniki, Greece
Conference Organization
Programme Chairs
Yannis Avrithis
Ioannis Kompatsiaris
Steffen Staab
Athena Vakali
Programme Committee
Harith Alani
Andrea Baldassarri
Nick Bassiliades
Susanne Boll
Ciro Cattuto
Thierry Declerck
Ying Ding
William Grosky
Harry Halpin
Andreas Hotho
Paul Lewis
Jose Martinez
Phivos Mylonas
Lyndon Nixon
Noel O’Connor
Raphael Troncy
External Reviewers
Eirini Giannakidou
Gianluca Correndo
Ioannis Katakis
Georgios Meditskos
Author Index
Alani, Harith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Aleksovski, Zharko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
Angeletou, Sofia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Bubak, Marian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Cantador, Ivan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
Castells, Pablo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Ciravegna, Fabio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Dierick, Francis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Fernandez, Miriam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Gomes, Paulo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Grzonkowski, Slawomir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Harezlak, Daniel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Hess, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Hoser, Bettina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Maass, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
Motta, Enrico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Nasirifard, Peyman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Nowakowski, Piotr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Peristeras, Vassilios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Rowe, Matthew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Sabou, Marta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Sousa, Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Szomszor, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
ten Kate, Warner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Tojo, JoÃo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
van Harmelen, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table of Contents
Information Retrieval vs Knowledge Retrieval:A Social Network
Perspective (invited talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Bettina Hoser
Enriching Ontological User Profiles with Tagging History for
Multi-Domain Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Ivan Cantador, Martin Szomszor, Harith Alani, Miriam Fernandez,
Pablo Castells
From Web 2.0 to Semantic Web: A Semi-Automated Approach . . . . . . . . . 20
Andreas Hess, Christian Maass, Francis Dierick
Using multiple ontologies as background knowledge in ontology matching 35
Zharko Aleksovski, Warner ten Kate, Frank van Harmelen
Flickring Our World: An Approach for a Graph Based Exploration of
the Flickr Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
JoÃo Tojo, Jorge Sousa, Paulo Gomes
Semantically Enriching Folksonomies with FLOR . . . . . . . . . . . . . . . . . . . . . 65
Sofia Angeletou, Marta Sabou, Enrico Motta
Disambiguating Identity through Social Circles and Social Data . . . . . . . . . 80
Matthew Rowe, Fabio Ciravegna
OntoPair: Towards a Collaborative Game for Building OWL-Based
Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Peyman Nasirifard, Slawomir Grzonkowski, Vassilios Peristeras
Semantically enhanced webspace for scientific collaboration . . . . . . . . . . . . . 109
Daniel Harezlak, Piotr Nowakowski, Marian Bubak
1
Information Retrieval vs Knowledge Retrieval:
A Social Network Perspective
Bettina Hoser
Information Services and Electronic Markets,
Institute of Information Engineering and Management,
Department of Economics and Business Engineering,
Universität Karlsruhe (TH)
Germany
hoser@iism.uni-karlsruhe.de
Bettina Hoser
Institute for Information services and management
Department for Information Services and Electronic Markets
Universität Karlsruhe (TH)
Kaiserstrasse 12
D-76128 Karlsruhe
Germany
2
Information Retrieval vs Knowledge Retrieval:
A Social Network Perspective
Abstract. Information Retrieval, especially in connection with the in-
ternet, is a well known research field. But as the technologies used for the
internet become more and more elaborate, so grows the need to not only
find (retrieve) already available information, but to generate knowledge.
In this paper the emergence of collective intelligence from information
retrieval and social network analysis will be presented. The main point
will be how information gained from e.g. websites can be enhanced to
become knowledge using methods and models from the research field of
Social Network Analysis. Such an approach is very context sensitive, so
two examples will be presented.
1 Introduction
When is a trend a trend? When ’the right people’ initialize it. This is very well
known from the world of fashion. In the world of news, research and technology
this may translate to the fact that a trend is a trend when ’relevant’ people
or websites take up the topic. But how can the relevant people or websites be
distinguished from the less relevant? How can ’relevant’ be defined? How can
one detect really ’relevant’ trends? ’Relevant’ is always a reflective approach. It
is dependent on the circumstances. Thus it is, e.g. in the case of fashion or news,
a social context.
As an example for the question discussed here take a high tech company (e.g.
mobile phones) or a reinsurance company. For both it is essential that they see
trends before the competitors or the possible clients see it. In the case of the high
tech company, for example, it is crucial to know what the potential customers
are interested in, or which features in the current product are not accepted and
why. For the reinsurance company, it is necessary to know which hazards, e.g.
in health care, are being discussed, so that the company may prepare its policy
accordingly. As an illustration of that point take the discussion on obeisity in
children and subsequent health problems in adults.
2 Information retrieval
As companies look for ways to find trends as shown above they used to look for
example at newspapers. Nowadays the internet with its chat rooms, newsgroups,
social networking sites and blogs offers a wide area of information, which had not
been accessible before. To gather this information various methods have been
devised.
Text analysis is one of the methods often used to extract information from
a text source. There is a large body of research literature, see e.g. [FNR03],
in the fields of linguistics, information science or classification on diverse ways
3
to extract keywords, key phrases, etc. from websites and other text sources. In
these research fields models have been built to explain how the context sensitive
relevance of words, phrases etc. can be defined. Just think about classifications
like e.g. the ACM Classification System. Some of these methods lead to lists of
possible topics listed by relative relevance according to their usage of phrases in
text.
Another approach is to use additional information like keyword or tags to
enhance the information retrieved by classifying it. This has grown into the
research fields on folksonomies, tagging, semantic web, etc.
At this point though what is known is that these phrases or words are often
used. What is not known is who used them. Or to put it precisely, whether the
user is a ’relevant’ user in the context. This is a question that has been at the
heart of the research field of Social Network Analysis.
3 Social Network Analysis
Social Network Analysis (SNA) is a research area that tries to analyze and model
actor behavior based on his or her connections or relations to other members of
a group. Fur further reference see [WF99]. An actor is thus seen as restricted
or empowered by his or her connection to others. The basis of this structural
approach is given by models about group interaction. The first research questions
were posed to define roles to actors given a social context. Thus e.g. leadership
of a group is such a role. There are also models about the power to manipulate.
Thus a person in such a context may be called relevant, or central, if he or she is
positioned in such a way in the group’s network that all information exchanged
between any two actors has to pass through this ’central’ actor. He or she can
thus manipulate the group.
Thus the question of who is relevant within a group is one of the research
questions with SNA. Based on graph theory this can be analyzed by using dif-
ferent so called centrality indices. Some of them are inuitive, like e.g. degree
centrality, other are more elaborate like e.g. betweenness centrality or eigenvec-
tor centrality. But always the question is: given a clearly defined context, who
within a group is relevant, who is not, how are the actors in the group connected
and what, if any, predicitions can be made for the future deveploment of the
group structure.
Thus, this analysis approach can be used to find the ’relevant’ people or
websites needed to enhance the information found by text retrieval.
4 Knowledge Retrieval
The idea to retrieve knowledge means not only to gather the information avail-
able but to enrich it with other information to gain knowledge about a topic.
In the case proposed here this means to use results from SNA to enrich the
information gathered by text analysis to find whether the topics found by infor-
mation retrieval are ’really hot topics’ because ’relevant people’ talk about it, or
4
whether it is just ’small talk’ by ’bystanders’. In a conceptual study [HSGS+ 07]
we used such an approach to look for socially enriched information about mobile
phones within a newsgroup.
The idea proposed here is based on following information fusion approach:
First a text corpus and a group are defined. Then the text corpus is analyzed and
the group structure is evaluated. As a last step these two results are combined to
gain knowledge. This is just a very crude and short description of the procedure.
One major challenge here is to to define the group. Depending on the area of in-
terest this can be a very large group or a collection of websites corresponding to
a group. Sometimes this may not even be a well defined group. Thus biases can
be introduced by choice of actors (or websites). But once the group is defined,
there is also the question of the appropriate text analysis method. Questions like
scalability and validity have to be answered here. As a last step, the interpreta-
tion of the combined results have to be validated before any measures should be
taken.
But even with regard to the aforementioned challenges this approach seems to
yield deeper insights into topics and trends, since it includes the social component
of trends.
5 Outlook
The potential of such an approach is very high. Not only are companies interested
in such a kind of knowledge gained from different ’news’-sources, weighted by
the social impact, but also the average internet user. If one takes a look at
communites of diverse interests such as travel or such necesseties as emergencies,
it is not only valuable to have information at hand gathered from collective sites,
but also to know who gave the information and whether the source can be viewed
as ’relevant’ in the given context. In the context of emergencies, this may save
lives.
References
[FNR03] J. Franke, G. Nakhaeizadeh, and I. Renz, editors. Text Mining, Theoretical
Aspects and Applications. Advances in soft computing. Physica-Verlag,
Heidelberg, 2003.
[HSGS+ 07] B. Hoser, J. Schröder, A. Geyer-Schulz, M. Viermetz, and M. Skubacz.
Topic trend detection in newsgroups. Künstliche Intelligenz, (3), 2007.
[WF99] S. Wasserman and K. Faust. Social Network Analysis, volume 8 of Struc-
tural Analysis in the Social Sciences. Cambridge University Press, Cam-
bridge, 1 edition, 1999.
5
Enriching Ontological User Profiles with Tagging History
for Multi-Domain Recommendations
Iván Cantador1, Martin Szomszor2, Harith Alani2,
Miriam Fernández1, Pablo Castells1
1
Escuela Politécnica Superior
Universidad Autónoma de Madrid
28049 Madrid, Spain
{ivan.cantador, miriam.fernandez, pablo.castells}@uam.es
2
School of Electronics and Computer Science
University of Southampton
SO17 1BJ Southampton, United Kingdom
{mns2, ha}@ ecs.soton.ac.uk
Abstract. Many advanced recommendation frameworks employ ontologies of
various complexities to model individuals and items, providing a mechanism
for the expression of user interests and the representation of item attributes. As
a result, complex matching techniques can be applied to support individuals in
the discovery of items according to explicit and implicit user preferences.
Recently, the rapid adoption of Web2.0, and the proliferation of social
networking sites, has resulted in more and more users providing an increasing
amount of information about themselves that could be exploited for
recommendation purposes. However, the unification of personal information
with ontologies using the contemporary knowledge representation methods
often associated with Web2.0 applications, such as community tagging, is a
non-trivial task. In this paper, we propose a method for the unification of tags
with ontologies by grounding tags to a shared representation in the form of
Wordnet and Wikipedia. We incorporate individuals’ tagging history into their
ontological profiles by matching tags with ontology concepts. This approach is
preliminary evaluated by extending an existing news recommendation system
with user tagging histories harvested from popular social networking sites.
Keywords: social tagging, web 2.0, ontology, semantic web, user modelling,
recommender systems.
1 Introduction
The increasing proliferation of Web2.0 style sharing platforms, coupled with the rapid
development of novel ways to exploit them, is paving the way for new paradigms in
Web usage. Virtual communities and on-line services such as social networking,
folksonomies, blogs, and wikis, are fostering an increase in user participation,
engaging users and encouraging them to share more and more information, resources,
and opinions. The huge amount of information resulting from this emerging
phenomenon gives rise to excellent opportunities to investigate, understand, and
exploit the knowledge about the users’ interests, preferences and needs. However, the
current infrastructure of the Web does not provide the mechanisms necessary to
6
consolidate this wealth of personal data since they are spread over many unconnected,
heterogeneous sources.
Community tagging sites, and their respective folksonomies, are a clear example of
this situation: users have access to a plethora of web sites that allow them to annotate
and share many types of resources. For example, they can organise and make photos
available on Flickr1, classify and share bookmarks using del.icio.us2, communicate and
share resources with friends using Facebook3. Through personal tags, users implicitly
declare different facets of their personalities, such as their favourite book subjects on
LibraryThing4, movie preferences on IMDb5, music tastes on Last.fm6, and so forth.
Therefore, the domains covered by social tagging applications are both disparate and
divergent, creating considerably complex and extensive descriptions of user profiles.
In the current Web2.0 landscape, there is a distinct lack of tools to support users with
meaningful ways to query and retrieve resources spread over disparate end-points: users
should be able to search consistently across a broad range of sites for diverse media
types such as articles, reviews, videos, and photos. Furthermore, such sites could be
used to support the recommendation of new resources belonging to multiple domains
based on tags from different sites. As a step towards making this vision a reality, we
explore the use of syntactic and semantic based technologies for the combination,
communication and exploitation of information from different social systems.
In this paper, we present an approach for the consolidation of social tagging
information from multiple sources into ontologies that describe the domains of
interest covered by the tags. Ontology-based user profiles enable rich comparisons of
user interests against semantic annotations of resources, in order to make personal
recommendations. This principle has already been tested by the authors in different
personalised information retrieval frameworks, such as semantic query-based
searching [4], personalised context-aware content retrieval [13], group-oriented
profiling [3], and multi-facet hybrid recommendations [2].
We propose to feed the previous strategies with user profiles built from personal
tag clouds obtained from Flickr and del.icio.us web sites. The mapping of those social
tags to our ontological structures involve three steps: the filtering of tags, the
acquisition of semantic information from the Web to map the remaining tags into a
common vocabulary, and the categorisation of the obtained concepts according to the
existing ontology classes.
An application of the above techniques has been tested in News@hand, a news
recommender system which integrates our different ontology-based recommendation
approaches. In this system, ontological knowledge bases and user profiles are
generated from public social tagging information, using the aforementioned
techniques. The News@hand system, along with the automatic acquisition of news
articles from the Web, and the automatic semantic annotation of these items using
Natural Language Processing tools [1] and the Lucene7 indexer shall also be described.
1
Flickr, Photo Sharing, http://www.flickr.com/
2
del.icio.us, Social Bookmark manager, http://del.icio.us/
3
Facebook, Social Networking, http://www.facebook.com/
4
LibraryThing, Personal Online Book Catalogues, http://www.librarything.com/
5
IMDb, Internet Movie Database, http://imdb.com/
6
Last.fm, The Social Music Revolution, http://www.last.fm/
7
Lucene, An Open Source Information Retrieval Library, http://lucene.apache.org/
7
The structure of the paper is the following. Section 2 briefly describes our
approach for representing user preferences and item features using ontology-based
knowledge structures, and how they are exploited by several recommendation models.
Section 3 explains mechanisms to automatically relate and transform social tagging
and external semantic information into our ontological knowledge structures. A real
implementation and evaluation of the previous tag transformation and
recommendation processes within a news recommender system are presented in
section 4. Finally, section 5 proclaims some conclusions and future research lines.
2 Hybrid recommendations
In this section, we summarise the ontology-based knowledge representation and
recommendation models in which filtered social tags are proposed to be integrated
and exploited.
2.1 Ontology-based representation of item features and user preferences
In the knowledge representation we propose [4, 13], user preferences are described as
vectors u m = (um ,1 , um ,2 ,..., um , K ) where um , k ∈ [ 0,1] measures the intensity of the
interest of user um ∈ U for concept ck ∈O (a class or an instance) in a domain
ontology O , K being the total number of concepts in the ontology. Similarly, items
d n ∈ D are assumed to be annotated by vectors dn = (d n ,1 , d n ,2 ,..., d n , K ) of concept
weights, in the same vector-space as user preferences.
The main advantages of this knowledge representation are its portability, thanks to
the XML-based Semantic Web standards, the domain independency of the subsequent
content retrieval and recommendation algorithms, and the multi-source nature of the
proposal (different types of media could be annotated: texts, images, videos).
2.2 Personalised content retrieval
Our notion of content retrieval is based on a matching algorithm that provides a
personal relevance measure pref ( d n , um ) of an item d n for a user um . This measure
is set according to semantic preferences of the user and semantic annotations of the
item, and is based on a cosine vector similarity cos ( d n , u m ) . The obtained similarity
values (Personalised Ranking module of Figure 1) can be combined with query-based
scores without personalisation sim ( d n , q ) and semantic context information (Item
Retrieving module of Figure 1), to produce combined rankings [13].
To overcome the existence of sparsity in user profiles, we propose a preference
spreading mechanism, which expands the initial set of preferences stored in user profiles
through explicit semantic relations with other concepts in the ontology. Our approach is
based on Constrained Spreading Activation (CSA), and is self-controlled by applying a
decay factor to the intensity of preference each time a relation is traversed. We have
empirically demonstrated [3, 13] that preference extension improves retrieval precision
and recall. It also helps to mitigate other well-known limitations of recommender
systems such as the cold-start, overspecialisation and portfolio effects.
8
Figure 1. Ontology-based personalised content retrieval
2.3 Context-aware recommendations
The context is represented in our approach [13] as a set of weighted ontology concepts.
This set is obtained by collecting the concepts that have been involved in the interaction
of the user (e.g. accessed items) during a session. It is built in such a way that the
importance of concepts fades away with time by a decay factor. Once the context is
built, a contextual activation of user preferences is achieved by finding semantic paths
linking preferences to context. These paths are made of existing relations between
concepts in the ontologies, following the spreading technique mentioned in section 2.2.
2.4 Group-oriented recommendations
The presented user profile representation allows us to easily model groups of users. We
have explored the combination of the ontology-based profiles to meet this purpose [3],
on a per concept basis, following different strategies from social choice theory. In our
approach, user profiles are merged to form a shared group profile, so that common
content recommendations are generated according to this new profile.
2.5 Multi-facet hybrid recommendations
In order to make hybrid recommendations we cluster the semantic space based on the
correlation of concepts appearing in the profiles of individual users. The obtained
clusters Cq represent groups of preferences (topics of interests) shared by a significant
number of users. Using these clusters profiles are partitioned into semantic segments.
Each of these segments corresponds to a cluster and represents a subset of the user
interests that is shared by the users who contributed to the clustering process. By thus
introducing further structure in user profiles, we define relations among users at
different levels, obtaining multilayered communities of interest.
Exploiting the relations of the communities which emerge from the users’ interests,
and combining them with item semantic information, we have presented in [2] several
recommendation models that compare the current user interests with those of the others
users in a double way. First, according to item characteristics, and second, according to
connections among user interests, in both cases at different semantic layers.
pref ( d n , um ) = ∑ nsim ( d n , Cq ) ∑ nsimq ( um , ui )·simq ( d n , ui )
q i
9
3 Relating social tags to ontological information
Parallel to the proliferation and growth of social tagging systems, the research
community is increasing its efforts to analyse the complex dynamics underlying
folksonomies, and investigate the exploitation of this phenomenon in multiple
domains. Results reported in [5] suggest that users of social systems share behaviours
which appear to follow simple tagging activity patterns. Understanding, predicting
and controlling the semiotic dynamics of online social systems are the base pillars for
a wide variety of applications.
For these purposes, the establishment of a common vocabulary (set of tags) shared
by users in different social systems is a desirable situation. Indeed, recent works have
focused on the improvement of tagging functionalities to generate tag datasets in a
controlled, coordinated way. P-TAG [6] is a method that automatically generates
personalised tags for web pages, producing keywords relevant both to their textual
content and to data collected from the user’s browsing. In [8], an adaptation of user-
based collaborative filtering and a graph-based recommender is presented as a tag
recommendation mechanism that eases the process of finding good tags for a
resource, and consolidating the creation of a consistent tag vocabulary across users.
The integration of folksonomies and the Semantic Web has been envisioned as an
alternative approach to the collaborative organisation of shared tagging information.
The proposal presented in [11] uses a combination of pre-processing strategies and
statistical techniques together with knowledge provided by ontologies for making
explicit the semantics behind the tag space in social tagging systems.
In the work presented herein, we propose the use of knowledge structures defined
by multiple domain ontologies as a common semantic layer to unify and classify
social tags from several Web 2.0 sites. More specifically, we propose a mechanism
for the creation of ontology instances for the gathered tags, according to semantic
information collected from the Web. Tagging information is linked to ontological
structures by our method through a sequence comprising three processing steps:
• Filtering social tags: To facilitate the integration of information from different
social sources as well as the subsequent translation of that information into
ontological knowledge, a pre-processing of the tags is needed, associating them
to a common vocabulary, shared by the different involved applications.
Morphologic and semantic transformations of tags are performed at this stage
based on the WordNet English dictionary [9], the Wikipedia8 encyclopaedia and
the Google9 web search engine.
• Obtaining semantic information about social tags: The shared vocabulary is
created with the use of Wikipedia, which provides semantic information about
millions of concepts.
• Categorisation of social tags into ontology classes: Once the tags have been
filtered and mapped to a shared vocabulary, they are automatically converted
into instances of classes of domain ontologies. Again, semantic categorisation
information available in Wikipedia is exploited in this process.
These steps are explained in more detail in the next subsections.
8
Wikipedia, The Free Encyclopaedia, http://en.wikipedia.org/
9
Google, Web Search Engine, http://www.google.com/
10
3.1 Filtering social tags
Raw tagging information can be noisy and inconsistent. When manual tags are
introduced with a non-controlled tagging mechanism, people often make grammatical
mistakes (e.g. barclona instead of barcelona), tag concepts indistinctly in singular,
plural or derived forms (blog, blogs, blogging), sometimes add adjectives, adverbs,
prepositions or pronouns to the main concept of the tag (beautiful car, to read), or use
synonyms and acronyms that could be converted into a single tag (biscuit and cookie,
ny and new york). Moreover, the tag encoding and storage mechanisms used by social
systems often alter the tags introduced by the users: they may transform white spaces
(san francisco, san-francisco, san_francisco, sanfrancisco) and special characters in
the tags (los angeles for los ángeles, zurich instead of zürich), etc.
Thus, while it is possible to gather information from multiple folksonomy sites, such
as Flickr or del.icio.us, inconsistency will lead to confusion and loss of information
when tagging data is compared. For example, if a user has tagged photos from a recent
holiday in New York with nyc, but also bookmarked relevant pages in del.icio.us with
new_york, the correlation will be lost. In order to facilitate the folksonomy data analysis
and integration, tags have to be filtered and mapped to a shared vocabulary. Here, we
present a tag filtering architecture that makes use of external knowledge resources such
as the WordNet dictionary, Wikipedia encyclopaedia and Google web search engine.
The filtering process is a sequential execution where the output from one filtering
step is used as input to the next. The output of the entire filtering process is a set of new
tags that correspond to an agreed representation. As will be explained below, this is
achieved by correlating tags to entries in two large knowledge resources: Wordnet and
Wikipedia. Wordnet is a lexical database and thesaurus that group English words into
sets of cognitive synonyms called synsets, providing definitions of terms, and modelling
various semantic relations between concepts: synonym, hypernym, hyponym, among
others. Wikipedia is a multilingual, open-access, free-content encyclopaedia on the
Internet. Using a wiki style of collaborative content writing, is has grown to become one
of the largest reference Web sites with over 75,000 active contributors, maintaining
approximately 9,000,000 articles in over 250 languages (as of February 2008).
Wikipedia contains collaboratively generated categories that classify and relate entries,
and also supports term disambiguation and dereferencing of acronyms.
Figure 2 provides a visual representation of the filtering process where a set of raw
tags are transformed into a set of filtered tags and a set of discarded tags. Each of the
numbers in the diagram corresponds to a step outlined below.
Figure 2. The tag filtering process
11
For this work, tags from public available user accounts from Flickr and del.icio.us
sites have been collected and filtered. A total of 1004 user profiles have been gathered
from these two systems, providing 149,529 and 84,851 distinct tags respectively.
Initially, the intersection between both datasets was 28,550 common tags.
Step 1: Lexical filtering
After raw tags have been harvested from different folksonomy sites, they are passed
to the Lexical Filter, which applies several filtering operations. Tags that are too small
(with length = 1) or too large (length > 25) are removed, resulting in a discarding rate of
approximately 3% of the initial dataset. In addition, considering the discrepancies in the
use of special characters (such as accents, dieresis and caret symbol), we convert such
special characters to a base form (e.g., the characters à, á, â, ã, ä, å are converted to a).
Tags containing numbers are also filtered based on a set of custom heuristics. For
example, to maintain salient numbers, such as dates (2006, 2007, etc), common
references (911, 360, 666, etc), or combinations of alphanumeric characters (7 up,
4 x 4, 35 mm), we discard unpopular tags below a certain global tag frequency
threshold. Finally, common stop-words, such as pronouns, articles, prepositions and
conjunctions are removed. After lexical filtering, tags are passed on to the Wordnet
Manager. If a tag has an exact match in Wordnet, we pass it on directly to the set of
filtered tags, to save further unnecessary processing.
Step 2: Compound nouns and misspellings
If a tag is not found in Wordnet, we consider possible misspellings and compound
nouns. Motivated by [11], to solve these problems, we make use of the Google “did you
mean” mechanism. When a search term is entered, the Google engine checks whether
more relevant search results are found with an alternative spelling. Because Google’s
spell check is based on occurrences of all words on the Internet, it is able to suggest
common spellings for proper nouns that would not appear in a standard dictionary.
The Google “did you mean” mechanism also provides an excellent way to resolve
compound nouns. Since most tagging systems prevent users from entering white spaces
into the tag value, users create compound nouns by concatenating nouns together or
delimiting them with a non-alphanumeric character such as _ or -, which introduces an
obvious source of complication when aligning folksonomies. By sending compound
nouns to Google, we easily resolve the tag into its constituent parts. This mechanism
works well for compound nouns with two terms, but is likely to fail if more than two
terms are used. For example, the tag sanfrancisco is corrected to san francisco, but the
tag unitedkingdomsouthampton is not resolved by Google.
We have thus developed a complementary algorithm that quickly and accurately
splits compound nouns of three or more terms. The main idea is to firstly sort the tags
in alphabetical order, and secondly process the generated tag list sequentially. By
caching previous lookups, and matching the first shared characters of the current tag
string, we are able to split it into a prefix (previously resolved by Google) and a
postfix. A second lookup is then made using the postfix to seek further possible
matches. The process is iteratively repeated until no splits are obtained from our
Google Connector. Compared to a bespoke string-splitting heuristic, this process has a
very low computational cost. This mechanism successfully recognizes long compound
12
nouns such as war of the worlds, lord of the rings, and martin luther king jr.
Similarly to Step 1, after using Google to check for misspellings and compound
nouns, the results are validated against the Wordnet Manager. Unprocessed tags are
added to the pending tag stack, and unmatched tags are discarded.
Step 3: Wikipedia correlation
Many of the popular tags occurring in community tagging systems do not appear in
grammar dictionaries, such as Wordnet, because they correspond to proper names
(such as famous people, places, or companies), contemporary terminology (such as
web2.0 and podcast), or are widely used acronyms (such as asap and diy).
In order to provide an agreed representation for such tags, we correlate tags to their
appropriate Wikipedia entries. For example, when searching the tag nyc in Wikipedia,
the entry for New York City is returned. The advantage of using Wikipedia to agree on
tags from folksonomies is that Wikipedia is a community-driven knowledge base, much
like folksonomies are, so that it rapidly adapts to accommodate new terminology.
Apart from consolidating agreed terms for the filtered tags, our Wikipedia
Connector retrieves semantic information about each obtained entry. Specifically, it
extracts ambiguous concepts (e.g., “java programming language” and “java island”
for the entry “java”), and collaboratively generated categories (e.g., “living people”,
“film actors” and “american male models” for the entry “brad pitt”). This information
is exploited by the ontology population and annotation processes described below.
Step 4: Morphologically similar terms
An additional issue to be considered during the filtering process is that users often use
morphologically similar terms to refer to the same concept. One very common example
of this is the no discrepancy between singular and plural terms, such as blog and blogs,
and other morphological deviations (e.g. blogging). In this step, using a custom
singularisation algorithm, and the stemming functions provided by the Snowball
library10, we reduce morphologically similar tags to a single tag. For each group of
similar tags, the shortest term found in Wordnet is used as the representative tag.
Step 5: WordNet synonyms
When people communicate a certain concept, they often use synonyms, i.e., terms that
have the same meaning, but with different morphological forms. A natural filtering step
is the simplification of the tag sets by merging pairs of synonyms into single terms.
WordNet provides synonym relations between synsets of the terms. However, due
to ambiguous meanings of the tags, not all of them can be taken into consideration,
and the filtering process must be very carefully executed. Our merging process
comprises three stages. In the first stage, a matrix of synonym relations is created by
using Wordnet. In the second stage, according to the number of synonym relations
found for each tag, we identify the non-ambiguous synonym pairs, and finally, stage
three replaces each of the synonym pairs by the term that is most popular. Examples
of thus processed synonym pairs are android and humanoid, thesis and dissertation,
funicular and cable railway, stein and beer mug, or poinsettia and christmas flower.
10
Snowball, String-handling Language, http://snowball.tartarus.org/
13
3.2 Obtaining semantic information about social tags
In order to populate ontologies with concepts associated to the filtered social tags,
general multi-domain semantic knowledge is needed. In this work, as mentioned
before, we propose to extract that information from Wikipedia. The Wikipedia articles
describe a number of different types of entities: people, places, companies, etc.,
providing descriptions, references, and even images about the described entities.
Many of these entities are ambiguous, having several meanings for different
contexts. For instance, the same tag “java” could be assigned to a Flickr picture of the
Pacific island, or a del.icio.us page about the programming language. One approach to
address tag disambiguation is by using the information available in Wikipedia. A
Wikipedia article is fairly structured: the title of the page is the entity name itself (as
found in Wikipedia), the content is divided into well delimited sections, and a first
paragraph is dedicated to possible disambiguation options for the corresponding term.
For example, the page of the entry “apple” starts as follows:
• “This article is about the fruit…”
• “For the Beatles multimedia corporation, see…”
• “For the technology company, see…”
Apart from these elements, every article contains a set of collaboratively generated
categories. Hence, for example, the categories created for the concept “teide” are:
world heritage sites in spain, tenerife, mountains of spain, volcanoes of spain, national
parks of spain, stratovolcanoes, hotspot volcanoes, and decade volcanoes. Processing
somehow the previous information, we might infer that “teide” is a volcano in Spain.
Disambiguation and categorisation information have been therefore extracted from
Wikipedia for every concept appearing in our social tag datasets. Once the most
suitable category for a term is determined, we match its relevant categories to classes
defined in the domain ontologies, as explained next.
3.3 Categorisation of social tags into ontology classes
The assignment of an ontology class to a Wikipedia entry is based on a morphologic
matching between the name and the categories of the entry, and the names of the
ontology classes. The ontology classes with most similar names to the name and
categories of the entry are chosen as the classes whereof the corresponding individual
(instance) is to be created. The created instances are assigned a URI containing the
entry name, and are given RDFS labels with the Wikipedia categories.
To better explain the proposed matching method, let us consider the following
example. Let “brad pitt” be the concept we wish to instantiate. If we look up this
concept in Wikipedia, a page with information about the actor is returned. At the end of
the page, several categories are shown: “action film actors”, “american film actors”,
“american television actors”, “best supporting actor golden globe (film)”, “living
people”, “missouri actors”, “oklahoma (state) actors”, “american male models”, etc.
After retrieving that information, all the terms (tokens) that appear in the name and
categories of the entry (which we will henceforth refer to as entry terms) are
morphologically compared with the names of the ontology classes (assuming that a class-
label mapping is available, as it is usually the case). Computing the Levenshtein distance,
and applying singularisation and stemming mechanisms, only the entry terms that match
14
some class name, above a certain distance threshold, are kept, and the rest are discarded.
For instance, suppose that “action”, “actor”, “film”, “people”, and “television” are the
ones sufficiently close to some ontology class. To select the most appropriate ontology
class among the matching ones, we firstly create a vector whose coordinates correspond
to the filtered entry terms, taking as value the number of times the term appears in the
entry name and categories together. In the example, the vector might be as follows:
{(action, 1), (actor, 6), (film, 3), (people, 1), (television, 1)}, assuming that “actor”
appears in six categories of the Wikipedia entry “brad pitt”, and so forth.
Once this vector has been created, one or more ontology classes are selected by the
following heuristic:
1. If a single coordinate holds the maximum value in the vector, we select the
ontology class that matches the corresponding term.
2. In case of a tie between several coordinates having the maximum value, a new
vector is created, containing the matched classes plus their taxonomic ancestor
classes in the ontologies. Then the weight of each component is computed as the
number of times the corresponding class is found in this step. Finally, the original
classes that have the highliest valued ancestor in the new vector are selected.
Here “ontology class” and “ancestor” denote a loose notion admitting a broad
range of taxonomic constructs, ranging from informally built subject hierarchies (such
as the ones defined in the Open Directory tree or, in our experiments, the IPTC
Subjects), to pure ontology classes in a strict Description Logic sense.
In our example, the weight for the term “actor” is the highest, so we select its
matching class as the category of the entry. Thus, assuming that the class matching
this term was “Actor”, we finally define “Brad Pitt” as an instance of “Actor”.
Now suppose that, instead, the vector for Brad Pitt was {(actor, 1), (film, 1), (people,
1)}. In that case, there would be a tie in the matching classes, and we would apply the
second case of the heuristic. We take the ancestor classes, which could be e.g. “cinema
industry” for “actor”, “cinema industry” for “film”, and “mammal” for “person”, and
create a weighted list with the original and ancestor classes. Then we count the number
of times each class appears in the previous list, and create the new vector: {(actor, 1),
(film, 1), (person, 1), (cinema industry, 2), (mammal, 1)}. Since the class “cinema
industry” has the highest weight, we finally select its sub-classes “actor” and “film” as
the classes of the instance “brad pitt”.
We must note that our ontology population mechanism does not necessarily
generate individuals following a strict semantic “is-a” schema, but a more relaxed
semantic “is-related-to” association principle. This is not a problem for our final
purposes in personalised content retrieval, since the annotation and recommendation
methods in that area are themselves rooted on models of inherently approximated
nature, e.g. regarding the relationships between concepts and item contents.
4 Preliminary evaluations
Recent works show an increasing interest in using social tagging information to enhance
personalised content retrieval and recommendation. FolkRank [7] is a search algorithm
that exploits the structure of folksonomies to find communities and organise search
results. The recommender system presented in [10] suggests web pages available on the
Internet, by using folksonomy and social bookmarking information. The movie
15
recommender proposed in [12] is built on keywords assigned to movies via collaborative
tagging, and demonstrates the feasibility of making accurate recommendations based on
the similarity of item keywords to those of the user’s rating tag-clouds.
In the following, we present and preliminary evaluate how our ontological
knowledge representation, recommendation models, and tag filtering and matching
strategies are integrated in News@hand, a news recommender system.
4.1 News@hand
News@hand is a news recommender system that describes news contents and user
preferences with a controlled and structured vocabulary, using semantic-based
technologies, and integrating the recommendation models described in section 2.
Figure 3 depicts how ontology-based item descriptions and user profiles are created and
exploited by the system.
Figure 3. Architecture of News@hand
News items are automatically and periodically retrieved from several on-line news
services via RSS feeds. The title and summary of the retrieved news are annotated with
concepts of the domain ontologies. A dynamic graphic interface allows the system to
automatically retrieve all the users’ inputs in order to analyse their behaviour with the
system, update their preferences, and adjust the recommendations in real time.
Figure 4 shows a screenshot of a typical news recommendation page in News@hand.
The news items are classified into eight different sections: headlines, world, business,
technology, science, health, sports and entertainment. When the user is not logged in the
system, s/he can browse any of the previous sections, but the items are listed without
any personalised criterion. On the other hand, when the user is logged in the system,
recommendation and profile edition functionalities are activated, and the user can
browse the news according to his and others’ preferences in different ways. Click
history is used to detect the short term user interests, which represent the dynamic
semantic context exploited by our personalised content retrieval mechanism.
16
The terms occurring in the title and summary that are associated to semantic
annotations of the contents, the user profile, and the current context are highlighted with
different colours. A collaborative rating is shown on a 0 to 5 star scale, and two
coloured bars indicate the relevance of the item for the profile and the context. The user
has the possibility adding comments, tags and ratings to the article. S/he also can set
parameters for single or group-oriented recommendations, such as the activation or
deactivation of his/her individual preferences, those of his/her contacts and/or all other
users, the weight that the dynamic context should have over the profile, and the weight
of multiple rating criteria.
Figure 4. Item recommendation page of News@hand
4.2 Knowledge base
A total of 17 ontologies have been used for the current version of the system. They
are adaptations of the IPTC ontology11, which contains concepts of multiple domains
such as education, culture, politics, religion, science, technology, business, health,
entertainment, sports, weather, etc. They have been populated with semantic
information about the tags we extracted from Flickr and del.icio.us web sites,
applying the population mechanism explained in Section 3. A total of 137,254
Wikipedia entries were used to populate 744 ontology classes with 121,135 instances.
Table 1 describes the characteristics of the obtained knowledge base.
In order to evaluate the ontology population process, we asked 20 users to
randomly select, and manually assess 25 instances of each ontology. They were
undergraduate and PhD students of our department, half of them with experience on
ontological engineering. They were requested to declare whether each instance was
assigned to its correct class, to a less correct class but belonging to a suitable
ontology, or to an incorrect class/ontology. The table shows the average accuracy
values for all the users considering correct class and correct ontology assignments.
These preliminary results demonstrate the feasibility of our ontology population
mechanism. The average accuracy for class assignment is 69.9%, and the average
accuracy for ontology assignment arises to 84.4%. Improvements in our mapping
heuristics can be investigated. Nevertheless, we presume they are good enough for
our recommendation goals. In general, the main common concepts are correctly
instantiated, and the effect of an isolated incorrect annotation in a news item is
mitigated by the domain/s of the rest of the correct annotations.
11
IPTC ontology, http://nets.ii.uam.es/mesh/news-at-hand/news-at-hand_iptc-kb_v01.zip
17
Table 1. Number of classes and instances of News@hand KB, and average population accuracy
Ontology #classes #instances Avg. #instances/class Avg. accuracy
arts, culture, entertainment 87 33,278 383 78.7 / 93.3
crime, law, justice 22 971 44 62.7 / 73.3
disasters, accidents 16 287 18 74.7 / 84.0
economy, business, finance 161 25,345 157 69.3 / 80.0
education 20 3,542 177 57.5 / 76.7
environmental issues 41 20,581 502 72.0 / 85.3
health 26 1,078 41 65.3 / 89.3
human interests 6 576 96 64.0 / 84.0
labour 6 133 22 70.7 / 78.7
lifestyle, leisure 29 4,895 169 72.0 / 90.7
politics 54 3,206 59 60.0 / 81.3
religion, belief 31 3,248 105 84.0 / 90.7
science, technology 50 7,869 157 68.0 / 86.7
social issues 39 8,673 222 70.7 / 85.3
sports 124 5,567 45 72.0 / 86.7
unrests, conflicts, wars 23 1,820 79 61.3 / 80.0
weather 9 66 7 69.7 / 89.5
744 121,135 163 (avg.) 69.9 / 84.4
4.3 Semantic annotation of news
News@hand periodically retrieves news items from the websites of well-known news
media, such as BBC, CNN, NBC, The New York Times, and The Washington Post.
These items are obtained via RSS feeds, and contain information of published news
articles: their title, summary of content, publication date, hyperlinks to the full texts
and related on-line images. The system analyses and automatically annotates the
textual information (title and summary) of the RSS feeds (Figure 5).
Figure 5. Automatic RSS feeds extraction and semantic annotation processes in News@hand
Using a set of Natural Language Processing tools [1], an annotation module removes
stop words and extracts relevant (simple and compound) terms, categorised according to
their Part of Speech (PoS): nouns, verbs, adjectives or adverbs. Then, nouns are
morphologically compared with the names of the classes and instances of the domain
ontologies. The comparisons are done using an ontology index created with Lucene, and
18
according to fuzzy metrics based on the Levenshtein distance. For each term, if
similarities above a certain threshold are found, the most similar semantic concept (class
or instance) is chosen and added as an annotation of the news item. After all the
annotations are created, a TF-IDF based technique computes and assigns them weights.
For 2 months, since 1st January 2008, we have been daily gathering RSS feeds. A
total of 9,698 news items were stored. For this dataset, we run our semantic annotation
mechanism, and a total of 66,378 annotations were obtained. Table 2 shows a summary
of the average number of annotations per news item generated with our system.
Similarly to the experiments conducted for our ontology population strategy, we asked
the 20 students to evaluate 5 news items from each of the 8 topic sections of
News@hand, giving ratings with values from 0 to 10. The annotation accuracies for
each topic are also presented in the table. An average accuracy of 74.8% was obtained.
Table 2. Average number of annotations per news item, and average annotation accuracies
headlines world business technology science health sports entertainment
#news items 2,660 2,200 1,739 303 346 803 603 1,044
#annotations 18,210 17,767 13,090 2,154 2,487 4,874 2,453 5,343
#annotations/item 7 8 8 7 7 6 4 5
Avg. accuracy 71.4 72.7 79.2 76.3 74.1 73.1 75.8 76.0
4.4 Personalised news recommendations
Our 20 experimenters were requested to evaluate news recommendations according to
10 user profiles obtained from Flickr and del.icio.us datasets. Using News@hand and
its recommendation algorithms, they had to evaluate the 5 top ranked news items for
each user/topic, specifying whether a recommended item would be relevant or not for
the users taking into account their profiles. Table 3 shows the average results. Each
value represents the percentage of evaluated news items that were marked as relevant.
The results are compared with those obtained with a classic keyword-based algorithm
[4] applied to the initial folksonomy-based user profiles.
Table 3. Average relevance values for the 5 top ranked news items recommended by News@hand
headlines world business technology science health sports entertainment
keyword-based 46.3 34.3 39.0 43.5 35.9 21.1 58.0 33.5
News@hand 57.0 53.2 72.8 94.0 60.9 40.6 98.2 60.4
5 Conclusions and future work
The combination of folksonomy information with knowledge available in the Semantic
Web is in our opinion a powerful and promising approach to provide flexible, multi-
domain collaborative recommendations. It benefits from two major issues: the easy
adaptation to new vocabularies, and the supervised representation of semantic
knowledge. Folksonomies and Wikipedia repositories continuously and collaboratively
grow, providing consensual up-to-date semantic information about user preferences
and items. On the other hand, ontologies allow us to describe and organise the above
information, so that relations between concepts can be defined and used by fine-
grained content retrieval and recommendation strategies.
19
We have presented techniques that filter personal tags, and integrate them into
multi-domain ontological structures considering semantic information extracted from
Wikipedia. Annotating item contents with concepts of the same knowledge bases, we
relate user profiles and item descriptions under a common semantic concept space,
fact that is exploited by several ontology-based recommendation algorithms. We have
conducted preliminary evaluations of the above techniques obtaining favourable
results. However, more detailed experimentation should be done in order to obtain
founded conclusions about the benefits of our proposals.
Acknowledgements
This research has been supported by the European Commission (FP6-027685 –
MESH, IST-34721 – TAGora). The expressed content is the view of the authors but
not necessarily the view of the MESH and TAGora projects as a whole.
References
1. Alfonseca, E. Moreno-Sandoval, A., Guirao, J. M., Ruiz-Casado, M. (2006). The Wraetlic
NLP Suite. Proc. of the 5th Int. Conf. on Language Resources and Evaluation (LREC 2006).
2. Cantador, I., Bellogín, A., Castells, P. (2008). A Multilayer Ontology-based Hybrid
Recommendation Model. AI Communications, special issue on Recommender Systems. In press.
3. Cantador, I., Castells, P. (2008). Extracting Multilayered Semantic Communities of Interest
from Ontology-based User Profiles: Application to Group Modelling and Hybrid
Recommendations. Computers in Human Behavior, special issue on Advances of
Knowledge Management and the Semantic Web for Social Networks. Elsevier. In press.
4. Castells, P., Fernández, M., Vallet, D. (2007). An Adaptation of the Vector-Space Model for
Ontology-based Information Retrieval. IEEE Transactions on Knowledge and Data
Engineering, 19 (2), pp. 261-272.
5. Cattuto, C., Loreto, V., Pietronero, L. (2007). Collaborative Tagging and Semiotic
Dynamics. Proceedings of the National Academy of Sciences 104(1461).
6. Chirita, P. A., Costache, S., Handschuh, S., Nejdl, W. (2007). PTAG - Large Scale
Automatic Generation of Personalized Annotation Tags for the Web. Proceedings of the
16th international conference on World Wide Web (WWW 2007). Banff, Alberta, Canada.
7. Hotho, A., Jäschke, R., Schmitz, C., Stumme, G. (2006). Information Retrieval in
Folksonomies: Search and Ranking. Proceedings of the 3rd European Semantic Web
Conference (ESWC 2006). Budva, Montenegro.
8. Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., Stumme, G. (2007). Tag
Recommendations in Folksonomies. Knowledge Discovery in Databases 2007, pp. 506-514.
9. Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the
Association for Computing Machinery, 38(11), pp. 39-41.
10. Niwa, S., Doi, T., Honiden, S. (2006). Web Page Recommender System based on
Folksonomy Mining for ITNG’06 Submissions. Proceedings of the 3rd International
Conference on Information Technology (ITNG 2006). Las Vegas, Nevada, USA.
11. Specia, L., Motta, E. (2007). Integrating Folksonomies with the Semantic Web. Proc. of the
4th European Web Semantic Conference (ESWC 2007). Innsbruck, Austria.
12. Szomszor, M., Cattuto, C., Alani, H., O’Hara, K., Baldassarri, A., Loreto, V., Servedio, V. D.
P. (2007). Folksonomies, the Semantic Web, and Movie Recommendation. Proc. of the 1st Int.
Workshop “Bridging the Gap between Semantic Web and Web 2.0”. Innsbruck, Austria.
13. Vallet, D., Castells, P., Fernández, M., Mylonas, P., Avrithis, Y. (2007). Personalised
Content Retrieval in Context Using Ontological Knowledge. IEEE Transactions on Circuits
and Systems for Video Technology, 17 (3), pp. 336-346.
20
From Web 2.0 to Semantic Web:
A Semi-Automated Approach
Andreas Heß, Christian Maaß and Francis Dierick
Lycos Europe GmbH, Gütersloh, Germany
{andreas.hess, christian.maass, francis.dierick}@lycos-europe.com
Abstract. Web 2.0 and Semantic Web are regarded as two comple-
mentary paradigms that will probably converge in the future. However,
whereas the Semantic Web is an established field of research, there has
been little analysis devoted to Web 2.0 applications. For this reason it re-
mains unclear how the advantages of both paradigms could be merged.
In this paper we make three contributions in this direction. First, we
discuss why merging Web 2.0 and the Semantic Web is beneficial and
propose five approaches. Second, we show that (semi-) automated tag-
ging of content improves the quality of annotations. Third, we present
an automatic approach for improving the tag quality by using duplicate
detection techniques. We verify our approach on a large-scale data set
from the social search service Lycos IQ.
1 Introduction
The Semantic Web promises an easy access to information sources using a
machine-understandable (not only machine-readable) representation of knowl-
edge. This requires web resources be annotated with machine-understandable
meta-data. Presently, the primary approach to achieve this is to first define an
ontology and then use the ontology to add semantic markup for web resources.
However, at present only a fraction of web users can take part in the process of
building ontologies. Ontology tools and ontology languages impose high entrance
barriers for potential users [11]. This is likely to contribute to the fact that the
most popular approach of creating ontologies is engineering-oriented, i.e. a small
number of individuals carefully construct the representation of the domain of
discourse and release the results at some point in time to a wider community
of users. In other words, the ontology evolution is not under full control of the
users. For example, missing entries cannot be added by any user who finds the
need for a new concept, but have to be added by the small group of creators. In
natural language, in comparison, the evolution of the vocabulary is under the
control of the user community. Anybody can invent and define a new word or
concept in the course of communication.
Against this background there was a large debate in academic literature
on how the process of meta-data generation can be automated to address the
problem of cost-intensive ontology construction and generation of meta-data,
e.g. [3, 11]. This is an important research question as several researchers agree
21
that without the proliferation of formal semantic annotations the Semantic Web
is certainly doomed to failure [3].
One way to lower the threshold for a user to enter the Semantic Web is to
move away from strict ‘heavy’ ontologies towards light-weight ontologies, folk-
sonomies or tagging. With the proliferation and growth of so-called Web 2.0
sites, tag-based folksonomies promise to be a useful tool for search and nav-
igation. Unlike an ontology, which is usually defined as a “specification of a
conceptualisation” [9] and due to its formal nature created by trained experts, a
folksonomy is “a type of distributed classification system” and “usually created
by a group of individuals, typically the resource users” [10].
However, as the user and content base of a site grows, folksonomies tend to
become more diffuse and imprecise. A certain degree of automation would help
to make the maintenance of folksonomies as well as the annotation of content
easier. In spite of the obvious need for this, up to now only a few automated
approaches have been presented.
Even though tagging is not comparable to annotations using a full ontology, it
can be regarded as a first step towards the Semantic Web. Even more important,
tagging is already used every day by millions of web users posting blog entries.
It is accepted that automated tagging algorithms, unlike manually tagged data,
can have significant levels of mis-classification [5]. In a semi-automatic setting
predicted tags do not have to be perfectly accurate in order to be useful. It is
still easier for a user to browse through a list of only a few possible tags than
to enter their own free-text tags. Furthermore, entering new tags is error-prone,
since synonyms or spelling mistakes are not always detected. If the tags are
drawn from a full ontology or at least a controlled vocabulary, looking only at
a few suggestions is easier than looking at a complete ontology with possibly
thousands of concepts.
In the remainder of this paper we will make three contributions to address the
question, how the top-down approach of ontology engineering could be merged
with a bottom-up approach that is typical for so-called Web 2.0 applications.
First, we provide an overview of the advantages and disadvantages of ontologies
and folksonomies and why a merging of these concepts is beneficial. Second,
we will focus on the semi-automated classification or tagging of content. We
believe that when using algorithmic assistance for annotating text, the quality
of annotations will increase. We present a machine-learning-based classification
algorithm that is tailored for use with short texts and folksonomies. We show
that part-of-speech-tagging can be used in text classification and retrieval to
dramatically reduce the dimensionality of the corpus without affecting perfor-
mance. The algorithm is fast enough for interactive use. Third, we present an
automatic approach for tag merging and correction. This is an important issue
as folksonomies usually do not consist of a limited number of well-written tags.
Rather, almost every Web 2.0 application faces the problem that tags are mis-
spelled or are redundant. To address these problems, we present a method for
detecting different (mis-)spellings of a tag that is based on a spell-checker in con-
22
junction with string edit distance metrics. We use a rule learner for fine-tuning
the parameters of the algorithm.
1.1 Folksonomies: Usage Scenario
Despite their complementary nature, currently folksonomies (or tag clouds) and
ontologies are used in quite distinct usage scenarios. Folksonomies are mainly
used for tagging in Web 2.0 applications, the main use cases being search, navi-
gation and recommendation. For social bookmarking sites such as del.icio.us
tagging is an essential part of these processes: links are annotated and thereby
sorted into categories, a user can search by category, etc. For applications like
photo sharing, tagging is a prerequisite for effective searching, since a picture
cannot easily be searched by its actual content. By arranging related tags (e.g. by
co-occurrence) folksonomies also allow for browsing through different categories.
While folksonomies have the clear advantage of being cheap and reflecting the
language of the user, there are certain problems related to their use: if tags are
not drawn from a controlled vocabulary but are just plain text keywords, several
issues that affect the usefulness of tagging will arise. Especially inexperienced
users tend to assign tags that are not meaningful to other users or to assign no
tags at all. For example, as a result of an analysis of the leading social book-
marking system del.icio.us, Lee points out that about 20% of its users do
not annotate or tag any of their bookmarks [15]. Moreover, different spellings
and subjective combinations of tags lead to more or less diffuse folksonomies.
Therefore, errors occur frequently while searching for related issues and subjects.
Some users are well aware that this is a problem and that there should be some
tagging guidelines. This problem and whether tools should be used is discussed
both in the blogosphere1 and by scientists [21]. We argue that by moving from
folksonomies towards ontologies, the usefulness for Web 2.0 scenarios will im-
prove as well. Therefore, we investigate the use of semi-automatic techniques
for assisting the user in annotating content and cleaning existing tag clouds and
thus moving to a more structured representation.
1.2 Proposed Approaches
Making the transition from Web 2.0 to Semantic Web smooth and user-friendly
is a difficult task. We propose a combination of five approaches to address the
different aspects of the problem:
Semi-Automated Tagging As a first step, we believe it is very important to
reduce the uncontrolled growth of tag clouds due to usage of synonyms, mis-
spelled words and inconsistent tagging. We propose using semi-automated tag
suggestions based on text classification to guide users towards consistent tagging.
1
e.g. http://paolo.evectors.it/2005/05/24.html#a2532 or
http://ross.typepad.com/blog/2005/05/tags and simple.html
23
We believe that users will more likely choose from a list of suggested tags than
entering new tags. As a consequence the quality of annotations will increase. In
section 2 we present our algorithm for semi-automated tagging in detail.
Tag Merging We believe that merging of synonyms and misspelled tags will
increase the quality of annotations in the same way as automated tagging. The
result of merging similar tags denoting the same concept will be a more consistent
tagging. We discuss our algorithm for tag merging in section 3.
Identification of Related Tags Based on co-tagging (i.e. tags used together
to annotate the same content) it is possible to identify a network of relationships
between tags. Approaches that involve further analysis of such a network of tags
have been discussed, e.g. [13]. However, this topic is out of the scope of our own
research area.
Tag Rating For the combination of tags drawn from a folksonomy and concepts
in an ontology we follow a layer concept: User-entered tags are located in an outer
layer, whereas concepts in an ontology that is maintained by experts are located
in an inner core. When tags in the outer layer are identified as being consistent
and precise and there is no equivalent concept in the ontology already, these
tags should be included as concepts. We propose that the user community can
rate annotations. When a tag gets a high number of good ratings it should be
recommended to the experts for inclusion in the core ontology. We will investigate
using such a rating mechanism in future work.
Information Extraction We propose to make extensive use of information
extraction techniques to fill the core ontology with facts. DBPedia [1] is an ap-
proach for extracting facts from Wikipedia articles. Furthermore, a large amount
of previous work on information extraction from websites (e.g. [14]) and free form
text (e.g. [7]) exists. Some approaches are targeted directly towards use in the
Semantic Web area, e.g. [4]. In [16] an approach for finding relationships via a
web search is presented. We are currently researching in the same area. In our
approach, we are combining results from information extraction sources with a
web search in order to identify the type of relation between two persons and to
either confirm or disprove whether such a connection exists. We will present this
approach in greater detail in a future publication.
1.3 Case Study: Lycos iQ
Lycos iQ is a question-and-answer community web site. Q&A communities try
to deliver answers where algorithmic search engines fail to generate high quality
results by activating users from the Web 2.0 community. For example, current
search technologies have problems to answer search enquiries just like “Who is
the Swedish singer that sounds like Heather Nova”. These kinds of questions
24
Fig. 1. Screenshot of the Lycos iQ website, illustrating our dataset
could only be answered by humans or by a group of users. Q&A services provide
the necessary infrastructure to discuss such questions and enable a broad com-
munity to share their knowledge. Figure 1 shows a screenshot from the Lycos iQ
website2 . Q&A communities such as Lycos iQ or Yahoo Answers must not be
confused with Q&A systems as known in Information Retrieval.
Tags are a certain form of meta-data that serves as a description for a partic-
ular content. For example, a question like the above could be published with the
tags ‘music’, ‘Sweden’, ‘singer’ or ‘songwriter’. Through these tags the posting is
associated with the topics ‘music’ and ‘songwriter’, although the terms are not
explicitly mentioned in the text.
From the technical point of view the ‘tagging’ of questions is the key to suc-
cess for these kinds of services: Based on tags, expert users that can answer
questions from a specific topic can be identified and brought together with users
seeking information. The quality of tags is a crucial element in the function-
ality of such a service. However, as social bookmark systems have attracted a
great attention [15, 8], there has been little analysis devoted to Q&A- and other
communities.
2 Semi-Automated Tagging
2.1 Related Work
Although text classification is an old and well-researched topic in information
retrieval and machine learning (e.g. [17]) it has not been widely used for au-
2
http://iq.lycos.co.uk/
25
tomatic tagging in Web 2.0-applications yet. An exception is AutoTag [19], a
system that uses a k-nearest-neighbour classifier for automated tagging of blog
posts. This work is closely related to ours. We will highlight the differences in
the next section. A more complex case-based system for semi-automated tagging
is TagAssist [22]. As far as we are aware, these systems have not been deployed
in an environment outside of the research community. No well-known Web 2.0
sites actually use semi-automated tagging based on learning or natural language
techniques.
2.2 Problem Formulation
Formally, semi-automated tagging is a multi-value text classification problem.3
Most machine learning algorithms can only handle single-value classification.
Therefore it is common practise that single-value classification algorithms are
adapted by means of some combination method; see [23] for a recent survey.
However, these strategies are out of the scope of this paper: given that in our
scenario many annotations are plausible and the user is involved in the clas-
sification process, it is not necessary that the algorithm predicts the exact set
of true annotations and presenting a ranked list is acceptable. Considering the
high number of classes and the need for an incremental learning algorithm, using
vector space classification algorithms such as kNN or Rocchio is a logical choice.
AutoTag [19] uses a search engine to locate similar blog posts. The search query
is derived from the text that is to be classified using statistical query rewriting
techniques. In the next step, tags from the search results are aggregated and
re-ranked using information about the user. Yet, this method of predicting tags
for posts has a disadvantage. Re-writing the query at classification time is com-
putationally costly. In a semi-automatic setting, where users can annotate their
content online immediately after they type, response time is critical. Therefore,
to avoid the need for query rewriting, we decided to perform a feature selection
at training time. First we tokenised the short headline of the questions. This
headline usually consists of just one sentence. We applied part-of-speech tagging
and kept only nouns and proper nouns. Although the headlines in Lycos iQ are
much shorter than blog posts (usually just one sentence) and we did not use
re-ranking, we achieved similar performance results in preliminary experiments
conducted with the current live version of Lycos iQ.
2.3 kNN classification vs. Rocchio
For comparison, we implemented two classifiers, both based on an index cre-
ated using the dimensionality reduction method described above (POS tagging).
First, we implemented a kNN classifier with k = 10 that queries the index for the
ten nearest questions (postings) in the database and aggregates the tags from
3
For the remainder of this paper, we will use the terms ‘class’ as in text classification
and ‘tag’ as synonyms.
26
the results. Preliminary tests showed that IDF weighting does not improve clas-
sification results in this setting, so we decided not to use it in our experiments.
Second, we implemented a Rocchio-style classifier. Rocchio classification is
based on Rocchio relevance feedback [20]. The centroids for each class were sim-
ply computed as a big bag of words containing all tokens from all posts labelled
with this specific class (tag). Note that if a posting is tagged with more than
one tag the tokens in the post are indexed for all its classes. As opposed to the
kNN classifier, we found out that IDF weighting slightly improves performance,
so we used it in our experiments. Although the difference in accuracy between
using IDF weighting and not using it were rather low, we decided to use the best
setting for each of the two approaches to achieve a fair comparison.
Given the nature of the two algorithms, we expect that Rocchio classifica-
tion will be faster at classification time, a factor that is very important in an
interactive setting, when users are not willing to accept long response times.
When comparing the classification performance of Rocchio vs. kNN, Rocchio
has some known disadvantages: It becomes inaccurate, when classes are not
spheres of similar size in vector space, and it does not handle non-spherical
classes (e.g. multi-modal classes that consist of more than one cluster) very
well. However, we argue that these properties of Rocchio classifiers will not
affect performance in our setup, because of three main reasons: First, while
the dataset is indeed skewed and classes have very different size, we expect
that most of the classes will be mono-thematic. Second, while large classes will
be preferred by the classifier, this is not a disadvantage in our scenario. When
considering semi-automated tagging of postings, it is actually an advantage when
we direct users towards choosing more popular tags, as explained in section 1.
Third, since our scenario is a multi-value classification problem where not only a
single prediction is correct, overlap between classes does not necessarily influence
accuracy negatively.
2.4 Evaluation
To evaluate our approach, we used a corpus of 116417 question from the knowl-
edge community web site Lycos iQ. Figure 1, showing a screenshot from the
Lycos iQ website, gives a good impression of what our dataset looks like. Note
the different levels of tags the users assigned and the style of the texts. We use
only the headlines as shown on this screenshot for classification. After tokenisa-
tion and POS tagging, there were 89275 distinct tokens. After deleting frequent
tokens according to Zipf’s law, this number was further reduced to 27309, lead-
ing to an average of only 2.63 tokens per question. Questions were tagged with
49836 distinct tags that follow the usual distribution with some frequently used
tags followed by a long tail. It should be noted that the number of classes ex-
ceeds the number of tokens. We expect that tags drawn from the long tail will
be suggested very rarely if at all. However, this does not pose a problem, since
guiding the user towards common tags accepted by many users is one of the
goals of our approach. Suggesting tags that were initially only used very few
times or even only once does not make sense. It is not useful to include text
27
from answers in the classification. When the user is tagging a question, there are
no answers available yet. The text from the answers is not helpful for classifying
the questions, since the words are too different. We evaluated both the kNN ap-
proach and the Rocchio approach automatically using the 111629 questions that
had user-assigned tags, comparing the predicted tags with the manually assigned
tags. We used the well-known leave-one-out cross-validation scheme for evalua-
tion. After preprocessing we encountered the fact that some questions were not
assigned any tokens. These questions were, however, left in the corpus and thus
affect the overall performance of our algorithm negatively.
Methodology It is important to note that the tags assigned by users should not
be regarded as a gold standard. Tags are not drawn from an ontology, taxonomy
or controlled vocabulary, but are free text entered by users and thus prone to
spelling mistakes. Also, inexperienced users tend to assign either no tags at all,
only very few tags or they tag inconsistently. Given the large number of users,
we also expect that users use different synonyms to denote the same concept.
Due to these ambiguities and inconsistencies we expect that the accuracy of any
automated approach is considerably lower than its true usefulness.
To overcome this problem in our empirical evaluation, we distributed ques-
tionnaires and had test persons check the plausibility of tags suggested by our
semi-automated approach. To reduce the workload for the test persons and be-
cause it outperformed the kNN classifier in the automated tests, we decided
to test only the Rocchio-style approach. For comparison, we also had the test
persons check the precision of the user-assigned tags, since we assumed many
nonsensical or inconsistent tags among them. Every test person was given one
or two chunks of 100 out of a random sample of 200 questions that were either
machine-tagged or hand-tagged. Every question was checked by four persons to
average out disagreement about the sensibility of tags.
Automatic Evaluation For the classification results evaluated against the
user-assigned tags, we report precision, recall and accuracy. Precision is defined
as the number of predicted tags that also occur in the set of manually assigned
tags divided by the number of predicted tags that were considered. We report
precision only for the top three predicted tags. Since most questions are only
tagged with up to three tags, it does not make sense to report precision when
allowing for more predicted tags. Recall is defined as the number of tags in the
set of user-assigned tags that also occur in the set of predicted tags divided
by the number of user-assigned tags. We report recall for the top ten predicted
tags. With this definition of precision and recall we follow the generally accepted
definition in information retrieval and machine learning. Furthermore, we report
the fraction of questions where there is at least one overlap between the assigned
and predicted tags as accuracy. We regard it is already useful, if only some or
even one sensible suggestion is among the top suggested tags, since the user can
quickly browse through a short list of suggestions and select or deselect tags. We
expect that accuracy and recall will go up if we include more suggested tags.
28
Top n tags 1 2 3 4 5 6 7 8 9 10
kNN Precision 0.26 0.24 0.20 — — — — — — —
kNN Recall 0.26 0.23 0.21 0.23 0.24 0.25 0.26 0.27 0.27 0.28
kNN Accuracy 0.23 0.29 0.32 0.34 0.35 0.36 0.37 0.38 0.39 0.39
Rocchio Precision 0.32 0.31 0.27 — — — — — — —
Rocchio Recall 0.32 0.30 0.28 0.31 0.33 0.35 0.37 0.38 0.39 0.40
Rocchio Accuracy 0.28 0.36 0.41 0.43 0.45 0.47 0.48 0.49 0.49 0.50
Table 1. Results of the automatic evaluation
Table 1 shows the empirical results for precision, recall and accuracy for
the two different proposed methods. For recall and accuracy, we highlight the
value at five suggested tags, because we believe that a user will not accept a
longer list. As Miller pointed out in [18], most people can process five items at
once. We measured the classification time per instance for both approaches on
an Intel Core 2 machine with 1.86 GHz and 1 GB RAM. As expected, Rocchio
classification was much faster than kNN. The classification time for each instance
was 155 ms for kNN and 57 ms for Rocchio.
Manual Evaluation As expected, we could observe that there was a big dis-
agreement among the test persons and the users who originally tagged the ques-
tions as well as between the test persons themselves. For the manual evaluation,
we checked only the Rocchio classifier because it performed better in the auto-
matic test. As explained above, the total 200 questions that were evaluated were
split in two sets of 100 questions, yielding four different questionnaires (two for
the original user-assigned tags and two for machine-annotated tags) and each
chunk of 100 questions was checked by four persons. Each test person was check-
ing at most two sets of questions. To highlight the huge difference of the several
test persons, we report the individual results in the table below. For the human-
annotated tags, we evaluated precision, defined as the number of useful tags
divided by the total number of assigned tags. For the machine-assigned tags, we
report accuracy as well, with the same definition of accuracy as in the automatic
test. Questions that had no assigned tags were ignored for evaluating accuracy.
Among the 200 randomly selected question were 6 that had no assigned tags,
leading to 194 questions evaluated for accuracy.
For all manual tests, we evaluated the algorithms with five suggested tags
only. We believe that in a real-world semi-automated setting, we cannot assume
that an inexperienced user is willing to look at more than five tags. The questions
that were manually tagged had mostly three tags each, some of them only two
and very few questions had more than three tags.
As expected, there was a large disagreement between different persons both
on the human-annotated tags as well as the machine-annotated tags (see ta-
bles 2). It is interesting to note when looking at the second set of questions,
that, although the human annotations on this set of 100 questions were rated
worse than those from the first set, the tags suggested by our algorithm were
29
Test TP TP+FP avg. Prec.
assigned tags 1535 1856 0.83
suggested tags 1866 3360 0.56
Test Person 1 Person 2 Person 3 Person 4
Set 1, assigned tags, prec. 0.89 0.89 0.93 0.96
Set 2, assigned tags, prec. 0.52 0.73 0.73 0.87
Set 1, suggested tags, prec. 0.41 0.52 0.53 0.71
Set 2, suggested tags, prec. 0.51 0.54 0.59 0.65
Set 1, accuracy 0.84 0.84 0.86 0.87
Set 2, accuracy 0.87 0.87 0.91 0.91
Table 2. Results of the manual evaluation
on average rated even slightly better. Keeping in mind that we envision a semi-
automated scenario with human intervention, we see this as a confirmation that
automatically suggested tags can help to improve the quality of tagging.
When looking at macro-averaged precision, it is obvious that a classifica-
tion system is still not good enough for fully automated tagging. However, it
is important to note that even the human-annotated questions were rated far
below 100% correct by the test persons. More than half of the suggested tags
were rated as useful by the test persons. We believe that this is certainly good
enough for a semi-automated scenario, were users are presented a small number
of tags to choose from. In absolute numbers, interestingly, the automatic clas-
sifier produced more helpful tags than were assigned by users, even compared
to the number of all user-assigned tags, not just the ones perceived as helpful
by the test persons. We believe that this confirms our hypothesis that users will
assign more tags when they are supported by a suggestion system. However, this
can only be finally answered with a user study done with a live system.
Finally, the high accuracy of the classifier (with accuracy being defined as
in section 2.4) underlines our conclusion that semi-automated tagging is good
enough to be implemented in a production environment. In almost nine out of
ten cases there was at least one helpful tag among the suggestions.
3 Tag Merging
3.1 Problem Formulation
When considering a social tagging system with a potentially high number of
users, one of the most common problems is inconsistent tagging. One common
facet is that users tend to misspell tags . While we can guide the user towards
consistent tagging by suggesting tags, a complementary approach is to clean
existing tag clouds after the annotation phase by identifying tags that should
be merged. This merging can be based both on string similarity to detect mis-
spellings or based on a thesaurus to detect synonyms. The first approach is well
known as duplicate detection or record linkage. Yet, little attention has been
30
devoted to using duplicate detection techniques for improving the quality of tag
clouds. Unlike semi-automated annotation, tag duplicate detection must have
a precision that is high enough to be run unsupervised, with only little man-
ual correction. An interactive review is infeasible when merging possibly several
thousand misspelled tags. Therefore, the duplicate detection algorithm should
be biased towards high precision.
3.2 Parameter Tuning
The obvious approach to address the tag duplicate detection problem is to use
string similarity metrics. Preliminary experiments, however, showed that the
performance of using a single string distance metric is not sufficient to be em-
ployed in an automatic setting in terms of precision. Therefore, we decided to
combine multiple similarity measures and use a machine learning algorithm to
fine-tune the exact setting. A similar approach has been used by Bilenko and
Mooney [2].
We use a two-step approach. First, we check whether a tag should be con-
sidered for merging. Second, if the tag is suitable, we use an off-the-shelf spell-
checking tool to identify possible tags that are candidates for merging. To be
able to use a standard spell checker for our purpose, we do not use a natural
language dictionary but the list of all tags in the system instead. Therefore, sug-
gestions by the spell checker are tags that are textually similar. While an efficient
implementation of a spell checker will return a result very quickly, these results
are not accurate enough for a fully automated merging process. To increase pre-
cision, we perform further checks. Pairwise calculation of all of these features for
all tags would be computationally too expensive and therefore prohibitive.
For the first step, checking whether a tag should be considered for merging,
we used a number of features such as frequency, number of related tags, string
length and the number of tokens (a tag can consist of multiple words). Tags are
called related tags, when they are used together to annotate the same content.
Related tags of second order are tags that are related to the same other tag (e.g.
if there are relationship between “house” and “door” and between “building”
and “door”, then “house” and “building” have a second order relationship).
Frequency is defined as the number of times the tag has been used. For the
second step, checking whether two tags should be merged, we considered addi-
tional features such as Levenshtein edit distance, Jaro-Winkler string similarity,
Monge-Elkan string similarity and Smith-Waterman string similarity. The rela-
tionship strength is defined as the number of co-occurrences. The relationship
strength of second order is defined as the sum of the strength of the connecting
first order relations.
To tune the thresholds for the merging system we created training sets with
the features explained above and exported them into ARFF format for process-
ing with WEKA [24]. Because of the desired area of use and the simplicity of
implementation of the classifier, learners that output rules seemed an obvious
choice. We experimented with different rule learning and decision tree algorithms
31
including RIPPER and C4.5 as implemented in WEKA.4 Since our goal is to
implement an automatic, unsupervised tag merger, we consider precision on the
class of merged tags as more important than recall and accuracy. A classifier
with high precision leads to a merger that will make few mistakes at the price of
missing some tags that should be merged. To bias the learning algorithm towards
precision on one class, we experimented with biased classifiers from the imple-
mentation of the Triskel algorithm [12]. Biasing methods include under-sampling
(randomly dropping training instances from one class) and over-weighting (as-
signing a higher weight to instances from one class). It has been shown that
even dropping as much as 90% of instances leads to a good classifier with high
precision. After looking at the resulting models, it turned out that, as expected,
some of the features were redundant.
3.3 Evaluation
To evaluate the tag merger, we used the same dataset as described in section 2,
with 49836 distinct tags in the database. Running the merging algorithm gener-
ated 4320 sets of merged tags with 10245 tags in total, yielding an average 2.4
tags per set. This means that after merging, about one fifth of tags are part of
a set of merged tags, and when only keeping one tag per set, the amount of tags
available will be reduced by around 11.9%.
To measure the precision of the merger, we examined a sample of 100 tag
sets containing 248 distinct tags. Of these 100 sets, 3 contained one tag that was
not correctly merged, 1 set contained two tags that did not fit and in 2 cases
it could be considered questionable whether or not the tags should be merged.
This leads to an average precision of 94% resp. 96% (depending on whether the
two tags in doubt are considered correctly merged). When putting the errors in
relation to individual tags instead of sets (keeping in mind that a set can contain
more than two tags), precision is approx. 97.98% resp. 97.18%.
We believe that the precision of this approach is high enough to allow elim-
ination of misspelled tags or merging of tags in their singular or plural form in
a fully automated setting. Even with very conservative settings yielding a high
precision, there is already a significant amount of tags that can be merged or
discarded.
4 Conclusion
4.1 Summary
In this paper, we have made three contributions: first, we have proposed five
approaches to address some of the problems when moving from a folksonomy
towards an ontology. We argue that this move brings benefits, although we have
to keep in mind that heavy ontologies tend to become too complicated for the
broad mass of inexperienced users. Therefore, we advocate for (semi-) automated
4
The WEKA implementations of RIPPER and C4.5 are called JRip and J4.8
32
measures to guide users towards more structured tagging. We regard this as a
first step on the move from Web 2.0 to the Semantic Web.
Second, we have presented a classification algorithm that has been shown
to perform well in terms of accuracy even when using very short text snippets.
We have shown that a drastically reduced dimensionality of the text corpus can
be achieved when using part-of-speech tagging. We conclude from our empirical
study based on a dataset with questions from the Lycos iQ web site that the
performance of our classifier is good enough to be employed in a semi-automatic
setting and that it scales well enough to be used in a production system with a
large number of instances.
Third, as a complementary measure to the proposed semi-automated tagging,
we have shortly described an algorithm for automatically merging misspelled or
similar tags and have shown that its precision is high enough for unsupervised
operation while still eliminating a high number of tags. We believe that both
ad-hoc interactive tagging suggestions as well as post-hoc offline merging are
effective approaches to improve the quality of a folksonomy as a prerequisite
towards moving to more structured semantic networks and ontologies.
4.2 Generalisation
In recent work that is beyond the scope of this paper we have tested our text
classification system on the well-known Reuters dataset. We have adapted our
algorithm to output a true multi-value classification instead of a ranked list. We
introduce a new method for multi-value-classification that is related to stacking.
The performance of our algorithm is comparable to the older results presented in
[6], although we use only the headlines of the articles as opposed to the full text.
In [6], a macro-averaged accuracy for the top 10 classes of 63.7% was reported
for the Rocchio classification algorithm in a one-classifier-per-class scheme. We
achieved a 67.1% accuracy using our stacking approach. Due to the different focus
of these experiments and space restrictions, we will present these results in a
separate publication. Even though we focused on Rocchio and kNN classification
due to performance requirements related to interactive use, using SVMs should
be further evaluated.
Second, we applied our tag merging algorithm to the task of aligning head-
lines of movie reviews, news articles and movie titles from the cinema pro-
gramme. The dataset was taken from the Lycos movie portal site. The setup of
these experiments was slightly different: a pre-selection using a fast implemen-
tation of a spell checking algorithm was not needed due to the smaller dataset
size. We removed domain-specific stop-words. Due to the different nature of the
dataset, co-occurrence information was not available. As an additional distance
metric, a soft-token TFIDF metric was used. As in the experiments described
here, we used a rule learner to identify combination rules for the different sim-
ilarity metrics and to learn a threshold. Additionally, we experimented with an
SVM classifier. In these experiments we could achieve a precision of almost 100%.
33
4.3 Future Work
One of the next big challenges will be to align folksonomies with existing ontolo-
gies. As mentioned in section 1.2, we are currently investigating using web search
to identify relationships between persons. In future work, we want to investigate
how ratings by users can be used to identify high-quality tags. These tags could
then be presented to a group of moderators or administrators to decide whether
or not they should be added as concepts into a more stable ontology.
One of the clear hurdles towards a widespread adoption of ontologies have
been usability problems related to visualising and interacting with such complex
datasets. Specifically, we hope to guide the user towards real semantic tagging
without alienating them. This process of gradually increasing ontology complex-
ity has been described in literature as ontology evolution.
4.4 Acknowledgements
The research presented in this paper was partially funded by the German Federal
Ministry of Economy and Technology (BMWi) under grant number 01MQ07008.
The authors are solely responsible for the contents of this work.
We thank our colleagues at Lycos Europe who helped us with the manual
evaluation of our tagging system.
References
1. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga-
niak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The
Semantic Web, volume 4825 of Lecture Notes in Computer Science, pages 722–
735, Berlin/Heidelberg, 2008. Springer.
2. Mikhail Bilenko and Raymond J. Mooney. Learning to combine trained distance
metrics for duplicate detection in databases. Technical report, Dept. of Computer
Science, University of Texas, Austin, Texas, USA, February 2002.
3. Philipp Cimiano, Günter Ladwig, and Steffen Staab. Gimme’ the context: Context-
driven automatic semantic annotation with c-pankow. In Proc. of the 14th Int.
World Wide Web Conference, 2005.
4. Fabio Ciravegna, Sam Chapman, Alexiei Dingli, and Yorick Wilks. Learning to
harvest information for the semantic web. In Proceedings of the 1st European
Semantic Web Symposium, Heraklion, Greec, May 2004.
5. Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran,
Tapas Kanungi, Sridhar Rajagopalan, Andrew Tomkins, John Tomlin, and Jason
Zien. Semtag and seeker: Bootstrapping the semantic web via automated semantic
annotation. In Proc. of the 12th Int. World Wide Web Conference, 2003.
6. Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive
learning algorithms and representations for text categorization. In CIKM ’98:
Proceedings of the seventh international conference on Information and knowledge
management, New York, NY, USA, 1998. ACM.
7. Aidan Finn and Nicholas Kushmerick. Multi-level boundary classification for in-
formation extraction. In Proc. European Conf. on Machine Learning. Springer,
2004.
34
8. Gernot Gräfe, Christian Maaß, and Andreas Heß. Alternative searching services:
Seven theses on the importance of social bookmarking. In Proc. of the 1st Conf.
on Social Semantic Web, 2007.
9. Tom R. Gruber. A translation approach to portable ontologies. Knowledge Acqui-
sition, 5(2):199–220, 1993.
10. Marieke Guy and Emma Tonkin. Folksonomies – tidying up tags? D-Lib Magazine,
12(1), January 2006.
11. M. Hepp, D. Bachlechner, and K. Siorpaes. Ontowiki: Community-driven ontol-
ogy engineering and ontology usage based on wikis. In Proceedings of the 2006
international symposium on Wikis, 2006.
12. Andreas Heß, Rinat Khoussainov, and Nicholas Kushmerick. Ensemble learning
with biased classifiers: The triskel algorithm. In Proceedings of the 6th International
Workshop on Multiple Classifier Systems, Monterey Bay, California, USA, 2005.
13. Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. Emergent
semantics in bibsonomy. In Christian Hochberger and Rdiger Liskowsky, editors,
GI Jahrestagung, volume 94 of LNI, pages 305–312, 2006.
14. Nicholas Kushmerick. Wrapper induction for information extraction. PhD thesis,
University of Washington, 1997.
15. K. Lee. What goes around comes around: An analysis of del.icio.us as social space.
In Proc. of the 20th conf. on Computer supported cooperative work, 2006.
16. Gang Luo, Chunqiang Tang, and Ying li Tian. Answering relationship queries on
the web. In Proc. of the 16th Int. World Wide Web Conference, New York, NY,
USA, 2007. ACM.
17. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction
to Information Retrieval. Cambridge University Press, 2008.
18. G. A. Miller. The magical number seven, plus or minus two: Some limits on our
capacity for processing information. The Psychological Review, 1(63):81–97, 1956.
19. Gilad Mishne. Autotag: a collaborative approach to automated tag assignment for
weblog posts. In Proc. of the 15th Int. World Wide Web Conference, New York,
NY, USA, 2006. ACM Press.
20. J. J. Rocchio. The SMART Retrieval System: Experiments in Automatic Document
Processing, chapter Relevance feedback in information retrieval, pages 313–323.
Prentice-Hall, Englewood Cliffs, NJ, USA, 1971.
21. Clay Shirky. Ontology is overrated: Categories, media & community.
http://shirky.com/writings/ontology overrated.html, 2006.
22. Sanjay C. Sood, Sra H. Owsley, Kristian J. Hammond, and Larry Birnbaum. Tagas-
sist: Automatic tag suggestion for blog posts. In Proceedings of the International
Conference on Weblogs and Social Media, Boulder, Colorado, USA, March 2007.
23. Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview.
International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
24. I. H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools with
Java Implementations. Morgan Kaufmann, San Francisco, 1999.
35
Using multiple ontologies as background
knowledge in ontology matching
Zharko Aleksovski1 , Warner ten Kate1 , and Frank van Harmelen2
1
Philips Research, Eindhoven,
2
Vrije Universiteit, Amsterdam
Abstract. Using ontology as a background knowledge in ontology match-
ing is being actively investigated. Recently the idea attracted attention
because of the growing number of available ontologies, which in turn
opens up new opportunities, and reduces the problem of finding candi-
date background knowledge. Particularly interesting is the approach of
using multiple ontologies as background knowledge, which we explore in
this paper. We report on an experimental study conducted using real-life
ontologies published online.
The first contribution of this paper is an exploration about how the
matching performance behaves when multiple background ontologies are
used cumulatively. As a second contribution, we analyze the impact that
different types of background ontologies have to the matching perfor-
mance. With respect to the precision and recall, more background knowl-
edge monotonically increases the recall, while the precision depends on
the quality of the added background ontology, with high quality tending
to increase, and the low quality tending to decrease the precision.
1 Introduction
Ontology matching is regarded as one of the most urgent and most important
problems in the Semantic Web. It is scientifically challenging and inherently very
difficult problem [1–6]. It generated a lot of research in the past years, which re-
sulted in many different solution methods proposed. Good surveys of the existing
ontology matching methods can be found in [2, 3, 7]. According to [7] they can be
divided into four categories: terminological that use lexical similarities between
names, comments etc., structural that use the similarities in the structure of
the matching ontologies, instance-based that use the classified instance-data in
the ontologies, and using background knowledge that rely on external structured
resources to find matching entities across different ontologies. In this paper, we
focus on the last category - using ontologies as background knowledge in the
matching.
Background knowledge in matching has been used in different ways [8–10].
In this study we use a very simple approach. We try to match each pair of
concepts from the matching ontologies in two steps - anchoring and deriving
relations. In the anchoring, we look if the matching concepts can be themselves
matched to the background knowledge, and in the deriving relations we check
36
if they match to background concepts which are related to one another. If they
are, then we report that the testing pair of concepts are matched. This type of
match we call an indirect match because it is being discovered indirectly through
the background knowledge ontology.
In respect to the matching success, regardless of the choice, no background
knowledge ontology is likely to provide all the matches we would like to find.
Instead, it is reasonable to expect that matches missed by one background ontol-
ogy can be found using some other. Hence, hoping to find more of the matches
we desire to find, we can use multiple ontologies as background knowledge. The
question we face now is how the characteristics of the background ontologies
will impact the matching performance. As discussed in the study of [11], the
landscape of the online published ontologies is very diverse.
We set to investigate the feasibility of the matching when multiple back-
ground ontologies are used. To stress the paradigm, we present the results of
several experiments in which we set our objectives as follows: (i) the anchoring
to the background knowledge is a simple lexical matching technique, i.e. we only
use simple matching as needed to obtain relatively successful anchoring (see
Section 3 for further explanation), (ii) the background knowledge candidates
are relatively large sized ontologies3 , and (iii) there is lexical overlap between
the matching ontologies and the background knowledge, that is, lexical match
is possible between the matching ontologies and the background knowledge. We
use multiple background knowledge ontologies by using each ontology separately,
and then combining the sets of obtained matches. We are interested to see how
do the background ontologies perform together as compared to how each of them
performs alone.
Multiple ontologies as background knowledge have already been used [9].
Contribution of this paper is that we study the contribution of each background
ontology individually as compared to their cumulative contribution, and we also
study the effects of the different types of ontologies when used as background
knowledge. All the test data was selected from online published ontologies, and
it consisted of two ontologies which we matched to one another and six other
that we used as background knowledge. Having selected our data from online
published ontologies, our results reflect on the current state of the published
ontologies.
The experiments revealed that using background ontologies published on
the semantic web provide low but stable increase of recall as compared to a
simple direct matching. Multiple background ontologies find almost disjoint sets
of matches, and hence result in cumulative increase of recall. The precision of
background-based matching mostly, but not entirely depends on the quality of
the background ontologies. The low quality ontologies increase the recall, but
they reduce the precision. The high-quality background knowledge ontologies
3
Ontologies of size around 30 concepts are common as demonstration examples, how-
ever, they are trivial to analyze and do not provide well-grounded empirical insight.
Hence, we focused our attention on ontologies of larger size with at least couple of
hundreds of concepts as more interesting candidates.
37
also find wrong matches, but these are mainly caused by the different context of
the knowledge, not by mistakes in the ontologies.
The rest of the paper is organized as follows: in Section 2 we will describe
our approach to using background knowledge in detail, in Section 3 we will
describe our case study with the experimental data and the results, in Section
4 we will discuss the findings of the experiments, and finally with Section 5 we
will conclude the paper.
2 Using background knowledge in ontology matching
In our approach we match two ontologies while using a third one as background
knowledge. We call the ontologies being matched the source and the target,
however, this naming is not discriminative - the matching algorithm treats them
equally, and swapping their places will only invert the result set4 . As we men-
tioned in the introduction, the algorithm proceeds in two steps - anchoring and
deriving relations. Its scheme is depicted on Figure 1.
Background knowledge
ontology
Anchoring Deriving relations Anchoring
Source Indirect match Target
ontology ontology
Fig. 1. Scheme of ontology matching using background knowledge.
Anchoring is matching the source and target concepts to the background knowl-
edge. In general, this process can be performed by using any existing ontology
matching technique. In our case we only use a simple lexical matching. Using
other methods can make it difficult to explain the experimental results, because
they may produce wrong matches, and simple technique while being rigid it is
very precise and allows us to concentrate on the use of the background knowledge
itself.
4
The source and the target concept in each match on the result set will have their
places swapped as well
38
Deriving relations is the process of discovering relations between source and
target concepts by looking for relations between their anchored concepts in the
background knowledge. Both the source and target concepts anchors are part
of the background knowledge, and checking if they are related means using the
reasoning service in the background knowledge ontology. Combining the anchor
matches with the relations between the background knowledge concepts derives
the match between source and target concepts, which is what we are looking for.
To explain this process in the context of life-sciences ontologies, we can see
a realistic example on Figure 2: the source concept SRC: Brain is anchored to
background knowledge concept BK: Brain, and the target concept TAR: Head is
anchored to a background knowledge concept BK: Head. The background knowl-
edge reveals a relation BK: Brain part-of BK: Head, and we derive a relation that
source concept SRC: Brain has a narrower meaning than the target concept TAR:
Head. Using background knowledge was crucial in this case; the match was not
found by directly matching the source to the target ontology, SRC: Brain is classi-
fied under SRC: Central nervous system which is in no way related to the concept
TAR: Head.
Background knowledge
Head
part of
Brain
Anchoring
Anchoring
Deriving relation
Root
Brain
Head Brain
narrower than
Source
Target
Fig. 2. Example using background knowledge in the matching process.
As suggested by the example above, of particular interest in our approach
is exploiting the structure of the background knowledge ontology. It is done in
the deriving relations step, when checking for relatedness between the anchored
concepts in the background knowledge ontology. Before moving to the experi-
mental part of the work, we will first introduce the formal definitions of all the
components in this framework, which we will later use in the experimental part.
2.1 Formal framework
Concept is a class of things grouped together due to some shared property. It
is named with a label, and sometimes with additional alternative names (syn-
39
onyms). Besides the name(s), the meaning of a concept is determined by its
semantic neighborhood, that is how it is related to the other concepts in the
ontology. We will refer to concepts with capital italic letters X, Y..., with XONT
to a concept from specific ontology, and we will also use the concept’s label (in
Sans Serif font), like Temporal lobe, or ONT: Temporal lobe, for the concept from
the particular ontology.
Relation instance (also called just relation) is a triple (X relation Y ), where X
and Y are concepts, and relation ∈ T is a relation type. T is the universal set
of all relation types. The relation instance (X relation Y ) is interpreted as the
concept X is related through the relation type relation to the concept Y. When
clear from the context we will call the relation instances simply relations.
Ontology is a pair of sets: ONT(C, R). C is a set of concepts, R is the set of
relations among these concepts. We will refer to ontologies with their full name,
like Foundational Model of Anatomy (or the name in italic Foundational Model
of Anatomy), with short form of the name in Sans Serif font, like ONT for an
arbitrary ontology, or FMA for the particular ontology Foundational Model of
Anatomy
Ontology match between two ontologies S and T is a set of relation instances:
M ⊆ CS × T × CT (1)
Each element in this set (X r Y ) : X ∈ C S , r ∈ T , Y ∈ C T we call a match
between X and Y, or, X is matched to Y, through the relation type r. We will
r
write it as X −→ Y, or, X → Y when the relation type of the match is known
from the context.
An ontology match is the result of any ontology matching technique. In
practice, it plays the role of a bridge between different ontologies. Two specific
ontology matches are of particular interest to our approach. They correspond to
the two phases of the matching - anchoring and deriving relations.
2.2 Evaluation
To characterize the degree of success for matching we adopt two notions from
the information retrieval field: precision and recall. In Information Retrieval (IR)
the precision and recall are measures on performance of document retrieval [12].
They rely on a collection of documents and a query for which the relevancy of the
document is known, assuming binary relevancy: a document is either relevant
or non-relevant. In the ontology matching we define these measures through two
sets - desired matches, and matches found by a matching method.
Precision is the proportion of desired and found matches, to all the found
matches:
|Desired ∩ Found|
Precision = (2)
|Found|
40
Recall is the proportion of desired and found matches, to all the desired matches:
|Desired ∩ Found|
Recall = (3)
|Desired|
The precision represents the quality or the preciseness of the matches - what por-
tion of the found matches are correct, and the recall represents the completeness
of the matches - how many of the matches we want to find were actually found.
The precision and recall have values between 0 and 1 inclusive. In practice they
are often expressed in terms of percentage, ranging from 0% to 100%.
3 Case study
In our case study we matched two ontologies from the agricultural domain us-
ing six other ontologies as background knowledge. Motivated by the variety of
ontologies that exist online, we decided to use background knowledge ontologies
with varying origin. We investigated three different types of ontologies: different
but related domain ontologies, general knowledge ontologies, and ontologies of
an unknown origin. We set simple direct matching as a baseline to evaluate the
matching performance, and we analyzed the matching performance by observing
the precision and recall. All the test data was extracted in March 2007.
Matching ontologies The source ontology was NALT5 and the target Agrovoc6 .
They both describe the domain of life sciences and agriculture. Agrovoc, as stated
on the description provided on its homepage7 , I quote ”is a multilingual, struc-
tured and controlled vocabulary designed to cover the terminology of all subject
fields in agriculture, forestry, fisheries, food and related domains (e.g. environ-
ment).” NALT, as described on its homepage 8 , I quote: ”The NALT is primarily
used for indexing and for improving retrieval of agricultural information. Cur-
rently, the NALT is the indexing vocabulary for NAL’s bibliographic database
of citations to agricultural resources, AGRICOLA9 . The Food Safety Research
Information Office10 (FSRIO) and Agricultural Network Information Center11
(AgNIC) also use the NALT as the indexing vocabulary for their information
systems. In addition, the NALT is used as an aid for locating information on
the ARS12 and AgNIC web sites.” In the experiments we used the versions of
the OAEI 200613 , which are publicly available. They contain 41,577 concepts
NALT, and 28,174 concepts Agrovoc. Many of the concepts besides the labels are
additionally described with synonyms.
5
http://agclass.nal.usda.gov/agt
6
http://www.fao.org/agrovoc
7
http://www.fao.org/aims/ag_intro.htm
8
http://agclass.nal.usda.gov/about.shtml
9
http://agricola.nal.usda.gov/
10
http://www.nal.usda.gov/foodsafety/
11
http://www.agnic.org/
12
http://www.ars.usda.gov/
13
Published on http://www.few.vu.nl/\~wrvhage/oaei2006/
41
Background knowledge We selected the background knowledge ontologies to
faithfully represent the types of background knowledge we set to investigate.
We used the Watson14 ontology search engine to find them. We queried Watson
for concept labels from the matching ontologies which are common English terms
like meat, animal, food, etc. and selected six ontologies which frequently occurred
in the retrieved results and also seemed like reasonable choice for the goal we
set to analyze, that is exploring the different background knowledge types. Note
that the choice of the search engine is not any special, in other studies differ-
ent search engines have been successfully used for the same purpose, [9] used
Swoogle to dynamically select background ontologies for an ontology matching
task.
The selected six ontologies were the following: Economy which models a dif-
ferent but related domain as the matching ontologies; Mid-level, Sumo and Tap
which are general knowledge ontologies; and A.com and Surrey which are ontolo-
gies of an unknown origin.
– Background knowledge 1 : Economy ontology is described at www.daml.org, I
quote: ”is based on CIA World Fact Book (2002). Some industry concepts
are based on the North American Classification System (’NAICS’) - online
at http://www.census.gov/rpcd/www/naics.html.” As its name indicates, it
intends to formally describe the domain of economy. It was engineered by
Teknowledge Corporation15 and submitted to the collection of ontologies
gathered at www.daml.org. The size is 323 concepts.
– Background knowledge 2 : Mid-level is constructed to play the role of bridge
between the Sumo abstract level ontology, and the different varieties of Sumo
domain-specific ontologies16 . It is not domain-specific, and contains 1773
concepts.
– Background knowledge 3 : Sumo (Suggested Upper Merged Ontology) is being
created as part of the IEEE Standard Upper Ontology Working Group. It
contains 576 concepts.
– Background knowledge 4 : Tap as described in [13] is a shallow but broad
knowledge base containing basic lexical and taxonomic information about a
wide range of popular objects. It is claimed to be independent of a domain,
however, a manual inspection indicated that it mainly covers the chemical,
machine and electronic industry domains. It contains 5488 concepts.
– Background knowledge 5 : A.com is an ontology with an unknown origin. By
browsing it we got the impression that it has been produced as a result of
merging several ontologies. In addition, noticeable are surprising relations
such as:
Volume Pollution
which can be seen as an indication that some form of directory structure
was the origin of the data. It seems to cover various domains, and its size is
5624 concepts.
14
http://watson.kmi.open.ac.uk/WatsonWUI/
15
http://www.teknowledge.com/
16
http://ontology.teknowledge.com/
42
– Background knowledge 6 : Surrey ontology, according to the Watson search
engine, originates from the web site www.surrey.co.uk. In our analysis we did
not manage to trace back its source, the download link does not work and on
the web site the ontology is not available. Similarly as in the previous case,
parts of its content gave the impression that it was created by transforming
a directory structure into an ontology in a straight-forward way. Having
no available documentation about how it was created, we treated it as an
unknown origin ontology. Its size is 672 concepts.
Background knowledge Type of Size in number of
ontology ontology concepts
BK1 : Economy Different domain 323
BK2 : Mid-level General knowledge 1773
BK3 : Sumo General knowledge 576
BK4 : Tap General knowledge 5488
BK5 : A.Com Unknown origin 5624
BK6 : Surrey Unknown origin 672
Fig. 3. Properties of the background knowledge ontologies
The six background knowledge ontologies with their properties are summa-
rized on the table in Figure 3. With respect to the common ontology sizes found
online [11], they are large sized ontologies.
Evaluation We manually evaluated the results of the matching experiments. As
a reference use-case we set the task of document reclassification, which is realistic
in this context because the matching ontologies are used for classifying books
and articles.
3.1 Experiments
We performed seven experiments in which we matched NALT to Agrovoc. In
the first experiment, which served as baseline, we matched the ontologies di-
rectly, and in the other six we matched them indirectly using the six previously
described background ontologies, one per experiment.
Direct matching (Experiment 1) In the direct matching we combined lexical and
structural matching. In the lexical phase the labels were normalized by discarding
stop words (the, and, an, a) and interpunction, and then matched to one another
accounting for different word order and plural/singular form of the words. As
a result, the lexical phase produced list of pairs of equivalent concepts. In the
structural phase the hierarchical structure of the ontologies was used to induce
further matches. The direct matching algorithm is shown in Figure 4.
43
The set of direct matches is empty in the beginning
1 dmatches := ∅
Lexical phase: find equivalent lexical matches
2 for every concept pair X ∈ C SRC , Y ∈ C TAR do
3 if FullLexMatch(X, Y ) then
≡
4 dmatches ⇐ (X −→ Y )
5 end for
Structural phase: use the structure to find more matches
6 for every two relations (X1 X2 ) ∈ RSRC , (Y1 Y2 ) ∈ RTAR
≡
7 if (X2 −→ Y1 ) ∈ dmatches then
8 dmatches ⇐ (X1 −→ Y2 )
9 for every two relations (X1 X2 ) ∈ RSRC , (Y1 Y2 ) ∈ RTAR
≡
10 if (X2 −→ Y1 ) ∈ dmatches then
11 dmatches ⇐ (X1 −→ Y2 )
Fig. 4. Algorithm for matching ontologies directly.
Even though the direct matching was done using such a simple and rigid
technique (no edit distance, or other form of approximation), it produced 6,437
matches between NALT and Agrovoc. This number is comparable to the numbers
obtained in the OAEI 2006 [5] on the same test data, where most of the partic-
ipating matching systems produced between 5000 and 10,000 matches. Hence,
our direct matching can be considered relatively successful, and a good base-line
to measure the added value of the background knowledge.
Indirect matching (Experiments 2 - 7) In the indirect matching we lexically an-
chored the matching ontologies to the background knowledge, and then used
the hierarchies of the background knowledge to induce the indirect matches. In
other words, the indirect matching algorithm can be explained as follows: for
two matching concepts we first find their equivalent concepts in the background
knowledge (if possible), then check if these background concepts are hierarchi-
cally related, and if they are we report an indirect match between the matching
concepts. The indirect matching algorithm is shown on Figure 5.
The table on Figure 6 summarizes the results of the anchoring phase show-
ing the number of source and target anchors (NALT and Agrovoc, respectively)
established to each background knowledge ontology. The Economy ontology has
the highest number of anchors as compared to its size, roughly to about one
third of its concepts there are anchors established from the matching ontologies.
Contrary, ACom has much fewer anchors relatively to its size, roughly one out
of each 90 concepts has anchor established to it. We can also observe from the
table that this ratio is variable for the background ontologies of the same origin.
Generally, the number of anchors is much smaller than the sizes of the match-
ing ontologies NALT and Agrovoc, which count in tens of thousands. However,
44
The set of indirect matches is empty in the beginning
1 imatches := ∅
Anchoring phase: anchor SRC and TAR to BK using direct matching
2 anchS7→B := MatchDirectly(SRC, BK)
3 anchT 7→B := MatchDirectly(TAR, BK)
Deriving relations phase: find indirect matches using the anchors and BK
4 for every two anchors (X 7−→ Z1 ) ∈ anchS7→B , (Y 7−→ Z2 ) ∈ anchT 7→B
5 if (Z1 Z2 ) then
6 imatches ⇐ (X −→ Y )
7 for every two anchors (X 7−→ Z1 ) ∈ anchS7→B , (Y 7−→ Z2 ) ∈ anchT 7→B
8 if (Z1 Z2 ) then
9 imatches ⇐ (X −→ Y )
Fig. 5. Algorithm for matching SRC to TAR indirectly through BK as a back-
ground knowledge.
Background knowledge BKi size Source anchors Target anchors
BK1 : Economy 323 121 106
BK2 : MidLevel 1773 330 271
BK3 : Sumo 576 79 72
BK4 : Tap 5488 367 227
BK5 : ACom 5624 66 69
BK6 : Surrey 672 102 95
Fig. 6. Overview of the anchoring results.
given the sizes of the background ontologies and the fact that they are not
agriculture-specific, this anchoring result is not surprising.
The table in Figure 7 gives an overview on the indirect matching results. The
third and fourth column show the number of indirect matches, and the number of
additional indirect matches which were not found in the baseline direct matching.
Each row in the table corresponds to one background knowledge ontology, except
for the last one which shows the cumulative number of matches (union). Note
that these cumulative numbers are not simple sums of the numbers above them,
for example for the indirect matches the sum is 2287 and the cumulative number
of matches is 2183. They are different because some of the matches are found
by more than one background knowledge ontology. Similarly, the sum of the
additional indirect matches is 1462 whereas the cumulative number is 1428. We
see that the sum and the cumulative number are close to one another, which
reveals very important and attractive behavior of using multiple background
knowledge ontologies. Namely, different ontologies produce nearly disjoint sets
of indirect matches. This means that the more ontologies we use - the more
matches we will find. If we look at the cumulative matches, the additional indirect
45
Background BKi size Indirect Additional matches
ontology matches on top of direct matches
BK1 : Economy 323 259 85
BK2 : MidLevel 1773 200 81
BK3 : Sumo 576 115 57
BK4 : Tap 5488 1003 625
BK5 : ACom 5624 87 71
BK6 : Surrey 672 623 543
Cumulatively all BKi 2183 1428
Fig. 7. Overview of the indirect matching results, the number of matches established
using each background ontology
matches represent 66% of all the indirect matches, which in turn means that an
arbitrary indirect match has higher chances to be an addition to the baseline
matches. However, these numbers say nothing about the quality of the matches,
as a next step we will evaluate their correctness.
3.2 Evaluation
In order to get better insight in the matching process we decided to undertake
the effort of manually assessing the matches. As a natural reference we choose the
task of document reclassification: the obtained matches are expected to faithfully
reclassify the documents from the source to the target ontology, ideally, in the
same way as a human would do.
For the precision we did the evaluation as follows: each match was checked
for validity, if the correctness was not obvious then Google was used as refer-
ence by querying for define: label to find the definition of the term label. The
evaluation of the precision proceeded in two phases: first evaluate the direct and
then the indirect matches. For the direct matching which produced more than
6000 matches, we choose the random sampling method. After drawing a random
sample of 10% (640 matches), we manually assessed these matches as described
above. For the indirect matches, which were in total little bit more than 2000,
we took the effort to manually assess all of them.
The recall was hard to estimate because it requires all the correct matches
between the matching ontologies available, which we don’t have. Therefore we
set to observe the change in recall between different experiments instead of es-
timating the achieved recall.
The evaluation revealed that the direct matching achieved 100% precision,
i.e. all the matches in the evaluation sample were correct. The precision of the
indirect matching and the change in recall are shown in the table on Figure 8.
46
Matching experiment Precision Precision ∆Recall
indir. matches addit. matches
Exp.2: BK1 : Economy 84.17% 51.76% 0.68%
Exp.3: BK2 : Mid-level 97.00% 92.59% 1.17%
Exp.4: BK3 : Sumo 76.52% 52.63% 0.47%
Exp.5: BK4 : Tap 57.23% 31.36% 3.04%
Exp.6: BK5 : A.Com 36.78% 22.54% 0.25%
Exp.7: BK6 : Surrey 35.63% 26.15% 2.21%
Cumulativly BK1 -BK6 57.63% 35.22% 7.81%
Fig. 8. Performance of the indirect matching experiments
4 Analysis
First general observation on the matches (all the matches from all the seven
experiments) is that they were established between a small subset of the match-
ing concepts: 2241 in NALT, and 1757 concepts in Agrovoc participated in the
matches, as compared to the size of NALT which is 41,577 concepts and Agrovoc
28,174 concepts. The number of concepts which participated in the matching
results were in the order of about 5% of the size of the matching ontologies. But,
this effect is not peculiarity of our experimental data, in other studies [14, 15]
similar effect was noticed when matching the FMA and GALEN ontologies which
model the human anatomy. These ontologies have 59,000 and 24,000 concepts
respectively, and the number of matched concepts reported in the studies is in
the order of 10% of the ontology sizes. It seems that this effect occurs when
matching large ontologies even though they model the same domain. Most likely
explanation for this is that for the general concepts there is much better naming
agreement, while for the more specific ones, which represent the majority, there
is almost no agreement. In such a situation the labeling problem is solved by
using many words to name a single concept. As an example, in NALT there is a
concept named Salmonella choleraesuis subsp.choleraesuis serovar Paratyphi A.
Precision and recall The table on Figure 8 shows the precision of the indirect
matches, and the increase of recall with respect to the baseline direct matching.
Each row corresponds to one background ontology, except for the last which
shows the results for the cumulative use of all the background ontologies together.
All the indirect matches which were also found in the baseline were cor-
rect, incorrect matches only appeared when they were not found in the baseline
matching. Hence, the precision of the additional indirect matches is lower than
the precision of the indirect matches.
The Tap ontology resulted in 57.23% precision, however, a special situation
had reduced the precision of this ontology. Many of the matches were wrongly
established to the target concept called Node. The root concept in TAP is called
Node, and the target concept anchored to it was found related to any source
concept anchored in Tap. When these wrong matches are not taken into account,
47
the precision of Tap is calculated to 92.13%. This example gave a very important
insight, the indirect matching can be very sensitive to mistakes which are high in
the background knowledge hierarchy. The fact that the root concept of Tap was
named Node caused drastic change in the results when we used it as background
knowledge.
The first four background ontologies which are expert-created exhibit high
precision in the indirect matches (more than 75%), and relatively high precision
in the additional indirect matches (more than 50%). On the other hand, the
unknown-origin ontologies show lower precision which is not a surprising thing
given the low quality of their content.
Observing the recall we see that Tap provides the highest increase of recall,
shown in the third column, but the Surrey ontology is the second next to the Tap
ontology in the recall increase. While the ontologies of an unknown origin might
show low precision, that does not prevent the recall being increased considerably.
We also see that Surrey is much smaller than Midlevel, Tap and ACom, which is
an empirical proof that small size does not immediately imply low recall.
Causes of wrong matches For the first four background knowledge ontologies
there were two main causes for wrong matches: contextual problems and rela-
tively small mistakes. Examples of matches caused by contextual problems are
the following:
NALT: Meat −→ Agrovoc: Product
NALT: Vehicle −→ Agrovoc: Product
NALT: Organism −→ Agrovoc: Agent
Meat can be seen as a kind of product in the domain of economy, however, for
our matching task this was not a desirable match. These matches can be seen as
relations establishing roles, meat and vehicles can have the role of a product, and
organism can have the role of an agent. Such modeling is apparently good for the
contexts of these background ontologies. For discussions related to the context
issues in knowledge representation the reader is referred to the Cyc Knowledge
Base [16] and the study of [17]17 . In addition to the context problems, few of the
wrong matches were caused by relatively small mistakes, such examples are the
matches:
NALT: Marine invertebrae −→ Agrovoc: Fish
NALT: Herbivore −→ Agrovoc: Mammals
Jellyfish are kind of Marine invertebrae but they are not fish, and some kinds of
birds are herbivore but not mammals. These relations come close to generally
accepted claims like ”birds fly” while exceptions exist: ”penguins are birds, and
17
The study argues that the knowledge representation issues and the functionality of
the system are intrinsically tied to one another, I quote: ”Representation and reason-
ing are inextricably intertwined: we cannot talk about one without also, unavoidably,
discussing the other. We argue as well that the attempt to deal with representation
as knowledge content alone leads to an incomplete conception of the task of building
an intelligent reasoner.”
48
yet they do not fly”. We stress here that there were no different causes for wrong
matches between Economy and the other three general-knowledge ontologies.
The high-quality ontologies, whether they model different domain or are general-
knowledge, the same reasons caused them to produce wrong matches when they
were applied as background knowledge.
For the last two ontologies, which have unknown origin, mistakes were the
cause for the wrong matches. For example:
NALT: Gas −→ Agrovoc: Turbines
NALT: Waste −→ Agrovoc: Water
The concepts in these wrong matches are semantically related, however, no strict
relation can be established. These matches are clearly wrong. This suggests that
ACom and Surrey were obtained by straight-forward transformation of a directory
structure into an ontology.
5 Conclusions
Based on the work presented in this paper, we conclude that using multiple on-
tologies as background knowledge in ontology matching is useful and practically
feasable. Our experiments indicated the key factors that influence the matching
performance. The recall increases monotonically with adding more background
ontologies. This is an important property because the recall increase is seen as
bigger challenge for the current matching systems. For the precision, the success
primarially depends on the quality of the background ontologies.
Observing the precision, the expert-created ontologies such as Economy, Mid-
level, Sumo and Tap resulted in relatively high precision (more than 75%), and
the main causes of wrong matches were contextual differences with the match-
ing ontologies and small mistakes. The ontologies of unknown origin like ACom
and Surrey resulted in lower precision (less than 40%) and the main cause of
wrong matches were mistakes. This makes the expert-created ontologies more
trustworthy and clearly preferable background knowledge candidates over the
unknown-origin ontologies with respect to the precision.
All the background ontologies together provided relatively small increase in
the recall of about 8% in addition to the direct matching. However, they resulted
in nearly disjoint sets of matches, which means that new ontologies are likely to
provide new additional matches and further increase the recall.
Furthermore, the expert-created ontologies, regardless whether they modeled
different domain from the matching ontologies (Economy) or they were general-
knowledge (Mid-level, Sumo and Tap), they resulted in similar matching qualities.
On our experimental data we could not discriminate by the precision or the recall
increase, and all of them had the same causes of wrong matches.
Finally, the Tap ontology showed that the matching process can be very sen-
sitive to mistakes high in the background knowledge hierarchy. Other mistakes
also resulted in wrong matches, but the mistake in Tap with the root concept be-
ing labeled Node seriously affected the precision when applying this background
ontology.
49
References
1. Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Match-
ing. VLDB Journal 10(4) (2001)
2. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal
on Data Semantics 4 (2005) 146–171
3. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl.
Eng. Rev. 18(1) (2003) 1–31
4. Euzenat, J., Isaac, A., Meilicke, C., Shvaiko, P., Stuckenschmidt, H., Svab, O.,
Svatek, V., van Hage, W.R., Yatskevich, M.: First results of the ontology align-
ment evaluation initiative 2007. In: Ontology Matching Workshop at International
Semantic Web Conference (ISWC). (2007)
5. Euzenat, J., Mochol, M., Shvaiko, P., Stuckenschmidt, H., Svab, O., Svatek, V.,
van Hage, W.R., Yatskevich, M.: Results of the ontology alignment evaluation
initiative 2006. In: Ontology Matching Workshop at International Semantic Web
Conference (ISWC). (2006)
6. Euzenat, J., Stuckenschmidt, H., Yatskevich, M.: Introduction to the ontology
alignment evaluation 2005. In: Integrating Ontologies. (2005)
7. Euzenat, J., Le Bach, T., Paolo, Debo, J., Kuntz, R.D., Ehrig, M., Hauswirth,
M., Jarrar, M., Lara, R., Dianamaynard, Napoli, A., Stamou, G., Stuckenschmidt,
H., Shvaiko, P., Sergiotessaris, Van Acker, S., Zaihrayeu, I.: State of the art on
ontology alignment. deliverable d2.2.3 (2004)
8. Aleksovski, Z., Klein, M., ten Kate, W., van Harmelen, F.: Matching unstruc-
tured vocabularies using a background ontology. In: Proceedings of Knowledge
Engineering and Knowledge Management (EKAW). (2006) 182–197
9. Sabou, M., dAquin, M., Motta, E.: Using the semantic web as background knowl-
edge for ontology mapping. In: Proceedings of Ontology Matching Workshop.
(2006)
10. Bouquet, P., Serafini, L., Zanobini, S.: Semantic coordination: A new approach
and an application. In: International Semantic Web Conference (ISWC). (2003)
130–145
11. Wang, T.D., Parsia, B., Hendler, J.A.: A survey of the web ontology landscape. In
Cruz, I.F., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M.,
Aroyo, L., eds.: International Semantic Web Conference. Volume 4273 of Lecture
Notes in Computer Science., Springer (2006) 682–694
12. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison-
Wesley Harlow, England (1999)
13. Guha, R., McCool, R.: Tap: A semantic web toolkit. Semantic Web Journal 1(1)
(2003) 81–87
14. Zhang, S., Bodenreider, O.: Aligning representations of anatomy using lexical and
structural methods. AMIA Symposium (2003) 753–757
15. Mork, P., Pottinger, R.A., Bernstein, P.A.: Challenges in precisely aligning models
of human anatomy using generic schema matching. Technical report, Microsoft
Research (2004)
16. Lenat, D.: Cyc: a large-scale investment in knowledge infrastructure. Communi-
cations of the ACM 38(11) (1995) 33–38
17. Davis, R., Shrobe, H., Szolovits, P.: What is a knowledge representation? AI
Magazine 14(1) (1993) 17–33
50
Flickring Our World: An Approach for a Graph Based
Exploration of the Flickr Community
João Tojo, Jorge Sousa and Paulo Gomes
{tojo, pelaio}@student.dei.uc.pt
pgomes@dei.uc.pt
CISUC – DEI/FCTUC University of Coimbra
Polo II – Coimbra – Portugal
Abstract. When using the Web, one begins to understand and acknowledge the
existence of the “other side of the mirror”. The Internet has become a means of
inter-relationship between its users, usually forming complex and emergent
relationships, providing a handful of knowledge waiting to be extracted. In this
paper, we present a tool that analyzes the Flickr community using semantic web
technologies. It uses the FOAF ontology to present the result of the analysis as
graphs based on the relations between the users of this community.
Keywords: Social Networks, Flickr, FOAF, Semantic Connections, Web 2.0,
Semantic Web.
1 Introduction
The Web, as we know it today, is made of – and by – everything. By everything we
mean everything: plain text, pictures, movies, and so on. However, one thing that is
missing is meaning. Every one of these things has the meaning that we provide to
them, but none has the underlying meaning that can be recognizable by machines,
necessary in order to do some computation and reason about the single most powerful
media mechanism known to men. The science of semantic web [1] has this role, to
provide a means to compute web resources, providing them the semantics and the
structures in order to use them in an intelligent fashion.
One of the main focus of this fairly new science is the contribution to the study of
social sciences [2]. The communities that originate from the main web concentration
points are very rich in semantic information, as they self organize themselves and
their interests by using a tagging mechanism [24]. These virtual societies, by using
tags, provide an adaptive taxonomy – the folksonomies [3] – that allows the
emergence of social trends, by analyzing the movement and creation of clouds of
information. These clouds, composed by terms exclusively created by users, provide a
way to verify trends, relations and future directions of people, businesses, products,
and so on, thus providing a way of analyzing the hidden features and relations of the
web. Our work will focus on developing a tool that allows the analysis of such
phenomenon regarding the Flickr environment.
51
Using Flickr [4] – the famous photo blogging site – as our base of information, we
will make use of the API [5] provided by the site to compute users, the pictures they
upload and, most of all, the tags they associate with each picture. This information
will be structured using a specific ontology [6] named FOAF [7, 8, 9] – friend-of-a-
friend –, used mostly on web analysis of societies, with the focus on connections
between users. We will treat this information as RDF graphs [25] and try to make a
simplified approach to the, so called, second generation clouds [10, 11], by presenting
visual, interactive graphs representing the semantic landscape created by the users of
the community.
In the present paper, we begin by introducing Flickr and its community, discussing
about the concepts behind the creation and usage of the site. We will then make a
brief introduction to social network analysis, referring the history of the field and its
area of work, and exposing some key concepts behind a social analysis of a network.
The following section will introduce our approach to the developing of the tool, by
discussing the main structure, followed by the explanation of the algorithms using
during the graph computation/creation. We will then expose some of the
experimentations made, including its results and observations, and address some
future work to be made related to this work. In the end, we round up some
conclusions to the work performed and the results obtained.
2 The Flickr World
Developed by Ludicorp [12, 13], the Flickr project was an evolution of a project
named Game Neverending [12, 13], a massive multiplayer online game developed in
order to provide a platform for an enhanced real time online interaction, in the form of
a role playing game based on social interaction. As the project was cancelled, its
ideias and tools were used in order to build the Flickr website. Nowadays owned by
Yahoo, Ludicorp developed what is now called one the first Web 2.0 applications [12]
and, certainly, one of the most famous.
The Flickr website provides a way for people to, not only share their photos with
the other users of the site, but also for these users to collaborate in the definition of
the meaning of these photos, making use of the tagging phenomenon. As such, one
can use Flickr to search for photos depicting a certain color, place or whatever other
characteristics, by searching these tags. This tagging is one of the best examples of
the use of folksonomies (although there are opposing opinions [14, 15] of this view
regarding the Flickr context) , where the users define, evolve and adapt the taxonomy
used for a determined context, instead of using a predetermined, fixed and static set of
terms.
Although the Flickr website provides for the identification of the users (direct)
relatedness, either by being able to (co)exist in each others contact lists or by
belonging to the same group, this tagging mechanism may allow for a higher level
connectivity assessment approach. Regarding the growing relevance of the Flickr
website, and the also growing diversity and number of people participating both in
sharing the photos and labeling them, we see the Flickr community as a good possible
source of social network analysis.
52
3 Social Network Theory and the Web
Social science [18] is a thoroughly studied field, regarding what the humans do as
specie. Focusing on scientific methods, based on psychology, sociology, and so on,
the social sciences are aimed at studying and understanding the human behavior [18].
However, although the traditional view of social sciences focuses on the rational
choices made by individuals, it does not regard the aspects of interrelation between
those actors [16].
As such, a new field emerged in order to address this issue, named social network
theory (or analysis). This field focuses mainly on the social context of the actors and
the behavior of their relationships [16], identifying the underlying patterns on these
relationships. There is actually a debate [16] whether the social network analysis
should be an independent, self contained field or, on the other hand, a subset of social
studies, a set of collection methods and studies in order to use in the study of social,
human relations (centered on the individual). However, the complexity of the
structures of these social networks may induce that it is, indeed, a whole new science
[19], that although has its roots and some of its foundations in the traditional social
sciences, tries to form its own theories regarding a complex concept.
Social network analysis (SNA) is based on the concept that there are determinable
structures behind the formation of the network of agents and their relationships [17].
This concept has originated many theories regarding people and how they organize
themselves in networks, being the most famous the “six degrees of separation” theory
[20], with roots on the small world models [21].
The social networks field had three main influences [16]: sociometric analysis
(graph theory models), mathematical analysis and the anthropology view on the
structure of communities. During the 60’s and the 70’s, however, a Harvard
congregation on these influences created the general field known as SNA.
SNA regards three main points of investigation [16]:
1) The total structure;
2) The subsets of a determined group;
3) The individuals (as “points”, “vertices” or “nodes”) that form the
network;
In order to address these points, SNA also defines a set of concepts regarding the
study of networks [16]:
• Dyad: Two actors how have a connection, meaning they have a relationship;
• Clique: A subset of actors within a network who have ties with all other
actors between the subset;
• Density: The proportion of total available ties connecting actors;
• Centralization: The fraction of main actors within a network;
• Reachability: The number of ties connecting actors;
• Connectedness: The ability of the actors to reach one another reciprocally,
that is, the ability to choose a relationship between both parties;
53
• Asymmetry: The ratio of reciprocal relationships – the mutual relationships –
to total relationships within the network;
• Balance: The extent to which ties in the network are direct and reciprocated;
Regarding the study of individuals in the network, there are a few more concepts to
consider [16]:
• Centrality: The degree to which an actor is in a central role in the network;
• Homophily: The degree to which similar actors in similar roles share
information;
• Isolate: An actor with no ties to other actors;
• Gatekeeper: An actor who connects the network to outside influences;
• Cutpoint: An actor whose removal results in unconnected paths in the
network;
3.1 The Web and its Dual Relation with SNA
The Internet boom has provided a very rich and diverse arena for the study of social
networks. In fact, it was the development of the World Wide Web (WWW) that
provided for a wider application of the field, accompanied by further developments on
theories, models and concepts. The presence of logs, blogs, forums and registers of all
kinds provide for very useful data to be used in SNA [17].
SNA has provided some interesting studies (and consequent results), using
traditional methods and targets of study. The Internet has evolved this study and its
scope one step further: the globalization of the population has provided for a much
wider sense of connections, regarding aspects from preferences of music, food and
films to ideologies and ways of living. In a sense, we can say we are in the presence
of the meta level of social networks, where the localization of the agents no longer
matter, due to the world virtualization of the WWW [23].
If, at first, there was an underlying fear of the eventually ephemeral Internet base
connections, the appearance of Internet communities proved that the aforementioned
relationship arena was strong, cohesive and robust, and, as such, ready and perfectly
suited for the SNA field [22, 23]. The fact that online communities often change its
paradigms (the social trends shifting) provides yet another dimension for this social
analysis, as the networks evolve in time (and space, eventually). As such, SNA has
another characteristic to study (evolution of online networks) and a playground to test
and infer theories.
On the other hand, social network concepts became so famous that, nowadays, we
have companies and businesses based solely on them, regarding networking sites such
as Orkut, Flickr, MySpace, etc. [16].
54
4 Developing a Tool for Social Network Analysis in Flickr
In this section, we describe our work presenting several key points of the project, such
as the ontology used, the architecture built and the algorithm and structures included
for the generation of graphs. The goal is the development of a tool that allows for the
modulation, parameterization and navigation on graph representations of the Flickr
social networks.
4.1 Architecture
In order to build a tool for both constructing and analyze the graphs representing the
connections between the Flickr users, we need to perform the task of data extraction
from the Flickr database. After this preliminary step, we needed some way of
constructing the models in order to apply the algorithms, and then a dynamic fashion
on representing the graphs. As so, the application consists, mainly, on three different
tiers, as represented on figure 1.
Figure 1: Overall architecture of the tool
The first tier (Flickr Data Processing) is responsible for querying the Flickr API
service, while storing this data in a local MySQL database. The algorithms for
implementing these queries are built using the Python language [31], and the focus is
based mostly on the task of processing the XML data originated by the Flickr API
upon a data request.
The second tier (Algorithms) is in charge of all data manipulation and algorithmic
functions. This tier uses the data retrieved from the first tier for the construction and
storing of RDF [25] graphs, which are then used for SPARQL [26] querying. These
queries are built upon the needs of the data used in the main algorithms, responsible
for performing the assessment of the relations between the Flickr users. Related to
this tier is the creation of a virtual matrix representing all the connections between the
Flickr users (stored in the local database). This structure will be used in the third tier.
This second tier is implemented using a Java application, including the JENA [27]
package in order to perform the RDF construction and querying.
The third tier (Visualization and Navigation) uses the previously mentioned virtual
matrix to perform the mapping of the matrix to a visual graph. Using the YWorks
55
Java package [28], we (re)present the results derived from the implemented (and
selected) algorithms to a graph of nodes representing the users. The layout is built in
order to depict visually the connectivities between the Flickr users in the graph. This
visualization is inserted in an interface allowing the user to define certain parameters
of the presented graph. An example of a graph is shown in figure 2. This third tier
comprises another visual element, allowing the algorithms to be selected, as
previously mentioned. As the goal of this work is to build a tool for social graph
analysis, we developed a Graphic User Interface (GUI), allowing the user to both
choose from a different set of algorithms and insert the desired parameter values
regarding the formulas used in the assessment of the relatedness of the Flickr users.
This GUI is built using a Python graphic package, wxPython [29].
Figure 2: An example of a graph created with yWorks and a subset of Flickr data.
4.2 Ontology
In order to analyze the Flickr data, we use a base structure that can provide for the
needs on asserting relationships between actors, which can be either persons or
objects. Not only for that reason, but also because of principles of good engineering,
the use of an ontology is an obvious choice. For that we used the FOAF [7] (friend-of-
a-friend) ontology, because it provides not only defined structures for persons,
activities and properties, but also the means of performing simple and direct
connections between them and their related objects. This ontology enables the
identification of relationships within groups of people and relations between people
and resources (locations, films, photos), and people and their activities (blogging,
tagging). FOAF ontology not only provides these direct assertions, but also lays the
foundation for reasoning about some higher level relationships, not so clear at a first
glance, such as the relation between actors based on the properties of their activities,
or the objects they create.
It is with one of these higher level relationship identifications in mind (namely, the
tagging relationship) that we use the FOAF ontology. However, in order to use the
ontology within our project, we had to perform some modifications, for example,
including specific terms related to our work and concerning the Flickr activities and
properties.
As stated above, the use of FOAF ontology is considered to be important from a
conceptual point of view.
56
Despite that, the technological perspective should not be forgotten as it provides
the possibility to make direct reasoning with the Flickr data and prepares it for future
applications in terms of scalability. The ontology used in the present work is depicted
on figure 3.
Figure 3: Part of the ontology used in the current work.
4.3 Algorithms for Asserting Connections on the Network Creation
We have used a formula to compute the proximity between users, which takes in
consideration 3 factors:
4) If the users are contacts of each other;
5) If the users are in the same group;
6) The similarity asserted from the tags related to each of the users;
As so, the main formula is presented on formula 1.
con (u1, u 2) = w1* ContactS (u1, u 2) + w 2 * GroupS (u1, u 2) + w3* TagS (u1, u 2) (1)
With: ContactS = Similarity as a contact, GroupS = Similarity as in the same group,
TagS = Similarity by tags and, preferentially, w1+w2+w3 = 1, in order to normalize
values.
4.3.1 Contact Similarity
on the contact similarity is computed based on the user’s contacts. However, this
relation may not be bidirectional, as the user can have a contact that doesn’t have the
user as its own contact. As such, we specify this option as a parameter (p1), so it can
be chosen how to deal with this relation. If p1 is true, the relation is assumed to be
directional, otherwise it is assumed bidirectional. The following formulas, 2 and 3,
represent, respectively, both these cases.
p1 = false → contactS = u1has(u 2) (2)
p1 = true → contactS = (u1has(u 2) + u 2has(u1)) / 2 (3)
57
With: u1has(u2) = Verification if u1 has u2 on its contact list (equals 1) or not (equals
0) and u1has(u2) = (0;1).
4.3.2 Group Similarity
As a user can belong to several different groups, we’ve considered this value to
represent a higher relation between users if the number of similar groups of the users
is higher. So, the formula has in consideration not only the number of similar groups
between the users, but also the amount of groups of the first user, as shown in formula
4.
groupS (u1, u 2) =
∑ similarGroups(u1, u 2) (4)
∑ groups(u1)
With: similarGroups(u1, u2) = Number of groups similar between u1 and u2, and
groups(u1) = Number of total groups of u1.
4.3.3 Tag Similarity
Tag similarity is the most complex relation between the users, as it is the only one that
is not directly retained when consulting the data. We could have used several
approaches to tackle this issue: we have chosen, however, one that simplifies this
relation by relating the tags associated with the photos of each user.
The algorithm runs each user list of tags present in her/his photos, comparing the
values and counting the similar cases. We’ve also implemented a step deeper, by
using the Flickr API function, which returns the related tags of a specific tag. We add
this into the equation by, whenever a tag between two users does not relate,
discovering the related tags (as “stated” by Flickr) and finding a relation between
them. As this isn’t a direct relation, and regarding the fact that Flickr already provides
in its response for a classification of the most related, we infer a linearly descendant
weight for each tag on the list of similar.
Two formulas were implemented for the direct relation between tags and one for
the indirect relation. The weight of these two relations can be user defined.
For the direct tag similarity, we have formulas 5 (Record Semantic Proximity
(RSP) [30]) and 6 (self named Empiric Semantic Proximity (ESP)).
tagS (u1, u 2) =
∑ similarTags (u1, u 2) (5)
∑ tags (u1) + ∑ tags (u 2) − ∑ similarTags (u1, u 2)
58
∑ similarTags(u1, u 2) (6)
tagS (u1, u 2) =
∑ tags(u1)
1 − ∑ tags (u1) − ∑ tags (u 2)
∑ tags(u1) + ∑ tags(u 2)
With: similarTags(u1, u2) = Number of equal tags in tag lists of user 1 and user 2, and
tags(uX) = Number of total tags in tag list of user X.
The difference between RSP and ESP resides mostly on its concept. While RSP
accounts for the similarity of both the users on a joint context, asserting it as a whole,
ESP looks at this similarity from a “point of view” of u1, giving the tag relation a
slight different meaning.
Regarding the indirect tag relationship, the value is computed according to the
following steps:
1) For every tag “ tagU” from user1, if it’s not on user2 tag list, for every
“tagY” in user2 tag list, get “tagY” list of similar tags (according to
Flickr), checking if this list contains “tagU”;
2) If so, add to the counter the weight related to the position of the tag on
the list. For example (assuming a base value of 0.8 and a threshold of
values 0.01):
a. [‘cat’:0.4, ‘dog’:0.2, ‘bird’:0.05, ‘meow’:0.025]
3) Divide the sum value obtained for the multiplication between the
number of sub lists compared with and the maximum value of each sub
list. See formula 7, for the given example.
indirectR e lation = sum / (0.4 * nAnalyzed ) (7)
With: sum = All the “contributions” from the different sub lists. And nAnalyzed = the
number of sub lists analyzed.
As a final result, both the direct and indirect tag relations are taken into account (if
that is the desire of the user), regarding formula 8.
tagS = w1* DirectSim (u1, u 2) + w2 * IndirectSim (u1, u 2) (8)
59
With: DirectSim(u1, u2) = The value of the direct tag relations (formula 5 or 6),
IndirectSim(u1, u2) = The value of indirect tag relation and, preferentially, w1+w2 =
1, in order to normalize values.
4.4 Adjacency Matrix
The result of the previous formulas is a value for each pair (UserX, UserY). The value
of the pair (User1, User2) may not be the same as the value of the pair (User2, User1).
For instance, if it is decided not to use the complete approach when asserting the users
contact similarity, and User1 has User2 as a contact while the reverse is not true, then
User1 will have a weight of 1 in that relationship, while User2 will have a weight of 0
in the same calculus. As the graphs to be constructed are spatially visualized, we had
to, somehow, normalize this distances. As so, we opted for the following way: for
each pair (of pairs) ((UserX, UserY), (UserY, UserX)), we assign the mean value,
regarding, each sub pair. For example, let’s say that con(User1, User2) = 0.83 and
con(User2, User1) = 0.25. The assigned value for this (double) relation would be
(0.83+0.25)/2 = 0.54.
In the end, the matrix will look like the one represented in table 1.
Table 1. The Adjacency Matrix
P1\P2 Person1 Person2 Person3 Person4 Person5 Person6 Person…
Person1 NA 0.54 0.67 0.04 0.08 0.34 …
Person2 0.54 NA 0.78 0.23 0.51 0.001 …
Person3 0.67 0.78 NA 0.02 0.045 0.43 …
Person4 0.04 0.23 0.02 NA 0.78 0.12 …
Person5 0.08 0.51 0.045 0.78 NA 0.28 …
Person6 0.34 0.001 0.43 0.12 0.28 NA …
Person… … … … … … … …
Note: “NA” means the value is “not assigned”, as it wouldn’t make sense to connect a person
to itself.
5 Experimentations
In order to create a reasonable experience, where we could visually control and
manage the data with ease, we’ve extracted around 200 Flickr users, chosen
randomly, with their tags, contacts and photos. From that information, we created a
set of graphs, modifying some of the parameters.
The base parameters, used in formula 1, are, respectively, 0.2, 0.2 and 0.6. This
means that the contact relation has a weight of 0.2, the same for the group relation.
The tag relation, the one we intend to perceive the most, has a weight of 0.6. Figures 4
60
and 5 depict the graphs obtained, with both the RSP formula and the ESP formula.
The indirect relation from tags was not used in these computations.
Figure 4: Graph generated with the RSP formula. Graph with the connections on
the right and just the nodes on the left.
Figure 5: Graph generated with the ESP formula. Graph with the connections on
the right and just the nodes on the left.
We varied the base parameters in order to use the values 1.0, 0.0 and 0.0, meaning
that only the contact factor is represented in the graph. The resulting graph is shown
in figure 6.
61
Figure 6: Graph generated by giving only significance to the contacts relation.
6 Observations
Regarding our experimentations, we can observe some relevant aspects:
• Graph Modulation: Both figures 4, 5, and 6 show that we can modify the
topology of a graph representing the same data, by varying the parameters that
construct that graph. This aspect allows graph modifications related to the aspects
more relevant to the people analyzing the graphs. In the present case, one could
prefer to study the dynamics of a network fully based on Flickr contact
connections (figure 6), while, on the other hand, a study could be performed
where all the variables are inserted into the graph construction (figures 4 and 5).
• Visual properties: If we observe the graphs on the previous section, we can easily
identify a few characteristics of networks. In all the examples (figures 4, 5 and 6)
we are able to visually identify sets of people grouped together, forming a subset
of the network – clusters – we can also spot some subsets further away and
separated from the others, forming some sort of islands; some of the persons are
farther away from the main group of people, forming a set of individuals that we
can call of outliers.
• Graph navigation: Another goal of the project was over the second generation
view on this sort of graphs. In figure 7 is shown an example of navigability on
the graph. By selecting a node in the graph, one can perform a sort of navigation
by “travelling” through the neighbors of each node. When a node is selected, its
directed connections will be highlighted giving a visual output of the “road”. By
performing this sort of navigation, we enhance the study of networks on the
individual level [16], allowing the possible analysis of individual’s connections
62
and characteristics of a determined individual, either as a “real” person or as a
role on the network.
Figure 7: Example of navigability on the graph, while selection the leftmost node.
7 Future Work
The current work, presented in this article, was a first step towards an interactive tool
for the science of SNA. The principal basis of the work is already implemented, with
a project that allows for the creation and visualization of graphs representing social
networks. However, and precisely because this work is a basis for future work, there
are a few features that can be enhanced in the project.
One of the main augmentations of the work will be the inclusion of methods for
explicit social network analysis, where we are able to withdraw statistics for inference
on characteristics like density or reachability of the graph. On the other hand, the
inclusion of different layers on the second generation aspect will allow for a concrete
visualization of the proximity based on specific aspects. One example on the current
work is the inclusion of tag clouds, where the location of the persons on the network
is accompanied by a set of tags, representing the semantic landscape on another level.
These two additions to the current work will allow for a deeper theoretical validation
on the graphs constructed with the tool,
Regarding the construction of the graph, other elements can be included into the
calculations of proximities. One can include another level of tagging, by considering
the tags related by proximity (Flickr API provides this list), or by including metadata
from the pictures (when the API provides methods for that goal). Other formulas can
also be included, in order to experiment with different topologies.
As we validate the constructed graphs, trying larger sets of data may provide for
richer network environments. We believe that the built tool will handle these larger
sets of data without much effort, thus achieving a good level of scalability.
63
8 Conclusion
The work performed and described in this article provides a step further on the SNA.
Using the basis of semantic web, we’ve developed a project that reveals as a core
engine for an auxiliary tool for SNA. We’ve proved the creation of the graphs
depending on the desired set of parameters is achieved. At the same time, the graph
navigation is already possible, as the main work is already produced; the additional
methods that allow for a specific analysis can be included in the tool.
Several questions also emerge from the produced work: what other elements from
web 2.0 can be used along with the web semantics in order to produce valid results
and conclusions? What metadata is valid, and what isn’t? On the other hand, how can
this SNA and its networks enhance the web 2.0 experience? Although the answers to
questions of these sorts are not evident, the solution to them is to investigate and aid
in the investigation of SNA in the context of the Internet. And tools as the one
described in this work are a precious aid in this goal.
References
1. W3C Semantic Web Activity, 1994-2008, accessed 10 January 2008,
2. Mika, P.: Social Networks and the Semantic Web, Springer (2007)
3. Mathes, A.: Folksonomies - Cooperative Classification and Communication Through
Shared Metadata, University of Illinois Urbana – Champaign, December 2004
4. Flickr website, 2004-2008, accessed 12 February 2008,
5. Flickr services API, accessed 12 February 2008,
6. Gruber, “Ontology”, Encyclopedia of Database Systems, 2008, Springer, accessed 1
January 2008,
7. FOAF Project, 2000-2008, accessed 8 December 2007,
8. FOAF vocabulary specification, Vers. 0.91 on 2 November 2007, accessed 8
December 2007, http://xmlns.com/foaf/spec/
9. Li Ding, Lina Zhou, Finin, T., Joshi, A.: How the Semantic Web is Being Used: An
Analysis of FOAF Documents. In: System Sciences, 2005. HICSS '05. Proceedings
of the 38th Annual Hawaii International Conference on 03-06 Jan. 2005 pages: 113c
- 113c
10. Hassan-Montero, Y., Herrero-Solana, V.: Improving Tag Clouds as Visual
Information Retrieval Interfaces. In: I International Conference on
Multidisciplinary Information Sciences and Technologies, InSciT2006. Mérida,
Spain. October 25-28, 2006.
11. Lamantia, J., “Second Generation Tag Clouds”, Online posting 23 February 2006,
accessed 14 January 2008,
12. Flickr page on Wikipedia, accessed 13 February 2008,
http://en.wikipedia.org/wiki/Flickr
13. Jefferson Graham, “Flickr of idea on a gaming project led to photo website” Online
posting 2 February 2006, accessed 18 February, http://
www.usatoday.com/tech/products/2006-02-27-flickr_x.htm
64
14. Guy, M., Tonkin, E.: Folksonomies – Tidying Up Tags?. In: D-Lib Magazine,
January 2006, Vol. 12 N. 1.
15. Vander Wal, “Floksonomy Research needs cleaning up”, Online posting 17 January
2006, accessed 17 December 2007,
16. Fredericks, K.A., Durland, M.M.: The Historical Evolution and Basic Concepts of
Social Network Analysis. In: New Directions for Evaluation, Special Issue: Social
Network Analysis in Program Evaluation, February 2006.
17. Elizabeth F. C., Christine A. H., "Guest Editors' Introduction: Social Networks and
Social Networking," IEEE Internet Computing, vol. 9, no. 5, pp. 14-19, Sept/Oct,
2005
18. Efferson, C., Richerson, P.J.: A Prolegomenon to Nonlinear Empiricism in the
Human Behavioral Sciences
19. Scott, J. P.: Social Network Analysis: A Handbook. Sage Publications Ltd; Second
Edition ed., March 2000
20. Watts, D. J.: Six Degrees: The Science of a Connected Age, Norton, New York, 2003
21. Watts, D. J.: Networks, Dynamics and the Small-World Phenomenon, American
Journal of Sociology 1999 105:2, 493-527
22. Mynatt, E. D., et al: Design for Network Communities. In: Proceedings of the
SIGCHI conference on Human factors in computing systems, Atlanta, Georgia,
United States Pages: 210 – 217, 1997
23. Mynatt, E. D., et al: Network Communities: Something Old, Something New,
Something Borrowed... . In: Computer Supported Cooperative Work (CSCW)
Journal, Volume 7, Numbers 1-2, Pages 123-156, March, 1998
24. Shadbolt, N., Hall, W., Barners-Lee, T.: The Semantic Web Revisited. In: IEEE
Intelligent Systems, 21 (3). pp. 96-101. ISSN 1541-1672, 2006
25. RDF Semantics, 1999-2004, accessed on 2 December 2007,
http://www.w3.org/TR/rdf-mt/
26. SPARQL Query Language for RDF, 2006-2008, accessed on 3 January 2008,
27. JENA, Vers. 2.5.5, size 22M, latest version on 18 January 2007, accessed on 10
February 2008,
28. YFiles, Vers. 2.5.0.4, size 28M, accessed on 1 January 2008,
29. wxPython, Vers. 2.8.7.1, size 8M, accessed on 1 December 2007,
30. Rocha, L. M., Bollan, J.: An Adaptive System Approach to the Implementation
and Evaluation of Digital Library Recommendation Systems. In: Fourth
European Conference on Research and Advanced Technology for Digital
Libraries, December 2000
31. Python, Vers. 2.51, size 10M, accessed on 1 December 2007,
65
Semantically Enriching Folksonomies with FLOR
Sofia Angeletou, Marta Sabou, and Enrico Motta
Knowledge Media Institute (KMi)
The Open University, Milton Keynes, United Kingdom
{S.Angeletou, R.M.Sabou, E.Motta}@open.ac.uk
Abstract. While the increasing popularity of folksonomies has lead to a
vast quantity of tagged data, resource retrieval in these systems is limited
by them being agnostic to the meaning (i.e., semantics) of tags. Our goal
is to automatically enrich folksonomy tags (and implicitly the related re-
sources) with formal semantics by associating them to relevant concepts
defined in online ontologies. We introduce FLOR, a mechanism for au-
tomatic folksonomy enrichment by combining knowledge from WordNet
and online ontologies. We experimentally tested FLOR on tag sets drawn
from 226 Flickr photos and obtained a precision value of 93% and an ap-
proximate recall of 49%.
1 Introduction
The popularity of many Web2.0 applications such as Del.icio.us1 , Flickr2 and
YouTube3 has led to a massive amount of freely accessible, user contributed
and tagged content. Despite the presence of tags, the lack of structure and ex-
plicit semantics hampers the creation of intelligent user interfaces for annotation,
navigation and querying and the integration of content from diverse and hetero-
geneous data sources. A popular hypothesis, expressed by many web experts ([4,
8, 9, 11, 17]), is that Web2.0 data sources can be used more efficiently by struc-
turing and semantically organising them and that the Semantic Web can provide
the needed semantics to achieve that.
This hypothesis motivated two different research approaches to enrich folk-
sonomies. First, some methods rely on the statistical analysis of tagspaces based
on tag co-occurrence to identify clusters of related tags. In this cases the mean-
ing of a tag is given by its cluster but it remains implicit, i.e., it is not explicitly
stated [3, 15, 16, 20]. Second, recent methods shift from this statistical view to
a knowledge-intensive approach where a semantic definition of tags is obtained
by aligning them to a knowledge source [13, 10] . The majority of works use
WordNet to define the semantics of tags for organizing resources or enhancing
their navigation.
Our work is part of the second type of approaches, with the difference that
we rely on all online available ontologies as a background knowledge source to
1
http://del.icio.us
2
http://www.Flickr.com
3
http://www.youtube.com
66
define the meaning of tags. In this paper, we present the FLOR, FoLkson-
omy Ontology enRichment, algorithm which takes as input a set of tags
(either the tagsets of individual resources or the clusters derived by the statisti-
cal analysis of folksonomies) and automatically relates them to relevant semantic
entities (classes, relations, instances) defined in online ontologies. An immedi-
ate advantage of this correlation between tags and semantic entities is that the
tag is automatically associated with the semantic neighborhood provided by
the corresponding ontology. For example, for the tag canine apart from iden-
tifying that Canine SubClassOf Carnivore we also acquire the knowledge that
Canine DisjointWith Feline.
In the following we describe the related work (Section 2), our methodology
(Section 3) and discuss our experimental results (Section 4). We conclude and
elaborate on future work in Section 5.
2 Related Work
Since the term folksonomy was coined, research has focused on comprehending
the inherent characteristics of folksonomies and exploring their emergent seman-
tics. Two of the primer works exploring and analysing their structure, the types
of their tags and the user incentives in tagging are described in [7] and [14].
Additionally, there are two main lines of folksonomy related research.
Early works on folksonomies are based on the assumption that frequent co-
occurrence of tags translates to tag association ([3, 15, 16, 20], see [18] for a de-
tailed analysis of the specific methods). They use various statistical methods to
identify clusters of related tags without defining the exact relations among them.
An exception is the work detailed in [18], where, in addition to clustering the
tags, the semantic relations among them are identified.
The second research line focuses on the semantic definition of tags, primar-
ily by using WordNet. For example, [13] try to identify the meaning of tags in
order to enrich the relevant resources with RDF descriptions. The authors dis-
tinguish six conceptual categories of tags in Flickr. Using WordNet and other
knowledge resources for these conceptual categories they organise the tags ac-
cordingly. Then they enrich the Flickr photos with RDF triples created for each
of the tag categories. These triples are generated either by predefined predicates
or from WordNet signatures depending on the categories they belong to.
The authors of [10] describe a method that expands the related tags clusters
of Del.icio.us with more related tags based on co-occurrence. The expanded clus-
ters are presented as navigable hierarchical structures or semantic trees. These
semantic trees are derived from WordNet. Using a combination of WordNet
based metrics they identify the possible WordNet sense for each tag. Then they
extract the path of this tag from the WordNet hierarchy and they integrate it
into the semantic tree of the tag’s cluster.
The TagPlus system described in [12] uses WordNet to disambiguate the
senses of Flickr tags by performing a two step query. First a user looks for a tag,
then the system returns all the possible WordNet senses that define the tag and
67
the user selects (disambiguates) which sense he meant. Finally the system looks
for all the Flickr photos tagged with this tag and its synonyms.
T-ORG ([1]) performs ontology based organisation of Flickr photos into a
set of predefined categories according to the tags describing them. At first the
user selects an ontology of interest. Then, the system extracts the concepts and
tries to identify semantic relatedness between these concepts and the tags by
querying the web with various linguistic patterns between them. Then each tag
is categorised under a superclass of the concept to which was more related by
the web search.
All the aforementioned works present methods for tag disambiguation, re-
source organisation and tag cluster enrichment. Our work aims to address the
following additional issues. First, the existing works require some initialising from
the user’s side (e.g., a priori selecting ontology or knowledge resources for the
relevant categories of tags) or they require the user contribution to perform the
disambiguation of the tags. FLOR is aimed to run entirely automatically (i.e.,
without user contribution). Second, FLOR exploits more than one resources (all
the online ontologies and WordNet) aiming to achieve higher coverage of tags
compared to the coverage from single resources. Finally, the proposed enrich-
ment links each tag with a relevant semantic entity but also with its semantic
neighbourhood as demonstrated in the canine example in Section 1.
3 FLOR components and methodology
The goal of FLOR is to transform a flat folksonomy tagspace into a rich semantic
representation by assigning relevant Semantic Web Entities (SWEs) to each tag.
A SWE is an ontological entity (class, relation, instance) defined in an online
available ontology. While in this paper we describe the process of enriching a
set of tags with SWEs, the ultimate goal of our system is not just to connect to
SWE’s but also to bring in other knowledge related to these SWE’s. An example
of the inputs and expected outcomes to FLOR is demonstrated in Fig. 1. The
input consists a set of tags and the output is a set of semantically enriched
FlorTags. Note that FLOR is agnostic to the way in which this tagset was
obtained. It can either be the set of all tags associated to a resource, or a cluster
of related tags obtained through co-occurrence based clustering. The experiments
reported in this paper used sets of tags associated with a given resource.
Intuitively, FLOR performs three basic steps (see Fig. 1). First, during the
Lexical Processing the input tagset is cleaned and all potentially meaningless
tags are excluded. We rely on a set of heuristics to decide which tags are likely
to be meaningless. Second, during the Sense Definition and Semantic Ex-
pansion we attempt to assign a WordNet sense to each tag based on its context
(i.e., the other tags in its cluster) and to extract all relevant synonyms and hy-
pernyms so that we migrate to a richer representation of the tag. Finally, during
the Semantic Enrichment step each tag is associated to the appropriate SWE.
Note that there is a strong correlation between the steps of FLOR and the
components of the final FlorTag structure. The first step results in the Lexical
68
Representations which is a list of lexical forms for the tag, such as plural and
singular forms for nouns, or various delimited types of compound tags (sanFran-
cisco, san.Francisco, e.t.c). The second step identifies Synonyms and Hyper-
nyms for each tag. The last step generates the list of Entities containing the
associated SWE’s. Note that a tag can be associated to several relevant SWE’s.
Fig. 1. FLOR Methodology
3.1 PHASE1: Lexical Processing
Due to the freedom of tagging as a basic rule of folksonomies, a wide variety
of different tag types are in use. Understanding the types of tags used is the
first step in deciding which of them are meaningful and should be taken into
account as a basis of a semantic enrichment process. Previous work ([2, 7, 13])
has identified different conceptual categories of tags (event, location, person),
as well as tag categories that can be described by syntactic characteristics. For
example, there are many tags containing special characters (e.g., :P), numbers
(e.g., aug07), plurals as well as singular forms of the same word (e.g., building,
buildings), concatenated tags (e.g., littlegirl) or tags with spaces (e.g.,
little girl) and a big number of non-English tags (e.g., sillon). The role of
the lexical processing step is to identify these different categories of tags and
exclude those that are meaningless and should not be further included in the
semantic enrichment process. This is done in two steps.
69
The Lexical Isolation phase idenfies sets of tags that should be excluded as
well as those that can be further processed. Currently we isolate and exclude
all tags with numbers, special characters and non English tags. The reason for
excluding non-English tags is that our method explores various external knowl-
edge sources (WordNet, Semantic Web ontologies) that are primarily in English.
As future work, we will extend FLOR to isolate additional types of tags as well
and deal with non-English tags.
The Lexical Normalisation phase aims to solve the incompatibility between
different naming conventions used in folksonomies, ontologies and thesauri such
as WordNet. This phase produces a list of possible Lexical Representations
for each tag aiming to maximise the coverage of this tag by different resources.
For example, the compound tag santabarbara in folksonomies appears as Santa-
Barbara or Santa+Barbara in various ontologies and as Santa Barbara in
WordNet. However, as the lexical anchoring to these resources is a quite complex
problem, we try to address it by producing all the possible lexical representa-
tions for each tag such as: {santaBarbara, santa.barbara, santa barbara, santa
barbara, santa-barbara, santa+barbara, ...}.
3.2 PHASE2: Sense Definition and Semantic Expansion
Due to polysemy, the same tag can have different meanings in different contexts.
For example, the tag jaguar can describe either a car or an animal depending on
the context in which it appears. Before connecting a tag with a relevant SWE,
it is important to determine its intended sense in the given context. This task is
performed in the first step of this phase.
Another issue to take into account is that, despite its significant growth, the
Semantic Web is still sparse. A direct implication is that while online ontologies
might not contain concepts that are syntactically equivalent to a given tag, they
might contain concepts that are labeled with one of its synonyms. To overcome
this limitation, we perform a semantic expansion for each tag, based on its
previously identified sense, in the final step of this phase.
The Sense Definition and Disambiguation phase deals with discovering
the intended sense of a tag in the context it appears. As context we consider the
set of tags with which the given tag co-occurs when describing a resource. For
example, in the tagset: {panther, jaguar, jungle, wild} the context of jaguar
is {panther, jungle, wild}. We use WordNet as a sense repository and rely on
its hierarchy of senses to compute the similarities between the senses of all tags
in the tagset and thus achieve their disambiguation. WordNet also provides rich
sense definitions which facilitate the semantic expansion in the next step.
To define the senses of the tags in a tagset, we identify all the lexical repre-
sentations for each tag in WordNet. In the cases that a tag has more than one
senses in WordNet (synsets) we exploit the contextual information of the tagset
to identify the most relevant sense. For this, we calculate the similarity between
70
all the combinations of tags in the tagset using the Wu and Palmer similarity
formula ([21]) on the WordNet graph. The similarity degree between two senses
is calculated based on the number of common ancestors between them in the
WordNet hierarchy and the length of their connecting path. The result for each
calculation is a couple of senses and a similarity degree for these senses. We se-
lect the two senses of the tags that return the highest similarity degree provided
that this is higher than a specified threshold. If a tag has low similarities when
compared to all the other tags in its cluster, then it is assigned to the most
popular WordNet sense.
We currently use a threshold value of 0.8 which we observed to correctly
indicate relatedness in most of the cases. Indeed, as high values as 0.7 are often
assigned to unrelated tags. For example, in the tagset: {girl, eating, red,
apple} the similarity between red and girl is 0.7 for the senses:
Bolshevik, Marxist, Pinko, Red, Bolshie (emotionally charged terms used
to refer to extreme radicals or revolutionaries)
Girlfriend, Girl, Lady friend (a girl or young woman with whom a man is
romantically involved)
These two senses are connected through the concept Person in the WordNet
hierarchy, however the two tags are unrelated in the context of this tag cluster.
While this empirically established 0.8 value lead to reasonable results and was
sufficient for this proof of concept prototype, we plan to establish an optimal
value through systematic experiments.
Thanks to the modular architecture of FLOR, the disambiguation and sense
selection method can be replaced by other methods (e.g., such as those used in
[19] and [22]). Or our current method could be modified to exploit a different
similarity measure between two concepts such as the Google Similarity Distance
[5]. Another possible improvement could be achieved by further expanding the
resource tagset with more related tags. These can be discovered with statisti-
cal measures based on tag co-occurrence as described in [18]. For example, the
expanded tagset of {apple, mac} could be {apple, mac, computer, macOs}. So
instead of trying to disambiguate with two tags we increase the possibilities of
finding the correct sense by disambiguating with a more specific context.
The Semantic Expansion includes the synonyms and hypernyms of a tag
in the FlorTag (see Fig. 1). For the purpose of this work we used WordNet
to extract the synonyms of the correct sense and the synonyms of this sense’s
hypernym in WordNet. For example, if in the specific context the tag jaguar
refers to an animal then the semantic expansion would include a list of synonyms:
{Panther, Panthera onca, Felis onca} and a list of hypernyms: {Big cat,
Feline, Carnivore}.
3.3 PHASE3: Semantic Enrichment
This phase of FLOR identifies the SWEs that are relevant for each tag by lever-
aging the results of lexical cleaning and semantic expansion performed in the
71
previous two phases. The final output of FLOR is produced by this phase (see
Fig. 1) and it is a set of FlorTags enriched with relevant SWEs and their semantic
neighbourhood (e.g., parents, children, relations).
The relevant SWEs are selected by querying the WATSON semantic web
gateway[6], which gives access to all online ontologies. We search for all ontolog-
ical entities (Classes, Properties, Individuals) that contain in their local name
or in their label(s) one of the lexical representations or the synonyms of a tag.
Such queries often result in several SWEs some of which are very similar (or
the same when they appear in ontologies that are versions of each other). To
reduce the number of SWEs, we perform an entity integration process similar to
the one described in [19]. The goal of this process is to “collapse” entities that
have a high similarity into a single semantic object, thus reducing redundancy.
To compute similarity between two entities we compare their semantic neigh-
bourhoods (superclasses, subclasses, disjoint classes for classes; domain, range,
superproperties, subproperties for properties) and their localnames and labels.
The similarity simDgr for two SWEs e1 and e2 is calculated as:
simDgr = Wl ∗ simLexical(e1 , e2 ) + Wg ∗ simGraph(e1 , e2 )
simLexical(e1 , e2 ) is the similarity between the lexical information of two enti-
ties, i.e., their labels and localnames, computed with the Levenshtein distance
metric. simGraph(e1 , e2 ) is the similarity of the entities’ neighbourhoods, where
the similarity of each neighbourhood element is computed based on string simi-
larity. Because we consider the similarity of the semantic neighbourhoods more
important than the similarity of the labels, we set the weights as Wl = 0.3 and
Wg = 0.7. Note that these weights will be fine-tuned through systematic experi-
ments. If the similarity between two entities is higher than a threshold we merge
them in one entity by integrating their neighbourhoods into one. Then we repeat
the process until all entities are sufficiently different from each other, i.e., their
similarity falls under a chosen threshold.
Consider for example Fig. 2 where five SWEs e1,5 are compared against a
threshold value of 0.5. We start by performing their pair-wise comparison and
observe that the pairs (e1 , e4 ), (e1 , e5 ), (e2 , e3 ) and (e2 , e5 ) have a similarity equal
or above the set threshold. We proceed by merging the first two entities with the
highest similarity, e1 and e5 , to one entity e1 +e5 and compute the similarities
between the new entity and the remaining ones. This process continues until all
similarities are lower than the set threshold, which implies that the obtained
entities are sufficiently different.
Once the merged entities are created we enrich the tag with the relevant
entities. This is done by comparing the ontological parents of the merged en-
tity with the hypernyms retrieved from WordNet. The ontological parents are
the superclasses of classes, the superproperties of properties and the classes of
individuals. For example, as shown in Fig. 3, the tag moon is enriched with two
entities. The superclasses of both the entities have as localname one of the hyper-
nyms extracted from the WordNet sense of moon. Also, apart from the semantic
definition of the tag with the respective entity, we further enrich the tag with
the information carried by the entity, EarthsMoon TypeOf Moon.
72
Fig. 2. Merging Strategy with threshold 0.5
3.4 An Enrichment Example
In this section we present a full cycle of the FLOR semantic enrichment method
for the tag lake, which was found in the following five tagsets: {rush, lake,
pakistan, rakaposhi, mountain, asia, kashmir, snow, glacier, green, white,
sky, blue, clouds, water}, {moraine, alberta, banff, canada, lake, lac,
rockies, scan}, {rising, sunlight, lake, quality, bravo}, {lake, nature,
landscape, sunset, water, organisms} and {lake, finland, suomi, beach,
bubbles, blue, sunlight, kids, natural}. Note that these tagsets contain the
tags that remain after the lexical processing performed in the first phase. Fig. 4
shows the information contained in the automatically obtained FlorTag.
Fig. 3. Enriched FlorTag moon
73
For the second phase of FLOR, Sense Definition and Semantic Expansion
using WordNet, the available WordNet senses for Lake are considered. These
are the following:
WordNet 1: Lake→Body of water, Water →Thing →Entity
(a body of (usually fresh) water surrounded by land)
WordNet 2: Lake→Pigment→Coloring material →Material
→ Substance→Entity
(a purplish red pigment prepared from lac or cochineal)
WordNet 3: Lake→Pigment→Coloring material →Material
→Substance→Entity
(any of numerous bright translucent organic pigments)
Fig. 4. Enriched FlorTag lake
Applying the Wu and Palmer formula for the senses of lake and the senses
of the rest of the tags in these tagsets we obtained variable similarities from
0 to 0.86. The zero similarities were obtained for location names such as banf,
pakistan, suomi and for generally unrelated tags such as quality, scan, sunlight,
sunset. Interestingly, lake returned zero similarity for the tags glacier and
mountain while they should be related. This is due to the fact that, in Word-
Net, Glacier and Mountain are hyponyms of Geological formation which
is a hyponym of Natural object while Lake is a hyponym of Body of water
which is a direct hyponym of Thing. Furthermore Glacier is a hyponym of
Ice mass but there is no subsumption relation between Ice mass and Ice or
Water that would allow for a connecting path between Lake and Glacier.
This fact motivates further research on how to identify similarities between tags
of a tagset beyond the subsumption relations provided by WordNet.
The highest similarity, 0.86, for lake was obtained with the tag water, be-
cause Sense 1 of Lake is related to Body of water (Sense 2 of Water ) with a
74
direct hyponymy relation. Note that, in most of tagsets the first sense of Water,
Liquid, is selected as this is the most common sense in which the tag is used.
Therefore, this is a nice example of phase 2 identifying a non-trivial correlation.
Sense 1. Water, H2O: (binary compound that occurs at room temperature
as a clear colorless odorless tasteless liquid) → Binary Compound AND
→ Liquid
Sense 2. Body of water, Water : (the part of the earth’s surface covered with
water) → Thing
Once the correct sense is selected and the tag is semantically expanded with
hypernyms (there are no synonyms for this sense of Lake) then the third phase
of FLOR queries the online ontologies through WATSON and selects the SWEs
that correspond to this sense. As shown in Fig. 4 both selected entities have
the term Lake in their localname and their superclass in the ontology contains
one or more of the hypernyms returned by WordNet, Water and Thing, as a
whole or as a compound. This example shows that our anchoring to ontologies is
strict for the tags to be defined (their lexical representations and synonyms) and
the localnames and labels of the entities and flexible for the ontological parents
and hypernyms. Note also that the selected SWEs carry additional information
about two superclasses of Lake (Waterway, Waterfeature) and an instance of
Lake (Lake Baikal ) thus further enriching the tag.
4 Experiments and Results
To assess the correctness of FLOR enrichment (i.e., whether tags were linked
to relevant SWEs) we applied FLOR on a Flickr data set comprised of 250
randomly selected photos with a total of 2819 individual tags. During the Lexical
Isolation we removed 59% of the initial tags resulting to 1146 tags in total.
We isolated 45 tags with two characters (e.g., pb, ak), 333 tags with numbers
(e.g., 356days, tag1), 86 tags with special characters (e.g., :P, (raw → jpg)),
and 818 non English tags (e.g., turdus, arbol). Then we filtered out the photos
that exclusively contained the isolated tags (24 photos) and obtained a dataset
of 226 photos with a total of 1146 tags. After running the FLOR enrichment
algorithm for these 226 photos, one of the authors manually checked all the
assignments between tags and SWE’s.
The assignment of a SWE to a tag is considered correct if the concept de-
scribed by the SWE is the same as the concept of the tag in the context of its
tagset. To decide that the evaluator was given a tagset and the SWEs linked
to its tags. She evaluated each tag enrichment as CORRECT if the tag was
linked to the appropriate SWE and INCORRECT otherwise. In cases when she
was not sure about the intended meaning of the tag, she rated the enrichment
as UNDETERMINED. Finally, a NON ENRICHED value was assigned to tags
that were not associated to any SWE. The results are displayed in in Table 1.
Out of the individual 1146 lexically processed tags, FLOR correctly enriched
281 tags and incorrectly enriched 20 tags thus leading to precision results of 93%.
75
Enrichment Result # of Tags Percentage
CORRECT 281 24.5%
INCORRECT 20 1.7%
UNDETERMINED 4 0.3%
NON ENRICHED 841 73.4%
Total 1146 100%
Table 1. Evaluation of semantic enrichment for individual tags.
An example of incorrect enrichment is that of square in the context {street,
square, film, color, documentary}. While its intended meaning is Geograph-
ical area, because during the disambiguation phase square did not return high
similarity with any of the rest of the tags, the WordNet sense assigned to it was
the most popular one, Geometrical shape. This lead to the assignment of non-
relevant SWE’s namely, Square SubClassOf Rectangle and Square SubClassOf
RegularPolygonShaped. Despite this error, the rest of the tags in this tagset were
correctly enriched.
FLOR failed to enrich 841 tags, i.e., 73.4% of the tags (see Table 1). Be-
cause this is a significant amount of tags, we wished to understand whether the
enrichment failed because of FLOR’s recall or because most of the tags have
no equivalent coverage in online ontologies. To that end we selected a random
10% of the 841 tags (85 tags) and manually identified appropriate SWE(s) using
WATSON and taking into account the context(s) of the tags in the tagset(s)
they appear. Out of the 85 tags we manually enriched 29. We therefore estimate
that the number of tags that could have been enriched by FLOR (i.e., those for
which an appropriate SWE exists) is approximately 287. Thus, taking into ac-
count that the overall number of tags that should be correctly enriched was 568
(281+287) but only 281 were enriched by FLOR this leads to an approximate
recall rate of 49%. While this is quite a low recall, these results are highly supe-
rior to the ones we have obtained in previous experiments where phase 2 was not
part of FLOR, i.e., we directly searched for SWEs for the tags without relying
on WordNet as an intermediary step. Indeed, the WordNet sense definition and
expansion of the tags with synonyms and hypernyms (FLOR phase 2) increased
the tag discovery in the Semantic Web thus having a positive effect on recall.
FLOR failed to enrich the above 29 tags due to the following reasons. The
majority of the failures (55%) was due to different definition in terms of
superclasses in WordNet and in online ontologies For example, the definition of
love in WordNet and the relevant entity found in the Semantic Web are:
WordNet: Love→Emotion→Feeling →Psychological feature
(a strong positive emotion of regard and affection)
Semantic Web: Love SubClassOf Affection
Although both these definitions refer to the same sense, and additionally the
superclass Affection belongs to the gloss of Love in WordNet, they were not
76
matched because Affection does not appear as a hypernym of Love. Current
work investigates alternative ways of Semantic Expansion.
A further 24% of the tags not connected to any SWE were assigned to the
wrong sense during phase 2. For example, bulb referring to light bulb in
its tagset is assigned the incorrect sense Bulb → Stalk → Stem → Plant
organ. The rest of the unenriched tags are due to failures in anchoring them
into appropriate SWE’s. For example, the sense of butterfly was correctly
identified, but non of its lexical forms matched the label of the appropriate
SWE (Butterfly Insect):
WordNet: Butterfly →Lepidopterous insect → Lepidopteron → Lepi-
dopteran → Insect
Semantic Web: Identified entity with localname Butterfly Insect
In the case of 4 tags the evaluator could not determine whether the enrich-
ment was correct or incorrect (Table 1). This is because the meaning of the
tag was unclear even when considering its context and the actual photo. For
example, in the photo of Fig. 5 the meaning of the tag volume is unclear. In the
second phase of FLOR the tag was expanded with the hypernyms Measure and
Abstraction. Then, it was related to the SWE Volume SubClassOf Measure.
As the meaning of the tag was not clear for the evaluator, she evaluated it as
{UNDETERMINED}. More generally, there are several cases when tags only
make sense to their author (and maybe to his social group) and thus will be
difficult to enrich by FLOR.
Fig. 5. UNDETERMINED Enrichment
After evaluating the individual tag enrichments the evaluator was able to
draw conclusions on the overall enrichment of the tagset i.e., by photo. The
evaluation output is displayed in Table 2. This would result to {CORRECT,
INCORRECT, MIXED, UNDETERMINED, NON ENRICHED}. According to
77
this table, 179 enrichments (about 80%) were {CORRECT}, i.e., all the enriched
tags of the photo are enriched correctly. Note that the {CORRECT} enrichment
results are much higher from a photo-centric perspective as many tags may
appear in many photos. For the total of 20 {INCORRECT} and {MIXED}
enrichments, 3 of the photos had all enriched tags incorrect and 17 had at least
one tag incorrectly enriched. Finally the above 4 {UNDETERMINED} tags
resulted to 4 {UNDETERMINED} enrichments one of which is displayed in
Fig. 5. Finally if no enriched tag appears in the photo then the result for the
photo is {NON ENRICHED}.
Enrichment Result # of Photos Percentage
CORRECT 179 79.2%
INCORRECT 3 1.3%
MIXED 17 7.5%
UNDETERMINED 4 1.8%
NON ENRICHED 23 10.2%
Total 226 100%
Table 2. Evaluation of SWE assignment to photos.
5 Conclusions and Future Work
We presented the methodology and the experiments we performed to test the
hypothesis that enrichment of folksonomy tagsets with ontological enti-
ties can be performed automatically. We selected a subset of Flickr photos
and after performing lexical processing and semantic expansion we correctly en-
riched the 72% (179 of 250) of them with at least one Semantic Web Entity. We
enriched approximately the 49% of the tags with a precision of 93%. Compared
to our previous efforts to define the tags with Semantic Web Entities without
previously expanding them with synonyms and hypernyms, this is a significant
improvement. Analysing the results we identified a number of issues to be re-
solved to enhance the performance of FLOR.
The Lexical Processing phase requires supplementary methods to identify
and isolate additional special cases of tags (e.g., photography jargon, dates).
Furthermore, the understanding of the impact of excluding these tags from the
overall process, the implementation of strategies to deal with them and their
integration in FLOR will be addressed by our future work.
As indicated by the results in Section 4, the cases of incorrect enrichment
and lack of enrichment were mainly caused due to the failure of the Sense Def-
inition and Semantic Expansion phase. The following issues are currently
investigated in order to correct the errors and enhance the performance of this
phase. First, it is essential to extend the tag similarity measure to also identify
78
generic relations rather than only subsumption relations. This flaw was exempli-
fied in the case of lake and glacier which were considered unrelated based the
hierarchical structure of WordNet (Section 3.4). Also, in the example of square
co-occurring with street, the incorrect sense definition for square caused fur-
ther incorrect enrichment (Section 4) . One of the possible solutions to this is
the context expansion based on tag co-occurrence. For example, expanding the
{square, street} tagset with their frequently co-occurring tags e.g., {building,
park} can increase the semantic relatedness between the tags and potentially
lead to mapping the tags to the correct sense. Finally, to solve cases where the
WordNet sense and the SWE are the same but with different hypernyms (see
the example of love) the goal is to identify more relevant words as hypernyms
or synonyms in order to achieve higher coverage in the Semantic Web.
The quality of the results returned from the Semantic Enrichment phase
depends on (1) the input provided to this phase by the Semantic Expansion step
and (2) on the anchoring of the tags’ lexical representations and synonyms into
online ontologies (see the case of butterfly). Alternative strategies for flexible
anchoring to increase the number of successful enrichments and the same time
keep the number of irrelevant matches low, are investigated by our current work.
Also, we aim to experimentally identify optimal values for the thresholds and
weight used in the second and third phases.
Finally, we aim to evaluate FLOR in large scale experiments and to assess
the usefulness of the semantic enrichment in a real content retrieval application.
This is to identify the possible implications of the overall process that are not
apparent in a small scale study like the current one.
To conclude, we demonstrated that the automatic enrichment of folk-
sonomy tagsets using a combination of WordNet and online ontologies
is possible without user intervention in any step of the methodology and by
using straightforward methods for lexical isolation, disambiguation, semantic ex-
pansion and semantic enrichment. The goal is to create a semantic layer on top
of the flat folksonomy tagspaces, that allows intelligent annotation, search and
navigation as well as the integration of resources from distinct, heterogeneous
systems.
Acknowledgements
This work was funded by the IST-FF6-027595 NeOn project.
References
1. R. Abbasi, S. Staab, and P. Cimiano. Organizing resources on tagging systems
using t-org. In 4th European Semantic Web Conference, pages 97–110, Innsbruck,
Austria, 2007.
2. S. Angeletou, M. Sabou, L. Specia, and E. Motta. Bridging the gap between folk-
sonomies and the semantic web: An experience report. In 4th European Semantic
Web Conference, pages 30–43, Innsbruck, Austria, 2007.
79
3. G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving
search and exploration in the tag space. In 15th International World Wide Web
Conference, Edinburgh, Scotland, 2006.
4. R. Benjamins, J. Davies, R. Baeza-Yates, P. Mika, H. Zaragoza, M. Greaves,
J. Gomez-Perez, J. Contreras, J. Domingue, and D. Fensel. Near-term prospects
for semantic technologies. Intelligent Systems, IEEE, 23:76–88, 2008.
5. R. Cilibrasi and P. Vitanyi. The google similarity distance. Transactions on Knowl-
edge and Data Engineering, IEEE, 19(3):370–383, 2007.
6. M. dAquin, M. Sabou, M. Dzbor, C. Baldassarre, L. Gridinoc, S. Angeletou, and
E. Motta. Watson: A gateway for the semantic web. In 4th European Semantic
Web Conference, Innsbruck, Austria, 2007.
7. S. Golder and B. Huberman. Usage patterns of collaborative tagging systems.
Journal of Information Science, 32(2):198–208, 2006.
8. M. Greaves. Semantic web 2.0. Intelligent Systems, IEEE, 22(2):94–96, 2007.
9. J. Hendler. The dark side of the semantic web. Intelligent Systems, IEEE, 22(1):2–
4, 2007.
10. D. Laniado, D. Eynard, and M. Colombetti. Using wordnet to turn a folksonomy
into a hierarchy of concepts. In Semantic Web Application and Perspectives -
Fourth Italian Semantic Web Workshop, pages 192–201, Bari, Italy, Dec 2007.
11. O. Lassila and J. Hendler. Embracing “Web 3.0”. Internet Computing, IEEE,
11(3):90–93, 2007.
12. S. Lee and H. Yong. Tagplus: A retrieval system using synonym tag in folksonomy.
In International Conference on Multimedia and Ubiquitous Engineering, pages 294–
298, Seoul, Korea, 2007.
13. M. Zied Maala, A. Delteil, and A. Azough. A conversion process from flickr tags
to rdf descriptions. In 10th International Conference on Business Information
Systems, Poznan, Poland, 2007.
14. C. Marlow, M. Naaman, D. Boyd, and M. Davis. Position paper, tagging, tax-
onomy, flickr, article, toread. In 15th International World Wide Web Conference,
Edinburgh, Scotland, 2006.
15. P. Mika. Ontologies are us: A unified model of social networks and semantics.
In 4th International Semantic Web Conference, pages 522–536, Galway, Ireland,
2005.
16. P. Schmitz. Inducing ontology from flickr tags. In 15th International World Wide
Web Conference, Edinburgh, Scotland, 2006.
17. N. Shadbolt, T. Berners-Lee, and W. Hall. The semantic web revisited. Intelligent
Systems, IEEE, 21(3):96–101, 2006.
18. L. Specia and E. Motta. Integrating folksonomies with the semantic web. In 4th
European Semantic Web Conference, pages 624–639, Innsbruck, Austria, 2007.
19. R. Trillo, J.Gracia, M. Espinoza, and E. Mena. Discovering the semantics of user
keywords. Journal of Universal Computer Science, 13(12):1908–1935, 2007.
20. X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web.
In 15th International World Wide Web Conference, pages 417–426, Edinburgh,
Scotland, 2006. ACM.
21. Z. Wu and M. Palmer. Verb semantics and lexical selection. In 32nd Annual
Meeting of the Association for Computational Linguistics, pages 133 –138, New
Mexico, USA, 1994.
22. C. Yeung, N. Gibbins, and N. Shadbolt. Understanding the semantics of ambiguous
tags in folksonomies. In International Semantic Web Conference, Busan, South
Korea, 2007.
80
Disambiguating Identity through Social Circles and
Social Data
Matthew Rowe, Fabio Ciravegna
Web Intelligence Technologies Lab
Department of Computer Science
University of Sheffield, UK
{m.rowe, f.ciravegna}@dcs.shef.ac.uk
Abstract: This paper presents an approach to disambiguate extracted identity
information relating to different individuals through the use of social circles.
Social circles are generated through the extraction and pruning of social
networks using the analysis of existing social data. Social data encompasses
information such as images, videos and blogs shared within a social network.
Identity information is extracted by involving the user in both selecting their
key identity features for disambiguation, and validating the retrieved
information. Our approach provides a methodology to monitor existing identity
information, applicable to addressing such issues as identity theft, online fraud
and lateral surveillance.
Keywords: communities, disambiguation, identity, semantic web, social
networks, social web
1 Introduction
The social web has seen enormous growth over the past 2 years. For example, in the
UK alone, there are now more than 9 million unique Facebook users, 5 million unique
MySpace users and 4.1 million unique Bebo users [16]. Commonly, users of such
services use these sites to create an on-line social environment very similar to the one
they experience in their everyday life, but extended and complemented by the on-line
features of these services. Common tasks include organising events, and interacting
with friends through messaging, blogging and sharing photos.
81
In parallel with the growth of the Social Web, the number of cases of malicious use
of personal information has also grown [18]. The problems of identity theft and
online fraud are of great significance in many countries, where personal details of
individuals are stolen daily and used for malicious purposes. Users of online social
networking sites commonly share information intended for social interaction.
However, such personal information (e.g. the date of birth provided to help people
remind someone’s birthday) can be misused with malicious intention (e.g. the date of
birth is partly used to check a person’s identity when accessing phone services in the
UK). The number of reported cases of cyber stalking and online harassment has
increased, and the practice of lateral surveillance has also risen, including reports of
cases where potential employees are vetted based on their online presence.
Such issues present several challenges and opportunities for research. The
development of technologies able to monitor personal information of a given
individual provide a stepping stone to assessing the risk of an individual becoming a
victim of identity fraud. This requires first and foremost the ability to identify and
integrate personal identity information from heterogeneous web resources. This
paper focuses on the first part of this task, i.e. the challenge of discovering personal
identity information from semi-structured Web resources, and the required
disambiguation of individuals contained within this information. Semantic
technologies provide a useful means for carrying out disambiguation, by formalising
identity and using semantic information to assert facts about an individual.
The approach presented in this paper utilises social circles derived from social data
to disambiguate individuals. This is split into several stages:
Firstly a user’s social network is extracted from social networking sites and
integrated. The resulting network is then pruned into a social circle containing
identifiable relationships with other individuals by analysing existing social
information. A social circle is denoted as a group of people linked to a central
individual by some identifiable common relation. Social information describes
content related to a given person that contains social characteristics, and provides a
useful source of socially annotated data due to the rise in sharing facilities in social
Web sites. This can be of any type including blogs, images tagged with person names,
and instant messaging conversations.
The user decides which identity features are best suited to minimally distinguish
their identity from others. The process of discovering identity data begins by
extracting information from the set of resources using these identity features: We use
a social approach to allow users to rate and critique resources based on the accuracy
and volume of information available. Each user has the ability to contribute their
opinion about an identity resource based on the information present relating to them.
Other identity features are used that have been repeatedly chosen by other users to
extract data. Other resources are discovered and analysed automatically via the Web,
e.g. FOAF-web.
Finally all resources found to contain the user’s details are analysed to
disambiguate between potential individuals sharing similar identity properties . The
disambiguation procedure uses the user’s generated social circle by analysing
resources found to contain user details. Comparisons are made between the social
circle and the resource to derive the probability of the information present belonging
to the user. Upon completion of the disambiguation process, all known identity
82
information attributed to a person is displayed for validation. A user in the loop can
then validate it, correct it and re-start the process of extraction.
In order to implement the approach we have created a Facebook application
capable of extracting a users social network from the social networking site. Our
application is able to access all the images and conversation data from Facebook
relating to each member of a user’s social network and prune the social network to
produce a circle consisting of the user’s closest friends. Images are chosen due to the
social tagging application commonly found in social networking sites. Our
implementation then uses this social circle for the disambiguation process. The
Facebook application is available for downloading and testing1.
This paper is structured as follows: Section 2 presents related work to
disambiguating identity through social circles. Section 3 outlines the semantics of
personal identity, and the selection of prevalent features. Section 4 details the
presented approach, explaining the methods used and technical details. Section 5
explains the proposed methodologies of evaluation and explains the reasons for this.
And finally, section 6 discusses the primary conclusions from the investigation so far,
and outlines the proposed future work in this area.
2 Related Work
The related work to our approach covers several fields of research: Name
disambiguation literature is included describing similar approaches to disambiguating
persons using related contextual information. Social network analysis literature covers
formal definitions of social circles and groupings. Social network mining literature
discusses differing methods for extracting social network information from various
sources. Object identification literature presents approaches that could be adapted to
identifying individuals, and commercial systems are included detailing work towards
identity disambiguation through the provision of identity theft risk assessments.
The problem of disambiguating individuals, also known as instance unification, is
addressed in [1]. Citations are used to discover additional information about a given
individual author; this information is then used to mine the web. Social networks are
constructed surrounding the author based on the co-authorship of papers. Similarly
work by [7] investigates the challenge of identifying misspelt and abbreviated names
by using clustering together with Naïve Bayes to compute the probability of a given
name belonging to a name cluster. Our methodology is similar to [1] by using existing
information to derive the initial social network, however we utilise further social
information such as image content and conversation data to prune the network.
Work to identify social circles is presented in social network analysis literature
such as [14] where cliques are initially discovered from an individual’s social
network, categorised as a sub-graph where a relation connects all pairs of points
within the graph. A social circle is the aggregation of overlapping cliques within the
1
http://apps.facebook.com/socialcircular
3
http://www.garlik.com
83
larger social network graph, and the key group of friends related to a given central
individual. Social grouping enables socially linked individuals to be clustered based
on a common relation where the relation can simply be a binary classifier used to
prune an individuals social network to only contain those individuals positively
classified. Our approach uses this definition to generate the social circle from the
initial social network, we use classifiers over image and conversation data to derive
social links.
A technique for mining social networks is presented in [9] and [6] utilising a three-
step approach: Firstly, mining the web for social network information identifying
links between two individuals, secondly monitoring real world interactions between
individuals to confirm relationships between them, and thirdly, monitoring
interactions between users on the web by capturing online communications between
individuals. Further work will mine social network information from the wider web.
At present our approach only generates social networks from social networking sites.
Work described in [10] uses a two-part methodology to gather social network
information by mining information from the web and crawling for semantic
documents containing information described using the FOAF [2] ontology. Social
network information is mined from the web by querying a search engine, with pairs of
names of individuals considered to be friends. The number of pages returned
containing both names co-occurring is the count for that pair; this gives the strength
of the relationship between the two individuals. Work by [4] presents an approach to
social network extraction using FOAF files by crawling FOAF-web, extracting
information from each FOAF file and aggregating with information from other FOAF
files. Assertions are made about discovered individuals using the supplied semantic
information. Our approach also uses FOAF-web to extract information, however we
are only concerned with identity information during the mining phase. The later
disambiguation phase checks the FOAF content for relationships corresponding to the
social circle.
Work by [8] presents a methodology to identify labels for relations between two
socially linked entities; the label word along with the two entities then form the query
to be entered into a search engine to retrieve additional information. This
methodology allows ontologies to be generated using the entities and relations that
link them together. Threshold tuning is used to refine the importance between two
entities using objective and subjective criteria. An approach described in [11]
identifies relations between two socially linked entities and uses labels derived from
the collective context of the entities to define the relation in a similar manner to work
in [8].
Literature relating to object identification presents a similar approach to identity
extraction by using features to recognise objects. An interesting methodology
presented in [17] details a general method for learning rules to map data for object
identification using domain independent attributes for identification. Work by [12]
describes a heavily cited framework for object identification presenting a
straightforward modulated approach using clustering for the pre-selection of similar
object pairs. Literature such as [15] utilises the features of objects to predict the
likelihood probability of a match occurring for object pairs. Work by [3] describes an
automatic approach to identifying and disambiguating relevant information using a
library of string metrics to deduce term record similarity. Our approach similarly uses
84
string metrics to compare the properties of identity at a low-level, mining the web for
identity information, and disambiguating extracted identity information.
Several systems utilising social networks and social information offer users the
ability to monitor their personal information. Garlik 3 offers services to enable the
monitoring of personal information, but fails to correctly disambiguate between
individuals in several cases. Garlik only uses the presence of social networking
accounts when detecting personal information; our approach differs by using the
information within the accounts to extract relationship information. Maltego 4 tracks
social networks relating to a given individual through mining the web for information,
and identifying real world links between people, and groups. Spock5 is another similar
application that crawls the social web and the wider web for occurrences of names
that have been searched for, information is then aggregated together using the
similarities in the derived content. Our work is similar to Maltego by mining the web
for person names, however we place a greater emphasis on an individual’s social
circle rather than their wider social network. The presented approach also differs by
allowing the user to select their most prevalent identity feature, and prioritising
identity features with the greatest cumulative prevalence.
4
http://www.maltego.com
5
http://www.spock.com
85
3 The Semantics of Identity
The semantics of identity are extremely important when extracting identity
information and disambiguating between individuals. The notion of personal identity
described in [19] splits identity into 3 tiers: My identity containing persistent identity
features; shared identity containing attributes assigned to an individual by others; and
abstracted identity containing identity denoted by grouping. The first tier, ‘My
identity’, contains identity features of the greatest significance when disambiguating
one individual from the next; name, date of birth, etc. However, the prevalence of
these identity features differs between individuals. Consider a scenario involving a
man named John Smith, John must decide on what features of his identity are the
most prevalent. He knows that his name is very common, but he knows that he is the
only ‘John Smith’ on his street, so he selects his postcode. As certain identity features
build a cumulative level of prevalence, these features become inherently used by the
approach.
Home Address
Home Telephone Number
Details of Cohabiters
Home Address Date of Birth
Home Telephone Number Email
Friends List
Web Pages
Current Location
Email
Professional Contacts Email
Friends List
Work Telephone Number Email
Web Pages Web Pages
Friends List
Web Pages
Figure 1. Distribution of features of my identity throughout the Web
As figure 1 shows the distribution of identity features contain enough information
about a given individual to compile a fairly complete identity profile. The social
network feature of identity is related to the ‘Shared identity’ tier presented in [19];
this tier contains temporary relationships prone to either becoming stronger or
breaking down. The creation of an ontology to encapsulate the properties of an
individuals identity and their social network based on the FOAF [2] specification is
86
required. FOAF provides sufficient properties to define certain identity information,
however it fails to provide other features that would differ between domains e.g.
Social security number, and national insurance number. Such a formalisation would
aid with integration of identity information and reasoning.
4 An Approach to Disambiguating Identity using a Social Circle
Generate Social Social Networking
Networks Sites
Generate Social
Circles Identity
Resources
Mine the Web
for Identity Data
www
Disambiguate
Identity Data
Validate
Identity Data
Figure 2. Approach Overview
This section explains the details of our approach to disambiguate individuals using
social circles, and present disambiguated information for validation. This section is
composed of sub-sections describing the processes performed in each stage of the
approach as displayed in figure 2. To illustrate these processes we use the example of
John Smith, a man who is concerned about his identity being stolen or misused, and
wishes to know what information exists about him. We explain our approach in terms
of a completed system.
4.1 Generating Social Networks
In order to generate social networks, existing accounts with social networking sites
belonging to the user are needed to seed the disambiguation process. At present only
Facebook data is being used to seed the approach. However, data from other accounts
can be used. John logs into a specially created Facebook application. The application
works by accessing Facebook’s Developer API and extracting the social network
87
information into FOAF [2]. This compiles a comprehensive list of all of John’s
friends he is acquainted with. In the future, other social networking sites and
alternative services can be used to seed the process providing John has an account
with them. Existing approaches could also be used to generate the social network such
as entity co-occurrence and mining FOAF-web for social networks [7,8].
4.2 Generating Social Circles from Multimedia Data
Using the algorithms described in figure 3 and figure 4, strengths are derived
describing the relationships the user has with each member of their social circle.
Several users of social networking sites are prone to adding friends who are more
likely to be acquaintances and not necessarily people within their social circle or
clique. Figure 3 details the derivation of relationship strengths from image data;
images are extracted from a web resource annotated with unique identifiers of
individuals. Should the unique identifier match that of the user, the social bond is
strengthened. The same principle is used in the algorithm to derive the social bond
from conversation data by analysing all received and submitted messages to identify
the user.
Extract all friends from friend list
For each friend from friend list
Extract all photos the friend appears in
For each photo from photolist
If you appear in the photo with your friend
Increment friend strength by one
Divide the friend strength by the total number of
photos
Store the social bond strength
Figure 3 – Algorithm deriving relationship strengths from images
Extract all friends from friend list
Extract all messages you have received
For each friend from friend list
Extract all the messages sent to their profile
For each message they have received
If you received the message
Increment message count by 1
Divide message count by the total number of
messages
For each message you have received
If the message was sent by the friend
Increment received message count by 1
88
Divide received message count by the total number
of received messages
Add the message count to the received message count
Divide combined message count by 2
Store the combined message count
Figure 4 – Algorithm deriving relationship strengths from conversational
information
Using these algorithms, John’s social network is pruned to fewer components: For
each of John’s friends the Facebook application extracts all the photos they appear in.
Each photo is then verified to see if John also appears in the photo. A score is kept to
count how many times John appears in his friend’s photos, this score is then used to
derive the weighting of their relationship. The same process is carried out using
conversational information. Both of the derived weightings are used to compute an
average social bond, the property is added to the existing FOAF
specification to contain the strength of the social bond for each friend.
4.3 Mining the Web for Identity Data
As discussed in section 3 the prevalence of identity features are purely based on the
user. This approach incorporates both identity features specifically selected by the
current user, and the social dynamic of cumulatively prevalent identity features
following their frequent selection. The mining for identity data begins by using the
features of identity that have been chosen by the user. John Smith knows that his
name is very common, he must therefore select accompanying features of his identity.
He knows that he is the only John Smith on his street so he chooses his postcode, and
he also knows that his email address is unique so he selects that.
Matching identity information compares instances of identity to derive a similarity
measure corresponding to the properties of each instance. At a meta-level this allows
the comparison of objects regardless of the differences in the format of their
properties. In our approach we only compare textual components, so one application
of instance matching would use matching names and any other identity features
through simple string matching using the SimMetrics package [3]. As resources are
parsed, the information is analysed for any possible occurrences of the identity
features being searched for. If any matches take place, then the URI of the resource is
stored.
The mining process begins by firstly mining FOAF-web to extract semantic
information about the user using the prevalent identity features selected by the user.
Due to their semantically rich format, parsing FOAF files is a simple process allowing
basic comparison of the identity features specified by the user and the found
information. Using the property, linked FOAF files are also mined to
gather more Semantic information.
Secondly existing identity resources are accessed to extract identity information
using the prevalent identity features selected by the user. Identity resources can be
89
shard by users of the implementation that contain identity information. The
community of users are able to provide feedback regarding the accuracy and volume
of the information present in each resource. Resources are marked by the community
as useful sources for identity information, and become prioritised when the process of
mining identity information begins. When mining both FOAF-web and identity
resources for information, any found social content is flagged for later use. This
includes any names observed from the individual’s social circle.
Finally the wider web is mined for identity information by submitting structured
queries to a search engine. Queries are structured to detail the most important and
prevalent features of the user’s identity selected at the start of the process, together
with the cumulatively most prevalent identity features. In John’s case three queries
would be used: His name and his postcode. His name and his street name (derived
from the postcode). His name and his email address.
4.4 Disambiguating Identity Data
Following the collection of all possible identity instances attributed to the user it is
important to disambiguate between information relating to different individuals.
Social circles are used to perform the disambiguation process. Each resource is parsed
to derive information about any of the user’s friends from their social circle. In our
approach entity extraction is used to find any names within the identity resources and
compile the names into a list using the rule based document annotator; Saxon [5].
Each name in the list is then compared with the names from the user’s social circle by
computing the Smith-Waterman-Gotoh distance using the SimMetrics [3] package,
and comparing this distance to a predefined threshold. If a match occurs then the
resource is marked for re-extraction, and the social bond between the matched friend
name and the user is strengthened. The members of the user’s social circle with the
strongest bond are prioritised to force them to be compared to the name list first.
Should no match be found, then the confidence level that the resource contains
information relating to the individual is minimal, and as a consequence no further
extraction is performed.
Identity information that was found relating to John Smith produces two pieces of
information from different resources. After parsing the first resource three names
were found and compiled into a list: John Smith, Bobby Moore, and Geoff Hurst.
Bobby Moore is one of John’s friends from his social circle. Therefore the
information contained within the resource is valid and belongs to John. The second
resource is parsed, and the names are added to a list: John Smith, and Boris Becker.
Boris Becker is not in John’s social circle; therefore we cannot tell if this resource
contains information belonging to John.
Upon completion of the disambiguation process, the resources found to contain
information correlating with the user’s social circle are passed on for validation.
Resources found to contain more people from the user’s social circle are given a
higher confidence rating and are therefore presented as being the strongest candidates
for containing information relating to the user.
90
4.5 Validating Identity Data
Once the information has been aggregated and linked together, it is displayed to the
user. The information is presented as a mapping showing the resource where the
information occurs and within what context the recognition took place, describing the
friends from the social circle that were matched around the user. The user is able to
validate the results by confirming or rejecting the extracted information. If the
returned information does not belong to the user then the extraction process should be
performed again, but with less prevalence towards the friends who were responsible
for the misidentification.
John is presented with an interactive diagram containing occurrences of his identity
on the web. Each occurrence is labelled with the location, and his friends and identity
properties that were also found there. Upon inspection of the information, he doesn’t
agree that one piece of information is about him. He clicks on the resource link to the
web page containing the information and realises that he was correct; the information
is about another John Smith. He returns to the diagram and informs the system of an
incorrect find.
5 Evaluating Identity Disambiguation
The evaluation of the described approach uses a user based study consisting of 60
users each with a variable level of presence on the web, both within the wider web
including personal web sites, and the social web including online accounts with social
networking sites. This many users yield enough results to perform statistical
evaluation. Each user tests the approach in two stages:
The first stage evaluates the pruning of each user’s social network into a more
compact and relational social circle. The evaluating user is required to analyse
whether the derived social circle contains links that they deem to be appropriate with
their peers, and mark any errors that exist.
The second stage of the evaluation process involves the evaluation of the
disambiguated identity information. The evaluating user verifies all the extracted
occurrences of their identity information prior to the disambiguation process, marking
whether each resource does contain information describing their identity or not. Once
the disambiguation process has completed, the user is then presented with
disambiguated information for validation. This second evaluation step is already
included in the previously described approach to provide a feedback mechanism for
the mining of extracted information. Both steps of evaluation use information metrics
to derive the precision, and error rate produced by the system, evaluating the
efficiency and effectiveness. User satisfaction also is also evaluated using
questionnaires completed by users of the implementation.
91
6 Conclusions and Future Work
The presented approach is currently being implemented and will be ready for
evaluation by prospective users. Through implementation several interesting issues
have arisen that will be investigated further. One of the main issues concerns the
construction of search queries to the wider web. At present this last stage of the
mining process can yield low levels of precision even when the user has declaratively
specified their most prevalent identity features. This can be related to the lack of
online presence in the search engine domain should a given individual not have any
personal web sites, therefore a technique to tune the approach must be considered.
We believe that the described approach presents a novel technique both to the
pruning of social networks to form social circles, and also for the disambiguation of
individuals using social circles. The former challenge has been addressed in an
abstract manner to allow the inclusion of alternative data sources such as blogs, and
the sharing of bookmarks, or emails. The algorithm for deriving the social bonds can
be applied to analyse any similar social interactions. Images were chosen due to the
increase in the social tagging phenomena evident in several social networking sites.
Such techniques could be easily adapted to provide risk assessments for users
concerned with identity theft and online fraud. By providing a model that specifies
what identity features must become accessible for identity theft to be possible, a user
could be informed simply by submitting the identity information provided by our
approach. This model must be an adaptive model to permit transfer between domains
where the criteria for assessment may alter (i.e. Different countries require different
identity features).
At present our approach uses two binary classifiers to derive the existence of a
social bond between two individuals. We analyse conversations and images to
classify individuals as being friends. The usage of such classifiers is assumed to
generate a satisfactory social circle, however a further advancement would allow the
comparison of classifiers. This would allow further evaluation of the proposed
methodology for social circle generation, at present there is little indication of the
quality of the classifiers being used.
The methodology that we present for performing the disambiguation process is
fairly limited. It is composed of a straightforward comparison technique to derive the
string distance between two words. An alternative suitable approach could utilise
decision models compiled from the string distances of each name extracted from a
single resource, a decision is then reached for a single resource from the analysis of
all decision models [13]. The current approach is limited by only requiring a single
name match from a resource to denote relation to the individual in question.
Upon completion of the implementation evaluation will take place to derive the
effectiveness of generating social circles, and disambiguating between items of
identity information. The methodology presented in chapter 6 sufficiently covers the
two main objectives of this approach, although both evaluation steps do require an
exhaustive process of auditing the generated information to produce a gold standard,
particularly in the second stage.
92
Other future work is to develop a visualisation technique for user information
similar to figure 1 detailing the distribution of personal information, and allowing the
individual to analyse the information.
References
1. Aswani. N., Bontcheva. K., Cunningham. H.: Mining Information for Instance Unification.
In the proceedings of 5th International Semantic Web Conference, Athens, GA, USA (2006).
2. Brickley. D., Miller. L.: FOAF Vocabulary Specification. (2004).
3. Chapman. S., Norton. B., Ciravegna. F.: Armadillo: Integrating Knowledge for the Semantic
Web. Proceedings of the Dagstuhl Seminar in Machine Learning for the Semantic Web , 13-
18 February (2005).
4. Finin. T., Ding. L., Zhou. L., Joshi. A.: Social Networking on the Semantic Web. The
Learning Organisation, vol. 1 , no. 5, pp. 418-435 (2005).
5. Greenwood. M., Iria. J.: Saxon: An Extensible Multimedia Annotator. To Appear in the
Proceedings of the International Conference on Language Resources and Evaluation,
Marrakech, Morocco (2008).
6. Hamasaki. M., Matsuo. Y., Ishida. K., Nakamura. Y., Nishimura. Y., Takeda. H.:
Community Focused Social Network Extraction. Proceedings of 2006 Asian Semantic Web
Conference (2006).
7. Han. H., Zha. H., Giles. L. C.: A Model-based K-means Algorithm for Name
Disambiguation. Semantic Web Technologies for Searching and Retrieving Scientific Data
Workshop. International Semantic Web Conference (2003).
8. Jin. Y., Matsuo. Y., Ishizuka. M.: Extracting Social Networks among Various Entities on
Web. The Semantic Web. International Semantic Web Conference 2006. pp. 487-500
(2006).
9. Matsuo. Y., Hamasaki. M., Nakamura. Y.: Spinning Multiple Social Networks for the
Semantic Web. Proceedings of the 2006 Asian Artificial Intelligence Conference (2006).
10. Mika. P.: Bootstrapping the FOAF-Web: An Experiment in Social Network Mining. 1st
Workshop on Friend of a Friend, Social Networking and the Semantic Web, Galway,
Ireland (2004).
11. Mori. J., Tsujishita. T., Matsuo. Y., Ishizuka. M.: Extracting Relations in Social Networks
from Web using Similarity between Collective Contexts. International Semantic Web
Conference (2006).
12. Neiling. M., Jurk. S.: The Object Identification Framework. In KDD03 Workshop on Data
Cleaning, Record Linkage, and Object Consolidation, Washington DC (2003).
13. Rendle. S., Schmidt-Thieme. L.: Object Identification with Constraints. In Proceedings of
the Sixth international Conference on Data Mining (December 18 - 22, 2006). ICDM. IEEE
Computer Society, Washington, DC. pp. 1026-1031 (2006).
14. Scott. J.: Social network analysis : a handbook. London, Sage (2000).
15. Singla. P., Domingos. P.: Object identification with Attribute-Mediated Dependences. In
Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD),
pages 297--308, Porto, Portugal (2005).
16. Sweeney. M.: Facebook sees first dip in UK users.
http://www.guardian.co.uk/media/2008/feb/21/facebook.digitalmedia (2008)
17. Tejada. S., Knoblock. C. A., Minton. S.: Learning Object Identification Rules for
Information Integration. Special Issue on Data Extraction, Cleaning, and Reconciliation,
Information Systems Journal. vol. 26. (2001).
93
18. Wallop. H.: Fear over Facebook identity fraud.
http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/07/03/nface103.xml
19. Windley. P. J.: Digital Identity. O’Reilly Media (2005).
94
OntoPair: Towards a Collaborative Game for
Building OWL-Based Ontologies
Peyman Nasirifard, Slawomir Grzonkowski and Vassilios Peristeras
Digital Enterprise Research Institute
National University of Ireland, Galway
IDA Business Park, Lower Dangan, Galway, Ireland
firstname.lastname@deri.org
Abstract. Collective Intelligence takes advantage of collaboration, com-
petition and integration. It often uses mixed groups of humans and com-
puters to research in new unexplored ways. Ontologies, which are the
main building block of the Semantic Web, are usually prepared by do-
main experts. We introduce a novel approach, which employs Collective
Intelligence, towards building simple domain ontologies through a game
called OntoPair, an entertaining web-based game that is able to build
simple OWL-based ontologies based on collected information from play-
ers. The game collects properties and common-sense facts regarding an
object by means of some fixed templates and translates them into OWL
representation by aid of a mediator/mapper and builds simple domain
ontologies after refinement in several iterations. We define the game and
preform a small experiment that proves our idea.
1 Introduction
Ontologies are the main building blocks of the Semantic Web technologies. They
try to define a specific domain in a systematic way. The can be expressed using
different standards and languages like RDFS [3] and OWL [10]. One of the
main concerns of Semantic Web researchers is building domain ontologies and
collect sufficient instances for them. Because building domain ontologies is not
an entertaining task, they are usually build by domain experts
In computer science, human-based computation is a technique in which a
computational process performs its function via outsourcing certain steps to hu-
mans [16]. In other words, there are some tasks that most humans can do easily,
but current computers can not perform them in a logical time (e.g. CAPTCHA
[13]).
Collective Intelligence is a form of intelligence that emerges from the collab-
oration and competition of many individuals [17]. This phenomenon has been
observed by many researchers for years. Among them, Pierre Levy [6] described
its potential for the internet technologies. He pointed out that rapid and open
data exchange would coordinate the intelligence in new unexpected ways.
In this paper we present a game called OntoPair which aims at harnessing the
benefits of the Collective Intelligence phenomenon to create ontologies. We show
95
how to create them by a number of human-human competitions. We describe
how computers should proceed and integrate the obtained results in a way that
leads us to obtain well-defined ontologies.
2 OntoPair Structure
OntoPair is a two- or one-player game which provides an interactive environ-
ment between anonymous players to play and build simple ontologies. The game
is based on traditional word guessing games [15] with some fixed templates which
have been carefully chosen to be translated into OWL by means of a media-
tor/mapper. The game is composed of two different phases which are separated
from each other, but the result of first phase is the input of next phase. Roughly
speaking, these two main phases can be called Collecting properties and Col-
lecting common sense facts about an object by means of some fixed templates.
The game is mainly for two players, but it can be also played in single mode.
The players do not know each other, they can not communicate, and they are
randomly paired. The game can be played in two main modes: graphic-based
and text-based. In the graphic mode, the players look at a same image and play,
whereas in the text mode, the players look at a same text-based word or keyword
and they play; e.g. in the graphic mode, the players may look at an image of a
car, a house or a bicycle, but in text mode, they will see the explicit words of
a car, a house, or a bicycle. In next sections, we describe each phase in a more
detailed manner.
2.1 Collecting Properties
In first phase of OntoPair, collecting properties, the main goal is collecting prop-
erties, components, and characteristics of a specified object. This phase is very
similar to ESP Game [7] and Google Image Labeler [5], but there exist several
cruicial differences. The main difference is that in ESP game or Google Image
Labeler, the players try to annotate an image and catch the objects that are
located in images, whereas in this phase of OntoPair, players play to catch the
properties and characteristics of a specific object in text-based or graphic-based
mode. The other main difference is that ESP game and Google Image Labeler
work only in graphic mode and text-based ESP game does not make sense;
whereas in OntoPair, as we mentioned earlier, the game can be played in both
graphic and text mode.
The graphic mode of this phase should be based on ESP game or Google
Image Labeler, as we need the explicit name of objects that are located in the
image. In other words, the result of ESP game or Google Image Labeler can be
used in this phase of graphic mode of OntoPair. In graphic mode of OntoPair,
the players will look at the same image and they have a hint which is actually
the name of one of the objects that is located in image and the players should
mention properties of that object and agree upon a property. In the text-based
96
mode, again both players will look at the same word which is fetched from a
database of objects.
The result of this phase is actually a collection of properties of different
objects. We store these properties in a data store and link them to the object.
To clarify what we are looking for, we give some hints to the players. These hints
are two general questions: What does an object X contain/have?, and Which
parts/components/characteristics does object X have?. in these templates, X is
replaced with the name of the object; e.g. What does a car contain? and Which
parts/components/characteristics does a car have?
One of key concepts in games is points. Games without points do not make
sense and players will lose their motivation to play after a while. In this phase of
the game, we also give points to the players. After agreement of the players upon
a property, both players get points and the game continues and shows another
image or text, depending on game’s mode. The game continues until one player
quits or time is up, as the game is played in time intervals. If one player quits
during the game, we try to find another player randomly. If it takes a long time,
as the number of players is not always even, the game can go through single
player version. The single player version is actually playing with a log file from
previous games with the same object. Actually we store all properties of an
object during each game. This will also help to evaluate the data source of the
properties. It is obvious that at startup of the game with an empty knowledge
base, the game can not be played in single player version and there should be
always an even number of players. To avoid the game being boring, the players
can skip current image/text and continue to see next random image/text, if the
players find an object boring and they can not agree upon a property.
It is obvious that there exist some properties in an object that most players
often mention, e.g. most players will say that a book has title or author, but
probably a few of them will say about ISBN, price, or publishing date. To fight
with these issues, we detect these often-used words and we prevent users to
mention such words again and again by indicating them as prohibited words.
The prohibited words which are assigned to the objects are calculated based on
the number of each property which has been mentioned in previous plays. There
is no doubt that a single object is played as long as the players are not able
to detect a new property in it and they always skip the object. The prohibited
words make the game more difficult, but more fun. The other advantage of
the prohibited words is that these words can be used as hints for players to
guess what information we are looking for as a property, part, characteristic, or
component.
As a concrete example, suppose that both players are looking at an image
of a book. They should play and mention the properties and different parts of
a book, one after the other. First player says title, the second player says au-
thor ; game continues: the first player says chapter, the second player says ISBN ;
game continues: the first player says publisher, the second player says title. At
this point, both players agreed upon title, so both players get points and the
game continues by showing another random image or word. It is obvious that
97
after several times of playing and putting common words to prohibited list, we
catch, say, a complete collection of book’s properties like author, chapter, ISBN,
publisher, etc.
2.2 Collecting Triples and Common Sense Facts
The second phase of OntoPair, collecting triples and common sense facts, is also
totally separated from the previous phase. Note that this phase is played by
different players which are not necessarily the same as players in the first phase.
The result of the first stage is used in this phase. Like the previous phase, there
are also two players in this phase, who do not know each other and they can
not communicate. The players are randomly paired. In this phase, we collect
some pieces of information which are called common sense facts about an object
by means of some fixed templates. Informally, a common-sense fact is a true
statement about the world that is known to most humans [8]: ”a book has one
title”, ”a human has two legs”, etc. As we mentioned earlier, collecting these
common sense facts is done through fixed templates and based partially on
properties that we have collected in the previous phase. This phase of OntoPair
is a word guessing game that one player (narrator) should guide the other player
(guesser) to guess a word which is actually the object that we are trying to find
some common sense facts about it. In other words, in this stage, an image or a
word is assigned to one player and he/she should complete pre-defined templates
to guide the next player to guess the word. As soon as one template is completed,
it will be sent to next player and as soon as the next player could come up with
the right word, both players get points, the role of players will switch and game
continues by showing another randomly-chosen image or word. These pre-defined
templates have been chosen for a purpose: To translate them simply into OWL
using a mediator/mapper. The explicit templates that we present to the players
are listed below:
– It has at least Y : The Y will be replaced by a list of real properties of
the item that comes actually from the knowledge base of properties that we
have collected in previous phase and the player can choose arbitrary property
from a combo box. Here we catch the minimum cardinality of the property,
if and only if it makes sense.
– It has at most Y : The Y will be replaced by a list of real properties of
the item that comes actually from the knowledge base of properties that we
have collected in previous phase and the player can choose arbitrary property
from a combo box. Here we catch the maximum cardinality of the property,
if and only if it makes sense.
– It is kind of : With this template, we catch hierarchical information of the
item.
– It could be either or (or more): From one perspective (see also next
template), this template provides different types of the item. The player can
extend this template by adding more items.
98
– It could be union of and (and more): From the other perspective (see
also previous template), this template provides different types of the item.
The player can extend this template by adding more items.
– It is complement of the : This template provides complement objects/-
concepts of the item.
– It is disjoint with (opposite of) : This template provides the objects/con-
cepts that are disjoint with the item.
– It is equivalent to the : This template provides equivalent objects/concepts
to the item.
Note that in this phase we have also the notion of prohibited statements.
Prohibited statements are actually those statements that most players decide to
choose first. we are not interested to collect these statements all time, so we do
not give the opportunity to the player (narrator) to use them.
However, the players should use these templates, as we build OWL ontologies
by aid of these templates, but we give also the option to the narrator to build
arbitrary sentences as well, if the templates can not be useful. These arbitrary
sentences will build comments for the generated ontology.
3 Generating OWL-based Ontology
In this section, we introduce the translation mechanism that we use to generate
OWL-based ontologies. The Ontology will be created for the object that the
players are playing, e.g. book, computer, car. After every play using an object
(item), we collect some common sense facts about that item and we can build
an ontology for that. The first iteration of generating ontologies is draft and can
not be considered as a complete ontology. In other words, the ontology is created
during several iterations and not at the first time.
3.1 Concepts
For the approved properties, i.e. the properties that their frequencies are more
than a threshold, a owl:class is generated. These classes are actually the transfor-
mation of properties into OWL representation using a mediator/mapper which
is simply able to generate classes and their properties. Suppose a domain like a
book : For every approved property or concept, a class and a link will be generated
to associate this class to main concept which in our example is a book. Figure 1
illustrates the mapping between some selected properties and their OWL repre-
sentations. As we mentioned earlier, the properties will be stored in a knowledge
base (KB) and as soon as they are mature enough to be linked, the mapper will
translate them into OWL and link them to the main concept. In the following
sections, we provide a more detailed description.
99
Fig. 1. Generating OWL for Properties Using a Mapper/Mediator
Pre-Refinement of Concepts (Refining Before Mediation). As we men-
tioned earlier, the concepts need to be refined. The refinement process is as
follows: Because a specific object can be played more than once, we assign a
counter to every object and the counter increases if the players are playing that
object. We call this counter objectCounter in which the word object will be re-
placed with the explicit name of the object. A counter is also assigned to every
property that the players agree upon that during the game and after further
agreement by other players, the counter increases. We call this counter object-
PropertyCounter which object will be replaced with the explicit name of the
object and property will be replaced with the explicit name of the property of
the object. The variance is defined for each property and is calculated by object-
Counter minus objectPropertyCounter. If the result is greater than threshold1,
the property will be moved to prohibited list, as many pairs agreed upon that
property and if it is less than threshold2, the property will be deleted, as only very
few pairs agreed upon that property. Note that, we do not care about uppercase
and lowercase of alphabetic letters. Listing 1.1 demonstrates the pseudocode of
this refinement.
Listing 1.1. Pseudocode of Refining Concepts
1 if ( o b j e c t i s s e l e c t e d ) then
2 o b j e c t C o u n t e r ++;
3
4 if ( o b j e c t P r o p e r t y i s s e l e c t e d ) then
5 o b j e c t P r o p e r t y C o u n t e r ++;
6
7 v ar i anc e ( obj ect Pr oper ty ) = objectCounter − objectPropertyCounter ;
8
9 if ( normalize ( variance ( objectProperty ) ) > threshold1 ) then
10 move o b j e c t P r o p e r t y t o p r o h i b i t e d l i s t ;
11
12 if ( normalize ( variance ( objectProperty ) ) < threshold2 ) then
13 d ele te objectProperty ;
100
Concept Mediator/Mapper. Concept mediator/mapper is simply a mapper
that gets the property or concept as input and generates OWL statements as
output. The OWL statement contains also the link that associates the property
to the main object. Figure 1 demonstrates some sample inputs and outputs of
the mediator/mapper. However, in this step, we do not have our ontology and
we have just gathered only properties and built their links. The ontology will be
created after gathering sufficient facts about the object.
Post-Refining (Refinement After Mediation). After generating OWL rep-
resentations of properties, they need also to be purified. Refining statements is
an iterative task and tries to build a summarized version of statements based on
resource URIs. Figure 2 demonstrates a sample of this post-refinement.
Fig. 2. Properties Refinement Sample
3.2 Statements
In the previous section, we presented the fixed templates that we use to gather
common sense facts about objects. As we mentioned, those templates were care-
fully chosen for two main purposes: First, to be able to be translated into OWL
using a mediator/mapper and second, to avoid the game being boring, as we need
to entertain players, instead of assigning tasks to them. Table 1 demonstrates the
general translation of templates. Note that &xsd; refers to XSD namespace which
is actually xmlns:xsd = ”http://www.w3.org/2001/XMLSchema#”. To avoid a
huge messy table, we decided to use acronyms.
101
Table 1: Templates and Their OWL Representations
Template Generated OWL
X has at least Y
some value
X has at most Y
some value
X is kind of
X could be either or
(or more)
X could be union of and
(and more)
X is complement of
Continued on next page
102
Table 1 – continued from previous page
Template Generated OWL
X is disjoint with (opposite of)
X is equivalent to
Pre-Refinement of Statements (Refinement Before Mediation). The
main goal of Pre-Refinement is to select the statements that can be translated
into correct OWLs. The process is as follows: Like previous refinement, we assign
a counter to an object. we call this counter objectCounter2. We assign also a
counter to every instance of a template related to object. We call this counter
objectTInstanceCounter. We log all instances that will be sent to guesser. If the
instance was helpful and the guesser could guess the word correctly, we increase
the objectTInstanceCounter, but if the instance was not useful and the guesser
was not able to guess the word, we decrease the objectTInstanceCounter. We
compare the objectTInstanceCounter with some thresholds and then we decide
whether to keep, delete or move it into the prohibited list. Note that in this
refinement, we do not care about uppercase and lowercase of alphabetic letters.
Listing 1.2 demonstrates the pseudocode of this refinement.
Listing 1.2. Pseudocode of Refining Instances
1 if ( o b j e c t i s s e l e c t e d ) then
2 o b j e c t C o u n t e r 2 ++;
3
4 if ( o b j e c t T I n s t a n c e was h e l p f u l ) t h e n
5 o b j e c t T I n s t a n c e C o u n t e r ++;
6 else
7 o b j e c t T I n s t a n c e C o u n t e r −−;
8
9 v ar i anc e ( obje ctT I nst ance ) = objectCounter2 − objectTInstanceCounter ;
10
11 if ( normalize ( variance ( objectTInstance ) ) > threshold3 ) then
12 move o b j e c t T I n s t a n c e t o p r o h i b i t e d l i s t ;
13 if ( normalize ( variance ( objectTInstance ) ) < threshold4 ) then
14 d ele te objectTInstance ;
Statement Mediator/Mapper. Statement mediator/mapper is simply a map-
per that gets the template instance as input and generates OWL statements as
output. The OWL statements also contain all necessary links to the main object.
103
Table 1 demonstrates the OWL translation of some fixed templates. Note that
the italic words are those variable words that are used by the narrator.
Post-Refinement (Refinement After Mediation) and Ontology Assem-
bler. After generating OWL representations, they need to be purified. Refining
statements is an iterative task that tries to build a summarized version of state-
ments based on resource URIs. Figure 3 demonstrates a sample of statement
refinement.
As we mentioned earlier, the fixed templates are just highly-recommended
proposals to be used. If they are not helpful for the narrator to help the guesser,
he/she may simply use English sentences. As these sentences have no structure,
we keep them as comments for the ontology, if they were helpful for guesser.
Fig. 3. Statement Refinement Sample
After all these processes, the general assembler is able to merge these state-
ments and build the first version of the ontology. This is an iterative task and
the ontology will be completed after several plays. Every Ontology has a ver-
sion track using owl:versionInfo that enables us to keep the history of generated
ontologies. Figure 4 demonstrates the iterative life cycle of generating ontologies.
4 Experimental Results
To evaluate the quality of the generated ontologies, we have checked how they
change with an increasing number of rounds. To make our presentation feasible,
we have reduced the number of rounds to ten and the number of concepts to
two (tree and book).
In the first round (see Section 2.1), the properties color, height, and age were
collected for the word tree. After ten rounds, we additionally collected leaves and
species. The same test performed for the word book resulted in five properties:
104
Fig. 4. Iterative Life Cycle of Generating Ontologies
author, language, publisher, title, and year of publishing. Five more rounds gave us
additionally three more properties: number of pages, language and index. Tables
2 and 3 present the results that we have collected; we show both the words
that affected the created ontology and the words that were rejected. However,
the rejected words can become properties of the ontology, if we perform more
rounds.
By analyzing more and more examples, we noticed that the number of prop-
erties does not grow linearly with the number of rounds. Additionally, some of
the players were using plural versions of the words. This problem can be solved,
however, by using dictionaries. Moreover, the results provided by the native
speakers were much more accurate and they responded faster. We suggest using
the lists of forbidden words; such lists impose users, specially non-English-spoken
players, to use more and more sophisticated vocabularies, otherwise they stop
getting points at some time. Hence, they have to learn new vocabularies.
The next part of our experiment was to evaluate the second phase (see Section
2.2), in which each person was asked a set of questions related to the common
sense facts. Again we used the same words: tree and book. For the word tree,
there were just three questions that let the players to successfully complete a
round: it is a kind of a plant ; it has at least 1 height ; it could be either oak or
larch. Five more rounds introduced additionally two more facts to our knowledge
base: it is disjoint with animals; and it has at least 1 root. The same example for
the word book resulted in three common sense facts in five rounds: it has at least
1 edition; it has at least 1 language; it could be either hard-copy or electronic.
Five more rounds resulted in two new statements: it has at least 1 author ; and
it has at least 1 title. Again we note that more and more rounds are necessary
to improve the quality of the ontologies.
Table 2: Results of Phase 1: Tree
Rounds Accepted Words Rejected Words
5 Color, Height, Age Bark, Animals, Location, Kind, Fruit,
Root, Branches, Green, Flower, Species,
Width, Status, Leaves Falling, Seeds,
Continued on next page
105
Table 2 – continued from previous page
Rounds Accepted Words Rejected Words
Kind
10 Color, Height, Age, Bark, Animals, Location, Kind, Fruit,
Leaves, Species Root, Branches, Green, Flower, Width,
Status, Type, Name, Leaves Falling,
Seeds, Kind
Table 3: Results of Phase 1: Book
Rounds Accepted Words Rejected Words
5 Author, Language, Pages, Chapters, Words, Paragraph, Index,
Publisher, title, Foreword, Thickness, audience age, ISBN,
year of publishing Wtext, abstract, color
10 Author, Language, Pages, Chapters, Words, Paragraph, Index,
Publisher, title, Foreword, Thickness, audience age, ISBN,
year of publishing, text, abstract, color, cover type, domain
number of pages,
publishing, Language
5 Discussions
The aim of the OntoPair game is to build simple ontologies for different objects
that are located in images or even text-based objects in a short time. Our main
concern is that the game should be entertaining to encourage people to play it.
For this reason, we should avoid complex domains to be played. Some compli-
cated concepts like business categorizations can be out of scope of this game, as
these complicated domains may make the game boring and players will not come
back again. The other point is that the generated ontologies may not contain
all information regarding a domain, as the players are very ordinary people and
not from Semantic Web domain. This is the main advantage of the game, as it
cleverly uses people from different domains to help the Semantic Web domain
experts and scientists. However, we believe that ontologies will be complicated
after each play.
Even though we proposed that the players should be randomly paired, there
exist some cheating potentials; players could agree to login at the same time
to be paired together and maliciously annotate the objects. To avoid this case,
based on previous plays, at some random times, we propose presenting specific
images or texts that we know exactly the properties of objects in them and if
we notice that the players are not playing honestly, we let them play as long
106
as they want. The same solution is foreseen for second phase of the game. As
we mentioned, to increase certainty, we only assign properties and statements
to objects, if and only if a certain amount of players agreed upon that. As an
example, if only two players agreed upon a car has wing among other players, we
give a low ranking to wing and after filtering the properties using a threshold,
we omit the wing.
Statistics and our experiences show that word guessing games are played by
many people as these games are entertaining. Many people from non-English
speaking countries play these game to improve their English.
For evaluating the generated ontologies, the game can be played in single
mode and the single player will play against already-generated ontologies. If
generated ontologies contain sufficient knowledge, the guesser should be able to
guess the correct words, otherwise a low ranking will be assigned to the generated
ontology. The other approach towards evaluating OntoPair is comparing the
generated ontologies with ontologies that have been created by domain experts;
e.g. we can compare two ontologies for a domain like book, one from OntoPair
repository and the other which has been generated by hand.
6 Related Works
In [11], the authors present an approach for building ontologies using a game
called OntoGame. They use Wikipedia articles as conceptual entities, present
them to the players, and have the users judge the ontological nature and find a
common abstractions for a given entry [11]. Our approach is different, as we do
not build a tree structure for objects. In two phases, we gather properties and
cardinalities plus different instances of an object.
There exist also some efforts towards building a knowledge base by means of
computer-based games. These games have been designed mostly for two players.
The ESP game [7] tries to annotate images by enforcing players to come up with
the exact objects located in images. Peekaboom [9] is another game which tries
to come up with approximate location of objects in an image. Verbosity [8] is a
word guessing game which composes of two players: narrator and guesser; The
former should guide the latter to come up with the word that he is looking for
by using some fixed templates for this purpose. Common Consensus [4] is very
similar to Verbosity [8], but it has its own templates which begin mostly with
Wh* questions. Phetch [12] is another game which is composed of two players:
narrator and guesser; the narrator should give guesser some keywords to help
him/her to select the right image from a list of images. In other words, Phetch’s
main goal is finding a specific image in a bunch of similar images.
There exist also some other efforts in this general direction mostly for de-
signing single player games. Labelme [2] is one example which assigns you an
image for annotation. Cyc 1 is an artificial intelligence project that attempts
to assemble a comprehensive ontology and database of everyday common sense
1
http://www.cyc.com/
107
knowledge, with the goal of enabling AI applications to perform human-like rea-
soning [14]. Cyc offers a web-based game called FACTory 2 which gives the single
player several sophisticated common sense facts regarding different domains and
the player should mark them as true or false statements in a short time period.
At the beginning of 1980s Wille [18] initiated his work on a theory known
as Formal Concept Analysis. The aim of the theory is to analysis data and
identify conceptual structures among data sets. This work rapidly expanded
several years later and has been successfully applied for some specific domains,
e.g. bio-medicine [1]. However, such an approach often requires domain experts
to approve the results.
7 Conclusion and Future Works
We have presented our work towards OntoPair, a game that uses Collective In-
telligence for building OWL-based ontologies. OntoPair collects properties and
common sense facts about an object in an entertaining environment and builds
simple domain ontologies. We described how players should compete and how
computers should process and integrate results. We also performed a simple
experiment showing now our knowledge base grows. Our prototype implemen-
tation is still being implemented3 and it needs some work in the data and user
management areas. Moreover, the future work will include a reputation model
that will give more impact to users who are given high esteem. Linking different
ontologies together can be also considered as next phase. As an example, if we
build an ontology for a wheel, and we have a common sense fact indicating that
a car has wheel, we may link the car and wheel ontologies. Furthermore, we
would like to perform more experiments to research how long would it take for
a domain expert and ontology engineer to build an equivalent ontology. We also
would like to test OntoPair in more specific domains.
Acknowledgments. The authors would like to thank Dr. Axel Polleres for his
valuable comments. This work is partially supported by Ecospace (Integrated
Project on eProfessional Collaboration Space) project: FP6-IST-5-35208, Lion
project supported by Science Foundation Ireland under Grant No. SFI/02/CE1/I-
131, and Enterprise Ireland under Grant No. *ILP/05/203*.
References
1. Franz Baader, Bernhard Ganter, Baris Sertkaya, and Ulrike Sattler. Completing
description logic knowledge bases using formal concept analysis. In Manuela M.
Veloso, editor, IJCAI, pages 230–235, 2007.
2. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman.
LabelMe: a database and web-based tool for image annotation. In MIT AI Lab
Memo AIM-2005-025, 2005.
2
http://207.207.9.186/
3
http://sourceforge.net/projects/OntoPair
108
3. Dan Brickley and R.V. Guha. Resource Description Framework (RDF) Schema
Specification. http://www.w3.org/TR/rdf-schema/, February 2004.
4. Henry Lieberman, Dustin Smith, and Alea Teeters. Common Consensus: A Web-
based Game for Collecting Commonsense Goals. In Workshop on Common Sense
for Intelligent Interfaces, ACM International Conference on Intelligent User In-
terfaces (IUI-07), Honolulu, Hawaii, USA, 2007. ACM Press.
5. Google Inc. Google Image Labeler. http://images.google.com/imagelabeler/, 2007.
Online; accessed 3-May-2007.
6. Pierre Levy. Collective Intelligence. Plenum Publishing Corporation, January 1997.
7. Luis von Ahn, and Laura Dabbish. Labeling images with a computer game. In CHI
’04: Proceedings of the 2004 conference on Human factors in computing systems,
pages 319–326. ACM Press, 2004.
8. Luis von Ahn, Mihir Kedia, and Manuel Blum. Verbosity: a game for collecting
common-sense facts. In CHI ’06: Proceedings of the SIGCHI conference on Human
Factors in computing systems, pages 75–78, New York, NY, USA, 2006. ACM
Press.
9. Luis von Ahn, Ruoran Liu, and Manuel Blum. Peekaboom: a game for locating
objects in images. In CHI ’06: Proceedings of the SIGCHI conference on Human
Factors in computing systems, pages 55–64, New York, NY, USA, 2006. ACM
Press.
10. Sean Bechhofer, and Frank van Harmelen, and Jim Hendler, and Ian Horrocks,
and Deborah L. McGuinness, and Peter F. Patel-Schneider, and Lynn Andrea
Stein. OWL Web Ontology Language Reference. http://www.w3.org/TR/owl-
ref/, February 2004. Online; accessed 2-May-2007.
11. Siorpaes Katharina, and Martin Hepp. OntoGame: Towards Overcoming the In-
centive Bottleneck in Ontology Building. In 3rd International IFIP Workshop On
Semantic Web and Web Semantics (SWWS ’07), co-located with OTM Federated
Conferences, Vilamoura, Portugal, pages 1222–1232, 2007.
12. Luis von Ahn, Shiry Ginosar, Mihir Kedia, Ruoran Liu, and Manuel Blum. Im-
proving accessibility of the web with a computer game. In CHI ’06: Proceedings
of the SIGCHI conference on Human Factors in computing systems, pages 79–82,
New York, NY, USA, 2006. ACM Press.
13. Wikipedia. Captcha — wikipedia, the free encyclopedia, 2007. [Online; accessed
14-December-2007].
14. Wikipedia. Cyc — wikipedia, the free encyclopedia.
http://en.wikipedia.org/w/index.php?title=Cyc&oldid=125786119, 2007. [On-
line; accessed 7-May-2007].
15. Wikipedia. Guessing game — wikipedia, the free encyclopedia.
http://en.wikipedia.org/w/index.php?title=Guessing game&oldid=116214370,
2007. Online; accessed 6-May-2007.
16. Wikipedia. Human-based computation — wikipedia, the free
encyclopedia. http://en.wikipedia.org/w/index.php?title=Human-
based computation&oldid=122965665, 2007. [Online; accessed 7-May-2007].
17. Wikipedia. Collective intelligence — wikipedia, the free encyclopedia, 2008. [On-
line; accessed 17-March-2008].
18. R. Wille. Restructuring lattice theory: An approach based on hierarchies of con-
cepts. In Ordered Setsand in I. Rivals (Ed.), volume 23, 1982.
109
Semantically Enhanced Webspace for Scientific
Collaboration
Daniel Harezlak1 , Piotr Nowakowski1 , and Marian Bubak1,2
1
Academic Computer Center CYFRONET AGH
ul. Nawojki 11, 30-950 Krakow, Poland
2
Institute of Computer Science AGH
al. Mickiewicza 30, 30-059 Krakow, Poland
d.harezlak@cyf-kr.edu.pl
Abstract. The paper presents an approach to constructing a collective
Web-based system for knowledge management. The work refers to the
concepts and ideas widely promoted by modern web communities, such as
user-created and user-annotated content or reliable search mechanisms.
Also, formal ways such as ontology-to-model dependencies within collec-
tive knowledge are used to build the proposed system. The main focus
of this effort is directed towards scientific communities in which large
amounts of experimental data need to be classified and verified. For this
purpose an enhanced set of available Web tools needs to be assembled
and made available as a unified system.
Key words: semantic models, web management, application plan, col-
laborative research
1 Introduction
The need to represent knowledge by a language that both people and computers
can comprehend is obvious and has been proven almost a decade ago [1]. Since
then significant effort was invested in combining the formalisms of descriptions
that can be parsed by computers with free-text content published by people all
over the world, creating the new notion of the Semantic Web. According to the
survey [2] the Semantic Web is increasing its momentum by expanding in the
areas of Internet computing such as trade, business and travel, not to mention
the science domain. Currently we observe that the technologies and tools used for
knowledge representation and management are becoming more stable and thus
models and services are being proposed [3, 4] to realize the vision of large-scale
knowledge integration.
This paper focuses on scientific aspects of the Semantic Web, especially on
knowledge- and data-intensive applications, which need to better benefit from
the possibilities that become available through the manifestation of the Seman-
tic Web and its extensions. The basic challenge is to combine the collaborative
and global methods of using Web resources with individual and geographically-
scattered research activities. Many modern approaches try to exploit the tech-
niques available in social Web management such as tagging, ranking or editing
110
Web content by all users. However, more formal mechanisms are required for
scientific purposes. This goal can be supported by applying a strict semantic
framework to the way in which Web research is conducted. That is why we
propose a solution that incorporates a semantic layer into the available Web
management routines to facilitate scientific research.
A need for such environment was observed in the ViroLab project [5] which
develops a virtual laboratory [6] to facilitate medical knowledge discovery and
provide decision support for HIV drug resistance [7]. Three groups of users have
been identified: clinicians using decision support systems for drug ranking, ex-
periment developers who plan complex biomedical simulations, and experiment
users who apply prepared experiments (scripts) [8]. An experiment is a kind of
processing which may involve acquiring input data from distributed resources,
running remote operations on this data, and storing results in a dedicated space,
which should not only limit its functionality to the medical disciplines but extend
into other areas of science.
The following section contains current achievements in the Semantic Web
area. Subsequently, a list of requirements for the proposed solution is presented.
The following two sections contain the architecture and proposals of semantic
enhancements, followed by current implementation status and a summary with
a future workplan.
This work tries to go beyond the present state in building scientific web
communities by proposing a system which covers traditional computation in-
frastructures with lightweight yet reliable and oriented on research web inter-
faces supporting knowledge management. In principle, it builds upon existing
achievements of Semantic Web, however, a novel approach of managing seman-
tic descriptions by web community members is introduced. This requires new
combinations of tools for managing semantic metadata and social techniques of
editing web content.
2 Related Work
Modern systems in which semantic descriptions are used to represent knowl-
edge generally apply tested and reliable languages, such as OWL [9], which is
based on an older RDF specification [10]. Another standard used by a significant
group of people is WSMO [11], which provides methods to semantically describe
Web services. A problem, however, arises when different groups of researchers
try to create descriptions of the same phenomena or elements of reality, result-
ing in inconsistencies when such descriptions are merged. This requires manual
alignment, which can be very time-consuming and inefficient. In order to effi-
ciently build ontologies, a semiautomatic tool is required to provide feedback on
preexisting descriptions and enable scientists to further build upon them, thus
ensuring coherency.
It is easy to observe that the social Web has evolved into a global collab-
oration space where people from all over the world exchange experience using
systems such as Facebook [12] or Flickr [13]. This way of collaboration has made
111
the Web an interesting tool for scientific communities, with which to exchange
research results and knowledge. Several attempts were undertaken to benefit
from those ideas, resulting in applications like [14] and new trends in seman-
tic computing [15]. These attempts, however, still lack general acceptance and
stability. Nevertheless, several environments are already available and are being
used by minor groups. For example, myExperiment [16], currently in its beta
testing phase, is a successor to well-accepted workflow management systems
such as Taverna [17] or BIOSteer [18]. The project delivers a Web-based system
for sharing workflows among community members; however, the infrastructure
does not provide features that allow workflow execution and result management.
3 Requirements
In order to satisfy potential researchers, any new system should ease their work.
Therefore, basic requirements should be identified first. Below we present a list
which attempts to formalize the process in which research is conducted. In par-
ticular, it is assumed that each type of supported scientific research can be aided
either by applying a computer system to conduct a virtual experiment (such as
a simulation) or by presenting the results in a digital format. Following is a list
of basic requirements for a knowledge Web management system.
– application plan storage - The notion of an application plan exists in various
domains of science and can be described as a list of steps necessary to achieve
a certain result. There are many ways to represent such a list. It can be ac-
complished either by building a workflow (e.g. with the BPEL [19] notation)
or by using a script (with any available scripting language). The requirement
is to provide a facility for application storage that can be accessed by au-
thorized users. In this way published applications can be discovered, reused,
assessed and improved by other scientists.
– managing application execution - For the application plan execution to be
possible, an underlying infrastructure has to be deployed and a proper ap-
plication plan execution engine needs to be set up. The whole process of
application execution has to be visualized to the user and, if necessary, in-
termediate results should be delivered.
– managing scientific results - The outcome of a research activity should be
represented by a result stored in a dedicated database. The results should
be properly annotated and classified, available for other scientists for verifi-
cation purposes.
– collaborating with other scientists - The system should provide collabora-
tion tools enabling scientists to discover their work, properly restricted by
security and copyright agreements. It also should be convenient to exchange
experience and validate other’s work within one system.
The presented list of requirements should be supported by a semantic model
that facilitates all the functionalities which are to be provided by the proposed
system.
112
Another non-functional requirement is to separate the processes of applica-
tion development and conducting research. On the one hand developers do not
want to be laden with the semantics of a certain research area but only restrict
to e.g. data format, computation optimization, etc. On the other hand scientists
want to focus only on the research without knowing the specifics of the actual
implementation. This requires a certain separation layer provided by the exper-
iment plan. the common parts between the mentioned groups are only notions
of experiment plan, input data and experiment result. Developers write experi-
ments together with underlying services, components, etc., wchich require input
data and produce results (of course the format of the data is to be agreed be-
tween those two groups). The researchers execute the experiments, validate and
classify the data being able to manage the semantic layer.
One last requirement that was identified is the cross-disciplinary cooperation
of researchers. Creating a global and ultimate ontology seems to be an impossible
challange. However, it might be possible to find intersections between them and
benefit from what others work on. The approach in the proposed system is to
make all the semantic metadata available to all participants. In order to do
that an advanced editor is required to assist the researchers in the process of
managing the metadata.
4 System Architecture
4.1 General Overview
In Fig. 1 the basic architecture is presented. The system is divided into four
layers. At the bottom, the resource layer consists of services and data sources
which are used to build application plans using workflow or script notations
that provide some level of abstraction. In the same layer the Metadata Store
and the Application Repository are deployed and used to archive semantic data
and application plans respectively. The last two components are accessed by
the Web application layer (shown in green) directly. The next, yellow layer is
the middleware which provides an abstraction over the low-level resources and
ensures unified access to the variety of technologies that implement data sources
and computational services. In this way access to data and services is seamlessly
woven into the notation. The Application Execution Engine also maintains the
state of the applications during execution.
The third layer, representing Web applications, contains two modules, namely
the Metadata Engine module and the Execution Client module. The first module
is the one responsible for managing semantic descriptions available in the system.
It also constitutes a filter and a tool that helps users manage the semantic
content they provide or browse. Based on the semantic model presented in the
next section users are able to:
– import their own semantic descriptions by semi-automatically aligning and
mapping them against existing ones,
113
Reasearch groups
researc h, share explore, publish
Metadata Semantically Enhanced Execution
Engine Web Interface C lient
run
Data use Application use
query browse Operation
Access Execution
Invoker
C lient Engine
invoke
query
Metadata Application Streamed
Data DBs Services
Store Repository
Stores
Fig. 1. Basic components of the proposed system.
– browse the existing knowledge by conveniently searching through existing
ontology triples,
– quickly obtain application plans, results or publications of interest by pro-
viding key words (the whole knowledge space is tagged and annotated),
– tag and annotate the existing objects in the knowledge space.
The second module - Execution Client - is responsible for communication
with the application execution engine and keeping the users updated with the
current execution status using AJAX-oriented techniques (e.g. implemented with
the GWT toolkit [20]).
4.2 Metadata Engine
The Metadata Engine is the main component which provides the reasoning func-
tionality over the ontologies built within the system. It covers the low-level Meta-
data Store and exposes convenient methods to manage the knowledge structure.
In Fig. 2 a detailed architecture of the Metadata Engine is presented. It
contains a client that enables it to access the underlying metadata store and
facilitates the use of the query language used by the store. The deduction module
114
Clien t
Metadata Engine
Kn owled g e
Port
Client-side Presentation Client
simple deduction Modules Web
module Browser
Server
Remote Asyn ch ron ou s/ Kn owled g e
Local Metad ata Port
Qu ery I n terface Web
Server-side Server
Metadata complex deduction
Store Client module
RDQL, etc.
Metadata
Store
Fig. 2. Internal architecture of Metadata Engine.
is divided into two parts. For simple queries for which response times should be
short the part on the client-side is used. It communicates with the client through
an asynchronous channel according to the techniques used in web client-server
communication models (built over standard request-response model). The calls
are made directly by the visual components which concludes with their visual
state update. If the queries are more complex then the deduction module on the
server-side is used. To the visual components this however is transparent with
only longer repsonse times.
5 Semantic Enhancements
5.1 Basic Approach
In Fig. 3 a sample of the ontology model is presented. This model is used as the
basis for the Metadata Engine module to manage the collaboration space.
The model consists of three parts:
– Science Domain - (blue) - This part of the semantic description is extendable
by users. This ensures that the model remains dynamic and, when required,
115
Fig. 3. Samples of semantic descriptions used in the proposed system.
users may add custom ontological descriptions to existing ones. The process
is semi-supervised by the system in order to maintain coherency.
– Basic Model - (orange) - This model is the core of the application and its ba-
sic models. It assumes (in accordance with social Web content management)
that every item within the collaboration space may be tagged or annotated.
This enables the space to be enhanced by a quick search mechanism or by
building a tag cloud (used for space browsing).
– Application model - (green) - This ontology model allows the Metadata En-
gine to keep track of the content managed by users. In particular, users
are able to submit specific queries that navigate to accurate pieces of data
stored in the collaboration space (e.g. list all publications that describe the
outcomes of a particular application plan, etc.)
The presented model is just a proposition, showing how the final implemen-
tation could look and it remains a subject of ongoing research. It is also possible
to test several different models in different research contexts.
5.2 Role and Ontology Management
In order to ensure hierarchy in the process of managing and building the ontolo-
gies proper groups need to be modelled with certain permissions. Also, a way of
assessing the quality of the ontologies is required to introduce formal models of
the management process.
Figure 4 depicts a sample structure of such ontology. the main Object node
is assigned the is editable by relation which specifies what roles are permitted to
116
User
Role
Object
Node
Quality Community
Node
Fig. 4. Role management dependency semantics.
edit a given node. All Role nodes are referred by User nodes which creates the
authorization net in the proposed model.
To enable users with the possibility of extending the current ontology graph
a Community Node is introduced. This node is inherited by all the nodes cre-
ated by community members and in the process of collaborative cooperation
of scientific communities it is assessed and the quality information is stored in
the individuals of the Node Quality. The quality will be measured by analyzing
statistics of use of such knowledge node(e.g. the more users use and cite a given
ontology node the higher rank it has). Further improvements of such approach
will categorize the semantic descriptions into approved and validated and those
still being unassessed. Hopefully, this will lay ground for building community
ontologies across different science domains. The model itself may be changed
while the system is working.
6 Implementation Status
Currently the presented model is being implemented within the virtual labora-
tory supporting the scripting approach to representing application plans [6]. The
application execution engine is already [21] operational and capable of running
test application plans. Simple ontology models have been built; however, they
still require user assessment in order to be improved.
With respect to the web application layer a prototype of the user interface
was built and a screenshot is depicted in Fig. 5.
The interface is divided into three parts:
– application management - In this widget the user is able to browse the col-
laboration space in search for application plans of interest. The search is
117
Fig. 5. Screenshot of a semantic collaboration space interface prototype.
supported by the Metadata Engine, so the application plans can be found
according to the history of previous executions, produced results, owners or
publications.
– result management - Results are managed by this view. Annotations and
tags can be provided to assess particular results.
– application execution status - With this tab users may follow the execution
status of their application plans and input intermediary data. The input is
also supported by the Metadata Engine and previous results may be used as
the inputs. When result type model is provided the engine suggests suitable
inputs.
The overlapping window in the middle is displayed as popup and in this
case is used to show the application plan script. Each application plan may be
supplied with a license regarding its usage restrictions.
7 Conclusions and Future Work
This paper presents a semantic Web-based approach to constructing a scien-
tific collaboration space. The solution combines social Web routines with the
118
formalisms of semantic content descriptions to facilitate the process of on-line
research. Main improvements of the approach include integration of the applica-
tion runtime system with result management and adoption of widely-used Web
content management techniques in the area of scientific research.
At present the ViroLab virtual laboratory already integrates biomedical in-
formation related to viruses (proteins and mutations), patients (viral load) and
literature (drug resistance); it enables to plan and run experiments transpar-
ently on distributed resources. Different experiments from the virology domain
are executable, such as: from virus genotype to drug resistance interpretation,
querying historical and provenance information about experiments, assisting a
virologist with the Drug Resistance System, a simple data mining with classi-
fication. Further work will extend the list and explore re-usability in different
science disciplines.
Future plans include the extension of the semantic model used for building
the prototype and extending the user community to test and assess the approach.
The aim is to benefit from the ideas brought by the Semantic Web trends and
extend the present solutions in the area of community-driven research to make
the process more reliable and efficient.
Acknowledgments. This work is partly funded by the European Commission
under the ViroLab IST-027446 and the IST-2002-004265 Network of Excellence
CoreGRID projects.
References
1. Chandrasekaran, B., Josephson, J.R., Benjamins, V.R.: What are ontologies, and
why do we need them? IEEE Intelligent Systems 14(1) (January/February 1999)
20–26
2. Cardoso, J.: The semantic web vision: Where are we? IEEE Intelligent Systems
22(5) (September/October 2007) 84–88
3. Missier, P., Alper, P., Corcho, O., Dunlop, I., Goble, C.: Requirements and services
for metadata management. IEEE Internet Computing 11(5) (September/October
2007) 17–25
4. Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.:
Jena: Implementing the semantic web recommendations. Technical report, HP
Labs (2003)
5. Virolab Consortium: ViroLab - EU IST STREP Project 027446 (2008),
http://www.virolab.org
6. ACC CYFRONET AGH: Virolab virtual laboratory (2008),
http://virolab.cyfronet.pl
7. Sloot, P.M., Tirado-Ramos, A., Altintas, I., Bubak, M., Boucher, C.: From molecule
to man: Decision support in individualized e-health (2006)
8. Gubala, T., Bubak, M.: Gridspace - semantic programming environment for the
grid. LNCS 3911 (2006) 172–179
9. W3C: Owl web ontology language (2004), http://www.w3.org/TR/owl-features
10. W3C: Rdf: Resource description framework (2001), http://www.w3.org/RDF
119
11. Roman, D., Keller, U., Lausen, H., de Bruijn, J., Lara, R., Stollberg, M., Polleres,
A., Feier, C., Bussler, C., Fensel, D.: Web service modeling ontology. Applied
Ontology 1(1) (January 2005) 77–106
12. Facebook Team: A social utility that connects people with friends and others who
work, study and live around them (2008), http://www.facebook.com
13. Yahoo! Inc: Photo sharing web space (2008), http://www.flickr.com
14. Fox, G.C., Guha, R., McMullen, D.F., Mustacoglu, A.F., Pierce, M.E., Topcu,
A.E., Wild, D.J.: Web 2.0 for grids and e-science. In: INGRID 2007 - Instrumenting
the Grid, 2nd International Workshop on Distributed Cooperative Laboratories -
S.Margherita Ligure Portofino. (2007)
15. Goble, C., Roure, D.D.: Grid 3.0: Services, semantics and society. In: Proceedings
of Cracow Grid Workshop 2007, ACC CYFRONET AGH (2008) 10–11
16. The University of Manchester and University of Southampton: myexperiment home
page (2008), http://www.myexperiment.org
17. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T.,
Glover, K., Pocock6, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition
and enactment of bioinformatics workflows (2004)
18. Lee, S., Wang, T.D., Hashmi, N., Cummings, M.P.: Bio-steer: A semantic web
workflow tool for grid computing in the life sciences (2007)
19. OASIS: Web services business process execution language (2007),
http://www.oasis-open.org/committees/tc home.php?wg abbrev=wsbpel
20. Google: Google web toolkit (2008) http://code.google.com/webtoolkit
21. Ciepiela, E., Kocot, J., Gubala, T., Malawski, M., Kasztelnik, M., Bubak, M.:
Gridspace engine of the virolab virtual laboratory. In: Proceedings of Cracow Grid
Workshop 2007, ACC CYFRONET AGH (2008) 53–58