-

Collective Intelligence & the Semantic Web

Preface

0 0 Dr. Yannis Avrithis, National Technical University of Athens , Greece Dr. Yiannis Kompatsiaris, CERTH-ITI, Greece Prof. Steffen Staab , University of Koblenz-Landau, Germany Prof. Athena Vakali, Aristotle University of Thessaloniki , Greece

2008

78 119

http://mklab.iti.gr/CISWeb This volume includes the papers presented at the 1st International Workshop on “Collective Semantics: Collective Intelligence & the Semantic Web (CISWeb 2008)”, which was hosted by the 5th European Semantic Web Conference (ESWC-08), in Tenerife, Spain, June 2nd, 2008. Web 2.0 technologies have introduced new information sharing practices which favor mass users participation and aim at improving quality of information content and information organization. It is challenging to dynamically capture knowledge that emerges as the outcome of the interactions of masses of users in social networks, since difficulties are posed by the heterogeneous data sources, the large information scale and the huge amount of information postings. Semantic Web may contribute by providing language basis, structuring help from distributed ad-hoc ontologies, and by offering new ways of exploring the information space.

In this context, CISWeb 2008 Workshop has attracted very interesting work which covers crucial and emerging research topics such as using and enriching ontologies, semantically enhancing folksonomies and webspaces, social data management, , and interrelating Web 2.0 to Semantic Web. More specifically, interesting ideas were presented at the Workshop for ontology matching via knowledge extracted by multiple ontologies, enriching ontological user profiles with tagging history, merging Web 2.0 and the Semantic Web by (semi-) automated content tagging, semantically enriching folsonomies and tagging. Most of these efforts were experimented and validated under popular datasets and testbeds (such as Wikipedia, Flickr, LycosIQ). There were 11 submissions from 9 countries, and three reviewers were assigned to each paper. The program committee has finally selected 5 regular papers and 3 poster papers for presentation at the workshop. We would like to thank all the program committee members for their dedicated effort to review papers in their area of expertise and on a timely manner. Their effort was valuable to accommodate high quality papers in the CISWeb 2008 program.

The research work presented at CISWeb 2008 was very interesting and exciting and the Workshop involved live discussions and fruitful comments. Moreover, the program included a very interesting invited talk by Prof. Bettina Hoser, from the Universität Karlsruhe, who presented “Information Retrieval versus Knowledge Retrieval: A social network pespective”, a topic which is emerging and of wide interest. We are grateful to Prof. Bettina Hoser for her insightful presentation.

Special thanks are ought to Eirini Giannakidou, PhD graduate student from the CERTH Research Institute, for her technical support to CISWeb 2008 organization. The workshop has been held in cooperation with the European Commission and WeKnowIt Integrated Project, and we are indebted for their contributions and financial support.

CISWeb 2008 Co-Chairs

Conference Organization

Programme Chairs Yannis Avrithis Ioannis Kompatsiaris Ste®en Staab Athena Vakali Programme Committee Harith Alani

Andrea Baldassarri Nick Bassiliades Susanne Boll Ciro Cattuto Thierry Declerck Ying Ding William Grosky Harry Halpin Andreas Hotho Paul Lewis Jose Martinez Phivos Mylonas Lyndon Nixon Noel O'Connor Raphael Troncy

External Reviewers Eirini Giannakidou Gianluca Correndo Ioannis Katakis Georgios Meditskos

Author Index Alani, Harith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Aleksovski, Zharko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Angeletou, So¯a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 ten Kate, Warner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Tojo, JoA~ o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 van Harmelen, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Semantically enhanced webspace for scienti¯c collaboration : : : : : : : : : : : : : 109

Daniel Harezlak, Piotr Nowakowski, Marian Bubak

Bettina Hoser

Information Services and Electronic Markets, Institute of Information Engineering and Management, Department of Economics and Business Engineering,

UniversitaÄt Karlsruhe (TH)

Germany hoser@iism.uni-karlsruhe.de 1

Introduction

When is a trend a trend? When 'the right people' initialize it. This is very well known from the world of fashion. In the world of news, research and technology this may translate to the fact that a trend is a trend when 'relevant' people or websites take up the topic. But how can the relevant people or websites be distinguished from the less relevant? How can 'relevant' be de¯ned? How can one detect really 'relevant' trends? 'Relevant' is always a re°ective approach. It is dependent on the circumstances. Thus it is, e.g. in the case of fashion or news, a social context.

As an example for the question discussed here take a high tech company (e.g. mobile phones) or a reinsurance company. For both it is essential that they see trends before the competitors or the possible clients see it. In the case of the high tech company, for example, it is crucial to know what the potential customers are interested in, or which features in the current product are not accepted and why. For the reinsurance company, it is necessary to know which hazards, e.g. in health care, are being discussed, so that the company may prepare its policy accordingly. As an illustration of that point take the discussion on obeisity in children and subsequent health problems in adults. 2

Information retrieval

As companies look for ways to ¯nd trends as shown above they used to look for example at newspapers. Nowadays the internet with its chat rooms, newsgroups, social networking sites and blogs o®ers a wide area of information, which had not been accessible before. To gather this information various methods have been devised.

Text analysis is one of the methods often used to extract information from a text source. There is a large body of research literature, see e.g. [FNR03], in the ¯elds of linguistics, information science or classi¯cation on diverse ways to extract keywords, key phrases, etc. from websites and other text sources. In these research ¯elds models have been built to explain how the context sensitive relevance of words, phrases etc. can be de¯ned. Just think about classi¯cations like e.g. the ACM Classi¯cation System. Some of these methods lead to lists of possible topics listed by relative relevance according to their usage of phrases in text.

Another approach is to use additional information like keyword or tags to enhance the information retrieved by classifying it. This has grown into the research ¯elds on folksonomies, tagging, semantic web, etc.

At this point though what is known is that these phrases or words are often used. What is not known is who used them. Or to put it precisely, whether the user is a 'relevant' user in the context. This is a question that has been at the heart of the research ¯eld of Social Network Analysis. 3

Social Network Analysis

Social Network Analysis (SNA) is a research area that tries to analyze and model actor behavior based on his or her connections or relations to other members of a group. Fur further reference see [WF99]. An actor is thus seen as restricted or empowered by his or her connection to others. The basis of this structural approach is given by models about group interaction. The ¯rst research questions were posed to de¯ne roles to actors given a social context. Thus e.g. leadership of a group is such a role. There are also models about the power to manipulate. Thus a person in such a context may be called relevant, or central, if he or she is positioned in such a way in the group's network that all information exchanged between any two actors has to pass through this 'central' actor. He or she can thus manipulate the group.

Thus the question of who is relevant within a group is one of the research questions with SNA. Based on graph theory this can be analyzed by using different so called centrality indices. Some of them are inuitive, like e.g. degree centrality, other are more elaborate like e.g. betweenness centrality or eigenvector centrality. But always the question is: given a clearly de¯ned context, who within a group is relevant, who is not, how are the actors in the group connected and what, if any, predicitions can be made for the future deveploment of the group structure.

Thus, this analysis approach can be used to ¯nd the 'relevant' people or websites needed to enhance the information found by text retrieval. 4

Knowledge Retrieval

The idea to retrieve knowledge means not only to gather the information available but to enrich it with other information to gain knowledge about a topic. In the case proposed here this means to use results from SNA to enrich the information gathered by text analysis to ¯nd whether the topics found by information retrieval are 'really hot topics' because 'relevant people' talk about it, or whether it is just 'small talk' by 'bystanders'. In a conceptual study [HSGS+07] we used such an approach to look for socially enriched information about mobile phones within a newsgroup.

The idea proposed here is based on following information fusion approach: First a text corpus and a group are de¯ned. Then the text corpus is analyzed and the group structure is evaluated. As a last step these two results are combined to gain knowledge. This is just a very crude and short description of the procedure. One major challenge here is to to de¯ne the group. Depending on the area of interest this can be a very large group or a collection of websites corresponding to a group. Sometimes this may not even be a well de¯ned group. Thus biases can be introduced by choice of actors (or websites). But once the group is de¯ned, there is also the question of the appropriate text analysis method. Questions like scalability and validity have to be answered here. As a last step, the interpretation of the combined results have to be validated before any measures should be taken.

But even with regard to the aforementioned challenges this approach seems to yield deeper insights into topics and trends, since it includes the social component of trends. 5

Outlook

The potential of such an approach is very high. Not only are companies interested in such a kind of knowledge gained from di®erent 'news'-sources, weighted by the social impact, but also the average internet user. If one takes a look at communites of diverse interests such as travel or such necesseties as emergencies, it is not only valuable to have information at hand gathered from collective sites, but also to know who gave the information and whether the source can be viewed as 'relevant' in the given context. In the context of emergencies, this may save lives. Enriching Ontological User Profiles with Tagging History for Multi-Domain Recommendations

Iván Cantador1, Martin Szomszor2, Harith Alani2,

Miriam Fernández1, Pablo Castells1

1 Escuela Politécnica Superior Universidad Autónoma de Madrid

28049 Madrid, Spain {ivan.cantador, miriam.fernandez, pablo.castells}@uam.es 2 School of Electronics and Computer Science

University of Southampton SO17 1BJ Southampton, United Kingdom

{mns2, ha}@ ecs.soton.ac.uk Abstract. Many advanced recommendation frameworks employ ontologies of various complexities to model individuals and items, providing a mechanism for the expression of user interests and the representation of item attributes. As a result, complex matching techniques can be applied to support individuals in the discovery of items according to explicit and implicit user preferences. Recently, the rapid adoption of Web2.0, and the proliferation of social networking sites, has resulted in more and more users providing an increasing amount of information about themselves that could be exploited for recommendation purposes. However, the unification of personal information with ontologies using the contemporary knowledge representation methods often associated with Web2.0 applications, such as community tagging, is a non-trivial task. In this paper, we propose a method for the unification of tags with ontologies by grounding tags to a shared representation in the form of Wordnet and Wikipedia. We incorporate individuals’ tagging history into their ontological profiles by matching tags with ontology concepts. This approach is preliminary evaluated by extending an existing news recommendation system with user tagging histories harvested from popular social networking sites. 1 Introduction The increasing proliferation of Web2.0 style sharing platforms, coupled with the rapid development of novel ways to exploit them, is paving the way for new paradigms in Web usage. Virtual communities and on-line services such as social networking, folksonomies, blogs, and wikis, are fostering an increase in user participation, engaging users and encouraging them to share more and more information, resources, and opinions. The huge amount of information resulting from this emerging phenomenon gives rise to excellent opportunities to investigate, understand, and exploit the knowledge about the users’ interests, preferences and needs. However, the current infrastructure of the Web does not provide the mechanisms necessary to consolidate this wealth of personal data since they are spread over many unconnected, heterogeneous sources.

Community tagging sites, and their respective folksonomies, are a clear example of this situation: users have access to a plethora of web sites that allow them to annotate and share many types of resources. For example, they can organise and make photos available on Flickr1, classify and share bookmarks using del.icio.us2, communicate and share resources with friends using Facebook3. Through personal tags, users implicitly declare different facets of their personalities, such as their favourite book subjects on LibraryThing4, movie preferences on IMDb5, music tastes on Last.fm6, and so forth. Therefore, the domains covered by social tagging applications are both disparate and divergent, creating considerably complex and extensive descriptions of user profiles.

In the current Web2.0 landscape, there is a distinct lack of tools to support users with meaningful ways to query and retrieve resources spread over disparate end-points: users should be able to search consistently across a broad range of sites for diverse media types such as articles, reviews, videos, and photos. Furthermore, such sites could be used to support the recommendation of new resources belonging to multiple domains based on tags from different sites. As a step towards making this vision a reality, we explore the use of syntactic and semantic based technologies for the combination, communication and exploitation of information from different social systems.

In this paper, we present an approach for the consolidation of social tagging information from multiple sources into ontologies that describe the domains of interest covered by the tags. Ontology-based user profiles enable rich comparisons of user interests against semantic annotations of resources, in order to make personal recommendations. This principle has already been tested by the authors in different personalised information retrieval frameworks, such as semantic query-based searching [ 4 ], personalised context-aware content retrieval [ 13 ], group-oriented profiling [ 3 ], and multi-facet hybrid recommendations [ 2 ].

We propose to feed the previous strategies with user profiles built from personal tag clouds obtained from Flickr and del.icio.us web sites. The mapping of those social tags to our ontological structures involve three steps: the filtering of tags, the acquisition of semantic information from the Web to map the remaining tags into a common vocabulary, and the categorisation of the obtained concepts according to the existing ontology classes.

An application of the above techniques has been tested in News@hand, a news recommender system which integrates our different ontology-based recommendation approaches. In this system, ontological knowledge bases and user profiles are generated from public social tagging information, using the aforementioned techniques. The News@hand system, along with the automatic acquisition of news articles from the Web, and the automatic semantic annotation of these items using Natural Language Processing tools [ 1 ] and the Lucene7 indexer shall also be described. 1 Flickr, Photo Sharing, http://www.flickr.com/ 2 del.icio.us, Social Bookmark manager, http://del.icio.us/ 3 Facebook, Social Networking, http://www.facebook.com/ 4 LibraryThing, Personal Online Book Catalogues, http://www.librarything.com/ 5 IMDb, Internet Movie Database, http://imdb.com/ 6 Last.fm, The Social Music Revolution, http://www.last.fm/ 7 Lucene, An Open Source Information Retrieval Library, http://lucene.apache.org/

The structure of the paper is the following. Section 2 briefly describes our approach for representing user preferences and item features using ontology-based knowledge structures, and how they are exploited by several recommendation models. Section 3 explains mechanisms to automatically relate and transform social tagging and external semantic information into our ontological knowledge structures. A real implementation and evaluation of the previous tag transformation and recommendation processes within a news recommender system are presented in section 4. Finally, section 5 proclaims some conclusions and future research lines. 2 Hybrid recommendations In this section, we summarise the ontology-based knowledge representation and recommendation models in which filtered social tags are proposed to be integrated and exploited. 2.1 Ontology-based representation of item features and user preferences In the knowledge representation we propose [ 4, 13 ], user preferences are described as vectors um = (um,1, um,2 ,..., um,K ) where um,k ∈ [ 0,1 ] measures the intensity of the interest of user um ∈ U for concept ck ∈O (a class or an instance) in a domain ontology O , K being the total number of concepts in the ontology. Similarly, items dn ∈ D are assumed to be annotated by vectors dn = (dn,1, dn,2 ,..., dn,K ) of concept weights, in the same vector-space as user preferences.

The main advantages of this knowledge representation are its portability, thanks to the XML-based Semantic Web standards, the domain independency of the subsequent content retrieval and recommendation algorithms, and the multi-source nature of the proposal (different types of media could be annotated: texts, images, videos). 2.2 Personalised content retrieval Our notion of content retrieval is based on a matching algorithm that provides a personal relevance measure pref ( dn , um ) of an item dn for a user um . This measure is set according to semantic preferences of the user and semantic annotations of the item, and is based on a cosine vector similarity cos (dn , um ) . The obtained similarity values (Personalised Ranking module of Figure 1) can be combined with query-based scores without personalisation sim ( dn , q ) and semantic context information (Item Retrieving module of Figure 1), to produce combined rankings [ 13 ].

To overcome the existence of sparsity in user profiles, we propose a preference spreading mechanism, which expands the initial set of preferences stored in user profiles through explicit semantic relations with other concepts in the ontology. Our approach is based on Constrained Spreading Activation (CSA), and is self-controlled by applying a decay factor to the intensity of preference each time a relation is traversed. We have empirically demonstrated [ 3, 13 ] that preference extension improves retrieval precision and recall. It also helps to mitigate other well-known limitations of recommender systems such as the cold-start, overspecialisation and portfolio effects. 2.3 Context-aware recommendations The context is represented in our approach [ 13 ] as a set of weighted ontology concepts. This set is obtained by collecting the concepts that have been involved in the interaction of the user (e.g. accessed items) during a session. It is built in such a way that the importance of concepts fades away with time by a decay factor. Once the context is built, a contextual activation of user preferences is achieved by finding semantic paths linking preferences to context. These paths are made of existing relations between concepts in the ontologies, following the spreading technique mentioned in section 2.2. 2.4 Group-oriented recommendations The presented user profile representation allows us to easily model groups of users. We have explored the combination of the ontology-based profiles to meet this purpose [ 3 ], on a per concept basis, following different strategies from social choice theory. In our approach, user profiles are merged to form a shared group profile, so that common content recommendations are generated according to this new profile. 2.5 Multi-facet hybrid recommendations In order to make hybrid recommendations we cluster the semantic space based on the correlation of concepts appearing in the profiles of individual users. The obtained clusters Cq represent groups of preferences (topics of interests) shared by a significant number of users. Using these clusters profiles are partitioned into semantic segments. Each of these segments corresponds to a cluster and represents a subset of the user interests that is shared by the users who contributed to the clustering process. By thus introducing further structure in user profiles, we define relations among users at different levels, obtaining multilayered communities of interest.

Exploiting the relations of the communities which emerge from the users’ interests, and combining them with item semantic information, we have presented in [ 2 ] several recommendation models that compare the current user interests with those of the others users in a double way. First, according to item characteristics, and second, according to connections among user interests, in both cases at different semantic layers. pref ( dn , um ) = ∑ nsim ( dn , Cq ) ∑ nsimq (um , ui )·simq (dn , ui ) q i 3 Relating social tags to ontological information Parallel to the proliferation and growth of social tagging systems, the research community is increasing its efforts to analyse the complex dynamics underlying folksonomies, and investigate the exploitation of this phenomenon in multiple domains. Results reported in [ 5 ] suggest that users of social systems share behaviours which appear to follow simple tagging activity patterns. Understanding, predicting and controlling the semiotic dynamics of online social systems are the base pillars for a wide variety of applications.

For these purposes, the establishment of a common vocabulary (set of tags) shared by users in different social systems is a desirable situation. Indeed, recent works have focused on the improvement of tagging functionalities to generate tag datasets in a controlled, coordinated way. P-TAG [ 6 ] is a method that automatically generates personalised tags for web pages, producing keywords relevant both to their textual content and to data collected from the user’s browsing. In [ 8 ], an adaptation of userbased collaborative filtering and a graph-based recommender is presented as a tag recommendation mechanism that eases the process of finding good tags for a resource, and consolidating the creation of a consistent tag vocabulary across users.

The integration of folksonomies and the Semantic Web has been envisioned as an alternative approach to the collaborative organisation of shared tagging information. The proposal presented in [ 11 ] uses a combination of pre-processing strategies and statistical techniques together with knowledge provided by ontologies for making explicit the semantics behind the tag space in social tagging systems.

In the work presented herein, we propose the use of knowledge structures defined by multiple domain ontologies as a common semantic layer to unify and classify social tags from several Web 2.0 sites. More specifically, we propose a mechanism for the creation of ontology instances for the gathered tags, according to semantic information collected from the Web. Tagging information is linked to ontological structures by our method through a sequence comprising three processing steps: •

Filtering social tags: To facilitate the integration of information from different social sources as well as the subsequent translation of that information into ontological knowledge, a pre-processing of the tags is needed, associating them to a common vocabulary, shared by the different involved applications. Morphologic and semantic transformations of tags are performed at this stage based on the WordNet English dictionary [ 9 ], the Wikipedia8 encyclopaedia and the Google9 web search engine. •

Obtaining semantic information about social tags: The shared vocabulary is created with the use of Wikipedia, which provides semantic information about millions of concepts. •

Categorisation of social tags into ontology classes: Once the tags have been filtered and mapped to a shared vocabulary, they are automatically converted into instances of classes of domain ontologies. Again, semantic categorisation information available in Wikipedia is exploited in this process.

These steps are explained in more detail in the next subsections. 8 Wikipedia, The Free Encyclopaedia, http://en.wikipedia.org/ 9 Google, Web Search Engine, http://www.google.com/ 3.1 Filtering social tags Raw tagging information can be noisy and inconsistent. When manual tags are introduced with a non-controlled tagging mechanism, people often make grammatical mistakes (e.g. barclona instead of barcelona), tag concepts indistinctly in singular, plural or derived forms (blog, blogs, blogging), sometimes add adjectives, adverbs, prepositions or pronouns to the main concept of the tag (beautiful car, to read), or use synonyms and acronyms that could be converted into a single tag (biscuit and cookie, ny and new york). Moreover, the tag encoding and storage mechanisms used by social systems often alter the tags introduced by the users: they may transform white spaces (san francisco, san-francisco, san_francisco, sanfrancisco) and special characters in the tags (los angeles for los ángeles, zurich instead of zürich), etc.

Thus, while it is possible to gather information from multiple folksonomy sites, such as Flickr or del.icio.us, inconsistency will lead to confusion and loss of information when tagging data is compared. For example, if a user has tagged photos from a recent holiday in New York with nyc, but also bookmarked relevant pages in del.icio.us with new_york, the correlation will be lost. In order to facilitate the folksonomy data analysis and integration, tags have to be filtered and mapped to a shared vocabulary. Here, we present a tag filtering architecture that makes use of external knowledge resources such as the WordNet dictionary, Wikipedia encyclopaedia and Google web search engine.

The filtering process is a sequential execution where the output from one filtering step is used as input to the next. The output of the entire filtering process is a set of new tags that correspond to an agreed representation. As will be explained below, this is achieved by correlating tags to entries in two large knowledge resources: Wordnet and Wikipedia. Wordnet is a lexical database and thesaurus that group English words into sets of cognitive synonyms called synsets, providing definitions of terms, and modelling various semantic relations between concepts: synonym, hypernym, hyponym, among others. Wikipedia is a multilingual, open-access, free-content encyclopaedia on the Internet. Using a wiki style of collaborative content writing, is has grown to become one of the largest reference Web sites with over 75,000 active contributors, maintaining approximately 9,000,000 articles in over 250 languages (as of February 2008). Wikipedia contains collaboratively generated categories that classify and relate entries, and also supports term disambiguation and dereferencing of acronyms.

Figure 2 provides a visual representation of the filtering process where a set of raw tags are transformed into a set of filtered tags and a set of discarded tags. Each of the numbers in the diagram corresponds to a step outlined below.

For this work, tags from public available user accounts from Flickr and del.icio.us sites have been collected and filtered. A total of 1004 user profiles have been gathered from these two systems, providing 149,529 and 84,851 distinct tags respectively. Initially, the intersection between both datasets was 28,550 common tags. Step 1: Lexical filtering After raw tags have been harvested from different folksonomy sites, they are passed to the Lexical Filter, which applies several filtering operations. Tags that are too small (with length = 1) or too large (length > 25) are removed, resulting in a discarding rate of approximately 3% of the initial dataset. In addition, considering the discrepancies in the use of special characters (such as accents, dieresis and caret symbol), we convert such special characters to a base form (e.g., the characters à, á, â, ã, ä, å are converted to a).

Tags containing numbers are also filtered based on a set of custom heuristics. For example, to maintain salient numbers, such as dates (2006, 2007, etc), common references (911, 360, 666, etc), or combinations of alphanumeric characters (7 up, 4 x 4, 35 mm), we discard unpopular tags below a certain global tag frequency threshold. Finally, common stop-words, such as pronouns, articles, prepositions and conjunctions are removed. After lexical filtering, tags are passed on to the Wordnet Manager. If a tag has an exact match in Wordnet, we pass it on directly to the set of filtered tags, to save further unnecessary processing.

Step 2: Compound nouns and misspellings If a tag is not found in Wordnet, we consider possible misspellings and compound nouns. Motivated by [ 11 ], to solve these problems, we make use of the Google “did you mean” mechanism. When a search term is entered, the Google engine checks whether more relevant search results are found with an alternative spelling. Because Google’s spell check is based on occurrences of all words on the Internet, it is able to suggest common spellings for proper nouns that would not appear in a standard dictionary.

The Google “did you mean” mechanism also provides an excellent way to resolve compound nouns. Since most tagging systems prevent users from entering white spaces into the tag value, users create compound nouns by concatenating nouns together or delimiting them with a non-alphanumeric character such as _ or -, which introduces an obvious source of complication when aligning folksonomies. By sending compound nouns to Google, we easily resolve the tag into its constituent parts. This mechanism works well for compound nouns with two terms, but is likely to fail if more than two terms are used. For example, the tag sanfrancisco is corrected to san francisco, but the tag unitedkingdomsouthampton is not resolved by Google.

We have thus developed a complementary algorithm that quickly and accurately splits compound nouns of three or more terms. The main idea is to firstly sort the tags in alphabetical order, and secondly process the generated tag list sequentially. By caching previous lookups, and matching the first shared characters of the current tag string, we are able to split it into a prefix (previously resolved by Google) and a postfix. A second lookup is then made using the postfix to seek further possible matches. The process is iteratively repeated until no splits are obtained from our Google Connector. Compared to a bespoke string-splitting heuristic, this process has a very low computational cost. This mechanism successfully recognizes long compound nouns such as war of the worlds, lord of the rings, and martin luther king jr.

Similarly to Step 1, after using Google to check for misspellings and compound nouns, the results are validated against the Wordnet Manager. Unprocessed tags are added to the pending tag stack, and unmatched tags are discarded.

Step 3: Wikipedia correlation Many of the popular tags occurring in community tagging systems do not appear in grammar dictionaries, such as Wordnet, because they correspond to proper names (such as famous people, places, or companies), contemporary terminology (such as web2.0 and podcast), or are widely used acronyms (such as asap and diy).

In order to provide an agreed representation for such tags, we correlate tags to their appropriate Wikipedia entries. For example, when searching the tag nyc in Wikipedia, the entry for New York City is returned. The advantage of using Wikipedia to agree on tags from folksonomies is that Wikipedia is a community-driven knowledge base, much like folksonomies are, so that it rapidly adapts to accommodate new terminology.

Apart from consolidating agreed terms for the filtered tags, our Wikipedia Connector retrieves semantic information about each obtained entry. Specifically, it extracts ambiguous concepts (e.g., “java programming language” and “java island” for the entry “java”), and collaboratively generated categories (e.g., “living people”, “film actors” and “american male models” for the entry “brad pitt”). This information is exploited by the ontology population and annotation processes described below. Step 4: Morphologically similar terms An additional issue to be considered during the filtering process is that users often use morphologically similar terms to refer to the same concept. One very common example of this is the no discrepancy between singular and plural terms, such as blog and blogs, and other morphological deviations (e.g. blogging). In this step, using a custom singularisation algorithm, and the stemming functions provided by the Snowball library10, we reduce morphologically similar tags to a single tag. For each group of similar tags, the shortest term found in Wordnet is used as the representative tag. Step 5: WordNet synonyms When people communicate a certain concept, they often use synonyms, i.e., terms that have the same meaning, but with different morphological forms. A natural filtering step is the simplification of the tag sets by merging pairs of synonyms into single terms.

WordNet provides synonym relations between synsets of the terms. However, due to ambiguous meanings of the tags, not all of them can be taken into consideration, and the filtering process must be very carefully executed. Our merging process comprises three stages. In the first stage, a matrix of synonym relations is created by using Wordnet. In the second stage, according to the number of synonym relations found for each tag, we identify the non-ambiguous synonym pairs, and finally, stage three replaces each of the synonym pairs by the term that is most popular. Examples of thus processed synonym pairs are android and humanoid, thesis and dissertation, funicular and cable railway, stein and beer mug, or poinsettia and christmas flower. 10 Snowball, String-handling Language, http://snowball.tartarus.org/ 3.2

Obtaining semantic information about social tags In order to populate ontologies with concepts associated to the filtered social tags, general multi-domain semantic knowledge is needed. In this work, as mentioned before, we propose to extract that information from Wikipedia. The Wikipedia articles describe a number of different types of entities: people, places, companies, etc., providing descriptions, references, and even images about the described entities.

Many of these entities are ambiguous, having several meanings for different contexts. For instance, the same tag “java” could be assigned to a Flickr picture of the Pacific island, or a del.icio.us page about the programming language. One approach to address tag disambiguation is by using the information available in Wikipedia. A Wikipedia article is fairly structured: the title of the page is the entity name itself (as found in Wikipedia), the content is divided into well delimited sections, and a first paragraph is dedicated to possible disambiguation options for the corresponding term. For example, the page of the entry “apple” starts as follows: • • • “This article is about the fruit…” “For the Beatles multimedia corporation, see…” “For the technology company, see…”

Apart from these elements, every article contains a set of collaboratively generated categories. Hence, for example, the categories created for the concept “teide” are: world heritage sites in spain, tenerife, mountains of spain, volcanoes of spain, national parks of spain, stratovolcanoes, hotspot volcanoes, and decade volcanoes. Processing somehow the previous information, we might infer that “teide” is a volcano in Spain.

Disambiguation and categorisation information have been therefore extracted from Wikipedia for every concept appearing in our social tag datasets. Once the most suitable category for a term is determined, we match its relevant categories to classes defined in the domain ontologies, as explained next. 3.3

Categorisation of social tags into ontology classes The assignment of an ontology class to a Wikipedia entry is based on a morphologic matching between the name and the categories of the entry, and the names of the ontology classes. The ontology classes with most similar names to the name and categories of the entry are chosen as the classes whereof the corresponding individual (instance) is to be created. The created instances are assigned a URI containing the entry name, and are given RDFS labels with the Wikipedia categories.

To better explain the proposed matching method, let us consider the following example. Let “brad pitt” be the concept we wish to instantiate. If we look up this concept in Wikipedia, a page with information about the actor is returned. At the end of the page, several categories are shown: “action film actors”, “american film actors”, “american television actors”, “best supporting actor golden globe (film)”, “living people”, “missouri actors”, “oklahoma (state) actors”, “american male models”, etc.

After retrieving that information, all the terms (tokens) that appear in the name and categories of the entry (which we will henceforth refer to as entry terms) are morphologically compared with the names of the ontology classes (assuming that a classlabel mapping is available, as it is usually the case). Computing the Levenshtein distance, and applying singularisation and stemming mechanisms, only the entry terms that match some class name, above a certain distance threshold, are kept, and the rest are discarded. For instance, suppose that “action”, “actor”, “film”, “people”, and “television” are the ones sufficiently close to some ontology class. To select the most appropriate ontology class among the matching ones, we firstly create a vector whose coordinates correspond to the filtered entry terms, taking as value the number of times the term appears in the entry name and categories together. In the example, the vector might be as follows: {(action, 1), (actor, 6), (film, 3), (people, 1), (television, 1)}, assuming that “actor” appears in six categories of the Wikipedia entry “brad pitt”, and so forth.

Once this vector has been created, one or more ontology classes are selected by the following heuristic: 1. If a single coordinate holds the maximum value in the vector, we select the ontology class that matches the corresponding term. 2. In case of a tie between several coordinates having the maximum value, a new vector is created, containing the matched classes plus their taxonomic ancestor classes in the ontologies. Then the weight of each component is computed as the number of times the corresponding class is found in this step. Finally, the original classes that have the highliest valued ancestor in the new vector are selected.

Here “ontology class” and “ancestor” denote a loose notion admitting a broad range of taxonomic constructs, ranging from informally built subject hierarchies (such as the ones defined in the Open Directory tree or, in our experiments, the IPTC Subjects), to pure ontology classes in a strict Description Logic sense.

In our example, the weight for the term “actor” is the highest, so we select its matching class as the category of the entry. Thus, assuming that the class matching this term was “Actor”, we finally define “Brad Pitt” as an instance of “Actor”.

Now suppose that, instead, the vector for Brad Pitt was {(actor, 1), (film, 1), (people, 1)}. In that case, there would be a tie in the matching classes, and we would apply the second case of the heuristic. We take the ancestor classes, which could be e.g. “cinema industry” for “actor”, “cinema industry” for “film”, and “mammal” for “person”, and create a weighted list with the original and ancestor classes. Then we count the number of times each class appears in the previous list, and create the new vector: {(actor, 1), (film, 1), (person, 1), (cinema industry, 2), (mammal, 1)}. Since the class “cinema industry” has the highest weight, we finally select its sub-classes “actor” and “film” as the classes of the instance “brad pitt”.

We must note that our ontology population mechanism does not necessarily generate individuals following a strict semantic “is-a” schema, but a more relaxed semantic “is-related-to” association principle. This is not a problem for our final purposes in personalised content retrieval, since the annotation and recommendation methods in that area are themselves rooted on models of inherently approximated nature, e.g. regarding the relationships between concepts and item contents. 4 Preliminary evaluations Recent works show an increasing interest in using social tagging information to enhance personalised content retrieval and recommendation. FolkRank [ 7 ] is a search algorithm that exploits the structure of folksonomies to find communities and organise search results. The recommender system presented in [ 10 ] suggests web pages available on the Internet, by using folksonomy and social bookmarking information. The movie recommender proposed in [ 12 ] is built on keywords assigned to movies via collaborative tagging, and demonstrates the feasibility of making accurate recommendations based on the similarity of item keywords to those of the user’s rating tag-clouds.

In the following, we present and preliminary evaluate how our ontological knowledge representation, recommendation models, and tag filtering and matching strategies are integrated in News@hand, a news recommender system. 4.1 News@hand News@hand is a news recommender system that describes news contents and user preferences with a controlled and structured vocabulary, using semantic-based technologies, and integrating the recommendation models described in section 2. Figure 3 depicts how ontology-based item descriptions and user profiles are created and exploited by the system.

News items are automatically and periodically retrieved from several on-line news services via RSS feeds. The title and summary of the retrieved news are annotated with concepts of the domain ontologies. A dynamic graphic interface allows the system to automatically retrieve all the users’ inputs in order to analyse their behaviour with the system, update their preferences, and adjust the recommendations in real time.

Figure 4 shows a screenshot of a typical news recommendation page in News@hand. The news items are classified into eight different sections: headlines, world, business, technology, science, health, sports and entertainment. When the user is not logged in the system, s/he can browse any of the previous sections, but the items are listed without any personalised criterion. On the other hand, when the user is logged in the system, recommendation and profile edition functionalities are activated, and the user can browse the news according to his and others’ preferences in different ways. Click history is used to detect the short term user interests, which represent the dynamic semantic context exploited by our personalised content retrieval mechanism. – It could be union of and (and more): From the other perspective (see also previous template), this template provides different types of the item.

The player can extend this template by adding more items. – It is complement of the : This template provides complement objects/concepts of the item. – It is disjoint with (opposite of) : This template provides the objects/concepts that are disjoint with the item. – It is equivalent to the : This template provides equivalent objects/concepts to the item.

Note that in this phase we have also the notion of prohibited statements. Prohibited statements are actually those statements that most players decide to choose first. we are not interested to collect these statements all time, so we do not give the opportunity to the player (narrator) to use them.

However, the players should use these templates, as we build OWL ontologies by aid of these templates, but we give also the option to the narrator to build arbitrary sentences as well, if the templates can not be useful. These arbitrary sentences will build comments for the generated ontology. 3

Generating OWL-based Ontology In this section, we introduce the translation mechanism that we use to generate OWL-based ontologies. The Ontology will be created for the object that the players are playing, e.g. book, computer, car. After every play using an object (item), we collect some common sense facts about that item and we can build an ontology for that. The first iteration of generating ontologies is draft and can not be considered as a complete ontology. In other words, the ontology is created during several iterations and not at the first time. For the approved properties, i.e. the properties that their frequencies are more than a threshold, a owl:class is generated. These classes are actually the transformation of properties into OWL representation using a mediator/mapper which is simply able to generate classes and their properties. Suppose a domain like a book : For every approved property or concept, a class and a link will be generated to associate this class to main concept which in our example is a book. Figure 1 illustrates the mapping between some selected properties and their OWL representations. As we mentioned earlier, the properties will be stored in a knowledge base (KB) and as soon as they are mature enough to be linked, the mapper will translate them into OWL and link them to the main concept. In the following sections, we provide a more detailed description.

Fig. 1. Generating OWL for Properties Using a Mapper/Mediator Pre-Refinement of Concepts (Refining Before Mediation). As we mentioned earlier, the concepts need to be refined. The refinement process is as follows: Because a specific object can be played more than once, we assign a counter to every object and the counter increases if the players are playing that object. We call this counter objectCounter in which the word object will be replaced with the explicit name of the object. A counter is also assigned to every property that the players agree upon that during the game and after further agreement by other players, the counter increases. We call this counter objectPropertyCounter which object will be replaced with the explicit name of the object and property will be replaced with the explicit name of the property of the object. The variance is defined for each property and is calculated by objectCounter minus objectPropertyCounter. If the result is greater than threshold1, the property will be moved to prohibited list, as many pairs agreed upon that property and if it is less than threshold2, the property will be deleted, as only very few pairs agreed upon that property. Note that, we do not care about uppercase and lowercase of alphabetic letters. Listing 1.1 demonstrates the pseudocode of this refinement.

Listing 1.1. Pseudocode of Refining Concepts 1 i f ( o b je c t i s s e l e c t e d ) then 32456 i f ( objecootbbPjjeerccottpPCeroroutpynetreitrsy+Cs+oe;ul enctteerd+)+;then 7 v ar i anc e ( obj ect Pr oper ty ) = objectCounter − objectPropertyCounter ; 8 9 i f ( norm al i ze ( v ar ia nce ( obje ctP rope rt y ) ) > t hr es h ol d1 ) then 1101 move obje ctP rope rt y to p r o hi bi t ed l i s t ; 12 i f ( norm al i ze ( v ar ia nce ( obje ctP rope rt y ) ) < t hr es h ol d2 ) then 13 d e l e t e objec tPr ope rt y ; Concept Mediator/Mapper. Concept mediator/mapper is simply a mapper that gets the property or concept as input and generates OWL statements as output. The OWL statement contains also the link that associates the property to the main object. Figure 1 demonstrates some sample inputs and outputs of the mediator/mapper. However, in this step, we do not have our ontology and we have just gathered only properties and built their links. The ontology will be created after gathering sufficient facts about the object.

Post-Refining (Refinement After Mediation). After generating OWL representations of properties, they need also to be purified. Refining statements is an iterative task and tries to build a summarized version of statements based on resource URIs. Figure 2 demonstrates a sample of this post-refinement. In the previous section, we presented the fixed templates that we use to gather common sense facts about objects. As we mentioned, those templates were carefully chosen for two main purposes: First, to be able to be translated into OWL using a mediator/mapper and second, to avoid the game being boring, as we need to entertain players, instead of assigning tasks to them. Table 1 demonstrates the general translation of templates. Note that &xsd; refers to XSD namespace which is actually xmlns:xsd = ”http://www.w3.org/2001/XMLSchema#”. To avoid a huge messy table, we decided to use acronyms.

Template X has at least Y X has at most

Y X is kind of X could be either (or more)

or X could be union of (and more)

and X is complement of

Continued on next page X is disjoint with (opposite of) X is equivalent to </owl:Class> <owl:Class rdf:ID = ”some concept”/> <owl:Class rdf:ID = ”X”> <owl:disjointWith> <owl:Class rdf:about = ”#some concept”/> </owl:disjointWith> </owl:Class> <owl:Class rdf:ID = ”some concept”/> <owl:Class rdf:ID = ”X”> <owl:equivalentClass> <owl:Class rdf:about = ”#some concept”/> </owl:equivalentClass> </owl:Class> Pre-Refinement of Statements (Refinement Before Mediation). The main goal of Pre-Refinement is to select the statements that can be translated into correct OWLs. The process is as follows: Like previous refinement, we assign a counter to an object. we call this counter objectCounter2. We assign also a counter to every instance of a template related to object. We call this counter objectTInstanceCounter. We log all instances that will be sent to guesser. If the instance was helpful and the guesser could guess the word correctly, we increase the objectTInstanceCounter, but if the instance was not useful and the guesser was not able to guess the word, we decrease the objectTInstanceCounter. We compare the objectTInstanceCounter with some thresholds and then we decide whether to keep, delete or move it into the prohibited list. Note that in this refinement, we do not care about uppercase and lowercase of alphabetic letters. Listing 1.2 demonstrates the pseudocode of this refinement.

Listing 1.2. Pseudocode of Refining Instances 1 i f ( o b je c t i s s e l e c t e d ) then 11165743892102 veiifflasre((ianonbocrjeemmc(oooatbobbolTbjjjvieeeezIjcccneetttc(soTTCttvbTaIIoajnnnueIrsscnncittatetsaaTnetnnracwIcc2nenaee+cs(sCCeto+oao)hb;uunje=nnceltecpteetofrTrutb−o+lIjen)−+csp;;tttrCahonoehnucienb)tite)er>2d −tl hisortebs;jhecotlTd 3In)sttahnecneCounter ; 13 i f ( norm al i ze ( v ar ia nce ( objec tT Ins ta nce ) ) < t hr es h ol d4 ) then 14 d e l e t e o bject T Ins tanc e ;

Statement Mediator/Mapper. Statement mediator/mapper is simply a mapper that gets the template instance as input and generates OWL statements as output. The OWL statements also contain all necessary links to the main object. Table 1 demonstrates the OWL translation of some fixed templates. Note that the italic words are those variable words that are used by the narrator. Post-Refinement (Refinement After Mediation) and Ontology Assembler. After generating OWL representations, they need to be purified. Refining statements is an iterative task that tries to build a summarized version of statements based on resource URIs. Figure 3 demonstrates a sample of statement refinement.

As we mentioned earlier, the fixed templates are just highly-recommended proposals to be used. If they are not helpful for the narrator to help the guesser, he/she may simply use English sentences. As these sentences have no structure, we keep them as comments for the ontology, if they were helpful for guesser.

Fig. 3. Statement Refinement Sample

After all these processes, the general assembler is able to merge these statements and build the first version of the ontology. This is an iterative task and the ontology will be completed after several plays. Every Ontology has a version track using owl:versionInfo that enables us to keep the history of generated ontologies. Figure 4 demonstrates the iterative life cycle of generating ontologies. 4 To evaluate the quality of the generated ontologies, we have checked how they change with an increasing number of rounds. To make our presentation feasible, we have reduced the number of rounds to ten and the number of concepts to two (tree and book).

In the first round (see Section 2.1), the properties color, height, and age were collected for the word tree. After ten rounds, we additionally collected leaves and species. The same test performed for the word book resulted in five properties:

Fig. 4. Iterative Life Cycle of Generating Ontologies author, language, publisher, title, and year of publishing. Five more rounds gave us additionally three more properties: number of pages, language and index. Tables 2 and 3 present the results that we have collected; we show both the words that affected the created ontology and the words that were rejected. However, the rejected words can become properties of the ontology, if we perform more rounds.

By analyzing more and more examples, we noticed that the number of properties does not grow linearly with the number of rounds. Additionally, some of the players were using plural versions of the words. This problem can be solved, however, by using dictionaries. Moreover, the results provided by the native speakers were much more accurate and they responded faster. We suggest using the lists of forbidden words; such lists impose users, specially non-English-spoken players, to use more and more sophisticated vocabularies, otherwise they stop getting points at some time. Hence, they have to learn new vocabularies.

The next part of our experiment was to evaluate the second phase (see Section 2.2), in which each person was asked a set of questions related to the common sense facts. Again we used the same words: tree and book. For the word tree, there were just three questions that let the players to successfully complete a round: it is a kind of a plant ; it has at least 1 height ; it could be either oak or larch. Five more rounds introduced additionally two more facts to our knowledge base: it is disjoint with animals; and it has at least 1 root. The same example for the word book resulted in three common sense facts in five rounds: it has at least 1 edition; it has at least 1 language; it could be either hard-copy or electronic. Five more rounds resulted in two new statements: it has at least 1 author ; and it has at least 1 title. Again we note that more and more rounds are necessary to improve the quality of the ontologies. Rounds Accepted Words 5 Color, Height, Age

Rejected Words Bark, Animals, Location, Kind, Fruit, Root, Branches, Green, Flower, Species, Width, Status, Leaves Falling, Seeds,

Continued on next page

Table 2 – continued from previous page Rounds Accepted Words Rejected Words 10

Kind Color, Height, Age, Bark, Animals, Location, Kind, Fruit, Leaves, Species Root, Branches, Green, Flower, Width,

Status, Type, Name, Leaves Falling,

Seeds, Kind

Rejected Words Pages, Chapters, Words, Paragraph, Index, Foreword, Thickness, audience age, ISBN, Wtext, abstract, color Pages, Chapters, Words, Paragraph, Index, Foreword, Thickness, audience age, ISBN, text, abstract, color, cover type, domain 5

Discussions The aim of the OntoPair game is to build simple ontologies for different objects that are located in images or even text-based objects in a short time. Our main concern is that the game should be entertaining to encourage people to play it. For this reason, we should avoid complex domains to be played. Some complicated concepts like business categorizations can be out of scope of this game, as these complicated domains may make the game boring and players will not come back again. The other point is that the generated ontologies may not contain all information regarding a domain, as the players are very ordinary people and not from Semantic Web domain. This is the main advantage of the game, as it cleverly uses people from different domains to help the Semantic Web domain experts and scientists. However, we believe that ontologies will be complicated after each play.

Even though we proposed that the players should be randomly paired, there exist some cheating potentials; players could agree to login at the same time to be paired together and maliciously annotate the objects. To avoid this case, based on previous plays, at some random times, we propose presenting specific images or texts that we know exactly the properties of objects in them and if we notice that the players are not playing honestly, we let them play as long as they want. The same solution is foreseen for second phase of the game. As we mentioned, to increase certainty, we only assign properties and statements to objects, if and only if a certain amount of players agreed upon that. As an example, if only two players agreed upon a car has wing among other players, we give a low ranking to wing and after filtering the properties using a threshold, we omit the wing.

Statistics and our experiences show that word guessing games are played by many people as these games are entertaining. Many people from non-English speaking countries play these game to improve their English.

For evaluating the generated ontologies, the game can be played in single mode and the single player will play against already-generated ontologies. If generated ontologies contain sufficient knowledge, the guesser should be able to guess the correct words, otherwise a low ranking will be assigned to the generated ontology. The other approach towards evaluating OntoPair is comparing the generated ontologies with ontologies that have been created by domain experts; e.g. we can compare two ontologies for a domain like book, one from OntoPair repository and the other which has been generated by hand. 6 In [ 11 ], the authors present an approach for building ontologies using a game called OntoGame. They use Wikipedia articles as conceptual entities, present them to the players, and have the users judge the ontological nature and find a common abstractions for a given entry [ 11 ]. Our approach is different, as we do not build a tree structure for objects. In two phases, we gather properties and cardinalities plus different instances of an object.

There exist also some efforts towards building a knowledge base by means of computer-based games. These games have been designed mostly for two players. The ESP game [ 7 ] tries to annotate images by enforcing players to come up with the exact objects located in images. Peekaboom [ 9 ] is another game which tries to come up with approximate location of objects in an image. Verbosity [ 8 ] is a word guessing game which composes of two players: narrator and guesser; The former should guide the latter to come up with the word that he is looking for by using some fixed templates for this purpose. Common Consensus [ 4 ] is very similar to Verbosity [ 8 ], but it has its own templates which begin mostly with Wh* questions. Phetch [ 12 ] is another game which is composed of two players: narrator and guesser; the narrator should give guesser some keywords to help him/her to select the right image from a list of images. In other words, Phetch’s main goal is finding a specific image in a bunch of similar images.

There exist also some other efforts in this general direction mostly for designing single player games. Labelme [ 2 ] is one example which assigns you an image for annotation. Cyc 1 is an artificial intelligence project that attempts to assemble a comprehensive ontology and database of everyday common sense 1 http://www.cyc.com/ knowledge, with the goal of enabling AI applications to perform human-like reasoning [ 14 ]. Cyc offers a web-based game called FACTory 2 which gives the single player several sophisticated common sense facts regarding different domains and the player should mark them as true or false statements in a short time period.

At the beginning of 1980s Wille [ 18 ] initiated his work on a theory known as Formal Concept Analysis. The aim of the theory is to analysis data and identify conceptual structures among data sets. This work rapidly expanded several years later and has been successfully applied for some specific domains, e.g. bio-medicine [ 1 ]. However, such an approach often requires domain experts to approve the results. 7

Conclusion and Future Works We have presented our work towards OntoPair, a game that uses Collective Intelligence for building OWL-based ontologies. OntoPair collects properties and common sense facts about an object in an entertaining environment and builds simple domain ontologies. We described how players should compete and how computers should process and integrate results. We also performed a simple experiment showing now our knowledge base grows. Our prototype implementation is still being implemented3 and it needs some work in the data and user management areas. Moreover, the future work will include a reputation model that will give more impact to users who are given high esteem. Linking different ontologies together can be also considered as next phase. As an example, if we build an ontology for a wheel, and we have a common sense fact indicating that a car has wheel, we may link the car and wheel ontologies. Furthermore, we would like to perform more experiments to research how long would it take for a domain expert and ontology engineer to build an equivalent ontology. We also would like to test OntoPair in more specific domains.

Acknowledgments. The authors would like to thank Dr. Axel Polleres for his valuable comments. This work is partially supported by Ecospace (Integrated Project on eProfessional Collaboration Space) project: FP6-IST-5-35208, Lion project supported by Science Foundation Ireland under Grant No. SFI/02/CE1/I131, and Enterprise Ireland under Grant No. *ILP/05/203*. 2 http://207.207.9.186/ 3 http://sourceforge.net/projects/OntoPair Semantically Enhanced Webspace for Scienti c

Collaboration Daniel Harezlak1, Piotr Nowakowski1, and Marian Bubak1;2 1 Academic Computer Center CYFRONET AGH ul. Nawojki 11, 30-950 Krakow, Poland 2 Institute of Computer Science AGH al. Mickiewicza 30, 30-059 Krakow, Poland

d.harezlak@cyf-kr.edu.pl Abstract. The paper presents an approach to constructing a collective Web-based system for knowledge management. The work refers to the concepts and ideas widely promoted by modern web communities, such as user-created and user-annotated content or reliable search mechanisms. Also, formal ways such as ontology-to-model dependencies within collective knowledge are used to build the proposed system. The main focus of this e ort is directed towards scienti c communities in which large amounts of experimental data need to be classi ed and veri ed. For this purpose an enhanced set of available Web tools needs to be assembled and made available as a uni ed system.

Key words: semantic models, web management, application plan, collaborative research 1 The need to represent knowledge by a language that both people and computers can comprehend is obvious and has been proven almost a decade ago [ 1 ]. Since then signi cant e ort was invested in combining the formalisms of descriptions that can be parsed by computers with free-text content published by people all over the world, creating the new notion of the Semantic Web. According to the survey [ 2 ] the Semantic Web is increasing its momentum by expanding in the areas of Internet computing such as trade, business and travel, not to mention the science domain. Currently we observe that the technologies and tools used for knowledge representation and management are becoming more stable and thus models and services are being proposed [ 3, 4 ] to realize the vision of large-scale knowledge integration.

This paper focuses on scienti c aspects of the Semantic Web, especially on knowledge- and data-intensive applications, which need to better bene t from the possibilities that become available through the manifestation of the Semantic Web and its extensions. The basic challenge is to combine the collaborative and global methods of using Web resources with individual and geographicallyscattered research activities. Many modern approaches try to exploit the techniques available in social Web management such as tagging, ranking or editing Web content by all users. However, more formal mechanisms are required for scienti c purposes. This goal can be supported by applying a strict semantic framework to the way in which Web research is conducted. That is why we propose a solution that incorporates a semantic layer into the available Web management routines to facilitate scienti c research.

A need for such environment was observed in the ViroLab project [ 5 ] which develops a virtual laboratory [ 6 ] to facilitate medical knowledge discovery and provide decision support for HIV drug resistance [ 7 ]. Three groups of users have been identi ed: clinicians using decision support systems for drug ranking, experiment developers who plan complex biomedical simulations, and experiment users who apply prepared experiments (scripts) [ 8 ]. An experiment is a kind of processing which may involve acquiring input data from distributed resources, running remote operations on this data, and storing results in a dedicated space, which should not only limit its functionality to the medical disciplines but extend into other areas of science.

The following section contains current achievements in the Semantic Web area. Subsequently, a list of requirements for the proposed solution is presented. The following two sections contain the architecture and proposals of semantic enhancements, followed by current implementation status and a summary with a future workplan.

This work tries to go beyond the present state in building scienti c web communities by proposing a system which covers traditional computation infrastructures with lightweight yet reliable and oriented on research web interfaces supporting knowledge management. In principle, it builds upon existing achievements of Semantic Web, however, a novel approach of managing semantic descriptions by web community members is introduced. This requires new combinations of tools for managing semantic metadata and social techniques of editing web content. 2 Modern systems in which semantic descriptions are used to represent knowledge generally apply tested and reliable languages, such as OWL [ 9 ], which is based on an older RDF speci cation [ 10 ]. Another standard used by a signi cant group of people is WSMO [ 11 ], which provides methods to semantically describe Web services. A problem, however, arises when di erent groups of researchers try to create descriptions of the same phenomena or elements of reality, resulting in inconsistencies when such descriptions are merged. This requires manual alignment, which can be very time-consuming and ine cient. In order to e ciently build ontologies, a semiautomatic tool is required to provide feedback on preexisting descriptions and enable scientists to further build upon them, thus ensuring coherency.

It is easy to observe that the social Web has evolved into a global collaboration space where people from all over the world exchange experience using systems such as Facebook [ 12 ] or Flickr [ 13 ]. This way of collaboration has made the Web an interesting tool for scienti c communities, with which to exchange research results and knowledge. Several attempts were undertaken to bene t from those ideas, resulting in applications like [ 14 ] and new trends in semantic computing [ 15 ]. These attempts, however, still lack general acceptance and stability. Nevertheless, several environments are already available and are being used by minor groups. For example, myExperiment [ 16 ], currently in its beta testing phase, is a successor to well-accepted work ow management systems such as Taverna [ 17 ] or BIOSteer [ 18 ]. The project delivers a Web-based system for sharing work ows among community members; however, the infrastructure does not provide features that allow work ow execution and result management. 3

Requirements In order to satisfy potential researchers, any new system should ease their work. Therefore, basic requirements should be identi ed rst. Below we present a list which attempts to formalize the process in which research is conducted. In particular, it is assumed that each type of supported scienti c research can be aided either by applying a computer system to conduct a virtual experiment (such as a simulation) or by presenting the results in a digital format. Following is a list of basic requirements for a knowledge Web management system. { application plan storage - The notion of an application plan exists in various domains of science and can be described as a list of steps necessary to achieve a certain result. There are many ways to represent such a list. It can be accomplished either by building a work ow (e.g. with the BPEL [ 19 ] notation) or by using a script (with any available scripting language). The requirement is to provide a facility for application storage that can be accessed by authorized users. In this way published applications can be discovered, reused, assessed and improved by other scientists. { managing application execution - For the application plan execution to be possible, an underlying infrastructure has to be deployed and a proper application plan execution engine needs to be set up. The whole process of application execution has to be visualized to the user and, if necessary, intermediate results should be delivered. { managing scienti c results - The outcome of a research activity should be represented by a result stored in a dedicated database. The results should be properly annotated and classi ed, available for other scientists for veri cation purposes. { collaborating with other scientists - The system should provide collaboration tools enabling scientists to discover their work, properly restricted by security and copyright agreements. It also should be convenient to exchange experience and validate other's work within one system.

The presented list of requirements should be supported by a semantic model that facilitates all the functionalities which are to be provided by the proposed system.

Another non-functional requirement is to separate the processes of application development and conducting research. On the one hand developers do not want to be laden with the semantics of a certain research area but only restrict to e.g. data format, computation optimization, etc. On the other hand scientists want to focus only on the research without knowing the speci cs of the actual implementation. This requires a certain separation layer provided by the experiment plan. the common parts between the mentioned groups are only notions of experiment plan, input data and experiment result. Developers write experiments together with underlying services, components, etc., wchich require input data and produce results (of course the format of the data is to be agreed between those two groups). The researchers execute the experiments, validate and classify the data being able to manage the semantic layer.

One last requirement that was identi ed is the cross-disciplinary cooperation of researchers. Creating a global and ultimate ontology seems to be an impossible challange. However, it might be possible to nd intersections between them and bene t from what others work on. The approach in the proposed system is to make all the semantic metadata available to all participants. In order to do that an advanced editor is required to assist the researchers in the process of managing the metadata. 4.1 In Fig. 1 the basic architecture is presented. The system is divided into four layers. At the bottom, the resource layer consists of services and data sources which are used to build application plans using work ow or script notations that provide some level of abstraction. In the same layer the Metadata Store and the Application Repository are deployed and used to archive semantic data and application plans respectively. The last two components are accessed by the Web application layer (shown in green) directly. The next, yellow layer is the middleware which provides an abstraction over the low-level resources and ensures uni ed access to the variety of technologies that implement data sources and computational services. In this way access to data and services is seamlessly woven into the notation. The Application Execution Engine also maintains the state of the applications during execution.

The third layer, representing Web applications, contains two modules, namely the Metadata Engine module and the Execution Client module. The rst module is the one responsible for managing semantic descriptions available in the system. It also constitutes a lter and a tool that helps users manage the semantic content they provide or browse. Based on the semantic model presented in the next section users are able to: { import their own semantic descriptions by semi-automatically aligning and mapping them against existing ones,

Reasearch groups researc h, share explore, publish use query Streamed

Data Stores

run Application Execution

Engine use invoke query

browse Metadata

Store

DBs

Services

Fig. 1. Basic components of the proposed system. { browse the existing knowledge by conveniently searching through existing ontology triples, { quickly obtain application plans, results or publications of interest by providing key words (the whole knowledge space is tagged and annotated), { tag and annotate the existing objects in the knowledge space.

The second module - Execution Client - is responsible for communication with the application execution engine and keeping the users updated with the current execution status using AJAX-oriented techniques (e.g. implemented with the GWT toolkit [ 20 ]). 4.2 The Metadata Engine is the main component which provides the reasoning functionality over the ontologies built within the system. It covers the low-level Metadata Store and exposes convenient methods to manage the knowledge structure.

In Fig. 2 a detailed architecture of the Metadata Engine is presented. It contains a client that enables it to access the underlying metadata store and facilitates the use of the query language used by the store. The deduction module n E a t a Remote Asynchronous/ d QLuocearlyMInettaedrfaatcae a t e M

Client Knowledge

Port

Presentation

Modules

Server-side complex deduction

module RDQL, etc.

Server Knowledge

Port

Client

Web Browser Metadata

Store

Fig. 2. Internal architecture of Metadata Engine. is divided into two parts. For simple queries for which response times should be short the part on the client-side is used. It communicates with the client through an asynchronous channel according to the techniques used in web client-server communication models (built over standard request-response model). The calls are made directly by the visual components which concludes with their visual state update. If the queries are more complex then the deduction module on the server-side is used. To the visual components this however is transparent with only longer repsonse times. 5 In Fig. 3 a sample of the ontology model is presented. This model is used as the basis for the Metadata Engine module to manage the collaboration space.

The model consists of three parts: { Science Domain - (blue) - This part of the semantic description is extendable by users. This ensures that the model remains dynamic and, when required, users may add custom ontological descriptions to existing ones. The process is semi-supervised by the system in order to maintain coherency. { Basic Model - (orange) - This model is the core of the application and its basic models. It assumes (in accordance with social Web content management) that every item within the collaboration space may be tagged or annotated. This enables the space to be enhanced by a quick search mechanism or by building a tag cloud (used for space browsing). { Application model - (green) - This ontology model allows the Metadata Engine to keep track of the content managed by users. In particular, users are able to submit speci c queries that navigate to accurate pieces of data stored in the collaboration space (e.g. list all publications that describe the outcomes of a particular application plan, etc.)

The presented model is just a proposition, showing how the nal implementation could look and it remains a subject of ongoing research. It is also possible to test several di erent models in di erent research contexts. 5.2

Role and Ontology Management In order to ensure hierarchy in the process of managing and building the ontologies proper groups need to be modelled with certain permissions. Also, a way of assessing the quality of the ontologies is required to introduce formal models of the management process.

Figure 4 depicts a sample structure of such ontology. the main Object node is assigned the is editable by relation which speci es what roles are permitted to User

Role Node Quality

Object Community

Node

Fig. 4. Role management dependency semantics. edit a given node. All Role nodes are referred by User nodes which creates the authorization net in the proposed model.

To enable users with the possibility of extending the current ontology graph a Community Node is introduced. This node is inherited by all the nodes created by community members and in the process of collaborative cooperation of scienti c communities it is assessed and the quality information is stored in the individuals of the Node Quality. The quality will be measured by analyzing statistics of use of such knowledge node(e.g. the more users use and cite a given ontology node the higher rank it has). Further improvements of such approach will categorize the semantic descriptions into approved and validated and those still being unassessed. Hopefully, this will lay ground for building community ontologies across di erent science domains. The model itself may be changed while the system is working. 6 Currently the presented model is being implemented within the virtual laboratory supporting the scripting approach to representing application plans [ 6 ]. The application execution engine is already [ 21 ] operational and capable of running test application plans. Simple ontology models have been built; however, they still require user assessment in order to be improved.

With respect to the web application layer a prototype of the user interface was built and a screenshot is depicted in Fig. 5.

The interface is divided into three parts: { application management - In this widget the user is able to browse the collaboration space in search for application plans of interest. The search is

The overlapping window in the middle is displayed as popup and in this case is used to show the application plan script. Each application plan may be supplied with a license regarding its usage restrictions. 7

Conclusions and Future Work This paper presents a semantic Web-based approach to constructing a scienti c collaboration space. The solution combines social Web routines with the formalisms of semantic content descriptions to facilitate the process of on-line research. Main improvements of the approach include integration of the application runtime system with result management and adoption of widely-used Web content management techniques in the area of scienti c research.

At present the ViroLab virtual laboratory already integrates biomedical information related to viruses (proteins and mutations), patients (viral load) and literature (drug resistance); it enables to plan and run experiments transparently on distributed resources. Di erent experiments from the virology domain are executable, such as: from virus genotype to drug resistance interpretation, querying historical and provenance information about experiments, assisting a virologist with the Drug Resistance System, a simple data mining with classication. Further work will extend the list and explore re-usability in di erent science disciplines.

Future plans include the extension of the semantic model used for building the prototype and extending the user community to test and assess the approach. The aim is to bene t from the ideas brought by the Semantic Web trends and extend the present solutions in the area of community-driven research to make the process more reliable and e cient.

Acknowledgments. This work is partly funded by the European Commission under the ViroLab IST-027446 and the IST-2002-004265 Network of Excellence CoreGRID projects.

References

Franz

Baader , Bernhard Ganter, Baris Sertkaya, and

Ulrike

Sattler . Completing description logic knowledge bases using formal concept analysis . In Manuela M. Veloso, editor, IJCAI , pages 230 - 235 , 2007 .

2. Bryan

Russell , Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: a database and web-based tool for image annotation . In MIT AI Lab Memo AIM-2005-025 , 2005 .

Dan

Brickley and

R.V.

Guha . Resource Description Framework (RDF) Schema Specification . http://www.w3.org/TR/rdf-schema/, February 2004 .

Henry

Lieberman ,

Dustin

Smith ,

and Alea

Teeters . Common Consensus: A Webbased Game for Collecting Commonsense Goals . In Workshop on Common Sense for Intelligent Interfaces , ACM International Conference on Intelligent User Interfaces (IUI-07) , Honolulu, Hawaii, USA, 2007 . ACM Press.

Google

Inc . Google Image Labeler. http://images.google.com/imagelabeler/, 2007 . Online; accessed 3-May- 2007 .

Pierre

Levy . Collective Intelligence . Plenum Publishing Corporation, January 1997 .

7. Luis von Ahn, and

Laura

Dabbish . Labeling images with a computer game . In CHI '04: Proceedings of the 2004 conference on Human factors in computing systems , pages 319 - 326 . ACM Press, 2004 .

8. Luis von Ahn, Mihir Kedia, and

Manuel

Blum . Verbosity: a game for collecting common-sense facts . In CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems , pages 75 - 78 , New York, NY, USA, 2006 . ACM Press.

9. Luis von Ahn, Ruoran Liu, and

Manuel

Blum . Peekaboom: a game for locating objects in images . In CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems , pages 55 - 64 , New York, NY, USA, 2006 . ACM Press.

10. Sean

Bechhofer

, and Frank van Harmelen,

and Jim

Hendler , and

Ian

Horrocks , and Deborah L. McGuinness , and Peter F. Patel-Schneider , and Lynn Andrea Stein. OWL Web Ontology Language Reference . http://www.w3.org/TR/owlref/, February 2004 . Online; accessed 2-May- 2007 .

11. Siorpaes

Katharina

, and Martin Hepp. OntoGame: Towards Overcoming the Incentive Bottleneck in Ontology Building . In 3rd International IFIP Workshop On Semantic Web and Web Semantics (SWWS '07) , co -located with OTM Federated Conferences , Vilamoura, Portugal, pages 1222 - 1232 , 2007 .

12. Luis von Ahn, Shiry Ginosar, Mihir Kedia, Ruoran Liu, and Manuel Blum . Improving accessibility of the web with a computer game . In CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems , pages 79 - 82 , New York, NY, USA, 2006 . ACM Press.

13. Wikipedia . Captcha - wikipedia, the free encyclopedia , 2007 . [Online; accessed 14-December-2007].

14. Wikipedia . Cyc - wikipedia, the free encyclopedia . http://en.wikipedia.org/w/index.php?title= Cyc&oldid=125786119 , 2007 . [Online; accessed 7-May-2007].

15. Wikipedia . Guessing game - wikipedia, the free encyclopedia . http://en.wikipedia.org/w/index.php?title=Guessing game& oldid=116214370 , 2007 . Online; accessed 6-May- 2007 .

16. Wikipedia . Human-based computation - wikipedia, the free encyclopedia . http://en.wikipedia.org/w/index.php?title=Humanbased computation& oldid=122965665 , 2007 . [Online; accessed 7-May-2007].

17. Wikipedia . Collective intelligence - wikipedia, the free encyclopedia , 2008 . [Online; accessed 17-March-2008].

18.

Wille . Restructuring lattice theory: An approach based on hierarchies of concepts . In Ordered Setsand in I. Rivals (Ed.) , volume 23 , 1982 .

1. Chandrasekaran , B. , Josephson , J.R. , Benjamins , V.R. : What are ontologies, and why do we need them ? IEEE Intelligent Systems 14 ( 1 ) (January/ February 1999 ) 20 { 26

2. Cardoso , J.: The semantic web vision: Where are we ? IEEE Intelligent Systems 22 ( 5 ) (September/ October 2007 ) 84 { 88

3. Missier , P. , Alper , P. , Corcho , O. , Dunlop , I. , Goble , C. : Requirements and services for metadata management . IEEE Internet Computing 11 ( 5 ) (September/ October 2007 ) 17 { 25

4. Carroll , J.J. , Dickinson , I. , Dollin , C. , Reynolds , D. , Seaborne , A. , Wilkinson , K. : Jena: Implementing the semantic web recommendations . Technical report, HP Labs ( 2003 )

Virolab

Consortium: ViroLab - EU IST STREP Project

027446 (

2008 ), http://www.virolab.org

6. ACC CYFRONET AGH

: Virolab virtual laboratory (

2008 ), http://virolab.cyfronet.pl

7. Sloot , P.M. , Tirado-Ramos , A. , Altintas , I. , Bubak , M. , Boucher , C. : From molecule to man: Decision support in individualized e-health ( 2006 )

8. Gubala , T. , Bubak , M. : Gridspace - semantic programming environment for the grid . LNCS 3911 ( 2006 ) 172 { 179

9. W3C: Owl web ontology language ( 2004 ), http://www.w3.org/TR/owl-features

10. W3C: Rdf: Resource description framework ( 2001 ), http://www.w3.org/RDF

11. Roman , D. , Keller, U., Lausen , H., de Bruijn, J., Lara , R. , Stollberg , M. , Polleres , A. , Feier , C. , Bussler , C. , Fensel , D. : Web service modeling ontology . Applied Ontology 1 ( 1 ) ( January 2005 ) 77 { 106

12. Facebook

Team: A social utility that connects people with friends and others who work, study and live around them (

2008 ), http://www.facebook.com

13. Yahoo

! Inc: Photo sharing web space (

2008 ), http://www.flickr.com

14. Fox , G.C. , Guha , R. , McMullen , D.F. , Mustacoglu , A.F. , Pierce , M.E. , Topcu , A.E. , Wild , D.J.: Web 2.0 for grids and e-science . In: INGRID 2007 - Instrumenting the Grid , 2nd International Workshop on Distributed Cooperative Laboratories - S.Margherita Ligure Porto no. ( 2007 )

15. Goble , C. , Roure , D.D.: Grid 3.0: Services, semantics and society . In: Proceedings of Cracow Grid Workshop 2007 , ACC CYFRONET AGH ( 2008 ) 10 { 11

16. The University of Manchester and University of Southampton: myexperiment home page ( 2008 ), http://www.myexperiment.org

17. Oinn , T. , Addis , M. , Ferris , J. , Marvin , D. , Senger , M. , Greenwood , M. , Carver , T. , Glover , K. , Pocock6, M.R. , Wipat , A. , Li , P. : Taverna: a tool for the composition and enactment of bioinformatics work ows ( 2004 )

18. Lee , S. , Wang , T.D. , Hashmi , N. , Cummings , M.P. : Bio-steer: A semantic web work ow tool for grid computing in the life sciences ( 2007 )

19. OASIS: Web services business process execution language ( 2007 ), http://www.oasis-open.org/committees/tc home.php?wg abbrev=wsbpel

20. Google: Google web toolkit ( 2008 ) http://code.google.com/webtoolkit

21. Ciepiela , E. , Kocot , J. , Gubala , T. , Malawski , M. , Kasztelnik , M. , Bubak , M. : Gridspace engine of the virolab virtual laboratory . In: Proceedings of Cracow Grid Workshop 2007 , ACC CYFRONET AGH ( 2008 ) 53 { 58