=Paper=
{{Paper
|id=None
|storemode=property
|title=How much semantics on the "wild" Web is enough for machines to help
us?
|pdfUrl=https://ceur-ws.org/Vol-683/paper1.pdf
|volume=Vol-683
|dblpUrl=https://dblp.org/rec/conf/itat/Bielikova10
}}
==How much semantics on the "wild" Web is enough for machines to help
us?==
How much semantics on the “wild” Web is enough for machines
to help us??
Mária Bieliková
Institute of Informatics and Software Engineering, Slovak University of Technology
Ilkovicova 3, 842 16 Bratislava, Slovakia
maria.bielikova@fiit.stuba.sk
WWW home page: http://fiit.stuba.sk/~bielik
Abstract. The current Web is not only a place for the changing, how it absorbs people with their opinions,
content available in any time and location. It is also a place ratings and tags1 . Especially its dynamic nature pre-
where we actually spend time to perform our working tasks, vents us from a direct employing of the most methods
a place where we look for not only interesting informa- developed for closed information worlds (even though
tion, but also entertainment, and friends, a place where big or actually present on the Web). And its size re-
we spend part of our rest. The Web is also an infrastruc-
quires automatic (or semiautomatic) approaches for
ture for applications which offer various services. There is
so many aspects of the Web that this diverse organism is
information acquisition from this large heterogeneous
a subject of study of researchers from various disciplines. information space.
In this paper we concentrate on information retrieval aspect The Web is undergoing constant development with
of the Web, which is still prevailing. How we can improve – the Semantic Web initiative, which aims for a ma-
information retrieval, be it goal-driven or exploratory? To chine readable representation of the Web [3],
which extent we are able to give our machines means for – the Adaptive Web initiative, which stresses the
helping us in information retrieval tasks? Is there any level need for personalization and broader context
of semantics, which we can supply for the Web in general,
adaptation on the Web [6],
and it will help? We present some aspects of information
– the Web 2.0 initiative called also the Social Web,
acquisition by search on the “wild” Web together of exam-
ples of approaches to particular tasks towards the improve-
which focuses on social and collaborative aspects
ment of information search, which were proposed in last of the Web [14].
two years within the Institute of Informatics and Software Development in this area matures to the point whe-
Engineering at the Slovak University of Technology, espe- re the Web is becoming so important and in fact still
cially within the PeWe (Personalized Web) research group. unknown phenomenon that is identified as a separate,
original object of investigation, and there are even ini-
tiatives which want to establish the Web Science as
1 Introduction a new scientific discipline [7].
Considering information retrieval based on search
The Web is amazing by the amount of diversity of its (be it goal-driven or exploratory) includes also effec-
stuff, by the conception of so much thoughts, discus- tive means for expressing users’ information needs –
sions, opinions that all show in many cases wisdom how should a user specify his query or a broader aim
and creativity of people. This is also the bottleneck of the search (be it a concrete requirement for expla-
of current web – it is its nature, which involves “web nation of particular term or an abstract need for find-
objects” of various type (text, multimedia, programs) ing out what is interesting or new in some domain).
representing conceptually different entities (the con- The “effective” here means that the user gets what he
tent, people, things, services) and constantly chang- expects, even if his expectations are not completely
ing. Particular objects are not formally defined, e.g. known – this is pretty similar to the software require-
the content is semistructured, which leads to the com- ments specification, but within the “wild” Web we
plexity considering machine processing. have so much and so diverse users with various needs
Obvious sentences are expected here – how is the that we are not able to do this manually as software
Web important for our lives (both work and private), engineers do with the software specification.
how the Web grows, how it is dynamic and constantly In general, user’s information needs usually come
?
into existence while the user solves a task. Information
This work was partially supported by the projects needs can be classified into three categories [5]:
VEGA 1/0508/09, KEGA 028-025STU-4/2010, and it
1
is the partial result of the Research & Development Op- We do not mention and elaborate further another impor-
erational Programme for the project SMART II, ITMS tant view on the Web as an infrastructure for services
26240120029, co-funded by the ERDF. and software applications.
2 Mária Bieliková
– Informational. The user’s intent is to get specific 2.1 Considering the web content
information assumed to be present on the Web.
The only assumed interaction is reading. The content, or resources in general are basically de-
– Navigational. The user’s intent is to reach a par- scribed by metadata. Metadata were used by libra-
ticular web page. It is assumed that a user will rians already before the Web era. They typically rec-
“travel” through the Web space taking advantage ognize three categories of metadata: administrative,
of getting a starting point. structural, and descriptive [21]. Considering the Web
– Transactional. The user’s intent is to perform and it content we focus on descriptive metadata re-
some activity enabled by the Web, i.e. the use of lated to the content. Moreover, metadata for the Web
a service offered by particular web page. comparing to libraries resources should conform the
fact that we cannot predict all kinds of the Web ob-
These categories cannot be directly inferred from jects and their evolution.
the user’s query. However, good search engine should The semantics of the content can be expressed
consider various information needs as this implicates many ways ranging from
a move from a static information retrieval (first two
categories) to the third category, which integrates not – the set of keywords (or tags) through
just data stored on the Web, but also services that can – the Resource Description Framework or topic
provide right information (e.g. planning a flight). maps as a general model for conceptual descrip-
tion of resources to
– ontologies with all power resultant from formal
2 Web and semantics logic where the ontology consists of concepts, rela-
tions, attributes, data types, a concept hierarchy,
I do not feel a need for putting here well-known ar- and a relation hierarchy.
guments about the importance of semantics for auto-
matic reasoning. Yes, it is important! This fact was Having ontologies that cover (almost) “complete”
stated already many times from its first publishing semantics which we are presently able to specify seems
in [3] even though what we give a machine actually is to be a solution for the Semantic Web. But it is not,
not the semantics; for the machine it is only a syntax at least now. Considering the complexity of defining
– formal description of a resource. such semantics recalls the situation some more than
The question is not what we can do with the se- 40 years back when people tried devise general sol-
mantics when it is perfect, but how to acquire it. How ving machines. Even though they moved later to ex-
much semantics we can acquire for constantly chang- pert knowledge capturing, the results were still limited
ing world of the Web, or what amount is already useful mainly due to the ability of people to specify know-
to such extent that we can report an improvement in ledge explicitly. So the situation repeats in some sense.
fulfilling our information needs. Right after the Semantic Web establishment we
With the Web development several sources for the have witnessed a boom of various approaches to rep-
semantics come into existence. Except the resenting semantics for specific domains and methods
for reasoning including mapping ontologies. However,
– web content as a fundamental source for the se- ontology-based semantics is spreading slowly because
mantics, we obviously have solutions just for very specific and
there are other sources of the semantics that can be rather static domains. It is perfect way for the appli-
mined: cation architecture as knowledge bases were in 70ties.
But it does not fit well with the “wild” Web.
– web structure with the focus on links analysis, and Even if we would have formally represented knowl-
– usage logs with the focus on a user activity on the edge that would be sufficient for the best part of our
Web mainly by an analysis of clickstreams. needs (knowledge representation problem in Artificial
Intelligence), and would have strong reasoning me-
As a special case of the content source we consider
chanisms, it is not enough for the changing Web – we
– web annotations, still miss a component for matching this knowledge to
particular web objects. Moreover, the Web is evolving
when viewing the annotations as a layer above the as we people evolve in unpredictable way. New infor-
content created either automatically [11] or manually mation and knowledge is constantly added to the Web
(in particular by user interactions and social tagging). either as semistructured content or as services or ap-
The web annotations can be viewed also as a result of plications running on the Web.
the users’ activity and as such considered as a source Web 2.0 brought or vitalized a role of people in
for the web usage mining. the whole process. We witness the power of crowd and
How much semantics on the “wild” Web . . . 3
its limitations. Folksonomy is simply a returning back sults [15]. This models conforms also with existing and
to the most elemental way to enrich a resource with evolving folksonomies that can supplement extracted
semantics employing a set of keywords. Fundamental metadata, and can be fully captured within the model.
difference lays in the process of keywords acquisition. We believe that proposed model can improve infor-
Folksonomy is created by users through the process mation search. Our confidence is supported by partial
of social tagging [12]. The advantage is real power of results achieved (some of them are briefly mentioned
users, so keywords attached to the resource by social in the Section 3). There are still some issues related to
tagging represent rather objective notation of a web the proposed model. As the most serious we consider:
page content. The problem is that folksonomies are
coarse-grained, informal and flat. – extracting the right terms (concepts);
– creating and typing relationships between con-
Following this trend we proposed a model of light-
cepts;
weight semantics of the web content referred to as the
– multilingual and multicultural aspects as for ex-
resource metadata [17]. It is promising in the sense
ample some terms can have completely different
of its automatic acquisition for open corpus, or vast
meaning in dependence of culture.
and dynamic domains. It provides a meaningful ab-
straction of the Web content, i.e. provides a metadata Especially term extraction is well developed field
model, and a mapping between web pages and this with term-indexing approaches and named entity res-
metadata model. olution. Considering the model alone, the semantics
The model consists of interlinked concepts and is still rather low as we cannot recognize properly im-
relationships connecting concepts to resources (sub- portant terms for particular user in particular context.
jects of the search) or concepts themselves (see Fig- That is why there is the need to combine all sources for
ure 1). Concepts feature domain knowledge elements the semantics [13]. We mention here except the con-
(e.g., keywords or tags) related to the resource content tent also web users’ activity (web structure and web
(e.g., web pages or documents). Both the resource-to- annotation are out of the scope of this paper).
concept and the concept-to-concept relationships are
weighted. Weights determine the degree of concept re-
2.2 Considering web users’ activity
latedness to the resource or to other concept, respec-
tively. Interlinked concepts result in a structure resem- Monitoring a user’s activity can serve as important
bling lightweight ontology, and form a layer above the source for semantics. Utilizing an implicit user feed-
resources allowing an improvement of the search. back we can recognize which web pages (or even their
parts) are interesting in particular context, and thus
adjust or enrich metadata related to that content. User
related metadata (i.e., a user model) allow personal-
ization. Considering the “wild” Web with its light-
Metadata weight semantics the spreading the personalization to
(keywords, tags, the whole Web becomes possible (to some extent).
concepts) Resource metadata model introduced above serves
also as a bottom layer for an overlayed user model. As
we operate in open corpus it is not possible to have ei-
ther of the models in advance. We propose to represent
user’s interests (discovered via web usage mining) by
Resources the same means as the resource metadata, and provide
(web content)
r1
r2 r5 constant mapping between these two models.
r4
rn If we want to employ such models for the purpose
r3 of information retrieval on the “wild” Web, we need to
acquire terms (keywords, tags, concepts) from the web
Fig. 1. Content model based on lightweight semantics. pages visited by the users. Because the Web is an open
information space, we need to track down and process
every page the user has visited in order to update his
The advantage of modeling domain knowledge as model appropriately.
described above lies in its simplicity. Hence, it is pos- To achieve this, we developed an enhanced proxy
sible to generate metadata enabling lightweight se- server, which allows for realization of advanced oper-
mantic search for a vast majority of resources on the ations on the top of requests flowing from a user with
Web. We have already performed several experiments responses coming back from the web servers, all over
of automatic metadata extraction with promising re- the Internet [2]. Figure 2 depicts the schema how the
4 Mária Bieliková
request request
.js
User
+ p ro x y s e rv e r
.js
metadata extraction
translate
metadata extraction
............
user model
readability (main textual content extraction)
metadata extraction
Fig. 2. Monitoring a user based on an enhanced proxy platform.
proxy server operates. When the web server sends the 3 Examples
response to the required resource back to the user, the We present several examples of approaches to par-
proxy server enriches the resource by a script able to ticular tasks towards the improvement of information
capture the user activities (due evaluation of the user search, which were proposed and evaluated in last two
feedback). In parallel we run a process of extracting years within the Institute of Informatics and Software
the meta-data and concepts from the web page. To- Engineering at the Slovak University of Technology in
gether with the user feedback, these are stored in the Bratislava, especially within the PeWe (Personalized
user profile. Before the extraction phase based on var- Web) research group.
ious algorithms to semantic annotation and keywords,
and category extraction, we realize main content de- 3.1 Gaming as a source of semantics
tection (relevant textual part of the HTML document)
Computer games are potential sources of metadata
and machine based translation into English, which is
that are hard to extract by machines. With game rules
required by the extraction algorithms.
properly set and sufficient motivation, players can in-
The aforementioned process gathers metadata for
directly solve otherwise costly problems.
every requested web page, and creates a basic (ev-
idence) layer of a user model. Naturally, as the time Little Google Game. We proposed a method for
flows, the keywords which represent long-term user in- term relationship network extraction via analysis of
terests occur more often than the others. Therefore, by the logs of unique web search game [19]. Our game
considering only top K most occurring keywords, we called Little Google Game focuses on web search query
get a user model which can be further analyzed, and guessing. Players have to formulate queries in a spe-
serves as a basis for personalization. cial format (using negative keywords) and minimize
We deployed our enhanced proxy platform to de- amount of results returned by the search engine (we
termine the efficiency of the solution in real-world us- use Google at the moment). Afterwards we mine the
age. The proxy solution can be, apart from user ac- game logs and extract relationships of terms based on
tivity logging, used to improve user experience with their frequent common occurrence in the Web.
ordinary web pages by adapting them according ac-
tual user needs. More, we provide users with a wordle- 3.2 Domain dependent approaches
based visualization (Wordle tag cloud generator, In spite of domain independence of proposed models,
http://www.wordle.net/) of their user profiles, and col- knowing the domain allows for more accurate models.
lected a precious feedback, which helped us to deter- This is common approach also used by the most pop-
mine “web stop-words”, i.e., words which occur often ular web search engines, which blend data from mul-
on web pages but do not make any sense from the tiple sources in order to fulfil the user’s need behind
user’s interests point of view. An example of such his query using the advantage when domain is known
a user profile of one of the proxy authors is displayed (e.g. flight planning or cooking a meal).
in Figure 3.
How much semantics on the “wild” Web . . . 5
Fig. 3. Michal’s tag cloud.
ALEF, Adaptive Learning Framework. We pro- Adaptive faceted browser. We devised a faceted se-
posed a schema for adaptive web-based learning and mantic exploratory browser taking advantage of ada-
based on it we developed ALEF (Adaptive LEarning ptive and social web approaches to provide person-
Framework), a framework for creating adaptive and alized visual query construction support and address
highly interactive web-based learning systems [16]. guidance and information overload [22]. It works on
ALEF domain model follows the resource meta- semantically enriched information spaces (both data
data model described above. The content includes lear- and metadata describing the information space struc-
ning objects that can be of three types: explanation, ture are represented by ontologies). Our browser facili-
question and exercise. The domain model covers for tates user interface generation using metadata describ-
every learning object: actual content (text and me- ing the presented information spaces (e.g., photos).
dia), and additional metadata that contain informa-
tion which is relevant for personalization services (con-
3.3 User centric approaches
cepts, tags, comments). Comparing to other existing
approaches, the notion of metadata in ALEF is quite Monitoring users and implicit feedback is promising
simplified, which allows for automatic construction of approach for the “wild” Web. Even though an explicit
domain model, and on the other hand, it still provides user feedback (filling forms by a user) is easy to im-
a solid basis for reasoning resulting in advanced opera- plement, it has serious problems with credibility, dis-
tions such as metadata-based personalized navigation. turbing the user and dependence on his will.
News recommendation. We proposed content Query expansion by social context. We proposed
based news recommendation based on articles simi- a method which implicitly infers the context of search
larity. Considering high dynamic and large every day by leveraging a social network, and modifies the user’s
volume of news we devised and evaluated in real set- search query to include it [10]. The social network is
tings two representations for effective news recommen- built from the stream of user’s activity on the Web,
dation: which is acquired by means of our enhanced proxy
– efficient vector comprising title, term frequency server.
of title words in the article content, names
and places, keywords, category and readability User interest estimation. We proposed a method
index [9], for adaptive link recommendation [8]. It is based on an
– balanced tree built incrementally; it inserts articles analysis of the user navigational patterns and his be-
based on the content similarity [23]. havior on the web pages while browsing through a web
portal. We extract interesting information from the
Different approach to news recommendation pro- web portal and recommend it in the form of personal-
vided on the same e-news portal (www.sme.sk) is pre- ized calendar and additional personalized links.
sented in [20]. It employs k-nearest neighbor collab-
orative filtering algorithm based on generic full text Search history tree. We proposed an approach in-
engine exploiting power-law distributions Important tended to reduce user effort required to retrieve and/or
property of proposed algorithm is that it maintains revisit previously discovered information by exploit-
linear scalability characteristics with respect to the ing web search and navigation history [18]. It is based
dataset size. on collecting streams of user actions during search
6 Mária Bieliková
sessions. We provide the user with a history map – 4. M. Bielikova, P. Navrat, (Eds): Workshop on
a scrutable graph of semantic terms and web resources the Web – Science, Technologies and Engineering,
with full-text search capability over individual history 2010. ISBN 978-80-227-3274-1. Available online at
entries. It is constructed by merging individual session pewe.fiit.stuba.sk/ontoparty-2009-2010-spring/.
history trees and the associated web resources. 5. A. Broder: A taxonomy of web search. t ACM SIGIR
forum, 36 (2), 2002, 3–10.
6. P. Brusilovsky et al.: The adaptive web. LNCS 4321,
Discovering keyword relations from Crowd. We Springer, 2007, ISBN 978-3-540-72078-2, 763p.
proposed an approach of determining keyword rela- 7. J. Hendler, N. Shadbolt, W. Hall, T. Berners-Lee,
tions (mainly a parent-child relationship) by leverag- D. Weitzner: Web science: an interdisciplinary ap-
ing collective wisdom of the masses, which is present proach to understanding the web. Commun. ACM, 51
in data of collaborative (social) tagging systems on (7), July 2008, 60–69.
the Web [1]. We demonstrated the feasibility of our 8. M. Holub, M. Bielikova: Estimation of user interest in
approach on the data coming from the social book- visited web page. Proc. of Int. Conf. on World Wide
marking systems delicious and CiteULike. Web, WWW 2010, ACM, 2010, 1111–1112.
9. M. Kompan, M. Bielikova: Content-based news rec-
ommendation. LNBIP Series, E-Commerce and Web
4 Conclusions Technologies, Springer, 2010.
10. T. Kramar, M. Barla, M. Bielikova: Disambiguating
In this paper we described just particular aspects of search by leveraging a social context based on the
the whole picture. It is not in any sense complete. It stream of users activity. LNCS 6075, UMAP 2010,
should be viewed as a discussion on certain aspects Springer, 2010, 387–392.
11. M. Laclavik et al.: Ontea: platform for pattern based
and possible partial solutions.
automated semantic annotation. Computing and In-
At the moment we have more questions as the an- formatics 28 (4), 2009, 555–579.
swers. How the Web should be described? What prop- 12. P. Mika: Ontologies are us: a unified model of so-
erties are important? How to discover interesting in- cial networks and semantics. LNCS 3729, ISWC 2005,
formation for particular individual? Is there any emer- Springer, 2005, 522-536.
gent phenomena? What we could do? How we can 13. P. Navrat, T. Taraba, A. Bou Ezzeddine, D. Chuda:
really connect people in such a way that it will be Context search enhanced by readability index. IFIP
convenient and useful? Can we trust the Web? Is its WCC Series 276, Springer, 2008, 373–382.
infrastructure right? 14. T. O’Reilly: What is Web 2.0. O’Reilly Net-
work. 2005. [Accessed 2010-07-30] Available at
One day maybe we people will discover silver bul-
http://oreilly.com/web2/archive/what-is-web-20.html.
let for the Web. Meantime we should be open for var- 15. M. Simko, M. Bielikova: Automated educational course
ious small enhancement, try to understand the Web metadata generation based on semantics discovery.
as much as possible, and try to integrate all particular LNCS 5794, EC TEL 2009, Springer, 2009, 99-105.
successes. 16. M. Simko, M. Barla, M. Bielikova: ALEF: A frame-
work for adaptive web-based learning 2.0. KCKS 2010,
Acknowledgements. Figures and parts of descrip- IFIP AICT 324, Springer, 2010, 367–378.
tions in the Section 3 are taken from published papers, 17. M. Simko, M. Bielikova: Improving search results with
which present particular examples, all mentioned in lightweight semantic search. CEUR, 491, SemSearch
References. 2009 at WWW 2009, 53–54.
The author wish to thank colleagues from the Insti- 18. J. Simko, M. Tvarozek, M. Bielikova: Semantic history
tute of Informatics and Software Engineering and all map: graphs aiding web revisitation support. Proc. of
students – members of PeWe group, pewe.fiit.stuba.sk 9th Int. Workshop on Web Semantics, IEEE Computer
Society, 2010.
for their invaluable contribution to the work presented
19. J. Simko, M. Tvarozek, M. Bielikova: Little google
in this invited lecture. The most current state of on-
game: relationships term extraction by means of search
going projects within the group is reported in [4]. game. Proc. of Datakon 2010.
20. J. Suchal, P. Navrat: Full text search engine as scalable
References k-nearest neighbor recommendation system. AI 2010,
IFIP AICT 331, Springer, 2010, 165–173.
1. M. Barla, M. Bielikova: On deriving tagsonomies: key- 21. A.G. Taylor: The organization of information. Li-
word relations coming from the Crowd. LNCS 5796, braries Unlimited: Englewood, USA. 1999, 300p.
ICCCI 2009, Springer, 2009, 309–320. 22. M. Tvarozek, M. Bielikova: Generating exploratory
2. M. Barla, M. Bielikova: Ordinary web pages as a source search interfaces for the semantic web. HCIS 2010,
for metadata acquisition for open corpus user model- IFIP AICT 332, Springer, 2010, 175–186.
ing. Proc. of IADIS WWW/Internet 2010, 2010. 23. D. Zelenik, M. Bielikova: Dynamics in hierarchical
3. T. Berners-Lee, J. Hendler, O. Lassila: The semantic classification of news. Proc. of WIKT 2009, 83–87.
web. Scientific American Magazin, May 2001.