=Paper= {{Paper |id=None |storemode=property |title=How much semantics on the "wild" Web is enough for machines to help us? |pdfUrl=https://ceur-ws.org/Vol-683/paper1.pdf |volume=Vol-683 |dblpUrl=https://dblp.org/rec/conf/itat/Bielikova10 }} ==How much semantics on the "wild" Web is enough for machines to help us?== https://ceur-ws.org/Vol-683/paper1.pdf
    How much semantics on the “wild” Web is enough for machines
                           to help us??

                                                      Mária Bieliková

                  Institute of Informatics and Software Engineering, Slovak University of Technology
                                        Ilkovicova 3, 842 16 Bratislava, Slovakia
                                            maria.bielikova@fiit.stuba.sk
                                  WWW home page: http://fiit.stuba.sk/~bielik

Abstract. The current Web is not only a place for the            changing, how it absorbs people with their opinions,
content available in any time and location. It is also a place   ratings and tags1 . Especially its dynamic nature pre-
where we actually spend time to perform our working tasks,       vents us from a direct employing of the most methods
a place where we look for not only interesting informa-          developed for closed information worlds (even though
tion, but also entertainment, and friends, a place where         big or actually present on the Web). And its size re-
we spend part of our rest. The Web is also an infrastruc-
                                                                 quires automatic (or semiautomatic) approaches for
ture for applications which offer various services. There is
so many aspects of the Web that this diverse organism is
                                                                 information acquisition from this large heterogeneous
a subject of study of researchers from various disciplines.      information space.
In this paper we concentrate on information retrieval aspect         The Web is undergoing constant development with
of the Web, which is still prevailing. How we can improve         – the Semantic Web initiative, which aims for a ma-
information retrieval, be it goal-driven or exploratory? To         chine readable representation of the Web [3],
which extent we are able to give our machines means for           – the Adaptive Web initiative, which stresses the
helping us in information retrieval tasks? Is there any level       need for personalization and broader context
of semantics, which we can supply for the Web in general,
                                                                    adaptation on the Web [6],
and it will help? We present some aspects of information
                                                                  – the Web 2.0 initiative called also the Social Web,
acquisition by search on the “wild” Web together of exam-
ples of approaches to particular tasks towards the improve-
                                                                    which focuses on social and collaborative aspects
ment of information search, which were proposed in last             of the Web [14].
two years within the Institute of Informatics and Software           Development in this area matures to the point whe-
Engineering at the Slovak University of Technology, espe-        re the Web is becoming so important and in fact still
cially within the PeWe (Personalized Web) research group.        unknown phenomenon that is identified as a separate,
                                                                 original object of investigation, and there are even ini-
                                                                 tiatives which want to establish the Web Science as
1     Introduction                                               a new scientific discipline [7].
                                                                     Considering information retrieval based on search
The Web is amazing by the amount of diversity of its             (be it goal-driven or exploratory) includes also effec-
stuff, by the conception of so much thoughts, discus-            tive means for expressing users’ information needs –
sions, opinions that all show in many cases wisdom               how should a user specify his query or a broader aim
and creativity of people. This is also the bottleneck            of the search (be it a concrete requirement for expla-
of current web – it is its nature, which involves “web           nation of particular term or an abstract need for find-
objects” of various type (text, multimedia, programs)            ing out what is interesting or new in some domain).
representing conceptually different entities (the con-           The “effective” here means that the user gets what he
tent, people, things, services) and constantly chang-            expects, even if his expectations are not completely
ing. Particular objects are not formally defined, e.g.           known – this is pretty similar to the software require-
the content is semistructured, which leads to the com-           ments specification, but within the “wild” Web we
plexity considering machine processing.                          have so much and so diverse users with various needs
    Obvious sentences are expected here – how is the             that we are not able to do this manually as software
Web important for our lives (both work and private),             engineers do with the software specification.
how the Web grows, how it is dynamic and constantly                  In general, user’s information needs usually come
?
                                                                 into existence while the user solves a task. Information
    This work was partially supported by the projects            needs can be classified into three categories [5]:
    VEGA 1/0508/09, KEGA 028-025STU-4/2010, and it
                                                                 1
    is the partial result of the Research & Development Op-          We do not mention and elaborate further another impor-
    erational Programme for the project SMART II, ITMS               tant view on the Web as an infrastructure for services
    26240120029, co-funded by the ERDF.                              and software applications.
2         Mária Bieliková

    – Informational. The user’s intent is to get specific   2.1   Considering the web content
      information assumed to be present on the Web.
      The only assumed interaction is reading.           The content, or resources in general are basically de-
    – Navigational. The user’s intent is to reach a par- scribed by metadata. Metadata were used by libra-
      ticular web page. It is assumed that a user will   rians already before the Web era. They typically rec-
      “travel” through the Web space taking advantage    ognize three categories of metadata: administrative,
      of getting a starting point.                       structural, and descriptive [21]. Considering the Web
    – Transactional. The user’s intent is to perform     and it content we focus on descriptive metadata re-
      some activity enabled by the Web, i.e. the use of  lated to the content. Moreover, metadata for the Web
      a service offered by particular web page.          comparing to libraries resources should conform the
                                                         fact that we cannot predict all kinds of the Web ob-
    These categories cannot be directly inferred from jects and their evolution.
the user’s query. However, good search engine should          The semantics of the content can be expressed
consider various information needs as this implicates many ways ranging from
a move from a static information retrieval (first two
categories) to the third category, which integrates not    – the set of keywords (or tags) through
just data stored on the Web, but also services that can    – the Resource Description Framework or topic
provide right information (e.g. planning a flight).           maps as a general model for conceptual descrip-
                                                              tion of resources to
                                                           – ontologies with all power resultant from formal
2 Web and semantics                                           logic where the ontology consists of concepts, rela-
                                                              tions, attributes, data types, a concept hierarchy,
I do not feel a need for putting here well-known ar-          and a relation hierarchy.
guments about the importance of semantics for auto-
matic reasoning. Yes, it is important! This fact was          Having ontologies that cover (almost) “complete”
stated already many times from its first publishing semantics which we are presently able to specify seems
in [3] even though what we give a machine actually is to be a solution for the Semantic Web. But it is not,
not the semantics; for the machine it is only a syntax at least now. Considering the complexity of defining
– formal description of a resource.                      such semantics recalls the situation some more than
    The question is not what we can do with the se- 40 years back when people tried devise general sol-
mantics when it is perfect, but how to acquire it. How ving machines. Even though they moved later to ex-
much semantics we can acquire for constantly chang- pert knowledge capturing, the results were still limited
ing world of the Web, or what amount is already useful mainly due to the ability of people to specify know-
to such extent that we can report an improvement in ledge explicitly. So the situation repeats in some sense.
fulfilling our information needs.                             Right after the Semantic Web establishment we
    With the Web development several sources for the have witnessed a boom of various approaches to rep-
semantics come into existence. Except the                resenting semantics for specific domains and methods
                                                         for reasoning including mapping ontologies. However,
  – web content as a fundamental source for the se- ontology-based semantics is spreading slowly because
     mantics,                                            we obviously have solutions just for very specific and
there are other sources of the semantics that can be rather static domains. It is perfect way for the appli-
mined:                                                   cation architecture as knowledge bases were in 70ties.
                                                         But it does not fit well with the “wild” Web.
  – web structure with the focus on links analysis, and       Even if we would have formally represented knowl-
  – usage logs with the focus on a user activity on the edge that would be sufficient for the best part of our
     Web mainly by an analysis of clickstreams.          needs (knowledge representation problem in Artificial
                                                         Intelligence), and would have strong reasoning me-
As a special case of the content source we consider
                                                         chanisms, it is not enough for the changing Web – we
  – web annotations,                                     still miss a component for matching this knowledge to
                                                         particular web objects. Moreover, the Web is evolving
when viewing the annotations as a layer above the as we people evolve in unpredictable way. New infor-
content created either automatically [11] or manually mation and knowledge is constantly added to the Web
(in particular by user interactions and social tagging). either as semistructured content or as services or ap-
The web annotations can be viewed also as a result of plications running on the Web.
the users’ activity and as such considered as a source        Web 2.0 brought or vitalized a role of people in
for the web usage mining.                                the whole process. We witness the power of crowd and
                                                                  How much semantics on the “wild” Web . . .       3

its limitations. Folksonomy is simply a returning back      sults [15]. This models conforms also with existing and
to the most elemental way to enrich a resource with         evolving folksonomies that can supplement extracted
semantics employing a set of keywords. Fundamental          metadata, and can be fully captured within the model.
difference lays in the process of keywords acquisition.         We believe that proposed model can improve infor-
Folksonomy is created by users through the process          mation search. Our confidence is supported by partial
of social tagging [12]. The advantage is real power of      results achieved (some of them are briefly mentioned
users, so keywords attached to the resource by social       in the Section 3). There are still some issues related to
tagging represent rather objective notation of a web        the proposed model. As the most serious we consider:
page content. The problem is that folksonomies are
coarse-grained, informal and flat.                           – extracting the right terms (concepts);
                                                             – creating and typing relationships between con-
    Following this trend we proposed a model of light-
                                                               cepts;
weight semantics of the web content referred to as the
                                                             – multilingual and multicultural aspects as for ex-
resource metadata [17]. It is promising in the sense
                                                               ample some terms can have completely different
of its automatic acquisition for open corpus, or vast
                                                               meaning in dependence of culture.
and dynamic domains. It provides a meaningful ab-
straction of the Web content, i.e. provides a metadata          Especially term extraction is well developed field
model, and a mapping between web pages and this             with term-indexing approaches and named entity res-
metadata model.                                             olution. Considering the model alone, the semantics
    The model consists of interlinked concepts and          is still rather low as we cannot recognize properly im-
relationships connecting concepts to resources (sub-        portant terms for particular user in particular context.
jects of the search) or concepts themselves (see Fig-       That is why there is the need to combine all sources for
ure 1). Concepts feature domain knowledge elements          the semantics [13]. We mention here except the con-
(e.g., keywords or tags) related to the resource content    tent also web users’ activity (web structure and web
(e.g., web pages or documents). Both the resource-to-       annotation are out of the scope of this paper).
concept and the concept-to-concept relationships are
weighted. Weights determine the degree of concept re-
                                                            2.2    Considering web users’ activity
latedness to the resource or to other concept, respec-
tively. Interlinked concepts result in a structure resem-   Monitoring a user’s activity can serve as important
bling lightweight ontology, and form a layer above the      source for semantics. Utilizing an implicit user feed-
resources allowing an improvement of the search.            back we can recognize which web pages (or even their
                                                            parts) are interesting in particular context, and thus
                                                            adjust or enrich metadata related to that content. User
                                                            related metadata (i.e., a user model) allow personal-
                                                            ization. Considering the “wild” Web with its light-
     Metadata                                               weight semantics the spreading the personalization to
  (keywords, tags,                                          the whole Web becomes possible (to some extent).
     concepts)                                                  Resource metadata model introduced above serves
                                                            also as a bottom layer for an overlayed user model. As
                                                            we operate in open corpus it is not possible to have ei-
                                                            ther of the models in advance. We propose to represent
                                                            user’s interests (discovered via web usage mining) by
    Resources                                               the same means as the resource metadata, and provide
   (web content)
                     r1
                          r2             r5                 constant mapping between these two models.
                                    r4
                                                rn              If we want to employ such models for the purpose
                               r3                           of information retrieval on the “wild” Web, we need to
                                                            acquire terms (keywords, tags, concepts) from the web
 Fig. 1. Content model based on lightweight semantics.      pages visited by the users. Because the Web is an open
                                                            information space, we need to track down and process
                                                            every page the user has visited in order to update his
    The advantage of modeling domain knowledge as           model appropriately.
described above lies in its simplicity. Hence, it is pos-       To achieve this, we developed an enhanced proxy
sible to generate metadata enabling lightweight se-         server, which allows for realization of advanced oper-
mantic search for a vast majority of resources on the       ations on the top of requests flowing from a user with
Web. We have already performed several experiments          responses coming back from the web servers, all over
of automatic metadata extraction with promising re-         the Internet [2]. Figure 2 depicts the schema how the
4      Mária Bieliková


                            request                    request


                                                                                                       .js

               User
                              +             p ro x y                s e rv e r
                                      .js




                                                                                 metadata extraction


                                                        translate
                                                                                 metadata extraction




                                                                                        ............
                                                                                                             user model


              readability (main textual content extraction)



                                                                                 metadata extraction

                           Fig. 2. Monitoring a user based on an enhanced proxy platform.

proxy server operates. When the web server sends the                  3          Examples
response to the required resource back to the user, the               We present several examples of approaches to par-
proxy server enriches the resource by a script able to                ticular tasks towards the improvement of information
capture the user activities (due evaluation of the user               search, which were proposed and evaluated in last two
feedback). In parallel we run a process of extracting                 years within the Institute of Informatics and Software
the meta-data and concepts from the web page. To-                     Engineering at the Slovak University of Technology in
gether with the user feedback, these are stored in the                Bratislava, especially within the PeWe (Personalized
user profile. Before the extraction phase based on var-               Web) research group.
ious algorithms to semantic annotation and keywords,
and category extraction, we realize main content de-                  3.1         Gaming as a source of semantics
tection (relevant textual part of the HTML document)
                                                                      Computer games are potential sources of metadata
and machine based translation into English, which is
                                                                      that are hard to extract by machines. With game rules
required by the extraction algorithms.
                                                                      properly set and sufficient motivation, players can in-
    The aforementioned process gathers metadata for
                                                                      directly solve otherwise costly problems.
every requested web page, and creates a basic (ev-
idence) layer of a user model. Naturally, as the time                 Little Google Game. We proposed a method for
flows, the keywords which represent long-term user in-                term relationship network extraction via analysis of
terests occur more often than the others. Therefore, by               the logs of unique web search game [19]. Our game
considering only top K most occurring keywords, we                    called Little Google Game focuses on web search query
get a user model which can be further analyzed, and                   guessing. Players have to formulate queries in a spe-
serves as a basis for personalization.                                cial format (using negative keywords) and minimize
    We deployed our enhanced proxy platform to de-                    amount of results returned by the search engine (we
termine the efficiency of the solution in real-world us-              use Google at the moment). Afterwards we mine the
age. The proxy solution can be, apart from user ac-                   game logs and extract relationships of terms based on
tivity logging, used to improve user experience with                  their frequent common occurrence in the Web.
ordinary web pages by adapting them according ac-
tual user needs. More, we provide users with a wordle-                3.2         Domain dependent approaches
based visualization (Wordle tag cloud generator,                      In spite of domain independence of proposed models,
http://www.wordle.net/) of their user profiles, and col-              knowing the domain allows for more accurate models.
lected a precious feedback, which helped us to deter-                 This is common approach also used by the most pop-
mine “web stop-words”, i.e., words which occur often                  ular web search engines, which blend data from mul-
on web pages but do not make any sense from the                       tiple sources in order to fulfil the user’s need behind
user’s interests point of view. An example of such                    his query using the advantage when domain is known
a user profile of one of the proxy authors is displayed               (e.g. flight planning or cooking a meal).
in Figure 3.
                                                                  How much semantics on the “wild” Web . . .      5




                                             Fig. 3. Michal’s tag cloud.


ALEF, Adaptive Learning Framework. We pro-                  Adaptive faceted browser. We devised a faceted se-
posed a schema for adaptive web-based learning and          mantic exploratory browser taking advantage of ada-
based on it we developed ALEF (Adaptive LEarning            ptive and social web approaches to provide person-
Framework), a framework for creating adaptive and           alized visual query construction support and address
highly interactive web-based learning systems [16].         guidance and information overload [22]. It works on
    ALEF domain model follows the resource meta-            semantically enriched information spaces (both data
data model described above. The content includes lear-      and metadata describing the information space struc-
ning objects that can be of three types: explanation,       ture are represented by ontologies). Our browser facili-
question and exercise. The domain model covers for          tates user interface generation using metadata describ-
every learning object: actual content (text and me-         ing the presented information spaces (e.g., photos).
dia), and additional metadata that contain informa-
tion which is relevant for personalization services (con-
                                                            3.3    User centric approaches
cepts, tags, comments). Comparing to other existing
approaches, the notion of metadata in ALEF is quite         Monitoring users and implicit feedback is promising
simplified, which allows for automatic construction of      approach for the “wild” Web. Even though an explicit
domain model, and on the other hand, it still provides      user feedback (filling forms by a user) is easy to im-
a solid basis for reasoning resulting in advanced opera-    plement, it has serious problems with credibility, dis-
tions such as metadata-based personalized navigation.       turbing the user and dependence on his will.

News recommendation. We proposed content                   Query expansion by social context. We proposed
based news recommendation based on articles simi-          a method which implicitly infers the context of search
larity. Considering high dynamic and large every day       by leveraging a social network, and modifies the user’s
volume of news we devised and evaluated in real set-       search query to include it [10]. The social network is
tings two representations for effective news recommen-     built from the stream of user’s activity on the Web,
dation:                                                    which is acquired by means of our enhanced proxy
  – efficient vector comprising title, term frequency server.
    of title words in the article content, names
    and places, keywords, category and readability User interest estimation. We proposed a method
    index [9],                                             for adaptive link recommendation [8]. It is based on an
  – balanced tree built incrementally; it inserts articles analysis of the user navigational patterns and his be-
    based on the content similarity [23].                  havior on the web pages while browsing through a web
                                                           portal. We extract interesting information from the
    Different approach to news recommendation pro- web portal and recommend it in the form of personal-
vided on the same e-news portal (www.sme.sk) is pre- ized calendar and additional personalized links.
sented in [20]. It employs k-nearest neighbor collab-
orative filtering algorithm based on generic full text Search history tree. We proposed an approach in-
engine exploiting power-law distributions Important tended to reduce user effort required to retrieve and/or
property of proposed algorithm is that it maintains revisit previously discovered information by exploit-
linear scalability characteristics with respect to the ing web search and navigation history [18]. It is based
dataset size.                                              on collecting streams of user actions during search
6          Mária Bieliková

sessions. We provide the user with a history map –               4. M. Bielikova, P. Navrat, (Eds): Workshop on
a scrutable graph of semantic terms and web resources               the Web – Science, Technologies and Engineering,
with full-text search capability over individual history            2010. ISBN 978-80-227-3274-1. Available online at
entries. It is constructed by merging individual session            pewe.fiit.stuba.sk/ontoparty-2009-2010-spring/.
history trees and the associated web resources.                  5. A. Broder: A taxonomy of web search. t ACM SIGIR
                                                                    forum, 36 (2), 2002, 3–10.
                                                                 6. P. Brusilovsky et al.: The adaptive web. LNCS 4321,
Discovering keyword relations from Crowd. We                        Springer, 2007, ISBN 978-3-540-72078-2, 763p.
proposed an approach of determining keyword rela-                7. J. Hendler, N. Shadbolt, W. Hall, T. Berners-Lee,
tions (mainly a parent-child relationship) by leverag-              D. Weitzner: Web science: an interdisciplinary ap-
ing collective wisdom of the masses, which is present               proach to understanding the web. Commun. ACM, 51
in data of collaborative (social) tagging systems on                (7), July 2008, 60–69.
the Web [1]. We demonstrated the feasibility of our              8. M. Holub, M. Bielikova: Estimation of user interest in
approach on the data coming from the social book-                   visited web page. Proc. of Int. Conf. on World Wide
marking systems delicious and CiteULike.                            Web, WWW 2010, ACM, 2010, 1111–1112.
                                                                 9. M. Kompan, M. Bielikova: Content-based news rec-
                                                                    ommendation. LNBIP Series, E-Commerce and Web
4      Conclusions                                                  Technologies, Springer, 2010.
                                                                10. T. Kramar, M. Barla, M. Bielikova: Disambiguating
In this paper we described just particular aspects of               search by leveraging a social context based on the
the whole picture. It is not in any sense complete. It              stream of users activity. LNCS 6075, UMAP 2010,
should be viewed as a discussion on certain aspects                 Springer, 2010, 387–392.
                                                                11. M. Laclavik et al.: Ontea: platform for pattern based
and possible partial solutions.
                                                                    automated semantic annotation. Computing and In-
    At the moment we have more questions as the an-                 formatics 28 (4), 2009, 555–579.
swers. How the Web should be described? What prop-              12. P. Mika: Ontologies are us: a unified model of so-
erties are important? How to discover interesting in-               cial networks and semantics. LNCS 3729, ISWC 2005,
formation for particular individual? Is there any emer-             Springer, 2005, 522-536.
gent phenomena? What we could do? How we can                    13. P. Navrat, T. Taraba, A. Bou Ezzeddine, D. Chuda:
really connect people in such a way that it will be                 Context search enhanced by readability index. IFIP
convenient and useful? Can we trust the Web? Is its                 WCC Series 276, Springer, 2008, 373–382.
infrastructure right?                                           14. T. O’Reilly: What is Web 2.0. O’Reilly Net-
                                                                    work. 2005. [Accessed 2010-07-30] Available at
    One day maybe we people will discover silver bul-
                                                                    http://oreilly.com/web2/archive/what-is-web-20.html.
let for the Web. Meantime we should be open for var-            15. M. Simko, M. Bielikova: Automated educational course
ious small enhancement, try to understand the Web                   metadata generation based on semantics discovery.
as much as possible, and try to integrate all particular            LNCS 5794, EC TEL 2009, Springer, 2009, 99-105.
successes.                                                      16. M. Simko, M. Barla, M. Bielikova: ALEF: A frame-
                                                                    work for adaptive web-based learning 2.0. KCKS 2010,
Acknowledgements. Figures and parts of descrip-                     IFIP AICT 324, Springer, 2010, 367–378.
tions in the Section 3 are taken from published papers,         17. M. Simko, M. Bielikova: Improving search results with
which present particular examples, all mentioned in                 lightweight semantic search. CEUR, 491, SemSearch
References.                                                         2009 at WWW 2009, 53–54.
    The author wish to thank colleagues from the Insti-         18. J. Simko, M. Tvarozek, M. Bielikova: Semantic history
tute of Informatics and Software Engineering and all                map: graphs aiding web revisitation support. Proc. of
students – members of PeWe group, pewe.fiit.stuba.sk                9th Int. Workshop on Web Semantics, IEEE Computer
                                                                    Society, 2010.
for their invaluable contribution to the work presented
                                                                19. J. Simko, M. Tvarozek, M. Bielikova: Little google
in this invited lecture. The most current state of on-
                                                                    game: relationships term extraction by means of search
going projects within the group is reported in [4].                 game. Proc. of Datakon 2010.
                                                                20. J. Suchal, P. Navrat: Full text search engine as scalable
References                                                          k-nearest neighbor recommendation system. AI 2010,
                                                                    IFIP AICT 331, Springer, 2010, 165–173.
    1. M. Barla, M. Bielikova: On deriving tagsonomies: key-    21. A.G. Taylor: The organization of information. Li-
       word relations coming from the Crowd. LNCS 5796,             braries Unlimited: Englewood, USA. 1999, 300p.
       ICCCI 2009, Springer, 2009, 309–320.                     22. M. Tvarozek, M. Bielikova: Generating exploratory
    2. M. Barla, M. Bielikova: Ordinary web pages as a source       search interfaces for the semantic web. HCIS 2010,
       for metadata acquisition for open corpus user model-         IFIP AICT 332, Springer, 2010, 175–186.
       ing. Proc. of IADIS WWW/Internet 2010, 2010.             23. D. Zelenik, M. Bielikova: Dynamics in hierarchical
    3. T. Berners-Lee, J. Hendler, O. Lassila: The semantic         classification of news. Proc. of WIKT 2009, 83–87.
       web. Scientific American Magazin, May 2001.