=Paper=
{{Paper
|id=Vol-1653/paper_26
|storemode=property
|title=Web Information Foraging
|pdfUrl=https://ceur-ws.org/Vol-1653/paper_26.pdf
|volume=Vol-1653
|authors=Yassine Drias,Gabriella Pasi
|dblpUrl=https://dblp.org/rec/conf/iir/DriasP16
}}
==Web Information Foraging==
<pdf width="1500px">https://ceur-ws.org/Vol-1653/paper_26.pdf</pdf>
<pre>
                   Web Information Foraging

                        Yassine Drias and Gabriella Pasi

                 Università degli Studi di Milano-Bicocca, DiSCo
                          Viale Sarca 336, 20126 Milano
               y.drias@campus.unimib.it , pasi@disco.unimib.it


      Abstract. We present in this paper an approach to Web information
      foraging. We implemented a technique that helps Web users undertaking
      information foraging by simulating their behavior using a colony of ar-
      tificial ants. Experiments were conducted on a website dedicated to the
      domain of Health. The results are promising and show the ability of our
      Web information foraging approach to find relevant Web pages.

      Keywords: Information Foraging, Information Seeking, Ant Colony Op-
      timization


1   Introduction
A recent paradigm related to accessing relevant information on the Web is In-
formation Foraging. The task of information foraging consists in browsing the
Web to collect information related to specific user needs under a time constraint.
Recent work on information foraging are focalized to define algorithms that are
able to discover in an automatic way the surfing paths that web users would
follow while seeking information on the Web. The development of such systems
may allow the users to spend less time locating the needed information; more-
over this will also help them to easily identify the best sources containing that
information.
    The task of foraging is grounded on the Optimal Foraging Theory [10] devel-
oped by anthropologists to model the animal behavior while foraging food.The
Information Foraging Theory was first developed in 1999 [7]. The authors estab-
lished their study on the similarity between the animal food foraging behavior
described in the Optimal Foraging Theory and the behavior of humans while
seeking information online. The theory is based on the assumption that, when
searching for information, users rely on their senses to perceive the information
scent that helps them to reach their goal just like animals do when they follow
the scent of their preys.
    Information foraging consists in simulating the behavior of real users while
seeking information on the Web. At the beginning, the information foraging
process starts from a Web page and then according to the information need of the
user, it decides which Web pages to visit in order to reach Web pages containing
relevant information to the user. At the end of the process, a collection of surfing
paths containing relevant Web pages is produced.
2

    In this paper an approach to Web information foraging based on Ant Colony
Optimization is proposed and evaluated. The artificial ants simulate the behavior
of Web users while searching for information on the Web taking into consider-
ation the complex structure of the Web and its volume. For this purpose, we
propose a Web surfing model and implement a Web surfing strategy.
    The rest of the paper is organized as follows. In section 2, we report the stud-
ies that are the most related to our concern. Web information foraging described
is then described in section 3. In section 4, we present the contribution we de-
veloped for Web information foraging using Ant Colony Optimization. Section 5
summarizes the experiments we conducted on a medical website and the results
we achieved.


2   Related Works

In the study presented in [8], the authors consider the Web as a semantic space
and try to predict the navigational choices of Web users. The notion of infor-
mation scent is introduced, which is measured as the mutual relevance between
the user’s goal and the Web pages’ content. The authors tested their model on
a database of selected tasks collected by a survey of more than 2000 web users.
The results show that the measure of information scent is able to generate good
estimations of web user interactions by predicting the links the users will click
on and also when they decide to leave the website.
    Strong regularities in Web surfing behavior were studied in [4] from a theoret-
ical point of view. The authors proposed a model for studying surfing behaviors
and the experiments they held showed common surfing behaviors. The study
conducted in [3] shows that the Web pages are distributed over the sites accord-
ing to a universal power law, which is an example of their strong regularities.
    In [5] and [6] the authors consider Web topology, information distribution and
interest profile in building a Wisdom information foraging agent. They found out
that the unique distribution of agent interest leads for regularities in Web surfing.
They also undertook an interesting study on three categories of users according
to their interest and familiarity with the Web: A random user, a rational user
and a recurrent user. The result is that independently from the type of users,
the regularities of surfing are the same, which means that the user ability for
predicting the surfing chain is predominant.
    In [9] the author presents different interactive information retrieval models
and shows the importance of developing such systems. Interactive information
retrieval is based on human behavior and the fact that the feedback of Web users
when performing a search on the Web can enhance the user experience and the
performance of IR systems.
    Link analysis has considerably improved the effectiveness of Web Search en-
gines, thanks to the analysis of the hyperlink structure of the Web. According
to [2], hyperlinks provide a valuable source of information for web information
retrieval and a large number of links is created by independent individuals ev-
eryday.
                                                                                  3

    In real life, web users generally do not get all the information they’re looking
for from the first Web page they visit. Our approach not only offers them the
opportunity to have a direct access to the most relevant Web pages but also
they can explore the whole surfing path which contains Web pages with com-
plementary information that may interest them as well. This new way to access
information is ensured thanks to the fact that information foraging is inspired
from human Web navigation behavior. Our approach takes advantage from some
related research areas such as interactive information retrieval and link analysis.
It particularly takes into consideration the Web pages’ information content, their
distribution on the Web and the relation between them. The evaluation of our
proposal was held on a real website, unlike previous works that were tested on
log files or other forms of data.


3   Web Information Foraging

The Web is usually represented as a directed graph G(P, L). The vertices P
correspond to the Web pages, where two Web pages pi and pj are connected via
a directed edge if pi contains a link referring to pj . Considering a user with a
specific interest, we may assume that he is interested in visiting a branch of the
webgraph, one node at a time starting from an initial Web page and ending at
a page containing some relevant information concerning his interest.
    The transition choices that the user makes during the navigation define his
surfing strategy. The set of visited Web pages while navigating is called a surfing
path. The goal of information foraging is to determine the optimal paths to reach
relevant Web pages for a given user interest. To this purpose, at each click on
a page pi , the question is to find the best move to another page pj in order to
build a surfing path.
    In the literature it has been outlined that the surfing behaviors differ from
one user to another, depending on the familiarity of the user with the Web
environment, as well as his goal behind browsing the Web. In particular, three
types of surfing behaviors have been identified in [5], but not formalized:
    A pseudo random strategy which concerns users that are not familiar with
the Web and are browsing it without having a strong interests in any specific
topics; A rational strategy which is the closest to real-life Web navigation as
most of Web users behave rationally. They usually have a specific goal behind
Web surfing and they try to reach it by selecting the Web pages that seem the
closest to their information need; A recurrent strategy concerning users who are
familiar with the Web and have a well-defined goal behind browsing it. This
kind of users always makes the best decision when they have to move from the
current Web page they are into a new one by selecting the most relevant Web
page among to the possible pages.
    When surfing, Web users are guided by the information scent they get from
Web pages. Starting with a weak amount, the information scent increases as
the user gets closer to the Web pages that interest him. In order to model the
4

user’s information need (interest), we consider a vector containing keywords that
represent topics in which the user is interested.


4   Web Information Foraging using Ant Colony
    Optimization

As mentioned in the previous section, the majority of Web users behave ratio-
nally when seeking information on the Web [5]. The aim of our work is to develop
algorithms to efficiently implement the rational surfing strategy. To this purpose
we have applied Ant Colony Optimization (ACO). The main reason to choose
ACO resides in the fact that it simulates the ants’ food foraging behavior which
is similar to the human information foraging behavior on the Web.
    The ant system algorithm [1] includes several ant generations, each genera-
tion is composed of NbAnts ants. Where NbAnts is the population size of the
ant colony. Each artificial ant starts building a solution from an initial state i
generated randomly. Recall that a solution is a connection of states and it is
constructed using a stochastic process. The ant chooses a new state j from the
current state’s neighborhood Ni , with a probability computed by Formula (1):

                                         phero[j]
                            P (i, j) = P                                      (1)
                                        l∈Ni phero[l]

    To each state i is assigned an amount of pheromone denoted by phero[i]. The
pheromone information is initialized with a very small value in order to simu-
late the fact that real ants deposit a very small amount of pheromone on the
ground when starting their food foraging. Two structures are needed to compute
the ant algorithm, a table named Phero to store the pheromone amount yielded
by the ants each time they build a solution and a table called sol to save the
best solution found by each ant. phero[k] corresponds to the pheromone amount
associated with the solution found by ant k and sol[k] is the best solution deter-
mined by ant k. The tables are updated at each generation of ants. Besides, two
variables namely best and bestsol are used to save respectively the best solution
found during the current generation and the best solution computed since the
beginning of the process. During the search, the pheromone amount, which rep-
resents the effectiveness of the solution, will be computed and associated with
each solution found by the ants.
    The strategies of updating pheromone simulate the evaporation of natural
pheromone followed by a production of this chemical substance. The evaporation
phenomenon gives rise to rule (2) where the empirical parameter ρ belongs to
the interval [0, 1] and simulates the evaporation rate. Pheromone evaporation
prevents from premature convergence. An online delayed update is performed
at each generation of ants, the pheromone added is calculated for each state of
the solution according to rule (3). It is a delayed update because the pheromone
assigned to a state is updated once the ant determines a solution. For the offline
update, rule (4) is applied. Recall that bestsol is the best solution found during
                                                                                          5

the previous iterations and best is the best solution of the current iteration. The
added pheromone amount is proportional to the ratio of these values.

                            phero[i] = (1 − ρ) ∗ phero[i]                               (2)
                           phero[i] = phero[i] + ρ ∗ f (s)                              (3)
                    phero[k] = phero[k] + ρ ∗ f (bestsol)/f (best)                      (4)

4.1   Simulating the Rational Strategy using ACO
We present here the adaptation of the Ant Colony System algorithm to Web
information foraging in order to simulate the rational surfing behavior.
    As we previously mentioned in section 3, rational users are interested in
specific topics and they forage in order to locate Web pages that contain in-
formation on those topics. When they reach a new Web page, they will try to
decide whether or not the content sufficiently matches their interest and, if not,
predict which Web page at the next level will become a more interesting one. In
predicting the next-level contents, they will rely on the information scent they
get from the titles or descriptions of various hyperlinks inside the current Web
page. We notice that the information scent is analogous to the pheromone that
guides the ants when they’re seeking food. In our adaptation, the artificial ants
simulate the behavior of Web users that have a rational surfing strategy.


Algorithm 1 ACO-WIF
Input: N (A part of the Web); user interest;
Output: bestsol, a surfing path ending with a relevant Web page;
 1: procedure ACO-WIF
 2:    for i=1 to NbAnts do phero[i]= 0.1;                     . pheromone initialization
 3:    end for
 4:    select at random a solution s from N ;                   . a surfing path namely s
 5:    best := bestsol := s;
 6:    for i=1 to MaxIter do
 7:        for k=1 to NbAnts do
 8:            sol[k] := build Sol();
 9:            update the online pheromone using Formulas (2) and (3);
10:            if f (sol[k]) > f (best) then then best := sol[k];     . f : fitness function
11:            end if
12:        end for
13:        if f (best) > f (bestsol) then bestsol := best;
14:        end if
15:        apply offline-update of pheromone using Formula (4);
16:    end for
17:    return (bestsol);
18: end procedure


Solutions encoding. The search space for the ant colony is the set of all
possible surfing paths. In ACO ants build solutions which are surfing paths for
6

our case. A surfing path is composed by the set of all visited Web pages during the
foraging process and contains at least one Web page. It starts with an initial page
and ends with a target page which should incorporate relevant information. The
number of Web pages contained in the surfing path is called the surfing depth.
So ants will seek surfing paths that contain relevant Web pages. The adapted
Ant Colony System for Web Information Foraging is outlined in Algorithm 1.


Building a solution. Each ant performs the task of exploring the best surfing
path in a specific part of the Web. The ant builds a solution by selecting Web
pages according to a probability P defined in Formula (5). In addition to the
pheromone, we introduce a heuristic that brings a knowledge on our problem in
order to help the ants to make better foraging moves. The heuristic we propose is
measured as the similarity between potential next page pj and the user’s interest
represented by the vector V. The neighborhood of a page pi is a set of Web pages
that are connected to pi via a Web link. We introduce a noise parameter q0 in
order to simulate the fact that even Web users that behave rationally don’t
always make the best decision when surfing. This may be the result of multiple
factors such as their unfamiliarity with the Web for example.
    if q ≤ q0              (
                            1 if pj = argmax(phero[j]α (heurjv )β ) for pj ∈ Ni
       then P (pi , pj ) =                                                      (5)
                            0 else
else                                      phero[i]α (heurjv )β
                       P (pi , pj ) =
                                        Σl∈Ni phero[l]α (heurlv )β

    In other words, the ant decides stochastically to consider the most relevant
Web page among the outgoing Web pages when q ≤ q0 and a Web page drawn
at random otherwise.
    The evaluation of a surfing path quality is performed by applying the fit-
ness function f to the last Web page of the path. This function represents the
similarity between the user interest and the description of the Web page which
contains the title of the page and eventually some tags and keywords. It can be
computed using one of the similarity functions defined in literature.


5      Experiments

5.1     Description of the real-world Benchmark

We performed our experiments on MedlinePlus, an online medical Website pro-
duced by the National Library of Medicine of the U.S. It provides information
on over 900 diseases, health conditions and wellness issues. Our experiments
deal with health topics described in an XML file that includes pages describing
medical topics. The data for health topics is available at http://www.nlm.nih.
gov/medlineplus/xml.html. We worked on the version of the 25th February
2016 where the number of Web pages was equal to 1946. Each topic is specified
                                                                                    7

by a title and contains the following elements: a URL, a unique identifier, the
language of the topic (English or Spanish), the date of its creation, topic syn-
onyms, translation to other languages, a full summary, a list of related topics,
which are internal links to similar topics and external links.


5.2    Results

Different user interests were experimented for evaluating the rational surfing
strategy. The results we focused on are: the last Web page on the surfing path
(the most relevant one), its URL, its score, the surfing depth of the path and
the surfing time in milliseconds. Table 1 exhibits the obtained results for the
rational surfing strategy.
                                        Relevant Page                         Surfing Surfing
      User interest                                                     Score
                              Title                     URL                    depth time
Childhood, Leukemia Childhood Leukemia */childhoodleukemia.html 1.0              2     115
  Food, Poisoning     Foodborne Illness       */Foodborneillness.html    1.0     3     167
    Hypotension      Low Blood Pressure */lowbloodpressure.html 1.0              2     125
   Skin, Allergies      Skin Conditions        */skinconditions.html     0.5     2     992
    Illness, Heat         Heat Illness           */heatillness.html      1.0     4     175
    Chest, Injury        Chest Injuries     */chestinjuriesanddisorders 0.66     5     338
                         and Disorders                 .html
         Zika              Zika Virus             */zikavirus.html       1.0     1     204
                     * : http://www.nlm.nih.gov/medlineplus
  Table 1. Results for different user interests using the Rational Surfing Strategy


                           Fig. 1. Detailed surfing paths

    From table 1, we can observe that our approach is able to find relevant Web
pages to the user based on the vector describing his interest. The developed
program also makes use of the health topics’ synonyms provided by MedlinePlus
in order to locate the most relevant Web pages to the user. For example, in the
third instance of the table the user interest was ”Hypotension” and the result
returned by the program was a Web page entitled ”Low Blood Pressure” which
is a synonym of Hypotension.
8

    Figure 1 shows more details about the results for the user interests ”illness,
heat” and ”food, poisoning”. The surfing path for ”illness, heat” starts from a
Web page named ”Melanoma” and ends with the Web page ”Heat Illness”. The
surfing depth in this case is equal to 4. We notice that the last Web page in the
surfing path is the most relevant to the user. However, the other Web pages are
also related to the user’s interest and may be interesting for him.


6    Conclusion
In this work, we presented an approach to Web information foraging based on
Ant Colony Optimization. We proposed a model for Web surfing in order to
simulate the Web surfing behavior of real Web users. This idea was inspired from
the information foraging theory which states that Web users and animals have
similar behaviors when looking for information/food. Furthermore, a real website
was used for the experiments instead of an artificial one or log files as it was
performed in the literature. The results show the ability of our program to find
relevant Web pages in a short time based on a user interest. The outcomes consist
in a set of surfing paths ranked by relevance. This offers the user the possibility
to go more in depth with getting information on a certain topic without spending
too much time on visiting a lot of Web pages.
    As a perspective, we intend to investigate Web information foraging in social
and information sharing networks such as Twitter.


References
 1. Dorigo, M., Caro, G.D.: Ant algorithms for discrete optimization. Artificial Life
    5-3, 137–172 (1999)
 2. Henzinger, M.R.: Link analysis in web information retrieval. IEEE Data Eng. Bull.
    23(3), 3–8 (2000)
 3. Huberman, B.A., Adamic, L.A.: Growth dynamics of the world-wide web. Nature
    40, 7478–7491 (1999)
 4. Huberman, B.A., Pirolli, P., Pitkow, J.E., Lukose, R.M.: Strong regularities in
    world wide web surfing. Science (1997)
 5. Liu, J., Zhang, S.W.: Characterizing web usage regularities with information forag-
    ing agents. IEEE Transactions on Knowledge and Data Engineering 40, 7478–7491
    (2004)
 6. Liu, J., Zhong, N., Yao, Y., , Ras, Z.: The wisdom web: New challenges for web
    intelligence (wi). Expert System With Applications 40, 7478–7491 (2013)
 7. Pirolli, P., Card, S.K.: Information foraging. Psychological Review 106(4), 643–675
    (1999)
 8. Pirolli, P., Fu, W.T.: Snif-act: A model of information foraging on the world wide
    web. User Modeling 2003, 9th International Conference 22-26, 45–54 (2003)
 9. Robins, D.: Interactive information retrieval: Context and basic notions. Inform-
    ingSciJ 3, 57–62 (2000)
10. Werner, E.E., Hall, D.J.: Optimal foraging and the size selection of prey by the
    bluegill sunfish (lepomis macrochirus). Ecology 55(5), 1042 (1974)

</pre>