<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Mining Evolving Web Clickstreams with Explicit Retrieval Similarity Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olfa Nasraoui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cesar Cardona</string-name>
          <email>ccardona@memphis.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Rojas</string-name>
          <email>crojas@memphis.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and</institution>
          ,
          <addr-line>Computer Engineering</addr-line>
          ,
          <institution>The University of Memphis, 206 Engineering Science</institution>
          ,
          <addr-line>Bldg., Memphis, TN 38152</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Data on the Web is noisy, huge, and dynamic. This poses enormous challenges to most data mining techniques that try to extract patterns from this data. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the “you only get to see it once” constraint on stream data calls for different computational models that may bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering. In this paper, we explore the task of mining evolving clusters in a single pass with a new scalable immune based clustering approach (TECNO-STREAMS), and study the effect of the choice of different similarity measures on the mining process and on the interpretation of the mined patterns. We propose a simple similarity measure that has the advantage of explicitely coupling the precision and coverage criteria to the early learning stages, and furthermore requiring that the affinity of the data to the learned profiles or summaries be defined by the minimum of their coverage or precision, hence requiring that the learned profiles are simultaneously precise and complete, with no compromises. In our simulations, we study the task of mining evolving user profiles from Web clickstream data (web usage mining) in a single pass, and under different trend sequencing scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>artificial immune systems</kwd>
        <kwd>unsupervised learning</kwd>
        <kwd>clustering</kwd>
        <kwd>stream data mining</kwd>
        <kwd>web usage mining</kwd>
        <kwd>text mining</kwd>
        <kwd>mining evolving data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Natural organisms exhibit powerful learning and processing abilities
that allow them to survive and proliferate generation after generation
in ever changing and challenging environments. The natural immune
system is a powerful defense system that exhibits many signs of
cognitive learning and intelligence [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In particular the acquired or adaptive
immune system is comprised mainly of lymphocytes which are special
types of white blood cells (B-cells) that detect and destroy pathogens,
such as viruses and bacteria. Identification of a particular pathogen is
enabled by soluble proteins on the cell surface, called antigens. Special
protein receptors on the B-cell surface, called antibodies are specialized
to react to a particular antigen by binding to this antigen. Lymphocytes
are only activated when the bond exceeds a minimum strength that may
be different for different lymphocytes. A stronger binding with an
antigen induces a lymphocyte to clone more copies of itself, hence providing
reinforcement. Mature lymphocytes form the long term memory of the
immune system, and help recognize and fight similar antigens that may
be encountered in the future. Therefore, the immune system can
perform pattern recognition and associative memory in a continuous and
decentralized manner.
      </p>
      <p>
        Recently, data mining has put even higher demands on clustering
algorithms. They now must handle very large data sets, leading to some
scalable clustering techniques. However, most scalable clustering
techniques such as BIRCH [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and the scalable K-Means (SKM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
assume that clusters are clean of noise, hyper-spherical, similar in size,
and span the whole data space. Robust clustering techniques have
recently been proposed to handle noisy data. Another limitation of most
clustering algorithms is that they assume that the number of clusters is
known. However, in practice, the number of clusters may not be known.
This problem is called unsupervised clustering. A recent explosion of
applications generating and analyzing data streams has added new
unprecedented challenges for clustering algorithms if they are to be able
to track changing clusters in noisy data streams using only the new data
points because storing past data is not even an option [
        <xref ref-type="bibr" rid="ref1 ref2 ref5 ref9">2, 1, 5, 9</xref>
        ].
      </p>
      <p>
        Web usage mining [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref24 ref25 ref26 ref3 ref7">24, 26, 20, 7, 21, 3, 19, 22, 18, 17, 25</xref>
        ] has recently
attracted attention as a viable framework for extracting useful access
pattern information, such as user profiles, from massive amounts of Web log
data for the purpose of Web site personalization and organization. Most
efforts have relied mainly on clustering or association rule discovery as
the enabling data mining technologies. Typically, data mining has to be
completely re-applied periodically and offline on newly generated Web
server logs in order to keep the discovered knowledge up to date.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we proposed a new immune system inspired approach for
clustering noisy multi-dimensional stream data, called TECNO-STREAMS
(Tracking Evolving Clusters in NOisy Streams), that has the advantages
of scalability, robustness, and automatic scale estimation.
TECNOSTREAMS is a scalable clustering methodology that gleams
inspiration from the natural immune system to be able to continuously learn
and adapt to new incoming patterns by detecting an unknown number of
clusters in evolving noisy data in a single pass.
      </p>
      <p>In this paper, we study the possibility of mining evolving user profiles
from Web clickstream data (web usage mining) in a single pass, and
under different usage trend sequencing scenarios. We also study the effect
of the choice of different similarity measures on the mining process and
on the interpretation of the mined patterns. We propose a simple
similarity measure that has the advantage of explicitely coupling the precision
and coverage criteria to the early learning stages, and furthermore
requiring that the affinity of the data to the learned profiles or summaries be
defined by the minimum of their coverage or precision, hence requiring</p>
      <p>The rest of the paper is organized as follows. In Section 2, we
describe the TECNO-STREAMS algorithm. and compare it to some
existing scalable clustering algorithms. In Section 3, we describe how we can
use TECNO-STREAMS to track evolving clusters in Web usage data,
and illustrate using it for mining real Web clickstream data, while
studying the effect of the choice of different similarity measures on mining
and interpreting the evolving profiles. Finally, in Section 4, we present</p>
      <p>
        our conclusions.
a set,
data,
space only [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>B-cell,
links between them. Learning takes as input a set of antigen training
, and tries to learn an optimal immune network consisting of
linked B-Cells based on cloning operations as in nature. Each B-Cell
represents a learned pattern that could be matched to or validated by an
antigen/data item or another B-Cell in the network. A link between two
training set is matched against a B-Cell based on a properly chosen
similarity measure. This affects the B-Cell’s stimulation level, which in turn
affects both its outlook for survival, as well as the number of clones that
it produces. Because clones are similar to their spawning parent, they
together form a network of co-stimulated cells that can sustain themselves
tered by the immune network) to D-W-B-cell,
antigen
encounparameter that controls the decay rate of the weights along the spatial
dimensions, and hence defines the size of an influence zone around a
gens, and hence how much emphasis is placed on the currency of the
immune network compared to the sequence of antigens encountered so
, that can be interpreted as a robust zone of influence,
consisting of all the data points that succeed in acticating this cell.</p>
    </sec>
    <sec id="sec-2">
      <title>CLUSTERS IN NOISY STREAMS)</title>
      <p>
        The immune system (lymphocyte elements) can behave as an
alternative biological model of intelligent machines, in contrast to the
conventional model of the neural system (neurons). In particular, the Artificial
Immune Network (AIN) model is based on Jerne’s Immune Network
theory [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The system consists of a network of B cell lymphocytes
that summarize the learned model. The immune network consists of
, of artificial B-cells, as well as stimulating and suppressing
      </p>
    </sec>
    <sec id="sec-3">
      <title>TECNO-STREAMS (TRACKING EVOLVING</title>
      <p>cluster prototype. Data samples falling far from this zone are considered
B-Cells gets stronger if they are more similar. Data from the antigen</p>
      <p>Each D-W-B-cell is allowed to have is own zone of influence with
raliers are easily detected as data points falling outside the influence zone
have been presented to DWB , is defined as the density of the antigen
population around DWB :
(1)
(2)
(3)
(4)
(5)
(6)</p>
      <p>that the learned profiles are simultaneously precise and complete, with
) antigen data point, after antigens have been presented, as
&lt;8 9&gt;8= :@?BADCFHJEI
current. Quantitatively, the influence zone is defined in terms of a weight
function that decreases not only with distance from the antigen/data
locacomputation of the D-W-B-cell stimulation level by adding a
compenoutliers. The weight functions decrease exponentially with the order of
presentation of an antigen, , and therefore, will favor more current data
in the learning process.</p>
      <sec id="sec-3-1">
        <title>Definition 2: (Influence Zone) The</title>
        <p>D-W-B-cell represents a soft
previous antige3 ns, "oc
, to D-W-B-Cell .
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Dynamic Stimulation and Suppression</title>
      <sec id="sec-4-1">
        <title>Lemma 1: (Optimal Scale Update) [14] The equations for optimal</title>
        <p>scale updates are given by
tion and optimal scale can be updated using the following approximate
incremental equations, respectively,
above equations to incremental counterparts as follows.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Lemma 2: (Incremental Update of Pure Stimulation and Optimal</title>
        <p>For the purpose of computational efficiency, however, we convert the
, pure
stimulaeven long after the disappearance of antigen data that has initiated the
re-vaccination. The combined recall and forgetting behavior in the face
hence gradually forgetting old encounters. This forgetting is the
reason why the immune system needs periodical reminders in the form of
cloning. However, this network of B-cells will slowly wither and die if it
is no longer stimulated by the antigen data for which it has specialized,
of external antigenic agents forms the fundamental principle behind the
concept of emerging or dynamic memory in the immune system. This is
specifically the reason why the immune system metaphor offers a very
competitive model within the evolving data stream framework. In the
following description, we present a more formal treatment of the
intuitive concepts explained above.</p>
        <p>
          Here, we summarize the TECNO-STREAMS approach omitting some
of the details and proofs that can be found in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. In a dynamic
environment, the objects from a data stream
are presented to the
immune network one at a time, with the stimulation and scale measures
reupdated with each presentation. It is more convenient to think of the
antigen index, , as monotonically increasing with time. That is, the antigens
are presented in the following chronological order:
.
        </p>
        <p>The Dynamic Weighted B-Cell (D-W-B-cell) represents an influence zone
over the domain of discourse consisting of the training data set.
However, since data is dynamic in nature, and has a temporal aspect, data that
is more current will have higher influence compared to data that is less
tion to the D-W-B-cell prototype, but also with the time since the antigen
has been presented to the immune network. It is convenient to think of
time as an additional dimension that is added to the D-W-B-Cell
comthemselves in the immune network, even after the antigen that caused
pared to the classical B-Cell, traditionally statically defined in antigen
their creation disappears from the environment. However, we need to</p>
      </sec>
      <sec id="sec-4-3">
        <title>Definition 1: (Robust Weight/Activation Function) For the</title>
        <p>
          D-Wterns. This is done by allowing D-W-B-cells to have their own
stimusation term that depends on other D-W-B-cells in the network [
          <xref ref-type="bibr" rid="ref11 ref23">11, 23</xref>
          ].
In other words, a group of co-stimulated D-W-B-cells can self-sustain
put a limit on the time span of this memory to forget truly outdated
pat, we define the activation caused by the
lation coefficient, and to have this stimulation coefficient decrease with
. We also incorporate a dynamic suppression
2.4
        </p>
        <p>Cloning in the Dynamic Immune System
2.5
(7)
(8)
2.6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Comparison to Other Clustering Techniques</title>
      <p>Because of paucity of space, we review only some related methods,
as summarized in Table 1. We note that all immune based techniques,
as well as most evolutionary type clustering techniques are expected to
benefit from insensitivity to initial conditions (reliability) by virtue of
being population based. Moreover, most techniques achieve their
scalability by using a special indexing structure which requires an additional
preliminary scan of the data which may not be acceptable in the context
of data streams.
tvuxwy ! their age:
uxwy ! factor,
&lt;=1 + ( 1 Compute distance, activation weight, and update incrementally using
A - + ( @) ( DZ% Create by duplication a new D-W-B-cell = and ;
021435 )76 398:898;3, Present antigen to each subnet centroid, in network :
, Compress immune network into subnets using 2 iterations of K Means;
’ X Fix the maximal population size, ;
&lt;=1 + Determine the most activated subnet (the one with maximum );
&lt; +?&gt; &lt; X [Z IF All B-cells in most activated subnet have (antigen does
( *) ( DZ% ’ X + Initialize D-W-B-cell population and using the first
- +/. Repeat for each incoming antigen
TECNO-STREAMS Algorithm:
(optional steps are enclosed in [] )
input antigens;
(6);
not sufficiently activate subnet) THEN.
B ’ X + . IF population size Then
A [or move oldest/mature D-W-B-Cells to secondary (long term) storage];
&gt; C X DZ IF (Age of B-cell ) THEN
EF ’ X Kill worst excess (top (’ ) according to previous sorting)
Clone and mutate D-W-B-cells;</p>
      <p>Temporarily scale D-W-B-cell’s stimulation level to the network average
stimulation;</p>
      <p>Sort D-W-B-cells in ascending order of their stimulation level;
D-W-B-cells;
G , Compress immune network periodically (after every antigens), into
subnets using 2 iterations of K Means with the previous centroids as initial
cAentroids;
( tions with the D-W-B-cells inside the parent subnetwork (the closest
( all cells in the immune network, only the intra-subnetwork
interacu&gt;( y Instead of taking into account all possible interactions between
subnetwork to which this B cell is assigned) are taken into account. In
case K-Means is used, this representative as well as the organization of
the network into subnetworks is a by-product. For more complex data
structures, a reasonable best representative/prototype (such as a medoid)
can be chosen. Taking these modifications into account, the stimulation
and scale values that take advantage of the compressed network are given
by</p>
      <p>
        The number of possible internal interactions (between different cells
in the network) can be a serious bottleneck in the face of all existing
immune network based learning techniques [
        <xref ref-type="bibr" rid="ref11 ref23">11, 23</xref>
        ]. Suppose that the
immune network is compressed by clustering the D-W-B-cells using a
linear complexity approach such as K Means. Then the immune network
can be divided into subnetworks that form a parsimonious view of the
entire network. For global low resolution interactions, such as the ones
between D-W-B-cells that are very different, only the inter-subnetwork
interactions are germane. For higher resolution interactions such as the
ones between similar D-W-B-cells, we can drill down inside the
corresponding subnetwork and afford to consider all the intra-subnetwork
interactions.
      </p>
      <sec id="sec-5-1">
        <title>Lemma 3: (Effect of Network Compression on Scalability) The pro</title>
        <p>posed AIS based clustering model can achieve scalability at a finite
compression rate ( ).</p>
        <p>
          Recently, data mining techniques have been applied to extract usage
patterns from Web log data [
          <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref24 ref25 ref26 ref3 ref7">24, 26, 20, 7, 21, 19, 22, 3, 18, 17, 25</xref>
          ]. In
[
          <xref ref-type="bibr" rid="ref18 ref19">19, 18</xref>
          ], we have proposed new robust and fuzzy relational clustering
techniques that allow Web usage clusters to overlap, and that can detect
and handle outliers in the data set. A new subjective similarity measure
between two Web sessions, that captures the organization of a Web site,
was also presented as well as a new mathematical model for “robust”
Web user profiles [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and quantitative evaluation means for their
validation. Unfortunately, the computation of a huge relation matrix added
a heavy computational and storage burden to the clustering process.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we presented a quasi-linear complexity technique, called
Hierarchical Unsupervised Niche Clustering (H-UNC), for mining both
user profile clusters and URL associations in a single step. More
recently, we have presented a new approach to mining user profiles that
+ "!$#&amp;% +
(9)
(10)
(11)
(12)
(13)
A
yes
yes
no
no
yes
yes
( -tree)G
c We note that it is easy to shoc w that the cosine similarity is related to
+ ! c e e + 1 h Z1 Z1
the well known information retrieval measures of precision and coverage
as follows:
+ a learned B-Cell profile , which in the simplest case, can both be
de
        </p>
        <p>
          For many data mining applications such as clustering text documents
and other high dimensional data sets, the Euclidean distance measure is
not appropriate. This is due mainly to the high dimensionality of the
problem, and the fact that two documents may not be considered similar
if keywords are missing in both documents. More appropriate for this
application, is the cosine similarity measure between data item and
fined as binary vectors of length , the total number of items/URLs or
keywords, [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
is inspired by concepts from the natural immune system [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This
approach proved to be successful in mining clusters and frequent
itemsets from large web session data. This kind of data, which is extremely
sparse, presents a real challenge to conventional clustering and frequent
itemset mining techniques. Many data sets share this sparsity with
clickstream data: these include text data as well as a large number of
transactional databases. Unfortunately, all the above methods assume that
the entire preprocessed Web session data could reside in main memory.
        </p>
        <p>This can be a disadvantage for systems with limited main memory in
case of huge web session data, since the I/O operations would have to
be extensive to shuffle chunks of data in and out, and thus compromise
scalability. Today’s web sites are a source of an exploding amount of
clickstream data that can put the scalability of any data mining technique
into question.</p>
        <p>Moreover, the Web access patterns on a web site are very dynamic in
nature, due not only to the dynamics of Web site content and structure,
but also to changes in the user’s interests, and thus their navigation
patterns. The access patterns can be observed to change depending on the
time of day, day of week, and according to seasonal patterns or other
external events in the world. As an alternative to locking the state of
the Web access patterns in a frozen state depending on when the Web
log data was collected and preprocessed, we propose an approach that
considers the Web usage data as a reflection of a dynamic environment
which therefore requires dynamic learning of the access patterns. An
intelligent Web usage mining system should be able to continuously learn
in the presence of such conditions without ungraceful stoppages,
reconfigurations, or restarting from scratch. In this section, we illustrate
using TECNO-STREAMS to continuously and dynamically learn evolving</p>
        <p>Web access patterns from non-stationary Web usage environments.
%1’ + where the precision in the learning phase, describes the
ac+ curacy of the learned B-cell profiles in representing the data , or
the ratio of the number of matching items (URLs or terms) between the
learned profile and the data (session or document) to the number of items
in the learned profile:
"!(#&amp;% + while the coverage in the learning phase, describes the
comc
+ o pleteness of the learned B-cell profiles in representing the data ,
or the ratio of the number of matching items (URLs or terms) between
the learned profile and the data (session or document) to the number of
items in the data:
3.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Similarity Measures Used in the Learning Phase of Single-Pass Mining of Clusters in Web Data</title>
    </sec>
    <sec id="sec-7">
      <title>Similarity Measures Used in the Validation</title>
    </sec>
    <sec id="sec-8">
      <title>Phase of Single-Pass Mining of Clusters in</title>
    </sec>
    <sec id="sec-9">
      <title>Web Data</title>
      <p>In evaluating the goodness of the learned B-Cell profiles that make
up the immune network model, we recall that the B-cell profiles should
represent the ground-truth trends as accurately as possible, and as
completely as possible, and that the distribution of the learned repertoire of
B-cell profiles should mirror the incoming stream of evolving data as
represented by the ground truth profiles/topic representatives. Accuracy
0
1
2
(14)
(15)
+ relative to the ground truth profiles , while completeness can
+ be measured based on coverage of the learned B-cell profiles,
relF1 + + tion phase, describes the accuracy of the B-cell profiles in
can be measured based on the precision of the learned B-cell profiles,
ative to the ground truth profiles . Here, precision in the
validarepresenting the ground truth profiles , or the ratio of the number
of matching items (URLs or terms) between the learned profile and the
ground truth profiles to the number of items in the learned profile:
"!$#&amp;% + while the coverage in the validation phase, describes the
+ o completeness of the B-cell profiles in representing the data , or
c
the ratio of the number of matching items (URLs or terms) between the
learned profile and the data (session or document) to the number of items
in the data:
&amp; then sessions assigned to profile 1, , etc.
"!(#&amp;% + (ii) coverage , measuring the completeness of the learned
pro! # * X [Z ! every sessions. The activation threshold was ,
+ cosine similarity in learning as given by (9), and then again using
! # parameter for compression was , with periodical compression
X DZ + the MinPC similarity as given by (13).
%1 + lowing criteria: (i) precision , measuring the accuracy of the
K ! # and . We illustrate the continuous learning ability of theh
pro</p>
      <p>
        Profiles were mined from the 12-day clickstream data (from 1998)
with 1704 sessions and 343 URLs from the website of the department of
Computer Engineering and Computer Science at the University of
Missouri. This is a benchmark data set used in [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. The profiles that
were discovered using TECNO-STREAMS in a single pass are
comparable to the ones previously obtained using a variety of less scalable
techniques [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. The maximum population size was 50, the control
posed technique using the following simulations:
Scenario 1: We partition the Web sessions into 20 distinct sets of
sessions, each one assigned to the closest of 20 profiles previously
discovered and validated using Hierarchical Unsupervised Niche Clustering
(HUNC) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and listed in Table 2. Then we presented these sessions to
TECNO-STREAMS one profile at a time: sessions assigned to trend 0,
Scenario 2: We used the same session partition as scenario 1, but
presented the profiles in reverse order: sessions assigned to trend 19, then
sessions assigned to trend 18, , etc, ending with trend 0.
      </p>
      <p>Scenario 3: The Web sessions are presented in their natural
chronological order exactly as received in real time by the web server.</p>
      <p>For each of the above scenarios, we repeated the experiment using</p>
      <p>We track the number of B-cells that succeed in learning each one of
the 20 ground truth profiles after each session is presented, by
counting the number of B-cells registering a sufficient match (i.e., above a
certain threshold) with each ground truth profile based on one of the
follearned profiles compared to the ground truth profiles as given by (14),
files compared to the ground truth profiles as given by (15). These two
measures provide an evolving number of hits per profile relative to each
of the above criteria, as shown in Figures 2 - 7, for the two different
learning similarity options, and the three above scenarios respectively.
uxw.y tion , and indicates the presence of at least one B-cell profile that
w for the profile for session No. is shown in these figures at
locaThe y-axis is split into 20 intervals, with each interval devoted to the
trend/profile number indicated by the lower value (from 0 to 19). A hit
achieved the desired threshold in the validation measures of precision or
coverage.</p>
      <p>The proposed immune clustering algorithm can learn the user profiles
in a single pass. A single pass over all 1704 Web user sessions (with
nonoptimized Java code) took less than 7 seconds on a 2 GHz Pentium 4 PC
running on Linux. With an average of 4 milliseconds per user session,
the proposed profile mining system is suitable for use in a real time
personalization system to constantly and continuously provide a fresh and
current list of an unknown number of evolving user profiles. Old
profiles can be handled in a variety of ways. They may either be discarded,
moved to secondary storage, or cached for possible re-emergence. Even
if discarded, older profiles that re-emerge later, would be re-learned from
scratch just like new profiles. Hence the logistics of maintaining old
profiles are less crucial compared to existing techniques.</p>
      <p>Figures 2 and 3 show the evolving hits per usage trend for the cosine
similarity and the MinPC similarity, respectively when scenario 1 is
deployed for sequencing the usage trends. They both exhibit an expected
staircase pattern proving the gradual learning of emergent usage trends
as these are experienced by the immune network in the order from trend
0 to 19. The plot shows some peculiarities, for example at trend 15 since
it records hits at the same time as trends 0, 2, 3, and 5. Table 2 and
the examination of the user sessions in each of these trends show that
these trends do indeed share many similarities with trend 15, especially
in terms of overlap. Typical cross reactions between similar patterns are
actually desired and illustrate a certain tolerance for inexact matching.</p>
      <p>Figures 2(a) and 3(a) show that the number of learned profiles
satisfying more than precision evolves in synchrony with the usage trends
being presented.h Furthermore, Figure 3(a) shows that the MinPC
similarity allows learning and maintaining high-precision profiles longer
than cosine similarity in Figure 2(a). For instance, compare the top 3
profiles in each figure corresponding to trends 17, 18, and 19 that are
presented last in that sequence. Similarly, Figure 3(b) shows that the
MinPC similarity allows learning more high-coverage profiles and can
keep them longer than the plain cosine similarity in Figure 2(b). This
can be seen in the top 5 profiles corresponding to trends 15, 16, 17, 18,
0.99 - /people index.html , 0.98 - /people.html , 0.97 - /faculty.html</p>
      <p>0.99 - / , 1.00 - /cecs computer.class
0.90 - /courses index.html , 0.88 - /courses100.html ,</p>
      <p>0.87 - /courses.html , 0.81 - /
0.80 - / , 0.48 - /degrees.html , 0.23 - /degrees grad.html
0.97 - /degrees undergrad.html , 0.97 - /bsce.html , 0.95 - /degrees index.html
0.56 - /faculty/springer.html , 0.38 - /faculty/palani.html
0.91 - /˜saab/cecs333/private , 0.78 - /˜saab/cecs333
0.57 - /˜shi/cecs345 , 0.45 - /˜shi/cecs345/java examples ,</p>
      <p>0.46 - /˜shi/cecs345/Lectures/07.html
0.82 - /˜shi/cecs345 , 0.47 - /˜shi , 0.34 - /˜shi/cecs345/references.html
0.55 - /˜shi/cecs345 , 0.55 - /˜shi/cecs345/java examples , 0.33 - /˜shi/cecs345/Projects/1.html
0.92 - /courses index.html , 0.90 - /courses100.html ,</p>
      <p>0.86 - /courses.html , 0.78 - /courses200.html
0.78 - /˜yshang/CECS341.html , 0.56 - /˜yshang/W98CECS341 , 0.29 - /˜yshang
0.27 - /access , 0.23 - /access/details.html
K by the parameter which affects the rate of forgetting in the immune
and 19 that are the last to be encountered in that sequence.</p>
      <p>Figures 4 and 5 show the evolving hits per usage trend for the
cosine similarity and the MinPC similarity, respectively when scenario 2 is
deployed for sequencing the usage trends. They show an interesting
inverted staircase pattern due to the reverse presentation order. Again,
comparing Figures 4(a) with Figure 5(a) shows that the MinPC
similarity allows learning more high-precision profiles and can maintain
them longer than cosine. Similarly, by contrasting Figure 5(b) and
Figure 4(b), we can infer that the MinPC similarity allows learning more
high-coverage profiles and can keep them longer.</p>
      <p>Finally Figures 6 and 7 show the evolving hits per usage trend for
the cosine similarity and the MinPC similarity, respectively when the
sessions are presented in their original chronological order
corresponding to scenario 3. In this case, the order of presentation of the trends
is no longer sequenced in straight or reverse order of the trend number.</p>
      <p>Instead, the user sessions are presented in completely natural
(chronological) order, exactly as in real time. So we cannot expect a staircase
pattern. In order to visualize the expected pattern, we simply plot the
distribution of the original input sessions, but with all the noise sessions
excluded, in Figure 1 to further test the robustness to noise. This figure
shows that the session data is quite noisy, and that the arrival sequence
and pattern of sessions belonging to the same usage trend may vary in a
way that makes incremental tracking and discovery of the profiles even
more challenging than in a batch style approach, where the sessions can
be stored in memory, and a standard iterative approach is used to mine
the profiles. It also shows how some of the usage trends (e.g: No. 13, 14,
15) are not synchronized with others, and how some of the trends (No. 5,
9, 13, 14) are weak and noisy. Such weak profiles can be even more
elusive to discover in a real time web mining system. While Figures 6 and 7
show the high precision and high-coverage B-cell distribution with time,
Figure 1 shows the distribution of the input data with time. The fact that
all these figures show a striking similarity in the emergence patterns of
the trends, attests to the fact that the immune network is able to form a
reasonable dynamic synopsis of the usage data, even after a single pass
over the data, for both types of similarity measures (cosine or MinPC).</p>
      <p>Again, even here, we notice that MinPC succeeds slightly better than
cosine similarity in learning high-precision and high-coverage profiles.</p>
      <p>This can be seen for example by the fact that profiles 10 and 19 end up
lost with the cosine similarity in Figures 6, because their corresponding
learned profiles fall below the precision and coverage threshold.</p>
      <p>We notice furthermore that the gap between the MinPC and cosine
similarities, in the number and fidelity of learned high-precision and
high-coverage profiles compared to the incoming stream of evolving
trends, gets wider when the trends are presented one at a time (scenarios
1 and 2) as opposed to when they are presented in a more random,
alternating order (scenario 3). Note that scenarios 1 and 2 are much more
challenging than scenario 3, and they were simulated intentionally to test
the ability of TECNO-STREAMS to learn completely new and unseen
patterns (usage trends, topics, ...etc), even after settling on a stable set
of learned patterns before. In other words, these scenarios represent an
extreme test of the adaptability of the single-pass web mining system.</p>
      <p>It is interesting to note that the memory span of the network is affected
network. A low value will favor faster forgetting, and therefore a more
current set of profiles that reflect the most recent activity on a website,
while a higher value will tend to keep older profiles in the network for
longer periods.</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION</title>
      <p>
        We investigated using a new robust and scalable algorithm
(TECNOSTREAMS) and the effect of similarity for detecting an unknown
number of evolving clusters or trends in a noisy Web data stream. The main
factor behind the ability of the proposed method to learn in a single pass
lies in the richness of the immune network structure that forms a dynamic
synopsis of the data. TECNO-STREAMS adheres to all the requirements
of clustering data streams [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: compactness of representation, fast
incremental processing of new data points, and clear and fast identification
of outliers. This is mainly due to the compression mechanism and the
dynamic B-cell model that make the immune network manageable, and
continuous learning possible.
      </p>
      <p>Even though the cosine similarity has been prevalent in the majority
of web clustering approaches, it may fail to explicitely seek profiles that
achieve high coverage and high precision,empsimultaneously. The
MinOf-Precision-Coverage or MinPC similarity, proposed and investigated
in this paper, overcomes these drawbacks. Our simulations confirmed
that the MinPC similarity does a better job than cosine in learning from
a stream of evolving data in a single pass setting, regardless of the order
of presentation. This is because the MinPC similarity has the advantage
of explicitely coupling the precision and coverage criteria to the early
learning stages, and furthermore requiring that the affinity of the data to
the learned profiles or summaries be defined by the minimum of their
coverage or precision, hence requiring that the learned profiles are
simultaneously precise and complete, with no compromises.</p>
      <p>With an average of 4 milliseconds per user session, the proposed
profile mining system is suitable for use in a real time personalization
system to constantly and continuously provide the recommendation engine
with a current set of user profiles. The same can be said about the
ability to mine evolving topic profiles/summaries from a stream of text data,
even in the presence of outliers. In fact detecting potential outliers with
TECNO-STREAMS is a trivial process, limited to identifying input data
that fail to activate all the B-cells in the immune network, as described
in Section ??.</p>
      <p>The logistics of maintaining, caching, or discarding old profiles are
much less crucial with our approach than with most existing techniques.
Even if discarded, older profiles that re-emerge later, would be re-learned
from scratch just like completely new profiles. Like the natural immune
system, the strongest advantage of our approach is expected to be its ease
of adaptation in dynamic environments such as the World Wide Web.
Our approach is modular and generic enough that it can be extended
to handle richer Web object models, such as more sophisticated web
user profiles and web user sessions, or more elaborate text document
representations. The only module to be extended would be the similarity
measure that is used to compute the stimulation levels controlling the
survival, interaction, and proliferation of the learned B-cell profiles.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by National Science Foundation CAREER
Award IIS-0133948 to O. Nasraoui.</p>
      <p>200
600
9;: Figure 1: Distribution of input sessions over usage trend versus session number when only non-noisy (
) sessions are presented in natural chronological order. The horizental axis depicts the session number or a time stamp.</p>
      <p>The vertical axis is split into several horizental bands, each one depicting one of the 20 usage trends. Trends 5, 9, 13, 14, 15,
and 19 appear to be weaker and noisier. Also trends 6 and 7 emerge late in the 12-day access log, while trend 0 weakens in
the last days.
20
19
18
17
16
15
14</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Babu</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Continuous queries over data streams</article-title>
          .
          <source>In SIGMOD Record'01</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>120</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Barbara</surname>
          </string-name>
          .
          <article-title>Requirements for clustering data streams</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Borges</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Levene</surname>
          </string-name>
          .
          <article-title>Data mining of user navigation patterns</article-title>
          . In H. A.
          <string-name>
            <surname>Abbass</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Sarker</surname>
          </string-name>
          , and C. Newton, editors,
          <source>Web Usage Analysis and User Profiling, Lecture Notes in Computer Science</source>
          , pages
          <fpage>92</fpage>
          -
          <lpage>111</lpage>
          . Springer-Verlag,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Reina</surname>
          </string-name>
          .
          <article-title>Scaling clustering algorithms to large databases</article-title>
          .
          <source>In Proceedings of the 4th international conf. on Knowledge Discovery and Data Mining (KDD98)</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Dong, J. Han,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Wah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          <article-title>. Multi-dimensional regression analysis of time-series data streams</article-title>
          .
          <source>In 2002 Int. Conf. on Very Large Data Bases (VLDB'02)</source>
          , Hong Kong, China,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>Tending Adam's Garden</article-title>
          . Academic Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Data preparation for mining world wide web browsing patterns</article-title>
          .
          <source>Journal of knowledge and information systems</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In 2nd International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          ,
          <string-name>
            <given-names>Portland</given-names>
            <surname>Oregon</surname>
          </string-name>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>O'Callaghan</surname>
          </string-name>
          .
          <article-title>Clustering data streams</article-title>
          .
          <source>In IEEE Symposium on Foundations of Computer Science (FOCS'00)</source>
          , Redondo Beach, CA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Keim</surname>
          </string-name>
          .
          <article-title>An efficient approach to clustering in large multimedia databases with noise</article-title>
          .
          <source>In Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Cooke</surname>
          </string-name>
          .
          <article-title>An adaptative, distributed learning system, based on immune system</article-title>
          .
          <source>In IEEE International Conference on Systems, Man and Cybernetics</source>
          , pages
          <fpage>2494</fpage>
          -
          <lpage>2499</lpage>
          , Los Alamitos, CA,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Jerne</surname>
          </string-name>
          .
          <article-title>The immune system</article-title>
          .
          <source>Scientific American</source>
          ,
          <volume>229</volume>
          (
          <issue>1</issue>
          ):
          <fpage>52</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Korfhage</surname>
          </string-name>
          . Information Storage and Retrieval. Wiley,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardona-Uribe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Rojas-Coronel</surname>
          </string-name>
          .
          <article-title>Tecno-streams: Tracking evolving clusters in noisy data streams with a scalable immune system learning model</article-title>
          .
          <source>In IEEE International Conference on Data Mining</source>
          , Melbourne, Florida, Nov.
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>An artificial immune system approach to robust data mining</article-title>
          .
          <source>In Genetic and Evolutionary Computation Conference (GECCO) Late breaking papers</source>
          , pages
          <fpage>356</fpage>
          -
          <lpage>363</lpage>
          , New York, NY,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Frigui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <article-title>Mining web access logs using relational competitive fuzzy clustering</article-title>
          . In Eighth International Fuzzy Systems Association Congress, Hsinchu, Taiwan, Aug.
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          .
          <article-title>One step evolutionary mining of context sensitive associations and web navigation patterns</article-title>
          .
          <source>In SIAM conference on Data Mining</source>
          , pages
          <fpage>531</fpage>
          -
          <lpage>547</lpage>
          , Arlington,
          <string-name>
            <surname>VA</surname>
          </string-name>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Frigui</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <article-title>Extracting web user profiles using relational competitive fuzzy clustering</article-title>
          .
          <source>International Journal of Artificial Intelligence Tools</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):
          <fpage>509</fpage>
          -
          <lpage>526</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <article-title>Mining web access logs using a relational clustering algorithm based on a robust estimator</article-title>
          .
          <source>In 8th International World Wide Web Conference</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>41</lpage>
          , Toronto, Canada,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Perkowitz</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Adaptive web sites: Automatically synthesizing web pages</article-title>
          .
          <source>In AAAI 98</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Zarkesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abidi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Knowledge discovery from users web-page navigation</article-title>
          .
          <source>In Proceedings of workshop on research issues in Data engineering</source>
          , Birmingham, England,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.-N.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Web usage mining: Discovery and applications of usage patterns from web data</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>Jan 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Timmis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          .
          <article-title>An artificial immune system for data analysis</article-title>
          .
          <source>Biosystems</source>
          ,
          <volume>55</volume>
          (
          <issue>1</issue>
          /3):
          <fpage>143</fpage>
          -
          <lpage>150</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacobsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Dayal</surname>
          </string-name>
          .
          <article-title>From user access patterns to dynamic hypertext linking</article-title>
          .
          <source>In Proceedings of the 5th International World Wide Web conference</source>
          , Paris, France,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <article-title>On the use of constrained association rules for web mining</article-title>
          .
          <source>In WebKDD workshop on Knowledge Discovery in the Web</source>
          , pages
          <fpage>77</fpage>
          -
          <lpage>90</lpage>
          , Edmonton, Alberta, Canada,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaiane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xin</surname>
          </string-name>
          , and J. Han.
          <article-title>Discovering web access patterns and trends by applying olap and data mining technology on web logs</article-title>
          .
          <source>In Advances in Digital Libraries</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>29</lpage>
          , Santa Barbara, CA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          . Birch:
          <article-title>An efficient data clustering method for large databases</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <fpage>103</fpage>
          -
          <lpage>114</lpage>
          , New York, NY,
          <year>1996</year>
          . ACM Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>