<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Recommendation Model Based on Latent Principal Factors in Web Navigation Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yanzan Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bamshad Mobasher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yzhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mobasher}@cs.depaul.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University</institution>
          ,
          <addr-line>Chicago, Illinois</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>52</fpage>
      <lpage>61</lpage>
      <abstract>
        <p>Discovery of factors that lead to common navigational patterns can help in improving online information presentation as well as in providing personalized content to users. It is, therefore, necessary to develop techniques that can automatically characterize the users' underlying navigational objectives and to discover the hidden semantic relationships among users as well as between users and Web objects. Typical approaches to Web usage mining, such as clustering of user sessions, can discover usage patterns directly, but cannot identify the latent factors, intrinsic in users' navigational behavior, that lead to such patterns. In this paper, we propose an approach based on a latent variable model, called Iterative Principal Factor Analysis, to discover such hidden factors in Web usage data. The hidden factors are then used to create aggregate models of common user profiles which are, in turn, used to provide dynamic recommendations to users. Our experimental results, performed on real Web usage data, verify that the proposed principal factor approach results in better predictive user models, when compared to more traditional approaches such as clustering and principal component analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>pageview level. They, however, do not capture the intrinsic characteristics of Web users’ activities, nor can
they quantify the underlying and unobservable factors that lead to specific navigational patterns.</p>
      <p>
        Latent variable models, such as Principal Component Analysis (PCA), have been widely used in a
variety of areas, most prominently for latent semantic indexing in information retrieval [
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ]. More recently,
probabilistic Latent Semantic Analysis [
        <xref ref-type="bibr" rid="ref3 ref8">8, 3</xref>
        ] has gained attention because of its flexibility and its ability to
leverage probabilistic inference. Factor analysis models, which are established statistical models [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], are less
discussed in data mining research than PCA, possibly due to the model complexity and solution
indeterminacy. Although the latent variable models, such those mentioned above, have been studied and applied
extensively, few of these approaches have been employed in the context of Web mining, in general, or for the
discovery of Web usage patterns, in particular.
      </p>
      <p>In this paper we propose a Web usage mining approach to personalization based on Principal Factor
Analysis (PFA). Specifically, we use an algorithm called Iterative Principal Factor (IPF) Analysis to discover
the latent factors which characterize relationships among pages based on their usage. We, then, use the
discovered factors to create aggregate representations of common navigational patterns among user. We call
such patterns aggregate usage profiles (or simply usage profiles). Finally, the usage profiles are used, together
with a current user’s active session, to generate dynamic recommendations for the active user.</p>
      <p>The primary assumption behind all latent variable models is that there is a set of common hidden factors
which “explain” a set of observations in co-occurrence data. In our case, we are interested in automatically
identifying the common factors that lead groups of users to visit certain pageviews together. Thus, in the
Web usage mining context, user sessions represent the set of observations, while pageviews contained in the
sessions represent the variables.</p>
      <p>It should be noted that the more common Principal Component Analysis (PCA) can also be used to
identify latent variables (principal components). However, while PCA finds the principal components that
maximize the total variance of variables, the principal factor model discovers the structural factors underlying
the common variance of variables and excludes unique variances not related to other variables. Because of this
difference, PCA is commonly used for tasks such as dimensionality reduction in which preserving the overall
relationships between the variables is important. On the other hand, the Principal Factor Analysis is better
suited for identifying the factors that characterize unique relationships among sets of variables. Indeed, our
experimental results verify that the IPF approach results in better predictive models for generating dynamic
recommendations.</p>
      <p>The paper is organized as follows. In Section 2 we discuss the modeling aspects of the principal factor
model for Web usage data, and we present the iterative principal factor extraction algorithm. In Section 3
we introduce our approach for constructing aggregate usage profiles on the basis of extracted latent factors.
Our recommendation algorithm based on the discovered usage profiles is introduced in Section 4. Section 5
presents our experimental results based on real usage data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Web Usage Patterns as Latent Factors</title>
      <p>
        We begin with the raw server logs of the site under analysis and perform the appropriate preprocessing steps
such as data cleaning, pageview identification, sessionization, and Web robot detection. A detailed discussion
of the various steps in usage data preprocessing is beyond the scope of the current paper. Interested reader
can find these details in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The preprocessing tasks result in a set of n pageviews P = {p1, p2, · · · , pn}
and a set of m user sessions U = {u1, u2, · · · , um}. A pageview is an abstraction representing a set of
objects associated with a single user request. So, a pageview may consist of an individual Web page,
multiple pages (such as in the case of HTML frames), or a combination of pages and objects (e.g., database
records associated with dynamic applications). Following multivariate data analysis conventions, if we treat
pageviews as variables and user sessions as observations, we can represent the usage data as an m × n
userpageview matrix S = [w(ui, pj )], where each element w(ui, pj ) is the significance weight (a function of time
duration) of pageview j in user session i.
      </p>
      <p>Our goal is to discover a set of latent factors C = {c1, c2, · · · , ck} from this data that “explain” the
underlying relationships among pageviews. These latent factors will be closely related to Web site’s functional
structure and Web users’ actual navigational tasks. The identification of such factors allows us to transform
the usual representation of users sessions as a set or sequence of pageviews (pageview-level representation)
to a higher-level representation over the space of latent factors. Specifically, given the matrix S, of
userpageview observations, each user session, ui = [w(ui, pj)]1×n, can be transformed into a higher-level factor
space representation uic = [ω(ui, cj)]1×k, where ωij is a significance weight of user ui with respect to factor
cj.</p>
      <p>During a given session, a user may perform a single task or multiple tasks simultaneously. These tasks a
captured by the discovered latent factors. Thus, the higher-order factor-level representation of user sessions
can be used to identify the primary tasks performed by users according to the associated weights in each
factor dimension. Users with relatively high weights on a given factor dimension could be regarded as
prototypical users performing the associated task. By aggregating the profiles of such prototypical users we
can create user models that can be used for generating recommendations to other users exhibiting similar
navigational behavior. We divide this learning task into two major steps. First, we extract latent factors
representing pageview association patterns from usage data. Secondly, aggregate usage profiles representing
prototypical user navigational patterns are derived to be used in dynamic Web personalization.
2.1</p>
      <sec id="sec-2-1">
        <title>The Latent Factor Model for Web Navigational Patterns</title>
        <p>Let E(pi) denote the expectation of random variable pi and var(pi) denote the variance of pi (pageviews).
To simplify our discussion we assume E(pi) = 0 and var(pi) = 1. This assumption can be made without loss
of generality given that we can always obtain a z-score normalized observed value by subtracting the sample
mean p¯i and dividing by the standard deviation). Let PT = [p1, p2, · · · , pn]. Then the covariance matrix
mean and unit variance assumptions.n×n, and we have correlation matrix corr(P) = cov(P) given our zero
of P is defined as cov(P) = [E(pipj)]</p>
        <p>We assume that Web users’ accesses to a pageview pi are influenced by a set of common latent factors
CT = [c1, c2, · · · , ck] as described in the linear function: pi = li1c1 + li2c2 + ... + likck + ∆ i, where coefficient
lij is the loading weight of pageview pi on unobservable common factor cj. Unique factor ∆ i represents
unobservable unique portion of pi that are not accounted for by k common factors (such as individual
variable-specific variation and measurement error). This can be written in matrix format as</p>
        <p>P = LC + ∆ ,
where L is the loading coefficient matrix. Since both C and ∆ are unobservable, there are too many unknown
components in equation 1, we cannot solve the equation directly. However, we can approach the solution by
making the following assumptions on C and :∆
1. Both common and unique factors have zero mean: E(C) = 0, and E()∆ =
0;
2. Common factors have unit variances and are uncorrelated with each other (orthonormal), i.e., cov(C) =</p>
        <p>E(CCT ) = Ik×k (identity matrix)
3. Unique factors have their own variances while they are uncorrelated with each other. Let the diagonal
matrix Ψ = [ψi]n×n, where ψi = var(∆ i), then cov()∆ = E(∆ T ) = Ψ,
4. Common factors are uncorrelated with unique factors, i.e., cov(C, )∆ =
E(C∆ T ) = 0.</p>
        <p>
          Note that these assumptions are necessary to obtain an orthogonal factor model and the second
assumption could be relaxed to be cov(C) = I resulting in a more complicated oblique factor model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. That is,
some of these latent factors will be correlated. As a rule of thumb, we can always solve the orthogonal
factor model first, followed by some oblique rotation criteria on factor axes as needed. Based on the above
assumptions, we can now derive important covariance relationships between pageviews, factors and loading
coefficients. For example, let cov(P, C) = [E(pi, cj)]n×k. Then the covariance between p and C can be
described as:
cov(P, C) = E pCT = E (LC + )∆ CT
= LE(CCT ) + E(∆ CT ) = L.
The covariance matrix of p can be described as:
cov(p) = E ppT
        </p>
        <p>= E (LC + )∆( LC + )∆ T
= LE(CCT )LT + LE(C∆ T ) + E(∆ CT )LT + E(∆</p>
        <p>T ) = LLT + Ψ.</p>
        <p>Equation (3) tells us that the variance of pi is just the sum of its squared loading coefficients plus a
unique variance not related to other variables:
(3)
(4)
var(pi) =
k
j=1
li2j + ψi.</p>
        <p>Thus, the total variance of pageview pi is comprised of two parts: the communality part, denoted as h(pi) =
k
j=1 li2j , which is the common variance that is accounted for by k common factors; and the unique variance
part ψi that is not explained by common factors. The communality is essentially equivalent to the squared
multiple correlation coefficient R2 in the multiple regression sense, i.e., h(pi) is the proportion of total
variance in pi that is predictable by the k predictors (in this case, factors).</p>
        <p>Equation (2) tells us that loading coefficients have clear statistical meaning: lij is just the covariance
between pageview pi and factor cj . That is cov(pi, cj ) = lij . lij is also the correlation coefficient of pageview
pi and factor cj given our variable standardization assumption.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>The IPF Algorithm for Factor Extraction in Web Navigation Data</title>
        <p>
          There are several algorithms available to estimate the loadings and factors, including maximum likelihood
(ML) method, the centroid method, and image factor analysis. For a discussion of these and other
approaches see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The ML method usually assumes that the co-occurrence data has a Gaussian distribution.
This, however, does not generally hold in the context of Web usage data. Thus, we adapt a more robust
and computationally efficient method called iterative principal factor analysis (IPF) to perform the factor
extraction.
        </p>
        <p>
          In IPF, the estimation of factor loading matrix can be solely based on the sample correlation matrix.
The sample correlation matrix of the usage data is simply computed as R(S) = ST S given that the pi are
z-score standardized. There are three major steps in the IPF algorithm:
1. Initial communality estimates h(pi). The most popular method is squared multiple correlation
coefficient (SMC) [
          <xref ref-type="bibr" rid="ref7 ref9">9, 7</xref>
          ], which is defined as h(pi) = 1 − γ1ii , where γii is the i-th diagonal element of
inverse correlation matrix [R(S)]−1. Thus, estimated unique variance ψi = 1 − h(pi) and the reduced
correlation matrix becomes R∗(S) = R(S) − Ψ. Note that the diagonal elements of R∗(S) represent
the common variances of pageviews (i.e., communalities).
2. Decomposition of the reduced correlation matrix R∗(S). Consider an eigen decomposition of a
symmetric matrix, we can approximate R∗(S) by keeping k leading eigen vectors and corresponding k
largest eigen values:
where H(k) is a matrix consisting of k columns of eigen-vectors corresponding to the descendingly
1
ranked k largest eigen-values of R∗(S), Λ(2k) denotes a diagonal matrix whose diagonal elements are
square roots of corresponding k largest eigen values of R∗(S). Thus loading matrix could be estimated
1
as L = H(k)Λ(2k).
3. Updates of communality estimates. From L , we get re-estimated communality as h(pi) =
k 2
j=1 lij
        </p>
        <p>After the above steps, the diagonal elements of R∗(S) are replaced with the updated communalities.
Then, we iterates steps 2 and 3 until the difference between updated and previous communalities are
minimized within a certain threshold.</p>
        <p>
          Given that initial solutions of loading matrices are usually not easily interpretable, appropriate
transformations such as orthogonal or oblique factor rotation procedures are usually performed on the loading
matrix to arrive at a simpler structure for better interpretability [
          <xref ref-type="bibr" rid="ref7 ref9">9, 7</xref>
          ]. Based on our experience, the rotated
factor patterns generally lead to better quality usage profiles while showing more interpretable patterns as
well.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Discovery of Aggregate Usage Profiles Based on Latent Factors</title>
      <p>In this section we present the algorithm for deriving aggregate usage profiles based on the factor loading
matrix obtained through IPF.
3.1</p>
      <sec id="sec-3-1">
        <title>The Factor Loading Matrix</title>
        <p>The loading matrix L tells us how closely each pageview is related to each factor. We can further interpret
the meaning of each factor by selecting the most related pageviews. Each user’s activity pattern could be
represented as a mixture of these factors. These factors summarize the leading k prominent types of user
navigational tasks. The weight of a pageview in a certain factor reflects the relative importance of describing
the factor.</p>
        <p>Furthermore, we can represent each factor as a pageview space vector for later comparison with user
sessions (also in pageview space). We can derive an aggregate usage profile for other similar users who have
the same dominant factor present in their sessions. This could be done by first comparing the similarity
between a user activity session with these factors. Specifically, for each column of the loading matrix L in
the factor model, we can generate a corresponding vector cj = [l1j , l2j , · · · , lnj ], where lij is the loading of
pageview pi on factor cj .
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Deriving Aggregate Usage Profiles from Factor Loadings</title>
        <p>To obtain aggregate profiles from the discovered factors, we begin by projecting the user sessions (represented
as vectors of pagviews) onto the reduced-dimension space of latent factors. The component (factor) scores
can be regarded as an inner-product similarity computation of a user with the reduced dimensions. In
order to normalize the effect of diverse user browsing styles such as slow-surfers versus fast surfers we use
cosine coefficient to compute the user interest weight with respect to different factors. Specifically, we can
conceptualize user sessions by transforming a user-pageview matrix S to a user-factor matrix Sc as follows:</p>
        <p>Sc = SL = [ω(ui, cj )]m×k ,
where ω(ui, cj ) = ui · cj , ui and cj are both normalized to have unit length.</p>
        <p>Thus, we have a conceptualized view of user activities beyond the pageview level, and we can further
discover the dominant and the secondary factors by examining the associated weights with the corresponding
factors. For example, some users may have only one significantly weighted factor, which might indicate that
the user is engaged in a single specific task; while other users may exhibit multiple interests during a session,
as reflected by high significant weights on more than one factor.</p>
        <p>Given this representation, for each of the k latent factors (columns in the user-factor matrix SC ), we
choose those users whose factor loading weight is greater than a certain threshold as a representative user
associated with that factor. Such a set of users U c = {u1, uc2, · · · , ucm} forms a user segment whose members
c
have similar dominant interests while having probably diverse minor interests. In other words, users’
crossinterests would be captured in such segments and each member will have different contributing weights.
Differentiating these contributing weights is important since they are directly used for generating the
aggregate usage profiles.</p>
        <p>
          To create an aggregate representation of the user segment U c, we compute the centroid of all the vectors
in U c, resulting in a representation of the segment as a set of factor-weight pairs. This is a similar approach
as the profile aggregation method, PACT, discussed in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In PACT, user segments were generated by
performing clustering on the set of user sessions, and the cluster centroid was regarded as the aggregate
usage profile. In contrast, the user segments is the present approach are derived based on the factor loadings
obtained from the IPF algorithm. The algorithm of generating aggregate usage profiles is as follows:
1. For each factor c, choose all the user sessions with ω(ui, c) ≥ µ to get a candidate session set U c, where
µ is predefined threshold.
2. Represent each user session ui ∈ U c as a pageview vector and compute their centroid pageview vector
v = (1/|U c|) ui · ω(ui, c), where |U c| denotes the total number of sessions in set U c.
3. For each factor c, output page vector v. This pageview vector consists of a set of weights for pageviews
in P , which represent the relative significance of that pageview for the user segment associate with c.
        </p>
        <p>The pageview-weight pairs obtained using the above aggregation process can be further ordered according
to the associated weights. In contrast to individual factors, these aggregate usage profiles contain
information about changing contexts during Web user’s online real navigation. In particular, different (possibly
related) tasks performed by a user during a session is reflected by pageviews in the aggregate profile that are
contributed by different latent factors. We regard a dominantly weighted pageviews in a usage profile as
reflecting the main“theme” (interest) of the associated user segment. Furthermore, non-dominant pageviews,
reflect the diversity of minor “themes” in other factors and could be regarded as additional contextual
information. We provide some real examples of discovered aggregate usage profiles in Section 5.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Using the Latent Factors for Personalization</title>
      <p>Web personalization usually refers to dynamically tailoring the presentation of a Web site according to the
interests or preferences of individual or groups of users. This can be accomplished by recommending Web
resources, such as Web pages or products, to a user by considering the current user’s active behavior with
his own historic patterns or best matched learned models of other users. The usage profiles generated based
on the algorithm of Section 3.2 provide an aggregate representation of all individual users’ navigational
activities in a particular group. They also provide the basis for automatically generating relevant pageview
recommendations.</p>
      <p>Here, we use the discovered usage profiles from the IPF algorithm to generate top-N recommendations
based on a current user’s active session:
1. Each aggregate usage profile described in Section 3 can be conveniently represented as an n-dimensional
pageview vector v = [w1c, w2c, . . . , wnc], where wic is the weight associated with pageview pi in this profile.
Similarly, the active user session a is represented as a = [a1, a2, · · · , an], where ai = 1, if pageview pi
is visited, and otherwise ai = 0.
2. Choose the usage profile that best matches the active user session. The similarity match score is defined
using the cosine coefficient:
sim(a, v) =</p>
      <p>i (ai × wic)
i (ai)2 ×
i (wic)2
3. For the top matched usage profile vmcax = argmaxvsim(a, v) together with the active session a, we
c
compute a recommendation score: rec(a, pi), for each pageview pi ∈ vmax:
rec(a, pi) =
wc</p>
      <p>i × sim(a, vmcax).</p>
      <p>Thus, each pageview will receive a normalized recommendation score between 0 and 1. If the pageview
pi is already in the current active session a, its recommendation score is set to zero.
4. Select the N pageviews with the highest recommendation scores as the top-N recommendation set.</p>
      <p>We will present our evaluation results on real usage data for the top-N recommendations in Section 5.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Evaluation</title>
      <p>In this section we evaluate the effectiveness of the PFA model for recommendations on two different data
sets. We also provide some examples of the generated aggregate usage profiles.
5.1</p>
      <sec id="sec-5-1">
        <title>Data Sets and Experimental Methodology</title>
        <p>The data sets used in our experiments are:
1. CTI data. This data set is based on the server logs of the host Computer Science department
spanning a one-month period. The initial preprocessed data set contained more 100,000 sessions and over
4,000 pageviews. Further aggregation was performed to “roll up” low-support (infrequently accessed)
pageviews to their common root node in the site hierarchy. Furthermore, small sessions (containing
less than 6 pageviews) were filtered out. This resulted in a final data set containing 21,299 user sessions
and 692 Web pageviews. The site is highly dynamic, involving numerous online applications, including
online admissions application, online advising, online registration, and faculty-specific Intranet
applications. Thus, we expect the discovered usage patterns to reflect various functional tasks performed
by diverse groups of users.
2. NC data. This data set is from Network Chicago which combines the programs and activities of the
Chicago Public Television and Radio (www.networkchicago.com). The data was collected over a period
of one month. Similar preprocessing and aggregation steps as in the case of the CTI data were also
applied here, resulting in a total of 4,987 user sessions and 295 pageviews were included for analysis.
In contrast to the CTI data, this site is comprised primarily of static pages grouped together based
on their association with specific content areas. In this case, we expect the generated usage profiles to
reflect common interests of users in one or more programs represented by the content areas.</p>
        <p>Each data set was randomly divided into multiple training and test sets to use with 10-fold
crossvalidation. The training sets were used to build the models while the test sets were used to evaluate the
user segments and the recommendations generated by the models. In all experiments, the results represent
averages over the 10 folds.</p>
        <p>
          We extracted 24 factors for both CTI and NC data sets based on the initial scree plots [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The scree plot
is a plot of eigen-values (communalities of the reduced correlation matrix) against their descending order.
In our case, the initial 24 factors, corresponding to the largest eigen values, explain ≥ 50% of the common
variance based on the initial communality estimations.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Examples of Usage Patterns Based on the Latent Factors</title>
        <p>Profile #1
Profile #2
Profile #3</p>
        <p>Factor # Aggregate usage profiles
1 1.00 Intranet main page
1 0.71 Online advising – student history
1 0.59 Intranet main page – login
6 0.50 Intranet – news
1 0.38 Online advising – student search
6 0.27 Intranet – calendar
6 0.25 Intranet – faculty page
1 0.24 Intranet – online advising page
6 0.18 Intranet – faculty rosters
2 0.70 Online application-step1
2 0.70 Online application-step2
2 0.58 Online application-finish
2 0.42 Online application-restart
2 0.38 Online application-login
9 0.34 Admission- main page
9 0.29 Admissions-requirements
3 0.89 People - faculty evaluation
3 0.89 People - faculty search
3 0.75 People - faculty info main page
3 0.74 People - full-time faculty list
3 0.56 People - faculty news
15 0.46 Programs – course information
9 0.30 Student Intranet login
15 0.29 Course news main page
3 0.25 People - part-time faculty list</p>
        <p>Interpretation
Dominant Factor:
- Online advising system</p>
        <p>(factor #1)
Secondary Factors:
-Faculty Intranet system
(factor #6)
Dominant Factor:
- Online applications
(factor #2)
Secondary Factors:
- Admission info search
(factor #9)
Dominant Factor:
- faculty info search
(factor #3)
Secondary Factors:
- Course info search,
- Making appointments,
- Course news
(factor #9 and factor #15)</p>
        <p>As another example, Profile 2 also captures two different but related tasks, namely that of a prospective
student going through the online admissions application process (factor 2), and that of searching for
general admissions information and requirements (factor 9). Note that there may be groups of students who
perform each of these tasks independently and separately (reflected in other discovered profiles). However,
this particular profile captures the behavior of those students who, after checking on various admissions
requirements, have decided to go through the application process during the same session.</p>
        <p>As a final example, Profile 3 shows, in part, the activity of current students who are in the process
of registering for courses. As reflected by the dominant factor, the profile indicates that one important
determining criteria for students’ course selection is the faculty teaching the courses, and particularly the
student evaluations of the faculty in past courses.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Evaluation of Recommendations</title>
        <p>For evaluating the recommendation effectiveness we use a measure called hit ratio in the context of top-N
recommendation. For each user session in the evaluation set T , we took the first j pages as a representative
of an active user to generate a top-N recommendation set as described in section 4. We then compare the
c
recommendation set in the top matched profile vmax with the pageview (j + 1) in the evaluation session. If
the pageview (j + 1) appears in this recommendation set, we consider it as a hit. We define the hit ratio as
the total number of hits divided by the total number of sessions in the evaluation set.</p>
        <p>Note that the hit ratio increases as the value of N (number of recommendations) increases. Thus, we pay
special attention to smaller number of recommendations that result in good hit ratios in our experiments
(generally between 5 and 10 recommendations).</p>
        <p>For each of the data sets, we compare the hit ratio based on the aggregate profiles generated using the
IPF algorithm, against those derived based on PCA and PACT approaches. In the case of PCA, we used the
standard SVD (Singular Valued Decomposition) approach to identify the principal components. We then
used an approach similar to our proposed method, based on IPF, to derive the aggregate usage profiles. In
the case of PACT (described earlier), first clusters of user sessions were obtained using k-means clustering,
and then the aggregate usage profiles were derived as the cluster centroids.</p>
        <p>Figure 2 and 3 depicts the hit ratios vs. recommendation set sizes for both CTI and NC data sets, across
the three approaches. The results show that the profiles based on the latent factor model have consistently
higher hit ratio values than both PACT and PCA. Such advantage continues in all ranges of recommendation
set sizes. Interestingly, the results show that PACT performs in par with, or better than, PCA for the purpose
of generating dynamic recommendations. This may suggest that the standard approach of clustering user
sessions is better suited for distinguishing among Web user segments. This is likely due to the fact that
principal components are designed to maximize the total variance among variables, instead of isolating their
communalities (as is the case for the IPF model). However, clustering approahces, in contrast to PCA,
cannot directly identify latent factors which describe inter-relationships among pageviews.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Every Web site contains functional units that represent a variety of user tasks performed within those units.
The navigational patterns of users are, therefore, semantically related to the functional structure of the site.
Web site owners may have pre-defined functions and page categories. However, we are particularly interested
in how exactly users are utilizing these functions during real visits and the latent factors that underly the
task-oriented behavior of users.</p>
      <p>We have introduced an approach based on iterative principal factor analysis that can automatically learn
hidden factors and to generate aggregate usage profiles, based on these factors, that represent common
navigational patterns. The discovered usage profiles can be used to dynamically predict a new user’s navigational
interests and recommend relevant pages or products. Our results show that this approach can successfully
uncover the patterns that characterize a Web site’s functional structure, and distinguish between different
types of user interests and tasks. Our experimental results show that the proposed factor model is better
able to capture the hidden relationships among users and Web objects, than both the standard PCA-based
approach and the traditional clustering approach. Furthermore, the standard clustering approaches, such as
PACT, that are often used for user segmentation, are generally unable to automatically identify the hidden
factors that “explain” the underlying reasons behind the discovered usage patterns.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.W.</given-names>
            <surname>Berry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>W. O'Brien</surname>
          </string-name>
          .
          <article-title>Using linear algebra for intelligent information retrieval</article-title>
          .
          <source>SIAM Review</source>
          ,
          <volume>37</volume>
          :
          <fpage>573</fpage>
          -
          <lpage>595</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.B.</given-names>
            <surname>Cattell</surname>
          </string-name>
          .
          <article-title>The scree test for the number of factors</article-title>
          .
          <source>Multivariate Behavioral Research</source>
          ,
          <volume>1</volume>
          :
          <fpage>245</fpage>
          -
          <lpage>276</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cohn</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Learning to probabilistically identify authoritative documents</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Machine Learning</source>
          , Stanford,
          <year>June 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Data preparation for mining world wide web browsing patterns</article-title>
          .
          <source>Journal of Knowledge and Information Systems</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Hashman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic indexing</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Selective markov models for predicting web-page accesses</article-title>
          .
          <source>In Proceedings of the First International SIAM Conference on Data Mining</source>
          , Chicago,
          <year>April 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <article-title>Modern Factor Analysis</article-title>
          . University of Chicago Press, Chicago, 3rd edition,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <article-title>Probabilistic latent semantic analysis</article-title>
          .
          <source>In Proceedings of the 15tg Conference on Uncertainty in Artificial Intelligence</source>
          , Stockholm,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wichern</surname>
          </string-name>
          .
          <article-title>Applied Multivariate Statistical Analysis</article-title>
          .
          <source>Prentice Hall, New Jersey, 5th edition</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parekh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          .
          <article-title>Lessons and challenges from mining retail e-commerce data</article-title>
          . To appear
          <source>in Machine Learning</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          .
          <article-title>Efficient adaptive-support association rule mining for recommender systems</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>6</volume>
          :
          <fpage>83</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Creating adaptive web sites through usage-based clustering of urls</article-title>
          .
          <source>In Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99)</source>
          , Chicago, Illinois,
          <year>November 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Automatic personalization based on web usage mining</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>43</volume>
          (
          <issue>8</issue>
          ):
          <fpage>142</fpage>
          -
          <lpage>151</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Effective personalization based on association rule discovery from web usage data</article-title>
          .
          <source>In Proceedings of the 3rd ACM Workshop on Web Information and Data Management (WIDM01)</source>
          , Atlanta, Georgia,
          <year>November 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Discovery and evaluation of aggregate usage profiles for web personalization</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>6</volume>
          :
          <fpage>61</fpage>
          -
          <lpage>82</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <article-title>Mining web access logs using a fuzzy relational clustering algorithm based on a robust estimator</article-title>
          .
          <source>In Proceedings of the 8th International World Wide Web Conference</source>
          , Toronto, Canada, May
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kamdar</surname>
          </string-name>
          .
          <article-title>Automatic web user profiling and personalization using robust fuzzy relational clustering</article-title>
          . In J. Segovia,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szczepaniak</surname>
          </string-name>
          , and M. Niedzwiedzinski, editors,
          <source>Studies in Fuzziness and Soft Computing</source>
          . Springer-Verlag,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.R.</given-names>
            <surname>Sarukkai</surname>
          </string-name>
          .
          <article-title>Link prediction and path analysis using markov chains</article-title>
          .
          <source>In Proceedings of the 9th International World Wide Web Conference</source>
          , Amsterdam, May
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Spiliopoulou</surname>
          </string-name>
          .
          <article-title>Web usage mining for web site evaluation</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>43</volume>
          (
          <issue>8</issue>
          ):
          <fpage>127</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Mining web logs to improve website organization</article-title>
          .
          <source>In Proceedings of the 10th International World Wide Web Conference</source>
          , Hong Kong, May
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Web usage mining: Discovery and applications of usage patterns from web data</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <fpage>12</fpage>
          -
          <lpage>23</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>