Integrating and Ranking Interests From User
                     Profiles!

                    Fabien Duchateau1 and Lynda Hardman1

                               CWI, Science Park 123
                       1098 XG Amsterdam, The Netherlands
                            firstname.lastname@cwi.nl


      Abstract. Many websites allow their users to personalize their profiles.
      As users subscribe to many personalization websites, such as social net-
      works or search systems, each user owns different profiles, which are
      seldom compatible. Yet, there is a strong need for comparing the profiles
      of different users to discover shared interests, e.g., by integrating all user
      profiles into a global one. In this paper, we propose a novel method for in-
      tegrating and ranking user interests from various profiles. Our approach
      relies on the identification of high-level concepts around which similar
      user interests are clustered. We compute the weight of each cluster with
      respect to the other ones, thus enabling the ranking of the most shared
      user interests between user profiles.


1    Introduction

Nowadays, a majority of Web users interacts daily with systems that store (some
of) their preferences and interests. Indeed, these personalization websites have
reached sufficient maturity to cover a large spectrum of applications such as e-
commerce, search engines, social networks. Most of them already propose to their
users recommendations or advertisements based on their profiles. In a similar
fashion, they help users to find people for sharing common activities, dating or
finding a roommate. Interoperability between these systems would benefit both
users and information providers, by overcoming the issue of integrating user
profiles [1]. We assume that user profiles are in the same language. Multilingual
issues have been studied in [2] but they are out of scope of this paper.
    In this context, a number of frameworks have been proposed to create and
manage user profiles [3–5]. Other initiatives such as OpenID1 provide users with
a means of storage for their passwords and basic profile information (e.g., name,
address). Companies are also interested in user profiles, for instance to build a
pool of experts on a specific topic. In [6], authors integrate profiles from vari-
ous users, in particular for human resources purposes. These approaches require
specific user inputs. In [7], authors aim at making interoperable user models for
!
  The first author carried out this work during the tenure of an ERCIM “Alain Ben-
  soussan” Fellowship Programme.
1
  http://openid.net
education. More specifically, they convert these user models from one system
to another, mainly relying on manual integration. In addition, the work is re-
stricted to two specific systems. Similarly to our approach, Li et al. also cluster
labels/tags to discover user interests [8]. However, their approach requires user-
annotated tags. Other initiatives like Mypes 2 already aggregate user profiles from
different personalization websites, but they do not gather similar interests under
the same concept, thus implying many redundancies in the generated tag cloud
(e.g. “mountain” and “mountains”).. In a similar fashion, Google’s Social Graph
API 3 enables to discover the relationship between people. However, it implies
that there exists a public link between user profiles. For representing user pro-
files, several user models such as Friend Of A Friend [9] (FOAF), General User
Modelling Ontology (GUMO) [10] or UserRDF [11] currently exist. While such
modelling frameworks have been developed, many Web applications still include
their own user models.
     Many issues need to be addressed to ensure interoperability between Web
applications, specifically for user profiles. For instance, one may require to inte-
grate a user profile to a higher level of abstraction such as models like GUMO.
In a similar fashion, users often own different profiles. Yet, if they need to create
a new one, this process is manually performed from scratch. Thus, we believe
there is a lack of integration approaches for user profiles to solve these prob-
lems. But more challenges need to be addressed at the application level too.
Social networks and commercial websites intensively analyse their user profiles
for recommendations. Due to the possible growth of these profiles, for which
users can subscribe to thousands of groups for instance, how can we extract the
most important interests for a given user? And when several users are involved,
applications emphasize the comparison of their profiles to deduce their shared
interests.
     In this paper, we explore work which aims at tackling all these issues. Indeed,
our work is intended to be useful both for end-users and for service providers.
Owners of scattered user profiles would be able to merge them into a unified
profile or could use information already stored in their profiles to automatically
build a new one. On the other hand, service providers would benefit from our
work for recommendation purposes by detecting and representing common user
interests, even when these users do not share online connections. Thus, we first
propose a method for integrating two profiles (either from the same user or from
different ones). The idea is to cluster similar interests around a high-level con-
cept. The discovering of these concepts and the matching of an interest towards a
concept are performed using state-of-the-art matching tools. Then, we present a
measure for assessing the importance of a cluster with respect to the other ones.
It is based on the cluster relationships between interests and their concept to
compute a weight. This weight enables the ranking of the most shared interests
between the two profiles.

2
    http://mypes.groupme.org/mypes/
3
    http://code.google.com/apis/socialgraph/
    The rest of this paper is divided as follows: Section 2 illustrates the problem
in terms of a scenario. Section 3 describes our approach in detail. Finally, we
conclude and outline several perspectives in Section 5.


2     Scenario and Approach

In this section, we first describe an example. Then, we present an overview of
our approach.


2.1   A Running Example

Let us imagine two people looking for a medical job. Jane has a Facebook account
and stores her bookmarks, including those about job searches, using del.icio.us.
John also has a Facebook account for leisure and personal activities, but he stays
tuned to professional networks with LinkedIn and his bookmarks are locally
stored in his web browser. Figure 1 depicts this scenario. In parentheses, I, G
and P respectively stand for interest, group or page that the user has subscribed
to. We notice that both users share a common sport (fishing/angling), and that
both enjoy other sports (tennis, rock climbing).


              (a) Jane’s profile                     (b) John’s profile

Fig. 1. Two Examples of User Profiles from several Personalization Websites with (G):
Group, (I): Interests and (P): page.
    Note that we could have chosen to integrate profiles from the same user. For
instance, by integrating Facebook and Delicious profiles from Jane, we would
have noticed that she is strongly interested in finding a job. On the other hand,
integrating John’s profiles could lead to some suggestions for adding Facebook
groups about job searches. In the rest of this paper, we focus on comparing
profiles from both users and ranking their most common interests.

2.2    Overview of our Approach
Our goal is to find common interests between two user profiles. Directly matching
individual interests using only terminological measures is not sufficient because if
we match ”angling” with ”tennis”, we could not discover the ”sport” concept. We
thus choose to include matching using concepts, since this allows identified terms
to be related to (multiple) higher-level concepts. This also allows us to include
pairs of interests in multiple clusters (in our example, ”fishing” and ”angling” are
connected to cluster ”fishing”, but they are also connected to cluster ”sport”). In
addition, we believe that our approach should be generic. Thus, we assume that
there are no semantics between interests included in the personalization websites.
We also consider that user interests are not structured (e.g., with categories). For
these reasons, we only apply terminological and linguistic similarity measures.
While using both methods helps us satisfy our goal of integration and comparison
of user profiles, we need some way of combining the information from both
methods.
    We propose a two-step approach, as shown by Figure 2. Users own profiles on
one or more personalization websites. The first component is the integrating
component, which extracts data from the user profiles. Then, these data are
matched to Wordnet concepts, thus forming clusters. For comparing common
interests from several user profiles, a next step is required to evaluate the weight
of each cluster according to the number of interests it gathers. This is performed
by the ranking component which outputs ranked user interests. The next
two sections contain the details of each component.


3     Integrating Interests from User Profiles
This sections first describes the two steps for integrating user profiles: (i) ex-
tracting user interests and (ii) clustering them. The last part of this section is
dedicated to discussion about these steps.

3.1    Extraction of User Profiles
During the first step, the integrating component extracts user interests from
various personalization websites, e.g., Facebook or LinkedIn. These interests not
only gather the list of activities and hobbies that the user filled in during pro-
file creation, but also include the groups to which the user subscribes, the pages
marked as “watch this page” and the bookmarks (s)he has added. All of these are
                        Fig. 2. Overview of our Approach


represented by a label (e.g., the name of a group, or the description of a book-
mark). Currently, we are able to extract such information from profiles stored in
Facebook, LinkedIn and Del.icio.us. The extractor uses common processes (tok-
enization, lemmatization and stemming [12]) to normalize each (label) interest
for the matcher component. In our running example, the interest medical pro-
fessionals is normalized into two tokens, medical and profession. Similarly, the
interest LastFM: listen to Internet music catalog is normalized into three tokens,
namely listen, music, Internet and catalog. The token LastFM is discarded during
this process because it does not have a Wordnet entry. Note that we could also
create a cluster for these discarded tokens. In this case, the number of occurences
for each token would eventually increase the importance of the interest.

3.2   Clustering Interests around High-level Concepts
The second step performed by the integrating component is to discover concepts
that gather several interests. In other words, we want to create clusters, each
composed of a high-level concept (from Wordnet dictionary) and a list of interests
that are related to this concept. To do so, we apply matching techniques, both
linguistics and terminological [12], between all interests from any two profiles.
    In our running example, this means that each of the 16 interests from Jane’s
profile (including the two category interests tennis and work) would be matched
to each of the 14 interests from John’s profile.
    For linguistics (or semantics), we only use the Wordnet-based similarity mea-
sure from YAM [13], a matching tool. This measure enables the discovery of re-
lationships such as hypernym, synonym or hyponym between two user interests.
In other words, we check the set of ancestors in the Wordnet hierarchy for each
interest. If we find a common ancestor between both interests, then this ances-
tor becomes a concept around which the two interests are clustered. Note that
in Wordnet, all words have a common ancestor (entity). To avoid discovering
such abstract concepts when matching a pair of labels, we have constrained the
hierarchy to 7 lower levels at most [14]. The Wordnet dictionary is also used in
[15], in which the authors map the Flickr tags to Wordnet categories. However,
these categories are high-level concepts (e.g., location, event, time) while our
approach aims at finding the closest ancestor between two interests.
    In our example, matching fishing and angling reveals that the former label is
a hypernym of the latter. Consequently, the fishing concept is created. Similarly,
matching work and profession enables the discovery of the job concept. Figure
3 depicts the four concepts that are created with the running example. Edges
represented by a full line denote a linguistic relationship (discovered using Word-
net), and the number indicates how far, in terms of Wordnet relationships, the
interest is from the concept (e.g., the fishing interest is related to the concept
sport by two concepts, namely outdoor sport and sport).


          Fig. 3. Interests Linked to the Concepts using Linguistic Measures


    Next, we apply terminological similarity measures (e.g., Needle-Wunsche,
Trigrams [12]) to compare character strings of the interests. Namely, we aim at
linking all interests that have not been linked either to a concept or to an interest
linked to a concept. We have chosen to use COMA++ [16], a matching tool
reputed to provide acceptable matching quality [17]. Indeed, this tool strongly
promotes precision, thus avoiding the discovery of irrelevant correspondences.
COMA++’s library contains 17 terminological measures that are aggregated
into a global similarity value4 . When the global similarity value computed for a
pair of two labels is too low, COMA++ automatically discards this pair from
the results list. This tool outputs a list of similar interests (according to their
labels comparison) associated with a similarity value between 0 and 1 (1 denoting
perfect similarity).
4
    Note that the list of normalized interests has been converted into a simple schema
    so that it can be processed by COMA++ and YAM.
   In our example, COMA++ matches for instance USA jobs or Job search to
the concept job with similarity values of 0.48 and 0.42 respectively. As COMA++
mainly aggregates terminological measures, the interest USA jobs, whose char-
acter string is smaller than the one of Job search, has a higher similarity value
with the concept job than Job search has. The tool also discards many candidate
pairs, e.g., between USA jobs and angling whose similarity value is close to 0.


      Fig. 4. Interests Linked to the Concepts using Terminological Measures


    At the end of this step, similar interests have been clustered (or integrated)
around the same concept, as shown by Figure 4. The similarity value computed
by terminological measures (e.g., interest Sport on Tennis.com is terminologi-
cally similar to the sport concept with 0.32 confidence).


3.3   Discussion

Both interests and concepts are represented by URIs. This avoids confusion with
similar labels. For example, the interest job search (group) is different from job
search (bookmark) but fishing interest linked to sport concept is the same entity
as fishing interest linked to the fishing concept.
    In the Wordnet hierarchy, a concept may have different meanings. For in-
stance, the concept rock can refer to a music genre, a stone, etc. If we try to
discover a common concept between a fanatic of progressive rock and a geology
fanatic, specifically for chondrites (a rock of meteoric origin containing chon-
drules), then we discover the common concept rock in the Wordnet hierarchy
between the two user interests. To avoid this problem, we analyse the direct
hyponyms of all meanings for the discovered concept (rock in our example) in
the Wordnet hierarchy. For the rock fanatic, its interest progressive rock ap-
pears under the following meaning “rock ’n’ roll, rock’n’roll, rock-and-roll, rock
and roll, rock, rock music”. On the contrary, the geology fanatic has its interest
chondrites under the meaning “rock, stone”. Thus, we are able to disambiguate
two interests that could have been clustered around the same concept due to its
different meanings.
    In our example, we only integrate two user profiles. However, we are able
to integrate more than two profiles by means of two techniques: (i) incremental
or (ii) holistic. The former is the fastest technique. Once we have integrated
two profiles, we incrementally integrate another one by computing only termi-
nological values, i.e., we directly match interests from the new profile with the
concepts which have already been extracted between the initial user profiles. On
the contrary, the holistic technique integrates each profile with all the others by
applying both linguistic and terminological similarity measures. Thus, this tech-
nique allows to discover all concepts between all profiles, but to the detriment
of execution time.
    We notice a gap between the value distributions of the scores obtained either
by terminological or by linguistic edges. Indeed, scores obtained for linguistic
edges are higher than the terminologic ones. We believe that this clearly reflects
the quality of the similarity measures which are used. On the other hand, termi-
nological measures mainly return more similarities (although with a lower score)
than the linguistic one, which compensates for the strictness of the latter.
    Many user profiles may contain hundreds of interests, in particular groups
and web pages. Thus, the number of extracted concepts can be very large, and
this can lead to confusion or unusability. Therefore, we propose to rank the
concepts (as well as user interests) using computed similarity values.


4   Ranking Interests from User Profiles

Ranking interests is required in some contexts, specifically to retrieve a particu-
lar one from a large collection. For instance, if a company is looking for a Prolog
programmer, it is necessary to dissociate Prolog experts from users who have
some Prolog tutorials in their bookmarks. Thus, we choose to rank concepts
(and consequently interests underlying them) according to their importance in
both profiles.

   Each cluster that has been computed during a previous step can be seen as
a connected component of a disconnected graph [18]. More formally, we have in
our context:

 – a set of vertices, composed of C, the set of concepts, and I, the set of interests
 – E, a set of edges between a concept and an interest. Given c ∈ C and i ∈ I,
   the edge e between c and i is noted eci . An edge either belongs to T , the
   set of terminological edges, or to L, the set of linguistic edges. Each edge e
    has a weight noted val(e). It corresponds to the similarity value computed
    in the previous step.

    Our intuition is to compute, for each concept, its weight in the graph. This
weight can be seen as the number and the quality of all interests that are linked
to a concept. By quality of an interest, we mean the type of similarity measure
used to link it to the concept (terminological or linguistic, the former being less
trustable than the latter) and the associated similarity value (which assesses the
confidence we have in the link). To do so, we first need to calculate the score
of each interest related to a concept. Given a concept c ∈ C and an interest
i ∈ I connected by an edge eci , we propose formula 1 to compute the score of
an interest with respect to a concept.
                             
                              val(eci ) if eci ∈ T
                score(c, i) = 1           if eci ∈ L and val(eci ) = 0          (1)
                              1
                                val(eci ) if eci ∈ L and val(eci ) "= 0

    Intuitively, the score of an interest is equal to its similarity value with the
concept in case both are related via a terminological edge. On the other hand,
with a linguistic edge, the score is inversely proportional to the number of in-
termediary (Wordnet) concepts between the interest and the concept. A specific
case appears when the concept is identical to the interest and when they are
linked by a linguistic edge. Although the number of intermediate concepts be-
tween them is null, the similarity between the concept and the interest is total.
All computed scores are in the range [0, 1].
    Let us compute several scores in our running example. The interest “USA
jobs” is terminologically connected to the concept “job”. Consequently, the score
between them is equal to their similarity value (0.48). The concepts “tennis” and
“sport” are linked by a linguistic edge, which indicates that they are separated
by 3 concepts in the “Wordnet” hierarchy. Thus, the score between “tennis” and
“sport” is equal to 31 . Finally, the interest “fishing” has no intermediate (Word-
net) concept with the concept “fishing”, and its score is equal to 1.

    Now that we have a score between an interest and a concept, it is possible
to compute the weight of each concept. The idea is to sum all scores between
a concept and its interests because the more links a concept has, the more
important it should be. In the following, |I| denotes the number of distinct
interests that have been linked to any concept. By distinct interest, we mean
represented by different URIs, as explained in Section 3.3. Given a concept c ∈ C,
we designate its related interests by a set Ic = < i1 , i2 , ..., in > such that ∀
ik ∈ Ic , ∃ ecik ∈ E. Using these definitions, formula 2 computes the score of the
concept c:                              $n
                                               score(c, ik )
                          weight(c) = k=1                                      (2)
                                                |I|
This formula returns values in the range [0, 1]. Indeed, the upper bound is reached
when there is only one cluster whose edges all have the maximum value (1 for
a score). As for the lower bound, it tends to 0, although this value cannot be
reached. As concepts are discovered by linguistic measures, there are at least two
edges linking any concept. As the number of intermediate (Wordnet) concepts is
a finite set, the scores of these linguistic edges tend towards 0 in the worst case.
Thus, even if we imagine an infinite number of distinct interests linked to any
concepts, the weight of any concept only tends towards 0.
    What happens with the concepts from Figure 4 ? The number of distinct
interests linked to one or more concepts is 19. Let us compute the weight for
the concept fishing. It has two linguistic relationships, and its score is therefore
             1+ 1
equal to 191 = 0.11. For the sport concept, we have both linguistic and termi-
nological edges. We compute its weight using formula 2 to obtain a value equal
    1
      + 1 + 1 + 1 +0.32+0.34
to 1 2 3 319                 = 0.15.

    At the end of this process, we have computed a score for each concept. Con-
sequently, it is possible to rank them with the shared interests that are most
represented at the top. These discovered concepts also form a “summary” be-
tween user profiles, since they are mainly higher level abstractions in the Wordnet
dictionary (e.g., rock climbing and tennis w.r.t. the sport concept).
    Following is the ranked list of common interests (with their scores in paren-
theses) shared by June and John’s profiles in our running example:

                                 1. job (0.27)
                                 2. sport (0.15)
                                 3. fishing (0.11)
                                 4. music (0.08)


5   Conclusion

In this paper, we have focused on integrating different user profiles for ranking
shared interests. As we desire a generic approach, we have only used linguis-
tic and terminological similarity measures applied to interests. We have noticed
that matching techniques are not perfect and can discover several irrelevant sim-
ilarities between totally different interests. Specifically, this weak point mostly
appears with terminological measures.
    We see many perspectives to our work. At first, we aim at integrating profiles
from other personalization websites. Namely, we need to identify these websites
and build appropriate wrappers to extract user interests. Some web services or
APIs are sometimes available to fulfill this goal. Concepts are currently browsed
through Wordnet dictionary. However, other resources, such as DBpedia, could
be useful to enrich or confirm the relationship between a user interest and a con-
cept. For instance, the interest salsa only appears in Wordnet as a spicy sauce,
thus no similar concept with music could be discovered whereas the salsa dis-
ambiguation page on Wikipedia lists more than 10 different meanings, including
the music style. We will also explore the possibility to match user interests to
concepts from models such as GUMO.
     Another challenge would be the extraction of preferences from free texts (or
unstructured texts) such as blogs or user reviews. A user who wrote a positive
comment or gave a good rating to the Lord of the Rings books is likely to have
interests in reading other Tolkien and/or fantasy books.
     An improvement could be reached for integrating several user profiles. The
incremental technique suffers from a possible missing of several concepts while
the holistic one is time consuming. An idea to tackle these issues is a hybrid ap-
proach, which would apply the costly linguistic similarity measure only between
interests that cannot be matched to an existing concept.
     A last perspective deals with user behaviours. Returning to our running
example, it is likely that our unemployed users regularly visit job search websites.
Once they have found a job, they will probably not consult these websites for a
while. Thus, frequency of visited websites is one of the measures that could help
us to update most shared interests over time. Another example would be the
discovery of another shared interest about medical. Although John joined a few
groups related to medical and health on LinkedIn, it seems that Jane typically
fills in forms with similar keywords to find jobs in her domain. Recording these
keywords on job search websites would enable to detect that both users work
in the same field. Further, we could also take into account the fact that differ-
ent personalization websites are exploited in different contexts, e.g. Facebook is
mainly used for contacting people, while LinkedIn is more used for professional
purposes.

References
 1. Aroyo, L., Dolog, P., Houben, G.J., Kravcik, M., Naeve, A., Nilsson, M., Wild,
    F.: Interoperability in personalized adaptive learning. Educational Technology &
    Society 9(2) (2006)
 2. Schultz, T., Kirchhoff, K.: Multilingual Speech Processing. Academic Press (2006)
 3. Palmisano, I., Redavid, D., Iannone, L., Semeraro, G., Degemmis, M., Lops, P.,
    Licchelli, O.: A rdf-based framework for user profile creation and management. In:
    SWAP. (2005)
 4. Abel, F., Heckmann, D., Herder, E., Hidders, J., Houben, G.J., Krause, D.,
    Leonardi, E., van der Slujis, K.: A framework for flexible user profile mashups.
    In: Proceedings of International Workshop on Adaptation and Personalization for
    Web 2.0 (AP-WEB 2.0 2009). CEUR Workshop Proceedings, Trento, Italy, CEUR,
    Tilburg, Aachen (June 22 2009) 1–10
 5. Korth, A., Plumbaum, T.: A framework for ubiquitous user modeling. In: IEEE
    International Conference on Information Reuse and Integration (IRI2007), IEEE
    Computer Society Press (2007)
 6. Vandermeulen, B., Duflou, J.R., Moor, B.D.: The role of user profiles in vector-
    based information retrieval. In: IKE. (2003) 668–669
 7. Stewart, C., Cristea, A., Celik, I., Ashman, H.: Interoperability between AEH user
    models. In: APS ’06: Proceedings of the joint international workshop on Adaptivity,
    personalization & the semantic web, New York, NY, USA, ACM (2006) 21–30
 8. Li, X., Guo, L., Zhao, Y.E.: Tag-based social interest discovery. In: WWW ’08:
    Proceeding of the 17th international conference on World Wide Web, New York,
    NY, USA, ACM (2008) 675–684
 9. Brickley, D., Miller, L.: Foaf vocabulary specification 0.97 (January 2010)
10. Heckmann, D., Schwartz, T., Brandherm, B., Schmitz, M., von Wilamowitz-
    Moellendorff, M.: Gumo - the general user model ontology. In: User Modeling.
    (2005) 428–432
11. Abel, F., Henze, N., Krause, D., Plappert, D.: User modeling and user profile
    exchange for semantic web applications. In: LWA. (2008) 4–9
12. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (DE)
    (2007)
13. Duchateau, F., Coletta, R., Bellahsene, Z., Miller, R.J.: (Not) yet another matcher.
    In: CIKM. (2009) 1537–1540
14. Lin, D.: An information-theoretic definition of similarity. In: ICML. (1998) 296–304
15. Sigurbjörnsson, B., van Zwol, R.: Flickr tag recommendation based on collective
    knowledge. In: WWW ’08: Proceeding of the 17th international conference on
    World Wide Web, New York, NY, USA, ACM (2008) 327–336
16. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching
    with COMA++. In: ACM SIGMOD. (2005) 906–908
17. Yatskevich, M.: Preliminary evaluation of schema matching systems. Technical
    Report DIT-03-028, Informatica e Telecomunicazioni, University of Trento (2003)
18. Diestel, R.: Graph Theory (Graduate Texts in Mathematics). Springer (August
    2005)