<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>De¯nition of User Pro¯les based on the YAGO Ontology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvia Calegari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>pasig@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISCo</institution>
          ,
          <addr-line>Universita</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>degli Studi di Milano-Bicocca</institution>
          ,
          <addr-line>vle. Sarca 336/14, 20126 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we consider the problem to personalize user's Web searches for improving the quality of results. To this aim, we propose a preliminary methodology that allows to de¯ne a conceptual user pro¯le based on the YAGO ontology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        To overcome the limitations of the \one size ¯ts all" approach of search engines,
personalized approaches to Information Retrieval have been proposed.
Personalized search is based both on modeling the user's context by a user's pro¯le that
represents the user's preferences, and on the de¯nition of processes that exploit
the knowledge represented in the user pro¯le to tailor the search outcome to
users' needs. The accurate de¯nition of a user pro¯le plays then a central role
to de¯ne e®ective approaches to personalization. Up to now, bags of words, and
vectors or graph-based representations have been mainly used to de¯ne users'
pro¯les. To improve the quality of the knowledge represented in user pro¯les, in
some recent works, external knowledge sources (i.e., WordNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or Web
directories as the ODP [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the Yahoo! Web directory [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) have been considered
to represent in a more structured way the user context. The use of an ontology
allows to give a more structured and expressive knowledge representation with
respect to the above mentioned approaches [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        A user pro¯le is de¯ned based on the analysis of the information characterizing
the user's interests and preferences. Elicitation of user's interests and preferences
is not the focus of the research reported in this paper. Numerous approaches have
been proposed in the literature to this aim [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our objective is the formal
definition of an ontological user pro¯le based on the use of YAGO as an external
reference knowledge. YAGO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a general purpose ontology containing
several millions of entities and facts. Only the entities and facts which match the
appropriate user's interests are used to derive the user pro¯le. To this aim a
preliminary methodology aimed at the extraction of the appropriate fragment of
the YAGO ontology has been de¯ned. Then the main objective of the research
reported in this paper is (assuming to have the user's interests speci¯ed as a bag
of words) both to extract the portion of YAGO useful for the de¯nition of a user
pro¯le, and to organize it into a coherent ontological representation expressed
by a language such as RDFS.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Building the YAGO-based pro¯le</title>
      <p>
        The novelty of the research reported in this paper is to employ the YAGO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
ontology as external reference knowledge for building a conceptual user pro¯le.
YAGO is a general purpose ontology, and it consists of more than 1:7 million
entities (like books, movies, . . . ), and over 14 million facts about them. The triple
&lt; entity; relation; entity &gt; is called a f act. All facts are grouped in 99 relations
such as FamilyNameOf,subClassOf,actedIn, etc. To build the YAGO - based user
pro¯le, our methodology is articulated in four phases as sketched in Fig. 1. Our
investigation addresses the methodology de¯ned for extracting the sub-part of
YAGO related to the user's interests. To produce a bag of words that represents
the user's interest, we have decided to consider a set of documents residing on the
user's PC related to his/her topical preferences. We have then analyzed them
with standard IR techniques in order to extract meaningful terms, i.e. terms
representative of the user's preferences (interest-terms). Thus, we have developed
a strategy that allows to semantically extract the sub-YAGO ontology starting
from the interest-terms. A similar approach has been reported in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where a set
of documents are indexed, and the obtained index terms are semantically linked
to a network of concepts, but to the di®erent aim of the automatic construction
of hypertexts. Moreover in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the external knowledge resource is a taxonomy,
i.e. the ACM classi¯cation that de¯nes a hierarchy of topics where each topic
is a concept. Instead YAGO is an ontology with millions of entities (concepts
plus individuals), and several relations with a di®erent semantics; to this aim
several rules have to be de¯ned related to the possible relations for associating
the index-terms with the right entities.
      </p>
      <p>USER</p>
      <p>Phase 1
Documents</p>
      <p>Phase 2
Analysis and</p>
      <p>Terms
Extraction</p>
      <p>Phase 3
Personalized
Knowledge
Extraction
from YAGO</p>
      <p>Phase 4
Conversion
to RDFS and
Editing with</p>
      <p>Protégé
Phase 1. This ¯rst phase consists in individuating the user's knowledge that has
to be considered to extract the user's interests. In this speci¯c case, we analyzed
a set of documents collected by the user and stored in his/her personal computer.
Phase 2. Each document is analyzed in two steps: (1) document preprocessing
and (2) term frequency analysis, respectively. In the ¯rst step, standard text
processing techniques are applied such as stop-word removal, and stemming.
In the second step the open source software Lucene is used for indexing the
documents; a standard normalized Tf-Idf formula is adopted to compute the
index terms weights, but other approaches will be taken into account for further
investigations.</p>
      <p>Phase 3. The outcome of the previous phase is a list of interest-terms with
index terms weight over a given threshold ®. To enrich the knowledge of the
user's interests a process of knowledge extraction from the YAGO ontology is
performed. This process is articulated in 3 sub-phases: (1) individuals and facts
extraction, (2) direct concepts extraction and expansion to their child nodes,
and (3) addition of new synonyms, respectively.</p>
      <p>The fact extraction process is logically divided into non-taxonomic and
taxonomic relations extraction. Non-taxonomic relations are de¯ned in the YAGO
ontology over entities which are referred to as individuals, while taxonomic
relations can hold between an individual and its parent concept (class), or
between two concepts. As previously stated a fact is a triple de¯ned as &lt;
entity; relation; entity &gt;, so the ¯rst step of the algorithm consists in
locating the facts where an interest-term (obtained based on phase 2) matches with
an entity. The outcome of this step is constituted by a set of facts and entities
extracted from YAGO. From the analysis of the taxonomic relation di®erent
considerations have been made. In fact, it is possible that some facts based on the
taxonomic relation SubClassOf do not report useful information with respect
to the considered term. For example, the fact &lt; relational database systems;
SubClassOf; database systems &gt; contains the knowledge that \relational database
systems" are sub-class of \database systems", which is not very informative. For
this reason, in case of a direct concept match, the algorithm takes all the ¯rst
level children (individuals) of the matched concept. Referring to the previous
example, for the term \relational database systems" the following instances
MySQL, Oracle, PostgreSQL etc., will be added in the user's pro¯le.
A possibility is that the term is not found in YAGO. When this happens, our
algorithm analyzes WordNet for checking the existence of synonyms. In case
multiple synset exist, we adopt the methodology used by the authors of YAGO,
where the most probable synset (i.e., the synset having higher probability of
occurrence) is selected.</p>
      <p>
        Phase 4. At this step, the resulting personal ontology is converted into the
ontological language RDFS 1, and its graph portions are visualized by the ontology
editor Prot¶eg¶e [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the conversion process, every relation is exported into a
single RDFS ¯le, and afterwards all ¯les are gathered into a single schema
representing the personal ontology. A problem may arise related to the quality of the
obtained pro¯le. In fact, by the process of index analysis and facts extraction
from YAGO, an unavoidable amount of noise is gathered into the ¯nal ontology.
A ¯rst and preliminary solution was to manually improve its quality by using
the Prot¶eg¶e editor.
      </p>
      <p>Preliminary Experiment A preliminary analysis has been made for de¯ning
a conceptual user pro¯le based on the YAGO ontology by considering 35
documents. This set of documents is related to several user interests such as art,
literature, music, cinema and work. The second phase of the proposed
methodology has been conducted by using Lucene, and the threshold for scoring was
1 http://www.w3.org/TR/PR-rdf-schema
set to 0:5, thus obtaining 306 terms. At the end of phase 3, 578950 entities
(i.e., individual plus concepts) have been counted in the user pro¯le, where 11
new terms are added from WordNet. The last phase has consisted in converting
the obtained pro¯le in an ontological language (i.e., RDFS) in order to improve
it by, for example, reducing noise or adding relations between terms. For
example, if a term was related to the actor \Brad Pitt", all the corresponding
information de¯ned in YAGO are extracted such as categories it belongs to (i.e.,
Action ¯lm actors, American male model, . . . ), as well as its non-taxonomic
relations (i.e., hasWonPrize, produced, actedIn, . . . ). By editing this ontological
pro¯le in Prot¶eg¶e, the user is allowed to delete non relevant information, for
example the ones related to Brad Pitt as a model.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Works</title>
      <p>The aim of this work is to create user pro¯les based on the YAGO general purpose
ontology, to the aim of Web search personalization. We believe that ontologies
are worth to be investigated as an interesting support for structuring knowledge
in user pro¯les. To this aim, in a ¯rst preliminary application, the documents
collected by a user are considered as the evidence of his/her interests. We plan
to improve the methodology presented in this paper by following three main
directions: the ¯rst is to automatically remove some noise from the pro¯le (e.g.,
by deleting non relevant entities and the relations involving them), the second
is to add new relations and facts between terms not de¯ned in YAGO, and the
last one is to consider other sources of information (i.e., past user's queries) to
extract user's interests. Furthermore we will test the obtained YAGO-based user
pro¯le for expanding the user's queries to contextualize his/her Web searches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Agosti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melucci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Tachir: A tool for automatic construction of hypertexts for information retrieval</article-title>
          . In:
          <string-name>
            <surname>Funck-Brentano</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seitz</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          . (eds.) RIAO. pp.
          <volume>338</volume>
          {
          <fpage>358</fpage>
          .
          <string-name>
            <surname>CID</surname>
          </string-name>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Degemmis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lops</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>A content-collaborative recommender that exploits wordnet-based user pro¯les for neighborhood formation</article-title>
          .
          <source>User Model. UserAdapt. Interact</source>
          .
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <volume>217</volume>
          {
          <fpage>255</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Gauch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Cha®ee, J.,
          <string-name>
            <surname>Pretschner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Ontology-based personalized search and browsing</article-title>
          .
          <source>Web Intelligence and Agent Systems</source>
          <volume>1</volume>
          (
          <issue>3-4</issue>
          ),
          <volume>219</volume>
          {
          <fpage>234</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Labrou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.W.</given-names>
          </string-name>
          :
          <article-title>Yahoo! as an ontology: Using yahoo! categories to describe documents</article-title>
          .
          <source>In: CIKM</source>
          . pp.
          <volume>180</volume>
          {
          <issue>187</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergerson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The knowledge model of protege-2000: Combining interoperability and °exibility</article-title>
          .
          <source>In: EKAW 2000</source>
          . pp.
          <volume>17</volume>
          {
          <issue>32</issue>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Issues in personalizing information retrieval</article-title>
          .
          <source>IEEE Intelligent Informatics Bulletin</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ), 3{
          <issue>6</issue>
          (
          <year>December 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Sieg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobasher</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burke</surname>
          </string-name>
          , R.D.:
          <article-title>Ontological user pro¯les for representing context in web search</article-title>
          . In: Web Intelligence/IAT Workshops. pp.
          <volume>91</volume>
          {
          <issue>94</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago: A large ontology from wikipedia and wordnet</article-title>
          .
          <source>Journal of Web Semantic</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <volume>203</volume>
          {
          <fpage>217</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>