<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Wikipedia Knowledge for Entity Recommendations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nitish Aggarwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Mika</string-name>
          <email>pmika@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roi Blanco</string-name>
          <email>roi@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Buitelaar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics National University of Ireland Galway</institution>
          ,
          <addr-line>Ireland Yahoo Labs 125 Shaftesbury Ave, WC2H 8HR London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>User engagement is a fundamental goal of commercial search engines. In order to increase it, they provide the users an opportunity to explore the entities related to the queries. As most of the queries can be linked to entities in knowledge bases, search engines recommend the entities that are related to the users' search query. In this paper, we present Wikipedia-based Features for Entity Recommendation (WiFER) that combines di erent features extracted from Wikipedia in order to provide related entity recommendations. We evaluate WiFER on a dataset of 4.5K search queries where each query has around 10 related entities tagged by human experts on 5-level label scale.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the advent of large knowledge bases like DBpedia1, YAGO2 and
Freebase3, search engines have started recommending entities related to the web
search queries. Pound et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] reported that more than 50% web search queries
pivot around a single entity and can be linked to an entity in the knowledge
bases. Consequently, the task of entity recommendation in the context of web
search can be de ned as nding the entities related to the entity appearing in a
web search query. It is very intuitive to get the related entities by obtaining all
the explicitly linked entities to a given entity in knowledge bases. However, most
of the popular entities can easily have more than 1,000 directly connected
entities, and knowledge bases mainly tend to cover some speci c types of relations.
For instance, \Tom Cruise" and \Brad Pitt" are not directly connected in the
DBpedia graph with any relation, however, they can be considered related to
1 http://wiki.dbpedia.org/
2
http://www.mpi-inf.mpg.de/departments/databases-and-informationsystems/research/yago-naga/yago/
3 https://www.freebase.com/
each other as they both are popular Hollywood actors and co-starred in movies.
Therefore, to build a system for entity recommendation, there is a need to
discover related entities beyond the relations explicitly de ned in knowledge bases.
Furthermore, these related entities require a ranking method to select the most
related ones.
      </p>
      <p>
        Blanco et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] described the Spark system for related entity
recommendation and suggested that such recommendations are successful at extending
users' search sessions in Yahoo search. Microsoft also published a similar
system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that performs personalized entity recommendation by analyzing the click
logs. In this paper, we present Wikipedia-based Features for Entity
Recommendation (WiFER) that combines di erent features extracted from Wikipedia. It
makes use of Distributional Semantics for Entity Relatedness (DiSER) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and
Explicit Semantic Analysis (ESA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as its features, in combination of others.
The features are combined by using learning to rank methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. WiFER is
inspired by Spark. However, Spark utilizes proprietary data like query logs and
query sessions, which are not available publicly. Therefore, we focus on extracting
di erent features from Wikipedia to build the entity recommendation system.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>Wikipedia-based Features for Entity Recommendation (WiFER) combines the
di erent features by using learning to rank method. These features are extracted
from Wikipedia by considering two di erent types of data source: collection of
textual content and collection of Wikipedia hyperlinks. The features are derived
from the hypothesis that the entities, which occur often in the same context
(Wikipedia article), are more likely to be related to each other. We use following
features:
1. Probability (P1; P2) is calculated by taking the ratio of the number of
articles that contain the given entity to the total number of articles. P1 is</p>
      <p>PN
the probability of an entity E1. P1 = iN=0 oi where oi = 1, if an article si
contains the entity E otherwise oi = 0. N is the total number of articles.
The value of P of an entity is independent of the other entities, therefore it
gives two values P1 and P2 for an entity pair consisting of E1 and E2.
2. Joint probability (JPSYM) This score is obtained by taking the ratio of
the number of articles that contain both the given entities to total number</p>
      <p>PN
of articles. J P SY M = i=N0 coi where coi = 1 if an article si contains both
the entities E1 and E2, otherwise oi = 0.
3. PMI (SISYM) It computes the point-wise mutual information (PMI).</p>
      <p>P M I(E1; E2) = Plo(gE(P1)(EP1(;EE22)))) where P (E1) and P (E2) are the prior
probabilities as described above. P (E1; E2) is computed by taking the ratio of
number of articles that contain both the entities E1 and E2, to the total
number of articles.
4. Cosine similarity (CSSYM) The cosine similarity is calculated as</p>
      <p>P (E1;E2)</p>
      <p>Cosine(E1; E2) = P (E1) P (E2))
Since we mentioned that Wikipedia is used twice, WiFER generates 16 di erent
feature values. In order to generate the feature values from text collection, we
consider only the surface from of an entity to obtain the occurrence. However,
we count the occurrence of an entity in collection of hyperlinks, only if the entity
appears as hyperlink in an article. The Probability features generates two values
for an entity pair, therefore, each collection provides 8 di erent feature values
and we obtain total 16 values.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        In order to evaluate our approach, we compare WiFER with the Spark entity
recommendation system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that uses more than 100 features extracted from
di erent data sources such as query logs and user search sessions. We evaluate
the performance on same dataset that was used by Spark. It consists of 4,797
search queries. Every query refers to an entity in DBpedia and contains a list of
entity candidates. The entity candidates are tagged by professional editors on 5
label scale: Perfect, Excellent, Good, Fair, and Bad. Finally, it contains 47,623
query-entity pairs. We use Gradient Boosting Decision Tree (GBDT) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] ranking
method. Due to variations in the number of retrieved related entities for a query,
we use Normalized Discounted Cumulative Gain (nDCG) for the performance
metric. We calculate nDCG@10, nDCG@5, and nDCG@1 as the evaluation
metrics. All the nDCG scores are obtained by performing 10-fold cross validation. In
addition to performing experiments on the dataset with all the entity types, we
also evaluated the systems for the datasets including only person type entities
or location type entities. Table 1 shows the retrieval performance of Spark, and
compare it with WiFER. It shows that WiFER achieved comparable results on
full dataset and person type entities. However, it could not cope well for location
type entities. The possible reason behind it could be that most of the locations
are too speci c which do not have enough information on Wikipedia. Moreover,
to investigate if WiFER can complement Spark performance, we combine all the
features in Spark with WiFER features. WiFER could not outperform Spark,
however the combination of both i.e. Spark+WiFER achieved higher scores for
all the test cases. Although, WiFER obtained relatively lower scores for
location type entities, it is able to compliment the Spark's performance. Further,
we performed an extensive evaluation to investigate the importance of di erent
features in entity recommendations (see for more details [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we presented WiFER that combines di erent features extracted
from Wikipedia, by using a learning to rank method. We showed that WiFER
achieved a comparable accuracy to Spark, which uses more than 100 features
obtained from proprietary data sources like query logs and user search sessions.
Moreover, Spark does not utilize Wikipedia to build its features, thus, we
combine WiFER with Spark features, and we showed that WiFER complements the
overall performance of Spark.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asooja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ziad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Who are the american vegans related to brad pitt?: Exploring related entities</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web Companion</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Wikipedia-based distributional semantics for entity relatedness</article-title>
          .
          <source>In 2014 AAAI Fall Symposium Series</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Insights into entity recommendation in web search</article-title>
          .
          <source>In Proceedings of the Intelligent Exploration of Semantic Data, ISWC</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Cambazoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Torzec</surname>
          </string-name>
          .
          <article-title>Entity recommendations in web search</article-title>
          .
          <source>In International Semantic Web Conference (ISWC)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          .
          <source>Annals of Statistics</source>
          , pages
          <volume>1189</volume>
          {
          <fpage>1232</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing semantic relatedness using wikipediabased explicit semantic analysis</article-title>
          .
          <source>In Proceedings of the 20th international joint conference on Arti cal intelligence</source>
          ,
          <source>IJCAI'07</source>
          , pages
          <fpage>1606</fpage>
          {
          <fpage>1611</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Pound</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>Ad-hoc object retrieval in the web of data</article-title>
          .
          <source>In Proceedings of the 19th international conference on World wide web</source>
          , pages
          <volume>771</volume>
          {
          <fpage>780</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-J. P.</given-names>
            <surname>Hsu</surname>
          </string-name>
          , and J. Han.
          <article-title>On building entity recommender systems using user click log and freebase knowledge</article-title>
          .
          <source>In Proceedings of the 7th ACM international conference on Web search and data mining</source>
          , pages
          <volume>263</volume>
          {
          <fpage>272</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>