<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Insights into Entity Recommendation in Web Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nitish Aggarwaly?</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Mika</string-name>
          <email>pmika@yahoo-inc.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roi Blanco</string-name>
          <email>roi@yahoo-inc.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Buitelaary</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>User engagement is a fundamental goal for search engines. Recommendations of entities that are related to the user's original search query can increase engagement by raising interest in these entities and thereby extending the user's search session. Related entity recommendations have thus become a standard feature of the interfaces of modern search engines. These systems typically combine a large number of individual signals (features) extracted from the content and interaction logs of a variety of sources. Such studies, however, do not reveal the contribution of individual features, their importance and interaction, or the quality of the sources. In this work, we measure the performance of entity recommendation features individually and by combining them based on a novel dataset of 4.5K search queries and their related entities, which have been evaluated by human assessors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the advent of large knowledge bases like DBpedia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], YAGO [13] and
Freebase [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], search engines have started recommending entities related to web
search queries. Pound et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] reported that around 50% web search queries
pivot around a single entity and can be linked to an entity in the knowledge
bases. Consequently, the task of entity recommendation in the context of web
search can be de ned as nding the entities related to the entity appearing in a
web search query. It is very intuitive to get the related entities by obtaining all
the explicitly linked entities to a given entity in the knowledge bases. However,
most of the popular entities have more than 1,000 directly connected entities,
and the knowledge bases mainly cover some speci c types of relations. For
instance, \Tom Cruise" and \Brad Pitt" are not directly connected in DBpedia
graph with any relation, however, they can be considered related to each other.
? This work was done while the author was visiting Yahoo! Research Labs, Barcelona.
Therefore, to build a system for entity recommendation, there is a need to nd
related entities beyond the explicit relations de ned in Knowledge bases.
Further, these related entities require a ranking method to select the most related
ones.
      </p>
      <p>
        Blanco et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] described the Spark system for related entity
recommendation and suggested that such recommendations are successful at extending users'
search sessions. Microsoft also published a similar system [14] that performs
personalized entity recommendation by analyzing the user click through logs. In this
paper, we focus on exploring the di erent features in an entity recommendation
system and investigate their e ectiveness. Yahoo's entity recommendation
system \Spark" utilizes more than 100 di erent features providing the evidence of
the relevance of an entity. The nal relevance scores are calculated by combining
the di erent features using state-of-the-art learning-to-rank approach. Although,
Blanco et al. presented some experimentation with the Spark system, in
particular by reporting on the importance of the top 10 features, and the evaluation
metrics on di erent types of entities; further experimentation is required to
investigate the impact of individual features and their di erent combinations. The
features used in Spark can be divided in ve types: co-occurrence based,
linear combination of co-occurrence based features, graph-based, popularity-based,
and type-based features. Co-occurrence based features make use of four di erent
data sources: query term, user speci c query sessions, Flickr tags, and tweets.
In this paper, we explore the impact of the features used in the Spark system by
combining them based on their types and data sources. In order to investigate
the quality of di erent data sources, we focus extensively on co-occurrence based
features. All of the data sources used to calculate co-occurrence based features
are not publicly accessible. For instance, only major search engines have the
datasets like query terms and query sessions. Therefore, we measure the
performance of a system that has only co-occurrence based features extracted from
Wikipedia. The data sources like query terms, Flickr tags and tweets can only
capture the presence of an entity. However, Wikipedia articles are long enough
to obtain the associative weight of an entity with a Wikipedia article, which
provides an opportunity to build the distributional semantic model (DSM) [
        <xref ref-type="bibr" rid="ref1 ref10 ref4">1, 4, 10</xref>
        ]
over Wikipedia concepts. Therefore, in addition to co-occurrence based features
that consider only the presence, we also explore the DSM based feature built
over Wikipedia. We evaluate the performance by adding the Wikipedia-base
features in the current Spark system, which will be referred as Spark+Wiki in rest
of the paper.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Entity recommendation system</title>
      <p>This section provides a detailed overview of the Spark system. Section 2.1
describes the construction of Yahoo's knowledge graph, part of which is used to
obtain the potential entity candidates. Section 2.2 explains di erent types of
features and how they are extracted from di erent data sources. Spark and
Spark+Wiki combines the values obtained from di erent features, by using a
learning to rank approach, which is explained in Section 2.3.
2.1</p>
      <p>
        Yahoo knowledge graph
In order to retrieve a ranked list of the entities, the system requires a list of
potential entity candidates that can be considered related with the given entity.
These candidates can be obtained from existing knowledge bases like DBpedia or
YAGO. However, such existing knowledge bases may not cover all the relations
that can be de ned between the related entities. For instance, \Tom Cruise" can
be considered highly related to \Brad Pitt", but they are not connected by any
relation in DBpedia graph. Therefore, Spark uses an entity graph extracted from
di erent structured and unstructured data sources including public data sources
such as DBpedia and Freebase. It also uses a manually constructed ontology
that de nes the types of an entity extracted from di erent resources. In order to
extend the coverage of the de ned relations in entity graph, it performs
information extraction over various unstructured data sources in di erent domains
like movies, music, TV shows and sports. The subset of the entity graph used in
Spark covers entity-types in media, sports and geography and consisted of over
3.5M entities and 1.4B relations at the time of our experiments (see for more
detail [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
2.2
      </p>
      <p>Feature extraction
Spark uses more than 100 di erent features. These features are divided into
ve di erent categories: co-occurrence based features, linear combination of
cooccurrence based features, graph-based features, popularity-based features, and
type-based features.</p>
      <p>Co-occurrence features are derived from the hypothesis that the entities,
which occur often in the same event or context, are more likely to be related to
each other. Spark system uses 11 di erent types of features which are obtained
by using di erent co-occurrence measures. Let E1 and E2 are two entities and S
is the set of events, where S = fs1; s2; :::sng and sn is the nth event. The event is
de ned as one observation under consideration for measuring the co-occurrence.
For instance, every query in query logs is an event and entity occurrence is
de ned by PN</p>
      <p>i=0 oi, where oi = 1 if an event si contains the entity E otherwise
oi = 0.
1. Probability (P1; P2) it is calculated by taking the ratio of the number of
events that contain the given entity to the total number of events. P is the
probability of an entity E.</p>
      <p>P =</p>
      <p>PN
i=0 oi
N
(1)
and P is the probability de ned in feature 1. Similar to the probability
feature, it gives two values Ent1 and Ent2 for an entity pair.
3. KL divergence (KL1; KL2) It is KL divergence of an entity E. Similar to
the above features, it also gives two values KL1andKL2 for an entity pair.
4. Joint probability (JPSYM) This score is obtained by taking the ratio of
the number of events that contain both the given entities to total number of
events.</p>
      <p>N
where coi = 1 if an event si contains both the entities E1 and E2, otherwise
oi = 0.
5. Joint user probability (PUSYM) This is similar to the feature 4,
however, it calculates the co-occurrence over users rather than the events.
(2)
(3)
where U is the total number of users and coui = 1 if a user ui contains both
the entities E1andE2, otherwise coui = 0.
6. PMI (SISYM) It computes the point wise mutual information (PMI).
where N is the total number of events. The value of P of an entity is
independent of the other entities, therefore it gives two values P1 and P2 for an
entity pair consisting of E1 and E2.
2. Entropy (Ent1; Ent2) This is the standard entropy of an entity that is
de ned by</p>
      <p>Ent1 =</p>
      <p>P1 log(P1)
(4)
(5)
(6)
(7)
(8)
P M I(E1; E2) =</p>
      <p>Cosine(E1; E2) =
.
7. Cosine similarity (CSSYM) The cosine similarity is calculated as
.
8. Conditional probability (CPASYM) It is calculated as the ratio of the
total number of events that contain E1 and E2, to the total number of events
that contain E1.</p>
      <p>CP ASY M (E1; E2) =</p>
      <p>CU P ASY M (E1; E2) =
where oe1i = 1 if an event si contains the entity E1, otherwise oe1i = 0.
9. Conditional user probability (CUPASYM) This is similar to the CPASYM
except it computes the score over the users.
where oue1i = 1 if an user ui contains the entity E1, otherwise oue1i = 0.</p>
      <p>J P SY M =</p>
      <p>PiN=0 coi
P U SY M =</p>
      <p>PiU=0 coui</p>
      <p>U
log(P (E1; E2))
P (E1) P (E2))</p>
      <p>P (E1; E2)
P (E1)</p>
      <p>P (E2))
PiN=0 coi
PiN=0 oe1i
PiU=0 coui
PiU=0 oue1i
where oue2i = 1 if an user ui contains the entity E2, otherwise oue1i = 0.
Combined features are the combination of co-occurrence features. The Spark
system uses 8 di erent types of combined features from every data source.
Therefore it generates a total of 32 di erent features. These are the following 8 features:
1. CF1 is the combination of conditional user probability and prior probability
of target entity de ned by:
2. CF2 is the combination of conditional user probability and prior probability
of target entity de ned by:
PN</p>
      <p>i=0 coi
PN</p>
      <p>i=0 oe2i
PU</p>
      <p>i=0 coui
PU
i=0 oue1i
(9)
(10)
(11)
(12)
(13)
(14)
(15)
4. CF4 is the combination of reverse conditional probability and entropy of
target entity de ned by:
5. CF5 is the combination of joint user probability and prior probability of
target entity de ned by:
3. CF3 is the combination of reverse conditional probability and prior
probability of target entity de ned by:
6. CF6 is the combination of joint user probability and prior probability of
target entity de ned by:
10. Reverse conditional probability (RCPASYM) It is reverse of the CPASYM.</p>
      <p>where oe2i = 1 if an event si contains the entity E2, otherwise oe1i = 0.
11. Reverse conditional user probability (RCUPASYM) It is reverse of
the CUPASYM.</p>
      <p>RCP ASY M (E1; E2) =
RCU P ASY M (E1; E2) =</p>
      <p>CF 1 = CU P ASY M</p>
      <p>P2
CF 2 =</p>
      <p>CU P ASY M</p>
      <p>P2
CF 3 = RCP ASY M</p>
      <p>P2
CF 4 = RCP ASY M</p>
      <p>Ent2
CF 5 = J P U SY M</p>
      <p>P2
7. CF7 is the combination of joint user probability and entropy of target entity
de ned by:
8. CF8 is the combination of joint user probability and entropy of target entity
de ned by:</p>
      <p>CF 7 = J P U SY M</p>
      <p>J P U SY M
Graph-based features use the knowledge graphs like DBpedia and Freebase.
Spark computes 5 di erent features by using knowledge graphs.
1. Graph similarity (GSCEG) This feature computes the total shared
connections between two given entities in Yahoo! knowledge graph.
2. Entity popularity in movies (EPOPUMOVIE) This feature counts the
total number of directly connected nodes in movie speci c knowledge graph,
to compute the entity popularity rank.
3. Facet popularity in movies (FPOPUMOVIE) This is facet popularity
rank in movie speci c knowledge graph.
4. Entity popularity in all (EPOPUALL) Similar to EPOPUMOVIE it
counts the total number of directly connected nodes in complete Yahoo!
knowledge graph.
5. Facet popularity in all (FPOPUALL) This is facet popularity rank in
the complete knowledge graph.</p>
      <p>Popularity-based features
1. Web search citation (WCTHWEB) It counts the total hits in web search
results of Yahoo!.
2. Web deep citation (WCDHWEB) It counts the total number of user
clicks in web search results of Yahoo!.
3. Entity Volume in query(COVQ) It counts the total number of
occurrence of given entity in query logs.
4. Entity Volume in facet (COVF) Facet volume in query logs.
5. Entity view volume in query (W P OP1; W P OP2) It compute the total
number user clicks for given entity while the entity occur in query.</p>
      <p>Entity type features re ect the entity types and relation types present in the
knowledge bases. Spark uses two di erent entity type features:
1. Entity class type (ET1; ET2) This is the type of an entity de ned in the
knowledge base. It provides two di erent feature values ET1 and ET2 for an
entity pair of the entities E1 and E2.
2. Relation type (RT) This feature de nes the relation type between two
given entities. For instance, \Brad Pitt" and \Angelina Jolie" are de ned by
relation type \Partner" in DBpedia.</p>
      <p>
        Wikipedia-based features The Spark system does not use Wikipedia to
extract the features. However, in addition to the features reported by Blanco et
al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we experiment with additional Wikipedia-based features that we refer as
Spark+Wiki. Aggarwal et al. [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] presented an entity recommendations system
\EnRG" that shows the e ectiveness of using only Wikipedia-based features.
Therefore, in this section we explain the additional features.
      </p>
      <p>
        In order to obtain the Wikipedia-based features, we use Wikipedia as two types
of data sources: collection of textual content and the collection of Wikipedia
hyperlinks. We use 7 types of co-occurrence features from Wikipedia, where 6
out these 7 features types are already de ned above: Probability (P1; P2), Joint
Probability (JPSYM), Conditional Probability (CPASYM), Cosine Similarity
(CSSYM), PMI (SISYM) and Reverse Conditional Probability (RCPASYM).
The above described co-occurrence features only consider presence of an entity,
as the events (search queries or tweets) used in Spark are very short in length.
However, Wikipedia articles have long enough content to measure the importance
of an entity to a given article (or an event in this case). Therefore, Wikipedia can
provide the occurrence information of the entities with their importance weights
that can be used to build the distributional vector of the entities. Spark+Wiki
uses Wikipedia-based distributional semantic model (DSM) [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ] as an
additional co-occurrence feature. DSM score is calculated by computing the cosine
score between two distributional vectors. The DSM vector is de ned by v, where
v = PiN=w0 ai ci and ci is ith concept in the Wikipedia concept space, and ai is
the tf-idf weight of the entity e with the concept ci. Here, Nw represents the
total number of Wikipedia concepts. As mentioned above, we use Wikipedia as
a collection of textual content and the collection of Wikipedia hyperlinks, there
are 16 features that compute the values by using Wikipedia.
2.3
      </p>
      <p>
        Ranking
In order to predict the ranking by combining all the features, Spark uses learning
to rank approach [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] considering all the scores obtained from di erent features.
As all the learning algorithm requires a training data, Blanco et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] built the
dataset that contains more than four thousand web search queries. Every query
refers to an entity de ned in knowledge graph, and contain a list of entity
candidates. Finally, the dataset consists of 47,623 entity-pairs, which are tagged by
professional experts. The ranking can be de ned by learning a ranking function
f(.) that generates a score for an input query entity qi and an entity candidate
ej . Spark makes use of Stochastic Gradient Boosted Decision Trees (GBDT) to
obtain the ranking score to decide the appropriate label for given pairs.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        This section describes the evaluations of Spark and Spark+Wiki. As explained
above, Spark+Wiki is actually the Spark with additional Wikipedia-based
features. We evaluate the performance on a dataset that consists of 47,623
queryentity pairs. As Spark uses GBDT ranking method, we tune the GBDT
parameters by splitting the dataset in 10 folds. The nal parameters are obtained by
performing cross validation. Due to variations in the number of retrieved related
entities for a query, we use Normalized Discounted Cumulative Gain (nDCG) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
for the performance metric. nDCGp is de ned by the ratio of DCGp to
maximum or ideal DCGp.
      </p>
      <p>DCGp is de ned by:
nDCGp =</p>
      <p>p
DCGp = X</p>
      <p>DCGp
IDCGp</p>
      <p>:
2g(li)</p>
      <p>1
i=1 log2(g(li)) + 1
(19)
(20)
g(li) is the gain for the label li. nDCG gives di erent scores on di erent values
of p, therefore, we reported the nDCG scores for 1, 5, and 10.
3.1</p>
      <p>
        Datasets
Blanco et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] reported the Spark performance on a dataset that consists
of 4,797 search queries obtained from commercial search engines. Every query
refers to an entity in DBpedia, and contains a list of entity candidates. The entity
candidates are tagged by professional editors on 5 label scales: Excellent, Prefer,
Good, Fair, and Bad. The dataset contains di erent types of entity candidates
such as person, location, movie, and TV show. Table 1 provides the details about
di erent types of instances in the dataset. It shows that most of the entities are
of type \location" or \person". Section 3.3 reports the performance for these
speci c types in addition to the overall dataset.
We evaluate the performance of Spark system, and compare it with the model
that was built only over Wikipedia. In order to inspect whether the additional
features generated using Wikipedia can complement Spark performance, we
perform the experiments with Spark+Wiki. We calculate nDCG@10, nDCG@5, and
nDCG@1 as the evaluation metrics. In addition to perform experiments on the
dataset with all the entity types, we also evaluated the systems for the datasets
including only person type entities or location type entities. Spark combines the
scores that are obtained from di erent types of features by using GBDT. It
contains 112 features in total where 56 features are co-occurrence based, 32 features
are the linear combination of co-occurrence based features, 5 features are
graphbased, 6 features are popularity-based, 3 features are type-based, and the
remaining 10 features are of types such as string length and Wikipedia clicks. These
56 co-occurrence based features are built over 4 di erent data sources: query
term (QT), query session (QS), Flickr tags (FL), and tweets (TW). It means
that there are 14 co-occurrence based features generated from each data source.
Spark+Wiki has additional co-occurrence based features built over Wikipedia.
Spark+Wiki uses the Wikipedia as two types of data sources: collection of
documents with textual content and collection of documents with hyperlinks only.
However, it does not generate 14 co-occurrence based features for both the data
sources. Spark+Wiki uses 8 co-occurrence based features: Probability (P1; P2),
Joint probability (JPSYM), PMI (SYSYM), Cosine similarity (CSSYM),
Conditional probability (CPASYM), Reverse conditional probability (RCPASYM),
and Distributional semantic model (DSM) vector. The DSM feature was not
available in Spark as the data sources used in Spark have small documents
(query or tweet). However, Wikipedia characteristics allow us to build the DSM
vector over Wikipedia concepts [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ]. As a result, Spark+Wiki consists of 128
features where 16 features are additional to Spark system presented by Blanco
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In order to investigate the importance of the features, we build the ranking
model by taking the features from one category at a time. Therefore, we
examine the performance of all ve models: co-occurrence based, linear combination
of co-occurrence based features, graph-based, popularity-based, and type-based.
Further, we perform the experiments with only co-occurrence based features as
they turn out to be most signi cant features of the system. We calculate the
scores by taking co-occurrence based features and compare the importance of
each data source separately.
This section presents the results obtained from the above described
experiments. Table 2 shows the retrieval performance of Spark, and compare it with
Spark+Wiki and the Wikipedia only model. It shows that Wikipedia-based
model achieved comparable results on full dataset and person type entities.
However, it could not cope well for location type entities. The possible reason behind
it could be that most of the locations are too speci c which do not have enough
information on Wikipedia. Although, Wikipedia-based model could not
outperform Spark, the combination of both i.e. Spark+Wiki achieved higher scores for
all the test cases. Wikipedia-based model obtained relatively lower scores for
location type entities, however, it is able to compliment the Spark performance.
In order to inspect the e ectiveness of di erent features, we compute the feature
importance in our learning algorithm. We calculate the reduction in the loss
function for every split of feature variable and then compute the total
reduction in loss function. It provides that how many times the given features was
used in making the nal decision by the learning algorithm. Table 3 shows the
importance of top 20 features used in Spark+Wiki. The names of the features
listed in the table correspond to their acronyms explained in section 3.2. The
cooccurrence features have additional su xes QT, QS, FL, TW, WT, and WL for
query term, query sessions, Flickr tags, tweets, Wikipedia text, and Wikipedia
links, respectively. For instance, the feature CSSYMFL refers to cosine similarity
generated over Flickr tags. Table 3 shows that relation type (RT$) is the most
important feature in Spark+Wiki which is same as reported by Blanco et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Further, this table reports the e ectiveness of the Wikipedia-based features as
Features
      </p>
      <p>types
Co-occurrence</p>
      <p>Features</p>
      <p>QT
QS
FL
TW</p>
      <p>
        Wiki
there are 5 Wikipedia based features in the top 10 most e ective ones for the full
dataset. It also shows the advantage of using additional DSM features. In
particular, for person type entities, Wikipedia-based DSM feature shows a remarkable
importance. Moreover, Wikipedia turned out to be a useful data source to obtain
the background information about location type entities. The Wikipedia
document collection created by keeping only hyperlinks, shows more e ectiveness than
taking all the textual content for building the DSM model. It shows the
constancy of the results with the ones reported by Aggarwal and Buitelaar [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that
hyperlink-based DSM outperforms the text-based DSM model for entity
relatedness and ranking. As we performed experiments by categorizing the features
based on their types, we also evaluate models which are built over the subset of
the features coming from the same category. Table 4 shows the scores obtained
from ve di erent models based on the feature categories: co-occurrence features,
linear combination of co-occurrence features, graph-based features,
popularitybased features, and type-based features. It shows that co-occurrence based
features are very e ective. Although, relation-type feature turned out to be the
most important feature (see table 3), the type-based features are not very e
ective without other features. The co-occurrence based features are built by using
5 data sources: query terms, query sessions, Flickr tags, tweets, and Wikipedia.
Therefore, we reported the scores generated by co-occurrence based features over
di erent data sources in table 5. It shows that Wikipedia is the most e ective
resource for all types of entities. However, for location type entities, Flickr tags
perform better than Wikipedia. This shows the usefulness of the Flickr data to
capture the speci c and non-popular place names. Table 5 shows that
Wikipediabased features are the most e ective ones for building the co-occurrence based
model. Consequently, we further investigate the importance of Wikipedia-based
features. Table 6 shows that the probability obtained from textual content is the
most signi cant feature. However, the DSM based vectors over textual content
(WT) and hyperlinks (WL) show a good relevance for the model. In all the
experiments, DSM over hyperlinks shows more importance than the DSM built over
textual content. The possible reason behind this could be that the DSM vector
over textual content may not capture the appropriate semantics of an ambiguous
entity. On the contrary, the hyperlink-based DSM vector can di erentiate
between ambiguous surface forms. For instance, Aggarwal and Buitelaar [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] showed
that the text-based DSM vector of an entity \NeXT"1 may not obtain the
relevant dimensions while the hyperlink-based DSM vector obtained all the relevant
Wikipedia articles.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we presented an extensive evaluation of entity recommendation
system called \Spark". Spark uses more than 100 features, and produces the
nal scores by combining these features using learning to rank algorithm. These
features are built over varying data sources: query term, query session, Flickr
tags, and tweets. Therefore, we investigated the performance of these features
individually and by combining them based on their data source. Most of the data
1 http://en.wikipedia.org/wiki/NeXT
sources used in Spark such as users' query logs, are not publicly available.
However, Wikipedia is a continuously growing encyclopedia that is publicly available.
Therefore, we showed that the model built only over Wikipedia achieved a
comparable accuracy to the Spark. Moreover, Spark does not utilize the Wikipedia
to build its features, thus, we also analyzed the e ect of using Wikipedia as an
additional resource. We showed that Wikipedia-based features complement the
overall performance of Spark.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work is supported by a research grant from Science Foundation Ireland
(SFI) under Grant Number SFI/12/RC/2289 (INSIGHT) and Yahoo! Labs.
13. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge.</p>
      <p>In Proceedings of the 16th international conference on World Wide Web, pages
697{706. ACM, 2007.
14. X. Yu, H. Ma, B.-J. P. Hsu, and J. Han. On building entity recommender
systems using user click log and freebase knowledge. In Proceedings of the 7th ACM
international conference on Web search and data mining, pages 263{272. ACM,
2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asooja</surname>
          </string-name>
          , G. Bordea, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Non-orthogonal explicit semantic analysis</article-title>
          .
          <source>Lexical and Computational Semantics (* SEM</source>
          <year>2015</year>
          ), pages
          <fpage>92</fpage>
          {
          <fpage>100</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asooja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Vulcu</surname>
          </string-name>
          .
          <article-title>Is brad pitt related to backstreet boys? exploring related entities</article-title>
          .
          <source>In Semantic Web Challenge ISWC</source>
          (
          <year>2014</year>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asooja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ziad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Who are the american vegans related to brad pitt?: Exploring related entities</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web Companion</source>
          , pages
          <volume>151</volume>
          {
          <fpage>154</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>N.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Wikipedia-based distributional semantics for entity relatedness</article-title>
          .
          <source>In 2014 AAAI Fall Symposium Series</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          .
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          .
          <source>In The semantic web</source>
          , pages
          <volume>722</volume>
          {
          <fpage>735</fpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Cambazoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Torzec</surname>
          </string-name>
          .
          <article-title>Entity recommendations in web search</article-title>
          .
          <source>In International Semantic Web Conference (ISWC)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          . Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data</source>
          , pages
          <volume>1247</volume>
          {
          <fpage>1250</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          .
          <source>Annals of Statistics</source>
          , pages
          <volume>1189</volume>
          {
          <fpage>1232</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          .
          <source>In Proceedings of the 20th international joint conference on Arti cal intelligence</source>
          ,
          <source>IJCAI'07</source>
          , pages
          <fpage>1606</fpage>
          {
          <fpage>1611</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <article-title>Distributional structure</article-title>
          .
          <source>In Word</source>
          <volume>10</volume>
          (
          <issue>23</issue>
          ), pages
          <fpage>146</fpage>
          {
          <fpage>162</fpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>K. Ja</surname>
          </string-name>
          <article-title>rvelin and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Keka</surname>
          </string-name>
          <article-title>lainen. Ir evaluation methods for retrieving highly relevant documents</article-title>
          .
          <source>In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>41</volume>
          {
          <fpage>48</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Pound</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mika</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>Ad-hoc object retrieval in the web of data</article-title>
          .
          <source>In Proceedings of the 19th international conference on World wide web</source>
          , pages
          <volume>771</volume>
          {
          <fpage>780</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>