<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph Analytical Re-ranking for Entity Search</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Takahiro Komamizu Nagoya University Nagoya</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity search is a fundamental task in Linked Data (LD). The task is, given a keyword search query, to retrieve a set of entities in LD which are relevant to the query. The state-ofthe-art approaches for entity search are based on information retrieval technologies such as TF-IDF vectorization and ranking models. This paper examines the approaches by applying a traditional evaluation metrics, recall@k, and shows ranking qualities still room left for improvements. In order to improve the ranking qualities, this paper explores possibilities of graph analytical methods. LD is regarded as a large graph, graph analytical approaches are therefore appropriate for this purpose. Since query-based graph analytical approaches fit to entity search tasks, this paper proposes a personalized PageRankbased re-ranking method, PPRSD (Personalized PageRank based Score Distribution), for retrieved results by the state-of-the-art. The experimental evaluation recognizes improvements but its results are not satisfactory, yet. For further improvements, this paper reports investigations about relationship between queries and entities in terms of path lengths on the graph, and discusses future directions for graph analytical approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1http://tiny.cc/dbpedia-entity
vant answers are absent in top-k results. The
analysis results shown the Total column in Table 1,
recall@10, recall@100, and recall@1000 are maximally
0.2872, 0.6912, and 0.8708, respectively.</p>
      <p>The investigation results indicate that there are still
room left for improving rankings. The low recall for
top-10 results and the high recall for top-1000 results
imply that large amount of relevant results are within
1000 results but most of them are below top-10.
Therefore, this paper attempts to improve the ranking by
reranking, which arrange the ranking by applying
different ranking criteria. It is reasonable to take graph
topological features into account due to the nature
of LD. Therefore, this paper applies graph analytical
methods for re-ranking. The result for the re-ranking
method is expected to be an answer to the question.</p>
      <p>In consequence, this paper arranges the
aforementioned question to the following question. Do graph
analysis-based re-ranking methods improve the ranking
quality? This paper attempts to take graph
analytical methods into account and proposes a re-ranking
method PPRSD (Personalized PageRank based Score
Distribution) which distributes calculated relevance
scores by the state-of-the-art in a personalized
PageRank manner. Test of PPRSD gives the following
answer, the graph analysis-based re-ranking method can
improve the ranking quality but the improvement is
not very significant.</p>
      <p>In order to find future directions based on graph
analytical methods for improving entity search, this
paper performs investigations and provides insights.
This paper poses a question for results of the
preliminary evaluation by recall@k, that is, why recall@1000
is not perfect yet? To answer the question, this
paper investigates relationship between query terms and
relevant entities for the query, and the investigation
reveals that some terms only exist on distant literals
from relevant entities. Additionally, this paper obtains
a clue for selection of predicates connecting to literals
w.r.t. different distances from the entities. Based on
these investigations, this paper puts discussion on
future directions based on graph analytical approaches
for entity search.</p>
      <p>The following sections discuss the detail for
getting the answers to the questions. Section 2
introduces briefly the state-of-the-art shown in [HNX+17]
and showcases the preliminary evaluation in terms of
recall@k metrics, and Section 3 explains the idea and
detail of PPRSD, and Section 4 evaluates the
state-ofthe-art and PPRSD using the test collection and shows
the answers to the aforementioned questions. Section 5
displays additional investigations and insights for the
future directions, and Section 6 concludes this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>State of Current Entity Search</title>
      <p>This work explores the future directions of entity
search, to this end, this paper investigates the current
state of entity search, especially this paper sticks to a
leading benchmark, DBpedia-Entity v2 [HNX+17].</p>
      <p>As shown in the benchmark, there are various
approaches which are mainly based on
information retrieval and natural language processing
techniques. The list of approaches include fundamental
approaches: BM25 [RZ09], BM25-CA [RZ09], LM
(Language Modeling) [PC98], SDM (Sequential
Dependency Model) [MC05], PRMS (Probabilistic Model for
Semistructured Data) [KXC09], and MLM-all
(Mixture of Language Models) [OC03]; fielded extension
approaches: MLM-CA [OC03], FSDM (Fielded
Sequential Dependence Model) [ZKN15], and
BM25FCA [RZ09]; extended approaches by entity linking
technique [HBB16] for query: LM-ELR [HBB16],
SDM-ELR [HBB16], and FSDM-ELR [HBB16].</p>
      <p>These works are based on a fielded document
construction method in [Has18]. As an overall structure,
each entity has 1000 fields together with three
additional fields. The 1000 fields are corresponding with
top 1000 frequent predicates in DBpedia, and the
additional fields are heuristically constructed such that
one is “name” field which is constitution of predicates
rdfs:label and foaf:name; another is “types” field
which contains rdf:type predicate and predicates
ending in “subject”, and the other is “contents” field which
holds the contents of all fields of connected entities
except those connected by owl:sameAs to remove same
entities in different languages. Aforementioned
approaches use parts of the fielded documents as follows:
BM25, LM and SDM use the contents field; and
MLMall, PRMS and FSDM use top-10 fields. The field
extension approaches are differentiated by settings of
field weights (e.g., MLM-all uses equal weights for all
fields, while PRMS learns weights for fields).</p>
      <p>To investigate the qualities of these approaches, this
paper tests more intuitive metrics recall@k in
addition to NDCG which is shown in [HNX+17]. The
NDCG results are copied to Table 4 (rows of not
*ed method names correspond to the original results
shown in [HNX+17]). The NDCG result shows
comparative ranking qualities among these approaches.
While, NDCG is not a clear indicator for distances
from goals. Therefore, this paper investigates more
clear indicator, recall@k (Eqn. 1) which reveals ratio
of relevant results in top-k.
(1)</p>
      <p>Table 1 displays recall@k (k 2 f10; 100; 1000g) and
it indicates that more than 80% of relevant results are
the number of relevant items in top-k
the total number of relevant items
@10
.1344
.1397</p>
      <p>Total
@100
.4198
.4355
included in top-1000 but only 20% to 45% of them
are included in top-10, which indicates there are room
left for improving rankings. The recalls are calculated
on the top-1000 results presented in the benchmark
data2. The boldface and underlined cells in the
table show maximum recall scores for tasks and k. All
methods have low recall@10 as well as recall@100, but
still high recall@1000, meaning that ranking
performance should be improved. The gap row in the table
emphasizes that top-10 results have large room left for
improvements.
3</p>
    </sec>
    <sec id="sec-3">
      <title>PageRank-based Re-ranking</title>
      <p>This work attempts to improve the ranking qualities
by graph analytical re-ranking methods. LD is
modeled as a labeled graph, it is therefore reasonable to
apply graph analytical approaches to evaluate values
of entities. In particular, this paper explores
feasibility of PageRank [PBMW99], which is popular graph
analytical methods to originally evaluate Web pages
and has been applied for many other domains.</p>
      <p>This paper models LD data as data graph (Def. 1)
Definition 1 (Data Graph) Given LD data, data
graph G = (V; E) is a graph, where set V = R [ L [ B
of vertices are union of set R of entities, set L of
literals and set B of blank nodes, and set E V P V
of edges between vertices with predicates P as labels.</p>
      <p>2https://github.com/iai-group/DBpedia-Entity/tree/
master/runs/v2</p>
      <p>The subsequent sections introduce naïve baseline
approaches and the proposed re-ranking method,
PPRSD. Section 3.1 introduces re-ranking methods
via PageRank [PBMW99] and personalized
PageRank [Hav02], and introduces a preliminary evaluation
of these methods. Then, Section 3.2 explains PPRSD
which utilizes both results of the state-of-the-art and
advantages of personalized PageRank.
3.1</p>
      <sec id="sec-3-1">
        <title>Naïve Graph Analytical Re-ranking</title>
        <p>As discussed above, graph analytical approaches are
reasonable for re-ranking criteria, however, with a
little consideration, global evaluation methods like
PageRank do not make sense for ranking entities with
respect to input keyword queries. Roughly speaking,
PageRank evaluates vertices having lots of incoming
links as important. Therefore, when PageRank is
applied to the data graph G, PageRank gives an order
of vertices which is independent from input queries.
Examinations for the global rankings show bad results
(this paper does not include this because it is obvious).</p>
        <p>In order to test PageRank and personalized
PageRank in a re-ranking manner, this work utilizes an
insight from the recall@k results in Table 1. The insight
is that the top-1000 results by existing methods
include more than 80% of relevant results. Thus, the
idea of re-ranking with PageRank and personalized
PageRank is to filter top-1000 result entities by the
existing methods and to apply the graph analytical
approaches. To do so, an induced subgraph (Definition 2)
for the top-1000 result entities are extracted.</p>
        <p>On the induced subgraph G0 extracted from
top1000 results, PageRank and personalized PageRank
values are calculated as Eqn. 2 and Eqn. 3. In Eqn. 2,
pr is a PageRank vector with 1000 length, A is a
1000 1000 adjacency matrix of G0, e is 1000-length
vector which elements are all 1, and d is a damping
factor which is the probability of random jumps.
Similarly, in Eqn 3, pprq is a 1000-length PageRank vector
for query q, A is an adjacency matrix as PageRank, s is
1000-length personalized vector for q, which elements
corresponding with matching entities for q are 1 and
other elements are 0, and d is a damping factor.
pr = (1</p>
        <p>
          d) prA + d e
pprq = (1
d) pprqA + d s
(2)
(
          <xref ref-type="bibr" rid="ref14">3</xref>
          )
        </p>
        <p>A preliminary experiment over these naïve
reranking methods shows worse results than the
stateof-the-art. The preliminary experiment tests the
feasibility of aforementioned methods (PageRank and
personalized PageRank-based re-ranking methods) on the
DBpedia-Entity v2 benchmark [HNX+17]. The
reranking approaches are applied for all the
state-ofthe-art methods listed in Table 1. Table 2 displays
maximum recall@k values among the applied methods
of PageRank and personalized PageRank, separately.
Amongst PageRank and personalized PageRank,
personalized PageRank has achieved better performance
than PageRank, therefore, taking relevance to queries
into account results better ranking qualities.
Comparing recall@k of the state-of-the-art shown in Table 1,
the re-ranking methods are mostly worse then them.
Consequently, re-ranking methods should more rely on
the state-of-the-art.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Re-ranking by Score Distribution</title>
        <p>The preliminary evaluation on the naïve re-ranking
methods reveal two facts: one is personalized
PageRank-based re-ranking is superior to
PageRankbased re-ranking, and the other is the state-of-the-art
are still more powerful than simple graph analytical
approaches. Therefore, the facts suggest that
personalized PageRank-based method with utilizing results
of the state-of-the-art can be a better choice. The rest
of this section introduces how to realize it.</p>
        <p>The main idea of the proposed approach is that
utilizing relevance scores for re-ranking algorithm via
personalized PageRank. The state-of-the-art rank
entities by their own relevance scores, the scores
indicate relative relevance degrees among the resulting
entities. That is, there are more or less gaps on relevance
scores than those on ranks. Additionally, the
relevance scores are more sophisticated than just
counting matching entities as naïve personalized
PageRankbased re-ranking approach (s in Eqn. 3).</p>
        <p>To realize this idea, this work arranges the
personalized PageRank formulation shown in Eqn. 3 to include
the relevance scores calculated by the state-of-the-art
as Eqn. 4 called PPRSD (stands for Personalized
PageRank based Score Distribution).</p>
        <p>pprsdq = (1
d) pprsdqA + d t
(4)
where pprsdq is a 1000-length relevance score vector
of PPRSD. The personalized vector s is redefined as t,
where each element ti of entity vi 2 V 0 stores a
relevance score of q to vi calculated by one of the
state-ofthe-art method. Log likelihood-based relevance scores
(i.e., LM, MLM, SDM, FSDM, PRMS, and their
variations) are negative values in nature, therefore, these
scores are converted to positive numbers by applying
exponential function. In addition, the converted scores
are quite small (e.g., 10 34) because the values in the
log function are products of probabilities, therefore,
the converted scores are multiplied by positive
number so as to make the scores comparable with those
of the other methods. As PageRank-based methods
compute the relevance score vectors (i.e., pr in Eqn. 2
and ppr in Eqn. 3), PPRSD also computes the
relevance score vector, pprsd, by the power method.
Result entities ranked by PPRSD are of ordering in the
relevance scores.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <p>The experiment in this paper attempts to confirm
the re-ranking method, PPRSD, improves the
ranking qualities in terms of both recall@k and NDCG@k.
Since PPRSD relies on the results of the
state-ofthe-art, this experiment uses the standard benchmark
dataset [HNX+17]3 of entity search on DBpedia.
Relevance scores for entities in the state-of-the-arts are
obtained from the website of the benchmark4. This
experimental evaluation attempts to answer the
following question: Does the re-ranking method improve the
state-of-the-are? And, how large or small the
improvements are? In order to answer the question, PPRSD
and the state-of-the-art are compared by recall@k
(Eqn. 1) and NDCG@k (Eqn. 5) which is a ratio of
DCG@k (Eqn. 6) over the ideal value of DCG@k
referred as to IDCG@k.
3http://tiny.cc/dbpedia-entity
4https://github.com/iai-group/DBpedia-Entity/tree/
master/runs/v2
0.30
0.25
010.20
lla@0.15
c
re0.10
0.05</p>
      <p>The subsequent sections discuss the comparison of
the ranking qualities between the original methods and
the re-ranked methods by PPRSD. More specifically,
Section 4.1 introduces an empirical study for
determining damping factor d in Eqn. 4, and, based on the
choice of the damping factor, Section 4.2 discusses the
comparison between the PPRSD-based methods over
the original methods.
4.1</p>
      <sec id="sec-4-1">
        <title>Effect of Damping Factor</title>
        <p>Figure 1 showcases effects of damping factor in
various state-of-the-art which PPRSD is applied, and it
reveals that smaller damping factor (i.e., around 0.1)
achieves the best performances. In the figure,
horizontal axis expresses damping factor which ranges 0 to 1.0
in Figure 1(a) and 0 to 0.2 in Figure 1(b), vertical axis
represents recall@10, and lines in the figure represent
the state-of-the-art. FSDM-ELR and BM25F-CA
perform the best among the state-of-the-art in the figure
around damping factor 0.1. In consequence, keeping
10% of relevance scores is the best choice for PPRSD,
so d = 0:1 is used for the later experiments.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Overall Evaluation</title>
        <p>Table 3 and Table 4 display the comparisons of values
of recall@k for the former and NDCG@k for the latter
among PPRSD and the state-of-the-art. The Model
row of the table represents tasks of entity search and
values of k, and the left-most column shows lists of
the state-of-the-art and re-ranked versions (which are
represented by *) of them by PPRSD. In addition, each
group of rows corresponding with a method includes
imp. row which emphasizes the improvement ratio by
PPRSD. Cells contain recall@k values, and the best
value in a row is emphasized by boldface and underline.</p>
        <p>Table 3 shows PPRSD successfully improves
ranking qualities of the state-of-the-art. The Total
column represents the ranking qualities among all tasks.
The column indicates that 7 over 12 methods have
been improved by PPRSD in recall@10 and 8
methods have also been improved by PPRSD but 2
methods have been degraded the ranking qualities in
recall@100. This indicates that PPRSD successfully
improves the ranking qualities. Note that PPRSD
improves not only elementary approaches (e.g., BM25)
but also more sophisticated approaches (e.g.,
BM25FCA and FSDM-ELR).</p>
        <p>Table 4 also shows PPRSD successfully improves
ranking qualities of the state-of-the-art. Recall@k
and NDCG@k obviously have correlation, therefore,
the improvements should be confirmed as the success
shown in recall@k. As expected, the best results are
all of PPRSD-based methods. It is worthy to note that
the improvement ratios shown in imp. are larger than
those of recall@k, indicating that PPRSD improves
the rankings not only by just more relevant entities in
the rankings but also better positions of relevant
entities in the rankings. Since NDCG@k is good at
relative comparison between rankings, the results confirm
the improvements of rankings. For example,
BM25CA* is the best ranking method in terms of recall@10
for QALD-2 task, however, BM25F-CA* is superior to
BM25-CA* and the best in terms of NDCG@10 for the
same task. This means that BM25F-CA* have less
relevant entities in top-10 rankings but more relevant
entities are in the earlier positions in the top-10 rankings.
Consequently, evaluation based on NDCG@k confirms
the ranking improvements by PPRSD.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Investigation for Improvements</title>
      <p>This section explores possibilities of further
improvements for the state-of-the-art and PPRSD-based
reranking methods in the following aspect: Why
recall@1000 is not 100% yet? Since PPRSD is based
on the state-of-the-art, the upper bound of ranking
qualities is limited by them and improving the
stateof-the-art is also important for further improvement of
PPRSD. Therefore, this work investigates the reason
why top-1000 results have not been perfect yet. To
answer the question, this paper investigates path lengths
from relevant entities to entities which literals contain
an input keyword query term (detail in Section 5.1).
The investigation reveals that there are still space left
for including literals within larger distances (i.e., 3, 4,
and 5 hops). Obviously, taking longer paths (or
sequences of predicates) into account entails explosion
of the number literals included into documents of
entities. As a result, each entity gets noisy documents, and
it is easy to imagine that the noisy documents degrade
ranking qualities. To obtain hints for preferable paths
for literals, this paper investigates the commonalities
of tail predicates in the paths (detail in Section 5.2).
The investigation reveals that tail predicates should be
different for different lengths of the paths.
5.1</p>
      <sec id="sec-5-1">
        <title>Distance from Query Term</title>
        <p>The state-of-the-art rely on terms occurring within two
hops at most, modeled as fielded documents.
Section 2 introduces the fields of entities taken into
account for the state-of-the-art, and the contents field
includes contents of one-hop away entities. This
implies that no method considers terms occurring within
longer hops away.</p>
        <p>The fielded document construction limits the
possibilities to reach to the relevant results due to the
absence of query terms in the documents. This fact is
estimated from the preliminary evaluation on recall@k
in Table 1, that is, recall@1000 values are less than
86% (except SemSearch ES task which is designed for
direct matching with terms). In other words, 14% are
below top-1000 results.</p>
        <p>In order to answer question why recall@1000 is not
perfect?, this paper attempts to realize the relation
between relevant answers and the numbers of hops
from query terms. To this end, this work
investigates the minimum distances from relevant entities to
query terms by performing SPARQL queries in terms
of the distances. SPARQL queries are generated with
a graph pattern of a sequential path from given entity
r 2 R to literal ` 2 L which contains query term t,
and predicates and resources between r and ` are
fulfilled by free variables. Figure 3 illustrates a n-length
graph pattern for entity r and query term t. Based on
the pattern, ASK query (which is an indicator
function query in SPARQL) is generated to examine such
pattern exists. Following SPARQL query displays
examples of generated ASK queries for distance 2.</p>
        <p>ASK{ hri ?p0 ?v0. ?v0 ?p1 ?v1.</p>
        <p>?v1 ?p2 ?l. ?l bif:contains ’t’.</p>
        <p>FILTER isLiteral(?l).}</p>
        <p>
          This investigation measures the minimum distance
which satisfies the ASK query corresponding with the
distance. The procedure of this investigation is that:
(1) given query q, relevant entity list Aq for q is
obtained from the benchmark dataset; (2) parse q into
set Tq of terms; (
          <xref ref-type="bibr" rid="ref14">3</xref>
          ) examine ASK queries from length 0
to maximum length (5 for this investigation) for each
pair of relevant entity r 2 Aq and term t 2 Tq; (4)
as soon as the ASK query is satisfied, the distance is
recorded; and (5) the obtained distances for each
relevant entities are analyzed. Obtained distances for a
relevant entity of a query may be different term by
term. Therefore, this investigation analyses minimum
distance, average distance and maximum distance for
each relevant entity of a query. Consequently, these
distances are individually gathered and calculate their
averages to observe how long distances required to
touch query terms from relevant entities.
        </p>
        <p>Figure 2 showcases the analyzed distances with
respect to tasks as well as with regardless of tasks (i.e.,
Total ). In the figure, bars represent ratios of relevant
entities having the number of hops (distances) to reach
from query terms, and dashed lines express cumulative
ratios of relevance entities. Three kinds of bars
(lightgray and oblique stripe bars, gray and horizontal stripe
bars, and black and crossing stripe bars) correspond
1.0
lts0.8
u
s
re0.6
f
ioo0.4
t
aR0.2
0.0 0 The1numb2er of3hops 4
(a) Total
with minimum, average (rounded), and maximum
distances. Similarly, three kinds of lines (lines with
triangles, circles, and squares) correspond with minimum,
average (rounded), and maximum.</p>
        <p>Figure 2 indicates that at least one term is included
in literals directly connected with relevant entities, and
Figure 2(a) indicates that most of the relevant
entities are reachable from query terms within two hops
on average, however, in terms of maximum distances,
still more than 10% relevant entities are not
reachable within three hops. This fact answers the question
why recall@1000 is not perfect? as some relevant
entities are still not found by the query terms due to
the smaller distances to construct entity documents.
This phenomenon is also marked on individual tasks
except SemSeach ES task, which is a simple tasks so
that queries in the task are more directly explaining
requiring entities than others.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Commonality of Tail Predicates of Paths</title>
        <p>A simple solution for improving ranking qualities in
terms of the previous investigation is top include
literals within more hops (i.e., 3 or more), however, it
is obvious that the solution incurs noisy entity
documents by including unnecessary literals within larger
hops. The number of reachable entities in G increases
very quickly as the distance increases. Therefore,
irrelevant entities contribute to entity documents.</p>
        <p>An intuition to avoid this situation is to select
“good” paths from an entity which include meaningful
literals for the entity. A naïve extension is to find paths
from an entity to “good” entities and to include their
documents (suppose the same approach to the
state-ofthe-art) into the document of the entity. This paper
wants to clarify there is any difference between
selfdescriptive literals and supportive literals for other
entities. Self-descriptive literals explain well about target
entities, while supportive literals explain supplemental
facts about the target entities. Self-descriptive literals
tend to be close to the targets, while supportive
literals tend to relatively distant from the targets.
Therefore, this investigation attempts to understand the
differences of predicates with ending literals (called tail
predicates) between shorter and longer paths. The
investigation is done in the following procedure: gathers
surveyed paths for each relevant entities using
intermediate results of the previous investigation (Section 5.1),
and analyzes the paths in terms of commonalities of
the tail predicates. The commonalities are measured
for different lengths (i.e., 1 to 4) of the tail predicate
sequences by Jaccard index, Jaccard (Yir; Yjr) = jjYiir[\YYjjrrjj ,
Y r
where Yir is a set of i-length tail predicates of entity r
and j j is cardinality.</p>
        <p>Table 5 display commonalities of tail predicates
among different lengths of paths for different tail
lengths. The results reveal that commonalities of
tail predicates decrease as differences of path lengths
increase. This fact indicates that literals reachable
in different path lengths should select different tail
predicates (e.g., rdfs:label is not always a good
choice.). Due to the tremendous number of tail
predicates, the detailed analysis on what kind tail
predicates are preferable in particular path lengths is left
for future work. Examples from rough analysis include
rdfs:label and rdfs:comment for 1-length paths and
dbo:wikiPageWikiLinkText for 5-length paths.
of-the-art by formulating as a re-ranking problem. For
further improvements, this paper reports two
investigations about relationship between paths to literals
containing query terms and relevant entities to queries.
Results of the investigations support the improvement
possibilities of graph analytical approaches, and
developing them is still left for future works. Consequently,
this paper answers to the first question as Yes, they
are appropriate, but there is still an issue on matching
entities in a graph.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Acknowledgments.</title>
        <p>This work was partly supported by JSPS KAKENHI
Grant Number JP18K18056.</p>
        <p>This section investigates relationship between paths
and result entities in order to answer the question Why
recall@1000 is not 100% yet? The answers of this
investigation are 2-fold: (1) literals in distant paths are
absent from documents; and (2) setting of tail predicates
is universal for all lengths of paths. Although, the
first problem is obvious, still the number of reachable
entities in more than two hops is extraordinary large,
therefore, constructing documents from longer paths is
computationally expensive. Additionally, in the naïve
approach, the generated documents may include large
amount of not quite relevant facts to entities.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Future Directions.</title>
        <p>To overcome the aforementioned problems, solutions
lies on graph analytical approaches as Section 3
showcases their possibilities. A basic idea is to select
appropriate reachable predicates and entities within
two or more hops. To this end, graph
analytical approaches (e.g., PageRank, Random Walks) can
be good choices. As discussed in this paper,
nonpersonalized PageRank and its families are not
appropriate, meaning that global centralities do not help.
Therefore, customizable graph analytical approaches
such as ObjectRank [BHP04] and random walk with
restart (RWR) [TFP08] are preferable. There are
some preliminary works based on this idea, namely
FORK [KOAK17], and RWRDoc [Kom18]. FORK has
applies ObjectRank for entity search and it achieves
better precision@k. While, RWRDoc has applies RWR
for determining importances of reachable entities in
terms of RWR scores and it slight improves in terms
of NDCG@k. These indicates that graph analytical
approaches still leave space for improvements.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper deals with entity search over Linked Data,
analyzes the state-of-the-art in terms of recall@k, and
reveals the possibilities of improvements of the
stateof-the-art. Also, this paper indicates the feasibility of
graph analytical approaches for improving the
state[BHB09]
[BHP04]
[Has18]
[Hav02]
[HBB16]</p>
      <sec id="sec-6-1">
        <title>Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story</title>
        <p>
          So Far. Int. J. Semantic Web Inf. Syst.,
5(
          <xref ref-type="bibr" rid="ref14">3</xref>
          ):1–22, 2009.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Andrey Balmin, Vagelis Hristidis, and</title>
        <p>Yannis Papakonstantinou. ObjectRank:
Authority-Based Keyword Search in
Databases. In VLDB 2004, pages
564–575, 2004.</p>
        <p>Faegheh Hasibi. Semantic Search with
Knowledge Bases. PhD thesis,
Norwegian University of Science and
Technology, Trondheim, Norway, 2018.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Taher H. Haveliwala. Topic-sensitive PageRank. In WWW 2002, pages 517– 526, 2002.</title>
      </sec>
      <sec id="sec-6-4">
        <title>Faegheh Hasibi, Krisztian Balog, and</title>
        <p>Svein Erik Bratsberg. Exploiting Entity
Linking in Queries for Entity Retrieval.</p>
        <p>In ICTIR 2016, pages 209–218, 2016.
[HNX+17] Faegheh Hasibi, Fedor Nikolaev, Chenyan
Xiong, Krisztian Balog, Svein Erik
Bratsberg, Alexander Kotov, and Jamie Callan.</p>
        <p>DBpedia-Entity v2: A Test Collection for</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Entity</given-names>
            <surname>Search</surname>
          </string-name>
          .
          <source>In SIGIR 2017</source>
          , pages
          <fpage>1265</fpage>
          -
          <lpage>1268</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [KOAK17]
          <string-name>
            <given-names>Takahiro</given-names>
            <surname>Komamizu</surname>
          </string-name>
          , Sayami Okumura, Toshiyuki Amagasa, and
          <string-name>
            <given-names>Hiroyuki</given-names>
            <surname>Kitagawa</surname>
          </string-name>
          . FORK:
          <article-title>Feedback-Aware ObjectRank-Based Keyword Search over Linked Data</article-title>
          .
          <source>In AIRS 2017</source>
          , pages
          <fpage>58</fpage>
          -
          <lpage>70</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [PBMW99]
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          , Sergey Brin, Rajeev Motwani, and
          <string-name>
            <given-names>Terry</given-names>
            <surname>Winograd</surname>
          </string-name>
          .
          <article-title>The PageRank Citation Ranking: Bringing Order to the Web</article-title>
          .
          <source>Technical Report 1999-66</source>
          ,
          <year>November 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Napolitano.</surname>
          </string-name>
          <article-title>7th Open Challenge on Question Answering over Linked Data (QALD7)</article-title>
          .
          <source>In ESWC 2017</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>69</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [WKC+12]
          <string-name>
            <surname>Qiuyue</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Jaap Kamps, Georgina Ramírez Camps, Maarten Marx, Anne Schuth, Martin Theobald, Sairam Gurajada, and
          <string-name>
            <given-names>Arunav</given-names>
            <surname>Mishra</surname>
          </string-name>
          .
          <article-title>Overview of the INEX 2012 Linked Data Track</article-title>
          .
          <source>In CLEF 2012 Evaluation Labs and Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [ZKN15]
          <string-name>
            <given-names>Nikita</given-names>
            <surname>Zhiltsov</surname>
          </string-name>
          , Alexander Kotov, and
          <string-name>
            <given-names>Fedor</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          .
          <article-title>Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data</article-title>
          .
          <source>In SIGIR 2015</source>
          , pages
          <fpage>253</fpage>
          -
          <lpage>262</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Kom18] [KXC09] [MC05] [OC03] [PC98] [PMZ10] [RZ09] [TFP08] [UNH</source>
          +17]
          <string-name>
            <surname>Ricardo</surname>
            <given-names>Usbeck</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Bastian Haarmann, Anastasia Krithara,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Röder</surname>
          </string-name>
          , and
          <article-title>Giulio Takahiro Komamizu. Learning Interpretable Entity Representation in Linked Data</article-title>
          .
          <source>In DEXA</source>
          <year>2018</year>
          ,
          <year>2018</year>
          . (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jinyoung</given-names>
            <surname>Kim</surname>
          </string-name>
          , Xiaobing Xue, and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A Probabilistic Retrieval Model for Semistructured Data</article-title>
          .
          <source>In ECIR 2009</source>
          , pages
          <fpage>228</fpage>
          -
          <lpage>239</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Donald</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A Markov Random Field Model for Term Dependencies</article-title>
          .
          <source>In SIGIR 2005</source>
          , pages
          <fpage>472</fpage>
          -
          <lpage>479</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Ogilvie</surname>
          </string-name>
          and
          <string-name>
            <given-names>James P.</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Combining Document Representations for Known-item Search</article-title>
          .
          <source>In SIGIR 2003</source>
          , pages
          <fpage>143</fpage>
          -
          <lpage>150</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Jay M. Ponte</surname>
            and
            <given-names>W. Bruce</given-names>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A Language Modeling Approach to Information Retrieval</article-title>
          .
          <source>In SIGIR 1998</source>
          , pages
          <fpage>275</fpage>
          -
          <lpage>281</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Pound</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>Ad-hoc Object Retrieval in the Web of Data</article-title>
          .
          <source>In WWW 2010</source>
          , pages
          <fpage>771</fpage>
          -
          <lpage>780</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Stephen E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>The Probabilistic Relevance Framework: BM25 and Beyond</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>FTIR</surname>
          </string-name>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Inf. Syst.</surname>
          </string-name>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>327</fpage>
          -
          <lpage>346</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>