=Paper=
{{Paper
|id=None
|storemode=property
|title=Using Wikipedia’s Category Structure for Entity Search
|pdfUrl=https://ceur-ws.org/Vol-986/paper_7.pdf
|volume=Vol-986
|dblpUrl=https://dblp.org/rec/conf/dir/KapteinK13
}}
==Using Wikipedia’s Category Structure for Entity Search==
Using Wikipedia’s Category Structure for Entity Search
Rianne Kaptein1 Jaap Kamps2
∗
1
TNO, Delft, The Netherlands
2
University of Amsterdam, Amsterdam, The Netherlands
ABSTRACT that provides an IR test collection to evaluate the task of
In this paper we investigate how the category structure of entity ranking using Wikipedia as its document collection.
Wikipedia can be exploited for Entity Ranking. In the last Our first research question is: How can we exploit category
decade, the Web has not only grown in size, but also changed and link information for entity ranking in Wikipedia?
its character, due to collaborative content creation and an Since a requirement for a relevant result in entity ranking
increasing amount of structure. Current Search Engines find is to retrieve the correct entity type, category information
Web pages rather than information or knowledge, and leave is of great importance for entity ranking. Category infor-
it to the searchers to locate the sought information within mation can also be regarded in a more general fashion, as
the Web page. A considerable fraction of Web searches con- extra context for your query, which could be exploited for
tains named entities. We focus on how the Wikipedia struc- ad hoc retrieval. Our second research question is therefore:
ture can help rank relevant entities directly in response to How can we use entity ranking techniques that use category
a search request, rather than retrieve an unorganized list of information for ad hoc retrieval?
Web pages with relevant but also potentially redundant in- Since usually ad hoc queries do not have target categories
formation about these entities. Our results demonstrate the assigned to them, and providing target categories for entity
benefits of using topical and link structure over the use of ranking is an extra burden for users, we also examine ways to
shallow statistics. This paper is a compressed version of [1]. assign target categories to queries. Our third research ques-
tion is: How can we automatically assign target categories
to ad hoc and entity ranking queries?
1. INTRODUCTION
Searchers looking for entities are better served by present-
ing a ranked list of entities directly, rather than an unorga-
2. RETRIEVAL MODEL
nized list of Web pages with relevant but also potentially In this section we describe our retrieval model, how we
redundant information about these entities. The goal of the use category information for entity ranking, how we combine
entity ranking task is to return entities instead of documents these sources of information, and how we assign categories
or text as are returned for most common search tasks. En- to query topics automatically.
tities can be for example persons, organizations, books, or Exploiting Category Information Although for each
movies. entity ranking topic one or a few target categories are pro-
A resource that is large enough to generate meaningful vided, relevant entities are not necessarily associated with
statistics, and contains interpretable semantic structure is these provided target categories. Relevant entities can also
Wikipedia. The nature and structure of Wikipedia presents be associated with descendants of the target category or
new opportunities to solve problems that were thought to other similar categories. Therefore, simply filtering on the
require deep understanding capabilities and where bottle- target categories is not sufficient. multiple categories, not all
necks such as high cost and scalability where applicable in categories of an answer entity will be similar to the target
the past. Combining the benefits of the structured informa- category. We calculate for each target category the distances
tion and the large scale of Wikipedia, creating the oppor- to the categories assigned to the answer entity. To calculate
tunity to use probabilistic methods, we can now efficiently the distance between two categories, we tried three options.
process all of the information contained in Wikipedia. The first option (binary distance) is a very simple method:
In this paper is motivated by the following main research the distance is 0 if two categories are the same, and 1 other-
question: How can we exploit the structure of Wikipedia to wise. The second option (contents distance) calculates dis-
retrieve entities? We start by looking at how we can retrieve tances according to the contents of each category, and the
entities inside Wikipedia, which is also the task in the INEX third option (title distance) calculates a distance according
entity ranking track. INEX1 (Initiative for the Evaluation of to the category titles. We use KL-divergence to calculate
XML retrieval) is an information retrieval evaluation forum distances between categories, and calculate a category score
∗Work done while at the University of Amsterdam. that is high when the distance is small.
1
https://inex.mmci.uni-saarland.de/ Combining information Finally, we have to combine our
different sources of information. Our first source of infor-
Copyright remains with the authors and/or original copyright holders.
DIR 2013, April 26, 2013, Delft, The Netherlands. mation is a standard language model for retrieval, which
. calculates the probabilities of occurrence of the query terms
Table 1: 2007 ER Topics using Category Information Table 2: Ad Hoc vs. Entity Ranking results in MAP
Category representation Weight MAP P10 Query Category Combi. Best Score
Baseline 0.1840 0.1920 Set (M/A) µ = 0.0 µ = 1.0 µ = 0.1 µ
Binary 0.1 0.2145 - 0.1880 - ER07a M 0.2804 0.2547 - 0.3848• 0.2 0.4039•
Contents 0.1 0.2481◦• 0.2320◦ ER07a A 0.2804 0.2671 - 0.3607◦• 0.1 0.3607◦•
Title 0.1 0.2509◦ 0.2360◦ ER07b M 0.1840 0.1231 - 0.2481◦• 0.1 0.2481◦•
Contents 0.05 ER07b A 0.1840 0.1779 - 0.2308◦• 0.2 0.2221◦
0.2618◦• 0.2480◦•
Title 0.05 AH07a M 0.3653 0.2067◦ 0.4308◦• 0.1 0.4308◦•
AH07b M 0.3031 0.1761• 0.3297◦• 0.05 0.3327•
in a document. This standard language model also serves entity ranking topics (set ER07a). There are 80 more judged
as our baseline retrieval model. We explore two possibilities ad hoc topics (set AH07b). Results for 2007 entity ranking
to combine information. First, we make a linear combina- and ad hoc topics expressed in MAP are summarized in
tion of the document, link and category score. All scores Table 2, where “M” stands for manually assigned categories,
and probabilities are calculated in the log space, and then a and “A” for automatically assigned categories.
weighted addition is made. From the four topic sets, the baseline scores of the ad hoc
Alternatively, we can use a two step model. Relevance topic sets are higher. There is quite a big difference between
propagation takes as input initial probabilities as calculated the two entity ranking topic sets, where the topics derived
by the baseline document model score. Instead of the base- from the ad hoc topics are easier than the genuine entity
line probability, we can use the scores of the run that com- ranking topics. The entity ranking topics benefit greatly
bines the baseline score with the category information. from using the category information with significant MAP
increases of 44% and 35% for topic sets ER07a and ER07b re-
Target Category Assignment Besides using the target
spectively. When we use the category information for the ad
categories provided with the entity ranking query topics, we
hoc topics with manually assigned categories improvements
also look at the possibility of automatically assigning target
are smaller than the improvements on the entity ranking
categories to entity ranking and ad hoc topics. From our
topics, but still significant. Comparing manual and auto-
baseline run we take the top N results, and look at the T
matic assignments of target categories, manually assigned
most frequently occurring categories belonging to these doc-
target categories perform somewhat better than the auto-
uments, while requiring categories to occur at least twice.
matically assigned categories. However, for both topic sets
These categories are assigned as target categories to the
using the automatically assigned categories leads to signifi-
query topic.
cant improvements over the baseline.
3. EXPERIMENTS
4. CONCLUSION
In this section we describe our experiments with entity
ranking and ad hoc retrieval in Wikipedia. In this paper we have experimented with retrieving enti-
ties from Wikipedia exploiting its category structure. First,
Experimental Set-up We experiment with two different we examined whether Wikipedia category and link struc-
tasks. First of all we experiment with the entity ranking ture can be used to retrieve entities inside Wikipedia as is
task as defined by INEX. We will make runs on the topic the goal of the INEX Entity Ranking task. Category infor-
sets from 2007 to 2009. Secondly, we experiment with ad mation proves to be a highly effective source of information,
hoc retrieval using category information on the ad hoc topic leading to large and significant improvements in retrieval
sets from 2007 and compare automatic and manual category performance on all data sets. Secondly, we studied how
assignment for ad hoc and entity ranking topics. we can use category information to retrieve documents for
Entity Ranking Results The results on the 2007 entity ad hoc retrieval topics in Wikipedia. Considering retrieval
ranking topic set (ER07b, 19 topics) are summarized in Ta- performance, also on ad hoc retrieval topics we achieved
ble 1. The weight of the baseline score is 1.0 minus the significantly better results by exploiting the category infor-
weight of the category information. For all three distances, mation. Finally, we examined whether we can automatically
a weight of 0.1 gives the best results. In addition to these assign target categories to ad hoc and entity ranking queries.
combinations, we also made a run that combines the original Guessed categories lead to performance improvements that
score, the contents distance and the title distance. When a are not as large as when the categories are assigned man-
single distance is used, the title distance gives the best re- ually, but they are still significant. Our main conclusion is
sults. The combination of contents and title distance gives that the category structure of Wikipedia can be effectively
the best results overall. For the 2008 and 2009 entity ranking exploited, in fact not only for entity ranking, but also for ad
topic sets (not shown here), also significant improvements hoc retrieval, and with manually assigned as well as auto-
are achieved when category information is used. Additional matically assigned target categories.
improvements to the approach are to rerank the top 2500
documents from the baseline retrieval run, instead of the Thanks This research was funded by NWO (# 612.066.513).
top 500, which have been reranked for the 2007 runs. Nor-
malizing the scores before combining shows improvements REFERENCES
for the 2009 topics. [1] R. Kaptein and J. Kamps. Exploiting the Category
Ad Hoc Retrieval Results A selection of 19 topics in the Structure of Wikipedia for Entity Ranking, In Artificial
ad hoc topic set (AH07a) was transformed into an additional Intelligence, Volume 194, January 2013, Pages 111-129.