=Paper= {{Paper |id=None |storemode=property |title=Using Wikipedia’s Category Structure for Entity Search |pdfUrl=https://ceur-ws.org/Vol-986/paper_7.pdf |volume=Vol-986 |dblpUrl=https://dblp.org/rec/conf/dir/KapteinK13 }} ==Using Wikipedia’s Category Structure for Entity Search== https://ceur-ws.org/Vol-986/paper_7.pdf
      Using Wikipedia’s Category Structure for Entity Search

                                                   Rianne Kaptein1      Jaap Kamps2
                                                                                     ∗
                                                        1
                                                      TNO, Delft, The Netherlands
                                       2
                                         University of Amsterdam, Amsterdam, The Netherlands



ABSTRACT                                                                that provides an IR test collection to evaluate the task of
In this paper we investigate how the category structure of              entity ranking using Wikipedia as its document collection.
Wikipedia can be exploited for Entity Ranking. In the last              Our first research question is: How can we exploit category
decade, the Web has not only grown in size, but also changed            and link information for entity ranking in Wikipedia?
its character, due to collaborative content creation and an                Since a requirement for a relevant result in entity ranking
increasing amount of structure. Current Search Engines find             is to retrieve the correct entity type, category information
Web pages rather than information or knowledge, and leave               is of great importance for entity ranking. Category infor-
it to the searchers to locate the sought information within             mation can also be regarded in a more general fashion, as
the Web page. A considerable fraction of Web searches con-              extra context for your query, which could be exploited for
tains named entities. We focus on how the Wikipedia struc-              ad hoc retrieval. Our second research question is therefore:
ture can help rank relevant entities directly in response to            How can we use entity ranking techniques that use category
a search request, rather than retrieve an unorganized list of           information for ad hoc retrieval?
Web pages with relevant but also potentially redundant in-                 Since usually ad hoc queries do not have target categories
formation about these entities. Our results demonstrate the             assigned to them, and providing target categories for entity
benefits of using topical and link structure over the use of            ranking is an extra burden for users, we also examine ways to
shallow statistics. This paper is a compressed version of [1].          assign target categories to queries. Our third research ques-
                                                                        tion is: How can we automatically assign target categories
                                                                        to ad hoc and entity ranking queries?
1.     INTRODUCTION
   Searchers looking for entities are better served by present-
ing a ranked list of entities directly, rather than an unorga-
                                                                        2.   RETRIEVAL MODEL
nized list of Web pages with relevant but also potentially                In this section we describe our retrieval model, how we
redundant information about these entities. The goal of the             use category information for entity ranking, how we combine
entity ranking task is to return entities instead of documents          these sources of information, and how we assign categories
or text as are returned for most common search tasks. En-               to query topics automatically.
tities can be for example persons, organizations, books, or             Exploiting Category Information Although for each
movies.                                                                 entity ranking topic one or a few target categories are pro-
   A resource that is large enough to generate meaningful               vided, relevant entities are not necessarily associated with
statistics, and contains interpretable semantic structure is            these provided target categories. Relevant entities can also
Wikipedia. The nature and structure of Wikipedia presents               be associated with descendants of the target category or
new opportunities to solve problems that were thought to                other similar categories. Therefore, simply filtering on the
require deep understanding capabilities and where bottle-               target categories is not sufficient. multiple categories, not all
necks such as high cost and scalability where applicable in             categories of an answer entity will be similar to the target
the past. Combining the benefits of the structured informa-             category. We calculate for each target category the distances
tion and the large scale of Wikipedia, creating the oppor-              to the categories assigned to the answer entity. To calculate
tunity to use probabilistic methods, we can now efficiently             the distance between two categories, we tried three options.
process all of the information contained in Wikipedia.                  The first option (binary distance) is a very simple method:
   In this paper is motivated by the following main research            the distance is 0 if two categories are the same, and 1 other-
question: How can we exploit the structure of Wikipedia to              wise. The second option (contents distance) calculates dis-
retrieve entities? We start by looking at how we can retrieve           tances according to the contents of each category, and the
entities inside Wikipedia, which is also the task in the INEX           third option (title distance) calculates a distance according
entity ranking track. INEX1 (Initiative for the Evaluation of           to the category titles. We use KL-divergence to calculate
XML retrieval) is an information retrieval evaluation forum             distances between categories, and calculate a category score
∗Work done while at the University of Amsterdam.                        that is high when the distance is small.
1
    https://inex.mmci.uni-saarland.de/                                  Combining information Finally, we have to combine our
                                                                        different sources of information. Our first source of infor-
Copyright remains with the authors and/or original copyright holders.
DIR 2013, April 26, 2013, Delft, The Netherlands.                       mation is a standard language model for retrieval, which
.                                                                       calculates the probabilities of occurrence of the query terms
Table 1: 2007 ER Topics using Category Information               Table 2: Ad Hoc vs. Entity Ranking results in MAP
 Category representation      Weight     MAP         P10                        Query      Category    Combi.       Best Score
 Baseline                                0.1840     0.1920        Set (M/A)     µ = 0.0    µ = 1.0     µ = 0.1    µ
 Binary                         0.1     0.2145 -   0.1880 -       ER07a M       0.2804     0.2547 -    0.3848•    0.2 0.4039•
 Contents                       0.1     0.2481◦•   0.2320◦        ER07a A       0.2804     0.2671 -    0.3607◦•   0.1 0.3607◦•
 Title                          0.1     0.2509◦    0.2360◦        ER07b M       0.1840     0.1231 -    0.2481◦•   0.1 0.2481◦•
 Contents                      0.05                               ER07b A       0.1840     0.1779 -    0.2308◦•   0.2 0.2221◦
                                       0.2618◦•    0.2480◦•
 Title                         0.05                               AH07a M       0.3653     0.2067◦     0.4308◦•   0.1 0.4308◦•
                                                                  AH07b M       0.3031     0.1761•     0.3297◦•   0.05 0.3327•

in a document. This standard language model also serves          entity ranking topics (set ER07a). There are 80 more judged
as our baseline retrieval model. We explore two possibilities    ad hoc topics (set AH07b). Results for 2007 entity ranking
to combine information. First, we make a linear combina-         and ad hoc topics expressed in MAP are summarized in
tion of the document, link and category score. All scores        Table 2, where “M” stands for manually assigned categories,
and probabilities are calculated in the log space, and then a    and “A” for automatically assigned categories.
weighted addition is made.                                          From the four topic sets, the baseline scores of the ad hoc
   Alternatively, we can use a two step model. Relevance         topic sets are higher. There is quite a big difference between
propagation takes as input initial probabilities as calculated   the two entity ranking topic sets, where the topics derived
by the baseline document model score. Instead of the base-       from the ad hoc topics are easier than the genuine entity
line probability, we can use the scores of the run that com-     ranking topics. The entity ranking topics benefit greatly
bines the baseline score with the category information.          from using the category information with significant MAP
                                                                 increases of 44% and 35% for topic sets ER07a and ER07b re-
Target Category Assignment Besides using the target
                                                                 spectively. When we use the category information for the ad
categories provided with the entity ranking query topics, we
                                                                 hoc topics with manually assigned categories improvements
also look at the possibility of automatically assigning target
                                                                 are smaller than the improvements on the entity ranking
categories to entity ranking and ad hoc topics. From our
                                                                 topics, but still significant. Comparing manual and auto-
baseline run we take the top N results, and look at the T
                                                                 matic assignments of target categories, manually assigned
most frequently occurring categories belonging to these doc-
                                                                 target categories perform somewhat better than the auto-
uments, while requiring categories to occur at least twice.
                                                                 matically assigned categories. However, for both topic sets
These categories are assigned as target categories to the
                                                                 using the automatically assigned categories leads to signifi-
query topic.
                                                                 cant improvements over the baseline.
3.   EXPERIMENTS
                                                                 4.   CONCLUSION
  In this section we describe our experiments with entity
ranking and ad hoc retrieval in Wikipedia.                          In this paper we have experimented with retrieving enti-
                                                                 ties from Wikipedia exploiting its category structure. First,
Experimental Set-up We experiment with two different             we examined whether Wikipedia category and link struc-
tasks. First of all we experiment with the entity ranking        ture can be used to retrieve entities inside Wikipedia as is
task as defined by INEX. We will make runs on the topic          the goal of the INEX Entity Ranking task. Category infor-
sets from 2007 to 2009. Secondly, we experiment with ad          mation proves to be a highly effective source of information,
hoc retrieval using category information on the ad hoc topic     leading to large and significant improvements in retrieval
sets from 2007 and compare automatic and manual category         performance on all data sets. Secondly, we studied how
assignment for ad hoc and entity ranking topics.                 we can use category information to retrieve documents for
Entity Ranking Results The results on the 2007 entity            ad hoc retrieval topics in Wikipedia. Considering retrieval
ranking topic set (ER07b, 19 topics) are summarized in Ta-       performance, also on ad hoc retrieval topics we achieved
ble 1. The weight of the baseline score is 1.0 minus the         significantly better results by exploiting the category infor-
weight of the category information. For all three distances,     mation. Finally, we examined whether we can automatically
a weight of 0.1 gives the best results. In addition to these     assign target categories to ad hoc and entity ranking queries.
combinations, we also made a run that combines the original      Guessed categories lead to performance improvements that
score, the contents distance and the title distance. When a      are not as large as when the categories are assigned man-
single distance is used, the title distance gives the best re-   ually, but they are still significant. Our main conclusion is
sults. The combination of contents and title distance gives      that the category structure of Wikipedia can be effectively
the best results overall. For the 2008 and 2009 entity ranking   exploited, in fact not only for entity ranking, but also for ad
topic sets (not shown here), also significant improvements       hoc retrieval, and with manually assigned as well as auto-
are achieved when category information is used. Additional       matically assigned target categories.
improvements to the approach are to rerank the top 2500
documents from the baseline retrieval run, instead of the        Thanks This research was funded by NWO (# 612.066.513).
top 500, which have been reranked for the 2007 runs. Nor-
malizing the scores before combining shows improvements          REFERENCES
for the 2009 topics.                                              [1] R. Kaptein and J. Kamps. Exploiting the Category
Ad Hoc Retrieval Results A selection of 19 topics in the              Structure of Wikipedia for Entity Ranking, In Artificial
ad hoc topic set (AH07a) was transformed into an additional           Intelligence, Volume 194, January 2013, Pages 111-129.