=Paper= {{Paper |id=Vol-1294/paper5 |storemode=property |title=An Approach to Diversify Entity Search Results |pdfUrl=https://ceur-ws.org/Vol-1294/paper5.pdf |volume=Vol-1294 |dblpUrl=https://dblp.org/rec/conf/icaase/SaidiAB14 }} ==An Approach to Diversify Entity Search Results== https://ceur-ws.org/Vol-1294/paper5.pdf
ICAASE'2014                                                                   An Approach to Diversify Entity Search Results




         An Approach to Diversify Entity Search
                       Results


                  Imène Saidi                       Sihem Amer-Yahia                  Safia Nait Bahloul
      University of Oran, LITIO Laboratory             CNRS, LIG           University of Oran, LITIO Laboratory
        BP 1524, El-M’Naouer, 31000                 Grenoble, France             BP 1524, El-M’Naouer, 31000
                  Oran, Algeria                                                        Oran, Algeria
          saidi.imene@univ-oran.dz                Sihem.Amer-Yahia@imag.fr     Nait-bahloul.safia@univ-oran.dz


Abstract – Having named entities (person, country, company...) in response to users’ queries is becoming
more and more important in search engines. Indeed, in some cases users are not searching for a ranked list
of documents, but for specific information (i.e. entities). In this work, we assume that users are interested in
finding entities (e.g., name of a politician) and related entities (e.g., country of the politician ’x’, name of the
wife of the politician ’x’...) and the documents related to each entity. The user can then search entities by
keywords or entities. Our goal is to return to the user diverse and relevant entities and documents. We then
use the different types of an entity (Washington: city, person) and different categories of documents (Sport,
Politics...) to diversify the results. In this paper, we develop a search semantics based on the types and
categories of ranked results of entities. This new approach provides a variety of interpretations of relevant
results. We conduct user studies to show the effectiveness of our approach and the quality of the results.

Keywords – Entity search, Diversification of search results, Indexing corpora, Information retrieval.

                                                                   should have the choice to express their queries
1. INTRODUCTION                                                    in various ways: Search With Entities (SWE)
                                                                   when the user knows entities and wants more
Information retrieval as it is widely used today is                related information (contextual entities and
not always suitable for some needs. In many                        documents), KeyWord Search (KWS) when
different domains such as medicine and news,                       entities are unknown. Our work is made possible
data is organized around topics, which are in the                  by the apparition of automatic annotation
form of categories (health in the medical field,                   systems such as: Open Calais [2], Alchemy API
politics in news, etc.) or entities (name of a                     [3] and Zemanta [4]. These systems annotate
hospital, name of a politician, etc.). What users                  semi-structured or unstructured documents and
seek in this case is not a ranked list of                          attach rich semantic metadata to documents by
documents, but information they contain                            categorizing them and extracting named entities
(categories, entities) [1]. For example, in the                    they contain. Our proposition is to build a system
medical context, users may be interested in                        for searching entities using the annotator and
discovering documents that contain entities                        support different types of queries formed by
related to a query, e.g., entities related to a                    entities or keywords and return related entities
specific disease. We name such entities:                           with documents of each entity (exploration by
contextual entities (i.e. entities appearing in the                entity). Entities may have several types
context of the disease, such as the symptoms of                    (Washington: city, person) and appear in
a disease). In news, users may want entities                       documents belonging to different categories
related to the revolutionary wave that swept                       (Politics, Business ...). This presents two major
different countries (names of heads of state,                      challenges. The first challenge is to associate
countries, dates, places, etc.).                                   keywords or entities that constitute the query
In this paper, we consider the problem of finding                  with different types of entities. The second is to
documents and entities related to user queries.                    use the identified types of entities and the
Entities may or not be known to users, so users                    categories of documents that contain them to



International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                       44
ICAASE'2014                                                                    An Approach to Diversify Entity Search Results




present the results in the best way to users.                      2.1. Contribution of our work
These challenges reflect an issue that has been
presented in different works which is: diversity of                Previous works are based on the coverage of
query      results.    Indeed,    the     different                interpretations or the content of documents to
interpretations of the same entity (e.g., city,                    diversify results and propose complex solutions
person, organization ...) and the categories of                    to perform diversity. The goal of these solutions
                                                                   is to return a ranked and diversified list of
documents (e.g., Politics, Medicine ...), can be
                                                                   documents. In our work, we use annotations to
used to diversify the documents to be returned.
                                                                   form diverse groups of documents using types or
In the next section, we present some related                       categories. We then use indexing and create
works to diversity. Section 3, is reserved for the                 various indexes to prepare data that allows us to
description of the context of our work. In section                 address the complexity of the problem.
4, we present our approach and in section 5                        In this work, diversity is applied in two situations.
some experiments. A conclusion and future work                     In the first situation, contextual entities (related to
are proposed in the end of this paper.                             our defined types of queries) are found using an
                                                                   index of entities. This is a kind of diversity of
2.   RELATED WORK                                                  query interpretations, user does not know entities
In our work, we define a new problem of                            or do not give any precision of the query (e.g.,
diversification which is different from the                        the query is George Washington: is it George
interpretation that has been given to it previously                Washington president, George Washington
                                                                   University, George Washington Hospital ...).
[5, 6, 7, 8]. Until now, diversity of query results
                                                                   In the second situation, related documents of
was formulated as the problem of finding a set of
                                                                   each entity are diversified and selected
documents relevant to the query that differ as                     according to two conditions: either their
much as possible from each other in their                          relevance to the entity is above a threshold, or
content. It has been proved that the problem of                    the type of the entity or the category of the
diversity of query results is NP-complete since it                 document is unique (compared to other
is to find a diverse subset of size N in a larger                  documents which are related to the same entity).
set. Thus, the problem is to define a threshold of                 In this case, an index of the categories and an
relevance and calculate the sum of distances of                    index of entities that the documents contain are
content of the pairs of documents in the returned                  used. We can affirm that it is our definition of
set (the distance may correspond for example to                    diversity that simplifies query processing.
the cosine tf*idf vectors of documents). Several
greedy algorithms for diversity [9, 10] have been                  3.   CONTEXT AND PROBLEM DEFINITION
developed. The most common is to find the N                        In this section, we give illustration examples for
most relevant documents in a first step. The                       the different types of queries of our work. We
second step is to test if the replacement of one                   end this section with our problem definition.
of the N documents by a new document
certainly less relevant but whose relevance is                     3.1 Illustration examples
greater than a threshold increases the diversity                   We suppose that a user wants information about
of the set of N documents. This phase is applied                   politics in America. The user can submit different
until the relevance of documents to be                             queries (Figure 1).
considered is no longer satisfying the threshold.
Diversity can be based on the meaning of a
query (intent of the user) or the content of the
returned results. It can be based on both too.
- Diversification based on meaning ([9, 11])
discusses the various possibilities of the user
query (ambiguity) using probabilities on all
disambiguisations of a query “the coverage of
the query”. The goal is to return the most
relevant results (close to the sense of the user).
- Diversification based on content ([6, 7, 8])
aims to reduce redundancy of information in the
results. This is accomplished by avoiding
documents that offer less information to users.
In our work, diversity is based on the types and
categories used to annotate documents.                                         Fi gure 1: Illustration example


International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                        45
ICAASE'2014                                                                   An Approach to Diversify Entity Search Results




    Scenario1. For the first query Q1, the user                    where entities of(TopK(Q)) are the entities that
wants information about two politicians (he is                     appear in the Top K documents that answer the
looking for a link between the two personalities                   query Q (case KWS) and related entities(Q) are
or for entities that are related to the two                        the entities that appear in the context of the
personalities): This case represents Search With                   query Q, i.e. entities that exist in the best
entities (SWE). Results will be the related and                    documents that match the query Q when it
diverse entities (contextual entities) that appear                 consists of several entities (case SWE).
in the context of the query. The user can explore                  Indexes will help find documents related to
the documents of each entity.                                      entities. Indexes will be described in the Section
                                                                   4.2.2.
    Scenario2. The second scenario is a special
                                                                   R is a set of entities that answer the query
case of SWE; it consists of searching with one
                                                                   extended by contextual entities that appear in
entity. User wants information on a particular
                                                                   the context of the query i.e. entities of the best
entity: in this case, information about
                                                                   documents that are related to entities of the
"Washington". The user may want the city of
                                                                   query or entities of top k documents that answer
Washington DC or he may search for the
                                                                   the query. In a specific case, when the query is
president George Washington, so we diversify
                                                                   formed by one single entity, R is the set of
the types of entities (diversification by type).
                                                                   entities composed of the entity of the query
Users may also be interested in composed
                                                                   (entities that start, end or contain the entity of
entities (e.g., George Washington University,
                                                                   the query).
George Washington Hospital ...). We assume
that it is more interesting to consider all these                  3.2.2 Finding Related Documents
entities and return them to the user to increase
                                                                   For each entity e R, we identify a set
the diversification of interpretations. We also
                                                                   S = {d1 ... dm} of documents to be returned.
assume that it is interesting to consider the
different categories of documents related to an                    A document d is returned with the entity e if a
entity when it has only one type (e.g., if the                     function we named Diverse_type_cat is true or if
query is George Washington University).                            a function we named Relevance is true.
    Senario3. For the third query Q3, the user                     Diverse_type_cat is true in several cases:
wants to have information on a keyword query
(KWS): e.g., president of USA. Results are the                        If the entity e has several types
related entities and their documents. The results                      (diversification by type):
                                                                   Diverse_type_cat is true if a type of entity e of
of each entity will be treated as the case of
                                                                   the document d does not exist in groupType.
search with entity (diversification by type if the
entity has several types or diversification by                     groupType        e.types; e.types are the types of
                                                                   an entity e. groupType is updated by the types
categories if the entity has one type).
                                                                   found for the entity e in the documents of the
3.2 Problem definition                                             results. Formally:
We consider a query Q = {t1... tn} / ti E K, E
is a set of entities and K a set of keywords.                                              (        )
The goal is to return entities and relevant
documents organized by entity. We first explain                    d.entities are the entities of a document d.
how entities are found then we detail the score                    Example: entity e is "Washington", this entity
of documents that incorporates relevance and                       has several types: 'person' and 'city'. For
diversity.                                                         example, if the type 'city' does not exist in
                                                                   groupType, the document d that contains the
3.2.1 Entity Search                                                entity of type 'city' will be selected in the result.
Given the above query Q, the purpose is to find                    This      increases        diversity   of     types,
a set of entities R E related to the query Q and                   Diverse_type_cat is then true.
for each entity e    R, classify the associated                       If the       entity   has    only one type
documents. To cover the different queries we                           (diversification by category):
consider, we define R as:                                          Diverse_type_cat is true when the category c of
                   (      ( ))                                     a document d does not exist in a group named
R={                                                                groupCat       C; C the set of the categories of the
                         ( )
                                                                   documents of the corpus. groupCat contains the



International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                       46
ICAASE'2014                                                                   An Approach to Diversify Entity Search Results




categories of results found for the entity e.                      to user (to increase diversity) or the document is
Formally:                                                          relevant to the type or the category.

                                                                   4.   DIVERSIFICATION OF ENTITY SEARCH
                                                                        RESULTS
Example: entity e is "Barack Obama", this entity                   For some data sources such as forums and
has one type: 'person', so the different                           news articles, we assume that it is more
categories of documents are considered. For                        interesting to interpret user queries by the
example, if the category 'business' does not                       entities that sources contain. Entities may have
exist in groupCat, the document d of category                      several types and documents have different
'business' will be selected in the result.                         categories, we exploit that to diversify results. In
This    increases      diversity   of    categories,               this section, we summarize our approach in a
Diverse_type_cat is then true.                                     conceptual architecture and we describe the
When the document d is taken in S, groupType                       offline processing.
is updated by the type of the entity e found in d                  4.1 System Architecture
or groupCat is updated by the category of the
document d.                                                        The following Figure (Figure 2) shows the
                                                                   conceptual architecture of our system and
Relevance is true if:                                              summarizes our approach.
    1. The document d answers the query Q
and contains an entity e . Formally:
  ti    Q | ti    {d.k eywords      d.entities}      e
d.entities | e   R.
   2.    The score of the document d for a query
Q score(d,Q) exceeds a threshold. score(d,Q) is
the sum of scores score(d,e) of the same entity
e in the document d when the entity has several
types (diversification by types), or it is the score
of the category of the document score(d,c) when
the entity has only one type (diversification by
category). This definition ensures a maximum of
diversity with relevance of the selected
documents, when the documents d D are pre-
ranked according to their relevance. Formally:
                                                                             Figure 2: System architecture
   score(d, Q) >        where       is the minimal
   threshold of relevance (selected by experiments) .              We consider a corpus D of semi structured
                                                                   documents (forums, newsgroups, etc.). Our
   score(d,Q)=
    ∑      (   )                                                   approach is to make an offline processing to
   {                                                               prepare ad-hoc indexes for online processing
          ( )
                                                                   (Figure 2):
Choosing a document to be taken in S is made                       - We start by annotating the corpus using an
according to diverse_type_cat to have all                            automatic annotation system (Open Calais
various types of an entity or the categories of                      API) to extract entities, types and categories of
documents, even if the score of the                                  documents with their scores.
corresponding document does not exceed the                         - We create different indexes to store
relevance threshold (because the document is                         information, i.e. keywords, entities, types,
unique). The choice is made according to                             categories and scores. The scores are:
Relevance to have the most relevant documents                        score(tf*idf) of a term (case KWS query),
(score is greater than a threshold) on the same                      score(d, e) of entity and score(d,c) of category.
type of entity e or on the same category c. This                     Three indexes are created: an inverted index
means that if a document d is selected by                            for Keywords (KI: Keyword Index), an inverted
checking diverse_type_cat, its type or its                           index for types of entities (EI: Entities Index)
category is new in the collection to be returned                     and an index for entities and categories of



International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                       47
ICAASE'2014                                                                      An Approach to Diversify Entity Search Results




    documents (DI: Document Index) i.e. what are                       must be returned to the user to ensure
    entities of a document and what is its category.                   maximum diversity, other documents will be
                                                                       selected according to relevance, i.e., their
Query processing is done in 3 steps: Entity
                                                                       score(d, e) must exceed a threshold.
search, Document search and Document
diversification per entity.                                        - If the entity has only one type: diversification of
                                                                     documents is made by categories of the
4.1.1     Entity search
                                                                     related documents to the entity, "at least one
In online processing, we use our indexes to                          document" of each category must be returned
interpret queries using the entities.                                to the user to ensure maximum diversity, other
                                                                     documents       are    selected    according     to
Two types of the queries are considered: Search
                                                                     relevance, i.e., their score(d, c) must exceed a
With Entities (SWE: Query Q1) and KeyWord
                                                                     given threshold.
Search (KWS: Query Q2).
                                                                   The condition "at least one document" ensures
SWE queries: the idea is to diversify
                                                                   that the unique documents (unique type of entity
interpretation by searching entities (Entity
                                                                   or unique category of document) will be
Search in Figure 2) to return a set of entities
                                                                   considered, even if their score is not high (does
R = {e1, e2... en} such as:
                                                                   not reach the threshold of relevance). This
     R are the entities that appear in the best                   maximizes diversity.
      documents that contain the entities of the                   After organizing the results, we return to the
      query, if the query is of type SWE.                          user a set of entities {e1, e2... en} and a set of
     A special case of this type SWE could be                     documents for each entity {e1.S, e2.S... en. S}.
      search by one entity, R is then equal to the                 4.2 Offline processing
      entity itself extended by the composed
      entities, i.e., entities that start, end or                  The corpus of documents D is preprocessed in
      contain the entity of the query.                             an offline phase in order to create the necessary
                                                                   indexes.
KWS queries: if the query is of type KWS, we
use Top K query processing algorithm of Fagin                      4.2.1 Document annotation
et. al [12] to have a ranked list of documents.                    With the advent of automatic annotation
From this list, entities are extracted using index                 systems, it is possible to attach semantically rich
DI and returned as the set R. We suppose that                      metadata to documents. That allows to
entities that appear in the best documents that                    categorize documents and find entities they
answer the query are relevant or contextual                        contain (people, places, organizations, etc.). In
(appear in the context so could interest the                       this work we used Open Calais API [4] to
user). If the query is a mixture of entities and                   annotate a corpus of files (semi or unstructured),
keywords, it is considered as a keyword query                      to find existing named entities, categories of
(KWS).                                                             documents and scores. Annotation of corpus is
We assume that when the user searches with                         an important step in our approach; it prepares
entities (SWE), he is looking for a relationship or                necessary information for online processing.
wants to make a comparison between entities of
the query (e.g., Sarkozy and Merkel, Renault or                    4.2.2 Indexing
Peugeot, infection and tumor...).                                  Necessary indexes are:
4.1.2. Document search                                                  KI (Keyword Index): An inverted index that
After finding the set R of entities, the related                         matches each keyword k with the
documents to each entity {e1.docs, e2.docs...                            documents that contain it and a score (score
en.docs} are found (searching documents in                               of k to a document d). This index is
Figure 2). Indexes EI and DI are used.                                   necessary for the application of Top K
                                                                         processing and will be used for the KeyWord
4.1.3. Document diversification per entity                               Search (KWS queries).
After finding the documents of each entity of R,                   k : {(d, tf.idf(k ,d)) }, tf.idf(k ,d) = score(d, k ) / tf.idf
they are diversified:                                              (term frequency - inverse document frequency).
- If the entity has several types: diversification of              Example: president: {(14, 9), (57, 3), (84, 18)}.
  documents is made according to the type of
  entity, "at least one document" of each type                      EI (Entity Index): To build this index, the API
                                                                      Calais is used for annotation (Named


International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                            48
ICAASE'2014                                                                         An Approach to Diversify Entity Search Results




    entities extraction with types and scores).                    documents, partitioned across 20 different
    The inverted index EI stores the types of the                  newsgroups (about 1000 messages per group).
    entities e with documents containing them
                                                                   Collection of Reuters: this corpus is a collection
    and the score (score of the type of the entity
                                                                   of texts downloaded from the website of the
    e in the document d, i.e. score(d,e) which is
                                                                   Institute of Computer and Electronic Gaspard-
    calculated by the automatic annotation
                                                                   Monge3. This collection of texts was extracted
    system). This index is used to retrieve
                                                                   automatically from Reuters-21578 collection4.
    documents related to entities in the case
                                                                   This categorization is given in a file named
    SWE and documents of each entity e of the
                                                                   categorisation.txt where each line corresponds
    set R. e: { (d, type, score(d,e))}.
                                                                   to the categorization of text. This dataset
Example: Barack Obama: {(575, person, 0.332),                      contains about 20,000 file. This corpus was also
(810, person, 0.341), (881, person, 0.331)}.                       chosen for its richness and variety of categories.
     DI (Document Index): This index contains                     5.1 Relevance
      the entities of a document and its category.
      Open Calais is also used for the extraction                  To verify the quality of results produced by our
      of the categories of documents with their                    approach, we conduct another user study on the
      scores. This index maps each document                        relevance of the found entities (the set R of
      with its category and a score (score(d,c)). It               entities) and the returned documents (diversified
      will be used to find the categories of                       documents). For this test, we used the corpus
      documents in the case of diversification by                  Reuters (also chosen for its richness in entities
      categories and entities of a document.                       and categories).
      d: { ({e}), (c, score(d,c)) }.                               5.1.1 Relevance of entities
Example: news.xml: {{Washington, white house
                                                                   We asked users (10 students) to identify among
...}, (politics, 0.91)}.
                                                                   a set of returned entities (R), the number of
5.   EXPERIMENTS                                                   relevant entities and the number of contextual or
                                                                   composed entities (when the query is composed
We conduct a set of experiments to demonstrate                     of one entity). Different queries have been
the usefulness of our proposed approach and                        submitted to cover the different types of queries
the quality of results. First, using a set of                      (5 queries for each type: search with entities,
Reuters’ articles, we demonstrate that the                         search with one entity and KeyWord Search).
generated results are relevant to users. Second,                   Table 1 presents the results of search with
we show that users prefer our proposed                             entities (SWE).
approach (over 60% of cases) to a classical
approach that returns a ranked list of documents                                     Table 1: Queries of SWE
                                                                                      Number of
with snippets.
                                                                        Themes           found      Relevant Contextual
Implementation setup: we have implemented a                                             entities     entities entities
Java prototype reflecting the processing of our                                           “R”
approach. The prototype can handle user                                Ford and
                                                                                            2          83.33%         16.66%
queries. It offers to the user a choice between                        Chevrolet
searching with entities and keyword search.
                                                                       Lincoln and
Different results are returned, i.e. entities and                                           3          66.66%           11%
                                                                          Bush
relevant documents to the found entities.
Documents are diversified to increase both the                          Infection
                                                                                            8            50%            33%
relevance    and     diversity of results.     All                     and tumors
experiments were conducted on an Intel Core i5
workstation with a speed of 2.5 GHz and 4GB                            Volvo USA            26         37.15%         25.61%
Memory running Linux Ubuntu 13.04.
                                                                       Washi ngton
Datasets: we conducted our experiments on                                 and               47         26.23%         27.65%
two   different    datasets: a collection of                            Baghdad
newsgroups (20NewsGroups) and a collection of
Reuters articles (Reuters-21578).                                  In Table 1, we calculate the average of relevant
                                                                   entities and the average of contextual entities
20NewsGroups: the 20 Newsgroups data set is                        using the votes of users, we then calculate the
a collection of approximately 20,000 newsgroup                     percentage of relevant entities and the



International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                             49
ICAASE'2014                                                                       An Approach to Diversify Entity Search Results




percentage of contextual entities, the rest are                       from each other, some entities are then non-
insignificant entities i.e. entities that appear in                   significant (21.06% of entities, see Table 3)
the context of the query but are not related. For                     because they don't match with the different
the first query "Ford and Chevrolet", the number                      needs of users.
of the found entities is 2, this means that                           Table 3 presents the results of KeyWord Search.
R= {Ford, Chevrolet}. No other entities were
                                                                                  Table 3: Queries of KWS
found, so most users (83.33%) consider these 2                                    Number of
entities as relevant and 16.66% as contextual                          Themes        found      Relevant
                                                                                                          Contextual
because they were looking for other results                                       entities “R”   entities
                                                                                                           entities
related to these entities of the query. In this
                                          1
case, there are no insignificant entities . We can                     Ford car        14           45.21%         28.57%
notice that for all queries, at least 26% of
returned entities are relevant and more than                            Patient
                                                                                       35           39.02%           22%
25% are contextual. In some cases, the                                 disease
percentage of non-significant entities exceeds                           Car
40% (query "Washington and Baghdad"), this is                                          36           31.47%           20%
                                                                        dealer
due to the nature of the query which is general
and not specific (the meaning of this query                              Buy
                                                                                       50             29%          29.32%
differs from a user to another). Relevance in this                       Ford
case is relative. Table 2 presents the results of
                                                                        Prime
search with one entity.                                                                80           27.08%         35.82%
                                                                       minister
      Table 2: Queries of search with one entity
                Number of                                             In Table 3, we calculate the average of relevant
    Themes                              Composed
                  found      Relevant                                 entities and the average of contextual entities
                                          entities
               entities “R”   entities                                using the votes of users and then calculate the
                                                                      percentage of relevant entities and the
 Chevrolet               6            72.16%          16.66%          percentage of contextual entities. For all queries,
                                                                      at least 27% of returned entities are relevant and
    Lincoln             10            36.36%          48.48%          more than 20% are contextual entities. The
                                                                      quality of results is less than the two previous
                                                                      types of queries because interpretation of
    America             15             36.4%          42.10%          keywords is more complex than entities,
                                                                      improvement of this part will be considered in
      Ford              25              57%             35%           future work as well as filtering insignificant
                                                                      entities of the two previous types of queries.

    Cancer              74            31.03%          56.31%          5.2 Usefulness
                                                                      We evaluate the usefulness of our approach of
In Table 2, we calculate the average of relevant                      diversification of entities and documents. We
entities and the average of composed entities                         aim to analyze whether users prefer our
using the votes of users and then calculate the                       proposition of interpreting users queries using
percentage of relevant entities and the                               the entities of the sources and organizing the
percentage of composed entities. We can notice                        documents of each entity, against the simple
that for all queries, at least 31% of returned                        approach that returns ranked list of documents
entities are relevant and more than 16% are                           using Top K processing. In this simple approach,
composed entities. The percentage of non-                             the score of a document is its TF*IDF, the k best
significant entities is negligible (less than 15%)                    documents (with the highest scores) are
except for the query "America" which is a                             returned to users. We want to know if for some
general query. Entities related to this query                         domains our approach is more useful to the user
(composed entities) appear in several categories                      than the classical approach so we used the
and different types, (e.g., America Online,                           corpus 20 Newsgroups (chosen for its richness
America West Airlines, South America, etc.). So                       in categories). In this corpus, data is organized
entities related to "America" are very different                      into      20     different  newsgroups,     each
                                                                      corresponding to a different category. We
                                                                      submitted different queries. Table 4 summarizes
1
    Entities that are not relevant or contextual are insignificant.   the results of this test.


International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                           50
ICAASE'2014                                                                   An Approach to Diversify Entity Search Results




     Table 4: User's preference: Our approach VS                   relevant entities (entities that exist in the
                 Classical approach                                documents that match the query). We also
     Themes      Query
                               Our          Classical              propose to organize and diversify the documents
                             Approach       Approach               of each returned entity by the various types of
                Disk                                               entities if the entity has several types or
 Computer                       70%            30%
                drive                                              categories of documents if the entity has only
                                                                   one type. Users will then have a list of relevant
                PC
 Business                       80%            20%                 entities and the possibility to explore the
                speaker
                                                                   documents of each entity. Experiments show the
                Coach of                                           effectiveness of our approach and the quality of
 Recreation                     60%            40%
                the year                                           results. Our approach is easier for exploration of
                                                                   results, clearer, and in general more helpful than
                Prime
     Politics                   70%            30%                 simple relevance ranking of documents.
                minister
                                                                   For a continuation of this work, we will improve
                Infections                                         and extend the annotations part by the use of
     Science    and             60%            40%
                                                                   other annotation systems such as Alchemy API
                tumors
                                                                   ([4]). We also plan to improve our algorithm to
                Catholic                                           filter insignificant entities and to consider the
     Religion   university      50%            50%
                                                                   large scale problem.
We achieve a survey by presenting to users the                     7.   REFERENCES
results of our approach and the results of the
classical approach using queries of different                      [1] T. Cheng, X. Yan, and K. C.-C. Chang.
domains (Table 4). We ask users to indicate                            Supporting entity search: a largescale
                                                                       prototype search engine. In SIGMOD
their preference by choosing the results of the                        Conference, pages 1144-1146, 2007.
first or the second approach. From the results, it
is clear that users prefer our approach which is                   [2] Open Calais,
                                                                       http://www.opencalais.com/documentation/c
more informative, to the classical approach, thus                      alais-web-service-api.
confirming our motivation. Our approach is more                    [3] Zemanta, http://developer.zemanta.com/.
interesting to users specifically for domains
                                                                   [4] Alchemy api,
where documents are focused on a specific                              http://www.alchemyapi.com/products/feature
category and contain many entities (computer,                          s/entity-extraction/.
politics). For literary documents (religion),                      [5] L. Qin, J. X. Yu, and L. Chang. Diversifying
entities are less useful. We also observe that                         top-k results. PVLDB,5(11):11241135, 2012.
when the user's knowledge is limited on a                          [6] A. Angel and N. Koudas. E_cient diversity-
subject, our approach brings more novelty by                           aware search. In SIGMOD Conference,
presenting entities that may appear in the                             pages 781792, 2011.
context of the search.                                             [7] C. Yu, L. V. S. Lakshmanan, and S. Amer-
                                                                       Yahia. It takes variety to make a world:
6.    CONCLUSION                                                       diversi_cation in recommender systems. In
                                                                       EDBT, pages 368378, 2009.
In this work, we presented an approach for
                                                                   [8] Y. Zhang, J. Callan, and T. Minka. Novelty
diversifying search results that leverages                             and redundancy detection in adaptive
differences between entities and documents, i.e.                       filtering. In SIGIR 02: Proceedings of the
types of entities and categories of documents. In                      25th annual international ACM SIGIR
our work, we exploit the different categories and                      conference on Research and development
                                                                       in information retrieval, pages 8188, 2002
types of annotations extracted by automatic
systems and previously stored in indexes. This                     [9] R. Agrawal, S. Gollapudi, A. Halverson, and
                                                                       S. Ieong. Diversifying search results. In
allows to by-pass the complexity of the problem                        WSDM, pages 514, 2009.
of diversity known as an NP-complete problem.
The definition of diversity in our work allows to                  [10] S. Gollapudi and A. Sharma. An axiomatic
                                                                        approach for result diversification. In WWW,
index processing and simplify the complexity of                         pages 381–390, 2009.
queries. To facilitate query processing, a pre-                    [11] K. Liu, E. Terzi, and T. Grandison.
processing is done offline on the corpus of                             Highlighting diverse concepts in documents.
documents to index necessary information.                               In SDM, pages 545–556, 2009.
The idea is to interpret keyword queries (KWS)                     [12] R. Fagin, A. Lotem, and M. Naor. Optimal
or the queries composed of entities (SWE) by                           aggregation algorithms for middleware. J.
                                                                       Comput. Syst. Sci., 66(4):614656, 2003.


International Conference on Advanced Aspects of Software Engineering
ICAASE, November, 2-4, 2014, Constantine, Algeria.                                                                       51