=Paper=
{{Paper
|id=Vol-1294/paper5
|storemode=property
|title=An Approach to Diversify Entity Search Results
|pdfUrl=https://ceur-ws.org/Vol-1294/paper5.pdf
|volume=Vol-1294
|dblpUrl=https://dblp.org/rec/conf/icaase/SaidiAB14
}}
==An Approach to Diversify Entity Search Results==
ICAASE'2014 An Approach to Diversify Entity Search Results An Approach to Diversify Entity Search Results Imène Saidi Sihem Amer-Yahia Safia Nait Bahloul University of Oran, LITIO Laboratory CNRS, LIG University of Oran, LITIO Laboratory BP 1524, El-M’Naouer, 31000 Grenoble, France BP 1524, El-M’Naouer, 31000 Oran, Algeria Oran, Algeria saidi.imene@univ-oran.dz Sihem.Amer-Yahia@imag.fr Nait-bahloul.safia@univ-oran.dz Abstract – Having named entities (person, country, company...) in response to users’ queries is becoming more and more important in search engines. Indeed, in some cases users are not searching for a ranked list of documents, but for specific information (i.e. entities). In this work, we assume that users are interested in finding entities (e.g., name of a politician) and related entities (e.g., country of the politician ’x’, name of the wife of the politician ’x’...) and the documents related to each entity. The user can then search entities by keywords or entities. Our goal is to return to the user diverse and relevant entities and documents. We then use the different types of an entity (Washington: city, person) and different categories of documents (Sport, Politics...) to diversify the results. In this paper, we develop a search semantics based on the types and categories of ranked results of entities. This new approach provides a variety of interpretations of relevant results. We conduct user studies to show the effectiveness of our approach and the quality of the results. Keywords – Entity search, Diversification of search results, Indexing corpora, Information retrieval. should have the choice to express their queries 1. INTRODUCTION in various ways: Search With Entities (SWE) when the user knows entities and wants more Information retrieval as it is widely used today is related information (contextual entities and not always suitable for some needs. In many documents), KeyWord Search (KWS) when different domains such as medicine and news, entities are unknown. Our work is made possible data is organized around topics, which are in the by the apparition of automatic annotation form of categories (health in the medical field, systems such as: Open Calais [2], Alchemy API politics in news, etc.) or entities (name of a [3] and Zemanta [4]. These systems annotate hospital, name of a politician, etc.). What users semi-structured or unstructured documents and seek in this case is not a ranked list of attach rich semantic metadata to documents by documents, but information they contain categorizing them and extracting named entities (categories, entities) [1]. For example, in the they contain. Our proposition is to build a system medical context, users may be interested in for searching entities using the annotator and discovering documents that contain entities support different types of queries formed by related to a query, e.g., entities related to a entities or keywords and return related entities specific disease. We name such entities: with documents of each entity (exploration by contextual entities (i.e. entities appearing in the entity). Entities may have several types context of the disease, such as the symptoms of (Washington: city, person) and appear in a disease). In news, users may want entities documents belonging to different categories related to the revolutionary wave that swept (Politics, Business ...). This presents two major different countries (names of heads of state, challenges. The first challenge is to associate countries, dates, places, etc.). keywords or entities that constitute the query In this paper, we consider the problem of finding with different types of entities. The second is to documents and entities related to user queries. use the identified types of entities and the Entities may or not be known to users, so users categories of documents that contain them to International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 44 ICAASE'2014 An Approach to Diversify Entity Search Results present the results in the best way to users. 2.1. Contribution of our work These challenges reflect an issue that has been presented in different works which is: diversity of Previous works are based on the coverage of query results. Indeed, the different interpretations or the content of documents to interpretations of the same entity (e.g., city, diversify results and propose complex solutions person, organization ...) and the categories of to perform diversity. The goal of these solutions is to return a ranked and diversified list of documents (e.g., Politics, Medicine ...), can be documents. In our work, we use annotations to used to diversify the documents to be returned. form diverse groups of documents using types or In the next section, we present some related categories. We then use indexing and create works to diversity. Section 3, is reserved for the various indexes to prepare data that allows us to description of the context of our work. In section address the complexity of the problem. 4, we present our approach and in section 5 In this work, diversity is applied in two situations. some experiments. A conclusion and future work In the first situation, contextual entities (related to are proposed in the end of this paper. our defined types of queries) are found using an index of entities. This is a kind of diversity of 2. RELATED WORK query interpretations, user does not know entities In our work, we define a new problem of or do not give any precision of the query (e.g., diversification which is different from the the query is George Washington: is it George interpretation that has been given to it previously Washington president, George Washington University, George Washington Hospital ...). [5, 6, 7, 8]. Until now, diversity of query results In the second situation, related documents of was formulated as the problem of finding a set of each entity are diversified and selected documents relevant to the query that differ as according to two conditions: either their much as possible from each other in their relevance to the entity is above a threshold, or content. It has been proved that the problem of the type of the entity or the category of the diversity of query results is NP-complete since it document is unique (compared to other is to find a diverse subset of size N in a larger documents which are related to the same entity). set. Thus, the problem is to define a threshold of In this case, an index of the categories and an relevance and calculate the sum of distances of index of entities that the documents contain are content of the pairs of documents in the returned used. We can affirm that it is our definition of set (the distance may correspond for example to diversity that simplifies query processing. the cosine tf*idf vectors of documents). Several greedy algorithms for diversity [9, 10] have been 3. CONTEXT AND PROBLEM DEFINITION developed. The most common is to find the N In this section, we give illustration examples for most relevant documents in a first step. The the different types of queries of our work. We second step is to test if the replacement of one end this section with our problem definition. of the N documents by a new document certainly less relevant but whose relevance is 3.1 Illustration examples greater than a threshold increases the diversity We suppose that a user wants information about of the set of N documents. This phase is applied politics in America. The user can submit different until the relevance of documents to be queries (Figure 1). considered is no longer satisfying the threshold. Diversity can be based on the meaning of a query (intent of the user) or the content of the returned results. It can be based on both too. - Diversification based on meaning ([9, 11]) discusses the various possibilities of the user query (ambiguity) using probabilities on all disambiguisations of a query “the coverage of the query”. The goal is to return the most relevant results (close to the sense of the user). - Diversification based on content ([6, 7, 8]) aims to reduce redundancy of information in the results. This is accomplished by avoiding documents that offer less information to users. In our work, diversity is based on the types and categories used to annotate documents. Fi gure 1: Illustration example International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 45 ICAASE'2014 An Approach to Diversify Entity Search Results Scenario1. For the first query Q1, the user where entities of(TopK(Q)) are the entities that wants information about two politicians (he is appear in the Top K documents that answer the looking for a link between the two personalities query Q (case KWS) and related entities(Q) are or for entities that are related to the two the entities that appear in the context of the personalities): This case represents Search With query Q, i.e. entities that exist in the best entities (SWE). Results will be the related and documents that match the query Q when it diverse entities (contextual entities) that appear consists of several entities (case SWE). in the context of the query. The user can explore Indexes will help find documents related to the documents of each entity. entities. Indexes will be described in the Section 4.2.2. Scenario2. The second scenario is a special R is a set of entities that answer the query case of SWE; it consists of searching with one extended by contextual entities that appear in entity. User wants information on a particular the context of the query i.e. entities of the best entity: in this case, information about documents that are related to entities of the "Washington". The user may want the city of query or entities of top k documents that answer Washington DC or he may search for the the query. In a specific case, when the query is president George Washington, so we diversify formed by one single entity, R is the set of the types of entities (diversification by type). entities composed of the entity of the query Users may also be interested in composed (entities that start, end or contain the entity of entities (e.g., George Washington University, the query). George Washington Hospital ...). We assume that it is more interesting to consider all these 3.2.2 Finding Related Documents entities and return them to the user to increase For each entity e R, we identify a set the diversification of interpretations. We also S = {d1 ... dm} of documents to be returned. assume that it is interesting to consider the different categories of documents related to an A document d is returned with the entity e if a entity when it has only one type (e.g., if the function we named Diverse_type_cat is true or if query is George Washington University). a function we named Relevance is true. Senario3. For the third query Q3, the user Diverse_type_cat is true in several cases: wants to have information on a keyword query (KWS): e.g., president of USA. Results are the If the entity e has several types related entities and their documents. The results (diversification by type): Diverse_type_cat is true if a type of entity e of of each entity will be treated as the case of the document d does not exist in groupType. search with entity (diversification by type if the entity has several types or diversification by groupType e.types; e.types are the types of an entity e. groupType is updated by the types categories if the entity has one type). found for the entity e in the documents of the 3.2 Problem definition results. Formally: We consider a query Q = {t1... tn} / ti E K, E is a set of entities and K a set of keywords. ( ) The goal is to return entities and relevant documents organized by entity. We first explain d.entities are the entities of a document d. how entities are found then we detail the score Example: entity e is "Washington", this entity of documents that incorporates relevance and has several types: 'person' and 'city'. For diversity. example, if the type 'city' does not exist in groupType, the document d that contains the 3.2.1 Entity Search entity of type 'city' will be selected in the result. Given the above query Q, the purpose is to find This increases diversity of types, a set of entities R E related to the query Q and Diverse_type_cat is then true. for each entity e R, classify the associated If the entity has only one type documents. To cover the different queries we (diversification by category): consider, we define R as: Diverse_type_cat is true when the category c of ( ( )) a document d does not exist in a group named R={ groupCat C; C the set of the categories of the ( ) documents of the corpus. groupCat contains the International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 46 ICAASE'2014 An Approach to Diversify Entity Search Results categories of results found for the entity e. to user (to increase diversity) or the document is Formally: relevant to the type or the category. 4. DIVERSIFICATION OF ENTITY SEARCH RESULTS Example: entity e is "Barack Obama", this entity For some data sources such as forums and has one type: 'person', so the different news articles, we assume that it is more categories of documents are considered. For interesting to interpret user queries by the example, if the category 'business' does not entities that sources contain. Entities may have exist in groupCat, the document d of category several types and documents have different 'business' will be selected in the result. categories, we exploit that to diversify results. In This increases diversity of categories, this section, we summarize our approach in a Diverse_type_cat is then true. conceptual architecture and we describe the When the document d is taken in S, groupType offline processing. is updated by the type of the entity e found in d 4.1 System Architecture or groupCat is updated by the category of the document d. The following Figure (Figure 2) shows the conceptual architecture of our system and Relevance is true if: summarizes our approach. 1. The document d answers the query Q and contains an entity e . Formally: ti Q | ti {d.k eywords d.entities} e d.entities | e R. 2. The score of the document d for a query Q score(d,Q) exceeds a threshold. score(d,Q) is the sum of scores score(d,e) of the same entity e in the document d when the entity has several types (diversification by types), or it is the score of the category of the document score(d,c) when the entity has only one type (diversification by category). This definition ensures a maximum of diversity with relevance of the selected documents, when the documents d D are pre- ranked according to their relevance. Formally: Figure 2: System architecture score(d, Q) > where is the minimal threshold of relevance (selected by experiments) . We consider a corpus D of semi structured documents (forums, newsgroups, etc.). Our score(d,Q)= ∑ ( ) approach is to make an offline processing to { prepare ad-hoc indexes for online processing ( ) (Figure 2): Choosing a document to be taken in S is made - We start by annotating the corpus using an according to diverse_type_cat to have all automatic annotation system (Open Calais various types of an entity or the categories of API) to extract entities, types and categories of documents, even if the score of the documents with their scores. corresponding document does not exceed the - We create different indexes to store relevance threshold (because the document is information, i.e. keywords, entities, types, unique). The choice is made according to categories and scores. The scores are: Relevance to have the most relevant documents score(tf*idf) of a term (case KWS query), (score is greater than a threshold) on the same score(d, e) of entity and score(d,c) of category. type of entity e or on the same category c. This Three indexes are created: an inverted index means that if a document d is selected by for Keywords (KI: Keyword Index), an inverted checking diverse_type_cat, its type or its index for types of entities (EI: Entities Index) category is new in the collection to be returned and an index for entities and categories of International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 47 ICAASE'2014 An Approach to Diversify Entity Search Results documents (DI: Document Index) i.e. what are must be returned to the user to ensure entities of a document and what is its category. maximum diversity, other documents will be selected according to relevance, i.e., their Query processing is done in 3 steps: Entity score(d, e) must exceed a threshold. search, Document search and Document diversification per entity. - If the entity has only one type: diversification of documents is made by categories of the 4.1.1 Entity search related documents to the entity, "at least one In online processing, we use our indexes to document" of each category must be returned interpret queries using the entities. to the user to ensure maximum diversity, other documents are selected according to Two types of the queries are considered: Search relevance, i.e., their score(d, c) must exceed a With Entities (SWE: Query Q1) and KeyWord given threshold. Search (KWS: Query Q2). The condition "at least one document" ensures SWE queries: the idea is to diversify that the unique documents (unique type of entity interpretation by searching entities (Entity or unique category of document) will be Search in Figure 2) to return a set of entities considered, even if their score is not high (does R = {e1, e2... en} such as: not reach the threshold of relevance). This R are the entities that appear in the best maximizes diversity. documents that contain the entities of the After organizing the results, we return to the query, if the query is of type SWE. user a set of entities {e1, e2... en} and a set of A special case of this type SWE could be documents for each entity {e1.S, e2.S... en. S}. search by one entity, R is then equal to the 4.2 Offline processing entity itself extended by the composed entities, i.e., entities that start, end or The corpus of documents D is preprocessed in contain the entity of the query. an offline phase in order to create the necessary indexes. KWS queries: if the query is of type KWS, we use Top K query processing algorithm of Fagin 4.2.1 Document annotation et. al [12] to have a ranked list of documents. With the advent of automatic annotation From this list, entities are extracted using index systems, it is possible to attach semantically rich DI and returned as the set R. We suppose that metadata to documents. That allows to entities that appear in the best documents that categorize documents and find entities they answer the query are relevant or contextual contain (people, places, organizations, etc.). In (appear in the context so could interest the this work we used Open Calais API [4] to user). If the query is a mixture of entities and annotate a corpus of files (semi or unstructured), keywords, it is considered as a keyword query to find existing named entities, categories of (KWS). documents and scores. Annotation of corpus is We assume that when the user searches with an important step in our approach; it prepares entities (SWE), he is looking for a relationship or necessary information for online processing. wants to make a comparison between entities of the query (e.g., Sarkozy and Merkel, Renault or 4.2.2 Indexing Peugeot, infection and tumor...). Necessary indexes are: 4.1.2. Document search KI (Keyword Index): An inverted index that After finding the set R of entities, the related matches each keyword k with the documents to each entity {e1.docs, e2.docs... documents that contain it and a score (score en.docs} are found (searching documents in of k to a document d). This index is Figure 2). Indexes EI and DI are used. necessary for the application of Top K processing and will be used for the KeyWord 4.1.3. Document diversification per entity Search (KWS queries). After finding the documents of each entity of R, k : {(d, tf.idf(k ,d)) }, tf.idf(k ,d) = score(d, k ) / tf.idf they are diversified: (term frequency - inverse document frequency). - If the entity has several types: diversification of Example: president: {(14, 9), (57, 3), (84, 18)}. documents is made according to the type of entity, "at least one document" of each type EI (Entity Index): To build this index, the API Calais is used for annotation (Named International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 48 ICAASE'2014 An Approach to Diversify Entity Search Results entities extraction with types and scores). documents, partitioned across 20 different The inverted index EI stores the types of the newsgroups (about 1000 messages per group). entities e with documents containing them Collection of Reuters: this corpus is a collection and the score (score of the type of the entity of texts downloaded from the website of the e in the document d, i.e. score(d,e) which is Institute of Computer and Electronic Gaspard- calculated by the automatic annotation Monge3. This collection of texts was extracted system). This index is used to retrieve automatically from Reuters-21578 collection4. documents related to entities in the case This categorization is given in a file named SWE and documents of each entity e of the categorisation.txt where each line corresponds set R. e: { (d, type, score(d,e))}. to the categorization of text. This dataset Example: Barack Obama: {(575, person, 0.332), contains about 20,000 file. This corpus was also (810, person, 0.341), (881, person, 0.331)}. chosen for its richness and variety of categories. DI (Document Index): This index contains 5.1 Relevance the entities of a document and its category. Open Calais is also used for the extraction To verify the quality of results produced by our of the categories of documents with their approach, we conduct another user study on the scores. This index maps each document relevance of the found entities (the set R of with its category and a score (score(d,c)). It entities) and the returned documents (diversified will be used to find the categories of documents). For this test, we used the corpus documents in the case of diversification by Reuters (also chosen for its richness in entities categories and entities of a document. and categories). d: { ({e}), (c, score(d,c)) }. 5.1.1 Relevance of entities Example: news.xml: {{Washington, white house We asked users (10 students) to identify among ...}, (politics, 0.91)}. a set of returned entities (R), the number of 5. EXPERIMENTS relevant entities and the number of contextual or composed entities (when the query is composed We conduct a set of experiments to demonstrate of one entity). Different queries have been the usefulness of our proposed approach and submitted to cover the different types of queries the quality of results. First, using a set of (5 queries for each type: search with entities, Reuters’ articles, we demonstrate that the search with one entity and KeyWord Search). generated results are relevant to users. Second, Table 1 presents the results of search with we show that users prefer our proposed entities (SWE). approach (over 60% of cases) to a classical approach that returns a ranked list of documents Table 1: Queries of SWE Number of with snippets. Themes found Relevant Contextual Implementation setup: we have implemented a entities entities entities Java prototype reflecting the processing of our “R” approach. The prototype can handle user Ford and 2 83.33% 16.66% queries. It offers to the user a choice between Chevrolet searching with entities and keyword search. Lincoln and Different results are returned, i.e. entities and 3 66.66% 11% Bush relevant documents to the found entities. Documents are diversified to increase both the Infection 8 50% 33% relevance and diversity of results. All and tumors experiments were conducted on an Intel Core i5 workstation with a speed of 2.5 GHz and 4GB Volvo USA 26 37.15% 25.61% Memory running Linux Ubuntu 13.04. Washi ngton Datasets: we conducted our experiments on and 47 26.23% 27.65% two different datasets: a collection of Baghdad newsgroups (20NewsGroups) and a collection of Reuters articles (Reuters-21578). In Table 1, we calculate the average of relevant entities and the average of contextual entities 20NewsGroups: the 20 Newsgroups data set is using the votes of users, we then calculate the a collection of approximately 20,000 newsgroup percentage of relevant entities and the International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 49 ICAASE'2014 An Approach to Diversify Entity Search Results percentage of contextual entities, the rest are from each other, some entities are then non- insignificant entities i.e. entities that appear in significant (21.06% of entities, see Table 3) the context of the query but are not related. For because they don't match with the different the first query "Ford and Chevrolet", the number needs of users. of the found entities is 2, this means that Table 3 presents the results of KeyWord Search. R= {Ford, Chevrolet}. No other entities were Table 3: Queries of KWS found, so most users (83.33%) consider these 2 Number of entities as relevant and 16.66% as contextual Themes found Relevant Contextual because they were looking for other results entities “R” entities entities related to these entities of the query. In this 1 case, there are no insignificant entities . We can Ford car 14 45.21% 28.57% notice that for all queries, at least 26% of returned entities are relevant and more than Patient 35 39.02% 22% 25% are contextual. In some cases, the disease percentage of non-significant entities exceeds Car 40% (query "Washington and Baghdad"), this is 36 31.47% 20% dealer due to the nature of the query which is general and not specific (the meaning of this query Buy 50 29% 29.32% differs from a user to another). Relevance in this Ford case is relative. Table 2 presents the results of Prime search with one entity. 80 27.08% 35.82% minister Table 2: Queries of search with one entity Number of In Table 3, we calculate the average of relevant Themes Composed found Relevant entities and the average of contextual entities entities entities “R” entities using the votes of users and then calculate the percentage of relevant entities and the Chevrolet 6 72.16% 16.66% percentage of contextual entities. For all queries, at least 27% of returned entities are relevant and Lincoln 10 36.36% 48.48% more than 20% are contextual entities. The quality of results is less than the two previous types of queries because interpretation of America 15 36.4% 42.10% keywords is more complex than entities, improvement of this part will be considered in Ford 25 57% 35% future work as well as filtering insignificant entities of the two previous types of queries. Cancer 74 31.03% 56.31% 5.2 Usefulness We evaluate the usefulness of our approach of In Table 2, we calculate the average of relevant diversification of entities and documents. We entities and the average of composed entities aim to analyze whether users prefer our using the votes of users and then calculate the proposition of interpreting users queries using percentage of relevant entities and the the entities of the sources and organizing the percentage of composed entities. We can notice documents of each entity, against the simple that for all queries, at least 31% of returned approach that returns ranked list of documents entities are relevant and more than 16% are using Top K processing. In this simple approach, composed entities. The percentage of non- the score of a document is its TF*IDF, the k best significant entities is negligible (less than 15%) documents (with the highest scores) are except for the query "America" which is a returned to users. We want to know if for some general query. Entities related to this query domains our approach is more useful to the user (composed entities) appear in several categories than the classical approach so we used the and different types, (e.g., America Online, corpus 20 Newsgroups (chosen for its richness America West Airlines, South America, etc.). So in categories). In this corpus, data is organized entities related to "America" are very different into 20 different newsgroups, each corresponding to a different category. We submitted different queries. Table 4 summarizes 1 Entities that are not relevant or contextual are insignificant. the results of this test. International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 50 ICAASE'2014 An Approach to Diversify Entity Search Results Table 4: User's preference: Our approach VS relevant entities (entities that exist in the Classical approach documents that match the query). We also Themes Query Our Classical propose to organize and diversify the documents Approach Approach of each returned entity by the various types of Disk entities if the entity has several types or Computer 70% 30% drive categories of documents if the entity has only one type. Users will then have a list of relevant PC Business 80% 20% entities and the possibility to explore the speaker documents of each entity. Experiments show the Coach of effectiveness of our approach and the quality of Recreation 60% 40% the year results. Our approach is easier for exploration of results, clearer, and in general more helpful than Prime Politics 70% 30% simple relevance ranking of documents. minister For a continuation of this work, we will improve Infections and extend the annotations part by the use of Science and 60% 40% other annotation systems such as Alchemy API tumors ([4]). We also plan to improve our algorithm to Catholic filter insignificant entities and to consider the Religion university 50% 50% large scale problem. We achieve a survey by presenting to users the 7. REFERENCES results of our approach and the results of the classical approach using queries of different [1] T. Cheng, X. Yan, and K. C.-C. Chang. domains (Table 4). We ask users to indicate Supporting entity search: a largescale prototype search engine. In SIGMOD their preference by choosing the results of the Conference, pages 1144-1146, 2007. first or the second approach. From the results, it is clear that users prefer our approach which is [2] Open Calais, http://www.opencalais.com/documentation/c more informative, to the classical approach, thus alais-web-service-api. confirming our motivation. Our approach is more [3] Zemanta, http://developer.zemanta.com/. interesting to users specifically for domains [4] Alchemy api, where documents are focused on a specific http://www.alchemyapi.com/products/feature category and contain many entities (computer, s/entity-extraction/. politics). For literary documents (religion), [5] L. Qin, J. X. Yu, and L. Chang. Diversifying entities are less useful. We also observe that top-k results. PVLDB,5(11):11241135, 2012. when the user's knowledge is limited on a [6] A. Angel and N. Koudas. E_cient diversity- subject, our approach brings more novelty by aware search. In SIGMOD Conference, presenting entities that may appear in the pages 781792, 2011. context of the search. [7] C. Yu, L. V. S. Lakshmanan, and S. Amer- Yahia. It takes variety to make a world: 6. CONCLUSION diversi_cation in recommender systems. In EDBT, pages 368378, 2009. In this work, we presented an approach for [8] Y. Zhang, J. Callan, and T. Minka. Novelty diversifying search results that leverages and redundancy detection in adaptive differences between entities and documents, i.e. filtering. In SIGIR 02: Proceedings of the types of entities and categories of documents. In 25th annual international ACM SIGIR our work, we exploit the different categories and conference on Research and development in information retrieval, pages 8188, 2002 types of annotations extracted by automatic systems and previously stored in indexes. This [9] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In allows to by-pass the complexity of the problem WSDM, pages 514, 2009. of diversity known as an NP-complete problem. The definition of diversity in our work allows to [10] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In WWW, index processing and simplify the complexity of pages 381–390, 2009. queries. To facilitate query processing, a pre- [11] K. Liu, E. Terzi, and T. Grandison. processing is done offline on the corpus of Highlighting diverse concepts in documents. documents to index necessary information. In SDM, pages 545–556, 2009. The idea is to interpret keyword queries (KWS) [12] R. Fagin, A. Lotem, and M. Naor. Optimal or the queries composed of entities (SWE) by aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614656, 2003. International Conference on Advanced Aspects of Software Engineering ICAASE, November, 2-4, 2014, Constantine, Algeria. 51