=Paper= {{Paper |id=None |storemode=property |title=On Suggesting Entities as Web Search Queries |pdfUrl=https://ceur-ws.org/Vol-964/paper6.pdf |volume=Vol-964 |dblpUrl=https://dblp.org/rec/conf/iir/CeccarelliGLNP13 }} ==On Suggesting Entities as Web Search Queries== https://ceur-ws.org/Vol-964/paper6.pdf
    On Suggesting Entities as Web Search Queries
                                Extended Abstract


           Diego Ceccarelli1,2,3 , Sergiu Gordea4 , Claudio Lucchese1 ,
                 Franco Maria Nardini1 , and Raffale Perego1
            1
              ISTI-CNR, Pisa, Italy – {firstname.lastname}@isti.cnr.it
             2
               IMT Institute for Advanced Studies Lucca, Lucca, Italy
           3
             Dipartimento di Informatica, Università di Pisa, Pisa, Italy
              4
                AIT GmbH, Wien, Austria – sergiu.gordea@ait.ac.at



      Abstract. The Web of Data is growing in popularity and dimension,
      and named entity exploitation is gaining importance in many research
      fields. In this paper, we explore the use of entities that can be extracted
      from a query log to enhance query recommendation. In particular, we
      extend a state-of-the-art recommendation algorithm to take into account
      the semantic information associated with submitted queries. Our novel
      method generates highly related and diversified suggestions that we as-
      sess by means of a new evaluation technique. The manually annotated
      dataset used for performance comparisons has been made available to
      the research community to favor the repeatability of experiments.


1    Semantic Query Recommendation
Mining the past interactions of users with the search system recorded in query
logs is an effective approach to produce relevant query suggestions. This is based
on the assumption that information searched by past users can be of interest
to others. The typical interaction of a user with a Web search engine consists
in translating her information need in a textual query made of few terms. We
believe that the “Web of Data” can be profitably exploited to make this process
more user-friendly and alleviate possible vocabulary mismatch problems.
We adopt the Search Shortcuts (SS) model proposed in [1, 2]. The SS algo-
rithm aims to generate suggestions containing only those queries appearing as
final in successful sessions. The goal is to suggest queries having a high poten-
tiality of being useful for people to reach their initial goal. The SS algorithm
works by efficiently computing similarities between partial user sessions (the one
currently performed) and historical successful sessions recorded in a query log.
Final queries of most similar successful sessions are suggested to users as search
shortcuts.
    A virtual document is constructed by merging successful session, i.e., ending
with a clicked query. We annotate virtual documents to extract relevant named
entities. Common annotation approaches on query logs consider a single query
and try to map it to an entity (if any). If a query is ambiguous, the risk is
to always map it to the most popular entity. On the other hand, in case of
ambiguity, we can select the entity with the highest likelihood of representing
the semantic context of a query.
    We define Semantic Search Shortcuts (S 3 ) the query recommender system
exploiting this additional knowledge. Please note that S 3 provides a list of re-
lated entities, differently from traditional query recommenders as SS that for a
given query produce a flat list of recommendations. We assert that entities can
potentially deliver to users much more information than raw queries.
    In order to compute the entities to be suggested, given an input query q,
we first retrieve the top-k most relevant virtual documents by processing the
query over the SS inverted index built as described above. The result set Rq
contains the top-k relevant virtual documents along with the entities associated
with them. Given an entity e in the result set, we define two measures:
                           (
                            conf (e) × score(VD), if e ∈ VD.entities
          score(e, VD) =
                            0                     otherwise
                                         X
                        score(e, q) =           score(e, VD)
                                        VD∈Rq

where conf (e) is the confidence of the annotator in mapping the entity e in the
virtual document VD, while score(VD) represents the similarity score returned
by the information retrieval system. We rank the entities appearing in Rq using
their score w.r.t. the query.


2   Experimental Evaluation

We used a large query log coming from the Europeana portal1 , containing a sam-
ple of users’ interactions covering two years (from August 27, 2010 to January,
17, 2012). We preprocessed the entire query log to remove noise (e.g., queries
submitted by software robots, mispells, different encodings, etc). Finally, we ob-
tained 139,562 successful sessions. An extensive characterization of the query
log can be found in [3]. To assess our methodology we built a dataset consisting
of 130 queries split in three disjoint sets: 50 short queries (1 term), 50 medium
queries (on average, 4 terms), 30 long terms (on average, 9 terms). For each
query in the three sets, we computed the top-10 recommendations produced by
the SS query recommender system and we manually mapped them to entities
by using a simple interface providing an user-friendly way to associate entities
to queries2 .
    We are interested in evaluating two aspects of the set of suggestions provided.
These are our main research questions:
1
  We acknowledge the Europeana Foundation for providing us the query logs used in
  our experimentation. http://www.europeana.eu/portal/
2
  Interested readers can download the dataset from: http://hpc.isti.cnr.it/
  ˜ceccarelli/doku.php/sss.
Relatedness : How much information related to the original query a set of
   suggestions is able to provide?
Diversity : How many different aspects of the original query a set of suggestions
   is able to cover?
   To evaluate these aspects, we borrow from the annotators the concept of
semantic relatedness between two entities proposed by Milne and Witten [4]:

           rel(e1 , e2 ) = 1 − log(max(|I L (e1 )|,|IL (e2 )|))−log(|IL (e1 )∪IL (e2 )|)
                                    log(|KB|)−log(min(|IL (e1 )|,|IL (e2 )|))

where e1 and e2 are the two entities of interest, the function IL (e) returns the
set of all entities that link to the entity e in Wikipedia, and KB is the whole
set of entities in the knowledge base. We extend this measure to compute the
similarity between two set of entities (the function IL gets a set of entities and
returns all the entities that link at least on entity in the given set). At the same
time, given two sets of entities E1 , E2 , we define the diversity as div(E1 , E2 ) =
1 − rel(E1 , E2 ). Given a query q, let Eq be the set of entities that have been
manually associated with the query. We define the relatedness and the diversity
of a list of suggestions Sq as:

Definition 1 The average relatedness of a list of suggestions is computed as:
                               P
                                 s∈Sq rel(Es \ Eq , Eq )
                    rel(Sq ) =
                                         |Sq |

where Es represents the set of entities mapped to a suggestion s (could contain
more than one entity in the manual annotated dataset). Please note that we
remove the entities of the original query from each set of suggestions as we are
not interested in suggesting something that do not add useful content w.r.t. the
starting query (Es \ Eq ).

Definition 2 The average diversity of a list of suggestions is defined as:
                                P
                                   s∈Sq div(Es , ESq \s )
                     div(Sq ) =
                                          |Sq |

    For each suggestion, we intend to evaluate how much information it adds
w.r.t the other suggestions. ESq \s denotes the union of the entities belonging to
all the suggestions except the current suggestion s.
Experimental Results: For each set of queries in the dataset described above
(short, medium and long), we compared the average relatedness and the average
diversity of the recommendations generated by SS and by S 3 .
    Figure 1 shows the average relatedness computed for each query q belonging
to a particular set of queries. Results confirm the validity of our intuition as,
for all the three sets, the results obtained by S 3 are always better than the
results obtained by considering the SS suggestions. It is worth to observe that
the longer the queries the more difficult the suggestion of related queries. This
                                               S3                                               S3
                        0.32
                        0.31                   SS                                               SS
                                                                   0.8
                0.3
                                                                                             0.69
Relatedness




                                                       Diversity
                                  0.25                                    0.62     0.63
               0.25                                                                          0.59
                                                                   0.6
                                            0.23

                0.2               0.19
                                                                   0.4             0.38
                                            0.16                          0.33
               0.15
                       small    medium     long                          small   medium     long
                                 Block                                            Block

                  Fig. 1: Per-set average related-                 Fig. 2: Per-set average diversity
                  ness computed between the list                   computed between the list of
                  of suggestions and the given                     suggestions and the given query.
                  query.


              happens because long queries occur less frequently in the log and then we have
              less information to generate the suggestions. If we consider single sets, the highest
              gain of S 3 in terms of average relatedness is obtained for medium and long
              queries: this means that relying on entities allows to mitigate the sparsity of
              user data.
                  Figure 2 reports the average diversity of the suggestions over the queries of
              each set. Here, we observe an opposite trend, due to the fact that the longer
              the queries, the more terms/entities they contain, and the more different the
              suggestions are. Furthermore, we observe that, for the most frequent queries,
              SS has a very low performance w.r.t. S 3 . This happens because for frequent
              queries SS tends to retrieve popular reformulations of the original query, thus
              not diversifying the returned suggestions. S 3 does not suffer for this problem
              since it works with entities thus diversifying naturally the list of suggestions.
              We leave as future work the study of a strategy for suggesting entities aiming at
              maximizing the diversity on a list of suggestions.


              References
              1. Baraglia, R., Cacheda, F., Carneiro, V., Fernandez, D., Formoso, V., Perego, R.,
                 Silvestri, F.: Search shortcuts: a new approach to the recommendation of queries.
                 In: Proc. RecSys’09. ACM, New York, NY, USA (2009)
              2. Broccolo, D., Marcon, L., Nardini, F.M., Perego, R., Silvestri, F.: Generating sug-
                 gestions for queries in the long tail with an inverted index. IP&M
              3. Ceccarelli, D., Gordea, S., Lucchese, C., Nardini, F.M., Tolomei, G.: Improving
                 europeana search experience using query logs. In: Proc. TPDL’11. pp. 384–395
              4. Milne, D., Witten, I.: Learning to link with wikipedia. In: Proc. CIKM’08. pp. 509–
                 518. ACM (2008)