Query Type Recognition and Result Filtering in INEX
              2014 Social Book Search Track

Shih-Hung Wu1*, Pei-Kai Liao1, Hua-Wei Lin1, Li-Jen Hsu1, Wei-Lun Xiao1, Liang-
                  Pu Chen2, Tsun Ku3, and Gwo-Dong Chen3
                  1
                    Chaoyang University of Technology, Taiwan, R.O.C
        { shwu(*Contact author), s10027024, s10027072, s10027042}@cyut.edu.tw
                    2
                      Institute for Information Industry, Taiwan, R.O.C
                                         eit@iii.org.tw,
                        3
                          National Central University, Taiwan, R.O.C
                            cujing@gmail, chen@csie.ncu.edu.tw


       Abstract. The paper reports our system in INEX 2014 Social Book Search
       (SBS) track. This is the second time that we attend the SBS track. Based on our
       social feature re-ranking system [1], we improve our system by involving some
       knowledge on understanding the queries. Our baseline system is built on
       Lucene [6], an open source information retrieval system. The new modification
       is a set of rules that can filter out unnecessary books from the recommendation
       list. The official run results show that the system performance is much im-
       proved than the 2013 system.

       Keywords: Query type recognition, social features, social book search


1      Introduction
The paper reports how we build a system to attend the INEX 2014 Social Book
Search (SBS) track [10]. This is the second time that we attend the SBS track [7].
Based on our social feature re-ranking system [1], we improve our system by involv-
ing some knowledge on understanding the queries.
    In the book search application, we believe that the result of traditional information
retrieval technology is not enough for the users who need more personal recommen-
dation. Recommendation from experienced users are more appealing; it might contain
more personal feelings and cover more subtle reasons that traditional information
retrieval system cannot cover. Our system integrates the social feature into the tradi-
tional information retrieval technology to give better recommendation on books. In
this task, user-generated metadata is used as the social feature.
    According to our observation on the topics in INEX 2012 SBS Track, we find that
there are some queries that are different from others. Simply treat the keywords in the
topic as search terms will not get good result. Some of them require higher level of
knowledge to deal with. System needs to understand the information need behind the
keyword, i.e. the knowledge on the types of literature. We analysis the topics and find
several types in them. Due to the time limitation, we only implement a module to


                                            525
recognize one special type of topics and a filtering module to modify the recommen-
dation result.
   The structure of this paper is as follows. Section 2 is the data set description, sec-
tion 3 shows our architecture and the details of our method, section 4 is the experi-
ment results, and final section gives conclusions.


2        Dataset
2.1      Collection
The document collection in this task is provided by the INEX 2014 social book search
track. The documents are in XML format, about 2.8 million books, and the size is
25.9GB. These documents are collected from Amazon.com and LibraryThing. [2]


                              Table 1.All the XML tag [2]

                                       tag name
      book                similarproducts title                    imagecategory
      dimensions          tags             edition                 name
      reviews             isbn             dewey                   role
      editorialreviews    ean              creator                 blurber
      images              binding          review                  dedication
      creators            label            rating                  epigraph
      blurbers            listprice        authorid                ﬁrstwordsitem
      dedications         manufacturer     totalvotes              lastwordsitem
      epigraphs           numberofpages helpfulvotes               quotation
      ﬁrstwords           publisher        date                    seriesitem
      lastwords           height           summary                 award
      quotations          width            editorialreview         browseNode
      series              length           content                 character
      awards              weight           source                  place
      browseNodes         readinglevel     image                   subject
      characters          releasedate      imageCategories         similarproduct
      places              publicationdate url                      tag
      subjects            studio           data

2.2      Test Topic
Topics provided by INEX 2014 Social Book Search track are collected from
LibraryThing. A topic describes the information needed for a user. Figure 1 and Fig-
ure 2 give partial view of an example, the XML tags used are：<topic id>, <title>,
<mediated_query>, <group>,        <narrative>, <catalog>, <book>, <LT_id>, <en-
try_date>, and <rating>.


                                          526
Fig. 1. A topic example in INEX 2014 social book search track


Fig. 2. A topic example in INEX 2014 social book search track


                            527
3      CYUT System Methodology
3.1    System Architecture
Figure 3 shows our basic system architecture. The pre-processing includes stop words
filtering, and stemming, both modules are provided by Lucene. After the prepro-
cessing, our system builds index for retrieval. The results of content-based retrieval
will be re-ranked as the final results according to the social features.


               Document                  Stop words
                                                                  Stemming
               Collection                 filtering


                                          Content-
                 Indexing                  based                  Re-Ranking
                                          Retrieval


                  Results


                          Fig. 3. Basic system architecture [1]


3.2    Indexing and Query
The index and search engine in use is the Lucene system [6], which is an open source
full text search engine provided by Apache software foundation. Lucene is written in
JAVA and can be called easily by JAVA program to build various applications.
          According to Bogers and Larsen [3], 19 tags are more useful in the social
book search, they are <isbn>, <title>, <publisher>, <editorial>, <creator>, <series>,
<award>, <character>, <place>, <blurber>, <epigraph>, <firstwords>, <lastwords>,
<quotation>, <dewey>, <subject>, <browseNode>, <review>, and <tag>. Our system
also focused on the 19 tags.
   The content in the <dewey> tag is restored to strings accordint to the 2003 list of
Dewey category descriptions [9] to make string matching easier. For example: <dew-
ey>004</dewey> will be restored to <dewey>Data processing Computer sci-
ence</dewey>. The content of <tag> is also expanded according to the count number
to emphasize its importance. For example: <tag count="3">fantasy</tag> will be
expanded as <tag>fantasy fantasy fantasy</tag>. In additional to the 19 tags, our sys-
tem also indexes the content of <review> as independent indexes files and names it as


                                          528
reviews.
  According to Koolen et al. [4], an Indri [5] based system using all the contents of
<Title>, <Query>, <Group>, and <Narrative> as query terms will give better result.


  Fig. 4. A type2 query example that we defined in INEX 2013 social book search track


3.3    Type2 Query Recognition and Result Filtering
According to our observation on the topics in INEX 2012 SBS Track, we find that
there are some queries that are different from others, we call them the Type2 queries.
Type2 queries are the queries that contain the names of some books that the original
users want to find similar ones. Therefore, the books in the topics should not be part
of the recommendation. Since the book names are given explicitly, our system origi-
nally will find exactly the same books as the top recommendation. To filter out these
ones, we define a list of phrases to identify such queries and filter out the books in the
queries from the recommendation lists. The phrases are listed in the appendix in the
rear of the paper. Figure 4 gives an example of Type2 queries taken from INEX 2013
SBS topics, in which contains a key phrase “I’m reading”. We find that there are 174
queries in the INEX 2013 SBS track that can be classified as Type2 queries. Therefore,
this year, we add a module in our system to identify the Type2 queries and filtering
out the books mentioned in the topics. The modified system flow is shown in Figure 5.

3.4    Re-ranking
The Re-ranking part is similar to that in our previous work [1]. We integrate the user-
generated metadata into the traditional content-based search result by re-ranking the
results. The social features are used to give more weight on certain books, for exam-
ple

 User rating: users might evaluate a book from 1 to 5, the higher the better.
 Helpful vote: other users might endorse one comment by voting it as helpful.
 Total vote: the total number of helpful or not.


                                          529
                             Query


                             Type2        Yes


                          No


                             Search                   Search
                             Engine                   Engine


                                                     Filtering


                                        Re-Ranking


                                           Result


                     Fig. 5. The modified system flow of our system

     We designed 3 different ways to use these social features in re-ranking.
1) User Rating method
     Increase the weight of content-based retrieval result by adding the summation of
user rating. As shown in formula (1):

                                                                                    (1)

2) Average User Rating method
     Increase the weight of content-based retrieval result by adding the average of us-


                                          530
er rating. As shown in formula (2):
                                                                                      (2)

3) Weights User Rating method
     Increase the weight of content-based retrieval result by adding the book which
gets more helpful votes. As shown in formula (3) and (4):

                                                                                       (3)

                                                                                      (4)


3.5    Find the Best α Value by Experiment
Since there is no theoretical reference on how to set the value, in our official runs,
the value is selected via a series experiments that we conduct on the 2013 dataset.
Table 2 shows the results, we find that the system gets the best result when is 0.95.

Table 2. Experimental Result for different α on 2013 data set
             α                            P@10                          MAP
           0.50                          0.0221                         0.0193
           0.60                          0.0221                         0.0193
           0.70                          0.0224                         0.0195
           0.80                          0.0226                         0.0196
           0.90                          0.0237                         0.0204
           0.95                          0.0245                         0.0220


4      Experimental results
In the official evaluation, we sent four runs. This year, we use four fields in the topics
as query terms, and we filter out some book candidates for all the type2 queries. The
configuration of each run is as follows. Run 1, the CYUT - Type2QTGN: without re-
ranking. Run 2, the CYUT - 0.95AverageType2QTGN, re-ranking with Average User
Rating. Run 3, the CYUT - 0.95RatingType2QTGN, re-ranking with User Rating.
Run 4, CYUT - 0.95WRType2QTGN, Re-ranking with Weights User Rating.
   Table 3 shows the official evaluation results of our four runs. Among them the
CYUT - Type2QTGN run gives the best NDCG@10 [8] result, while the re-ranking
run CYUT - 0.95AverageType2QTGN gives similar result. The other two runs give
poor results due to technical errors; the system searches the document in 2013 index
file. The last two runs should be better result if the system searches the document in
2014 index file. Comparing to the 2013 INEX SBS results in Table 4, our system
performance improved significantly.


                                           531
                  Table 3. Official evaluation results in 2014 INEX SBS
             Run                     nDCG@10           MRR           MAP         R@1000
CYUT - Type2QTGN                       0.119           0.246         0.086        0.340
CYUT -
                                       0.119           0.243         0.085         0.332
0.95AverageType2QTGN
CYUT - 0.95RatingType2QTGN             0.034           0.101         0.021         0.200
CYUT - 0.95WRType2QTGN                 0.028           0.084         0.018         0.213


                  Table 4. Official evaluation results in 2013 INEX SBS
           Run                  nDCG@10           P@10             MRR            MAP
Run1.query.content-base          0.0265           0.0147          0.0418         0.0153
Run2.query.Rating                0.0376           0.0284          0.0792         0.0178
Run3.query.RA                    0.0170           0.0087          0.0352         0.0107
Run4.query.RW                    0.0392           0.0287          0.0796         0.0201
Run5.query.reviwes.content-      0.0254           0.0153          0.0359         0.0137
base
Run6.query.reviews.RW             0.0378          0.0284          0.0772         0.0165


5      Conclusions
This paper reports our system and result in INEX 2014 Social Book Search track. We
sent four runs and the results are list in Table 3. In the four runs, the CYUT -
Type2QTGN run gives best nDCG@10, which is searching with content-based search
and applying a set of filtering rules based on a list of key phrase. In the future, we will
implement more modules with literature knowledge on the writers, genre of books,
geometric categories of the publishers, and temporal categories of the authors that can
deal with the special cases in the topics.


Acknowledgement
This study was conducted under the "Online and Offline Integrated Smart Commerce
Platform (1/4)" of the Institute for Information Industry, which is subsidized by the
Ministry of Economic Affairs of the Republic of China.


References
 1. Wei-Lun Xiao, Shih-Hung Wu, Liang-Pu Chen, Hung-Sheng Chiu, and Ren-Dar Yang,
    “Social Feature Re-ranking in INEX 2013 Social Book Search Track”, CLEF 2013 Evalua-
    tion Labs and Workshop Online Working Notes, 23 - 26 September, Valencia, Spain.
 2. Marijn Koolen, Gabriella Kazai, Jaap Kamps, Michael Preminger, Antoine Doucet, and
    Monica Landoni, “Overview of the INEX 2012 Social Book Search Track”, INEX'12
    Workshop Pre-proceedings,P.77-P.96,2012.
 3. Toine Bogers and Birger Larsen, “RSLIS at INEX 2012: Social Book Search Track”,


                                           532
    INEX'12 Workshop Pre-proceedings,P.97-P.108,2012.
 4. Marijn Koolen, Hugo Huurdeman and Jaap Kamps, “Comparing Topic Representations for
    Social Book Search”, CLEF 2013 Evaluation Labs and Workshop Online Working Notes,
    23 - 26 September, Valencia – Spain.
 5. T. Strohman, D. Metzler, H. Turtle, and W. B. Croft, “Indri: a language-model based search
    engine for complex queries”, In Proceedings of the International Conference on Intelligent
    Analysis, 2005.
 6. Lucene, https://lucene.apache.org
 7. Marijn Koolen, Gabriella Kazai, Michael Preminger, and Antoine Doucet, “Overview of
    the INEX 2013 Social Book Search Track”, CLEF 2013 Evaluation Labs and Workshop
    Online Working Notes, 23 - 26 September, Valencia – Spain.
 8. Järvelin, K., Kekäläinen, “J.: Cumulated Gain-based Evaluation of IR Techniques”, ACM
    Transactions on Information Systems 20(4) (2002) 422–446.
 9. 2003 list of Dewey category descriptions,
    https://www.library.illininois.edu/ugl/about/dewey.html
10. INEX 2013 Social Book Search Track, https://inex.mmci.uni-saarland.de/tracks/books


Appendix: The key phrases for recognizing Type2 queries.
<TotalKeyWord>
       <keyWord>I've just finished</keyWord>
       <keyWord>I'm now reading</keyWord>
       <keyWord>I'm reading</keyWord>
       <keyWord>I've read</keyWord>
       <keyWord>I read</keyWord>
       <keyWord>I've ever read</keyWord>
       <keyWord>Any book as good as</keyWord>
       <keyWord>I'm not interested</keyWord>
       <keyWord>I already own</keyWord>
       <keyWord>I own</keyWord>
       <keyWord>picked up</keyWord>
       <keyWord>I can find</keyWord>
       <keyWord>I read</keyWord>
       <keyWord>I've looked through</keyWord>
       <keyWord>I've just found</keyWord>
       <keyWord>I have already read</keyWord>
       <keyWord>I was reading</keyWord>
       <keyWord>I had read</keyWord>
       <keyWord>to read</keyWord>
       <keyWord>what other</keyWord>
       <keyWord>I'm already completely</keyWord>
       <keyWord>I have already read</keyWord>
       <keyWord>I've started on</keyWord>
       <keyWord>I just finished</keyWord>
       <keyWord>I did enjoy</keyWord>
       <keyWord>something like</keyWord>
       <keyWord>without</keyWord>


                                             533
<keyWord>I am reading</keyWord>
<keyWord>starting with</keyWord>
<keyWord>I already have</keyWord>
<keyWord>I'm thinking of</keyWord>
<keyWord>I just finished reading</keyWord>
<keyWord>similar</keyWord>
<keyWord>I adore</keyWord>
<keyWord>I tried reading</keyWord>
<keyWord>I also have</keyWord>
<keyWord>I've seen</keyWord>
<keyWord>I recently read</keyWord>
<keyWord>I discovered</keyWord>
<keyWord>I have recently read</keyWord>
<keyWord>have been suggested</keyWord>
<keyWord>has been suggested</keyWord>
<keyWord>I've enjoyed</keyWord>
<keyWord>I've just completed</keyWord>
<keyWord>I haven't yet read</keyWord>
<keyWord>I have only found</keyWord>
<keyWord>I have found</keyWord>
<keyWord>I have read</keyWord>
<keyWord>I am re-reading</keyWord>
<keyWord>I also recently started</keyWord>
<keyWord>I recently started</keyWord>
<keyWord>I just re-read</keyWord>
<keyWord>I've compiled</keyWord>
<keyWord>I'd really like to read</keyWord>
<keyWord>I've already enjoyed</keyWord>
<keyWord>I can think of</keyWord>
<keyWord>I was considering</keyWord>
<keyWord>Currently reading</keyWord>
<keyWord>Apart from</keyWord>
<keyWord>I'm nearly finished</keyWord>
<keyWord>have been recommended</keyWord>
<keyWord>other recommendations</keyWord>
<keyWord>having read</keyWord>
<keyWord>on my list</keyWord>
<keyWord>I've been reading</keyWord>
<keyWord>I have just received</keyWord>
<keyWord>finishing</keyWord>
<keyWord>also read</keyWord>
<keyWord>recent readings</keyWord>
<keyWord>I have been reading</keyWord>
<keyWord>I've recently finished</keyWord>
<keyWord>other books</keyWord>
<keyWord>additional resources</keyWord>
<keyWord>The most recent book I haved</keyWord>


                            534
        <keyWord>I saw a book</keyWord>
        <keyWord>Thus far I</keyWord>
        <keyWord>what else should I</keyWord>
        <keyWord>Can anyone think of any more</keyWord>
        <keyWord>books like</keyWord>
        <keyWord>something elsee</keyWord>
        <keyWord>I thoroughly enjoyed</keyWord>
        <keyWord>My reading suggestions</keyWord>
</TotalKeyWord>


                                   535