=Paper=
{{Paper
|id=Vol-2002/gutclscisumm2017
|storemode=property
|title=Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies
|pdfUrl=https://ceur-ws.org/Vol-2002/gutclscisumm2017.pdf
|volume=Vol-2002
|authors=Thomas Felber,Roman Kern
|dblpUrl=https://dblp.org/rec/conf/sigir/FelberK17
}}
==Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies==
<pdf width="1500px">https://ceur-ws.org/Vol-2002/gutclscisumm2017.pdf</pdf>
<pre>
       Graz University of Technology at
CL-SciSumm 2017: Query Generation Strategies

                          Thomas Felber and Roman Kern

                  Institute of Interactive Systems and Data Science
                            Graz University of Technology
                                  Know-Center GmbH
                          Inffeldgasse 13, 8010 Graz, Austria
                 felber@student.tugraz.at,rkern@know-center.at


        Abstract. In this report we present our contribution to the 3rd Com-
        putational Linguistics Scientific Document Summarization Shared Task
        (CL-SciSumm 2017), which poses the challenge of identifying the spans
        of text in a reference paper (RP) that most accurately reflect a citation
        (i.e. citance) from another document to the RP. In our approach, we ad-
        dress this challenge by applying techniques from the field of information
        retrieval. Therefore we create a separate index for every RP and then
        transform each citance to a RP into a query. This query is subsequently
        used to retrieve the most relevant spans of text from the RP. Different
        ranking models and query generation strategies have been employed to
        alter which spans of text are retrieved from the index. Furthermore we
        implemented a k-nn classification based on our search infrastructure for
        assigning the cited text span to pre-defined classes.

        Keywords: Information Retrieval, Query Generation, Ranking Models


1     Introduction
The focus of the CL-SciSumm 2017 Shared Task is on automatic paper summa-
rization in the Computational Linguistics (CL) domain. It is organized as part
of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and
Natural Language Processing for Digital Libraries (BIRNDL 2017)1 [2], held at
the 40th International ACM SIGIR Conference on Research and Development
in Information Retrieval2 in Tokyo, Japan. It is a follow-up on the CL-SciSumm
2016 Shared Task at BIRNDL 2016 [1] which was conducted in the course of the
Joint Conference on Digital Libraries (JCDL ’16) in Newark, New Jersey.
    This Shared Task is divided into multiple smaller tasks which pose the fol-
lowing problems:
 – Task 1A: Identify the spans of text (cited text spans) in a reference paper
   (RP) that most accurately reflect a citation (i.e. citance) to the RP made
   from another document.
1
    http://wing.comp.nus.edu.sg/ birndl-sigir2017/
2
    http://sigir.org/sigir2017/
2         Thomas Felber and Roman Kern

    – Task 1B: Classify each identified cited text span according to a predefined
      set of facets. The elements of that set are: Implication, Method, Aim, Results,
      and Hypothesis.
    – Task 2 (optional bonus task): Generate a structured summary of the
      RP from the identified cited text spans of the RP, where the length of the
      summary must not exceed 250 words.
The data set provided for these tasks comprises a training set and a test set
consisting of 30 and 10 RPs respectively. Each RP is associated with a set of
citing papers (CP) which all contain citations to the RP. In each CP, the text
spans (citances) have been identified that pertain to a particular citation to the
RP.
    To tackle the problem in Task 1A, we followed an information retrieval (IR)
approach. For every RP, we created an index holding all the spans of text of that
RP. A citance to a RP is transformed into a query and performed on the index
associated with the RP to retrieve the most relevant spans of text.
    For Task 1B, we followed a k-NN classification approach. Each identified
cited text span is compared against all different cited text spans in the training
set. Among the top five most similar cited text spans a majority vote is used to
determine the facet.


2      Task 1A: Identification of Text Spans in the RP
In this section we provide a closer look at our approach to Task 1A. We will
describe how the indices to the RPs are created as well as how the citances are
turned into queries and subsequently used to identify relevant spans of text in
the RP.

2.1     Index Creation
In order to create an index to a RP, which holds all the different spans of text of
the RP, we used the Apache Lucene text search engine library3 which features
Java-based indexing and searching technology.
    Taking advantage of the library’s indexing technology, we created an index
for every RP and added all spans of text of the RP to the index. In this scenario
a single span of text can be imagined as a separate text document that is being
added to a conventional index. Before we added anything to the index, however,
we performed two additional preprocessing steps on every span of text. At first,
all stop words contained in a span of text were removed. The idea behind this is
to tune the performance of the index (fewer terms in the index) and to obtain
more relevant search results since stop words only carry little distinguishing
potential [6].
    To decide which words qualify as stop words and which do not, we used
Apache Lucene’s integrated list of stop words for the English language. As a
3
    http://lucene.apache.org/
                         Query Generation Strategies for CL-SciSumm 2017         3

second preprocessing step, we stripped down all suffixes of all words contained
in a span of text in order to normalize them. This was achieved by applying
Porter’s stemming algorithm [3]. After the preprocessing on a span of text was
completed, we moved on and added it to the index.

2.2   Query Generation
After all the indices for the RPs were in place, we transformed each citance
to a RP into a query and ran it on the index associated with the RP in or-
der to retrieve the most relevant spans of text of the RP for that particular
citance. Since all indices were constructed with Apache Lucene, we also resorted
to Apache Lucene functionality to generate the queries. There exists a broad
range of different query types in Apache Lucene, however, after we conducted
various experiments on the training set, we found that using Apache Lucene’s
TermQuery generated the best results.
    To turn a citance into a query, we first applied two preprocessing steps to the
citance. The preprocessing steps applied, are analogous to the ones described in
section 2, that is, stop words were removed from the citance and porter stem-
ming was performed. As a next step we extracted all words from the citance
and created a TermQuery for every word. This means, each TermQuery corre-
sponds to a single word in the citance. After that, we created an Apache Lucene
BooleanQuery by OR-conjuncting all TermQueries. This resulting BooleanQuery
was then used to query the index associated with the RP that the citance refers
to.
    As a result of the query, we obtained a set of top ranked spans of text of
the RP. The elements of that set are ordered according to a score, however, it
depends on the ranking and retrieval model used by the index what elements
are in the set and what score they are given. From all spans of text that are
retrieved this way, we considered the top two as most accurately reflecting the
corresponding citance. This is because considering the top two yielded the best
results during experiments on the training set.

2.3   Ranking
In section 2.2 we mentioned that it depends on the ranking and retrieval model
that is used by the index which elements are retrieved by a query and how they
are ranked. For the sake of the CL-SciSumm 2017 Shared Task, we submitted a
system run using a simple vector space model (VSM) [5] based method and the
popular BM25 model [4].

Vector Space Model The term frequency and inverse document frequency
(TF-IDF) weighting scheme used by Apache Lucene within the scope of the
VSM is as follows:
For the term frequency, which correlates to a term t in a document d, the formula
                                         p
                                 tft,d = ft,d                                 (1)
4       Thomas Felber and Roman Kern

is used, where ft,d denotes the number of times the term t occurs in document
d.
    For the inverse document frequency, which correlates to the number of doc-
uments in which the term t appears, the formula
                                                    N
                                idft = 1 + log                                   (2)
                                                  nt + 1
is used, where N is the total number of documents in the index and nt is the
number of documents containing the term t.
    The score of a document d to a query q is calculated based on the cosine
similarity and is defined as
                                               V (d) · V (q)
                              sim(d, q) =                                        (3)
                                              |V (d)||V (q)|

where V (d) · V (q) is the dot product of the weighted vectors, and |V (d)| and
|V (q)| are their euclidean norms.

BM25 The term frequency factors used in Apache Lucene within the scope of
BM25 ranking are defined as
                                        (k1 + 1)ft,d
                     Bt,d =     h                        i                       (4)
                              k1 (1 − b) + b i2 |d||d|avg + ft,d
                                                 boost


where ft,d denotes the number of times the term t occurs in document d, |d| is
the length of the document d in words, |d|avg is the average document length,
iboost is an index-time boosting factor, and k1 and b are parameters.
    The ranking equation used in the BM25 model can then be written as
                                  X                     N − nt + 0.5
                    sim(d, q) =            Bt,d × log                .           (5)
                                                          nt + 0.5
                                  t[q,d]

The values we used for the parameters k1 and b are 1.2 and 0.75 respectively.


3   Task 1B: Identification of the Discourse Facet
The discourse facet takes one of the following values: Implication, Method, Aim,
Results, and Hypothesis. To classify the spans of text, which were identified in
Task 1A, we took the following approach: At first we created an index which we
filled with all available cited text spans of the training set plus their correspond-
ing discourse facets. To classify which discourse facet a span of text belongs to,
we then transformed the span of text into a query, analogous to the way de-
scribed in 2.2, and then ran the query on the index. After that, we conducted a
majority vote on the top five retrieved results to determine the discourse facet
to use.
                         Query Generation Strategies for CL-SciSumm 2017        5

4   Evaluation
Overall we submitted two system runs for Task 1. One of the system runs was
conducted using the BM25 ranking model and the other using a simple vector
space model (VSM). See section 2.3 for the parameters that we used for these
models.
    The system performance for Task 1a was determined by measuring the sen-
tence id overlaps between the sentences identified by the system and the gold
standard sentences created by human annotators. Based on that, precision, recall
and F1 score were calculated for each system run.
    The performance of Task 1b was measured by the proportion of the correctly
classified discourse facets by the system, contingent on the expected response of
Task 1a. The metrics used here are also precision, recall and F1 score.
    The official evaluation results of our submitted system runs for Task 1a and
Task 1b are shown in Table 1 and Table 2 respectively:


Table 1: The Task 1a evaluation results for our system runs using the vector space
model (VSM) and BM25.

                Ranking Model      Precision   Recall    F1 score
                    VSM              0.085     0.138       0.105
                    BM25             0.107     0.181       0.135


Table 2: The Task 1b evaluation results for our system runs using the vector space
model (VSM) and BM25.

                Ranking Model      Precision   Recall    F1 score
                    VSM              0.917     0.158       0.269
                    BM25             0.938     0.205       0.337


    Judging by the official evaluation results we find that our proposed ap-
proaches yield excellent results. Especially our BM25 approach seems to work
very well for both Task 1a and Task 1b: An F1 score of 0.135 at Task 1a achieves
the third highest result among all system runs, not far behind the winning sys-
tem, which has an F1 score of 0.146. An F1 score of 0.337 at Task 1b is the
eighth highest result among all system runs, with the winning system having an
F1 score of 0.408. Overall 47 system runs have been submitted to Task 1. The
mean F1 score at Task 1a among all system runs is 0.088 and at Task 1b 0.208.

5   Conclusion
In this report we described the approaches we followed to tackle the problems
posed in Task 1A and Task 1B of the CL-SciSumm 2017 Shared Task. In pre-
6       Thomas Felber and Roman Kern

liminary test we found out that using a combination of stop word removal and
stemming in combination with a disjunction query strategy work best. The of-
ficial evaluation results confirmed the success of our approach. We were able
to reuse the indexing infrastructure for a classification task, namely assigning
categories to the cited text spans. In future work we plan to make use of our in-
frastructure and investigate methods to enhance the process by integrating more
sources of evidence. In particular, additional context information like author or
venue specific information might prove beneficial.


Acknowledgements

The Know-Center is funded within the Austrian COMET Program – Compe-
tence Centers for Excellent Technologies – under the auspices of the Austrian
Federal Ministry of Transport, Innovation and Technology, the Austrian Federal
Ministry of Economy, Family and Youth and by the State of Styria. COMET is
managed by the Austrian Research Promotion Agency FFG.


References
1. Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr,
   P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and
   natural language processing for digital libraries (birndl 2016). In: Proceedings of the
   16th ACM/IEEE-CS on Joint Conference on Digital Libraries. pp. 299–300. ACM
   (2016)
2. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Overview of the cl-
   scisumm 2017 shared task. In: In Proceedings of the Joint Workshop on Bibliometric-
   enhanced Information Retrieval and Natural Language Processing for Digital Li-
   braries (BIRNDL 2017), Tokyo, Japan, CEUR. (2017)
3. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
4. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al.:
   Okapi at trec-3. Nist Special Publication Sp 109, 109 (1995)
5. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic index-
   ing. Commun. ACM 18(11), 613–620 (Nov 1975), http://doi.acm.org/10.1145/
   361219.361220
6. Silva, C., Ribeiro, B.: The importance of stop word removal on recall values in text
   categorization. In: Neural Networks, 2003. Proceedings of the International Joint
   Conference on. vol. 3, pp. 1661–1666. IEEE (2003)

</pre>