BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


RALI System Description for CL-SciSumm 2016 Shared
                       Task

                           Bruno Malenfant1 and Guy Lapalme2
 1
   Université de Montréal, CP 6128, Succ Centre-Ville, Montréal, Québec, Canada, H3C 3J3,
                            malenfab@iro.umontreal.ca
 2
   Université de Montréal, CP 6128, Succ Centre-Ville, Montréal, Québec, Canada, H3C 3J3,
                             lapalme@iro.umontreal.ca


       Abstract. We present our approach to the CL-SciSumm 2016 shared task. We
       propose a technique to determine the discourse role of a sentence. We differen-
       tiate between words linked to the topic of the paper and the ones that link to the
       facet of the scientific discourse. Using that information, histograms are built over
       the training data to infer a facet for each sentence of the paper (result, method,
       aim, implication and hypothesis). This helps us identify the sentences best repre-
       senting a citation of the same facet. We use this information to build a structured
       summary of the paper as an HTML page.


1    Introduction

One’s task in research is to read scientific papers to be able to compare them, to identify
new problems, to position a work within the current literature and to elaborate new
research propositions [8].
    This implies reading many papers before finding the ones we are looking for. With
the growing amount of publications, this task is getting harder. It is becoming important
to have a fast way of determining the utility of a paper for our needs. A first solution
is to use web sites such as CiteSeer, arXiv, Google Scholar and Microsoft Academic
Search that provide cross reference citations to papers. Another approach is automatic
summarization of a group of scientific papers dealing with the subject.
    This year’s CL-SciSumm competition for summarization of computational linguis-
tics papers proposes a community approach to summarization; it is based on the as-
sumption that citances, the set of citation sentences to a reference paper, can be used
as a measure of its impact. This task implies identifying the text a citance refers to in
the reference paper and a facet (aim, result, method, implication and hypothesis) for the
referred text.
    We are building a system that given a topic, generates a survey of the topic from a set
of papers. That system uses citations as the primary source of information for building
an annotated summary. Our system must be able to identify the purpose/polarity/facet
of a citation. to direct the reader towards the more relevant information. The summary
is built by selecting sentences from the cited paper and the citations. This process uses a
similarity function between sentences. The resulting summaries are presented in HTML
format with their annotations and links to the original paper. The only task that is not


                                              146
        BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


performed by our system is finding the text referred to by the citation. We intend to
use the information already found by our system (facet of citations and sentences) to
complete that task.
    We had already some experience in dealing with scientific papers and their refer-
ences, having participated to Task2 of the Semantic Publishing Challenge of ESWC-
2014 (Extended Semantic Web Conference) on the extraction and characterization of
citations. A short review of previous work follows in Sect. 2. We will summarize the
task in Sect. 3 and the techniques for extracting information in Sect. 4. Finally, Sect. 5
will show our results.


2     Previous Work
There has been a growing attention towards the information carried by citations and
their surrounding sentences (citances). These contain information useful for rhetorical
classification [18], technical surveys [14] and emphasize the impact of papers [12].
Qazvinian [16] and Elkiss [6] showed that citations provide information not present in
the abstract.
    Since the first works of Luhn [11] and Edmundson [5] many researchers have de-
veloped methods for finding the most relevant sentences of papers to produce abstracts
and summaries. Many metrics have been introduced to measure the relevance of parts of
text, either using special purpose formulas [21] or using learned weights [10]. The hy-
pothesis for CL-SciSumm task is that important sentences can be pointed out by other
papers : a citation indicates a paper considered important by the author of the citing
paper.
    Another domain for study over scientific papers is the classification of their sen-
tences. Teufel [19] identified the rhetorical status of sentences using Bayes classifier.
    To find citations inside a paper, we need to analyse the references section. Do-
minique Besagni et al. [1] developed a method using pattern recognition to extract fields
from the references while Brett Powley and Robert Dale [15] looked citations and ref-
erences simultaneously using informations from one task to help complete the second
task.


3     Task Description
For this year competition we were given 30 topics, 10 for training, 10 for tuning and
10 for testing [9]. Each topic is composed of a Reference Paper (RP) and some Citing
Papers (CPs). The citing papers contain citations pointing to the RP. An annotation file
is given for each topic. That file contains information about each citation, the citation
marker and the citance.
      There are two mandatory tasks (Task 1A and Task 1B) and an optional task (Task
2)3 .
Task 1A : Find the part of the RP that is indicated with each citance. This will be called
   the referenced text.
 3
     http://wing.comp.nus.edu.sg/cl-scisumm2016/


                                             147
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


Task 1B : Once the referenced text is identified, we need to attribute a facet to it. A
   facet is one of these : result, method, aim, implication and hypothesis.
Task 2 : Building a summary for the RP using the referenced text identified in Task
   1A.

    Both the training and the developing set of topics contain expected results for these
tasks. The next section will describe how our system performs on the test set.


4     Our Approach
For the first task, we have to find the referenced text and its facet. We hypothesized that
the referenced text should be sentences sharing the same facet as the citance. We use
that fact to reduce the set of sentences to choose from for the reference. This is why we
execute Task 1B on all the sentences of the RP and all the citances prior to Task 1A. We
now present how we determine the facet of a citance, then the facet for sentences in the
RP and finally the referenced text.

4.1   Task 1B : Facet Identification
Our goal is to be able to use our system for papers from different domains, without hav-
ing to train them again. Toward that objective, our system only uses words that are not
domain specific. Patrick Drouin [3, 4] compiled such a list of words in his Transdisci-
plinary scientific lexicon (TSL). This lexicon comprises 1627 words such as acceptance,
gather, newly, severe... We will denote the set of words from the lexicon using w ∈ L.
     We trained two systems, one to attribute a facet to sentences in the RP and one to
attribute a facet to citances.
     We determine the word distribution for each facet using an histogram. We only use
words appearing in the TSL. This computation yielded a sum of each words present
in all referenced text for each facet. The facet with the highest score is chosen for that
sentence.
     For training our system, we extract the reference sentences from each annotation
with their assigned facet. Each sentence is tokenized using the NLTK library in Python.
Only words from the TSL are kept. Our dataset consists of pairs of list of words with a
facet : D = [(wsi , fi )].
     We build a profile (hf ) for each facet using a histogram. For each word in the lexi-
con, we compute the number of times it appears in sentence paired with the a specific
facet.

                                     X
                      hf (w, D) =        [cnt(w, wsi ) | (wsi , f ) ∈ D]
                                     X
                    cnt(w, wsi ) =       [1 | w ∈ wsi ]

    When a word appears more then once in a sentence, all its occurrences are counted.
    Once the histogram is built, we use it to find the facet of new sentence. First, we
extract the words that are part of the lexicon from the sentence, yielding the list of words


                                            148
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


p. Then a score sf for each facet is computed by adding the profile of each word for
that facet. The facet that scored the highest value is assigned to the new sentence.

                                             X
                               sf (p, D) =         hf (w, D)
                                             w∈p

    Looking closely at the results for the profile, we saw that some words have a nega-
tive effect on finding the facet. To find a better sublist of words to use within the TSL,
we used a genetic algorithm that uses a population of lists of words.
    A genetic algorithm starts with an initial population (set of possible solutions) and
tries to find better solutions by applying small changes to existing solutions. In our case,
a solution is a subset of words Li of L. The initial population is built using random
subsets.
    To build the next generation, we use three different techniques :

 1. Adding a random word to an existing solution : L0i = Li +{w} where w ∈ (L−Li ).
 2. Removing a random word from an existing solution : L0i = Li −{w} where w ∈ Li .
 3. Combining two subsets of existing solutions : L0i = Lj ∪ Lk .

    Once enough solutions are built for the new population, each solution is tested using
cross-validation with the histogram. The list that performed best in the task is kept for
the next generation. We use the same technique over the dataset consisting of the citance
texts and their facets.


4.2   Task 1A : Finding the Sentences Referred to by Citances

Having determined the facet of sentences in both the RP and citances, we are now ready
to assign referenced text to citances from the CPs. Our hypothesis is that a citance
should have the same facet as the text it refers to. We extract Qf the subset of sentences
from the RP that have the same facet f as a citance ci . To choose the sentence of RP
referred to by the citance, we look for the sentence from Qf that is the most similar
with the citance ci .

                                       1
                    simmcs (P1 , P2 )= (hs(P1 , P2 ) + hs(P2 , P1 ))                      (1)
                                      2P
                                            ms(w, Pj ) × idf w
                        hs(Pi , Pj )= w∈PiP                                               (2)
                                               w∈Pi idf w
                          ms(w, Pj )=max simwup (w, v)                                    (3)
                                        v∈Pj


    We use the similarity function simmcs defined by Mihalcea, Corley and Strappar-
ava [13]. This similarity function between sentences P1 and P2 (Equation 1) averages
two values, the similarity from P1 to P2 , and the similarity from P2 to P1 . The similar-
ity from one sentence Pi to the other Pj is computed by first pairing each word from
the first sentence w ∈ Pi with a word in the second one v ∈ Pj . A word is paired


                                             149
        BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


with the one that is the most similar to it (Equation 3). For each pair (w, v), the value
of the similarity is weighted by the Inverse Document Frequency of the first word idf w
(Equation 2). The average of these weighted similarity values is computed to yield the
similarity between Pi and Pj . We use only words that are Noun, Verb, Adjective and
Adverb for this comparison. The POS_tagger of NLTK was used to assert the tag of
each word. Since we believe that the domain of the paper is important to compute that
similarity, we use all words, not only the ones that are part of the TSL.
    Mihalcea et al. [13] reported that within the set of possible metrics to compare
words, the one proposed by Zhibiao Wu et Martha Palmer [20] yielded good result
(denoted simwup ). This metric is also available with the NLTK package. To use that
metric, we transform each word into their synonym group synset using WordNet. The
IDF was computed for each synset. The computation was done over the set of all the
documents contained in the ACL Anthology Network4 .


4.3    Task 2 : Summarization

Multiple source summarization adds three problems [17] :

 1. Redundancy : a paper will often be cited for the same reason over and over, resulting
    in many citances having the same subject.
 2. Identifying important differences between sources : our goal will be to find those
    citances/references that bring new information and important information to the
    summary.
 3. Coherence : since sentences come from many sources, we want to ensure that the
    summary forms an unified whole.

    For Task 2, we choose to use the Maximal Marginal Relevance (MMR) proposed by
Jaime G. Carbonell et Jade Goldstein [2]. Their technique is presented in Equation 4,
in which R is the list of possible sentences and V is the summary. They propose to use
the title of the research paper as the starting query Q.
                                                                        
               arg max λ simmcs (si , Q) − (1 − λ) max simmcs (si , sj )           (4)
               si ∈R\V                                   sj ∈V

    At each iteration, their algorithm adds a sentence si to V . Sentences are choosen
so that they bring new information to the summary (Points 1 and 2) and it must have
a certain amount of similarity with the query (Point 2). λ must be adjusted to balance
between adding a sentence very similar to the query and a sentence very different from
the ones already in the summary V . We use the same metric (simmcs ) as for task 1A to
compare sentences.
    We divided the summarization process in two steps : adding sentences from the
citance (R = CT) and adding sentences from the paper (R = RP). In the first step,
the algorithm chooses sentences from the set of citances until it reaches 150 words. For
that part, we use λ = 0.3 to give priority to similarity with the query, trying to remove
meaningless citances. Since citances have been identified as bringing new information
 4
     http://clair.eecs.umich.edu/aan/index.php


                                             150
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


not present in the original paper, we believe it is important to keep them in the summary.
Then, the summary is completed (to 250 words) using sentences choosen from the RP.
Here, we use λ = 0.7. Since sentences are choosen in the RP, most of them are about
the same subject, we want to give priority to sentences that are more different.
    The summary is built in an XML format. Each sentence is identified with its position
(the id of the paper it was extracted from, the sid and ssid attributes inside the XML
source files). The citances contain the id of the referred paper. This information will
enable to point a reader towards the corresponding paper.
    To help analyse the summaries, our software builds an HTML page containing the
extracted information (see Fig. 1).


5     Evaluation

5.1   Task 1

We present our results for facet attribution to citance and reference text. The set of data
we receive is divided in two : the training set contains 197 sentences distributed over
the citances and 247 sentences over the reference text; the development set contains
273 sentences distributed over the citances and 330 sentences over the reference text.
We first train our system using the training data (T) and then we retrained it using both
set training and developing set together (TD). In each case, we test the result over both
sets. We show the result for simple training of the histogram and for the training using
the genetic algorithm (gen_T) to select the list of words to consider. We also trained
our histogram without limiting to the words in the TSL for comparison purpose.
    For the genetic algorithm, we let it run over 25 generations. Each generation started
with 1 000 lists of words. 9 000 lists are added using the proposed mutations, bringing
the number of lists to 10 000.

                    Table 1. Success rate for attributing facet to citances.

                                            Tested on
                           Trained on Train Dev Train + Dev
                           T no TSL    47% 61%         59%
                           T           65% 52%         57%
                           TD no TSL 56% 61%           59%
                           TD          61% 57%         58%
                           gen_T       74% 43%         55%


    The result of these experiments are presented in Table 1 and Table 2. We see that,
using the training set T gives good result on itself but lower result when we apply it on
the development set. After training with both set TD (Test + Development), the result
over the development set raises at the expense of the result for the training set. For
citance, the genetic algorithm yields better result over the training set only. It does not
help to get better histograms. Considering that fact, we ask ourselves if it is possible to


                                             151
        BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


                  Table 2. Success rate for attributing facet to references text.

                                              Tested on
                             Trained on Train Dev Train + Dev
                             T no TSL    60% 60%         60%
                             T           74% 46%         57%
                             TD no TSL 60% 61%           61%
                             TD          70% 59%         64%
                             gen_T       76% 35%         51%


obtain better results using histograms, or if we have reached the limit of that technique?
Limiting our choice of words to the TSL did not give lower results. It is to be tested if the
histograms built with the TSL will perform better in another domain than computational
linguistics.
    Once we had identified the facets, we ran our script for finding the reference text. It
was able to reach an F1 score of 0.095 over the training set and 0.052 over the develop-
ment set (table 3). We reduced the search space for the referred text using the facet of
the citance. Since the identification of the facet is not perfect, this reduction might re-
move a sentence we are looking for. In the future, we have to test our approach with all
sentences, instead of the reduced set, to see if this reduction of space causes a problem
more than helps the solution.


                        Table 3. F1 scores for finding the reference text.

                                           Train Dev
                                        F1 0.095 0.052


5.2    Task 2
Figure 1 shows the HTML interface we have generated for showing the result of our
system. It allows for selecting different topics. The top of the page lets us choose be-
tween the different topics that were summarised. Each topic will present, on the left
side, the text of each CPs and RP. The sentences have been divided and citance iden-
tified. The right side contains the different summaries that our software builds (using
different values of λ) and the gold standard summary. Each paper links to its pdf version
on the ACL Anthology5 .
     On the left side of the top part of the figure we see the RP divided in sentences.
On the right side, there is a summary built by choosing five sentences from the set of
citances using a λ of 0.3. These sentences where selected to be as different as possible
by the MMR algorithm. The bottom screen shoot (Fig. 1) presents one of the CP on the
left. The citance and citation are colored to be easy to identify. The third sentence from
the top was selected by the algorithm for the summaries.
 5
     http://aclanthology.info/


                                               152
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


6   Conclusion
We presented the use of distinguishing between topic and non-topic (TSL) words for
determining the facet of sentences in a paper. This technique is useful because it lets
our system work on paper in a domain independent way. We obtained good results with
a simple histogram. We still have to test our histogram over other domains, to see if
they also yield good results. Our experiments with a genetic algorithm to refine the list
of used words did not show any improvement.
    We presented our interface for browsing the results of our system. That interface
presents RP, CPs and summaries with links to the original paper. This interface helps
the reader browse through a topic.


References
1. Dominique Besagni, Abdel Belaïd, and Nelly Benet : A Segmentation Method for Biblio-
   graphic References by Contextual Tagging of Fields. ICDAR ’03 Proceedings of the Seventh
   International Conference on Document Analysis and Recognition, 1:384–388 (2003)
2. Jaime G. Carbonell, and Jade Goldstein : The Use of MMR, Diversity-based Reranking for
   Reordering Documents and Producing Summaries. Research and Development in Information
   Retrieval - SIGIR, 335–336 (1998)
3. Patrick Drouin : Extracting a Bilingual Transdisciplinary Scientific Lexicon. Proceedings of
   eLexicography in the 21st Century : New Challenges, New Applications. Presses universitaires
   de Louvain, Louvain-la-Neuve, 7:43–54 (2010)
4. Patrick Drouin : From a Bilingual Transdisciplinary Scientific Lexicon to Bilingual Transdis-
   ciplinary Scientific Colloations. Proceedings of the 14th EURALEX International Congress.
   Fryske Akademy, Leeuwarden/Ljouwert, Pays-Bas, 296–305 (2010)
5. Harold P. Edmundson : New Methods in Automatic Extracting. Journal of the ACM (JACM),
   16(2):264–285 (1969)
6. Aaron Elkiss, Siwei Shen, Anthony Fader, Günes Erkan, David J. States, and Dragomir R.
   Radev : Blind Men and Elephants: What Do Citation Summaries Tell Us About a Research
   Article?. Journal of the American Society for Information Science and Technology - JASIS,
   59(1):51–62 (2008)
7. C. Lee Giles and Kurt D. Bollacker and Steve Lawrence : CiteSeer: an Automatic Citation
   Indexing System. Proceedings of the Third ACM Conference on Digital Libraries, 89–98
   (1998)
8. Kokil Jaidka, Christopher S.G. Khoo, Jin-Cheon Na, and Wee Kim Wee : Deconstructing
   Human Literature Reviews – A Framework for Multi-Document Summarization. Proceedings
   of the 14th European Workshop on Natural Language Generation, 125–135 (2013)
9. Kokil Jaidka, Muthu Kumar Chandrasekaran, Sajal Rustagi, and Min-Yen Kan : Overview of
   the 2nd Computational Linguistics Scientific Document Summarization Shared Task (CL-
   SciSumm 2016). To appear in the Proceedings of the Joint Workshop on Bibliometric-
   enhanced Information Retrieval and Natural Language Processing for Digital Libraries
   (BIRNDL 2016), Newark, New Jersey, USA. (2016)
10. Julian Kupiec, Jan O. Pedersen, and Francine Chen : A Trainable Document Summarizer.
   Research and Development in Information Retrieval - SIGIR, 68–73 (1995)
11. Hans P. Luhn : The Automatic Creation of Literature Abstracts. IBM Journal of Research
   and Development - IBMRD, 2(2):159–165 (1958)
12. Qiaozhu Mei, and ChengXiang Zhai : Generating Impact-Based Summaries for Scientific
   Literature. Meeting of the Association for Computational Linguistics - ACL, 816–824 (2008)


                                             153
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


13. Rada Mihalcea, Courtney Corley, and Carlo Strapparava : Corpus-based and Knowledge-
   based Measures of Text Semantic Similarity. AAAI, 6:775–780 (2008)
14. Saif Mohammad, Bonnie J. Dorr, Melissa Egan, Ahmed Hassan, Pradeep Muthukrishnan,
   Vahed Qazvinian, Dragomir R. Radev, and David M. Zajic : Using Citations to Generate Sur-
   veys of Scientific Paradigms. North American Chapter of the Association for Computational
   Linguistics - NAACL, 584–592 (2009)
15. Brett Powley, and Robert Dale : Evidence-based Information Extraction for High Accuracy
   Citation and Author Name Identification. RIAO ’07 Large Scale Semantic Access to Content,
   618–632 (2007)
16. Vahed Qazvinian, Dragomir R. Radev, Saif Mohammad, Bonnie J. Dorr, David M. Zajic, M.
   Whidby, and T. Moon : Generating Extractive Summaries of Scientific Paradigms. Journal of
   Artificial Intelligence Research, 46:165–201 (2013)
17. Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown : Introduction to the Special
   Issue on Summarization. Computational Linguistics - Summarization, 28(4):399–408 (2002)
18. Advaith Siddharthan, and Simone Teufel : Whose Idea Was This, and Why Does it Matter?
   Attributing Scientific Work to Citations. North American Chapter of the Association for
   Computational Linguistics - NAACL, 316–323 (2007)
19. Simone Teufel, and Marc Moens : Summarizing Scientific Articles: Experiments with Rele-
   vance and Rhetorical Status. Computational Linguistics - COLI, 28(4):409–445 (2002)
20. Zhibiao Wu, and Martha Palmer : Verbs Semantics and Lexical Selection. ACL ’94 Pro-
   ceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 133–138
   (1994)
21. Peter N. Yianilos, and Kirk G. Kanzelberger : The LikeIt Intelligent String Comparison
   Facility. NEC Research Institute. (1997)


                                            154
       BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


Fig. 1. Screen shots of the HTML interface. The top part shows the RP and the corresponding
summary. The bottom part shows a CP in which we see sentences from the CP where chosen in
the summary.


                                            155