<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lexical and Syntactic cues to identify Reference Scope of Citance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peeyush Aggarwal</string-name>
          <email>peeyushaggarwal94@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richa Sharma</string-name>
          <email>richa.sharma@bml.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BML Munjal University</institution>
          ,
          <addr-line>Gurgaon</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bharti Vidyapeeth College of Engineering</institution>
          ,
          <addr-line>Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our system addressing Task 1 of CL-SciSumm Shared Task at BIRNDL 2016. Our system makes use of lexical and syntactic dependency cues, and applies rule-based approach to extract text spans in the Reference Paper that accurately reflect the citances. Further, we make use of lexical cues to identify discourse facets of the paper to which cited text belongs. The lexical and syntactic cues are obtained on pre-processed text of the citances, and the reference paper. We report our results obtained for development set using our system for identifying reference scope of citances in this paper.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing</kwd>
        <kwd>Syntactic Analysis</kwd>
        <kwd>Scientific Document Summarisation</kwd>
        <kwd>Bag of Words</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The scientific research community needs different viewpoints of research
contributions in summarized form. Abstract of the research contribution presents summary
from the author(s) perspective. Citations of a reference paper reflect the viewpoint of
the citing authors for that reference paper, and possibly in a certain context only.
Summary drawn for a reference paper from its citations can put forward a different
and interesting context of that reference paper. There have been several efforts
towards extracting reference scope of citances, and such citations-based summary in
recent years like [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] etc. Kokil et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have shown through their Computational
Linguistics Summarization (CL-Summ) Pilot task that citation based summary of
scientific documentation is important to create for understanding different
perspectives of a reference paper. Further to that pilot task, Computational Linguistics
Scientific Document Summarization (CL-SciSumm-20161) shared task has been designed
with the goal of exploring automated summarization of scientific contributions for the
computational linguistics research domain.
The organizers of CL-SciSumm shared task have divided the task into two parts: (1)
For each citance, identify the spans of text (cited text spans) in the Reference Paper
(RP) that most accurately reflect the citance, and identify the facet of the paper it
belongs to; (2) Generate a structured summary of the RP from the cited text spans of
the RP. Task-2 is optional. However, task-1 is required to create citations-based
summary of the RP. This makes task-1 crucial and important step in creating
citations-based summary of any scientific document [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The corpus of CL-SciSumm
shared task has been created by sampling documents from ACL Anthology corpus
and selecting their citing papers [9].
      </p>
      <p>We have worked on task-1 (‘a’ and ‘b’) to develop our system for identifying the
reference scope of the citance in the RP. We present the details of our system in
Section – 2 below. This is followed by evaluation of our system, as presented in section
3, and observations in section 4. We finally present concluding remarks in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our System</title>
      <p>In order to develop our system for the CL-SciSumm shared task, we first reviewed
one sample topic (one RP and its citing papers) from the training set, and one from the
development set. Manual review of these two samples revealed that though the shared
task requires analysing the statements in the corpus semantically, but semantic
analysis is challenging owing to the nature of the corpus. The corpus is a collection of
scientific, technical articles making use of appropriate technical language, and therefore
usage of varying, similar-meaning words is quite less. This makes the scope of using
text semantic similarity measures quite minimal. Secondly, the citance from the citing
paper refers to the text spans of RP in different contexts. These citing texts often do
not refer to any meaningful content or information from RP except for a word or two.
For example, the citing statement below does not convey much information about the
RP except for two hinting words – RFTagger and German:
For German, we show results for RFTagger (Schmid and Laws, 2008).
Having found most of such examples in manual review, we were discouraged to make
use of sub-sequences (of words) overlap between the statement in Citing Paper (CP)
and its corresponding, reflective statements in RP. The overlapping words between
the statements in CP and in RP usually do not form a subsequence. Therefore, we
resolved to work with lexical (n-grams in bag-of-words approach instead of
subsequence of words), and syntactic cues to develop our system. Following sub-sections
summarize our approach and the heuristics used in our system. We have implemented
our solution approach using Python. During the course of development of our system,
we observed various advantages that Python offered us. We shall discuss those in
observations section 3. Figure 1 below summarizes an overview of our approach
implemented to develop Python-based system for carrying out task-1 of CL-SciSumm:</p>
      <p>Generate
bag-ofword for statements in
RP and the cited text
from annotation file after
removing stop-words.</p>
      <p>Apply Porter
Algorithm to get stemmed
form of each word in
these generated
bag-ofwords.</p>
      <p>Parse the cited text
from annotation file and
statements in RP using
Stanford Parser</p>
      <p>(i) Identify most
frequent bi-grams
(unigrams, if bi-grams are
not matching) between
these generated
bag-ofwords.</p>
      <p>(ii) Compute distance
between the bi-grams in
their source statement in
RP. Not applicable for
unigrams.</p>
      <p>(iii) Identify
dependency overlap between
cited text in CP and RP
by matching dependency
tags and the position of
words (picked from
bigrams)</p>
      <p>Use the three heuristics (i) – (iii) and apply rule-based
algorithm (rules and matching confidence levels learnt on training
and development set) to identify reference scope of citances.
Lexical cues, in our system, are gathered in terms of bi-grams (two lexical tokens
from the bag-of-words) and unigrams (where bi-grams are not available). We are
extending the notion of bi-grams, in our context of study, to group of two matching
words between the cited text in CP and its reference scope in the RP. As discussed
above, most of the citances refer to two (or more) lexical units in the reference scope
of the RP. Therefore, we have limited the scope of our solution to bi-grams. We first
parse the XML version of the reference paper to get individual statements in the RP
for further processing. Then, we generate bag-of-words after removing stop-words
from the citing text, and the statements of the RP. We have used most commonly used
Glasgow list of stop-words2 for the purpose. We identify the (matching) bi-grams
after converting the lexical units in bag-of-words to their stemmed form using
Porter’s Stemmer3. Porter’s stemmer is often criticized for not returning the correct root
form of a word. However, this limitation of Porter’s stemmer does not affect the
results in our case since we are applying it to both the bag-of-words to be used for
matching (words/lexical units) purpose. Therefore, carrying out a regular expression
comparison on both the bag-of-words did not add any discrepancy inadvertently.
2.2</p>
      <sec id="sec-2-1">
        <title>Syntactic Dependency Analysis</title>
        <p>
          Syntactic Dependency analysis has been extensively used for analysing statement at
granular level for recognizing textual entailments [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and question-answering systems
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Similarities in syntactic roles, and dependency overlaps between statements under
analysis for semantic similarity have proved to be effective heuristics. Finding
reference scope for citations could also benefit from dependency overlap, though
similarity in syntactic roles is difficult to find between citations and their corresponding
reference scope in RP. We have used Stanford dependency parser [7] to find dependency
overlaps for the identified bi-grams between citing statement in CP and its reflective
statement in RP. After obtaining parsed output, the words in the dependency relation
are again converted to their stemmed form (using Porter’s stemmer) to facilitate
matching between different forms of same word like use, using etc.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Heuristics to identify Reference Scope of Citance</title>
        <p>We have worked with following heuristics for task 1a of the CL-SciSumm task:</p>
        <p>Most frequent bi-grams. We search for matching words (stemmed form)
between the citing statement and the statements in RP. Having obtained a list
of matching statements, we search for most frequent words in thus obtained
list of statements from the RP. The count of matching words in the
statements of the RP varies from zero to five-six. We observe that considering
bigrams for ranking statements in RP is if help. If none of the statements in the
RP has been found to have matching words with bag-of-words of the citing
statement, then we output such a citance relationship with ‘NaN’. There are
instances where citing statements got stopped inadvertently due to incorrect
marking of end of statement while preparing the corpus. In case only
unigram is found to be matching between citing statement and the source
statement in RP, then its weight in the matching statement is computed as the
ratio of its occurrence count against the size of bag-of-words for that matching
statement. The statement with highest weight is assigned rank – 1. If more
than two statements have exactly same weight with same words, then all
2 http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
3 https://pypi.python.org/pypi/stemming/1.0
these statements are assigned rank – 1. In case, the word are different in such
similar weighing statements, then the statement having most frequent word
across various statements is assigned highest rank.</p>
        <p>In case, two or more than two words are matching, then we rank the
statements considering bi-grams for weight-assignment. We identify the most
frequent bi-gram across various statements that have matching words with
the citing statement. For the most frequent bi-gram, we assign weight to the
statement as the ratio of count of occurrence of the words in bi-gram against
the size of bag-of-words in each of the source statements from RP. The
matching statement having highest weight is ranked highest, and is reported
as the reflective statement for the cited statement under consideration. In
case, more than one statement has similar weights then ranking algorithm
considers rest of the two heuristics as discussed below.
2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Distance between tokens in frequent bi-grams. This heuristic considers</title>
        <p>the distance between the words or tokens in the most frequent bi-gram. This
heuristic helps in resolving ranks of the matching statements from the RP
when the heuristic in point 1 happens to assign similar weights to more than
one statement. However, this heuristic is not applicable where unigrams have
been found to be matching.</p>
        <p>Dependency Overlap Count. We determine dependency overlap for the
frequent bi-grams between the cited text in CP and its corresponding
reflective statement in the RP. We extract those dependencies that have either of
the words or tokens in the bi-gram. A match is said to be found if the
dependency tag matches, the stemmed form of tokens also match, and the token is
in the identical position (either governing position or dependent position).
Following example illustrates how dependency overlap is found:
Considering the following citing statement from development set for topic,
C02-1025:
S1: In such cases, neither global features (Chieu and Ng, 2002) nor
aggregated contexts (Chieu and Ng, 2003) can help.
and one of the statements from RP:
S2: Such a classification can be seen as a not-always-correct summary of
global features.</p>
        <p>The parsed output for S1 and S2 is respectively:
Parsed Output for S1 in CP:
amod(cases-3, such-2)
preconj(features-7, neither-5)
nsubj(help-12, features-7)
conj_nor(features-7, contexts-10)
aux(help-12, can-11)
prep_in(help-12, cases-3)
amod(features-7, global-6)
amod(contexts-10, aggregated-9)
nsubj(help-12, contexts-10)
root(ROOT-0, help-12)
Parsed Output for S2 in RP:
predet(classification-3, Such-1) det(classification-3, a-2)
nsubjpass(seen-6, classification-3) aux(seen-6, can-4)
auxpass(seen-6, be-5) root(ROOT-0, seen-6)
amod(summay-10, not-always-correct-9)
det(summay-10, a-8) prep_as(seen-6, summay-10)
amod(features-13, global-12) prep_of(summay-10, features-13)
The most frequent bi-gram for this topic is: global, features. Searching for
these words in the parsed output of S1 and S2, we get:
S1:
S2:
preconj(features-7, neither-5)
amod(features-7, global-6)
nsubj(help-12, features-7)
conj_nor(features-7, contexts-10)
amod(features-13, global-12)
prep_of(summay-10, features-13)
Matching the dependency tags and governing/dependent positions in the
above-extracted dependencies of S1 and S2, we get dependency overlap
count as one (matching dependency presented in italics above).
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Rule-based System to identify Reference Scope in RP</title>
        <p>We have developed rule-based system based on these three heuristics to report
reference scope in RP for the cited text in CP. We have learnt threshold values for ranking
and further processing the statements after several rounds of experimentation with
these datasets. The statements having weight more than or equal to 0.2 are considered
for further ranking and processing. We report the reflective source statement in RP for
a citing statement in CP after considering highest weights, maximum dependency
overlap count, and lowest distance between bi-grams. In case of more than one
statement encountered having similar values for any of these three heuristics, we assign
weights the highest priority, followed by dependency overlap count, and then
distance. After these checks, if there is still more than one statement with same values of
heuristics, then our system reports all of these statements as reference scope for the
CP statement.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Heuristics to identify Facets</title>
        <p>The next sub-task of task-1 is to identify discourse facet for the cited text span with
reference to the RP. The discourse facet is helpful in identifying different contexts of
citing a reference paper. The organizers of the CL-SciSumm have predefined five
facets: aim_citation, hypothesis_citation, method_citation, implication_citation, and
results_citation. Identifying correct discourse facet again calls for understanding
semantics of the cited text span, though there are challenges involved with the same as
discussed above. Our approach to identify discourse facet, therefore, is based on
section headers in the paper. However, our approach has the drawback of not being able
to identify hypothesis_citation and implication_citation. For rest of the three facets,
following rules are observed:
1. If the cited text span lies in the introduction section, beginning of abstract,
then it is indicative of aim_citation.
2. Discourse facet is marked as results_citation if the cited text span belongs to
the sections having title as – Results, Observations, Discussion, Conclusion,
or if the cited text span is one of the last 2 statements of the abstract.
3. If cited text span does not belong to the sections as mentioned in above two
points, then the discourse facet is marked as method_citation.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We have computed ROUGE-N [8] metric with ‘N’ as 2 for bi-grams to evaluate our
system. Table 1 below presents the average results for identifying reference scope
(task – 1a) for each topic in the development set, and an average overall performance
of the system for the development set for task -1a:
For task 1b, we have computed accuracy of reporting discourse facet of the paper as
the ratio of correctly identified facets in an annotation file for a topic and the total
number of citances for that topic. Table 2 presents the discourse-facet accuracy
corresponding to the development set:
The experiments with different datasets – training, development, and test set indicate
that lexical and syntactic cues are indeed of help. But, lexical and syntactic analysis
has its own limitations in terms of only regular expression match, and no semantic or
contextual matching. We observe that the same approach does not perform uniformly
with all the datasets, and performance does differ even within one dataset. For
example – our system worked better with topics E09-2008 and N04-1038 as compared to
other topics in development set, as evident from Table – 1. The evaluation results
presented in Table – 1 correspond to ROUGE-N metric (N as 2). We have used this
metric because our system is bi-gram in nature. Nevertheless, we are implementing
ROUGE-S metric as well in order to cross-validate our evaluation and system
performance.</p>
      <p>It can be inferred from the discussion above that semantic-level analysis is inevitable
to yield good results. The task of identifying reference scope for citances appears
similar to the task of recognizing textual entailment (RTE), but is actually quite
different. This is primarily because of different nature of corpus. Nevertheless,
CLSciSumm task can benefit from the RTE challenges and solution approaches to
recognizing textual entailment. While working with CL-SciSumm corpus, we
encountered several problems in the corpus in terms of its formatting, characters coding as
well as annotations. However, these problems are not major, and could be fixed.
Resolution of these concerns may provide useful pointers to semantic-level analysis
needed for tasks like CL-SciSumm.</p>
      <p>We have worked with three heuristics of lexical and syntactic nature to identify the
reference scope of the cited text in the RP. The computation of values of these
heuristics has been described in detail in section – 2. We observed after experiments with
our system that computation methodology of our heuristics may further be refined. As
of now, our system considers unigrams and bi-grams only. We have mitigated the
challenges with lexical analysis by considering stemmed form of words to work with.
We are further experimenting with different priorities for our heuristics, and tweaking
our algorithms currently.</p>
      <p>We have developed our system for CL-SciSumm task in Python language. We have
observed that Python turned out to be a useful choice. Python is an interpreted
language supporting both object-oriented and functional programming flavour. Python
allowed us to develop codes in fewer lines with dividing the problems into
subproblems. We were thus able to code and test small snippets separately and merge
those later to develop complete system.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have presented our system for CL-SciSumm task 1 to identify
reflective statements from RP for a given citance in CP. The task is challenging as
semantic-level analysis has limited applicability in this case. We have addressed the task
using lexical and syntactic cues to extract text-spans from RP that correspond to the
cited text in CP. We believe that further refinements to the corpus and to our system
can yield better results. We do intend to further refine our heuristics and check the
applicability of machine learning too.
6
7. Marneffe, M.C. de, Silveira, N., Dozat, T., Haverinen, K., Ginter, F., Nivre, J. and
Manning, C.D.: Universal Stanford Dependencies: A cross-linguistic typology. In: LREC
(2014).
8. Lin, C. and Hovy, E.H.: Automatic Evaluation of Summaries using N-gram co-occurence
Statistics. In: Proceedings of Language Technology Conference (HLT-NAACL), Canada
(2003).
9. Jaidka, K., Chandrasekran, M.K., Rustagi, S. and Kan, M.: Overview of the 2nd
Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm-2016),
To appear in the Proceedings of the Joint Workshop on Bibliometric-enhanced Information
Retrieval and Natural Language Processing for Digital Libraries (BIRNDL), Newark, New
Jersey, USA (2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hearst. M.</surname>
          </string-name>
          <article-title>A.: Citances: Citation sentences for semantic analysis of bioscience text</article-title>
          .
          <source>In: SIGIR</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Qazvinian</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          :
          <article-title>Identifying Non-explicit Citing Sentences for Citationbased Summarization</article-title>
          .
          <source>In Proceedings of Association for Computational Linguistics</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jaidka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandrasekaran</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elizalde</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molla-Aliod</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronzano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.:
          <article-title>The computational linguistics summarization pilot task</article-title>
          .
          <source>In: Proceedings of Text Ananlysis Conference</source>
          , Gaithersburg, USA, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Abu-Jbara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Reference Scope Identification in Citing Sentences</article-title>
          .
          <source>In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pp
          <fpage>80</fpage>
          -
          <lpage>90</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Biswas K</surname>
          </string-name>
          .K..
          <article-title>: Recognizing Textual Entailment using Dependence Analysis and Machine Learning</article-title>
          .
          <source>In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Student Research Workshop (SRW)</source>
          , Colorado, USA (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Molla</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards semantic-overlap based measures for question answering</article-title>
          .
          <source>In: Proceedings of the Australasian Language Technology Workshop</source>
          , Australia (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>