<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>NLP-NITMZ @ CLScisumm-</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aizawl</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Engineering National Institute of Technology Mizoram</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper report NLP-NITMZ @ CL-Scisumm 2018 system participation for the shared task task 1A, task 1B and task 2 at the BIRNDL 2018 Workshop. We developed our system based on the previous years data provided by the organizer. For task 1A and 1B, we apply various rule based approaches and trained the K- Nearest Neighbors Classi er (KNN) using di erent features identi ed from the input citation text and the reference Text. We achieved an overall accuracy score of 42.75 (%) and 78.75 (%) score for task 1A and task 1B. For task 2, We built our summary generation system using OpenNMT tool. We developed the model using the training and development datasets released from previous year tracks and validated the results of our system using previous years test data set. We have evaluated our system using Recall-Oriented Understudy for Gisting Evaluation, (ROUGE) score on some test data of CL-Scisumm 2017. For task 2, we achieved an overall accuracy score of 37.75 (%).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Generating automatic summaries of scienti c papers is one of the most important
challenging tasks in the eld of summarization. Research articles need to be
summarized to provide the reader a brief glance of the paper. citation sentences
provide all the useful information about the reference paper. There are a lot of
shared tasks like TAC 2014 Biomedical Summarization Track, CL-Scisumm 2016
, CL-SciCumm 2017 and CL-Scisumm 2018. These all shared tasks provide new
challenges every year. It motivated Researchers to share their own theories and
methodology in the eld of scienti c paper summarization.</p>
      <p>
        This short paper will highlight the participation of our NLP-NITMZ @
CLScisumm 2018 system for the shared task at the BIRNDL 2018 Workshop. We
would elaborately explain the methodologies, techniques and the achievement of
our system in this Shared Task of scienti c summarization. We rst identi ed
various cited text span or citation. For example the cited text like ( Ceylan et
al 2010) or ( Clarke et al 2010) or (Navigli, 2009, WSD) are identi ed. These
all citations contains year as common . So we extract the phrase or the sentence
containing the citation using the NLTK regular expression pattern matching.
The second task of our strategy involves the identi cation of the reference Paper.
From the extracted citances, we identify the topics name of the reference paper.
This is done through the LDA model of Word2vec using Gensim package. For
example, consider the cited text spans, "In addition, dis-discriminative weighting
methods were proposed to assign appropriate weights to the sentences from
training corpus (Matsoukas et al, 2009) or the phrase pairs of phrase table (Foster
et al, 2010)", reference paper was identi ed to be (Matsoukas et al, 2009), so we
extract the phrase "In addition, discriminative weighting methods were proposed
to assign appropriate weights to the sentences" from training corpus . We then
extract feature from the input citation texts and the reference paper which are
then trained on the K Nearest Neighbor Classi er (KNN) to group the similar
sentences in one cluster that cited the reference paper. It is reported in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the
authors applied the automatic identi cation of cited text spans using a
multiclassi er approach. So we identi ed the facet of our citation text accordingly into
di erent classes or categories as aim section, method section, implementation
section, aim and method section, result section and implication section. Finally
we trained the Open NMT system to summaries the reference paper and to
generate a brief summary of the reference paper using the citances. Each year
the open task on scienti c paper summarization enhances important feature. The
following section discussed the Task presented at CL-Scisumm 2018. Following
are the task assigned at BIRNDL 2018 Workshop.
      </p>
      <p>Task 1A: For each citances in the citing papers (i.e. text spans containing
a citation), identify the cited spans of text in the reference paper that most
accurately re ect the citance.</p>
    </sec>
    <sec id="sec-2">
      <title>Task 1B:</title>
      <p>For each cited text span, identify which discourse facet it belongs to, among
the following facet, namely Aim Citation, Result Citation, Method Citation,
Implementation Citation.</p>
      <p>Task 2: Finally, an optional task consists on generating a structured
summary of the reference paper with up to 250 words from the cited text spans.</p>
      <p>In this work we report and present the systems developed at NLP-NITMZ
in participate at the CL-SciSumm 2018. We further explain the methodology
and architecture developed for this shared task. We evaluate our system and
compared with the previous di erent runs of participation for the shared Task
on the previous year dataset.</p>
      <p>The paper is organized as follows: Section 1 presents the literature survey on
this eld of work. Section 2 reports the system description and the architecture
for developing the system. Section 3 shows an elaborate explanation on the
approaches used for developing the system on scienti c paper summarization.
Section 4 discusses the evaluation results and section 5 and section 6 concludes
our work with an aim for the future work.</p>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>
          Determining the similarity between sentences is one of the crucial task which
have a wide impact in many natural language processing applications. In
scienti c document summary generation task, summary generation largely depends
on the accuracy of sentence similarity. More accurately we identi ed the correct
sentences of Reference papers (RPs) by giving a query or keyword from the cited
text span (CPs), more accurately so that we can generate the summary. The
system NJUST@CLscisumm17 used di erent similarity measure as features for
example LDA, JACCARD, TF-IDF, DOC2VEC similarity and used SVMBBF ,
SVM linear Decision tree and logistic regression to extract the best accurate
sentences from RPs for Task 1A. For task 1B they have used dictionary based facet
selection and for task 2 they used bisecting M-Mean and maximal marginal
relevance (MMR) for summarization. PKC@CLscisumm17 used search based
methods with features like threshold based TF-IDF, word2vec from genism and word
mover distance to nd the sentence similarity and used conditional probability
to identify the RPs text span. Metzler et al. evaluate the performance of
statistical translation models in identifying topically related sentences compared to
several simplistic approaches such as word overlap, document ngerprinting, and
TF-IDF measures. Lei Liyuan Mao et in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], implemented task 1A and 1B using
rule-based methods with various features of lexicons and similarities and trained
the system using SVM classi er. For Task 2, hLDA topic model is adopted
for content modeling, which provides us the knowledge about sentence
clustering (subtopic) and word distributions (attractivenesses) for summarization.
We further study the implementation of an unsupervised summarization
technique, TextSentenceRank in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that helps in nding the similarity of sentences
to the citation on a textual level. This paper employed the classi cation method
method to select the original text sentences from the candidates texts using the
TextSentenceRank algorithm. Also in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] works has proven the implementation
of unsupervised summarization of the relevant sub-part of the document that
was previously selected in a supervised manner. In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the authors applied the
Learning to Rank algorithm with multiple features, including lexical features,
topic features, knowledge-based features and sentence importance, to task 1A
by regarding the reference span. They viewed the approach as Information
Retrieval method. They viewed the task 1B as the discourse facet identi cation
which falls under the text classi cation problem by considering features of both
citation contexts and cited spans. In [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], CIST@ CLSciSumm-17 used
multiple features based on citation linkage for the classi cation and summarization
approaches.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>System Architecture</title>
        <p>This section describes the description of our system used for our work. The
Extractor module extract the cited text span. In this module the cited text span
like (Ceylan et al 2010) or Clarke et al (2010) or (Navigli, 2009, WSD) are
identi ed. We identi ed the common word. In these examples year is common.
so this serves as a matching pattern to identify and extract the corresponding
texts from the reference Text. So we extract all the sentences containing the
year as a key terms from the Reference Text. This is done through the
RegexpParser(pattern) module imported from NLTK. The next step involves the
identi cation of feature from the input cited text span and the Reference text.
We used Lexical and syntactic features for this. We used the n-gram matching
paradigm like the uni-gram matching, bi-gram matching and n gram matching.
We used the Word2Vec model of the Gensim. Cosine similarity, LDA are also
being employed. The identi ed feature are then feed into the K-Nearest
Neighbor Classi er (KNN) classi er and then trained the dataset. After training the
classi er and using some rules based method, we classi ed the citation text into
the possible facet like Aim citation, Aim and method citation, implication
citation and method citation, method citation, Implication citation, and Result
citation. We employ a rule based approach here. If the text is generated from the
beginning section of the RPs, it is classi ed and assigned the facet Aim Citation.
Similarly, if the sentence from RPs is extracted both from the Aim and Method
Citation, We classi ed the text sentence into the Aim and Method Citation. Also
correspondingly, if the identi ed sentence from the RPs is generated from the
implication section, it is assigned the implication facet. If the generated sentence
from the reference text describes about the aim or the method, we assign the
facet of the sentence as the method facet. If the generated sentence is generated
from any section other than abstract and result section, we classify the
sentence into the method and implication facet. Lastly the generated text sentence
as classi ed into the result and method citation if the sentences describes the
method and result description.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Methodology</title>
        <p>The main step of any work is cleaning or pre processing the datasets. we pre
processed the datasets released from the previous year from CL-Scisumm 17,
CL-Scisumm 18. For this we used the NLTK preprocessing module. We clean,
tokenized, extract pattern from the reference paper to extract those sentences
that contain the cited text. The various task undertaken to complete Task 1A,
Task 1B and Task 2 are enumerated:</p>
        <p>For Task 1A: In task 1A of CL-Scisumm 18, we nd the clean cited text
span to identify cited spans of text in the reference paper that most accurately
re ects the citation. Here we choose only those sentences or portion of sentence
that contain citation to the actual reference paper. We applied di erent
preprocessing steps to identify the key contributing term so as to extract the cited text
span. The steps are undermentioned:</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Extraction of cited text span</title>
      <p>Identi cation of Cited text span: Here the cited text spans containing the
citation like (Ceylan et al 2010) or (Clarke et al 2010) or (Navigli, 2009, WSD)
are identi ed. These all citation text contains year 2010 as common . So we
rst extracted only those lines containing the citation using Regular expression
regexp(pattern) from NLTK.</p>
    </sec>
    <sec id="sec-4">
      <title>Identi cation of the actual reference</title>
      <p>After the extraction and analysis of citation text, we identi ed the reference
papers topic name and then extract only those sentence or portions of sentences
that belong to the actual reference paper. For this we used the LDA model in
Gensim. For example, considering the cited text spans In addition, discriminative
weighting methods were proposed to assign appropriate weights to the sentences
from training corpus (Matsoukas et al, 2009) or the phrase pairs of phrase
table (Foster et al, 2010), we nd that the reference paper was (Matsoukas et al,
2009), so we extract the phrase In addition, discriminative weighting methods
were proposed to assign appropriate weights to the sentences from training
corpus (Matsoukas et al, 2009). We only choose one referenced sentence at a time
and for each folder we made a text le containing all the sentence of the cited
papers that contains citation to the Reference paper. We compute the TF-IDF
score of the query citation text and the reference paper. The top scoring key
terms are used as an index query keywords to identify and extract the sentences
from the reference paper. From the extracted sentences and the input citation
sentences, we engineered out di erent feature like the n gram matching and apply
a rule based method on it. The distance similarity function like Cosine similarity
function, LDA score, Word Mover Distance are being used as di erent feature.
We then trained the K Nearest Neighbor (KNN) classi er using the extracted
features. The classi er group the similar sentences on cluster by cluster basis.
We then further classi ed the sentences and identi ed their facet using the rule
based approaches.</p>
      <p>Top scoring sentences are selected from the reference paper based on their
Jaccard similarity score and TF-IDF score. For task 1B, we classify the classi
cation and facet identi cation into the following subclasses.
1. Aim Citation : If the text is present on the location of the beginning of the
paper.
2. Aim Citation, Method Citation: Similar sentence in the aim and method
section we used both the term related to method and in present future tense.
3. Implication Citation : In the Introduction or Method section we used some
terms related to implementation of a method or technique or dataset.
4. Method Citation : In the aim or method section we used some terms related
to method or technique.
5. Method Citation, Implication Citation : Any section other than abstract and
result section. Details about the method and its implementation
6. Result Citation, Method Citation : If the text is present in either the method
or result section and contains some result and method describing term.
7. Result Citation : In the result or any other section containing some numbers
with percentage or the term accuracy, performance, score etc.</p>
    </sec>
    <sec id="sec-5">
      <title>Task 2 : Summary generation</title>
      <p>From the sentences generated from the task 1A we combined all the sentence
of reference sentence that are cited in the cited papers to create an extractive
summary and abstractive summary.</p>
    </sec>
    <sec id="sec-6">
      <title>Extractive text Summarization:</title>
      <p>In Extractive text summarization approach, we applied three simple rules to
generate summary as the text itself is short. We have ranked the generated
sentences from reference paper a score based on Jaccard similarity score between
all the cited text and reference text. We also considered sentence length and
location, where in summary there should be at least a sentence from
introduction, Implementation, methods and results. And after Task 1A we ranked the
sentences as they are refereed in cited papers. Based on these three criteria we
nally selected each sentence from all section i.e. Introduction, Implementation,
Methods and Results. And if the length is not exceed to 250 we added more
sentence based on similarity score.
We build our system model based on CL-Scisumm 16 and CL-Scisumm 16 and
CL-Scisumm 17 data and DUC1 dataset and DUC1 dataset. We made four le
i.e. training text data, training summary data, validation text data, and
validation summary data. We preprocessed all the data using OpenNMT. For summary
generation, we used preprocessed output of Task 1A's as test data and summary
is generated from this by translating using OpenNMT . After we interpreted
the summary generated by the OpenNMT based on CL-Scisumm 16 and
CLScisumm 17 data. The DUC1 dataset is not an accurate one because of less
data. For NMT to work properly and give result accurately, we need to train the
OpenNMT system with more data.</p>
      <sec id="sec-6-1">
        <title>System Result and Observation</title>
        <p>We build our system using the dataset released from CL-SciSumm-18,
CLScisumm 17 and CL-Scisumm-16 datasets and tested the result of our system
on the Scisumm-17 datasets. We evaluated the result of our system runs using
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score or metrics.</p>
        <p>From the table given above, we can conclude that for Task 1A, we got
accuracy 42.105 score in terms of % on some the data from CL-Scisumm 17 dataset.
For Task 1B we got accuracy 78.75 % score. For Task 2, using OpenNMT we
got an accuracy score of 37.75 %.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Conclusion and Future Works</title>
        <p>In the succeeding future track, We would like to train and build our system
incorporating more lexical, syntactic and semantic feature. We would extend our
work by training multi classi ers and thereby improving the accuracy score of
our developed model. Moreover, We would like to study the dataset imbalanced
problem that we encountered while experimenting the system and would like to
handle and address the solution to this problem.</p>
        <p>Acknowledgments We would like to express our deepest appreciation to the
Department of Computer Science and Engineering, National Institute of
Technology, Mizoram for providing and assisting us all the required and necessary
nancial assistance, laboratory, and the technical facilities for conducting out our
full experimental research works on this research paper work for the participation
at CL-SciSumm Shared Task 2018 @SIGIR2018: The Scienti c Summarization
Shared Task.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Klamp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rexha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kern</surname>
          </string-name>
          , R.:
          <article-title>Identifying referenced text in scienti c publications by summarisation and classi cation techniques</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)</source>
          . pp.
          <volume>122</volume>
          {
          <issue>131</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
          </string-name>
          , H.:
          <article-title>Cist system for clscisumm 2016 shared task</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Bibliometricenhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)</source>
          . pp.
          <volume>156</volume>
          {
          <issue>167</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          : CIST@ CLSciSumm-17:
          <article-title>Multiple features based citation linkage, classi cation and summarization</article-title>
          .
          <source>In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017)</source>
          . Tokyo, Japan (
          <year>August 2017</year>
          ) (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Recognizing reference spans and classifying their discourse facets</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)</source>
          . pp.
          <volume>139</volume>
          {
          <issue>145</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , S.,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Automatic identi cation of cited text spans: a multiclassi er approach over imbalanced dataset</article-title>
          . Scientometrics pp.
          <volume>1</volume>
          {
          <issue>28</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>