SciSumm 2017: Employing Word Vectors for Identifying, Classifying and Summarizing Scientific Documents Aniket Pramanick, Salma Mandi, Monalisa Dey, and Dipankar Das Jadavpur University, Kolkata, West Bengal, India Abstract: This paper describes our approach on ”Recognizing Reference Spans,Classifying Their Discourse Facets and Summarizing from Reference Text” as an attempt in the shared task on relationship mining and scientific summarization of computational linguistics research papers at SIGIR 2017. 1 Introduction The 3rd CL-SciSumm Shared task provides resources for scientific paper summarization.An overview of the shared task,including specific details on the dataset and subsequent analysis for each task.In this report we provide a short description of the methods we have used for the task 1A and 1B. 1.1 Summary This system is a rule based implementation of Artificial Neural Network. 2 Dataset and Preprocessing The original reference is divided into sentences. The similarity between each of the sentences to the cited sentence in the citance is measured in using three standard measuring rules: a. Jaccards Coefficient (Trigram Model)(J(a, b)) b. Clough and Stevenson Coeffecient (Trigram Model) (C(a,b)) c. Model Probability Measure (Bigram Model) (P(a)) Thus for each sentence in the reference text we get an 3-tuple (J, C, P). Now this vector is fed to an Artificial Neural Network to find whether the sentence in the reference text is cited in the citance. Using the outputs of this Artificial Neural Network we get the Candidate Sentences, the subset of which is the set of ”Citation Sentences”. Actually this Neural Network is used as an Filtering method. 3 System Framework 3.1 Task 1A:Identification A very simple method has been used to identify the possible Citation Texts. For each Candidate Sentence in the Reference Text the cosine similarity to the Cited Sen- tence is measured using cosine similarity, and the Cosine Similarity score is incremented by unity. Thus, each Candidate Sentence of a Reference Text gets assigned to a Cosine Similarity Score. The easychair Class File Mokhov, Sutcliffe and Voronkov If a Reference Text has no Candidate Sentence with score greater than 1.2, we say that the Reference text has not been cited at all. Otherwise the sentence or the sentence segment with maximum score is declared to be the Citation Text. The code to measure the Cosine Similarity Score is as follows: from k e r a s . p r e p r o c e s s i n g . t e x t import Tok eniz er , t e x t t o w o r d s e q u e n c e import math from s c i p y import s p a t i a l def c o s i n e s i m i l a r i t y ( t1 , t 2 ) : t e x t s =[ t1 , t 2 ] t k n z r=T o k e n i z e r ( l o w e r=True , s p l i t=” ” ) tknzr . f i t o n t e x t s ( texts ) x=t k n z r . t e x t s t o m a t r i x ( t e x t s ) v1=x [ 0 ] v2=x [ 1 ] sumxx=0 sumxy=0 sumyy=0 f o r i in range ( len ( v1 ) ) : a=v1 [ i ] b=v2 [ i ] sumxx=sumxx+a∗ a sumyy=sumyy+b∗b sumxy=sumxy+a∗b return ( f l o a t ( sumxy ) / f l o a t ( math . s q r t ( sumxx∗sumyy ))+1) 3.2 Task 1B:Classification Task 1B considered as text classification problem.The five discourse facets of a sentence in a reference paper are Aim,Methode,Implication,Result and Hypothesis.For this task we are following an unsupervised method. • A bag of words are created for each class. • For each bag compute its bag vector. • Compute sentence vector for each cited text span. • For each bag vector measure cosine similarity between sentence vector and bag vector. The most similarity value corresponding to a bag vector will be assigned as class of cited text. Bag of Words: Each bag is a list of relevent words to a particuler class.Each bag is made based on unigram.For each class we have seperated the reference text from training data set.Then made a list of words from reference sentences and calculated their tf-idf score.Bag is constructed by taking words with highest tf-idf score. 2 The easychair Class File Mokhov, Sutcliffe and Voronkov Bag Vectors: We used the pre-trained 200 dimensional GloVe(http://nlp.standford.edu/software/CRF- NER.html) on Twitter data 2billion tweets(http://nlp.standford.edu.projects/glove/) to create the vectors of the reference text and the word bags. The word bag vectors are created by taking the normalized summation of the vector of words in word bags which were present in the vocabulary of the pre-trained Glove model.Out of vocabulary words are assigned to null vector. PN v(qi) ~ ~ = 1 qi N v(qi) j=1 W ij and W~ij = ~0 where,qi ~ = Topic vector of ith word bag,Nv(qi)= ~ Number of words in qi present in vocabulary, W ij = Vector of jth word in ith word bag. sentence vectors The sentence vectors are created by taking the normalized summation of the vectors of the words in the sentence,which were present in the vocabulary of the pre-trained GloVe model.In cases where the word was not a part of the model vicabulary,it was assigned to the null vector. PN v(ti) ~ = 1 ti ~ and uij uij ~ = ~0 Where,ti ~ = Sentence vector of ith sentence,ti. Nv(ti)= N v(ti) j=1 ~ = Vector of jth word in ith sentence. Number of words in ti present in vocabulary. uij Cosine Similarity We used cosine similarity measure to calculate the cosine similarity,S between the sentence vector and the topic vector. ~~ ~ = tiqj ~ qj) S = cosine − sim(ti; ~ ||ti|||| ~ qj|| ~ and the topic vector A high value of S denotes higher similarity between the sentence vector,ti ~ and vice-versa. qj 3.3 Task 2:Summarization From the outputs obtained from Task 1A and 1B, a community summary had to be formed, which is a structured extractive summary of the Reference Paper (RP) generated from the cited text spans of the RP. Community summary The output of Task 1b, contained the cited text spans, along with their facets for each RP. The five discourse facets of a sentence in a reference paper are Aim, Method, Implication, Result and Hypothesis. For each facet, duplicate entries and stop words were removed from the text spans. A similarity score was calculated between the text spans for each facet using the cosine similarity measure. If the cosine similarity score was high, then one out of the two sentence vectors were selected randomly as a probable candidate for summarization. It was analysed from the output that any sentence with word length less than three, contributed no meaning to the summary generated and hence were discraded. 3 The easychair Class File Mokhov, Sutcliffe and Voronkov Table 1: Performance of our System in Task 1 Run1 Run2 Precision 0.045 0.051 Task 1a Micro Avg Recall 0.031 0.035 F1 0.037 0.042 Precision 0.057 0.066 Task 1a Micro Avg Recall 0.037 0.046 F1 0.045 0.054 Precision 0.058 0.058 Task 1a ROUGE2 Recall 0.132 0.132 F1 0.065 0.065 Precision 0.045 0.051 Task 1b Micro Avg Recall 0.031 0.035 F1 0.037 0.042 Precision 0.000 0.400 Task 1b Micro Avg Recall 0.000 0.057 F1 0.000 0.100 Table 2: Performance of our System in Task 2 Run1 Precision 0.149 Vs. Abstract - ROUGE 2 Recall 0.278 F1 0.191 Precision 0.091 Vs. Abstract - ROUGE SU4 Recall 0.289 F1 0.133 Precision 0.243 Vs. Human - ROUGE 2 Recall 0.152 F1 0.181 Precision 0.249 Vs. HUman - ROUGE SU4 Recall 0.099 F1 0.129 Precision 0.135 Vs. Community - ROUGE 2 Recall 0.138 F1 0.132 Precision 0.133 Vs. Community - ROUGE SU4 Recall 0.138 F1 0.119 4 Evaluation We have submitted two different runs for Task 1a and 1b and a single run for Task 2. Task 1a and 1b is scored by the overlap of text spans measured by number of sentences in the system output vs gold standard. Task 2 is scored using the ROUGE family of metrics between i) the system output and the gold standard summary fromt the reference spans ii) the system output and the asbtract of the reference paper. The performance of our system in Task 1 and 2 is shown below in Table 1 and Table 2 respectively. 5 Conclusion and Future Work In this paper,we presented a brief overview of our system to address automatic paper sum- marization in the Computational Linguistics domain.Recognizing the cited text spans and de- termining their discourse facets are very challenging task for the summarization of scientific papers.We have observed, task 1A involves much more than similarity problem.More features 4 The easychair Class File Mokhov, Sutcliffe and Voronkov that reflect the citation intentions should be explored. For task 1B,building word bags which contain all the topic words relevant to the facet showed better results than the rest.We could do that approach better using bigram or trigram. Task 2 evaluations show that more features have to be introduced in order to calculate the importance of the cited text spans. 6 References • Kokil Jaidka,Muthu Kumar Chandrasekaran,Sajal Rustagi and Min-Yen Kan(2016).Overview of the 2nd Computational Linguistics Scientific Document Summarization Shared Task(CL- SciSumm 2016),To appear in the proceedings of the Joint Workshop on Bibliometric- enhanced Informatilon Retrival and Natural Language Processing for Digital Libraries(BIRNDL 2016),Newwork,New Jersey,USA. • Surojeet Dasgupta,Abhash Kumar,Dipankar Das,Sudip Kumar Naskar,Shivaji Bandy- opadhyay.Word Embeddings for Information Extraction from Tweets.Microblog Track at Forum for Information Retrival Evaluation(FIRE)2016. 5