=Paper=
{{Paper
|id=Vol-1610/paper16
|storemode=property
|title=Recognizing Reference Spans and Classifying their Discourse Facets
|pdfUrl=https://ceur-ws.org/Vol-1610/paper16.pdf
|volume=Vol-1610
|authors=Kun Lu,Jin Mao,Gang Li,Jian Xu
|dblpUrl=https://dblp.org/rec/conf/jcdl/LuMLX16
}}
==Recognizing Reference Spans and Classifying their Discourse Facets==
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
Recognizing reference spans and classifying their
discourse facets
Kun Lu Jin Mao
School of Library and Information Studies School of Information
University of Oklahoma University of Arizona
kunlu@ou.edu danveno@163.com
Gang Li Jian Xu
School of Information Management School of Information Management
Wuhan University Wuhan University
imiswhu@aliyun.com xukeywhu@163.com
Abstract: In this shared task, we applied “Learning to Rank” algorithm with multiple
features, including lexical features, topic features, knowledge-based features and sen-
tence importance, to Task 1A by regarding reference span finding as an information re-
trieval problem. Task 1B, discourse facet identifying, is treated as a text classification
problem by considering features of both citation contexts and cited spans.
Keywords: Learning to rank; Topic model; Facet classification
1 Introduction
The 2nd CL-SciSumm Shared task follows the TAC 2014 Biomedical Summarization
Track on scientific paper summarization. An overview of the shared task, including
specific details on the dataset, the competitive results and subsequent analyses for each
task can be found in the shared task overview paper[1]. In this report, we provide a
detailed description of the methods we used for the Task 1A and Task 1B. Our methods
are introduced in Section 2 followed by results in Section 3. Some conclusions are pre-
sented in the last section.
2 Methods
2.1 Task 1A
We considered Task 1A as an information retrieval problem. A citance (citation con-
text) is regarded as a query and sentences from reference paper (cited spans) are treated
as candidate documents. Then, the problem becomes how to rank the sentences of the
reference paper (i.e., candidate cited spans in this report) for a given citance. The most
relevant sentences of the reference paper to a citation context are selected as the golden
139
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
sentences of the citation context. We apply “learning to rank” algorithms to address this
problem and exploit multiple features. The explored features are as follows:
• Lexical Features. Bag of words is a widely used text representation method. By
representing citation contexts and candidate cited spans with the bag of words model,
the lexical similarities between citation contexts and candidate cited spans can be
obtained. We chose four candidate lexical similarity features including Cosine sim-
ilarity, Jaccard similarity, Dice similarity and LCS (longest common subsequence).
When computing cosine similarity, TFIDF term weighing was applied. In the for-
mula below, TF is the number of times a term occurring in a given sentence, while
the IDF value of a term is computed from all the 437 papers in the training set (93
papers), development set (155 papers), and test set(229 papers). Thus,
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷_𝐴𝐴𝐴𝐴𝐴𝐴_𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 is 437, which is the same for all terms. And
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷_𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇) is the number of the different papers that a term oc-
curs.
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑡𝑡𝐴𝐴𝐴𝐴𝑙𝑙𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
TFIDF(Term) = TF(Term) ∗ ln (1)
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑡𝑡𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇)
• Topic Features. Bag of words model is insufficient in handling polysemy and syn-
onym problems. Taking topics into account can relieve this problem. Topic model-
ing method[2] was used to identify latent topics from 10,921 articles of the ACL An-
thology Reference Corpus. The topic distributions of both citation contexts and can-
didate cited spans were predicted through the LDA models. Cosine similarity was
then used to measure their topic similarities.
• Knowledge Based Features. WordNet[3] was used to compute the concept similar-
ity between citation contexts and candidate cited spans. Lin similarity[ 4 ] that
measures the similarity of two words was applied. The similarity of two sentences
can be compute by cumulating the similarities between their words. We used differ-
ent combinations of nouns and verbs to calculate similarities: WordNet (N)-only us-
ing nouns, WordNet (V)-only using verbs, WordNet (N,V)-using both nouns and
verbs, and WordNet (N,Vsep) which is obtained by (WordNet(N)+ WordNet(V))/2.
• Sentence Importance. The importance of candidate cited spans in the reference pa-
per is considered as a factor influencing whether they are being cited or not. Tex-
tRank[5], which is a widely used unsupervised method to extract the keyword or rank
the sentences of a given document, was applied to measure the importance of a sen-
tence in the reference paper. The assumption is that the more important a sentence
is in the reference paper, the more likely the sentence belongs to the cited span.
Each pair of a citation context and a candidate sentence from the cited article is an
instance. Positive instances (i.e., the candidate cited sentence is one of the gold standard
cited sentences) were assigned higher scores than negative instances. The above fea-
tures were calculated and fed into “learning to rank” methods. For topic similarity, the
number of topics varied from 20 to 200 with a step of 20. The features showing high
performance were selected as the final set of features. Then, five learning to rank algo-
rithms from RankLib[ 6], including RankBoost 7, RankNet[8], AdaRank[ 9], and Coordi-
nate Ascent[10], were compared.
140
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
2.2 Task 1B
Task 1B is considered as a text classification problem. The five discourse facets of a
sentence in a reference paper are Aim, Method, Result, Implication, and Hypothesis.
The features adopted by the classifiers are as follows.
• The text of the reference sentence
• The title of the section that the reference sentence belongs to
• The section type of the section that the reference sentence belongs to
• The text of the citation context
• The title of the section that the citation context belongs to
• The section type of the section that the citation context belongs to
Three classifiers from Weka[11] including Naïve Bayes, Decision Tree and Supporting
Vector Machine were applied and compared for Task 1B. The algorithm behind these
classifiers are Naive Bayesian classification, C4.5, and Sequential Minimal Optimiza-
tion. Default parameter settings were used to train the classifiers implemented in Weka.
2.3 Evaluation
2.3.1 Task 1A metrics.
Two groups of precision (P), recall (R) and F_1 measures were used to evaluate the
performance in Task 1A. The first group counted the number of sentences returned by
our methods that match the gold standard sentences annotated by the task organizers.
|𝐺𝐺 ⋂ 𝑆𝑆| |𝐺𝐺 ⋂ 𝑆𝑆| 2 ∗ 𝑅𝑅 ∗ 𝑃𝑃
𝑅𝑅 = 𝑃𝑃 = F1 = (2)
|𝐺𝐺| |𝑆𝑆| (𝑅𝑅 + 𝑃𝑃)
where G indicates the gold standard sentences, S denotes the sentences returned by our
methods.
The second group of measures are ROUGE_1[12] (Recall-Oriented Understudy for Gist-
ing Evaluation) measures used as the official comparison measures.
2.3.2 Task 1B metrics.
Precision (P), recall (R) and F_1 measures were calculated to evaluate the performance
for each facet. For one facet, a is denoted as the number of correct predictions for this
facet, b is the number of wrong predictions for this facet, and c is the number of cited
sentences from gold standard sentences for this facet that are not predicted as belonging
to this facet.
a a 2∗R∗P
R= P= 𝐹𝐹1 = (3)
a+c a+b (R+P)
Then, Macro average and Micro average were used to evaluate the system on all
facets. Macro average measures are computed as:
141
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
∑𝑁𝑁
𝑓𝑓 𝑅𝑅𝑓𝑓 ∑𝑁𝑁
𝑓𝑓 𝑃𝑃𝑓𝑓 2∗R ∗𝑃𝑃
R macro = 𝑃𝑃𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐹𝐹1𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑅𝑅 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚+𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (4)
N N 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
Micro average measures are computed as:
∑𝑁𝑁 𝑎𝑎𝑓𝑓 ∑𝑁𝑁𝑓𝑓 𝑎𝑎𝑓𝑓 2∗R ∗𝑃𝑃
𝑅𝑅𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = ∑𝑁𝑁 𝑓𝑓+∑𝑁𝑁 𝑐𝑐 𝑃𝑃𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = ∑𝑁𝑁 𝑁𝑁 𝐹𝐹1𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑅𝑅 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚+𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (5)
𝑓𝑓 𝑎𝑎f 𝑓𝑓 f 𝑓𝑓 𝑎𝑎𝑓𝑓 +∑𝑓𝑓 𝑏𝑏𝑓𝑓𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
where f is the facet index, N is the number of facets (i.e., 5). A good classifier should
have both high Macro and Micro average measures.
3 Results
In this section, the results where the training set is used for training and the development
set is used for testing.
3.1 Task 1A results
According to Fig.1, Jaccard similarity and Dice similarity achieved better F_1 measures
than Cosine similarity and LCS. Topic similarity feature (indicated as Topic_number
in Fig.1) showed slight different performance among different numbers of topics.
WordNet based similarity achieved best results when combining the results of
standalone nouns similarities and verb similarities. TextRank showed similar results to
topic similarity and WordNet based similarity, however, this feature did not improve
the performance of learning to rank. Finally, Jaccard similarity, Topic similarity with
200 topics, WordNet (N,Vsep), and TextRank were kept. In the development set, Rank-
Net and AdaRank achieved best performance with Overlap/ ROUGE_1 F_1 of
0.057/0.211 (Table 1).
142
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
0.3
Ovelap_F1 Rouge_F1
0.25
0.2
0.15
0.1
0.05
0
Figure 1.The F_1 values for each ranking feature on development set
Table 1. The performance of different learning to rank algorithms on development set
Overlap ROUGE_1
Learning to rank
Algorithms P R F_1 P R F_1
RankBoost 0.008 0.015 0.010 0.129 0.171 0.147
RankNet 0.043 0.085 0.057 0.194 0.232 0.211
AdaRank 0.043 0.085 0.057 0.194 0.232 0.211
Coordinate Ascent 0.012 0.024 0.016 0.153 0.190 0.169
3.2 Task 1B results
Table 2 lists results for Task 1B. SVM showed best micro average performance
(F_1=0.657), while Naïve Bayes achieved the best macro average performance
(F_1=0.269). It’s observed that Naïve Bayes showed balanced performance over the
five facets. Both Decision Tree and SVM had poor prediction on Aim, Implication and
Hypothesis facets.
Table 2. The performance of the three classifiers on development set
# of Cited Naïve Bayes Decision Tree SVM
Facets Spans P R F_1 P R F_1 P R F_1
Aim 29 0.197 0.482 0.280 0.023 0.034 0.028 0 0 0
Method 142 0.683 0.578 0.626 0.625 0.634 0.629 0.671 0.950 0.787
Result 33 0.407 0.333 0.367 0.615 0.484 0.542 0.75 0.272 0.400
Implication 8 0.006 0.015 0.008 0 0 0 0 0 0
Hypothesis 7 0 0 0 0 0 0 0 0 0
143
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
Micro Avg - 0.488 0.488 0.488 0.488 0.488 0.488 0.657 0.657 0.657
Macro Avg - 0.259 0.282 0.269 0.253 0.230 0.240 0.284 0.244 0.262
4 Final run methods and conclusions
For the last run, we used both training set and development set to train instances. We
chose Jaccard, Topic_200, WordNet (N+V) and TextRank as the final ranking features
and applied AdaRank learning to rank algorithm on Task 1A. For Task 1B, Naïve Bayes
was used as the classifier.
Recognizing the cited spans and determining their discourse facets are very challenging
for the summarization of scientific papers. There are some issues to be addressed during
our study on the tasks. First, either cited span discovery or discourse facet classification
is an unbalanced problem—negative examples are more prevalent than positive exam-
ples. That would be our future work. Second, Task 1A involves much more than a sim-
ilarity problem. This is because the underlying citation intentions are complex. More
features that reflect the citation intentions should be explored.
References
[1]Kokil Jaidka, Muthu Kumar Chandrasekaran, Sajal Rustagi, and Min-Yen
Kan (2016). Overview of the 2nd Computational Linguistics Scientific
Document Summarization Shared Task (CL-SciSumm 2016), To appear in the
Proceedings of the Joint Workshop on Bibliometric-enhanced Information
Retrieval and Natural Language Processing for Digital Libraries
(BIRNDL 2016), Newark, New Jersey, USA.
[2]Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
[3]George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the
ACM Vol. 38, No. 11: 39-41.
[4]Lin, D. (1998, July). An information-theoretic definition of similarity. In ICML (Vol. 98, pp.
296-304).
[5]Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for
Computational Linguistics.
[6]Dang, V. The Lemur Project-Wiki-RankLib. Lemur Project,[Online]. Available:
https://sourceforge.net/p/lemur/wiki/RankLib.
[7]Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for
combining preferences. The Journal of machine learning research, 4, 933-969.
[8]Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G.
(2005, August). Learning to rank using gradient descent. In Proceedings of the 22nd international
conference on Machine learning (pp. 89-96). ACM.
[9]Xu, J., & Li, H. (2007, July). Adarank: a boosting algorithm for information retrieval. In Pro-
ceedings of the 30th annual international ACM SIGIR conference on Research and development
in information retrieval (pp. 391-398). ACM.
[10]Metzler, D., & Croft, W. B. (2007). Linear feature-based models for information retrieval.
Information Retrieval, 10(3), 257-274.
144
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
[11]Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The
WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.
[12]Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text
summarization branches out: Proceedings of the ACL-04 workshop (Vol. 8).
145