CIST@CLSciSumm-17: Multiple Features Based
     Citation Linkage, Classification and
               Summarization

Lei Li, Yazhao Zhang, Liyuan Mao, Junqi Chi, Moye Chen, and Zuying Huang

      Center for Intelligence Science and Technology (CIST), School of Computer
    Beijing University of Posts and Telecommunications (BUPT), Beijing, P.R.China
            {leili,yazhao,circleyuan,cjq,moyec,zoehuang}@bupt.edu.cn


        Abstract. This paper describes our methods and experiments applied
        for CLSciSumm-17. We try Convolutional Neural Network, word vectors
        and sentence similarities for citation linkage. For facet classification, we
        explore more useful features, rules, SVM and Fusion method. We use the
        linear combination of five classical features and DPPs based diversity
        sampling method to compute the structured summary. Test-data results
        show that we obtain the best performance for both facet classification
        and summarization.

        Keywords: CNN, word vector, SVM, hLDA, DPP


1     Introduction
With the rapid development of Computational Linguistics (CL), rich, complex
and continually expanding resource has flooded into the scientific literature of
this domain. Literature surveys and review articles in CL do help readers ob-
tain a gist of the state-of-the-art in research for a topic. However, literature
survey writing is labor-intensive and one specific literature survey is not always
available for every topic of interest. The CLSciSumm-17 has highlighted the
challenges and relevance of the scientific summarization problem.
In this paper, we describe our strategies, methods and experiments applied for
CLSciSumm-17 [1]. There are totally two tasks and 30 topics involving one ref-
erence paper (RP) and some citing papers (CPs) for the training dataset. The
tasks are defined as follows: Given a set of CPs that all contain citations to a RP.
Given a set of CPs that all contain citations to a RP for each topic, the tasks are
defined as follows. Task 1A requires that for each citance (the set of citation
sentences), we need to identify the spans of text (cited text spans, CTS) in the
RP. We will try to use Convolutional Neural Network(CNN) with word vectors
to match the citance and possible CTS. Besides, we also have existed methods
using sentence similarity based on various traditional features and rules. We will
search for a good hybrid methods which can take advantages of both CNN and
the existed methods.
Based on the results of Task 1A, for each cited text span, we need to identify
what facet the CTS belongs to, from a predefined set of facets in Task 1B.
We plan to explore more useful features and machine learning methods besides
SVM.
Finally, we will generate a structured summary of the RP and all of the commu-
nity discussion of the paper represented in the citances for Task 2. The length of
the summary should not exceed 250 words. We will extract summaries with high
quality, diversity and low redundancy. hLDA (hierarchical Latent Dirichlet Al-
location) topic model is adopted for content modeling, which is able to organize
topics into a tree-like structure. We combine hLDA knowledge with several other
classical features using diﬀerent weights and proportions to evaluate the quality.
And the summary diversity is enhanced using Determinantal Point Processes
(DPPs).


2   Related Work

Interest about information extraction and retrieval has increased recently, and
there are some shared tasks in this domain like Online Forum Summarization
(OnForumS) in MultiLing 2017 and Computational Linguistics Scientific Docu-
ment Summarization Shared Task. Through these shared tasks, many methods
have been found for content linking, Ziqiang Cao et al. [2] simplify it into a
ranking problem which just needs to select the first item, then they adopt SVM
Rank to handle it. Bruno et al. [3] transform each word into their synonym group
synset using WordNet for next idf (inverse document frequency) calculation step.
Kun Lu et al. [4] use WordNet to compute the concept similarity between sen-
tences. Tadashi [5] combines TFIDF and a single layer Neural Network together
for content linking.
In order to identify the linkage eﬃciently between a paper citation and its cited
text spans in the RP, we need to catch the potential information of natural lan-
guage sentences. More and more methods have been proposed for this purpose.
CNN [6] has shown good performance in sentence classification, Word Embed-
ding [7] is useful in digging semantic information, WordNet is used to calculate
the similarity between words. All of above will be used in our experiments.
In the case of CL summary generation, we will not only consider the original
paper, but also consider the information provided by citances. Li et al. [8] pro-
vide a method based on linear combination of multiple features. Amjad et al. [9]
focused on the coherence and readability of generated citation summaries. Yet
their objects for summarization are diﬀerent from ours, because we are focusing
on summarizing the reference texts considering the citation texts in this paper.
It is a common view that summaries should consider redundancy as well as qual-
ity. But, unfortunately most previous methods divide the summary generation
problem into two procedures in sequence. First, they use some machine learning
methods [13] to select a subset with higher quality. Then, they control the redun-
dancy of summaries [14]. In our method, we will consider diversity, redundancy
and quality at the same time. And we will use a new method based on DPPs [12]
to enhance the diversity of summaries.
3     Methods
3.1   Citation Linkage
For citation linkage, we need to identify the CTS in the RP that most accu-
rately reflect the citation in the CP, we call this content linking. In fact, it
focuses on finding the linkage between sentences, which is represented by similar
meaning between sentences. Hence computing sentence similarity based on var-
ious features is our major work. For features, we use some traditional features
like Jaccard similarity and idf to find syntactic information. Besides, word vec-
tor, WordNet and CNN are used for digging out deeper semantic information.
Finally, we fuse the above features to obtain the result.

Feature Extraction

    1) Three lexicons: a) we picked up the words with high frequency from refer-
ence text in the training corpus artificially, and expanded them through WordNet
and word vectors as Lexicon 1 (high-frequency lexicon). b) we used LDA (Latent
Dirichlet Allocation) model to train the reference paper and citing papers to ob-
tain a lexicon of 30 latent topics for files in every topic independently as Lexicon
2 (LDA lexicon). c) we obtained the co-occurrence degree between words by the
word frequency statistics of citation text and its reference text from the training
corpus as Lexicon 3 (co-occurrence lexicon).
    2) Two sentence similarities: One is idf similarity, we add up the idf values
of the same words between two sentences. The other is Jaccard similarity, which
uses the division between the intersection and the union of the words in two
sentences as similarity.
    3) Two context similarities: We calculate the context similarity for idf sim-
ilarity and Jaccard similarity. The F1 performances of the above features for
training corpus are shown in Table 5 of Appendix B.
    4) Word vector: We trained every word as a vector with fixed dimensions
using Word2Vec. Then add the word similarities together to represent sentence
similarity [8]. The word vector of 400 dimensions with window size as 15 performs
best as shown in Table 6 in Appendix B, so we will choose it for the following
experiments.
    5) WordNet: WordNet is a kind of English lexicon which contains nouns,
verbs, adjectives, and adverbs. It can calculate the similarity between two words
with same part-of-speech. We use 6 similarity methods: jcn, lin, lch, res, wup and
path similarity. Though the similarity in WordNet can only support calculating
word similarity, in order to compute the sentence similarity, we use the same
algorithm in word vector section for WordNet similarity, changing the cosine
similarity to word similarity in WordNet. The F1 performances can be seen in
Table 7 of Appendix B.
    6) CNN: CNN can find the deep semantic information in sentences and is
widely used in natural language processing domain [11]. We use the word vector
as the input of CNN and obtain the probability of content linking from its output.
We use the output probability to represent the similarity of input sentences. We
have investigated the lengths of reference sentences and citation ones, almost
all of them are less than 80. So we fix the length of sentence to 80 and the
length of two sentences is 160. For the sentence whose length is less than 80,
we add 0 vector to the relevant location. We use the word vector with 200
dimensions. As for the CNN structure, we have tried some kinds of configurations
involving diﬀerent convolution kernels and other parameters. Finally, through
the experiments on the training corpus, we use one of them for testing corpus.


Method for Linkage

   1) For voting method, we tried diﬀerent weights(the weights of diﬀerent fea-
tures) and proportions(the number of sentences we choose) to combine them
through experiments, and then got four results (run 1, run 2, run 5, run 6)
through a voting system. The text span with the highest-number of votes is
chosen as the citation text corresponding sentences in the reference paper. Jac-
card Focused Method (run 3) chose Jaccard similarity as the major feature, and
added other features as supplementary. Jaccard Cascade Method (run 4) chose
the sentences with top two Jaccard values as the basic answer, then combined
other features to find the other two sentences with highest values as the improved
answer. Finally, we choose 4 sentences as answer through experiments. Table. 1
shows the parameters for every run. W means weight and P means proportion.


   Table 1. The parameters of run 1 to 6, JS means 10 fold of Jaccard Similarity

                                run1     run2     run3     run4     run5     run6
     Feature                    W P     W P      W P      W P      W P      W P
     Idf similarity             1 8      1 7     0.7 15   1.5 16   1.5 15   0.8 15
     Idf context similarity     -   -   0.5 4    0.5 15    1 18    1.5 20   0.5 20
     Jaccard similarity         1 12     1 3     JS 7      -   -    2 10     1 8
     Jaccard context similarity 1 12    0.5 8    0.7 15   1.5 15    2 15    0.7 15
     Word vector                1 10    0.5 8    0.5 25    -   -   0.8 10   0.5 15
     Lexicon 2                  -   -   0.3 2     -   -    -   -    -   -    - -
     Lexicon 3                  -   -   0.4 2    0.2 25   0.5 15   0.5 15   0.5 15
     Lch                        -   -    -   -    -   -    -   -    1 10    0.5 10
     Path                       -   -    -   -    -   -    -   -    1 10    0.5 10


    2) For method with CNN: We use the answers in training data as positive
samples, and choose the sentences out of answers randomly as negative samples,
keeping the count of them in balance. For testing set, we combine every sentence
in reference paper with one citation text as input, getting the output as feature
value. While only using this feature, the performance is bad, so we choose the
following method for CNN: use top 40 sentences according to CNN output, then
combine Jaccard similarity and idf similarity together using Equation 1. All
parameters were obtained through experiments on training corpus.

            similarity = 10 ∗ Jaccardsimilarity + 0.1 ∗ idfsimilarity                                                        (1)

Finally, we choose the top 4 sentences with highest similarities as answers.


3.2   Facet Classification

Subtitle Rule obtains Facet Classification according to the subtitle of sentence
pair. We use the high frequency word number of sentence pair to obtain the Facet
Classification in High Frequency Word Rule. In Subtitle and High Frequency
Word Combining Rule, we combine the two methods for Facet Classification.
SVM Classifier uses features of sentence pair to obtain the classification. The
features are Location of Paragraph, Document Position Ratio, Paragraph Posi-
tion Ratio and Number of Citations or References. The ideas of Voting Method
and Fusion Method are similar.
The details of above methods are mostly similar to [8] except for some details of
High Frequency Word Rule, SVM Classifier and Fusion Method. The diﬀerence
is that we update the High Frequency Word and retrain the SVM Classifier. In
Fusion Method, we combine all the results to obtain a fusion result. We provide
the details of features used in SVM Classifier in Appendix A.


3.3   Summarization

Figure. 1 shows our framework of Task 2.


               3UHSURFHVVLQJ           GRFXPHQWPHUJLQJVHQWHQFHILOWHULQJLQSXWILOHJHQHUDWLRQIRUK/'$


                                                                                                                 -DFFDUG
                                       63          6/             76           &76         +70                  VLPLODULW\
               )HDWXUH
               6HOHFWLRQ                 M   
                                                        M   
                                                                       M   
                                                                                   M   
                                                                                             M   


                                                                VHQWHQFHTXDOLW\                                VLPLODULW\


                                                 )HDWXUH&RPELQDWLRQ                         '33VEDVHGVHQWHQFHVDPSOLQJ
               6WUXFWXUHG6XPPDU\
               *HQHUDWLRQ
                                     WUXQFDWHVXPPDU\WRWKHKXPDQVXPPDU\OHQJWKUHPRYHPXOWLSOHZKLWHVSDFHV


            Fig. 1. The framework for structured summary generation


Pre-processing The source documents provided by CLSciSumm-17 have rela-
tively high quality, but there also exist some xml-coding errors. Besides, we need
specific data format to train our hLDA feature.
1. Document Merging: merge the content of RP and the citations into a doc-
   ument. It is worth mentioning that we will not include the sentence in the
   abstract of RP unless it is selected by Task 1A. And all documents are
   converted to lowercase letters.
2. Sentence filtering: the corpus is generated from CL papers, thus there must
   be some equations, figures, tables and so on. However, these contents make
   small contribution to summary generation. First, we use WordNet to check
   whether a word is useful. Then we can filter those sentences whose proportion
   of useless words is greater than 0.5.
3. Input file generation for hLDA: For the remaining words, we build a dic-
   tionary for each document, which contains words and their corresponding
   frequencies, and the index starts from 1 to word list size. Finally, we gener-
   ate an input file for hLDA modeling, in which each line represents a sentence
   presented by word index - frequency pair, such as:
      [number of words in sentencei ] [word − index A : f requency A] . . .

Feature Selection According to our previous work [8], Sentence Length (SL),
Sentence Position (SP) and CTS are useful features for summary generation of
CL papers and we will reuse them in this paper. Besides, we upgrade the hierar-
chical Topic Model (HTM) and apply title similarity (TS) to our experiments.
1. hierarchical Topic Model (HTM): hLDA constructs a document into a tree-
   like structure. Each word is assigned to a node, and each sentence is assigned
   to a path that goes through the root node to a leaf node. Thus, each node
   is regarded as a topic which is a probability distribution over words, and
   each path is always regarded as a theme. Based on this, we believe that
   both topics and themes involve valuable message for summary generation.
   We adapt the method in [10] to use the information of hLDA.
   Diﬀerent from the method proposed by Taiwen [10], we also consider the
   theme distribution to measure the contribution of a sentence to the summary.
   We use Equation. 2 to calculate it.
                                  ∑
                                  m
                           qi =         (αj Tj + F reqj ) + tpi                 (2)
                                  j=1

    where Tj represents the distribution score of wordj in i-th sentence calcu-
   lated by hLDA, αj is the pre-defined weight of Tj according to our former
   experiments, F reqj is the frequency of wordj in current hLDA node, tpi is
   the theme distribution of i-th sentence.

2. Title Similarity (TS): Title Similarity is the cosine similarity of each sentence
   and the document title. We use Equation (3) to calculate it.
                                          tfsi × tfstitle
                                  qi =                                          (3)
                                          |tfsi ||tfstitle |
    where stitle and si represent the title and i-th sentence respectively.
Structured Summary Generation To control the redundancy of the gen-
erated summary, we use Jaccard similarity to measure the similarity between
sentences firstly. Then we use DPPs to enhance the diversity.
Determinantal Point Processes (DPPs) are elegant probabilistic models of global,
negative correlations and mostly used in quantum physics to study the reflected
Brownian motions. In our method, we only consider discrete DPPs and follow
the definition of Kulesza [12].
DPPs based subset sampling method considers diversity, quality and redundancy
at the same time. And the subset sampled using DPPs are mostly diverse. Thanks
for the contribution of [15], we can construct DPPs using a positive semi-defined
matrix L.
Using this representation, the entries of kernel L can be written as
                                Lij = qi ϕ⊤
                                          i ϕj qj

where qi ∈ R+ measures the quality of an element i, and ϕ⊤ i ϕj measures the
similarity between element i and element j.
In our method, we use the combination of previous five features (SP, SL, TS,
CTS, HTM) to measure the quality, and Jaccard similarity to measure the re-
dundancy, shown as below.
                                        ∑
                                        5
                                 qi =         φk qki                          (4)
                                        k=1

 where φk ∈ {0, 1} is the combination proportion of corresponding features. qki
represents the k-th quality feature of si . For example q11 represents the second
feature (SL) of second sentence in the article.
Given the matrix L, we adapt the DPPs based sampling method to compute a
sentence subset D′ of the corresponding paper, shown in Table. 2.
Finally, in order to extract a high-quality structured summary, we make full use
of the prior knowledge that structured summary contains four parts: Introduc-
tion, Methods, Results and Conclusion. So we use the Facet obtained in Task
1B to help extract the summary sentences from D′ . We will extract two or three
sentences for each part of the summary if exists. Furthermore, we remove redun-
dant candidate sentences.
Besides, we also use qi calculated using Equation. 4 without DPP to select top-N
sentences to compute a summary. We call this method as Feature Combination
method(FC). This is a comparison system to examine the eﬀectiveness of DPPs
based sampling method.

4     Experiments
Table. 3 shows the oﬃcial results of 7 runs we submitted.

4.1   Citation Linkage and Facet Classification
From Run1 to Run6, we use the above mentioned methods respectively for Task
1A. In Rnn 7, we use the CNN method, while it uses all corpus for training, we
Table 2. DPPs sampling method for diverse summary extraction, where S is the
similarity matrix, D is the document, qi is the quality of i-th sentence calculated in
Equation. 4

               Input: qi , S, D, max len.
               → quality vec= [qi for i in D]
               → matrix l= quality vec ∗ S ∗ quality vecT
               → (vn , λn ) =eigen decompose(matrix l)
               →J =∅
               → for n = 1, 2, . . . , N do
               →→ J = J ∪ {n} with prob. λnλn+1
               → V = {vn }n∈J
               →Y =∅
               → while |V | > 0 and len(Y ) < max len do
                                                     ∑
               →→ Select i from J with P r(i) = |V1 | v∈V (v⊤ ei )2
               →→ Y = Y ∪ D[i]
               →→ V = V⊥ , an orthonormal basis for the subspace of
                            V orthogonal to ei
               Output: summary Y


                           Table 3. Results on Test Data

           F Score of Task 1A F Score of Task 1B     ROUGE 2 of Task 2
      Runs macro     micro    macro     micro    Abstract Human Community
      run1 0.113     0.107    0.373     0.107     0.341    0.173  0.187
      run2 0.086     0.067    0.326     0.067     0.322    0.225  0.195
      run3 0.097     0.086    0.365     0.086     0.327 0.275     0.204
      run4 0.117     0.105    0.408     0.105     0.351 0.156     0.184
      run5 0.109     0.100    0.392     0.100     0.318    0.153  0.192
      run6 0.091     0.077    0.330     0.077     0.331    0.184  0.185
      run7 0.100     0.084    0.351     0.084     0.240    0.170  0.163


have no result about it [8].
We evaluate our methods on the training dataset using the evaluation scripts
oﬀered in the CL-SciSumm Shared Task 2017 oﬃcial website. We evaluate six
methods for Facet Classification on four results of Task 1A, and selected the best
method for each result of Task 1A. We find out that the best Facet Classification
of Task 1A result is obtained by the voting method except for the result of
Jaccard-Focused obtained by High Frequency Word Rule.
From the oﬃcial results in Table 3, we can see that our methods for test data
performs very well. Run4 has performed the best for Macro Average F1 .


4.2   Summarization

According to our experiments on the training dataset, we select seven diﬀerent
parameter settings to calculate the testing dataset and obtain seven runs for
the CLSciSumm-17 competition. We select the human written summary as the
only gold summary and the Manual ROUGE values of our experiments on the
training dataset are shown in Table. 4.


                Table 4. The ROUGE-1 values for training dataset

             id   type φ0     φ1   φ2   φ3   φ4    human community
             run1 FC   3      0    2    1    0    0.44349 0.46051
             run2 FC   1      0    0    0    0    0.43771 0.42914
             run3 FC   1      0    1    3    0    0.44304 0.47556
             run4 DPPs 1      0    0    0    0    0.44696 0.39661
             run5 DPPs 3      0    2    1    1    0.43848 0.40638
             run6 DPPs 3      0    0    1    0    0.44598 0.40549
             run7 DPPs 1      0    1    3    0    0.44276 0.40126


    Table. 4 shows that, all FC methods perform better than DPPs on com-
munity summary. However, DPPs mostly perform better than FC on human
summary. The golden standard summary provided by CLSciSumm-17 includes
the summary from the cited text span. CTS feature chooses sentence from this
point, so run3 obtains the best precision as our anticipation. DPPs attempt to
sample sentences from a global perspective. They consider diversity, quality and
redundancy at the same time. That is to say, this inspiration happens to coincide
with the way our human beings address the same problem.
During our experiments, we also find some interesting results. For example,
DPPs perform better than FC in the case of same parameter settings. Table. 3
shows that run4 performs best in the results of DPPs based methods. Thus we
believe that the clusters modeled by unsupervised hLDA can implicate some
latent topic messages, although they are not directly matched to the pre-defined
five facets, which leads to the poor performance in this pre-defined structured
summarization task. But we think that it possibly will work better for other
open domains without pre-defined structures especially in multilingual case and
will achieve better results combining with DPPs.
In fact, among all the systems participating Task 2, our run3 has also performs
the best on ROUGE 2 vs Human abstract and Community abstract, and our
run4 has also performs the best on ROUGE 2 vs Abstract. All these results have
verified the eﬀectiveness of our methods again.


5   Conclusion and Future Work

The results of oﬃcial test data have proved that the performance of methods
we proposed is excellent. In Task 1A we tried to find a calculating similarity
method to represent the true relationship between two sentences. For Task 1B we
used fusion method to obtain a fusion facet classification. Finally, we considered
the quality features, redundancy feature and diversity of RP and cited text
spans in Task 2. And we will also try to find some better ways to use more
semantic features for citance linkage. Furthermore, we will continue finding a
better method to choose the least sentences covering the most information.


References

1. Kokil Jaidka, Muthu Kumar Chandrasekaran, Devanshu Jain, and Min-Yen Kan.
   ”Overview of the CL-SciSumm 2017 Shared Task”, In Proceedings of the Joint
   Workshop on Bibliometric-enhanced Information Retrieval and Natural Language
   Processing for Digital Libraries (BIRNDL 2017), Tokyo, Japan, CEUR. (2017)
2. Cao, Z., Li, W., Wu, D.: PolyU at CL-SciSumm 2016. In: Proc. of the Joint Work-
   shop on Bibliometric-enhanced Information Retrieval and Natural Language Pro-
   cessing for Digital Libraries (BIRNDL2016). pp. 132138. Newark, NJ, USA (June
   2016)
3. Malenfant, B., Lapalme, G.: RALI System Description for CL-SciSumm 2016 Shared
   Task. In: Proc. of the Joint Workshop on Bibliometric-enhanced Information Re-
   trieval and Natural Language Processing for Digital Libraries (BIRNDL2016). pp.
   146155. Newark, NJ, USA (June 2016)
4. Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their dis-
   course facets. In: Proc. of the Joint Workshop on Bibliometric-enhanced Information
   Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2016). pp.
   139145. Newark, NJ, USA (June 2016)
5. Nomoto, T.: NEAL: A neurally enhanced approach to linking citation and reference.
   In: Proc. of the Joint Workshop on Bibliometric-enhanced Information Retrieval
   and Natural Language Processing for Digital Libraries (BIRNDL2016). pp. 168174.
   Newark, NJ, USA (June 2016)
6. Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv,
   2014.
7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word represen-
   tations in vector space. Computer Science. (2013).
8. Li L, Mao L, Zhang Y, et al. Computational linguistics literature and citations
   oriented citation linkage, classification and summarization[J]. International Journal
   on Digital Libraries:1-18.
9. Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific pa-
   pers. In: Meeting of the Association for Computational Linguistics: Human Lan-
   guage Technologies, pp. 500-509. Portland, Oregon (2010).
10. Huang, T., Li, L., Zhang, Y.: Multilingual multi-document summarization based
   on multiple feature combination (2016)
11. Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from
   scratch. The Journal of Machine Learning Research, 12: 2493-2537 (2011)
12. Alex, K., Ben, T.: Determinantal point processes for machine learning. arXiv
   preprint arXiv:1207.6083 (2012)
13. Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a sur-
   vey. Artificial Intelligence Review 47(1), 1–66 (2017)
14. Kam-Fai, W., Mingli, W., Wenjie, L.: Extractive summarization using supervised
   and semi-supervised learning. In: Proceedings of the 22nd International Conference
   on Computational Linguistics-Volume 1. pp. 985–992. (2008)
15. Borodin, A.: Determinantal point processes (2009)
Appendix A
The meaning of features in Task 1B:
1) Location of Paragraph: the order number of the paragraph in which the sen-
tence is located.
2) Document Position Ratio: the ratio of sentence Sid to the total sentence num-
ber of the corresponding document.
3) Paragraph Position Ratio: the ratio of sentence Ssid to the total sentence
number of the corresponding paragraph.
4) Number of Citations or References: the number of Citation Oﬀset or Refer-
ence Oﬀset.

The details of methods:
1 Rule-based method
1.1 Subtitle Rule: First of all, we examine whether the subtitle of reference sen-
tences and citance contains the following facet words: Hypothesis, Implication,
Aim, Results and Method. If the subtitle contains any one of these words, it will
be directly classified as the corresponding facet. If it contains more than one of
these words, it will be classified into all the facets. Else if it contains none of
them, we just classify it as the facet of Method.
1.2 High Frequency Word Rule: According to the High Frequency Word Rule,
we firstly count the High Frequency Word of five facets from the Training Set
and the Development Set. In order to improve the coverage of sentences, we
expanded the High Frequency Word to get some similar words of each facet. We
set an appropriate threshold for each facet. If the number of the High Frequency
Word of any facet in the sentence is more than the corresponding facet thresh-
old, then we just use the facet whose coverage is the highest as the final class. if
some facets’ overage are same, then we just classify according to the sequence of
Hypothesis, Implication, Aim, Results and Method. If all facets have not reached
the threshold of each facet, we classify it as the Method.
1.3 Combine Subtitle and High Frequency Word Rule: We firstly use the Subtitle
Rule to classify the testing set. If the results are not in the five facets of Hypoth-
esis, Implication, Aim, Results and Method, then we use the High Frequency
Word Rule to get the final facet.
2 SVM Classifier
We extract four features of sentence for each class. Then features from citance
sentence and reference sentence form an 8-dimension vector of a pair of reference
sentence and citation sentence. We train SVM to get five classifiers. For solving
the problem of unbalanced training data, we set diﬀerent weights for diﬀerent
classes. If we cannot get any class of the five facets, then we classify it as Method
class.
3 Voting Method
We combine the results from Subtitle Rule, High Frequency Word and SVM
classifier to generate the voting results with most votes.
4 Fusion Method
We run the above methods for each run result we obtained in Task 1A and
choose a best one as the final result. Then we also tried a fusion method to com-
bine all the run results of the above methods obtained in Task 1A. We counted
the number of Method, Results, Aim, Hypothesis and Implication, and set an
appropriate threshold for each facet class to get a final result of facet class.


Appendix B


         Table 5. The performance of single syntactic information feature

                Feature                     N=3     N=4     N=5
                high-frequency (Lexicon 1) 0.00592 0.00728 0.01028
                LDA (Lexicon 2)            0.01999 0.02549 0.02674
                co-occurence (Lexicon 3) 0.22724 0.21663 0.20257
                Idf similarity             0.08808 0.08920 0.09152
                Idf-context similarity     0.06736 0.06735 0.06478
                Jaccard similarity         0.09400 0.09102 0.08740
                Jaccard-context similarity 0.05996 0.05947 0.06015


               Table 6. The performance of diﬀerent word vectors

        Feature N=3       N=4     N=5 Feature N=3         N=4     N=5
        15, 400 0.07698 0.07828 0.07661 10, 200 0.069578 0.07160 0.07147
        15, 300 0.07550 0.07585 0.07455 8, 400 0.074760 0.07646 0.07404
        15, 200 0.074012 0.07100 0.06941 8, 300 0.074759 0.07524 0.07609
        10, 400 0.076980 0.07828 0.07609 8, 200 0.071058 0.07039 0.07352
        10, 300 0.076240 0.07403 0.07352 -          -       -       -


          Table 7. The performance of diﬀerent similarities in WordNet

         Feature N=3     N=4     N=5 Feature N=3      N=4     N=5
         Jcn    0.02147 0.03095 0.03290 Res  0.02147 0.03095 0.03290
         Lin    0.02147 0.03095 0.03290 Path 0.06292 0.06675 0.06684
         Lch    0.04885 0.05097 0.04884 Wup  0.04145 0.04430 0.04216