<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Categorization of Computer Science Research Papers using Knowledge Bases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shashank Gupta</string-name>
          <email>manish.gupta@iiit.ac.in</email>
          <email>shashank.gupta@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Priya Radhakrishnan</string-name>
          <email>priya.r@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>International Institute of Information Technology, Hyderabad, India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Microsoft</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Umang Gupta</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vasudeva Varma</institution>
        </aff>
      </contrib-group>
      <fpage>38</fpage>
      <lpage>42</lpage>
      <abstract>
        <p>Automatic categorization of computer science research papers using just the abstracts, is a hard problem to solve. This is due to the short text length of the abstracts. Also, abstracts are a general discussion of the topic with few domain speci c terms. These reasons make it hard to generate good representations of abstracts which in turn leads to poor categorization performance. To address this challenge, external Knowledge Bases (KB) like Wikipedia, Freebase etc. can be used to enrich the representations for abstracts, which can aid in the categorization task. In this work, we propose a novel method for enhancing classication performance of research papers into ACM computer science categories using knowledge extracted from related Wikipedia articles and Freebase entities. We use state-of-the-art representation learning methods for feature representation of documents, followed by learning to rank method for classi cation. Given the abstracts of research papers from the Citation Network Dataset containing 0.24M papers, our method of using KB, outperforms a baseline method and the stateof-the-art deep learning method in classi cation task by 13.25% and 5.41% respectively, in terms of accuracy. We have also open-sourced the implementation of the project4.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One of the di culties faced in categorization of the papers in the conference proceedings is automatically
identifying the categories of the research papers from the standard ACM computing classi cation system1. It is
important to nd the right category of the paper submitted by authors for several purposes, which includes
sending the paper to a panel with relevant reviewers according to the category, publishing papers under the
correct category and so on. Given the limited amount of content in the abstract and a very high level
discussion of the topic with few domain speci c terms, nding the category of the paper using just the abstract is a
challenging and a hard problem.</p>
      <p>In this paper we address this problem and propose a novel method to leverage external Knowledge Bases (KB)
to improve the performance of short text categorization, using the learning to rank framework [Liu09].</p>
      <p>We evaluate our method on a large dataset of abstracts with the aim of classifying the paper into one of the
24 ACM computer science categories. Our method outperforms the baseline method, which does not use any
external information, by 13.25% and outperforms the existing state-of-the-art model in text categorization by
5.41%, in terms of accuracy.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Traditional methods for text classi cation, work by representing document using human curated features like
TF-IDF features, followed by a linear classi er like SVM [Joa98]. Due to the bag-of-words assumption and
sparsity induced by high dimensionality in the TF-IDF feature vector, these methods do not perform very well.
Another approach to this problem is using dimensionality reduction methods on the TF-IDF feature vector to
overcome the sparsity problem. These methods include Latent Semantic Allocation (LSA) [DDF+90] and Latent
Dirichlet Allocation (LDA) [BNJ03].</p>
      <p>Recent advancements in distributional representations of text resulted in better representation schemes for
the document. Some examples of such techniques are word2vec [MSC+13], paragraph2vec [LM14] and GloVe
[PSM14]. Mikolov et al. [MSC+13] demonstrated the superiority of distributed representation methods over
classical representation methods in the sentiment analysis task.</p>
      <p>The success of deep learning methods in the eld of computer vision and speech processing, inspired their
applications in Natural Language Processing (NLP) [CWB+11]. Combined with the superior representation
learning methods, these methods have proven to be state-of-the-art in a variety of NLP tasks like sentiment
analysis [dSG14], document similarity task [LL13], etc. For the text categorization problem, current
state-ofthe-art models are based on Convolutional Neural Networks (CNN) [Kim14].</p>
      <p>Our main contribution lies in using the learning to rank framework to combine KB with the text using
state-of-the-art representation learning methods for text.
2.1</p>
      <p>Learning to Rank
In a typical setting, the learning to rank method is de ned as follows. We are given a query qi 2 Q and a set of
N candidate documents (di1; di2; ::::; diN ). For each document dij , there is a binary relevance label yij such that
yij 2 f0; 1g, where a label of 1 indicates that the candidate document is relevant and 0 otherwise. Given this
information, the goal of learning to rank method, is to learn a function h, that assigns a higher score to relevant
documents than to the non-relevant documents. Formally, learning to rank tries to learn the function h de ned
as follows.</p>
      <p>where w is the set of parameters for the function, and
and the document combined.</p>
      <p>The Learning to rank framework has these two broad categories:
h(w; (qi; dij )) ! R
(1)
generates a feature vector representation of the query
Pointwise Approach, where the training instances are (qi; dij ; yij ) and a binary classi er is trained over
input pairs (qi; dij ) de ned formally as: h(w; (qi; dij )) ! yij with the goal to predict, whether the document
dij is relevant to the query qi or not.</p>
      <p>Pairwise Approach where the model is trained to score correct pairs higher than incorrect pairs with a
xed margin.</p>
      <p>While it can be argued that pairwise models can give better results than pointwise models, the primary focus
of this work is on generating a good combined representation for the abstract and the corresponding KB entity
which can be used for classi cation, rather than capturing di erent aspects of similarity for ranking. Hence in
this paper, we adopt pointwise method of the learning to rank framework.</p>
    </sec>
    <sec id="sec-3">
      <title>Our Method</title>
      <p>Let (wc1; wc2; ::::; wcK ) denote the set candidate KB entities corresponding to each of the K categories. Let
(y1; y2; :::; yK ) denote the set of K labels each corresponding to a category. For each paper (research article)
dp in the dataset, there is an associated value for each yi such that yi is 1 if dp belongs to category ci and 0
otherwise. For each paper dp, our goal is to rank KB entity relevant to its category wci higher in score than
other KB entities.</p>
      <p>Formally the model in Eq. 1 can be speci ed as follows.</p>
      <p>h(W; (dp; wci)) = f (W</p>
      <p>R(dp)
R(wci)
+ b)
(2)
where R is the feature representation function (word2vec or paragraph2vec), wci is the KB entity corresponding
to the category ci, W 2 R1 2d is the linear transformation matrix, d is the embeddings dimension and b 2 R is
the bias parameter. We use sigmoid function to realize f .</p>
      <p>Our input consists of triplets (dp; wci; yi), where yi is 1 if wci is the KB entity corresponding to the category of
the document dp and 0 otherwise. Since we are training a discriminative classi er, for each positive pair (dp; wci),
we sample a negative pair (dp; wcj ) with label 0. The model is trained to optimize the Binary Cross-Entropy
(BCE) loss function de ned as follows.</p>
      <p>BCE(yi; h(W; (dp; wci))) =
(yi log h(W; (dp; wci)) + (1
yi) log(1
h(W; (dp; wci))))
(3)</p>
      <p>For testing on a new document, we generate the feature representation of the document and combine it with
the feature representation of each category in accordance with Eq. 2, and select the category for which a label of
1 is predicted. If for multiple categories, we get a label of 1, we randomly select one category out of all predicted
categories.</p>
      <p>We use word2vec and paragraph2vec for the feature vector representation R. To generate the feature vector
representation using word2vec, we compute the average of the word vectors corresponding to each token in the
document. To account for out-of-vocabulary words, a similar strategy described in [SM15] is followed and they
are replaced with a randomly sampled vector of same dimension as a vector drawn from a uniform distribution
U [ 0:25; 0:25]. For the feature vector representation using paragraph2vec, we follow the inference mechanism as
described in [LM14].
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Dataset
We use the Citation Network Dataset [CST+13] which contains 236565 papers, with each article categorized into
one of the 24 ACM categories. We randomly split these articles into 80% training instances and 20% testing
instances, and used a one-vs-rest logistic regression Eq.22 for classi cation. For external knowledge base, we
select two popular knowledge bases: Wikipedia and Freebase (FB). We use pre-trained embeddings for freebase
entities, trained on Google News Corpus3 to initialize each entity's representation corresponding to each category
in text. We experiment with word2vec and paragraph2vec to generate embeddings of categories with Wikipedia.
4.2</p>
      <p>Experimental Setup
To generate document embeddings we use gensim [RS10], an open-source python library with embedding size
set to 300. For text pre-processing, we remove all stop-words, punctuations and non-ASCII characters from
abstracts of all articles. We then lower-case all text. To generate the TF-IDF representation of articles, we
consider each article's abstract as a document and collection of all abstracts as document collection, and term
frequency and inverse-document-frequency is computed accordingly. We use a dimension of 300 for Freebase
entity embedding. For more details about experimental settings, interested readers can refer to the open-source
implementation code4
2http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
3https://code.google.com/archive/p/word2vec/
4https://github.com/shashankg7/Knowledge-Base-integration-with-Text-Classi cation-pipeline
Method
TF-IDF + SVM Baseline
CNN baseline [Kim14] with Avg. Padding
CNN baseline [Kim14] with Max. Padding
Our model
Knowledge-Base (KB) consists of entities and their relations. To generate semantic representation of the
categories, we need to map them to their corresponding entities in the KB. Each category is mapped to its
corresponding entity in KB by matching their corresponding description text. For example, the category `Machine
Learning and Pattern Recognition' is mapped to the entity `/en/machine learning in Freebase and to the entity
`Machine learning' in Wikipedia. More details can be found in the open-source implementation of the project4.
For comparisons, we select the state-of-the-art, Convolution Neural Network (CNN) based method for text
categorization [Kim14]. CNN used in the paper operates at a sentence level, so we concatenate all sentences in
the abstract into a single sentence. The abstracts in our case are of variable length with di erent number of
tokens, so there is a need to convert them to xed length sequences. To achieve this, we employ two strategies:
average and maximum length padding. In the case of average length padding, we consider the size of each
abstract in the corpus and calculate the average length across the corpus. Each abstract in the dataset is then
converted to the average size by either padding with a `null' token or by truncating the sequence, depending on
the length of the abstract relative to the average length. Similar strategy is employed with maximum length
padding. For padding using the maximum document length method, we removed all documents in the training
set with document length greater than 150. We use the code made available by authors5 and run it on our
dataset with the best settings reported in the paper.</p>
      <p>We present the results of all the methods in Table 1. For evaluating the performance, micro-averaged accuracy
is used as a measure. It is clear that our method outperforms the CNN baseline in accuracy by 5.4%. Since
abstracts are short texts, adding external information to the model clearly gives us an advantage over current
methods.</p>
      <p>Method
para2vec
word2vec</p>
      <sec id="sec-4-1">
        <title>Acc. without KB 12.86% 50.95%</title>
      </sec>
      <sec id="sec-4-2">
        <title>Acc. with Wiki. 68.04% 71.38%</title>
      </sec>
      <sec id="sec-4-3">
        <title>Acc. with FB</title>
        <p>72.55%
71.51%</p>
      </sec>
      <sec id="sec-4-4">
        <title>Acc. with Wiki. + FB 72.54% 71.13% Table 2: Performance Comparison with Di erent KBs</title>
        <p>We compare the results of our method with baseline methods that do not use KB in Table 2. Paragraph2vec and
word2vec combined with Freebase gives an increase of 59.69% and 20.56% respectively in terms of accuracy over
the baseline methods. It can also be observed that the performance of the method is consistent across di erent
KBs. It can be inferred that the common factor which gives an increase in the performance is the knowledge
from the KB, and not the type of KB itself. Further, we select two popular KBs with di erent semantics and one
KB as combination of these two. The performance across these KBs provides us with evidence that the method
can be generalized across di erent KBs. The two KBs di er in the sense that, Wikipedia contains mostly textual
information about entities and topics, while Freebase has relation information between entities too and is di erent
in nature than Wikipedia. We also present the training time comparison between our method and deep learning
based state-of-the-art method in the rightmost column of the Table 1. The results are shown for 25 epochs of
training for CNN (best setting reported by the authors) and 100 (determined using cross-validation) epochs of
training for Logistic Regression. It is clear that our method is faster to train as compared to the CNN owing to
mostly linear operations in our model, as compared to more complex feed-forward and back-propagation methods
in CNNs.
We propose a novel method to combine information from external knowledge bases, using learning to rank
framework to enhance classi cation performance in short texts like abstracts of research papers. We empirically
demonstrate that our method outperforms deeper networks in terms of accuracy. Performance with di erent
KBs was also studied, and it provides a strong evidence that the method can be generalized across di erent KBs.
The only requirement is the presence of a high quality external KB. We also demonstrate that our method is
faster to train than the baseline method.</p>
        <p>In future, we plan to extend this work for di erent NLP tasks like topic modeling, named entity recognition
etc. It would be interesting to observe the e ect of addition of domain speci c KBs instead of general purpose
KBs. It would also be interesting to observe the e ect of addition of information from KBs in the setting when
number of classes are very large.
[BNJ03]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>JMLR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [CST+13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sikdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tammana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          .
          <article-title>Computer science elds as ground-truth communities: Their impact, rise and fall</article-title>
          .
          <source>In ASONAM</source>
          , pages
          <volume>426</volume>
          {
          <fpage>433</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [CWB+11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          , Le. Bottou,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>JMLP</source>
          , pages
          <volume>2493</volume>
          {
          <fpage>2537</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [DDF+90]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American society for information science, page 391</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>C .N. dos</given-names>
            <surname>Santos</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatti</surname>
          </string-name>
          .
          <article-title>Deep convolutional neural networks for sentiment analysis of short texts</article-title>
          .
          <source>In COLING</source>
          , pages
          <volume>69</volume>
          {
          <fpage>78</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>In ECML</source>
          , pages
          <volume>137</volume>
          {
          <fpage>142</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <volume>1746</volume>
          {
          <fpage>1751</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>T.Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Learning to rank for information retrieval</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          , pages
          <volume>225</volume>
          {
          <fpage>331</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>A deep architecture for matching short texts</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>1367</fpage>
          {
          <fpage>1375</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          . In ICML, pages
          <volume>1188</volume>
          {
          <fpage>1196</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [MSC+13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In NIPS</source>
          , pages
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[dSG14] [Joa98] [Kim14] [Liu09] [LL13] [LM14] [PSM14] [RS10]</source>
          [SM15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <volume>1532</volume>
          {
          <fpage>43</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Radim</given-names>
            <surname>Rehurek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          , pages
          <volume>45</volume>
          {
          <fpage>50</fpage>
          ,
          <string-name>
            <surname>Valletta</surname>
          </string-name>
          , Malta, May
          <year>2010</year>
          . ELRA. http://is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Severyn</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moschitti</surname>
          </string-name>
          .
          <article-title>Learning to rank short text pairs with convolutional deep neural networks</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>373</volume>
          {
          <fpage>382</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>