<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>\The Less Is More" for Text Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rima Turker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Koutraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Knowledge-Based Text Classi cation</institution>
          ,
          <addr-line>KBTC</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text Classi cation [2,5] is gaining more attention due to the availability of a huge number of text data, such as blog articles and news data. Traditional text classi cation methods [1] use all the words present in a given text to represent a document. However, the high number of words mentioned in documents can tremendously increase the complexity of the classi cation task and subsequently make it very costly. Moreover, long (natural language text) documents usually include a di erent variety of information related to the topic of a document. For example, encyclopedic articles such as the life of a scientist3, contain besides topic related content also detailed biographical information. Often, in such articles after the rst paragraph (or rst a few sentences), words or entities appear, which are not related to the main topic (or category4) of the article. We assume that the most informative part of such articles is limited to a few starting sentences. In other words, instead of considering the complete document, only its beginning can be exploited to classify a document accurately. In this study, we design a Knowledge Based Text Classi cation method, which is able to classify a document by using only a few starting sentences of the article. Since the length of the considered text is rather limited, ambiguous words might lead to inaccurate classi cation results. Therefore, instead of words, we consider entities to represent a document. In addition, entities and categories are embedded into a common vector space, which allows capturing the semantic similarity between them. Moreover, the similarity based approach does not require any labeled training data as a prerequisite. Instead, it relies on the semantic similarity between a set of prede ned categories and a given document to determine which category the given document belongs to. The study has been validated with preliminary experiments on text classi cation for encyclopedic articles, which show that our method achieves comparable and even better results using only the rst few sentences of a document than using the entire document.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Given a Knowledge Base KB containing a set of entities E = fe1; e2; ::; eng and
a set of hierarchically related categories C = fc1; c2; ::; cmg, where each entity
3 http://scihi.org/albert-einstein-revolutionized-physics/
4 https://en.wikipedia.org/wiki/Category:Physics
ei 2 E is associated with a set of categories C0 C. The input is a text t, which
contains a set of mentions Mt = fm1; : : : ; mkg that uniquely refer to a set of
entities. Then, the output is the most relevant category ci 2 C0 for the given
text t.</p>
      <p>
        KBTC Overview. The general work ow of Knowledge Based Text Classi
cation is shown in Figure 1. The rst step is \Mention Detection Based on
AnchorText Dictionary", where each entity mention present in t is detected based on
a \Anchor-Text Dictionary" prefabricated from Wikipedia. The Anchor-Text
Dictionary contains all mentions and their corresponding Wikipedia entities. In
order to construct an Anchor-Text Dictionary all the anchor texts of
hyperlinks in Wikipedia articles referring to another Wikipedia article are extracted,
whereby the anchor texts serve as mentions and the Wikipedia article links refer
to the corresponding entities. In the second step, for each detected mention in
the given input text candidate entities are generated based on the Anchor-Text
Dictionary. In our example these are \Motorola", \Hewlett-Packard" and \Linux".
Likewise, the prede ned categories are mapped to Wikipedia categories. Finally,
with the help of entity and category embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that have been
precomputed from Wikipedia, the output is the semantically most related category for
the given entities. Thereby, in the given example the category Technology will
be determined.
      </p>
      <p>Probabilistic Model. The proposed classi cation task is formalized as
estimating the probability of P (cjt) of each prede ned category c and an input text t.
Based on Bayes' theorem, the probability P (cjt) can be rewritten as follows:
P (cjt) =</p>
      <p>P (c; t)
P (t)
/ P (c; t)
where the denominator P (t) has no impact on the ranking of the categories. For
an input text t, a mention is a term in t that can refer to an entity e and the
context of e is the set of all other mentions in t except the one for e. For each
candidate entity e in t, the input text t can be decomposed into the mention and
context of e, denoted by me and Ce, respectively. Based on the above introduced
concepts, the joint probability P (c; t) is given as follows:</p>
      <p>P (c; t) = X P (e; c; t) = X P (e; c; me; Ce)</p>
      <p>
        e2Et e2Et
= X P (e)P (cje)P (meje)P (Ceje)
e2Et
(2)
where Et represents the set of all possible entities contained in the input text t.
Here, we simply apply a uniform distribution to calculate P (e) for each
entity e. The probability P (cje) models the relatedness between an entity e and a
category c, which is estimated by using the prefabricated entity-category
embeddings. Moreover, the probability of P (meje) is calculated based on the anchor
text dictionary. Finally, the probability P (Ceje) models the relatedness between
the entity e and its context Ce. Each mention in Ce refers to a context entity
ec from the given knowledge base. The probability P (Ceje) can be calculated
with the help of entity-category embeddings. More details about the probability
estimation can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results and Discussion</title>
      <p>Dataset. The proposed text classi cation approach is evaluated on articles of
SciHi 5, a web blog on the history of science. From that dataset6 1452 articles
associated to a single category have been considered. The di erent categories
supported in the dataset are 45 and the average number of sentences per article
is 32.96.</p>
      <p>1
2
3
5
10</p>
      <p>All</p>
      <p>Number of Sentences
y 0:61
c
a
r
u
c
cA 0:6</p>
      <p>Experimental Results. The proposed approach does not require any
training phase. Therefore, only test sets are generated for the classi cation task from
SciHi data. To show the impact of the number of the starting sentences of the
articles on the classi cation accuracy, the data set has been sampled in di erent
sizes. From each article, the rst sentence, rst 2, rst 3, rst 5, rst 10
sentences and complete documents have been collected. For each sampled datasets
the proposed approach has been applied to the classi cation task. The results
are depicted in Fig. 2. The results show that a few starting sentences (in this case
3 sentences) are rather informative and have huge impact on the classi cation
accuracy. During the experiments it has been observed that most of the times
irrelevant entities to the corresponding category tend to appear after the rst 2
or 3 sentences. Hence, after the 3rd sentence the accuracy starts to drop (Fig. 2).
Note that usually in such documents the frequency of relevant entities is higher
in comparison to irrelevant entities. Therefore, complete documents help to
obtain reasonable classi cation accuracy. However, the classi cation of complete
documents is computationally very expensive (cp. Table 1). The classi cation
of the whole documents takes 215 minutes while the classi cation of a sentence
requires no more than 18 minutes for the entire dataset. The best results have
been obtained with rst 3 sentences (Fig. 2), where the execution time was 23
minutes, which is almost 90% faster. As expected, the complexity signi cantly
increases when the number of sentences is increased.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>In this study, a probabilistic text classi cation approach has been used to
analyze the in uence of the text length for a text classi cation task. Based on
the obtained results we can conclude that considering complete document does
not always increase the classi cation accuracy. Instead, the accuracy depends
on the nature of the considered part of the documents. In this study, it has
been observed that the most informative part of encyclopedic documents is the
rst 3 sentences for the classi cation based on entity and category embeddings.
Moreover, as anticipated, the complexity of the classi cation task decreases by
considering only a few starting sentences. As for future work, we plan to apply
the proposed approach to the di erent domains such as patent data to be able
to classify patents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>Machine learning: ECML-98</source>
          pp.
          <volume>137</volume>
          {
          <issue>142</issue>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On dataless hierarchical text classi cation</article-title>
          .
          <source>In: AAAI</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mei</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Line: Large-scale information network embedding</article-title>
          .
          <source>CoRR</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Turker, R., Zhang,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Koutraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sack</surname>
          </string-name>
          , H.:
          <article-title>Short text categorization using joint entity and category embeddings - (under review)</article-title>
          , https://github.com/ISEFIZKarlsruhe/Submission-under-review
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          , LeCun, Y.:
          <article-title>Character-level convolutional networks for text classi cation</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>