Introduction

\The Less Is More" for Text Classi cation

Rima Turker

0 1 2

Lei Zhang

0 2

Maria Koutraki

0 1 2

Harald Sack

0 1 2 0 FIZ Karlsruhe 1 Karlsruhe Institute of Technology, Institute AIFB , Germany 2 Knowledge-Based Text Classi cation , KBTC 3 Leibniz Institute for Information Infrastructure , Germany

Text Classi cation [2,5] is gaining more attention due to the availability of a huge number of text data, such as blog articles and news data. Traditional text classi cation methods [1] use all the words present in a given text to represent a document. However, the high number of words mentioned in documents can tremendously increase the complexity of the classi cation task and subsequently make it very costly. Moreover, long (natural language text) documents usually include a di erent variety of information related to the topic of a document. For example, encyclopedic articles such as the life of a scientist3, contain besides topic related content also detailed biographical information. Often, in such articles after the rst paragraph (or rst a few sentences), words or entities appear, which are not related to the main topic (or category4) of the article. We assume that the most informative part of such articles is limited to a few starting sentences. In other words, instead of considering the complete document, only its beginning can be exploited to classify a document accurately. In this study, we design a Knowledge Based Text Classi cation method, which is able to classify a document by using only a few starting sentences of the article. Since the length of the considered text is rather limited, ambiguous words might lead to inaccurate classi cation results. Therefore, instead of words, we consider entities to represent a document. In addition, entities and categories are embedded into a common vector space, which allows capturing the semantic similarity between them. Moreover, the similarity based approach does not require any labeled training data as a prerequisite. Instead, it relies on the semantic similarity between a set of prede ned categories and a given document to determine which category the given document belongs to. The study has been validated with preliminary experiments on text classi cation for encyclopedic articles, which show that our method achieves comparable and even better results using only the rst few sentences of a document than using the entire document.

Introduction

Given a Knowledge Base KB containing a set of entities E = fe1; e2; ::; eng and a set of hierarchically related categories C = fc1; c2; ::; cmg, where each entity 3 http://scihi.org/albert-einstein-revolutionized-physics/ 4 https://en.wikipedia.org/wiki/Category:Physics ei 2 E is associated with a set of categories C0 C. The input is a text t, which contains a set of mentions Mt = fm1; : : : ; mkg that uniquely refer to a set of entities. Then, the output is the most relevant category ci 2 C0 for the given text t.

KBTC Overview. The general work ow of Knowledge Based Text Classi cation is shown in Figure 1. The rst step is \Mention Detection Based on AnchorText Dictionary", where each entity mention present in t is detected based on a \Anchor-Text Dictionary" prefabricated from Wikipedia. The Anchor-Text Dictionary contains all mentions and their corresponding Wikipedia entities. In order to construct an Anchor-Text Dictionary all the anchor texts of hyperlinks in Wikipedia articles referring to another Wikipedia article are extracted, whereby the anchor texts serve as mentions and the Wikipedia article links refer to the corresponding entities. In the second step, for each detected mention in the given input text candidate entities are generated based on the Anchor-Text Dictionary. In our example these are \Motorola", \Hewlett-Packard" and \Linux". Likewise, the prede ned categories are mapped to Wikipedia categories. Finally, with the help of entity and category embeddings [ 3 ] that have been precomputed from Wikipedia, the output is the semantically most related category for the given entities. Thereby, in the given example the category Technology will be determined.

Probabilistic Model. The proposed classi cation task is formalized as estimating the probability of P (cjt) of each prede ned category c and an input text t. Based on Bayes' theorem, the probability P (cjt) can be rewritten as follows: P (cjt) =

P (c; t) P (t) / P (c; t) where the denominator P (t) has no impact on the ranking of the categories. For an input text t, a mention is a term in t that can refer to an entity e and the context of e is the set of all other mentions in t except the one for e. For each candidate entity e in t, the input text t can be decomposed into the mention and context of e, denoted by me and Ce, respectively. Based on the above introduced concepts, the joint probability P (c; t) is given as follows:

P (c; t) = X P (e; c; t) = X P (e; c; me; Ce)

e2Et e2Et = X P (e)P (cje)P (meje)P (Ceje) e2Et (2) where Et represents the set of all possible entities contained in the input text t. Here, we simply apply a uniform distribution to calculate P (e) for each entity e. The probability P (cje) models the relatedness between an entity e and a category c, which is estimated by using the prefabricated entity-category embeddings. Moreover, the probability of P (meje) is calculated based on the anchor text dictionary. Finally, the probability P (Ceje) models the relatedness between the entity e and its context Ce. Each mention in Ce refers to a context entity ec from the given knowledge base. The probability P (Ceje) can be calculated with the help of entity-category embeddings. More details about the probability estimation can be found in [ 4 ]. 3

Results and Discussion

Dataset. The proposed text classi cation approach is evaluated on articles of SciHi 5, a web blog on the history of science. From that dataset6 1452 articles associated to a single category have been considered. The di erent categories supported in the dataset are 45 and the average number of sentences per article is 32.96.

1 2 3 5 10

All

Number of Sentences y 0:61 c a r u c cA 0:6

Experimental Results. The proposed approach does not require any training phase. Therefore, only test sets are generated for the classi cation task from SciHi data. To show the impact of the number of the starting sentences of the articles on the classi cation accuracy, the data set has been sampled in di erent sizes. From each article, the rst sentence, rst 2, rst 3, rst 5, rst 10 sentences and complete documents have been collected. For each sampled datasets the proposed approach has been applied to the classi cation task. The results are depicted in Fig. 2. The results show that a few starting sentences (in this case 3 sentences) are rather informative and have huge impact on the classi cation accuracy. During the experiments it has been observed that most of the times irrelevant entities to the corresponding category tend to appear after the rst 2 or 3 sentences. Hence, after the 3rd sentence the accuracy starts to drop (Fig. 2). Note that usually in such documents the frequency of relevant entities is higher in comparison to irrelevant entities. Therefore, complete documents help to obtain reasonable classi cation accuracy. However, the classi cation of complete documents is computationally very expensive (cp. Table 1). The classi cation of the whole documents takes 215 minutes while the classi cation of a sentence requires no more than 18 minutes for the entire dataset. The best results have been obtained with rst 3 sentences (Fig. 2), where the execution time was 23 minutes, which is almost 90% faster. As expected, the complexity signi cantly increases when the number of sentences is increased. 4

Conclusion and Future Work

In this study, a probabilistic text classi cation approach has been used to analyze the in uence of the text length for a text classi cation task. Based on the obtained results we can conclude that considering complete document does not always increase the classi cation accuracy. Instead, the accuracy depends on the nature of the considered part of the documents. In this study, it has been observed that the most informative part of encyclopedic documents is the rst 3 sentences for the classi cation based on entity and category embeddings. Moreover, as anticipated, the complexity of the classi cation task decreases by considering only a few starting sentences. As for future work, we plan to apply the proposed approach to the di erent domains such as patent data to be able to classify patents.

1. Joachims , T. : Text categorization with support vector machines: Learning with many relevant features . Machine learning: ECML-98 pp. 137 { 142 ( 1998 )

2. Song , Y. , Roth , D. : On dataless hierarchical text classi cation . In: AAAI ( 2014 )

3. Tang , J. , Qu , M. , Wang , M. , Zhang , M. , Yan , J. , Mei , Q. : Line: Large-scale information network embedding . CoRR ( 2015 )

4. Turker, R., Zhang, L. , Koutraki , M. , Sack , H.: Short text categorization using joint entity and category embeddings - (under review) , https://github.com/ISEFIZKarlsruhe/Submission-under-review

5. Zhang , X. , Zhao , J.J. , LeCun, Y.: Character-level convolutional networks for text classi cation . In: NIPS ( 2015 )