TECNE: Knowledge Based Text Classification Using Network Embeddings Rima Türker1,2 , Maria Koutraki1,2 , Lei Zhang1 , and Harald Sack1,2 1 FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Germany 2 Karlsruhe Institute of Technology, Institute AIFB, Germany {firstname.lastname}@fiz-karlsruhe.de {firstname.lastname}@kit.edu Abstract. Text classification is an important and challenging task due to its application in various domains such as document organization and news filtering. Several supervised learning approaches have been pro- posed for text classification. However, most of them require a signifi- cant amount of training data. Manually labeling such data can be very time-consuming and costly. To overcome the problem of labeled data, we demonstrate TECNE, a knowledge-based text classification method using network embeddings. The proposed system does not require any labeled training data to classify an arbitrary text. Instead, it relies on the semantic similarity between entities appearing in a given text and a set of predefined categories to determine a category which the given document belongs to. 1 Introduction Text classification is gaining more and more attention due to the availability of a huge number of text data, which includes search snippets, news data as well as text data generated in social networks. Recently, several supervised approaches have been proposed for text classification [6,1]. However, they all require a sig- nificant amount of labeled training data. Manual labeling of such data can be a very time-consuming and costly task. Especially, if the text to be labeled is of a specific scientific or technical domain, crowd-sourcing based labeling approaches do not work successfully and only expensive domain experts are able to fulfill the manual labeling task. Alternatively, semi-supervised text classification ap- proaches [5] have been proposed to reduce the labeling effort. Yet, due to the diversity of the documents in many applications, generating small training set for the semi-supervised approaches still remains an expensive process [2]. More- over, to cope with the problem of labeled data several dataless text classification methods have been proposed. Similar to our proposed approach, the methods do not require any labeled data, rather they rely on the semantic similarity be- tween documents and the predefined categories. However, the most prominent and successful dataless classification approaches cannot utilize the rich entity and category information in large-scale knowledge bases. In this paper we demonstrate TECNE, an approach which classifies an ar- bitrary input text, according to a predefine set of categories, without requiring 2 R. Türker et al. any training data. The approach is able to capture the semantic relation be- tween the entities represented in a text and the predefined categories by embed- ding them into a common vector space using state-of-the-art network embedding techniques. Finally, the category of the given text can be derived based on the semantic similarity between entities (present in the given text) and a set of pre- defined categories. The similarity is computed based on the vector representation of the entities and the categories. 2 Description of TECNE Fig. 1. The work flow of TECNE (best viewed in color) Given a Knowledge Base KB, containing a set of entities E = {e1 , e2 , .., en } and a set of hierarchically related categories C = {c1 , c2 , .., cm }, where each entity ei ∈ E is associated with a set of categories C 0 ⊆ C via a relation cat ⊆ E × C, such that cat(ei ) = C 0 . The input of the system is an arbitrary text t, which contains a set of mentions Mt = {m1 , . . . , mk } that uniquely refer to a set of entities as well as a set of predefined categories C 0 ⊆ C (from the underlying knowledge base KB). The output of TECNE is a score value for each category ci ∈ C 0 based on the semantic similarity between the given text t and the predefined categories C 0 . TECNE Overview The general work flow of TECNE presented in Figure 1 is similar to our previous study [4]. The input is a text t of an arbitrary length. Then, the classification task start with the detection of each entity mention present in t based on a prefabricated “Anchor-Text Dictionary” from Wikipedia. The Anchor-Text Dictionary con- tains all mentions and their corresponding Wikipedia entities. In our example the detected mentions are “IBM”, “midrange computer” and “eServer”. As a next step, for each detected entity mention in t, the candidate enti- ties are generated with the help of the Anchor-Text Dictionary. In our example TECNE: Knowledge Based Text Classification Using Network Embeddings 3 Fig. 2. Example of an input to the system using Wikipedia API these are “IBM”, “Midrange computer” and “IBM eServer”. Also, the predefined categories (Sports, Technology, Culture, World) are mapped to Wikipedia cat- egories. Finally, based on the entity and category embeddings [3] that have been precomputed from Wikipedia, the output of TECNE is a score for each pre- defined category. Ideally, the most semantically related category to the entities present in the input text should have the highest score. Thereby, in the given example the category Technology has the highest score. More technical details about the approach and the evaluation of the system can be found in [4]. 3 Demonstration A recorded video of our demonstration can be found here: https://goo.gl/pSxkcy TECNE is implemented in Java using a client-server architecture with commu- nication over HTTP. The server is a RESTful web service, which is implemented using Spark1 . Moreover, the client user interface is achieved by using Vaadin Framework2 as a Web Application. The system supports both service-oriented and user-oriented interfaces for classifying short/long text documents. The sys- tem accepts any arbitrary text that needs to be classified as an input. In the interest of convince, the system utilizes three different APIs that a user can use to provide an online text as an input to the system. The first API3 is used to fetch an abstract of Wikipedia articles. Simply, a user can enter the name of the Wikipedia article and the abstract of the certain article would be fetched automatically. Figure 2 presents the screen shot of this service, where the input is the abstract of the Albert Einstein’s Wikipedia page. The second API4 and the third API5 are used to retrieve long and short random news respectively from different web pages. Besides that, a user can select a predefined sample sentence as an input or also manually enter any text without using the already provided data sources. For the sake of simplicity, the system covers 4 different categories, i.e. the system can classify a text based on 4 different categories, Sports, Business, World, and Science-Technology. However, it can be easily extended to support higher number of categories for the classification purpose. For classifying a text, TECNE proceeds in 3 main steps as following: 1 http://sparkjava.com 2 https://vaadin.com/ 3 https://en.wikipedia.org/w/api.php 4 https://webhose.io/ 5 https://newsapi.org 4 R. Türker et al. Fig. 3. Example of the detected mentions and the classification result 1 Mention Detection Based on Anchor-Text Dictionary: Each entity mention present in a given text is detected based on a “Anchor-Text Dic- tionary”. Figure 3 presents the screen shot of the detected mentions of the input article. 2 Candidate Generation: For each detected mention in the given input text, the candidate entities are generated based on the Anchor-Text Dictionary. For example, the first detected mention is “Albert Einstein”, then the gen- erated candidate entity is “Albert Einstein”6 . 3 Classification: Finally, with the help of entity and category embeddings a score will be calculated for each category based on the assigned entities. Figure 3 presents an example of a classification result for the given text (abstract of the Albert Einstein’s Wikipedia page). Based on the scores the categories are arranged in descending order. 4 Conclusion and Future Work In this paper, we demonstrate TECNE, a system for knowledge based text clas- sification using network embeddings. Future works also include the extension of TECNE towards enabling user to define a category list where the input text will be categorized accordingly. References 1. Biswas, R., Türker, R., Moghaddam, F.B., Koutraki, M., Sack, H.: Wikipedia in- fobox type prediction using embeddings. In: DL4KGS@ESWC (2018) 2. Li, C., Xing, J., Sun, A., Ma, Z.: Effective document labeling with very few seed words: A topic model approach. In: CIKM. pp. 85–94. ACM (2016) 3. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor- mation network embedding. CoRR (2015) 4. Türker, R., Zhang, L., Koutraki, M., Sack, H.: “the less is more” for text classifica- tion. SEMANTICS (2018) 5. Xuan, J., Jiang, H., Ren, Z., Yan, J., Luo, Z.: Automatic bug triage using semi- supervised text classification. CoRR (2017) 6. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015) 6 https://en.wikipedia.org/wiki/Albert Einstein