Conceptual Scheme for Text Classification System Nicolay Lyfenko Russian State University for the Humanities, Moscow lyfenkoNick@yandex.ru Abstract. The paper describes an application of classification algorithms to the text categorization problem. Author proposes a conceptual scheme for an auto- matic text categorization system. This system must operate with various text representation models and data mining methods. The novelty of this system consists in advanced implementation of JSM method for automatic hypothesis generation — an original logical-combinatorial technology of data mining, which is developed in Russia by several research groups. Keywords: text classification system, machine learning, data mining, natural language processing 1 Introduction Due to an increasing number of text documents in digital form and the extension of a data stream in different fields of professional activities the interest in a text catego- rization task has essentially increased. The main goal of classifying a new text is to assign a predefined class or classes to it [1]. It is being solved with the help of the text classification system ADC (automatic document classifier). Our system includes: different text representation models, a number of text mining methods and some text similarity metrics. The main goal of the system is to compare various classical text classification methods to JSM method for automatic hypothesis generation and choose the best one for a particular task [2, 3]. This research is in progress so the main purpose of this work is to build a concep- tual scheme for the ADC system, develop a project scheme for ADC system and rep- resent its current state of work. There is a great variety of machine learning methods to make a text classification. The most popular of are: k-nearest neighbor, Rocchio classifier, neural network, deci- sion trees, naive Bayes classifier, and support vector machine [4–6]. There are not only algorithms but ready to use frameworks and IDE’s for text classification problem (e.g. Rapidminer1, Gate2). But none of them has the JSM method implemented. This method was proposed by V.K. Finn at the beginning of the 1980s. The abbre- viation JSM is given in honor to John Stuart Mill. The JSM method uses the Mill’s idea 1 http://rapidminer.com/products/rapidminer-studio/ 2 https://gate.ac.uk/ 17 that common effects are more likely to have common causes. The JSM method for au- tomatic hypothesis generation is known as an original set of logical combinatorial tech- nologies for data mining using rules of plausible reasoning [7]. The JSM method includes three cognitive procedures: induction, analogy, abduction [2] and two main stages: learning (to identify data patterns using Mill’s agreement) and prediction. By means of induction the JSM method generates casual hypotheses. With the help of analogy additional definition to unknown examples is formed (prediction). The abduction procedure evaluates the plausibility of the generated hypothesis. This logical-combinatorial method for intelligent data analysis has shown good re- sults on level with SVM method in the work [8] for the task of sentiment analysis. So we have a proposal to apply it in the task of automatic topic and authorship classifica- tion. 2 Conceptual Scheme for ADC System Data receiver Data processor Result interpreter Get data Detect language and document code page Choose term model Tokenize [DelStopWordsRequired] Remove stop words [WeightingRequired] Weight terms [NormalizationRequired] Normalize terms Make feature vector Apply text classification algorithm Update experiment DB Result compare Interpret results Fig. 1. Conceptual scheme for ADC system Fig. 1 shows the key steps for automatic document classification used in the ADC system: to get data, to process it and to analyze results. Reasoning from the fact that a document to analyze can be written in different code pages and various languages (Russian and English currently supported) a character set 18 and a text language should be identified. We are using statistical analysis as in [9]. In our research we normalize terms with the help of a made inverse dictionary based on Zaliznak’s for the Russian language3. English words are stemmed. We use some classical IR text models: frequent model, tf-idf model for text repre- sentation as an n-dimensional vector (vector space model) and not so popular but promising ones are investigated: LOWBOW (Locally Weighted Bag of Words Framework) [9], MFS (Maximal Frequent Sequences) [6], Document Occurrence Representation (DOR) & Term Co-occurrence Representation (TCOR) [9]. 2.1 Project Object Model In order to choose the best technic for a certain text classification approach we have to compare all the methods and have a log of our experiments. That is why it is proper to have well-structured and a user-friendly GUI for an experiment and logical- ly organized project scheme for ADC system and data base for experiments. Fig. 2. Project model for ADC system A project scheme for ADC system is represented in Fig. 2. It has a name, a date and a project configuration (for user’s visualization preferences) properties and ex- periment set as a collection of experiments. It is useful to know which piece of data is used for a learning phase and a test one and what results should be shown in a log file. The property experiment configuration (ExConfiguration) gives the information about the text representation model, term weighting and the classification method. 3 With the help of the COM object from www.aot.ru 19 3 Conclusions In the article we suggest a conceptual scheme for an automatic document classifi- cation system (ADC). The main goal of which is to choose the best text representation model and classification algorithm for a certain application. In more detail: to com- pare JSM method for automatic hypothesis generation to text classification methods. That is why a project object model and its conceptual scheme are developed. The current state of the system is the following: the task of converting a text to an n- dimensional vector is solved. Frequent and tf-idf models for text representation are implemented. Term normalization (using the dictionary for Russian and stemming for English languages) is done. Later the JSM method should be implemented and examined; data base scheme should be developed; experiments should be carried out and the results should be compared. References 1. Sebastiani, F.: Machine Learning in Automated Text Categorization. J. ACM Computing Surveys vol. 34(1), pp. 1–47 (2002) 2. Finn, V.K.: Plausible inference and plausible reasoning. J. Sov Math, vol. 56(1), pp. 2201– 2248 (1991) 3. Finn, V.K.: The synthesis of cognitive procedures and problem of induction. Autom Doc Math Lingust, vol. 43(3), pp.149–195 (1999) 4. Lyfenko, N.: Avtomaticheskaja Klassifikacija Tekstovyh Dokumentov na Russkom i An- glijskom Jazykah s Pomoshh'ju Metodov Mashinnogo Obuchenija. J. Molodezhnyj nauch- no-tehnicheskij vestnik, vol. 4, (2013) (in Russian) 5. Cabera, J.M., Escalante, H. J., Montes-y-Gómez, M.: Distributional Term Representations for Short-Text Categorization. 14th International Conference on Text Processing and Com- putational Linguistics. Samos, Greece, (2013) 6. Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. Proceedings of the 16th International Conference of Machine Learning ICML-99 Workshop on Machine Learning in Text Data Analisys, eds. D. Mladenic and G. Grobelnik, pp.11-17, J. Stefan Institute, Ljubljana, (1999) 7. Anshakov, O.M. The JSM method: A set-theoretical explanation. Automatic Documenta- tion and Mathematical Linguistics 46 (5),pp. 202-220,(2012) 8. Kotelnikov, E. V.: Using JSM Method for Sentiment Analysis. 3rd International Confer- ence on Science and Technology Held by SCIEURO in London, рp.56 (2013) 9. Lebanon, G., Mao, Y., Dillon, M.: The Locally Weighted Bag of Words Framework for Document Representation. J. Machine Learning Research. vol 8, pp.2405–2441, (2007) 20 Концептуальная схема системы классификации текста Николай Д. Лыфенко Российский государственный гуманитарный университет lyfenkoNick@yandex.ru Аннотация. Предлагается концептуальная схема для решения задачи автоматической классификации текста. Рассматриваются различные пред- ставления текстов на естественном языке, а также статистические и логи- ко-комбинаторные методы анализа текстов. Новизна система заключается в имплементации ДСМ метода автоматического порождения гипотез – оригинальной технологии интеллектуального анализа данных, разрабаты- ваемой в России различными группами исследователей. Ключевые слова. Классификация текста, машинное обучение, обра- ботка естественного языка, интеллектуальный анализ данных. 21