ConIText: An Improved Approach for Contextual Indexation of Text Applied to Classification of Large Unstructured Data Mohamed Salim El Bazzi1 , Abdelatif Ennaji2 , and Driss Mammass1 1 IRF-SIC Laboratory, Ibn Zohr University, Agadir, Morocco elbazzi.mohamedsalim@edu.uiz.ac.ma, mammass@uiz.ac.ma 2 LITIS Laboratory, University of Rouen, France abdel.ennaji@univ-rouen.fr Abstract. Managing text documents of a large size can be particularly challenging. For this reason, the indexation process is a crucial and deci- sive step for retrieving relevant textual features. Therefore, it is essential to design methods that effectively take into account complex data. In this work, we define a new contextual automatic indexation approach. Thus, we present ConIText System, a context-based approach for docu- ment indexation and its application. Also, we study its tolerance to large amounts of documents. In other words, we propose a new large corpus of texts and assess the performance gradually from 1.000 to 20.000 docu- ments. It is to observe the behavior of the indexation system while data is getting big. To compare classification results, we used KNN and SVM classifiers. ConIText system outperforms the conventional statistical in- dexation, based on TFIDF method. Nevertheless, our contextualization system is generic, is not based on external resources. Although we have tested ConIText System on an Arabic dataset, it is not limited to one and unique language. Keywords: ConIText · Text Mining · Indexation · Context · Data Anal- ysis · Classification. 1 Introduction The indexation of texts is a crucial step in text processing. It allows to represent documents by their most relevant features. Several approaches are used for this purpose. However, extracting knowledge from textual data is an important issue, especially for large amounts of data. Consequently, we proposed a contextual approach for the automatic indexa- tion of texts. Indeed, in order to explore big data and to disclose hidden semantic information in unstructured documents, such as texts, an efficient indexation sys- tem is required. Consequently, we propose a new approach for text indexation based on semantic proximity and taking into account the contexts contained in each document. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2019 ICAI Workshops, pp. 80–93, 2019. ConIText: An Improved Approach for Contextual Indexation 81 This lead to our second proposition of a new approach for document model- ing. To test the performance of our approach, a large corpus was needed. There- fore, we built our dataset from Arabic online Encyclopedias. It contains 20,000 documents labeled and categorized into 7 classes. The tests will be done gradually in 1000, 5000, 10000 and 20000 documents to study the robustness of ConIText system. Once the input is integrated into the system, a certain level of syntax processing is required. After the preprocessing step, ConIText System identifies common sentences of each document, and classify them according to their se- mantic proximity. Then, the system identify contexts and model the document, for classification aim [6]. We propose, in this work, a new approach for context discovery, based on sentence clustering techniques. Moreover, we introduce an efficient document modeling method. This model illustrates the most dominant context in a given document. Nevertheless, to assess the robustness of our proposed system, we have conducted experimental tests to compare its results to conventional statistical methods, as TF IDF. The organization of this paper is as follows. In part 2, we introduce related works. Part 3 details our proposed ConIText System. In part 4, we highlight the experiments and results. Finally, we conclude by synthesizing the contributions of this work. 2 Related works Most of the researches in the field of unsupervised information extraction focuses on keyword extraction Few of them offer methods to extract the contextual relations present in a document. The contextual approaches aim, on the one hand, to remove the ambiguity of the meaning of texts. On the other hand, they highlight the semantic rela- tions between these words. Semantic relationships can also be calculated using methods that evaluate the quantity of information shared between n-to-n words. A study of Named Entity Recognition (NER) is presented in [18], for identi- fying different classes on NER in social media. They use words similarity besides several text mining technics for named entity class discovery. A survey of documents clustering using semantic approaches is introduced in [16]. A comparison between LSI, Graph, Ontology, and Lexical Chain is pre- sented. The authors of [7] propose a text classifier called Supervises Meaning Clas- sification. They introduced Helwheltz principle to measure meaning. It is about noticing unexpected events in a particular context. They compare the results to SVM classifier that has been outperformed. Mohamed and al. [13] used LSA method, to evaluate each term in a document, and then applied an Evidential Reasoning method. It is to attribute the new document to a category based on the corpus. Experiments showed that ER-LSA is more efficient than ER-TFIDF. 82 M. El Bazzi et al. The authors of [2] presents graph medialization for text. The nodes of a graph correspond to the terms of a document. The relationship between two nodes represents the semantic relationship between two words. The proposed approach outperforms the traditional Bag-of-words (BOW) approach. Same for Herskovic [10], they propose MedRank algorithm to reorder the ranks of the concepts extracted from a medical base. First, the MetaMap pro- gram extracts these concepts. Then, new scores are assigned to the concepts us- ing the TextRank algorithm. The best results are obtained using the MedRank approach. In [8], the authors try to classify a large amount of texts. Each text is mod- eled by a vector that contains a big number of words. The authors highlight the importance of selecting the most relevant feature for a classification aim. This was the goal of their feature extraction algorithm. Also, three different feature selection methods are used: Information Gain, Correlation and k-Best- Discriminative-Terms (k-BDT). Multivariate Relative Discriminative Criterion (MRDC) is proposed in [12], to perform text classification. First, stopwords removal, stemming and term weighting are applied before the classification step. Second, a multivariate fea- tures ranking criterion to evaluate features is proposed for text classification. Then, a subset of features is evaluated using a supervised learning algorithm. The objective of the authors in [11] is dimensionality reduction, without compromising the performance of a classifier. After forming the document-term matrix, they apply data mining techniques to solve this problem. Their research introduces a method for document classification by performing dimensionality reduction with PCA. The authors of [15] propose fuzzy logic based on a multi-document sum- marization system to extract relevant sentences to generate a non-redundant summary. This approach is based on a generic summarization system. Authors in [14] treat the problem of classification of Arabic text. They use SVM, NB and MLP-NN algorithms and apply tests on in-house made dataset. This study aims to apply those algorithms on an Arabic dataset and proceed for a comparative study. The average measures show that SVM algorithm outper- formed NB and MLP-NN. In [1], authors discuss a proposition to perform text classification using a space-independent text classification algorithm. This method depends on Markov chain theory. Each document is represented using a sequence of characters co- occurrences in the document. Each category of the corpus is used to create a single probability transition matrix that will be used in the classification process. TF-IDF with dimensionality reduction can improve the precision in the pro- cess of lexical matching, for identification of domain categories referencing to a document, as proposed in [4]. Higher level of accuracy is possible to perform based on the reduction approach that can be adopted for documents classifica- tion. The study in [3] reports the results of an improved feature selection algo- rithm combined with decision three and SVM on text classification. This study ConIText: An Improved Approach for Contextual Indexation 83 compares the impact of this approach with results of text classification using Chi- square, Mutual Information, and Gini Index. The results show that ImpCHI and SVM in text classification outperform the use of Chi-square, MI and GI. The authors of [5] show that the application of TF-IDF-ICF (Term Frequency- Inverse Document Frequency – Inverse Class Frequency) method with dimension- ality reduction technique can be more powerful in precision for classification of documents. In [9], the authors introduce a method based on continuous distributed rep- resentation of words. The proposed Arabic taxonomy, which is independent of the model used to classify Arabic questions, provides promising results in Arabic question classification. Authors in [17] propose a research that raises the effectiveness of unsupervised learning, semi-supervised learning, and semi-supervised learning with dimen- sionality reduction algorithms using k-means, incremental k-means, Threshold - kmeans and k-means with dimensionality reduction, to calculate the accuracy of SVM. Table 1. Synthetic view of related works Reference Used Method Aim [2] Word2Vec Named Entity Recognition [3] LSI, Graph Clustering [4] Helwheltz Classification [5] LSA, TFIDF, Evidential Reasoning Classification SVM [6] TextRank Classification KNN [7] TextRank Concept Extraction [8] k-BDT Classification DT – BN [9] Minimal-redundancy-maximal-relevance Classification MLP [10] PCA Classification SVM [11] TF IDF Summarization [12] TF IDF Classification SVM, NB, MLP [13] Markov Chain Classification [14] TF IDF Classification LIBLINEAR [15] MI , IG , Chi-square Classification SVM [16] TF IDF ICF Classification MLP, NB, KNN [17] TF IDF Classification SVM Each of the presented methods highlight certain criteria. The approach we propose takes advantage of the existing advances and introduces a new concept of contextualization for a more refined indexation process. 3 ConIText : Contextual Indexation of Text In this part, we introduce the architecture of the automatic system for con- textual indexation. It is a set of complex text mining methods, which forms 84 M. El Bazzi et al. Fig. 1. ConIText System Overview an autonomous process of extracting contexts, then document features, for a context-based indexation (Fig. 1). In fact, a dataset may contain different categories of documents that are either homogeneous or heterogeneous. Thus, the categories of politics and eco- nomics can be homogeneous. In contrast, the categories of new technologies and literature can be heterogeneous. Moreover, a document can express several con- texts. For example, a document that describes a political decision and its impact on the economy will be difficult to be classed within the appropriate category. This is the kind of ambiguity that the ConIText System overcomes, by selecting the most appropriate context for each document. We define context as the linguistic environment of a textual element (word, sequence of words, etc.) within the utterance in which it appears. That is to say the series of text units that precede and follow it. The term context refers to all the circumstances in which an act of enunciation takes place, as cultural and psychological situations, experiences and knowledge of the world, trade and promotion in economics, etc. ConIText: An Improved Approach for Contextual Indexation 85 Furthermore, we define a sentence as the minimal element, which can express a context. A sentence is a set of words giving a complete meaning. Therefore, sentences is the first unit detected by ConIText System. Then, sentences close semantically are gathered to form a context. From each context of a document, we extract relevant features to form contexts vectors. Finally, the vector corre- sponding to the most dominant context models the whole document. To perform ConIText System, three steps are essential. The first step is the segmentation of texts. The second step is building context from which the features will be ex- tracted. Finally, we model documents with our proposed principle of dominance. 3.1 Segmentation The works that study the semantic grouping of sentences to extract relevant in- formation inspire this proposition. As matter of fact, a sentence tends to express an idea, a context in our case, in an affine way. To go further in our data pro- cessing, the phase of sentences splitting is essential. In fact, words are organized into sequences, sentences or paragraphs, to define the meaning of the document. Therefore, the exploration of the relationship between the different components of the document is important to understand the document in depth. Hence, this step consists on defining units of a text that will form the con- texts. Segmentation of the document is the process of dividing the textual doc- uments into meaningful sentences. Humans naturally understand the sentence when reading the text. Intuitively, we instill this power of understanding to our algorithm. Texts have markers of explicit sentence boundaries. We use punctuation marks to delineate a sentence. In this work, we test ConIText system on an Arabic corpus. Since the Arabic language does not have a capital letters, our sen- tence segmentation is based on points, exclamation points and question marks (".", "!", "?"). 3.2 Building contexts This phase consists on grouping the sentences obtained in clusters. Each cluster represents a context. Obviously, each document will have at least one context. To perform sentence clustering, we used Iterative K-means with a mean square error metric. This method makes possible to find, for each document, the optimal k number of the clustering method. Thus, each document is subdivided into k clusters. Therefore, we group together all the sentences to obtain k contexts in each document in the corpus (Fig. 2). To conceive an efficient clustering process, the weights of words must be standardized based on their apparition in the document and their distribution in the entire corpus. In general, a common representation used for text processing is the TF-IDF representation. The TFIDF weighting method is widely used by researchers. It is a frequency associated to the Vector Space Model (VSM), which involves associating a weight vector to each document. TF represents the number of occurrences of a word 86 M. El Bazzi et al. Fig. 2. Contextualization process in the document and IDF is the absolute inverse frequency of the word in the corpus. This method reduces the importance of common terms in the collection while ensuring that the matching of documents is more influenced by most discrimi- nating words, which have a relatively high frequency in the document and low frequencies in the corpus. In this work, since the form of the documents has changed, considering the generated contexts, we have introduced a slight modification for the TFIDF formula. Named TF-ICF (Term Frequency – Inverse Contexts Frequency), it is expressed as follows: tf (t)context(i) T F − ICF context(i) (t) = T F context(i) (t)x log( ) ICF (t) Where t is a term of the context i, TF context(i) (t) is the frequency of t within the context(i) and ICF(t) is the occurrence of t in all contexts of the corpus. The obvious advantage of using this method is to calculate the relevance of a term according to all contexts of the corpus. This induces to express the value of the terms judged irrelevant in the conventional TFIDF system, whereas they have a powerful discrimination role in the document. 3.3 Document Modeling with Dominance Principle The dilemma in text mining is to select the appropriate representation of the textual information that will be able to represent the semantic content of the text. To model the document, we use the VSM representation. After building contexts, we calculate the score of each word in the context using the TF-ICF method, in order to model each context by a corresponding vector of words weights. Thus, each document is represented by a set of vectors (Fig. 3). A constraint occurs, it is to represent each document by a single vector of weight. To perform this step, we define the principle of dominate context. After ConIText: An Improved Approach for Contextual Indexation 87 Fig. 3. Document modeling with dominance principle the contextualization, each document is divided into one or more contexts. Each context is modeled by one vector of weights. The Dominant Context is the vector of the strongest weights. Formally, n vectors model contexts of a given document D, as: V (D) = {M axV i (D), i∈[1, k]} Where V(D) is the unique vector that models the document D, Vi is the vector modelling the context i, and k the number of contexts discovered in D. 4 Data During our research, we often face the problem of a lack of significant corpus. To approve the efficiency of an indexation system, it is essential to test it on a large amount of data. The best structured corpus are often not open access (see Table 2). In lots of works on text mining, the authors build their own dataset. They choose the number of categories and themes to use. For each category, the docu- 88 M. El Bazzi et al. Table 2. Synthetic view of related works Reference Corpus size [6] 1084 [8] 1500 [12] 1400 [13] 1480 [14] 1960 [15] 5070 [16] 4000 [17] 1302 [18] 600 Table 3. Corpus details Class label Number of Documents Literature 2936 History and Geography 3830 Civilization 3306 Sciences 3452 Architecture 1406 Philosophy 2702 Medicine 2368 Total 20.000 ments are collected manually and those belonging to several categories are elim- inated. Nonetheless, the size of datasets is relatively small to assess a system power, and the areas covered are geared towards specific issues. This problem led us to create a new labeled corpus, in Arabic language, of 20,000 texts, with 27,605,263 words after document pretreatment (stemming and stopwords removing), and 7 classes labeled as presented in Table 3: This corpus is collected from Arabic encyclopedias, and have the particularity of containing homogeneous themes and other heterogeneous to better assess the precision of systems. We make this data freely available to researchers. 5 Experiments In this work, we proposed ConIText System, a context-based system for au- tomatic text indexation. To test our approach, we opted for a large dataset to evaluate its robustness and reliability. We have tested the proposed system gradually on a large amount of data (Table 3). For comparative reasons, we conducted the tests using two classifiers, KNN and SVM. The different models of these classifiers will show the tolerance of our indexation approach to classification systems. The experimental evaluation of the classifier is the final step in the classifica- tion process. It usually tries to evaluate the effectiveness of a classifier, namely its ability to make categorization decisions. ConIText: An Improved Approach for Contextual Indexation 89 Fig. 4. F-measure of TFIDF and TFICF using KNN Fig. 4 and Fig. 5 show the results of KNN and SVM classification of docu- ments using TFIDF and ConIText systems. These results are expressed by the f measure. The performance of our system is obvious. This is due to the com- plexity of the techniques used for the context-based indexation. However, the classification parameters are stationary for both classifiers. The strong point that can be drawn from these experiments is the behavior of ConIText on a wide range of documents including more than 10000 documents. This advantage is more visible in the following figures. Fig. 5. F-measure of TFIDF and TFICF using SVM 90 M. El Bazzi et al. 6 Discussion The first test was performed on 1000 documents, which is a proportion widely used in the literature. Then, we have passed the test on 5000 documents. This number represents the maximum of the documents used in similar works (Table 2). We have pushed the tests on 10,000 documents to see how the system will react with such a mass of documents. Finally, we have performed a test on 20000 documents to study the stability of the system. Table 4 and 5 presents the results of the comparison between TFIDF sys- tem and ConIText system, expressed by recall, precision and F-measure. Table 4 presents the result using KNN classifier and Table 5, SVM Classifier. In particu- lar, those results show the relevance of using contextual indexing that effectively improves classification performance. The results are illustrated in Fig. 6 and Fig. 7 in terms of precision and recall of classifiers KNN and SVM. The curves indicate that the performance of the TFIDF method drops dramatically as soon as the database takes more and more Table 4. KNN Classification results Documents Results% TF IDF precision recall F-measure 1000 67.77 56.67 61.72 5000 59.47 50.60 54.67 10000 51.09 47.07 48.99 20000 45.66 38.82 41.96 ConIText precision recall F-measure 1000 80.09 66.21 72.49 5000 76.25 60.00 67.15 10000 69.01 56.69 62.24 20000 67.97 53.08 59.60 Table 5. SVM Classification results Documents Results% TF IDF precision recall F-measure 1000 62.16 60.22 61.17 5000 60.36 58.15 59.23 10000 56.66 48.21 52.09 20000 47.39 39.18 42.89 ConIText precision recall F-measure 1000 81.10 67.87 73.89 5000 78.57 63.08 69.97 10000 72.50 55.00 62.54 20000 69.62 54.07 60.86 ConIText: An Improved Approach for Contextual Indexation 91 Fig. 6. Precision and recall of TFIDF and TFICF using KNN documents. However, ConITexte’s results are not only more satisfying but also tolerable to large datasets. We can see clearly that the performances are almost Fig. 7. Precision and recall of TFIDF and TFICF using SVM 92 M. El Bazzi et al. constant between 10,000 and 20,000 documents. This enhances the effectiveness of the context-based indexation system and confirms our theory. We can deduce many conclusions from our experimental results. First, the contextual model showed its performance to be the appropriate representation for large datasets. Indeed, the context has several advantages over which it is possible to act to refine the extraction of the keywords. Second, the relations be- tween words are expressed by maintaining the shared information of the context. This will certainly lead to better results. 7 Conclusion In this paper, we have introduced ConIText, a contextual indexation System for texts. The integration of a semantic measure between sentences in this ap- proach is necessary. For this reason, we have introduced our sentence grouping contribution to formalize the adaptation of the model to the semantic proxim- ity. The advantage of this approach is that it does not need any preliminary specific knowledge to identify terms in order to assign them weights since the identification of terms is done from an automatic document processing. We also proposed contextual modeling for document to increase the accuracy of indexation. In fact, the semantic proximity between words must be empha- sized when we are dealing with complex and unstructured documents such as texts. For this reason, it is essential to broaden our thinking to models of repre- sentation adapted to the nature of our resources. To this end, we have introduced a contextual modeling for documents based on the principle of dominance. The advantage of this model is that it reduces the space representation of features and reduce the whole modeling of a document to its most significant context. 8 Acknowledgements This work was funded by LITIS laboratory, and the University of Rouen Nor- mandy, France. References 1. Al-Anzi, F.S., AbuZeina, D.: Beyond vector space model for hierarchical arabic text classification: A markov chain approach. Information Processing & Management 54(1), 105–115 (2018) 2. Alami, N., Meknassi, M., Ouatik, S.A., Ennahnahi, N.: Impact of stemming on arabic text summarization. In: 2016 4th IEEE International Colloquium on Infor- mation Science and Technology (CiSt). pp. 338–343. IEEE (2016) 3. Bahassine, S., Madani, A., Al-Sarem, M., Kissi, M.: Feature selection using an improved chi-square for arabic text classification. Journal of King Saud University- Computer and Information Sciences (2018) 4. Dhar, A., Dash, N.S., Roy, K.: Application of tf-idf feature for categorizing doc- uments of online bangla web text corpus. In: Intelligent Engineering Informatics, pp. 51–59. Springer (2018) ConIText: An Improved Approach for Contextual Indexation 93 5. Dhar, A., Dash, N.S., Roy, K.: Categorization of bangla web text documents based on tf-idf-icf text analysis scheme. In: Annual Convention of the Computer Society of India. pp. 477–484. Springer (2018) 6. El Bazzi, M.S., Mammass, D., Ennaji, A., Zaki, T.: Toward a complex system for context discovery to index arabic documents. JCP 13(8), 955–962 (2018) 7. Ganiz, M.C., Tutkan, M., Akyokuş, S.: A novel classifier based on meaning for text classification. In: 2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA). pp. 1–5. IEEE (2015) 8. Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A.S., Gonçalves, C.T.: Comparative study of feature selection methods for medical full text clas- sification. In: International Work-Conference on Bioinformatics and Biomedical Engineering. pp. 550–560. Springer (2019) 9. Hamza, A., En-Nahnahi, N., Zidani, K.A., Ouatik, S.E.A.: An arabic question classification method based on new taxonomy and continuous distributed repre- sentation of words. Journal of King Saud University-Computer and Information Sciences (2019) 10. Herskovic, J.R., Cohen, T., Subramanian, D., Iyengar, M.S., Smith, J.W., Bern- stam, E.V.: Medrank: Using graph-based concept ranking to index biomedical texts. International journal of medical informatics 80(6), 431–441 (2011) 11. Kumar, B.S., Ravi, V.: Text document classification with pca and one-class svm. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. pp. 107–115. Springer (2017) 12. Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Engineering Appli- cations of Artificial Intelligence 70, 25–37 (2018) 13. Mohamed, R., Watada, J.: An evidential reasoning based lsa approach to document classification for knowledge acquisition. In: 2010 IEEE International Conference on Industrial Engineering and Engineering Management. pp. 1092–1096. IEEE (2010) 14. Mohammad, A.H., Alwada’n, T., Al-Momani, O.: Arabic text categorization us- ing support vector machine, naïve bayes and neural network. GSTF Journal on Computing (JoC) 5(1), 108 (2016) 15. Patel, D.B., Shah, S., Chhinkaniwala, H.R.: Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique. Expert Systems with Applications (2019) 16. Saiyad, N.Y., Prajapati, H.B., Dabhi, V.K.: A survey of document clustering using semantic approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). pp. 2555–2562. IEEE (2016) 17. Sangaiah, A.K., Fakhry, A.E., Abdel-Basset, M., El-henawy, I.: Arabic text clus- tering using improved clustering algorithms with dimensionality reduction. Cluster Computing pp. 1–15 (2018) 18. Taşpınar, M., Ganiz, M.C., Acarman, T.: A feature based simple machine learning approach with word embeddings to named entity recognition on tweets. In: Inter- national Conference on Applications of Natural Language to Information Systems. pp. 254–259. Springer (2017)