Text Classification Using Association Rules, Dependency Pruning and Hyperonymization Yannis Haralambous and Philippe Lenca Institut Mines Telecom, Telecom Bretagne, UMR CNRS 6285 Lab-STICC Technopôle Brest Iroise CS 83818, 29238 Brest Cedex 3, France .@telecom-bretagne.eu Abstract. We present new methods for pruning and enhancing item- sets for text classification via association rule mining. Pruning methods are based on dependency syntax and enhancing methods are based on replacing words by their hyperonyms of various orders. We discuss the impact of these methods, compared to pruning based on tfidf rank of words. Introduction Automatic text classification is an important text mining task, due to the huge number of text documents that we have to manage daily. Text classification has a wide variety of applications such as Web document and email classification. Indeed, most of the Web news services daily provide a large number of articles making them impossible to be organized manually [14]. Automatic subject clas- sification [9] and SPAM filtering [18] are two additional examples of the interest of automatic text classification. Automatic text classification can be defined as below. Given a set of docu- ments such that each document is labeled with a class value, learn a model that assigns a document with unknown class to one or more particular classes. This can also be done by assigning a probability value to each class or by ranking the classes. A wide variety of classical machine learning techniques have been used for text classification. Indeed, texts may be represented by word frequencies vectors, and thus most of the quantitative data methods can be used directly on the notorious “bag-of-words” model (cf. [27,3]). Choosing a classifier is a multicriteria problem. In particular one has often to make a trade-o↵ between accuracy and comprehensibility. In this paper, we are interested in both criteria with a deeper interest in comprehensibility. We are thus interested in rule-based approaches and especially in class association rules algorithms. Several studies have already successfully considered association rule-based approaches in text mining (e.g., [4], [29], [7], [25]). This framework is suitable for considering some statistical characteristics (e.g., high-dimensionality, sparsity. . . ) of the bag-of-words model where a document is represented as a set of words with their associated frequency in the document. In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. 66 Y. Haralambous and P. Lenca However a text is more than a set of words and their frequencies. Enhancing the bag-of-words approach with linguistic features has also attracted several works (e.g., [12,11,13,22], [23,16,10], [21,6]). We here propose a class association rules based approach enriched by lin- guistic knowledge. The paper is organized as follows: after introducing the tech- niques we are going to use (class association rules § 1.1, dependencies § 1.2, hy- peronymization § 1.3) we describe our main algorithms (for training § 2.1, clas- sifying § 2.2 and evaluating § 2.3); follows the experimental section, where we give results obtained by tfidf pruning § 3.2, dependency-based pruning § 3.3 and hyperonymization § 3.4, and, finally, we end up by a conclusion and perspectives for future work § 4. 1 Proposed model for text classification Let a corpus be a set C = {D1 , . . . , Dn } of documents. Let C be a set of classes. An annotated corpus is a pair (C, class) where class : C ! C is a function that maps each document Di to a (predefined) class of C. A document D 2 C is a set of sentences S. The corpus C can be considered as a set of sentences S = {S1 , . . . , Sm } if we go through the forgetful functor (which forgets the document to which the sentence belongs). Repeated sentences in the same document, or identical sentences in di↵erent documents are consid- ered as distinct, i.e., there is a function ◆ : S ! C which restores the forgotten information. We extend the class function to S by class(S) := class(◆(S)). A sentence S is a sequence of words w (sometimes S weSwill consider S simply as a set, without changing the notation). Let W = S2S w2S {w} be the set of all words of C. 1.1 Class association rules and text classification Let I be a set of objects called items and C a set of classes. A transaction T is a pair ({i1 , . . . , in }, c), where {i1 , . . . , in } ✓ I and c 2 C. We denote by T the set of transactions, by items(T ) the set of items (or “itemset”) of T and by class(T ) the class of T . Let I be an itemset. The support of I is defined by supp(I) := #{T 2T |I✓items(T #T )} . Let 2 [0, 1] be a value called minimum support. An itemset I is called frequent if its support exceeds . The confidence of a transaction t is defined as conf(t) := #{T 2T |items(t)✓items(T )^class(t)=class(T )} #{T 2T |items(t)✓items(T )} . Let  2 [0, 1] be a value called minimum confidence. A class association rule (or “CAR”) r = ({i1 , . . . , in }, c) [15] is a transaction with frequent itemset and a confidence exceeding . Text Classification Using Association Rules 67 To classify text with CARs, we consider words as being items, documents as being itemsets and pairs of documents and classes as being transactions. The advantage of this technique is that CARs can be easily understood and hence potentially improved by the user, especially if the classifier is tuned so that it produces humanly reasonable number of rules. Once the classifier is trained, to classify a new sentence we first find all CARs whose items are contained in the sentence, and then use an aggregation technique to choose a predominant class among those of the CARs we found. An important issue of CARs is that the complexity is exponential with re- spect to the itemset size, and hence we need to keep it bounded in specific ranges, independently of the size of documents to classify. Using entire documents as transactions is computationally out of reach, therefore pruning techniques play an important rôle. Our approach consists in (a) restricting CARs to the sentence level, (b) prune sentences by using morphosyntactic information (cf. § 1.2) and modifying itemsets using semantic information (cf. § 1.3). 1.2 Itemset pruning using dependencies One can prune sentences either by using word frequencies (cf. § 3.2) or by using information obtained by morphosyntactic parsing (cf. § 3.3). In this paper we introduce the latter approach, in the frame of dependency grammar. Dependency grammar [28,19] is a syntactic theory, alternative to phrase- structure analysis [8] which is traditionally taught in primary and secondary education. In phrase-structure syntax, trees are built by grouping words into “phrases” (with the use of intermediate nodes NP, VP, etc.), so that the root of the tree represents the entire sentence and its leaves are the actual words. In dependency grammar, trees are built using solely words as nodes (without introducing any additional “abstract” nodes). A single word in every sentence becomes the root (or head ) of the tree. An oriented edge between two words is a dependency and is tagged by a representation of some (syntactic, morphological, semantic, prosodic, etc.) relation between the words. For example in the sentence “John gives Mary an apple,” the word “gives” is the head of the sentence and we have the following four dependencies: head dobj nsubj iobj det John gives Mary an apple. where tags nsubj, dobj, iobj, det denote “noun subject,” “direct object,” “indi- rect object” and “determinant.” Let S be a sentence and D be the set of dependency tags: {nsubj, ccomp, prep, dobj, . . . } A dependency is a triple (w1 , w2 , d) where w1 , w2 2 S and d 2 D. Let Dep(S) denote the set of dependencies of S and root(S) the head of S. 68 Y. Haralambous and P. Lenca Pruning will consist in defining a morphosyntactic constraint i.e. a condition on dependencies (and POS tags) of words, the fulfillment of which is necessary for the word to be included in the itemset. But before describing pruning algorithms and strategies, let us first present a second technique used for optimizing itemsets. This time we use semantic information. We propose to replace words by their hyperonyms, expecting that the frequencies of the latter in the itemsets will be higher than those of the former, and hence will improve the classification process. 1.3 Hyperonymization The WordNet lexical database [20] contains sets of words sharing a common meaning, called synsets, as well as semantic relations between synsets, which we will use to fulfill our goal. More specifically, we will use the relations of hyperonymy and of hyperonymic instance. The graph having synsets as nodes, and hyperonymic relations as edges, is connected and rooted: starting with an arbitrary synset, one can iterate these two relations until attaining a sink. Note that in the case of nouns it will invariably be the synset 00001740 {entity} while for verbs there are approx. 550 di↵erent verb sinks. Let W be the WordNet lexical database, s 2 W a synset and h : W ! 2W the hyperonymic or hyperonymic instance relation. We define an hyperonymic chain CH(s) as a sequence (si )i 0 where s0 = s and si 2 h(si 1 ), for all i 1. Hyperonymic chains are not unique since a given synset can have many hyper- onyms. To replace a word by the most pertinent hyperonym, we have to identify the most significant hyperonymic chains of it. The wn-similarity project [24] has released synset frequency calculations based on various corpora. Let lf(s) denote the logarithmic frequency of synset s in the BNC English language corpus [2] and let us arbitrarily add infinitesi- mally small values to the frequencies so that they become unique (s 6= s0 ) lf(s) 6= lf(s0 )). We use frequency as the criterion for selecting a single hyper- onymic chain to represent a given synset, and hence define the most significant hyperonymic chain MSCH(s) as the hyperonymic chain (si )i 0 of s such that si = arg maxs2h(si 1 ) lf(s), for all i 1. The chain MSCH(s) is unique thanks to the uniqueness of synset frequencies. Our CARs are based on words, not synsets. Hence we need to extend MSCHs to words. Let w be a lemmatized word. We denote by Synsets(w) ⇢ W the set of synsets containing w. If the cardinal #(Synsets(w)) > 1 then we apply a standard disambiguation algorithm to find the most appropriate synset sw for w in the given context. Then we take (si )i = MSCH(sw ) and for each synset si in this chain we define hi (w) = proj1 (si ) (i > 0), that is the projection of si to its first element, which by WordNet convention is the most frequent word in the synset. The function vector h⇤ : W ! W (with h0 ⌘ Id) is called hyperonymization, and hi (w) is the i-th order hyperonym of w. Text Classification Using Association Rules 69 Algorithm 1: Training Data: An annotated corpus C, values of minimum support and minimum confidence  Result: A set of CARs R = ({R1 , . . . , RN }, conf) where items(Ri ) ⇢ W, class(Ri ) 2 C, and conf(Ri ) is the confidence of rule Ri Train(C, , ): S := forgetful(C); S0 := ;; for S 2 S do S 0 := Hyperonymize (Prune (Lemmatize (S))); class(S 0 ) := class(◆(S)); S0 := S0 [ {S 0 }; end R := Apriori (S0 , , ); end 2 Operational implementations for document classification Our text classifier operates by first training the classifier on sentences and then classifying the documents by aggregating sentence classification. These two pro- cedures are described in Sections 2.1 and 2.2 respectively. Specific evaluation procedure is presented in Section 2.3. 2.1 Training The Train algorithm (cf. Alg. 1) takes as input an annotated corpus C and values of minimum support and minimum confidence . It returns a set of CARs together with their confidence values. The first part of the algorithm consists in processing the corpus, to obtain efficient and reasonably sized transactions. Three functions are applied to every sentence: 1. Lemmatize is standard lemmatization: let P be the set of POS tags of the TreeTagger system [26] (for example, NP stands for “proper noun, singular”, VVD stands for “verb, past tense”, etc.), and let W 0 be the set of lemmatized forms of W (for example, “say” is the lemmatized form of “said”); then we define : W ! (W [ W 0 ) ⇥ P, which sends a word w to the pair (w0 , p) where w0 is the lemmatized form of w (or w itself, if the word is unknown to TreeTagger ) and p is its POS tag. 2. Prune is a function which prunes the lemmatized sentence so that only a small number of (lemmatized) words (and POS tags) remains. Several sen- tence pruning strategies will be proposed and compared (cf. § 3.2 and 3.3). 3. Hyperonymize is a function which takes the words in the pruned itemset and replaces them by the members of their most significant hyperonymic chains. Several strategies will also be proposed and compared (cf. § 3.4). 70 Y. Haralambous and P. Lenca Algorithm 2: Classification Data: A set of CARs R, a document D0 Result: The predicted class predclass(D0 ), variety , dispersion Classify(R, D0 ): for S 2 D0 do if 9r 2 R such that items(r) ⇢ S then RS := arg max conf(r); r2R^items(r)⇢S end end X predclass(D0 ) := arg max conf(RS ); c2C S2D0 class(RS )=c := #{c 2 C | (class(RS ) = c) ^ (conf(RS ) > 0)}; X X := max conf(RSi ) min conf(RSi ); c2C c2C Si 2D0 Si 2D0 class(RSi )=c class(RSi )=c end The second part of Alg. 1 uses the apriori algorithm [5] with the given values of minimum support and minimum confidence and output restrictions so as to generate only rules with item c 2 C in the consequent. It returns a set R of CARs and their confidence. It should be noted that this algorithm operates on individual sentences, hereby ignoring the document level. 2.2 Classification The Classify algorithm (cf. Alg. 2) uses the set of CARs produced by Train to predict the class of a new document D0 and furthermore provides two values measuring the quality of this prediction: variety and dispersion . The first part of the algorithm takes each sentence S of the document D0 and finds the most confident CAR that can be applied to it (i.e., such that the itemset of the rule is entirely contained in the itemset of the sentence). At this stage we have, for every sentence: a rule, its predicted class and its confidence. Our basic unit of text in Train is sentence, therefore CARs generated by Alg. 1 produce a class for each sentence of D0 . An aggregation procedure is thus needed in order to classify the document. This is done by taking class by class the sum of confidence of rules and selecting the class with the highest sum. Although this simple class-weighted sum decision strategy is reasonable, it is not perfect and may lead to wrong classification. This strategy will be optimally sure and robust if (a) the number of classes is minimal, and (b) the values when summing up confidence of rules are sufficiently spread apart. The degree of ful- fillment of these two conditions is given by the parameters variety (the number of classes for which we have rules), and dispersion (the gap between the most confident class and least confident one). These parameters will contribute to comparison among the di↵erent approaches we will investigate. Text Classification Using Association Rules 71 Algorithm 3: Evaluation Data: An annotated corpus C, initial values of minimal support 0 and confidence 0 , standard number of rules ⇢0 Result: Values of average precision P , recall R, F-measure F . Values of average number of rules ⇢, variety and dispersion SingleEvaluate(C, , ): (C1 , . . . , C10 ) := Partition (Shuffle (C),10); /* tenfold cross validation */ for I 2 {1, . . . , 10} do (RI , I , I ) := Train (C \ C1 , , ); for D 2 CI do predclass(D) := Classify (RI , D); end for c 2 C do RI (c) := #{d2CI |(predclass(d)=c)^(class(d)=c)} #{d2CI |class(d)=c} ; |(predclass(d)=c)^(class(d)=c)} PI (c) := #{d2CI#{d2C I |predclass(d)=c} 2RI (c)PI (c) ; FI (c) := R I (c)+PI (c) ; end end for c 2 C do 1 P10 (R(c), P (c), F (c)) := 10 I=1 (RI (c), PI (c), FI (c)); end 1 P10 (⇢, , ) := 10 I=1 (#RI , I , I ); 1 P (R, P , F ) := #C c2C (R(c), P (c), F (c)); end Evaluate(C, 0 , 0 , ⇢0 ): ( , ) := FindOptimal(C, 0 , 0 , ⇢0 ); (R, P , F , ⇢, , ) := SingleEvaluate(C, , ); end 2.3 Evaluation We evaluate the classifier (Alg. 3), by using 10-fold cross validation to obtain average values of recall, precision, F-measure, variety and dispersion. This is done by algorithm SingleEvaluate, once we specify values of minimal support and minimal confidence. Comparing rule-based classification methods is problematic because one can always increase F-measure performance by increasing the number of rules, which results in overfitting them. To avoid this phenomenon and compare methods in a fair way, we fix a number of rules ⇢0 (we have chosen ⇢0 = 1,000 in order to produce a humanly reasonably readable set of rules) and find values of minimal support and confidence so that F-measure is maximal under this constraint. Function FindOptimal will launch SingleEvaluate as many times as neces- sary on a dynamic grid of values ( , ) (starting with initial values ( 0 , 0 )), so that, at the end, the number of rules produced by Train is as close as possible to ⇢0 (we have used #R 2 [⇢0 2, ⇢0 + 2]) and F is maximal. 72 Y. Haralambous and P. Lenca Algorithm 4: Tfidf-based corpus pruning Data: An annotated corpus (considered as a set of sentences) S Result: The pruned corpus S0 Prune(S, N ): S0 := ;; for S 2 S do for w 2 S do ⇣ ⌘ #{S2C} TfidfS (w) := freqS (w) · log #{S2C|w2S} end S 0 := ;; S0 := S; for i 2 {1, . . . , N } do w0 := arg max TfidfS (w); S0 S0 := S0 \ {w0 }; S 0 := S 0 [ {w0 }; end S0 := S0 [ {S 0 }; end end 3 Experimental results on Reuters corpus In this section, we investigate three methods: (a) pruning through a purely frequentist method, based on tfidf measure (§ 3.2); (b) pruning using depen- dencies (§ 3.3); (c) pruning using dependencies followed by hyperonymic exten- sion (§ 3.4). 3.1 Preliminaries In the Reuters [1] corpus we have chosen the 7 most popular topics (GSPO = sports, E12 = monetary/economic, GPOL = domestic politics, GVIO = war, civil war, GDIP = international relations, GCRIM = crime, law enforcement, GJOB = labor issues) and extracted the 1,000 longest texts of each. The experimental document set is thus a corpus of 7,000 texts of length between 120 and 3,961 words (mean 398.84, standard variation 169.05). The texts have been analyzed with the Stanford Dependency Parser [17] in collapsed mode with propagation of conjunct dependencies. 3.2 Tfidf-based corpus pruning Tfidf-based corpus pruning consists in using a classical Prune function as defined in Alg. 4. It will be our baseline for measuring performance of dependency- and hyperonymy-based methods. Note that this definition of the tfidf measure diverges from the legacy one by the fact that we consider not documents but sentences as basic text units. This is because we compare tfidf-generated CARs to those using syntactic informa- tion, and syntax is limited to the sentence level. Therefore, in order to obtain a Text Classification Using Association Rules 73 fair comparison, we have limited term frequency to the sentence level and our “document frequency” is in fact a sentence frequency. Having calculated TfidfS (w) for every w 2 S 2 S, we take N words from each sentence with the highest tfidf values, and use them as transaction items. The performance of this method depends on the value of N . On Fig. 1 the reader can see the values of three quantities as functions of N : 1. F-measure: we see that F-measure increases steadily and reaches a maximum value of 83.99 for N = 10. Building transactions of more than 10 words (in decreasing tfidf order) deteriorates performance, in terms of F-measure; 2. variety: the number of predicted classes for sentences of the same document progressively increases but globally remains relatively low, around 3.1, except for N = 12 and N = 13 where it reaches 4.17; 3. dispersion: it increases steadily, with again a small outlier for N = 12, prob- ably due to the higher variety obtained for that value of N . 4.0 1500 80 3.5 1000 F-measure Dispersion 2.5 3.0 Variety 75 500 2.0 70 1.5 0 65 5 10 15 20 5 10 15 20 5 10 15 20 N (tfidf words per transaction) N (tfidf words per transaction) N (tfidf words per transaction) Fig. 1. F-measure, variety and dispersion of tfidf-based pruning methods as a function of the number of words kept in the corpus Furthermore, each investigated method will generate transactions of various sizes. It is fair to compare them with tfidf-based methods with similar transac- tions sizes. Therefore we will use the results displayed in Fig. 1 to compare the performance of subsequent methods with the one of the tfidf-based method of similar transaction size. Table 1 presents the results obtained by applying the tdfif-based pruning method, with a single word per transaction (N = 1). Table 1. Tfidf-based pruning, keeping a single word per transaction E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 69.30 44.48 55.44 45.75 52.54 82.90 67.98 59.77 Precision 70.09 77.81 71.25 79.76 71.62 80.78 73.35 74.95 F-measure 69.69 56.60 62.36 58.15 60.61 81.83 70.56 65.69 MinSupp=0.006, MinConf=67.6, Var.=1.36, Disp.=21.53, AvgTransSize=1.00 3.3 Methods based on dependencies In this section we investigate several strategies using the dependency structure of sentences. Our general approach (cf. Alg. 5) keeps only words of S that fulfill 74 Y. Haralambous and P. Lenca Algorithm 5: Dependency-based corpus pruning Data: An annotated corpus S and a morphosyntactic contraint : S ! {true, false} Result: The pruned corpus S0 Prune(S, ): S0 := ;; for S 2 S do S 0 := ;; for w 2 S do if (w) = true then S 0 := S 0 [ {w} end end S0 := S0 [ {S 0 }; end end Table 2a. Strategy I0 : Pruning by keeping only heads of sentences E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 38.54 57.46 26.88 17.43 31.06 88.49 51.90 44.54 Precision 49.13 66.18 49.61 65.73 39.07 42.18 60.31 53.17 F-measure 43.19 61.51 34.87 27.55 34.61 57.13 55.79 44.95 MinSupp=0.004, MinConf=36.6, Var.=2.14, Disp.=47.84, AvgTransSize=1.00 Table 2b. Strategy I1 : Pruning by keeping only nsubj ! head dependencies E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 70.55 60.44 66.97 58.27 63.22 78.76 69.92 66.88 Precision 73.05 80.84 72.87 86.98 71.71 85.29 76.29 78.15 F-measure 71.78 69.17 69.80 69.79 67.19 81.89 72.97 71.80 MinSupp=0.007, MinConf=60.4, Var.=1.43, Disp.=41.71, AvgTransSize=1.04 a given morphosyntactic constraint . The following strategies correspond to various definitions of . Strategy I0 Our first strategy will be to keep only the head of each sentence (which, incidentally, is a verb in 85.37% of sentences of our corpus). This corre- sponds to the constraint (w) ⌘ (w = root(S)). Results are given on Table 2a. Although the recall of GSPO is quite high (a possible interpretation could be that sports use very specific verbs), F-measure is quite low when we compare it to the one of the tfidf-based method of the same average itemset length, namely 65.69%. Strategy I1 The second strategy consists in keeping words connected to the head by a (single) dependency of type nsubj (= nominal subject). This occurs in Text Classification Using Association Rules 75 Table 2c. Strategy I01 : Pruning by keeping only ccomp ! head dependencies E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 57.33 33.82 25.31 16.96 21.74 47.62 59.60 37.48 Precision 37.83 47.55 38.50 42.06 34.62 57.05 54.17 44.54 F-measure 45.59 39.53 30.54 24.17 26.71 51.91 56.75 39.31 MinSupp=0.008, MinConf=34.4, Var.=1.97, Disp.=19.59, AvgTransSize=1.15 Table 2d. Strategy I2 : Pruning by keeping only nouns at distance 1 from head E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 80.75 75.92 73.24 68.59 70.59 95.55 77.96 77.51 Precision 73.21 83.51 75.35 89.86 73.67 80.52 77.36 79.07 F-measure 76.80 79.53 74.28 77.80 72.09 87.39 77.66 77.94 MinSupp=0.016, MinConf=51.6, Var.=2.43, Disp.=244.82, AvgTransSize=2.70 Table 2e. Strategy I02 : Pruning by keeping only verbs at distance 1 from head E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 54.32 62.50 44.37 20.41 27.58 91.39 67.68 52.61 Precision 49.58 65.98 48.44 78.39 46.69 43.57 63.84 56.64 F-measure 51.84 64.19 46.32 32.39 34.68 59.01 65.70 50.59 MinSupp=0.019, MinConf=30, Var.=4.00, Disp.=175.41, AvgTransSize=2.01 79.84% of sentences of our corpus. The constraint is then (w) ⌘ (9(w, root(S), nsubj) 2 Dep(S)). Results are given on Table 2b. Note that the slightly higher than 1 transaction size is probably due to the rare cases where there are more than one nsubj dependencies pointing to the head. The scores rise dramatically when compared to those of the strategy based only on the head of the sentence. The average F-measure (71.80%) is significantly higher than the tfidf-based performance for the same average transaction size (65.69%). This shows that using a dependency property to select a word is a better choice than the one provided by the frequentist tfidf-based method. Note that the case of nsubj is unique: if we take ccomp (= clausal complement) instead of nsubj, the performance falls even below the level of strategy I0 (Table 2c). Strategy I2 The third strategy considers all nouns (POS tags starting with N) at distance 1 from the head in the dependency graph. Such dependencies occur in 59.24% of the sentences of our corpus. This corresponds to (x) ⌘ ((9(x, root(S), d) 2 Dep(S)) ^ (POS(x) = N⇤)). Results are given on Table 2d. The result seems better than the one of strategy I1 (Table 2b). However, if we take transaction size into account, it is in fact merely equivalent to—and hence not better than, as it was the case for I1 —the tfidf-based method with the same transaction size. Once again we see a very high recall rate for the sports category. 76 Y. Haralambous and P. Lenca Algorithm 6: Corpus hyperonymization Data: A dependency-pruned corpus S0 , an hyperonymic function MSCH : W ! W N , the hyperonymic order N Result: The hyperonymically extended corpus S00 Hyperonymize(S0 , MSCH, N ): S00 := ;; for S 0 2 S0 do S 00 := ;; for w 2 S 0 do if projN (MSCH(w)) 6= ; then S 00 := S 00 [ {projN (MSCH(w))} else S 00 := S 00 [ {w} end end S00 := S00 [ {S 00 } end end One could be tempted to check the performance of taking verbs (instead of nouns) at distance 1 from the head. Indeed, verbs at that position are more frequent than nouns: they occur in 62.94% of the sentences of our corpus. Nev- ertheless, the results are not as good (Table 2e). This shows that despite their high frequency, verbs contain less pertinent information than nouns at the same distance from the head. 3.4 Methods based on dependencies and hyperonyms In this section we add semantic information by the means of hyperonyms, using the hyperonymization function h (§ 1.3). The preprocessing is done by Alg. 6: hi (w) is an N -th order hyperonym of w, if it exists in WordNet. In case there is no N -th order hyperonym, the word remains unchanged. We call N the hyperonymic factor of our itemset transformation. Strategy II1 This strategy considers hyperonymic factor N = 1. We thus first apply strategy I1 and then hyperonymization h1 . Results are presented on Table 3a. The performance is globally inferior to the one of Strategy I1 (in which, F-measure attained 71.80%). It is interesting to note that the recall of class GJOB has decreased significantly (48.96% vs. 58.27%): in other words, using hyperonyms when dealing with labor issues results into failure to recognize 9.31% of the documents as belonging to the domain; one could say that terms used in GJOB lose their “labor specificity” already at first-order hyperonymization. On the other hand, the (already high in I1 ) recall of GSPO has increased even more, compared to I1 (from 78.76% to 82.20%): it seems that sports terminology remains in the domain even after hyperonymization, and replacing specific terms Text Classification Using Association Rules 77 Table 3a. Strategy II1 : I1 followed by first-order hyperonymization E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 72.39 56.04 71.32 48.96 59.02 82.20 70.42 65.76 Precision 66.21 75.42 64.33 82.21 67.71 73.20 70.73 71.40 F-measure 69.16 64.30 67.64 61.37 63.07 77.44 70.57 67.65 MinSupp=0.010, MinConf=44.8, Var.=1.92, Disp.=78.47, AvgTransSize=1.04 Table 3b. Strategy II2 : I1 followed by second-order hyperonymization E12 GCRIM GDIP GJOB GPOL GSPO GVIO AVG Recall 69.02 52.78 71.97 47.77 54.24 80.80 65.67 63.18 Precision 64.94 74.57 61.23 80.64 65.81 69.96 72.33 69.93 F-measure 66.92 61.81 66.16 60.00 59.47 74.99 68.84 65.46 MinSupp=0.008, MinConf=44.8, Var.=1.85, Disp.=70.51, AvgTransSize=1.04 by more general ones has increased their frequency as items, and hence improved recall. We have the same phenomenon with the recall of GDIP (which increased from 66.97% to 71.32%), and also slightly with the recalls of E12 and GVIO. Strategy II2 This strategy is similar to strategy II1 but uses hyperonymic factor N = 2. Results are presented on Table 3b. The performance is globally inferior to the one of II1 (where we used first- order hyperonyms), with two minor exceptions: the recall of GDIP that increased by 0.65% and the precision of GVIO that increased by 1.6%. What is noteworthy however, is the fact that the recalls of GDIP and GSPO are still higher than the ones of strategy I1 (no hyperonyms). To better understand the behavior of the system when climbing the hyper- onymic chain by replacing words by hyperonyms of increasingly higher order (and returning to the original word when there are no hyperonyms left) we cal- culated the performance for N -th order hyperonyms for 1  N  12. Note that when N > 12 the amount of remaining hyperonyms is negligible and the strat- egy is similar to strategy I1 (no hyperonyms). On Fig. 2, the reader can see the evolution of recall (black), precision (red) and F-measure (blue) for the average of all class, and then specifically for GSPO and for GDIP. Dashed lines represent the recall, precision and F-measure of strategy I1 . In the average case, the e↵ect of hyperonymization of orders 1–4 is to de- crease performance. After N = 5, the global number of hyperonyms available in WordNet rapidly decreases so that the situation gradually returns to the one of I1 (no hyperonyms) and we see curves asymptotically converging to I1 lines from underneath. Not so for GSPO, the GSPO recall curve of which is above the I1 value for most N (N = 1, 2, 6–8 and 10–12). The phenomenon is even better illustrated in the case of GDIP: as the reader can see on the figure, the complete GDIP recall curve is located above the I1 one. It seems that in these two cases (GDIP and, to a lesser extent, GSPO), 78 Y. Haralambous and P. Lenca GSPO Recall / F-measure / Precision GDIP Recall / F-measure / Precision Rec. (black) / F-meas. (blue) / Prec. (red) 55 60 65 70 75 80 85 55 60 65 70 75 80 85 55 60 65 70 75 80 85 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 N (hyperonym degree) N (hyperonym degree) N (hyperonym degree) Fig. 2. F-1 measure for hyperonymization of orders 1  N  12: the average case, class GSPO, class GDIP hyperonyms of all orders have a positive impact on the classifier. Unfortunately this impact only concerns recall and is compensated by bad precision, so that F-measure is still inferior to the I1 case. 4 Conclusion and future work In this paper we have investigated the use of association rules for text classifica- tion by applying two new techniques: (a) we reduce the number of word features through the use of morphosyntactic criteria in the framework of dependency syn- tax; for that we keep words dependent from the head by specific dependencies and/or having specific POS tags (b) we replace words by their hyperonyms of di↵erent orders, which we have calculated out of WordNet using frequencies and, in some cases, disambiguation. We have obtained positive results for case (a), in particular when we compare dependency-based single-item rules with tfidf-based ones. In case (b) the results we share in this paper are less efficient but still in- teresting, especially we found classes for which hyperonymization significantly improves recall. This work opens several perspectives, among which: — examine why these particular classes are favorable to hyperonymization, whether this is related to the structure of WordNet or to linguistic properties of the domain; — explore partial hyperonymization i.e., is it possible to hyperonymize only specific items according to the needs of the classifier?1 How do we choose, on the word level, if we should rather keep the original word (to increase precision) or switch to some hyperonym (to increase recall)? — we have used only recall and precision as quality measures of our rules, and our evaluation is strongly dependent on these measures since the selection of the 1,000 rules we keep is entirely based upon them. There are other quality 1 Indeed, by hyperonymizing all words one wins on one side and loses on the other: for example “dalmatian” and “poodle” will both be replaced by “dog”, but “dog” occurrences will be replaced by “canid”. It would be more preferable to keep the word “dog” in the second case, so that we have a real increase in frequency. Text Classification Using Association Rules 79 measures available, how do they apply and how can they be compared and combined? How robust are the results? — and finally: how can we optimize the distinctive feature of association rules, namely the fact of being intelligible by the user? How can the user’s experience (and linguistic knowledge) be incorporated in the enhancement of rules to obtain the best possible result from his/her point of view? References 1. Reuters corpus, volume 1, english language, 1996-08-20 to 1997-08-19, http:// about.reuters.com/researchandstandards/corpus/statistics/index.asp 2. British National Corpus (1994), http://www.natcorp.ox.ac.uk 3. Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms, chap. 6, pp. 163–222. Mining Text Data, Springer 4. Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.I.: Applying data min- ing techniques in text analysis. TR C-1997-23, Department of Computer Science, University of Helsinki (1997) 5. Borgelt, C.: Efficient implementations of apriori and eclat. In: Workshop of Fre- quent Item Set Mining Implementations (FIMI 2003), Melbourne, FL, USA (2003) 6. Ferrer i Cancho, R., Solé, R.V., Köhler, R.: Patterns in syntactic dependency net- works. Physical Review E 69, 1–8 (2004) 7. Cherfi, H., Napoli, A., Toussaint, Y.: A Conformity Measure Using Background Knowledge for Association Rules: Application to Text Mining, pp. 100–115. IGI Global (2009) 8. Chomsky, N.: Syntactic structures. Mouton (1957) 9. Cohen, W.W.: Learning rules that classify e-mail. In: AAAI Spring Symposium on ML and IR (1996) 10. Curran, J.R., Moens, M.: Scaling context space. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 231–238 (2002) 11. Do, T.N.Q.: A graph model for text analysis and text mining. Master Thesis, Université de Lorraine, (2012) 12. Jaillet, S., Laurent, A., Teisseire, M.: Sequential patterns for text categorization. Intell. Data Anal. 10(3), 199–214 (2006) 13. Kovacs, L., Baksa-Varga, E.: Dependency-based mapping between symbolic lan- guage and Extended Conceptual Graph. In: 6th International Symposium on In- telligent Systems and Informatics. pp. 1–6 (2008) 14. Lang, K.: Newsweeder: Learning to filter netnews. In: International Conference on Machine Learning. pp. 331–339. Morgan Kaufmann (1995) 15. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (New York). pp. 80–86 (1998) 16. Lowe, W.: Towards a theory of semantic space. In: Proceedings of the Twenty-Third Annual Conference of the Cognitive Science Society. pp. 576–581 (2001) 17. de Marne↵e, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: LREC 2006. pp. 449–454 (2006) 18. Mehran Sahami, Susan Dumais, D.H., Horvitz, E.: A bayesian approach to filter- ing junk email. In: AAAI Workshop on Learning for Text Categorization. AAAI Technical Report WS-98-05 (1998) 80 Y. Haralambous and P. Lenca 19. Mel0 čuk, I.A.: Dependency syntax : theory and practice. Albany: State University Press of New York (1987) 20. Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995) 21. Nivre, J.: Dependency Grammar and Dependency Parsing. MSI report 05133. School of Mathematics and Systems Engineering, Växjö University, (2005) 22. Ordoñez-Salinas, S., Gelbukh, A.: Information Retrieval with a Simplified Concep- tual Graph-Like Representation 6437(Chapter 9), 92–104 (2010) 23. Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007) 24. Pedersen, T.: Information content measures of semantic similarity perform bet- ter without sense-tagged text. In: Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). pp. 329–332 (2010) 25. Roche, M., Azé, J., Matte-Tailliez, O., Kodrato↵, Y.: Mining texts by association rules discovery in a technical corpus. In: Intelligent Information Processing and Web Mining. pp. 89–98. Advances in Soft Computing, Springer Verlag (2004) 26. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceed- ings of International Conference on New Methods in Language Processing, Manch- ester, UK (1994) 27. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002) 28. Tesnière, L.: Éléments de syntaxe structurale. Klincksieck (1959) 29. Zaı̈ane, O.R., Antonie, M.L.: Classifying text documents by associating terms with text categories. In: Australasian Database Conference. CRPIT, vol. 5. Australian Computer Society (2002)