Text Classification Using Association Rules,
    Dependency Pruning and Hyperonymization

                       Yannis Haralambous and Philippe Lenca

      Institut Mines Telecom, Telecom Bretagne, UMR CNRS 6285 Lab-STICC
            Technopôle Brest Iroise CS 83818, 29238 Brest Cedex 3, France
                       <surname>.<name>@telecom-bretagne.eu


        Abstract. We present new methods for pruning and enhancing item-
        sets for text classification via association rule mining. Pruning methods
        are based on dependency syntax and enhancing methods are based on
        replacing words by their hyperonyms of various orders. We discuss the
        impact of these methods, compared to pruning based on tfidf rank of
        words.


Introduction

Automatic text classification is an important text mining task, due to the huge
number of text documents that we have to manage daily. Text classification has
a wide variety of applications such as Web document and email classification.
Indeed, most of the Web news services daily provide a large number of articles
making them impossible to be organized manually [14]. Automatic subject clas-
sification [9] and SPAM filtering [18] are two additional examples of the interest
of automatic text classification.
     Automatic text classification can be defined as below. Given a set of docu-
ments such that each document is labeled with a class value, learn a model that
assigns a document with unknown class to one or more particular classes. This
can also be done by assigning a probability value to each class or by ranking the
classes.
     A wide variety of classical machine learning techniques have been used for
text classification. Indeed, texts may be represented by word frequencies vectors,
and thus most of the quantitative data methods can be used directly on the
notorious “bag-of-words” model (cf. [27,3]).
     Choosing a classifier is a multicriteria problem. In particular one has often
to make a trade-o↵ between accuracy and comprehensibility. In this paper, we
are interested in both criteria with a deeper interest in comprehensibility. We
are thus interested in rule-based approaches and especially in class association
rules algorithms. Several studies have already successfully considered association
rule-based approaches in text mining (e.g., [4], [29], [7], [25]). This framework is
suitable for considering some statistical characteristics (e.g., high-dimensionality,
sparsity. . . ) of the bag-of-words model where a document is represented as a set
of words with their associated frequency in the document.

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
66      Y. Haralambous and P. Lenca

    However a text is more than a set of words and their frequencies. Enhancing
the bag-of-words approach with linguistic features has also attracted several
works (e.g., [12,11,13,22], [23,16,10], [21,6]).
    We here propose a class association rules based approach enriched by lin-
guistic knowledge. The paper is organized as follows: after introducing the tech-
niques we are going to use (class association rules § 1.1, dependencies § 1.2, hy-
peronymization § 1.3) we describe our main algorithms (for training § 2.1, clas-
sifying § 2.2 and evaluating § 2.3); follows the experimental section, where we
give results obtained by tfidf pruning § 3.2, dependency-based pruning § 3.3 and
hyperonymization § 3.4, and, finally, we end up by a conclusion and perspectives
for future work § 4.


1     Proposed model for text classification
Let a corpus be a set C = {D1 , . . . , Dn } of documents. Let C be a set of classes.
An annotated corpus is a pair (C, class) where class : C ! C is a function that
maps each document Di to a (predefined) class of C.
    A document D 2 C is a set of sentences S. The corpus C can be considered
as a set of sentences S = {S1 , . . . , Sm } if we go through the forgetful functor
(which forgets the document to which the sentence belongs). Repeated sentences
in the same document, or identical sentences in di↵erent documents are consid-
ered as distinct, i.e., there is a function ◆ : S ! C which restores the forgotten
information. We extend the class function to S by class(S) := class(◆(S)).
    A sentence S is a sequence of words w (sometimes   S weSwill consider S simply
as a set, without changing the notation). Let W = S2S w2S {w} be the set of
all words of C.


1.1    Class association rules and text classification
Let I be a set of objects called items and C a set of classes. A transaction T is a
pair ({i1 , . . . , in }, c), where {i1 , . . . , in } ✓ I and c 2 C. We denote by T the set
of transactions, by items(T ) the set of items (or “itemset”) of T and by class(T )
the class of T .
    Let I be an itemset. The support of I is defined by

                            supp(I) := #{T 2T |I✓items(T
                                               #T
                                                         )}
                                                            .

Let 2 [0, 1] be a value called minimum support. An itemset I is called frequent
if its support exceeds .
     The confidence of a transaction t is defined as

                 conf(t) := #{T 2T |items(t)✓items(T )^class(t)=class(T )}
                                     #{T 2T |items(t)✓items(T )}           .

Let  2 [0, 1] be a value called minimum confidence. A class association rule (or
“CAR”) r = ({i1 , . . . , in }, c) [15] is a transaction with frequent itemset and a
confidence exceeding .
                                      Text Classification Using Association Rules   67

    To classify text with CARs, we consider words as being items, documents as
being itemsets and pairs of documents and classes as being transactions. The
advantage of this technique is that CARs can be easily understood and hence
potentially improved by the user, especially if the classifier is tuned so that it
produces humanly reasonable number of rules. Once the classifier is trained, to
classify a new sentence we first find all CARs whose items are contained in the
sentence, and then use an aggregation technique to choose a predominant class
among those of the CARs we found.
    An important issue of CARs is that the complexity is exponential with re-
spect to the itemset size, and hence we need to keep it bounded in specific ranges,
independently of the size of documents to classify. Using entire documents as
transactions is computationally out of reach, therefore pruning techniques play
an important rôle. Our approach consists in (a) restricting CARs to the sentence
level, (b) prune sentences by using morphosyntactic information (cf. § 1.2) and
modifying itemsets using semantic information (cf. § 1.3).


1.2   Itemset pruning using dependencies

One can prune sentences either by using word frequencies (cf. § 3.2) or by using
information obtained by morphosyntactic parsing (cf. § 3.3). In this paper we
introduce the latter approach, in the frame of dependency grammar.
    Dependency grammar [28,19] is a syntactic theory, alternative to phrase-
structure analysis [8] which is traditionally taught in primary and secondary
education. In phrase-structure syntax, trees are built by grouping words into
“phrases” (with the use of intermediate nodes NP, VP, etc.), so that the root
of the tree represents the entire sentence and its leaves are the actual words.
In dependency grammar, trees are built using solely words as nodes (without
introducing any additional “abstract” nodes). A single word in every sentence
becomes the root (or head ) of the tree. An oriented edge between two words is a
dependency and is tagged by a representation of some (syntactic, morphological,
semantic, prosodic, etc.) relation between the words. For example in the sentence
“John gives Mary an apple,” the word “gives” is the head of the sentence and
we have the following four dependencies:

                              head
                                                    dobj


                      nsubj           iobj                       det

              John            gives          Mary          an          apple.

where tags nsubj, dobj, iobj, det denote “noun subject,” “direct object,” “indi-
rect object” and “determinant.”
    Let S be a sentence and D be the set of dependency tags: {nsubj, ccomp,
prep, dobj, . . . } A dependency is a triple (w1 , w2 , d) where w1 , w2 2 S and d 2 D.
Let Dep(S) denote the set of dependencies of S and root(S) the head of S.
68     Y. Haralambous and P. Lenca

Pruning will consist in defining a morphosyntactic constraint i.e. a condition
on dependencies (and POS tags) of words, the fulfillment of which is necessary
for the word to be included in the itemset.
    But before describing pruning algorithms and strategies, let us first present
a second technique used for optimizing itemsets. This time we use semantic
information. We propose to replace words by their hyperonyms, expecting that
the frequencies of the latter in the itemsets will be higher than those of the
former, and hence will improve the classification process.


1.3   Hyperonymization


The WordNet lexical database [20] contains sets of words sharing a common
meaning, called synsets, as well as semantic relations between synsets, which
we will use to fulfill our goal. More specifically, we will use the relations of
hyperonymy and of hyperonymic instance. The graph having synsets as nodes,
and hyperonymic relations as edges, is connected and rooted: starting with an
arbitrary synset, one can iterate these two relations until attaining a sink. Note
that in the case of nouns it will invariably be the synset 00001740 {entity} while
for verbs there are approx. 550 di↵erent verb sinks.
    Let W be the WordNet lexical database, s 2 W a synset and h : W ! 2W
the hyperonymic or hyperonymic instance relation. We define an hyperonymic
chain CH(s) as a sequence (si )i 0 where s0 = s and si 2 h(si 1 ), for all i 1.
Hyperonymic chains are not unique since a given synset can have many hyper-
onyms. To replace a word by the most pertinent hyperonym, we have to identify
the most significant hyperonymic chains of it.
    The wn-similarity project [24] has released synset frequency calculations
based on various corpora. Let lf(s) denote the logarithmic frequency of synset s
in the BNC English language corpus [2] and let us arbitrarily add infinitesi-
mally small values to the frequencies so that they become unique (s 6= s0 )
lf(s) 6= lf(s0 )). We use frequency as the criterion for selecting a single hyper-
onymic chain to represent a given synset, and hence define the most significant
hyperonymic chain MSCH(s) as the hyperonymic chain (si )i 0 of s such that
si = arg maxs2h(si 1 ) lf(s), for all i 1. The chain MSCH(s) is unique thanks
to the uniqueness of synset frequencies.
    Our CARs are based on words, not synsets. Hence we need to extend MSCHs
to words. Let w be a lemmatized word. We denote by Synsets(w) ⇢ W the set
of synsets containing w. If the cardinal #(Synsets(w)) > 1 then we apply a
standard disambiguation algorithm to find the most appropriate synset sw for
w in the given context. Then we take (si )i = MSCH(sw ) and for each synset
si in this chain we define hi (w) = proj1 (si ) (i > 0), that is the projection
of si to its first element, which by WordNet convention is the most frequent
word in the synset. The function vector h⇤ : W ! W (with h0 ⌘ Id) is called
hyperonymization, and hi (w) is the i-th order hyperonym of w.
                                   Text Classification Using Association Rules     69


    Algorithm 1: Training
      Data: An annotated corpus C, values of minimum support and minimum
            confidence 
      Result: A set of CARs R = ({R1 , . . . , RN }, conf) where items(Ri ) ⇢ W,
              class(Ri ) 2 C, and conf(Ri ) is the confidence of rule Ri
      Train(C, , ):
         S := forgetful(C); S0 := ;;
         for S 2 S do
             S 0 := Hyperonymize (Prune (Lemmatize (S)));
             class(S 0 ) := class(◆(S));
             S0 := S0 [ {S 0 };
         end
         R := Apriori (S0 , , );
      end


2      Operational implementations for document
       classification

Our text classifier operates by first training the classifier on sentences and then
classifying the documents by aggregating sentence classification. These two pro-
cedures are described in Sections 2.1 and 2.2 respectively. Specific evaluation
procedure is presented in Section 2.3.


2.1     Training

The Train algorithm (cf. Alg. 1) takes as input an annotated corpus C and
values of minimum support and minimum confidence . It returns a set of
CARs together with their confidence values.
     The first part of the algorithm consists in processing the corpus, to obtain
efficient and reasonably sized transactions. Three functions are applied to every
sentence:

 1. Lemmatize is standard lemmatization: let P be the set of POS tags of the
    TreeTagger system [26] (for example, NP stands for “proper noun, singular”,
    VVD stands for “verb, past tense”, etc.), and let W 0 be the set of lemmatized
    forms of W (for example, “say” is the lemmatized form of “said”); then we
    define : W ! (W [ W 0 ) ⇥ P, which sends a word w to the pair (w0 , p)
    where w0 is the lemmatized form of w (or w itself, if the word is unknown
    to TreeTagger ) and p is its POS tag.
 2. Prune is a function which prunes the lemmatized sentence so that only a
    small number of (lemmatized) words (and POS tags) remains. Several sen-
    tence pruning strategies will be proposed and compared (cf. § 3.2 and 3.3).
 3. Hyperonymize is a function which takes the words in the pruned itemset and
    replaces them by the members of their most significant hyperonymic chains.
    Several strategies will also be proposed and compared (cf. § 3.4).
70          Y. Haralambous and P. Lenca


 Algorithm 2: Classification
      Data: A set of CARs R, a document D0
      Result: The predicted class predclass(D0 ), variety , dispersion
      Classify(R, D0 ):
         for S 2 D0 do
             if 9r 2 R such that items(r) ⇢ S then
                 RS :=     arg max    conf(r);
                            r2R^items(r)⇢S
            end
         end
                                             X
         predclass(D0 ) := arg max                     conf(RS );
                                  c2C
                                           S2D0
                                        class(RS )=c
             := #{c 2 C | (class(RS ) = c) ^ (conf(RS ) > 0)};
                        X                            X
              := max             conf(RSi ) min              conf(RSi );
                 c2C                              c2C
                          Si 2D0                            Si 2D0
                       class(RSi )=c                     class(RSi )=c

      end


   The second part of Alg. 1 uses the apriori algorithm [5] with the given values
of minimum support and minimum confidence and output restrictions so as to
generate only rules with item c 2 C in the consequent. It returns a set R of
CARs and their confidence.
   It should be noted that this algorithm operates on individual sentences,
hereby ignoring the document level.

2.2     Classification
The Classify algorithm (cf. Alg. 2) uses the set of CARs produced by Train
to predict the class of a new document D0 and furthermore provides two values
measuring the quality of this prediction: variety and dispersion .
    The first part of the algorithm takes each sentence S of the document D0
and finds the most confident CAR that can be applied to it (i.e., such that the
itemset of the rule is entirely contained in the itemset of the sentence). At this
stage we have, for every sentence: a rule, its predicted class and its confidence.
    Our basic unit of text in Train is sentence, therefore CARs generated by
Alg. 1 produce a class for each sentence of D0 . An aggregation procedure is thus
needed in order to classify the document. This is done by taking class by class
the sum of confidence of rules and selecting the class with the highest sum.
    Although this simple class-weighted sum decision strategy is reasonable, it is
not perfect and may lead to wrong classification. This strategy will be optimally
sure and robust if (a) the number of classes is minimal, and (b) the values when
summing up confidence of rules are sufficiently spread apart. The degree of ful-
fillment of these two conditions is given by the parameters variety (the number
of classes for which we have rules), and dispersion    (the gap between the most
confident class and least confident one). These parameters will contribute to
comparison among the di↵erent approaches we will investigate.
                                         Text Classification Using Association Rules                71


 Algorithm 3: Evaluation
      Data: An annotated corpus C, initial values of minimal support 0 and
            confidence 0 , standard number of rules ⇢0
      Result: Values of average precision P , recall R, F-measure F . Values of average
               number of rules ⇢, variety and dispersion
      SingleEvaluate(C, , ):
         (C1 , . . . , C10 ) := Partition (Shuffle (C),10);
         /* tenfold cross validation                                                 */
         for I 2 {1, . . . , 10} do
             (RI , I , I ) := Train (C \ C1 , , );
             for D 2 CI do
                    predclass(D) := Classify (RI , D);
             end
             for c 2 C do
                    RI (c) := #{d2CI |(predclass(d)=c)^(class(d)=c)}
                                       #{d2CI |class(d)=c}
                                                                     ;
                                     |(predclass(d)=c)^(class(d)=c)}
                  PI (c) := #{d2CI#{d2C    I |predclass(d)=c}
                                                                                 2RI (c)PI (c)
                                                                     ; FI (c) := R I (c)+PI (c)
                                                                                                ;
             end
         end
         for c 2 C do
                                       1
                                          P10
             (R(c), P (c), F (c)) := 10      I=1 (RI (c), PI (c), FI (c));
         end
                         1
                            P10
         (⇢, , ) := 10       I=1 (#RI , I ,     I );
                           1
                             P
         (R, P , F ) := #C     c2C (R(c), P (c), F (c));
      end
      Evaluate(C, 0 , 0 , ⇢0 ):
         ( , ) := FindOptimal(C, 0 , 0 , ⇢0 );
         (R, P , F , ⇢, , ) := SingleEvaluate(C, , );
      end


2.3      Evaluation

We evaluate the classifier (Alg. 3), by using 10-fold cross validation to obtain
average values of recall, precision, F-measure, variety and dispersion. This is
done by algorithm SingleEvaluate, once we specify values of minimal support
and minimal confidence.
    Comparing rule-based classification methods is problematic because one can
always increase F-measure performance by increasing the number of rules, which
results in overfitting them. To avoid this phenomenon and compare methods in
a fair way, we fix a number of rules ⇢0 (we have chosen ⇢0 = 1,000 in order to
produce a humanly reasonably readable set of rules) and find values of minimal
support and confidence so that F-measure is maximal under this constraint.
    Function FindOptimal will launch SingleEvaluate as many times as neces-
sary on a dynamic grid of values ( , ) (starting with initial values ( 0 , 0 )), so
that, at the end, the number of rules produced by Train is as close as possible
to ⇢0 (we have used #R 2 [⇢0 2, ⇢0 + 2]) and F is maximal.
72        Y. Haralambous and P. Lenca


    Algorithm 4: Tfidf-based corpus pruning
      Data: An annotated corpus (considered as a set of sentences) S
      Result: The pruned corpus S0
      Prune(S, N ):
         S0 := ;;
         for S 2 S do
             for w 2 S do                  ⇣              ⌘
                                                   #{S2C}
                 TfidfS (w) := freqS (w) · log   #{S2C|w2S}
             end
             S 0 := ;; S0 := S;
             for i 2 {1, . . . , N } do
                  w0 := arg max TfidfS (w);
                           S0
                S0 := S0 \ {w0 }; S 0 := S 0 [ {w0 };
            end
            S0 := S0 [ {S 0 };
         end
      end


3      Experimental results on Reuters corpus
In this section, we investigate three methods: (a) pruning through a purely
frequentist method, based on tfidf measure (§ 3.2); (b) pruning using depen-
dencies (§ 3.3); (c) pruning using dependencies followed by hyperonymic exten-
sion (§ 3.4).

3.1     Preliminaries
In the Reuters [1] corpus we have chosen the 7 most popular topics (GSPO =
sports, E12 = monetary/economic, GPOL = domestic politics, GVIO = war,
civil war, GDIP = international relations, GCRIM = crime, law enforcement,
GJOB = labor issues) and extracted the 1,000 longest texts of each.
    The experimental document set is thus a corpus of 7,000 texts of length
between 120 and 3,961 words (mean 398.84, standard variation 169.05). The
texts have been analyzed with the Stanford Dependency Parser [17] in collapsed
mode with propagation of conjunct dependencies.

3.2     Tfidf-based corpus pruning
Tfidf-based corpus pruning consists in using a classical Prune function as defined
in Alg. 4. It will be our baseline for measuring performance of dependency- and
hyperonymy-based methods.
    Note that this definition of the tfidf measure diverges from the legacy one by
the fact that we consider not documents but sentences as basic text units. This
is because we compare tfidf-generated CARs to those using syntactic informa-
tion, and syntax is limited to the sentence level. Therefore, in order to obtain a
                                                          Text Classification Using Association Rules                                     73

 fair comparison, we have limited term frequency to the sentence level and our
 “document frequency” is in fact a sentence frequency.
     Having calculated TfidfS (w) for every w 2 S 2 S, we take N words from
 each sentence with the highest tfidf values, and use them as transaction items.
 The performance of this method depends on the value of N . On Fig. 1 the reader
 can see the values of three quantities as functions of N :
    1. F-measure: we see that F-measure increases steadily and reaches a maximum
       value of 83.99 for N = 10. Building transactions of more than 10 words (in
       decreasing tfidf order) deteriorates performance, in terms of F-measure;
    2. variety: the number of predicted classes for sentences of the same document
       progressively increases but globally remains relatively low, around 3.1, except
       for N = 12 and N = 13 where it reaches 4.17;
    3. dispersion: it increases steadily, with again a small outlier for N = 12, prob-
       ably due to the higher variety obtained for that value of N .
                                                 4.0


                                                                                                  1500
    80


                                                 3.5


                                                                                                    1000
F-measure


                                                                                               Dispersion
                                             2.5 3.0
                                              Variety
   75


                                                                                            500
                                                 2.0
    70


                                                 1.5


                                                                                                  0
    65


                5        10        15       20                 5        10        15       20                   5        10        15       20
            N (tfidf words per transaction)                N (tfidf words per transaction)                  N (tfidf words per transaction)


 Fig. 1. F-measure, variety and dispersion of tfidf-based pruning methods as a function
 of the number of words kept in the corpus

     Furthermore, each investigated method will generate transactions of various
 sizes. It is fair to compare them with tfidf-based methods with similar transac-
 tions sizes. Therefore we will use the results displayed in Fig. 1 to compare the
 performance of subsequent methods with the one of the tfidf-based method of
 similar transaction size. Table 1 presents the results obtained by applying the
 tdfif-based pruning method, with a single word per transaction (N = 1).
               Table 1. Tfidf-based pruning, keeping a single word per transaction

                       E12        GCRIM                 GDIP       GJOB          GPOL                GSPO          GVIO            AVG
  Recall              69.30          44.48              55.44       45.75         52.54                82.90        67.98          59.77
  Precision           70.09          77.81              71.25       79.76         71.62                80.78        73.35          74.95
  F-measure           69.69          56.60              62.36       58.15         60.61                81.83        70.56          65.69
  MinSupp=0.006, MinConf=67.6, Var.=1.36, Disp.=21.53, AvgTransSize=1.00


 3.3        Methods based on dependencies
 In this section we investigate several strategies using the dependency structure
 of sentences. Our general approach (cf. Alg. 5) keeps only words of S that fulfill
74       Y. Haralambous and P. Lenca


 Algorithm 5: Dependency-based corpus pruning
     Data: An annotated corpus S and a morphosyntactic contraint
             : S ! {true, false}
     Result: The pruned corpus S0
     Prune(S, ):
        S0 := ;;
        for S 2 S do
            S 0 := ;;
            for w 2 S do
                 if (w) = true then
                    S 0 := S 0 [ {w}
                 end
            end
            S0 := S0 [ {S 0 };
        end
     end

          Table 2a. Strategy I0 : Pruning by keeping only heads of sentences
              E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO       AVG
Recall       38.54     57.46    26.88    17.43    31.06    88.49    51.90      44.54
Precision    49.13     66.18    49.61    65.73    39.07    42.18    60.31      53.17
F-measure    43.19     61.51    34.87    27.55    34.61    57.13    55.79      44.95
MinSupp=0.004, MinConf=36.6, Var.=2.14, Disp.=47.84, AvgTransSize=1.00

     Table 2b. Strategy I1 : Pruning by keeping only nsubj ! head dependencies
              E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO       AVG
Recall       70.55     60.44    66.97    58.27    63.22    78.76    69.92      66.88
Precision    73.05     80.84    72.87    86.98    71.71    85.29    76.29      78.15
F-measure    71.78     69.17    69.80    69.79    67.19    81.89    72.97      71.80
MinSupp=0.007, MinConf=60.4, Var.=1.43, Disp.=41.71, AvgTransSize=1.04

a given morphosyntactic constraint       . The following strategies correspond to
various definitions of .

Strategy I0 Our first strategy will be to keep only the head of each sentence
(which, incidentally, is a verb in 85.37% of sentences of our corpus). This corre-
sponds to the constraint (w) ⌘ (w = root(S)). Results are given on Table 2a.
    Although the recall of GSPO is quite high (a possible interpretation could
be that sports use very specific verbs), F-measure is quite low when we compare
it to the one of the tfidf-based method of the same average itemset length,
namely 65.69%.

Strategy I1 The second strategy consists in keeping words connected to the
head by a (single) dependency of type nsubj (= nominal subject). This occurs in
                                 Text Classification Using Association Rules        75

   Table 2c. Strategy I01 : Pruning by keeping only ccomp ! head dependencies
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      57.33    33.82     25.31    16.96    21.74    47.62     59.60      37.48
Precision   37.83    47.55     38.50    42.06    34.62    57.05     54.17      44.54
F-measure   45.59    39.53     30.54    24.17    26.71    51.91     56.75      39.31
MinSupp=0.008, MinConf=34.4, Var.=1.97, Disp.=19.59, AvgTransSize=1.15

   Table 2d. Strategy I2 : Pruning by keeping only nouns at distance 1 from head

             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      80.75    75.92     73.24    68.59    70.59    95.55     77.96      77.51
Precision   73.21    83.51     75.35    89.86    73.67    80.52     77.36      79.07
F-measure   76.80    79.53     74.28    77.80    72.09    87.39     77.66      77.94
MinSupp=0.016, MinConf=51.6, Var.=2.43, Disp.=244.82, AvgTransSize=2.70

   Table 2e. Strategy I02 : Pruning by keeping only verbs at distance 1 from head
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      54.32    62.50     44.37    20.41    27.58    91.39     67.68      52.61
Precision   49.58    65.98     48.44    78.39    46.69    43.57     63.84      56.64
F-measure   51.84    64.19     46.32    32.39    34.68    59.01     65.70      50.59
MinSupp=0.019, MinConf=30, Var.=4.00, Disp.=175.41, AvgTransSize=2.01


79.84% of sentences of our corpus. The constraint is then (w) ⌘ (9(w, root(S),
nsubj) 2 Dep(S)). Results are given on Table 2b.
   Note that the slightly higher than 1 transaction size is probably due to the
rare cases where there are more than one nsubj dependencies pointing to the
head. The scores rise dramatically when compared to those of the strategy based
only on the head of the sentence. The average F-measure (71.80%) is significantly
higher than the tfidf-based performance for the same average transaction size
(65.69%). This shows that using a dependency property to select a word is a better
choice than the one provided by the frequentist tfidf-based method. Note that the
case of nsubj is unique: if we take ccomp (= clausal complement) instead of
nsubj, the performance falls even below the level of strategy I0 (Table 2c).


Strategy I2 The third strategy considers all nouns (POS tags starting with
N) at distance 1 from the head in the dependency graph. Such dependencies
occur in 59.24% of the sentences of our corpus. This corresponds to (x) ⌘
((9(x, root(S), d) 2 Dep(S)) ^ (POS(x) = N⇤)). Results are given on Table 2d.
   The result seems better than the one of strategy I1 (Table 2b). However, if
we take transaction size into account, it is in fact merely equivalent to—and
hence not better than, as it was the case for I1 —the tfidf-based method with the
same transaction size. Once again we see a very high recall rate for the sports
category.
76        Y. Haralambous and P. Lenca


 Algorithm 6: Corpus hyperonymization
      Data: A dependency-pruned corpus S0 , an hyperonymic function
             MSCH : W ! W N , the hyperonymic order N
      Result: The hyperonymically extended corpus S00
      Hyperonymize(S0 , MSCH, N ):
         S00 := ;;
         for S 0 2 S0 do
              S 00 := ;;
              for w 2 S 0 do
                   if projN (MSCH(w)) 6= ; then
                       S 00 := S 00 [ {projN (MSCH(w))}
                   else
                       S 00 := S 00 [ {w}
                   end
              end
              S00 := S00 [ {S 00 }
         end
      end


    One could be tempted to check the performance of taking verbs (instead of
nouns) at distance 1 from the head. Indeed, verbs at that position are more
frequent than nouns: they occur in 62.94% of the sentences of our corpus. Nev-
ertheless, the results are not as good (Table 2e). This shows that despite their
high frequency, verbs contain less pertinent information than nouns at the same
distance from the head.

3.4     Methods based on dependencies and hyperonyms
In this section we add semantic information by the means of hyperonyms, using
the hyperonymization function h (§ 1.3). The preprocessing is done by Alg. 6:
hi (w) is an N -th order hyperonym of w, if it exists in WordNet. In case there is no
N -th order hyperonym, the word remains unchanged. We call N the hyperonymic
factor of our itemset transformation.

Strategy II1 This strategy considers hyperonymic factor N = 1. We thus
first apply strategy I1 and then hyperonymization h1 . Results are presented on
Table 3a.
    The performance is globally inferior to the one of Strategy I1 (in which,
F-measure attained 71.80%). It is interesting to note that the recall of class
GJOB has decreased significantly (48.96% vs. 58.27%): in other words, using
hyperonyms when dealing with labor issues results into failure to recognize 9.31%
of the documents as belonging to the domain; one could say that terms used
in GJOB lose their “labor specificity” already at first-order hyperonymization.
On the other hand, the (already high in I1 ) recall of GSPO has increased even
more, compared to I1 (from 78.76% to 82.20%): it seems that sports terminology
remains in the domain even after hyperonymization, and replacing specific terms
                                 Text Classification Using Association Rules       77

        Table 3a. Strategy II1 : I1 followed by first-order hyperonymization
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      72.39    56.04    71.32     48.96    59.02    82.20     70.42      65.76
Precision   66.21    75.42    64.33     82.21    67.71    73.20     70.73      71.40
F-measure   69.16    64.30    67.64     61.37    63.07    77.44     70.57      67.65
MinSupp=0.010, MinConf=44.8, Var.=1.92, Disp.=78.47, AvgTransSize=1.04

       Table 3b. Strategy II2 : I1 followed by second-order hyperonymization
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      69.02    52.78    71.97     47.77    54.24    80.80     65.67      63.18
Precision   64.94    74.57    61.23     80.64    65.81    69.96     72.33      69.93
F-measure   66.92    61.81    66.16     60.00    59.47    74.99     68.84      65.46
MinSupp=0.008, MinConf=44.8, Var.=1.85, Disp.=70.51, AvgTransSize=1.04

by more general ones has increased their frequency as items, and hence improved
recall. We have the same phenomenon with the recall of GDIP (which increased
from 66.97% to 71.32%), and also slightly with the recalls of E12 and GVIO.


Strategy II2 This strategy is similar to strategy II1 but uses hyperonymic
factor N = 2. Results are presented on Table 3b.
    The performance is globally inferior to the one of II1 (where we used first-
order hyperonyms), with two minor exceptions: the recall of GDIP that increased
by 0.65% and the precision of GVIO that increased by 1.6%. What is noteworthy
however, is the fact that the recalls of GDIP and GSPO are still higher than the
ones of strategy I1 (no hyperonyms).
    To better understand the behavior of the system when climbing the hyper-
onymic chain by replacing words by hyperonyms of increasingly higher order
(and returning to the original word when there are no hyperonyms left) we cal-
culated the performance for N -th order hyperonyms for 1  N  12. Note that
when N > 12 the amount of remaining hyperonyms is negligible and the strat-
egy is similar to strategy I1 (no hyperonyms). On Fig. 2, the reader can see the
evolution of recall (black), precision (red) and F-measure (blue) for the average
of all class, and then specifically for GSPO and for GDIP. Dashed lines represent
the recall, precision and F-measure of strategy I1 .
    In the average case, the e↵ect of hyperonymization of orders 1–4 is to de-
crease performance. After N = 5, the global number of hyperonyms available
in WordNet rapidly decreases so that the situation gradually returns to the one
of I1 (no hyperonyms) and we see curves asymptotically converging to I1 lines
from underneath.
    Not so for GSPO, the GSPO recall curve of which is above the I1 value for
most N (N = 1, 2, 6–8 and 10–12).
    The phenomenon is even better illustrated in the case of GDIP: as the reader
can see on the figure, the complete GDIP recall curve is located above the I1
one. It seems that in these two cases (GDIP and, to a lesser extent, GSPO),
      78                                          Y. Haralambous and P. Lenca


                                                                           GSPO Recall / F-measure / Precision


                                                                                                                                              GDIP Recall / F-measure / Precision
Rec. (black) / F-meas. (blue) / Prec. (red)
    55 60 65 70 75 80 85


                                                                           55 60 65 70 75 80 85


                                                                                                                                              55 60 65 70 75 80 85
                                              2     4     6     8    10   12                                     2     4     6     8    10   12                                     2     4     6     8    10   12
                                                  N (hyperonym degree)                                               N (hyperonym degree)                                               N (hyperonym degree)


      Fig. 2. F-1 measure for hyperonymization of orders 1  N  12: the average case,
      class GSPO, class GDIP

      hyperonyms of all orders have a positive impact on the classifier. Unfortunately
      this impact only concerns recall and is compensated by bad precision, so that
      F-measure is still inferior to the I1 case.


      4                                       Conclusion and future work

      In this paper we have investigated the use of association rules for text classifica-
      tion by applying two new techniques: (a) we reduce the number of word features
      through the use of morphosyntactic criteria in the framework of dependency syn-
      tax; for that we keep words dependent from the head by specific dependencies
      and/or having specific POS tags (b) we replace words by their hyperonyms of
      di↵erent orders, which we have calculated out of WordNet using frequencies and,
      in some cases, disambiguation. We have obtained positive results for case (a), in
      particular when we compare dependency-based single-item rules with tfidf-based
      ones. In case (b) the results we share in this paper are less efficient but still in-
      teresting, especially we found classes for which hyperonymization significantly
      improves recall.
          This work opens several perspectives, among which:
          — examine why these particular classes are favorable to hyperonymization,
      whether this is related to the structure of WordNet or to linguistic properties of
      the domain;
          — explore partial hyperonymization i.e., is it possible to hyperonymize only
      specific items according to the needs of the classifier?1 How do we choose, on
      the word level, if we should rather keep the original word (to increase precision)
      or switch to some hyperonym (to increase recall)?
          — we have used only recall and precision as quality measures of our rules,
      and our evaluation is strongly dependent on these measures since the selection
      of the 1,000 rules we keep is entirely based upon them. There are other quality
               1
                                      Indeed, by hyperonymizing all words one wins on one side and loses on the other:
                                      for example “dalmatian” and “poodle” will both be replaced by “dog”, but “dog”
                                      occurrences will be replaced by “canid”. It would be more preferable to keep the
                                      word “dog” in the second case, so that we have a real increase in frequency.
                                   Text Classification Using Association Rules        79

measures available, how do they apply and how can they be compared and
combined? How robust are the results?
   — and finally: how can we optimize the distinctive feature of association rules,
namely the fact of being intelligible by the user? How can the user’s experience
(and linguistic knowledge) be incorporated in the enhancement of rules to obtain
the best possible result from his/her point of view?


References

 1. Reuters corpus, volume 1, english language, 1996-08-20 to 1997-08-19, http://
    about.reuters.com/researchandstandards/corpus/statistics/index.asp
 2. British National Corpus (1994), http://www.natcorp.ox.ac.uk
 3. Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms, chap. 6, pp.
    163–222. Mining Text Data, Springer
 4. Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.I.: Applying data min-
    ing techniques in text analysis. TR C-1997-23, Department of Computer Science,
    University of Helsinki (1997)
 5. Borgelt, C.: Efficient implementations of apriori and eclat. In: Workshop of Fre-
    quent Item Set Mining Implementations (FIMI 2003), Melbourne, FL, USA (2003)
 6. Ferrer i Cancho, R., Solé, R.V., Köhler, R.: Patterns in syntactic dependency net-
    works. Physical Review E 69, 1–8 (2004)
 7. Cherfi, H., Napoli, A., Toussaint, Y.: A Conformity Measure Using Background
    Knowledge for Association Rules: Application to Text Mining, pp. 100–115. IGI
    Global (2009)
 8. Chomsky, N.: Syntactic structures. Mouton (1957)
 9. Cohen, W.W.: Learning rules that classify e-mail. In: AAAI Spring Symposium on
    ML and IR (1996)
10. Curran, J.R., Moens, M.: Scaling context space. In: Proceedings of the 40th Annual
    Meeting of the Association for Computational Linguistics. pp. 231–238 (2002)
11. Do, T.N.Q.: A graph model for text analysis and text mining. Master Thesis,
    Université de Lorraine, (2012)
12. Jaillet, S., Laurent, A., Teisseire, M.: Sequential patterns for text categorization.
    Intell. Data Anal. 10(3), 199–214 (2006)
13. Kovacs, L., Baksa-Varga, E.: Dependency-based mapping between symbolic lan-
    guage and Extended Conceptual Graph. In: 6th International Symposium on In-
    telligent Systems and Informatics. pp. 1–6 (2008)
14. Lang, K.: Newsweeder: Learning to filter netnews. In: International Conference on
    Machine Learning. pp. 331–339. Morgan Kaufmann (1995)
15. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
    In: Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (New York).
    pp. 80–86 (1998)
16. Lowe, W.: Towards a theory of semantic space. In: Proceedings of the Twenty-Third
    Annual Conference of the Cognitive Science Society. pp. 576–581 (2001)
17. de Marne↵e, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency
    parses from phrase structure parses. In: LREC 2006. pp. 449–454 (2006)
18. Mehran Sahami, Susan Dumais, D.H., Horvitz, E.: A bayesian approach to filter-
    ing junk email. In: AAAI Workshop on Learning for Text Categorization. AAAI
    Technical Report WS-98-05 (1998)
80      Y. Haralambous and P. Lenca

19. Mel0 čuk, I.A.: Dependency syntax : theory and practice. Albany: State University
    Press of New York (1987)
20. Miller, G.A.: WordNet: A lexical database for English. Communications of the
    ACM 38(11), 39–41 (1995)
21. Nivre, J.: Dependency Grammar and Dependency Parsing. MSI report 05133.
    School of Mathematics and Systems Engineering, Växjö University, (2005)
22. Ordoñez-Salinas, S., Gelbukh, A.: Information Retrieval with a Simplified Concep-
    tual Graph-Like Representation 6437(Chapter 9), 92–104 (2010)
23. Padó, S., Lapata, M.: Dependency-based construction of semantic space models.
    Computational Linguistics 33(2), 161–199 (2007)
24. Pedersen, T.: Information content measures of semantic similarity perform bet-
    ter without sense-tagged text. In: Proceedings of the 11th Annual Conference of
    the North American Chapter of the Association for Computational Linguistics
    (NAACL HLT 2010). pp. 329–332 (2010)
25. Roche, M., Azé, J., Matte-Tailliez, O., Kodrato↵, Y.: Mining texts by association
    rules discovery in a technical corpus. In: Intelligent Information Processing and
    Web Mining. pp. 89–98. Advances in Soft Computing, Springer Verlag (2004)
26. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceed-
    ings of International Conference on New Methods in Language Processing, Manch-
    ester, UK (1994)
27. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput.
    Surv. 34(1), 1–47 (2002)
28. Tesnière, L.: Éléments de syntaxe structurale. Klincksieck (1959)
29. Zaı̈ane, O.R., Antonie, M.L.: Classifying text documents by associating terms with
    text categories. In: Australasian Database Conference. CRPIT, vol. 5. Australian
    Computer Society (2002)