=Paper=
{{Paper
|id=Vol-1202/proceedings
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1202/dmnlp2014.pdf
|volume=Vol-1202
}}
==None==
<pdf width="1500px">https://ceur-ws.org/Vol-1202/dmnlp2014.pdf</pdf>
<pre>
Peggy Cellier
Thierry Charnois
Andreas Hotho
Stan Matwin
Marie-Francine Moens
Yannick Toussaint (Eds.)


Interactions between Data Mining
and Natural Language Processing
1st International Workshop, DMNLP 2014
Nancy, France, September 2014
Proceedings
2
II

Volume Editors

Peggy Cellier
INSA Rennes, IRISA
Campus Beaulieu, 35042 Rennes cedex, France
E-mail: peggy.cellier@irisa.fr


Thierry Charnois
Université Paris 13 Sorbonne Paris Cité, LIPN CNRS
Av. J.B. Clément, 93430 Villetaneuse, France
E-mail: Thierry.Charnois@lipn.univ-paris13.fr


Andreas Hotho
University of Würzburg
Am Hubland, 97074 Würzburg, Germany
E-mail: hotho@informatik.uni-wuerzburg.de


Stan Matwin
Faculty of Computer Science, Dalhousie University
6050 University Ave., PO BOX 15000, Halifax, NS B3H 4R2, Canada
E-mail: stan@cs.dal.ca


Marie-Francine Moens
Department of Computer Science, KU Leuven
Celestijnenlaan 200A, B-3001 Heverlee, Belgium
E-mail: sien.moens@cs.kuleuven.be


Yannick Toussaint
INRIA Nancy Grand-Est, LORIA
615 Rue du Jardin Botanique, 54600 Villers-lès-Nancy, France
E-mail: Yannick.Toussaint@loria.fr


Copyright c 2014 for the individual papers by the papers’ authors. Copying permitted
only for private and academic purposes. This volume is published and copyrighted by
its editors.
                                      Preface


Recently, a new field has emerged taking benefit of both domains: Data Mining (DM)
and Natural Language Processing (NLP). Indeed, statistical and machine learning
methods hold a predominant position in NLP research1 , advanced methods such as
recurrent neural networks, Bayesian networks and kernel based methods are exten-
sively researched, and ”may have been too successful (. . . ) as there is no longer much
room for anything else”2 . They have proved their effectiveness for some tasks but one
major drawback is that they do not provide human readable models. By contrast, sym-
bolic machine learning methods are known to provide more human-readable model that
could be an end in itself (e.g., for stylistics) or improve, by combination, further meth-
ods including numerical ones. Research in Data Mining has progressed significantly in
the last decades, through the development of advanced algorithms and techniques to
extract knowledge from data in different forms. In particular, for two decades Pattern
Mining has been one of the most active field in Knowledge Discovery.
    This volume contains the papers presented at the ECML/PKDD 2014 workshop:
DMNLP’14, held on September 15, 2014 in Nancy. DMNLP’14 (Workshop on Inter-
actions between Data Mining and Natural Language Processing) is the first work-
shop dedicated to Data Mining and Natural Language Processing cross-fertilization,
i.e a workshop where NLP brings new challenges to DM, and where DM gives future
prospects to NLP. It is well-known that texts provide a very challenging context to both
NLP and DM with a huge volume of low-structured, complex, domain-dependent and
task-dependent data. The objective of DMNLP is thus to provide a forum to discuss
how Data Mining can be interesting for NLP tasks, providing symbolic knowledge, but
also how NLP can enhance data mining approaches by providing richer and/or more
complex information to mine and by integrating linguistic knowledge directly in the
mining process.
    Out of 23 submitted papers, 9 were accepted as regular papers amounting to an
acceptance rate of 39%. In addition to regular contributions, two less mature works,
which were still considered valuable for discussion, were accepted as posters and appear
as extended abstract in this volume.
    The high quality of the program of the workshop was ensured by the much-
appreciate work of the authors and the Program Committee members. Finally, we
wish to thank the local organization team of ECML/PKDD 2014, and more specifi-
cally Amedeo Napoli and Chedy Raı̈ssy, and the ECML/PKDD 2014 workshop chairs
Bettina Berendt and Patrick Gallinari.


September 2014                                        Peggy Cellier, Thierry Charnois
                                                         Andreas Hotho, Stan Matwin
                                             Marie-Francine Moens, Yannick Toussaint
1
  D. Hall, D. Jurafsky, and C. M. Manning. Studying the History of Ideas Using Topic
  Models. In Proceedings of the 2008 Conference on Empirical Methods in Natural
  Language Processing, pp. 363–371, 2008
2
  K. Church. A Pendulum Swung Too Far. Linguistic Issues in Language Technology,
  Vol. 6, CSLI publications, 2011.
                           Organization

Program Chairs

Peggy Cellier           INSA Rennes, IRISA, France
Thierry Charnois        Université Paris 13, Sorbonne Paris cité, LIPN, France
Andreas Hotho           University of Kassel, Germany
Stan Matwin             Dalhousie University, Canada
Marie-Francine Moens    Katholieke Universiteit Leuven, Belgium
Yannick Toussaint       INRIA Nancy Grand-Est, LORIA, France


Program Commitee

Martin Atzmueller       University of Kassel, Germany
Delphine Battistelli    MoDyCo-Université Paris Ouest, France
Yves Bestgen            F.R.S-FNRS, Université catholique de Louvain, Belgium
Philipp Cimiano         University of Bielefeld, Germany
Bruno Cremilleux        Universit de Caen, France
Beatrice Daille         Laboratoire d’Informatique de Nantes Atlantique, France
Francois Jacquenet      Laboratoire Hubert Curien, France
Jiri Klema              Czech Technical University, Prague, Czech Republic
Yves Lepage             Waseda University, Japan
Amedeo Napoli           LORIA Nancy, France
Adeline Nazarenko       Université de Paris 13, LIPN, France
Claire Nédellec        Institut National de Recherche Agronomique, France
Maria Teresa Pazienza   University of Roma ”Tor Vergata”, Italy
Pascal Poncelet         LIRMM Montpellier, France
Stephen Poteet          Boeing, USA
Solen Quiniou           Laboratoire d’Informatique de Nantes Atlantique, France
Mathieu Roche           Cirad, TETIS, Montpellier, France
Arnaud Soulet           Université François Rabelais Tours, France
Steffen Staab           University of Koblenz-Landau, Germany
Koichi Takeuchi         Okayama University, Japan
Isabelle Tellier        Lattice, Paris, France
Johanna Völker         University of Mannheim, Germany
Xifeng Yan              University of California at Santa Barbara, USA
Pierre Zweigenbaum      LIMSI-CNRS, Paris, France


Additional Reviewers

Éric Kergosien         LIRMM, Montpellier, France
                                                                                                                                          V

Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    III

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         IV


Regular papers
Automatically Detecting and Rating Product Aspects from Textual Customer
Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     1
   Wouter Bancken, Daniele Alfarone and Jesse Davis

Mining Meaning from Text by Harvesting Frequent and Diverse Semantic
Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   17
    Luigi Di Caro and Guido Boella

Text Mining for Open Domain Semi-Supervised Semantic Role Labeling . . . . . . .                                                         33
   Quynh Ngoc Thi Do, Steven Bethard and Marie-Francine Moens

Learning Semantically Coherent Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           49
   Alexander Gabriel, Heiko Paulheim and Frederik Janssen

Text Classification Using Association Rules, Dependency Pruning and
Hyperonymization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             65
   Yannis Haralambous and Philippe Lenca

Learning Subgraph Patterns from text for Extracting Disease–Symptom
Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      81
   Mohsen Hassan, Adrien Coulet and Yannick Toussaint

Interpretable Low-rank Document Representations with Label-dependent
Sparsity Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          97
    Ivan Ivek

TagMiner: A Semisupervised Associative POS Tagger Effective for Resource
Poor Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
   Pratibha Rani, Vikram Pudi and Dipti Sharma

Sequential Patterns of POS Labels Help to Characterize Language Acquisition . . 129
   Isabelle Tellier, Zineb Makhlouf and Yoann Dupont


Extended abstracts
RegExpMiner: Automatically discovering frequently matching regular expressions 143
   Julien Rabatel, Jérôme Azé, Pascal Poncelet and Mathieu Roche

NLP-based Feature Extraction for Automated Tweet Classification . . . . . . . . . . . . 145
  Anna Stavrianou, Caroline Brun, Tomi Silander and Claude Roux

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           VI
VI
    Automatically Detecting and Rating Product
     Aspects from Textual Customer Reviews

                 Wouter Bancken, Daniele Alfarone and Jesse Davis

                      Department of Computer Science, KU Leuven
                 Celestijnenlaan 200A - box 2402, 3001 Leuven, Belgium
                         wouter.bancken@student.kuleuven.be
                           daniele.alfarone@cs.kuleuven.be
                              jesse.davis@cs.kuleuven.be


        Abstract. This paper proposes a new approach to aspect-based sen-
        timent analysis. The goal of our algorithm is to obtain a summary of
        the most positive and the most negative aspects of a specific product,
        given a collection of free-text customer reviews. Our approach starts by
        matching handcrafted dependency paths in individual sentences to find
        opinions expressed towards candidate aspects. Then, it clusters together
        different mentions of the same aspect by using a WordNet-based simi-
        larity measure. Finally, it computes a sentiment score for each aspect,
        which represents the overall emerging opinion of a group of customers
        towards a specific aspect of the product. Our approach does not require
        any seed word or domain-specific knowledge, as it only employs an off-
        the-shelf sentiment lexicon. We discuss encouraging preliminary results
        in detecting and rating aspects from on-line reviews of movies and MP3
        players.

        Keywords: aspect-based sentiment analysis, opinion mining, syntactic
        dependency paths, text mining


1    Introduction

Sentiment analysis is the task of detecting subjectivity in natural language. Ap-
proaches to this task mainly draw from the areas of natural language processing,
data mining, and machine learning. In the last decade, the exponential growth
of opinionated data on the Web fostered a strong interest in the insights that
sentiment analysis could reveal. For example, companies can analyze user re-
views on the Web to obtain a good picture of the general public opinion on their
products at very little cost.
    While the first efforts in sentiment analysis were directed towards determin-
ing the general polarity (positive or negative) of a certain sentence or document,
the interest has recently shifted towards a more qualitative analysis, that aims
to detect the different aspects of a topic towards which an opinion is expressed.
For example, we may be interested in analyzing a movie review to capture the
opinions of the reviewer towards aspects such as the plot, the cinematography,

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
2       W. Bancken, D. Alfarone and J. Davis

or the performance of a specific actor. The most challenging part in aspect-based
sentiment analysis is that a system needs to detect the relevant aspects before
these can be associated with a polarity.
    In this paper we introduce Aspectator, a new algorithm for automatically
detecting and rating product aspects from customer reviews. Aspectator can
discover candidate aspects by simply matching few syntactic dependency paths,
while other approaches [6, 14, 16, 21] require seed words in input and use syn-
tactic dependencies or some bootstrapping technique to discover new words and
the relations between them. Additionally, it does not require any domain-specific
knowledge in input, but only few handcrafted syntactic dependency paths and
an off-the-shelf sentiment lexicon. Consequently, the proposed system can detect
and rate aspects of products in any domain, while many existing approaches [16,
21, 18] focus on domains for which machine-readable knowledge is available. Con-
cretely, Aspectator combines a first high-recall step where candidate aspects
are extracted from individual sentences through syntactic dependency paths,
with a second and third high-precision steps, where aspect mentions are clustered
and their sentiment scores are aggregated by leveraging an external sentiment
lexicon.
    In our opinion, the considered setting represents an ideal testbed for in-
vestigating interactions between natural language processing and data mining.
Indeed, our focus is not on extracting the aspects discussed in a single sentence
or document, which could be seen as a problem of deep text understanding, but
on crunching hundreds of reviews of a specific product to capture the aspects
that best summarize the opinions of a group of customers, which requires lin-
guistic knowledge to extract information from single sentences, along with data
mining expertise to make sense of large amounts of data.


2   Related Work

Historically, sentiment analysis has been concerned with assigning a binary clas-
sification to sentences or entire documents, that represents the polarity (i.e., the
orientation) of the writer towards the discussed contents [13, 19]. Nevertheless,
the overall polarity gives no indication about which aspects the opinions refer to.
For this reason, in 2004 Hu and Liu [6] introduced the more interesting problem
of aspect-based sentiment analysis, where polarity is not assigned to sentences
or documents, but to single aspects discussed in them. In their approach, given
a large number of reviews for a specific product, they first attempt to identify
interesting product aspects by using association mining, and then attach a senti-
ment score to each aspect by exploiting a small seed set of opinion words, along
with their synonyms and antonyms present in WordNet. Next, they use newly
detected opinion words to extract additional infrequent product aspects. Instead
of using association mining, our work will detect aspects through dependency
paths, and will use an external sentiment lexicon to rate them. However, their
work remains the most similar to ours, as in both cases the goal is to summarize a
          Automatically Detecting Product Aspects from Customer Reviews          3

collection of reviews for a specific product by detecting the most interesting and
discussed aspects, while most approaches focus on analyzing individual reviews.
     Qiu et al. [14] continued to pursue the idea that opinion words can be used
to detect product aspects and vice versa, focusing on single reviews. In their
approach, a seed set of opinion words is combined with syntactic dependencies
to identify product aspects and new opinion words. To detect the polarity of
the newly identified opinion words, they consider the given polarities of the
seed words and make the assumption that opinion words expressing a sentiment
towards the same aspect in the same review share the same polarity. While Qiu
et al. use syntactic dependencies solely to capture word sequences that contain
aspects or opinion words already observed, our approach uses dependency paths
to detect new product aspects, with the potential advantage of achieving higher
coverage.
     A different line of work on aspect-based sentiment analysis is based on topic
models. Brody and Elhadad [3] have tried to use Latent Dirichlet Allocation
(LDA) [2] to extract topics as product aspects. To determine the polarity to-
wards each topic/aspect, they start from a set of seed opinion words and propa-
gate their polarities to other adjectives by using a label propagation algorithm.
Instead of treating aspect detection and sentiment classification as two separate
problems, Lin and He [11] and Jo and Oh [8] directly integrate the sentiment clas-
sification in the LDA model, so that it natively captures the sentiment towards
the topic/aspect. While these LDA-based approaches provide an elegant model
of the problem, they produce topics that are often not directly interpretable as
aspects, and thus require manual labelling to achieve a readable output.
     The work discussed so far proposes domain-independent solutions for aspect-
based sentiment analysis, where also our approach is positioned. However, several
works make use of domain-specific knowledge to improve their results. For in-
stance, Thet et al. [16] focus on aspect-based classification of movie reviews, and
include as input for their algorithm movie-specific terms such as the name of the
movie, the cast and the director. Additionally, they include some domain-specific
opinion words as input for their algorithm. As expected, including domain-
specific knowledge yields a more accurate sentiment classification. To make an
example, the word “unpredictable” has a negative polarity in general English, but
in the movie domain it is often used to praise the unpredictability of a storyline.
Since all relevant aspects are given as input, they exclusively focus on detecting
opinions towards the given aspects by (1) capturing new opinion words through
syntactic dependencies, and (2) rating the product aspects based on an external
sentiment lexicon and some given domain-specific opinion words.
     Similarly, Zhu et al. [21] use product aspects and some aspect-related terms
as input for their algorithm, but then attempt to discover new aspect-related
terms by applying a bootstrapping algorithm based on co-occurrence between
seed terms and new candidate terms. A sentiment score is again obtained by
accessing an external sentiment lexicon. While our approach retains from these
works the usage of an external lexicon, it requires neither labelled examples nor
domain-specific knowledge, thus it has wider applicability.
4       W. Bancken, D. Alfarone and J. Davis

3     Aspectator: a New Approach

Aspectator takes as input a collection of textual customer reviews for one
specific product, and automatically extracts the most positive and the most
negative aspects of the product, together with all the sentences that contribute
to the sentiment polarity of each aspect. More precisely:

Given: a collection of textual reviews of one specific product
Extract:
      – The n most positive product aspects, along with a list of all sentences
        containing positive and negative mentions of each aspect.
      – The n most negative product aspects, along with a list of all sentences
        containing positive and negative mentions of each aspect.

    Aspectator works in three steps, depicted in Fig. 1. First, it detects men-
tions of aspects and their associated opinion by matching handcrafted paths
in dependency trees. Second, it clusters the different mentions of an aspect ex-
tracted in the first step by means of a WordNet-based similarity measure. Third,
it attaches a sentiment score to each mention, and aggregates the scores from
all mentions belonging to the same cluster in order to obtain a final sentiment
score for each aspect.
    Aspectator does not require labelled examples and it is domain-independent,
thus it can run on any collection of reviews for a specific product. The only re-
quired external knowledge is in the form of ten handcrafted dependency paths
and an English lexicon with a sentiment score for every word.

3.1   Detecting Product Aspects

The objective of the first step is to extract from customer reviews mentions of
a product aspect and the words that express the opinion of the writer towards
that aspect. For instance, given the sentence:
            “The action music used in the movie wasn’t too good.”
Aspectator extracts the following pair:

                     < not too good ; |action{zmusic} >
                      |     {z     }
                           Sentiment           Aspect
                            modifier           mention


   We call this an opinion pair, as the first part is the opinion of a reviewer
towards the second part. The first part can optionally be negated, as in the
above example, causing an inversion of the polarity expressed by the sentiment
modifier.
             Automatically Detecting Product Aspects from Customer Reviews                      5


       Input: a set of customer reviews for one product, e.g. the movie Batman & Robin


             Step 1: detection of product aspects

      cheesy film                       extremely attractive uma thurman                […]
      top-notch acting                  very cheesy acting                              […]
      almost bad uma thurman            not bad movie                                   […]

              Step 2: clustering of product aspects


      top-notch acting            extremely attractive uma thurman             cheesy film
      very cheesy acting          almost bad uma thurman                       not bad movie
                …                                   …                                   …


              Step 3: rating of product aspects

   top-notch acting: +0.63       extremely attractive uma thurman: +0.74      cheesy film:   -0.75
   very cheesy acting: -0.77     almost bad uma thurman:           -0.57      not bad movie: +0.57
               …                                     …                                  …


               -9.76                              +7.03                               -25.6


              Output

  Positive aspects:
  •   Uma Thurman
      23 positive mentions, e.g.: “Batman and Robin has plenty of big name actors, Uma
            Thurman is extremely attractive as Poison Ivy and …”
         9 negative mentions, e.g.: “The great Uma Thurman (Pulp Fiction, The Avengers) who
           plays Poison Ivy, is almost as bad as Schwarzenegger.”
  •      […]

  Negative aspects:
  •  Acting
     5 positive mentions, e.g.: “The acting, storyline and visual effects were top-notch.”
     22 negative mentions, e.g.: “The acting was very cheesy and predictable, but there
           is some parts that boggles my mind...george clooney as batman?!”
  •      […]


Fig. 1. Aspectator’s full pipeline, with example extractions from reviews for the
movie Batman & Robin. Scores greater or lower than zero represent positive or negative
sentiment polarity, respectively.
6       W. Bancken, D. Alfarone and J. Davis

   Aspectator extracts opinion pairs by using ten simple handcrafted depen-
dency paths, in three steps:

 1. For each sentence, Aspectator extracts a syntactic dependency tree by
    using the Stanford dependency parser [4, 10]. Fig. 2 shows the dependencies
    for the example sentence above.
 2. Given a dependency tree, it attempts to extract a basic opinion pair com-
    posed by a single-word sentiment modifier and a single-word aspect mention
    by matching one of the five dependency paths shown in Table 1. For the ex-
    ample sentence, this step extracts the opinion pair < good ; music > through
                              nsubj       cop
    the dependency path A ←−−−− M −−→ ∗.
 3. Given a matched opinion pair, it attempts to extend the match to neighbour-
    ing words by applying the additional dependency paths shown in Table 2.
    This allows to (1) capture multi-word expressions, such as “action music”
    and “too good ” in the running example, and (2) capture negations, such as
    “wasn’t” in the example. The final opinion pair for the running example
    becomes < not too good ; action music >.


Fig. 2. Example of syntactic dependencies detected by the Stanford dependency parser.


    Note that our approach leverages syntactic dependency paths for two pur-
poses: (1) detecting aspect mentions and sentiment modifiers, and (2) discover-
ing relations between them. This is a significant difference with other approaches
that are based on syntactic dependencies. For example, Qiu et al. [14] only use
syntactic dependencies to identify relations between word sequences that contain
an aspect or an opinion word that has been detected before.
    While our technique for extracting aspect mentions and sentiment modifiers
yields high recall, its precision is low, since several irrelevant word sequences
are captured. Nevertheless, the following steps allow our system to assign lower
confidence to incorrect extractions, thus ultimately yielding accurate top-ranked
extractions.
           Automatically Detecting Product Aspects from Customer Reviews              7

Table 1. Main dependency paths used by Aspectator to detect an aspect (A) and a
sentiment modifier (M ) that form an opinion pair < M ; A >. Asterisks (*) are wildcards
that can match any word.

Type                     Dependency path              Example

                                    amod
Adjectival modifier          M ←−−− A


                             nsubj         dobj
Direct object            A ←−−−− ∗ −−−→ M


Adjectival                  nsubj        acomp
complement              A ←−−−− ∗ −−−−→ M


Complement of a              nsubj           cop
copular verb              A ←−−−− M −−→ ∗


Adverbial modifier       nsubjpass         advmod
to a passive verb     A←−−−−−−
                             − ∗ −−−−−→ M


Table 2. Extensions to the dependency paths of Table 1 to deal with multi-word
aspects (A) and multi-word sentiment modifiers (M ), and to capture negations. Note
that the fourth extension is the only one that imposes a lexical constraint, as it only
triggers if the determiner is the word “no”.

Type of extension Dependency path                   Example

                                    nn
Compound noun                A0 ←−− A


                               advmod
Adverbial modifier        M 0 ←−−−−− M


                                neg
Simple negation              ∗ ←−− M


Negation through                     det
                            “no” ←−− A
“no” determiner

Negation through         ∗ ←a−
                             ux
                              −     cop
                                 M −−→ ∗
hypothetical phrase         aux−
                          ∗ ←−
8      W. Bancken, D. Alfarone and J. Davis

3.2   Clustering Product Aspects
The goal of this step is to cluster the previously-extracted opinion pairs by
searching for all semantically similar aspect mentions, independently from their
sentiment modifier. For example, in the context of movie reviews, we would like to
cluster together the opinion pairs < very bad ; music > and < awesome ; soundtrack >,
as they both express opinions towards the same aspect of a movie.
    To identify semantically similar aspect mentions, Aspectator uses a WordNet-
based similarity metric called Jcn [7]. Zhai et al. [20] experimented with several
WordNet-based similarity metrics in the context of clustering for aspect-based
sentiment analysis, concluding that Jcn delivers the best results.
    Jcn is based on the principle that two terms are similar if their least common
subsumer (LCS) in the WordNet taxonomy has high information content (IC).
For instance, the terms (car, bicycle), having LCS “vehicle”, are more similar
than (car, fork), having LCS “artifact”, because “vehicle” is a more informative
term than “artifact”. Formally, the Jcn similarity between two terms t1 and t2
is defined as:
                                                   1
              Jcn(t1 , t2 ) =                                                  (1)
                                IC(t1 ) + IC(t2 ) − 2 · IC(LCS(t1 , t2 ))
where LCS(t1 , t2 ) is the least common subsumer of t1 and t2 in WordNet, and
the information content of a term is equivalent to:

                                   IC(t) = −log P (t)                          (2)
where P (t) is the probability of observing, in a large English corpus, the term t
or any term subsumed by t in the WordNet hierarchy. The higher the probability
of observing a term t or any of its subsumed terms, the lower the information
content of t.
    Concretely, in order to cluster opinion pairs, Aspectator first computes
the Jcn similarity for every possible pair of aspect mentions, by using an imple-
mentation available in the WS4J library [15]. Next, it normalizes all mentions
by stemming them, in order to increase data density. When two terms map
to the same root, for instance “act” and “acting”, a comparison with another
term is made by picking the stem that maximizes the Jcn similarity. Finally,
Aspectator uses the pairwise similarity values as input for the K-Medoids
clustering algorithm [9], which will return clusters of opinion pairs, with each
cluster representing a collection of opinions towards a single aspect. K-Medoids is
preferred over K-Means because it can compute the centroid of a cluster without
the need of defining a mean.

3.3   Rating Product Aspects
In the final stage of our approach, each cluster receives a sentiment score, which
represents the overall emerging opinion of a group of customers towards a specific
aspect of a product. Concretely, Aspectator undertakes three sub-steps for
each cluster:
           Automatically Detecting Product Aspects from Customer Reviews           9

 1. For each opinion pair in the cluster, it assigns an individual sentiment score
    to each word that composes the sentiment modifier. For instance, given the
    opinion pair < just plain stupid ; action music >, it attaches three individual
    scores to “just”, “plain” and “stupid ”.
 2. It combines the scores for the individual words into a single score for the
    entire sentiment modifier, e.g., “just plain stupid ”.
 3. It extracts a final sentiment score for the entire cluster by aggregating the
    scores of all sentiment modifiers.

Step 1. In order to obtain a sentiment score for individual words, Aspectator
uses the external sentiment lexicon SentiWordNet [1]. SentiWordNet extends
WordNet by attaching three scores to each synset:1 a positive sentiment score, a
negative sentiment score and a neutrality score. These three scores always sum
to 1. For example, the word “mediocre”, in the sense of “lacking exceptional
quality or ability” has the scores 0.25, 0.125 and 0.625 as its positive, neutral
and negative score, respectively.
    For simplicity, our approach does not use three different sentiment scores,
but combines them in one score in the range [-1,1] by subtracting the negative
score from the positive score. The neutrality score is thus ignored, as “almost
neutral” opinions will have a score close to zero, and consequently will have no
significant impact in the following aggregation steps. Instead of performing word
sense disambiguation, Aspectator simply aggregates the sentiment scores of
all the synsets in which a word w appears, as follows:
                                      P
                                      n
                                          score(synseti )/i
                         score(w) = i=1      P
                                             n                                   (3)
                                                   1/i
                                             i=1

where i ∈ N is the rank of a synset in WordNet based on the synset’s frequency
in general English, and synseti is the ith synset of w in the ranking. Intuitively,
dividing a synset’s score by i allows our approach to give higher weight to synsets
that are more likely to represent the right sense of the word w in a certain context,
given their overall higher popularity in English.

Step 2. The word-level scores obtained in the previous step are then combined
into a single score for the entire sentiment modifier by adopting an approach
based on the work of Thet et al. [16]. Specifically, Aspectator takes as ini-
tial score the sentiment score of the rightmost (i.e., most specific) word in the
sentiment modifier. Then, it iteratively uses the score of each preceding word
to either intensify or attenuate the current score depending on the polarity of
the considered words, remaining in the range [-1,1]. Concretely, the score for a
sentiment modifier composed by words wn wn−1 . . . w1 w0 is computed as:
1
    A synset is a group of synonymous words, corresponding to a node in the WordNet
    hierarchy.
10       W. Bancken, D. Alfarone and J. Davis


score(wi . . . w0 ) = score(wi−1 . . . w0 ) − (score(wi−1 . . . w0 ) · |score(wi )|)
                      if score(wi−1 . . . w0 ) > 0 and score(wi ) < 0                  (4a)

                                                                                                
score(wi . . . w0 ) = σ · |score(wi−1 . . . w0 )| + (1 − |score(wi−1 . . . w0 )|) · |score(wi )|
                      with σ = sign(score(wi−1 . . . w0 ))
                      otherwise                                                        (4b)

    In case the sentiment modifier is negated, the resulting sentiment score is
multiplied by −1 to obtain the opposite polarity.
    Equation (4b) models the general case, where the next word wi in the iterative
procedure intensifies the current score functionally to |score(wi )|. This follows
Thet et al.’s observation that (1) words with the same polarity tend to intensify
each other (e.g., “super nice”, “terribly stupid ”), and (2) a negative current
score becomes more negative when the next word has positive score (e.g., “super
bad ”). Equation (4a) is introduced to handle the particular case in which the
current score is positive and the next word to be processed is negative (e.g.,
“hardly interesting”). In this case, applying (4b) would make the final score
more positive, while a negative modifier should make the score less positive.
    As a full example, we show how our iterative procedure computes a sentiment
score for the opinion pair < just plain stupid ; action music >:

     Example opinion pair : < just plain stupid ; action music >
                                 w2 w1      w0
     Individual scores:         0.07 0.12 −0.51
                                                          
     score(plain stupid ) = (−1) · 0.51 + (1 − 0.51) · 0.12 = −0.57
                                                               
     score(just plain stupid ) = (−1) · 0.57 + (1 − 0.57) · 0.07 = −0.60

Thus, the resulting sentiment score for the aspect mention “action music” in
this example is −0.60.

Step 3. Lastly, Aspectator computes a final sentiment score for each aspect,
by summing the scores computed in the previous step for all sentiment modifiers
belonging to the aspect’s cluster. A simple algebraic summation supports the
intuition that few strongly positive/negative opinions should result in a senti-
ment score comparable to the one of many weakly positive/negative opinions.
We refer back to Fig. 1 for a complete example.
    In order to produce the final output, Aspectator ranks the aspects by their
sentiment score, and returns only the n most positive and the n most negative
aspects, where n is specified by the user. This ranking places at the top the most
interesting aspects, i.e., the ones that (1) are frequently mentioned in the reviews,
and (2) are subjected to strong positive or negative opinions of the reviewers.
           Automatically Detecting Product Aspects from Customer Reviews           11

This has also the advantage that many incorrect opinion pairs extracted in the
first step of the pipeline (Sect. 3.1) will be excluded from the final output, as they
typically have very few mentions and are not associated with strong opinions.


4     Experiments
In this section, we present a preliminary evaluation of Aspectator. The objec-
tive of our experiments is to address the following questions:
 1. Can our approach detect interesting and relevant product aspects?
 2. Can our approach provide meaningful evidence that supports the sentiment
    score assigned to each aspect?
Additionally, we discuss the main sources of error of our approach.

4.1    Methodology
Aspectator’s output was manually evaluated on a portion of two public datasets
from different domains by two annotators, out of which only one was a co-author.
    The first dataset is a collection of movie reviews taken from Amazon,2 pub-
lished by McAuley and Leskovec [12]. Since manual evaluation is required, we
sampled ten movies to create a validation set and a test set, in the following way.
First, we considered only the 50 movies with the highest number of reviews, as
we want to the test the ability of our algorithm to summarize a large amount of
data for a single movie. Since most of these movies have a majority of positive
reviews, in order to obtain a more balanced dataset we first took the five movies
with the highest number of negative reviews, and then randomly sampled five
other movies from the remaining set. This resulted in a collection of 700 to 850
reviews for each movie.
    The second dataset consists of reviews of MP3 players taken from Amazon,3
published by Wang, Lu and Zhai [17]. From this dataset we selected the five
products with the highest number of reviews in the dataset, obtaining a collection
of 500 to 770 reviews for each MP3 player.
    From these samples, we used eight movies and three MP3 players as our
validation set, and the remaining two movies and two MP3 players as our test
set. We used the validation set to determine the optimal k for the K-Medoids
clustering applied in Sect. 3.2, which should ideally be equal to the total number
of unique aspects appearing in a set of reviews. We found that the optimal k is
0.9 times the number of aspect mentions to be clustered. For instance, if 1700
aspect mentions have been identified for a certain product, we set k = 1530.
    We used the test set consisting of two movies and two MP3 players to manu-
ally evaluate our algorithm. For each product, two annotators were given a form
with the ten most positive and the ten most negative product aspects, along
2
    http://snap.stanford.edu/data/web-Movies.html
3
    http://sifaka.cs.uiuc.edu/~wang296/Data/LARA/Amazon/mp3/
12     W. Bancken, D. Alfarone and J. Davis

with the six sentences containing the three most positive and three most nega-
tive mentions of each aspect. The annotators were asked to mark each product
aspect and each sentence mentioning the aspect as either correct or incorrect. For
simplicity, in the form given to the annotators each aspect was only represented
by the aspect mention appearing most frequently in the reviews. An aspect is
considered correct if it is an interesting and relevant aspect for the considered
product, such as “battery life” and “display” for an MP3 player, or “storyline”
and the name of an actor for a movie. A sentence listed by our algorithm for a
certain aspect is considered correct only if (1) the sentence mentions the consid-
ered aspect, and (2) the sentence expresses an opinion towards the considered
aspect that matches the polarity extracted by our algorithm for that specific
opinion pair.


4.2   Results

The accuracy of the top-n aspects is shown in Fig. 3. On average, the 10 most
positive and the 10 most negative aspects were considered to be correct in 72.5%
of the cases. The sentences mentioning an aspect were only considered correct in
59.8% of the cases. However, this last result can be studied more closely. Table 3
shows the accuracy of these sentences in function of the polarity of both aspects
and sentences. Clearly, the detected sentences are generally more accurate when
the aspect and the corresponding sentence have the same polarity. This is due
to the fact that for an aspect there are typically many more sentences with a
matching polarity than sentences with the opposite polarity, so when the top-3
sentences are drawn from a larger number of sentences, these tend to have higher
accuracy.


                            100%
                            90%
                            80%
        % correct aspects


                            70%
                            60%
                                                                                Annotator #1
                            50%
                            40%                                                 Annotator #2
                            30%
                            20%
                            10%
                             0%
                                   1   2   3   4    5     6    7   8   9   10
                                               top-n aspects


Fig. 3. Percentage of top-1, top-3, top-5, top-10 aspects marked as correct by two
annotators.
           Automatically Detecting Product Aspects from Customer Reviews            13

Table 3. Percentages of correct sentences split accordingly to the polarity of the sen-
tence and the product aspect.

                      Pos. sentence Neg. sentence Neg. sentence Pos. sentence
                      pos. aspect   pos. aspect   neg. aspect   neg. aspect
     Annotator #1         86.9%           47.8%          66.7%              39.1%
     Annotator #2         85.0%           53.0%          58.3%              41.3%
         Average          85.9%           50.4%          62.5%              40.2%


Table 4. Positive and negative aspects for the movie Batman & Robin. Each aspect
is represented as a cluster of its different mentions. Aspects shown in italics were
marked as incorrect by the two annotators, while † indicates a disagreement between
annotators.

                                   Batman & Robin
                         Positive aspects         Negative aspects
                 1               book               movie, film, . . .
                 2         Uma Thurman              dialogue, dialog
                 3      job, occupation, . . .            † line
                 4         actor, thespian           acting, act, . . .
                 5      Alicium Silverstone        † review, reviewer
                 6   performance, perform, . . .           plot
                 7          Bruce Wayne                    guy
                 8    effect, consequence, . . .          script
                 9             costume                    thing
                10               way             character, depict, . . .


    When evaluating the product aspects, the annotators agreed in 88% of the
cases, with an inter-annotator agreement of κ = 0.69 according to Cohen’s kappa
score. When evaluating the sentences containing mentions, the annotators agreed
in 89% of the cases, with κ = 0.785. Table 4 shows an example of the aspects
manually evaluated for the movie “Batman & Robin”. A sentence marked as
correct positive mention for the aspect “performance” in the same movie is:

   “Though not classic villains, Arnold Schwarzenegger as Mr. Freeze and Uma
   Thurman as Poison Ivy give engaging performances.”

while for the same aspect a negative mention is:

   “The performances were horrendous, roll call: George Clooney, Chris O’Donnel,
   Alicia Silverstone and Arnold.”
14     W. Bancken, D. Alfarone and J. Davis

4.3   Error Analysis

We conclude the experimental section by reporting the main sources of errors
for our approach.

Ambiguity of word polarity. The polarity of an opinion word can vary according
to its context. In some cases, SentiWordNet does not cover all possible senses of
a word. For instance, SentiWordNet only contains positive sentiment scores for
the word “laughable”, while in a sentence such as “The acting is laughable.” the
intended sentiment is clearly negative.
    In some other cases, even though SentiWordNet covers also the correct sense
of a word, Aspectator picks the wrong polarity. This is due to the fact that,
for simplicity, our algorithm does not perform word sense disambiguation, but
instead computes a sentiment score for a term as a weighted sum of the scores
of all possible senses of the term. For example, SentiWordNet contains several
positive sentiment scores for different senses of the word “joke”, and only one
negative score. By using a weighted sum, the overall sentiment score is positive,
while in a sentence such as “The dialogue was just a joke.” the word is used with
a negative connotation.

Inclusion of objective statements. The presence of opinion words in a review does
not necessarily imply that the reviewer is expressing an opinion. For instance,
when describing the antagonist of a movie, reviewers often use words with a
negative connotation without the intention of expressing any judgement. This
is the case in sentences like “Javier Bardem is an extremely creepy bad guy.”,
where Aspectator incorrectly concludes that a strongly negative opinion is
expressed towards Javier Bardem.

Limitations of dependency paths. The ten handcrafted dependency paths some-
times fail to capture the full sentiment of a sentence. To make a concrete exam-
ple, consider the sentence “Uma Thurman was really sexy as Poison Ivy. . . and
that’s about it.”. If the first part of the sentence was considered in isolation,
a human reader would interpret it as a positive opinion about Uma Thurman,
and Aspectator does the same. Nevertheless, the second part of the sentence
reveals a negative attitude of the reviewer, which our simple dependency paths
cannot capture.

Incorrect dependency parsing. A last major source of error is introduced by the
Stanford dependency parser. Some of the errors are caused by the inability of the
Stanford parser to deal with the imprecise, colloquial language typically adopted
in on-line product reviews. To make an example, from the sentence “Man, this
film is bad.” the parser indicates that the opinion word “bad ” refers to “man”,
and not “film”.
    Additionally, the Stanford parser is not always able to detect compound
nouns, as terms like “hard drive” are considered to be adjective-noun pairs. This
makes Aspectator interpret the adjective as an opinion expressed towards the
          Automatically Detecting Product Aspects from Customer Reviews             15

noun, while the compound noun simply represents an aspect mention with no
associated opinion.


5   Conclusions
We presented Aspectator, a novel algorithm for aspect-based sentiment anal-
ysis that takes in input a collection of customer reviews for a specific product,
and automatically extracts the most positive and the most negative aspects, to-
gether with evidence that supports the extractions. Aspectator first harvests
candidate aspects and the associated opinions by matching ten simple hand-
crafted dependency paths, then clusters together mentions of the same aspect,
and finally computes a sentiment score that expresses the overall orientation
towards each aspect. Our approach is domain-independent and does not require
any labelled example, thus it can be adopted to analyze customer reviews for
products in unseen domains.
    In a preliminary evaluation, we show that on average the 72.5% of the ex-
tracted aspects are relevant, and that sentences that adhere to the overall po-
larity of each aspect are correct in 74.2% of the cases. This percentage drops
to 45.3% when the sentence polarity does not match the overall aspect polar-
ity. Furthermore, we found that most errors in our pipeline are caused by the
ambiguity and the complexity of the colloquial language adopted in the reviews.
    For future work, we are interested in verifying whether, starting from few
example opinion pairs, we can learn the dependency paths that we now hand-
craft, and discover additional ones that generalize well across multiple domains.
Additionally, an extended (and comparative) evaluation is required.


Acknowledgements
Daniele Alfarone and Jesse Davis acknowledge the generous support of the Re-
search Fund K.U. Leuven (CREA/11/015 and OT/11/051), EU FP7 Marie Curie
Career Integration Grant (#294068), and FWO-Vlaanderen (G.0356.12).


References
 1. Baccianella S., Esuli A., Sebastiani F., Sentiwordnet 3.0: An enhanced lexical
    resource for sentiment analysis and opinion mining. In Proceedings of LREC,
    volume 10, pages 2200–2204, 2010.
 2. Blei D. M., Ng A. Y., Jordan M. I., Latent dirichlet allocation. Journal of Machine
    Learning research, 3:993–1022, 2003.
 3. Brody S., Elhadad N., An unsupervised aspect-sentiment model for online re-
    views. In Human Language Technologies: The 2010 Annual Conference of the
    North American Chapter of the Association for Computational Linguistics, pages
    804–812. Association for Computational Linguistics, 2010.
 4. De Marneffe M.-C., MacCartney B., Manning C. D., Generating typed dependency
    parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–
    454, 2006.
16      W. Bancken, D. Alfarone and J. Davis

 5. Ganu G., Elhadad N., Marian A., Beyond the stars: Improving rating predictions
    using review text content. In Proceedings of the 12th International Workshop on
    the Web and Databases, 2009.
 6. Hu M., Liu B., Mining and summarizing customer reviews. In Proceedings of the
    tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
    Mining, pages 168–177. ACM, 2004.
 7. Jiang J. J., Conrath D. W., Semantic similarity based on corpus statistics and
    lexical taxonomy. In Proceedings of ROCLING X, 1997.
 8. Jo Y., Oh A. H., Aspect and sentiment unification model for online review analysis.
    In Proceedings of the fourth ACM international conference on Web search and data
    mining, pages 815–824. ACM, 2011.
 9. Kaufman L., Rousseeuw P. J., Finding groups in data: an introduction to cluster
    analysis. John Wiley & Sons, 2009.
10. Klein D., Manning C. D., Accurate unlexicalized parsing. In Proceedings of the
    41st Annual Meeting on Association for Computational Linguistics, pages 423–430,
    2003.
11. Lin C., He Y., Joint sentiment/topic model for sentiment analysis. In Proceedings
    of the 18th ACM Conference on Information and Knowledge Management, pages
    375–384, 2009.
12. McAuley J. J., Leskovec J., From amateurs to connoisseurs: modeling the evolution
    of user expertise through online reviews. In Proceedings of the 22nd International
    Conference on World Wide Web, pages 897–908, 2013.
13. Paltoglou G., Thelwall M., A study of information retrieval weighting schemes for
    sentiment analysis. In Proceedings of the 48th Annual Meeting of the Association
    for Computational Linguistics, pages 1386–1395, 2010.
14. Qiu G., Liu B., Bu J., Chen C., Expanding domain sentiment lexicon through dou-
    ble propagation. In Proceedings of the Twenty-First International Joint Conference
    on Artificial Intelligence, volume 9, pages 1199–1204, 2009.
15. Shima H., WS4J WordNet Similarity for Java. https://code.google.com/p/
    ws4j/, 2014.
16. Thet T. T., Na J.-C., Khoo C. S., Aspect-based sentiment analysis of movie reviews
    on discussion boards. Journal of Information Science, 36(6):823–848, 2010.
17. Wang H., Lu Y., Zhai C., Latent aspect rating analysis without aspect keyword
    supervision. In Proceedings of the 17th ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining, pages 618–626. ACM, 2011.
18. Wang H., Lu Y., Zhai C., Latent aspect rating analysis on review text data: a
    rating regression approach. In Proceedings of the 16th ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining, ACM, 2010.
19. Yessenalina A., Yue Y., Cardie C., Multi-level structured models for document-
    level sentiment classification. In Proceedings of the 2010 conference on Empirical
    Methods in Natural Language Processing, pages 1046–1056, 2010.
20. Zhai Z., Liu B., Xu H., Jia P., Clustering product features for opinion mining. In
    Proceedings of the fourth ACM international conference on Web search and data
    mining, pages 347–354. ACM, 2011.
21. Zhu J., Wang H., Zhu M., Tsou B. K., Ma M., Aspect-based opinion polling from
    customer reviews. IEEE Transactions on Affective Computing, 2(1):37–49, 2011.
22. Zhu X., Ghahramani Z., Learning from labeled and unlabeled data with label
    propagation. Technical report, Technical Report CMU-CALD-02-107, Carnegie
    Mellon University, 2002.
           Mining Meaning from Text by Harvesting
           Frequent and Diverse Semantic Itemsets

                                Luigi Di Caro and Guido Boella

                                 Department of Computer Science
                                    University of Turin, Italy
                                 {dicaro,boella}@di.unito.it


            Abstract. In this paper, we present a novel and completely-unsupervised
            approach to unravel meanings (or senses) from linguistic constructions
            found in large corpora by introducing the concept of semantic vector.
            A semantic vector is a space-transformed vector where features repre-
            sent ﬁne-grained semantic information units, instead of values of co-
            occurrences within a collection of texts. More in detail, instead of seeing
            words as vectors of frequency values, we propose to ﬁrst explode words
            into a multitude of tiny semantic information retrieved from existing re-
            sources like WordNet and ConceptNet, and then clustering them into
            frequent and diverse patterns. This way, on the one hand, we are able to
            model linguistic data with a larger but much more dense and informa-
            tive semantic feature space. On the other hand, being the model based
            on basic and conceptual information, we are also able to generate new
            data by querying the above-mentioned semantic resources with the fea-
            tures contained in the extracted patterns. We experimented the idea on
            a dataset of 640 millions of triples subject-verb-object to automatically
            inducing senses for speciﬁc input verbs, demonstrating the validity and
            the potential of the presented approach in modeling and understanding
            natural language.

            Keywords: Natural Language Understanding, Distributional Seman-
            tics, Diverse Itemset Mining


     1    Introduction

     Most Computational Linguistics applications may need semantic information
     to improve their eﬀectiveness. Semantic resources are often constructed with
     automatic approaches, since manual building of ontologies is not feasible on
     large scale [24, 42, 19, 37, 31, 5, 7].
         Distributional Semantics (DS) is nowadays one of the frontiers in this ﬁeld [4,
     22, 8, 2, 14]. DS derives from the Distributional Hypothesis introduced by Z. Har-
     ris [23], where Vector Space Models (VSMs) represent its main expression [39].
     The current availability of huge corpora like ukWac [17] makes these approaches
     particularly eﬃcient. Data Mining (DM) techniques leveraging on VSMs have
     been successfully applied on text since many decades on Topic Extraction-related


In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
18        L. Di Caro and G. Boella

     tasks [10, 1, 13]. Speciﬁcally, terms become interconnected by similarity scores
     forming concept-like entities, i.e., words clusters sharing similar contexts [9]. DS
     reﬁnes traditional DM on text, since it considers language as a grammar-based
     type of data. However, DS still sees linguistically-reﬁned tokens as the basic
     bricks in VSMs, suﬀering of an intrinsic limitation: a wide range of words and
     grammar constructions is actually rarely used. Even in very large corpora, there
     is little chance of ﬁnding statistically-signiﬁcant patterns that can be used to
     carve out meanings out of them. This is known as the long tail problem [45].
     Moreover, DM starts from linguistic items to develop semantic information, with-
     out reusing it for further analysis.
         This work is based on an interdisciplinary approach that relies on Conceptual
     Spaces [20], a theory developed by P. Gardenfors in the Cognitive Science ﬁeld,
     through which concepts are represented by vectors whose features are cogni-
     tive axes that humans naturally use to give meaning to their perceptions. In this
     sense, rather than considering VSMs of linguistic symbols we will consider VSMs
     of extensive semantic information associated with them, derived from diﬀerent
     sources. Our methodology leverages on the wealth of resources available on the
     web concerning semantics, like Linked Open Data (e.g., DBPedia1 , Freebase2 ,
     etc.), linguistic resources (e.g., WordNet [29], ConceptNet [44], BabelNet [32],
     etc.), Semantic Web technologies (e.g., FRED [15], TPALO [33], etc.), and au-
     tomatic parsing of large corpora like Wikipedia to map linguistic contexts into
     semantic features.
         An initial proof-of-concept of the proposal is given by recent research in which
     the transformation of terms into top-level hypernyms carried to improvement in
     several computational tasks, as in [28, 18]. While this is in line with this paper,
     this transformation only involves terminological abstractions by means of IS-A
     substitutions. In fact, this contribution represents a large generalization that
     takes into account a wider spectrum of conceptual relationships. The outcomes
     of this work are threefold:


      – a new methodology that introduces the concept of semantic vectors
      – a novel technique for mining frequent and diverse itemsets based on set cover
        problem [41], implemented with an heuristic approach.
      – a model that generalizes over existing linguistic constructions with low re-
        source requirements that is also able to generate new linguistic data


        The paper ﬁrst presents the motivations and the goals of this work. Then, the
     approach is explained in terms of methodology and algorithms. An evaluation
     phase is then presented, showing the data and the pursued objectives. A ﬁnal
     part of conclusions and future work ends the paper.

     1
         http://dbpedia.org/About
     2
         http://www.freebase.com/
                                                Mining Meaning from Text           19

2     Background and Related Work

In Computational Linguistics, recent advances and successful experiences of sta-
tistical distributional approaches can be found in diﬀerent tasks. The IBM Wat-
son Question Answering system3 is maybe the most recent and well-known direct
result. This also explains the fortunate and growing trend of available semantic
resources often constructed with automatic approaches, since manual building
of ontologies is not feasible on large scale. Currently, many systems actually try
to automatically extract semantic knowledge from texts by means of three pos-
sible generic approaches: distributional analysis, pattern-based approaches, and
Machine Learning techniques.
     Nowadays, semantic information extraction is currently approached by dis-
tributional analysis of linguistic items over speciﬁc contexts [34] or by starting
from seeds and patterns to build ontologies from scratch [31]. In some cases,
linguistic items are substituted by super-senses (i.e., top-level hypernyms) [28].
However, such generalization should be applied taking into account a wider no-
tion of semantic context introduced by related cognitive theories [21], that has
not been addressed by current computational approaches.
     Distributional Semantics (DS) is nowadays one of the frontiers4 within the
Computational Linguistics ﬁeld [3]. DS derives from the distributional hypothesis
introduced by Z. Harris in 1954 [23]. Vector Space Models (VSMs) [39], proposed
by Gerald Salton in the seventies, are the main expression of this idea. Data
Mining (DM) techniques fully leveraging on VSMs and Latent Semantic Analysis
(LSA) [16] have been successfully applied on text since many decades on topic
extraction-related tasks, often producing concept-like entities, i.e., words clusters
sharing similar contexts [9].
     Current research in DS focuses on the exploration of the diﬀerent impact of
parameters such as context type (i.e., text regions vs. linguistic items), window
(context extension), frequency weighting strategy (e.g., number of occurrences,
Pointwise Mutual Information, etc.), dimensionality reduction (e.g., Latent Se-
mantic Analysis, Random Indexing, etc.), and similarity measure (e.g., Cosine
similarity, Jaccard’s coeﬃcient, etc.). Then, it produces co-occurrences matri-
ces (or tensors) that model the semantics of the tokens by means of weights
distributions.
     DS reﬁnes traditional DM on text, since it considers language as a grammar-
based type of data instead of simple unstructured paragraphs. However, DS
still sees linguistically-reﬁned tokens (words, lemmas, part-of-speech, etc.) as
the basic bricks in VSMs, suﬀering of an intrinsic limitation: a wide range of
words and grammar constructions is actually rarely used.
     On the contrary, this work concerns a radical departure from this direction,
releasing the assumption made by all approaches to rely on linguistic items (ei-
ther terms or more context-aware tokens). The current methodology still starts
3
    http://www.ibm.com/smarterplanet/us/en/ibmwatson/
4
    See also the ERC project COMPOSES leaded                   by   Marco    Baroni.
    http://clic.cimec.unitn.it/composes/
20      L. Di Caro and G. Boella

     from syntax and strings of text to extract semantics, while it would be more
     reasonable to have an automated approach which also leverages on the existing
     semantic resources it produces as further input. The idea at the basis of the
     proposed approach is to conceptually restructure the problem of DS under the
     light of research in Cognitive Science. The above-mentioned theory of Concep-
     tual Spaces introduced by Peter Gardenfors is about a concept representation
     motivated by the notions of conceptual similarity and prototype theory [38]. A
     conceptual space is a multi-dimensional feature space where points denote ob-
     jects, and regions deﬁne concepts. Its bases are composed by quality dimensions,
     which denote basic features in which concepts and objects can be compared, such
     as weight, color, taste and so on. Symbolic representations are particularly weak
     at modeling concept learning, which is paramount for understanding many cog-
     nitive phenomena. Concept learning is closely tied to the notion of similarity,
     which is also poorly served by the symbolic approach.
         Taking inspiration from this vision of language, as basic bricks of DS we
     substitute linguistic items with a representation of their meaning in terms of
     sets of quality dimensions. In detail, each word can be represented by a set of
     semantic relationships and properties that deﬁne a speciﬁc concept by means of
     physical and behavioral facts. For instance, a cat has legs, claws, paws, eyes, etc.
     (properties); then, it usually chases mouses and it sees in the dark (behaviour);
     it is an animal and a feline (taxonomical information), and it can have many
     other relations like the places where it can be habitually found.
         The Conceptual Spaces (CS) framework developed in the Cognitive Sciences
     ﬁeld by [20] is based on a vectorial representation of concepts whose features
     are cognitive axes through humans naturally give meaning to their perceptions.
     CS is directly connectable with VSMs since it is a particular type of VSMs
     where features represent the conceptual level. Our approach is about injecting
     semantics into tokens towards a concept-level feature set. One of the most im-
     portant brick in almost all Computational Linguistics tasks is the computation
     of similarity scores between texts at diﬀerent levels: terms, sentences and dis-
     courses. As recently discussed in the literature [36], semantic similarity needs to
     be cross-level.
         In the DS current view, the triple subject-verb-object extracted from the
     sentence “the cat climbs a tree” is equally seen as the triple extracted from “the
     monkey climbs a tree”, since the two subjects share the same linguistic context.
     In this work, instead, the two situations will be diﬀerentiated and therefore more
     deeply understood: in the ﬁrst case, it will be able to correlate the fact of having
     claws with the ability of climbing a tree; in the second case, this will happen
     for the presence of prehensile feet. This is due to the introduction of semantics
     within the process of distributional analysis itself. In fact, they share physical
     body parts with a similar kind of functionality. Since only speciﬁc semantic
     information are useful at a time, this new approach can also ﬁlter out non-
     relevant information (for instance, the fact that both are mammals with fur and
     teeth does not suggest the ability to climb a tree). Nowadays, the extraction of
     these features can be done due to the huge availability of semantic resources.
                                               Mining Meaning from Text          21

    Once linguistic items are replaced by semantic representations, it becomes
possible to reuse the methodology itself having as input the larger basis of se-
mantic information created by the system, thus creating a positive feedback cycle
and enlarging the possibilities of the system. We call this concept as semantic
loop, and, to the best of our knowledge, it is the ﬁrst attempt to go beyond
single-processing systems that connect syntax with semantics towards recursive
processing of extracted semantics. For example, the action of “seeing” can show
a correlation with the fact of having eyes. Nowadays, the link between actions
and properties of subject and objects are not used while they actually provide
signiﬁcant information for deeper language understanding.
    This paper presents an approach that also relies on the concept of diversity.
Diversity has been taken into account mostly in Information Retrieval (IR) sce-
narios, where systems become aware of the need of obtaining search results that
cover diﬀerent aspects of the data [12, 11]. However, this concept can be also
useful in diﬀerent contexts like clustering [30] and recommendation [40]. In spite
of this, within the Pattern Mining (PM) and Association Rules (AR) areas, to
the best of our knowledge, diversity has not been faced yet. Since our system
architecture needs to manage the output of these techniques with the additional
goal of producing frequent patterns that are able to cover diﬀerent aspects of
the input, we also revisited them in this sense.
    This shift in the basic bricks opens new research questions and challenges
concerning Data Mining methodologies: the problem of correlating atomic lin-
guistic items becomes to correlate sets of features, where only some of them are
actually signiﬁcant. Thus, the new challenges become to understand:
 – which features need to be ﬁltered out
 – which features can be combined to approximate concepts (according to Con-
   ceptual Spaces)

The advantages of the proposed research direction are the following:
 – the integration of semantic information within the internal steps of the cur-
   rent methodology can create a virtuous loop through which semantic re-
   sources can be automatically extended.
 – linguistic items are fragmented into minimal conceptual information that en-
   ables statistical approaches to overcome the problem of low-frequency words
   occurrences. In fact, even in very large corpora, there is little chance of ﬁnd-
   ing statistically-signiﬁcant patterns that can be used to carve out meanings
   out of text. This is known as the long tail problem. Statistical approaches
   are usually not able to propagate existing information belonging to frequent
   data to such long tail. One of the aim of this proposal is to deﬁne a linguistic
   framework in which both rare and frequent words are fragmented into more
   basic facts on which reason on, avoiding low-frequency issues.
 – the use of multilingual resources will have an impact on the creation of more
   powerful semantic resources, that will be more independent by word-to-word
   translations. Within the DS ﬁeld, a minimal step in this direction has al-
   ready been done by means of transformations of words into general concepts
22         L. Di Caro and G. Boella

           or super-senses. However, this only involves terminological abstractions by
           means of IS-A relationship substitutions. In fact, our proposal represents a
           generalization of this since it considers a wider spectrum of conceptual rela-
           tionships. For example, a person can assume the role of living entity, doctor
           or student in the context of breathing, making surgical interventions, and
           studying mathematics respectively. The point is that only speciﬁc properties
           are activated by the context at a time, so we avoid to assign ﬁxed top-level
           hypernyms for all the cases. In addition to this, the simple generalization of
           a linguistic item does not extend the current analysis of correlations between
           atomic tokens.

     The outcomes of such novel approach can be many:

      – a new methodology that introduces the concept of semantic loop, i.e., iter-
        ative use of extracted semantics as input for further extension of semantic
        resources
      – new semantic resources, created by the use of the proposed methodology
      – revisitations of Data Mining techniques for dealing with a new and more
        complex type of data with respect to standard VSMs applied on text
      – the proposed contribution can also have impact on how semantic knowledge
        can be re-used or inherited from data in diﬀerent languages. For instance, in
        case there is no translation for two words in two diﬀerent languages, it will be
        possible to leverage their semantic information to link them automatically.
        Only translation at concept level it will be needed (i.e., translation of the
        new feature space). Thus, the semantic loop can work also for alignment of
        diﬀerent languages.


     3     Approach

     Our proposal concerns an automatic methods to build a large-scale semantic
     framework based on a concept-level distributional analysis of the semantics con-
     tained in plain texts. Our methodology avoids manual constructions of ontologies
     which is known to be unfeasible. On the contrary, the method goes towards a
     direct and extensive exploitation of the wealth of available resources regarding
     semantics. In particular, it leverages diﬀerent types of knowledge that can be
     used to transform words (intended as lemmas or generic linguistic items, from
     now on) into sets of extended and ﬁne-grained semantic information. The result-
     ing explosion of such heterogeneous knowledge, coming from diﬀerent sources
     and methods, create a new challenge: how to align, ﬁlter, and merge it in order
     to feed Vector Space models with semantics, as opposite to lexical entities.


     3.1     Semantic Basis

     In this paper, we started focusing on ConceptNet [43], a semantic crowdsourced
     knowledge. In detail, the Open Mind Common Sense project developed by MIT
                                               Mining Meaning from Text          23

collected unstructured common-sense knowledge by asking people to contribute
over the Web. ConceptNet, a semantic graph created from a parsing of such
knowledge, is its ﬁnal outcome. In contrast with linguistic resources like WordNet
[29], ConceptNet contains semantic information more related to common-sense
facts. For this reason, it has a wider spectrum of semantic relationships but a
much more sparse coverage due to the non-methodological approach that was
used to build it. For instance, among the more unusual types of relationships
(24 in total), it contains information like “ObstructedBy” (i.e., referring to what
would prevent it from happening), “CausesDesire” (i.e., what does it make you
want to do), and “MotivatedByGoal ” (i.e., why would you do it). In addition,
it also has classic relationships like “is a” and “part of ” as in most linguistic
resources. An at-a-glance view of these semantic relations is shown in Table 1.


     Table 1. The relations in ConceptNet, with example sentences in English.

        Relation                      Example sentence
        IsA                            NP is a kind of NP.
        LocatedNear             You are likely to nd NP near NP.
        UsedFor                        NP is used for VP.
        DenedAs                        NP is dened as NP.
        HasA                               NP has NP.
        SymbolOf                       NP represents NP.
        CapableOf                          NP can VP.
        ReceivesAction                   NP can be VP.
        Desires                         NP wants to VP.
        HasPrerequisite             NPjVP requires NPjVP.
        CreatedBy                     You make NP by VP.
        MotivatedByGoal      You would VP because you want VP.
        PartOf                          NP is part of NP.
        CausesDesire            NP would make you want to VP.
        Causes                     The eﬀect of VP is NPjVP.
        MadeOf                         NP is made of NP.
        HasFirstSubevent The rst thing you do when you VP is NPjVP.
        HasSubevent     One of the things you do when you VP is NPjVP.
        AtLocation                Somewhere NP can be is NP.
        HasLastSubevent The last thing you do when you VP is NPjVP.
        HasProperty                         NP is AP.


    In spite of this, the approach can work with other resources. For example,
another type of knowledge that can have an high impact on our semantic integra-
tion comes from Linked Open Data (LOD). One of the most used LOD resources
in Computational Linguistics is DBPedia, a dataset containing data directly ex-
tracted from Wikipedia. It contains more than 3 million concepts described by 1
billion triples, including descriptions in several languages. Other knowledge bases
are UMBEL (i.e., a 20k subjects ontology derived from OpenCyc), GeoNames
24         L. Di Caro and G. Boella

     (i.e., descriptions of geographical features), and several others. Then, WordNet
     [29] is a large lexical database of English nouns, verbs, adjectives and adverbs
     that can further extend the semantic basis. All the words are therein grouped
     into sets of synonyms (also called synsets), each expressing a distinct concept.
     WordNet contains also a set of relationships that link the synsets. To make
     some examples, synsets can be used to extrapolate “same as” properties from
     synonyms, then hypernyms can be mapped into “is a” taxonomical information,
     while meronyms can be seen as “part of ” features.

     3.2     Data for Distributional Analysis
     In order to experiment the validity of the approach, we had the need of com-
     puting a distributional model starting from a large collection of texts. However,
     instead of parsing corpora from scratch, we used a dataset of subject-verb-object
     (SVO) triples generated as part of the NELL project5 . This dataset contains a
     set of 604 million triples extracted from the entire dependency-parsed corpus
     ClueWeb09 (about 230 billion tokens)6 . The dataset also provides the frequency
     of each triple in the parsed corpus. We integrated a Named Entity Recognition
     module to transform proper names into generic semantic classes, like people and
     organizations7 .

     3.3     Algorithm
     In this section, we explain the details of the approach. In particular, the algo-
     rithm is composed by three diﬀerent phases: (1) the data pre-processing step
     with the generation of two transactional databases (transactions of items, as in
     the ﬁelds of Frequent Itemset Mining and Association Rules [6]) that we also call
     semantic vectors; (2) the extraction of frequent, closed, and diverse itemsets (we
     will brieﬂy introduce the meaning of all these names in the next paragraphs);
     and ﬁnally (3) the creation of semantic verb models, that generalize and auto-
     matically induce senses from entire linguistic constructions at sentence-level.

     Transactional Databases Generation The ﬁrst step of the algorithm regards
     the generation of the semantic vectors, i.e., vectors whose features represent
     conceptual and semantic facts rather than document- or context-occurrences.
     Since the aim of the system is to capture senses from data, we start from the
     root of the meaning, that is the verb. So, for a speciﬁc input verb v, we parse
     all the SVO triples in the datasets that have a frequency higher than a set
     threshold8 , and we only take those who are morphological variations of v. Then,
     for each one of these triples, we query ConceptNet with the subject-term and
     5
       http://rtw.ml.cmu.edu/resources/svo/
     6
       http://lemurproject.org/clueweb09/
     7
       We used the Stanford NLP toolkit available at http://www-nlp.stanford.edu/.
     8
       In our experiments we considered SVO triples that occur at least 100 times in the
       whole ClueWeb09 corpus, in order to remove noisy data.
                                                 Mining Meaning from Text            25

the object-term, retrieving all their semantic information that will later build
the new semantic space. Table 2 shows an example of the information collected
in this phase.


Table 2. An example of subject- and object-terms semantic transformation for one
triple of the verb “to learn” (student-learns-math). This represents one row of the two
transactional databases.

 Subject-term Subject semantic features Object-term Object semantic features
   student CapableOf -study, AtLoca-        math    IsA-subject, HasProperty-
              tion-at school, IsA-person,           useful in business, UsedFor -
              Desires-learn, PartOf -class,         model physical world, Re-
              CapableOf -read book, ...             ceivesAction-teach in class,
                                                    ...


   Then, we associate each semantic information to a unique id and construct
two transactional databases: one for the semantic information of the subjects,
and one for the objects. An example of result of the ﬁrst phase is shown in
Table 3.

Table 3. An example of the two transactional databases created for the verb “to learn”
and the ID-label association table.

          Transactional DB of the subjects Transactional DB of the objects
                     1 34 67 90                 2 4 6 23 67 87 122 198
               3 4 12 36 59 88 90 91              42 54 67 87 122 124
                      34 67 45                        2 6 54 67 87
                          ...                              ...
                        ID Associated Semantic information
                          1        isa-young person
                          2      atlocation-classroom
                          3       atlocation-at school
                          4         capableof-learn
                         ...               ...


Diverse Itemsets Mining Once the transactional databases are built for a
speciﬁc verb “v ”, we use techniques belonging to the ﬁeld of Frequent Itemset
Mining to extract frequent patterns, i.e, semantic features that frequently co-
occur in our transactional databases.
    The description of the problem is the following: let I = i1 , i2 , ..., in be a set
of items (i.e., our semantic features) and D be a multiset of transactions, where
each transaction t is a set of items such that t ⊆ I. For any X ⊆ I, we say that
26        L. Di Caro and G. Boella

     a transaction t contains X if X ⊆ t. The set X is called itemset. The set of all
     X ⊆ I (the powerset of I) naturally forms an itemset lattice. The count of an
     itemset X is the number of transactions in D that contain X. The support of
     an itemset X is the proportion of transactions in D that contain X.
          An itemset X is called frequent if its support is greater than or equal to some
     given percentage threshold s, where s is called minimum support.
          When the database contains a signiﬁcant number of large frequent itemsets,
     mining all of them can be very expensive, since the space of itemsets to generate
     can be huge. However, if any subset of a frequent itemset is frequent, it can be
     suﬃcient to discover only all the maximal frequent itemsets (MFIs). A frequent
     itemset X is called maximal if there does not exist a frequent itemset Y such that
     X ⊆ Y . Mining frequent itemsets can thus be reduced to mining a “border” in
     the itemset lattice. All itemsets above the border are infrequent and those that
     are below the border are all frequent. Another type of frequent itemset, called
     closed frequent itemset (CFI), was proposed in [35]. A frequent itemset X is
     closed if none of its proper supersets have the same support.
          In our experimentation, we used the library called SPMF for ﬁnding closed
     frequent itemsets9 , applying the CHARM algorithm [46]. This is done for both
     the transactional databases (subject and object databases associated to the verb
     ’v ’). Since our aim is to capture all the linguistic senses, i.e., the diﬀerent mean-
     ings connectable to the use of a speciﬁc verb, we also need to obtain itemsets
     that cover all the items that are found in frequent itemsets. In other words, we
     want to extract diverse itemsets, i.e., a minimal set of frequent and closed item-
     sets that cover all the frequent items. The concept of diversity has been mostly
     used in Information Retrieval tasks, and to the best of our knowledge there is
     no attempt in capturing “kind of” diverse itemsets in the current literature.
          In order to produce these novel types of frequent itemsets, we viewed the
     problem as a set cover problem [41], implementing an heuristic-based approach
     to face it. Given a set of elements U = i1 , i2 , ..., im (called the universe) and a set
     S of n sets whose union equals the universe, the set cover problem is to identify
     the smallest subset of S whose union equals the universe. The only parameter of
     the algorithm is the percentage of diversity div that the candidate itemsets must
     have with respect to the ones already selected. The main cycle of the algorithm is
     then over the closed itemsets, starting from the ones with the highest cardinality
     (i.e., the ones that cover most of the items). For each candidate itemset, if its
     current percentage of diversity overtakes div, it is added to the result set. In our
     experiments, we set its initial value to 0.5 (candidate itemsets must have a half
     of their items that are not already present in the selected itemsets). In case the
     insertion phase ends without having covered all the items that are contained in
     the input closed itemsets, the value decreases of a certain factor alpha (set to
     0.1, in our experiments). This way, the algorithm assures its termination.

     Verb Model Construction In the ﬁnal phase, once obtained the frequent and
     diverse itemsets for both the two transactional databases, we connect all the
     9
         http://www.philippe-fournier-viger.com/spmf/index.php
                                               Mining Meaning from Text           27

subject-itemsets with all the object-itemsets, weighting the connection according
to the their co-occurrences in the same triples of the original dataset.
    The semantic verb model constructed for a speciﬁc verb “v ” is thus a set of
weighted connections between frequent and diverse semantic features belonging
to the subjects of “v ” and frequent and diverse semantic features of the objects
of “v ”. On the one hand, this is a way to summarize the semantics suggested
by the verb. On the other hand, it is also a result that can be used to generate
new data by querying existing semantic resources with such semantic subject-
and object-itemsets. Still, this can be done without looking for words similar
to frequent subjects and objects, but by ﬁnding new subjects and objects that,
even if not similar in general, have certain semantic information that ﬁll a speciﬁc
context.
    The resulting models are automatically calculated, and they are very concise,
since in all the large and sparse semantic space only few features are relevant
to certain meanings (headed by the verb). This is also in line with what stated
in [26] where the authors claimed that semantics is actually structured by low-
dimensionality spaces that are covered up in high-dimensional standard vector
spaces.


4   Experiments and Results

In this section we present the result of the approach on diﬀerent cases. In par-
ticular, we extracted all the triples in the dataset containing diﬀerent verbs like
to play, to eat, to sing, and so forth. Then, for each of these verbs we executed
the algorithm and extracted the models, i.e., sets of weighted pairs of diverse
subject- and object-itemsets. Table 4 shows some examples of the automatically
extracted semantic information.
    In the experiments, we wanted to evaluate the quality of the constructed
models and their ability to generalize over the input data also taking into account
their size in comparison with classic word-based vector spaces.
    On the one hand, the approach is able to model the meanings expressed by
complete verbal phrases with minimal resource requirements, as shown in Figure
1. In fact, starting from hundreds of verbal instances, the method produces
itemsets with a feature space much smaller then common word spaces in which
words and chunks are represented by vector spaces of the order of thousands
of features. For instance, in the presented example, with a minimum support
of 0.05 (i.e., 5%), the resulting model is constituted by 4 diverse itemsets for
the objects and 24 for the subjects, with an average itemset cardinality of 18.5
and 12.6 respectively, covering more than 50% of the semantic features of all the
input triples.
    On the other hand, we calculated the coverage of the extracted models, that
is the percentage of triples subject-verb-object in the input data in which at least
one item is included in the extracted diverse itemsets. These results are shown
in Figure 2. Notice that the coverage of the diverse itemsets is always equals to
28        L. Di Caro and G. Boella
     Table 4. Examples of the main semantic information that are automatically induced
     for subjects and objects, tested on various verbs.

           Verb Subject semantic features                  Object semantic features
         to pay isa-person                                 relatedto-money
         to read isa-person notcapableof -ﬂy desires- usedfor -read atlocation-library atlo-
                  clothe capableof -think capableof - cation-newspaper
                  love capableof -talk to each other
                  desires-privacy         partof -society
                  capableof -voice opinion hasa-name
         to visit isa-person                               atlocation-city            aTlocation-
                                                           museum       partof -web site     used-
                                                           for -entertainment
          to eat isa-mammal capableOf -fear death atlocation-oven usedfor -eat atloca-
                  capableOf -cook dinner capableOf - tion-store hasproperty-delicious at-
                  run      capableOf -eat     capableof - location-tree isa-food atlocation-
                  pay bill atLocation-earth ...            restaurant ...
         to play notcapableof -ﬂy isa-mammal ca- atlocation-movie theater hasprop-
                  pableof -think       atlocation-earth erty-fun atlocation-theatre isa-
                  desires-laugh capableof -hear noise story               usedfor -entertainment
                  capableof -experience joy       partof - hasproperty-entertain        capableof -
                  society ...                              tell story usedfor -learn ...
         to sing isa-person capableof -think capa- partof -album                          usedfor -
                  bleof -love atLocation-earth             pleasure yourself               atloca-
                                                           tion-record          usedfor -have fun
                                                           hasproperty-melodic            usedfor -
                                                           express feel and emotion           isa-
                                                           composition of music             creat-
                                                           edby-composer         atlocation-on cd
                                                           usedfor -entertainment ...


     the coverage of the closed itemsets, even if the formers are less than (or equal
     to) the latters.
         To the best of our knowledge, this is the ﬁrst attempt to model entire lin-
     guistic constructions subject-verb-object in terms of senses at sentence-level auto-
     matically carved out from the data by deeply analyzing co-occurrent ﬁne-grained
     semantic information instead of lexical and syntactic chunks. We think that fur-
     ther eﬀorts on this direction can importantly change the vision and the horizon
     of current Natural Language Understanding goals as well as the management of
     large collections of textual data with concise, generative, and semantics-based
     models.


     5     Conclusions

     This contribution represents a ﬁrst eﬀort to pass from standard word-vectors
     to semantic vectors. This causes the raise of new challenges, like the alignment
                                                    Mining Meaning from Text            29


                                          (a)                                        (b)

Fig. 1. Size of closed (blue line) and diverse (red line) itemsets w.r.t. minimum support,
and average number of itemset cardinality (green line). The plot on the left (a) is for
the subjects of the example verb “to sing”, while the plot on the right (b) is for its
direct objects.


                                          (a)                                        (b)

Fig. 2. Coverage w.r.t. minimum support. The plot on the left (a) is for the subjects
of the example verb “to sing”, while the plot on the right (b) is for its direct objects.


and the ﬁltering of heterogeneous semantic information. Still, such shift in the
basic bricks also concerns Data Mining techniques, since the problem of corre-
lating linguistic items becomes to correlate sets of semantic features, where only
some of them are actually signiﬁcant. In this paper, we presented an approach
that connect Natural Language Processing techniques (Lexico-syntactic analy-
sis, syntactic parsing10 and Named Entity Recognition) with Pattern Mining
approaches like Frequent Itemset Mining and the cover set problem.
    To produce semantic vectors, we started by using ConceptNet, one of the
largest semantic resource currently available. In spite of this, in future work we
will also come back to lexico-syntactic parsing of large corpora like Wikipedia
for the extraction of further semantic information directly from text.
    The impact of this new research direction can be extremely high. The main
question this proposal wants to engender is the following: what if computa-
tional systems can directly reason on semantics instead of syntax? Future NLP

10
     We refer here to the used subject-verb-object input structures.
30      L. Di Caro and G. Boella

     technologies could move away from language through more complex meaning
     understanding, also dealing with unseen and low-frequency words.
         By reducing commonly-huge vector spaces based on linguistic items into syn-
     thetic conceptual matrices, we also attack the Big Data problem for textual
     databases. For example, if we think at the term “color ”, a linguistic-based vecto-
     rial representation would contain hundreds of terms that usually co-occur with it,
     such as “pastel ”, “dark ”, “light”, “red ”, “brilliant”, and so forth. In Wikipedia,
     for instance, we found more than 500 adjectival lemmas that co-occur with this
     term. On the other hand, the concept of “color” can be potentially represented
     by few dimensions. For instance, the HSV scheme uses only three dimensions:
     brightness, hue, and saturation.
         We evaluated the approach by its ability to reduce the space and generalize
     over the input data. In future work, we will also measure the approach on tasks
     like Ontology Learning and Question Answering. This paper also introduces a
     the concept of semantic loop, i.e., the recursive use of extracted semantics as
     input for further extensions. The use of this methodology can create new and
     extended semantic resources.
         Finally, we will leverage techniques for data compression like Multi Dimen-
     sional Scaling (MDS) [27], Principal Component Analysis (PCA) [25] and tensors
     decompositions to actually transform combinations of properties into reduced-
     spaces capturing the more signiﬁcant part of the data (due to their ability to
     approximate information while preserving the maximum level of their expressiv-
     ity). Cognitive psychology has deeply used such techniques in a wide variety of
     applications where the explanation of cognitive processes can be derived directly
     from them.

     References
      1. Alvanaki, F., Sebastian, M., Ramamritham, K., Weikum, G.: Enblogue: emergent
         topic detection in web 2.0 streams. In: Proceedings of the 2011 ACM SIGMOD
         International Conference on Management of data. pp. 1271–1274. ACM (2011)
      2. Baroni, M.: Composition in distributional semantics. Language and Linguistics
         Compass 7(10), 511–522 (2013)
      3. Baroni, M., Bernardi, R., Zamparelli, R.: Frege in space: A program for com-
         positional distributional semantics. Submitted, draft at http://clic. cimec. unitn.
         it/composes (2013)
      4. Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-
         based semantics. Computational Linguistics 36(4), 673–721 (2010)
      5. Biemann, C.: Ontology learning from text: A survey of methods. In: LDV forum.
         vol. 20, pp. 75–93 (2005)
      6. Borgelt, C.: Frequent item set mining. Wiley Interdisciplinary Reviews: Data Min-
         ing and Knowledge Discovery 2(6), 437–456 (2012)
      7. Buitelaar, P., Cimiano, P., Magnini, B.: Ontology learning from text: An overview.
         Ontology learning from text: Methods, evaluation and applications 123, 3–12
         (2005)
      8. Cabrera, J.M., Escalante, H.J., Montes-y Gómez, M.: Distributional term represen-
         tations for short-text categorization. In: Computational Linguistics and Intelligent
         Text Processing, pp. 335–346. Springer (2013)
                                                   Mining Meaning from Text             31

 9. Candan, K., Di Caro, L., Sapino, M.: Creating tag hierarchies for eﬀective navi-
    gation in social media. In: Proceedings of the 2008 ACM workshop on Search in
    social media. pp. 75–82. ACM (2008)
10. Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based
    on temporal and social terms evaluation. In: Proceedings of the Tenth International
    Workshop on Multimedia Data Mining. p. 4. ACM (2010)
11. Cataldi, M., Di Caro, L., Schifanella, C.: Immex: Immersive text documents ex-
    ploration system. In: Content-Based Multimedia Indexing (CBMI), 2011 9th In-
    ternational Workshop on. pp. 1–6. IEEE (2011)
12. Cataldi, M., Schifanella, C., Candan, K.S., Sapino, M.L., Di Caro, L.: Cosena: a
    context-based search and navigation system. In: Proceedings of the International
    Conference on Management of Emergent Digital EcoSystems. p. 33. ACM (2009)
13. Chen, Y., Amiri, H., Li, Z., Chua, T.S.: Emerging topic detection for organizations
    from microblogs. In: Proceedings of the 36th international ACM SIGIR conference
    on Research and development in information retrieval. pp. 43–52. ACM (2013)
14. Croce, D., Storch, V., Annesi, P., Basili, R.: Distributional compositional seman-
    tics and text similarity. In: Semantic Computing (ICSC), 2012 IEEE Sixth Inter-
    national Conference on. pp. 242–249. IEEE (2012)
15. Draicchio, F., Gangemi, A., Presutti, V., Nuzzolese, A.G.: Fred: From natural
    language text to rdf and owl in one click. In: The Semantic Web: ESWC 2013
    Satellite Events, pp. 263–267. Springer (2013)
16. Dumais, S.T.: Latent semantic analysis. Annual review of information science and
    technology 38(1), 188–230 (2004)
17. Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating
    ukwac, a very large web-derived corpus of english. In: Proceedings of the 4th Web
    as Corpus Workshop (WAC-4) Can we beat Google. pp. 47–54 (2008)
18. Flati, T., Navigli, R.: Spred: Large-scale harvesting of semantic predicates. In: Pro-
    ceedings of 51st Annual Meeting of the Association for Computational Linguistics,
    Soﬁa, Bulgaria (2013)
19. Fortuna, B., Mladenič, D., Grobelnik, M.: Semi-automatic construction of topic
    ontologies. Semantics, Web and Mining pp. 121–131 (2006)
20. Gärdenfors, P.: Conceptual spaces: The geometry of thought. MIT press (2004)
21. Gibson, J.: The concept of aﬀordances. Perceiving, acting, and knowing pp. 67–82
    (1977)
22. Grefenstette, E., Sadrzadeh, M.: Experimental support for a categorical compo-
    sitional distributional model of meaning. In: Proceedings of the Conference on
    Empirical Methods in Natural Language Processing. pp. 1394–1404. Association
    for Computational Linguistics (2011)
23. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
24. Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Pro-
    ceedings of the 14th conference on Computational linguistics-Volume 2. pp. 539–
    545. Association for Computational Linguistics (1992)
25. Jolliﬀe, I.: Principal component analysis. Wiley Online Library (2005)
26. Karlgren, J., Holst, A., Sahlgren, M.: Filaments of meaning in word space. In:
    Advances in Information Retrieval, pp. 531–538. Springer (2008)
27. Kruskal, J.B., Wish, M.: Multidimensional scaling, vol. 11. Sage (1978)
28. Lenci, A.: Carving verb classes from corpora. Word Classes. A cura di Raﬀaele
    Simone e Francesca Masini. Amsterdam-Philadelphia: John Benjamins p. 7 (2010)
29. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
32       L. Di Caro and G. Boella

     30. Morik, K., Kaspari, A., Wurst, M., Skirzynski, M.: Multi-objective frequent termset
         clustering. Knowledge and information systems 30(3), 715–738 (2012)
     31. Navigli, R., Velardi, P., Faralli, S.: A graph-based algorithm for inducing lexical
         taxonomies from scratch. In: Proceedings of the Twenty-Second international joint
         conference on Artiﬁcial Intelligence-Volume Volume Three. pp. 1872–1877. AAAI
         Press (2011)
     32. Navigli, R., Ponzetto, S.P.: Babelnet: Building a very large multilingual semantic
         network. In: Proceedings of the 48th annual meeting of the association for compu-
         tational linguistics. pp. 216–225. Association for Computational Linguistics (2010)
     33. Nuzzolese, A.G., Gangemi, A., Presutti, V., Draicchio, F., Musetti, A., Ciancarini,
         P.: Tı̀palo: A tool for automatic typing of dbpedia entities. In: The Semantic Web:
         ESWC 2013 Satellite Events, pp. 253–257. Springer (2013)
     34. Padó, S., Lapata, M.: Dependency-based construction of semantic space models.
         Computational Linguistics 33(2), 161–199 (2007)
     35. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed item-
         sets for association rules. In: Database TheoryICDT99, pp. 398–416. Springer
         (1999)
     36. Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: A uniﬁed
         approach for measuring semantic similarity. In: Proceedings of the 51st Annual
         Meeting of the Association for Computational Linguistics (ACL 2013) (2013)
     37. Ponzetto, S., Strube, M.: Deriving a large scale taxonomy from wikipedia. In:
         Proceedings of the national conference on artiﬁcial intelligence. vol. 22, p. 1440.
         Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999 (2007)
     38. Rosch, E.: Principles of categorization. Concepts: core readings pp. 189–206 (1999)
     39. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic index-
         ing. Commun. ACM 18(11), 613–620 (Nov 1975), http://doi.acm.org/10.1145/
         361219.361220
     40. Sigurbjörnsson, B., Van Zwol, R.: Flickr tag recommendation based on collective
         knowledge. In: Proceedings of the 17th international conference on World Wide
         Web. pp. 327–336. ACM (2008)
     41. Slavı́k, P.: A tight analysis of the greedy algorithm for set cover. In: Proceedings of
         the twenty-eighth annual ACM symposium on Theory of computing. pp. 435–441.
         ACM (1996)
     42. Snow, R., Jurafsky, D., Ng, A.: Learning syntactic patterns for automatic hypernym
         discovery. Advances in Neural Information Processing Systems 17 (2004)
     43. Speer, R., Havasi, C.: Representing general relational knowledge in conceptnet 5.
         In: LREC. pp. 3679–3686 (2012)
     44. Speer, R., Havasi, C.: Conceptnet 5: A large semantic network for relational knowl-
         edge. In: The Peoples Web Meets NLP, pp. 161–176. Springer (2013)
     45. Wu, F., Hoﬀmann, R., Weld, D.S.: Information extraction from wikipedia: Mov-
         ing down the long tail. In: Proceedings of the 14th ACM SIGKDD international
         conference on Knowledge discovery and data mining. pp. 731–739. ACM (2008)
     46. Zaki, M.J., Hsiao, C.J.: Charm: An eﬃcient algorithm for closed itemset mining.
         In: SDM. vol. 2, pp. 457–473. SIAM (2002)
                           Text Mining for
                     Open Domain Semi-Supervised
                        Semantic Role Labeling

           Quynh Ngoc Thi Do1 , Steven Bethard2 , and Marie-Francine Moens1
                             1
                               Katholieke Universiteit Leuven, Belgium
                     2
                         University of Alabama at Birmingham, United States


            Abstract. The identification and classification of some circumstance
            semantic roles like Location, Time, Manner and Direction, a task of Se-
            mantic Role Labeling (SRL), plays a very important role in building
            text understanding applications. However, the performance of the cur-
            rent SRL systems on those roles is often very poor, especially when the
            systems are applied on domains other than the ones they are trained
            on. We present a method to build open domain SRL system, in which
            the training data is expanded by replacing its predicates by words in
            the testing domain. A language model, which is considered as a text
            mining technique, and some linguistic resources are used to select from
            the vocabulary of the testing domain the best words for the replace-
            ment. We apply our method on the case study of transferring a semantic
            role labeler trained on the news domain to the children story domain.
            It gives us valuable improvements over the four circumstance semantic
            roles Location, Time, Manner and Direction.


     1    Introduction

     Playing an essential role in text understanding, Semantic Role Labeling is the
     task of natural language processing that specifies “Who did What to Whom,
     and How, When and Where?” in text [12].
     For example, the processing of the sentence “Mary gave Peter a book at school
     yesterday” should result in the identification of a “giving” event with “Mary” as
     the Agent of the event, “Peter” as the Recipient and “a book” as the Item being
     given. The Location of the “giving” event, or where it took place, is “at school”
     and the Time of the event is “yesterday”.
     In this paper, we call an event (“giving” event) in a sentence the semantic
     frame, the verb or noun that evokes the frame (“gave”) the predicate, the words
     (“Mary”, “Peter”, “a book”, “at school”, “yesterday”) that play a role in the
     event the arguments and their roles (“Agent”, “Recipient”, “Thing being given”,
     “Location”, “Time”) the semantic roles.
     The task of semantic role labeling is to detect the event, to identify its argu-
     ments and assign the correct semantic roles to them. Thanks to the availability


In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
34       Q. N. T. Do, S. Bethard and M.-F. Moens

     of semantic annotated resources (e.g. PropBank3 , FrameNet4 ), supervised ma-
     chine learning approaches have been very successful in constructing automatic
     semantic role labellers. Assuming the predicates are already given, those sys-
     tems can reach an F15 score of 85% when the training and testing data are in
     the same domain. But, when testing on other domains, the scores often drop
     significantly6 .
     Text mining is the task of automatic discovery of new, previously unknown infor-
     mation from unstructured document collections. Meanwhile, a language model
     tries to capture the properties of a language, and predicts the next word in a
     word sequence. It is trained on a collection of unlabeled texts, and therefore is
     considered as a text mining technique. Recently, we some attempts to use such
     language models in a semi-supervised setting for semantic recognition [5], [7],
     in which, other words or a statistical class of words provided by the language
     model, that could be exchanged at a certain position in a sentence or phrase,
     enriches the feature vectors used in training, or they are used to create training
     examples artificially. However, there is no principled way to use such language
     information.
     In this paper, we develop a methodology to generate additional training data
     for SRL by replacing selected verbal predicate words in training examples using
     a language model. For each selected predicate in the training examples, from
     the vocabulary of the domain that the SRL is applied on, a list of replacement
     words which we believe can occur at the same position as the selected word,
     are generated. We introduce and explore a variety of features for identifying
     how words should be replaced, including predicate vs. argument status, POS,
     WordNet related words, and a replacement score based on a language model.
     As for experiment, we present a case study of improving the performance of a
     SRL system trained on the news domain when applying to the children story
     domain. The case study is based on our ongoing European project “Machine
     Understanding for interactive StorytElling” (MUSE)7 . One of the fundamental
     goals of MUSE is to detect actors, actions, plots in children stories, and render
     them as 3D worlds. SRL with its function of identifying the events in texts plays
     an essential role in solving our problem. Among the set of semantic roles, some
     circumstance semantic roles like Location, Time etc. are very important to un-
     derstand the full meaning of an event, while the performance of the current SRL
     systems on them is often very poor, especially when testing on a domain other
     than the one they are trained on. Thus, in our case study, we target to improve
     SRL on the four PropBank circumstance roles: AM-LOC (Location), AM-TMP
     (Time), AM-MNR (Manner) and AM-DIR (Direction).
     In the next sections, we present related work (Section 2), linguistic resources


     3
       http://verbs.colorado.edu/ mpalmer/projects/ace.html
     4
       https://framenet.icsi.berkeley.edu
     5
       Harmonic mean of recall and precision.
     6
       http://ufal.mff.cuni.cz/conll2009-st/
     7
       http://www.muse-project.eu/
Text Mining for Open Domain Semi-Supervised Semantic Role Labeling               35

(Section 3), underlying assumptions, objectives and task definition (Section 4),
methodology (Section 5), case study (Section 6), and conclusion (Section 7).


2   Related Work

Semi-supervised approaches to semantic role labeling recently have received the
attention of the computational linguistics community. Information from language
models have been used as extra features to improve the performance of SRL. [17]
use deep learning techniques based on semi-supervised embeddings to improve
a SRL system. [3] pursue this track further and use a deep neural network ar-
chitecture to obtain good word representations in the form of word-embeddings.
The word embedding defines the related words which are the result of the neu-
ral network training and are usually referred to as language models. [16] use
word embeddings obtained by recurrent neural networks to recover the syntactic
structure of a sentence, but the method is not applied to semantic role labeling.
Along these lines, a number of language models with hidden layers have been
developed based on generative probabilistic approaches and applied to seman-
tic role labeling. [5] define a latent words language model as a graphical model
where at each word position in a text the distribution of exchangeable words
are generated. The authors use a hidden Markov language model with depen-
dencies defined on two previous and two following words in the discourse, and
in a subsequent paper, [4] explain approximate methods to train such a model
among which is Gibbs sampling. In this model, each hidden variable or latent
word generates a distribution over the entire vocabulary of the training data
set. The model improves the performance of SRL on the CoNLL 2008 dataset
especially when few training data are given to the learner. [7] propose a hidden
Markov model that learns the distribution of a hidden variable that can take K
different values, and the hidden variable is dependent on the previous hidden
variable in the sentence. In contrast to [5], each hidden variable can generate a
span or sequence of words instead of a single word. The span contains the se-
quence of words for the word under consideration and the predicate. Each latent
variable represents a distribution of categories of words. The model is trained
with a Baum-Welch algorithm. In both [5] and [7], the respectively most proba-
ble hidden word or category of words is used as an extra feature to describe the
feature vector used in the recognition. In [7], several “hidden” features are used
each being the result of a different initialization of the Baum-Welch algorithm.
Although appealing, these latent words language models have disadvantages.
The model of [5] yields a distribution over all vocabulary words raising the need
to make a selection of possibly the most probable ones when using them in the
feature representation. The model of [7] relies on a fixed number of categories (or
latent topics) that form the hidden variables, but it is not clear how to choose
such a number especially when word spans of different sizes are used as observed
variables. In this paper, we aim at using a more flexible approach where such
free parameters are replaced by the use of linguistic knowledge.
Besides the semi-supervised approaches that extend the feature set of SRL, there
36       Q. N. T. Do, S. Bethard and M.-F. Moens

     are other attempts to generate new training examples automatically by using
     unlabeled data. [6] automatically generate training examples by considering the
     lexical and syntactic similarity between labeled and unlabeled sentences as a
     graph alignment problem. use the language model of [5] to generate new training
     examples by replacing the head word of temporal expression training examples
     in the task of temporal expression recognition.
     None of the above works consider both structural similarity and language models
     as a source of evidence for generating training examples, nor do they evaluate
     different approaches to similarity depending on the roles sought. In contrast to
     most of the above works, we evaluate the proposed methods when porting the
     learned model to texts from a domain that is different from the one the semantic
     role labeler was trained on.


     3    Linguistic resources

     The Penn Proposition Bank (PropBank) [13] provides a corpus annotated
     with semantic roles. In this resource, a semantic frame which is evoked by a verb
     is represented as a role set: Each role set is linked to a specific sense of the verb.
     Therefore, each verb has several role sets corresponding to its possible senses.
     The list of role sets and their semantic roles for each verb is defined in a frame
     file. For example, in the sentence “Mary gave Peter a book at school yesterday”,
     the role set give.01 with the meaning of “transfer” evoked by the verb “give”,
     has three main arguments A0, A1 and A2 that are “Agent”, “Theme”, and “
     Recipient”, respectively: “[Mary A0 ] gave (give.01 ) [Peter A2 ] [a book A1 ] [at
     school AM-LOC ] [yesterday AM-TMP ]”.
      In VerbNet [15], English verbs are grouped into different classes, adapting the


                           Table 1. Main PropBank semantic roles

         Role   Description
         A0     Agent - extern argument
         A1     Patient/Theme - intern argument
         A2     Indirect object / beneficiary / instrument / attribute / end state
         AM-LOC Location (where?)
         AM-TMP Temporal marker (when?)
         AM-MNR Manner
         AM-DIR Direction


     previous verbal classification of [8]. Each verbal class takes different thematic
     roles and certain syntactic constraints that describe their superficial behavior.
     VerbNet’s hierarchical verb classes establish a set of possible thematic roles [8].
     However, the semantic roles in VerbNet are more thematic than the ones in
     PropBank. For example, in VerbNet, Agent label is used instead of A0 label as
     in PropBank. P atient and T heme can be referred to the label A1 of PropBank.
Text Mining for Open Domain Semi-Supervised Semantic Role Labeling                37

In Table 2, there is an example of the information that VerbNet contains for
the class give − 13.1.1. Members of this class share the same syntactic patterns
(NP V NP PP) with corresponding thematic roles (Agent V Recipient Theme
Asset). Thus, two verbs “give” and “sell” in the two sentences “Mary gave Peter
a book for 20 EUR” and “Mary sold Peter a book for 20 EUR” which have the
same syntactic pattern, evoke two semantic frames with the same semantic role
patterns as follows:
“[Mary Agent] gave (give.01 ) [Peter Recipient] [a book Theme] [for 20 EUR
Asset]”
“[Mary Agent] sold (sell.01 ) [Peter Recipient] [a book Theme] [for 20 EUR As-
set]”
SemLink 8 is a project whose aim is to link together different lexical resources


                         Table 2. VerbNet class Give-13.1.1

     Class Give-13.1.1
     Roles: Agent, Theme, Recipient, Asset
     Members: give, hawk, hock, lease, pawn, rent, sell
     Frame: NP V NP PP.asset (Agent V Recipient Theme {at, for, on} Asset )...


via a set of mappings. These mappings will make it possible to combine the dif-
ferent information provided by these different lexical resources for tasks such as
inferencing. The mapping between VerbNet and PropBank is available in Sem-
Link. Each frame in PropBank is linked to a suitable VerbNet class and each role
label in the PropBank frame is mapped to a VerbNet role label. Table 3 shows
a mapping from the PropBank role set “give.01” to the VerbNet class “13.1.1”.
 WordNet [11] is a large lexical database of English. Nouns, verbs, adjectives


    Table 3. Mapping from PropBank role set “give.01” to VerbNet class “13.1.1”

             PropBank role set=“give.01” VerbNet class=“13.1.1”
             PropBank role label         VerbNet role label
             A0                          Agent
             A1                          Theme
             A2                          Recipient


and adverbs are grouped into sets of cognitive synonyms (synsets), each express-
ing a distinct concept. Synsets are interlinked by means of conceptual-semantic
and lexical relations. Noun and verb synsets are arranged into hierarchies. The
main relation among words in WordNet is synonymy, as between the words
“shut” and “close” or “car” and “automobile”. Each of WordNet’s 117000 synsets
is linked to other synsets by means of a small number of “conceptual” relations.
8
    http://verbs.colorado.edu/semlink/
38       Q. N. T. Do, S. Bethard and M.-F. Moens

     The most frequently encoded relation among synsets is the super-subordinate
     relation (also called hyperonymy, hyponymy or IS-A relation). It links more gen-
     eral synsets like furniture, piece of furniture to increasingly specific ones like bed
     and bunkbed.


     4    Underlying assumptions, objectives and task definition

     Semi-supervised learning is very difficult to accomplish in natural language pro-
     cessing tasks. In general, it is successful when the labeled training data contain
     seed examples that are representative for the whole data set and when the clus-
     ter hypothesis holds, that is, when a suitable similarity metric can correctly
     cluster the unlabeled examples with the labeled seed examples [2]. With regard
     to semantic role labeling, in order for the cluster hypothesis to hold, it requires
     that “similar” or exchangeable syntactic structures and lexical words found in
     the labeled and unlabeled examples cluster the linguistic phrases that form a
     specific semantic role.
     We assume that a language model (e.g., [10], [4]) with valuable generic infor-
     mation on both frequent and infrequent legitimate linguistic expressions, gives
     us exchangeable words in context. The exchangeable words are considered as a
     cluster of words playing the same role on forming a specific semantic role.
     In this respect, the goals of this paper are to:

      – Set up a methodology for choosing unlabeled examples, guessing their la-
        bels, then using them as new training data to improve the performance of a
        semantic role labeler.
      – Evaluate the methodology in our case study: when the SRL model is trained
        on news domain and applied on children story domain.

     The notation of the symbols used in this paper is given in Table 4. The task


                    Table 4. Denotation of the symbols used in this paper.

      Symbol Meaning
      Sl     Set of manually annotated sentences
      St     Testing set
      Sul    Set of unlabeled sentences used to train the language model
      Su     Set of unlabeled sentences generated automatically
      Snl    Set of automatically annotated semantic frames of Su
      Stemp  Set of tuples of (sentence, word to be replaced, list of replacement words)
      Ssl    Set of semantic frames selected for the replacement
      V      Vocabulary of St
      N      Maximum number of replacement words for replacement candidate
      z      Context window used to calculate replacement score


     of the semi-supervised semantic role labeler discussed in this paper is to learn
Text Mining for Open Domain Semi-Supervised Semantic Role Labeling               39

from a set of manually annotated sentences, a set of unannotated sentences, a
language model and some linguistic resources, a model that assigns semantic
roles to the set of semantic frames of sentences in a test set. A sentence may
contain more than one frame. Each semantic frame consists of one predicate that
evokes the frame and several arguments that play a role in the frame. Predicates
and arguments may be composed of more than one word. In this paper, we work
with verbal predicates, and use head word labeling, which means if an argument
consists of more than one word then the semantic role is assigned to only the
head word. For instance, if “in the park” is the argument playing the role AM-
LOC, then only the head word of the phrase “in the park”, “in”, is labeled with
the label AM-LOC. Given a sentence s composed of n words w1 , w2 , ..., wn , for
each semantic frame f in s, each word wi (i ∈ {1, 2, ...n}) has received a label
ri ∈ R ∪ {NULL} during the manual annotation for training, or will receive
a label ri ∈ R ∪ {NULL} during testing or evaluation, where R is a set of
predefined semantic roles and NULL means empty label. If ri 6= NULL, then wi
is the head of an argument of f with ri as the semantic role. In the approach
that we describe in this paper, R is the set of PropBank semantic roles (see
Table 1).
Instead of training a SRL system on the given manually annotated sentences Sl ,
we generate a set of new training examples by using a language model trained on
a set of unannotated sentences, then use them together with Sl as the training
data for the SRL.


5     Methodology
The steps of our methodology to generate new training examples and train a
SRL system are shown in Figure 1.
Given a manually annotated sentence set Sl which can be used as training data, a
set of unannotated sentences Sul , a vocabulary V including words in the domain
of the testing data St , a language model L, and some linguistic resources, we
create new training examples and train the SRL system as follows: First, L is
trained on Sul . Then, L, Sl , V and the linguistic resources are used to generate
new training examples Snl . The algorithm to generate new training examples is
presented below. Finally, the SRL system is trained on Sl ∪ Snl .

5.1   General model for generating new training instances
In what follows, we present our general model to generate new training instances
for the semantic role labeling task. The detail of our algorithm is given in Figure
1 (in dashed rectangle) and Algorithm 1. It consists of five main steps:
Step 1. Selecting data for replacement. We select a set of semantic frames from
Sl used for the replacement. The selection can be performed in different ways
depending on specific case studies. For each of the selected semantic frame, we
choose its predicate as the word to be replaced. For example, given a sentence
“Mary gave a book to Peter at school”, the semantic frame “giving” has “gave”
as predicate. “gave” is the word selected for the replacement.
40       Q. N. T. Do, S. Bethard and M.-F. Moens


                         Unannotated
                         examples
                                                                       Linguistic resources
                             Sul       Train language model


                         Vocabulary                                            Generating new training examples
                         of test set
                                       Generate replacement
                             V                words           Filter                    Replace words


                                                                                              Su Unannotated
                                                                                                 examples
                         Annotated
                         examples           Select data
                                          for replacement
                             Sl                                                            Guess
                                                                                        semantic labels


                                                                                              Snl   New annotated
                                                                                                    examples


                                                                                         Train SRL


     Fig. 1. Overview of the methodology to generate new training instances and train a
     SRL system.


     Step 2. Generating replacement words for each word selected to be replaced.
     A statistical language model assigns a probability to a sequence of m words by
     means of a probability distribution. For each word selected to be replaced, we
     use the language model L trained on Sul , and the vocabulary V of the testing
     domain, to generate a list of replacement words. Given a sentence composed of
     w1 , w2 , ..., wn , with wi is the word to be replaced, for each nwj ∈ V, the score
     of replacing wi by nwj is calculated by the probability of the sequence of words
     wi−z , wi−z+1 , ..., wi−1 , nwj , wi+1 , wi+2 , ..., wi+z obtained by putting nwj in the
     context of wi where z is size of the context window taken into account:
     ReplacementScore(wi , nwj ) = P (wi−z , wi−z+1 , ..., wi−1 , nwj , wi+1 , wi+2 , ..., wi+z )
     This probability score is calculated by the language model. It is used to rank the
     replacement words in our algorithm. Since the size of V may be very large and the
     words at the end of the list may have a very low score which often represents noise,
     only N words that have highest scores are chosen. After this step, we receive a ranked
     list of the top N replacement words for each candidate of the replacement.
     Step 3. Applying filters to reduce noise in the list of replacement words. There may
     be a great deal of noise in the replacement words returned by the language model since
     it does not take into account enough information (syntactic, semantic etc.) to generate
     a replacement word that can be replaced perfectly for a word in a given sentence
     assuring the same semantic role. Thus, some linguistics filters are needed to improve
     the correctness and meaningfulness of the replacement.
     Step 4. Replacing words in each sentence selected to be replaced by their replacement
     words that passed the filters, then we form a new unannotated set of sentences Su .
     Step 5. Guessing semantic frames and their semantic labels for each sentence in Su to
     have an annotated semantic frame set Snl .
     In the following sections, we will present in more detail the language model used, some
     proposed filters, how to perform replacement and guess semantic role labels for the
     new sentences obtained by the replacement.
Text Mining for Open Domain Semi-Supervised Semantic Role Labeling                     41


Algorithm 1 Generate novel training examples.
 1: procedure GenerateNewExample(L, Sl , V, z, N )
 2:    Su = ∅, Snl = ∅, Stemp = ∅;
 3:    Select semantic frames that are used for the replacement from Sl : Ssl = selected
    semantic frames
 4:    for each sentence s ∈ Sl do
 5:        for each word wi in s do
 6:            if wi is the predicate of a semantic frame ∈ Ssl then
 7:                 for each nwj ∈ V do
 8:                                      ReplacementScore(wi , nwj )                    =
    P (wi−z , wi−z+1 , ..., wi−1 , nwj , wi+1 , wi+2 , ..., wi+z ) obtained by using L;
 9:                 end for
10:                  Sort nwj according to ReplacementScore, then choose top N words
    that have highest scores forming the ranked list Listi ;
11:                  Stemp = Stemp ∪ (s, wi , Listi );
12:             end if
13:        end for
14:     end for
15:     for each (s, wi , Listi ) in Stemp do
16:        for each replacement word nwj in Listi do
17:             if nwj passes filters then
18:                 for each semantic frame f of s that is in Ssl and receives wi as the
    predicate do
19:                     s0 = the sentence obtained by replacing wi by nwj in s;
20:                     Su = Su ∪ s0 ;
21:                     f 0 = the semantic frame evoked by nwj in s0 ;
22:                     Guess semantic role labels of f 0 ;
23:                     Snl = Snl ∪ f 0 ;
24:                 end for
25:             end if
26:        end for
27:     end for
28:     Return Snl
29: end procedure
42         Q. N. T. Do, S. Bethard and M.-F. Moens

     5.2     Language model

     In this paper, we use the Recurrent Neural Network Language Model9 (RNNLM) [10]
     [9] which is one of the most successful techniques for statistical language modeling.
     Unlike previous approaches in using artificial neural networks for modeling sequential
     data, recurrent neural networks are not trained with limited context size. By using
     recurrent connections, information (e.g., words from previous sentences in a discourse)
     can cycle inside these networks for a long time and have an influence on the final lan-
     guage model obtained. The architecture of RNNLM is shown in Figure 2. The input


                            Fig. 2. Simple recurrent neural network.


     layer consists of a vector w(t) that represents the current word wt encoded as 1 of V
     with V is the vocabulary (thus size of w(t) is equal to the size of the vocabulary), and of
     vector s(t − 1) that represents output values in the hidden layer from the previous time
     step. After the network is trained, the output layer y(t) represents P (wt+1 |wt , s(t−1)).
     The network is trained by stochastic gradient descent using either usual backpropaga-
     tion algorithm, or backpropagation through time [14]. The network is represented by
     input, hidden and output layers and corresponding weight matrices - matrices U and
     W between the input and the hidden layer, and matrix V between the hidden and the
     output layer. Output values in the layers are computed as follows:
                                            X                 X
                             sj (t) = f (       wi (t)uji +       sl (t − 1)wjl )           (1)
                                            i                 l
                                                      X
                                        yk (t) = g(        sj (t)vkj )                      (2)
                                                       j

     where f (z) and g(z) are sigmoid and softmax activation functions:

                                              1                ezm
                                 f (z) =        −z
                                                   , g(zm ) = P z                           (3)
                                            1+e                 k
                                                                  ek
     The output layer y represents a probability distribution of the next word wt+1 given
     the history. The size of the hidden units is in our experiments set to 300. The standard
     backpropagation algorithm with stochastic gradient descent is used to train the model.
     In this research, we use the language model to calculate the probability of a word
     9
         http://www.fit.vutbr.cz/ imikolov/rnnlm/
 Text Mining for Open Domain Semi-Supervised Semantic Role Labeling                             43

sequence W = w1 w2 ...wm = W1m . The language model probability of W is computed
as follows:
                                            m
                                            Y
                        P (W ) = P (W1m ) =   P (wi |W1i−1 )                 (4)
                                                    i=1

Over the last few decades, an n-gram language model which assumes that the predicted
word only depends on the previous n-1 words, is the most popular technique since it
is simple and effective. Instead of using Equation 4, P (W1m ) is calculated in a more
                           Q
                           m
                                      i−1
simple way, as P (W1m ) =     P (wi |Wi−n+1 ). However, an n-gram language model esti-
                             i=1
mates its parameters in the discrete space, resulting in weak generalization capability
on unknown data. In addition, the standard n-gram language model suffers from expo-
nential growth of size, serious data fragmentation, and increased miss rate using longer
context [18]. To overcome this problem, RNNLM, which has activation feedback with
short-term memory and uses full history information instead of limiting context, can
help us to calculate more accurately and efficiently P (W1m ) by Equation 4 using the
output layer y(t).
A language model tries to capture the properties of a language and is trained on a
collection of unlabeled texts, so it can be considered as a text mining technique.


5.3    Filters

Because the list of top N replaceable words returned by the language model may con-
tain a great deal of noise, we propose specific filters to improve the performance of the
system.
Part-Of-Speech filter (POS filter): We keep replacement word nwj for wi if nwj has
the same POS tag as wi , when replacing wi in sentence s.
WordNet filter : We keep replacement word nwj for wi if nwj and wi are synonyms
or have the same hypernym in WordNet. Here, we ignore the problem of word disam-
biguation. We only use the first word sense when finding the synonyms and the words
that have the same hypernym.
For example, “January” has “Jan” as synonym, “February”, “March”, etc. are the
words that have the same hypernym “Gregorian calendar month”.
Predicate filter : A suitable replacement word of a predicate should also evoke a frame
with correct roles when it is placed in the sentence of the target. Our idea is to assign
role labels to the new frame based on the role labels of the current frame, but it raises
the problem of how to find a mapping between the role sets of the two frames, and
detect the correct sense of the new frame. Based on this idea, one possibility is to
select only replacement words for which the mappings between role sets are available.
By using SemLink (see Section 3), we define a filter specifically for predicates: for each
predicate wp evoking a frame f , we keep replacement predicate word nwj for wp if f
and one frame evoked by nwj are mapped to the same VerbNet class and the mappings
from those two frames to the VerbNet class are defined in SemLink.


5.4    Replacing words and guessing semantic labels

Replacement words that have passed filters are used to generate new training examples.
Given a semantic frame f of a sentence s composed of n words w1 , w2 , ..., wn , wp (p ∈
{1, 2, ..., n}) is the predicate of f , and wa1 , wa2 , ..., wam (a1, a2, ..., am ∈ {1, 2, ..., n})
44         Q. N. T. Do, S. Bethard and M.-F. Moens

     are the heads of the m arguments of f with r1 , r2 , ..., rm as semantic role labels, re-
     spectively. After the filtering step, the list of replacement words of wp , Listp , includes
     j words {nw1 , nw2 , ..., nwj }. For each nwt ∈ Listp , we replace wp by nwt in sentence
     s and obtain sentence s0 composed of n words w1 , w2 , ...wp−1 , nwt , wp+1 , ..., wn . If nwt
     has passed the Predicate filter - which we use as a default filter -, it can be a semantic
     predicate and the argument structure of the frame evoked by nwt is similar to the
     argument structure of the frame evoked by wp . Thus, we guess that nwt also invokes
     a semantic frame in s0 with wa1 , wa2 , ..., wam as arguments (the new semantic frame
     and f - the frame evoked by wp - have the same argument words). In order to predict
     the sense and role labels of the new semantic frame, we use the mappings between
     PropBank semantic frames and VerbNet classes that can be found in SemLink. We
     first find a frame f 0 of nwt so that both f 0 and f are mapped to a same VerbNet
     class c. The mappings from f and f 0 to c are denoted by m1 and m2 , respectively. If
     the Predicate filter has been applied before, f 0 exists and can be found in this step.
     Then, we can guess that f 0 is the new frame evoked by nwt in s0 . As for semantic role
     labels of f 0 , if in f , wai (i ∈ {1, 2, ..., m}) with semantic role label ri , is a circumstance
     role AM-s, then its role does not change in f 0 . That means ri is also the role label of
     wai in f 0 . Otherwise, if the role label of wai in f is ri , then the role label of wai in
     f 0 is m−1
              2 (m1 (ri )). For example, the sentence “Rachel wore a hat in her room” has
     the frame “wear.01” (wore) with “Rachel” as A0, “hat” as A1, “in” (the head of the
     prepostion phrase “in her room”) as AM-LOC, and the predicate “wore” has “donned”
     as a replacement word. By replacing “wore” by “donned” in the sentence, we have a
     new sentence “Rachel donned a hat in her room” and “donned” evokes a new frame. In
     SemLink, we can find the VerbNet class “simple dressing-41.3.1” linked to the Prop-
     bank frame “wear.01” and one PropBank frame of the predicate “don”, “don.01”. The
     role mapping between the VerbNet class and the two frames can be found in Table5.
     By applying our method, we have a new frame “don.01” with “Rachel” as A0 (mapped
     to the “Agent” VerbNet role), “hat” as A1 (mapped to the “Theme” VerbNet role),
     and “in” as AM-LOC (circumstance role) (See Figure 3).


     Table 5. Role mapping of “simple dressing-41.3.1” linked to both “wear.01” and
     “don.01”

                  Role of simple dressing-41.3.1 Role of wear.01 Role of don.01
                  Agent                          Arg0            Arg0
                  Theme                          Arg1            Arg1


     6      Case study

     In the EU-EP7 MUSE project10 , in which KU Leuven is involved, we instantiate a
     virtual world with information extracted from children stories. Our fundamental goal is
     to introduce a new way of exploring and understanding information by “bringing text to
     life” through 3D interactive storytelling. Children stories will be taken initially as input,
     due to the relative simplicity of such stories and the relative ease of results evaluation,
     10
          http://www.muse-project.eu/
Text Mining for Open Domain Semi-Supervised Semantic Role Labeling                                                  45

                                                Rachel wore a hat in her room
                    Original sentence s           w1    w2 w3 w4 w5 w6 w7

                                                  wa1      wp       wa2 wa3             Listp={donned (nw1)}

                    Original frame f
                                                                                        f = wear.01
                                                        predicate
                m1
                                          A0                            A1     AM-LOC


                    Verbnet class c    Agent                          Theme             c= simple_dressing-41.3.1


                                          A0                            A1     AM-LOC
               m2
                                                        predicate                       f' = don.01

                    New frame f'

                                                 wa1                 wa2 wa3
                    New sentence s'              w1    nw1  w3 w4 w5 w6 w7
                                               Rachel donned a hat in her room


                              Fig. 3. An example of the replacement.


and will then be translated into formal knowledge that represents the actions, actors,
plots and surrounding world. In a next step this formal knowledge is rendered as virtual
3D worlds in which the user can explore the text through interaction, re- enactment
and guided game play. Finally, we will evaluate to which degree this procedure enhances
the understanding by children of simple fantasy stories. In our project, NLP techniques
are necessary to bring natural language closer to 3D immersive environments. Among
the needed NLP techniques, SRL is one of the most important ones, since it labels the
textual content of the story by semantic roles identifying actors and activities, that are
then visualized in a 3D world. In this paper, we present our work on building a SRL
system for the children story domain.
To create virtual 3D worlds, the location where the activities happen, the order of the
activities, the tools used by the actors in the activities and the direction where the
actors move toward, are very important information. Such kinds of information can
be represented by the semantic roles AM-LOC, AM-TMP, AM-MNR and AM-DIR
respectively in SRL. However, the performance of the current SRL systems on those
roles is often very poor, especially when testing on a domain other than the one they
are trained on. Therefore, in our work, we target to improve SRL system on those four
roles: AM-LOC, AM-TMP, AM-MNR and AM-DIR.


6.1   Data

We select “Tuk the Hunter” story as the data for our project demonstration. The story
(with some slight changes in the content) is the testing data in our case study. To eval-
uate the performance of SRL on our domain, we annotate semantic roles for the story
following the PropBank annotation guideline and dependency head word labeling. The
total number of annotated semantic frames in our testing data is 154. The detailed
number of instances per role is given in Table 6.
 Most of the annotated data available for semantic role labeling are in news domain. To
46         Q. N. T. Do, S. Bethard and M.-F. Moens
             Table 6. Number of instances per role in the training and testing data

                    Data     AM-LOC AM-TMP AM-MNR AM-DIR
                    Testing 19      28     32     10
                    Training 10387  23347  11837  1146


     be used as our training data, we select CoNLL 200911 training dataset which contains
     parts of the Wall Street Journal corpus12 . The detail number of instances per role in
     the training data is given in Table 6.


     6.2     Expanding training data to children story domain

     In our case study, we collect 252 children stories (mostly are fairy tales) to create the
     domain of children story. They are used together with the first 80 million words of the
     Reuters corpus to train the Recurrent Neutral Network Language Model13 , and the
     vocabulary of those stories is used to generate new training examples.
     We realize that most of the instances of the four targeted roles are prepositions. It
     suggests us to choose semantic frames that contain at least one preposition argument
     in the CoNLL 2009 training data as the base for our training data expansion.
     The SRL system is used in our experiment is the Lund university’s semantic parsing [1],
     which is available freely, and one of the best systems in the CoNLL 2009 competition.
     In order to evaluate the effectiveness of our method, we compare the results obtained
     on the Tuk story by the Lund university’s semantic parsing when training on our
     expanded training data and on the original training data. In our experiment setting,
     we use all of the three filters: POS filter, WordNet filter, Predicate filter. The maximum
     number of replacement words for each replacement position, N = 500, and the context
     window size, z = 5. Table 7 presents the results and the gains obtained on the four
     circumstance roles when using our expanded training data in precision, recall and F1
     measures. From the table, it is clear that we obtain valuable recall, precision and F1
     improvements (at least 7% for recall and F1) over all of the tested roles.


     Table 7. Recall, precision, F1 results and gains (in %) per role when training on our
     expanded training data.

          Role   Recall (Recall gain) Precision (Precision gain) F1 (F1 gain)
          AM-LOC    47.37(+10.53)             60.00(+6.15)        52.94(+9.19)
          AM-TMP     82.14(+7.14)             69.70(+7.93)        75.41(+7.67)
          AM-MNR    56.25(+18.75)            85.71(+10.71)       67.92(+17.92)
          AM-DIR    60.00(+10.00)             75.00(+3.57)        66.67(+7.84)


     11
        http://ufal.mff.cuni.cz/conll2009-st/
     12
        http://catalog.ldc.upenn.edu/LDC2012T04
     13
        http://www.fit.vutbr.cz/ imikolov/rnnlm/
 Text Mining for Open Domain Semi-Supervised Semantic Role Labeling                   47

7      Conclusion
In this paper, we present a methodology of building an open-domain semantic role
labeling. In our case study, we transfer the SRL model trained on the news domain to
the children story domain, by collecting children stories to create the new domain, then
replacing verbal predicates in the training data by the words of the new domain given
a language model. We keep the precision score from dropping by using the occurrence
probabilities and some linguistic filters to verify linguistic patterns obtained by the
replacements. The valuable enhanced results over the four circumstance roles AM-LOC,
AM-TMP, AM-MNR and AM-DIR show clearly the effectiveness of our methodology
on this case study14 .


Acknowledgement
This work is funded by the EU ICT FP7 FET project “Machine Understanding for
interactive StorytElling” (MUSE) http://www.muse-project.eu/.


References
 1. Björkelund, A., Hafdell, L., Nugues, P.: Multilingual semantic role labeling. In:
    Proceedings of the Thirteenth Conference on Computational Natural Language
    Learning: Shared Task. pp. 43–48. CoNLL ’09, Association for Computational Lin-
    guistics, Stroudsburg, PA, USA (2009)
 2. Chapelle, O., Scholkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press
    (2006)
 3. Collobert, R.: Deep learning for efficient discriminative parsing. In: AISTATS
    (2011)
 4. Deschacht, K., De Belder, J., Moens, M.F.: The latent words language model.
    Computer Speech and Language 26(5), 384–409 (Oct 2012), https://lirias.
    kuleuven.be/handle/123456789/344914
 5. Deschacht, K., Moens, M.F.: Semi-supervised semantic role labeling using the la-
    tent words language model. In: EMNLP. pp. 21–29. ACL (2009)
 6. Fürstenau, H., Lapata, M.: Semi-supervised semantic role labeling via structural
    alignment. Comput. Linguist. 38(1), 135–171 (Mar 2012)
 7. Huang, F., Ahuja, A., Downey, D., Yang, Y., Guo, Y., Yates, A.: Learning Repre-
    sentations for Weakly Supervised Natural Language Processing Tasks. Computa-
    tional Linguistics 40, 85–120 (2013)
 8. Levin, B.: English Verb Classes and Alternations A Preliminary Investigation.
    University of Chicago Press, Chicago and London (1993)
 9. Mikolov, T.: Statistical Language Models Based on Neural Networks. Ph.D. thesis,
    Ph. D. thesis, Brno University of Technology (2012)
10. Mikolov, T., Karafit, M., Burget, L., Cernock, J., Khudanpur, S.: In: Kobayashi,
    T., Hirose, K., Nakamura, S. (eds.) INTERSPEECH. pp. 1045–1048. ISCA (2010)
11. Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–
    41 (Nov 1995)
14
     An expanded version of this paper with comparable experimental results on the
     out-of-domain CoNLL 2009 testing data is submitted to the journal of IEEE/ACM
     Transactions on Audio, Speech and Language Processing.
48       Q. N. T. Do, S. Bethard and M.-F. Moens

     12. Palmer, M., Gildea, D., Xue, N.: Semantic Role Labeling. Synthesis Lectures on
         Human Language Technologies, Morgan & Claypool Publishers (2010)
     13. Palmer, M., Kingsbury, P., Gildea, D.: The proposition bank: An annotated corpus
         of semantic roles. Computational Linguistics 31(1), 71–106 (2005)
     14. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel distributed processing:
         Explorations in the microstructure of cognition, vol. 1. chap. Learning Internal
         Representations by Error Propagation, pp. 318–362. MIT Press, Cambridge, MA,
         USA (1986), http://dl.acm.org/citation.cfm?id=104279.104293
     15. Schuler, K.K.: VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. Ph.D.
         thesis, University of Pennsylvania (2006)
     16. Socher, R., Lin, C.C.Y., Ng, A.Y., Manning, C.D.: Parsing natural scenes and
         natural language with recursive neural networks. In: Getoor, L., Scheffer, T. (eds.)
         ICML. pp. 129–136. Omnipress (2011)
     17. Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding.
         In: Proceedings of the 25th International Conference on Machine Learning. pp.
         1168–1175. ICML ’08, ACM, New York, NY, USA (2008), http://doi.acm.org/
         10.1145/1390156.1390303
     18. Yujing Si, Zhen Zhang, T.L.J.P., Yan, Y.: Enhanced word classing for recurrent
         neural network language model. In: JICS: Journal of Information and Computa-
         tional Science, Vol. 10. pp. 3595–3604 (2013)
          Learning Semantically Coherent Rules

           Alexander Gabriel1 , Heiko Paulheim2 , and Frederik Janssen3
                                  1
                                   agabriel@mayanna.org
                       Technische Universität Darmstadt, Germany
                          2
                            heiko@informatik.uni-mannheim.de
                         Research Group Data and Web Science
                            University of Mannheim, Germany
                              3
                                janssen@ke.tu-darmstadt.de
                              Knowledge Engineering Group
                       Technische Universität Darmstadt, Germany


        Abstract. The capability of building a model that can be understood
        and interpreted by humans is one of the main selling points of sym-
        bolic machine learning algorithms, such as rule or decision tree learners.
        However, those algorithms are most often optimized w.r.t. classification
        accuracy, but not the understandability of the resulting model. In this
        paper, we focus on a particular aspect of understandability, i.e., semantic
        coherence. We introduce a variant of a separate-and-conquer rule learn-
        ing algorithm using a WordNet-based heuristic to learn rules that are
        semantically coherent. In an evaluation on different datasets, we show
        that the approach learns rules that are significantly more semantically
        coherent, without losing accuracy.


Keywords: Rule Learning, Semantic Coherence, Interpretability, Rule Learn-
ing Heuristics


1    Introduction
Symbolic machine learning approaches, such as rule or decision tree induction,
have the advantage of creating a model that can be understood and interpreted
by human domain experts – unlike statistical models such as Support Vector
Machines. In particular, rule learning is one of the oldest and most intensively
researched fields of machine learning [14].
    Despite this advantage, the actual understandability of a learned model has
received only little attention so far. Most learning algorithms are optimized w.r.t.
the classification accuracy, but not understandability. Most often the latter is
measured rather naively by, e.g., the average number of rules and/or conditions
without paying any attention to the relation among them.
    The understandability of a rule model comprises different dimensions. One
of those dimensions is semantic coherence, i.e., the semantic proximity of the
different conditions in a rule (or across the entire ruleset). Prior experiments
have shown that this coherence has a major impact on the reception of a rule

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
50      A. Gabriel, H. Paulheim and F. Janssen

model. This notion is similar to the notion of semantic coherence of texts, which
is a key factor to understanding those texts [20].
    In a previous user study, we showed different rules describing the quality
of living in cities to users. The experiments showed that semantically coherent
rules – such as Cities with medium temperatures and low precipitation – are
favored over incoherent rules, such as Cities with medium temperatures where
many music albums have been recorded [27].
    In this paper, we discuss how separate-and-conquer rule learning algorithms
[12] can be extended to support the learning of more coherent rules. We introduce
a new heuristic function that combines a standard heuristic (such as Accuracy
or m-Estimate) with a semantic one, and allows for adjusting the weight of
each component. With that weight, we are able to control the trade-off between
classification accuracy and semantic coherence.
    The rest of this paper is structured as follows. We begin by briefly introducing
separate-and-conquer rule learning. Next, our approach to learning semantically
coherent rules is detailed. In the following evaluation, we introduce the datasets
and show the results. Here, also some exemplary rules are given, indeed indicat-
ing semantic coherence between the conditions of the rules. After that, related
work is captured. Then, the paper is concluded and future work is shown.


2    Separate-and-Conquer Rule Learning

Separate-and-conquer rule learning is still amongst the most popular strategies
to induce a set of rules that can be used to classify unseen examples, i.e., correctly
map them on their respective classes. How exactly this strategy is implemented
varies among the different algorithms but most of them fit into the framework of
separate-and-conquer. This led to the development of the so-called SeCo suite
[18], a versatile framework that allows for most existing algorithms to be con-
figured properly. Based on the flexibility and the convenient way to implement
new functions or extensions, we chose this framework for our experiments.
    In essence, a separate-and-conquer rule learner proceeds in two major steps:
First, a single rule, that fulfills certain quality criteria, is learned from the data
(this is called the conquer step of the algorithm). Then, all (positive) examples
that are covered by this rule are removed from the dataset (the separate step)
and the algorithm proceeds by learning the next rule until all examples are
covered.
    Certainly, this strategy is only usable for binary data as a notion of positive
and negative example is mandatory but then, if desired, it can guarantee that
every positive example is covered (completeness) and no negative one is covered
(consistency). There are different strategies to convert multi-class datasets to
binary ones. However, in this paper we used an ordered binarization as imple-
mented in the SeCo framework. Therefore, the classes of the dataset are ordered
by their class-frequency and the smallest class is defined to be the positive one
whereas the other ones are treated as negative examples. After the necessary
number of rules to cover the smallest class is learned, all examples from it are
                                      Learning Semantically Coherent Rules       51

removed and the next smallest one is defined to be positive while again the rest
of the examples are negative. The algorithm proceeds in this manner until all
classes expect the largest one are covered. The resulting ruleset is a so-called
decision list where for each example that is to be classified the rules are tested
from top to bottom and the first one that covers the example is used for predic-
tion. If, however, no rule covers the example, a default rule at the end of the list
assigns it to the largest class in the dataset.
    A single rule is learned in a top-down fashion meaning that it is initialized
as an empty rule and conditions are greedily added one by one until no more
negative examples are covered. Then, the best rule encountered during this pro-
cess is heuristically determined and returned as best rule. Note that this has
not to be the last rule covering no negative example, i.e., consistency due to
reasons of overfitting is not assured. A heuristic, in one way or another, maxi-
mizes the covered positive examples while trying to cover as few negative ones as
possible. The literature shows a wide variety of different heuristics [13]. For the
experiments conducted in this paper we had to make a selection and chose three
well known heuristics namely Accuracy, Laplace Estimate, and the m-Estimate,
as defined later. We are aware of the restrictions that come with our selection
but we are confident that our findings regarding the semantic coherence are not
subject to a certain type of heuristic but rather are universally valid.
    To keep it simple, we used the default algorithm implemented in the SeCo
framework. Namely, the configuration uses a top-down hill-climbing search (a
beam size of one) that refines a rule as long as negative examples are covered.
The learning of rules stops when the best rule covers more negative than positive
examples and the conditions of a rule test for equality (nominal conditions) or
use < and ≥ for numerical conditions. No special pruning or post-processing
of rules is employed. For the m-Estimate, the parameter was set to 22.466 as
suggested in [17].


3   Enabling Semantic Coherence

The key idea of this paper is to enrich the heuristic used for finding the best
condition with a semantic component that additionally to the goal of maximizing
positive examples while minimizing negatives, will incorporate that the selected
condition will also be as semantically coherent as possible. In essence, we have
two components now:

 – The classic heuristic (selects conditions based on statistical properties of the
   data) and
 – the semantic heuristic (selects conditions based on their semantic coherence
   with previous conditions).

    Hence, the new heuristic WH offers the possibility to trade-off between sta-
tistical validity (the classic heuristic CH ) and the semantic part (a semantic
heuristic SH ). This is enabled by a parameter α that weights the two objectives:
52      A. Gabriel, H. Paulheim and F. Janssen


          WH (Rule) = α · SH (Rule) + (1 − α) · CH (Rule),        α ∈ [0, 1]      (1)
    A higher value α gives more weight to semantic coherence, while a value of
α = 0 is equivalent to classic rule learning using only the standard heuristic. We
expect that higher values of α lead to a decrease in predictive accuracy because
the rule learning algorithm focuses less on the quality of the rule and more on
choosing conditions that are semantically coherent (which are likely not to have
a strong correlation with the rule’s accuracy). At the same time, higher values
of α should lead to more coherent rules.
    When learning rules, the first condition is selected by using the classic heuris-
tic CH only (since a rule with only one condition is always coherent in itself).
Then, while growing the rule, the WH heuristic is used, which leads to condi-
tions being added that result in both a coherent and an accurate rule according
to the trade-off specified by α.

3.1   WordNet Similarity
There are different possibilities to measure the semantic relatedness between two
conditions. In this paper, we use an external source of linguistic information, i.e.,
WordNet [8]. WordNet organizes words in so-called synsets, i.e., sets of synonym
words. Those synsets are linked by homonym and hyperonym relations, among
others. Using those relations, the semantic distance between words in different
synsets can be computed.
    In the first step, we map each feature that can be used in a rule to one or
more synsets in WordNet4 . To do so, we search WordNet for the feature name.
In the following, we will consider the case of measuring the semantic coherence
of two features named smartphone vendor and desktop.
    The search for synsets returns a list of synsets, ordered by relevance. The
search result for smartphone vendor is empty {}, the search result for desktop is
{desktop#n#1 , desktop#n#2 } where desktop#n#1 describes a tabletop and
desktop#n#2 describes a desktop computer.5
    If the list is not empty, we add it to the attribute label’s list of synset lists.
If otherwise the list is empty, we check whether the attribute label is a com-
pound of multiple tokens and restart the search for each of the individual to-
kens. We then add all non-empty synset lists that are returned to the list of
synset lists of the attribute label. The result for smartphone vendor is then
{{smartphone#n#1 }, {vendor #n#1 }} while the result for desktop is
{{desktop#n#1 , desktop#n#2 }}.
    In the second step, we calculate the distance between two synsets using the
LIN [21] metric. We chose this metric as it performs well in comparison with
other metrics [3], and it outputs a score normalized to [0, 1].
4
  Note that at the moment, we only use the names of the features to measure semantic
  coherence, but not the nominal or numeric feature values that are used to build a
  condition.
5
  The ’n’ indicates that the synsets are describing nouns.
                                           Learning Semantically Coherent Rules    53

    The LIN metric is based on the Information Content (IC ) metric [29], a
measure for the particularity of a concept. The IC of a concept c is calculated
as the negative of the log likelihood, simpler put: the negative of the logarithm
of the probability to encounter concept c:

                                 IC (c) = − log p(c)                              (2)
    Higher values denote less abstract, more general concepts, while lower values
denote more abstract, less general concepts. The body of text used for the cal-
culation of the p(c) values in this work is the SemCor [23] corpus, a collection
of 100 passages from the Brown corpus which were semantically tagged “based
on the WordNet word sense definition” and thus provide the exact frequency
distribution of each synset, which covers roughly 25% of the synsets in WordNet
[19].
    The LIN metric is calculated by dividing the Information Content (IC ) of
the least common synset of the two synsets by the sum of their Information
Content, and multiplying the result with two:6

                                                         IC (lcs)
                   lin(syn1 , syn2 ) = 2 ·                                        (3)
                                                  IC (syn1 ) + IC (syn2 )
Information Content. For each pair of synsets associated with two attributes,
we calculate the LIN metric. In our example, the corresponding values are
               lin(smartphone#n#1 , desktop#n#1 ) = 0.0,
               lin(smartphone#n#1 , desktop#n#2 ) = 0.5,
                    lin(vendor #n#1 , desktop#n#1 ) = 0.0, and
                    lin(vendor #n#1 , desktop#n#2 ) = 0.0.
   In the third step, we choose the maximum value for each pair of synset lists
(syn) so that we end up with the maximum similarity value per pair of tokens.
The overall semantic similarity of two attributes (att) is then computed as the
mean of those similarities across the tokens t:
                SH (att1 , att2 ) =     avg           max lin(syn1 , syn2 )       (4)
                                      ∀t1 ∈att1     ∀syn1 ∈t1
                                      ∀t2 ∈att2     ∀syn2 ∈t2

    This assigns each word pair the similarity value of the synset combination
that is most similar among all the synset combinations that arise from the two
lists of possible synsets for the two words. Thus, in our example, the SH value
assigned to smartphone vendor and desktop would be 0.25.
    To compute the semantic coherence of a rule given the pairwise SH scores
for the attributes used in the rule, we use the mean of those pairwise scores to
assign a final score to the rule.7
6
  This metric limits the similarity calculation to synsets of the same POS and works
  only with nouns and verbs. Our implementation returns a similarity value of 0 in all
  other cases.
7
  All experiments were carried out with minimum and maximum as well, but using
  the mean turned out to give the best results.
54     A. Gabriel, H. Paulheim and F. Janssen

                    Table 1. Datasets used in the experiments

                  Dataset       #Attributes Found in WordNet
                  hepatitis              19              68%
                  primary-tumor          17              71%
                  bridges2               11              73%
                  zoo                    17              94%
                  flag                   27             100%
                  auto-mpg                7             100%
                  balloons                4             100%
                  glass                   9             100%


4    Evaluation


We have conducted experiments with different classic heuristics on a number of
datasets from the UCI machine learning repository8 shown in Table 1. The table
depicts the overall number of attributes and the percentage of attributes for
which at least one matching synset was found in WordNet. For classic heuristics
CH , we chose Accuracy, m-Estimate, and Laplace Estimate, which are defined
as follows:

                                                 p + (N − n)
                           Accuracy := p − n ≡                                (5)
                                                    P +N
                                           p+1
                   Laplace Estimate :=                                        (6)
                                         p+n+2
                                                   P
                                         p + m · P +N
                         m-Estimate :=                                        (7)
                                          p+n+m


    where p, n denote the positive/negative examples covered by the rule and
P , N stand for the total positive/negative examples. Please see [17] for more
details on these heuristics.
    In addition, we used the semantic heuristic SH based on WordNet as defined
above. For each experiment, we report the accuracy (single run of a ten fold cross
validation) and the average semantic coherence of all the rules in the ruleset
(measured by SH ), as well as the average rule length and the overall number of
conditions and rules in the ruleset.
    As datasets, we had to pick some that have attribute labels that carry se-
mantics, i.e., the attributes have to have speaking names instead of, e.g., names
from att1 to att20 (which unfortunately is the case for the majority of datasets
in the UCI repository). We searched for datasets where we could map at least
two thirds of the attributes to at least one synset WordNet. This led to the eight
datasets used for the experiments in this paper which are listed in Table 1.
                                       Learning Semantically Coherent Rules        55

Table 2. Macro average accuracy of the learned rulesets on the eight datasets. Statis-
tically significant deviations (p > 0.05) from α = 0 are marked in bold.

 Classic                                    α
 Heuristic    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                      1.0
 Accuracy 0.649 0.667 0.668 0.668 0.669 0.669 0.668 0.668 0.668 0.668 0.465
 m-Estimate 0.670 0.673 0.672 0.671 0.671 0.670 0.670 0.673 0.673 0.674 0.474
 Laplace    0.673 0.680 0.679 0.682 0.681 0.680 0.681 0.679 0.679 0.681 0.476

Table 3. Average semantic coherence of the learned rulesets on the eight datasets.
Statistically significant deviations (p > 0.05) from α = 0 are marked in bold.

 Classic                                   α
 Heuristic    0.0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9 1.0
 Accuracy 0.146 0.212 0.222 0.241 0.235 0.235 0.237 0.239 0.233 0.233 –
 m-Estimate 0.165 0.195 0.199 0.204 0.207 0.209 0.211 0.217 0.222 0.232 –
 Laplace    0.156 0.206 0.223 0.228 0.231 0.228 0.227 0.227 0.227 0.226 –


4.1     Results of the Experiments

Table 2 shows the macro average accuracy across the eight datasets for different
values of α. It can be observed that, except for α = 1, the accuracy does not
change significantly. This is an encouraging result, as it shows that a weight
of up to 0.9 can be assigned to the semantic heuristic without the learning
model losing much accuracy. How much exactly the coherence can be enforced
has to be examined by a more detailed inspection of the parameter values in
between 0.9 and 1.0. Interestingly, the trade-off between coherence and accuracy
seems to occur rather at the edge at high parameter values. Clearly, a study of
these parameters would yield more insights, but, however, ensuring such high
coherence without a noticeable effect on accuracy already is a remarkable effect
and seems to be sufficient for our purposes. Only when assigning all weight to
the semantic heuristic (and none to the classic heuristic), the accuracy drops
significantly, which is the expected result. In most of these cases, no rules are
learned at all, but only a default rule is created, assigning all examples to the
majority class.
    In Table 3, we report the macro average semantic coherence of the learned
rulesets across the eight datasets. The results have to be seen in context with
Table 2 as our primary goal was to increase semantic coherence while not losing
too much accuracy. Clearly, the higher the values of α will be, the more seman-
tic coherence will be achieved anyway. This is because the heuristic component
uses the same measure for semantic coherence as is reported in the evaluation
in Table 3. However, as confirmation, it can be observed that the semantic co-
herence is indeed increased in all cases, whereas, when using m-Estimate as a
classic heuristic, the increase is not statistically significant. As stated above, no
8
    http://archive.ics.uci.edu/ml/
56                                       A. Gabriel, H. Paulheim and F. Janssen

                                                        Table 4. Two rules learned for primary-tumor

      α = 0.0 peritoneum = yes, skin = yes, histologic-type = adeno → class = ovary
      α = 0.8 peritoneum = yes, skin = yes, pleura = no, brain = no → class = ovary


                                                                                                               Avg. rule length, number of rules and conditions
                               0.6                                                                       300

                               0.5                                                                       250
Accuracy, Semantic Coherence


                               0.4                                                                       200
                                                                                                                                                                  Accuracy
                               0.3                                                                       150                                                      Semantic Coherence
                                                                                                                                                                  Avg. rule length
                               0.2                                                                       100                                                      Number of rules
                                                                                                                                                                  Total number of conditions
                               0.1                                                                       50

                                0                                                                        0
                                     0      0.1   0.2      0.3   0.4       0.5   0.6   0.7   0.8   0.9
                                                                       alpha


Fig. 1. Detailed results on the primary-tumor dataset, using Accuracy as a classic
heuristic


rules are learned in many cases for α = 1, so the semantic coherence cannot be
computed there.
    These results support our main claim, i.e., that it is possible to learn more
coherent rules without losing classification accuracy. What is surprising is that
even for α = 0.9, the accuracy does not drop. This may be explained by the
selection of the first condition in a rule, which is picked according only to the
classic heuristic and thus leads to growing a rule that has at least a moderate
accuracy. Furthermore, in many cases, there may be a larger number of possible
variants for growing a rule the learning algorithm can choose from, each leading
to a comparable value according to the classic heuristic, so adding weight to the
semantic heuristic still can lead to a reasonable rule.


4.2                                  Analysis of the Models

The two rules learned for the primary-tumor dataset shown in Table 4 illus-
trate the difference between rules with and without semantic coherence. Both
rules cover two positive and no negative example, i.e., according to any classic
heuristic, they are equally good. However, the second one can be considered to
be semantically more coherent, since three out of four attributes refer to body
parts (skin, pleura, and brain), and are thus semantically related.
    In order to further investigate the influence of the semantic heuristic on
general properties of the learned ruleset, we also looked at the average rule
length, the total number of rules, and the total number of conditions in a ruleset.
The results are depicted in Tables 5 and 6.
    In Table 5 we observe a mostly constant and sometimes increasing number
of rules for all but the last three datasets. This exception to the overall trend is
                                       Learning Semantically Coherent Rules        57

Table 5. An overview of the number of rules and conditions in the learned rulesets for
selected values of α for all datasets. Datasets where a drop occurred are shown at the
end of the table.

 Dataset               Accuracy            m-Estimate        Laplace Estimate
                α # rules # conditions # rules # conditions # rules # conditions
               0.0    47           120     14            50     48           114
 auto-mpg      0.5    48           127     14            46     47           110
               0.9    48           127     14            46     48           110
               0.0      2            4       2            4       4           12
 balloons      0.5      2            4       2            4       4           12
               0.9      2            4       2            4       4           12
               0.0    27            59     10            25     30            65
 bridges2      0.5    27            61     10            25     29            65
               0.9    27            61     10            25     29            65
               0.0    24            78     21            51     52           106
 flag          0.5    38            90     24            63     54           113
               0.9    38            90     24            63     54           113
               0.0    12            14     11            15     19            19
 zoo           0.5    16            18     12            16     19            19
               0.9    16            18     13            19     19            19
               0.0    40            90     16            52     50           102
 glass         0.5    35            82     17            55     36            74
               0.9    35            82     17            55     36            74
               0.0    10            22       6           22     13            26
 hepatitis     0.5      3            6       6           18       5            8
               0.9      3            6       6           18       5            8
               0.0    70           246     26           119     81           324
 primary-tumor 0.5    10            35     12            56     75           331
               0.9    11            39     11            46     16            68


analyzed more closely in case of the primary-tumor dataset. The values for this
dataset are depicted in Fig. 1.
    When looking at the rulesets learned on the primary-tumor dataset, it can
be observed that many very special rules for small classes, covering only a few
examples, are missing when increasing the value for α. A possible explanation
is that as long as there are many examples for a class, there are enough degrees
of freedom for the rule learner to respect semantic coherence. If, on the other
hand, the number of examples drops (e.g., for small classes), it becomes harder
to learn meaningful semantic rules, which leads the rule learner to ignore those
small example sets. Since only a small number of examples is concerned by this,
the accuracy remains stable – or it even rises slightly, as ignoring those small
sets may eventually reduce the risk of overfitting.
    Note that a similar trend could be observed for the other two datasets (hep-
atitis and glass, depicted at the lower part of Table 5). While the changes are
not so intense for the m-Estimate, certainly those for the other two heuristics are
significant. Interestingly, most often the rules in the beginning of the decision list
58      A. Gabriel, H. Paulheim and F. Janssen

Table 6. Average rule length of the learned rulesets on the eight datasets. Statistically
significant deviations (p > 0.05) from α = 0 are marked in bold.

 Classic                                    α
 Heuristic    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                      1.0
 Accuracy 2.270 2.295 2.296 2.296 2.281 2.281 2.310 2.321 2.288 2.288 0.250
 m-Estimate 2.329 2.295 2.304 2.334 2.348 2.342 2.356 2.337 2.365 2.322 0.250
 Laplace    2.920 2.851 2.862 2.811 2.811 2.833 2.828 2.821 2.796 2.788 0.375


are similar and at a certain point, no rules are learned any more. Thus, similar
to the effect noticeable at the dataset primary-tumor, the following low coverage
rules are not induced any more.
    However, when looking at the average rule length (cf. Table 6), the only
significant change occurs when all weight is given to the semantic component.
The reason is that most often no rule is learned at all in this case.


4.3   Semantic Coherent Rules in Relation to Characteristic Rules

When we inspected the rule sets and the behavior of our separate-and-conquer
learner in more detail, we found that semantically coherent rules interestingly
have a connection to so-called characteristic rules [22, 4]. Where a discriminant
rule tries to use as few conditions as possible with the goal of separating the
example(s) of a certain class versus all the other ones, a characteristic rule has
as much as possible conditions that actually describe the example(s) at hand.
For instance, if the example to be described would be an elephant, a discriminant
rule would concentrate on the single attribute(s) an elephant has and no other
animal shows such as, e.g., its trunk, its gray color, or its huge ears. Instead, a
characteristic rule would list all attributes that indicate an elephant such as four
legs, a tail, thick skin etc. In essence, a discriminant rule has only conditions
that discriminate elephants from all other animals whereas a characteristic rule
rather describes the elephant without the need to be discriminant, i.e., to use
only features no other animal has.
    Not surprisingly, a semantically coherent rule tends to show the same prop-
erties. Often the induced rules consist of conditions that are not necessarily
important to discriminate the examples, but rather are semantically coherent
with the conditions located at earlier positions in these rules. This becomes ob-
vious when we take a look at the above example of the two rules where the rule
without semantic influence has a condition less albeit both of them have the
same coverage.
    However, the number of rules is strongly dependent on the attribute’s se-
mantics. For most of the datasets where actually less rules are induced with our
approach, semantic coherence is hard to measure. The glass database contains of
descriptions of chemicals, in the hepatitis dataset biochemical components are
used as features and in primary-tumor we have simply considerably more classes.
A detailed examination of this phenomenon remains subject to future work.
                                      Learning Semantically Coherent Rules       59

5   Related Work
Most of the work concerned with the trade-off between interpretability and ac-
curacy stems from the fuzzy rules community. Here, this trade-off is well-known
and there are a number of papers that addressed this problem [30]. There are
several ways to deal with it, either by using (evolutionary) multiobjective opti-
mization [16], context adaptation, hierarchical fuzzy modeling as well as fuzzy
partitioning, membership functions, rules, rule bases or similar. However, most
often comprehensibility of fuzzy rules is measured by means such as the trans-
parency of the fuzzy partitions, the number of fuzzy rules and conditions or the
complexity of reasoning, i.e., defuzzification and inference mechanisms. As we
use classification rules in this paper, most of these techniques are not applicable.
    There are also some papers about comprehensibility in general. For example,
[33] deals with the means of dimensionality reduction and with presenting sta-
tistical models in a way that the user can grasp them better, e.g., with the help
of graphical representations or similar. The interpretability of different model
classes is discussed in [10]. The advantages and disadvantages of decision trees,
classification rules, decision tables, nearest neighbor, and Bayesian networks are
shown. Arguments are given why using model size on its own for measuring
comprehensibility is not the best choice and directives are demonstrated how
user-given constraints such as monotonicity constraints can be incorporated into
the classification model. For a general discussion of comprehensibility this is very
interesting, however, as single conditions of a rule are not compared against each
other, the scope is somewhat different than in our work.
    A lot of papers try to induce a ruleset that has high accuracy as well as good
comprehensibility by employing genetic, evolutionary, or ant colony optimization
algorithms. Given the right measure for relating single conditions of a rule or
even whole rules in a complete ruleset, this seems to be a promising direction.
Unfortunately, most of the fitness functions do not take this into account. For
example, in [25] an extension of a ant colony algorithm was derived to induce
unordered rulesets. They introduced a new measure for comprehensibility of
rules, namely the prediction-explanation size. In essence this measure is oriented
more strongly on the actual prediction hence the average number of conditions
that have to be checked for predicting the class value. Therefore, not the total
number of conditions or rules is measured as usual measures often do but for an
unordered ruleset exactly those that are actually used for classifying the example
at hand. For ordered rulesets also rules are counted that are before the classifying
rule in the decision list as they have to be also checked at prediction time. Other
algorithms are capable of multi-target learning [24] and define interestingness
as those rules that cover example of infrequent classes in the dataset. Also,
some papers deal with interpretability rather as a side effect [2], while here no
optimization of this objective is done during learning time. In contrast, [7] uses
a simple combination of accuracy maximization and size minimization in the
fitness function of the genetic algorithm.
    Some research is focused on specific problems where consequently rather
unique properties are taken into account [31]. In this bioinformatic domain, only
60      A. Gabriel, H. Paulheim and F. Janssen

the presence of an attribute (value) is of interest whereas the absence is of no
concern. The contribution are two new versions of CN2 [6] and Ant-Miner [26]
which are able to incorporate this constraint.
    Another thread is concerned with the measures themselves. For example, [9]
surveyed objective measures (data-driven) for interestingness and defined a new
objective, namely attribute surprisingness AttSurp, i.e., by arguing that a user
is mostly interested in a rule that has high prediction performance but many
single attributes with a low information gain, the authors define AttSurp as
one divided by the information gain of all attributes in the rule. In [11] it is
argued that small disjuncts (i.e., rules that cover only a very small number of
positive examples) are indeed surprising while most often not unfolding good
generalization or predictive quality. Here, also AttSurp is used which is different
to most other interestingness measures in the sense that not the whole rule body
is taken into account but single attributes which one can also see as a property of
our algorithm. Interestingly, surprisingness also is related to Simpson’s Paradox.


6    Conclusions and Future Work

In this paper, we have examined an approach to increase the understandability
of a rule model by learning rules that are in themselves semantically coherent.
To do so, we have introduced a method for combining classic heuristics, tailored
at learning correct rule models, with semantic heuristics, tailored at learning
coherent rules. While we have only looked at the coherence of single rules, adding
means to control the coherence across a set of rules would be an interesting
extension for future work.
     An experiment with eight datasets from the UCI repository has shown that
it is possible to learn rules that are significantly more coherent, while not being
significantly less accurate. In fact, the accuracy of the learned model has stayed
constant in all cases, even if adjusting the influence of the semantic heuristic to
90% of the overall heuristic. These results show that, even at a very preliminary
stage, the proposed approach actually works.
     Furthermore, we have observed that in some cases, adding the semantic
heuristic may lead to more compact rule sets, which are still as accurate as
the original ones. Although we have a possible explanation, i.e., that it is diffi-
cult for semantically enhanced heuristics to learn rules for small sets of examples,
we do not have statistically significant results here. An evaluation with synthetic
datasets may lead to more insights into the characteristics of datasets for which
this property holds, and help us to confirm or reject that hypothesis.
     Although we have evidence from previous research that semantically coher-
ent rules are perceived to be better understandable, e.g. in [27], we would like to
strengthen that argument by additional user studies. These may also help reveal-
ing other characteristics a ruleset should have beyond coherence, e.g., minimum
or maximum length. For example, the experiments in [27] have indicated that
less accurate rules (e.g., Countries with a high HDI are less corrupt) are pre-
                                        Learning Semantically Coherent Rules         61

ferred over more accurate ones (e.g., Countries with a HDI higher than 6.243
are less corrupt).
    In this paper, we have only looked into one method of measuring semantic
coherence, i.e., a similarity metric based on WordNet. There are more possible
WordNet-based metrics, e.g., the LESK [1] and the HSO [15] metrics, which
both work with adjectives and adverbs in addition to nouns and verbs and they
support arbitrary pairing of the POS classes. Furthermore, there is a number
of alternatives beyond WordNet, e.g., the use of Wikipedia [32] or a web search
engine [5]. Furthermore, in the realm of Linked Open Data, there are various
means to determine the relatedness of two concepts [28].
    The approach so far only uses the classical heuristic to select the first rule,
which sometimes lead to rules that are not too coherent w.r.t. that attribute, e.g.,
if there are no other attributes that match the first one well semantically. Here,
it may help to introduce a semantic part in the selection of the first condition
as well, e.g., the average semantic distance of all other attributes to the one
selected. However, the impact of that variation on accuracy has to be carefully
investigated.
    Another possible point for improvement is the selection of the final rule from
one refinement process. So far, we use the same combined heuristic for the re-
finement and the selection, but it might make sense to use a different weight
here, or even entirely remove the semantic heuristic from that step, since the
coherence has already been assured by the selection of the conditions.
    In summary, we have introduced an approach that is able to explicitly trade
off semantic coherence and accuracy in rule learning, and we have shown that
it is possible to learn more coherent rules without losing accuracy. However, it
remains an open question whether or not our results are generalizable to other
types of rule learning algorithms that do not rely on a separate-and-conquer
strategy. We will inspect the impact on other rule learners in the near future.


References
 1. Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disam-
    biguation Using WordNet. In: Computational linguistics and intelligent text pro-
    cessing, pp. 136–145. Springer Berlin Heidelberg (2002)
 2. Bojarczuk, C.C., Lopes, H.S., Freitas, A.A.: Discovering comprehensible classifi-
    cation rules by using genetic programming: a case study in a medical domain.
    In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M.,
    Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Con-
    ference. vol. 2, pp. 953–958. Morgan Kaufmann, Orlando, Florida, USA (1999)
 3. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic
    Relatedness. Computational Linguistics 32(1), 13–47 (2006)
 4. Cai, Y., Cercone, N., Han, J.: Attribute-oriented induction in relational databases.
    In: Knowledge Discovery in Databases, pp. 213–228. AAAI/MIT Press (1991)
 5. Cilibrasi, R., Vitányi, P.M.B.: The google similarity distance. CoRR ab-
    s/cs/0412098 (2004)
 6. Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning 3(4), 261–
    283 (1989)
62      A. Gabriel, H. Paulheim and F. Janssen

 7. Falco, I.D., Cioppa, A.D., Tarantino, E.: Discovering interesting classification rules
    with genetic programming. Applied Soft Computing 1(4), 257 – 269 (2002)
 8. Fellbaum, C.: WordNet. Wiley Online Library (1999)
 9. Freitas, A.: On rule interestingness measures. Knowledge-Based Systems 12(56),
    309 – 315 (1999)
10. Freitas, A.A.: Comprehensible classification models: A position paper. SIGKDD
    Explor. Newsl. 15(1), 1–10 (Mar 2014)
11. Freitas, A.A.: On objective measures of rule surprisingness. In: Proceedings of
    the Second European Symposium on Principles of Data Mining and Knowledge
    Discovery. pp. 1–9. PKDD ’98, Springer-Verlag, London, UK, UK (1998)
12. Fürnkranz, J.: Separate-and-Conquer Rule Learning. Artificial Intelligence Review
    13(1), 3–54 (1999)
13. Fürnkranz, J., Flach, P.A.: ROC ’n’ Rule Learning - Towards a Better Understand-
    ing of Covering Algorithms. Machine Learning 58(1), 39–77 (January 2005)
14. Fürnkranz, J., Gamberger, D., Lavrač, N.: Foundations of Rule Learning. Springer
    Berlin Heidelberg (2012)
15. Hirst, G., St-Onge, D.: Lexical Chains as Representations of Context for the De-
    tection and Correction of Malapropisms. In: Fellbaum, C. (ed.) WordNet: An Elec-
    tronic Lexical Database, pp. 305–332. MIT Press (1995)
16. Ishibuchi, H., Nojima, Y.: Analysis of interpretability-accuracy tradeoff of fuzzy
    systems by multiobjective fuzzy genetics-based machine learning. International
    Journal of Approximate Reasoning 44(1), 4–31 (Jan 2007)
17. Janssen, F., Fürnkranz, J.: On the quest for optimal rule learning heuristics. Ma-
    chine Learning 78(3), 343–379 (Mar 2010)
18. Janssen, F., Zopf, M.: The SeCo-framework for rule learning. In: Proceedings of
    the German Workshop on Lernen, Wissen, Adaptivität - LWA2012 (2012)
19. Jiang, J.J., Conrath, D.W.: Semantic Similarity Based on Corpus Statistics and
    Lexical Taxonomy. In: Proceedings of International Conference on Research in
    Computational Linguistics (ROCLING X). pp. 19–33. No. Rocling X (1997)
20. Kintsch, W., Van Dijk, T.A.: Toward a model of text comprehension and produc-
    tion. Psychological review 85(5), 363 (1978)
21. Lin, D.: An Information-Theoretic Definition of Similarity. In: ICML. pp. 296–304
    (1989)
22. Michalski, R.S.: A theory and methodology of inductive learning. Artificial Intel-
    ligence 20(2), 111–162 (1983)
23. Miller, G.a., Leacock, C., Tengi, R., Bunker, R.T.: A Semantic Concordance. In:
    Proceedings of the workshop on Human Language Technology. pp. 303–308. Asso-
    ciation for Computational Linguistics, Morristown, NJ, USA (1993)
24. Noda, E., Freitas, A., Lopes, H.: Discovering interesting prediction rules with a
    genetic algorithm. In: Proceedings of the 1999 Congress on Evolutionary Compu-
    tation. pp. 1322–1329. IEEE (1999)
25. Otero, F.E., Freitas, A.A.: Improving the interpretability of classification rules dis-
    covered by an ant colony algorithm. In: Proceedings of the 15th Annual Conference
    on Genetic and Evolutionary Computation. pp. 73–80. GECCO ’13, ACM, New
    York, NY, USA (2013)
26. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony opti-
    mization algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321–
    332 (August 2002)
27. Paulheim, H.: Generating possible interpretations for statistics from linked open
    data. In: 9th Extended Semantic Web Conference (ESWC) (2012)
                                        Learning Semantically Coherent Rules          63

28. Paulheim, H.: Dbpedianyd – a silver standard benchmark dataset for semantic
    relatedness in dbpedia. In: Workshop on NLP & DBpedia (2013)
29. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Tax-
    onomy. In: Proceedings of the 14th International Joint Conference on Artificial
    Intelligence. vol. 1 (1995)
30. Shukla, P.K., Tripathi, S.P.: A survey on interpretability-accuracy (i-a) trade-off
    in evolutionary fuzzy systems. In: Watada, J., Chung, P.C., Lin, J.M., Shieh, C.S.,
    Pan, J.S. (eds.) 5th International Conference on Genetic and Evolutionary Com-
    puting. pp. 97–101. IEEE (2011)
31. Smaldon, J., Freitas, A.A.: Improving the interpretability of classification rules in
    sparse bioinformatics datasets. In: Proceedings of AI-2006, the Twenty-sixth SGAI
    International Conference on Innovative Techniques and Applications of Artificial
    Intelligence. pp. 377–381. Research and Development in Intelligent Systems XXIII,
    Springer London (2007)
32. Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using
    Wikipedia. In: In Proceedings of the 21st National Conference on Artificial Intel-
    ligence. pp. 1419–1424. No. February, AAAI Press (2006)
33. Vellido, A., Martn-Guerrero, J.D., Lisboa, P.J.G.: Making machine learning mod-
    els interpretable. In: 20th European Symposium on Artificial Neural Networks,
    Computational Intelligence and Machine Learning (2012)
    Text Classification Using Association Rules,
    Dependency Pruning and Hyperonymization

                       Yannis Haralambous and Philippe Lenca

      Institut Mines Telecom, Telecom Bretagne, UMR CNRS 6285 Lab-STICC
            Technopôle Brest Iroise CS 83818, 29238 Brest Cedex 3, France
                       <surname>.<name>@telecom-bretagne.eu


        Abstract. We present new methods for pruning and enhancing item-
        sets for text classification via association rule mining. Pruning methods
        are based on dependency syntax and enhancing methods are based on
        replacing words by their hyperonyms of various orders. We discuss the
        impact of these methods, compared to pruning based on tfidf rank of
        words.


Introduction

Automatic text classification is an important text mining task, due to the huge
number of text documents that we have to manage daily. Text classification has
a wide variety of applications such as Web document and email classification.
Indeed, most of the Web news services daily provide a large number of articles
making them impossible to be organized manually [14]. Automatic subject clas-
sification [9] and SPAM filtering [18] are two additional examples of the interest
of automatic text classification.
     Automatic text classification can be defined as below. Given a set of docu-
ments such that each document is labeled with a class value, learn a model that
assigns a document with unknown class to one or more particular classes. This
can also be done by assigning a probability value to each class or by ranking the
classes.
     A wide variety of classical machine learning techniques have been used for
text classification. Indeed, texts may be represented by word frequencies vectors,
and thus most of the quantitative data methods can be used directly on the
notorious “bag-of-words” model (cf. [27,3]).
     Choosing a classifier is a multicriteria problem. In particular one has often
to make a trade-off between accuracy and comprehensibility. In this paper, we
are interested in both criteria with a deeper interest in comprehensibility. We
are thus interested in rule-based approaches and especially in class association
rules algorithms. Several studies have already successfully considered association
rule-based approaches in text mining (e.g., [4], [29], [7], [25]). This framework is
suitable for considering some statistical characteristics (e.g., high-dimensionality,
sparsity. . . ) of the bag-of-words model where a document is represented as a set
of words with their associated frequency in the document.

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
66      Y. Haralambous and P. Lenca

    However a text is more than a set of words and their frequencies. Enhancing
the bag-of-words approach with linguistic features has also attracted several
works (e.g., [12,11,13,22], [23,16,10], [21,6]).
    We here propose a class association rules based approach enriched by lin-
guistic knowledge. The paper is organized as follows: after introducing the tech-
niques we are going to use (class association rules § 1.1, dependencies § 1.2, hy-
peronymization § 1.3) we describe our main algorithms (for training § 2.1, clas-
sifying § 2.2 and evaluating § 2.3); follows the experimental section, where we
give results obtained by tfidf pruning § 3.2, dependency-based pruning § 3.3 and
hyperonymization § 3.4, and, finally, we end up by a conclusion and perspectives
for future work § 4.


1     Proposed model for text classification
Let a corpus be a set C = {D1 , . . . , Dn } of documents. Let C be a set of classes.
An annotated corpus is a pair (C, class) where class : C → C is a function that
maps each document Di to a (predefined) class of C.
    A document D ∈ C is a set of sentences S. The corpus C can be considered
as a set of sentences S = {S1 , . . . , Sm } if we go through the forgetful functor
(which forgets the document to which the sentence belongs). Repeated sentences
in the same document, or identical sentences in different documents are consid-
ered as distinct, i.e., there is a function ι : S → C which restores the forgotten
information. We extend the class function to S by class(S) := class(ι(S)).
    A sentence S is a sequence of words w (sometimes   S weSwill consider S simply
as a set, without changing the notation). Let W = S∈S w∈S {w} be the set of
all words of C.


1.1    Class association rules and text classification
Let I be a set of objects called items and C a set of classes. A transaction T is a
pair ({i1 , . . . , in }, c), where {i1 , . . . , in } ⊆ I and c ∈ C. We denote by T the set
of transactions, by items(T ) the set of items (or “itemset”) of T and by class(T )
the class of T .
    Let I be an itemset. The support of I is defined by

                            supp(I) := #{T ∈T |I⊆items(T
                                               #T
                                                         )}
                                                            .

Let σ ∈ [0, 1] be a value called minimum support. An itemset I is called frequent
if its support exceeds σ.
     The confidence of a transaction t is defined as

                 conf(t) := #{T ∈T |items(t)⊆items(T )∧class(t)=class(T )}
                                     #{T ∈T |items(t)⊆items(T )}           .

Let κ ∈ [0, 1] be a value called minimum confidence. A class association rule (or
“CAR”) r = ({i1 , . . . , in }, c) [15] is a transaction with frequent itemset and a
confidence exceeding κ.
                                      Text Classification Using Association Rules   67

    To classify text with CARs, we consider words as being items, documents as
being itemsets and pairs of documents and classes as being transactions. The
advantage of this technique is that CARs can be easily understood and hence
potentially improved by the user, especially if the classifier is tuned so that it
produces humanly reasonable number of rules. Once the classifier is trained, to
classify a new sentence we first find all CARs whose items are contained in the
sentence, and then use an aggregation technique to choose a predominant class
among those of the CARs we found.
    An important issue of CARs is that the complexity is exponential with re-
spect to the itemset size, and hence we need to keep it bounded in specific ranges,
independently of the size of documents to classify. Using entire documents as
transactions is computationally out of reach, therefore pruning techniques play
an important rôle. Our approach consists in (a) restricting CARs to the sentence
level, (b) prune sentences by using morphosyntactic information (cf. § 1.2) and
modifying itemsets using semantic information (cf. § 1.3).


1.2   Itemset pruning using dependencies

One can prune sentences either by using word frequencies (cf. § 3.2) or by using
information obtained by morphosyntactic parsing (cf. § 3.3). In this paper we
introduce the latter approach, in the frame of dependency grammar.
    Dependency grammar [28,19] is a syntactic theory, alternative to phrase-
structure analysis [8] which is traditionally taught in primary and secondary
education. In phrase-structure syntax, trees are built by grouping words into
“phrases” (with the use of intermediate nodes NP, VP, etc.), so that the root
of the tree represents the entire sentence and its leaves are the actual words.
In dependency grammar, trees are built using solely words as nodes (without
introducing any additional “abstract” nodes). A single word in every sentence
becomes the root (or head ) of the tree. An oriented edge between two words is a
dependency and is tagged by a representation of some (syntactic, morphological,
semantic, prosodic, etc.) relation between the words. For example in the sentence
“John gives Mary an apple,” the word “gives” is the head of the sentence and
we have the following four dependencies:

                              head
                                                    dobj


                      nsubj           iobj                       det

              John            gives          Mary          an          apple.

where tags nsubj, dobj, iobj, det denote “noun subject,” “direct object,” “indi-
rect object” and “determinant.”
    Let S be a sentence and D be the set of dependency tags: {nsubj, ccomp,
prep, dobj, . . . } A dependency is a triple (w1 , w2 , d) where w1 , w2 ∈ S and d ∈ D.
Let Dep(S) denote the set of dependencies of S and root(S) the head of S.
68     Y. Haralambous and P. Lenca

Pruning will consist in defining a morphosyntactic constraint φ i.e. a condition
on dependencies (and POS tags) of words, the fulfillment of which is necessary
for the word to be included in the itemset.
    But before describing pruning algorithms and strategies, let us first present
a second technique used for optimizing itemsets. This time we use semantic
information. We propose to replace words by their hyperonyms, expecting that
the frequencies of the latter in the itemsets will be higher than those of the
former, and hence will improve the classification process.


1.3   Hyperonymization


The WordNet lexical database [20] contains sets of words sharing a common
meaning, called synsets, as well as semantic relations between synsets, which
we will use to fulfill our goal. More specifically, we will use the relations of
hyperonymy and of hyperonymic instance. The graph having synsets as nodes,
and hyperonymic relations as edges, is connected and rooted: starting with an
arbitrary synset, one can iterate these two relations until attaining a sink. Note
that in the case of nouns it will invariably be the synset 00001740 {entity} while
for verbs there are approx. 550 different verb sinks.
    Let W be the WordNet lexical database, s ∈ W a synset and h : W → 2W
the hyperonymic or hyperonymic instance relation. We define an hyperonymic
chain CH(s) as a sequence (si )i≥0 where s0 = s and si ∈ h(si−1 ), for all i ≥ 1.
Hyperonymic chains are not unique since a given synset can have many hyper-
onyms. To replace a word by the most pertinent hyperonym, we have to identify
the most significant hyperonymic chains of it.
    The wn-similarity project [24] has released synset frequency calculations
based on various corpora. Let lf(s) denote the logarithmic frequency of synset s
in the BNC English language corpus [2] and let us arbitrarily add infinitesi-
mally small values to the frequencies so that they become unique (s 6= s0 ⇒
lf(s) 6= lf(s0 )). We use frequency as the criterion for selecting a single hyper-
onymic chain to represent a given synset, and hence define the most significant
hyperonymic chain MSCH(s) as the hyperonymic chain (si )i≥0 of s such that
si = arg maxs∈h(si−1 ) lf(s), for all i ≥ 1. The chain MSCH(s) is unique thanks
to the uniqueness of synset frequencies.
    Our CARs are based on words, not synsets. Hence we need to extend MSCHs
to words. Let w be a lemmatized word. We denote by Synsets(w) ⊂ W the set
of synsets containing w. If the cardinal #(Synsets(w)) > 1 then we apply a
standard disambiguation algorithm to find the most appropriate synset sw for
w in the given context. Then we take (si )i = MSCH(sw ) and for each synset
si in this chain we define hi (w) = proj1 (si ) (i > 0), that is the projection
of si to its first element, which by WordNet convention is the most frequent
word in the synset. The function vector h∗ : W → W (with h0 ≡ Id) is called
hyperonymization, and hi (w) is the i-th order hyperonym of w.
                                   Text Classification Using Association Rules     69


    Algorithm 1: Training
      Data: An annotated corpus C, values of minimum support σ and minimum
            confidence κ
      Result: A set of CARs R = ({R1 , . . . , RN }, conf) where items(Ri ) ⊂ W,
              class(Ri ) ∈ C, and conf(Ri ) is the confidence of rule Ri
      Train(C, σ, κ):
         S := forgetful(C); S0 := ∅;
         for S ∈ S do
             S 0 := Hyperonymize (Prune (Lemmatize (S)));
             class(S 0 ) := class(ι(S));
             S0 := S0 ∪ {S 0 };
         end
         R := Apriori (S0 , σ, κ);
      end


2      Operational implementations for document
       classification

Our text classifier operates by first training the classifier on sentences and then
classifying the documents by aggregating sentence classification. These two pro-
cedures are described in Sections 2.1 and 2.2 respectively. Specific evaluation
procedure is presented in Section 2.3.


2.1     Training

The Train algorithm (cf. Alg. 1) takes as input an annotated corpus C and
values of minimum support σ and minimum confidence κ. It returns a set of
CARs together with their confidence values.
     The first part of the algorithm consists in processing the corpus, to obtain
efficient and reasonably sized transactions. Three functions are applied to every
sentence:

 1. Lemmatize is standard lemmatization: let P be the set of POS tags of the
    TreeTagger system [26] (for example, NP stands for “proper noun, singular”,
    VVD stands for “verb, past tense”, etc.), and let W 0 be the set of lemmatized
    forms of W (for example, “say” is the lemmatized form of “said”); then we
    define λ : W → (W ∪ W 0 ) × P, which sends a word w to the pair (w0 , p)
    where w0 is the lemmatized form of w (or w itself, if the word is unknown
    to TreeTagger ) and p is its POS tag.
 2. Prune is a function which prunes the lemmatized sentence so that only a
    small number of (lemmatized) words (and POS tags) remains. Several sen-
    tence pruning strategies will be proposed and compared (cf. § 3.2 and 3.3).
 3. Hyperonymize is a function which takes the words in the pruned itemset and
    replaces them by the members of their most significant hyperonymic chains.
    Several strategies will also be proposed and compared (cf. § 3.4).
70          Y. Haralambous and P. Lenca


 Algorithm 2: Classification
      Data: A set of CARs R, a document D0
      Result: The predicted class predclass(D0 ), variety β, dispersion ∆
      Classify(R, D0 ):
         for S ∈ D0 do
             if ∃r ∈ R such that items(r) ⊂ S then
                 RS :=     arg max    conf(r);
                            r∈R∧items(r)⊂S
            end
         end
                                             X
         predclass(D0 ) := arg max                     conf(RS );
                                  c∈C
                                           S∈D0
                                        class(RS )=c
         β := #{c ∈ C | (class(RS ) = c) ∧ (conf(RS ) > 0)};
                      X                            X
         ∆ := max              conf(RSi ) − min            conf(RSi );
                 c∈C                              c∈C
                          Si ∈D0                            Si ∈D0
                       class(RSi )=c                     class(RSi )=c

      end


   The second part of Alg. 1 uses the apriori algorithm [5] with the given values
of minimum support and minimum confidence and output restrictions so as to
generate only rules with item c ∈ C in the consequent. It returns a set R of
CARs and their confidence.
   It should be noted that this algorithm operates on individual sentences,
hereby ignoring the document level.

2.2     Classification
The Classify algorithm (cf. Alg. 2) uses the set of CARs produced by Train
to predict the class of a new document D0 and furthermore provides two values
measuring the quality of this prediction: variety β and dispersion ∆.
    The first part of the algorithm takes each sentence S of the document D0
and finds the most confident CAR that can be applied to it (i.e., such that the
itemset of the rule is entirely contained in the itemset of the sentence). At this
stage we have, for every sentence: a rule, its predicted class and its confidence.
    Our basic unit of text in Train is sentence, therefore CARs generated by
Alg. 1 produce a class for each sentence of D0 . An aggregation procedure is thus
needed in order to classify the document. This is done by taking class by class
the sum of confidence of rules and selecting the class with the highest sum.
    Although this simple class-weighted sum decision strategy is reasonable, it is
not perfect and may lead to wrong classification. This strategy will be optimally
sure and robust if (a) the number of classes is minimal, and (b) the values when
summing up confidence of rules are sufficiently spread apart. The degree of ful-
fillment of these two conditions is given by the parameters variety β (the number
of classes for which we have rules), and dispersion ∆ (the gap between the most
confident class and least confident one). These parameters will contribute to
comparison among the different approaches we will investigate.
                                          Text Classification Using Association Rules                71


 Algorithm 3: Evaluation
      Data: An annotated corpus C, initial values of minimal support σ0 and
            confidence κ0 , standard number of rules ρ0
      Result: Values of average precision P , recall R, F-measure F . Values of average
               number of rules ρ, variety β and dispersion ∆
      SingleEvaluate(C, σ, κ):
         (C1 , . . . , C10 ) := Partition (Shuffle (C),10);
         /* tenfold cross validation                                                 */
         for I ∈ {1, . . . , 10} do
             (RI , βI , ∆I ) := Train (C \ C1 , σ, κ);
             for D ∈ CI do
                    predclass(D) := Classify (RI , D);
             end
             for c ∈ C do
                    RI (c) := #{d∈CI |(predclass(d)=c)∧(class(d)=c)}
                                       #{d∈CI |class(d)=c}
                                                                     ;
                                      |(predclass(d)=c)∧(class(d)=c)}
                  PI (c) := #{d∈CI#{d∈C     I |predclass(d)=c}
                                                                                  2RI (c)PI (c)
                                                                      ; FI (c) := R I (c)+PI (c)
                                                                                                 ;
              end
         end
         for c ∈ C do
                                        1
                                           P10
              (R(c), P (c), F (c)) := 10      I=1 (RI (c), PI (c), FI (c));
         end
                          1
                             P10
         (ρ, β, ∆) := 10       I=1 (#RI , βI , ∆I );
                            1
                              P
         (R, P , F ) := #C      c∈C (R(c), P (c), F (c));
      end
      Evaluate(C, σ0 , κ0 , ρ0 ):
         (σ, κ) := FindOptimal(C, σ0 , κ0 , ρ0 );
         (R, P , F , ρ, β, ∆) := SingleEvaluate(C, σ, κ);
      end


2.3      Evaluation

We evaluate the classifier (Alg. 3), by using 10-fold cross validation to obtain
average values of recall, precision, F-measure, variety and dispersion. This is
done by algorithm SingleEvaluate, once we specify values of minimal support
and minimal confidence.
    Comparing rule-based classification methods is problematic because one can
always increase F-measure performance by increasing the number of rules, which
results in overfitting them. To avoid this phenomenon and compare methods in
a fair way, we fix a number of rules ρ0 (we have chosen ρ0 = 1,000 in order to
produce a humanly reasonably readable set of rules) and find values of minimal
support and confidence so that F-measure is maximal under this constraint.
    Function FindOptimal will launch SingleEvaluate as many times as neces-
sary on a dynamic grid of values (σ, κ) (starting with initial values (σ0 , κ0 )), so
that, at the end, the number of rules produced by Train is as close as possible
to ρ0 (we have used #R ∈ [ρ0 − 2, ρ0 + 2]) and F is maximal.
72        Y. Haralambous and P. Lenca


    Algorithm 4: Tfidf-based corpus pruning
      Data: An annotated corpus (considered as a set of sentences) S
      Result: The pruned corpus S0
      Prune(S, N ):
         S0 := ∅;
         for S ∈ S do
             for w ∈ S do                                
                                                   #{S∈C}
                 TfidfS (w) := freqS (w) · log   #{S∈C|w∈S}
             end
             S 0 := ∅; S0 := S;
             for i ∈ {1, . . . , N } do
                  w0 := arg max TfidfS (w);
                           S0
                S0 := S0 \ {w0 }; S 0 := S 0 ∪ {w0 };
            end
            S0 := S0 ∪ {S 0 };
         end
      end


3      Experimental results on Reuters corpus
In this section, we investigate three methods: (a) pruning through a purely
frequentist method, based on tfidf measure (§ 3.2); (b) pruning using depen-
dencies (§ 3.3); (c) pruning using dependencies followed by hyperonymic exten-
sion (§ 3.4).

3.1     Preliminaries
In the Reuters [1] corpus we have chosen the 7 most popular topics (GSPO =
sports, E12 = monetary/economic, GPOL = domestic politics, GVIO = war,
civil war, GDIP = international relations, GCRIM = crime, law enforcement,
GJOB = labor issues) and extracted the 1,000 longest texts of each.
    The experimental document set is thus a corpus of 7,000 texts of length
between 120 and 3,961 words (mean 398.84, standard variation 169.05). The
texts have been analyzed with the Stanford Dependency Parser [17] in collapsed
mode with propagation of conjunct dependencies.

3.2     Tfidf-based corpus pruning
Tfidf-based corpus pruning consists in using a classical Prune function as defined
in Alg. 4. It will be our baseline for measuring performance of dependency- and
hyperonymy-based methods.
    Note that this definition of the tfidf measure diverges from the legacy one by
the fact that we consider not documents but sentences as basic text units. This
is because we compare tfidf-generated CARs to those using syntactic informa-
tion, and syntax is limited to the sentence level. Therefore, in order to obtain a
                                                          Text Classification Using Association Rules                                     73

 fair comparison, we have limited term frequency to the sentence level and our
 “document frequency” is in fact a sentence frequency.
     Having calculated TfidfS (w) for every w ∈ S ∈ S, we take N words from
 each sentence with the highest tfidf values, and use them as transaction items.
 The performance of this method depends on the value of N . On Fig. 1 the reader
 can see the values of three quantities as functions of N :
    1. F-measure: we see that F-measure increases steadily and reaches a maximum
       value of 83.99 for N = 10. Building transactions of more than 10 words (in
       decreasing tfidf order) deteriorates performance, in terms of F-measure;
    2. variety: the number of predicted classes for sentences of the same document
       progressively increases but globally remains relatively low, around 3.1, except
       for N = 12 and N = 13 where it reaches 4.17;
    3. dispersion: it increases steadily, with again a small outlier for N = 12, prob-
       ably due to the higher variety obtained for that value of N .
                                                 4.0


                                                                                                  1500
    80


                                                 3.5


                                                                                                    1000
F-measure


                                                                                               Dispersion
                                             2.5 3.0
                                              Variety
   75


                                                                                            500
                                                 2.0
    70


                                                 1.5


                                                                                                  0
    65


                5        10        15       20                 5        10        15       20                   5        10        15       20
            N (tfidf words per transaction)                N (tfidf words per transaction)                  N (tfidf words per transaction)


 Fig. 1. F-measure, variety and dispersion of tfidf-based pruning methods as a function
 of the number of words kept in the corpus

     Furthermore, each investigated method will generate transactions of various
 sizes. It is fair to compare them with tfidf-based methods with similar transac-
 tions sizes. Therefore we will use the results displayed in Fig. 1 to compare the
 performance of subsequent methods with the one of the tfidf-based method of
 similar transaction size. Table 1 presents the results obtained by applying the
 tdfif-based pruning method, with a single word per transaction (N = 1).
               Table 1. Tfidf-based pruning, keeping a single word per transaction

                       E12        GCRIM                 GDIP       GJOB          GPOL                GSPO          GVIO            AVG
  Recall              69.30          44.48              55.44       45.75         52.54                82.90        67.98          59.77
  Precision           70.09          77.81              71.25       79.76         71.62                80.78        73.35          74.95
  F-measure           69.69          56.60              62.36       58.15         60.61                81.83        70.56          65.69
  MinSupp=0.006, MinConf=67.6, Var.=1.36, Disp.=21.53, AvgTransSize=1.00


 3.3        Methods based on dependencies
 In this section we investigate several strategies using the dependency structure
 of sentences. Our general approach (cf. Alg. 5) keeps only words of S that fulfill
74       Y. Haralambous and P. Lenca


 Algorithm 5: Dependency-based corpus pruning
     Data: An annotated corpus S and a morphosyntactic contraint
           φ : S → {true, false}
     Result: The pruned corpus S0
     Prune(S, φ):
        S0 := ∅;
        for S ∈ S do
            S 0 := ∅;
            for w ∈ S do
                 if φ(w) = true then
                     S 0 := S 0 ∪ {w}
                 end
            end
            S0 := S0 ∪ {S 0 };
        end
     end

          Table 2a. Strategy I0 : Pruning by keeping only heads of sentences
              E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO       AVG
Recall       38.54     57.46    26.88    17.43    31.06    88.49    51.90      44.54
Precision    49.13     66.18    49.61    65.73    39.07    42.18    60.31      53.17
F-measure    43.19     61.51    34.87    27.55    34.61    57.13    55.79      44.95
MinSupp=0.004, MinConf=36.6, Var.=2.14, Disp.=47.84, AvgTransSize=1.00

     Table 2b. Strategy I1 : Pruning by keeping only nsubj → head dependencies
              E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO       AVG
Recall       70.55     60.44    66.97    58.27    63.22    78.76    69.92      66.88
Precision    73.05     80.84    72.87    86.98    71.71    85.29    76.29      78.15
F-measure    71.78     69.17    69.80    69.79    67.19    81.89    72.97      71.80
MinSupp=0.007, MinConf=60.4, Var.=1.43, Disp.=41.71, AvgTransSize=1.04

a given morphosyntactic constraint φ. The following strategies correspond to
various definitions of φ.

Strategy I0 Our first strategy will be to keep only the head of each sentence
(which, incidentally, is a verb in 85.37% of sentences of our corpus). This corre-
sponds to the constraint φ(w) ≡ (w = root(S)). Results are given on Table 2a.
    Although the recall of GSPO is quite high (a possible interpretation could
be that sports use very specific verbs), F-measure is quite low when we compare
it to the one of the tfidf-based method of the same average itemset length,
namely 65.69%.

Strategy I1 The second strategy consists in keeping words connected to the
head by a (single) dependency of type nsubj (= nominal subject). This occurs in
                                 Text Classification Using Association Rules        75

   Table 2c. Strategy I01 : Pruning by keeping only ccomp → head dependencies
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      57.33    33.82     25.31    16.96    21.74    47.62     59.60      37.48
Precision   37.83    47.55     38.50    42.06    34.62    57.05     54.17      44.54
F-measure   45.59    39.53     30.54    24.17    26.71    51.91     56.75      39.31
MinSupp=0.008, MinConf=34.4, Var.=1.97, Disp.=19.59, AvgTransSize=1.15

   Table 2d. Strategy I2 : Pruning by keeping only nouns at distance 1 from head

             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      80.75    75.92     73.24    68.59    70.59    95.55     77.96      77.51
Precision   73.21    83.51     75.35    89.86    73.67    80.52     77.36      79.07
F-measure   76.80    79.53     74.28    77.80    72.09    87.39     77.66      77.94
MinSupp=0.016, MinConf=51.6, Var.=2.43, Disp.=244.82, AvgTransSize=2.70

   Table 2e. Strategy I02 : Pruning by keeping only verbs at distance 1 from head
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      54.32    62.50     44.37    20.41    27.58    91.39     67.68      52.61
Precision   49.58    65.98     48.44    78.39    46.69    43.57     63.84      56.64
F-measure   51.84    64.19     46.32    32.39    34.68    59.01     65.70      50.59
MinSupp=0.019, MinConf=30, Var.=4.00, Disp.=175.41, AvgTransSize=2.01


79.84% of sentences of our corpus. The constraint is then φ(w) ≡ (∃(w, root(S),
nsubj) ∈ Dep(S)). Results are given on Table 2b.
   Note that the slightly higher than 1 transaction size is probably due to the
rare cases where there are more than one nsubj dependencies pointing to the
head. The scores rise dramatically when compared to those of the strategy based
only on the head of the sentence. The average F-measure (71.80%) is significantly
higher than the tfidf-based performance for the same average transaction size
(65.69%). This shows that using a dependency property to select a word is a better
choice than the one provided by the frequentist tfidf-based method. Note that the
case of nsubj is unique: if we take ccomp (= clausal complement) instead of
nsubj, the performance falls even below the level of strategy I0 (Table 2c).


Strategy I2 The third strategy considers all nouns (POS tags starting with
N) at distance 1 from the head in the dependency graph. Such dependencies
occur in 59.24% of the sentences of our corpus. This corresponds to φ(x) ≡
((∃(x, root(S), d) ∈ Dep(S)) ∧ (POS(x) = N∗)). Results are given on Table 2d.
   The result seems better than the one of strategy I1 (Table 2b). However, if
we take transaction size into account, it is in fact merely equivalent to—and
hence not better than, as it was the case for I1 —the tfidf-based method with the
same transaction size. Once again we see a very high recall rate for the sports
category.
76        Y. Haralambous and P. Lenca


 Algorithm 6: Corpus hyperonymization
      Data: A dependency-pruned corpus S0 , an hyperonymic function
             MSCH : W → W N , the hyperonymic order N
      Result: The hyperonymically extended corpus S00
      Hyperonymize(S0 , MSCH, N ):
         S00 := ∅;
         for S 0 ∈ S0 do
              S 00 := ∅;
              for w ∈ S 0 do
                   if projN (MSCH(w)) 6= ∅ then
                       S 00 := S 00 ∪ {projN (MSCH(w))}
                   else
                       S 00 := S 00 ∪ {w}
                   end
              end
              S00 := S00 ∪ {S 00 }
         end
      end


    One could be tempted to check the performance of taking verbs (instead of
nouns) at distance 1 from the head. Indeed, verbs at that position are more
frequent than nouns: they occur in 62.94% of the sentences of our corpus. Nev-
ertheless, the results are not as good (Table 2e). This shows that despite their
high frequency, verbs contain less pertinent information than nouns at the same
distance from the head.

3.4     Methods based on dependencies and hyperonyms
In this section we add semantic information by the means of hyperonyms, using
the hyperonymization function h (§ 1.3). The preprocessing is done by Alg. 6:
hi (w) is an N -th order hyperonym of w, if it exists in WordNet. In case there is no
N -th order hyperonym, the word remains unchanged. We call N the hyperonymic
factor of our itemset transformation.

Strategy II1 This strategy considers hyperonymic factor N = 1. We thus
first apply strategy I1 and then hyperonymization h1 . Results are presented on
Table 3a.
    The performance is globally inferior to the one of Strategy I1 (in which,
F-measure attained 71.80%). It is interesting to note that the recall of class
GJOB has decreased significantly (48.96% vs. 58.27%): in other words, using
hyperonyms when dealing with labor issues results into failure to recognize 9.31%
of the documents as belonging to the domain; one could say that terms used
in GJOB lose their “labor specificity” already at first-order hyperonymization.
On the other hand, the (already high in I1 ) recall of GSPO has increased even
more, compared to I1 (from 78.76% to 82.20%): it seems that sports terminology
remains in the domain even after hyperonymization, and replacing specific terms
                                 Text Classification Using Association Rules       77

        Table 3a. Strategy II1 : I1 followed by first-order hyperonymization
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      72.39    56.04    71.32     48.96    59.02    82.20     70.42      65.76
Precision   66.21    75.42    64.33     82.21    67.71    73.20     70.73      71.40
F-measure   69.16    64.30    67.64     61.37    63.07    77.44     70.57      67.65
MinSupp=0.010, MinConf=44.8, Var.=1.92, Disp.=78.47, AvgTransSize=1.04

       Table 3b. Strategy II2 : I1 followed by second-order hyperonymization
             E12    GCRIM     GDIP     GJOB     GPOL      GSPO     GVIO        AVG
Recall      69.02    52.78    71.97     47.77    54.24    80.80     65.67      63.18
Precision   64.94    74.57    61.23     80.64    65.81    69.96     72.33      69.93
F-measure   66.92    61.81    66.16     60.00    59.47    74.99     68.84      65.46
MinSupp=0.008, MinConf=44.8, Var.=1.85, Disp.=70.51, AvgTransSize=1.04

by more general ones has increased their frequency as items, and hence improved
recall. We have the same phenomenon with the recall of GDIP (which increased
from 66.97% to 71.32%), and also slightly with the recalls of E12 and GVIO.


Strategy II2 This strategy is similar to strategy II1 but uses hyperonymic
factor N = 2. Results are presented on Table 3b.
    The performance is globally inferior to the one of II1 (where we used first-
order hyperonyms), with two minor exceptions: the recall of GDIP that increased
by 0.65% and the precision of GVIO that increased by 1.6%. What is noteworthy
however, is the fact that the recalls of GDIP and GSPO are still higher than the
ones of strategy I1 (no hyperonyms).
    To better understand the behavior of the system when climbing the hyper-
onymic chain by replacing words by hyperonyms of increasingly higher order
(and returning to the original word when there are no hyperonyms left) we cal-
culated the performance for N -th order hyperonyms for 1 ≤ N ≤ 12. Note that
when N > 12 the amount of remaining hyperonyms is negligible and the strat-
egy is similar to strategy I1 (no hyperonyms). On Fig. 2, the reader can see the
evolution of recall (black), precision (red) and F-measure (blue) for the average
of all class, and then specifically for GSPO and for GDIP. Dashed lines represent
the recall, precision and F-measure of strategy I1 .
    In the average case, the effect of hyperonymization of orders 1–4 is to de-
crease performance. After N = 5, the global number of hyperonyms available
in WordNet rapidly decreases so that the situation gradually returns to the one
of I1 (no hyperonyms) and we see curves asymptotically converging to I1 lines
from underneath.
    Not so for GSPO, the GSPO recall curve of which is above the I1 value for
most N (N = 1, 2, 6–8 and 10–12).
    The phenomenon is even better illustrated in the case of GDIP: as the reader
can see on the figure, the complete GDIP recall curve is located above the I1
one. It seems that in these two cases (GDIP and, to a lesser extent, GSPO),
      78                                          Y. Haralambous and P. Lenca


                                                                           GSPO Recall / F-measure / Precision


                                                                                                                                              GDIP Recall / F-measure / Precision
Rec. (black) / F-meas. (blue) / Prec. (red)
    55 60 65 70 75 80 85


                                                                           55 60 65 70 75 80 85


                                                                                                                                              55 60 65 70 75 80 85
                                              2     4     6     8    10   12                                     2     4     6     8    10   12                                     2     4     6     8    10   12
                                                  N (hyperonym degree)                                               N (hyperonym degree)                                               N (hyperonym degree)


      Fig. 2. F-1 measure for hyperonymization of orders 1 ≤ N ≤ 12: the average case,
      class GSPO, class GDIP

      hyperonyms of all orders have a positive impact on the classifier. Unfortunately
      this impact only concerns recall and is compensated by bad precision, so that
      F-measure is still inferior to the I1 case.


      4                                       Conclusion and future work

      In this paper we have investigated the use of association rules for text classifica-
      tion by applying two new techniques: (a) we reduce the number of word features
      through the use of morphosyntactic criteria in the framework of dependency syn-
      tax; for that we keep words dependent from the head by specific dependencies
      and/or having specific POS tags (b) we replace words by their hyperonyms of
      different orders, which we have calculated out of WordNet using frequencies and,
      in some cases, disambiguation. We have obtained positive results for case (a), in
      particular when we compare dependency-based single-item rules with tfidf-based
      ones. In case (b) the results we share in this paper are less efficient but still in-
      teresting, especially we found classes for which hyperonymization significantly
      improves recall.
          This work opens several perspectives, among which:
          — examine why these particular classes are favorable to hyperonymization,
      whether this is related to the structure of WordNet or to linguistic properties of
      the domain;
          — explore partial hyperonymization i.e., is it possible to hyperonymize only
      specific items according to the needs of the classifier?1 How do we choose, on
      the word level, if we should rather keep the original word (to increase precision)
      or switch to some hyperonym (to increase recall)?
          — we have used only recall and precision as quality measures of our rules,
      and our evaluation is strongly dependent on these measures since the selection
      of the 1,000 rules we keep is entirely based upon them. There are other quality
               1
                                      Indeed, by hyperonymizing all words one wins on one side and loses on the other:
                                      for example “dalmatian” and “poodle” will both be replaced by “dog”, but “dog”
                                      occurrences will be replaced by “canid”. It would be more preferable to keep the
                                      word “dog” in the second case, so that we have a real increase in frequency.
                                   Text Classification Using Association Rules        79

measures available, how do they apply and how can they be compared and
combined? How robust are the results?
   — and finally: how can we optimize the distinctive feature of association rules,
namely the fact of being intelligible by the user? How can the user’s experience
(and linguistic knowledge) be incorporated in the enhancement of rules to obtain
the best possible result from his/her point of view?


References

 1. Reuters corpus, volume 1, english language, 1996-08-20 to 1997-08-19, http://
    about.reuters.com/researchandstandards/corpus/statistics/index.asp
 2. British National Corpus (1994), http://www.natcorp.ox.ac.uk
 3. Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms, chap. 6, pp.
    163–222. Mining Text Data, Springer
 4. Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.I.: Applying data min-
    ing techniques in text analysis. TR C-1997-23, Department of Computer Science,
    University of Helsinki (1997)
 5. Borgelt, C.: Efficient implementations of apriori and eclat. In: Workshop of Fre-
    quent Item Set Mining Implementations (FIMI 2003), Melbourne, FL, USA (2003)
 6. Ferrer i Cancho, R., Solé, R.V., Köhler, R.: Patterns in syntactic dependency net-
    works. Physical Review E 69, 1–8 (2004)
 7. Cherfi, H., Napoli, A., Toussaint, Y.: A Conformity Measure Using Background
    Knowledge for Association Rules: Application to Text Mining, pp. 100–115. IGI
    Global (2009)
 8. Chomsky, N.: Syntactic structures. Mouton (1957)
 9. Cohen, W.W.: Learning rules that classify e-mail. In: AAAI Spring Symposium on
    ML and IR (1996)
10. Curran, J.R., Moens, M.: Scaling context space. In: Proceedings of the 40th Annual
    Meeting of the Association for Computational Linguistics. pp. 231–238 (2002)
11. Do, T.N.Q.: A graph model for text analysis and text mining. Master Thesis,
    Université de Lorraine, (2012)
12. Jaillet, S., Laurent, A., Teisseire, M.: Sequential patterns for text categorization.
    Intell. Data Anal. 10(3), 199–214 (2006)
13. Kovacs, L., Baksa-Varga, E.: Dependency-based mapping between symbolic lan-
    guage and Extended Conceptual Graph. In: 6th International Symposium on In-
    telligent Systems and Informatics. pp. 1–6 (2008)
14. Lang, K.: Newsweeder: Learning to filter netnews. In: International Conference on
    Machine Learning. pp. 331–339. Morgan Kaufmann (1995)
15. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
    In: Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (New York).
    pp. 80–86 (1998)
16. Lowe, W.: Towards a theory of semantic space. In: Proceedings of the Twenty-Third
    Annual Conference of the Cognitive Science Society. pp. 576–581 (2001)
17. de Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency
    parses from phrase structure parses. In: LREC 2006. pp. 449–454 (2006)
18. Mehran Sahami, Susan Dumais, D.H., Horvitz, E.: A bayesian approach to filter-
    ing junk email. In: AAAI Workshop on Learning for Text Categorization. AAAI
    Technical Report WS-98-05 (1998)
80      Y. Haralambous and P. Lenca

19. Mel0 čuk, I.A.: Dependency syntax : theory and practice. Albany: State University
    Press of New York (1987)
20. Miller, G.A.: WordNet: A lexical database for English. Communications of the
    ACM 38(11), 39–41 (1995)
21. Nivre, J.: Dependency Grammar and Dependency Parsing. MSI report 05133.
    School of Mathematics and Systems Engineering, Växjö University, (2005)
22. Ordoñez-Salinas, S., Gelbukh, A.: Information Retrieval with a Simplified Concep-
    tual Graph-Like Representation 6437(Chapter 9), 92–104 (2010)
23. Padó, S., Lapata, M.: Dependency-based construction of semantic space models.
    Computational Linguistics 33(2), 161–199 (2007)
24. Pedersen, T.: Information content measures of semantic similarity perform bet-
    ter without sense-tagged text. In: Proceedings of the 11th Annual Conference of
    the North American Chapter of the Association for Computational Linguistics
    (NAACL HLT 2010). pp. 329–332 (2010)
25. Roche, M., Azé, J., Matte-Tailliez, O., Kodratoff, Y.: Mining texts by association
    rules discovery in a technical corpus. In: Intelligent Information Processing and
    Web Mining. pp. 89–98. Advances in Soft Computing, Springer Verlag (2004)
26. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceed-
    ings of International Conference on New Methods in Language Processing, Manch-
    ester, UK (1994)
27. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput.
    Surv. 34(1), 1–47 (2002)
28. Tesnière, L.: Éléments de syntaxe structurale. Klincksieck (1959)
29. Zaı̈ane, O.R., Antonie, M.L.: Classifying text documents by associating terms with
    text categories. In: Australasian Database Conference. CRPIT, vol. 5. Australian
    Computer Society (2002)
      Learning Subgraph Patterns from text
 for Extracting Disease–Symptom Relationships

              Mohsen Hassan, Adrien Coulet, and Yannick Toussaint

                 LORIA (CNRS, Inria NGE, Université de Lorraine),
             Campus scientifique, Vandoeuvre-lès-Nancy, F-54506, France
            {mohsen.sayed,adrien.coulet,yannick.toussaint}@loria.fr


        Abstract. To some extent, texts can be represented in the form of
        graphs, such as dependency graphs in which nodes represent words and
        edges represent grammatical dependencies between words. Graph repre-
        sentation of texts is an interesting alternative to string representation be-
        cause it provides an additional level of abstraction over the syntax that is
        sometime easier to compute. In this paper, we study the use of graph min-
        ing methods on texts represented as dependency graphs, for extracting
        relationships between pairs of annotated entities. We propose a three step
        approach that includes (1) the transformation of texts in a collection of
        dependency graphs; (2) the selection of frequent subgraphs, named here-
        after patterns, on the basis of positive sentences; and (3) the extraction of
        relationships by searching for occurrences of patterns in novel sentences.
        Our method has been experimented by extracting disease–symptom rela-
        tionships from a corpus of 51,292 PubMed abstracts (428,491 sentences)
        related to 50 rare diseases. The extraction of correct disease–symptom
        relationships has been evaluated on 565 sentences, showing a precision
        of 0.91 and a recall of 0.49 (F-Meaure is 0.63). These preliminary exper-
        iments show the feasibility of extracting good quality relationships using
        frequent subgraph mining.


1    Introduction
In many domains such as biomedical research, text is a major source of informa-
tion; unfortunately text corpora are frequently too large to be fully considered
manually [1]. We focus here on the task of Relation Extraction (RE), which
consists in identifying and qualifying valid relationships between entities already
recognized in the text. Figure 1 illustrates the process of RE with an example
of relation between a disease and a symptom. First, Named Entity Recognition
(NER) identifies the interesting entities in the text and annotate them with
the corrected category. Second step identifies if named entities are involved in a
relationship (and may qualify the type of the relationship) or not.
    Texts may be represented at different levels: words, bag of words, sequences
of words, syntactic trees, graphs (dependency graphs); and they may be enriched
by some linguistic features: part of speech, syntactic or semantic features. In this
paper we study how text, represented in the form of graphs, can be processed
with simple graph mining methods, to perform RE.

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
82         M. Hassan, A. Coulet and Y. Toussaint


                    Fig. 1: The process of Relation Extraction (RE)


    Frequent Subgraph Mining (FSM) is a graph mining method that extracts
frequently occurring subgraphs either from a single graph or a set of graphs [2].
We propose in this paper to extract relationships from text through a three step
method, based on FSM. The first step concerns data preparation and consists
in transforming texts into graphs and recognizing name entities. The second
step relies on the identification of labeled and oriented subgraphs, named here-
after patterns, that are connecting frequently two imposed typed entities, e.g.,
subgraphs connecting one disease to one of its symptom. The third step uses
generated patterns for extracting relationships between these entities.
    The paper is organized as follows: Section 2 presents background elements
regarding graph mining. Section 3 introduces our three step method. Section 4
reports experiments of our method on the extraction of disease–symptom rela-
tionships. Section 5 presents related works and Section 6 discusses the interest
of using graph mining for RE.


2      Graph Mining
A graph is defined as a pair G = (V, E) where V is a set of vertices (or nodes)
and E is a set of edges connecting vertices such as E ⊆ V × V . A graph is a
directed graph when edges are oriented pairs of vertices. A graph is a labeled
graph when vertices and edges are associated with labels.

2.1      Frequent Subgraph Mining
S = (SV, SE) is a subgraph of G if SV ⊆ V and SE ⊆ E. Given a graph collec-
tion G = {G1 , G2 , ..., Gk }, with Gi = (Vi , Gi ), and a minimum support min sup,
the Frequent Subgraph Mining task (denoted FSM) extracts the collection of
subgraphs S = {S1 , ..., Sn }, with Si = (SVi , SGi ) that occur in G with a sup-
port greater than min sup. The support of a subgraph Si is the number of its
occurrences in G 1 .
1
     The relative support of Si is |S i|
                                    |G|
                                     Learning Subgraph Patterns from text       83

     FSM algorithms are mainly based on two distinct approaches: Apriori -based
and pattern growth-based approaches. Apriori -based graph mining algorithms
share similarities with Apriori-based frequent itemset mining algorithms [3]. In
their case, the search for frequent subgraphs starts with graphs with no edge.
At each iteration, the size of the newly discovered frequent substructures is
increased by one by joining two subgraphs from the previous iteration. AGM,
FSG and FFSM are examples of Apriori-based algorithms [2,4,5]. The pattern-
growth mining algorithms extend a frequent graph by trying to add successively a
new edge to every possible position. If the new graph is frequent, a new frequent
graph can be expended; if it is not frequent a new edge is tried to be added. gSpan
[6], CloseGraph [7]and Gaston [8] are examples of pattern-growth algorithms.

2.2   gSpan
gSpan is a FSM algorithm that processes undirected labeled graphs. Given a
collection of such graphs, gSpan returns the set of frequent subgraphs and their
support. To generate this result, gSpan generates a Tree Search Space (TSS)
that is composed of all trees and subtrees that rely in the collection of graphs.
gSpan represents each tree of the TSS using a specific encoding, named minimum
Depth-First Search (DFS) Code. This code is unique for each tree because it is
constructed following the unique DFS traversal that follows the lexicographic
order of vertex labels.
    gSpan follows a pattern-growth mining approach, i.e., expends at each itera-
tion a frequent graph with a new edge, trying every potential position. An issue
with this approach is that the same graph can be discovered several times from
different frequent graphs. gSpan avoids this problem by introducing a right-most
extension technique, where edge extensions only takes place on a specific position
determined by DFS Codes.


3     Relationship Extraction using Frequent Subgraph
      Mining
We propose an original method based on FSM to extract relationships from text.
Figure 2 depicts an overview of this three step method. Each step is detailed in
next subsections.

3.1   Data Preparation
This step aims at transforming a collection of texts into a collection of Depen-
dency Graphs (DG). To achieve this, texts are submitted to the following tasks:
Sentence Splitting, NER and Named Entity (NE) Substitution, Sentence Filter-
ing, Dependency Parsing and lemmatization. First, texts are split into sentences.
Then, NEs are recognized. We focused on relation between diseases and symp-
toms. Thus, we replaced each occurrence of these entities by the corresponding
generic word “DISEASE” or “SYMPTOM”. Sentences are filtered to keep those
84     M. Hassan, A. Coulet and Y. Toussaint


           Fig. 2: Overview of our Relation Extraction (RE) method


involving at least two entities of interest. Dependency parsing produces for each
sentence one labeled directed graph, named DG. Such DG is made of vertices that
represent words and edges that are grammatical dependencies between words.
Figure 3 shows the dependency graph of the sentence “DMD is a neuromuscular
disease characterized by progressive weakness.”.


Fig. 3: Example of Dependency Graph (DG) processed by the Stanford Parser
and drawn with Brat


   Finally, words in DG are replaced by their lemmas, by a more general form
that is more likely to appear in other graphs. Figure 4 shows an example of DG
resulting from the data preparation step.
   The collection of formatted DG is the input to FSM for mining the most fre-
quent subgraph patterns that preserve the relations between two named entities.


3.2   Pattern Extraction

Frequent Subgraph Mining (FSM) aims at extracting useful patterns for the re-
lation extraction process. Given a set of DGs and the support threshold, gSpan
                                      Learning Subgraph Patterns from text   85


Fig. 4: Example of Dependency Graph after replacement of recognized entities
(diseases and symptoms) by generic words (DISEASE and SYMPTOM) and
lemmatization


extracts an undirected subgraph patterns. These patterns give the relationships
between interesting annotated entities. Figure 5 shows an example of such pat-
tern, extracted from graph in Figure 4. This subgraph pattern gives the relation
based on grammatical dependencies between the disease “DMD” and the symp-
tom “weakness”.


                       Fig. 5: Example of extracted pattern


    Then, the patterns that contain the following are excluded: (1) conj and or
conj or dependency relation between any two nodes; (2) The dependency path
for DISEASE is equal to the dependency path for SYMPTOM, this means that
DISEASE and SYMPTOM have the same semantic role in the sentence and this
might be an error from NER; (3) no node for DISEASE or SYMPTOM (at least
one disease and one symptom must be exist). Figure 6 shows an example of such
excluded patterns. These patterns can be discovered from a sentence like “this
disease is characterized by DISEASE and SYMPTOM”2 .


                    Fig. 6: Two examples of excluded patterns

2
    The uppercase words are the generic words for NEs
86        M. Hassan, A. Coulet and Y. Toussaint

    Bunescu and Mooney proposed a kernel method that used the shortest path
between the two entities in the undirected version of the dependency graph [9].
We proposed similarly to compute the shortest path, but from directed depen-
dency graph, which is useful for expressing the direction of relation between
entities and consequently gives more precise relations. Two paths with the same
sequence of syntactic dependency labels are similar if the direction of the syn-
tactic dependencies are the same. Hence, the shortest path method (SPM) for
extracting a smaller set of patterns than gSpan patterns has been used [9]. It
consists in extracting the shortest path between two entities (e.g., disease and
symptom) in a dependency graph. Bunescu and Mooney used words and POS
for expressing their pattern, but in SPM we consider the whole subgraph.
    Figure 7 shows the shortest path between the two entities DISEASE and
SYMPTOM in the dependency graph.


           Fig. 7: The shortest path between DISEASE and SYMPTOM


    Given the following two annotated sentences “DISEASE is a disease char-
acterized by SYMPTOM” and “DISEASE is anomaly accompanied by SYMP-
TOM”. First, SPM get the graph of each sentence as shown in figure 8. Then,
SPM compute the common shortest path from the graphs of the two sentences.
If the values of the nodes in the pattern are different, their values are replaced
by “*” and keeping a list of all possible values for each node. Hence, two graphs
patterns can be merged and represented in one generalized pattern 3 . The sup-
port of the new generalized pattern is less than or equal the sum of the supports
of the two patterns. The support of the generalized pattern is automatically
computed when executing the generalization operation which makes the process
run faster.
    Figure 9 shows two examples of pattern, one resulting from SPM patterns
and the other resulting from gSpan. SPM method checks every node in the
subgraph pattern to contain all possible values. This makes the pattern more
general than gSpan pattern and increases the frequency value of the pattern.
This has two advantages: first it produces a smaller set of patterns than gSpan
patterns which is easier for analysis and evaluation purposes; second it leads to
3
     There is no redundancy because all redundant patterns are merged into one pattern
                                      Learning Subgraph Patterns from text       87


                     Fig. 8: Example of our method pattern


a higher coverage than gSpan when the pattern may not be extracted by gSpan
because of its low frequency. On the other side, SPM did not use POS tags as
a single feature as proposed in [9], what makes the pattern more generic and
increases the coverage of patterns but induces a lower precision.
    When SPM extends every node with all possible values, some values don’t
represent a correct relation between the annotated entities. Figure 10 shows
rejected patterns that should be removed by the generalization operation to
increase the quality of patterns.
    The extracted patterns are classified into two classes: positive and negative
patterns. The classification is based on pattern support and quality. The quality
Q of a pattern is computed by the following formula
                                           T
                                      Q=                                        (1)
                                           S
where T is the number of all correct sentences in the pattern extension and S is
the support of the pattern. A sentence is correct if it contains the pattern and the
relation identified by the pattern is correct. For example, the pattern in Figure 8
has support 23. This means that the number of sentences that contain this
pattern is 23. All disease-symptom relationships provided in these 23 sentences
are correct. Then, T=23 and Q=23/23. Hence, the quality of this pattern is 1.
    The pattern is a positive pattern if its support is higher than a minimum
support (min sup) threshold and its quality is higher than a minimum quality
88     M. Hassan, A. Coulet and Y. Toussaint


         Fig. 9: Examples of patterns generated from gSpan and SPM


                     Fig. 10: Example of rejected patterns


threshold. Only positive patterns are considered for extracting new relationships
from a new corpus.


3.3   Relationship Extraction using Patterns


Positive patterns previously selected are used to discover new relationships be-
tween entities mentioned in a new corpus. Similarly to the learning process,
a set of dependency graphs are generated from the new corpus exactly as the
data preparation step (sentences splitting and NER are also required before de-
pendency parsing). Then, a pattern matching for the selected patterns with the
dependency graphs is done to extract the binary relationships between the inter-
esting entities. A value that expresses the quality of each new extracted relation
is also returned (accordingly to the quality of the pattern used in the extraction
process).
                                     Learning Subgraph Patterns from text      89

4     Experiment
We build-up experiments on the basis of a medical corpus related to rare diseases.
This corpus is explored to extract relationships between diseases and symptoms.
Figure 11 presents the process of our experiments and its evaluation. Details are
provided in the following subsections.


Fig. 11: Overview of our experiment for the extraction of disease-symptom rela-
tionships


4.1   Rare Disease Corpus
Our rare disease corpus is composed of 51,292 PubMed abstracts related to 50
Orphanet4 rare diseases5 . Abstracts are obtained by querying manually PubMed,
using its web user interface. The query submitted to PubMed has the fol-
lowing form: “(disease1,pref name or disease1,syn1 or...or disease1,synn ) or...
or (diseasek,pref name or diseasek,syn1 or...or diseasek,synm )”
   where diseasei,pref name and diseasei,synj are respectively referring to the
preferred name and the j th synonym of disease i according to the Orphanet Rare
Disease Ontology6 .
4
  http://www.orpha.net
5
  The 50 diseases are listed at: http://www.loria.fr/~msayed/50RareDiseases
6
  http://www.bioportal.bioontology.org/ontologies/ORDO
90       M. Hassan, A. Coulet and Y. Toussaint

4.2     Building a Dependency Graph Dataset

51,292 abstracts are split in 428,941 sentences using LingPipe7 , and subsequently
submitted to disease and symptom NER. We use MetaMap to annotate each
sentence of the corpus using UMLS semantic types “Disease or Syndrome” and
“Sign or Symptom” [10]. MetaMap annotations of some very general words like
“disorder” or “symptoms” have been removed to avoid noise in the rest of the
process. Annotated sentences are divided into a learning corpus, made of 90%
of sentences randomly selected, and a test corpus, made of the 10% left. In the
learning corpus, each recognized disease or symptom is replaced by the generic
string DISEASE or SYMPTOM (which are indexed when several disease or
symptom are recognized in one sentence). Sentences that does not contain at
least one disease and one symptom are filtered out, what reduces their number
to 5,653.
    The Stanford Parser is used to build the Dependency Graph (DG) of each
sentence [11]. It is set to collapsed dependencies with propagation of conjunct
dependencies option. As a result, conjunctions are propagating the dependen-
cies that involves the conjuncts. For example, in the sentence “DISEASE is
characterized by SYMPTOM1 and SYMPTOM2” this option guarantees that
the same pattern will be observed between DISEASE and SYMPTOM1, and
between DISEASE and SYMPTOM2. Finally, lemmatization is achieved using
the Stanford CoreNLP suite.


4.3     Frequent Subgraph Mining

From prepared dependency graphs, gSpan extracts all frequent subgraphs. Those
are filtered to keep only subgraphs that contain one disease and one symptom.
This guarantees that only patterns that describe the dependency relation be-
tween disease and symptom are kept. We applied our program to gSpan sub-
graphs to identify in each case the shortest path between the nodes DISEASE
and SYMPTOM. When several diseases or symptoms are in a unique sentence,
one shortest path is computed for each pair (DISEASEx–SYMTOMy). This
resulted in 6,048 subgraph patterns, a smaller set compared to gSpan result,
consequently easier to evaluate. Because we think that shortest paths represent
the most significance part of subgraph, we focused on these reduced graphs.


4.4     Selection of Positive Patterns

First, patterns with a f requency ≥ 2 are selected from the 6,048. Accordingly,
615 patterns are frequent, covering 2,535 sentences from all 5,653 (44.84%). Sec-
ond, patterns with a pattern quality ≥ 0.5, our quality threshold, are selected
(see formula 1). It results 324 patterns (that cover 1,329 sentences or 23.51%),
which are considered as positive patterns and are aiming at extracting new rela-
tionships from text.
7
     http://alias-i.com/lingpipe
                                       Learning Subgraph Patterns from text        91

4.5   Evaluation
Finally, we evaluate the ability of positive patterns to identify disease-symptom
relationships in the test corpus. The evaluation process can be divided in three
tasks. (i) For each sentence in the test corpus, two lists of disease-symptom
pairs are composed: the list i-a of all possible disease-symptom pairs found in
the sentence; the list i-b of pairs extracted by our method, i.e., when a positive
pattern matches the DG of a test sentence. Obviously, list (i-b) is a subset of
(i-a). (ii) Each pair of list i-a is marked manually as Correct if it corresponds to
a relation actually mentioned in the sentence, or Incorrect, if it is not. (iii) Pairs
that are marked as Correct and are extracted by our method (i.e., in list i-b)
are True Positive (TP); pairs that are marked as Incorrect and are not extracted
by our method are True Negative (TN); pairs that are marked as Incorrect and
are extracted by our method (i.e., in list i-b) are False Positive (FP); pairs that
are marked as Correct and are not extracted by our method are False Negative
(FN).


Table 1: Confusion matrix corresponding to the evaluation of RE method on a
corpus of 565 sentences. Because several relationships can be extracted from one
sentence, TP+TN+FP+FN is higher than the number of sentences.

                                          Pattern-based extraction
                                          Extracted Not extracted
                                  Correct TP=149       FN=172
               Manual extraction
                                 Incorrect FP=15       TN=441


    Table 1 shows the confusion matrix resulting form the evaluation process. It
enables to compute the precision (P)8 , recall (R)9 , F-measure 10 , accuracy (Acc)
11
   and specificity (S)12 . Evaluation shows high precision (0.90) and specificity
(0.96); reasonable F-measure (0.76) and accuracy (0.75); and a low recall (0.46).


5     Related Works
5.1   Mining Text as Set of words
Co-occurrence is the simplest method to identify relationships between two en-
tities that co-occur in the same sentence. This approach is based on the hypoth-
esis that if two entities are mentioned frequently together, it is likely that these
8
   P=TP/(TP+FP)
9
   R=TP/(TP+FN
10
   F-measure=2*P*R/(P+R)
11
   Acc=(TP+TN)/(P+N)
12
   S=TN/(FP+TN)
92      M. Hassan, A. Coulet and Y. Toussaint

two entities are related. Co-occurrence methods have been successfully applied to
the automated construction of networks of biomolecules such as protein-protein
or gene-disease networks [12,13]. Co-occurrence approach tends to achieve a good
recall but low precision. This can be balanced when one is mining very large cor-
pus. Another issue with such approaches is that the type of relationships and
their direction are unknown.

Bags of words are artificial constructs where one textual document is rep-
resented as an unordered set of the words it contains, i.e., the bag. In this set,
each word is usually associated with its frequency of occurrence in the document,
then enabling to weight words within the bag. This is used to classify documents
with similar words and words frequency profiles. Indeed, when associated with a
proper dictionary, a document represented as a bag can be encoded as a simple
vector of integers. This is a compact representation that enables to work with
large corpora of documents. It suffers from low precision.

Sequence of words It consists of a partial order (i.e., the sequence) of words,
POS tags, general POS tags, entity or chunk type, etc. These features are used
to build patterns or rules that assert a relationships between entities. Blohm et
al. presented method based on a taxonomic sequential pattern for RE which ex-
tends a sequential mining algorithm to take into account a taxonomy of morpho-
syntactic and lexico-semantic features [14]. It allows generalization or specializa-
tion among patterns, which affects the precision and the recall of the patterns.
Quiniou et al. studied how to use the sequence mining to identify more generic
linguistic patterns and show that sequence mining is more powerful than n-grams
to express the linguistic patterns [15]. Béchet et al. provided a sequential pattern
mining algorithm which discover the relations between genes and rare diseases
in biomedical corpus [16]. The proposed algorithm extracts expressive linguistics
patterns more efficient than patterns extracted with itemsets. Sequence mining
tends to generate a very high number of patterns what makes difficult the anal-
ysis and evaluation tasks. Consequently filter are usually applied to reduce the
number of extracted patterns.

5.2   Mining Trees
Parse Tree is an ordered, rooted tree that represents the syntactic structure of
a sentence. Some works have been proposed to use such syntactic structure for
extracting relations between entities. Galitsky introduced the operation of the
syntactic generalization which take a pair of syntactic parse trees and find the
maximal common subtrees [17]. Galitsky employed the nearest neighbour learn-
ing method to find the maximal common subtrees. Zhang et al. proposed a kernel
approach that uses the syntactic tree representation of sentences for RE. They
studied how to capture the syntactic structure by using a convolution tree ker-
nel and support vector machines [18]. Zelenko et al. also proposed a tree kernel
method, but using shallow parse tree representations [19]. The same tree kernel
                                      Learning Subgraph Patterns from text       93

approach has been used by Culotta and Sorensen, but allowed feature weighting
and used additional features such as Wordnet, POS, entity types [20]. In both
approaches, a relation instance is defined by the smallest subtree in the parse
or dependency tree that includes interesting entities. The tree kernel approaches
achieve good results but they are hard to implement and computationally com-
plex. Note that trees are specific type of graphs and mining trees can be easily
adapted to graphs.


5.3   Mining Graphs

Many RE methods based on DG have been proposed [21,22]. Chowdhury et
Lavelli proposed a hybrid kernel approach, which is based on different features:
dependency patterns, regex patterns, path enclosed and shallow linguistic kernels
[23]. In this case, dependency patterns are reduced graphs, which are subgraphs
from dependency graphs. The reduced graph extends the shortest path (smallest
common subgraph) of the dependency by adding (a) dependent nodes (when
exist) of nodes in the shortest path; (b) the immediate governor(s) (when exist)
of the least common governor. For sake of simplicity, we choose in this paper
to consider only the shortest path with no dependents. Bunescu et al. proposed
a RE approach similarly based on the shortest path between two entities in
undirected dependency graphs [9].
    Chowdhury et Lavelli also proposed to use a reduced graph pattern that is a
set of syntactic dependences of the corresponding reduced graph. For example,
the reduced graph pattern of the graph represented Figure 7 is hnsuj, cop, det,
jj, partmod, agenti. Note that, in this case, reduced graph patterns are undi-
rected.
    Adolphs et al. developed a rule learning algorithm to learn graph rules which
identify subgraphs in arbitrary graphs [24]. First, subgraphs are extracted. Then,
subgraph generalization is done to form rules by underspecifying the nodes and
introducing place-holders labeled with the role for the argument nodes.


6     Discussion and Conclusion

6.1   Discussion

The proposed hybrid kernel approach of Chowdhury et Lavelli [23] is evalu-
ated on 5 different corpora for the extraction of the protein-protein relationship
and the results varied from corpus to another. Considering pattern features and
corpora used, our method shows a good precision (0.91) and low recall (0.49).
This illustrates that graph mining can produce precise patterns for RE but addi-
tional work is required, such as adding features (e.g., similar to those proposed in
Chowdhury’s work). Béchet et al. [16] use sequential mining patterns for extract-
ing gene-disease relationships. The method gives the best precision 0.68 (recall
0.36) when using min sup = 50 while the best recall is 0.65 (precision is 0.66)
when using min sup = 5. While in our method we achieve the best precision
94     M. Hassan, A. Coulet and Y. Toussaint

0.94 (recall 0.33) and the best recall 0.67 (precision 0.41). In addition, the huge
number of patterns produced by sequence mining; makes the interpretation task
hard.
    In our experiments, we fixed min sup=2. When min sup=2, the number of
extracted patterns is 615. When increasing the min sup threshold, the number
of extracted patterns and recall decrease. For example, if min sup=3, then the
number of extracted patterns is 268. When decreasing the min sup threshold,
the number of extracted patterns and recall increase. For example, If min sup=1,
then the number of extracted patterns is 6048. This number of patterns is large
for analysis and patterns with support=1 may be not important because they
are rare patterns.
    Figure 12 shows the relation between the precision and pattern quality thresh-
old and between recall and pattern quality threshold. Precision increases and
recall decreases when the pattern quality threshold increases. The best precision
value is 0.94 when the quality threshold is 100 and the best recall value is 0.67
when the quality threshold is 0. The trade-off between the precision and recall
is required according to the purpose of the application.


 Fig. 12: The relation between precision and recall with the quality threshold


    The study of FN and FP relations is necessary for improving the recall and
precision respectively. In the following sentence “In areas in which transmission
is occurring, WNV infection should be considered in patients with acute flaccid
paralysis.”, the relation between disease “WNV” and symptom “flaccid paral-
ysis” is marked as FN relation (because we didn’t generate a positive pattern
that describes this relation). A possible solution for this problem is to consider
patterns with low frequency (rare patterns), another solution is to enlarge the
learning corpus. These produce a larger patterns set which reduces FN relations
and increases the recall.
    On the other side, to increase the precision, the number of FP relations needs
to be reduced. The following sentence “Muscular dystrophy is a nosology for a
group of hereditary muscle disorders characterized by progressive wasting and
weakness of skeletal muscle, where degeneration of muscle fibers is detected by
pathological examination” generates a FP relation between “hereditary muscle”
                                       Learning Subgraph Patterns from text        95

and “weakness”. One solution is to consider only patterns with high quality by
increasing the quality threshold to ensure that the extracted patterns are precise
enough.
    Finally, Unlike gSpan and Chowdhury’s work, SPM doesn’t able to keep other
features such as negation relation. issues like this must be token in consideration
for further improvements and extensions to SPM.

6.2   Conclusion
This paper illustrates how graph mining can be used for RE. We propose a
simple method based on FSM to extract relationship from a corpus of text
represented as DGs. FSM enables to identify subgraph patterns that are filtered
based on their frequency and quality. Selected patterns are in turn used to extract
relationships form novel sentences. Our evaluation on a corpus related to rare
diseases showed a very high precision of 0.91.
    In the future, the recall of the FSM-based method may be enhanced by
improving its ability to identify FN relations. Also, thanks of the readability
of the extracted patterns, studying and adding new features or constraints to
improve the quality of these patterns is possible and may increase the recall and
precision values. Combining features of sequences, syntax trees and dependency
graphs may introduce more precise patterns with higher recall.


References
 1. Larsen, P.O., von Ins, M.: The rate of growth in scientific publication and the
    decline in coverage provided by science citation index. Scientometrics (2010) 575–
    603
 2. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of the
    2001 IEEE International Conference on Data Mining. ICDM ’01, Washington, DC,
    USA, IEEE Computer Society (2001) 313–320
 3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
    databases. In: Proceedings of the 20th International Conference on Very Large
    Data Bases. VLDB ’94, San Francisco, CA, USA, Morgan Kaufmann Publishers
    Inc. (1994) 487–499
 4. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining
    frequent substructures from graph data. In: Proceedings of the 4th European
    Conference on Principles of Data Mining and Knowledge Discovery. PKDD ’00,
    London, UK, UK, Springer-Verlag (2000) 13–23
 5. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the pres-
    ence of isomorphism. In: Proceedings of the Third IEEE International Conference
    on Data Mining. ICDM ’03, Washington, DC, USA, IEEE Computer Society (2003)
    549–
 6. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Proceedings
    of the 2002 IEEE International Conference on Data Mining. ICDM ’02, Washing-
    ton, DC, USA, IEEE Computer Society (2002) 721–
 7. Yan, X., Han, J.: Closegraph: Mining closed frequent graph patterns. In: Proceed-
    ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
    and Data Mining. KDD ’03, New York, NY, USA, ACM (2003) 286–295
96      M. Hassan, A. Coulet and Y. Toussaint

 8. Nijssen, S., Kok, J.N.: The gaston tool for frequent subgraph mining. Electr. Notes
    Theor. Comput. Sci. 127(1) (2005) 77–87
 9. Bunescu, R.C., Mooney, R.J.: A shortest path dependency kernel for relation
    extraction. In: Proceedings of the Conference on Human Language Technology
    and Empirical Methods in Natural Language Processing. HLT ’05, Stroudsburg,
    PA, USA, Association for Computational Linguistics (2005) 724–731
10. Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus:
    the metamap program. Proc AMIA Symp (2001) 17–21
11. de Marneffe, M.C., Manning, C.D.: The stanford typed dependencies represen-
    tation. In: Coling 2008: Proceedings of the Workshop on Cross-Framework and
    Cross-Domain Parser Evaluation. CrossParser ’08, Stroudsburg, PA, USA, Asso-
    ciation for Computational Linguistics (2008) 1–8
12. Šarić, J., Jensen, L.J., Ouzounova, R., Rojas, I., Bork, P.: Extraction of regulatory
    gene/protein networks from medline. Bioinformatics 22(6) (March 2006) 645–650
13. Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: Genies: a natural-
    language processing system for the extraction of molecular pathways from journal
    articles. Comput. Appl. Biosci. 17(suppl 1) (June 2001) S74–82
14. Blohm, S., Buza, K., Cimiano, P., Schmidt-Thieme, L. Applied Semantic Web
    Technologies. In: Relation Extraction for the Semantic Web with Taxonomic Se-
    quential Patterns. Taylor and Francis Group (2011) 185–209
15. Quiniou, S., Cellier, P., Charnois, T., Legallois, D.: What about sequential data
    mining techniques to identify linguistic patterns for stylistics? In Gelbukh, A.F.,
    ed.: CICLing (1). Volume 7181 of Lecture Notes in Computer Science., Springer
    (2012) 166–177
16. Béchet, N., Cellier, P., Charnois, T., Crémilleux, B., Jaulent, M.C.: Sequential
    pattern mining to discover relations between genes and rare diseases. In Soda,
    P., Tortorella, F., Antani, S., Pechenizkiy, M., Cannataro, M., Tsymbai, A., eds.:
    CBMS, IEEE (2012) 1–6
17. Galitsky, B.: Machine learning of syntactic parse trees for search and classification
    of text. Engineering Applications of Artificial Intelligence 26(3) (2013) 1072 – 1091
18. Zhang, M., Zhou, G., Aw, A.: Exploring syntactic structured features over parse
    trees for relation extraction using kernel methods. Inf. Process. Manage. 44(2)
    (March 2008) 687–701
19. Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. J.
    Mach. Learn. Res. 3 (March 2003) 1083–1106
20. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Pro-
    ceedings of the 42Nd Annual Meeting on Association for Computational Linguis-
    tics. ACL ’04, Stroudsburg, PA, USA, Association for Computational Linguistics
    (2004)
21. Fundel, K., Küffner, R., Zimmer, R.: Relex—relation extraction using dependency
    parse trees. Bioinformatics 23(3) (January 2007) 365–371
22. Coulet, A., Shah, N.H., Garten, Y., Musen, M.A., Altman, R.B.: Using text to
    build semantic networks for pharmacogenomics. Journal of Biomedical Informatics
    43(6) (2010) 1009–1019
23. Chowdhury, M.F.M., Lavelli, A.: Combining tree structures, flat features and
    patterns for biomedical relation extraction. In: EACL. (2012) 420–429
24. Adolphs, P., Xu, F., Li, H., Uszkoreit, H.: Dependency graphs as a generic interface
    between parsers and relation extraction rule learning. In: Proceedings of the 34th
    Annual German Conference on Advances in Artificial Intelligence. KI’11, Berlin,
    Heidelberg, Springer-Verlag (2011) 50–62
           Interpretable Low-rank Document
          Representations with Label-dependent
                   Sparsity Patterns

                                           Ivan Ivek

             Rudjer Boskovic Institute, Bijenicka 54, 10000 Zagreb, Croatia


        Abstract. In context of document classification, where in a corpus of
        documents their label tags are readily known, an opportunity lies in uti-
        lizing label information to learn document representation spaces with
        better discriminative properties. To this end, in this paper application
        of a Variational Bayesian Supervised Nonnegative Matrix Factorization
        (supervised vbNMF) with label-driven sparsity structure of coefficients
        is proposed for learning of discriminative nonsubtractive latent semantic
        components occuring in TF-IDF document representations. Constraints
        are such that the components pursued are made to be frequently occur-
        ing in a small set of labels only, making it possible to yield document
        representations with distinctive label-specific sparse activation patterns.
        A simple measure of quality of this kind of sparsity structure, dubbed
        inter-label sparsity, is introduced and experimentally brought into tight
        connection with classification performance. Representing a great practi-
        cal convenience, inter-label sparsity is shown to be easily controlled in
        supervised vbNMF by a single parameter.


Keywords: Document Categorization, Latent Semantic Analysis, Supervised
Sparse Nonnegative Matrix Factorization, Variational Bayes


1    Introduction
As an essential step in machine learning applications which both efficiency and
quality of learning depend on, dimensionality reduction has become a well cov-
ered subject of research [MPH09] which produced archetipal linear methods with
low-rank assumptions such as Principal Component Analysis (PCA) [Jol02] and
Nonnegative Matrix Factorization (NMF) [LS99], as well as their kernelized and
generally non-linear variants, to touch upon some. Originally they have been
formulated as entirely unsupervised methods. However, in supervised and semi-
supervised learning applications, where labels of learning samples are readily
available, it may be appealing to use this information to obtain lower-dimensional
representations of data which not only attempt to preserve the original vari-
ance in the data, but also promise to deliver representation spaces with better
discriminative properties. A well known representative which incorporates this
desideratum is Fisher’s Linear Discriminant Analysis (FLD) [MK01].

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
98     I. Ivek

In recent relevant literature there is a pronounced trend of using probabilistic
generative models for this purpose. Probabilistic approaches to learning lie on
a well developed mathematical apparatus which offers flexible enough model-
ing of prior knowledge in form of graphical models, supported by well-known
meta-algorithms for estimating model parameters. Of this family of algorithms,
along with Probabilistic Latent Semantic Analysis (pLSA) [Hof99], a common
probabilistically formulated baseline algorithm in text mining is Latent Dirich-
let Allocation (LDA) [BNJ12] with its more recent discriminative modifications
[BM07][LJSJ08], as well as probabilistic formulations of sparse NMF [Cem09]
and their supervised counterparts [Ive14].
Sparse coding is known to result in efficient and robust representations which
have proven suitable for applications such as data compression, denoising and
missing data imputting [Mal99]. On the other hand, representations obtained dis-
criminatively are suitable for classification purposes. Combining those two prop-
erties, the basis of this work is a probabilistically formulated method for sparse
additive representations of data using nonnegative latent components which are
of sparsity structure additionally driven by data labeling [Ive14]. In context of
document classification, the decomposition is suitable for finding interpretable
patterns of semantically related terms, with high discriminative potential.


1.1   Document Feature Spaces

Disregarding syntactic and semantic interrelations of words, the simplest and
most often used intermediate form for document representation is bag-of-words;
after tokenization, purification and stemming, frequency of relevant terms is
determined for each document resulting in representations of documents as fre-
quencies of particular terms. Models such as LDA have a natural interpretation
when decomposing bag-of-words representations, while other approaches may
benefit from TF-IDF weighting [RU11] which heuristically measures the impor-
tance of a term for a particular document in a specific corpus of documents. For
a term with index τ in ν-th document, as a product of two measures,

                              tf idfντ = tfντ ∗ idfτ ,                        (1)

TF-IDF score is proportional to (normalized) frequency of a particular term in
a document,
                                       #ντ
                             tfντ =             ,                          (2)
                                    maxt (#νt )
but stunted by a measure of how rare this term occurs in the entire corpus,

                                              N
                                  idfτ = ln      ,                            (3)
                                              nτ
where the number of occurences of term τ in ν-th document is denoted by #ντ ,
the number of documents in the corpus by N and the number of documents
which contain term τ at least once by nτ .
           Interpretable Low-rank Document Representations with Patterns      99

1.2   NMF as a Tool for Latent Semantic Analysis

Bag-of-words-based approaches to text mining are known to suffer from prob-
lems of polysemy and synonymy of terms. These problems can be alleviated
by representing documents in spaces of patterns of frequencies of semantically
related terms rather than in the original space of term frequencies [DDF+ 90].
Luckily, algorithms for learning of such representations exist, of which perhaps
the best known are pLSA formulations. Also assuming inherent nonnegativity
in the data, NMF decompositions can be interpreted the same way as pLSA,
revealing patterns of semantically related terms underlying the data. Further-
more, a specific connection worth mentioning is that a NMF formulation based
on generalized KL-divergence minimizes the exactly same objective function as
the original pLSA formulation does [GG05].
Nonnegativity is a reasonable assumption and a desireable bias when model-
ing either term frequencies or derived intermediate document representations
such as TF-IDF. In general, NMF aims at decompositions in form of X≈T V ,
where X, T and V are all nonnegative matrices. Although the decomposition
is nonunique in general, to some extent nonuniqueness may be compensated for
by adding additional bias in the model, of which most prominent is sparsity
of solution [LS99]. Sparsity is enforced in divergence-based NMF by different
sparsity promoting regularizers, e.g. [Hoy04], and in probabilistic formulations
by imposing sparse prior distributions on the coefficients [Cem09].
Throughout this paper, in context of document representation for categorization
purposes, X will be regarded as a collection of documents organized columnwise
and represented by TF-IDF features, T as a low-rank collection of latent seman-
tic components organized columnwise, and V as matrix of coefficients when X
is projected onto the space of latent semantic components T . In other words,
each document is modeled as a strict superposition of the nonnegative latent
semantic components.


2     Methodology

2.1   Supervised NMF Model

The generative model [Ive14] assumes that each column of data x:τ is a result
of latent components t:i consisting of independent gamma-distributed variables,
                                                            
                        p tνi atνi , btνi = G tνi atνi , btνi ,            (4)

interacting through linear mixing with coefficients viτ under Poissonian noise:

                       p (sνiτ |tνi , viτ ) = P (sνiτ |tνi , viτ )           (5)
                                                                  !
                                                     X
                      p (xντ |sν:τ ) = δ xντ −             sνiτ       .      (6)
                                                       i
100     I. Ivek

Mixing coefficients viτ are assumed to be exponentially distributed with different
scale parameters for different selections of label indicators zτ ∈ L, formulated as
mixtures of variables λil with zτ as discrete numerical mixture selection variables,
                                                                !
                                                 X
                   p (viτ |zτ , λi: ) = G viτ 1,   δ(zτ − l)λ−1
                                                             il                 (7)
                                                    l∈L

Note that label indicators zτ are elements of a discrete set of (integer) numbers
L for convenience of notation. Variables λ−1  il , representing expectations of mag-
nitudes of coefficient components i for all samples labeled as l are constrained
by inverse-gamma priors,
                                                             
                         p λil aλil , bλil = G λil aλil , bλil .                 (8)

Because inverse-gamma is a heavy-tailed distribution, by setting the probability
mass to be concentrated around some small value, significantly larger values of
λ−1
 il will occur rarely. Thus, such a prior imposes an additional bias to produce
models having only a minority of indicators λ−1  il with significantly large mean
values on average, which, hierarchically propagating to activation coefficients viτ ,
constrain samples having the same label to have only a small shared subgroup
of latent patterns significantly active.
Using compact notation
                                         Y
                             p (X|S) =     p (xντ |sν:τ )
                                          ν,τ
                                          Y
                         p (S|T , V ) =         p (sν:τ |tνi , viτ )
                                          ν,τ
                                      Y
                         p T At , B t =  p(tνi |atνi , btνi )
                                             ν,i
                                  *
                                            Y
                         p V Λ, z        =         p (viτ |λi: , zτ )
                                             i,τ
                                    Y                  
                       p Λ Aλ , B λ =  p λil aλil , aλil ,
                                             i,l

joint distribution of the supervised NMF model can be written as
                                               *
                                                  
        p X, S, T , V , Λ At , B t , Aλ , B λ , z
                                                     *
                                                                   
           = p (X|S) p (S|T , V ) p T At , B t p V Λ, z p Λ Aλ , B λ .           (9)

Linear mixing as described by (4), (5) and (6) is the same as in Poisson-gamma
NMF [Cem09]. Equations (7) and (8) additionally formulate a sparsity structure
abstracted from the level of data samples to the level of labels, making it possible
to pursue decompositions with recognizable sparsity patterns characteristic of
data samples which share the same label tag.
             Interpretable Low-rank Document Representations with Patterns               101

2.2    Variational Bayesian Learning Algorithm
To give a concise outline of general treatment of learning by VB, let the ob-
served variables be denoted by D, the hyperparameters of a model by H and
both the unobserved variables and the model parameters by Θ. Minimization
of discrepancy between posterior p (Θ|D, H) (which is in general difficult to
optimize directly, especially in a fully Bayesian manner) and an introduced in-
strumental approximation q(Θ) measured by Kullback-Liebler divergence gives
rise to a lower bound on the posterior,
                        L = hln p (D, Θ|H)iq(Θ) + H [q(Θ)] ,                            (10)
where entropy of the probability density function in the argument
                                                         Q        is denoted by
H [.]. Supposing that q(Θ) is of factorized form q(Θ) = α∈C q(Θα ), it can be
shown that the iterative local updates at iteration (n+1) alternating over C in
form of                                                      !
                            (n+1)
                   q(Θα )           ∝ exp hln p (D, Θ|H)i q(Θ)(n)                       (11)
                                                               q(Θα )(n)

improve the lower bound (10) monotonically. Moreover, should the model be
conjugate-exponential, for a fully factorized approximation, expressions in (11)
neccessarily assume analytical forms [Win03].
To obtain convenient analytical forms of update equations of supervised vb-
NMF model defined by (9) the approximative distribution is chosen to be factor-
ized
Q as q (S, T , V , Λ) =    Qq (S) q (T ) q (V ) q (Λ),
                                                    Q with shorthand notation  Q q (S) =
   ν,τ q (sν:τ ), q (T ) =  ν,i q (t νi ), q (V ) =  i,τ q (v iτ ) and q (Λ) =  i,l q (λil ).
The variational Bayesian update expressions are obtained by specifying
                                                                           
                    p (D, Θ|H) = p X, S, T , V , Λ At , B t , Av , B v
together with
                                    q (Θ) = q (S, T , V , Λ)
in (11). For computational convenience, optimization is done with respect to the
lower bound (10) additionally relaxed using Jensen’s inequality, to finally yield
the update expressions summarized in Table A1., where the iterative learning
algorithm is presented using efficient matrix operations; the accompanying ma-
trix forms of the variables, hyperparameters and variational parameters of the
model are listed in the first column of the table. A more detailed treatment of
the learning algorithm can be found in [Ive14].

3     Experiments
All experiments have been performed on 20Newsgroups 1 dataset, bydate version
split into training and test sets; rather than estimating the generalization error
of classification by crossvalidation techniques, the underlying ambition is merely
to explore peak potentials of classification using different representation spaces
evaluated on a single train-test split in same conditions.
1
    Available from Jason Rennie’s web Page, http://qwone.com/˜jason/20Newsgroups/
102    I. Ivek

3.1   Dataset

Experiments have been performed on 20Newsgroups dataset sorted by date wih
duplicates and headers removed, with documents having multiple labels left out
and preprocessed to obtain a bag-of-words representation. The dataset is split
into a training set and a test set. To alleviate computational load, the set of
features has been heuristically reduced to 10000 terms, based on maximum TF-
IDF score accross all documents.


3.2   Experimental Setup

Representation spaces in which consequently classification takes place which are
taken under consideration are the ones obtained by PCA, Poisson-gamma un-
supervised vbNMF [Cem09], and the supervised vbNMF, all decomposing the
matrix of TF-IDF scores of the training set only. Having learned a specific space
of reduced dimensionality, representation of the test set in this space is found
by projection on the vector basis in case of PCA or by optimizing the matrix
of coefficients only using Poisson-gamma vbNMF formulation (i.e. the matrix of
latent components is fixed to what has been learned in the training step) in case
of both unsupervised and supervised vbNMF methods.
For Poisson-gamma vbNMF sparse decompositions have been pursued by fix-
ing shape hyperparameters of the gamma distributed coefficients to a value less
than or equal to 1 throughout the entire run, while other hyperparameters (con-
strained to be the same for all elements of matrices T and V , a single one for
each of the matrices) have been optimized automatically by maximization of the
lower bound directly in a non-Bayesian manner [Cem09]. For supervised vbNMF,
hyperparameters aLambda have been fixed and varied, while other hyperparam-
eters have been left to the algorithm to optimize, by direct optimization as in
[Cem09]. Specifically, following initialization, λ parameters are chosen to be all
equal and fixed for a burn-in period of 10 iterations, not until after which they
start to get optimized according to the algorithm in Table A1.
To accentuate the predictive potentials of the considered representation spaces
by themselves, rather than in conjunction with a strong classifier, the classifier
of choice is k-NN using cosine similarity metric, with k chosen heuristically as
the square root of the cardinality of the training set.
Dimension of space of latent components has been varied as a parameter for all
decomposition methods. Because at each run the NMF algorithms converge to
some local minimum, to explore these local minima, for each parameter set they
have been run 10 times with random initializations.


3.3   Evaluation

Metrics of classification performance used in the experiments are micro-averaged
accuracy, defined as                   P correct
                                micro     l Nl
                               a      = P      all
                                                    ,
                                            l Nl
            Interpretable Low-rank Document Representations with Patterns        103

and macro-averaged accuracy,

                                        1 X Nlcorrect
                             amacro =                 ,
                                        L
                                           l
                                             Nlall

where the number of correctly classified documents belonging to the l-th label
is denoted by Nlcorrect , the number of documents belonging to l-th label in the
test split by Nlall and the number of labels by L. By averaging the accuracies
calculated separatedly for each of the labels, macro-averaged accuracy compen-
sates for label-imbalance of test datasets.
As a measure of sparsity, Hoyer’s measure [Hoy04], based on ratio of l1 and l2
norms and originally introduced in context of NMF penalization, will be used.
              *                 T
For a vector x = [x1 , ..., xn ] it is defined as
                                                      P         !
                             *         1      √        |x   |
                                                       i    i
                  sparsity x = √                  n− p P          ,         (12)
                                        n−1           ( i x2i )

taking value of 1 in case only a single element is non-zero (maximum sparsity),
and a value of 0 if all elements are equal (minimum sparsity). For the purpose
of this paper, when referring to sparsity of matrices, matrix is assumed to be
vectorized first by appending its columns, then treating it as a vector according
to (12).
If labels in a document corpus are meaningfully assigned based on topics of
documents, then meaningful discovered latent semantic components are expected
to have specific patterns of occurence for documents belonging to a specific label.
Using supervised vbNMF, those patterns are modeled as patterns in sparsity of
coefficients (i.e. in patterns of support of sparse coefficients) that documents
labeled the same have in common. To measure the consistency of occurence of
sparsity patterns in labels, let a representation by coefficients of N documents in
I dimensional space be denoted by V ∈ RIxN , i.e. n-th document is represented
by coefficient vector [V ]:n , and let sums of coefficient sets which share the same
label be accumulated in matrix L ∈ RIxL , where L is number of labels as
                                          X
                                  [L]:l =     [V ]:n ,                           (13)
                                         n∈Nl

where n iterates over subset of document indices with the same label, Nl .
Now, inter-label sparsity can be introduced, defined as sparsity of matrix L.
The motivation behind (13) is that l0 norm of a sum of vectors with the same
sparsity pattern (same support) is the same as the exclusive l0 norm of such
vectors by themselves, and, the more those vectors deviate from the pattern (i.e.
when the vectors have differing supports), the larger the l0 norm of the sum
will be. Note that the latter rationale holds exactly for l0 definition of sparsity,
while for more relaxed definitions of sparsity such as (12) the behavior will be
only qualitatively similar. For the purpose of this paper, sparsity of (13) will be
measured as Hoyer’s sparsity (12).
104    I. Ivek

3.4   Results and Discussion
For comparison, as a baseline, classification results of PCA are plotted against
the dimension of representation space on Fig. 1. For unsupervised vbNMF,
micro-averaged accuracies averaged accross random initializations for different
shape parameters with varying number of latent semantic components are shown
in Fig. 2.


                                           0.750


                                           0.625
                 Micro−averaged Accuracy


                                           0.500


                                           0.375


                                           0.250


                                           0.125


                                                   0       50       100       150         200
                                                                 Dimension

                                              Fig. 1. Classification results using PCA.


Compared to PCA, even with a larger dimension of representation space, vbNMF
with sparsity constraints did not bring improvements on average, regardless of
the degree of sparsity penalization. The explanation is that, even though sparse
representation spaces may be good for clustering, natural clusters may differ
greatly from labeling and consequently even be detrimental to classification ap-
plications [BM07] when compared to dense representations such as PCA. Better
representations for classification purposes are expected to be found by intro-
ducing label information to the model, which in spaces obtained by supervised
vbNMF (Fig. 3.) indeed manifested as a boost in classification performance.
Both unsupervised vbNMF and supervised vbNMF consistently resulted in sparse
decompositions. However, label-driven structure present in supervised vbNMF
decompositions (engineered as to be the sole difference in the experiments) is
to be accounted for the beneficial effect observed. Examples of sparsities accross
labels according to (13) are visualized on on Fig. 4. for the sparse unsupervised
vbNMF decomposition which produced peak micro-averaged accuracy of 0.5796
and on Fig. 5. for an arbitrarily chosen supervised variant with matching di-
mension. The supervised variant produced distinctive sparsity patterns accross
labels, which is also reflected quantitatively on inter-label sparsity of the decom-
position.
            Interpretable Low-rank Document Representations with Patterns                           105


                                          0.56


                Micro−averaged Accuracy
                                          0.54


                                          0.52


                                                      at   0.6   0.7        0.8   0.9   1.0

                                          0.50
                                                 20        40          60         80          100
                                                                 Dimension

Fig. 2. Classification results of unsupervised vbNMF (averaged accross 10 random
initializations) with varying level of sparsity penalization.


The connection between sparsity on the level of labels and classification per-
formance is further explored using Fig. 6., showing data for all supervised rep-
resentations obtained in the experiments. Variance of the scatter plot becomes
tighter with increasing dimension of representation space, meaning that for a
sufficiently large dimension of the decomposition, inter-label sparsity is indeed
a good predictor for classification quality on this dataset.
Equally importantly, experiments show that in case of supervised vbNMF, inter-
label sparsity can elegantly be controlled by a single parameter aλ alone, regard-
less of dimension: as illustrated by Fig. 7., a logarithmic increase of aλ is accom-
panied by a trend of growth of inter-label sparsity, only to be broken by too
extreme regularizations, when tails of the prior have little mass. On the other
hand, unsupervised vbNMF resulted in moderate levels of inter-label sparsity
because sparsity structure is supported by the structure of data features only,
with somewhat higher values in cases of very strong sparsity regularizations and
an impractically small number of latent patterns.
Classification results using k-NN classifier with heuristically chosen k on rep-
resentation spaces obtained by the three methods are summarized in Table 1.,
reporting peak value of its micro- and macro-averaged accuracies; for the set of
parameters which yielded the peak performance, corresponding accuracies aver-
aged accross the 10 random initializations together with minimal achieved accu-
racies are reported. On Fig. 3.b), showing smooth dependence of micro-averaged
accuracy (averaged across random initializations) on an interesting range of aλ
for a selection of dimensions, peak performance as entered in Table 1. can be
seen marked.
To conclude the remarks on the experiments, it is worth mentioning that, if
sparse representations are pursued, care is advised when choosing and optimiz-
106     I. Ivek


                                            0.75


                                            0.70


                  Micro−averaged Accuracy   0.65


                                            0.60


                                            0.55
                                                           Dimension
                                                                   20        40         60    80         100
                                            0.50
                                                                   30        50         70    90

                                            0.45
                                                   0           2000               4000            6000         8000
                                                                                   aλ

                                                                             (a)

                                            0.75
                  Micro−averaged Accuracy


                                            0.70


                                            0.65


                                            0.60

                                                        Dimension            40     60       80          100

                                            0.55
                                                   10


                                                            20


                                                                        40


                                                                                  80
                                                                                  10


                                                                                             20


                                                                                                     40


                                                                                                               80
                                                   0


                                                               0


                                                                        0


                                                                                     0
                                                                                     00


                                                                                             00


                                                                                                         00


                                                                                                                 00


                                                                                   aλ

                                                                             (b)


Fig. 3. Supervised vbNMF classification results, averaged over 10 random initializa-
tions. a) Dependence on level of sparsity penalization, varying dimensions of represen-
tation spaces. b) Dependence on level of sparsity penalization, varying dimensions of
representation spaces. Error bars represent maximum and minimum values among the
random initializations. x-axis is shown on logarithmic scale.
              Interpretable Low-rank Document Representations with Patterns          107


                                you,that,they,it,would
                    windows,dos,os,drivers,mouse
         window,widget,display,application,motif
    university,fax,information,computer,research
                                  uk,ac,mit,server,sun
                                   uiuc,cso,he,runs,hit
               team,game,games,hockey,season
        stephanopoulos,president,mr,clinton,we
                          space,nasa,gov,shuttle,dc
                         scsi,drive,mb,disk,memory
            science,objective,exist,theory,values
                       rights,men,islam,sex,islamic
                price,sale,keyboard,reply,shipping
                  org,government,isc,money,billion
                             msg,tape,audio,navy,mil
                             moon,lunar,earth,du,tek
                        militia,moral,bear,right,arms
                launch,rochester,ti,commercial,adl
                                la,pts,period,vs,flyers
                   key,encryption,chip,keys,clipper
                           jim,fbi,fire,koresh,sandvik
                      israel,israeli,jews,jewish,arab
             insurance,ground,wire,circuit,neutral
                        health,rate,rates,risk,seattle
                                 he,she,her,was,were
                       gun,guns,law,crime,firearms
                        graphics,image,ftp,jpeg,pub
                                god,jesus,he,bible,his
                   faq,list,anonymous,posting,mail
                        entry,output,file,oname,max
                                        ei,um,pa,em,el
                          drugs,pain,drug,gang,iran
                 disease,pitt,food,medical,patients
                                  db,mov,ohio,acs,bh
                             cd,sound,quality,radio,tv
                           car,cars,engine,oil,speed
                            brian,myers,dog,nec,mot
                                black,ra,cover,rob,art
                          bike,dod,com,andrew,ride
                                                          alt
                                                          co.athe
                                                          comp.g ism
                                                          comp.oraph
                                                          comp.s s.msics
                                                          comp.sys.ib −win
                                                          mimp.wys.mm.pcdow
                                                          recsc.fo indoac.h .hards.mis
                                                          rec.aut rsalews.xardw war c
                                                          rec.mo s
                                                          rec.spo orcy
                                                          sc .spor t.bacles
                                                          sc .crypr t.hoseba
                                                          sc .elec t ckey ll
                                                          sc .medtroni
                                                          so .spa
                                                          talc.relice
                                                          tal .polgion.
                                                          talk.politics. chris
                                                          talk.politics.guns tian
                                                             i
                                                             i
                                                             i
                                                             i

                                                             k


                                                             k.r itic mi
                                                                eli s.m de
                                                                   o
                                                                   t


                                                                   gio i as
                                                                       n.m sc t
                                                                         cs


                                                                           isc
                                                                                  are e


Fig. 4. Sparsity accross labels according to (13) for the unsupervised vbNMF decom-
position with best classification performance; the more intense the tone of blue, the
higher the sparsity. Each of the 40 latent semantic components is represented by its 5
most significant terms. Coefficient sparsity is 0.6862, inter-label sparsity is 0.5784.
108       I. Ivek


                                 you,that,they,it,not
                                wpi,win,cci,rc,scope
                wire,ground,circuit,wiring,voltage
                          windows,dos,file,files,ms
                   window,motif,widget,server,lib
                   uiuc,cso,science,theory,larson
            tyre,magi,zoroastrians,ezekiel,dallas
        turkish,armenian,armenians,turkey,turks
           stephanopoulos,president,mr,jobs,tax
                   space,nasa,launch,orbit,moon
                              scsi,mb,card,bus,mhz
                sale,shipping,condition,dos,price
                   msg,food,insurance,adl,health
            lebanese,israel,lebanon,water,israeli
                                la,pts,bos,period,vs
                            key,des,ripem,keys,rsa
                     israel,israeli,jews,arab,arabs
                 islam,sandvik,drugs,islamic,sex
                      image,jpeg,gif,images,color
                  hockey,team,game,season,nhl
                   health,aids,hiv,cancer,medical
                     he,year,baseball,team,game
               gun,guns,firearms,militia,weapons
                     graphics,pub,ftp,sgi,package
                         god,jesus,he,bible,church
          god,atheism,atheists,objective,morality
                    fbi,fire,koresh,batf,compound
                     entry,output,oname,file,printf
      encryption,clipper,government,chip,escrow
                                       ei,um,el,di,pa
                drive,drives,disk,floppy,controller
                    doctor,pain,pitt,candida,yeast
                                 db,she,her,mov,he
                           cwru,po,kk,cleveland,de
         cramer,men,clayton,sexual,homosexual
                          car,cars,engine,oil,dealer
                   bike,dod,ride,motorcycle,bikes
                           apple,mac,lc,mouse,duo
              anonymous,internet,ftp,privacy,pub
                                    ac,uk,com,au,cs
                                                        alt
                                                        co.athe
                                                        comp.g ism
                                                        comp.oraph
                                                        comp.s s.msics
                                                        comp.sys.ib −win
                                                        mimp.wys.mm.pcdow
                                                        recsc.fo indoac.h .hards.mis
                                                        rec.aut rsalews.xardw war c
                                                        rec.mo s
                                                        rec.spo orcy
                                                        sc .spor t.bacles
                                                        sc .crypr t.hoseba
                                                        sc .elec t ckey ll
                                                        sc .medtroni
                                                        so .spa
                                                        talc.relice
                                                        tal .pol ion.
                                                        talk.politics. chris
                                                        talk.politics.guns tian
                                                           i
                                                           i
                                                           i
                                                           i

                                                           k


                                                           k.r itic mi
                                                              eli s.m de
                                                                 o
                                                                 t


                                                                 g


                                                                 gio i as
                                                                     n.m sc t
                                                                       cs


                                                                         isc
                                                                                are e


Fig. 5. Sparsity accross labels according to (13) for an arbitrarily chosen supervised
vbNMF decomposition with 40 latent semantic components, each represented by its 5
most significant terms. Coefficient sparsity is 0.8752, inter-label sparsity is 0.8578.
                     Interpretable Low-rank Document Representations with Patterns                      109


                                    0.75


          Micro−averaged4Accuracy
                                    0.70                                              100
                                                                                            Dimension
                                                                                                100

                                    0.65                                                        80
                                                                        60
                                                                                                60

                                    0.60                                                        40

                                                                                                20

                                    0.55                          20


                                           0.75       0.80       0.85          0.90
                                                        Inter−label4Sparsity

Fig. 6. Correlation between micro-averaged accuracy and inter-label sparsity. Each
point in the scatter plot represents a single supervised NMF decomposition. Convex
hulls contain points corresponding to choices of dimensions of 20, 60 and 100.


ing hyperparameters by non-Bayesian minimization of bound. Because sparse
constraints both on matrices T and V act as two competing penalizations, useful
decompositions are obtained more easily by constraining only one of the matri-
ces to be sparse - either the matrix of latent components to obtain a parts-based
representation, or the matrix of coefficients to obtain a sparse representation of
data. So, when optimizing the shape parameter of one of the matrices in such a
manner next to a fixed hyperparameter of the other matrix which is to be made
sparse, due to the automated (and, equally importantly, non-Bayesian) nature
of the optimization the former may also come to describe a sparse distribution
and in effect impede the desired bias toward the desired type of sparsity.


                                                         Unsupervised      Supervised
                                                  Algorithm PCA
                                                            vbNMF            NMF
                                Micro- [Min,Max]        [0.5190,0.5796] [0.6890,0.7418]
                               averaged Mean     0.6330      0.5532         0.7141
                               Accuracy Dimension 200          40             100
                                Macro- [Min,Max]        [0.5033,0.5655] [0.6758,0.7277]
                               averaged Mean     0.6179      0.5393         0.6997
                               Accuracy Dimension 200          40             100

                                              Table 1. Summary of experimental results
110               I. Ivek


                100                                     100

                90                                      90

                80                                      80


                                            Dimension
    Dimension

                70                                      70

                60                                      60

                50                                      50

                40                                      40

                30                                      30

                20                                      20
                      0.6 0.7 0.8 0.9 1.0


                                                              10

                                                                   20

                                                                        40

                                                                             80
                                                                             10

                                                                                    20

                                                                                           40

                                                                                                8
                                                                                                1000

                                                                                                20

                                                                                                40

                                                                                                80
                                  at


                                                                                0

                                                                                       0

                                                                                            0

                                                                                                   00

                                                                                                   00

                                                                                                   00

                                                                                                   00
                                                                                           aλ
                Inter−label
                 Sparsity
                                                                         Inter−label
                          0.45 0.50 0.55                                  Sparsity
                                                                                           0.80 0.84 0.88


                            (a)                                                     (b)


Fig. 7. Dependence of inter-label sparsity (averaged over 10 random initializations) on
dimension of representation space and parameters which control sparsity. a) Unsuper-
vised vbNMF with sparsity constraints. b) Supervised vbNMF; inter-label sparsity can
be controlled by aλ .


4         Conclusion
It has been well documented that using label information in low-rank repre-
sentation learning is vital to obtain representations with good discriminative
properties. In this context, applied to classification of a document corpus, a
probabilistic learning algorithm which combines sparse coding and supervised
learning has been presented.
To characterize advantages of using label information, two extreme cases have
been juxtaposed, the presented supervised model and a fully unsupervised one,
belonging to the same family, having the same noise model and using the same
metaalgorithm for parameter learning.
A qualitative inspection motivated the introduction of the notion of inter-label
sparsity, abstracting sparsity of coefficients on the level of documents to sparsity
on the level of document labels. Experiments point to a strong connection be-
tween the inter-label sparsity of the representation and the classification perfor-
mance metrics. Furthermore, inter-label sparsity of decompositions obtained by
supervised vbNMF can elegantly be controlled by a single parameter. However,
even though sparsity and nonnegativity constraints intuitively seem appropriate
and result in compact and interpretable document representations, a question
            Interpretable Low-rank Document Representations with Patterns          111

remains whether there is any actual advantage in using sparse representations
over dense ones as classification precursors.
As quality of representation spaces has been primarily addressed in this work,
little regard has been given to quality of the classifier per se. Because it is reason-
able to expect that a stronger classifier would result in even better classification
results, it would be interesting to compare a well-tuned classifier in the represen-
tation spaces obtained by supervised vbNMF to state-of-the-art aproaches in the
field, on benchmark datasets. Future work based on semi-supervised modifica-
tions of the model is considered, to make the model more flexible and applicable
in more commonly occuring, semi-supervised, scenarios.


Acknowledgments This work was supported by the Croatian Ministry of Sci-
ence, Education and Sports through the project ”Computational Intelligence
Methods in Measurement Systems”, No. 098-0982560-2565.


References
[BM07]    David M Blei and J McAuliffe. Supervised Topic Models. Neural Infor-
          mation Processing Systems, 21, 2007.
[BNJ12]   David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet
          Allocation. Journal of Machine Learning Research, 3:993–1022, 2012.
[Cem09]   Ali Taylan Cemgil. Bayesian inference for nonnegative matrix factorisation
          models. Computational Intelligence and Neuroscience, 2009, January 2009.
[DDF+ 90] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Lan-
          dauer, and Richard Harshman. Indexing by latent semantic analysis. Jour-
          nal of the American Society for Information Science, 41:391–407, 1990.
[GG05]    Cyril Goutte and Eric Gaussier. Relation between PLSA and NMF and
          implications. Proceedings of the 28th Annual International ACM SIGIR
          Conference on Research and Development in Information Retrieval, pages
          601–602, 2005.
[Hof99]   Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Uncertainity
          in Artifitial Intelligence - UAI’99, page 8, 1999.
[Hoy04]   Patrik O Hoyer. Non-negative matrix factorization with sparseness con-
          straints. Journal of Machine Learning Research, 5:1457–1469, 2004.
[Ive14]   Ivan Ivek. Supervised Dictionary Learning by a Variational Bayesian
          Group Sparse Nonnegative Matrix Factorization. May 2014.
[Jol02]   I T Jolliffe. Principal Component Analysis, volume 98. 2002.
[LJSJ08]  S Lacoste-Julien, F Sha, and MI Jordan. DiscLDA: Discriminative learning
          for dimensionality reduction and classification. NIPS, pages 897–904, 2008.
[LS99]    D D Lee and H S Seung. Learning the parts of objects by non-negative
          matrix factorization. Nature, 401:788–791, 1999.
[Mal99]   Stéphane Mallat. A Wavelet Tour of Signal Processing. 1999.
[MK01]    Aleix M. Martinez and Avinash C. Kak. PCA versus LDA. IEEE Trans-
          actions on Pattern Analysis and Machine Intelligence, 23:228–233, 2001.
[MPH09] L J P van der Maaten, E O Postma, and H J van den Herik. Dimensionality
          Reduction: A Comparative Review. Journal of Machine Learning Research,
          10:1–41, 2009.
112       I. Ivek

[RU11]           Anand Rajaraman and Jeffrey D Ullman. Mining of Massive Datasets.
                 Lecture Notes for Stanford CS345A Web Mining, 67:328, 2011.
[Win03]          John M Winn. Variational message passing and its applications. Ph.D.
                 thesis, Department of Physics, University of Cambridge, 2003.


A     Summary of the Learning Algorithm


          Observed variables:                  Initialize variational parameters. t = 0.
             [X]ντ = xντ                       Loop:
                                                            .                  
                                                                          (t)
             [∆]τ l = δ (zτ − l)                 Ξ = X.                ∗ exp L(t)
                                                                  exp Lt         v
                                                                               T      
                                                  (t+1)        (t)           (t)
          Hyperparameters:                       Σv     = exp Lv . ∗    exp Lt       ∗Ξ
                                                                                    T 
                                                (t+1)        (t)
              At νi = atνi                       Σt     = exp Lt . ∗ Ξ ∗ exp L(t)  v
              t                                                 (t+1)
              B νi = btνi                        At = At + Σ t
             h i                                                           T 
                                                                t
               Aλ = aλil                         B t = 1.                 (t)
                                                             1. B + 1 ∗ E v
             h     iil
                                                   (t+1)
               B λ = bλil                        Et        = At . ∗ B t
                    il
                                                  (t+1)
          Variational parameters:                Lt     = Ψ (At ) + ln B t
            h        i
                 (t)
              Et          = htνi i(t)            Av = 1 + Σ (t+1)
                                                            v
            h       i νi                                                T   
                (t)                                           (t)      (t)
              Lt νi = hln tνi i (t)              B v = 1.    Eλ ∗ ∆ + Et      ∗1
            h        i
              E v(t)      = hviτ i(t)            E (t+1)
                                                   v     = Av . ∗ B v
            h       i iτ
              Lv(t) iτ = hln viτ i (t)           L(t+1)
                                                  v     = Ψ (Av ) + ln B v
            h        i       X
                 (t)
              Σt          =       hsνiτ i(t)     Aλ = Aλ + ∆ ∗ 1
                       νi
            h        i       Xτ                           . .                  
              Σ v(t)      =       hsνiτ i(t)     B λ = 1.   1. B λ + E (t+1)
                                                                       v     .∗∆
                       iτ
            h        i        ν
                 (t)                               (t+1)
              Eλ          = hλil i(t)            Eλ        = Aλ . ∗ B λ
            h       i il
                (t)                               (t+1)
              Lλ          = hln λil i(t)         Lλ     = Ψ (Aλ ) + ln B λ
                     il
                                                 Optimize hyperparameters (nonbayesian)
                                               End loop

Table A1. The learning algorithm in matrix form. Left column: observed variables,
hyperparameters and variational parameters organized as matrices. Right column: the
algorithm; by .∗ and ./ elementwise matrix product and elementwise matrix division
are denoted, respectively, and by 1 matrix of ones of appropriate dimensions.
    TagMiner: A Semisupervised Associative POS
    Tagger Effective for Resource Poor Languages

               Pratibha Rani, Vikram Pudi, and Dipti Misra Sharma

          International Institute of Information Technology, Hyderabad, India
         pratibha rani@research.iiit.ac.in,{vikram, dipti}@iiit.ac.in


        Abstract. We present here, TagMiner, a data mining approach for
        part-of-speech (POS) tagging, an important Natural language process-
        ing (NLP) classification task. It is a semi-supervised associative clas-
        sification method for POS tagging. Existing methods for building POS
        taggers require extensive domain and linguistic knowledge and resources.
        Our method uses combination of a small POS tagged corpus and a raw
        untagged text data as training data to build the classifier model using
        association rules. Our tagger works well with very little training data
        also. The use of semi-supervised learning provides the advantage of not
        requiring a large high quality tagged corpus. These properties make it es-
        pecially suitable for resource poor languages. Our experiments on various
        resource-rich, resource-moderate and resource-poor languages show good
        performance without using any language specific linguistic information.
        We note that inclusion of such features in our method may further im-
        prove the performance. Results also show that for smaller training data
        sizes our tagger performs better than state-of-the-art CRF tagger using
        same features as our tagger.

        Keywords: Part-of-Speech Tagging, Associative Classification, Associ-
        ation Rules, Semi-supervised Classification, NLP


1    Introduction
Part-of-speech (POS) tagging is an important NLP classification task that takes
a word or a sentence as input, assigns a POS tag or other lexical class marker
to a word or to each word in the sentence, and produces the tagged text as
output. For this task several rule based [7, 8], stochastic supervised [6, 15, 30],
and unsupervised [2, 5, 16] approaches are available for a number of languages.
All of these approaches (including the state-of-the-art taggers) require training
data and linguistic resources like dictionaries in large quantities. These taggers
do not perform well for languages which do not have much resources and training
data, referred to as resource poor languages.
    The creation of linguistic resources is a time consuming expensive process
which requires expert linguistic knowledge. So, there is a need to develop semi-
supervised and generic POS tagging methods which take advantage of raw un-
tagged corpus and require less or no lexical resources. A few such available

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
114    P. Rani, V. Pudi and D. Sharma

techniques are mentioned in Sect. 2. In order to perform well, these techniques
require a large raw untagged corpus. Unfortunately, for many resource poor lan-
guages, even obtaining this is hard.
    This motivates us to explore data mining methods to build generic POS
tagger. Data mining, being composed of data driven techniques, is a promising
direction to explore or to develop language/domain independent POS tagging
methods. However, direct application of data mining concepts for this task is not
feasible and requires handling various challenges like 1) mapping POS tagging
task to association rule mining problem, 2) developing semi-supervised methods
to extract association rules from training set of tagged and raw untagged data
combined and 3) handling challenges of POS tagging task (discussed in Sect. 4.2),
like class imbalance, data sparsity and phrase boundary problems.
    Associative classification [28] is a well known data mining based classifica-
tion approach which uses association rules [1] to build the classifier model. In
this work, we apply associative classification for POS tagging and present Tag-
Miner, a generic semi-supervised method for POS tagging. Our method uses a
combination of a small POS tagged corpus and a raw untagged text data as
training data to build a classifier model using a new concept of context based
association rule mining. These association rules work as context based tagging
rules. Our Experiments demonstrate that it gives good performance even with-
out using any linguistic resources—except for a small POS tagged corpus—for
resource-rich English, resource-moderate Hindi and resource-poor Telugu, Tamil
and Bengali languages.
    Our method is generic in two aspects: (1) it does not use any language specific
linguistic information such as morphological features and there is ample scope
to improve further by including such features, (2) it does not require a large,
high quality, tagged corpus and uses the POS tags of the tagged corpus only
to calculate scores of “context based lists” which are used to form association
rules. This can be easily adapted for various languages. Also, as an additional
benefit model made by our tagger is human understandable since it is based on
association rules.
    Our algorithm has following advantages, especially suitable for resource poor
languages, arising due to the use of raw untagged data: (1) it tags unknown
words without using smoothing techniques, (2) the coverage of words present in
the classifier model is increased which in turn increases tagging accuracy and
(3) it creates additional linguistic resources from raw untagged data in the form
of word clusters.
    Remainder of this paper is as follows: Section 2 surveys related work. Sec-
tion 3 formally presents the problem. Section 4, 5 and 6 present details of our
proposed approach. Section 7 gives details of the datasets, various experiments
and discusses the performance. Section 8 concludes our work.

2     Related Work

Associative classifiers use association rules to build a classifier model. They
have been successfully applied for various classification tasks, for example, [34]
                                                                 TagMiner      115

presents an associative classifier for mammography image classification and [26]
uses it for predictive analysis in health care data mining. Some of the associa-
tive classifiers worth mentioning are CBA [21] which integrates association rules
and classification by finding class association rules, CMAR [20] uses concept of
multiple class-association rules, CPAR [33] is based on predictive association
rules and ACME [29] exploits maximum entropy principle. A good review of
various associative classifiers and the detailed analysis of this method can be
found in [28]. In some other association rule based approaches [18] uses asso-
ciation rules in a hybrid system of Naive Bayes and genetic classifier for text
classification and [23] presents a supervised language specific hybrid algorithm
of statistical method and association rule mining to increase the POS tagging ac-
curacy of Chinese text. To the best of our knowledge no semi-supervised method
exists for association rule mining from training set of tagged and raw untagged
data combined.
    For POS tagging, one of the first semi-supervised methods was proposed
by [10] which uses raw untagged corpus by incorporating features obtained from
a small fraction of untagged data along with features obtained from a large
tagged data. A good overview of the existing semi-supervised POS tagging meth-
ods and discussion on their limitations is provided by [27], which uses graph as
a smoothness regularizer to train CRFs [19] in a semi-supervised manner from a
large untagged data and a small tagged data. In another approach [25] presents a
condensed nearest neighbor method for semi-supervised POS tagging and report
97.5% accuracy on WSJ dataset of English. Most of the existing semi-supervised
POS tagging methods use a combination of complex learning methods and exist-
ing supervised tagging methods to learn from large untagged data and moderate
sized tagged data. All these methods have been developed for resource rich En-
glish and other European languages.
    To the best of our knowledge no semi-supervised tagging method has been
employed for resource moderate Hindi and resource poor Telugu and Tamil lan-
guages. Also to the best of our knowledge no fully data mining based generic
POS tagger exists for any language. Baseline POS taggers for various languages
are discussed below. We note that all the reported accuracy values were obtained
for very small sized test sets. All the mentioned POS taggers use linguistic (espe-
cially morphological) knowledge in some or the other form, while our approach
uses only the POS tags of the tagged set in an indirect form and learns from the
raw untagged data.
    For Hindi language, [22] proposes a CRF model with Transformation Based
Learning (TBL) with morphological features and reports 78.67% accuracy on
SPSAL corpus. [14] reports 92.36% accuracy on ISPC corpus using special lin-
guistic features in a HMM model. [24] proposes an HMM model with morpho-
logical features and reports 93.05% accuracy. For Telugu language, [22] applies
Transformation Based Learning (TBL) on top of a CRF model and reports
77.37% accuracy on SPSAL corpus. [14] uses various special linguistic features
in a HMM model and reports 91.23% accuracy on ISPC corpus.
116     P. Rani, V. Pudi and D. Sharma

   For Bengali language, [11] presents various supervised and semi-supervised
Maximum Entropy and HMM models using morphological features and report
87.9% accuracy for semi-supervised HMM model on CIIL corpus. [13] reports
92.35% accuracy using a voted approach among various models. For Tamil lan-
guage, [31] presents a linear programming based SVM model and reports 95.63%
accuracy.

3     Problem Definition
Automated POS tagging is a classification task which takes a word or a sentence
as input, assigns a POS tag or other lexical class marker to a word or to each
word in the sentence, and produces the tagged text as output. In semi-supervised
paradigm the POS tagger is built from a corpus of untagged sentences and a set
of tagged sentences. The POS tagging classification problem is formally defined
as follows:
    Given a set of tags Γ = {T1 , T2 , . . . , Tn }, an annotated set of tagged sen-
tences AS = {St1 , St2 , . . . StN }, where Sti = hW1 /Ti , W2 /Tj . . . Wn /Tk i (where
Wi is a word and Ti is a tag from Γ ) and a raw untagged training corpus of
sentences D = {S1 , S2 . . . SM }, where Si = hW1 W2 . . . Wm i, the goal is to build a
classifier model Φ which outputs the best tag sequence hT1 T2 . . . Tl i for an input
sequence of words hW1 W2 . . . Wl i.

4     TagMiner
4.1   Mapping POS tagging task to Association Rule Mining problem
According to the one sense per collocation [32] hypothesis, the sense of a word
in a document is effectively determined by its context. The notion of context
has been used in various methods of POS tagging [2, 30]. A context can occur
in multiple places in the text. We refer to the list of occurrences of a context as
a context based list. We use this idea for building TagMiner. In our method, we
mine context based association rules from training data containing both tagged
and untagged text. Our method works as follows:
  – We collect all possible words occurring in the same context from the raw
    untagged data into a list called context based list (formally defined later). In
    this way we are able to find groups of words of similar categories from the
    raw untagged data.
  – Using the annotated set and the tag finding algorithm (in Fig. 1), we find
    association rules of the form: Context ⇒ T ag for the context based lists.
    Each rule maps a context based list to a suitable POS tag. These association
    rules work as the context based classification rules.
  – Lastly, we group these context based association rules according to their
    POS tags to form clusters. This set of clusters is used as the classifier model
    to tag words using the method described in Sect. 6 and Fig. 2.
   By experimenting with two varieties of bi-gram (one with preceding word
and the other with succeeding word as context) and trigram as possible contexts
                                                               TagMiner      117

we found that trigram works best for our method. For a word instance Wi , we
fix its context as a trigram containing Wi in the middle and we use this context
to find the context based list. Any other notion of context can be used as long as
it fits into the formalism given below.
Context Based List: If Ψ is a function mapping from a word instance Wi
in the data to its context Ψ (Wi ), then Ψ −1 (Ψ (Wi )) is a list of words instances
sharing the same context. We refer to this list as context based list of Ψ (Wi ). It
denotes words of similar category or type as Wi in a specific context and can
store multiple instances of a word. For a given trigram (Wi−1 Wi Wi+1 ) of words,
Ψ (Wi ) = (Wi−1 , Wi+1 ). The preceding word Wi−1 and succeeding word Wi+1
are called context words and Ψ (Wi ) is called the context word pair of Wi .
Context Based Association Rule: For each context based list L, our ap-
proach finds association rule of the form L ⇒ T . This rule maps the context
based list L to a POS tag T with support and confidence parameters defined
below. Since each list L is obtained from a unique context word pair, so each
association rule uniquely associates a context to a POS tag and works as the
context based tagging rule.
     In the following definitions and formulas we develop the intuition and the
method to compute the interestingness measures of the significant association
rules. The complexity in defining support is due to the presence of raw untagged
training data required for semi-supervised learning. The support is the frequency
(count) of occurrences of the context in the dataset. Context based lists are made
from raw untagged data D and we are interested in the words of this list for which
we know the tag in annotated set AS. Hence, we define Support of a context as
follows:
AllTagContextSupport: Number of unique words of a context based list L
whose tags are available (in annotated set AS) is denoted as AllT agContextSupport(L).
This measure gives the number of tagged words of L.
ContextSupport: For a list of words L in which duplicates may be present,
ContextSupport(L) is defined as the set of unique words present in L.
Coverage: For a context based list L,

                                  AllT agContextSupport(L)
                  Coverage(L) =                                              (1)
                                     |ContextSupport(L)|

This measure represents the confidence that enough number of tagged samples
are present in L.
ContextTagSupport: Number of unique words of a context based list L present
in annotated set AS with a particular tag T is denoted as ContextT agSupport(L, T ).
Confidence: For a context based list L and tag T ,

                                      ContextT agSupport(L, T )
               Conf idence(L, T ) =                                          (2)
                                        |ContextSupport(L)|

This measure represents the confidence that considerable number of words in
list L have a particular tag T and leads to rules of the form Context ⇒ T ag.
118     P. Rani, V. Pudi and D. Sharma

WordTagSupport: Frequency of tag T for a word W in the annotated set AS
is denoted as W ordT agSupport(T, W ).
WordTagScore: For a word W and tag T , W ordT agScore is defined as:
                                          W ordT agSupport(T, W )
            W ordT agScore(W, T ) =                                                (3)
                                        max W ordT agSupport(Ti , W )
                                        Ti ∈Γ

This represents how good the tag fits the word on a scale of 0 to 1.
ListTagScore: For a tag T in context based list L, ListT agScore is defined as:
                                      P
                                                 W ordT agScore(Wi , T )
                                Wi ∈ContextSupport(L)
      ListT agScore(L, T ) =                                                       (4)
                                |{Wi ∈ ContextSupport(L) : Wi /T ∈ AS}|
Where, AS is the annotated set. This formula represents the average frequency
of tag T in context based list L. Intuitively, it represents how good the tag fits
the list. Unfortunately, this is not always indicative of the correct tag for the list.
For example, if a tag is overall very frequent, it can bias this score. Therefore, we
compare this with the following score, inspired by the notion of Conviction [9].
BackgroundTagScore: For a tag T in annotated set AS, BackgroundT agScore
is defined as:
                                           P
                                                        W ordT agScore(Wi , T )
                                 Wi ∈ContextSupport(AS)
 BackgroundT agScore(T ) =                                                         (5)
                                 |{Wi ∈ ContextSupport(AS) : Wi /T ∈ AS}|
This represents the average frequency of tag T in annotated set AS.
4.2   POS Tagging Challenges
POS tagging, especially for resource poor languages, involves three major chal-
lenges listed below. In our approach we handle each of them explicitly.
 1. Data sparsity problem: Some POS tag classes are present in the annotated
    set with very few representations. This is not enough to derive statistical
    information about them. In our approach, the use of raw untagged data
    reduces this problem (shown in Sect. 7.4).
 2. Class imbalance problem: POS tag classes are highly imbalanced in their
    occurrence frequency. While selecting a tag this may lead to biasing towards
    the most frequent tags. Existing solutions of class imbalance problem typi-
    cally favor rare classes [12]. However, while tagging the context based lists,
    we need to find POS tags for them in such a way that we neither favor
    frequent tags nor rare tags. We tackle this problem using a novel Minmax
    approach to find the best preferred POS tag instead of the most frequent
    one (described in Sect. 5.2).
 3. Phrase boundary problem: Some lists are formed at phrase boundaries
    where the context comes from two different phrases. We need to filter out
    those context based lists which do not contain words of similar categories. In
    this case, the context of a word instance need not represent strong context
    and so the context based list may contain unrelated words. We use suitable
    parameters to handle this problem (explained in Sect. 5.3).
                                                                      TagMiner       119

1. for each tag Ti ∈ Γ present in annotated set AS do:
2.     Find BackgroundT agScore(Ti )        // Use Equation (5)
3. for context based list L do:
4.     Find Coverage(L)         // Use Equation (1)
5.     if Coverage(L) ≥ M inCoverage:
6.        ContextT agSupport(L, Tmax ) = max ContextT agSupport(L, Ti )
                                        Ti ∈Γ
7.       M axconf = Conf idence(L, Tmax )         // Use Equation (2)
8.       if M axconf > M inConf idence:
9.          M axT set = {Ti | ContextT agSupport(L, Ti ) == ContextT agSupport(L, Tmax )}
10.         BestP ref T ag = F indBestP ref T ag(L, M axT set)
11.         Return BestP ref T ag
12.       else: Return NOTVALIST
13.   else: Return NOTVALIST

14. F indBestP ref T ag(L, M axT set):
15.     Initialize P ref T agset = {}
16.     for each word W of ContextSupport(L) present in AS do:
17.         T agset(W ) = {Ti | W has tag Ti in AS}
18.         U nqT agset = T agset(W ) ∩ M axT set
19.         Find M axW T ag | W ordT agSupport(M axW T ag, W ) ==
                 max        W ordT agSupport(Tj , W )
          Tj ∈U nqT agset
20.       P ref T agset = P ref T agset ∪ M axW T ag
21.    Find M inT ag ∈ P ref T agset | ∃Wmin ∈ ContextSupport(L) with
W ordT agSupport(M inT ag, Wmin ) ==            min       W ordT agSupport(M inT ag, Wi )
                                      Wi ∈ContextSupport(L)
22.    Find ListT agScore(L, M inT ag)      // Use Equation (4)
23.    if ListT agScore(L, M inT ag) ≥ BackgroundT agScore(M inT ag): Return M inT ag
24.    else: Return NOTVALIST
         Figure 1: Algorithm to find POS tag for a context based list.


5     Building Classifier Model from Context Based Lists

5.1   Finding Association Rule for a Context Based List

The first step in our classifier model building method is to compute context based
lists from an untagged training corpus D. It may be noted that a context based
list can store multiple instances of a word. We use a sliding window of size three
to collect the context based lists from D, in a single iteration, taking care of
sentence boundaries.
     In the next step we use the algorithm shown in Fig. 1 to find association
rules for all the context based lists. In this algorithm, BackgroundT agScore
of all the POS tags present in the annotated set AS (lines 1-2) are computed
first. Then for a context based list satisfying the threshold values of Coverage
and Conf idence (lines 3-9), function F indBestP ref T ag (described in Sect. 5.2)
finds the best preferred tag (lines 10-11, 14-24) from the set of tags with maxi-
mum ContextT agSupport (lines 7-9).
     For a context based list L present as antecedent in association rule L ⇒ T , tag
T returned by this algorithm becomes the consequent. This algorithm outputs
best preferred tags for all the context based lists and hence finds association
rules for all of them.
120     P. Rani, V. Pudi and D. Sharma

5.2   Handling Class Imbalance Problem
We handle the class imbalance problem by using a novel Minmax approach in the
function F indBestP ref T ag (lines 14-24 in Fig. 1) and parameters ListT agScore
and BackgroundT agScore. In Minmax approach the preferred tag Ti for con-
text based list L, is the one which has maximum ContextT agSupport(L, Ti ) but
minimum W ordT agSupport(Ti , W ) among those words of list L which have tag
Ti as the best tag in AS. This takes care that the selected tag is supported by
majority of the words in the list and is not biased by the most frequent tag of
the annotated set.
    To find the best preferred tag in function F indBestP ref T ag, from the set of
all the tags with maximum ContextT agSupport value (line 9), at first we found
those tags which were best tags (having maximum W ordT agSupport value) for
the words of list L in AS (lines 15-20). Next, from this set of preferred tags
we find the tag with minimum W ordT agSupport value (line 21). Then criteria
ListT agScore(L, Ti ) ≥ BackgroundT agScore(Ti ) (lines 22-23) ensures that the
selected tag has above average support in the annotated set and the context
based list, both. If none of the tags satisfy this criteria, then we tag the list as
“NOTVALIST” (line 24).

5.3   Handling Phrase Boundary Problem
To filter out context based lists with the phrase boundary problem (see Sect. 4.2)
we use two suitable threshold values for parameters Confidence and Coverage.
Coverage takes care of the fact that a context based list has considerable number
of words to map it to a tag and Confidence ensures that the tag found for the
list is the one which is supported by majority of the words in the list.
    If context based list L has Coverage and Confidence values less than the
corresponding threshold values M inCoverage and M inConf idence, we tag L
as “NOTVALIST” (lines 3-8, 12, 13 in Fig. 1). If L satisfies both of the threshold
values then only we find the set of all the tags which have maximum value of
ContextT agSupport(L, Ti ) and use this set (lines 9-10) to find the best preferred
tag for the list (lines 14-24).

5.4   POS tag wise grouping of Association Rules to form Clusters
In the last step, we group context based lists according to their POS tags to
get clusters of context based lists as classifier model. We exclude context based
lists with tag “NOTVALIST” from the grouping process. Then we process these
clusters to store word frequencies, corresponding context word pairs and their
frequencies in each cluster. We represent the set of clusters as Clustset.
    Since we are highly confident about the tags of the words present in the
annotated set AS so, to improve cluster quality we apply a pruning strategy
on the words of the clusters present in AS and remove those words from each
cluster which do not have a matching cluster tag in AS. Finally, we get a set
of clusters in which each cluster has a set of words with their frequencies and a
set of associated context word pairs with their frequencies. Each cluster has a
                                                                   TagMiner      121

unique POS tag. These clusters are overlapping in nature and words can belong
to multiple clusters.

6   POS tagging Method
To tag the words of a test sentence we make use of the test word’s context word
pair, preceding word and the word frequency in a cluster to decide the tag of
the word (see Fig. 2). When a test word is found in only one cluster then we
output the cluster tag as the tag of the test word. But when a test word is found
in many clusters, then to select the suitable clusters following priority order is
followed:
 1. Criteria 1: Highest priority is given to the presence of matching context
     word pair of the test word in the clusters.
 2. Criteria 2: Second highest priority is given to the presence of matching
     preceding word of the test word as first word of the context word pairs in
     clusters.
 3. Criteria 3: Last priority is given to the frequency of the test word in the
     clusters.
For test words not present in any cluster we use criterion 1 and 2 to select
appropriate clusters. Based on the priority order, only one of the criterion is
used to select the suitable clusters. If we are not able to find any suitable cluster
then we return “NOTAG” as the tag of the test word.
    Even when we find suitable clusters, to increase precision, our method finds
POS tags only for those cases where it is confident. It avoids to wrongly classify
non confident cases and returns “NOTAG” for them. This is especially useful
when the cost of misclassifying (false positive) is high. This also gives opportunity
to integrate other language/domain specific POS taggers as they can be used for
the non-confident cases.
    After selecting the suitable clusters we need to make sure that we have enough
confidence in the highest probability tag obtained from the clusters. To ensure
this we use the parameter TagProbDif, which gives the fractional difference be-
tween the highest and the second highest cluster tag probabilities and is defined
as follows:
                               T agP rob(Cmax ) − T agP rob(Csecmax )
              T agP robDif =                                                      (6)
                                           T agP rob(Cmax )

Where, Cmax is the cluster with highest T agP rob(Ci ) value and Csecmax is
the cluster with second highest T agP rob(Ci ) value. T agP rob(Ci ) of a cluster is
defined as follows:
                                     Frequency of X in Ci
                T agP rob(Ci ) =    P                                           (7)
                                           Frequency of X in Cj
                                 ∀Cj ∈Clustset

Where, X is set as follows: If the test word is present in cluster Ci then X = test
word. For test word not present in any cluster, if the clusters are selected based
on the presence of the context word pair of the test word then X = context
122      P. Rani, V. Pudi and D. Sharma

for each word W mid in sentence S with context word pair CWp and CWs do:

 1. Initialize P redClustset = {}
 2. if ∃ cluster Ci ∈ Clustset | W mid ∈ Ci :
    (a) Find P Clustset = {Ci | W mid ∈ Ci }
    (b) if ∃ cluster Cj ∈ P Clustset | CWp and CWs pair is present as context word pair in
          cluster Cj :
             Find all such clusters from P Clustset and append to P redClustset #Criteria 1
     (c) else:
            if ∃ cluster Cj ∈ P Clustset | CWp is present as preceding word in a context word
               pair in cluster Cj :
                Find all such clusters from P Clustset and append to P redClustset #Crit. 2
            else: Append P redClustset = P redClustset ∪ P Clustset      #Criteria 3
 3. else:
    (a) if ∃ cluster Ci ∈ Clustset | CWp and CWs pair is present as context word pair in
          cluster Ci :
             Find all such clusters from Clustset and append to P redClustset #Criteria 1
    (b) else:
            if ∃ cluster Ci ∈ Clustset | CWp is present as preceding word in a context word
               pair in cluster Ci :
                Find all such clusters from Clustset and append to P redClustset #Crit. 2
            else: Return NOTAG
 4. ∀Ci ∈ P redClustset Find T agP rob(Ci )           // Use Equation 7
 5. Find Cmax = cluster with highest T agP rob(Ci ) value in P redClustset
 6. Find Csecmax = cluster with second highest T agP rob(Cj ) value in P redClustset
 7. Find T agP robDif          // Use Equation 6
 8. if T agP robDif ≥ M inprobdif : Return P redT ag = POS tag label of cluster Cmax
 9. else: Return NOTAG

    Figure 2: Method to tag words of a sentence using set of clusters Clustset.


word pair. If the clusters are selected based on the presence of the preceding
word of the test word as first word of the context word pairs in clusters then
X = preceding word of the test word. In this way we are able to tag some
unseen/unknown words also which are not present in the training data. This, in
a way, acts as an alternative of smoothing technique for them.
   After selecting the clusters (based on priority order) we compute their T agP rob
values using (7) and then compute T agP robDif using (6). For T agP robDif
value above a suitable threshold value M inprobdif we output the tag of cluster
with highest T agP rob value as the tag of the test word, otherwise we return
“NOTAG”(see Fig. 2).

7      Experiments, Results and Observations
7.1     Dataset Details
We have done our experiments on resource-rich English1 (uses Biber tag set [17]),
resource-moderate Hindi [3, 4] and resource-poor Telugu2 [3], Tamil3 and Ben-
1
  New York Times dataset of American National Corpus available at http://
  americannationalcorpus.org/FirstRelease/contents.html
2
  Provided by IIIT Hyderabad, data is part of IL-ILMT project sponsored by MC&IT,
  Govt. of India Reference No: 11(10)/2006-HCC(TDIL)
3
  Available at http://sanskrit.jnu.ac.in/ilci/index.jsp
                                                                  TagMiner      123

gali4 languages. Table 1 gives details of all the language datasets. All the five
language datasets have flat tag sets present in annotated training and test sets
without any hierarchy and considerable number of lexical ambiguities are also
present. We note that except English all the other four languages are morpho-
logically rich and have free word-order property. The POS tag data distribution
in the resource-moderate and resource-poor language datasets are highly imbal-
anced and sparse.

Table 1. Details of all language datasets with accuracy values obtained by TagMiner.

                                 Hindi Telugu Tamil Bengali English
          No. of Words in Raw
                                 393303 104281 169705 85796 1293388
         Untagged Training set
             No. of Words in
                                 282548 83442 20207 21561 629532
         Annotated Training set
          No. of POS Tags in
                                    35     28    28     27     109
         Annotated Training set
              No. of Words
                                  70811  20854 22352 20618 471977
               in Test set
            No. of POS Tags
                                    32     24    27     29     105
               in Test set
           No. of Test Words
           tagged as NOTAG         1916   1634  2647   3448   9385
              by TagMiner
          Average Accuracy
                                  87.8   87.6 83.46 76.17    88.5
            (%) (Equation 8)
            Resource Type       Moderate Poor Poor    Poor    Rich


7.2    Performance Analysis and Observations
We observed that following set of threshold values M inConf idence = 60%,
M inCoverage = 60% and M inprobDif = 30% for the three parameters gives
best AverageAcuracy (defined below) values for all the five languages. Tables 1
and 2 show the results for this set of parameter values.
                              Number of correctly tagged test words
   AverageAccuracy =                                                        (8)
                         |Test set| − No. of test words tagged as NOTAG

Where, |Test set| = No. of words in the test set.
    For both known and unknown test words, for all the five languages, maxi-
mum number of correct tagging was done by giving highest priority to presence
of context word pair in the cluster. Here, known words means test set words
which are present in untagged training set and unknown word means unseen
test set words which are not present in the untagged training set. Note that
words of annotated set are not included in the classifier model, only their tags
4
    Available at http://sanskrit.jnu.ac.in/ilci/index.jsp
124      P. Rani, V. Pudi and D. Sharma

are used indirectly while building the model. In the results shown in Table 1,
around 46% unknown English words, 60% unknown Hindi words, 67% unknown
Telugu words, 52% unknown Bengali words and 57% unknown Tamil words were
correctly tagged using their context word pair. This shows the strength of our
tagger to tag unknown words without using any smoothing technique used by
other POS taggers.
    In Table 2, we compare our results with a supervised CRF5 tagger [19].
This tagger uses words, their POS tag and context word pair information from
annotated data, while our tagger uses words and their context word pair in-
formation from untagged data and POS tag information from annotated data.
We observe that for annotated data size ≤ 25K words, our tagger gives bet-
ter AverageAccuracy than CRF tagger. Our tagger also gives better POS tag
precisions and better tagging accuracies than CRF tagger for unknown words
and performance improves by increasing the untagged data size up to a certain
size. This shows that our tagger can be a better choice for resource poor lan-
guages. Also, as an additional benefit model made by our tagger is more human
understandable than model made by CRF tagger.
7.3    Effect of Annotated (POS tagged) Data Size

We varied the size of annotated set of Tamil (see Table 3) while keeping the raw
untagged set constant and observed that the coverage of words by the clusters
in the classifier model increases with the increase in the size of annotated data,
the tagging accuracy increases while the number of words missed by the model
(tagged as “NOTAG”) decreases. For all languages we observed that increasing
the annotated training data size improves cluster quality which increases the
AverageAcuracy values but only up to a certain size. We also observed that there
is only a slight decrease in AverageAcuracy value with decrease in annotated
set size, so performance does not decrease drastically when the annotated set is
made smaller. Our tagger gives above 70% AverageAcuracy for annotated data
size as low as 5K and raw untagged data size 10K on all the languages. This
justifies the use of small annotated set to build a semi-supervised POS tagging
model for resource poor languages.
7.4    Effect of Raw Untagged Data Size

In Tables 1, 2 and 4 , we observe that increasing the raw untagged training
data size initially increases word coverage of clusters which in turn increases
the AverageAcuracy values but stabilizes after a certain size. For all languages
we observed that the coverage of words by the clusters in the classifier model
increases with the increase in the size of untagged data (while keeping the size
of annotated set constant). This accounts for the increase in tagging accuracy
and decrease in the number of words missed by the model (tagged as NOTAG).
Other interesting observation is that AverageAccuracy does not vary much as
5
    Available at http://crfpp.googlecode.com/svn/trunk/doc/index.html, CRF
    model outputs tag for all test words. So, for CRF tagger AverageAccuracy = (No.
    of correctly tagged test words)/(No. of test words).
                                                                     TagMiner       125

Table 2. Average Accuracy values for all languages obtained by CRF tagger and our
tagger TagMiner for various annotated training set sizes ( ≤ 25000 words).

              Test Annotated   CRF            TagMiner
       Lang.   set Training Average Average No. of Untagged
               size set size Accuracy Accuracy NOTAG Training
                               (%)      (%)     Words set size
                      5730     74.6     79.1     3195   10025
                     10030     78.4    79.05     2740   10025
       Hindi 20227
                     15771     81.3     82.1     2116   25020
                     25591     84.7     85.0     1903   50019
                      4994     59.4     81.4     5617    9994
                      9994     67.1     82.6     3897   23435
      Telugu 20854
                     14995     71.2     84.4     3240   43434
                     23435     75.3     84.3     2419  104281
                      5006     48.9     75.0     6957   40988
                      9941     59.9     79.7     4357   80004
       Tamil 22352
                     15007     65.9     82.1     3778   80004
                     20207     69.4     83.1     3495   80004
                      5010     47.3     73.5     7081   49997
                     10003     56.2     74.6     5332   85796
      Bengali 20618
                     15009     59.3     75.2     4269   85796
                     21561     63.0     77.8     4170   85796
                     10671     70.4     79.7     3774   50444
      English 24952  15298     72.9     82.3     3424   50444
                     24825     76.4     82.5     2574   93679


Table 3. Effect of annotated data size on classifier model on 22352 Tamil test set words
for M inCoverage = 60%, M inConf idence = 60%, M inprobdif = 30% and 169705
raw untagged words.

           No. of        No. of Clusters      No. of   Average
          Words in          in model        NOTAG test Accuracy
        Annotated set (No. of unique words)   Words      (%)
            5006             22 (2021)         5317       72.2
            9941             25 (3553)         3575       79.0
           15007             26 (4842)         2940      82.09
           20207             26 (5774)         2647      83.46


the untagged data size varies, so our algorithm is able to perform well even with
a small sized untagged data.

7.5   Effect of Various Parameters

We made the following observations about the effect of parameter values: (1)
Increasing threshold values of M inConf idence for parameter Confidence, it in-
126     P. Rani, V. Pudi and D. Sharma

Table 4. Effect of raw untagged data size on classifier model on 70811 Hindi test set
words and 20854 Telugu test set words for M inCoverage = 60%, M inConf idence =
60% and M inprobdif = 30% with 282548 annotated Hindi words and 104281 annotated
Telugu words.

                No. of    No. of Clusters      No. of   Average
      Language Words in      in model        NOTAG test Accuracy
               Raw set (No. of unique words)   Words      (%)
                 50019        25 (4366)         4714      87.3
                 98331        28 (6081)         3664      87.7
                128329        28 (6865)         3220      87.9
                158337        29 (7546)         2890      87.9
        Hindi
                188326        29 (8112)         2793      88.0
                196659        29 (8260)         2748      88.0
                282554        30 (9517)         2484      88.0
                294979        30 (9663)         2450      88.1
                393303       30 (10817)         1916      87.8
                 23435        23 (3600)         3079      86.1
                 43434        23 (5091)         2276      87.4
       Telugu
                 63436        23 (6221)         1828      87.4
                 83442        23 (7198)         1749      88.2


creases the quality of clusters but at the same time it also increases the number
of context based lists tagged as “NOTVALIST” which decreases the word cover-
age of clusters. (2) Decreasing threshold values of M inCoverage for parameters
Coverage although decreases the quality of clusters but at the same time it in-
creases the word coverage of clusters by decreasing the number of context based
lists tagged as “NOTVALIST”. (3) By varying the threshold value of Minprobdif
from 5% to 30% for parameter TagProbDif we found that increasing the thresh-
old value increases the precision values of POS tags but slightly decreases their
recall because the number of words tagged as “NOTAG” increases. Practical
advantage of this parameter is that it ensures that tagging of ambiguous and
non-confident cases is avoided. (4) The number of POS tag clusters obtained in
the classifier model is almost independent of the selected threshold values of the
parameters. For the datasets given in Table 1 and for the range of threshold val-
ues M inConf idence = 60% to 90% and M inCoverage = 0% to 75%, number
of POS tag clusters found for English was 100 to 101, for Hindi was 29 to 31, for
Tamil was 22 to 26, for Bengali was 25 and for Telugu was 23. We noted that
the POS tags missing from the set of clusters were the rare POS tags having
very low frequencies.

8     Conclusions and Future Work
In this work we developed TagMiner, a semi-supervised associative classification
method for POS tagging. We used the concept of context based list and context
based association rule mining. We developed a method to find interestingness
                                                                     TagMiner       127

measures required to find the association rules in a semi-supervised manner
from a training set of tagged and raw untagged data combined. We showed
that TagMiner gives good performance for resource rich as well as resource poor
languages without using extensive linguistic knowledge. It works well even with
less tagged training data and less untagged training data. It can also tag unknown
words. To some extent, it handles class imbalance and data sparsity problems
using the untagged data and a special method to find interestingness measures.
It handles phrase boundary problem using a set of parameters. These advantages
make it very suitable for resource poor languages and can be used as an initial
POS tagger while developing linguistic resources for them.
    Future work includes (1) using other contexts instead of trigram, (2) finding
methods to include linguistic features in the current approach, (3) mining tagging
patterns from the clusters to find tag of a test word and (4) using this approach
for other lexical item classification tasks.


References

 1. Agrawal, R., Imieliński, T., Swami, A.: Mining Association Rules Between Sets of
    Items in Large Databases. In: Proc. of SIGMOD. pp. 207–216 (1993)
 2. Banko, M., Moore, R.C.: Part-of-Speech Tagging in Context. In: Proc. of COLING
    (2004)
 3. Bharati, A., Misra Sharma, D., Bai, L., Sangal, R.: AnnCorra : Annotating Corpora
    Guidelines For POS And Chunk Annotation For Indian Languages. Tech. Rep. TR-
    LTRC-31, Language Technologies Research Centre, IIIT, Hyderabad (2006)
 4. Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D.M., Xia, F.: A
    multi-representational and multi-layered treebank for Hindi/Urdu. In: Proc. of the
    Third Linguistic Annotation Workshop. pp. 186–189 (2009)
 5. Biemann, C.: Unsupervised Part-of-Speech Tagging Employing Efficient Graph
    Clustering. In: Proc. of ACL (2006)
 6. Brants, T.: TnT: a statistical part-of-speech tagger. In: Proc. of ANLP. pp. 224–231
    (2000)
 7. Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proc. of ANLP. pp.
    152–155 (1992)
 8. Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Pro-
    cessing: A Case Study in Part-of-Speech Tagging. Comput. Linguist. 21(4), 543–565
    (1995)
 9. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic Itemset Counting and
    Implication Rules for Market Basket Data. In: Proc. of SIGMOD. pp. 255–264
    (1997)
10. Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A practical part-of-speech tagger.
    In: Proc. of the third conference on ANLP. pp. 133–140 (1992)
11. Dandapat, S., Sarkar, S., Basu, A.: Automatic Part-of-speech Tagging for Bengali:
    An Approach for Morphologically Rich Languages in a Poor Resource Scenario. In:
    Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demon-
    stration Sessions. pp. 221–224 (2007)
12. Dubey, H., Pudi, V.: Class Based Weighted K-Nearest Neighbor over Imbalance
    Dataset. In: Proc. of PAKDD (2). pp. 305–316 (2013)
128     P. Rani, V. Pudi and D. Sharma

13. Ekbal, A., Hasanuzzaman, M., Bandyopadhyay, S.: Voted Approach for Part of
    Speech Tagging in Bengali. In: Proc. of PACLIC. pp. 120–129 (2009)
14. Gadde, P., Yeleti, M.V.: Improving statistical POS tagging using Linguistic feature
    for Hindi and Telugu. In: Proc. of ICON (2008)
15. Gimenez, J., Marquez, L.: Svmtool: A general pos tagger generator based on sup-
    port vector machines. In: Proc. of LREC. pp. 43–46 (2004)
16. Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-
    speech tagging. In: Proc. of the 45th Annual Meeting of the ACL. pp. 744–751
    (June 2007)
17. Ide, N., Suderman, K.: The American National Corpus First Release. In: Proc. of
    LREC. pp. 1681–1684 (2004)
18. Kamruzzaman, S.M., Haider, F., Hasan, A.R.: Text Classification using Association
    Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm.
    CoRR abs/1009.4976 (2010)
19. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob-
    abilistic Models for Segmenting and Labeling Sequence Data. In: Proc. of ICML.
    pp. 282–289 (2001)
20. Li, W., Han, J., Pei, J.: CMAR: Accurate and Efficient Classification Based on
    Multiple Class-Association Rules. In: Proc. of ICDM. pp. 369–376 (2001)
21. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining.
    In: Proc. of KDD. pp. 80–86 (1998)
22. P.V.S., A., G., K.: Part Of Speech Tagging Using Conditional Random Fields and
    Transformation Based Learning. In: Proc. of IJCAI Workshop SPSAL. pp. 21–24
    (2007)
23. Shaohong, Y., Guidan, F.: Research of POS Tagging Rules Mining Algorithm.
    Applied Mechanics and Materials 347-350, 2836–2840 (August 2013)
24. Shrivastava, M., Bhattacharyya, P.: Hindi POS Tagger Using Naive Stemming:
    Harnessing Morphological Information Without Extensive Linguistic Knowledge.
    In: Proc. of ICON (2008)
25. Søgaard, A.: Semisupervised condensed nearest neighbor for part-of-speech tag-
    ging. In: Proc. of ACL HLT: short papers - Volume 2. pp. 48–52 (2011)
26. Soni, S., Vyas, O.: Using Associative Classifiers For Predictive Analysis In Health
    Care Data Mining. Int. Journal Of Computer Application 4(5), 33–37 (July 2010)
27. Subramanya, A., Petrov, S., Pereira, F.: Efficient graph-based semi-supervised
    learning of structured tagging models. In: Proc. of EMNLP. pp. 167–176 (2010)
28. Thabtah, F.: A Review of Associative Classification Mining. The Knowledge En-
    gineering Review 22(1), 37–65 (Mar 2007)
29. Thonangi, R., Pudi, V.: ACME: An Associative Classifier Based on Maximum
    Entropy Principle. In: Proc. of ALT. pp. 122–134 (2005)
30. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech
    tagging with a cyclic dependency network. In: Proc. of NAACL HLT’03 - Volume
    1. pp. 173–180 (2003)
31. V., D., Kumar, A., G., S., P., S.K., S., R.: Tamil POS Tagging using Linear Pro-
    gramming. Int. Journal of Recent Trends in Engineering 1(2), 166–169 (May 2009)
32. Yarowsky, D.: One sense per collocation. In: Proc. of the workshop on Human
    Language Technology. pp. 266–271 (1993)
33. Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In:
    Proc. of SDM. pp. 331–335 (2003)
34. Zaı̈ane, O.R., Antonie, M.L., Coman, A.: Mammography Classification By an As-
    sociation Rule-based Classifier. In: Proc. of MDM/KDD. pp. 62–69 (2002)
          Sequential Patterns of POS Labels
      Help to Characterize Language Acquisition

                Isabelle Tellier1,2 , Zineb Makhlouf1 , Yoann Dupont1

     (1) Lattice, CNRS - UMR 8094, (2) University Paris 3 - Sorbonne Nouvelle


Abstract. In this paper, we try to characterize various steps of the syntax acqui-
sition of their native language by children with emerging sequential patterns of
Part Of Speech (POS) labels. To achieve this goal, we first build a set of corpora
from the French part of the CHILDES database. Then, we study the linguis-
tic utterances of the children of various ages with tools coming from Natural
Language Processing (morpho-syntactic labels obtained by supervised machine
learning) and sequential Data Mining (emerging patterns among the sequences
of morpho-syntactc labels). This work thus illustrates the interest of combining
both approaches. We show that the distinct ages can be characterized by vari-
ations of proportions of morpho-syntactic labels, which are also clearly visible
inside the emerging patterns.

Keywords. language acquisition, POS labeling, CRF, Sequential Data Mining,
emerging patterns


1    Introduction
The acquisition of their native language by children, especially how grammatical
constructions are gradually mastered, is a process which largely remains myste-
rious. Some researches address this issue within a Natural Language Processing
framework, for example by implementing programs trying to mimic the learning
process [CM06,Ali10]. Our approach in this paper is different: we do not target
to reproduce, but to mine children productions, from a morphosyntactic point
of view. More precisely, we study the linguistic utterances of children of vari-
ous ages, seen as sequences of part-of-speech (POS) labels, with sequential data
mining tools.
    Sequential data mining can be applied to any kind of data following an order
relation. This relation is often related to time; for texts, it is only the linear
order of words in sentences. Sequential data mining allows to extract sequential
patterns, that is sequences or sub-sequences of itemsets that repeatedly occur in
the data. This domain has given rise to many works [AS95,SA96,Zak01,NR07].
If the extracted sequences are contiguous portions of texts, patterns coincides
with the older notion of repeated segments [Sal86].
    When data are composed of natural language texts, the itemsets are not nec-
essarily reduced to words: lemmas and POS labels can also be taken into account.
The use of sequential data mining technics in such a linguistic context has re-
cently been tested for the extraction of Named Entities [NAFS13], the discovery

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
130    I. Tellier, Z. Makhlouf and Y. Dupont

of relations between entities in the biological field [CPRC09,CCP10,BCCC12] or
the study of stylistic differences between textual genres [QCCL12]. As we look
at the emergence of grammatical constructions in children, we are mainly inter-
ested here in patterns of morpho-syntactic labels. As a matter of fact, they are
more general than words or lemmas and provide more abstract characterizations
of a given age. We seek in particular to exhibit specific emerging patterns for
different age groups.
    The remaining of the article is as follows. First, we present the way our
corpora of children’s productions of different age groups have been collected.
Then, we explain how we processed their morpho-syntactic analysis. Observing
that usual POS taggers available for French made many mistakes on our data, we
have built a new one, by training a machine learning device (a CRF model) on a
reduced set of manually corrected data. We show that, despite this reduced set of
manual corrections, the new tagger obtained behaves far better than the previous
one on our data. Finally, the last part of the paper describes the technique used
for the extraction of n-grams of morpho-syntactic labels of each specific age
group and provides quantitative and qualitative analyses of the corresponding
emerging patterns.


2     Corpora

2.1   The CHILDES Corpus

Several resources collecting children’s productions exist online, as those avail-
able in the CNRTL1 . But the best known and most widely used database is
CHILDES2 [Elm01], a multilingual corpus of transcriptions of recorded inter-
actions between adults and children. In this article, we are only interested in
the French part of these data. The recordings of a child cover several months or
years, the age of the children may therefore vary from one record to another.
Relying on the transcription manual3 which explicits the meta-data associated
with the corpus, we created six different sub-corpora corresponding to six age
groups: from the "1-2 years" to the "6-7 years".


2.2   Pretreatments

In this corpus, children and parents communicate by speech turns. Each speech
turn is transcribed and delimited by a period. In the following, we consider that
each line corresponds to a "sentence". The transcriptions are annotated and are
often followed by additional information in a (semi-)standard format allowing
to describe elements of the situation (e.g. objects which are in the scene). We
performed a preprocessing step to focus only on linguistic productions. We have
1
  Centre National des Ressources Textuelles et Linguistiques (http://www.cnrtl.fr for
  children’s production): see Traitement de Corpus Oraux en Français (TCOF) corpus
2
  http://childes.psy.cmu.edu/
3
  http://childes.psy.cmu.edu/manuals/CHAT.pdf
Sequential Patterns of POS Labels Help to Characterize Language Acquisition     131

removed all special characters related to standards of transcription, as well as
all information of phonetic nature, which are not relevant for the analysis of
syntactic constructions and prevent the use of a tagger. We have also eliminated
from our data all adult utterances.
    The characteristics of each of our initial sub-corpora are presented in the
table of Figure 1. There are differences between them: the corpus for the age of
"6-7 years" is the smallest one. To balance the corpora of the different age groups,
we have sampled them according to the number of words: this feature is more
reliable than the number of sentences, because the length of the sentences is a
key factor which significantly varies from one age to another (see the following).
To have comparable sub-corpora, the number of words is thus more reliable than
the number of sentences.


             corpus number of number of nb of distinct average length
                      sentences words      words       of the sentences
            1-2 years 41786      63810      3019             1.23
            2-3 years 115114    324341      8414             2.15
            3-4 years 60317     243244      8479             4.62
            4-5 years 16747      74719      4465             4.71
            5-6 years   4542     29422       938             6.96
            6-7 years   3383     21477       841             6.88

                  Fig. 1. Characteristics of the initial sub-corpora


2.3   Sampling


The smallest corpus in terms of words (the one of "6-7 years") is the reference
sample for the other age groups. So, we chose to take 20,000 words per corpus,
with a rate of 0.01% tolerance. To build our new corpora from the initial ones, we
sampled sentences randomly until the sum of all words in all sentences reaches
this size. After the sampling, we have six new corpora, whose properties are
given in the table of Figure 2.
    The corpora now have comparable size in terms of words. The number of
sentences in each corpus have of course decreased, but we note that the average
lengths of the sentences follow the same evolution than in the initial corpora. This
is crucial because, as long as the children grow up, they tend to produce longer
sentences. This is a well-known key feature of language acquisition [Bro73,MC81].
To go further in our exploration, we will now label the productions of the children
with morpho-syntactic labels.
132      I. Tellier, Z. Makhlouf and Y. Dupont

              corpus number of number of nb of distinct average length
                       sentences words      words       of the sentences
             1-2 years 14284     20348       1086             1.42
             2-3 years   9075    20504       1427             2.26
             3-4 years   5043    21051       1575             4.17
             4-5 years   4433    20949       1806             4.73
             5-6 years   3047    20514        805             6.73
             6-7 years   3147    20525        819             6.52

                   Fig. 2. Characteristics of the sampled sub-corpora


3     POS labeling
3.1    Use of an existing tagger
As we want to characterize the acquisition of syntactic constructions, we need
more information than simple transcriptions of words. Our experiments in this
article rely on a morpho-syntactic tagging of children’s productions: we must
thus assign to each word in the sub-corpora a label corresponding to its gram-
matical category. Several tools are available to annotate plain text in French
with "Part of Speech" (POS) labels, such as TreeTagger [Sch94]. In our work,
we have used SEM4 [TDE+ 12], which was obtained by training a linear CRF
(Conditional Random Fields) model on the French Treebank [ACT03]. The set
of labels adopted in SEM, similar to the one of [CC08], includes 30 different
categories among which the main important ones for the following are: NC (for
common nouns), V (for verbs), DET (for determiners), P (for prepositions), I
(for interjections) and CLS (for subject clitic). SEM also integrates the external
lexical resource Lefff [CSL04] to help achieve a better labeling.
    SEM has been learned with labeled sentences extracted from the French
newspaper "Le Monde". Our texts of children productions have very different
properties, and we therefore expect many annotation errors. Indeed, the cor-
pus CHILDES is composed of oral transcription, whose conventions differ from
those of writing (especially concerning punctuations). Furthermore, children ut-
terances are often far from standard French. It has already been observed that,
even if SEM is supposed to reach 97% accuracy on texts similar to those on
which it has been learned, it reaches 95.6% accuracy on more casual written
texts from blogs, and only 81.6% on oral productions of adults.
    To assess the quality of SEM on our data, we have randomly selected 200
sentences from each of our six corpora, tagged them with SEM and manually
corrected the labeling errors, following the annotation conventions of the French
Treebank. The accuracy of SEM on these samples (see table of Figure 4) ranges
from 70% (2-3 years) to 87% (6-7 years). The detailed F-measures of the main
categories for each age group can also be seen in the table of Figure 3: the
label interjection (I), very rare in the French Treebank but very frequent in our
4
    http://www.lattice.cnrs.fr/sites/itellier/SEM.html
Sequential Patterns of POS Labels Help to Characterize Language Acquisition        133

corpora, are particularly not well recognized by SEM (the F-measures goes from
33.33 for the "1-2 years" age group to 0 for the the "6-7 years" one).

3.2   Learning a New tagger
As we want to perform statistical measures on the morpho-syntactic labels, la-
beling errors must be reduced as much as possible. In [TDEW13], it has been
shown that to learn a good tagger by supervised machine learning, it is more
efficient to have a small annotated corpus similar to the target data than to have
a large too different training set. So, we decided to use the labelled sentences
which have been manually corrected for the evaluation of SEM as training data
to learn a new tagger adapted to our corpora.
     For this, we have used the same tools as those used to learn SEM, that is
CRFs (Conditional Random Fields), introduced by [LMP01] and implemented in
the software Wapiti [LCY10]. CRFs are graphical models that have proven their
effectiveness in the field of automatic annotation by supervised machine learning
[TTA09,TDE+ 12]. They allow to assign the best sequence of annotations y to
an observable sequence x. For us, the elements of x are words enriched with
endogenous attributes (presence of caps, digits, etc.) or exogenous ones (e.g.
associated properties in Lefff), while y is the corresponding sequence of morpho-
syntactic labels.
     We trained our new tagger thanks to 200 ∗ 6 = 1200 annotated and manually
corrected sentences (which is a very small number to learn a POS tagger), and
we tested it on 50 ∗ 6 = 300 other independent sentences, equally sampled from
the 6 distinct sub-corpora. The table of Figure 3 gives the F-measures of the
main labels obtained by SEM and by the re-learned tagger for each age group,
while the accuracy of both taggers are provided in the table of Figure 4.


  corpus       CLS        DET           I        NC         P           V
 1-2 years 100/100       80/100 33.33/57.14 76.92/84.21    0/0       80/100
 2-3 years 71.43/93.33 66.67/54.55 12.5/90.91 71.43/80  40/33.33 71.43/63.64
 3-4 years 77.42/100 80/78.26 13.33/88.89 88.89/94.74 71.43/71.43 83.87/94.74
 4-5 years 89.8/94.55 80.95/89.36 8.7/97.78 75.76/93.15 90.91/80 88.89/95.89
 5-6 years 81.08/97.56 91.18/93.15 0/94.74 86.32/96.08 78.05/88.89 92.96/90.14
 6-7 years 96.55/100 87.88/97.14      0/80    90/92.13 89.47/87.8 93.88/89.36

Fig. 3. F-measures of the main distinct labels before (with SEM) /after the re-learning


   We observe that the relearning leads to a significant improvement of the
accuracy of about 10% in average. SEM is better for only 4 cells out of 36 in the
table of Figure 3, probably thanks to its better vocabulary exposure: the French
Treebank on which SEM was learned was about ten times larger than our training
corpus. The improvement brought by relearning is larger for oral-specific labels
such as I. It is therefore very beneficial, despite a very small training corpus. This
134      I. Tellier, Z. Makhlouf and Y. Dupont

                           corpus   SEM re-trained tagger
                          1-2 years 82%        85%
                          2-3 years 70%        80%
                          3-4 years 73%        88%
                          4-5 years 75%        90%
                          5-6 years 80%        92%
                          6-7 years 87%        90%
                           average 77.83%    87.5%

      Fig. 4. Impact of the re-learning on the accuracy of the distinct age groups


can be explained by the fact that the vocabulary used in our texts is relatively
limited and redundant: few data are therefore sufficient to obtain a tagger which
is effective on our corpus, even if it is not uniformly better than SEM on every
label (it would obviously be much less effective on other types of data). In the
following, we systematically use the new version of the tagger.


3.3    Analysis of POS labels

Figure 5 shows the distribution of the main morpho-syntactic categories in the
different age groups. For example, we see that the curve of the label I (interjec-
tion) is decreasing (except for the 4-5 years age group): it seems that children
use fewer and fewer interjections in their productions as long as grow up. In
contrast, the label P (preposition) is strictly increasing, which is consistent with
an acquisition of increasingly sophisticated syntactic constructions. Curves for
the labels CLS (subject clitic) and V (verb) follow very similar variations, prob-
ably because they are often used together: they increase till the age of 4, then
decrease from 4 to 6, and finally stabilize at the age of 6. Observing labels DET
(determiner) and NC (common nouns), we notice that until the age of 4 years,
NC is the most common label, but not yet being systematically associated with a
DET. It is only at the age of 4 that both curves become parallel (most probably
when most NC is preceded by a DET). We finally note that from the age of 5
years, the proportions of different labels stabilize.
    The residual errors of the tagger (there is more than 10% remaining labeling
errors) lead us to be prudent with these observations. But it is clear that some of
the phenomena observed here would not have been possible without re-learning:
interjections, for example, were the words most poorly recognized by the original
SEM, because they are very rare in newspaper articles. However, their production
appears to be an important indicator of the child’s age group. Example sentences
like "ah maman" ("ah mom") or "heu voilà " ("uh there") were respectively
labeled as "ADJ NC" and "ADV V" with the original SEM tagger. After the
re-learning, the labels became "I NC " and "I V", which is at least more correct.
    Although we can already draw some interesting conclusions from these curves,
we cannot characterize the syntactic acquisition of children from single isolated
Sequential Patterns of POS Labels Help to Characterize Language Acquisition            135


                 Fig. 5. Proportions of each label for each age group


categories. We thus decided to use sequential data mining techniques on our data
to explore them further.


4     Sequential Patterns Extraction

4.1   General Definitions

Many studies have focused on the analysis of texts seen as sequential data.
For example, the notion of repeated segment is used in textometrics [Sal86] to
characterize a contiguous sequence of items appearing several times in a text.
Sequential data mining [AS95] generalizes such concept, with notions like se-
quential patterns of itemsets. In our case, itemsets can be composed of words
and POS labels. A sequence of itemsets is an ordered list of itemsets. An order
relation can be defined on such sequences: a sequence S1 = hI1 , I2 , ..., In i is in-
cluded into a sequence S2 = hI10 , I20 , ..., Im
                                               0
                                                 i, which is noted S1 ⊆ S2 , if there exist
integers 1 ≤ j1 ≤ j2 ≤ ... ≤ jn ≤ m such that I1 ⊆ Ij01 , I2 ⊆ Ij02 , ..., In ⊆ Ij0n (in
the classical sense of itemset inclusion). The table of Figure 6 provides examples
of sequences of itemsets found in our corpus labelled with the re-trained tagger.
    The support of a sequence S, denoted sup(S), is equal to the number of
sentences of the corpus containing S. For example, in the table of Figure 6,
sup(h(ADJ) (NC)i) = 2. The relative support of a sequence S is the proportion
of sequences containing S in the base of initial sequences. It is worth 21 for
the sequence in our example, because this sequence is present in 2 out of the
4 sequences of the database. Algorithms mining sequential patterns are based
on a minimum threshold for extracting frequent patterns. A frequent pattern is
thus a sequence for which the support is greater than or equal to this threshold.
Other concepts are also useful to limit the number of extracted patterns.
136      I. Tellier, Z. Makhlouf and Y. Dupont

                     seq. id                 sequence
                        1     h(le, DET) (petit, ADJ) (chat, NC)i
                                         ("the little cat")
                        2    h(le, DET) (grand, ADJ) (arbre, NC)i
                                         ("the big tree")
                        3            h(le, DET) (chat, NC)i
                                            ("the cat")
                        4 h(tombé, VPP) (et, CC) (cassé, VPP)i
                                      ("fallen and broken")

             Fig. 6. Examples of sequences of itemsets (word, POS label)


4.2    Extraction of Sequential Patterns under constraints

In [YHA03], was introduced the notion of closed patterns that allows to eliminate
redundancies without loss of information. A frequent pattern S is closed, if there
is no other frequent pattern S 0 such S ⊆ S 0 and sup(S) = sup(S 0 ). In our
example, if we fix minsup=2, the frequent pattern h(DET) (NC)i, extracted from
Figure 6, is not closed because it is included in the pattern h(le, DET) (NC)i
and they both have a support equals to 3. But the pattern h(DET) (small, ADJ)
(NC)i is closed. A length constraint can also be used. It defines the minimum
and maximum number of items contained in a pattern [BCCC12].


4.3    Algorithm

There are several available tools for extracting sequential patterns such as GSP
[SA96] and SPADE [Zak01]. CloSpan [YHA03] and BIDE [WH04] are able to
extract frequent closed sequential patterns. SDMC5 , used here, is a tool based
on the method proposed in [PHMA+ 01]. It extracts several types of sequen-
tial patterns, where items can correspond to simple words, lemma and/or their
morpho-syntactic category (the tagger is parameterized, which allowed us to use
our tagger). In this work, we wanted to characterize grammatical constructions,
and we thus focused only on sequences of POS labels. The algorithm of SDMC
implements the pattern growth technic; it is briefly discussed in [BCCC12]. It
allows to extract sequential patterns under several constraints.


4.4    Emerging Patterns

[DL99] introduced the concept of emerging pattern. A frequent sequential pat-
tern is called emerging if its relative support in a set of data set is significantly
higher than in another set of data. Formally, a sequential pattern P of a set of
data D1 is emerging relatively to another set of data D2 if GrowthRate(P ) ≥ ρ,
5
    https://sdmc.greyc.fr, login and password to be asked
Sequential Patterns of POS Labels Help to Characterize Language Acquisition     137

with ρ > 1. The growth rate function is defined by:
                     (
                       ∞            if supportD2 (P ) = 0
                        suppD1 (P )
                        suppD (P ) otherwise
                               2


   where suppD1 (P ) (respectively suppD2 (P )) is the relative support of the pat-
tern P in D1 (respectively D2 ). Any pattern P whose support is zero in a set is
neglected.


5     Experiments

5.1   Parameters

The corpora used in our experiments are those described in section 2.3. We are
interested here in sequences of itemsets restricted to POS labels without any gap
(thus corresponding to n-grams, or repeated segments of labels), under some
constraints (such as having a support strictly greater than a given threshold
or pruning non-closed patterns), to limit their number. To set the lengths of
sequences, we took account of the average size of sentences. So, we have decided
to select patterns of length between 1 and 10. The minsup threshold is set to 2
and ρ = 1.001. To find the emerging patterns of a certain age group, we do as
[QCCL12] did for literary genres: each age group (D1 ) is compared to the set of
every other age groups (D2 ).


5.2   Quantitative Results

Figure 7 shows the number of frequent and emerging patterns obtained under
our constraints for each age group. For example, for the age of 4-5 years, there are
1933 frequent patterns but only 842 emerging ones (42.6%). A serious reduction
has occurred, which will make the observation easier. The number of emerging
patterns is relatively stable across ages from 3-4 years and is important in each
age group. As these emerging patterns are defined relatively to every other age
group, this suggests the existence for each age group of characteristic phases of
grammatical acquisitions.
    Figure 8 shows the average size of the frequent and emerging patterns for each
age group. The curves are very similar, suggesting that emerging patterns have
properties which are similar to frequent patterns. In both cases, the length is
increasing and reaches its maximum at the age of 5-6 years old. This parameter
seems very correlated to the one of sentence length (see Figure 1): not only
utterances become longer as the children grow up, but also the grammatical
patterns they instantiate.
    Figures 9 and 10 show the distributions of the main morpho-syntactic la-
bels in frequent and emerging patterns respectively for each age group. These
results are consistent with those obtained on the entire corpus (cf. Figure 5).
138    I. Tellier, Z. Makhlouf and Y. Dupont


      Fig. 7. Number of frequent versus emerging patterns for each age group


   Fig. 8. Average length of frequent versus emerging patterns for each age group


The proportion of interjections still regularly decreases, while the one of prepo-
sitions increases, which is consistent with syntactic constructions of increasing
complexity. We also note that the CLS and V curves are parallel and that, before
the age of 4 years, the NC label is very frequent without being associated with
the label DET. These curves show that the proportions of labels in the frequent
and emerging patterns of each age group are similar to those of the corpus. In
this sense, these patterns seem to be representative of the different age groups.


5.3   Qualitative Results

The table of Figure 11 provides examples of emerging patterns of each age group,
and some corresponding sentences. These examples show that a single pattern
can correspond to various sentences, and that they have increasing complexity.
We note that even before the age of 2, children can produce sentences with a
CN preceded by a DET. We also note, for example, that the patterns "(DET)
(NC)" and "(DET) (NC) (CLS) (V) (VINF)" respectively extracted of the age
Sequential Patterns of POS Labels Help to Characterize Language Acquisition           139


     Fig. 9. Proportions of distinct labels in frequent patterns for each age group


    Fig. 10. Proportions of distinct labels in emerging patterns for each age group


"1-2 years" and "4-5 years are included in "(P) (DET) (NC) "and" (DET) (NC)
(CLS) (V) (VINF) (DET) (NC)" respectively, of the following age group. This
is consistent with a gradual acquisition of complex syntactic constructions.


6    Conclusion

In this article, we have applied techniques from Natural Language Processing,
machine learning and sequential Data Mining to study the evolution of children’s
utterances of different ages. The phase of morpho-syntactic labeling required the
learning of a specific tagger, adapted to our data. It was a necessity, considering
that current available taggers do not properly handle oral transcriptions, and
even less those of children: interjections, for example, which are very specific of
140    I. Tellier, Z. Makhlouf and Y. Dupont

1-2 years (P) (NC)                    - à maman ("to mom")
                                      - sac à dos ("backpack")
          (DET) (NC)                  - le ballon ("the ball")
                                      - des abeilles ("some bees")
2-3 years (P) (DET) (NC)              - de la tarte
                                       ("some pie")
                                      - poissons dans l’eau
                                       ("fishes in the water")
          (ADVWH) (CLS) (V)           - où il est ?
                                       ("where it is ?")
                                      - comment il marche ?
                                       ("how it works ?")
3-4 years (ADV) (CLS) (V)             - non il est par terre
                                       (" no it is on the floor")
                                      - ici il pourra passer
                                       ("here it will be able to pass")
4-5 years (ADV) (CLS) (CLO) (V)       - alors tu m’as vue ?
                                       ("so you saw me ?")
                                      - oui j’en fais souvent
                                       ("yes I do some often")
          (DET) (NC) (CLS) (V) (VINF) - les lapins ils vont rentrer
                                       ("the rabbits they will come in")
                                      - le chat il veut attraper l’oiseau
                                       ("the cat it wants to catch the bird")
5-6 years (DET) (NC) (CLS) (V) (VINF) - l’enfant il va chercher le chat
          (DET) (NC)                   ("the child he goes and fetch the cat")
                                      - le monsieur il va chercher les cerises
                                       ("the man he goes and catch the cherries")
          (CC) (DET) (NC) (CLS) (V)   - la maman et le papa ils regardaient le garçon
          (DET) (NC)                   ("the mommy and the daddy they watched the boy")
                                      - et le chat il mange les cerises
                                       ("and the cat it eats the cherries")
6-7 years (P) (VINF) (DET) (NC)       - les oiseaux les aident à ramasser les cerises
                                       ("the birds help them to pick up the cherries")
                                      - il y a un chat qui essaie de chasser des oiseaux
                                       ("there is a cat trying to catch birds")
          (DET) (NC) (PROPEL) (V)     - il y a un chat qui suit la fille avec son panier
          (DET) (NC) (P) (DET) (NC)    ("there is a cat which follows the girl with a basket")
                                      - et aussi un monsieur qui ramasse des cerises
                                       dans un arbre
                                       ("and a man picking up cherries in a tree")

            Fig. 11. Examples of emerging patterns in each age group


oral productions, would have been poorly recognized without re-learning. This
is crucial, as the curves of label proportions show that their frequency appears
as an important way to characterize a child’s age group.
Sequential Patterns of POS Labels Help to Characterize Language Acquisition        141

    We currently restricted our research to n-grams of POS labels but further
work could use richer itemsets of the type (word, lemma, POS tag). Our explo-
ration seems to confirm that the extracted emerging patterns are representative
of the age group in which they arise. The provided examples further confirm
the intuition that (at least some of) the patterns of increasing age groups are
included into each other, going in the direction of a grammatical sophistication.
    As far as we know, these kinds of analyses had never been performed before.
Of course, a detailed analysis of the patterns obtained remains to be done by
specialists of language acquisition. They could for example allow to characterize
typical evolutions of grammatical knowledge, or help to diagnose pathological
evolution of a child’s productions. We hope that they will provide valuable tools
for the study of language acquisition phases.


7    Aknowlegment
This work is supported by a public grant overseen by the French National Re-
search Agency (ANR) as part of the "Investissements d’Avenir" program (refer-
ence: ANR-10-LABX-0083).
The authors acknowledge Christophe Parisse, for his advice.


References
[ACT03]     A. Abeillé, L. Clément, and F. Toussenel. Building a treebank for french.
            In A. Abeillé, editor, Treebanks. Kluwer, Dordrecht, 2003.
[Ali10]     A. Alishahi. Computational modeling of human language acquisition (Syn-
            thesis lectures on human language technologies). San Rafael: Morgan and
            Claypool Publisher, 2010.
[AS95]      R. Agrawal and R. Srikant. Mining sequential patterns. In Int. Conf.
            Data Engineering: IEEE, 1995.
[BCCC12]    N. Béchet, P. Cellier, T. Charnois, and B. Crémilleux. Discovering lin-
            guistic patterns using sequence mining. In proceedings of CICLing’2012,
            pages 154–165, 2012.
[Bro73]     R. W. Brown. A first language: the early stages. Cambridge, Mass. Har-
            vard University Press, Cambridge, Massashusetts, 1973.
[CC08]      B. Crabbé and M. H. Candito. Expériences d’analyse syntaxique statis-
            tique du français. In Actes de TALN’08, 2008.
[CCP10]     P. Cellier, T. Charnois, and M. Plantevit. Sequential patterns to discover
            and characterise biological relations. In A. Gelbukh, editor, CICLing 2010.
            LNCS, vol. 6008, pages 537–548, 2010.
[CM06]      N. Chater and C. D. Manning. Probabilistic models of language processing
            and acquisition. In Trends in Cognitive Science, 10(7), pages 335–344,
            2006.
[CPRC09]    T. Charnois, M. Plantevit, C. Rigotti, and B. Crémilleux. Fouille de
            données séquentielles pour l’extraction d’information. In Traitement Au-
            tomatique des Langues, 50(3), 2009.
[CSL04]     L. Clément, B. Sagot, and B. Lang. Morphology based automatic acqui-
            sition of large-coverage lexica. In LREC 2004, Lisbonne, 2004.
142      I. Tellier, Z. Makhlouf and Y. Dupont

[DL99]    G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
          trends and differences. In Proc. of SIGKDD’99, 1999.
[Elm01]   J. Elman. Connectionism and language acquisition. In Essential readings
          in language acquisition. In Oxford : Blackwell, 2001.
[LCY10]   Thomas Lavergne, Olivier Cappé, and François Yvon. Practical very large
          scale CRFs. In Proceedings of ACL’2010, pages 504–513. Association for
          Computational Linguistics, July 2010.
[LMP01]   John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Con-
          ditional random fields: Probabilistic models for segmenting and labeling
          sequence data. In Proceedings of the Eighteenth International Conference
          on Machine Learning (ICML), pages 282–289, 2001.
[MC81]    J. F. Miller and R. S. Chapman. The relation between age and mean length
          of utterance in morphemes. In Journal of Speech and Hearing Research,
          24, pages 154–161, 1981.
[NAFS13] D. Nouvel, J-Y. Antoine, N. Friburger, and A. Soulet. Fouille de rè-
          gles d’annotation partielles pour la reconnaissance d’entités nommées. In
          TALN’13, pages 421–434, 2013.
[NR07]    M. Nanni and C. Rigotti. Extracting trees of quantitative serial episodes.
          In Proc. of KDID’07, pages 170–188, 2007.
[PHMA+ 01] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
          M. Hsu. Prefixspan: Mining sequential patterns by prefix-projected
          growth. In ICDE, IEEE Computer Society, pages 215–224, 2001.
[QCCL12] S. Quiniou, P. Cellier, T. Charnois, and D. Legallois. Fouille de données
          pour la stylistique : cas des motifs séquentiels émergents. In Proceedings
          of the 11th International Conference on the Statistical Analysis of Textual
          Data, Liege, pages 821–833, 2012.
[SA96]    R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
          and performance improvements. In EDBT 1996. LNCS, vol. 1057, pages
          3–17, 1996.
[Sal86]   A. Salem. Segments répétés et analyse statistique des données textuelles.
          In Histoire & Mesure volume 1 - numéro 2, pages 5–28, 1986.
[Sch94]   Helmut Schmid. Probabilistic part-of-speech tagging using decision trees.
          In Proceedings of International Conference on New Methods in Language
          Processing, pages 44–49, 1994.
[TDE+ 12] I. Tellier, D. Duchier, I. Eshkol, A. Courmet, and M. Martinet. Appren-
          tissage automatique d’un chunker pour le français. In Actes de TALN’12,
          papier court (poster), 2012.
[TDEW13] I. Tellier, Y. Dupont, I. Eshkol, and I. Wang. Adapt a text-oriented chun-
          ker for oral data: How much manual effort is necessary? In The 14th
          International Conference on Intelligent Data Engineering and Automated
          Learning (IDEAL’2013), Special Session on Text Data Learning, LNAI,
          Hefei (Chine), 2013.
[TTA09]   Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Fast full parsing by linear-chain
          conditional random fields. In Proceedings of EACL 2009, pages 790–798,
          2009.
[WH04]    J. Wang and J. Han. Bide: Efficient mining of frequent closed sequences.
          In ICDE, IEEE Computer Society, pages 79–90, 2004.
[YHA03]   X. Yan, J. Han, and R. Afshar. Mining closed sequential patterns in large
          databases. In SDM SIAM, 2003.
[Zak01]   M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences.
          In Machine Learning Journal 42(1/2), pages 31–60, 2001.
        RegExpMiner: Automatically discovering
        frequently matching regular expressions

     Julien Rabatel1 , Jérôme Azé1 , Pascal Poncelet1 , and Mathieu Roche1,2
    1
        LIRMM, CNRS UMR 5506, Univ. Montpellier 2, 34095 Montpellier, France
        2
          UMR TETIS - Cirad, Irstea, AgroParisTech, 34093 Montpellier, France

    Regular expressions (REs) are a very powerful and popular tool to manipu-
late string data in a variety of applications. They are used as search templates
to look for the occurrences of a given piece of text in a document, or to define
how a given piece of text should be formatted in order to be valid (e.g., to check
that the value entered in an email field of a Web form is correctly formatted),
or even to help solving more complex NLP tasks [NS07]. Their popularity in
those various application domains arises from several reasons. First, they are
easy to understand and manipulate for common usages, despite their wide ex-
pressiveness and power of abstraction. Second, they are natively usable within
a large variety of programming languages, hence making them suitable to be
integrated into every project addressing text processing tasks. Their usage of-
ten relies on a very limited amount of hand-crafted REs. It is indeed difficult
to automatically obtain the REs matching with a given set of strings for which
no a priori knowledge about their underlying formatting rules is given. Such
an automatic discovery of REs would nonetheless offer some very interesting
prospects. Regular expressions indeed have an interesting abstraction power as
they are able to provide information about how textual content is formatted,
rather than focusing on the actual sequences of characters. Having a more ab-
stract description space for describing textual content then offers new insights.
For instance, an application scenario consists in data cleaning problems. Given a
database containing some textual content about entities (e.g., addresses, names,
phone numbers, etc.), one may be interested in finding values contained in the
database that are mistakes from the people who entered them. Such typos and
formatting mistakes can easily be highlighted if they result in strings that do
not match the same regular expressions as the majority of the other strings.
    While regular expressions can be seen as interesting descriptors of textual
data for various NLP and machine learning tasks, they are hard to obtain. The
literature does not offer fully relevant solutions when one wishes to enumer-
ate some REs to describe a given set of strings. Regular Expression learning
[Fer05], for instance, consists in building a single regular expression matching
with a given set of positive string examples. Such approaches typically do not
allow exceptions w.r.t. the set of strings to be matched, hence losing their in-
terest as soon as input data are noisy. Additionally, only one RE is learned
while one can expect to obtain several REs reflecting the different templates
that co-exist in the data. E.g., one cannot expect all the values of a list of inter-
national ZIP codes to respond to only one template, as each country may use
a different one. Constructing one single RE matching with all of them will of-

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
144       J. Rabatel, J. Azé, P. Poncelet and M. Roche

ten lead to an over-generalization of the underlying templates that would make
the obtained RE irrelevant in practical applications. On the other hand, the
sequence mining literature, when applied to string data, offers the possibility
to discover more various templates via frequent patterns, i.e., data fragments
occurring in a sufficient amount of strings. While this general principle answers
the problems above-mentioned for RE learning approaches, the type of extracted
patterns (e.g., sequential patterns [AS95], episodes [MTV97]) is typically much
less expressive than REs. Some efforts have however been put in allowing the
generalization of sequence elements [PLL+ 10] but extracted sequential patterns
have little commonality with REs, as they only aim at discovering sequence
elements that are frequently found in the same order.
    We propose an approach for extracting regular expressions under the form of
frequent patterns in textual data. To this end, we define a relevant pattern lan-
guage that offers some interesting algorithmic properties. While we do not aim at
exploiting all the characteristics and expressiveness of the RE language, we focus
on providing a preliminary approach by keeping some of its main features. In
particular, we fully consider the problem of allowing the generalization of charac-
ters via the use of predefined character classes, commonly used in REs3 . Another
aspect that this approach takes into account is the repetition of some charac-
ters in strings. For instance, we assume that the strings “012 ” and “9876543 ”,
should both be generalizable to the RE /[0−9]+/, i.e., a list of consecutive digit
characters, even if they do not contain the same digits nor the same amount
of digits. We define the frequent regular expression pattern mining problem by
providing a theoretical framework linking together the RE and sequence mining
worlds, and highlight some properties that, while inspired from known properties
in sequence mining, are specific to the problem we consider study and employs
them to design the RegExpMiner algorithm to mine such patterns.

References
[AS95]   Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In
         Data Engineering, 1995. Proceedings of the Eleventh International Confer-
         ence on, pages 3–14. IEEE, 1995.
[Fer05] Henning Fernau. Algorithms for learning regular expressions. In Algorithmic
         Learning Theory, pages 297–311. Springer, 2005.
[MTV97] Heikki Mannila, Hannu Toivonen, and A Inkeri Verkamo. Discovery of fre-
         quent episodes in event sequences. Data Mining and Knowledge Discovery,
         1(3):259–289, 1997.
[NS07]   David Nadeau and Satoshi Sekine. A survey of named entity recognition and
         classification. Lingvisticae Investigationes, 30(1):3–26, 2007.
[PLL+ 10] Marc Plantevit, Anne Laurent, Dominique Laurent, Maguelonne Teisseire,
         and Yeow Wei Choong. Mining multidimensional and multilevel sequential
         patterns. ACM Transactions on Knowledge Discovery from Data, 4(1), 2010.
3
    Character classes are sets of characters. When tested against a string, a character
    class matches with any of the characters it contains. For instance, the character
    class [0−9] contains all the digit characters 0, 1, · · · , 9, which allows it to match with
    strings such as “3” or “8”, but not with “A”.
      NLP-based Feature Extraction for Automated Tweet
                       Classification

               Anna Stavrianou, Caroline Brun, Tomi Silander, Claude Roux

                         Xerox Research Centre Europe, Meylan, France

                             Name.surname@xrce.xerox.com


  1      Introduction

  Traditional NLP techniques cannot alone deal with twitter text that often does not
  follow basic syntactic rules. We show that hybrid methods could result in a more
  efficient analysis of twitter posts. Tweets regarding politicians have been annotated
  with two categories: the opinion polarity and the topic (10 predefined topics). Our
  contributions are on automated tweet classification of political tweets.


  2      Combination of NLP and Machine Learning Techniques

  Initially we used our syntactic parser [1] which has given high results on opinion
  mining when applied to product reviews [2] or the Semeval 2014 Sentiment Analysis
  Task [3]. However, when applied to Twitter posts, results were not satisfactory. Thus,
  we use a hybrid method and combine knowledge given by our parser with learning.
     Linguistic information has been extracted from every annotated tweet. We have
  used features such as bag of words, bigrams, decomposed hashtags, negation,
  opinions, etc. The“liblinear” library (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) was
  used to classify tweets. We used logistic regression classifier (with L2-regularization),
  where each class c has a separate vector      of weights for all the input features. More
  formally,                         , where is the th feature and the       is its weight
  in class c. When learning the model, we try to find the vectors of weight          that
  maximize the product of the class probabilities in the training data.
     Our objective has been to identify the optimal combination of features that yields
  good prediction results, while avoiding overfitting. Some features used are: Snippets:
  during annotation, we kept track of the snippets that explained why the annotator
  tagged the post with a specific topic or polarity, Hashtags: decomposition techniques
  have been applied to hashtags, and they are analyzed by an opinion detection system
  that extracts the semantic information they carry [4].
     We have selected the models using a 10-fold cross validation in the training data
  and evaluated them by their accuracy in the test data. For the topic-category task,
  (6,142 tweets, 80% used for training), the annotation had <0.4 inter-annotator
  agreement, which shows the difficulty of the task. Table 1. shows the results when
  NLP features are used, as well as when some semantic merging of classes takes place.


In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
146       A. Stavrianou, C. Brun, T. Silander and C. Roux


      Table 1. Cross-validation (2nd col) and prediction (3rd col) results for topic classification.

                   NLP features                     44.38          29.37
                   NLP features + merging           48.91          34.17
    Binary classification was applied to improve the results. We selected the class with
 the highest distribution and annotated the dataset with CLASS1 and NOT_CLASS1
 tags. We created a model for the prediction of CLASS1, the prediction of CLASS2
 and a model for the prediction of the rest of the 8 classes. Merging these models gave
 an accuracy of 40.03%, higher than the max accuracy of Table 1.

  Table 2. Binary classification results (2nd col: cross-validation, 3rd col: prediction) for topic.

  CLASS1/NOT_CLASS1                                                    85.28         62.57
  CLASS2/NOT_CLASS2 (removal of CLASS1)                                92.10         68.42
  The rest of the classes (removal of CLASS2)                          49.58         38.24

    For the opinion polarity task (5,754 tweets, 80% used for training), the inter-
 annotator agreement was higher (~ 0.8). As Table 3. shows, we have used not only
 NLP features from the tweet but also from the ‘snippet’. The “syntactic analysis” is
 the opinion tag given from our opinion analyser.

  Table 3. Cross-validation (2nd col) and prediction (3rd col) results for the opinion polarities.

  NLP features (syntactic analysis of opinion)                 61.28 (62.13)          56.77 (56.6)
  NLP features of snippet (syntactic analysis)                 66.41(67.99)           61.2 (61.46)

    As a conclusion, in this paper we provide a model that predicts opinions and topics
 for a tweet in the political context. More research around feature analysis will be
 carried out. We also plan to add more features yielded by our syntactic analyzer such
 as POS tags, or tense. We should also consider a multiple-class labelling.


 3         Acknowledgements

    This work was partially funded by the project ImagiWeb ANR-2012-CORD-002-
 01.


 4         References
  1. Ait-Mokthar, S., Chanod, J.P.: Robustness beyond Shallowness: Incremental Dependency
     Parsing. NLE Journal, 2002.
  2. Brun, C.: Learning opinionated patterns for contextual opinion detection. COLING 2012.
  3. Brun, C., Popa, D., Roux, C.: XRCE: Hybrid Classification for Aspect-based Sentiment
     Analysis. In International Workshop on Semantic Evaluation (SemEval), 2014 (to appear).
  4. Brun, C., Roux, C.: Décomposition des « hash tags » pour l’amélioration de la
     classification en polarité des « tweets ». In TALN, July, 2014.
Author Index


          Alfarone, Daniele         1
          Azé, Jérôme          143

          Bancken, Wouter           1
          Bethard, Steven          33
          Boella, Guido            17
          Brun, Caroline          145

          Coulet, Adrien           81

          Davis, Jesse              1
          Di Caro, Luigi           17
          Do, Quynh Ngoc Thi       33
          Dupont, Yoann           129

          Gabriel, Alexander      49

          Haralambous, Yannis      65
          Hassan, Mohsen           81

          Ivek, Ivan               97

          Janssen, Frederik        49

          Lenca, Philippe          65

          Makhlouf, Zineb         129
          Moens, Marie-Francine    33

          Paulheim, Heiko          49
          Poncelet, Pascal        143
          Pudi, Vikram            113

          Rabatel, Julien         143
          Rani, Pratibha          113
          Roche, Mathieu          143
          Roux, Claude            145

          Sharma, Dipti           113
          Silander, Tomi          145
          Stavrianou, Anna        145

          Tellier, Isabelle       129
          Toussaint, Yannick       81

</pre>