<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Knowledge through Multi-modal Association Rule Mining for Document Image Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corrado Loglisci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lynn Rudd</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donato Malerba</string-name>
          <email>donato.malerbag@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita degli Studi di Bari \Aldo Moro" Via Orabona 4</institution>
          ,
          <addr-line>70125, Bari</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper introduces a descriptive data mining method to discover knowledge for the task of automatic categorization in document image analysis. We argue that a document image is a multi-modal unit of analysis whose semantics is deduced from a combination of textual content, layout structure and logical structure. So, the method considers simultaneously different modalities of document representation, and, therefore different types of information: spatial information derived from a complex document image analysis process (layout analysis), information extracted from the logical structure of the document (by means of document image classi cation and understanding) and the textual information extracted by means of an OCR. The proposed method is based on a relational data mining approach to discover association rules, where the relational setting is justi ed, given its appropriateness to analyze data available in more than one modality. Experimental results on a real world dataset are reported.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Document image analysis is the sub eld of digital image processing that aims to
convert document images to symbolic form for modi cation, storage, retrieval,
reuse, and transmission. However, technologies and systems demand a large
amount of domain-speci c knowledge, in order to properly process document
images, as well as to automatically catalog and organize them [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Hand-coding
the necessary knowledge for document images with varied layout, such as those
processed in real applications, is prohibitive. For this reason, there has been a
growing interest in the application of data mining techniques in order to extract
the required knowledge from document images [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Text-mining is a
technology of data mining which is attracting interest for its particular ability to
analyze large collections of unstructured documents and extract interesting and
non-trivial patterns or knowledge. Knowledge discovered from textual documents
can be in various forms, including classi cation rules, which partition document
collections into a given set of classes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], clusters of similar documents or object
composing documents [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], patterns describing trends, such as emerging topics
in a corpus of time-stamped documents [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and ranking models used in the
problems of order reading detection and document summarization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Text mining and document image analysis have always been considered two
complementary technologies: the former is appropriate for documents that are
generated according to some textual format, while the latter is applicable to
documents available on paper media. Document image mining aims to identify
high-level spatial objects and relationships, while text mining is more concerned
with patterns involving words, sentences and concepts. The possible
interactions between spatial information extracted from document images and textual
information, related to the content of some layout components, have never been
considered in the data mining literature.</p>
      <p>
        In this paper, we propose to extract the required knowledge in the form
of association rules, which have been successfully applied both in the business
and scienti c realms. Some examples of applications to images have also been
reported in the literature. In particular, Ordonez and Omiecinski [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] propose
to mine association rules, in order to nd similarities in images on the basis
of their content. The content is expressed in terms of objects returned by a
segmentation process. They show that even without domain knowledge it is
possible to automatically extract some reliable knowledge. Mined association
rules refer to the presence/absence of an object in an image, since images are
viewed as transactions, while objects are seen as items. No spatial relationship
between objects in the same image is considered. Moreover, the proposed data
mining method is unimodal, since it considers only one source of information
- visual - conveyed by images. In order to deal properly with the multi-modal
nature of documents we resort to relational data mining approaches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which
permit the processing of data stored according to the relational data model. This
model permits us to: i) naturally represent different types of data in different
tables of a relational database; ii) relate collected data by means of foreign
key constraints; iii) represent properly spatial relationships (e.g., topological or
distance relationships) implicitly de ned in the layout structure of a document.
Relational data mining approaches permit us to directly analyze data stored
in multiple database relations, thus preventing information loss, due to data
transformation known as \propositionalization" [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In a generic library or archive, this kind of pattern can be used in a number
of ways. First, discovered association rules can be used as constraints de ning
domain templates of documents both for classi cation tasks, such as in
associative classi cation approaches [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and to support layout correction tasks.
Second, the rules could also be used in a generative way. For instance, if a part
of the document is hidden or missing, strong association rules can be used to
predict the location of the missing layout/logical components [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Moreover, a
desirable property of a system that automatically generates textual documents
is to take into account the layout speci cation during the generation process,
since layout and wording generally interact [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Association rules can be useful
to de ne the layout speci cations of such a system. Finally, this problem is also
related to document reformatting [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The paper is organized as follows. In the next Section, background and
motivations of this work are presented. The proposed solution is presented in Section
3. Finally, experiments are presented in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Motivations</title>
      <p>
        In tasks where the goal is to uncover structure in the data and where there is no
target concept, the discovery of relatively simple but frequently occurring
patterns is promising. Association rules are a basic example of this kind of setting.
The problem of mining association rules was originally introduced in the work
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that can be expressed by an implication:
      </p>
      <p>
        X ! Y;
where X and Y are sets of items called antecedent and consequent respectively
(X \ Y = ∅). The meaning of such rules is quite intuitive: Given a database D of
transactions, where each transaction T 2 D is a set of items, X ! Y expresses
that whenever a transaction T contains X, then T probably also contains Y . The
conjunction X ^ Y is called a pattern. Two parameters are usually reported for
association rules, namely the support, which estimates the probability p(X T ^
Y T ), and the con dence, which estimates the probability p(Y T jX T ).
The goal of association rule mining is to nd all the rules with support and
con dence exceeding user speci ed thresholds, henceforth called minsup and
minconf respectively. A pattern X ! Y is large (or frequent) if its support
is greater than or equal to minsup. An association rule X ! Y is strong if it
has a large support (i.e., X ! Y is large) and high con dence. However, it is
becoming clear that these rules can be successfully applied to a wide range of
domains, such as web access pattern discovery [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and mining data streams [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
As previously before, an interesting application is discussed in the work by [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
where a method for mining knowledge from images is proposed.
      </p>
      <p>Nevertheless, mining patterns from document images raises many different
issues regarding document structure, storage, access and processing. Firstly,
documents are typically unstructured or, at most, semi-structured data. In the
case of structured data, the associated semantics or meaning is unambiguously
and implicitly de ned and encapsulated in the structure of data (i.e., relational
databases), whereas unstructured information meaning is only loosely implied
by its form and requires several interpretation steps in order to extract the
intended meaning. Endowing documents with a structure that properly encodes
their semantics adds a degree of complexity in the application of the mining
process. This makes the data pre-processing step really crucial.</p>
      <p>
        Secondly, documents are message conveyors whose meaning is deduced from
the combination of the written text, the presentation style, the context, the
reported pictures and the logical structure, at least. For instance, when the
logical structure and the presentation style are quite well-de ned (typically when
some parts are pre-printed or documents are generated by following a
predened formatting style), the reader may easily identify the document type and
locate information of interest even before reading the descriptive text (e.g., the
title of a paper in the case of scienti c papers or newspapers, the sender in the
case of faxes, the supplier or the amount in the case of invoices, etc.). Moreover,
in many contexts, illustrative images fully complement the textual information,
such as diagrams in socio-economic or marketing reports. By considering
typeface information, it is also possible to immediately and clearly capture the notion
the historical origin of documents (e.g., medieval), as well as the cultural origin
(e.g., Arabic). The presence of spurious objects may inherently de ne classes of
documents, such as revenue stamps in the case of legal documents. The idea of
considering the multi-modal nature of documents falls into the novel research
trend of the document understanding eld, that encourages the development of
hybrid strategies for knowledge capture, in order to exploit the different sources
of knowledge (e.g., text, images, layout, type style, tabular and format
information) that simultaneously de ne the semantics of a document [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        However, data mining has evolved following a unimodal scheme instantiated
according to the type of the underlying data (text, images, etc). Applications of
data mining involving hybrid knowledge representation models are still to be
explored. Indeed, several works have been proposed to mine association rules from
the textual dimension [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with the goal of nding rules that express regularities
concerning the presence of particular words or particular sentences in text
corpora. Conversely, mining the combination of structure and content dimensions
of documents has not been investigated yet in the literature, even though some
emerging real-world applications require mining processes able to exploit several
forms of information, such as images and captions in addition to full text [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>Thirdly, documents are a kind of data that do not match the classical
attributevalue format. In the tabular model, data are represented as xed-length vectors
of variable values describing properties, where each variable can have only a
single, primitive value. Conversely, the entities (e.g., the objects composing a
document image) that are observed and about which information is collected
may have different properties, which can be properly modeled by as many data
tables (relational data model) as the number of object types. Moreover,
relationships (e.g., topological or distance relationships that are implicitly de ned
by the location of objects spatially distributed in a document image or words
distributed in a text) among observed objects forming the same semantic unit can
also be explicitly modeled in a relational database by means of tables describing
the relationship. Hence, the classical attribute-value representation seems too
restrictive and advanced approaches are necessary to both represent and reason
in the presence of multiple relations among data.</p>
      <p>Lastly, the data mining method should take into account external
information, also called expert or domain knowledge, that can add semantics to the
whole process and then obtain high-level decision support and user con dence.
All these peculiarities make documents a kind of complex data that require
methodological evolutions of data mining technologies, as well as the
involvement of several document processing techniques. In our context, the extraction
of spatio-textual association rules requires the consideration of all these sources
of complexity deriving from the inherent nature of processed documents.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Multi-modal Association Rule Mining</title>
      <p>
        In our proposal, the system used for processing documents is WISDOM++[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
WISDOM++ 1 is a document analysis system that can transform textual paper
documents into XML format. This is performed in several steps. First, the image
is segmented into basic layout components (non-overlapping rectangular blocks
enclosing content portions). These layout components are classi ed according to
their type of content (e.g., text, graphics, and horizontal/vertical lines). Second,
a perceptual organization phase, called layout analysis, is performed to detect
structures among blocks. The result is a tree-like structure, named layout
structure, which represents the document layout at various levels of abstraction and
associates the content of a document with a hierarchy of layout components, such
as blocks, lines, and paragraphs. Third, the document image classi cation step
aims at identifying the membership class (or type) of a document (e.g.
censorship decision, newspaper article, etc.), and it is performed using some rst-order
rules which can be automatically learned from a set of training examples.
      </p>
      <p>
        Document image understanding (or interpretation) creates a mapping of the
layout structure into the logical structure, which associates the content with a
hierarchy of logical components, such as title/authors of a scienti c article, or
the name of the censor in a censorship document. As previously pointed out, the
logical and the layout structures are strongly related. For instance, the title of
an article is usually located at the top of the rst page of a document and it is
written with the largest character set used in the document. Document image
understanding also uses rst-order rules [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Once the logical and layout
structure have been mapped, OCR can be applied only to those textual components
of interest for the application domain, and the content can be stored for future
retrieval purposes. Thus, the system can not only determine the type of
document, but is also able to identify interesting parts of a document and extract
the information given in this part plus its meaning. The result of the document
analysis is an XML document that makes the document image easily retrievable.
Once the layout/logical structure as well as the textual content of a document
have been extracted, association rules can be extracted.
      </p>
      <p>
        The task of mining relational association rules is, in this work, solved by the
SPADA system[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. It represents relational data a la Datalog, a logic
programming language with no function symbols, speci cally designed to implement
deductive databases. Moreover, SPADA takes into account background knowledge
(BK) expressed in Prolog and handles background hierarchies over the objects
to be mined. In document image understanding, hierarchies can be naturally
de ned, e.g., considering the organization of the logical components (Figure 1).
1 www.di.uniba.it/ malerba/wisdom++/
So, association rules involving more abstract objects are better supported
(although less precise), while association rules involving more speci c objects have
higher con dence values (although lower support values). SPADA distinguishes
between the set S of reference (or target) objects, which are the main subject of
analysis, and the sets Rk, 1 k m, of task-relevant (or non-target) objects,
which are related to the former and can contribute to accounting for the
variation. Each unit of analysis includes a distinct reference object and many related
task-relevant objects. Therefore, the description of a unit of analysis consists of
both properties of included reference and task-relevant objects as well as their
relationships.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Document Description</title>
        <p>In the logic framework adopted by SPADA, a relational database is boiled down
into a deductive database. Properties of both reference and task-relevant objects
are represented in the extensional part DE , while the domain knowledge is
expressed as a normal logic program which de nes the intensional part DI . As an
example, we report a fragment of the extensional part of a deductive database
D, which describes multi-modal information which can be extracted from any
document image:
block(b1). block(b2). : : : height(b2,[11..54]). width(b1,[7..82]). : : :
on top(b2,b1). : : : on top(b2,b3). : : : part of(b1,p1). part of(b2,p1). page rst(p1).
: : : abstract(b1). title(b2). : : : text in abstract(b1,'base'). text in title(b2,'model')....
In this example, b1 and b2 are two constants which denote many distinct
layout components (reference objects) , while p1 denotes a document page
(taskrelevant object). Predicate block de nes a layout component, part of associates
a block to a document page, height and width describe geometrical properties of
layout components, on top expresses a topological relationship between layout
components, page f irst(p1) refers to the position of the page in the document,
abstract and title associates b1 and b2 with a logical label, text in abstract and
text in title describe the textual content of the logical components.</p>
        <p>The complete list of predicates is reported in Table 1. The a-spatial
feature type of speci es the content type of a layout component (e.g. image, text,
horizontal line). Logical features are used to associate a logical label to a
layout object and depend on the speci c domain. In the case of scienti c papers
(considered in this work), possible logical labels are: affiliation, page number,
gure, caption, index term, running head, author, title, abstract, formulae,
subsection title, section title, biography, references, paragraph, table.</p>
        <p>Textual content is represented by means of another class of predicates, which
are true when the term reported as the second argument occurs in the layout
component denoted by the rst argument. Terms are automatically extracted
by means of a text-processing module. All terms in the textual components
are tokenized and the set of obtained tokens is ltered out, in order to remove
punctuation marks, numbers and tokens of less than three characters. Standard
Layout
structure
Logical structure
Text
text preprocessing methods are used to: i) remove stop-words, such as articles,
adverbs, prepositions and other frequent words; ii) determine equivalent stems
(stemming), such as `topolog' in the words `topology' and `topological, by means
of the well-known Porter's algorithm for English texts .</p>
        <p>
          Only relevant tokens are used in textual predicates. They are selected by
maximizing the product maxT F DF 2 ICF [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] that scores high terms
appearing (possibly frequently) in a single logical component c and penalizes terms
common to other logical components. More formally, let c be a logical label
associated to a textual component. Let d be the bag of tokens in a component
labeled with c (after the tokenization, ltering and stemming steps), w a term
in d and T Fd(w) the relative frequency of w in d. Then, the following statistics
can be computed:
1. the maximum value T Fc(w) of T Fd(w) on all logical components d labeled
with c;
2. the document frequency DFc2(w), i.e., the percentage of logical components
labeled with c in which the term w occurs;
3. the category frequency CFc(w), i.e., the number of labels c′ ̸= c, such that
w occurs in logical components labeled with c′.
        </p>
        <p>Then, the score vi associated to the term wi belonging to at least one of the
logical components labeled with c is:
vi = T Fc(wi)</p>
        <p>DFc2(wi)
1=CFc(wi)
(1)</p>
        <p>According to this function, it is possible to identify a ranked list of
\discriminative" terms for each of the possible labels. From this list, we select the
best ndict terms in Dictc, where ndict is a user-de ned parameter. The textual
dimension of each logical component d labeled as c is represented in the
document description as a set of ground facts that express the presence of a term
w 2 Dictc in the speci ed logical component.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The mining step</title>
        <p>In the setting introduced so far, the problem of mining multi-modal association
rules can be formalized as follows:</p>
        <p>Given
{ a set S of reference objects (layout components);
{ some sets Rk, 1 k m of task-relevant objects (layout components related
to those of the reference objects);
{ a background knowledge BK which includes hierarchies Hk on objects in
Rk, granularity levels M in the descriptions (1 is the highest, while M is the
lowest) and a set of granularity assignments k which associate each object
in Hk with a granularity level (the hierarchical organization of the layout
components included in the task-relevant objects);
{ a couple of thresholds minsup[l] and minconf [l] for each granularity level;
{ a language bias LB that constrains the search space.</p>
        <p>Find strong multi-level association rules, i.e., association rules involving
objects at different granularity levels.</p>
        <p>Hierarchies Hk de ne is-a (i.e., taxonomic) relations on task-relevant objects.
Both the frequency of patterns and the strength of association rules depend on
the granularity level l at which patterns/rules describe data. Therefore, a pattern
P with support s at level l is frequent if s minsup[l] and all ancestors of P
with respect to Hk are frequent at their corresponding levels. An association
rule Q ! R with support s and con dence c at level l is strong if the pattern
Q [ R is frequent and c minconf [l].</p>
        <p>The expressive power of rst-order logic is exploited to specify both the
background knowledge BK, such as hierarchies and domain speci c knowledge,
and the language bias LB. The LB is important to allow the user to specify
his/her bias for interesting solutions and then to exploit this bias to improve
both the efficiency of the mining process and the quality of the discovered rules.
In SPADA, the language bias is expressed as a set of constraint speci cations
for either patterns or association rules.
article
+
j +
j j +
j +
j
+
j +
j j +
j +
j
+
j +
j +
+
heading
identi cation</p>
        <p>(title, author, affiliation)
synopsis
+ (abstract, index term)
content
nal components</p>
        <p>(biography, references)
body
+ (section title, subsect title, paragraph, caption, gure, formulae, table)
page component
running head
page number
unde ned</p>
        <p>Fig. 1. Hierarchy of logical components.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We investigate the applicability of the proposed solution to real-world document
images. In particular, we have considered a dataset composed of 3,603 logical
components extracted from twenty-four multi-page documents, which are
scienti c papers published as either regular or short in the IEEE Transactions on
Pattern Analysis and Machine Intelligence in the January and February 1996
issues. Each paper is a multi-page document and has a variable number of pages
and layout components for page. A user labels some layout components of this
set of documents according to their logical meaning. Those layout components
with no clear logical meaning are labelled as unde ned. All logical labels
belong to the lowest level of the hierarchy reported in the previous section. We
processed 217 document images in all. The number of features to describe the
twenty-four documents presented to SPADA is 38,948, about 179 features for
each document page. The total number of logical components is partitioned as
follows: 23 for affiliation, 191 for page number, 357 for gure, 202 for caption, 26
for index term, 231 for running head, 28 for author, 26 for title, 25 for abstract,
333 for formulae, 65 for section title, 21 for biography, 45 for references, 968
for paragraph, 48 for table and 1014 for unde ned. About 150 descriptors for
each document page have been extracted. To generate textual predicates we set
ndict = 50 and we considered the following logical components: title, abstract,
index term, references and running head. Hence, the following textual predicates
have been included in the document descriptions: text in title, text in abstract,
text in index term, text in references,text in running head. The total number of
extracted textual features is 1,681. The text was read with a commercial OCR.</p>
      <p>In Table 2 we report a comprehensive view of the association rules mined for
each logical component at different granularity levels. SPADA nds associations
for all logical components. In particular, many spatial patterns involve logical
components in the rst page of an article, such as affiliation, title, author,
abstract and index term. Indeed, the rst page generally presents a more regular
layout structure and contains several distinct logical components. The situation
is different for references where most of the rules involve textual predicates,
because of the high frequency of discriminating terms (e.g., 'vol', `ieee').</p>
      <p>An example of an association rule discovered by SPADA at the second
granularity level (L=2) is:
is a block(A) ! specialize(A; B); is a(B; heading); on top(B; C); C ̸= B; is a(C; heading)
text in component(C; paper): [supp: 38.46% conf: 38.46%],</p>
      <p>This rule considers both spatial and textual properties. It is interpreted as
follows: a portion equal to 38.46% of the heading blocks is above another heading
block which contains the term `paper'. Usually, this term occurs in the abstract
(a typical sentence is \In this paper ..."), which means that the heading block
C is an abstract, while B is another logical component that usually is above the
abstract (e.g., author component or title component). The percentage value of
38.46% indicates that 10 out of 26 layout components of type heading (Figure 1),
in the overall initial set of 3,603 layout components, match the association rule
reported above. Indeed, at a lower granularity level (L=4), SPADA discovers the
following rule:
is a block(A) ! specialize(A; B); is a(B; title); on top(B; C); C ̸= B; is a(C; abstract);
text in component(C; paper): [supp: 38.46% conf: 38.46%],
The rule has the same con dence and support reported for the rule inferred at
the rst granularity level. This means that all heading components represented
by B in the former rule (L=2) are titles. Another example of a discovered rule
is:
is a block(A) ! specialize(A; B); is a(B; references); type text(B); at page last(B):
[supp: 46.66% conf: 46.66%],
which shows the use of the predicate at page last(B) introduced in the BK. This
is an example of a pure spatial association rule. The rules reported above have
only one predicate on the antecedent side. An example of rules with several
predicates on the antecedent side is:</p>
      <p>is a block(A); specialize(A; B); is a(B; heading) ! specialize(A; C); is a(C; heading);
at page first(B): [supp: 100.0% conf: 100.0%],
which has been discovered at the granularity level L=2. It is characterized by
the highest value of support (since it matches 23 out 23 affiliation components)
and the highest value of con dence, which indicates a very strong implication
between the set of predicates on the antecedent and the set of predicates on
the consequent. This rule reports that whenever there is a heading block B
(antecedent), then there is another heading block C and block B is collocated on
the rst page. The information associated to this rule is probably too general,
since different types of components could satisfy it. To obtain more speci c
indications, we can explore the lower levels of granularity and use specialized
patterns. The rule reported above is re ned at the level L=4 with the following
is a block(A); specialize(A; B); is a(B; affiliation) ! specialize(A; C); is a(C; affiliation);
at page first(B): [supp: 100.0% conf: 100.0%],
which keeps the same statistical parameters and states that whenever there is
an affiliation block B, then it is collocated on the rst page and there is another
affiliation block C, which could be typically the affiliation details of a co-author.
Finally, an example of pure a textual association rule discovered by SPADA is:
is a block(A) ! specialize(A; B); is a(B; index term); text in component(B; index):
[supp: 92.0% conf: 92.0%],
which simply states that a logical component index term contains the term
\index".
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Automated extraction of knowledge patterns from document images can boost
the application of document analysis systems in various contexts. In this paper
we investigate a particular form of knowledge pattern, namely association rules,
which are useful for many knowledge-intensive tasks, such as document classi
cation and indexing, document reformatting and document reconstruction. The
proposed method is based on a multi-relational approach, in order to consider
the inherent multi-modality of documents, which convey layout, logical and
textual information. Moreover, knowledge patterns are extracted at various levels
of granularity. Experiments prove the viability of the proposed approach.</p>
      <sec id="sec-5-1">
        <title>Acknowledgments</title>
        <p>This work partially ful lls the project "PON020056-33489339 PUGLIA@SERVICE
- Internet-based Service Engineering enabling Smart Territory structural
development" funded by the Italian Ministry of University and Research (MIUR) and
the project "MAESTRA - Learning from Massive, Incompletely annotated and
Structured Data" (Grant number ICT-2013-612944) funded by the European
Commission.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Swami</surname>
          </string-name>
          .
          <article-title>Mining association rules between sets of items in large databases</article-title>
          . In P. Buneman and S. Jajodia, editors,
          <source>International Conference on Management of Data</source>
          , pages
          <volume>207</volume>
          {
          <fpage>216</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Aiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Monz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Todoran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          .
          <article-title>Document understanding for a broad class of documents</article-title>
          .
          <source>International Journal on Document Analysis and Recognition-IJDAR</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>16</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>O.</given-names>
            <surname>Altamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Esposito</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Transforming paper documents into XML format with WISDOM++</article-title>
          . IJDAR,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):2{
          <fpage>17</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Amir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Aumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feldman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fresko</surname>
          </string-name>
          .
          <article-title>Maximal association rules: A tool for mining associations in text</article-title>
          .
          <source>J. Intell. Inf. Syst.</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <volume>333</volume>
          {
          <fpage>345</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Berardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Mining spatial association rules from document layout structures</article-title>
          .
          <source>In Proceedings of the 3rd Workshop on Document Layout Interpretation and its Application, DLIA03</source>
          , pages
          <fpage>9</fpage>
          {
          <fpage>13</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Appice</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Mr-SBC: a multi-relational naive bayes classi er</article-title>
          .
          <source>In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases</source>
          , volume
          <volume>2838</volume>
          <source>of LNAI</source>
          , pages
          <volume>95</volume>
          {
          <fpage>106</fpage>
          . Springer-Verlag,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Loglisci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Macchia</surname>
          </string-name>
          .
          <article-title>Ranking sentences for keyphrase extraction: A relational data mining approach</article-title>
          .
          <source>In Pushing the Boundaries of the Digital Libraries Field - 10th Italian Research Conference on Digital Libraries, IRCDL</source>
          <year>2014</year>
          , Padua, Italy, January
          <volume>30</volume>
          -
          <issue>31</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>52</fpage>
          {
          <fpage>59</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Classifying web documents in a hierarchy of categories: a comprehensive study</article-title>
          .
          <source>J. Intell. Inf. Syst.</source>
          ,
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <volume>37</volume>
          {
          <fpage>78</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Data mining for path traversal patterns in a web environment</article-title>
          .
          <source>In ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)</source>
          , page 385, Washington, DC, USA,
          <year>1996</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          .
          <article-title>Making documents work: Challenges for document understanding</article-title>
          .
          <source>In ICDAR</source>
          , pages
          <fpage>1026</fpage>
          {. IEEE Computer Society,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. S. Dzeroski and
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavrac</surname>
          </string-name>
          .
          <source>Relational Data Mining</source>
          . Springer-Verlag,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. L.
          <string-name>
            <surname>Hardman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Rutledge</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Bulterman</surname>
          </string-name>
          .
          <article-title>Automated generation of hypermedia presentation from pre-existing tagged media objects</article-title>
          .
          <source>In In Proc. of the Second Workshop on Adaptive Hypertext and Hypermedia</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>K.</given-names>
            <surname>Hiraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Gennari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yamamoto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Anzai</surname>
          </string-name>
          .
          <article-title>Learning spatial relations from images</article-title>
          .
          <source>In ML</source>
          , pages
          <volume>407</volume>
          {
          <fpage>411</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>N.</given-names>
            <surname>Jiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gruenwald</surname>
          </string-name>
          .
          <article-title>Research issues in data stream association rule mining</article-title>
          .
          <source>SIGMOD Rec</source>
          .,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <volume>14</volume>
          {
          <fpage>19</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sugandh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ram</surname>
          </string-name>
          .
          <article-title>Adapting associative classi cation to text categorization</article-title>
          . In P. R. King and
          <string-name>
            <surname>S. J</surname>
          </string-name>
          . Simske, editors,
          <source>ACM Symposium on Document Engineering</source>
          , pages
          <volume>205</volume>
          {
          <fpage>208</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Lisi</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Inducing multi-level association rules from multiple relations</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>55</volume>
          (
          <issue>2</issue>
          ):
          <volume>175</volume>
          {
          <fpage>210</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. B.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Hsu</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>Integrating classi cation and association rule mining</article-title>
          .
          <source>In Knowledge Discovery and Data Mining</source>
          , pages
          <volume>80</volume>
          {
          <fpage>86</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          .
          <article-title>Learning recursive theories in the normal ilp setting</article-title>
          .
          <source>Fundamenta Informaticae</source>
          ,
          <volume>57</volume>
          (
          <issue>1</issue>
          ):
          <volume>39</volume>
          {
          <fpage>77</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Discovering evolutionary theme patterns from text: an exploration of temporal text mining</article-title>
          .
          <source>In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining</source>
          , pages
          <volume>198</volume>
          {
          <fpage>207</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. G. Nagy.
          <article-title>Twenty years of document image analysis in pami</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell.,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <volume>38</volume>
          {
          <fpage>62</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Omiecinski</surname>
          </string-name>
          .
          <article-title>Discovering association rules based on image content</article-title>
          .
          <source>In ADL '99: Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries, page 38</source>
          , Washington, DC, USA,
          <year>1999</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>K.</given-names>
            <surname>Reichenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Rondhuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bateman</surname>
          </string-name>
          .
          <article-title>Effective presentation of information through page layout: a linguistically-based approach</article-title>
          . In
          <source>In Proc. of ACM Workshop on Effective Abstractions in Multimedia, Layout and Interaction</source>
          , Association for Computing Machinery,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          . Introduction:
          <article-title>Special issue on the 25th european conference on information retrieval research</article-title>
          . Inf. Retr.,
          <volume>7</volume>
          (
          <issue>3</issue>
          -4):
          <volume>235</volume>
          {
          <fpage>237</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>M. Steinbach</surname>
            , G. Karypis, and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>A comparison of document clustering techniques</article-title>
          .
          <source>In Proceedings of KDD-2000 Workshop on Text Mining</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hirschman</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Morgan</surname>
          </string-name>
          .
          <article-title>Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup</article-title>
          .
          <source>CoRR, cs.CL/0308032</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>