-

Discovering Knowledge through Multi-modal Association Rule Mining for Document Image Analysis

Michelangelo Ceci

Corrado Loglisci

Lynn Rudd

Donato Malerba

donato.malerbag@uniba.it 0 0 Dipartimento di Informatica, Universita degli Studi di Bari \Aldo Moro" Via Orabona 4 , 70125, Bari

The paper introduces a descriptive data mining method to discover knowledge for the task of automatic categorization in document image analysis. We argue that a document image is a multi-modal unit of analysis whose semantics is deduced from a combination of textual content, layout structure and logical structure. So, the method considers simultaneously different modalities of document representation, and, therefore different types of information: spatial information derived from a complex document image analysis process (layout analysis), information extracted from the logical structure of the document (by means of document image classi cation and understanding) and the textual information extracted by means of an OCR. The proposed method is based on a relational data mining approach to discover association rules, where the relational setting is justi ed, given its appropriateness to analyze data available in more than one modality. Experimental results on a real world dataset are reported.

Document image analysis is the sub eld of digital image processing that aims to convert document images to symbolic form for modi cation, storage, retrieval, reuse, and transmission. However, technologies and systems demand a large amount of domain-speci c knowledge, in order to properly process document images, as well as to automatically catalog and organize them [ 20 ]. Hand-coding the necessary knowledge for document images with varied layout, such as those processed in real applications, is prohibitive. For this reason, there has been a growing interest in the application of data mining techniques in order to extract the required knowledge from document images [ 2 ],[ 5 ]. Text-mining is a technology of data mining which is attracting interest for its particular ability to analyze large collections of unstructured documents and extract interesting and non-trivial patterns or knowledge. Knowledge discovered from textual documents can be in various forms, including classi cation rules, which partition document collections into a given set of classes [ 23 ], clusters of similar documents or object composing documents [ 24 ], patterns describing trends, such as emerging topics in a corpus of time-stamped documents [ 19 ], and ranking models used in the problems of order reading detection and document summarization [ 7 ].

Text mining and document image analysis have always been considered two complementary technologies: the former is appropriate for documents that are generated according to some textual format, while the latter is applicable to documents available on paper media. Document image mining aims to identify high-level spatial objects and relationships, while text mining is more concerned with patterns involving words, sentences and concepts. The possible interactions between spatial information extracted from document images and textual information, related to the content of some layout components, have never been considered in the data mining literature.

In this paper, we propose to extract the required knowledge in the form of association rules, which have been successfully applied both in the business and scienti c realms. Some examples of applications to images have also been reported in the literature. In particular, Ordonez and Omiecinski [ 21 ] propose to mine association rules, in order to nd similarities in images on the basis of their content. The content is expressed in terms of objects returned by a segmentation process. They show that even without domain knowledge it is possible to automatically extract some reliable knowledge. Mined association rules refer to the presence/absence of an object in an image, since images are viewed as transactions, while objects are seen as items. No spatial relationship between objects in the same image is considered. Moreover, the proposed data mining method is unimodal, since it considers only one source of information - visual - conveyed by images. In order to deal properly with the multi-modal nature of documents we resort to relational data mining approaches [ 11 ], which permit the processing of data stored according to the relational data model. This model permits us to: i) naturally represent different types of data in different tables of a relational database; ii) relate collected data by means of foreign key constraints; iii) represent properly spatial relationships (e.g., topological or distance relationships) implicitly de ned in the layout structure of a document. Relational data mining approaches permit us to directly analyze data stored in multiple database relations, thus preventing information loss, due to data transformation known as \propositionalization" [ 6 ].

In a generic library or archive, this kind of pattern can be used in a number of ways. First, discovered association rules can be used as constraints de ning domain templates of documents both for classi cation tasks, such as in associative classi cation approaches [ 17 ] [ 15 ], and to support layout correction tasks. Second, the rules could also be used in a generative way. For instance, if a part of the document is hidden or missing, strong association rules can be used to predict the location of the missing layout/logical components [ 13 ]. Moreover, a desirable property of a system that automatically generates textual documents is to take into account the layout speci cation during the generation process, since layout and wording generally interact [ 22 ]. Association rules can be useful to de ne the layout speci cations of such a system. Finally, this problem is also related to document reformatting [ 12 ].

The paper is organized as follows. In the next Section, background and motivations of this work are presented. The proposed solution is presented in Section 3. Finally, experiments are presented in Section 4. 2

Background and Motivations

In tasks where the goal is to uncover structure in the data and where there is no target concept, the discovery of relatively simple but frequently occurring patterns is promising. Association rules are a basic example of this kind of setting. The problem of mining association rules was originally introduced in the work [ 1 ] that can be expressed by an implication:

X ! Y; where X and Y are sets of items called antecedent and consequent respectively (X \ Y = ∅). The meaning of such rules is quite intuitive: Given a database D of transactions, where each transaction T 2 D is a set of items, X ! Y expresses that whenever a transaction T contains X, then T probably also contains Y . The conjunction X ^ Y is called a pattern. Two parameters are usually reported for association rules, namely the support, which estimates the probability p(X T ^ Y T ), and the con dence, which estimates the probability p(Y T jX T ). The goal of association rule mining is to nd all the rules with support and con dence exceeding user speci ed thresholds, henceforth called minsup and minconf respectively. A pattern X ! Y is large (or frequent) if its support is greater than or equal to minsup. An association rule X ! Y is strong if it has a large support (i.e., X ! Y is large) and high con dence. However, it is becoming clear that these rules can be successfully applied to a wide range of domains, such as web access pattern discovery [ 9 ] and mining data streams [ 14 ]. As previously before, an interesting application is discussed in the work by [ 21 ] where a method for mining knowledge from images is proposed.

Nevertheless, mining patterns from document images raises many different issues regarding document structure, storage, access and processing. Firstly, documents are typically unstructured or, at most, semi-structured data. In the case of structured data, the associated semantics or meaning is unambiguously and implicitly de ned and encapsulated in the structure of data (i.e., relational databases), whereas unstructured information meaning is only loosely implied by its form and requires several interpretation steps in order to extract the intended meaning. Endowing documents with a structure that properly encodes their semantics adds a degree of complexity in the application of the mining process. This makes the data pre-processing step really crucial.

Secondly, documents are message conveyors whose meaning is deduced from the combination of the written text, the presentation style, the context, the reported pictures and the logical structure, at least. For instance, when the logical structure and the presentation style are quite well-de ned (typically when some parts are pre-printed or documents are generated by following a predened formatting style), the reader may easily identify the document type and locate information of interest even before reading the descriptive text (e.g., the title of a paper in the case of scienti c papers or newspapers, the sender in the case of faxes, the supplier or the amount in the case of invoices, etc.). Moreover, in many contexts, illustrative images fully complement the textual information, such as diagrams in socio-economic or marketing reports. By considering typeface information, it is also possible to immediately and clearly capture the notion the historical origin of documents (e.g., medieval), as well as the cultural origin (e.g., Arabic). The presence of spurious objects may inherently de ne classes of documents, such as revenue stamps in the case of legal documents. The idea of considering the multi-modal nature of documents falls into the novel research trend of the document understanding eld, that encourages the development of hybrid strategies for knowledge capture, in order to exploit the different sources of knowledge (e.g., text, images, layout, type style, tabular and format information) that simultaneously de ne the semantics of a document [ 10 ].

However, data mining has evolved following a unimodal scheme instantiated according to the type of the underlying data (text, images, etc). Applications of data mining involving hybrid knowledge representation models are still to be explored. Indeed, several works have been proposed to mine association rules from the textual dimension [ 4 ] with the goal of nding rules that express regularities concerning the presence of particular words or particular sentences in text corpora. Conversely, mining the combination of structure and content dimensions of documents has not been investigated yet in the literature, even though some emerging real-world applications require mining processes able to exploit several forms of information, such as images and captions in addition to full text [ 25 ].

Thirdly, documents are a kind of data that do not match the classical attributevalue format. In the tabular model, data are represented as xed-length vectors of variable values describing properties, where each variable can have only a single, primitive value. Conversely, the entities (e.g., the objects composing a document image) that are observed and about which information is collected may have different properties, which can be properly modeled by as many data tables (relational data model) as the number of object types. Moreover, relationships (e.g., topological or distance relationships that are implicitly de ned by the location of objects spatially distributed in a document image or words distributed in a text) among observed objects forming the same semantic unit can also be explicitly modeled in a relational database by means of tables describing the relationship. Hence, the classical attribute-value representation seems too restrictive and advanced approaches are necessary to both represent and reason in the presence of multiple relations among data.

Lastly, the data mining method should take into account external information, also called expert or domain knowledge, that can add semantics to the whole process and then obtain high-level decision support and user con dence. All these peculiarities make documents a kind of complex data that require methodological evolutions of data mining technologies, as well as the involvement of several document processing techniques. In our context, the extraction of spatio-textual association rules requires the consideration of all these sources of complexity deriving from the inherent nature of processed documents. 3

Multi-modal Association Rule Mining

In our proposal, the system used for processing documents is WISDOM++[ 3 ]. WISDOM++ 1 is a document analysis system that can transform textual paper documents into XML format. This is performed in several steps. First, the image is segmented into basic layout components (non-overlapping rectangular blocks enclosing content portions). These layout components are classi ed according to their type of content (e.g., text, graphics, and horizontal/vertical lines). Second, a perceptual organization phase, called layout analysis, is performed to detect structures among blocks. The result is a tree-like structure, named layout structure, which represents the document layout at various levels of abstraction and associates the content of a document with a hierarchy of layout components, such as blocks, lines, and paragraphs. Third, the document image classi cation step aims at identifying the membership class (or type) of a document (e.g. censorship decision, newspaper article, etc.), and it is performed using some rst-order rules which can be automatically learned from a set of training examples.

Document image understanding (or interpretation) creates a mapping of the layout structure into the logical structure, which associates the content with a hierarchy of logical components, such as title/authors of a scienti c article, or the name of the censor in a censorship document. As previously pointed out, the logical and the layout structures are strongly related. For instance, the title of an article is usually located at the top of the rst page of a document and it is written with the largest character set used in the document. Document image understanding also uses rst-order rules [ 18 ]. Once the logical and layout structure have been mapped, OCR can be applied only to those textual components of interest for the application domain, and the content can be stored for future retrieval purposes. Thus, the system can not only determine the type of document, but is also able to identify interesting parts of a document and extract the information given in this part plus its meaning. The result of the document analysis is an XML document that makes the document image easily retrievable. Once the layout/logical structure as well as the textual content of a document have been extracted, association rules can be extracted.

The task of mining relational association rules is, in this work, solved by the SPADA system[ 16 ]. It represents relational data a la Datalog, a logic programming language with no function symbols, speci cally designed to implement deductive databases. Moreover, SPADA takes into account background knowledge (BK) expressed in Prolog and handles background hierarchies over the objects to be mined. In document image understanding, hierarchies can be naturally de ned, e.g., considering the organization of the logical components (Figure 1). 1 www.di.uniba.it/ malerba/wisdom++/ So, association rules involving more abstract objects are better supported (although less precise), while association rules involving more speci c objects have higher con dence values (although lower support values). SPADA distinguishes between the set S of reference (or target) objects, which are the main subject of analysis, and the sets Rk, 1 k m, of task-relevant (or non-target) objects, which are related to the former and can contribute to accounting for the variation. Each unit of analysis includes a distinct reference object and many related task-relevant objects. Therefore, the description of a unit of analysis consists of both properties of included reference and task-relevant objects as well as their relationships. 3.1

Document Description

In the logic framework adopted by SPADA, a relational database is boiled down into a deductive database. Properties of both reference and task-relevant objects are represented in the extensional part DE , while the domain knowledge is expressed as a normal logic program which de nes the intensional part DI . As an example, we report a fragment of the extensional part of a deductive database D, which describes multi-modal information which can be extracted from any document image: block(b1). block(b2). : : : height(b2,[11..54]). width(b1,[7..82]). : : : on top(b2,b1). : : : on top(b2,b3). : : : part of(b1,p1). part of(b2,p1). page rst(p1). : : : abstract(b1). title(b2). : : : text in abstract(b1,'base'). text in title(b2,'model').... In this example, b1 and b2 are two constants which denote many distinct layout components (reference objects) , while p1 denotes a document page (taskrelevant object). Predicate block de nes a layout component, part of associates a block to a document page, height and width describe geometrical properties of layout components, on top expresses a topological relationship between layout components, page f irst(p1) refers to the position of the page in the document, abstract and title associates b1 and b2 with a logical label, text in abstract and text in title describe the textual content of the logical components.

The complete list of predicates is reported in Table 1. The a-spatial feature type of speci es the content type of a layout component (e.g. image, text, horizontal line). Logical features are used to associate a logical label to a layout object and depend on the speci c domain. In the case of scienti c papers (considered in this work), possible logical labels are: affiliation, page number, gure, caption, index term, running head, author, title, abstract, formulae, subsection title, section title, biography, references, paragraph, table.

Textual content is represented by means of another class of predicates, which are true when the term reported as the second argument occurs in the layout component denoted by the rst argument. Terms are automatically extracted by means of a text-processing module. All terms in the textual components are tokenized and the set of obtained tokens is ltered out, in order to remove punctuation marks, numbers and tokens of less than three characters. Standard Layout structure Logical structure Text text preprocessing methods are used to: i) remove stop-words, such as articles, adverbs, prepositions and other frequent words; ii) determine equivalent stems (stemming), such as `topolog' in the words `topology' and `topological, by means of the well-known Porter's algorithm for English texts .

Only relevant tokens are used in textual predicates. They are selected by maximizing the product maxT F DF 2 ICF [ 8 ] that scores high terms appearing (possibly frequently) in a single logical component c and penalizes terms common to other logical components. More formally, let c be a logical label associated to a textual component. Let d be the bag of tokens in a component labeled with c (after the tokenization, ltering and stemming steps), w a term in d and T Fd(w) the relative frequency of w in d. Then, the following statistics can be computed: 1. the maximum value T Fc(w) of T Fd(w) on all logical components d labeled with c; 2. the document frequency DFc2(w), i.e., the percentage of logical components labeled with c in which the term w occurs; 3. the category frequency CFc(w), i.e., the number of labels c′ ̸= c, such that w occurs in logical components labeled with c′.

Then, the score vi associated to the term wi belonging to at least one of the logical components labeled with c is: vi = T Fc(wi)

DFc2(wi) 1=CFc(wi) (1)

According to this function, it is possible to identify a ranked list of \discriminative" terms for each of the possible labels. From this list, we select the best ndict terms in Dictc, where ndict is a user-de ned parameter. The textual dimension of each logical component d labeled as c is represented in the document description as a set of ground facts that express the presence of a term w 2 Dictc in the speci ed logical component. 3.2

The mining step

In the setting introduced so far, the problem of mining multi-modal association rules can be formalized as follows:

Given { a set S of reference objects (layout components); { some sets Rk, 1 k m of task-relevant objects (layout components related to those of the reference objects); { a background knowledge BK which includes hierarchies Hk on objects in Rk, granularity levels M in the descriptions (1 is the highest, while M is the lowest) and a set of granularity assignments k which associate each object in Hk with a granularity level (the hierarchical organization of the layout components included in the task-relevant objects); { a couple of thresholds minsup[l] and minconf [l] for each granularity level; { a language bias LB that constrains the search space.

Find strong multi-level association rules, i.e., association rules involving objects at different granularity levels.

Hierarchies Hk de ne is-a (i.e., taxonomic) relations on task-relevant objects. Both the frequency of patterns and the strength of association rules depend on the granularity level l at which patterns/rules describe data. Therefore, a pattern P with support s at level l is frequent if s minsup[l] and all ancestors of P with respect to Hk are frequent at their corresponding levels. An association rule Q ! R with support s and con dence c at level l is strong if the pattern Q [ R is frequent and c minconf [l].

The expressive power of rst-order logic is exploited to specify both the background knowledge BK, such as hierarchies and domain speci c knowledge, and the language bias LB. The LB is important to allow the user to specify his/her bias for interesting solutions and then to exploit this bias to improve both the efficiency of the mining process and the quality of the discovered rules. In SPADA, the language bias is expressed as a set of constraint speci cations for either patterns or association rules. article + j + j j + j + j + j + j j + j + j + j + j + + heading identi cation

(title, author, affiliation) synopsis + (abstract, index term) content nal components

(biography, references) body + (section title, subsect title, paragraph, caption, gure, formulae, table) page component running head page number unde ned

Fig. 1. Hierarchy of logical components.

Experiments

We investigate the applicability of the proposed solution to real-world document images. In particular, we have considered a dataset composed of 3,603 logical components extracted from twenty-four multi-page documents, which are scienti c papers published as either regular or short in the IEEE Transactions on Pattern Analysis and Machine Intelligence in the January and February 1996 issues. Each paper is a multi-page document and has a variable number of pages and layout components for page. A user labels some layout components of this set of documents according to their logical meaning. Those layout components with no clear logical meaning are labelled as unde ned. All logical labels belong to the lowest level of the hierarchy reported in the previous section. We processed 217 document images in all. The number of features to describe the twenty-four documents presented to SPADA is 38,948, about 179 features for each document page. The total number of logical components is partitioned as follows: 23 for affiliation, 191 for page number, 357 for gure, 202 for caption, 26 for index term, 231 for running head, 28 for author, 26 for title, 25 for abstract, 333 for formulae, 65 for section title, 21 for biography, 45 for references, 968 for paragraph, 48 for table and 1014 for unde ned. About 150 descriptors for each document page have been extracted. To generate textual predicates we set ndict = 50 and we considered the following logical components: title, abstract, index term, references and running head. Hence, the following textual predicates have been included in the document descriptions: text in title, text in abstract, text in index term, text in references,text in running head. The total number of extracted textual features is 1,681. The text was read with a commercial OCR.

In Table 2 we report a comprehensive view of the association rules mined for each logical component at different granularity levels. SPADA nds associations for all logical components. In particular, many spatial patterns involve logical components in the rst page of an article, such as affiliation, title, author, abstract and index term. Indeed, the rst page generally presents a more regular layout structure and contains several distinct logical components. The situation is different for references where most of the rules involve textual predicates, because of the high frequency of discriminating terms (e.g., 'vol', `ieee').

An example of an association rule discovered by SPADA at the second granularity level (L=2) is: is a block(A) ! specialize(A; B); is a(B; heading); on top(B; C); C ̸= B; is a(C; heading) text in component(C; paper): [supp: 38.46% conf: 38.46%],

This rule considers both spatial and textual properties. It is interpreted as follows: a portion equal to 38.46% of the heading blocks is above another heading block which contains the term `paper'. Usually, this term occurs in the abstract (a typical sentence is \In this paper ..."), which means that the heading block C is an abstract, while B is another logical component that usually is above the abstract (e.g., author component or title component). The percentage value of 38.46% indicates that 10 out of 26 layout components of type heading (Figure 1), in the overall initial set of 3,603 layout components, match the association rule reported above. Indeed, at a lower granularity level (L=4), SPADA discovers the following rule: is a block(A) ! specialize(A; B); is a(B; title); on top(B; C); C ̸= B; is a(C; abstract); text in component(C; paper): [supp: 38.46% conf: 38.46%], The rule has the same con dence and support reported for the rule inferred at the rst granularity level. This means that all heading components represented by B in the former rule (L=2) are titles. Another example of a discovered rule is: is a block(A) ! specialize(A; B); is a(B; references); type text(B); at page last(B): [supp: 46.66% conf: 46.66%], which shows the use of the predicate at page last(B) introduced in the BK. This is an example of a pure spatial association rule. The rules reported above have only one predicate on the antecedent side. An example of rules with several predicates on the antecedent side is:

is a block(A); specialize(A; B); is a(B; heading) ! specialize(A; C); is a(C; heading); at page first(B): [supp: 100.0% conf: 100.0%], which has been discovered at the granularity level L=2. It is characterized by the highest value of support (since it matches 23 out 23 affiliation components) and the highest value of con dence, which indicates a very strong implication between the set of predicates on the antecedent and the set of predicates on the consequent. This rule reports that whenever there is a heading block B (antecedent), then there is another heading block C and block B is collocated on the rst page. The information associated to this rule is probably too general, since different types of components could satisfy it. To obtain more speci c indications, we can explore the lower levels of granularity and use specialized patterns. The rule reported above is re ned at the level L=4 with the following is a block(A); specialize(A; B); is a(B; affiliation) ! specialize(A; C); is a(C; affiliation); at page first(B): [supp: 100.0% conf: 100.0%], which keeps the same statistical parameters and states that whenever there is an affiliation block B, then it is collocated on the rst page and there is another affiliation block C, which could be typically the affiliation details of a co-author. Finally, an example of pure a textual association rule discovered by SPADA is: is a block(A) ! specialize(A; B); is a(B; index term); text in component(B; index): [supp: 92.0% conf: 92.0%], which simply states that a logical component index term contains the term \index". 5

Conclusions

Automated extraction of knowledge patterns from document images can boost the application of document analysis systems in various contexts. In this paper we investigate a particular form of knowledge pattern, namely association rules, which are useful for many knowledge-intensive tasks, such as document classi cation and indexing, document reformatting and document reconstruction. The proposed method is based on a multi-relational approach, in order to consider the inherent multi-modality of documents, which convey layout, logical and textual information. Moreover, knowledge patterns are extracted at various levels of granularity. Experiments prove the viability of the proposed approach.

Acknowledgments

This work partially ful lls the project "PON020056-33489339 PUGLIA@SERVICE - Internet-based Service Engineering enabling Smart Territory structural development" funded by the Italian Ministry of University and Research (MIUR) and the project "MAESTRA - Learning from Massive, Incompletely annotated and Structured Data" (Grant number ICT-2013-612944) funded by the European Commission.

Agrawal ,

Imielinski , and

A. N.

Swami . Mining association rules between sets of items in large databases . In P. Buneman and S. Jajodia, editors, International Conference on Management of Data , pages 207 { 216 , 1993 .

Aiello ,

Monz ,

Todoran , and

Worring . Document understanding for a broad class of documents . International Journal on Document Analysis and Recognition-IJDAR , 5 ( 1 ):1{ 16 , 2002 .

Altamura ,

Esposito , and

Malerba . Transforming paper documents into XML format with WISDOM++ . IJDAR, 4 ( 1 ):2{ 17 , 2001 .

Amir ,

Aumann ,

Feldman , and

Fresko . Maximal association rules: A tool for mining associations in text . J. Intell. Inf. Syst. , 25 ( 3 ): 333 { 345 , 2005 .

Berardi ,

Ceci , and

Malerba . Mining spatial association rules from document layout structures . In Proceedings of the 3rd Workshop on Document Layout Interpretation and its Application, DLIA03 , pages 9 { 13 , 2003 .

Ceci ,

Appice , and

Malerba . Mr-SBC: a multi-relational naive bayes classi er . In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases , volume 2838 of LNAI , pages 95 { 106 . Springer-Verlag, 2003 .

Ceci ,

Loglisci , and

Macchia . Ranking sentences for keyphrase extraction: A relational data mining approach . In Pushing the Boundaries of the Digital Libraries Field - 10th Italian Research Conference on Digital Libraries, IRCDL 2014 , Padua, Italy, January 30 - 31 , 2014 , pages 52 { 59 , 2014 .

Ceci and

Malerba . Classifying web documents in a hierarchy of categories: a comprehensive study . J. Intell. Inf. Syst. , 28 ( 1 ): 37 { 78 , 2007 .

M. S.

Chen ,

J. S.

Park , and

P. S.

Yu . Data mining for path traversal patterns in a web environment . In ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96) , page 385, Washington, DC, USA, 1996 . IEEE Computer Society.

10.

Dengel . Making documents work: Challenges for document understanding . In ICDAR , pages 1026 {. IEEE Computer Society, 2003 .

11. S. Dzeroski and

Lavrac . Relational Data Mining . Springer-Verlag, 2001 .

12. L. Hardman , L.

Rutledge , and D.

Bulterman . Automated generation of hypermedia presentation from pre-existing tagged media objects . In In Proc. of the Second Workshop on Adaptive Hypertext and Hypermedia , 1998 .

13.

Hiraki ,

J. H.

Gennari ,

Yamamoto , and

Anzai . Learning spatial relations from images . In ML , pages 407 { 411 , 1991 .

14.

Jiang and

Gruenwald . Research issues in data stream association rule mining . SIGMOD Rec ., 35 ( 1 ): 14 { 19 , 2006 .

15.

Li ,

Sugandh ,

E. V.

Garcia , and

Ram . Adapting associative classi cation to text categorization . In P. R. King and S. J . Simske, editors, ACM Symposium on Document Engineering , pages 205 { 208 . ACM, 2007 .

16.

F. A.

Lisi and

Malerba . Inducing multi-level association rules from multiple relations . Machine Learning , 55 ( 2 ): 175 { 210 , 2004 .

17. B. Liu , W.

Hsu , and Y.

Ma . Integrating classi cation and association rule mining . In Knowledge Discovery and Data Mining , pages 80 { 86 , 1998 .

18.

Malerba . Learning recursive theories in the normal ilp setting . Fundamenta Informaticae , 57 ( 1 ): 39 { 77 , 2003 .

19.

Mei and

Zhai . Discovering evolutionary theme patterns from text: an exploration of temporal text mining . In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining , pages 198 { 207 , New York, NY, USA, 2005 . ACM.

20. G. Nagy. Twenty years of document image analysis in pami . IEEE Trans. Pattern Anal. Mach . Intell., 22 ( 1 ): 38 { 62 , 2000 .

21.

Ordonez and

Omiecinski . Discovering association rules based on image content . In ADL '99: Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries, page 38 , Washington, DC, USA, 1999 . IEEE Computer Society.

22.

Reichenberger ,

K. J.

Rondhuis ,

Kleinz , and

J. A.

Bateman . Effective presentation of information through page layout: a linguistically-based approach . In In Proc. of ACM Workshop on Effective Abstractions in Multimedia, Layout and Interaction , Association for Computing Machinery, 1995 .

23.

Sebastiani . Introduction: Special issue on the 25th european conference on information retrieval research . Inf. Retr., 7 ( 3 -4): 235 { 237 , 2004 .

24. M. Steinbach , G. Karypis, and V. Kumar . A comparison of document clustering techniques . In Proceedings of KDD-2000 Workshop on Text Mining , 2000 .

25.

A. S.

Yeh ,

Hirschman , and

A. A.

Morgan . Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup . CoRR, cs.CL/0308032 , 2003 .