-

Applications of Tolerance Rough Set Model Semantic Text Analysis

Hung Son Nguyen

son@mimuw.edu.pl 0 0 Institute of Computer Science The University of Warsaw Banacha 2 , 02-097, Warsaw Poland

Tolerance Rough Set Model (TRSM) is an extension of Rough Set theory and can be used as a tool for approximation of hidden concepts in collections of documents. In recent years, numerous successful applications of TRSM in web intelligence including text classi cation, clustering, thesaurus generation, semantic indexing, and semantic search, etc., have been proposed. This paper revises the basic concepts of TRSM, some of its possible extensions and some typical applications of TRSM in text mining. We also discuss some further research on TRSM.

Rough set theory has been introduced by Pawlak [ 1 ] as a tool for concept approximation under uncertainty. The idea is to approximate the concept by two descriptive sets called lower and upper approximations. The fundamental philosophy of rough set approach to concept approximation problem is to minimize the di erence between upper and lower approximations (the boundary region). This simple but brilliant idea leads to many e cient applications of rough sets in machine learning, data mining and also in granular computing. The connection between rough set and other computational intelligence techniques was presented by many researchers, e.g. [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ]. Numerous computational intelligence techniques based on rough sets including support vector machine [ 8 ], genetic algorithm [ 9 ] [ 10 ], modi ed self-organizing map [ 11 ] have been proposed. The rough set based data mining methods were applied to many real life applications, e.g., medicine [ 12 ], web user clustering [ 13 ] [ 11 ] [ 7 ] and marketing [ 10 ].

Tolerance Rough Set Model was developed in [ 14, 15 ] as a basis to model documents and terms in Information Retrieval, Text Mining, etc. With its ability to deal with vagueness and fuzziness, Tolerance Rough Set Model seems to be a promising tool to model relations between terms and documents. In many Information Retrieval problems, especially in document clustering, de ning the relation (i.e. similarity or distance) between document-document, term-term or term-document is essential. In Vector Space Model, is has been noticed [ 15 ] that a single document is usually represented by relatively few terms1. This results in zero-valued similarities which decreases quality of clustering. The application of TRSM in document clustering was proposed as a way to enrich document and cluster representation with the hope of increasing clustering performance.

In fact Tolerance Rough Set Model is a special case of a generalized approximation space, which has been investigated in [ 16 ]. as a generalization of standard rough set theory. Generalized approximation space utilizes every tolerance relation overs objects to determine the main concepts of rough set theory, i.e., lower and upper approximation.

The main idea of TRSM is to capture conceptually related index terms into classes. For this purpose, the tolerance relation R is determined as the cooccurrence of index terms in all documents from D. The choice of co-occurrence of index terms to de ne tolerance relation is motivated by its meaningful interpretation of the semantic relation in context of IR and its relatively simple and e cient computation. 1.1

Standard TRSM

Let D = fd1; : : : ; dN g be a corpus of documents Assume that after the initial processing documents, there have been identi ed N unique terms (e.g. words, stems, N-grams) T = ft1; : : : ; tM g.

Tolerance Rough Set Model, or brie y TRSM, is an approximation space R = (T; I ; ; P ) determined over the set of terms T where: { The parameterized uncertainty function I : T ! P(T ) is de ned by I (ti) = ftj j fD(ti; tj ) g [ ftig where fD(ti; tj ) denotes the number of documents in D that contain both terms ti and tj and is a parameter set by an expert. The set I (ti) is called the tolerance class of term ti. { Vague inclusion function (X; Y ) measures the degree of inclusion of one set in another. The vague inclusion function is de ned as (X; Y ) = jX\Y j . It jXj is clear that this function is monotone with respect to the second argument. { Structural function: All tolerance classes of terms are considered as structural subsets: P (I (ti)) = 1 for all ti 2 T .

In TRSM model R = (T; I; ; P ), the membership function is de ned by (ti; X) = (I (ti); X) = jI (ti) \ Xj jI (ti)j where ti 2 T and X T . The lower and upper approximations of any subset X T can be determined by the same maneuver as in approximation space [ 16 ]: LR(X) = fti 2 T j (I (ti); X) = 1g

UR(X) = fti 2 T j (I (ti); X) > 0g 1 In other words, the number of non-zero values in document's vector is much smaller than vector's dimension { the number of all index terms t1 t2 t3 t4 t5 t6 t1 t2 t3 t4 t5 t6 c1 c2

The standard TRSM was applied for document clustering and snippet clustering tasks (see [ 14 ], [ 15 ], [ 7 ], [ 17 ], [ 18 ]). In those applications, each document is represented by the upper approximation of its set of words/terms, i.e. the document di 2 D is represented by UR(di). For the example in Figure 1, the enriched representation of d1 is UR(d1) = ft1; t3; t4; t2; t6g. Let D = fd1; : : : ; dN g be a set of documents and T = ft1; : : : ; tM g the set of index terms for D. Let C be the set of concepts from a given domain knowledge (e.g. the concepts from DBpedia or from a speci c ontology).

The extended TRSM is an approximation space RC = (T [ C; I ; ; ; P ), where C is the mentioned above set of concepts. The uncertainty function I ; : T [ C ! P(T [ C) has two parameters and is de ned as follows: d1 d2 d3 d1 d2 d3 { for each term ci 2 C the set I ; (ci) contains top terms from the bag of terms of ci calculated from the textual descriptions of concepts. { for each term ti 2 T the set I ; (ti) = I (ti) [ C (ti) consists of the tolerance class of ti from the standard TRSM and the set of concepts, whose description contains the term ti as the one of the top terms.

In the extended TRSM, any document di 2 D can be represented by URC (di) = UR(di) [ fcj 2 C j (I ; (cj); di) > 0g = [ I ; (ti) tj2di 1.3

Weighting Schema

Any text di in the corpus D can be represented by a vector [wi1; : : : ; wiM ], where each coordinate wi;j expresses the signi cance of j-th term in this document. The most common measure, called tf-idf index (term frequency-inverse document frequency) [ 19 ], is de ned by: wi;j = tfi;j idfj =

ni;j PM k=1 ni;k log

N jfi : ni;j 6= 0gj (1) where ni;j is the number of occurrences of the term tj in the document di.

Both standard TRSM and extended TRSM are the conceptual models for the Information Retrieval. Depending on the current application, di erent extended weighting schema can be proposed to achieve as highest performance as possible. Let us recall some existing weighting scheme for TRSM: 1. The extended weighting scheme is inherited from the standard TF-IDF by: wij = 8 (1 + log fdi (tj)) log fD(tj) if tj 2 di

N > >< 0

log fDN(tj) >>: mintk2di wik 1+log fDN(tj) if tj 2= UR(di) otherwise This extension ensures that each term occurring in the upper approximation of di but not in di itself has a weight smaller than the weight of any terms in di. Normalization by vector's length is then applied to all document vectors: winjew = wij=qPtk2di (wij)2 (see [ 14 ], [ 15 ]). The example of standard TRSM is presented in Table 1. 2. Explicit Semantic Analysis (ESA) proposed in [ 20 ] is a method for automatic tagging of textual data with prede ned concepts. It utilizes natural language de nitions of concepts from an external knowledge base, such as an encyclopedia or an ontology, which are matched against documents to nd the best associations. Such de nitions are regarded as a regular collection of texts, with each description treated as a separate document. The original purpose of ESA was to provide means for computing semantic relatedness between texts. However, an intermediate result { weighted assignments of concepts

Title: EconPapers: Rough sets

bankruptcy prediction models versus auditor Description: Rough sets bankruptcy prediction models versus auditor signalling rates. Journal of Forecasting, 2003, vol. 22, issue 8, pages 569-586. Thomas E. McKee. ...

Original vector Enriched vector

Term Weight Term Weight auditor 0.567 auditor 0.564 bankruptcy 0.4218 bankruptcy 0.4196 signalling 0.2835 signalling 0.282 EconPapers 0.2835 EconPapers 0.282 rates 0.2835 rates 0.282 versus 0.223 versus 0.2218 issue 0.223 issue 0.2218 Journal 0.223 Journal 0.2218 MODEL 0.223 MODEL 0.2218 prediction 0.1772 prediction 0.1762 Vol 0.1709 Vol 0.1699 applications 0.0809 Computing 0.0643 to documents (induced by the term-concept weight matrix) may be interpret as a weighting scheme of the concepts that are assigned to documents in the extended TRSM.

N Let Wi = [wi;j ]j=1 be a bag-of-words representation of an input text di, where wi;j is a numerical weight of term tj expressing its association to the text di. Let sj;k be the strength of association of the term tj with a knowledge base concept ck, k 2 f1; : : : ; Kg an inverted index entry for tj . The new vector representation, called a bag-of-concepts representation of di, is denoted by [ui;1; : : : ui;K ], where: ui;k = PjN=1 wi;j sj;k: For practical reasons it is better to represent documents by the most relevant concepts only. In such a case, the association weights can be used to create a ranking of concept relatedness. With this ranking it is possible to select only top concepts from the list or to apply some more sophisticated methods that involve utilization of internal relations in the knowledge base. An example of top 20 concepts for an article from PubMed is presented in Figure 3 The described above weighting scheme naturally utilized in Document Retrieval as a semantic index [ 21, 22 ]. A user may query a document retrieval engine for documents matching a given concept. If the concepts are already assigned to documents, this problem is conceptually trivial. However such a situation is relatively rare, since employment of experts who could manually labelled documents from a huge repository is expensive. On the other hand, utilization of an automatic tagging method, such as ESA, allows to infer labeling of previously untagged documents. More sophisticated weighting schema have been proposed in, e.g. [ 23 ], [ 24 ]. 1.4

The applications of TRSM in Semantic Web

Let us now brie y describe some applications of TRSM in semantic text analysis

The list of top 20 concepts:

"Low Back Pain", "Pain Clinics", "Pain Perception", "Treatment Outcome", "Sick Leave", "Outcome Assessment (Health Care)", "Controlled Clinical Trials as Topic", "Controlled Clinical Trial", "Lost to Follow-Up", "Rehabilitation, Vocational", "Pain Measurement", "Pain, Intractable", "Cohort Studies", "Randomized Controlled Trials as Topic", "Neck Pain", "Sickness Impact Pro le", "Chronic Disease", "Comparative E ectiveness Research", "Pain, Postoperative"

TRSM-base search: Let us recall that in TRSM, the upper approximations of documents can be used as an enriching bag-of-word document representations, and it can be applied in information retrieval systems. In [ 25 ], we supplement TRSM by a weight learning method in an unsupervised setting and apply the model to the problem of extending search results. We also introduce a method for a supervised multi-label classi cation problem and brie y compare it to an algorithm described in [ 23 ], which is based on Explicit Semantic Analysis [ 20 ]. The same model structure (de ned by tolerance relations) can be also used for di erent searching tasks, e.g. inference of authors by de ning a di erent structurality function.

Semantic indexing: document databases use external knowledge bases to facilitate the searching process. For example, bio-medical documents in PubMed are semi-manually tagged with concepts from MeSH. Queries sent to the database are then automatically extended by the corresponding MeSH headings. Indeed, the ontological part of our data model supports storage of information from different external knowledge bases, such as MeSH or DBpedia. Therefore, we may implement some universal methods for detecting associations between documents and concepts. The obtained tags can be then utilized in various processes, such as grouping of search results or topical classi cation (e.g.: automatic classi cation of documents into MeSH's topics).

The key concept of semantic indexing process is to assign to each document a new representation called the bag-of-concepts. As a step toward this direction, we implemented the extended TRSM algorithm, where natural language de nitions of concepts from an encyclopedia or an ontology are matched against texts to nd the best associations. Thus, we can easily construct an inverted semantic index that maps words occurring in such descriptions into related concepts. For each new document, concepts that correspond to its words basing on such inverted index are retrieved and aggregated to form an extended bag-of-concepts. Online document grouping. Online grouping methods utilize content of usually up to several hundreds snippets (contexts for the searched term occurrences) returned by the Web search engines. The output is a list of labeled groups assigned with some objects (typically Web pages). The goal of grouping is then to provide a navigational rather than a summary interface [ 26 ]. On the other hand, a document retrieval system can usually access higher quality information about documents, which sets up expectations at a di erent level. In such a case, the groups based merely on snippets' content may not be informative enough to provide a meaningful overview of documents returned by the query. This suggests that enriching snippets may lead to a higher quality clustering. 1.5

The accuracy and performance

The performance and quality tests undertaken so far on over 200K full-content articles resulting in 300M tuples con rm SONCA's scalability, which should be investigated not only by means of data volume but also ease of adding new types of objects that may be of interest for speci c groups of users.

We applied the semantic indexing methods in combination with MeSH and DBpedia to index PubMed documents. We veri ed e ectiveness of our approach in two ways. First, we clustered small subsets of documents represented by bagof-words and bag-of-concepts using a simple k-means algorithm and found out that the semantic representation frequently yields better results [ 24 ]. We also compared the key MeSH concepts assigned to selected documents with the corresponding tags assigned by the PubMed experts. Preliminary results of this analysis reveal that the ESA method produces quite reasonable tags (see Table 2).

The TNF- System: Functional Aspects in Depression, Narcolepsy and Psychopharmacology.

We conducted experiments which utilized document representations based on inbound and outbound citations (i.e.: the lists of documents that are referenced by and that reference each given paper), semantic indexes described earlier in this section, as well as snippets extended by document abstracts. MeSH terms assigned by the PubMed domain experts to documents provided natural means of validation for each of clustering methods, as ideally the system would group documents in a similar way that the experts would do it [ 26, 24 ]. Table 3 shows an example of cluster that was discovered after extending document representations by information about citations. We expect that extraction of more meaningful snippets can further improve our results in the nearest future.

The relational data model employed within DocDB enables smooth extension of the set of supported types of objects with no need to create new tables or attributes. It is also prepared to deal on the same basis with objects acquired at di erent stages of parsing (eg concepts derived from domain ontologies vs. concepts detected as keywords in loaded texts) and with di erent degrees of information completeness (eg fully available articles vs. articles identi ed as bibliography items elsewhere). However, as already mentioned, the crucial aspect is freedom of choice between di erent data forms and processing strategies while optimizing Analytic Algorithms, reducing execution time of speci c tasks from (hundreds of) hours to (tens of) minutes. 1.6

Further Perspectives and Conclusions

SONCA (Search based on ONtologies and Compound Analytics) platform is developed at the Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw. SONCA is expected to provide interfaces for intelligent algorithms identifying relations between various types of objects. It extends typical functionality of scienti c search engines by more accurate identi cation of relevant documents and more advanced synthesis of information. To achieve this, concurrent processing of documents needs to be coupled with ability to produce collections of new objects using queries speci c for analytic database technologies.

Ultimately, SONCA should be capable of answering the user query by listing and presenting the resources (documents, Web pages, etc.) that correspond to it semantically. In other words, the system should have some understanding of the intention of the query and of the contents of documents stored in the repository as well as the ability to retrieve relevant information with high e cacy. The system should be able to use various knowledge sources related to the investigated areas of science. It should also allow for independent sources of information about the analyzed objects, such as, e.g., information about scientists who may be identi ed as the stored articles' authors.

Our primary motivation to develop SONCA is to extend functionality of the currently available search engines towards document based decision support and problem solving, via enhanced search and information synthesis capabilities, as well as richer user interfaces. For this purpose, we have been seeking for inspiration in many projects and approaches, related to such elds as, e.g., semantic web , social networks or hybrid information networks.Surely, there are plenty of aspects to be further investigated, in particular, in what form the results should be transmitted between modules and eventually reported to users. With this respect, we can refer to some research on, e.g., enriching original contents and linguistic summaries of query results.

Another challenge is how to manage a hierarchy of computational tasks in order to assembly the answers to compound queries. Basing on initial observations in Section 1.4, we can see that the framework for specifying intermediate components of search and reasoning processes is crucial for both performance and extendability of the system [ 27, 28 ]. The chain of computational speci cations may follow a way human beings interact with standard search engines in order to summarize knowledge they are truly interested in. Thus, it is crucial to know how to represent and learn behavioral patterns followed by domain experts while solving problems [ 29 ]. Some hints in this area may come out from our previous research related to ontology-based approximations of compound concepts and identifying behavioral patterns in biomedical applications [ 30 ].

We also need to work on completion of the list of query types that should be supported. Besides examples mentioned in the previous sections, one may be interested in questions such as: \Who specializes in the treatment of a given condition (countries, states, hospitals)?"; \What are the current and past methods of diagnosis and treatment (e.g.: links to patient histories and medical images)?"; \Which pharmaceutical patents are relevant to treatment of the condition?".

Furthermore, the user-system dialog may go beyond answering to queries (see e.g. [ 31 ]). The system may be actually more active by means of proposing solutions, suggesting additional pieces of information that should be completed, or even identifying the existing pieces that might need to be reexamined. For example, let us imagine a SONCA-based diagnostic support system based on a repository of medical documents and clinical data sets, where a medical doctor should be able to enter information about a patient's history and, within a context of speci c queries, expect some guidelines with regards to further medical treatment and, if necessary, further data acquisition and veri cation.

Pawlak , Rough sets: Theoretical aspects of reasoning about data . Kluwer Dordrecht, 1991 .

2. ||, \ Granularity of knowledge, indiscernibility, and rough sets," in Proceedings: IEEE Transactions on Automatic Control 20 , 1999 , pp. 100 { 103 .

L. T.

Polkowski and

Skowron , \ Towards adaptive calculus of granules," in Proceedings of the FUZZ-IEEE International Conference , 1998 IEEE World Congress on Computational Intelligence (WCCI'98) , 1998 , pp. 111 { 116 .

H. S.

Nguyen ,

Skowron , and

Stepaniuk , \ Granular computing: a rough set approach," Computational Intelligence: An International Journal , vol. 17 , no. (no. 3) , pp. 514 { 544 ( 31 , August 2001 .

J. F.

Peters ,

Skowron ,

Suraj ,

Rzasa , and

Borkowski , \ Clustering: A rough set approach to constructing information granules," in Soft Computing and Distributed Processing . Proceedings of 6th International Conference, SCDP , 2002 , pp. 57 { 61 .

H. S.

Nguyen , \ Approximate boolean reasoning: Foundations and applications in data mining," in Transactions on Rough Sets V . Springer, 2006 , pp. 334 { 506 .

H. S.

Nguyen and

T. B.

Ho , \ Rough document clustering and the internet," in Handbook of Granular Computing ,

Pedrycz ,

Skowron , and V. Kreinovich, Eds. Wiley & Sons, 2008 , pp. 987 { 1004 .

Asharaf ,

S. K.

Shevade , and

M. N.

Murty , \ Rough support vector clustering . " Pattern Recognition , vol. 38 , no. 10 , pp. 1779 { 1783 , 2005 .

Lingras , \ Unsupervised rough set classi cation using gas," Journal of Intelligent Information Systems , vol. 16 , no. 3 , pp. 215 { 228 , 2001 .

10.

Voges ,

Pope , and M. Brown, \ Cluster analysis of marketing data: A comparison of k-means, rough set, and rough genetic approaches," Heuristics and Optimization for Knowledge Discovery , Idea Group Publishing, vol. 208216 , 2002 .

11.

Lingras ,

Hogo , and

Snorek , \ Interval set clustering of web users using modi ed kohonen self-organizing maps based on the properties of rough sets," Web Intelligence and Agent Systems , vol. 2 , no. 3 , pp. 217 { 225 , 2004 .

12.

Hirano and

Tsumoto , \ Rough clustering and its application to medicine," Journal of Information Science , vol. 124 , pp. 125 { 137 , 2000 .

13.

Lingras and

West , \ Interval set clustering of web users with rough k-means," Journal of Intelligent Information Systems , vol. 23 , no. 1 , pp. 5 { 16 , 2004 .

14.

Kawasaki ,

N. B.

Nguyen , and

T. B.

Ho , \ Hierarchical document clustering based on tolerance rough set model," in Proceedings of PKDD 2000 , Lyon, France, ser. Lecture Notes in Computer Science,

D. A.

Zighed ,

H. J.

Komorowski , and

J. M.

Zytkow , Eds., vol. 1910 . Springer, 2000 .

15. T. B. Ho and N. B. Nguyen , \ Nonhierarchical document clustering based on a tolerance rough set model," International Journal of Intelligent Systems , vol. 17 , no. 2 , pp. 199 { 212 , 2002 .

16.

Skowron and

Stepaniuk , \ Tolerance approximation spaces," Fundamenta Informaticae , vol. 27 , no. 2-3 , pp. 245 { 253 , 1996 .

17.

S. H.

Nguyen , G. Jaskiewicz,

Swieboda , and

H. S.

Nguyen , \ Enhancing search result clustering with semantic indexing," in Proceedings of the Third Symposium on Information and Communication Technology , ser. SoICT '12 . New York, NY, USA: ACM, 2012 , pp. 71 { 80 .

18. G. Virginia and

H. S.

Nguyen , \ Investigating the e ectiveness of thesaurus generated using tolerance rough set model," in ISMIS, ser . Lecture Notes in Computer Science, M. Kryszkiewicz,

Rybinski ,

Skowron , and

Z. W.

Ras , Eds., vol. 6804 . Springer, 2011 , pp. 705 { 714 .

19.

Feldman and J. Sanger, Eds., The Text Mining Handbook . Cambridge University Press, 2007 .

20. E. Gabrilovich and

Markovitch , \ Computing semantic relatedness using wikipedia-based explicit semantic analysis," in Proceedings of the 20th international joint conference on Arti cal intelligence, ser . IJCAI'07 . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007 , pp. 1606 { 1611 .

21.

Hliaoutakis ,

Varelas ,

Voutsakis ,

E. G. M.

Petrakis , and E. Milios, \ Information retrieval by semantic similarity," Int. Journal on Semantic Web and Information Systems (IJSWIS) . Special Issue of Multimedia Semantics , vol. 3 , no. 3 , pp. 55 { 73 , 2006 .

22. A. M. Rinaldi , \ An ontology-driven approach for semantic information retrieval on the web," ACM Trans. Internet Technol. , vol. 9 , pp. 10 : 1 { 10 : 24 , July 2009 .

23.

Janusz ,

Swieboda ,

Krasuski , and

H. S.

Nguyen , \ Interactive document indexing method based on explicit semantic analysis," in RSCTC, ser . Lecture Notes in Computer Science, J. Yao,

Yang ,

Slowinski ,

Greco ,

Li ,

Mitra , and L. Polkowski, Eds., vol. 7413 . Springer, 2012 , pp. 156 { 165 .

24. M. Szczuka , A.

Janusz , and K.

Herba , \ Clustering of Rough Set Related Documents with use of Knowledge from DBpedia," in Proc. of the 6th Int. Conf. on Rough Sets and Knowledge Technology (RSKT), ser . LNAI , vol. 6954 . Springer, 2011 , pp. 394 { 403 .

25. W. Swieboda,

Meina , and

H. S.

Nguyen , \ Weight learning for document tolerance rough set model," in Rough Sets and Knowledge Technology 2013 , LNAI 8171, 2013 , pp. pp. 385396 ,.

26.

H. S.

Nguyen and

T. B.

Ho , \ Rough Document Clustering and the Internet," in Handbook of Granular Computing ,

Pedrycz ,

Skowron , and V. Kreinovich, Eds. New York, NY, USA: John Wiley & Sons, Inc., 2008 , pp. 987 { 1003 .

27.

Barwise and

Seligman , Information Flow: The Logic of Distributed Systems . Cambridge University Press, 1997 .

28. L. G. Valiant, \Robust Logics, " Artif . Intell., vol. 117 , no. 2 , pp. 231 { 253 , 2000 .

29. V. Vapnik, \ Learning Has Just Started (An interview with Vladimir Vapnik by Ran Gilad-Bachrach) , " 2008 . [Online]. Available: http://seed.ucsd.edu/joomla/index.php/articles/12-interviews/9 -qlearninghas-just-startedq-an-interview-with-prof-vladimir-vapnik

30. J. G. Bazan, \ Hierarchical Classi ers for Complex Spatio-temporal Concepts," Transactions on Rough Sets , vol. 9 , pp. 474 { 750 , 2008 .

31. J. M. Tenenbaum and J. Shrager , \ Cancer: A Computational Disease that AI Can Cure," AI Magazine , vol. 32 , no. 2 , pp. 14 { 26 , 2011 .