<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applications of Tolerance Rough Set Model Semantic Text Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hung Son Nguyen</string-name>
          <email>son@mimuw.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science The University of Warsaw Banacha 2</institution>
          ,
          <addr-line>02-097, Warsaw</addr-line>
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tolerance Rough Set Model (TRSM) is an extension of Rough Set theory and can be used as a tool for approximation of hidden concepts in collections of documents. In recent years, numerous successful applications of TRSM in web intelligence including text classi cation, clustering, thesaurus generation, semantic indexing, and semantic search, etc., have been proposed. This paper revises the basic concepts of TRSM, some of its possible extensions and some typical applications of TRSM in text mining. We also discuss some further research on TRSM.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Rough set theory has been introduced by Pawlak [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as a tool for concept
approximation under uncertainty. The idea is to approximate the concept by two
descriptive sets called lower and upper approximations. The fundamental
philosophy of rough set approach to concept approximation problem is to minimize
the di erence between upper and lower approximations (the boundary region).
This simple but brilliant idea leads to many e cient applications of rough sets
in machine learning, data mining and also in granular computing. The
connection between rough set and other computational intelligence techniques was
presented by many researchers, e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Numerous computational
intelligence techniques based on rough sets including support vector machine [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
genetic algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], modi ed self-organizing map [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have been proposed.
The rough set based data mining methods were applied to many real life
applications, e.g., medicine [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], web user clustering [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and marketing
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Tolerance Rough Set Model was developed in [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] as a basis to model
documents and terms in Information Retrieval, Text Mining, etc. With its ability
to deal with vagueness and fuzziness, Tolerance Rough Set Model seems to be
a promising tool to model relations between terms and documents. In many
Information Retrieval problems, especially in document clustering, de ning the
relation (i.e. similarity or distance) between document-document, term-term or
term-document is essential. In Vector Space Model, is has been noticed [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] that
a single document is usually represented by relatively few terms1. This results in
zero-valued similarities which decreases quality of clustering. The application of
TRSM in document clustering was proposed as a way to enrich document and
cluster representation with the hope of increasing clustering performance.
      </p>
      <p>
        In fact Tolerance Rough Set Model is a special case of a generalized
approximation space, which has been investigated in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. as a generalization of standard
rough set theory. Generalized approximation space utilizes every tolerance
relation overs objects to determine the main concepts of rough set theory, i.e., lower
and upper approximation.
      </p>
      <p>The main idea of TRSM is to capture conceptually related index terms
into classes. For this purpose, the tolerance relation R is determined as the
cooccurrence of index terms in all documents from D. The choice of co-occurrence
of index terms to de ne tolerance relation is motivated by its meaningful
interpretation of the semantic relation in context of IR and its relatively simple and
e cient computation.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Standard TRSM</title>
      <p>Let D = fd1; : : : ; dN g be a corpus of documents Assume that after the initial
processing documents, there have been identi ed N unique terms (e.g. words,
stems, N-grams) T = ft1; : : : ; tM g.</p>
      <p>Tolerance Rough Set Model, or brie y TRSM, is an approximation space
R = (T; I ; ; P ) determined over the set of terms T where:
{ The parameterized uncertainty function I : T ! P(T ) is de ned by
I (ti) = ftj j fD(ti; tj )
g [ ftig
where fD(ti; tj ) denotes the number of documents in D that contain both
terms ti and tj and is a parameter set by an expert. The set I (ti) is called
the tolerance class of term ti.
{ Vague inclusion function (X; Y ) measures the degree of inclusion of one
set in another. The vague inclusion function is de ned as (X; Y ) = jX\Y j . It
jXj
is clear that this function is monotone with respect to the second argument.
{ Structural function: All tolerance classes of terms are considered as
structural subsets: P (I (ti)) = 1 for all ti 2 T .</p>
      <p>
        In TRSM model R = (T; I; ; P ), the membership function
is de ned by
(ti; X) = (I (ti); X) = jI (ti) \ Xj
jI (ti)j
where ti 2 T and X T . The lower and upper approximations of any subset
X T can be determined by the same maneuver as in approximation space [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
LR(X) = fti 2 T j (I (ti); X) = 1g
      </p>
      <p>UR(X) = fti 2 T j (I (ti); X) &gt; 0g
1 In other words, the number of non-zero values in document's vector is much smaller
than vector's dimension { the number of all index terms
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6
c1
c2</p>
      <p>
        The standard TRSM was applied for document clustering and snippet clustering
tasks (see [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]). In those applications, each document is
represented by the upper approximation of its set of words/terms, i.e. the document
di 2 D is represented by UR(di). For the example in Figure 1, the enriched
representation of d1 is UR(d1) = ft1; t3; t4; t2; t6g.
Let D = fd1; : : : ; dN g be a set of documents and T = ft1; : : : ; tM g the set of
index terms for D. Let C be the set of concepts from a given domain knowledge
(e.g. the concepts from DBpedia or from a speci c ontology).
      </p>
      <p>The extended TRSM is an approximation space RC = (T [ C; I ; ; ; P ),
where C is the mentioned above set of concepts. The uncertainty function I ; :
T [ C ! P(T [ C) has two parameters and is de ned as follows:
d1
d2
d3
d1
d2
d3
{ for each term ci 2 C the set I ; (ci) contains top terms from the bag of
terms of ci calculated from the textual descriptions of concepts.
{ for each term ti 2 T the set I ; (ti) = I (ti) [ C (ti) consists of the
tolerance class of ti from the standard TRSM and the set of concepts, whose
description contains the term ti as the one of the top terms.</p>
      <p>In the extended TRSM, any document di 2 D can be represented by
URC (di) = UR(di) [ fcj 2 C j (I ; (cj); di) &gt; 0g
=
[ I ; (ti)
tj2di
1.3</p>
    </sec>
    <sec id="sec-3">
      <title>Weighting Schema</title>
      <p>
        Any text di in the corpus D can be represented by a vector [wi1; : : : ; wiM ], where
each coordinate wi;j expresses the signi cance of j-th term in this document. The
most common measure, called tf-idf index (term frequency-inverse document
frequency) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], is de ned by:
wi;j = tfi;j idfj =
      </p>
      <p>ni;j
PM
k=1 ni;k
log</p>
      <p>N
jfi : ni;j 6= 0gj
(1)
where ni;j is the number of occurrences of the term tj in the document di.</p>
      <p>Both standard TRSM and extended TRSM are the conceptual models for the
Information Retrieval. Depending on the current application, di erent extended
weighting schema can be proposed to achieve as highest performance as possible.
Let us recall some existing weighting scheme for TRSM:
1. The extended weighting scheme is inherited from the standard TF-IDF by:
wij =
8 (1 + log fdi (tj)) log fD(tj) if tj 2 di</p>
      <p>N
&gt;
&gt;&lt; 0</p>
      <p>
        log fDN(tj)
&gt;&gt;: mintk2di wik 1+log fDN(tj)
if tj 2= UR(di)
otherwise
This extension ensures that each term occurring in the upper approximation
of di but not in di itself has a weight smaller than the weight of any terms in
di. Normalization by vector's length is then applied to all document vectors:
winjew = wij=qPtk2di (wij)2 (see [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). The example of standard TRSM
is presented in Table 1.
2. Explicit Semantic Analysis (ESA) proposed in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is a method for automatic
tagging of textual data with prede ned concepts. It utilizes natural language
de nitions of concepts from an external knowledge base, such as an
encyclopedia or an ontology, which are matched against documents to nd the best
associations. Such de nitions are regarded as a regular collection of texts,
with each description treated as a separate document. The original purpose
of ESA was to provide means for computing semantic relatedness between
texts. However, an intermediate result { weighted assignments of concepts
      </p>
      <sec id="sec-3-1">
        <title>Title: EconPapers: Rough sets</title>
        <p>bankruptcy prediction models versus
auditor
Description: Rough sets bankruptcy
prediction models versus auditor
signalling rates. Journal of Forecasting,
2003, vol. 22, issue 8, pages 569-586.
Thomas E. McKee. ...</p>
        <sec id="sec-3-1-1">
          <title>Original vector Enriched vector</title>
          <p>Term Weight Term Weight
auditor 0.567 auditor 0.564
bankruptcy 0.4218 bankruptcy 0.4196
signalling 0.2835 signalling 0.282
EconPapers 0.2835 EconPapers 0.282
rates 0.2835 rates 0.282
versus 0.223 versus 0.2218
issue 0.223 issue 0.2218
Journal 0.223 Journal 0.2218
MODEL 0.223 MODEL 0.2218
prediction 0.1772 prediction 0.1762
Vol 0.1709 Vol 0.1699
applications 0.0809
Computing 0.0643
to documents (induced by the term-concept weight matrix) may be interpret
as a weighting scheme of the concepts that are assigned to documents in the
extended TRSM.</p>
          <p>
            N
Let Wi = [wi;j ]j=1 be a bag-of-words representation of an input text di,
where wi;j is a numerical weight of term tj expressing its association to
the text di. Let sj;k be the strength of association of the term tj with a
knowledge base concept ck, k 2 f1; : : : ; Kg an inverted index entry for tj .
The new vector representation, called a bag-of-concepts representation of
di, is denoted by [ui;1; : : : ui;K ], where: ui;k = PjN=1 wi;j sj;k: For practical
reasons it is better to represent documents by the most relevant concepts
only. In such a case, the association weights can be used to create a ranking
of concept relatedness. With this ranking it is possible to select only top
concepts from the list or to apply some more sophisticated methods that
involve utilization of internal relations in the knowledge base. An example
of top 20 concepts for an article from PubMed is presented in Figure 3
The described above weighting scheme naturally utilized in Document
Retrieval as a semantic index [
            <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
            ]. A user may query a document retrieval engine
for documents matching a given concept. If the concepts are already assigned
to documents, this problem is conceptually trivial. However such a situation is
relatively rare, since employment of experts who could manually labelled
documents from a huge repository is expensive. On the other hand, utilization of an
automatic tagging method, such as ESA, allows to infer labeling of previously
untagged documents. More sophisticated weighting schema have been proposed
in, e.g. [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ].
1.4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The applications of TRSM in Semantic Web</title>
      <p>Let us now brie y describe some applications of TRSM in semantic text analysis</p>
      <sec id="sec-4-1">
        <title>The list of top 20 concepts:</title>
        <p>"Low Back Pain", "Pain Clinics",
"Pain Perception", "Treatment
Outcome", "Sick Leave", "Outcome
Assessment (Health Care)", "Controlled
Clinical Trials as Topic", "Controlled
Clinical Trial", "Lost to Follow-Up",
"Rehabilitation, Vocational", "Pain
Measurement", "Pain, Intractable",
"Cohort Studies", "Randomized
Controlled Trials as Topic", "Neck Pain",
"Sickness Impact Pro le", "Chronic
Disease", "Comparative E ectiveness
Research", "Pain, Postoperative"</p>
        <p>
          TRSM-base search: Let us recall that in TRSM, the upper approximations of
documents can be used as an enriching bag-of-word document representations,
and it can be applied in information retrieval systems. In [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], we supplement
TRSM by a weight learning method in an unsupervised setting and apply the
model to the problem of extending search results. We also introduce a method
for a supervised multi-label classi cation problem and brie y compare it to an
algorithm described in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], which is based on Explicit Semantic Analysis [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
The same model structure (de ned by tolerance relations) can be also used
for di erent searching tasks, e.g. inference of authors by de ning a di erent
structurality function.
        </p>
        <p>Semantic indexing: document databases use external knowledge bases to
facilitate the searching process. For example, bio-medical documents in PubMed are
semi-manually tagged with concepts from MeSH. Queries sent to the database
are then automatically extended by the corresponding MeSH headings. Indeed,
the ontological part of our data model supports storage of information from
different external knowledge bases, such as MeSH or DBpedia. Therefore, we may
implement some universal methods for detecting associations between documents
and concepts. The obtained tags can be then utilized in various processes, such as
grouping of search results or topical classi cation (e.g.: automatic classi cation
of documents into MeSH's topics).</p>
        <p>
          The key concept of semantic indexing process is to assign to each document a
new representation called the bag-of-concepts. As a step toward this direction, we
implemented the extended TRSM algorithm, where natural language de nitions
of concepts from an encyclopedia or an ontology are matched against texts to nd
the best associations. Thus, we can easily construct an inverted semantic index
that maps words occurring in such descriptions into related concepts. For each
new document, concepts that correspond to its words basing on such inverted
index are retrieved and aggregated to form an extended bag-of-concepts.
Online document grouping. Online grouping methods utilize content of
usually up to several hundreds snippets (contexts for the searched term occurrences)
returned by the Web search engines. The output is a list of labeled groups
assigned with some objects (typically Web pages). The goal of grouping is then to
provide a navigational rather than a summary interface [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. On the other hand,
a document retrieval system can usually access higher quality information about
documents, which sets up expectations at a di erent level. In such a case, the
groups based merely on snippets' content may not be informative enough to
provide a meaningful overview of documents returned by the query. This suggests
that enriching snippets may lead to a higher quality clustering.
1.5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>The accuracy and performance</title>
      <p>The performance and quality tests undertaken so far on over 200K full-content
articles resulting in 300M tuples con rm SONCA's scalability, which should be
investigated not only by means of data volume but also ease of adding new types
of objects that may be of interest for speci c groups of users.</p>
      <p>
        We applied the semantic indexing methods in combination with MeSH and
DBpedia to index PubMed documents. We veri ed e ectiveness of our approach
in two ways. First, we clustered small subsets of documents represented by
bagof-words and bag-of-concepts using a simple k-means algorithm and found out
that the semantic representation frequently yields better results [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We also
compared the key MeSH concepts assigned to selected documents with the
corresponding tags assigned by the PubMed experts. Preliminary results of this
analysis reveal that the ESA method produces quite reasonable tags (see Table
2).
      </p>
      <sec id="sec-5-1">
        <title>The TNF- System: Functional Aspects in Depression, Narcolepsy and Psychopharmacology.</title>
        <p>
          We conducted experiments which utilized document representations based on
inbound and outbound citations (i.e.: the lists of documents that are referenced
by and that reference each given paper), semantic indexes described earlier in
this section, as well as snippets extended by document abstracts. MeSH terms
assigned by the PubMed domain experts to documents provided natural means
of validation for each of clustering methods, as ideally the system would group
documents in a similar way that the experts would do it [
          <xref ref-type="bibr" rid="ref24 ref26">26, 24</xref>
          ]. Table 3 shows an
example of cluster that was discovered after extending document representations
by information about citations. We expect that extraction of more meaningful
snippets can further improve our results in the nearest future.
        </p>
        <p>The relational data model employed within DocDB enables smooth
extension of the set of supported types of objects with no need to create new tables
or attributes. It is also prepared to deal on the same basis with objects acquired
at di erent stages of parsing (eg concepts derived from domain ontologies vs.
concepts detected as keywords in loaded texts) and with di erent degrees of
information completeness (eg fully available articles vs. articles identi ed as
bibliography items elsewhere). However, as already mentioned, the crucial aspect is
freedom of choice between di erent data forms and processing strategies while
optimizing Analytic Algorithms, reducing execution time of speci c tasks from
(hundreds of) hours to (tens of) minutes.
1.6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Further Perspectives and Conclusions</title>
      <p>SONCA (Search based on ONtologies and Compound Analytics) platform is
developed at the Faculty of Mathematics, Informatics and Mechanics of the
University of Warsaw. SONCA is expected to provide interfaces for intelligent
algorithms identifying relations between various types of objects. It extends
typical functionality of scienti c search engines by more accurate identi cation of
relevant documents and more advanced synthesis of information. To achieve this,
concurrent processing of documents needs to be coupled with ability to produce
collections of new objects using queries speci c for analytic database
technologies.</p>
      <p>Ultimately, SONCA should be capable of answering the user query by listing
and presenting the resources (documents, Web pages, etc.) that correspond to it
semantically. In other words, the system should have some understanding of the
intention of the query and of the contents of documents stored in the repository
as well as the ability to retrieve relevant information with high e cacy. The
system should be able to use various knowledge sources related to the
investigated areas of science. It should also allow for independent sources of information
about the analyzed objects, such as, e.g., information about scientists who may
be identi ed as the stored articles' authors.</p>
      <p>Our primary motivation to develop SONCA is to extend functionality of the
currently available search engines towards document based decision support and
problem solving, via enhanced search and information synthesis capabilities, as
well as richer user interfaces. For this purpose, we have been seeking for
inspiration in many projects and approaches, related to such elds as, e.g., semantic
web , social networks or hybrid information networks.Surely, there are plenty of
aspects to be further investigated, in particular, in what form the results should
be transmitted between modules and eventually reported to users. With this
respect, we can refer to some research on, e.g., enriching original contents and
linguistic summaries of query results.</p>
      <p>
        Another challenge is how to manage a hierarchy of computational tasks in
order to assembly the answers to compound queries. Basing on initial observations
in Section 1.4, we can see that the framework for specifying intermediate
components of search and reasoning processes is crucial for both performance and
extendability of the system [
        <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
        ]. The chain of computational speci cations
may follow a way human beings interact with standard search engines in order
to summarize knowledge they are truly interested in. Thus, it is crucial to know
how to represent and learn behavioral patterns followed by domain experts while
solving problems [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Some hints in this area may come out from our previous
research related to ontology-based approximations of compound concepts and
identifying behavioral patterns in biomedical applications [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
      </p>
      <p>We also need to work on completion of the list of query types that should
be supported. Besides examples mentioned in the previous sections, one may be
interested in questions such as: \Who specializes in the treatment of a given
condition (countries, states, hospitals)?"; \What are the current and past methods
of diagnosis and treatment (e.g.: links to patient histories and medical images)?";
\Which pharmaceutical patents are relevant to treatment of the condition?".</p>
      <p>
        Furthermore, the user-system dialog may go beyond answering to queries
(see e.g. [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]). The system may be actually more active by means of proposing
solutions, suggesting additional pieces of information that should be completed,
or even identifying the existing pieces that might need to be reexamined. For
example, let us imagine a SONCA-based diagnostic support system based on a
repository of medical documents and clinical data sets, where a medical doctor
should be able to enter information about a patient's history and, within a
context of speci c queries, expect some guidelines with regards to further medical
treatment and, if necessary, further data acquisition and veri cation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pawlak</surname>
          </string-name>
          ,
          <article-title>Rough sets: Theoretical aspects of reasoning about data</article-title>
          . Kluwer Dordrecht,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. ||, \
          <article-title>Granularity of knowledge, indiscernibility, and rough sets,"</article-title>
          <source>in Proceedings: IEEE Transactions on Automatic Control 20</source>
          ,
          <year>1999</year>
          , pp.
          <volume>100</volume>
          {
          <fpage>103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Polkowski</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          , \
          <article-title>Towards adaptive calculus of granules,"</article-title>
          <source>in Proceedings of the FUZZ-IEEE International Conference</source>
          ,
          <source>1998 IEEE World Congress on Computational Intelligence (WCCI'98)</source>
          ,
          <year>1998</year>
          , pp.
          <volume>111</volume>
          {
          <fpage>116</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Stepaniuk</surname>
          </string-name>
          , \
          <article-title>Granular computing: a rough set approach,"</article-title>
          <source>Computational Intelligence: An International Journal</source>
          , vol.
          <volume>17</volume>
          , no.
          <source>(no. 3)</source>
          , pp.
          <volume>514</volume>
          {
          <issue>544</issue>
          (
          <issue>31</issue>
          ,
          <year>August 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Suraj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Rzasa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Borkowski</surname>
          </string-name>
          , \
          <article-title>Clustering: A rough set approach to constructing information granules," in Soft Computing and Distributed Processing</article-title>
          .
          <source>Proceedings of 6th International Conference, SCDP</source>
          ,
          <year>2002</year>
          , pp.
          <volume>57</volume>
          {
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Approximate boolean reasoning: Foundations and applications in data mining," in Transactions on Rough Sets V</article-title>
          . Springer,
          <year>2006</year>
          , pp.
          <volume>334</volume>
          {
          <fpage>506</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Ho</surname>
          </string-name>
          , \
          <article-title>Rough document clustering and the internet," in Handbook of Granular Computing</article-title>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pedrycz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          , and V. Kreinovich, Eds. Wiley &amp; Sons,
          <year>2008</year>
          , pp.
          <volume>987</volume>
          {
          <fpage>1004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Asharaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Shevade</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Murty</surname>
          </string-name>
          , \
          <article-title>Rough support vector clustering</article-title>
          .
          <source>" Pattern Recognition</source>
          , vol.
          <volume>38</volume>
          , no.
          <issue>10</issue>
          , pp.
          <volume>1779</volume>
          {
          <issue>1783</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lingras</surname>
          </string-name>
          , \
          <article-title>Unsupervised rough set classi cation using gas,"</article-title>
          <source>Journal of Intelligent Information Systems</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>215</volume>
          {
          <issue>228</issue>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Voges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pope</surname>
          </string-name>
          , and M. Brown, \
          <article-title>Cluster analysis of marketing data: A comparison of k-means, rough set, and rough genetic approaches," Heuristics and Optimization for Knowledge Discovery</article-title>
          , Idea Group Publishing, vol.
          <volume>208216</volume>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lingras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hogo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Snorek</surname>
          </string-name>
          , \
          <article-title>Interval set clustering of web users using modi ed kohonen self-organizing maps based on the properties of rough sets,"</article-title>
          <source>Web Intelligence and Agent Systems</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>217</volume>
          {
          <issue>225</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>S.</given-names>
            <surname>Hirano</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsumoto</surname>
          </string-name>
          , \
          <article-title>Rough clustering and its application to medicine,"</article-title>
          <source>Journal of Information Science</source>
          , vol.
          <volume>124</volume>
          , pp.
          <volume>125</volume>
          {
          <issue>137</issue>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lingras</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>West</surname>
          </string-name>
          , \
          <article-title>Interval set clustering of web users with rough k-means,"</article-title>
          <source>Journal of Intelligent Information Systems</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>1</issue>
          , pp.
          <volume>5</volume>
          {
          <issue>16</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kawasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. B.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Ho</surname>
          </string-name>
          , \
          <article-title>Hierarchical document clustering based on tolerance rough set model,"</article-title>
          <source>in Proceedings of PKDD</source>
          <year>2000</year>
          , Lyon, France, ser. Lecture Notes in Computer Science,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Zighed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Komorowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zytkow</surname>
          </string-name>
          , Eds., vol.
          <source>1910</source>
          . Springer,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>T. B. Ho</surname>
            and
            <given-names>N. B.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Nonhierarchical document clustering based on a tolerance rough set model,"</article-title>
          <source>International Journal of Intelligent Systems</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>2</issue>
          , pp.
          <volume>199</volume>
          {
          <issue>212</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Stepaniuk</surname>
          </string-name>
          , \
          <article-title>Tolerance approximation spaces,"</article-title>
          <source>Fundamenta Informaticae</source>
          , vol.
          <volume>27</volume>
          , no.
          <issue>2-3</issue>
          , pp.
          <volume>245</volume>
          {
          <issue>253</issue>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , G. Jaskiewicz,
          <string-name>
            <given-names>W.</given-names>
            <surname>Swieboda</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Enhancing search result clustering with semantic indexing,"</article-title>
          <source>in Proceedings of the Third Symposium on Information and Communication Technology</source>
          , ser.
          <source>SoICT '12</source>
          . New York, NY, USA: ACM,
          <year>2012</year>
          , pp.
          <volume>71</volume>
          {
          <fpage>80</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. G. Virginia and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Investigating the e ectiveness of thesaurus generated using tolerance rough set model," in ISMIS, ser</article-title>
          . Lecture Notes in Computer Science, M. Kryszkiewicz,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rybinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z. W.</given-names>
            <surname>Ras</surname>
          </string-name>
          , Eds., vol.
          <volume>6804</volume>
          . Springer,
          <year>2011</year>
          , pp.
          <volume>705</volume>
          {
          <fpage>714</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>R.</given-names>
            <surname>Feldman</surname>
          </string-name>
          and J. Sanger, Eds.,
          <source>The Text Mining Handbook</source>
          . Cambridge University Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. E. Gabrilovich and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          , \
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis," in Proceedings of the 20th international joint conference on Arti cal intelligence, ser</article-title>
          .
          <source>IJCAI'07</source>
          . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
          <year>2007</year>
          , pp.
          <volume>1606</volume>
          {
          <fpage>1611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hliaoutakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Voutsakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G. M.</given-names>
            <surname>Petrakis</surname>
          </string-name>
          , and E. Milios, \
          <article-title>Information retrieval by semantic similarity,"</article-title>
          <source>Int. Journal on Semantic Web and Information Systems (IJSWIS)</source>
          .
          <source>Special Issue of Multimedia Semantics</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>55</volume>
          {
          <issue>73</issue>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>A. M. Rinaldi</surname>
          </string-name>
          , \
          <article-title>An ontology-driven approach for semantic information retrieval on the web,"</article-title>
          <source>ACM Trans. Internet Technol.</source>
          , vol.
          <volume>9</volume>
          , pp.
          <volume>10</volume>
          :
          <issue>1</issue>
          {
          <fpage>10</fpage>
          :
          <fpage>24</fpage>
          ,
          <year>July 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>A.</given-names>
            <surname>Janusz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Swieboda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krasuski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Interactive document indexing method based on explicit semantic analysis," in RSCTC, ser</article-title>
          . Lecture Notes in Computer Science, J. Yao,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Slowinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , and L. Polkowski, Eds., vol.
          <volume>7413</volume>
          . Springer,
          <year>2012</year>
          , pp.
          <volume>156</volume>
          {
          <fpage>165</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>M. Szczuka</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Janusz</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Herba</surname>
          </string-name>
          , \
          <article-title>Clustering of Rough Set Related Documents with use of Knowledge from DBpedia,"</article-title>
          <source>in Proc. of the 6th Int. Conf. on Rough Sets</source>
          and
          <article-title>Knowledge Technology (RSKT), ser</article-title>
          .
          <source>LNAI</source>
          , vol.
          <volume>6954</volume>
          . Springer,
          <year>2011</year>
          , pp.
          <volume>394</volume>
          {
          <fpage>403</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. W. Swieboda,
          <string-name>
            <given-names>M.</given-names>
            <surname>Meina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , \
          <article-title>Weight learning for document tolerance rough set model,"</article-title>
          <source>in Rough Sets and Knowledge Technology</source>
          <year>2013</year>
          , LNAI 8171,
          <year>2013</year>
          , pp. pp.
          <volume>385396</volume>
          ,.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Ho</surname>
          </string-name>
          , \
          <article-title>Rough Document Clustering and the Internet," in Handbook of Granular Computing</article-title>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pedrycz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skowron</surname>
          </string-name>
          , and V. Kreinovich, Eds. New York, NY, USA: John Wiley &amp; Sons, Inc.,
          <year>2008</year>
          , pp.
          <volume>987</volume>
          {
          <fpage>1003</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>J.</given-names>
            <surname>Barwise</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Seligman</surname>
          </string-name>
          , Information Flow:
          <article-title>The Logic of Distributed Systems</article-title>
          . Cambridge University Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. L. G. Valiant, \Robust Logics,
          <article-title>" Artif</article-title>
          . Intell., vol.
          <volume>117</volume>
          , no.
          <issue>2</issue>
          , pp.
          <volume>231</volume>
          {
          <issue>253</issue>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. V. Vapnik, \
          <article-title>Learning Has Just Started (An interview with Vladimir Vapnik by Ran Gilad-Bachrach)</article-title>
          ,
          <year>" 2008</year>
          . [Online]. Available: http://seed.ucsd.edu/joomla/index.php/articles/12-interviews/9
          <article-title>-qlearninghas-just-startedq-an-interview-with-prof-vladimir-vapnik</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30. J. G. Bazan, \
          <article-title>Hierarchical Classi ers for Complex Spatio-temporal Concepts,"</article-title>
          <source>Transactions on Rough Sets</source>
          , vol.
          <volume>9</volume>
          , pp.
          <volume>474</volume>
          {
          <issue>750</issue>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>J. M. Tenenbaum</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Shrager</surname>
          </string-name>
          , \
          <article-title>Cancer: A Computational Disease that AI Can Cure,"</article-title>
          <source>AI Magazine</source>
          , vol.
          <volume>32</volume>
          , no.
          <issue>2</issue>
          , pp.
          <volume>14</volume>
          {
          <issue>26</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>