=Paper=
{{Paper
|id=None
|storemode=property
|title=Faceted Approach To Diverse Query Processing
|pdfUrl=https://ceur-ws.org/Vol-762/paper4.pdf
|volume=Vol-762
}}
==Faceted Approach To Diverse Query Processing==
<pdf width="1500px">https://ceur-ws.org/Vol-762/paper4.pdf</pdf>
<pre>
             Faceted Approach To Diverse Query Processing

                        Alessandro Agostini                                 Devika P. Madalli, A.R.D. Prasad
                  Department of Computer Science                        Documentation Research and Training Centre
               Prince Mohammad Bin Fahd University                              Indian Statistical Institute
                      Al-Khobar, Saudi Arabia                                        Bangalore, India
                      aagostini@pmu.edu.sa                                   {devika,ard}@drtc.isibang.ac.in


ABSTRACT                                                                strings, according to faceted theory by Ranganathan [20] and DEPA
This paper presents a formal framework for implementing a query         facet analysis [7]. On the other hand, current keyword-based query-
refinement method. The method uses general principles of facet          ing methods does not use DEPA strings to represent web directo-
analysis. Two key notions are advanced and discussed: diversity         ries and annotating resources in digital libraries, so they seem in-
and focus. Diversity refers to the information needs of a querying      adequate to search over digital repositories organized according to
user; it is captured by the notion of ‘facet’. A focus refers to how    CCS and similar faced-based classification systems.
diversity is captured from the documents as organized by the user; it
provides a kind of context to the user query. The method is situated    For answers to be relevant, a user must ask the appropriate query in
within the formal framework of the smallest propositionally closed      order to retrieve the desired information and fulfill the information
description logic ALC, thereby betting that ALC provides us with        need (IN). For keyword-based search this means that a high num-
a suitable SAT solver to implement a facet engine, which is the         ber of keywords is necessary to the user to narrow down the search
main component of our method.                                           according to her information need. This is due the semantic ambi-
                                                                        guity of querying languages, often built upon natural language, as
Categories and Subject Descriptors                                      it is the case of keyword-based querying. Unfortunately, the query
H.3.7 [Information Systems]: Digital Libraries; I.2.4 [Computing        length of keyword-based search on average is reported to be short,
Methodologies]: Knowledge Representation Formalisms and Meth-           with 90% of the queries being less than four keywords [12]. As a
ods                                                                     consequence, the ambiguity of the query is somewhat mirrored in
                                                                        the relative relevance of search results [32, 3]; diversity in search
                                                                        results arises [15] and query refinement by the user is often the
General Terms                                                           only solution. To resolve such ambiguity some authors advanced
Design, Human Factors, Algorithms.                                      the notion of ‘context’ in web search, see for instance [14, 10] and
                                                                        references cited therein. However, in contect-based solutions the
Keywords                                                                user is often assumed to know how data and information are orga-
Query refinement, facet-based search, text-based search, context-       nized in the search domain. This is often hard to happen in real-
based search, user issues, description logic.                           world, distributed scenarios like the Web, due to large amounts of
                                                                        heterogeneous data organized in an unknow structure.
1. INTRODUCTION
Classical libraries had systems that processed subjects or domains      In this paper we present a formal framework wherein we define a
and built representations such as subject indices. Among these sys-     method for the extraction of DEPA facets from a user query. The
tem, the Colon Classification System (CCS) first proposed by S.R.       facets are then used to refine the original query for search and re-
Ranganathan [20] is currently widely used by almost all Indian li-      trieval purposes. The method is aimed to suggest the user a list of
braries. The CCS had enough contextual information in the method        facets that the user would hardly be aware of by simply typing a
of facetisation and synthesis so that it formed a semantic formali-     keyword-based query into a search engine, without any query con-
sation of the domain scope of the library collections.                  text. These automatically suggested new facets can be used by the
                                                                        user, for instance by clicking on one of the new facets, to narrow
In order to digitize CCS and similar facet-based systems, Prasad        down the search space by expanding the original user query with
and Guha [18] demonstrate the applicability of faceted schema in        the suggested facet.
describing resources in web directories and annotating resources in
digital libraries using SKOS/RDF representation to express DEPA         This paper is organized as follows. In Section 2 we define basic
                                                                        concepts related to facet analysis. In Section 3 we discuss the first
                                                                        step of our method. In Section 4 we build a formal faceted ontol-
                                                                        ogy to formalize the focused terms and contexts that we succes-
                                                                        sively process, in Section 5, to produce new facets to be shown to
                                                                        the user for query refinement. After building the faceted ontology
                                                                        and defining the facet engine, in Section 6 we present the three dif-
                                                                        ferent yet related querying methods we offer to the user; these are
                                                                        keyword-based, by focus, and on subject. In Section 7 we discuss
                                                                        related work. In Section 8 we conclude the paper.
2. FACETS ANALYSIS                                                        3.    FOCUSED TERMS FROM TEXT
Facet analysis is essentially a conceptual analysis of the subject        In the present work, we apply facetization as a technique to com-
matter, or the topical content of a concept into distinct divisions       bine extensional and intensional semantics of concepts viz. queries,
that together constitute a semantic description of the concept. In        or equivalently to disclose the subject of concepts and queries to the
order to build the facet repository available to a user to refine a       querying user, for the purpose of query refinement and search as-
query, in this section we present some elements of facet analysis.        sistance. We implement facetization in two related steps: 1. we
                                                                          produce certain “focused terms” from documents organized in a
Our facets repository is organized around two main notions of the         polyhierarchy, and 2. from focused terms we produc new facets
DEPA paradigm for facet analysis [6, 7]: subjects and facets. A           to be shown to the user for the purpose of query refinement. We
subject of a concept is the topical content of the concept, that is,      present step 1 in subsections 3.1 and 3.2 in this section, and step 2
the concept’s overall semantics, as defined by the combination of         in sections 4 and 5.
extensional and intensional semantics of the concept term. The def-
inition can be extended to a query, which in its simplest form can        3.1    Organization of documents
be thought of as a finite sequence of concept terms; see subsections      Although our method can be adopted as integral part of digital li-
6.1 and 6.3. A facet consists of a “group of terms derived by taking      braries systems, both for describing the documents collection and
each term and defining it, per genus et differentiam, with respect        for faceted querying over the collection or the web, in this paper we
for its parent class.” [31, p. 12]. According to Ranganathan [20],        assumed the method assists a querying user in query refinement. As
each domain is made of distinct divisions or facets that are groups       the method in this specific application uses a textual collection of
of mutually exclusive concepts and many such facets together con-         documents stored in the user’s querying machine, we stipulate the
stitute a domain. The notion of such facetization has been extended       following convention.
by Bhattacharyya [7] to subject indexing by representing content as
a string of fundamental categories DEPA (Discipline, Entity, Prop-
erty and Action) that are conceptually equivalent to ‘facets’. To            C ONVENTION 1. We denote the set of available documents to
illustrate, we rely on the following two examples.                        a querying user by D. All available documents are textual, that is,
                                                                          they can be processed by text information retrieval techniques as
   E XAMPLE 1. Consider a document titled ‘Improving EU labour            the variant of a standard technique discussed in Section 3.
market access for Rome’. DEPA facet analysis of the title leads
to facets such as: Labour Market (Entity), Access (Action), Rome
                                                                           Intuitively, the domain D of documents can be thought of as the
(Space - from commonly applicable facet schedules across domains).
                                                                          set of all documents the querying user has classified and stored in
The facet ‘Discipline’ is extrapolated from faceted document rep-
                                                                          the querying machine.
resentation, and it is ‘Economics’.

Note that in case a concept would be classified within more than            C ONVENTION 2. We assume that the querying user organizes
one discipline, as a homonymous or synonymous concept, then all           documents in D by using a ‘polyhierarchical classification’, or
such different combinations of facets are taken into account and          polyhierarchy.
presented to the user for further refinement.
                                                                           A polyhierarchical classification is a hierarchical classification
   E XAMPLE 2. Consider a document titled ‘Treating Apple trees           permitting some concept terms to be listed in multiple categories
for bacterial disease in Trentino’.1 DEPA facet analysis provides a       of a taxonomy, or branches of a hierarchy [16]. An example of
classification of the document into the following facets: Agriculture     polyhierarchy can be found in Figure 1. Note that what makes the
(D), Apple Trees (E), Treating (A), Disease (P), and Bacterial (as        hierarchical classification in Figure 1 be polyhierarchical is the con-
‘Modifier’ to P, cf. [6]).                                                cept term ‘Apple’.

We are now ready to define the facet repository for a given context.                             Cx:MyClassification
A facet repository for a context C is the set
   F R(C) = {⟨C : d, e, p, a⟩ | C has DEPA facets d, e, p, a},                               Computers            Fruit

where C is a concept description in description logic ALC (see                                  Apple       orange    Apple
subsection 4.2) of a concept or subject of interest in context C, and
d, e, p, a are, respectively, a Discipline, Entity, Property and Action                         doc1                   doc1
in DEPA classification system.                                                                                         doc3

   E XAMPLE 3. Consider the previous two examples. We can as-               Figure 1: A polyhierarchy, or polyhierarchical context Cx.
sume that ‘Improving EU labour market access for Rome’ is rep-
resented by a concept description C1 , and ‘Treating Apple trees
                                                                          A subset of documents is organized in ‘contexts’, each context be
for bacterial disease in Trentino’ is represented by a concept de-
                                                                          organized into related sets of documents. A context is a polyhier-
scription C2 in a context C. The facet repository F R(C) contains
                                                                          archical classification composed by sets of documents, i.e., ‘nodes’
⟨C1 : Economics, LabourM arket, p, Access⟩ for p is unspeci-
                                                                          of the polyhierarchy, called clusters, and a relation over the nodes
fied, and ⟨C2 : Agriculture, AppleT rees, Disease, T reating⟩.
                                                                          as defined by the polyhierarchy. Typical relations are the binary re-
1                                                                         lations of subsumption, part-of, is-a, among others relations. Each
  Trentino is a Province of the Italian North-east known for the
Dolomites and for its quality production of red and yellow apples.        cluster in a given context has a name composed by a finite sequence
of words from a representation language, often a natural langiage         C; Card (F C) is the number of focuses in C with leaf C, and
thereby betting that clusters are named by a human—the querying           doCKu [k] is the total number of clusters in the set
user, who naturally applies her native language for clusters nam-
                                                                             C \ {C ′ | C ′ ̸= C is a cluster in a focus in C with leaf C}     (2)
ing. A cluster’s name in such representation language is referred
to as concept term. A concept is a concept term provided with a           which contain k. Intuitively, (1) says that, in order to represent the
semantics. Two kinds of semantics are provided to a concept term:         extensional semantics of a focus, the importance of a retrieved term
an extensional semantics, defined over the documents in the cluster       for a cluster, i.e., the value of Wu [k, C], is inversely proportional
named by the concept term; and an intensional semantics, defined          to the number of different focuses with C as leaf which contain the
by the unique position of the concept term in a given ‘focus’.            term.

Contexts provide a way to define finite, ordered sequences of con-        The label of a cluster C is the most representative term or sequence
cept terms, each sequence called a focus. A focus consists of an          of terms for the cluster. Now we want compute the label of all clus-
ordered set of related concept terms, each concept term naming a          ters of a given context. For doing this, we process all documents
cluster built upon the collection of documents in D. Intuitively, a       stored in each cluster by considering the position of each cluster
focus is a path of concept terms corresponding to a path in a given       in the context. To define the process formally, we rely on the fol-
context. Figure 2 provides an example of both a context (left-hand        lowing technical definition. Let context C organize (a subset of)
side) and a focus (right-hand side). With reference to Figure 2, we       documents in D and cluster C in C be given. We define
write Cx:Fruit>Trentino>Apple to denote the focus named ‘Ap-
ple’ in the context Cx. In boldface are written two documents in                    IR (D, C, C) = {k ∈ Text (d) | d ∈ C, C in C}.             (3)
the cluster ‘Apple’: docRdoc and docGtxt.                                 We expect that the label of cluster C in (3) is the most representa-
                                                                          tive term or sequence of terms in IR (D, C, C). The most represen-
                       Cx:Fruit          CxF:Fruit                        tative term among terms in IR (D, C, C) is the term with the high-
                                                                          est weight among all terms in IR (D, C, C) according to weighting
                  orange    Trentino      Trentino                        measure 1. Formally, a term k in IR (D, C, C) is the most repre-
                                                                          sentative for the cluster C in C, and we say that k is a label of C,
                                            Apple
                              Apple                                       if Wu [k′ , C] ≤ Wu [k, C] for all terms k′ in IR (D, C, C). A se-
                                          docRdoc                         quence k 1 , k2 , ...k n of terms in IR (D, C, C) is a label of C if (a)
                            docRdoc
                                          docGtxt                         Wu [ki , C] = Wu [k, C] for i = 1, 2, ...n, and (b) k is a label of C.
                            docGtxt

                                                                             L EMMA 1. Every cluster C organized by a querying user u in
    Figure 2: An example of context (left) and focus (right).             a context C has a label if and only if C contains a document d such
                                                                          that Text (d) is nonempty.

3.2 Concept terms grounded in documents                                   To compute a label of every nonempty cluster C of a given context
In this section, our goal is to automatically assign a ‘label’ to every   C, we exhibit an algorithm that produces the label lC of C; see
cluster of a given context. Each cluster’s label produced by Algo-        Algorithm 1. Set IR = IR (D, C, C).
rithm 1 below is a finite, simple concatenation of terms with max-
imum ‘weight’, extracted by using Text (·). Formally, we proceed          Algorithm 1 Context-based cluster labeling.
as follows.
                                                                          Input: C, D ̸= ∅
Let Text (·) be a text extraction function. In this paper, we refer to    foreach C in C with C ̸= ∅ do
Text (·) as a standard keywords extraction function, for instance see       foreach k ∈ IR (D, C, C) do
[25, Sec. 4]. Given a document d, Text (d) listes all the keywords             compute Wu [k, C] according to formula (1) od;
in d, precisely, the most frequent ‘tokens’. Applied to a document          compute M = {k ∈ IR | ∀k′ ∈ IR, Wu [k′ , C] ≤ Wu [k, C]};
d, Text (·) produces a set Text (d) of terms (or ‘keywords’). Let d         Let n be the cardinality of M ;
be any document in D. As terms are defined from documents, from             Let {k1 , k 2 , . . . , kn } be the lexicographical ordering of M ;
now on we write k ∈ Text (d) to denote a generic term retrieved by          Set l0 = ∅;                                       /* empty sequence */
using Text (·) d. Given a document d, we rank a term k ∈ Text (d)           for i = 1 to i = n do
by adapting IR standard TF/IDF (“Term Frequency / Inverse Docu-                 Pick ki ∈ M ;
ment Frequency”) method [22, 23] to deal with contexts and unique               Set li = li−1 ki od od;                  /* simple concatenation */
concept terms’ position, i.e., focus, within a context. Observe that      Define lC = ln
in the following, for a given context C we write ‘C in C’ in place of     Return : set of labels {lC | C in C, C ̸= ∅}.
‘C in C’ set of clusters’ for every cluster C.

Let querying user u organizing a context C, cluster C in C, and term      Observe: 1. If C ̸= ∅ then IR ̸= ∅. 2. The label lC computed by
k ∈ Text (d) for a document d ∈ D be given. We define the weight          Algorithm 1 in not unique. In fact, M in Algorithm 1 is assumed to
of k in C as follows:                                                     be ordered according to lexicographical ordering. Other orderings
                         ∑                                                of the elements in M are possible and, as a consequence, a different
                                             Card (F C)                   label can be generated from each ordering.
           Wu [k, C] = (     TF[k, d]) · log             ,        (1)
                                             doCKu [k]
                         d∈C

where∑TF[k, d] is the total number of occurrences of k in d, so              E XAMPLE 4. To illustrate how Algorithm 1 works, consider
that  d∈C TF[k, d] is the total number of occurrences of k in             the context Cx in Figure 2. The result of applying Algorithm 1 to
Cx, limited to focus CxF in Figure 2 is depicted in Figure 3. Each         as explained. Due to the limitation of space, we do not provide a
label in the three, e.g., lApple , is a simple concatenation k1 ...kn of   detailed introduction of Description Logics (DLs), but rather point
terms extracted by Algorithm 1.                                            the reader to [5, 4] and offer the reader an example.
                                  lFruit
                                                                              E XAMPLE 5. Consider the labeled focus in Example 4. We can
                            ...     lTrentino
                                                                           represent it within ALC by a set of equality axioms, that we present
                                                                           as labels of the labeled focus in Figure 4. The concept descrip-
                                      lApple
                                                                                    Fruit ≡ ∃hasK.k 31 ⊓ · · · ⊓ ∃hasK.k 3n

          Figure 3: A focus as labeled by Algorithm 1.
                                                                                         ...    Trentino ≡ ∃hasK.k21 ⊓ · · · ⊓ ∃hasK.k2m
We are now ready to define “focused terms.” Let a focus F with
concept term C as leaf be given. A focused term for F is any term
that appears in a label lC of a cluster C in F . In symbols, the set                              Apple ≡ ∃hasK.k11 ⊓ · · · ⊓ ∃hasK.k1p
of focus terms for F is
             F T (F ) = {k | k appears in lC , C ∈ F}.                                         Figure 4: A labeled focus in ALC.
A focused term for C is any term that appears in lC . A focus term
for a concept term plays the role of a synonymous, or alias names,         tions kji that appear in the tree refer to the focused terms extracted
of the concept term. As we will see in Section 6, alias names are          by Algorithm 1 for each concept in the focus; hasK is a named
important to improve keyword-based querying.                               role, which is intuitively interpreted as ‘has keyword’. For exam-
                                                                           ple, ∃hasK.k31 intuitively means that concept term ‘Fruit’ in focus
                                                                           F :Fruit>Trentino>Apple is extended with focused term (keyword)
4. FACETED ONTOLOGY BUILDING                                               k31 . Each equality axiom that appears along the tree defines in ALC
The result of extracting terms from documents and “facetizing”             a concept term in F; the focus itself is formalized by the equality
the concepts of a polyhierarchical classification by using them pro-       axiom: FocusApple ≡ Apple ⊓ ∃R.(Trentino ⊓ ∃R.Fruit). An
duces a basic kind of faceted taxonomy, provided that (1) the ex-          ALC KB for this example is the set of the three equality axioms
tracted terms or, often, a proper subset of these [9], are matched         depicted along the tree plus the equality axiom that defines ‘Fo-
with a predefined set of facets, and (2) the clusters in a focus are       cusApple’ as the ‘focus Apple’, i.e., the focus F.
related to each other by a subsumption relation. For a faceted tax-
onomy consists of: (a) a set of facets, where each facet consists of a
predefined set of terms; and (b) a subsumption relation among the          4.2 Formal Faceted Classifications
terms. In this section we provide the formal framework we need to          Now we generalize the example. Algorithm 2 below provides a
formalize the focused terms and labeled contexts we have produced          way to build an ALC faceted knowledge base, or faceted ontology,
by Algorithm 1 by shallowly assuming (2)2 .                                for a given context. The algorithm works in two main steps.

                                                                           First, it builds a knowledge base by adding ALC equality axioms
4.1 Description Logics                                                     that formally define the concept terms of an input context by us-
Description Logics (DLs) [5] are a family of logic-based knowl-            ing focused terms computed by Algorithm 1 over the same context.
edge representation formalisms designed to represent and reason            For maching purposes that we will see in Section 5, if strictly more
about the knowledge of an application domain in a structured and           or strictly less (but at least one) focused terms were computed for
well-understood way. In this paper, we use a basic description             a concept term, then the algorithm adds to the knowledge base all
logic, called ALC, thereby betting that ALC provides us with an            the equality axioms defined over all possible combinations of four
efficient SAT solver to implement our facet engine (Section 5).            focused terms picked up, possibly with repetitions, from the com-
ALC is the smallest propositionally closed DL, and provides the            puted terms.
concept constructors
                 ¬ C, C ⊓ D, C ⊔ D, ∃R.C, ∀R.C,                            Second, the algorithm adds to the knowledge base so obtained all
                                                                           ALC equality axioms that formally define DEPA facets of every
as well as concept inclusion (or subsumption) C ⊑ D and concept            concept as stored in the facet repository (see Section 2). These
equality C ≡ D, where C, D are concept descriptions and R is a             axioms have the form
named role. A DL knowledge base (KB) consists of concept ax-
ioms (such as concept inclusion and concept equality axioms), role         C ≡ ∃F acetD.d ⊓ ∃F acetE.e ⊓ ∃F acetP.p ⊓ ∃F acetA.a,               (4)
axioms (such as functional role axioms) and assertions of the form
                                                                           where C represents a concept c available in the facet repository,
C(a), R(a, b) where a and b are named individuals. For the goal
                                                                           F acetD, F acetE, F acetP , and F acetA are named roles rapre-
of this paper, we use a limited part of ALC’s expressive power;
                                                                           senting the property of c in terms of DEPA facet analysis paradigm.3
in particular we do not use role axioms and assertions. Moreover,
                                                                           The intended interpretation of these named roles relates to the facet
we write concept descriptions in lower case, as concept description
                                                                           repository. For example, ∃F acetD.f means that there is a concept
from now on are terms extracted by Algorithm 1 from documents
                                                                           in the facet repository with facet ‘Discipline’ be f . By extension,
2                                                                          equality axiom (4) means that there is a concept in the facet reposi-
  That in our approach clusters in a focus are related to each other
by a subsumption relation follows from Convention 2 by observing           tory with facet ‘Discipline’ d, ‘Entity’ e, ‘Property’ p, and ‘Action’
that polyhierarchical classifications are often subsumption hierar-
                                                                           3
chies. However, we do not need to strictly assume (2) in this paper.           To shorten notation, in algorithms we use D, E, P , A instead.
a, and that concept has name C. Hence, as per second step, Algo-            Algorithm 3 Query expansion with facets from focused terms.
rithm 2 adds to the knowledge base all axioms of form as in (4) if          proc QueryExpansion
and only if there is a concept (or a subject) with DEPA facets d, e,          Input: C, O, F R(C) /* C is meant to represent user query */
p, a in the facet repository. We make the system insensitive to case          Define ΩK be the set of axioms in O of the form
and punctuation in the facets d, e, p, a by adding additional axioms          C ≡ ∃hasK.k1 ⊓ · · · ⊓ ∃hasK.kn ;                   /* k1 ...kn = lC */
where variants of d, e, p, a with the same meaning are used. We               Define ΩF be the set of axioms in O of the form
call the ontology produced by Algorithm 2 a formal faceted classi-            C ≡ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a;
fication (FFC).                                                                                                   /* ⟨C : d, e, p, a⟩ is in F R(C) */
                                                                              if ΩK ∨ ΩF = ∅
Algorithm 2 Building a ALC faceted ontology O.                                   then exit                     /* no query exspansion provided */
Input: C, D = ̸ ∅, F R(C)                                                         else
Set O = ∅;                              /* ALC ontology to be built */                 s := Card (ΩK );               /* ΩK cardinality is s ≥ 1 */
foreach C in C with C ̸= ∅ do                                                          t := Card (ΩF );                /* ΩF cardinality is t ≥ 1 */
  lC := ⟨k1 k2 · · · k(n )⟩;        /* lC computed by Algorithm 1 */                   F acetSet(C) := ∅; /* set of facets retrieved for C */
  for i = 1 to i = n4 do                                                               for j = 1 to j = s do
       O := O ∪ {C ≡ ∃hasK.ki1 ⊓ · · · ⊓ ∃hasK.ki4 } od;                                   F00 := ∅;         /* different facets strings retrieved */
  if ⟨C : d, e, p, a⟩ ∈ F R(C)                                                                                /* by using a single axiom in ΩK */
                        /* facets d, e, p, a for C in facet repository */                  for l = 1 to l = t do( )
     then                                                                                      for i = 1 to i = n4 do
          O := O ∪ {C ≡ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a};                                                if O |= ∃hasK.ki1 ⊓ · · · ⊓ ∃hasK.ki4 } ≡
                                      /* axiom of form in (4) added */                                ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a
  fi od;                                                                                              /* focused terms and DEPA facets match */
Return : O.                                                                                           then
                                                                                                            Fli := Fli−1 ∪ {⟨C : d, e, p, a⟩}
                                                                                                                      /* ⟨C : d, e, p, a⟩ retrieved */
                                                                                                                      /* depending on ki1 ,...,ki4 */
                                                                                                   fi od
5. FACET ENGINE                                                                            od;
Now we design within our framework a facet engine that computes                            F acetSet(C)j := F acetSet(C)j−1 ∪ Fli
the matching between the focused terms of a input context and the                               /* all DEPA strings for C in F R(C) retrieved */
predefined set of facets stored in the facet repository for a num-                     od fi;
ber of concepts. Intuitively, the facet engine looks at all keywords          Return : F acetSet(C)j .
generated for each concept name in a focus for all focuses of the
hierarchy, and browse through the focus from the root to the leaf to
identify what keywords are DEPA facets stored in facet repository.
The facet engine’s main component is Algorithm 3. The basic steps
of the algorithm are the following:
                                                                            6.    QUERY PROCESSING
                                                                            After building the faceted ontology and defining the facet engine
Step 1. Input a concept description C that represents a user’s query;       we are ready to use them to provide new facets to the user for
the different possible queries that can be represented this way are         query refinement. We allow the user to make three kind of query:
presented in Section 6.                                                     keyword-based, by focus, and on subject. We discuss each query-
                                                                            ing method in turn.
Step 2. Find and retrieve from the ontology built by Algorithm 2
all equality axioms that define C in the ontology either by focused
terms or DEPA facets. If no axioms do exist, that is, C is not de-          6.1     Keyword-based querying
fined according to the knowledge stored in the ontology, the algo-          The user types one or more keywords in the search box. This
rithm ends with no help to the user. This state means that the search       method is the simplest one and it is often the only method available
engine cannot provide the user with help for query refinement by            when the user does not know anything about the subject to search,
facets.                                                                     or the user’s knowledge on the query subject is not based on doc-
                                                                            uments locally stored in the user querying machine, so that we can
Step 3. For all retrieved axioms and for each axiom of the form             not use the ontology and facet engine we have advanced. This is
C ≡ ∃hasK.k1 ⊓ · · · ⊓ ∃hasK.kn , where lC = k1 ...kn is the la-            also a tyipical case of keyword-based querying by common search
bel computed by Algorithm 1, the algorithm runs the ALC SAT                 engines, where the keywords used in the query are listed without a
solver in order to match (focused) terms ki in the axiom to all             specific ordering on the only basis of the user’s information need.
DEPA facets for C possibly stored in the facet repository. Note
that the performance of our method mainly dependents on this step,          We deal with this method of querying as follows. Each keyword is
namely, the number and complexity of the matchings. Preliminary             mapped to zero or more concept terms in the context C. We do that
results suggested that the algorithm satisfies the requirements of          using an exact string match of the keyword to the concept term or
a query refinement system in terms of real time performance. A              one of its alias names, namely, its focused terms.
complete study of the complexity of this step is in progress.
                                                                            If no concept term and its alias names match any keyword, no con-
Step 4. For all successful matchings computed in Step 3, the re-            cept description is available to the facet engine, and as a conse-
trieved DEPA facets are output and shown to the user.                       quence no facets for query refinement are shown to the user.
If one concept term or its alias names match some keywords, then                             Computers             Fruit
the concept description C of the concept term is generated and pro-
cessed by Algorithm 3 for query expansion. The facets that occur in                             Apple       orange     Apple
the query expansion are shown to the user. When selecting one of
                                                                                                 doc1                   doc1
the new facets, the user will narrow down the search by expanding
                                                                                                                        doc3
the original query with the suggested facet.

If multiple concept terms match some keywords, then the concept                 Figure 6: Position of sample document doc1 matters.
description of each term is generated and processed by Algorithm
3 for query expansion. The facets that occur in the query expansion
of every concept description are shown to the user. Alternatively,        suppose the user selects the document named doc1 as the sample
the user is given the option to refine their query to indicate which      document. In classical querying by example, a relevant answer to
concept term, namely, keyword they meant the most.                        the user would be any document about ‘apple’, as meant as either
                                                                          a fruit or a computer. In contrast, using querying by focus the only
6.2 Querying-By-Focus                                                     relevant answers to the user would be documents from one of the
Now suppose that the user knows at least something about the sub-         two focus Fruit>Apple and Computers>Apple.
ject to search, and the user’s knowledge comes from documents
stored and polyhierarchically organized in the user’s document col-       We deal with querying by focus as follow. First, a concept descrip-
lection. In this case, it would always be desiderable for the user        tion C of the concept term that is the leaf of the focus is generated
to get better and better understanding of the hidden content of the       and processed by Algorithm 3 for query expansion. The facets that
query, as it is automatically generated by a suitable method, so as to    occur in the query expansion are shown to the user. When select-
discover new facets of the original query that the user was not aware     ing one of the new facets, the user will narrow down the search by
of before. For example, suppose the query is ‘apple’ as contextual-       expanding the original query with the suggested facet.
ized in Figure 5. The user clicks on a concept term in a context C.
In doing that, the user selects a focus in C. Alternatively, the user     Note that the case where query by focus applies in practical sit-
types some keywords as in keyword-based querying, but in a spe-           uations is not as uncommon as it may seem, because almost all
cific order to mean a focus in C. For example, the user may click         users start a search from a device storing text and text-annotated
on (an appropriate graphic-version of) ‘Apple’ in context or either       documents, and these are often organized by the user according to
                                                                          a polyhierarchical classification. More importantly, the fact that a
                              Cx:Fruit                                    user searches the Web does not mean that documents from the Web
                                                                          will be used for the purpose of querying by focus. The documents
                         orange     Trentino                              used for querying by focus are all and only the documents locally
                                                                          stored in the user’s querying device, whatever the search objective
                                     Apple                                is either to retrieve documents stored in the user’s device or in the
                                                                          Web. As a consequence, querying by focus clearly scales to the
                                                                          size of the web. To understand a bit further, recall that our method
               Figure 5: A focus for query ‘Apple’.                       is about query refinement, it is not a query search method. We use
                                                                          standard methods and search engines to search; the difference is
type keywords ‘fruit’, ‘trentino’, ‘apple’ in this order, as to mean      that the keywords we let the search engines to use are automati-
Cx:Fruit>Trentino>Apple. In the example, by selecting the facet           cally generated by our facetization technique.
‘Fruit’ the user would narrow down the search space by excluding
all subjects about Apple Computers and related subjects as search         6.3     Querying-On-Subject
results (see Figure 1). Similarly, by selecting facet ‘Trentino’ the      Subject-based querying is the most common approach by special-
user would be able to narrow down the search space by excluding           ized users, where ‘subject’ refers to the topical intent of a query (cf.
all subjects about fruits that are not related to Trentino’s production   Section 2). In our faceted approach to representation of documents
of apples. It follows that the keyword-based method and query-            in collection D, ‘subjects’ are broken down into distinct divisions,
ing by focus are not equivalent for at least one reason, that is, in      the facets of subject. A typical ‘query-on-subject’ is deemed to re-
keyword-based querying the order of keywords does not matter, in          late to a specific subject of a preexisting faceted classification. For
querying by focus does. The other main difference between these           example, a subject-based query is: ‘What are the documents on the
two querying methods arises looking at query processing. The dif-         effects of nitrogen fertilizers on rice plants?’ The subject of the
ference is that concept terms in a focus are not ‘pure’ keywords; a       concept subsumed by this query is one of possibly many focuses,
concept term is represented by a string of similar keywords as gen-       for example the following:
erated by Algorithm 1. Concept terms relate to documents in the
user’s repository, while keywords are usually unrelated to the user’s                 Cx:rice plants>nitrogen fertilizers>effects.            (5)
documents.                                                                This is a partial focus, in the sense that the discipline subsumed by
                                                                          the query as provided by the DEPA facet analysis is
A query-by-focus is similar to a query by example, yet it is more
specific. In querying by example, a sample document (the exam-               Cx:Agriculture>rice plants>nitrogen fertilizers>effects.         (6)
ple) is selected by the user to refine the query. On the other hand, in   Another possible focus for the subject of query’s concept is the
querying by focus the position of the sample document is also con-        following:
sidered, that is, the place the document is stored within the user’s
documentary repository. To illustrate, suppose that a user stores his       C ′ x:Agriculture> effects of nitrogen> fertilizers>rice plants.
documents according two different structures, see Figure 6. Now                                                                            (7)
A number of different but equivalent focuses could exists for a            Normalized Formal Classifications (NFC) used in [11] does this by
given subject-based query. Note the the existance of a focus for           taking into account both the label of the node and its position using
this query as well as the focus form depend only upon the querying         natural language processing techniques (see [11, sec 4]). On the
user’s classification of documents. The take-away point is that by         other hand, we have used an information retrieval technique to find
merging a subject to one or more focuses, by automatically trans-          out the keywords that will successively represented in concept de-
forming a query-on-subject to a query-by-focus, the method pro-            scriptions by using role names of the form hasK.k. This is an im-
vides the user with assistance in query refinement. In fact, we com-       portant difference with [11]. A focus is called “concept at a node”
pute the focuses generated from the query on subject, and for each         in [11, p. 70], although we believe that the two notions are not to-
focus we consider the concept description that represents the focus        tally equivalent (to be investigated). The notion of Formal Faceted
in ALC ontology computed by Algorithm 2. Then we proceed as                Classification (FFC) extends the notion of “lightweight ontology”
in the case of querying by focus and compute the query expansion           of [11] to facets. A main difference with lightweight ontologies by
of the focus according to knowledge stored in the ontology. Finally,       [11] is that FFC’s descriptive language is not propositional as the
the retrieved facets are shown to the user. If multiple focuses are        language used in [11]. Yet, it allows us to automate, through DL
computed from the query’s subject, the user is given the option to         reasoning services (SAT), query refinement, as we did in this paper.
refine the original query to indicate which focus they meant for the       Moreover, by our query language we allow a user to specify a query
searched subject.                                                          by selecting a sample document, to be interpreted of as the “infor-
                                                                           mation need” of documents similar to the sample selected. As a
                                                                           consequence, we provide a user with a mechanism of “querying by
7. RELATED WORK                                                            example” as a special case. On the other hand, in [11] it seems not
                                                                           easy to formalize querying by example, as the propositional lan-
There has been extensive work on automated facet construction
                                                                           guage used does not allow to represent instances.
motivated by query refinement, browsing and navigation over doc-
ument collections, see for instance [29], [8, 9], [10], .[24], [30, 13].
The notion of context in these related works differ from the notion        8.    CONCLUSION
of focus; in [10] context is a piece of text, from a document the user     This paper presented a formal framework for a querying refine-
is presented to, surrounding the query, which is marked by the user        ment method that enables the extraction of the diversity aspects,
on the document. The structural nature of a focus contrasts with the       or facets, of a user query. The method uses the general princi-
plain, linguistic nature of query context as meant in [10]. The nav-       ples of facet analysis in the DEPA paradigm of facetization and
igation trees discussed in [28] are similar to the focuses discussed       the notion of ‘focus’, which is used to infer new facets from the
in this paper. The formal approach of [28], moreover, as well as           user query. The method provides a user with additional and essen-
the use of faceted taxonomies is close in spirit, if not in the formal     tial contextual information, in form of new facets. When select-
development to our work presented here. As far as we know, none            ing one of the new facets, the user can narrow down the search by
of the foregoing approaches uses a DEPA facet schema.                      expanding the original query with the suggested facets. The pro-
                                                                           posed method of query refinement is based on diversity in query-
Our method is a focused retrieval method, in the sense that focused        ing and a multi-dimensionality of information. Three methods of
retrieval addresses ways to provide a querying user a more direct          querying weree discussed: keyword-based, by focus, and on sub-
access to relevant information [26]. Focused retrieval aims to iden-       ject. For each method, textual and structural dimensions were used
tify not only documents relevant to a user information need, but           to assist the user in query refining. The textual dimension allowed
also where within the document the relevant information is located.        us to generate the top-k most relevant terms for each concept of
Our approach of querying-by-focus is similar to querying by focus          a given polyhierarchy of text and text-annotated documents. The
on hierarchical classifications proposed by [1, 2].                        structural dimension of the polyhierarchy was used to match DEPA
                                                                           facets with the user query. We have situated our framework within
In the Indian Context, faceted library systems, especially the Colon       the smallest propositionally closed description logic ALC, and we
Classification System (CCS), has been adopted by majority of the           have used ALC’s solver to implement the facet engine as the main
academic libraries for organizing collections in semantic arrange-         component of the method.
ment. However, there is a wide scope for use of the faceted theory
behind systems such as CCS to other knowledge modeling efforts.
Prasad and Guha [18] intoduced a facet-based method to formu-              9.    REFERENCES
late the descriptive domain metadata that could be used to anno-            [1] A. Agostini and P. Avesani. On the discovery of the semantic
tate digital library resources. Prasad and Madalli [19] propose a               context of queries by game-playing. In H. Christiansen,
generic model for building semantic infrastructure for digital li-              M.-S. Hacid, T. Andreasen, and H. Larsen, editors,
braries based on facets as used in traditional library classification           Proceedings of the Sixth International Conference On
systems.                                                                        Flexible Query Answering Systems (FQAS-04), pages
                                                                                203–216, Berlin Heidelberg, 2004. Springer-Verlag LNAI
Faceted taxonomies are extensively studied, see for instance [21,               3055.
27, 28] and references therein. Facet techniques include that stud-         [2] A. Agostini and G. Moro. Identification of communities of
ied by Tvaroẑek and Bieliková [27], who have proposed faceted                  peers by trust and reputation. In D. F. C. Bussler, editor,
navigation and its personalization in digital libraries. They follow            Proceedings of the Eleventh International Conference on
a method of faceted browser adaptation based on an automatically                Artificial Intelligence: Methodology, Systems, Applications -
acquired user model with support for dynamic facet generation J.                Semantic Web Challenges (AIMSA-04), pages 85–95, Berlin
Polowinski [17] argues for use of Faceted Browsing as a visual se-              Heidelberg, 2004. Springer-Verlag LNAI 3192.
lection mechanism to browse data collections as it is deemed as             [3] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong.
being particularly suitable for structured, but heterogeneous data              Diversifying search results. In Proceedings of the Second
with explicit semantics.                                                        ACM International Conference on Web Search and Data
     Mining (WSDM-00), pages 5–14, New York, NY, 2009.                       19-24, 2009, pages 601–610, Berlin Heidelberg, 2009.
     ACM Press.                                                              Springer-Verlag LNCS 5617.
 [4] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and              [18] A. Prasad and N. Guha. Expressing faceted subject indexing
     P. Patel-Schneider, editors. Handbook of Description Logics,            in SKOS/RDF. In Proceedings of the First International
     Cambridge, UK, 2002. Cambridge University Press.                        Conference of Semantic Web and Digital Libraries,
 [5] F. Baader and W. Nutt. Basic description logics. In F. Baader,          Bangalore 21-23 February (ICSWDL-07), 2007.
     D. Calvanese, D. M. Guinness, and P. P.-S. D. Nardi, editors,      [19] A. Prasad and D. Madalli. Semantic digital faceted
     Handbook of Description Logics, pages 47–100. Cambridge                 infrastructure for semantic digital libraries. Library Review,
     University Press, Cambridge, UK, 2002.                                  57(3):225–234, 2008.
 [6] G. Bhattacharyya. POPSI: its fundamentals and procedure            [20] S. R. Ranganathan. Prolegomena to Library Classification.
     based on a general theory of subject indexing languages.                Asia Publishing House, London, 1967.
     Library Science with a Slant to Documentation, 16(1):1–34,         [21] G. Sacco and Y. Tzitzikas, editors. Dynamic Taxonomies and
     1976.                                                                   Faceted Search, The Information Retrieval Series, v. 25,
 [7] G. Bhattacharyya. Subject indexing language: its theory and             Berlin Heidelberg, 2009. Springer-Verlag.
     practice. In Proceedings of the DRTC Refresher Seminar–13,         [22] G. Salton, editor. The SMART Retrieval
     New Developments in LIS in India, Bangalore, India, 1981.               System—Experiments in Automatic Document Retrieval,
     DRTC, ISI Bangalore Centre.                                             Englewood Cliffs, NJ, 1971. Prentice-Hall Inc.
 [8] W. Dakka, R. Dayal, and P. Ipeirotis. Automatic discovery of       [23] G. Salton and M. McGill. Introduction to Modern
     useful facet terms. In Proceedings of the ACM SIGIR 2006                Information Retrieval. McGraw-Hill, New York, NY, 1983.
     Workshop on Faceted Search, New York, NY, 2006. ACM                [24] E. Stoica, M. A. Hearst, and M. Richardson. Automating
     Press.                                                                  creation of hierarchical faceted metadata structures. In
 [9] W. Dakka and P. Ipeirotis. Automatic extraction of useful               Proceedings of the Human Language Technology Conference
     facet hierarchies from text databases. In Proceedings of the            (NAACL HLT), pages 244–251, Rochester, NY, USA, 2007.
     2008 IEEE 24th International Conference on Data                         Association for Computational Linguistics.
     Engineering (ICDE-08), pages 466–475, Washington, DC,              [25] P. Tonella, F. Ricca, E. Pianta, and C. Girardi. Using
     USA, 2008. IEEE Computer Society.                                       keyword extraction for web site clustering. In K. Wong,
[10] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan,         editor, Proceedings of the Fifth International Workshop on
     G. Wolfman, and E. Ruppin. Placing search in context: The               Web Site Evolution (WSE-03), pages 41–48, Amsterdam, The
     concept revised. In Proceedings of the Tenth International              Netherlands, 2003. IEEE Computer Society.
     World Wide Web Conference (WWW-2001), pages 406–414,               [26] A. Trotman, S. Geva, J. Kamps, M. Lalmas, and V. Murdock.
     New York, NY, 2001. ACM Press.                                          Current research in focused retrieval and result aggregation.
[11] F. Giunchiglia, M. Marchese, and I. Zaihrayeu. Encoding                 Special Issue in the Journal of Information Retrieval,
     classifications into lightweight ontologies. In S. Spaccapietra         13(5):407–411, 2010.
     and et. al., editors, Journal on Data Semantics VIII, pages        [27] M. Tvaroẑek and M. Bieliková. Personalized faceted
     57–81. Springer-Verlag LNCS 4380, Berlin Heidelberg,                    browsing for digital libraries. In L.ács, N. Fuhr, and
     2007.                                                                   C. Meghini, editors, Research and Advanced Technology for
[12] B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users,        Digital Libraries. Proceedings of the 11th European
     and real needs: a study and analysis of user queries on the             Conference on Digital Libraries (ECDL-07), Budapest,
     web. Information Processing & Management,                               Hungary, September 16-21, 2007, pages 485–488, Berlin
     36(2):207–227, 2000.                                                    Heidelberg, 2007. Springer-Verlag LNCS 4675.
[13] K. Latha, K. R. Veni, and R. Rajaram. AFGF: An automatic           [28] Y. Tzitzikas, N. Spyratos, P. Constantopoulos, and
     facet generation framework for document retrieval. In                   A. Analyti. Extended faceted taxonomies for web catalogs.
     Proceedings of the 2010 International Conference on                     In Proceedings of the Third International Conference on Web
     Advances in Computer Engineering (ACE-2010), pages                      Information Systems Engineering (WISE-02), pages
     110–114, Washington, DC, USA, 2010. IEEE Computer                       192–204, 2002.
     Society.                                                           [29] R. van Zwol and B. Sigurbjörnsson. Faceted exploration of
[14] S. Lawrence. Context in Web Search. IEEE Data                           image search results. In Proceedings of the Nineteenth
     Engineering Bulletin, 23(3):25–32, 2000.                                International World Wide Web Conference (WWW-10), pages
[15] V. Maltese, F. Giunchiglia, K. Denecke, P. Lewis, C. Wallner,           961–970, New York, NY, 2010. ACM Press.
     A. Baldry, and D. Madalli. On the interdisciplinary                [30] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and
     foundations of diversity. In G. Boato and C. Niederee,                  S. A. Yahia. Efficient computation of diverse query results. In
     editors, Proceedings of the First International Workshop on             Proceedings of the 2008 IEEE 24th International Conference
     Living Web at ISWC-09, Washington D.C., USA, October 26,                on Data Engineering (ICDE-08), pages 228–236,
     2009. CEUR-WS, 2009.                                                    Washington, DC, USA, 2008. IEEE Computer Society.
[16] P. Morville and L. Rosenfeld. Information architecture for         [31] B. Vickery. Faceted classification: A guide to construction
     the World Wide Web, 3rd edition. O’Reilly Media, Inc.,                  and use of special schemes. Aslib - Asia Publishing House,
     Sebastopol, CAe, 2006.                                                  London, 1960.
[17] J. Polowinski. Human interface and the management of               [32] K. Weinberger, M. Slaney, and R. van Zwol. Resolving tag
     information. Designing information environments. In M. J.               ambiguity. In Proceedings of the 16th International ACM
     Smith and G. Salvendy, editors, Proceedings of the                      Conference on Multimedia (MM 2008), New York, NY,
     Symposium on Human Interface 2009, held as Part of HCI                  2010. ACM Press.
     International 2009 (HCII-09), San Diego, CA, USA, July

</pre>