=Paper=
{{Paper
|id=None
|storemode=property
|title=Faceted Approach To Diverse Query Processing
|pdfUrl=https://ceur-ws.org/Vol-762/paper4.pdf
|volume=Vol-762
}}
==Faceted Approach To Diverse Query Processing==
Faceted Approach To Diverse Query Processing
Alessandro Agostini Devika P. Madalli, A.R.D. Prasad
Department of Computer Science Documentation Research and Training Centre
Prince Mohammad Bin Fahd University Indian Statistical Institute
Al-Khobar, Saudi Arabia Bangalore, India
aagostini@pmu.edu.sa {devika,ard}@drtc.isibang.ac.in
ABSTRACT strings, according to faceted theory by Ranganathan [20] and DEPA
This paper presents a formal framework for implementing a query facet analysis [7]. On the other hand, current keyword-based query-
refinement method. The method uses general principles of facet ing methods does not use DEPA strings to represent web directo-
analysis. Two key notions are advanced and discussed: diversity ries and annotating resources in digital libraries, so they seem in-
and focus. Diversity refers to the information needs of a querying adequate to search over digital repositories organized according to
user; it is captured by the notion of ‘facet’. A focus refers to how CCS and similar faced-based classification systems.
diversity is captured from the documents as organized by the user; it
provides a kind of context to the user query. The method is situated For answers to be relevant, a user must ask the appropriate query in
within the formal framework of the smallest propositionally closed order to retrieve the desired information and fulfill the information
description logic ALC, thereby betting that ALC provides us with need (IN). For keyword-based search this means that a high num-
a suitable SAT solver to implement a facet engine, which is the ber of keywords is necessary to the user to narrow down the search
main component of our method. according to her information need. This is due the semantic ambi-
guity of querying languages, often built upon natural language, as
Categories and Subject Descriptors it is the case of keyword-based querying. Unfortunately, the query
H.3.7 [Information Systems]: Digital Libraries; I.2.4 [Computing length of keyword-based search on average is reported to be short,
Methodologies]: Knowledge Representation Formalisms and Meth- with 90% of the queries being less than four keywords [12]. As a
ods consequence, the ambiguity of the query is somewhat mirrored in
the relative relevance of search results [32, 3]; diversity in search
results arises [15] and query refinement by the user is often the
General Terms only solution. To resolve such ambiguity some authors advanced
Design, Human Factors, Algorithms. the notion of ‘context’ in web search, see for instance [14, 10] and
references cited therein. However, in contect-based solutions the
Keywords user is often assumed to know how data and information are orga-
Query refinement, facet-based search, text-based search, context- nized in the search domain. This is often hard to happen in real-
based search, user issues, description logic. world, distributed scenarios like the Web, due to large amounts of
heterogeneous data organized in an unknow structure.
1. INTRODUCTION
Classical libraries had systems that processed subjects or domains In this paper we present a formal framework wherein we define a
and built representations such as subject indices. Among these sys- method for the extraction of DEPA facets from a user query. The
tem, the Colon Classification System (CCS) first proposed by S.R. facets are then used to refine the original query for search and re-
Ranganathan [20] is currently widely used by almost all Indian li- trieval purposes. The method is aimed to suggest the user a list of
braries. The CCS had enough contextual information in the method facets that the user would hardly be aware of by simply typing a
of facetisation and synthesis so that it formed a semantic formali- keyword-based query into a search engine, without any query con-
sation of the domain scope of the library collections. text. These automatically suggested new facets can be used by the
user, for instance by clicking on one of the new facets, to narrow
In order to digitize CCS and similar facet-based systems, Prasad down the search space by expanding the original user query with
and Guha [18] demonstrate the applicability of faceted schema in the suggested facet.
describing resources in web directories and annotating resources in
digital libraries using SKOS/RDF representation to express DEPA This paper is organized as follows. In Section 2 we define basic
concepts related to facet analysis. In Section 3 we discuss the first
step of our method. In Section 4 we build a formal faceted ontol-
ogy to formalize the focused terms and contexts that we succes-
sively process, in Section 5, to produce new facets to be shown to
the user for query refinement. After building the faceted ontology
and defining the facet engine, in Section 6 we present the three dif-
ferent yet related querying methods we offer to the user; these are
keyword-based, by focus, and on subject. In Section 7 we discuss
related work. In Section 8 we conclude the paper.
2. FACETS ANALYSIS 3. FOCUSED TERMS FROM TEXT
Facet analysis is essentially a conceptual analysis of the subject In the present work, we apply facetization as a technique to com-
matter, or the topical content of a concept into distinct divisions bine extensional and intensional semantics of concepts viz. queries,
that together constitute a semantic description of the concept. In or equivalently to disclose the subject of concepts and queries to the
order to build the facet repository available to a user to refine a querying user, for the purpose of query refinement and search as-
query, in this section we present some elements of facet analysis. sistance. We implement facetization in two related steps: 1. we
produce certain “focused terms” from documents organized in a
Our facets repository is organized around two main notions of the polyhierarchy, and 2. from focused terms we produc new facets
DEPA paradigm for facet analysis [6, 7]: subjects and facets. A to be shown to the user for the purpose of query refinement. We
subject of a concept is the topical content of the concept, that is, present step 1 in subsections 3.1 and 3.2 in this section, and step 2
the concept’s overall semantics, as defined by the combination of in sections 4 and 5.
extensional and intensional semantics of the concept term. The def-
inition can be extended to a query, which in its simplest form can 3.1 Organization of documents
be thought of as a finite sequence of concept terms; see subsections Although our method can be adopted as integral part of digital li-
6.1 and 6.3. A facet consists of a “group of terms derived by taking braries systems, both for describing the documents collection and
each term and defining it, per genus et differentiam, with respect for faceted querying over the collection or the web, in this paper we
for its parent class.” [31, p. 12]. According to Ranganathan [20], assumed the method assists a querying user in query refinement. As
each domain is made of distinct divisions or facets that are groups the method in this specific application uses a textual collection of
of mutually exclusive concepts and many such facets together con- documents stored in the user’s querying machine, we stipulate the
stitute a domain. The notion of such facetization has been extended following convention.
by Bhattacharyya [7] to subject indexing by representing content as
a string of fundamental categories DEPA (Discipline, Entity, Prop-
erty and Action) that are conceptually equivalent to ‘facets’. To C ONVENTION 1. We denote the set of available documents to
illustrate, we rely on the following two examples. a querying user by D. All available documents are textual, that is,
they can be processed by text information retrieval techniques as
E XAMPLE 1. Consider a document titled ‘Improving EU labour the variant of a standard technique discussed in Section 3.
market access for Rome’. DEPA facet analysis of the title leads
to facets such as: Labour Market (Entity), Access (Action), Rome
Intuitively, the domain D of documents can be thought of as the
(Space - from commonly applicable facet schedules across domains).
set of all documents the querying user has classified and stored in
The facet ‘Discipline’ is extrapolated from faceted document rep-
the querying machine.
resentation, and it is ‘Economics’.
Note that in case a concept would be classified within more than C ONVENTION 2. We assume that the querying user organizes
one discipline, as a homonymous or synonymous concept, then all documents in D by using a ‘polyhierarchical classification’, or
such different combinations of facets are taken into account and polyhierarchy.
presented to the user for further refinement.
A polyhierarchical classification is a hierarchical classification
E XAMPLE 2. Consider a document titled ‘Treating Apple trees permitting some concept terms to be listed in multiple categories
for bacterial disease in Trentino’.1 DEPA facet analysis provides a of a taxonomy, or branches of a hierarchy [16]. An example of
classification of the document into the following facets: Agriculture polyhierarchy can be found in Figure 1. Note that what makes the
(D), Apple Trees (E), Treating (A), Disease (P), and Bacterial (as hierarchical classification in Figure 1 be polyhierarchical is the con-
‘Modifier’ to P, cf. [6]). cept term ‘Apple’.
We are now ready to define the facet repository for a given context. Cx:MyClassification
A facet repository for a context C is the set
F R(C) = {⟨C : d, e, p, a⟩ | C has DEPA facets d, e, p, a}, Computers Fruit
where C is a concept description in description logic ALC (see Apple orange Apple
subsection 4.2) of a concept or subject of interest in context C, and
d, e, p, a are, respectively, a Discipline, Entity, Property and Action doc1 doc1
in DEPA classification system. doc3
E XAMPLE 3. Consider the previous two examples. We can as- Figure 1: A polyhierarchy, or polyhierarchical context Cx.
sume that ‘Improving EU labour market access for Rome’ is rep-
resented by a concept description C1 , and ‘Treating Apple trees
A subset of documents is organized in ‘contexts’, each context be
for bacterial disease in Trentino’ is represented by a concept de-
organized into related sets of documents. A context is a polyhier-
scription C2 in a context C. The facet repository F R(C) contains
archical classification composed by sets of documents, i.e., ‘nodes’
⟨C1 : Economics, LabourM arket, p, Access⟩ for p is unspeci-
of the polyhierarchy, called clusters, and a relation over the nodes
fied, and ⟨C2 : Agriculture, AppleT rees, Disease, T reating⟩.
as defined by the polyhierarchy. Typical relations are the binary re-
1 lations of subsumption, part-of, is-a, among others relations. Each
Trentino is a Province of the Italian North-east known for the
Dolomites and for its quality production of red and yellow apples. cluster in a given context has a name composed by a finite sequence
of words from a representation language, often a natural langiage C; Card (F C) is the number of focuses in C with leaf C, and
thereby betting that clusters are named by a human—the querying doCKu [k] is the total number of clusters in the set
user, who naturally applies her native language for clusters nam-
C \ {C ′ | C ′ ̸= C is a cluster in a focus in C with leaf C} (2)
ing. A cluster’s name in such representation language is referred
to as concept term. A concept is a concept term provided with a which contain k. Intuitively, (1) says that, in order to represent the
semantics. Two kinds of semantics are provided to a concept term: extensional semantics of a focus, the importance of a retrieved term
an extensional semantics, defined over the documents in the cluster for a cluster, i.e., the value of Wu [k, C], is inversely proportional
named by the concept term; and an intensional semantics, defined to the number of different focuses with C as leaf which contain the
by the unique position of the concept term in a given ‘focus’. term.
Contexts provide a way to define finite, ordered sequences of con- The label of a cluster C is the most representative term or sequence
cept terms, each sequence called a focus. A focus consists of an of terms for the cluster. Now we want compute the label of all clus-
ordered set of related concept terms, each concept term naming a ters of a given context. For doing this, we process all documents
cluster built upon the collection of documents in D. Intuitively, a stored in each cluster by considering the position of each cluster
focus is a path of concept terms corresponding to a path in a given in the context. To define the process formally, we rely on the fol-
context. Figure 2 provides an example of both a context (left-hand lowing technical definition. Let context C organize (a subset of)
side) and a focus (right-hand side). With reference to Figure 2, we documents in D and cluster C in C be given. We define
write Cx:Fruit>Trentino>Apple to denote the focus named ‘Ap-
ple’ in the context Cx. In boldface are written two documents in IR (D, C, C) = {k ∈ Text (d) | d ∈ C, C in C}. (3)
the cluster ‘Apple’: docRdoc and docGtxt. We expect that the label of cluster C in (3) is the most representa-
tive term or sequence of terms in IR (D, C, C). The most represen-
Cx:Fruit CxF:Fruit tative term among terms in IR (D, C, C) is the term with the high-
est weight among all terms in IR (D, C, C) according to weighting
orange Trentino Trentino measure 1. Formally, a term k in IR (D, C, C) is the most repre-
sentative for the cluster C in C, and we say that k is a label of C,
Apple
Apple if Wu [k′ , C] ≤ Wu [k, C] for all terms k′ in IR (D, C, C). A se-
docRdoc quence k 1 , k2 , ...k n of terms in IR (D, C, C) is a label of C if (a)
docRdoc
docGtxt Wu [ki , C] = Wu [k, C] for i = 1, 2, ...n, and (b) k is a label of C.
docGtxt
L EMMA 1. Every cluster C organized by a querying user u in
Figure 2: An example of context (left) and focus (right). a context C has a label if and only if C contains a document d such
that Text (d) is nonempty.
3.2 Concept terms grounded in documents To compute a label of every nonempty cluster C of a given context
In this section, our goal is to automatically assign a ‘label’ to every C, we exhibit an algorithm that produces the label lC of C; see
cluster of a given context. Each cluster’s label produced by Algo- Algorithm 1. Set IR = IR (D, C, C).
rithm 1 below is a finite, simple concatenation of terms with max-
imum ‘weight’, extracted by using Text (·). Formally, we proceed Algorithm 1 Context-based cluster labeling.
as follows.
Input: C, D ̸= ∅
Let Text (·) be a text extraction function. In this paper, we refer to foreach C in C with C ̸= ∅ do
Text (·) as a standard keywords extraction function, for instance see foreach k ∈ IR (D, C, C) do
[25, Sec. 4]. Given a document d, Text (d) listes all the keywords compute Wu [k, C] according to formula (1) od;
in d, precisely, the most frequent ‘tokens’. Applied to a document compute M = {k ∈ IR | ∀k′ ∈ IR, Wu [k′ , C] ≤ Wu [k, C]};
d, Text (·) produces a set Text (d) of terms (or ‘keywords’). Let d Let n be the cardinality of M ;
be any document in D. As terms are defined from documents, from Let {k1 , k 2 , . . . , kn } be the lexicographical ordering of M ;
now on we write k ∈ Text (d) to denote a generic term retrieved by Set l0 = ∅; /* empty sequence */
using Text (·) d. Given a document d, we rank a term k ∈ Text (d) for i = 1 to i = n do
by adapting IR standard TF/IDF (“Term Frequency / Inverse Docu- Pick ki ∈ M ;
ment Frequency”) method [22, 23] to deal with contexts and unique Set li = li−1 ki od od; /* simple concatenation */
concept terms’ position, i.e., focus, within a context. Observe that Define lC = ln
in the following, for a given context C we write ‘C in C’ in place of Return : set of labels {lC | C in C, C ̸= ∅}.
‘C in C’ set of clusters’ for every cluster C.
Let querying user u organizing a context C, cluster C in C, and term Observe: 1. If C ̸= ∅ then IR ̸= ∅. 2. The label lC computed by
k ∈ Text (d) for a document d ∈ D be given. We define the weight Algorithm 1 in not unique. In fact, M in Algorithm 1 is assumed to
of k in C as follows: be ordered according to lexicographical ordering. Other orderings
∑ of the elements in M are possible and, as a consequence, a different
Card (F C) label can be generated from each ordering.
Wu [k, C] = ( TF[k, d]) · log , (1)
doCKu [k]
d∈C
where∑TF[k, d] is the total number of occurrences of k in d, so E XAMPLE 4. To illustrate how Algorithm 1 works, consider
that d∈C TF[k, d] is the total number of occurrences of k in the context Cx in Figure 2. The result of applying Algorithm 1 to
Cx, limited to focus CxF in Figure 2 is depicted in Figure 3. Each as explained. Due to the limitation of space, we do not provide a
label in the three, e.g., lApple , is a simple concatenation k1 ...kn of detailed introduction of Description Logics (DLs), but rather point
terms extracted by Algorithm 1. the reader to [5, 4] and offer the reader an example.
lFruit
E XAMPLE 5. Consider the labeled focus in Example 4. We can
... lTrentino
represent it within ALC by a set of equality axioms, that we present
as labels of the labeled focus in Figure 4. The concept descrip-
lApple
Fruit ≡ ∃hasK.k 31 ⊓ · · · ⊓ ∃hasK.k 3n
Figure 3: A focus as labeled by Algorithm 1.
... Trentino ≡ ∃hasK.k21 ⊓ · · · ⊓ ∃hasK.k2m
We are now ready to define “focused terms.” Let a focus F with
concept term C as leaf be given. A focused term for F is any term
that appears in a label lC of a cluster C in F . In symbols, the set Apple ≡ ∃hasK.k11 ⊓ · · · ⊓ ∃hasK.k1p
of focus terms for F is
F T (F ) = {k | k appears in lC , C ∈ F}. Figure 4: A labeled focus in ALC.
A focused term for C is any term that appears in lC . A focus term
for a concept term plays the role of a synonymous, or alias names, tions kji that appear in the tree refer to the focused terms extracted
of the concept term. As we will see in Section 6, alias names are by Algorithm 1 for each concept in the focus; hasK is a named
important to improve keyword-based querying. role, which is intuitively interpreted as ‘has keyword’. For exam-
ple, ∃hasK.k31 intuitively means that concept term ‘Fruit’ in focus
F :Fruit>Trentino>Apple is extended with focused term (keyword)
4. FACETED ONTOLOGY BUILDING k31 . Each equality axiom that appears along the tree defines in ALC
The result of extracting terms from documents and “facetizing” a concept term in F; the focus itself is formalized by the equality
the concepts of a polyhierarchical classification by using them pro- axiom: FocusApple ≡ Apple ⊓ ∃R.(Trentino ⊓ ∃R.Fruit). An
duces a basic kind of faceted taxonomy, provided that (1) the ex- ALC KB for this example is the set of the three equality axioms
tracted terms or, often, a proper subset of these [9], are matched depicted along the tree plus the equality axiom that defines ‘Fo-
with a predefined set of facets, and (2) the clusters in a focus are cusApple’ as the ‘focus Apple’, i.e., the focus F.
related to each other by a subsumption relation. For a faceted tax-
onomy consists of: (a) a set of facets, where each facet consists of a
predefined set of terms; and (b) a subsumption relation among the 4.2 Formal Faceted Classifications
terms. In this section we provide the formal framework we need to Now we generalize the example. Algorithm 2 below provides a
formalize the focused terms and labeled contexts we have produced way to build an ALC faceted knowledge base, or faceted ontology,
by Algorithm 1 by shallowly assuming (2)2 . for a given context. The algorithm works in two main steps.
First, it builds a knowledge base by adding ALC equality axioms
4.1 Description Logics that formally define the concept terms of an input context by us-
Description Logics (DLs) [5] are a family of logic-based knowl- ing focused terms computed by Algorithm 1 over the same context.
edge representation formalisms designed to represent and reason For maching purposes that we will see in Section 5, if strictly more
about the knowledge of an application domain in a structured and or strictly less (but at least one) focused terms were computed for
well-understood way. In this paper, we use a basic description a concept term, then the algorithm adds to the knowledge base all
logic, called ALC, thereby betting that ALC provides us with an the equality axioms defined over all possible combinations of four
efficient SAT solver to implement our facet engine (Section 5). focused terms picked up, possibly with repetitions, from the com-
ALC is the smallest propositionally closed DL, and provides the puted terms.
concept constructors
¬ C, C ⊓ D, C ⊔ D, ∃R.C, ∀R.C, Second, the algorithm adds to the knowledge base so obtained all
ALC equality axioms that formally define DEPA facets of every
as well as concept inclusion (or subsumption) C ⊑ D and concept concept as stored in the facet repository (see Section 2). These
equality C ≡ D, where C, D are concept descriptions and R is a axioms have the form
named role. A DL knowledge base (KB) consists of concept ax-
ioms (such as concept inclusion and concept equality axioms), role C ≡ ∃F acetD.d ⊓ ∃F acetE.e ⊓ ∃F acetP.p ⊓ ∃F acetA.a, (4)
axioms (such as functional role axioms) and assertions of the form
where C represents a concept c available in the facet repository,
C(a), R(a, b) where a and b are named individuals. For the goal
F acetD, F acetE, F acetP , and F acetA are named roles rapre-
of this paper, we use a limited part of ALC’s expressive power;
senting the property of c in terms of DEPA facet analysis paradigm.3
in particular we do not use role axioms and assertions. Moreover,
The intended interpretation of these named roles relates to the facet
we write concept descriptions in lower case, as concept description
repository. For example, ∃F acetD.f means that there is a concept
from now on are terms extracted by Algorithm 1 from documents
in the facet repository with facet ‘Discipline’ be f . By extension,
2 equality axiom (4) means that there is a concept in the facet reposi-
That in our approach clusters in a focus are related to each other
by a subsumption relation follows from Convention 2 by observing tory with facet ‘Discipline’ d, ‘Entity’ e, ‘Property’ p, and ‘Action’
that polyhierarchical classifications are often subsumption hierar-
3
chies. However, we do not need to strictly assume (2) in this paper. To shorten notation, in algorithms we use D, E, P , A instead.
a, and that concept has name C. Hence, as per second step, Algo- Algorithm 3 Query expansion with facets from focused terms.
rithm 2 adds to the knowledge base all axioms of form as in (4) if proc QueryExpansion
and only if there is a concept (or a subject) with DEPA facets d, e, Input: C, O, F R(C) /* C is meant to represent user query */
p, a in the facet repository. We make the system insensitive to case Define ΩK be the set of axioms in O of the form
and punctuation in the facets d, e, p, a by adding additional axioms C ≡ ∃hasK.k1 ⊓ · · · ⊓ ∃hasK.kn ; /* k1 ...kn = lC */
where variants of d, e, p, a with the same meaning are used. We Define ΩF be the set of axioms in O of the form
call the ontology produced by Algorithm 2 a formal faceted classi- C ≡ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a;
fication (FFC). /* ⟨C : d, e, p, a⟩ is in F R(C) */
if ΩK ∨ ΩF = ∅
Algorithm 2 Building a ALC faceted ontology O. then exit /* no query exspansion provided */
Input: C, D = ̸ ∅, F R(C) else
Set O = ∅; /* ALC ontology to be built */ s := Card (ΩK ); /* ΩK cardinality is s ≥ 1 */
foreach C in C with C ̸= ∅ do t := Card (ΩF ); /* ΩF cardinality is t ≥ 1 */
lC := ⟨k1 k2 · · · k(n )⟩; /* lC computed by Algorithm 1 */ F acetSet(C) := ∅; /* set of facets retrieved for C */
for i = 1 to i = n4 do for j = 1 to j = s do
O := O ∪ {C ≡ ∃hasK.ki1 ⊓ · · · ⊓ ∃hasK.ki4 } od; F00 := ∅; /* different facets strings retrieved */
if ⟨C : d, e, p, a⟩ ∈ F R(C) /* by using a single axiom in ΩK */
/* facets d, e, p, a for C in facet repository */ for l = 1 to l = t do( )
then for i = 1 to i = n4 do
O := O ∪ {C ≡ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a}; if O |= ∃hasK.ki1 ⊓ · · · ⊓ ∃hasK.ki4 } ≡
/* axiom of form in (4) added */ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a
fi od; /* focused terms and DEPA facets match */
Return : O. then
Fli := Fli−1 ∪ {⟨C : d, e, p, a⟩}
/* ⟨C : d, e, p, a⟩ retrieved */
/* depending on ki1 ,...,ki4 */
fi od
5. FACET ENGINE od;
Now we design within our framework a facet engine that computes F acetSet(C)j := F acetSet(C)j−1 ∪ Fli
the matching between the focused terms of a input context and the /* all DEPA strings for C in F R(C) retrieved */
predefined set of facets stored in the facet repository for a num- od fi;
ber of concepts. Intuitively, the facet engine looks at all keywords Return : F acetSet(C)j .
generated for each concept name in a focus for all focuses of the
hierarchy, and browse through the focus from the root to the leaf to
identify what keywords are DEPA facets stored in facet repository.
The facet engine’s main component is Algorithm 3. The basic steps
of the algorithm are the following:
6. QUERY PROCESSING
After building the faceted ontology and defining the facet engine
Step 1. Input a concept description C that represents a user’s query; we are ready to use them to provide new facets to the user for
the different possible queries that can be represented this way are query refinement. We allow the user to make three kind of query:
presented in Section 6. keyword-based, by focus, and on subject. We discuss each query-
ing method in turn.
Step 2. Find and retrieve from the ontology built by Algorithm 2
all equality axioms that define C in the ontology either by focused
terms or DEPA facets. If no axioms do exist, that is, C is not de- 6.1 Keyword-based querying
fined according to the knowledge stored in the ontology, the algo- The user types one or more keywords in the search box. This
rithm ends with no help to the user. This state means that the search method is the simplest one and it is often the only method available
engine cannot provide the user with help for query refinement by when the user does not know anything about the subject to search,
facets. or the user’s knowledge on the query subject is not based on doc-
uments locally stored in the user querying machine, so that we can
Step 3. For all retrieved axioms and for each axiom of the form not use the ontology and facet engine we have advanced. This is
C ≡ ∃hasK.k1 ⊓ · · · ⊓ ∃hasK.kn , where lC = k1 ...kn is the la- also a tyipical case of keyword-based querying by common search
bel computed by Algorithm 1, the algorithm runs the ALC SAT engines, where the keywords used in the query are listed without a
solver in order to match (focused) terms ki in the axiom to all specific ordering on the only basis of the user’s information need.
DEPA facets for C possibly stored in the facet repository. Note
that the performance of our method mainly dependents on this step, We deal with this method of querying as follows. Each keyword is
namely, the number and complexity of the matchings. Preliminary mapped to zero or more concept terms in the context C. We do that
results suggested that the algorithm satisfies the requirements of using an exact string match of the keyword to the concept term or
a query refinement system in terms of real time performance. A one of its alias names, namely, its focused terms.
complete study of the complexity of this step is in progress.
If no concept term and its alias names match any keyword, no con-
Step 4. For all successful matchings computed in Step 3, the re- cept description is available to the facet engine, and as a conse-
trieved DEPA facets are output and shown to the user. quence no facets for query refinement are shown to the user.
If one concept term or its alias names match some keywords, then Computers Fruit
the concept description C of the concept term is generated and pro-
cessed by Algorithm 3 for query expansion. The facets that occur in Apple orange Apple
the query expansion are shown to the user. When selecting one of
doc1 doc1
the new facets, the user will narrow down the search by expanding
doc3
the original query with the suggested facet.
If multiple concept terms match some keywords, then the concept Figure 6: Position of sample document doc1 matters.
description of each term is generated and processed by Algorithm
3 for query expansion. The facets that occur in the query expansion
of every concept description are shown to the user. Alternatively, suppose the user selects the document named doc1 as the sample
the user is given the option to refine their query to indicate which document. In classical querying by example, a relevant answer to
concept term, namely, keyword they meant the most. the user would be any document about ‘apple’, as meant as either
a fruit or a computer. In contrast, using querying by focus the only
6.2 Querying-By-Focus relevant answers to the user would be documents from one of the
Now suppose that the user knows at least something about the sub- two focus Fruit>Apple and Computers>Apple.
ject to search, and the user’s knowledge comes from documents
stored and polyhierarchically organized in the user’s document col- We deal with querying by focus as follow. First, a concept descrip-
lection. In this case, it would always be desiderable for the user tion C of the concept term that is the leaf of the focus is generated
to get better and better understanding of the hidden content of the and processed by Algorithm 3 for query expansion. The facets that
query, as it is automatically generated by a suitable method, so as to occur in the query expansion are shown to the user. When select-
discover new facets of the original query that the user was not aware ing one of the new facets, the user will narrow down the search by
of before. For example, suppose the query is ‘apple’ as contextual- expanding the original query with the suggested facet.
ized in Figure 5. The user clicks on a concept term in a context C.
In doing that, the user selects a focus in C. Alternatively, the user Note that the case where query by focus applies in practical sit-
types some keywords as in keyword-based querying, but in a spe- uations is not as uncommon as it may seem, because almost all
cific order to mean a focus in C. For example, the user may click users start a search from a device storing text and text-annotated
on (an appropriate graphic-version of) ‘Apple’ in context or either documents, and these are often organized by the user according to
a polyhierarchical classification. More importantly, the fact that a
Cx:Fruit user searches the Web does not mean that documents from the Web
will be used for the purpose of querying by focus. The documents
orange Trentino used for querying by focus are all and only the documents locally
stored in the user’s querying device, whatever the search objective
Apple is either to retrieve documents stored in the user’s device or in the
Web. As a consequence, querying by focus clearly scales to the
size of the web. To understand a bit further, recall that our method
Figure 5: A focus for query ‘Apple’. is about query refinement, it is not a query search method. We use
standard methods and search engines to search; the difference is
type keywords ‘fruit’, ‘trentino’, ‘apple’ in this order, as to mean that the keywords we let the search engines to use are automati-
Cx:Fruit>Trentino>Apple. In the example, by selecting the facet cally generated by our facetization technique.
‘Fruit’ the user would narrow down the search space by excluding
all subjects about Apple Computers and related subjects as search 6.3 Querying-On-Subject
results (see Figure 1). Similarly, by selecting facet ‘Trentino’ the Subject-based querying is the most common approach by special-
user would be able to narrow down the search space by excluding ized users, where ‘subject’ refers to the topical intent of a query (cf.
all subjects about fruits that are not related to Trentino’s production Section 2). In our faceted approach to representation of documents
of apples. It follows that the keyword-based method and query- in collection D, ‘subjects’ are broken down into distinct divisions,
ing by focus are not equivalent for at least one reason, that is, in the facets of subject. A typical ‘query-on-subject’ is deemed to re-
keyword-based querying the order of keywords does not matter, in late to a specific subject of a preexisting faceted classification. For
querying by focus does. The other main difference between these example, a subject-based query is: ‘What are the documents on the
two querying methods arises looking at query processing. The dif- effects of nitrogen fertilizers on rice plants?’ The subject of the
ference is that concept terms in a focus are not ‘pure’ keywords; a concept subsumed by this query is one of possibly many focuses,
concept term is represented by a string of similar keywords as gen- for example the following:
erated by Algorithm 1. Concept terms relate to documents in the
user’s repository, while keywords are usually unrelated to the user’s Cx:rice plants>nitrogen fertilizers>effects. (5)
documents. This is a partial focus, in the sense that the discipline subsumed by
the query as provided by the DEPA facet analysis is
A query-by-focus is similar to a query by example, yet it is more
specific. In querying by example, a sample document (the exam- Cx:Agriculture>rice plants>nitrogen fertilizers>effects. (6)
ple) is selected by the user to refine the query. On the other hand, in Another possible focus for the subject of query’s concept is the
querying by focus the position of the sample document is also con- following:
sidered, that is, the place the document is stored within the user’s
documentary repository. To illustrate, suppose that a user stores his C ′ x:Agriculture> effects of nitrogen> fertilizers>rice plants.
documents according two different structures, see Figure 6. Now (7)
A number of different but equivalent focuses could exists for a Normalized Formal Classifications (NFC) used in [11] does this by
given subject-based query. Note the the existance of a focus for taking into account both the label of the node and its position using
this query as well as the focus form depend only upon the querying natural language processing techniques (see [11, sec 4]). On the
user’s classification of documents. The take-away point is that by other hand, we have used an information retrieval technique to find
merging a subject to one or more focuses, by automatically trans- out the keywords that will successively represented in concept de-
forming a query-on-subject to a query-by-focus, the method pro- scriptions by using role names of the form hasK.k. This is an im-
vides the user with assistance in query refinement. In fact, we com- portant difference with [11]. A focus is called “concept at a node”
pute the focuses generated from the query on subject, and for each in [11, p. 70], although we believe that the two notions are not to-
focus we consider the concept description that represents the focus tally equivalent (to be investigated). The notion of Formal Faceted
in ALC ontology computed by Algorithm 2. Then we proceed as Classification (FFC) extends the notion of “lightweight ontology”
in the case of querying by focus and compute the query expansion of [11] to facets. A main difference with lightweight ontologies by
of the focus according to knowledge stored in the ontology. Finally, [11] is that FFC’s descriptive language is not propositional as the
the retrieved facets are shown to the user. If multiple focuses are language used in [11]. Yet, it allows us to automate, through DL
computed from the query’s subject, the user is given the option to reasoning services (SAT), query refinement, as we did in this paper.
refine the original query to indicate which focus they meant for the Moreover, by our query language we allow a user to specify a query
searched subject. by selecting a sample document, to be interpreted of as the “infor-
mation need” of documents similar to the sample selected. As a
consequence, we provide a user with a mechanism of “querying by
7. RELATED WORK example” as a special case. On the other hand, in [11] it seems not
easy to formalize querying by example, as the propositional lan-
There has been extensive work on automated facet construction
guage used does not allow to represent instances.
motivated by query refinement, browsing and navigation over doc-
ument collections, see for instance [29], [8, 9], [10], .[24], [30, 13].
The notion of context in these related works differ from the notion 8. CONCLUSION
of focus; in [10] context is a piece of text, from a document the user This paper presented a formal framework for a querying refine-
is presented to, surrounding the query, which is marked by the user ment method that enables the extraction of the diversity aspects,
on the document. The structural nature of a focus contrasts with the or facets, of a user query. The method uses the general princi-
plain, linguistic nature of query context as meant in [10]. The nav- ples of facet analysis in the DEPA paradigm of facetization and
igation trees discussed in [28] are similar to the focuses discussed the notion of ‘focus’, which is used to infer new facets from the
in this paper. The formal approach of [28], moreover, as well as user query. The method provides a user with additional and essen-
the use of faceted taxonomies is close in spirit, if not in the formal tial contextual information, in form of new facets. When select-
development to our work presented here. As far as we know, none ing one of the new facets, the user can narrow down the search by
of the foregoing approaches uses a DEPA facet schema. expanding the original query with the suggested facets. The pro-
posed method of query refinement is based on diversity in query-
Our method is a focused retrieval method, in the sense that focused ing and a multi-dimensionality of information. Three methods of
retrieval addresses ways to provide a querying user a more direct querying weree discussed: keyword-based, by focus, and on sub-
access to relevant information [26]. Focused retrieval aims to iden- ject. For each method, textual and structural dimensions were used
tify not only documents relevant to a user information need, but to assist the user in query refining. The textual dimension allowed
also where within the document the relevant information is located. us to generate the top-k most relevant terms for each concept of
Our approach of querying-by-focus is similar to querying by focus a given polyhierarchy of text and text-annotated documents. The
on hierarchical classifications proposed by [1, 2]. structural dimension of the polyhierarchy was used to match DEPA
facets with the user query. We have situated our framework within
In the Indian Context, faceted library systems, especially the Colon the smallest propositionally closed description logic ALC, and we
Classification System (CCS), has been adopted by majority of the have used ALC’s solver to implement the facet engine as the main
academic libraries for organizing collections in semantic arrange- component of the method.
ment. However, there is a wide scope for use of the faceted theory
behind systems such as CCS to other knowledge modeling efforts.
Prasad and Guha [18] intoduced a facet-based method to formu- 9. REFERENCES
late the descriptive domain metadata that could be used to anno- [1] A. Agostini and P. Avesani. On the discovery of the semantic
tate digital library resources. Prasad and Madalli [19] propose a context of queries by game-playing. In H. Christiansen,
generic model for building semantic infrastructure for digital li- M.-S. Hacid, T. Andreasen, and H. Larsen, editors,
braries based on facets as used in traditional library classification Proceedings of the Sixth International Conference On
systems. Flexible Query Answering Systems (FQAS-04), pages
203–216, Berlin Heidelberg, 2004. Springer-Verlag LNAI
Faceted taxonomies are extensively studied, see for instance [21, 3055.
27, 28] and references therein. Facet techniques include that stud- [2] A. Agostini and G. Moro. Identification of communities of
ied by Tvaroẑek and Bieliková [27], who have proposed faceted peers by trust and reputation. In D. F. C. Bussler, editor,
navigation and its personalization in digital libraries. They follow Proceedings of the Eleventh International Conference on
a method of faceted browser adaptation based on an automatically Artificial Intelligence: Methodology, Systems, Applications -
acquired user model with support for dynamic facet generation J. Semantic Web Challenges (AIMSA-04), pages 85–95, Berlin
Polowinski [17] argues for use of Faceted Browsing as a visual se- Heidelberg, 2004. Springer-Verlag LNAI 3192.
lection mechanism to browse data collections as it is deemed as [3] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong.
being particularly suitable for structured, but heterogeneous data Diversifying search results. In Proceedings of the Second
with explicit semantics. ACM International Conference on Web Search and Data
Mining (WSDM-00), pages 5–14, New York, NY, 2009. 19-24, 2009, pages 601–610, Berlin Heidelberg, 2009.
ACM Press. Springer-Verlag LNCS 5617.
[4] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and [18] A. Prasad and N. Guha. Expressing faceted subject indexing
P. Patel-Schneider, editors. Handbook of Description Logics, in SKOS/RDF. In Proceedings of the First International
Cambridge, UK, 2002. Cambridge University Press. Conference of Semantic Web and Digital Libraries,
[5] F. Baader and W. Nutt. Basic description logics. In F. Baader, Bangalore 21-23 February (ICSWDL-07), 2007.
D. Calvanese, D. M. Guinness, and P. P.-S. D. Nardi, editors, [19] A. Prasad and D. Madalli. Semantic digital faceted
Handbook of Description Logics, pages 47–100. Cambridge infrastructure for semantic digital libraries. Library Review,
University Press, Cambridge, UK, 2002. 57(3):225–234, 2008.
[6] G. Bhattacharyya. POPSI: its fundamentals and procedure [20] S. R. Ranganathan. Prolegomena to Library Classification.
based on a general theory of subject indexing languages. Asia Publishing House, London, 1967.
Library Science with a Slant to Documentation, 16(1):1–34, [21] G. Sacco and Y. Tzitzikas, editors. Dynamic Taxonomies and
1976. Faceted Search, The Information Retrieval Series, v. 25,
[7] G. Bhattacharyya. Subject indexing language: its theory and Berlin Heidelberg, 2009. Springer-Verlag.
practice. In Proceedings of the DRTC Refresher Seminar–13, [22] G. Salton, editor. The SMART Retrieval
New Developments in LIS in India, Bangalore, India, 1981. System—Experiments in Automatic Document Retrieval,
DRTC, ISI Bangalore Centre. Englewood Cliffs, NJ, 1971. Prentice-Hall Inc.
[8] W. Dakka, R. Dayal, and P. Ipeirotis. Automatic discovery of [23] G. Salton and M. McGill. Introduction to Modern
useful facet terms. In Proceedings of the ACM SIGIR 2006 Information Retrieval. McGraw-Hill, New York, NY, 1983.
Workshop on Faceted Search, New York, NY, 2006. ACM [24] E. Stoica, M. A. Hearst, and M. Richardson. Automating
Press. creation of hierarchical faceted metadata structures. In
[9] W. Dakka and P. Ipeirotis. Automatic extraction of useful Proceedings of the Human Language Technology Conference
facet hierarchies from text databases. In Proceedings of the (NAACL HLT), pages 244–251, Rochester, NY, USA, 2007.
2008 IEEE 24th International Conference on Data Association for Computational Linguistics.
Engineering (ICDE-08), pages 466–475, Washington, DC, [25] P. Tonella, F. Ricca, E. Pianta, and C. Girardi. Using
USA, 2008. IEEE Computer Society. keyword extraction for web site clustering. In K. Wong,
[10] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, editor, Proceedings of the Fifth International Workshop on
G. Wolfman, and E. Ruppin. Placing search in context: The Web Site Evolution (WSE-03), pages 41–48, Amsterdam, The
concept revised. In Proceedings of the Tenth International Netherlands, 2003. IEEE Computer Society.
World Wide Web Conference (WWW-2001), pages 406–414, [26] A. Trotman, S. Geva, J. Kamps, M. Lalmas, and V. Murdock.
New York, NY, 2001. ACM Press. Current research in focused retrieval and result aggregation.
[11] F. Giunchiglia, M. Marchese, and I. Zaihrayeu. Encoding Special Issue in the Journal of Information Retrieval,
classifications into lightweight ontologies. In S. Spaccapietra 13(5):407–411, 2010.
and et. al., editors, Journal on Data Semantics VIII, pages [27] M. Tvaroẑek and M. Bieliková. Personalized faceted
57–81. Springer-Verlag LNCS 4380, Berlin Heidelberg, browsing for digital libraries. In L.ács, N. Fuhr, and
2007. C. Meghini, editors, Research and Advanced Technology for
[12] B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, Digital Libraries. Proceedings of the 11th European
and real needs: a study and analysis of user queries on the Conference on Digital Libraries (ECDL-07), Budapest,
web. Information Processing & Management, Hungary, September 16-21, 2007, pages 485–488, Berlin
36(2):207–227, 2000. Heidelberg, 2007. Springer-Verlag LNCS 4675.
[13] K. Latha, K. R. Veni, and R. Rajaram. AFGF: An automatic [28] Y. Tzitzikas, N. Spyratos, P. Constantopoulos, and
facet generation framework for document retrieval. In A. Analyti. Extended faceted taxonomies for web catalogs.
Proceedings of the 2010 International Conference on In Proceedings of the Third International Conference on Web
Advances in Computer Engineering (ACE-2010), pages Information Systems Engineering (WISE-02), pages
110–114, Washington, DC, USA, 2010. IEEE Computer 192–204, 2002.
Society. [29] R. van Zwol and B. Sigurbjörnsson. Faceted exploration of
[14] S. Lawrence. Context in Web Search. IEEE Data image search results. In Proceedings of the Nineteenth
Engineering Bulletin, 23(3):25–32, 2000. International World Wide Web Conference (WWW-10), pages
[15] V. Maltese, F. Giunchiglia, K. Denecke, P. Lewis, C. Wallner, 961–970, New York, NY, 2010. ACM Press.
A. Baldry, and D. Madalli. On the interdisciplinary [30] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and
foundations of diversity. In G. Boato and C. Niederee, S. A. Yahia. Efficient computation of diverse query results. In
editors, Proceedings of the First International Workshop on Proceedings of the 2008 IEEE 24th International Conference
Living Web at ISWC-09, Washington D.C., USA, October 26, on Data Engineering (ICDE-08), pages 228–236,
2009. CEUR-WS, 2009. Washington, DC, USA, 2008. IEEE Computer Society.
[16] P. Morville and L. Rosenfeld. Information architecture for [31] B. Vickery. Faceted classification: A guide to construction
the World Wide Web, 3rd edition. O’Reilly Media, Inc., and use of special schemes. Aslib - Asia Publishing House,
Sebastopol, CAe, 2006. London, 1960.
[17] J. Polowinski. Human interface and the management of [32] K. Weinberger, M. Slaney, and R. van Zwol. Resolving tag
information. Designing information environments. In M. J. ambiguity. In Proceedings of the 16th International ACM
Smith and G. Salvendy, editors, Proceedings of the Conference on Multimedia (MM 2008), New York, NY,
Symposium on Human Interface 2009, held as Part of HCI 2010. ACM Press.
International 2009 (HCII-09), San Diego, CA, USA, July