Faceted Approach To Diverse Query Processing

Faceted Approach To Diverse Query Processing AlessandroAgostini aagostini@pmu.edu.sa Department of Computer Science Prince Mohammad Bin Fahd University Al-Khobar

Saudi Arabia

DevikaPMadalli devika@drtc.isibang.ac.in Documentation Research and Training Centre Indian Statistical Institute

Bangalore India

AR DPrasad Documentation Research and Training Centre Indian Statistical Institute

Bangalore India

Faceted Approach To Diverse Query Processing 998D5DD3EA43CD18F7EEAAB364C1F2AA GROBID - A machine learning software for extracting information from scholarly documents H.3.7 [Information Systems]: Digital Libraries; I.2.4 [Computing Methodologies]: Knowledge Representation Formalisms and Methods Design Human Factors Algorithms Query refinement facet-based search text-based search contextbased search user issues description logic

This paper presents a formal framework for implementing a query refinement method. The method uses general principles of facet analysis. Two key notions are advanced and discussed: diversity and focus. Diversity refers to the information needs of a querying user; it is captured by the notion of 'facet'. A focus refers to how diversity is captured from the documents as organized by the user; it provides a kind of context to the user query. The method is situated within the formal framework of the smallest propositionally closed description logic ALC, thereby betting that ALC provides us with a suitable SAT solver to implement a facet engine, which is the main component of our method.

INTRODUCTION

Classical libraries had systems that processed subjects or domains and built representations such as subject indices. Among these system, the Colon Classification System (CCS) first proposed by S.R. Ranganathan [20] is currently widely used by almost all Indian libraries. The CCS had enough contextual information in the method of facetisation and synthesis so that it formed a semantic formalisation of the domain scope of the library collections.

In order to digitize CCS and similar facet-based systems, Prasad and Guha [18] demonstrate the applicability of faceted schema in describing resources in web directories and annotating resources in digital libraries using SKOS/RDF representation to express DEPA strings, according to faceted theory by Ranganathan [20] and DEPA facet analysis [7]. On the other hand, current keyword-based querying methods does not use DEPA strings to represent web directories and annotating resources in digital libraries, so they seem inadequate to search over digital repositories organized according to CCS and similar faced-based classification systems.

For answers to be relevant, a user must ask the appropriate query in order to retrieve the desired information and fulfill the information need (IN). For keyword-based search this means that a high number of keywords is necessary to the user to narrow down the search according to her information need. This is due the semantic ambiguity of querying languages, often built upon natural language, as it is the case of keyword-based querying. Unfortunately, the query length of keyword-based search on average is reported to be short, with 90% of the queries being less than four keywords [12]. As a consequence, the ambiguity of the query is somewhat mirrored in the relative relevance of search results [32,3]; diversity in search results arises [15] and query refinement by the user is often the only solution. To resolve such ambiguity some authors advanced the notion of 'context' in web search, see for instance [14,10] and references cited therein. However, in contect-based solutions the user is often assumed to know how data and information are organized in the search domain. This is often hard to happen in realworld, distributed scenarios like the Web, due to large amounts of heterogeneous data organized in an unknow structure.

In this paper we present a formal framework wherein we define a method for the extraction of DEPA facets from a user query. The facets are then used to refine the original query for search and retrieval purposes. The method is aimed to suggest the user a list of facets that the user would hardly be aware of by simply typing a keyword-based query into a search engine, without any query context. These automatically suggested new facets can be used by the user, for instance by clicking on one of the new facets, to narrow down the search space by expanding the original user query with the suggested facet. This paper is organized as follows. In Section 2 we define basic concepts related to facet analysis. In Section 3 we discuss the first step of our method. In Section 4 we build a formal faceted ontology to formalize the focused terms and contexts that we successively process, in Section 5, to produce new facets to be shown to the user for query refinement. After building the faceted ontology and defining the facet engine, in Section 6 we present the three different yet related querying methods we offer to the user; these are keyword-based, by focus, and on subject. In Section 7 we discuss related work. In Section 8 we conclude the paper.

FACETS ANALYSIS

Facet analysis is essentially a conceptual analysis of the subject matter, or the topical content of a concept into distinct divisions that together constitute a semantic description of the concept. In order to build the facet repository available to a user to refine a query, in this section we present some elements of facet analysis.

Our facets repository is organized around two main notions of the DEPA paradigm for facet analysis [6,7]: subjects and facets. A subject of a concept is the topical content of the concept, that is, the concept's overall semantics, as defined by the combination of extensional and intensional semantics of the concept term. The definition can be extended to a query, which in its simplest form can be thought of as a finite sequence of concept terms; see subsections 6.1 and 6.3. A facet consists of a "group of terms derived by taking each term and defining it, per genus et differentiam, with respect for its parent class." [31, p. 12]. According to Ranganathan [20], each domain is made of distinct divisions or facets that are groups of mutually exclusive concepts and many such facets together constitute a domain. The notion of such facetization has been extended by Bhattacharyya [7] to subject indexing by representing content as a string of fundamental categories DEPA (Discipline, Entity, Property and Action) that are conceptually equivalent to 'facets'. To illustrate, we rely on the following two examples. EXAMPLE 1. Consider a document titled 'Improving EU labour market access for Rome'. DEPA facet analysis of the title leads to facets such as: Labour Market (Entity), Access (Action), Rome (Space -from commonly applicable facet schedules across domains). The facet 'Discipline' is extrapolated from faceted document representation, and it is 'Economics'.

Note that in case a concept would be classified within more than one discipline, as a homonymous or synonymous concept, then all such different combinations of facets are taken into account and presented to the user for further refinement. EXAMPLE 2. Consider a document titled 'Treating Apple trees for bacterial disease in Trentino'. 1 DEPA facet analysis provides a classification of the document into the following facets: Agriculture (D), Apple Trees (E), Treating (A), Disease (P), and Bacterial (as 'Modifier' to P, cf. [6]).

We are now ready to define the facet repository for a given context. A facet repository for a context C is the set

F R(C) = {⟨C : d,

FOCUSED TERMS FROM TEXT

In the present work, we apply facetization as a technique to combine extensional and intensional semantics of concepts viz. queries, or equivalently to disclose the subject of concepts and queries to the querying user, for the purpose of query refinement and search assistance. We implement facetization in two related steps: 1. we produce certain "focused terms" from documents organized in a polyhierarchy, and 2. from focused terms we produc new facets to be shown to the user for the purpose of query refinement. We present step 1 in subsections 3.1 and 3.2 in this section, and step 2 in sections 4 and 5.

Organization of documents

Although our method can be adopted as integral part of digital libraries systems, both for describing the documents collection and for faceted querying over the collection or the web, in this paper we assumed the method assists a querying user in query refinement. As the method in this specific application uses a textual collection of documents stored in the user's querying machine, we stipulate the following convention. CONVENTION 1. We denote the set of available documents to a querying user by D. All available documents are textual, that is, they can be processed by text information retrieval techniques as the variant of a standard technique discussed in Section 3.

Intuitively, the domain D of documents can be thought of as the set of all documents the querying user has classified and stored in the querying machine. CONVENTION 2. We assume that the querying user organizes documents in D by using a 'polyhierarchical classification', or polyhierarchy.

A polyhierarchical classification is a hierarchical classification permitting some concept terms to be listed in multiple categories of a taxonomy, or branches of a hierarchy [16]. An example of polyhierarchy can be found in Figure 1. Note that what makes the hierarchical classification in Figure 1 be polyhierarchical is the concept term 'Apple'. A subset of documents is organized in 'contexts', each context be organized into related sets of documents. A context is a polyhierarchical classification composed by sets of documents, i.e., 'nodes' of the polyhierarchy, called clusters, and a relation over the nodes as defined by the polyhierarchy. Typical relations are the binary relations of subsumption, part-of, is-a, among others relations. Each cluster in a given context has a name composed by a finite sequence of words from a representation language, often a natural langiage thereby betting that clusters are named by a human-the querying user, who naturally applies her native language for clusters naming. A cluster's name in such representation language is referred to as concept term. A concept is a concept term provided with a semantics. Two kinds of semantics are provided to a concept term: an extensional semantics, defined over the documents in the cluster named by the concept term; and an intensional semantics, defined by the unique position of the concept term in a given 'focus'.

Cx:MyClassification

Computers

Contexts provide a way to define finite, ordered sequences of concept terms, each sequence called a focus. A focus consists of an ordered set of related concept terms, each concept term naming a cluster built upon the collection of documents in D. Intuitively, a focus is a path of concept terms corresponding to a path in a given context. Figure 2

Concept terms grounded in documents

In this section, our goal is to automatically assign a 'label' to every cluster of a given context. Each cluster's label produced by Algorithm 1 below is a finite, simple concatenation of terms with maximum 'weight', extracted by using Text (•). Formally, we proceed as follows.

Let Text (•) be a text extraction function. In this paper, we refer to Text (•) as a standard keywords extraction function, for instance see [25,Sec. 4]. Given a document d, Text (d) listes all the keywords in d, precisely, the most frequent 'tokens'. Applied to a document d, Text (•) produces a set Text (d) of terms (or 'keywords'). Let d be any document in D. As terms are defined from documents, from now on we write k ∈ Text (d) to denote a generic term retrieved by using Text (•) d. Given a document d, we rank a term k ∈ Text (d) by adapting IR standard TF/IDF ("Term Frequency / Inverse Document Frequency") method [22,23] to deal with contexts and unique concept terms' position, i.e., focus, within a context. Observe that in the following, for a given context C we write 'C in C' in place of 'C in C' set of clusters' for every cluster C.

Let querying user u organizing a context C, cluster C in C, and term k ∈ Text (d) for a document d ∈ D be given. We define the weight of k in C as follows:

W u [k, C] = ( ∑ d∈C TF[k, d]) • log Card (F C) doCK u [k] ,(1)

where

TF[k, d] is the total number of occurrences of k in d, so that ∑ d∈C TF[k, d] is the total number of occurrences of k in C; Card (F C) is the number of focuses in C with leaf C, and doCK u [k]

is the total number of clusters in the set

C \ {C ′ | C ′ ̸ = C is a cluster in a focus in C with leaf C} (2)

which contain k. Intuitively, (1) says that, in order to represent the extensional semantics of a focus, the importance of a retrieved term for a cluster, i.e., the value of W u [k, C], is inversely proportional to the number of different focuses with C as leaf which contain the term.

The label of a cluster C is the most representative term or sequence of terms for the cluster. Now we want compute the label of all clusters of a given context. For doing this, we process all documents stored in each cluster by considering the position of each cluster in the context. To define the process formally, we rely on the following technical definition. Let context C organize (a subset of) documents in D and cluster C in C be given. We define

IR (D, C, C) = {k ∈ Text (d) | d ∈ C, C in C}. (3)

We expect that the label of cluster C in ( To compute a label of every nonempty cluster C of a given context C, we exhibit an algorithm that produces the label lC of C; see Algorithm 1. Set IR = IR (D, C, C). Algorithm 1 Context-based cluster labeling.

k is a label of C, if W u [k ′ , C] ≤ W u [k, C] for all terms k ′ in IR (D, C, C). A se- quence k1, k2, ...kn of terms in IR (D, C, C) is a label of C if (a) W u [ki, C] = W u [k, C] for i = 1,Input: C, D ̸ = ∅ foreach C in C with C ̸ = ∅ do foreach k ∈ IR (D, C, C) do compute W u [k, C] according to formula (1) od; compute M = {k ∈ IR | ∀k ′ ∈ IR, W u [k ′ , C] ≤ W u [k, C]};

Let n be the cardinality of M ; Let {k1, k2, . . . , kn} be the lexicographical ordering of M ; Set l0 = ∅; /* empty sequence */

for i = 1 to i = n do Pick ki ∈ M ; Set li = li−1ki od od; /* simple concatenation */ Define lC = ln Return : set of labels {lC | C in C, C ̸ = ∅}. Observe: 1. If C ̸ = ∅ then IR ̸ = ∅. 2.

The label lC computed by Algorithm 1 in not unique. In fact, M in Algorithm 1 is assumed to be ordered according to lexicographical ordering. Other orderings of the elements in M are possible and, as a consequence, a different label can be generated from each ordering. We are now ready to define "focused terms." Let a focus F with concept term C as leaf be given. A focused term for F is any term that appears in a label lC of a cluster C in F . In symbols, the set of focus terms for F is

F T (F ) = {k | k appears in lC , C ∈ F}.

A focused term for C is any term that appears in lC . A focus term for a concept term plays the role of a synonymous, or alias names, of the concept term. As we will see in Section 6, alias names are important to improve keyword-based querying.

FACETED ONTOLOGY BUILDING

The result of extracting terms from documents and "facetizing" the concepts of a polyhierarchical classification by using them produces a basic kind of faceted taxonomy, provided that (1) the extracted terms or, often, a proper subset of these [9], are matched with a predefined set of facets, and (2) the clusters in a focus are related to each other by a subsumption relation. For a faceted taxonomy consists of: (a) a set of facets, where each facet consists of a predefined set of terms; and (b) a subsumption relation among the terms. In this section we provide the formal framework we need to formalize the focused terms and labeled contexts we have produced by Algorithm 1 by shallowly assuming (2) 2 .

Description Logics

Description Logics (DLs) [5] are a family of logic-based knowledge representation formalisms designed to represent and reason about the knowledge of an application domain in a structured and well-understood way. In this paper, we use a basic description logic, called ALC, thereby betting that ALC provides us with an efficient SAT solver to implement our facet engine (Section 5). ALC is the smallest propositionally closed DL, and provides the concept constructors For the goal of this paper, we use a limited part of ALC's expressive power; in particular we do not use role axioms and assertions. Moreover, we write concept descriptions in lower case, as concept description from now on are terms extracted by Algorithm 1 from documents 2 That in our approach clusters in a focus are related to each other by a subsumption relation follows from Convention 2 by observing that polyhierarchical classifications are often subsumption hierarchies. However, we do not need to strictly assume (2) in this paper. as explained. Due to the limitation of space, we do not provide a detailed introduction of Description Logics (DLs), but rather point the reader to [5,4] and offer the reader an example. EXAMPLE 5. Consider the labeled focus in Example 4. We can represent it within ALC by a set of equality axioms, that we present as labels of the labeled focus in Figure 4. The concept descrip-

¬ C, C ⊓ D, C ⊔ D,Fruit ≡ ∃hasK.k 3 1 ⊓ • • • ⊓ ∃hasK.k 3 n ... Trentino ≡ ∃hasK.k 2 1 ⊓ • • • ⊓ ∃hasK.k 2 m Apple ≡ ∃hasK.k 1 1 ⊓ • • • ⊓ ∃hasK.k 1 p Figure 4: A labeled focus in ALC.

tions k j i that appear in the tree refer to the focused terms extracted by Algorithm 1 for each concept in the focus; hasK is a named role, which is intuitively interpreted as 'has keyword'. For example, ∃hasK.k 31 intuitively means that concept term 'Fruit' in focus F :Fruit>Trentino>Apple is extended with focused term (keyword) k 3 1 . Each equality axiom that appears along the tree defines in ALC a concept term in F; the focus itself is formalized by the equality axiom: FocusApple ≡ Apple ⊓ ∃R.(Trentino ⊓ ∃R.Fruit). An ALC KB for this example is the set of the three equality axioms depicted along the tree plus the equality axiom that defines 'Fo-cusApple' as the 'focus Apple', i.e., the focus F.

Formal Faceted Classifications

Now we generalize the example. Algorithm 2 below provides a way to build an ALC faceted knowledge base, or faceted ontology, for a given context. The algorithm works in two main steps.

First, it builds a knowledge base by adding ALC equality axioms that formally define the concept terms of an input context by using focused terms computed by Algorithm 1 over the same context. For maching purposes that we will see in Section 5, if strictly more or strictly less (but at least one) focused terms were computed for a concept term, then the algorithm adds to the knowledge base all the equality axioms defined over all possible combinations of four focused terms picked up, possibly with repetitions, from the computed terms.

Second, the algorithm adds to the knowledge base so obtained all ALC equality axioms that formally define DEPA facets of every concept as stored in the facet repository (see Section 2). These axioms have the form C ≡ ∃F acetD.d ⊓ ∃F acetE.e ⊓ ∃F acetP.p ⊓ ∃F acetA.a, (4) where C represents a concept c available in the facet repository, F acetD, F acetE, F acetP , and F acetA are named roles rapresenting the property of c in terms of DEPA facet analysis paradigm. 3 The intended interpretation of these named roles relates to the facet repository. For example, ∃F acetD.f means that there is a concept in the facet repository with facet 'Discipline' be f . By extension, equality axiom (4) means that there is a concept in the facet repository with facet 'Discipline' d, 'Entity' e, 'Property' p, and 'Action' a, and that concept has name C. Hence, as per second step, Algorithm 2 adds to the knowledge base all axioms of form as in (4) if and only if there is a concept (or a subject) with DEPA facets d, e, p, a in the facet repository. We make the system insensitive to case and punctuation in the facets d, e, p, a by adding additional axioms where variants of d, e, p, a with the same meaning are used. We call the ontology produced by Algorithm 2 a formal faceted classification (FFC).

Algorithm 2 Building a ALC faceted ontology O.

Input: C, D ̸ = ∅, F R(C) Set O = ∅; /*

FACET ENGINE

Now we design within our framework a facet engine that computes the matching between the focused terms of a input context and the predefined set of facets stored in the facet repository for a number of concepts. Intuitively, the facet engine looks at all keywords generated for each concept name in a focus for all focuses of the hierarchy, and browse through the focus from the root to the leaf to identify what keywords are DEPA facets stored in facet repository. The facet engine's main component is Algorithm 3. The basic steps of the algorithm are the following:

Step 1. Input a concept description C that represents a user's query; the different possible queries that can be represented this way are presented in Section 6.

Step 2. Find and retrieve from the ontology built by Algorithm 2 all equality axioms that define C in the ontology either by focused terms or DEPA facets. If no axioms do exist, that is, C is not defined according to the knowledge stored in the ontology, the algorithm ends with no help to the user. This state means that the search engine cannot provide the user with help for query refinement by facets.

Step 3. For all retrieved axioms and for each axiom of the form C ≡ ∃hasK.k1 ⊓ • • • ⊓ ∃hasK.kn, where lC = k1...kn is the label computed by Algorithm 1, the algorithm runs the ALC SAT solver in order to match (focused) terms ki in the axiom to all DEPA facets for C possibly stored in the facet repository. Note that the performance of our method mainly dependents on this step, namely, the number and complexity of the matchings. Preliminary results suggested that the algorithm satisfies the requirements of a query refinement system in terms of real time performance. A complete study of the complexity of this step is in progress.

Step 4. For all successful matchings computed in Step 3, the retrieved DEPA facets are output and shown to the user.

:= F acetSet(C) j−1 ∪ F li /* all DEPA strings for C in F R(C) retrieved */ od fi; Return : F acetSet(C) j .

QUERY PROCESSING

After building the faceted ontology and defining the facet engine we are ready to use them to provide new facets to the user for query refinement. We allow the user to make three kind of query: keyword-based, by focus, and on subject. We discuss each querying method in turn.

Keyword-based querying

The user types one or more keywords in the search box. This method is the simplest one and it is often the only method available when the user does not know anything about the subject to search, or the user's knowledge on the query subject is not based on documents locally stored in the user querying machine, so that we can not use the ontology and facet engine we have advanced. This is also a tyipical case of keyword-based querying by common search engines, where the keywords used in the query are listed without a specific ordering on the only basis of the user's information need.

We deal with this method of querying as follows. Each keyword is mapped to zero or more concept terms in the context C. We do that using an exact string match of the keyword to the concept term or one of its alias names, namely, its focused terms.

If no concept term and its alias names match any keyword, no concept description is available to the facet engine, and as a consequence no facets for query refinement are shown to the user.

If one concept term or its alias names match some keywords, then the concept description C of the concept term is generated and processed by Algorithm 3 for query expansion. The facets that occur in the query expansion are shown to the user. When selecting one of the new facets, the user will narrow down the search by expanding the original query with the suggested facet.

If multiple concept terms match some keywords, then the concept description of each term is generated and processed by Algorithm 3 for query expansion. The facets that occur in the query expansion of every concept description are shown to the user. Alternatively, the user is given the option to refine their query to indicate which concept term, namely, keyword they meant the most.

Querying-By-Focus

Now suppose that the user knows at least something about the subject to search, and the user's knowledge comes from documents stored and polyhierarchically organized in the user's document collection. In this case, it would always be desiderable for the user to get better and better understanding of the hidden content of the query, as it is automatically generated by a suitable method, so as to discover new facets of the original query that the user was not aware of before. For example, suppose the query is 'apple' as contextualized in Figure 5. The user clicks on a concept term in a context C. In doing that, the user selects a focus in C. Alternatively, the user types some keywords as in keyword-based querying, but in a specific order to mean a focus in C. For example, the user may click on (an appropriate graphic-version of) 'Apple' in context or either type keywords 'fruit', 'trentino', 'apple' in this order, as to mean Cx:Fruit>Trentino>Apple. In the example, by selecting the facet 'Fruit' the user would narrow down the search space by excluding all subjects about Apple Computers and related subjects as search results (see Figure 1). Similarly, by selecting facet 'Trentino' the user would be able to narrow down the search space by excluding all subjects about fruits that are not related to Trentino's production of apples. It follows that the keyword-based method and querying by focus are not equivalent for at least one reason, that is, in keyword-based querying the order of keywords does not matter, in querying by focus does. The other main difference between these two querying methods arises looking at query processing. The difference is that concept terms in a focus are not 'pure' keywords; a concept term is represented by a string of similar keywords as generated by Algorithm 1. Concept terms relate to documents in the user's repository, while keywords are usually unrelated to the user's documents.

A query-by-focus is similar to a query by example, yet it is more specific. In querying by example, a sample document (the example) is selected by the user to refine the query. On the other hand, in querying by focus the position of the sample document is also considered, that is, the place the document is stored within the user's documentary repository. To illustrate, suppose that a user stores his documents according two different structures, see Figure 6 suppose the user selects the document named doc1 as the sample document. In classical querying by example, a relevant answer to the user would be any document about 'apple', as meant as either a fruit or a computer. In contrast, using querying by focus the only relevant answers to the user would be documents from one of the two focus Fruit>Apple and Computers>Apple.

We deal with querying by focus as follow. First, a concept description C of the concept term that is the leaf of the focus is generated and processed by Algorithm 3 for query expansion. The facets that occur in the query expansion are shown to the user. When selecting one of the new facets, the user will narrow down the search by expanding the original query with the suggested facet.

Note that the case where query by focus applies in practical situations is not as uncommon as it may seem, because almost all users start a search from a device storing text and text-annotated documents, and these are often organized by the user according to a polyhierarchical classification. More importantly, the fact that a user searches the Web does not mean that documents from the Web will be used for the purpose of querying by focus. The documents used for querying by focus are all and only the documents locally stored in the user's querying device, whatever the search objective is either to retrieve documents stored in the user's device or in the Web. As a consequence, querying by focus clearly scales to the size of the web. To understand a bit further, recall that our method is about query refinement, it is not a query search method. We use standard methods and search engines to search; the difference is that the keywords we let the search engines to use are automatically generated by our facetization technique.

Querying-On-Subject

Subject-based querying is the most common approach by specialized users, where 'subject' refers to the topical intent of a query (cf. Section 2). In our faceted approach to representation of documents in collection D, 'subjects' are broken down into distinct divisions, the facets of subject. A typical 'query-on-subject' is deemed to relate to a specific subject of a preexisting faceted classification. For example, a subject-based query is: 'What are the documents on the effects of nitrogen fertilizers on rice plants?' The subject of the concept subsumed by this query is one of possibly many focuses, for example the following: Cx:rice plants>nitrogen fertilizers>effects.

(

This is a partial focus, in the sense that the discipline subsumed by the query as provided by the DEPA facet analysis is Cx:Agriculture>rice plants>nitrogen fertilizers>effects. (6) Another possible focus for the subject of query's concept is the following:

C ′ x:Agriculture> effects of nitrogen> fertilizers>rice plants.

A number of different but equivalent focuses could exists for a given subject-based query. Note the the existance of a focus for this query as well as the focus form depend only upon the querying user's classification of documents. The take-away point is that by merging a subject to one or more focuses, by automatically transforming a query-on-subject to a query-by-focus, the method provides the user with assistance in query refinement. In fact, we compute the focuses generated from the query on subject, and for each focus we consider the concept description that represents the focus in ALC ontology computed by Algorithm 2. Then we proceed as in the case of querying by focus and compute the query expansion of the focus according to knowledge stored in the ontology. Finally, the retrieved facets are shown to the user. If multiple focuses are computed from the query's subject, the user is given the option to refine the original query to indicate which focus they meant for the searched subject.

RELATED WORK

There has been extensive work on automated facet construction motivated by query refinement, browsing and navigation over document collections, see for instance [29], [8,9], [10], . [24], [30,13].

The notion of context in these related works differ from the notion of focus; in [10] context is a piece of text, from a document the user is presented to, surrounding the query, which is marked by the user on the document. The structural nature of a focus contrasts with the plain, linguistic nature of query context as meant in [10]. The navigation trees discussed in [28] are similar to the focuses discussed in this paper. The formal approach of [28], moreover, as well as the use of faceted taxonomies is close in spirit, if not in the formal development to our work presented here. As far as we know, none of the foregoing approaches uses a DEPA facet schema.

Our method is a focused retrieval method, in the sense that focused retrieval addresses ways to provide a querying user a more direct access to relevant information [26]. Focused retrieval aims to identify not only documents relevant to a user information need, but also where within the document the relevant information is located.

Our approach of querying-by-focus is similar to querying by focus on hierarchical classifications proposed by [1,2].

In the Indian Context, faceted library systems, especially the Colon Classification System (CCS), has been adopted by majority of the academic libraries for organizing collections in semantic arrangement. However, there is a wide scope for use of the faceted theory behind systems such as CCS to other knowledge modeling efforts. Prasad and Guha [18] intoduced a facet-based method to formulate the descriptive domain metadata that could be used to annotate digital library resources. Prasad and Madalli [19] propose a generic model for building semantic infrastructure for digital libraries based on facets as used in traditional library classification systems.

Faceted taxonomies are extensively studied, see for instance [21,27,28] and references therein. Facet techniques include that studied by Tvaroẑek and Bieliková [27], who have proposed faceted navigation and its personalization in digital libraries. They follow a method of faceted browser adaptation based on an automatically acquired user model with support for dynamic facet generation J.

Polowinski [17] argues for use of Faceted Browsing as a visual selection mechanism to browse data collections as it is deemed as being particularly suitable for structured, but heterogeneous data with explicit semantics.

Normalized Formal Classifications (NFC) used in [11] does this by taking into account both the label of the node and its position using natural language processing techniques (see [11, sec 4]). On the other hand, we have used an information retrieval technique to find out the keywords that will successively represented in concept descriptions by using role names of the form hasK.k. This is an important difference with [11]. A focus is called "concept at a node" in [11, p. 70], although we believe that the two notions are not totally equivalent (to be investigated). The notion of Formal Faceted Classification (FFC) extends the notion of "lightweight ontology" of [11] to facets. A main difference with lightweight ontologies by [11] is that FFC's descriptive language is not propositional as the language used in [11]. Yet, it allows us to automate, through DL reasoning services (SAT), query refinement, as we did in this paper. Moreover, by our query language we allow a user to specify a query by selecting a sample document, to be interpreted of as the "information need" of documents similar to the sample selected. As a consequence, we provide a user with a mechanism of "querying by example" as a special case. On the other hand, in [11] it seems not easy to formalize querying by example, as the propositional language used does not allow to represent instances.

CONCLUSION

This paper presented a formal framework for a querying refinement method that enables the extraction of the diversity aspects, or facets, of a user query. The method uses the general principles of facet analysis in the DEPA paradigm of facetization and the notion of 'focus', which is used to infer new facets from the user query. The method provides a user with additional and essential contextual information, in form of new facets. When selecting one of the new facets, the user can narrow down the search by expanding the original query with the suggested facets. The proposed method of query refinement is based on diversity in querying and a multi-dimensionality of information. Three methods of querying weree discussed: keyword-based, by focus, and on subject. For each method, textual and structural dimensions were used to assist the user in query refining. The textual dimension allowed us to generate the top-k most relevant terms for each concept of a given polyhierarchy of text and text-annotated documents. The structural dimension of the polyhierarchy was used to match DEPA facets with the user query. We have situated our framework within the smallest propositionally closed description logic ALC, and we have used ALC's solver to implement the facet engine as the main component of the method.

Figure 1 :1Figure 1: A polyhierarchy, or polyhierarchical context Cx.

Figure 2 :2Figure 2: An example of context (left) and focus (right).

EXAMPLE 4 .Figure 3 :43Figure 3: A focus as labeled by Algorithm 1.

∃R.C, ∀R.C, as well as concept inclusion (or subsumption) C ⊑ D and concept equality C ≡ D, where C, D are concept descriptions and R is a named role. A DL knowledge base (KB) consists of concept axioms (such as concept inclusion and concept equality axioms), role axioms (such as functional role axioms) and assertions of the form C(a), R(a, b) where a and b are named individuals.

Figure 5 :5Figure 5: A focus for query 'Apple'.

Figure 6 :6Figure 6: Position of sample document doc1 matters.

e, p, a⟩ | C has DEPA facets d, e, p, a}, where C is a concept description in description logic ALC (see subsection 4.2) of a concept or subject of interest in context C, and d, e, p, a are, respectively, a Discipline, Entity, Property and Action in DEPA classification system.EXAMPLE 3. Consider the previous two examples. We can as-sume that 'Improving EU labour market access for Rome' is rep-resented by a concept description C1, and 'Treating Apple treesfor bacterial disease in Trentino' is represented by a concept de-scription C2 in a context C. The facet repository F R(C) contains⟨C1 : Economics, LabourM arket, p, Access⟩ for p is unspeci-fied, and ⟨C2 : Agriculture, AppleT rees, Disease, T reating⟩.

3) is the most representative term or sequence of terms in IR (D, C, C). The most representative term among terms in IR (D, C, C) is the term with the highest weight among all terms in IR(D, C, C) according to weighting measure 1. Formally, a term k in IR (D, C, C) is the most representative for the cluster C in C, and we say that

Algorithm 3 Query expansion with facets from focused terms. proc QueryExpansion Input: C, O, F R(C) /* C is meant to represent user query */ Define ΩK be the set of axioms in O of the form C ≡ ∃hasK.k1 ⊓ • • • ⊓ ∃hasK.kn; /* k1...kn = lC */ Define ΩF be the set of axioms in O of the form C ≡ ∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a; /* ⟨C : d, e, p, a⟩ is in F R(C) */ if ΩK ∨ ΩF = ∅

then exit/* no query exspansion provided */elses := Card (ΩK );/* ΩK cardinality is s ≥ 1 */t := Card (ΩF );/* ΩF cardinality is t ≥ 1 */F acetSet(C) := ∅; /* set of facets retrieved for C */for j = 1 to j = s doF00 := ∅;/* different facets strings retrieved *//* by using a single axiom in ΩK */for l = 1 to l = t do for i = 1 to i = ( n 4 )doif O |= ∃hasK.ki1 ⊓ • • • ⊓ ∃hasK.ki4} ≡∃D.d ⊓ ∃E.e ⊓ ∃P.p ⊓ ∃A.a/* focused terms and DEPA facets match */thenF *//* depending on ki1,...,ki4 */fi odod;F acetSet(C) j

li := F li−1 ∪ {⟨C : d, e, p, a⟩} /* ⟨C : d, e, p, a⟩ retrieved Trentino is a Province of the Italian North-east known for the Dolomites and for its quality production of red and yellow apples. To shorten notation, in algorithms we use D, E, P , A instead.

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">On the discovery of the semantic context of queries by game-playing AAgostini PAvesani Proceedings of the Sixth International Conference On Flexible Query Answering Systems (FQAS-04) HChristiansen M.-SHacid TAndreasen HLarsen the Sixth International Conference On Flexible Query Answering Systems (FQAS-04)

Berlin Heidelberg

Springer-Verlag LNAI 2004 3055 Identification of communities of peers by trust and reputation AAgostini GMoro Proceedings of the Eleventh International Conference on Artificial Intelligence: Methodology, Systems, Applications -Semantic Web Challenges (AIMSA-04) DF CBussler the Eleventh International Conference on Artificial Intelligence: Methodology, Systems, Applications -Semantic Web Challenges (AIMSA-04)

Berlin Heidelberg

Springer-Verlag LNAI 2004. 3192 Diversifying search results RAgrawal SGollapudi AHalverson SIeong Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM-00) the Second ACM International Conference on Web Search and Data Mining (WSDM-00)

New York, NY

ACM Press 2009 Handbook of Description Logics F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel-Schneider 2002 Cambridge University Press Cambridge, UK Basic description logics FBaader WNutt Handbook of Description Logics FBaader DCalvanese DMGuinness PP -SDNardi

Cambridge, UK

Cambridge University Press 2002 POPSI: its fundamentals and procedure based on a general theory of subject indexing languages GBhattacharyya Library Science with a Slant to Documentation 16 1 1976 Subject indexing language: its theory and practice GBhattacharyya Proceedings of the DRTC Refresher Seminar-13, New Developments in LIS in India the DRTC Refresher Seminar-13, New Developments in LIS in India

Bangalore, India

DRTC 1981 ISI Bangalore Centre Automatic discovery of useful facet terms WDakka RDayal PIpeirotis Proceedings of the ACM SIGIR 2006 Workshop on Faceted Search the ACM SIGIR 2006 Workshop on Faceted Search

New York, NY

ACM Press 2006 Automatic extraction of useful facet hierarchies from text databases WDakka PIpeirotis Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE-08) the 2008 IEEE 24th International Conference on Data Engineering (ICDE-08)

Washington, DC, USA

IEEE Computer Society 2008 Placing search in context: The concept revised LFinkelstein EGabrilovich YMatias ERivlin ZSolan GWolfman ERuppin Proceedings of the Tenth International World Wide Web Conference (WWW-2001) the Tenth International World Wide Web Conference (WWW-2001)

New York, NY

ACM Press 2001 Encoding classifications into lightweight ontologies FGiunchiglia MMarchese I Journal on Data Semantics VIII SSpaccapietra

Berlin Heidelberg

Springer-Verlag LNCS 2007 4380 Real life, real users, and real needs: a study and analysis of user queries on the web BJJansen ASpink TSaracevic Information Processing & Management 36 2 2000 AFGF: An automatic facet generation framework for document retrieval KLatha KRVeni RRajaram Proceedings of the 2010 International Conference on Advances in Computer Engineering (ACE-2010) the 2010 International Conference on Advances in Computer Engineering (ACE-2010)

Washington, DC, USA

IEEE Computer Society 2010 Context in Web Search SLawrence IEEE Data Engineering Bulletin 23 3 2000 On the interdisciplinary foundations of diversity VMaltese FGiunchiglia KDenecke PLewis CWallner ABaldry DMadalli Proceedings of the First International Workshop on Living Web at ISWC-09 GBoato CNiederee the First International Workshop on Living Web at ISWC-09

Washington D.C., USA

CEUR-WS October 26, 2009. 2009 Information architecture for the World Wide Web PMorville LRosenfeld 2006 O'Reilly Media, Inc Sebastopol, CAe 3rd edition Human interface and the management of information. Designing information environments JPolowinski Proceedings of the Symposium on Human Interface 2009, held as Part of HCI International 2009 (HCII-09) MJSmith GSalvendy the Symposium on Human Interface 2009, held as Part of HCI International 2009 (HCII-09)

San Diego, CA, USA; Berlin Heidelberg

Springer-Verlag LNCS July 19-24, 2009. 2009 5617 Expressing faceted subject indexing in SKOS/RDF APrasad NGuha Proceedings of the First International Conference of Semantic Web and Digital Libraries the First International Conference of Semantic Web and Digital Libraries

Bangalore

21-23 February (ICSWDL-07. 2007 Semantic digital faceted infrastructure for semantic digital libraries APrasad DMadalli Library Review 57 3 2008 Prolegomena to Library Classification SRRanganathan 1967 Asia Publishing House London Dynamic Taxonomies and Faceted Search GSacco YTzitzikas The Information Retrieval Series

Berlin Heidelberg

Springer-Verlag 2009 25 The SMART Retrieval System-Experiments in Automatic Document Retrieval GSalton 1971 Prentice-Hall Inc Englewood Cliffs, NJ Introduction to Modern Information Retrieval GSalton MMcgill 1983 McGraw-Hill New York, NY Automating creation of hierarchical faceted metadata structures EStoica MAHearst MRichardson Proceedings of the Human Language Technology Conference (NAACL HLT) the Human Language Technology Conference (NAACL HLT)

Rochester, NY, USA

2007 Association for Computational Linguistics Using keyword extraction for web site clustering PTonella FRicca EPianta CGirardi Proceedings of the Fifth International Workshop on Web Site Evolution (WSE-03) KWong the Fifth International Workshop on Web Site Evolution (WSE-03)

Amsterdam, The Netherlands

IEEE Computer Society 2003 Current research in focused retrieval and result aggregation ATrotman SGeva JKamps MLalmas VMurdock Special Issue in the Journal of Information Retrieval 13 5 2010 Personalized faceted browsing for digital libraries MTvaroẑek MBieliková Proceedings of the 11th European Conference on Digital Libraries (ECDL-07) LÁcs NFuhr CMeghini the 11th European Conference on Digital Libraries (ECDL-07)

Budapest, Hungary; Berlin Heidelberg

Springer-Verlag LNCS September 16-21, 2007. 2007 4675 Research and Advanced Technology for Digital Libraries Extended faceted taxonomies for web catalogs YTzitzikas NSpyratos PConstantopoulos AAnalyti Proceedings of the Third International Conference on Web Information Systems Engineering (WISE-02) the Third International Conference on Web Information Systems Engineering (WISE-02) 2002 Faceted exploration of image search results RVan Zwol BSigurbjörnsson Proceedings of the Nineteenth International World Wide Web Conference (WWW-10) the Nineteenth International World Wide Web Conference (WWW-10)

New York, NY

ACM Press 2010 Efficient computation of diverse query results EVee USrivastava JShanmugasundaram PBhat SAYahia Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE-08) the 2008 IEEE 24th International Conference on Data Engineering (ICDE-08)

Washington, DC, USA

IEEE Computer Society 2008 Faceted classification: A guide to construction and use of special schemes BVickery 1960 Aslib -Asia Publishing House London Resolving tag ambiguity KWeinberger MSlaney RVan Zwol Proceedings of the 16th International ACM Conference on Multimedia (MM 2008) the 16th International ACM Conference on Multimedia (MM 2008)

New York, NY

ACM Press 2010