When is the Structural Context Effective? Muhammad Ali Norozi Paavo Arvola Dept. of Computer and Information Science School of Information Sciences Norwegian University of Science and Technology University of Tampere Trondheim, Norway Tampere, Finland mnorozi@idi.ntnu.no paavo.arvola@uta.fi ABSTRACT formation is the context. The representation of the semi- Structural context surrounding the relevant information is structured documents aims to follow the established struc- intuitively and empirically considered important in informa- ture of documents, i.e., an academic book is typically com- tion retrieval. Utilizing this context in scoring has improved posed of hchaptersi, hsectionsi, hsubsectionsi etc., struc- the retrieval effectiveness. In this study we will objectively tures. hchapter1i is followed by hchapter2i and within look into the significance of the structural context in contex- hchapter1i, hsection1i is followed by hsection2i. Elements tualization process, and try to answer the core question of hsection1i and hsection2i are siblings, and hence most under which circumstances do we need to deal with the such likely, semantically related. The following element takes types of context? the concepts further from the preceding elements, and the preceding elements provide the basics or foundation for the Categories and Subject Descriptors following elements. Therefore, together in the document order, the preceding and following elements form a strong H.3.3 [Info. Search and Retrieval]: Search process and connected perspective (the kinship structural context), 1. INTRODUCTION surrounding the relevant information. Two general types of context can be distinguished based on the standard relation- Document parts, referred to as elements, have both a hier- ships. Hierarchical context, for one, refers to the ancestors, archical and a sequential relationship with each other. The whereas horizontal refers to the preceding and following el- hierarchical relationship is a partial order of the elements, ements [3]. In existing studies, context has been referred to which can be represented with a directed acyclic graph, or externally as the hyperlink structure of the elements as well. more precisely, a tree. In the hierarchy of a document, the The context is internal when it is considered from within the upper elements form the context of the lower ones. In ad- document, and it is external when it is considered outside dition to the hierarchical order, the sequential relationship the document(s). corresponds to the order of the running text. From this Contextualization [3] is a re-scoring model, where the ba- perspective, the context covers the surroundings of an ele- sic score, usually obtained from a full-text retrieval model, ment. An implicit chronological order of a document’s text of a contextualized document or element is re-enforced by is formed, when the document is read by a user. the weighted scores of the contextualizing documents or el- In focused retrieval, the use of context is a driving force ements (elements in the sub-tree of interest or structural to alleviate or “un-bias” the retrieval of items with varying context). In this section, we will formalize the context from length. Namely, information retrieval is based on evidence in and outside the document using contextualization model. of the retrievable units at hand, and longer text units have indeed more textual evidence. This has led to a play-safe 2.1 Structural Context strategy where the larger elements are favoured by retrieval Structural context is the sub-tree of interest from the hi- systems. How effective the context is to neutralize the side- erarchical tree structure of the semi-structured document. effects or bias because of size or length (smaller elements Internally, in hierarchical contextualization [3], the intrin- with less textual evidence gets same opportunity to satisfy sic tree structure within the XML document is employed. the users need), has been reported experimentally in many Structural context in hierarchical or vertical contextualiza- studies [1–3, 6, 9, 10, 8, 7]. The question asked here is: tion is the context based on parent-child relationship in doc- why the structural context is important in the retrieval of ument’s hierarchical structure. An element’s parent or an- focused items? In addition, we also ask if the use of context, cestors are accounted to be the structural context, while under certain circumstances (worst-case), would harm the contextualizing the element itself. The sub-tree of interest retrieval. This means if the context is poor or even mislead- is shown in Figure 1(a). Horizontal contextualization [3] ing. takes into account the sibling elements in the document’s 2. CONTEXT hierarchical structure as the structural context. If we visu- In semi-structured documents, context of an element cov- alize the document’s hierarchically tree structure, horizontal ers everything in the document excluding the element it- structural context is horizontal, as it is based on one level self. The surrounding items or elements of the relevant in- (the same level as the element to be contextualized) of the tree at a time (see Figure 1(b)). The most recent form of Copyright remains with the authors and/or original copyright holders. contextualization, the Kinship contextualization [7], is both DIR 2013, April 26, 2013, Delft, The Netherlands. horizontal (siblings) and vertical (ancestors & descendants article article article <1> <1> <1> header body header body header body <1.1> <1.2> <1.1> <1.2> <1.1> <1.2> title id revision categories sec sec sec title id revision categories sec sec sec title id revision categories sec sec sec <1.1.1> <1.1.2> <1.1.3> <1.1.4> <1.1.1> <1.1.2> <1.1.3> <1.1.4> <1.2.1> <1.2.2> <1.2.3> <1.1.1> <1.1.2> <1.1.3> <1.1.4> <1.2.1> <1.2.2> <1.2.3> <1.2.1> <1.2.2> <1.2.3> timestamp category st p st list st weblink timestamp category st p st list st weblink timestamp category st p st list st weblink <1.1.3.1> <1.1.4.1> <1.2.1.1> <1.2.1.2> <1.2.2.1> <1.2.2.2> <1.2.3.1> <1.2.3.2> <1.1.3.1> <1.1.4.1> <1.2.1.1> <1.2.1.2> <1.2.2.1> <1.2.2.2> <1.2.3.1> <1.2.3.2> <1.1.3.1> <1.1.4.1> <1.2.1.1> <1.2.1.2> <1.2.2.1> <1.2.2.2> <1.2.3.1> <1.2.3.2> b link entry entry entry b link entry entry entry <1.2.1.2.1> <1.2.1.2.2> <1.2.2.2.1> <1.2.2.2.2> <1.2.2.2.3> b link entry entry entry <1.2.1.2.1> <1.2.1.2.2> <1.2.2.2.1> <1.2.2.2.2> <1.2.2.2.3> <1.2.1.2.1> <1.2.1.2.2> <1.2.2.2.1> <1.2.2.2.2> <1.2.2.2.3> Out-links In-links (a) Hierarchical (b) Horizontal (c) Kinship (d) Citation Figure 1: Structural context, the sub-tree of interest, example taken from [7] elements) but intrinsically non-hierarchical perspective of didate to satisfy the user’s need and therefore, the elements the hierarchical information. Structural context is hence should be contextualized by the elements in the sub-tree both vertical and horizontal in the document’s hierarchical of interest. Hence, the good structural context contains a form, Figure 1(c). strong likelihood factor that should be used to deduce that And externally, in citation contextualization [8], the doc- the contextualized element is a good candidate for the posed ument’s hyperlink structure is taken in to account. The query. structural context here is based on the hyperlinks’ graph of The tree-structure of the XML document (Figure 1) is as- documents hyper-linking (connecting) one another in form sumed to be a graph. In order for the structural context to of inlinks (indegree) and outlinks (outdegree). In this case, take part in the contextualization process, each of the nodes the sub-graphs instead of tree of interest are the out-links in the sub-tree of interest should possess an impact factor. graph and the in-links graphs (see Figure 1(d)). Conceptually, the impact factor is produced in the follow- ing manner: Myriad of random surfers traverse the XML 2.2 Why Structural Context? graphs. In particular, at any time step a random surfer is Structural context is the essential component of the Con- found at an element and either (a) makes a next move to the textualization model [1]. With contextualization model, us- sub-element of the existing element by traversing the con- ing the structural context, the aim is to rank higher an el- tainment edge, or (b) makes a move to the parent-element ement in a good context (strong evidence in the structural of the existing element, or (c) jumps randomly to another context) than an identical element in a not so good context element in the XML graph. As the time goes on (the num- (less or no evidence in the structural context) within the ber iterations), the expected percentage of surfer at each document. And therefore, retrieve elements independent of node converges to a limit, the dominant eigenvector of the their sizes. A small element, in term of size, can be viewed XML graph. This limit provides the impact or strength of and hence scored in relation to its structural context, and each element in the structural context of the element to be its smaller size (which means having less evidence in total) contextualized, in the form of g function. All the elements, doesn’t stop it from being selected as one of the best results. in the structural context of the contextualized element, are In order to cope up with the “biasness” issue (described considered for contextualization; where the contextualiza- earlier), in contextualization model, the weight of a relevant tion vector g identifies the importance of each of the unit of element is adjusted by the basic weights of the elements in the structural context (Equation 1). the structural context (its contextualizing elements). In ad- dition to basic weights, each element in the structural con- 2.4 Generalized Combination Functions text of the contextualized element, should possess an impact The generalized re-ranking combination function based on factor. An higher impact factor shows the importance of the the random walk principle, which also captures the struc- contextualizing element and vice versa. The role and rela- tural context, can be formally defined as follows: tion of elements in the structural context are operationalized by giving the element a contextualizing weight. A contex- CR(x, f, Cx , g k ) = (1 − f ) · BS(x) + X tualization vector is defined to capture the impact factor BS(y) · g k (y) of each contextualizing element, and this contextualization y∈Cx vector is represented by a g function in Equation 1. f· X (1) g k (y) 2.3 Contextualization and Random Walks y∈Cx Random walk principle is employed, for contextualization, where to induce a similarity structure over the documents based • BS(x) is the basic score of contextualized element x on the containment and reverse-containment relationships (text-based score, e.g., tf · ief ) (element, sub-element and vice versa). Hence, these rela- • f is a parameter which determines the weight of the tionships affect the weight each element, in the structural context in the overall scoring. context, has in contextualization. • Cx is the kinship context surrounding the contextu- The premise is that good structural context (identified by alizing element x, i.e., Cx ⊆ structural context(x), random walk and the contextualization model [7]) provides ⊆, because only the structural context containing the evidence that an element in focused retrieval is a good can- query terms are considered. • g k (y) is the generalized contextualization vector based computing g k vector is feasible for a reasonably large XML on random walk, which gives the authority weight (the document collections. At the query time, the scores from impact) of y, the contextualizing elements (elements in g k vector and the basic scores are combined to produce an structural context) of x in the sub-tree of interest. overall ranking score, using Equation 1. In the generalized combination function given (Equation 1), 3. EFFECTS OF CONTEXTUALIZATION ON the contextualization force has to be parametrized. In our DIFFERENT TEST COLLECTIONS earlier work [7], the contextualization force was tuned and Structural context in the contextualization framework, is reported the values leading to best overall performance. In independent of the basic weighting scheme of the elements the parametrization process it was found that the optimal and it could be applied on the top of any query language, values of contextualization force f (from Equation 1) lies in retrieval systems and test collections. The effects of contex- the range, (f ∈ {.25,..., 2.50}). These optimal values for f tualization on different test collections have been observed are obtained by using cross-validation technique. A 68-fold2 in the existing studies. Contextualization model has been cross-validation (or complete cross-validation) technique has applied on the top of different and competitive baseline sys- been performed - by randomly partitioning the collection tems using a diverse set of test collections, e.g., semantically into 68 training and test samples based on the number of as- annotated Wikipedia collection from INEX 20091 , IEEE col- sessed topics. Of the 68 samples, a single sample is retained lection, and iSearch scientific collection [3, 7, 8]. In order to as the validation set for testing, and remaining 67 samples get the best possible baseline system, a data fusion was per- are used as training set. The cross-validation process is re- formed based on sum of normalized scores (CombSUM) [11] peated 68 times (for each fold), with each of 68 samples used and Reciprocal Ranking [4] of INEX 2009 submitted runs. exactly only once as validation set. These 68 independent In the experimental evaluation the retrieval effectiveness or unseen samples are then combined to produce a single or at different granularity levels were observed. Mainly, re- a set of estimations for parameter f . trieval effectiveness at paragraph, article and INEX’s fo- cused retrieval level selection has been observed. The ap- 3.2 Query Term Probabilities proaches were evaluated using the evaluation framework pro- If a relevant element does not contain any of the query vided by TREC and INEX evaluation initiatives. The re- term(s), it does not match to the query. Hence, in order ported results were shown to be promising using both TREC to retrieve such elements, some expansive methods, such as and INEX evaluation framework [3, 7]. contextualization, ought to be used. It seems obvious that, The focused task in INEX ad-hoc track is to retrieve in a relevant small element, the probability of occurrence most focused elements satisfying an information need with- of a query term is smaller than in a larger element. In or- out overlapping elements. An overlapping result list means der to demonstrate this lack of evidence on small elements, that the elements in the result list may have a descendant re- we calculated some posteriori probabilities for query term lationship with each other and share the same text content. occurrences in a relevant document (Rd ) and in a relevant For instance, in Figure 1 the hentryi element h1.2.2.2.1i paragraph (Rp , i.e., the relevant hpi elements from the XML and the hseci element h1.2.2i are overlapping. In the ex- graph), based on INEX 2009, 68 topics (title field) and their isting studies, in the focused retrieval task, the INEXs’ fo- relevance assessments. The probabilities are calculated as cused approach is followed, considering a result list where the fraction of relevant elements containing any query term, only one of the overlapping elements from each branch is or all query terms over all relevant elements of same kind. selected. This means that including the hseci element in The probability of occurrence of any query term (from the the results would mean excluding the entry element in the query Q) in a Rp and in a Rd respectively are: results or vice versa. ! ! Contextualization and the fusion approach as scoring meth- [ [ P q Rp = 0.847, P q Rd = 0.995 ods, however, do not take any stand on which elements q∈Q q∈Q should be selected from each branch. Thus a structural fu- sion has been performed, where the element level selection This means that the probability of occurrence of none of the is taken from the baseline run and subsequently re-rank the query terms in Rp and a Rd is 0.153 and 0.005 respectively3 . elements of the baseline run. Accordingly, the probabilities of occurrence of all the query terms in Rp and Rd , respectively are: 3.1 Test Settings Y P (q|Rp ) = 0.127, Y P (q|Rd ) = 0.469 The hierarchical structure of XML documents in the Wiki- q∈Q q∈Q pedia 2009 collection, are captured using the dewey encoding scheme (as shown in Figure 1). This way each element in the The difference in the amount of evidence at different gran- document possess a unique index within the document, and ularity levels become even more obvious, when we draw the together with document’s unique id, this becomes unique for frequencies of the query terms in this picture. A query term the entire collection. The tree structure of XML documents occurs on average 3.4 times in a Rp and 45.4 times in a Rd . are converted into a matrix, and random walk is performed 4. WORST CASE ANALYSIS on this matrix at indexing time, as it is described in detail, in Worst-case for a document d, in contextualization models, our earlier work [7]. The contextualization vector g k from means when structural context of element x is chosen such Equation 1 is computed off-line for each and every XML that: document in the Wikipedia collection. This suggests that structural context(x) ∈ / elementsy (d) (2) 1 (∀ elements y in document d where x and y ∈ d ) Wikipedia collection containing 2.66 million semantically 2 annotated XML documents (50, 7 Gb) and 68 related topics 68, because of the 68 topics from INEX 2009. 3 provided by the INEX 2009 ad-hoc track [5]. Test is performed without stemming or stop-word removal 1.0 worst-case-p CRhierarchical cally into the hypothesis that structural context gathered 0.9 Baselinep Fusion from within the document, “horizontally” and “vertically” 0.8 Baselinea Fusion using the hierarchical tree structure of document, and from 0.7 CRworst-case-a hierarchical outside, using the hyperlinks graph structure of documents 0.6 referencing each other, influences the retrieval effectiveness. Precision 0.5 Worst-case experiments also support the theoretical sound- 0.4 ness of contextualization, i.e., if we apply contextualization 0.3 blindly, in the worst case, we would have as good result 0.2 as the basic scoring method. The results obtained in this 0.1 study are in-line with the earlier work on contextualiza- 0.0 tion [1, 3, 6, 7, 9, 10]. In this study we have experimented 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 with semi-artificial data, in the sense that we muddled the Recall context for the worst-case analysis. However, in real data the Figure 2: Precision - recall, worst-case scenario at quality of context varies as well. For example in Wikipedia article (a) and paragraph (p) granulation and the there are different kinds of pages ranging from listings to top- fusion baseline systems. ically very coherent documents. In order to get the best re- sults in retrieval, analysing the quality and topical coherency of context would be of great benefit. The analysis of con- The non-structural context (Equation 2), should theoreti- text may be topic dependent, since some queries may have cally expose the worst-case effects of the contextualization contextual parts. For instance a query: “Losses Belgium in model. Non-structural context is structural by definition, WW2”, crave for answers about Belgium in the context of but physically not in the structural context of element x. WW2. How should we interpret the non-structural context, in or- der to experimentally visualize the worst-case scenario? In- 6. REFERENCES stead of taking the actual and true structural context, we randomly select the structural context from another non- [1] P. Arvola, M. Junkkari, and J. Kekäläinen. General- relevant but retrieved document. Such a document (re- ized Contextualization Method for XML Information trieved but not relevant) would have misleading evidence Retrieval. In Proc. of the 14th ACM CIKM, pages 20– (false positive) and hence best suited for the worst-case eval- 27. ACM, 2005. uation. Randomly selecting a document with zero basic [2] P. Arvola, J. Kekäläinen, and M. Junkkari. The Effect score would be trivial and not suitable for our purposes. of Contextualization at Different Granularity Levels in By applying this simplistic approach on every element to Content-oriented XML Retrieval. In Proc. of the 17th be contextualized, we can formulate the worst-case scenario. ACM CIKM, pages 1491–1492. ACM, 2008. We have used the reciprocal rank fusion approach (fusing [3] P. Arvola, J. Kekäläinen, and M. Junkkari. Contextu- 98 INEX 2009 runs) as the baseline system, for worst-case alization Models for XML Retrieval. Info. Processing analysis, which has been used before in our earlier work, find & Management, pages 1–15, 2011. further details from [8]: X 1 [4] G. Cormack, C. Clarke, and S. Buettcher. Reciprocal RRScore(e, q) = (3) rank fusion outperforms condorcet and individual rank k + rank(r, e, q) r∈R learning methods. In Proc. of the 32nd international where ACM SIGIR conference on Research and development • R is the set of runs (rankings) in information retrieval, pages 758–759. ACM, 2009. • and rank(r, e, q) returns the rank of element e as a [5] S. Geva, J. Kamps, M. Lethonen, R. Schenkel, J. Thom, result of query q in run r. and A. Trotman. Overview of the INEX 2009 ad-hoc • If e is not in the ranking, rank(r, e, q) is not defined track. Focused Ret. and Evaluation, pages 4–25, 2010. 1 and the outcome of k+rank(r,e,q) is 0. [6] Y. Mass and M. Mandelbrod. Component Ranking and • The parameter k is for tuning. Automatic Query Refinement for XML Retrieval. Ad- Figure 2 reveals the worst-case depiction of the contextual- vances in XML IR, pages 1–18, 2005. ization model. Not unexpectedly, the worst-case scenario is [7] M. A. Norozi, P. Arvola, and A. P. de Vries. Contex- as good as the baseline system, slightly better but not sig- tualization using hyperlinks and internal hierarchical nificant enough to be visible statistically. We can claim here structure of wikipedia documents. In Proc. of the 21st that, when the structural context is chosen randomly (hap- ACM CIKM, pages 734–743. ACM, 2012. hazardly), in the worst-case, the contextualization method [8] M. A. Norozi, A. P. de Vries, and P. Arvola. Contex- will not be worse than the basic scoring method. tualization from the Bibliographic Structure. In Proc. of the ECIR 2012 Workshop on Task-Based and Aggre- 5. CONCLUSIONS AND FUTURE WORK gated Search (TBAS2012), page 9, 2012. Structural context is the sub-tree of interest, utilized in [9] P. Ogilvie and J. Callan. Hierarchical Language Models conjunction with contextualization model, improves the re- for XML Component Retrieval. Advances in XML IR, trieval effectiveness. We have presented an exploratory and pages 269–285, 2005. theoretical study into the use of structural context from el- [10] G. Ramirez Camps. Structural Features in XML Re- ements in the hierarchical structure of information, to im- trieval. PhD thesis, SIKS, the Dutch Research School prove retrieval performance. We looked into the structural for Information and Knowledge Systems., 2007. context from document’s hierarchical structure internally, [11] J. A. Shaw and E. A. Fox. Combination of multiple and hyperlinks structure externally. We looked theoreti- searches. In The 2nd TREC. Citeseer, 1994.