<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>pest: Term-Propagation over Wiki Structures as Eigenvector Computation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Klara Weiand</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Kneißl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Furche</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>François Bry</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Informatics, University of Munich</institution>
          ,
          <addr-line>Oettingenstraße 67, D-80538 München</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present pest, a novel approach to approximate querying of structured wiki data that exploits the structure of that data to propagate term weights between related wiki pages and tags. Based on the pest matrix, eigenvectors representing the distribution of a term after propagation are computed. The result is an index which takes the document structure into account and can be used with standard document retrieval techniques. This article gives a detailed outline of the approach and gives first experimental results showing its viability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Mary wants to get an overview of software projects in her company that are
written in Java and that make use of the Lucene library for full-text search.
According to the conventions of her company’s wiki, a brief introduction to each
software project is provided by a wiki page tagged with “introduction”.</p>
      <p>
        Thus, Mary enters the query for wiki pages containing “java” and “lucene”
that are also tagged with “introduction”. In the semantic wiki KiWi, this can be
achieved by the KWQL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] query ci(java lucene tag(introduction)), where
ci indicates wiki pages, see Section 3.2.
      </p>
      <p>However, the results fall short of Mary’s expectations for two reasons that
are also illustrated in the sample wiki of Figure 1:</p>
      <p>(1) Some projects may not follow the wiki’s conventions (or the convention
may have changed over time) to use the tag “introduction” for identifying project
briefs. This may be the case for Document 5 in Figure 1. Mary could loosen her
query to retrieve all pages containing “introduction” (rather than being tagged
with it). However, in this case, documents that follow the convention are not
necessarily ranked higher than other matching documents.</p>
      <p>(2) Some projects use the rich annotation and structuring mechanisms of a
wiki to split a wiki page into sub-sections, as in the case of the description of
KiWi in Documents 1 and 2 from Figure 1, and to link to related projects or
technologies (rather than discuss them inline), as in the case of Document 4 and
5 in Figure 1. Such projects are not included in the results of the original query
at all. Again, Mary could try to change her query to allow keywords to occur in
sub-sections or in linked to documents, but such queries quickly become rather
“lucene” 0.5
“java” 0.4
“introduction” 0.1
2.1:
search
complex (even in a flexible query language such as KWQL). Furthermore, this
solution suffers from the same problem as addressed above: documents following
the wiki’s conventions are not necessarily ranked higher than those only matched
due to the relaxation of the query.</p>
      <p>Fuzzy matching over words by means of, for example, stemming is an
established technique widely used in Information Retrieval applications such as web
search engines. Fuzzy matching over structure however, is only recently gaining
attention as the amount of (semi-)structured data on the web increases. When
a query explicitly imposes structural constraints on the selection, fuzzy matches
are also returned where the structural constraints hold only approximately (e.g.,
a direct link is approximated by a chain of links).</p>
      <p>
        In this article, we present pest, short for term-propagation using eigenvector
computation over wiki-structures, a novel approach to approximate or fuzzy
matching over structured data. pest is based on a unique technique for
propagating term weights (as obtained from a standard vector-space
representation of the documents) over the structure of a wiki using eigenvector
computation. The eigenvector computation is inspired by, but differs significantly from,
Google’s PageRank [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In contrast to many other fuzzy matching approaches (see Section 2), pest
relies solely on modifying term weights in the document index and requires no
runtime query expansion, but can use existing document retrieval technologies
such as Lucene. Nevertheless, it is able to solve all above described problems in
the context of the semantic wiki KiWi.</p>
      <p>To illustrate how pest propagates term weights, consider again Figure 1. As
for PageRank, the “magic” of pest lies in its matrix, called the pest propagation
matrix, or pest matrix for short. The pest matrix is computed in two steps:
.1
6
0.2,0 .1
0.2, 0.1</p>
      <p>5.1</p>
      <p>Doc.  5:  
“Lucene-­‐Java  Wiki”
“lucene” 0.5
“java” 0.4
“introduction” 0.1
0.8, 0.5
2.1:
search
(1) Weighted propagation graph: First, we extend and weight the graph
of the wiki pages and their tags: These insertions are used to enable direct
propagation between tags. Thus, we can configure how terms propagate between
tags of related pages independently from term propagation between the pages.</p>
      <p>The resulting graph for the sample wiki is shown in Figure 2. We have added
the tags 5:1 and 6:1 and containment edges from tag 1:1 and 1:2 to tag 2:1,
as well as link edges, e.g., from the tag 6:1 to tag 1:1; 2:1; 3:1 and 3:2. In the
following, we assign edge weights based solely on the type of the edge (link,
containment, tagging).</p>
      <p>(2) “Informed Leap”: The weighted propagation graph, however, does not
encode any information about the differences in term distribution in the
original nodes, but only information about the structure of the wiki graph. To
encode that information in the pest matrix, we use an “informed leap”: First, we
transpose the weighted adjacency matrix of the weighted propagation graph and
normalize it by the highest edge-weight sum over all documents (rather than for
each document individually) to preserve differences in overall edge weight
between documents. Second, the remaining probability together with a fixed leap
parameter (e.g., 30%) is used for an “informed leap” to an arbitrary node. The
probability to go to a specific node A in such an “informed leap” is not random
(as in the case of the original PageRank), but informed by the original term
weight distribution: A page with a high term weight for term is more likely to
be the target of a leap than a page with low term weight for .</p>
      <p>The resulting matrix is called the pest matrix P for term . Note that it
must be computed for each term individually, but does not depend on the query.
A formal description of the pest matrix computation is given in Section 5.</p>
      <p>Finally, the eigenvectors of the pest matrix for each term are combined
to form the vector space representation (i.e., the document-term matrix) for the
wiki pages and their tags. Section 6 presents, as an example, the computation
and the resulting term weights for the wiki from Figure 1.</p>
      <p>Keyword queries can be evaluated on this representation with any of the
existing IR engines using a vector space model (e.g., Lucene). Queries mixing
keywords and structure require an engine capable of combining keyword matches
with structural constraints such as the KWQL engine.</p>
      <sec id="sec-1-1">
        <title>Contributions</title>
        <p>To summarize, pest improves on existing fuzzy matching approaches for
structured data (briefly summarized in Section 2) in the following aspects:
– It is based on a simple, but flexible model for structured content that
captures a wide range of knowledge management systems and applications. We
introduce the model in Section 4 and discuss how it can represent the core
concepts of the semantic wiki KiWi, briefly recalled in Section 3.1. We also
briefly recall KWQL (Section 3.2) to illustrate the need for a combination
of structure and keyword queries.
– The main contribution of pest is the pest matrix for propagating term
weights over structured data. The computation of that matrix for a given
graph of structured content is formalized in Section 5.</p>
        <p>The pest matrix allows the propagation of term weights at index time and
yields a modified vector space representation that can be used by any IR
engine based on the vector space model (e.g., Lucene).</p>
        <p>Section 6 gives an extended example of the pest matrix computation on the
sample wiki from Figure 1.
– We prove formally in Section 5.3 that any pest matrix has 1 as dominant
eigenvalue and that the power method converges with the corresponding
eigenvector if applied to a pest matrix.</p>
        <p>Though the results from Section 6 as well as further internal testing validate
the pest approach, there are a number of open issues summarized in Section 7.
2</p>
        <p>Related Work: Fuzzy Matching on Structured Data
pest differs from the majority of fuzzy matching approaches including those
reviewed in the following in two important ways:
– It is designed for graph-shaped data rather than purely hierarchical data as
most of the XML-based approaches discussed in the following.
– In essence, pest can be used with any information retrieval engine based on
the vector space model. The only modification to the evaluation process is the
computation of the actual vector space model. Otherwise existing technology
(such as Lucene or similar search engines) can be utilized. In particular, the
pest matrix is query independent and thus can be computed at index time.</p>
        <p>Before we consider specific approaches, it is worth recalling that fuzzy
matching —approaches to include not only strict matches, but also other results which
are relevant but do not match the strict interpretation of the query—and
ranking are closely related. Though they do not have to be used in conjunction, this
is often the case, in particular to allow a fuzzy matching engine to differentiate
looser results from results that adhere more strictly to the query.</p>
        <p>While fuzzy matching is widely used in web search and other IR applications,
conventional query languages for (semi-)structured data such as XQuery, SQL
or SPARQL do not usually employ fuzzy matching or rank results. These
languages have been applied to probabilistic data, but this is a distinct area from
fuzzy querying: In probabilistic data management the data itself introduces
uncertainty, in fuzzy matching uncertainty is introduced under the assumption that
the user is also interested in matches that do not quite match her query.</p>
        <p>As the amount of structured web data increases and the semantic web
continues to emerge, the need for solutions that allow for layman querying of structured
data arises. Research has been dedicated to combining web querying and web
search and to introducing IR methods to querying, for example in the form of
extensions to conventional query languages, visual tools for exploratory search,
extension of web keyword search to include (some) structure and keyword search
over structured data. With the arrival of these techniques, the need for fuzzy
querying that does not apply solely to individual terms or phrases but takes the
data structure into account arises.</p>
        <p>
          Approximate matching on data structure has been researched mainly in the
context of XML data similarity [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. A wide body of work in this area can be
divided into three main classes of approaches:
        </p>
        <p>
          Tree edit distance: Tree edit distance approaches, e.g., [
          <xref ref-type="bibr" rid="ref1 ref10 ref2">10, 1, 2</xref>
          ] extend the
edit distance in such a way that not strings but trees are compared. A number
of types of edit operations may be applied repeatedly in order to transform one
XML document into another. The similarity between the documents can then
be quantified through a cost function taking into account the number of steps
and types of operations required.
        </p>
        <p>
          In contrast to pest, these approaches are hard to generalize to graph data,
require a relaxation loop at query time, and require the evaluation of a (often
quite considerable) number of relaxed queries whereas pest’s computation can
be performed entirely at index time. The last effect is slightly ameliorated by
novel top-k algorithms in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Also it is not obvious how different edge types, as
easily treated by pest, affect tree edit distance.
        </p>
        <p>
          Approximate tree matching: A small number of approaches modify
existing matching algorithms to introduce certain degrees of freedom. In [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], direct
relations in the query are allowed to be matched with indirect relations in the
document. In [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a document is considered a good approximate match if it and
the query have few paths that are not common (a mismatching).
        </p>
        <p>
          Again, the contrast to pest lies (a) in the limitation to tree-shaped data
which would be hard to lift at least in the case of [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] due to the reliance on
paths and suffix trees and (b) in the need for a new query engine, where pest
can reuse existing information retrieval engines.
        </p>
        <p>Adapting the vector space model: Finally, the largest class of approaches
aims, like pest, to adapt the vector space model, a well-established IR technique,
to the application on XML data. In the vector space model, documents and
queries are represented as vectors of weights for each term; similarity is computed
as the cosine angle between two vectors.</p>
        <p>
          Pokorny et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] represent paths and terms in an XML tree in a matrix
instead of a vector, assigning weights to each combination of path and term. A
query, also expressed as an XML tree, is transformed into a matrix of the same
form. The score of a query with respect to a possible result is then calculated as
the correlation between the two matrices. In an extension, the matrix is adapted
to reflect also the relationship between paths.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (and similarly [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) document vectors are modified such that their
elements are not weights for terms but rather weights for term and context pairs—
the context of a term is the path in which it occurs. The vector then consists of a
weight for each combination of term and context. Further, the cosine similarity
measure is relaxed by computing context similarities which are integrated in the
vector similarity measure.
        </p>
        <p>
          Similarly, [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and, later, [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] use tree embeddings combined with a vector
space representation of XML elements.
        </p>
        <p>
          Activation propagation is used in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for fuzzy matching over structure. Here,
a modified version of the vector space model is used to calculate similarity scores
between query terms and textual nodes in the data. The calculation of term
weights takes into account the structural context of a term as well as its
frequency. In a second step, these scores are propagated up in the tree. Finally,
the highest activated nodes are selected, filtering out some results which are
considered to be unsuitable such as the descendants of results that have already
been selected. This approach resembles ours in that activation propagation is
used to realize approximate matching over structure. However, in this approach,
propagation happens upon query evaluation and is unidirectional. Like the other
approaches in this class, it is also limited to tree-shaped data.
        </p>
        <p>
          Outside of XML, one widely-used method where structural relationship is
used for fuzzy matching is the use of anchor-tags in web search [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The anchor
text of a link to a web page is treated as if it was part of the text of that web
page even if it does not appear there. However, the application of this approach
is limited to anchor tags and does not apply to general graphs or generalize to
different link types.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>KiWi</title>
        <p>KiWi1 is a semantic wiki with extended functionality in the areas of information
extraction, personalization, reasoning, and querying. KiWi relies on a simple,
modular conceptual model consisting of the following building blocks:</p>
        <p>Content Items are composable wiki pages, the primary unit of information
in the KiWi wiki. A content item consists of text or multimedia and an optional
sequence of contained content items. Thus, content item containment provides
1 http://www.kiwi-project.eu, showcase at http://showcase.kiwi-project.eu/
a conventional structuring of documents, for example a chapter may consist
of a sequence of sections. For reasons of simplicity, content item containment
precludes any form of overlapping or of cycles, and thus a content item can be
seen as a directed acyclic graph (of content items). Links are simple hypertext
links and can be used for relating content items to each other or to external web
sites.</p>
        <p>Annotations are meta-data that can be attached to content items and links,
describing their content or properties. They can be added by users, but can also
be created by the system through automatic reasoning. Though KiWi supports
various types of annotations ranging from informal, freely chosen tags, to
semiformal tags selected from a pre-defined vocabulary, to RDF triples and
relationships from an ontology, we consider only tags consisting of phrases (one or
several words) in this paper.</p>
        <p>To illustrate these concepts, consider again Figure 1: It shows a sample KiWi
wiki using the above structuring concepts (for sake of familiarity, we call content
items documents). For example, the content item (document) 1 “About KiWi”
contains the content item 2 representing a section on “Search in KiWi” and is
linked to by the content item 6 “Guided Tour”. It is tagged with 1:1 “introduction”
and 1:2 “architecture”.</p>
        <p>Structure, within as well as between resources, thus plays an important role
for expressing knowledge in the wiki, ranging from simple tags to complex graphs
of links or content item containment.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>KWQL</title>
        <p>
          KWQL [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], KiWi’s label-keyword query language [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], allows for combined queries
over full-text, annotations and content structure, fusing approaches from
conventional query languages with information retrieval techniques for search.
        </p>
        <p>KWQL aims to make data contained in a Semantic Wiki accessible to all
users—not only those who have experience with query languages. Queries have
little syntactic overhead and aim at being only as complex as necessary. The
query language is designed to be close to the user experience, allowing queries
over the elements of the conceptual model described in the previous section.</p>
        <p>Further, KWQL has a flat learning curve and the complexity of queries
increases with the complexity of the user’s information need. Simple KWQL queries
consist of a number of keywords and are no more complicated to write than search
requests in web search engines. On the other hand, advanced KWQL queries can
impose complex selection criteria and even reformat and aggregate the results
into new wiki pages, giving rise to a simple form of reasoning.</p>
        <p>Some examples of KWQL queries are given in the following table:
Java Content items containing “java” directly or in any of its tags
or other meta data
ci(author:Mary) Content items authored by Mary (using author meta-data)
ci(Java OR (tag(XML) AND author:Mary))</p>
        <p>Content items that either contain “java” or have a tag
containing “XML” and are authored by Mary
ci(tag(Java) link(target:ci(Lucene)))</p>
        <p>Content items with a tag containing “java” that contain a link
to a content item containing “lucene”
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A Formal Model for Wiki Content: Content Graphs</title>
      <p>In this section we formally define a generic graph-based model of structured
content that is capable of capturing the rich knowledge representation features
of KiWi.</p>
      <p>2T
Definition 1 (Content graph). A content graph is a tuple G = (Vd; Vt; El;
En; T ; wt) where Vd and Vt are sets of vertices and El; En (Vd [ Vt) (Vd [ Vt).
Vd and Vt represent documents (content items) and tags. El and En describe the
directed linking and nesting among documents and tags.</p>
      <p>The textual content of documents and tags is represented by a set T of terms
and a function wt : (Vd [ Vt) T ! R that assigns a weight to each pair of
a vertex and a term. We assume that the term weights for each vertex v are a
stochastic vector (i.e., P wt(v; ) = 1).</p>
      <p>We denote the type of an edge e with type(e) 2 fl; ng and the type of a
vertex v with type(v) 2 fd; tg.</p>
      <p>The above is an instance of a generic model, that allows for an arbitrary
number of vertex and edge sets for flexible typing. Tags can be used to
represent any property of a document other than its textual content. Here, we limit
ourselves to two vertex and edge types each for sake of clarity. The model allows
for different types of links and nestings exist depending on the types of linked
and nested nodes. For example, an edge in El \ (Vd Vt) represents a link from
a document to a tag, whereas an edge El \ (Vd Vd) represents a link between
documents.</p>
      <p>For the sample wiki from Figure 1, the six documents 1 to 6 form Vd, Vt =
f1:1; 1:2; 2:1; 3:1; 3:2; 4:1g, El = f(6; 1); (6; 3); (4; 3); (1; 1:1); (1; 1:2); : : : ; (4; 4:1)g,
En = f(1; 2)g, T the set of all terms in the wiki and wt = f(1; “java”; 0:8); : : : ;
(2:1; “search”; 1); : : :g.</p>
      <p>Nesting of tags in documents, En \ (Vd Vt), do not occur in our model of
a semantic wiki, but may do so in other applications.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Computing the pest Propagation Matrix</title>
      <p>Based on the above model for a knowledge management system, we now formally
define the propagation of term-weights over structural relations represented in a
content graph by means of an eigenvector computation.</p>
      <p>A document’s tag is descriptive of the content of the text of said content
item—they have a close association. Similarly, the tags of a sub-document to
some extent describe the parent document since the document to which the
tag applies is, after all, a constituent part of the parent document. More
generally, containment and linking in a wiki or another set of documents indicate
relationships between resources. We suggest to exploit these relationships for
approximate matching over data structure by using them to propagate resource
content. A resource thereby is extended by the terms contained in other resources
it is related to. Then, standard information retrieval engines based on the vector
space model can be applied to find and rank results oblivious to the underlying
structure or term-weight propagation.</p>
      <p>To propagate term weights along structural relations, we use a novel form
of transition matrix, the pest propagation matrix. In analogy to the random
surfer of PageRank, the term-weight propagation can be explained in terms
of a semi-random reader who is navigating through the content graph looking
for documents relevant to his information need expressed by a specific term
(or a bag of such terms). He has been given some—incomplete—information
where in the graph occurs literally. He starts from one of the nodes and reads
on, following connections to find other documents that are also relevant for his
information need (even if they do not literally contain ). When he becomes
bored or loses confidence in finding more matches by traversing the structure of
the wiki (or knowledge management system, in general), he jumps to another
node that seems promising and continues the process.</p>
      <p>To encode this intuition in the pest matrix, we first consider which
connections are likely to lead to further matches by weighting the edges occurring in
a content graph. Let H be the transposed, normalized adjacency matrix of the
resulting graph. Second, we discuss how to encode, in the leap matrix L , the
jump to a promising node for the given term (rather than to a random node
as in PageRank)</p>
      <p>The overall pest matrix P is therefore computed as (where is the leap
factor)</p>
      <p>P
= (1</p>
      <p>Each entry mi;j 2 P , that is, the probability of transitioning from vertex
j to vertex i, is thus determined primarily by two factors, the normalized edge
weights of any edge from j to i, the term weight of in j.
To be able to control the choices the semi-random reader performs when following
edges in the content graph, we first extend the content graph with a number of
additional edges and vertices and, second, assign weights to all edges in that
graph.</p>
      <sec id="sec-4-1">
        <title>Definition 2 (Weighted propagation graph). A weighted propagation</title>
        <p>graph is a content graph extended with a function we : (El [ En) ! R2 for
assigning weights to edges that fulfills the following conditions:
– For each document v 2 Vd, there is a tag tv 2 Vt with (v; tv) 2 El.
– For each pair of documents v; w 2 Vd with (u; v) 2 El (En), if tv and tw are
tags of v and w respectively, then there is an edge (tv; tw) 2 El (En).</p>
        <p>Edge weights are given as pairs of numbers, one for traversing the edge in its
direction, one for traversing it against its direction.</p>
        <p>The first condition requires that each document must be tagged by at least
one tag. The second condition ensures that tags of related documents are not
only related indirectly through the connection between the documents, but also
stand in a direct semantic relation. For example, a document which contains
another document about a certain topic trivially also is about that topic to
some extent, since one of its constituent parts is.</p>
        <p>Proposition 1. For every content graph, a weighted propagation graph can be
constructed by (1) adding an empty tag (“dummy tag”) to each document that is
not tagged at all and (2) copying any relation between two documents to its tags
(if not already present).</p>
        <p>Consider again the sample wiki from Figure 1, the resulting weighted
propagation graph is shown in Figure 2. It contains two “dummy tags” (5:1 and 6:1)
as well as a number of added edges between tags of related documents.</p>
        <p>We call a weighted propagation graph type-weighted, if for any two edges
e1 = (v1; w1); e2 = (v2; w2) 2 El [ En it holds that, if type(e1) = type(e2),
type(v1) = type(v2), and type(w1) = type(w2), then we(e1) = we(e2). In other
words, the weights of edges with the same type and with start and end vertices
of the same type respectively must be the same in a type-weighted propagation
graph. In the following, we only consider such graphs.</p>
        <p>Let Aw be the weighted adjacency matrix of a weighted propagation graph
G. Then we normalize and transpose Aw to obtain the transition matrix H for
G as follows:</p>
        <p>H =</p>
        <p>1
max (Pi we((i; j)))</p>
        <p>ATw
Note that we normalize the columns for all vertices with the same maximum sum
of outgoing term weights. This preserves differences in weights between nodes
with the same number of outgoing edges, but also yields only a sub-stochastic
matrix.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Informed Leap</title>
        <p>Given a leap factor
2 (0; 1], a leap from vertex j occurs with a probability
P (leapjj) =
+ (1
)(1</p>
        <p>X Hi;j )
i</p>
        <p>A leap may be random or informed. In a random leap, the probability
of jumping to some other vertex is uniformly distributed and calculated as
lrnd(i; j) = jVd[1Vtj for each pair of vertices (i; j).</p>
        <p>An informed leap by contrast takes the term weights, that is, the prior
distribution of terms in the content graph into account. It is therefore term-dependent
and given as linf (i; j) = Pwktw(ti;(k); ) for a 2 T .</p>
        <p>In preliminary experiments, a combination of random and informed leap,
with heavy bias towards an informed leap, proved to give the most desirable
propagation behavior. The overall leap probability is therefore distributed
between that for a random leap and that of an informed leap occurring according
to the factor 2 (0; 1] which indicates which fraction of leaps are random leaps.</p>
        <p>Therefore, we obtain the leap matrix L for term as</p>
        <p>L =</p>
        <p>P (leapjj)
(1
) linf (i; j) +
lrnd(i; j)
i;j
5.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Properties of the pest Matrix</title>
        <p>Definition 3 (pest matrix). Let 2 (0; 1] be a leap factor, H be the
normalized transition matrix of a given content graph (as defined in Section 5.1) and
L the leap matrix (as defined in Section 5.2) to H and term with random leap
factor 2 (0; 1]. Then the pest matrix P is the matrix</p>
        <p>P
= (1
Theorem 1. The pest matrix P for any content graph and term
stochastic and strictly positive (all entries &gt; 0).
is
columnProof. It is easy to see that P is strictly positive as both and are &gt; 0 and
thus there is a non-zero random leap probability from each vertex to each other
vertex.</p>
        <p>P is column stochastic, as for each column j
X(P )i;j = X (1
i i</p>
        <p>)Hi;j + (L )i;j
= (1
= (1
) X Hi;j +</p>
        <p>i
) X Hi;j + (1
i
+ (1
)(1</p>
        <p>X Hl;j )</p>
        <p>l
(1
)(1
)</p>
        <p>X linf (i; j) +</p>
        <p>X lrnd(i; j)
|
i
{z
=1
X Hl;j ) +
l
}
= 1
|
i
{z
=1
+
= 1</p>
        <p>}
has eigenvalue 1 with unique eigenvector p</p>
      </sec>
      <sec id="sec-4-4">
        <title>Corollary 1.The pest matrix P</title>
        <p>for each term .</p>
        <p>The resulting eigenvector p gives the new term-weights for in the vertices
of the content graph after term-weight propagation. It can be computed, e.g.,
using the power method (which is guaranteed to converge due to Theorem 1).
1:1
1:2
2:1</p>
        <p>The vector space representation of the content graph after term-weight
propagation is the document-term matrix using the propagation vectors p for each
term as columns.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Structure Propagation with pest Matrix: An Example</title>
      <p>In order to confirm that the described propagation approach performs as
expected, a prototype implementation of the pest matrix construction has been
implemented and experiments computing the resulting vector space
representation after term-weight propagation have been conducted. The implementation is
available from http://www.pms.ifi.lmu.de/pest.</p>
      <p>Here, we present the results for the sample wiki from Figure 1. We use a leap
factor of = 0:3 and a random leap factor of = 0:25. Using these factors,
the pest matrix is computed for each term 2 f“java”; “lucene”; : : :g. The edge
weights are derived by intuition of the authors as shown in Figure 2.</p>
      <p>The resulting matrix for the term “java” is shown in Table 1, omitting
Documents 5 and 6 and their tags for space reasons.</p>
      <p>Note that the matrix contains high probabilities for propagation to 1 and 4
throughout thanks to the informed leap. This preserves their higher term-weight
for “java” compared to other nodes that do not contain “java”.</p>
      <p>Using the pest matrix, we compute for each term the resulting pest vector
p . Together these vectors form a new document-term matrix representing the
documents and tags in our wiki, but now with propagated term weights, as
shown in Table 2.</p>
      <p>To verify the veracity of our approach, let us consider a number of desirable
properties an approach to fuzzy matching on a structured knowledge
management systems such as KiWi should exhibit:
1. Documents containing a term directly (e.g., “java”) with a significant term
weight should still be ranked highly after propagation. This should hold to</p>
      <p>XML architecture introduction java lucene search
guarantee that direct search results (that would have been returned without
fuzzy matching) are retained.</p>
      <p>Indeed Documents 1, 4, and 5, all containing “java” are highest ranked for
that term, though the tags of Document 1 come fairly close. This is desired,
as Document 1 contains “java” with high term weight and tag-document
associations are among the closest relations.
2. A search for a term should also yield documents not containing but
directly connected to ones containing it. Their rank should depend on the
weight of in the connected document and the type (and thus propagation
strength) of the connection.</p>
      <p>Again, just looking at the results for “java” the two tags of Document 1
as well as the contained Document 2 receive considerable weight for term
“java”.
3. Searching for a KWQL query such as ci(architecture introduction) should
also rank highly documents that do not include these terms, but that are
tagged with “architecture” and “introduction”.</p>
      <p>Document 1 is such a case and is indeed the next highest ranked document
for such a query after the three documents directly containing “architecture”
or “introduction” (using either boolean or cosine similarity).</p>
      <p>Though this evaluation can, by design, only illustrate the effectiveness of the
proposed term-weight propagation approach for fuzzy matching, we believe that
it is a strong indication that it will prove efficient and effective also for larger
and more diverse document collections.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Open Questions</title>
      <p>pest is a unique approach to fuzzy matching that combines the principles of
structural relevance from approaches such as PageRank with the standard vector
space model. Its particular strength is that it runs entirely at index time and
results in a modified vector space representation.</p>
      <p>However, the present paper is just the first step in exploring the potential
and research issues on term-weight propagation as eigenvector computation over
structured data.</p>
      <p>First, and most obvious, extensive experimental evaluation of the approach
including a comparison with existing methods is called for. In particular, we are
currently estimating the values for and as well as for the edge weights “by the
seat of our pants” rather than empirical observation. A guide to choosing these
values might be possible to derive from studying the behavior of pest on various
kinds of data. Edge values, in particular, could also be amenable to various
machine learning approaches, using, for example, average semantic relatedness
as a criterion, or to semi-automatic approaches through user-feedback.</p>
      <p>
        We have also considered a number of different algorithmic approaches to
term-weight propagation, e.g., where propagation is not based on convergence
but on a fixed number of propagation steps. Techniques for spreading activation
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] might be applicable and a comparison study is called for. Furthermore, the
computation of the pest matrix is just one of several alternatives for finding a
stochastic propagation matrix.
      </p>
      <p>There are also a number of specific areas for improving pest:
1. In pest, propagation between documents and between tags and
documents influence each other: E.g., a document with many tags will propagate
only a relatively smaller amount to its children than a document with few
children. For extreme cases, a model where each of these kinds of propagations is at
least each given a minimal amount might prove superior to the basic version of
pest described here.</p>
      <p>2. The model in this paper does not address the representation of tagged
links. One simple way to do this would be to represent a tagged link between
two documents as a direct link and in addition a tag that is connected via links
to both documents. Alternatively, typed links could be introduced. They create
the possibility of dynamically determining the weight of a connection based
on the link type and term being propagated, for example depending on their
semantic similarity as determined through their Google distance or distance in
an ontology.</p>
      <p>3. Links to external resources such as Linked Open Data or ontologies are
currently not considered in pest. Their inclusion would allow to enrich the
content graph and thereby enhance the results of term propagation. This extension
seems particularly promising in combination with aforementioned typed links.</p>
      <p>4. Another, wiki-specific, extension is observing how the term scores of a
document change over several revisions and taking this into account as a factor
when ranking query answers.</p>
      <p>5. Any fuzzy matching approach suffers from non-obvious explanations for
returned answers: In the case of a boolean query semantics, the answer is
obvious, but when term propagation is used, a document might be a highly-ranked
query result without as much as containing any query terms directly. In this
case, providing an explanation, for example that the document in question is
closely connected to many documents containing query terms, makes the
matching process more transparent to users. However, automatically computing good,
minimal explanations is far from a solved issue.</p>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>The research leading to these results is part of the project “KiWi—Knowledge
in a Wiki” and has received funding from the European Community’s Seventh
Framework Programme (FP7/2007-2013) under grant agreement No. 211932.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Tree pattern relaxation</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V. S.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Pandit.</surname>
          </string-name>
          <article-title>FleXPath: flexible structure and full-text querying for XML</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>SIGMOD</given-names>
          </string-name>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Anh</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          .
          <article-title>Compression and an IR approach to XML retrieval</article-title>
          .
          <source>In INEX Workshop</source>
          , pages
          <fpage>99</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Page.</surname>
          </string-name>
          <article-title>The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>In WWW</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>F.</given-names>
            <surname>Bry</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Weiand</surname>
          </string-name>
          .
          <article-title>Flavors of KWQL, a keyword query language for a semantic wiki</article-title>
          .
          <source>In SOFSEM</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          , Y. Mass,
          <string-name>
            <given-names>N.</given-names>
            <surname>Efraty</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Landau.</surname>
          </string-name>
          <article-title>An extension of the vector space model for querying XML documents via XML fragments</article-title>
          .
          <source>In SIGIR Workshop on XML and Information Retrieval</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Collins</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Loftus</surname>
          </string-name>
          .
          <article-title>A spreading-activation theory of semantic processing</article-title>
          .
          <source>Psychological review</source>
          ,
          <volume>82</volume>
          (
          <issue>6</issue>
          ):
          <fpage>407</fpage>
          -
          <lpage>428</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>Application of spreading activation techniques in information retrieval</article-title>
          .
          <source>Artificial Intelligence Review</source>
          ,
          <volume>11</volume>
          (
          <issue>6</issue>
          ):
          <fpage>453</fpage>
          -
          <lpage>482</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>T.</given-names>
            <surname>Grabs and H.-J. Schek</surname>
          </string-name>
          .
          <article-title>Flexible information retrieval on XML documents</article-title>
          .
          <source>In Intelligent Search on XML Data</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Approximate xml joins</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>V.</given-names>
            <surname>Kakade</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          .
          <article-title>Encoding xml in vector spaces</article-title>
          .
          <source>In ECIR</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>J.</given-names>
            <surname>Pokorný</surname>
          </string-name>
          .
          <article-title>Vector-oriented retrieval in XML data collections</article-title>
          .
          <source>In DATESO</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>T.</given-names>
            <surname>Schlieder</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Meuss</surname>
          </string-name>
          .
          <article-title>Querying and ranking xml documents</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .,
          <volume>53</volume>
          (
          <issue>6</issue>
          ):
          <fpage>489</fpage>
          -
          <lpage>503</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>D.</given-names>
            <surname>Shasha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.-L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Atreegrep:
          <article-title>Approximate searching in unordered trees</article-title>
          .
          <source>In SSDBM</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>J. Tekli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chbeir</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Yetongnon</surname>
          </string-name>
          .
          <article-title>An overview on XML similarity: Background, current trends and future directions</article-title>
          .
          <source>Computer Science Review</source>
          ,
          <volume>3</volume>
          (
          <issue>3</issue>
          ):
          <fpage>151</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>K.</given-names>
            <surname>Weiand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Furche</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Bry</surname>
          </string-name>
          . Quo vadis,
          <source>web queries? In Int'l. Workshop on Semantic Web Technologies (Web4Web)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>