=Paper=
{{Paper
|id=Vol-205/paper-4
|storemode=property
|title=Graph Retrieval with the Suffix Tree Model
|pdfUrl=https://ceur-ws.org/Vol-205/paper4.pdf
|volume=Vol-205
}}
==Graph Retrieval with the Suffix Tree Model==
Graph Retrieval with the Suffix Tree Model
Mathias Lux1 and Sven Meyer zu Eissen2 and Michael Granitzer3
Abstract. The paper in hand presents an adoption of the 2. The operationalization of the retrieval functionality.
suffix tree model for the retrieval of labeled graphs. The suffix
tree model encodes path information of graphs in an efficient The second challenge restricts the flexibility in formulating
way and so reduces the size of the data structures compared a similarity function: ϕ must not be expensive to evaluate in
to path index based approaches, while offering a better run- terms of runtime complexity since in our case a user waits
time performance than subgraph isomorphism based meth- actively for retrieval results.
ods. Within a specific use case we evaluate the correlation of
the developed method to human judgement and compare the 2 RELATED WORK
correlation values to other methods. We show that in our use
case, which is the retrieval of digital photos annotated with Although maximum common subgraph isomorphism is a nat-
MPEG-7 using the MPEG-7 Semantic Description Scheme, ural starting point for graph similarity computation (see [3]),
the presented algorithm performs better than other methods. it cannot be applied to our scenario: First, the question if
two graphs G and H contain an isomorphic subgraph whose
edge set has more than k ∈ N elements is NP-complete (see
1 INTRODUCTION [6]). Second, quantifying similarity using ratios of subgraph
Let G = hV, Ei be a graph, where V denotes the node set and edge set sizes solely may not reflect our problem, since edge
E ⊆ V × V denotes the edge set. Given a query graph Gq and label matches can be of different importance, depending on
a graph set G, graph retrieval deals with the task to identify the value of an edge label.
a subset R ⊆ G with the property For this and similar reasons, graph retrieval algorithms are
tailored to the requirements of the underlying use case. For
∀G ∈ R : ϕ(Gq , G) ≥ t example, Fonseca et al. used graph invariants of trees—in
this specific case the eigenvalues of the tree’s and subtree’s
where ϕ : G × G → R denotes a similarity function and t ∈ R
adjacency matrix—to identify relevant cliparts represented
is a minimum similarity threshold.
as trees, representing adjacency and inclusion of color areas
The research question how to search similar graphs in a
within the cliparts, in a database (see [5],[12]).
database was already prescribed in a work by Simmons in
Zong et al. (see [18]) retrieved labeled graphs using an index
1966 (see [13]), in which he matched conceptual graphs. Since
in which the labels of paths up to a certain length were stored.
then, different applications areas emerged; they include query-
The relevance between a query graph and a graph from the
ing chemical graph databases that store molecular structures,
database was computed from a TF*IDF-like similarity mea-
retrieving vector and raster images using characteristics en-
sure that was applied to the edge labels.
coded in a graph, and recently, searching in semantically en-
Berreti et al. (see [2]) extracted information on neighbour-
riched data in the context of semantic Web applications.
ing colour regions from raster images, which was encoded in
Our application scenario relates to multimedia retrieval
directed labeled graphs. To retrieve similar images a graph
with the MPEG-7 standard, where metadata are represented
database was queried employing a tailored metric, which
as graphs: A user formulates his or her information need in the
proved as slow but highly configurable.
form of a graph, which is then matched against an MPEG-7
graph database G.
A property of MPEG-7 graphs is that their nodes and edges 2.1 Contribution
are labeled with text, say, for each G ∈ G there exists a func-
Text retrieval methods based on the vector space model, es-
tion lE : E → TE as well as lV : V → TV , where TE , TV are
pecially those using inverted lists as described in [1], have
term sets. The goal is to retrieve graphs that match both, the
been applied to graph retrieval before: A graph’s labels form
query graph’s structure as well as the labels. The challenges
a virtual document; likewise, the query graph’s labels are used
in this connection are twofold:
to construct a query document. The similarity between these
1. The statement of a similarity function ϕ that reflects the documents is computed using the vector space model along
application scenario, and with standard similarity measures like TF*IDF or BM-25.
Unlike traditional vector space approaches our proposed
1 University of Technology Graz, Knowledge Management Insti- method employs the suffix tree model, described in [8]. Its
tute, Austria, email: mathias.lux@tugraz.at
2 Bauhaus University Weimar, Germany, email: sven.meyer-zu- advantage is that similarity computations incorporate word
eissen@medien.uni-weimar.de order within sentences and text fragments. Applied to the out-
3 Know-Center Graz, Austria, email: mgrani@know-center.at
lined MPEG-7 retrieval scenario, this property is especially
useful when matching labels in a graph’s paths, yielding to similarity values while keeping the computational complexity
better similarity values like the respective experiments show. bounded by a linear function. In this connection, knowledge
about suffix trees is necessary prerequisite; some details are
summarized in the next section.
3 APPLICATION SCENARIOS
The specification of semantics often follows a graph modeling 4.1 Suffix Trees
approach; the pioneering work of Sowa (see [14]) is one of
many examples. Similarity search in this and related contexts The ith suffix of a document d = w1 . . . wm is the substring
reduces to graph retrieval. of d that starts with word wi . A suffix tree of d is a labeled
Currently a trend towards a semantically enriched Web can tree that contains each suffix of d along a path whose edges
be noted. This movement started with the vision of a semantic are labeled with the respective words. The construction of a
Web by Berners-Lee (see e.g. foreword in [4]) and resulted in suffix tree is straightforward: The ith suffix of d is inserted by
the definition of a syntax for semantics, formally defined in an checking whether some edge emanating from the root node is
ontology language based on the Resource Description Frame- labeled with wi . If so, this edge is traversed and it is checked
work (RDF), which uses a model based on directed labeled whether some edge of the successor node is labeled with wi+1 ,
graphs. and so on. If, in some depth k, a node n without a matching
Another initiative, aimed at an interoperable standards for edge is reached, a new node is created and linked to node n
multimedia data, is the Moving Picture Expert Group, in with an edge labeled with wi+k .
short MPEG. Within their Multimedia Content Description Figure 1 illustrates a the suffix tree in which the documents
Interface, short name MPEG-7, they defined a way to seman- d1 =“Boy plays chess” and d2 =“Boy plays bridge too” have
tically describe the contents of multimedia files by intercon- been inserted.
necting semantic objects (e.g. agents, places, and so on) by
typed semantic relations (see [7] for more details), which again ϕ
results in directed labeled graphs that encode semantics.
bri
All of the above mentioned scenarios model semantics with s
dg
lay
chess
yp to
ys
et
directed labeled graphs. While the same edge label can be bo o
pla
oo
used more than once within a graph, we assume that node
labels are unique within a graph as defined in MPEG-7, RDF chess bridge chess bridge
too too
and conceptual graphs.
"boy plays chess"
4 APPLYING THE SUFFIX TREE "boy plays bridge too"
MODEL TO GRAPH RETRIEVAL
Figure 1. A suffix tree in which the documents d1 =“Boy plays
Information retrieval methods that have been used in the chess” and d2 =“Boy plays bridge too” have been inserted.
past for graph retrieval have in common that they transform
database graphs Gi ∈ G as well as query graphs Gq to docu-
ments di and dq , respectively, which are then compared using
their vector space model representations in combination with
a related similarity measure like the cosine similarity. Here, 4.2 Path-based Graph Suffix Trees
the documents consist of sentences, which are made up of Let di denote the document that is associated with Gi , and
node and edge label concatenations from paths in the corre- likewise, let dq denote the document that is associated with
sponding graphs. This methodology raises two questions: Gq . Both, di and dq consist of “sentences”, which are concate-
nations of path labels from selected paths from Gi and Gq ,
1. Which paths of a graph should be used for the construction
following a heuristic mentioned above.
of di and dq ?
A natural similarity measure between di and dq arises when
2. Which retrieval methodology should be chosen for query
inserting each suffix from each sentence of di and dq into an
matching?
initially empty suffix tree GS = hVS , ES i. Let Ei ⊆ ES denote
the set of the edges that have been traversed when all suffixes
With respect to point (1), some heuristics have been pro-
of di ’s sentences have been inserted into GS , and analogously,
posed. One prominent method is discussed in connection with
let Eq ⊆ ES denote the traversed edge set for all sentences’
GraphGrep (see [11]). The paths of a graph are extracted ei-
suffixes from dq . The similarity between di and dq can be
ther by identifying all paths in a graph up to a certain length,
measured by how many edges Ei and Eq have in common,
e.g. with a depth first or breadth first search starting from
e.g. quantified by the Jaccard coefficient:
each vertex (see e.g. [15]), or by identifying frequent substruc-
tures within the graphs (see e.g. [17] or [16]). | Ei ∩ E q |
The focus of our research refers to point (2). Known graph ϕS (Gi , Gq ) =
| E i ∪ Eq |
retrieval methods that rely on the vector space model disre-
gard term order or include only partial term order informa- Furthermore in [8] two more weighting schemes using term
tion when using n-grams for indexing. In the following, a sim- frequency and inverse document frequency of edges, are de-
ilarity measure is presented that tackles the aforementioned scribed to enhance relevance and precision. For similarity cal-
problem; it compiles full path label order information into the culation of graphs such a weighting can be applied.
In addition to the two original weighting schemes a third
1,000
scheme relying solely on IDF can be introduced. Stripping
0,900
the term frequency from the original weighting formula, a
1
0, 6
0, 2
79
78
78
0, 9
3
0, 3
0,
0,800
73
73
similarity measure can be defined as follows:
73
0, 9
0,
0, 5
4
68
68
68
0, 5
0, 4
4
0,
65
65
65
0,700
0,
0,600
1 X no relations
ϕidf (Gi , Gq ) = traversed(e) · IDF (e) 0,500 undirected r.
| ES | full r.
e∈ES 0,400
0,300
0 e∈/ E i ∩ Eq
with traversed (e) = 0,200
1 e ∈ E i ∩ Eq
0,100
Here, IDF : E → R is defined to be the inverse docu- 0,000
n no weighting TF TF*IDF IDF
ment frequency function, IDF (e) = log( S(e) ), with n being
the total number of documents and S : E → N denoting the
function that delivers the number of distinct documents that Figure 3. Evaluation of the Suffix Tree Metric in correlation to
traversed a given edge on insertion into the suffix tree. human judgement
5 EVALUATION of the evaluation of the suffix tree model based metrics is
shown in figure 3. With each weighting scheme three different
Although the presented suffix tree model for graphs can be strategies for building the tree are evaluated: A first approach
applied to arbitrary graphs with node and edge labels, the is to build the tree without taking the edge labels into account
evaluation was done within a multimedia retrieval scenario: (shown as option no relations in figure 3), so only the sequence
Using MPEG-7, the Multimedia Content Description Inter- of node labels is inserted into the tree. A second approach is
face, multimedia documents can be annotated using graphs to normalize all relation labels without taking their directions
expressing the semantics of the multimedia document. This into account (shown as option undirected r. in figure 3). This
particular functionality of MPEG-7 is defined in the Semantic can be done by ignoring all direction information on edges.
Description Scheme (see [7] for details on MPEG-7). The third option is to use the full paths including node and
edge labels (shown as option full r. in figure 3).
Mathias Graz
As can be seen easily the suffix tree model cannot provide
agentOf locationOf an optimal approximation of human judgement with any of
the presented weighting schemes. With no weighting schema a
Talking rounded maximum correlation value of 0.689 can be achieved.
patientOf patientOf With the term frequency weighting, which was proposed in
the original publications the correlation value even gets worse.
Sven Michael The inverse document frequency (IDF) weighting proposed in
this publication offers the best correlation with a maximum
Figure 2. Illustration of an MPEG-7 based annotation value of 0.791 taking all node and edge information (labels
expressing that Mathias is talking to Sven and Michael in Graz. and direction) into account.
Besides the above introduced suffix tree similarity measure
for graphs following similarity measures from text and graph
Within this scenario two graphs, like the one shown in fig- retrieval were compared to human judgement:
ure 2, can be compared and a similarity value can be obtained.
Based on the used mechanism for similarity calculation dif- 1. Vector space based on node and edge labels, cosine co-
ferent results are achieved. Our evaluation aims to identify efficient as similarity measure with following weighting
the most semantic method (in terms of human judgement) schemes. This metric does not take the structure of the
for similarity calculation of MPEG-7 based annotations. graph into account, the set of labels is treated as text doc-
To evaluate the semantics of candidate similarity measure a ument:
test set of 96 manually annotated digital photos was used. In
essence for all photos a labeled directed graph exists, which (a) without weighting scheme (Text VS in fig. 4)
describes the semantics of the image by specifying persons, (b) TF*IDF (Text VS TF*IDF in fig. 4)
time points, locations and events as nodes and interconnect-
(c) BM25 (Text VS BM25 in fig. 4, see [9] and [10] for details
ing these nodes by labeled edges, like shown in figure 2. The
on BM25)
graphs have a median number of nodes of 5.81, with a medium
number of 5.99 edges. From this test data set 20 photo pairs 2. Vector space with graph paths as terms, cosine coefficient
were identified, which were used to create a questionnaire. The as similarity measure with following weighting schemes:
participants of the evaluation were asked to rate the pair- (a) TF*IDF on paths with one arc (VS IDF Triple in fig. 4)
wise similarity of the photos. The averaged similarity from and full length paths (VS IDF Paths in fig. 4)
the participants answers was correlated to the results of the
(b) BM25 on paths with one arc (VS BM25 Triple in fig. 4)
candidate similarity measures.
and full length paths (VS BM25 Paths in fig. 4)
After initial evaluations of 18 and 15 participants a final
evaluation with 112 participants was carried out. The results 3. Maximum common subgraph metric from [3] (MCS in fig.
4) ACKNOWLEDGEMENTS
4. Error correcting subgraph isomorphism metric from [2]
with boolean edge label distance functions and two options The Know-Center is funded by the Austrian Competence
for used node label distance functions: Center program K plus under the auspices of the Aus-
(a) Boolean distance function (Berretti (Bool) in fig. 4) trian Ministry of Transport, Innovation and Technology
(http://www.ffg.at/index.php?cid=95) and by the State of
(b) Term vector distance function (Berretti (VS) in fig. 4) Styria.
1,000
0,900
REFERENCES
0,788 0,774 0,782 0,786 0,791
0,800 0,754 0,744 0,740
0,675
[1] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto, Modern
0,700
0,620 Information Retrieval, Addison-Wesley Longman Publishing
0,600
0,518 Co., Inc., 1999.
0,489
0,500
[2] S. Berretti, A. Del Bimbo, and P. Pala, ‘A graph edit distance
0,400 based on node merging’, in Image and Video Retrieval: Third
0,300 International Conference, CIVR 2004, volume 3115 of LNCS,
0,200 pp. 464–472, Dublin, Ireland, (July 21-23 2004). Springer.
0,100 [3] Horst Bunke and Kim Shearer, ‘A graph distance metric
0,000
based on the maximal common subgraph’, Pattern Recogni-
tion Letters, 19(3-4), 255–259, (1998).
S
25
le
l)
s
s
)
.
F
le
r.
s
r
th
S
th
D
C
n
oo
ip
ip
(V
ll
BM
ed
M
tio
*I
Pa
Pa
Tr
fu
Tr
(B
[4] Dieter Fensel, James A. Hendler, and Henry Lieberman, Spin-
TF
ct
la
ti
S
F
25
25
F
F
et
re
re
ti
ID
ID
V
ID
S
et
r
BM
BM
di
V
xt
er
no
r
T
S
S
un
ning the Semantic Web Bringing the World Wide Web to Its
er
xt
Te
B
S
V
V
S
S
F
B
Te
F
V
V
ID
ID
T
Full Potential, MIT Press, 2005.
T
S
S
[5] Manuel J. Fonseca, B. Barroso, and Joaquim A. Jorge, ‘Re-
Figure 4. Evaluation of different distance functions and metrics trieving clipart images by content’, in Image and Video Re-
using the correlation to human judgement trieval: Third International Conference, CIVR 2004, volume
3115 of LNCS, pp. 500–507, Dublin, Ireland, (July 21-23
2004). Springer.
[6] Michael R. Garey and David S. Johnson, Computers and In-
The evaluation results in figure 4 show that the suffix tree
tractability, W.H. Freeman and Company, New York, 1979.
model with proposed inverse document frequency weighting [7] Harald Kosch, Distributed Multimedia Database Technolo-
offers the best correlation to human judgement in the pre- gies, CRC Press, Nov. 2003.
sented domain. However the VS BM25 Triple metric offers a [8] Sven Meyer zu Eissen, Benno Stein, and Martin Potthast,
nearly as high correlation value. The two variants of the error ‘The suffix tree document model revisited’, in Proceedings
of the I-Know ’05 5th International Conference on Knowl-
correcting subgraph isomorphism metric of [2] do not perform edge Management, pp. 596–603, Graz, Austria, (July 2005).
as good as the other candidates. All evaluated text based sim- J.UCS.
ilarity and distance measures, which do not take the structure [9] S. E. Robertson and S. Walker, ‘Some simple effective approx-
in to account, do not correlate well with human judgement. imations to the 2-poisson model for probabilistic weighted re-
trieval’, in SIGIR ’94: Proceedings of the 17th annual interna-
tional ACM SIGIR conference on Research and development
6 CONCLUSION in information retrieval, pp. 232–241, New York, NY, USA,
(1994). Springer-Verlag New York, Inc.
As can be seen easily from the evaluation similarity measures, [10] Stephen Robertson, Hugo Zaragoza, and Michael Taylor,
which take the structure information of the graphs into ac- ‘Simple bm25 extension to multiple weighted fields’, in CIKM
count, are superior to the tested text retrieval mechanisms, ’04: Proceedings of the thirteenth ACM international confer-
ence on Information and knowledge management, pp. 42–49,
which use node and edge labels for retrieval. The suffix tree New York, NY, USA, (2004). ACM Press.
method has a slightly better correlation coefficient and there- [11] Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno, ‘Al-
fore reflects human judgement better than the other meth- gorithmics and applications of tree and graph searching’, in
ods. However the difference to the vector space method is PODS ’02: Proceedings of the twenty-first ACM SIGMOD-
marginal, which justifies for example the usage of an path SIGACT-SIGART symposium on Principles of database sys-
tems, pp. 39–52. ACM Press, (2002).
index for graph retrieval. One possible explanation why the [12] Ali Shokoufandeh, Sven J. Dickinson, K. Siddiqi, and S.W.
triple based VS approach performs that good is that in the Zucker, ‘Indexing using a spectral encoding of topological
inspected domain all node labels are unique within a single structure’, in Conference on Computer Vision and Pattern
graph. Recognition, IEEE Computer Society, volume 2, pp. 491–497,
USA, (June 1999).
The most interesting point is, that methods adapted from [13] R. F. Simmons, ‘Storage and retrieval of aspects of meaning
text retrieval perform better than the evaluated methods de- in directed graph structures’, Commun. ACM, 9(3), 211–215,
veloped for graphs, like MCS and the algorithm of Berretti et (1966).
al. described in [2] on the used test data set. However the num- [14] John F. Sowa, ‘Semantics of conceptual graphs’, in Proceed-
ber of photos in the set is too small for general conclusions, ings of the 17th annual meeting on Association for Computa-
tional Linguistics, pp. 39–44, Morristown, NJ, USA, (1979).
but as no test data sets for semantic annotations currently ex- Association for Computational Linguistics.
ist, the creation of semantic annotations for multimedia docu- [15] Gabriel Valiente, Algorithms on Trees and Graphs, Springer,
ments is a laborous task and the usefulness of random graphs Berlin, Germany, September 2002.
for evaluation is limited in this domain, an evaluation with a [16] Takashi Washio and Hiroshi Motoda, ‘State of the art of
graph-based data mining’, SIGKDD Explor. Newsl., 5(1), 59–
bigger data set was out of scope of the project. Nevertheless 68, (2003).
the presented evaluation provides a starting point for further [17] Xifeng Yan, Philip S. Yu, and Jiawei Han, ‘Graph indexing:
investigations. a frequent structure-based approach’, in SIGMOD ’04: Pro-
ceedings of the 2004 ACM SIGMOD international conference
on Management of data, pp. 335–346. ACM Press, (2004).
[18] Jiwei Zhong, Haiping Zhu, Jianming Li, and Yong Yu, ‘Con-
ceptual graph matching for semantic search’, in ICCS ’02:
Proceedings of the 10th International Conference on Concep-
tual Structures, pp. 92–196, London, UK, (2002). Springer-
Verlag.