=Paper= {{Paper |id=None |storemode=property |title=Combined Structure-Weight Graph Similarity and its Application in E-Health |pdfUrl=https://ceur-ws.org/Vol-1054/paper-04.pdf |volume=Vol-1054 |dblpUrl=https://dblp.org/rec/conf/csws/KianiBB13 }} ==Combined Structure-Weight Graph Similarity and its Application in E-Health== https://ceur-ws.org/Vol-1054/paper-04.pdf
     Combined Structure-Weight Graph Similarity and its Application in E-Health


                                   Mahsa Kiani, Virendrakumar C. Bhavsar, and Harold Boley
                                                 Faculty of Computer Science
                                                 University of New Brunswick
                                                   Fredericton, NB, Canada
                                       {mahsa.kiani, bhavsar, harold.boley}[AT]unb.ca


   Abstract—A combined structure-weight similarity approach                 In this system, different aspects of a disease are weighted
for comparing directed (vertex- and edge-)labeled (edge-)                   using regression estimation. Then, these calculated weights
weighted graphs is presented. Vertex labels (as types) and                  are used as coefficients in a weighted distance measure. Note
edge labels (as attributes) embody semantic information. Edge
weights express assessments regarding the (percentage-)relative             that each particular user group (e.g., profiles of all patients
importance of the attributes, a kind of pragmatic information.              having lung cancer) has the same values in the weight
These graphs are uniformly represented and interchanged                     vector. This approach differs from the structural similarity
using a weighted extension of Object Oriented RuleML. We                    algorithms [1], [2] which consider different set of weights
propose semantic-pragmatic information retrieval and cluster-               for each profile (even if they belong to the same group). The
ing where a combination of structure and weight similarities
between a query and stored graphs is calculated. The struc-                 similarity algorithms in [1], [2] compute the arithmetic mean
ture and weight similarity values are used as primary and                   of the two weights on corresponding edges of compared
secondary criteria, respectively, to rank the retrieved graphs.             trees/wDAG in order to determine the weighted similarity. In
The proposed weight similarity algorithm refines the ranking of             this way, edge weights are used as scaling factors to ensure
retrieved graphs that have identical or nearly identical query-             that the overall similarity value is in the real interval [0, 1].
graph structure similarity but have different edge weights.
It is shown that our approach leads to higher precision                     We have found that this approach cannot differentiate trees
compared to earlier approaches that did not incorporate the                 nor wDAGs with different edge weights having identical
similarity of edge weights. The proposed approach of semantic-              or nearly identical structure similarity to the given query.
pragmatic information retrieval and clustering can be applied,              Therefore, we propose modifications to the original weighted
for example, in e-Learning, e-Business, social networks, and                similarity algorithm to address this issue.
Health 3.0. In this paper, the application focus is in e-Health,
specifically the retrieval of mental health records.                        In this paper, a combined structure-weight similarity algo-
                                                                            rithm is proposed based on two component algorithms: a
   Keywords-graph similarity; structure similarity; weight sim-             version of the structure similarity algorithm in [2] and a new
ilarity; weighted Object Oriented RuleML; e-Health.
                                                                            weight similarity algorithm. In our approach, we perform
                                                                            ranked retrieval over a set of (meta)data represented as
                      I. I NTRODUCTION
                                                                            directed (vertex- and edge-)labeled (edge-)weighted graphs,
   Semantic information can be represented using hierarchi-                 each optionally associated with a data record. A special case
cal structures, which express knowledge in multiple levels                  is that the ‘metadata’ already are the ‘data’ to be retrieved,
of detail. In the e-Business domain, vertex-labeled, edge-                  with no need for a separate data record. Similar to [2],
labeled and edge-weighted trees [1] are used in order to rep-               graphs must be transformed to an internal representation
resent attributes of products. In [2], these weighted trees are             before computing their similarity. Such graphs are expressed
generalized to weighted Directed Acyclic Graphs (wDAGs)                     using a weighted extension of Object Oriented RuleML
in which substructures can be shared. Efficient similarity                  [5]. The XML parent-child structure reflects the hierarchical
algorithms are required in many applications, such as for                   structure of the graphs, while the role element  ex-
schema matching in databases, buyer-seller matching in e-                   presses edge labels and the attribute weight expresses edge
Business, and health record retrieval in e-Health. They can                 weights. Also, the sharing of a rooted subgraph by multiple
also be used in social networks, e.g. to form similarity-                   parents can be represented using a RuleML element with an
clustered wellness or patient groups [3]. Calculating similar-              XML key referred to from multiple keyrefs. The graphs
ities between patient profiles (i.e., health records) is difficult,         could be expressed using other representation approaches
as the various aspects of a disease should be weighted                      (e.g., Turtle [6] and RDF/XML [7]) as well. We assume that,
differently, which entails that simple matching of attributes               given a query graph, a ranked list of matching (meta)data
is not adequate in e-Health [4]. Weights are already used                   graphs (and consequently corresponding records), which are
in similarity algorithms [1], [2], [4]. In [4], similar patients            stored in a dataset, is constructed. The structure similarity
are identified based on similarity of symptoms and diseases.                and the weight similarity algorithms match the query graph


                                                        Copyright is held by the author(s)
to each (meta)data graph and calculate their structure and        a ranked list of the matching graphs is required to be
edge weight similarity values, respectively. These pair values    constructed. Subsequently, these ranked (meta)data graphs
of structure and weight similarities (resulting from matching     are used to look up corresponding records. The computed
the query graph to each (meta)data graph) are considered          weight similarity values should be comparable, therefore
as ranking criteria to generate the ranking list of (meta)data    (similar to [1]) our graphs have to conform to the same
graphs. We demonstrate that this approach is able to differen-    standard schema.
tiate the graphs having identical or nearly identical structure      Architecture: The proposed similarity approach has three
similarity but different edge-weight similarity to the given      modules: the structure similarity evaluation module, the
query.                                                            weight similarity evaluation module, and the integration
   The proposed combined structure-weight similarity ap-          and ranking module (see Figure 1). We have a set of
proach is applied in e-Health domain. We represent                graphs G = {G1 , G2 , G3 , · · · , Gn }, which represents the
(meta)data of Electronic Medical Records (EMRs) using             (meta)data for a set of records. Both number of vertices
graphs which express disorders and treatment priorities           and edges are assumed to be finite. Given a graph G0 ,
of patients. Then, our similarity approach is used to find        the structure similarity of G0 with each member of G is
mental health EMRs having similar (meta)data graphs to            calculated using the recursive graph similarity algorithm
a given query. To provide patient privacy and security for        proposed in [2]; here G0 may represent a query. The struc-
health records as well as (meta)data, different technological     ture similarity algorithm is iterative. The given graphs are
safeguards as well as policies could be used [8]. In addition,    traversed from their roots to their leaves (top-down) and
using (meta)data could act as an extra level of privacy,          then their similarity is computed bottom-up. The structure
as for extracting some statistics or trend, information in        similarity values and weight similarity values are in the
(meta)data itself is enough. Also, in retrieval applications,     real interval [0, 1]. The weight similarity evaluation module
only records related to the ranked results would be retrieved     matches each member of G with G0 ; then it calculates the
not all records.                                                  edge-weight similarity value. Figure 1 shows the architecture
   The rest of the paper is organized as follows. Section II      of the similarity approach where G and G0 represent a
explains our similarity approach. Section III focusses on an      set of graphs and a given query graph being matched,
application of the proposed approach in the e-Health domain.      sSim(G, G0 ) denotes their structure similarity values, and
Section IV concludes the paper.                                   wSim(G, G0 ) expresses their weight similarity values.

      II. C OMBINED S TRUCTURE -W EIGHT G RAPH                    G
                                                                          input1          Structure Similarity
                                                                                                                               sSim(G,G0 )
                                                                                          Evaluation Module
                     S IMILARITY                                                 input2                                                      Integration
                                                                                                                                                         rankedList[G]
                                                                        input1                                                                   and
                                                                                                                                              Ranking
  In this section, graph representation and the architecture                                                     wSim(G,G0 )
                                                                                                                                               Module
                                                                          input2          Weight Similarity
of the combined structure-weight similarity approach are          G0
                                                                                          Evaluation Module

presented. The theoretical basis of the proposed weight
                                                                  Figure 1: Proposed Combined Structure-Weight Similarity
similarity is explained and the characteristics of the weight
                                                                  Architecture
similarity are mentioned. A recursive weight similarity al-
gorithm and the computational experiments on a synthetic
dataset are presented.                                               The structure similarity values and weight similar-
                                                                  ity values of G and G0 are inputs to the integra-
A. Approach                                                       tion and ranking module. After receiving the similarity
   Graph Representation: As stated earlier, we assume that        pairs [sSim(Gi , G0 ), wSim(Gi , G0 )] for all graphs i =
we are given a set of records, with each record having            {1, 2, 3, · · · , n} in set G, the integration and ranking module
an associated (meta)data represented as a graph. Note that        ranks the graphs in G based on the structure similarity
all graphs throughout this paper are single-rooted wDAGs.         and weight similarity. Structure and weight similarity values
All graphs are hierarchical as concepts can be represented        could be combined with different approaches. Here, we
using sub-concepts having different importance. The root          consider weight similarity as the secondary criterion in
vertex carries a class label, which types the main object.        ranking of graphs. As a result, G1 could appear before
This object is further described by the labeled weighted          G2 (G1  G2 ) in the ranked list if and only if structure
edges leading to other labeled vertices of the graph, etc.        similarity value of G1 to G0 (the query) is greater than the
Labels on outgoing edges from each given vertex are unique        structure similarity value of G2 to G0 ; or the difference
and appear in lexicographic (alphabetical) left-to-right order.   between their structure similarity is less than or equal to
Also, edge weights are values in the real interval [0, 1] and     a threshold while the weight similarity value G1 to G0 is
for each graph its edge weights normalized; therefore, the        greater than the weight similarity value of G2 to G0 . Thus,
sum of weights for all outgoing edges from each vertex            (a) G1  G2 if and only if [sSim(G1 , G0 ) >
equals 1. Further, we assume that given a query graph,                sSim(G2 , G0 )], or [|sSim(G1 , G0 ) − sSim(G2 , G0 )| ≤
     T hreshold and wSim(G1 , G0 ) > wSim(G2 , G0 )].                where dmax is the maximum possible depth of the source
(b) G1  G2 or G2  G1 if [|sSim(G1 , G0 ) −                         vertex of corresponding edges in two graphs.
     sSim(G2 , G0 )| ≤ T hreshold and wSim(G1 , G0 ) =
                                                                                           dX   md
     wSim(G2 , G0 )]                                                                        max X

                                                                                  Sim =         (         weSimp · Dd+1 )          (3)
   In this paper, we consider the threshold equal to 0. For
                                                                                            d=1 p=1
each graph, we keep a count of the number of edges,
assigning a unique integer j to each edge, starting from                As the similarity of weights, numbers in the real interval
1 in top-down (root to leaf) and left-to-right order. As a           [0, 1], related to two corresponding edges is calculated
result, each edge is represented by ej , j  {1, 2, 3, · · · , z},   using the Manhattan distance (Equation 1) or the Min/Max
considering z as the total number of edges in a graph. As            similarity measure (Equation 2), the similarity value of a
all edges are directed, the source vertex u and the destination      pair of weights weSimp , p  {1, 2, 3, · · · , md } is in interval
vertex v of each edge ej can be represented as an ordered            [0, 1]. Also d, which is the depth of the source vertex related
pair (u, v). Also, the weight of edge ej is represented              to an edge, could be a value larger than or equal to 0. As a
as w(ej ). The edge ej in graph G and the edge ej 0 in               result, Dd+1 is a positive number. Thus, the summation of
graph G0 are called corresponding edges if and only if they          (weSimp · Dd+1 ) for all corresponding edges could result
have identical edge labels as well as identical source vertex        in a value larger than 1 and therefore Sim could be greater
labels and destination vertex labels. The relation between           than 1. In order to express the graph similarity as a value
                                                          .          in real interval [0, 1], the combined edge weight similarity
corresponding edges ej and ej 0 is denoted as ej = ej 0 .
Consider du as the depth of vertex u. In our graphs, du              values (viz. Sim) is normalized by the sum of the Dd+1
and du0 are equal for two corresponding edges ej and ej 0 .          used in various iterations of the recursive weight similarity
                                                                     algorithm. Starting from the first level in graphs, each time
B. Weight Similarity                                                 a pair of weights is compared, the related depth factor is
   In the proposed weight similarity approach, the similar-          added and this process is repeated for all levels of graphs.
ity of weights related to two corresponding edges can be             The normalization factor denoted by F is expressed as,
calculated based on two similarity measures [9], viz. Man-
                                                                                                dX   md
                                                                                                 max X
hattan distance, Equation 1, or Min/Max similarity measure,
Equation 2, as given below:                                                               F =         (      Dd+1 )                (4)
                                                                                                d=1 p=1
              weSim1 = 1 − |w(ej ) − w(ej 0 )|                (1)
                                                                       Thus, the normalized weight similarity of two graphs
                                                                     (wSim) is given as,
                            min(w(ej ), w(ej 0 ))
              weSim2 =                                        (2)                                       Sim
                            max(w(ej ), w(ej 0 ))                                            wSim =          ,                   (5)
                                                                                                         F
   The importance of each edge can be considered to be a
                                                                     which lies in real interval [0, 1]. The global depth degrada-
function of the depth of its source vertex. As stated earlier,
                                                                     tion factor (D) could be equal to 1. In this case, the proposed
the root vertex carries a class label, which types the main
                                                                     similarity approach gives the same importance to the weight
object; therefore, the outgoing edges from the root have
                                                                     similarities of various levels of the graphs and the arithmetic
the highest importance. This importance decreases as the
                                                                     mean of the weight similarity values is calculated. Therefore,
depth of the source vertex of the edge increases. Similarly,
                                                                     the result of such a calculation is identical to considering
contribution of the weight similarity of two corresponding
                                                                     the weight similarity of all attributes having the same effect
edges in weight similarity of two graphs depends on the
                                                                     on the weight similarity of two graphs. This approach
depth of the source vertex related to corresponding edges.
                                                                     results in a linear trend of similarity values. In Equation
The coefficient for adjusting the contribution of edge weight
                                                                     6, mtotal denotes the number of corresponding edges in
similarity needs to decreases as the depth of the source
                                                                     total. weSim1(w(ej ), w(ej 0 )) is the similarity of weights
vertex of corresponding edges increases. One approach for
                                                                     related to two corresponding edges based on the Manhattan
defining this coefficient is using an exponential function with
                                                                     distance, while wSim is the global weight similarity of two
D as the fixed base and d + 1 as the variable exponent.
                                                                     graphs based on the Manhattan distance. The same relation
Therefore, in this paper, the adjustment coefficient is ex-
                                                                     holds when the weight similarity is calculated based on the
pressed as Dd+1 . If p enumerates the pairs of corresponding
                                                                     Min/Max similarity measure as well.
edges in depth d and md (md ≥ 0) denotes the number of
                                                                                               mX
corresponding edges in depth d, the weight similarity value                                     total

of graphs is expressed using Equation 3. In this equation,            wSim = (1/mtotal ) ·              (weSim1(w(ej ), w(ej 0 ))) (6)
each edge similarity value is multiplied by Dd+1 , in which                                     k=1

D is the global depth degradation factor (D ≤ 0.5) and d is          The weight similarity also has the following characteristics:
the depth of the source vertex of the edge. 0 ≤ d ≤ dmax ,           (a) The similarity value generated by the weight similarity
approach is a non-negative number. The minimum similar-          learning component could be used to adjust the parameter.
ity value equals 0. (b) The weight similarity of a graph         Considering graphs G and G0 as the inputs of the algorithm,
to itself is 1.0. The similarity of each pair of weights         G.subgraph(ej ) denotes the sub-graph rooted at destination
weSim(w(ej ), w(ej 0 )) is 1.0. Therefore, Sim has the same      vertex of ej in graph G. G.root.label, G.root.inDegree,
value as F and as a result the weight similarity of two          and G.root.outDegree represent vertex label, in-degree,
graphs (i.e., wSim) is equal to 1.0. (c) The weight similarity   and out-degree of the root of graph G, respectively. Also,
measure is a symmetric function, as the order of pair of         ej  ej 0 represents that ej could appear before ej 0 in
graphs does not affect the result of the computation of          a lexicographic ordered list. weSim is the similarity of
weight similarity. (d) The weight similarity like many other     weights related to two corresponding edges. root(G).depth
similarities does not obey triangular inequality. The weight     is a function which gives the depth for root of graph G
similarity measure is a partial matching approach as only the    relative to the root of the original graph. The output, wSim,
weights related to the corresponding edges are compared.         is the weight similarity value of G and G0 .
                                                                 The proposed weight similarity algorithm traverses two
C. Algorithm                                                     given graphs in a top-down (root-leaf) order to compute the
   Algorithm 1, which calculates the weight similarity of two    edge-weight similarity of the graphs. If two edges being
graphs based on Manhattan distance, is represented in Figure     traversed are corresponding edges, their weight similarity is
2.                                                               calculated using Equation 1 or 2. Two pointer variables, k
                                                                 and k 0 , indicate the positions of two outgoing edges being
 1: procedure W S IMILARITY (G, G0 )                             matched. If ej  ej 0 , k is set to point to the next outgoing
 2:    if G or G0 only contains a single vertex then             edge in G, while if ej 0  ej , k 0 would be increased to
 3:        return 0                                                                                                .
 4:    end if                                                    point to the next outgoing edge in G0 . If ej = ej 0 , k and
 5:    if G.root.label 6= G0 .root.label then                    k are set to point to the next outgoing edges in G and G0 ,
                                                                   0

 6:        return 0                                              respectively. The loop is terminated as soon as any one of
 7:    else                                                      the following conditions is met: k > G.root.outDegree or
 8:        d ← root(G).depth                                     k 0 > G0 .root.outDegree.
 9:        k←1
10:        k0 ← 1                                                The algorithm is recursive, so the base case and recursive
11:        while k ≤ G.root.outDegree                            case should be defined. The base case is where the problem
                           ∧ k 0 ≤ G0 .root.outDegree do         can be solved directly, while in the recursive case the prob-
12:           ej ← G[k].root.edge                                lem is expressed as subproblems that are closer to the base
13:           e0j ← G0 [k 0 ].root.edge                          case [10, pp. 228]. In this algorithm the base of the recursion
                     .
14:           if ej = ej 0 then                                  is where G or G0 only contains a single vertex (Algorithm
15:               F ← F + Dd+1                                   1, lines 2-4) or if G.root.label 6= G0 .root.label (Algorithm
16:               weSim ← (1 − |w(ej ) − w(ej 0 )|)
17:               Sim ← Sim + weSim · Dd+1                       1, lines 5-6). In both cases, their weight similarity is 0. The
                       + wSimilarity(G.subgraph(ej ),            algorithm is tail recursive, i.e., the recursive invocation is
                                       G0 .subgraph(ej 0 ))      the very last thing which is performed [10, pp. 245]. In
18:               k ←k+1                                         the recursive case, the algorithm recursively invokes itself
19:               k0 ← k0 + 1                                    using the roots of two sub-graphs of G and G0 as arguments
20:           else if ej  ej 0 then
21:               k ←k+1                                         (Algorithm 1, line 17).
22:           else                                               As stated earlier, the labels of outgoing edges from each
23:               k0 ← k0 + 1                                    vertex are arranged in the lexicographic order. Also, two
24:           end if                                             pointers indicate the positions of two edges being matched.
25:       end while                                              Using these features, the time complexity of the algo-
26:       wSim ← Sim/F
27:       return wSim                                            rithm is improved. If G or G0 only contains a single
28:    end if                                                    vertex or G.root.label 6= G0 .root.label for the roots
29: end procedure                                                of two graphs, then the algorithm sets the weight sim-
                                                                 ilarity directly to 0 without any further computation; If
    Figure 2: Algorithm 1. Weight Similarity of two Graphs
                                                                 G.root.label = G0 .root.label, the algorithm uses one loop
    based on Manhattan Distance
                                                                 (Algorithm 1, line 11) to find the corresponding edges.
                                                                 For two graphs, consider t, t  {1, 2, 3, · · · , r}, in which
   Algorithm 1 (see Figure 2) gives the weight similarity        r equals to the total number of pairs of matched non-leaf
algorithm, which traverses two input graphs G and G0 in          vertices. When matching all outgoing edges of a pair of
a left-right depth-first strategy. The parameter of the algo-    vertices, three cases should be considered: (i) If ej  ej 0
                                                                           .
rithm is D, which represents the global depth degradation        or ej = ej 0 , for all values of k and k 0 , the number
factor. Here we assume that D is equal to 0.5; however, a        of iterations equals to IG t = G.root.outDegree, (ii) If
ej 0  ej , for all values of k and k 0 , the number of                                  Table I: Edge Weights of a Subset (G1 to G8 ) of 29
iterations is equal to IG0 t = G0 .root.outDegree, and                                 Graphs (G1 to G29 ) with the Structure given in Figure 3
(iii) If only for some values of k and k 0 , ej  ej 0 or                                                                   ((a)) Weights of Edges e1 to e7
     .
ej = ej 0 , the number of iterations to find the corresponding                                             Graph w(e1 ) w(e2 ) w(e3 ) w(e4 ) w(e5 ) w(e6 ) w(e7 )
edges is in the interval [min(IG t , IG0 t ), max(IG t , IG0 t )].                                          G1 0.01 0.99 0.01 0.99 0.01 0.99 0.01
The number of iterations for finding all corresponding                                                      G2 0.01 0.99 0.01 0.99 0.01 0.99 0.01
                                                                                                            G3 0.01 0.99 0.01 0.99 0.01 0.99 0.01
edges in graphs, I, equals to the summation of itera-                                                       G4 0.01 0.99 0.01 0.99 0.01 0.99 0.01
tions
 Pr performedt for teach     Prpair of vertices;    I is in interval                                        G5 0.01 0.99 0.01 0.99 0.01 0.99 0.25
                                            t       t                                                       G6 0.01 0.99 0.01 0.99 0.25 0.75 0.25
[ t = 1 min(I PG   , IG 0 ),  t=1  max(I  G   , IG )]. In the worst
                                                  0
                r              t    t                                                                       G7 0.01 0.99 0.25 0.75 0.25 0.75 0.25
case, I =       t = 1 max(IG , IG
                                   0 )], and therefore the com-
                                                                                                            G8 0.25 0.75 0.25 0.75 0.25 0.75 0.25
                                      r
plexity of the algorithm is Θ( t = 1 max(IG t , IG0 t )).
                                  P
                                                                                                                           ((b)) Weights of Edges e8 to e14
D. Computational Experiments
                                                                                                       Graph w(e8 ) w(e9 ) w(e10 ) w(e11 ) w(e12 ) w(e13 ) w(e14 )
   Now, we test the proposed weight similarity algorithm on                                             G1 0.99 0.01 0.99           0.01    0.99    0.01    0.99
a synthetic dataset, in which weights are changed systemati-                                            G2 0.99 0.01 0.99           0.01    0.99    0.25    0.75
cally to understand the effects of structure and weights on the                                         G3 0.99 0.01 0.99           0.25    0.75    0.25    0.75
                                                                                                        G4 0.99 0.25 0.75           0.25    0.75    0.25    0.75
similarity. The dataset contains graphs structurally identical                                          G5 0.75 0.25 0.75           0.25    0.75    0.25    0.75
to the graphs given in Figure 3, but with different weights.                                            G6 0.01 0.25 0.75           0.25    0.75    0.25    0.75
The graphs are balanced with maximum breadth assuming                                                   G7 0.01 0.25 0.75           0.25    0.75    0.25    0.75
                                                                                                        G8 0.25 0.75 0.25           0.75    0.25    0.75    0.25
branching factor of 2. The dataset contains 29 graphs.
                                                 A

                                        e1            e2                                Figure 4 depicts the similarity values of G1 for the
                                   B                           C                      synthetic dataset with the remaining graphs using the graph
                                                                                      similarity algorithm in [2] as well as our combined structure-
                          e3           e4                 e5           e6
                      D                     E         F                     I         weight similarity algorithm (using the similarity measure
                 e7        e8 e9                e10 e11        e12 e13          e14
                                                                                      based on the Manhattan distance).
                                                                                                     1.0
                 J             K                 L                 O             P                   0.9
                                                                                                     0.8
                                                                                                     0.7
     Figure 3: Graph Structure of Metadata in Dataset                                                0.6
                                                                                        Similarity




                                                                                                     0.5
                                                                                                     0.4
                                                                                                     0.3
                                                                                                            Graph Similarity in [2]
                                                                                                     0.2
   In this dataset, we have five possible values for a                                               0.1
                                                                                                            Manhattan Approach
                                                                                                            Min/Max Approach
                                                                                                       0
pair of edge weights: [0.01, 0.99], [0.25, 0.25], [0.5, 0.5],                                              1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

[0.75, 0.25], or [0.99, 0.01]. In G1 of dataset, weights of all                                                                          Graph(Gi )

edges having the same source vertex are [0.01, 0.99]. Now,
we change the edge weights from right to left in a level                                Figure 4: Similarity of G1 to 29 Graphs in the Dataset
and then bottom-up for various levels, exhausting the five
possible sets of edge weight pairs. This results in 29 graphs                            While similarity values based on the previous graph
in the dataset, of which eight graphs, G1 to G8 , are shown                           similarity algorithm [2] are always equal to 1, our com-
in Table I, where each row represents the weights related to                          bined structure-weight similarity approach differentiates the
a graph 1 . Enabling a compact specification and description                          structurally identical graphs with different weights. Also,
of the weights, this notation is used to illustrate different                         Figure 4 gives a comparison of the two similarity measures,
weight values for one graph structure.                                                the similarity measure based on the Manhattan distance and
   Considering this systematic changes in weights, the                                the Min/Max similarity measure. Here again we compute
weight similarities of G1 in the dataset with respect to                              the similarity w.r.t. G1 . For the depth degradation factor
the remaining graphs are expected to decrease gradually.                              equal to 0.5 and for the same set of weights for a dataset,
Therefore, the synthetic dataset provides a starting point for                        both similarity measures generate similarity values with a
an evaluation of our weight similarity algorithm.                                     decreasing trend. It is important to note that for Figure 4, the
  1 The complete dataset is available from authors.
                                                                                      similarity decreases as a result of the systematic change of
                                                                                      weights of edges (having the same source): gradual increase
                                                                                      of the edge weight for the left vertex and gradual decrease
                                                                                      of the edge weight for the right vertex. The bumps in the
                                                                                      similarity plots (e.g. at G8 , G15 , and G22 ) are observed as
                                                                                      the result of level transitions, i.e., the systematic changes of
                                                                                      weights in each level of the graph.
   We have given above the computational results for the          always related to the last psychological evaluation of patients
similarities of members of a dataset. We can generalize           (available in the mental health records). Figure 5 illustrates
the behavior of the similarity computation to other possible      the generic structure of (meta)data of mental health EMRs
graph structures such as trees [1], and generalized trees [11],   in the database as well as a query having the same structure.
and conclude that the proposed weight-similarity algorithm,
                                                                                                    Mental Health EMR
with any one of the similarity measures, is effective in
differentiating graphs having identical or nearly identical                                           e1
                                                                                                             e2
                                                                                                                       e3

                                                                                    Affective Set                              Cognitive Set
structure similarity values (but different weights). Weight                                           Behavioral Set
                                                                                            e4                                       e7
similarity considers only weight of common subgraphs of
                                                                                                                      e6     e8
two graphs being compared, while structure similarity takes                                           e5          Aggression Set
                                                                                 Panic Disorder                                      Cocaine Intoxication
into account common as well as uncommon subgraphs.
                                                                                          Anorexia Nervosa
Therefore, two graphs could be similar from weight sim-                                                              e9               e10
ilarity perspective, while their uncommon sub-graphs are
large (i.e. small structure similarity). Note that although the                                                           Physical
numerical similarity values of the two similarity measures
are different, they result in the same relative ranking of        Figure 5: Graph Structure of a Query and Metadata of
the graphs for the given query. Since there is no universal       Mental Health EMRs
benchmark for evaluating similarity [12], it is not possible
to select or recommend one of the similarity measures over           Table II represents the edge labels of the generic structure
the other and both similarity measures could be used for the      (in Figure 5), in which l(ej ) denotes the label of edge ej ,
purposes of relative ranking.                                     j  {1, 2, 3, · · · , 10}. The patients have panic disorder and
     III. E-H EALTH A PPLICATION : M ENTAL H EALTH                also delirium due to cocaine intoxication. Other disorders
              E LECTRONIC M EDICAL R ECORD                        of the patients are anorexia nervosa and physical aggression
                                                                  including fantasies and real acts [15, pp. 421].
   Group therapy is used as a treatment option for drug
abusers [13, pp. 577-620]. Newcomers should be placed in            Table II: Edge Labels of a Query and Metadata Graphs
groups with at least one or two similar members. Open group       (having the Structure in Figure 5) for Mental Health EMRs
membership in which new members are allowed to enter
as others leave is the norm [14, pp. 262-273]. Therefore,                l(e1 ) Affective disorders l(e6 ) Aggression
                                                                         l(e2 ) Behavioral disorders l(e7 ) Delirium
retrieving similar mental health EMRs to select patients for             l(e3 ) Cognitive disorders l(e8 ) Substance induced panic
group therapy is a challenging task. This selection should be            l(e4 ) Anxiety              l(e9 ) Real act
based on the gathered dynamic, behavioral, and diagnostic                l(e5 ) Appetite disorder    l(e10 ) Fantasies
information in a screening interview [15, pp. 934]. Consider
the scenario where the user (e.g., a psychologist) wants to          Edge weights of four EMR (meta)data, representing the
find an appropriate group for a new patient in order to           diagnosis segment of a mental health EMR, and a query are
schedule group therapy sessions. In this case, mental health      illustrated in Table III. Note the different last subscripts for
EMRs that describe similar disorders as well as treatment         the two edges emanating from the Aggression vertex and
priorities should be found. Each (meta)data expresses the         terminating at the same Physical destination vertex. Further,
individualized treatment plan about patient’s disorders and       there are three edges from the root vertex.
the treatment priorities based on the last psychological
evaluation. Similar to [2], we represent the attributes of each      Table III: Edge Weights of a Query (G01 ) and four
(meta)data using a graph based on a standard schema. The            Metadata Graphs (having the Structure in Figure 5) for
attributes of this schema are extracted from [15], [16], and                       Mental Health EMRs
the terms representing the (meta)data are based on DSM IV          Graph w(e1 ) w(e2 ) w(e3 ) w(e4 ) w(e5 ) w(e6 ) w(e7 ) w(e8 ) w(e9 ) w(e10 )
[16]. The attributes express possible affective, behavioral,        G01 0.01 0.01 0.98 1.0 0.01 0.99 1.0                   1.0 0.01 0.99
and cognitive problems of a patient. The edge weights in            G1 0.01 0.01 0.98 1.0 0.01 0.99 1.0                    1.0 0.01 0.99
                                                                    G2    0.5 0.25 0.25 1.0 0.25 0.75 1.0                  1.0 0.25 0.75
graphs represent the relative priority regarding treatment          G3    0.4    0.3    0.3    1.0    0.5    0.5    1.0    1.0    0.5    0.5
of each disorder in the group therapy session. Therefore,           G4    0.3 0.35 0.35 1.0 0.75 0.25 1.0                  1.0 0.75 0.25
severe, influential, and dangerous disorders as well as the
items for which treatments have the greatest benefit have            Now we compare the similarity of query with the four
higher priority (i.e., higher weight) in our treatment-oriented   (meta)data graphs G1 , G2 , G3 , and G4 of the EMRs given
(meta)data. As treatment priorities change over time, edge        in Tables II and III using the combined structure-weight
weights could be different in each evaluation phase by the        similarity algorithm. The computed similarity values are
psychologist. In order to select patients for group therapy,      given in Table IV. The structure similarity values between
in the proposed system the edge weights of (meta)data are
query G0 and any of four (meta)data graphs are identical;              [4] Fritz, P. and Klenk, S. and Dippon, J. and Heidemann, G.,
therefore, we cannot distinguish between them using the                    “Determining patient similarity in medical social networks,”
structure similarity alone. The edge weight similarity results             in Proc. MedEx Workshop, 2010, pp. 6–13.
using the proposed algorithm are also shown in Table IV.               [5] H. Boley, “Object-Oriented RuleML: User-Level Roles, URI-
                                                                           Grounded Clauses, and Order-Sorted Terms,” in Proc. Rules
   Table IV: Computational Results for the Metadata of                     and Rule Markup Languages for the Semantic Web (RuleML-
     Mental Health EMRs and the Query in Table III                         2003). LNCS 2876, Springer, Oct. 2003, pp. 1–16.
    Graph   Graph    Structure    Manhattan   Min/Max    Rank
                     Similarity   Approach    Approach                 [6] D. Beckett and T. Berners-Lee. (2011) Turtle A
                                                                           Readable RDF Syntax. [Online]. Available: http:
     G0      G1         1.0          1.0         1.0       1               //www.w3.org/TeamSubmission/turtle/
     G0      G2         1.0        0.6834      0.3762      2
     G0      G3         1.0        0.6356      0.3492      3           [7] Cyganiak, R. and Wood, D., Ed., Resource Description
     G0      G4         1.0        0.5878      0.3249      4               Framework (RDF): Concepts and Abstract Syntax. World
                                                                           Wide Web Consortium, Jan. 2013. [Online]. Available:
                                                                           http://www.w3.org/TR/2013/WD-rdf11-concepts-20130115/
   We can clearly see that the similarities are different and
they can be used to rank four (meta)data graphs. Further,              [8] Jacques, L.B., “Electronic Health Records and Respect for
both similarity measures (see columns 4 and 5 in Table IV)                 Patient Privacy: A Prescription for Compatibility,” Vand. J.
are equally acceptable as they result in the same relative                 Ent. & Tech. L., vol. 13, pp. 441–462, 2011.
ranks. Instead of ranked graphs based on their similarity to a
                                                                       [9] Boriah, S. and Chandola, V. and Kumar, V., “Similarity
given query, the proposed approach could cluster the mental
                                                                           Measures for Categorical Data: A Comparative Evaluation,”
health EMRs based on a threshold to facilitate creation of                 in Proc. the 8th SIAM International Conference on Data
supportive virtual communities, which is one of the main                   Mining, 2008, pp. 243–254.
goals of Health 3.0 [17].
                                                                      [10] Drake, P., Data Structures and Algorithms in Java.     Upper
                      IV. C ONCLUSION                                      Saddle River, NJ, USA: Prentice-Hall, Inc., 2005.
   Our combined structure-weight similarity approach is able
to distinguish graphs having identical or nearly identical            [11] Dehmer, M. and Emmert-Streib, F. and Kilian, J., “A Similar-
                                                                           ity Measure for Graphs with Low Computational Complex-
structure but different weights. By considering the weight                 ity,” Applied Mathematics and Computation, vol. 182, no. 1,
similarity in addition to the structure similarity, preferences            pp. 447 – 459, 2006.
of user are compared with the preferences expressed as
edge weights of graphs stored in dataset. The similarity              [12] Janowicz, K. and Raubal, M. and Schwering, A. and Kuhn,
of edge weights is calculated in a recursive way, giving                   W., “Semantic Similarity Measurement and Geospatial Ap-
                                                                           plications,” T. GIS, vol. 12, no. 6, pp. 651–659, 2008.
more importance to weights of edges in higher levels of a
graph. The combined structure-weight similarity algorithm             [13] Carr, A., The Handbook of Child and Adolescent Clinical
has been implemented in Java and it has been applied to                    Psychology: A Contextual Approach. Rouledge, New York:
retrieve mental health electronic medical records (EMRs).                  Taylor and Francis Group, 1999.
                  ACKNOWLEDGMENTS                                     [14] Ruiz, P. and Strain, E.C. and Langrod, J., The Substance
   The authors would like to thank Mehrdad Kiani, M.D.,                    Abuse Handbook, ser. Doody’s all reviewed collection.
who provided help in the health domain. This research                      Philadelphia: Wolters Kluwer Health/Lippincott Williams and
                                                                           Wilkins, 2007.
is partially funded by a Discovery Grant from the Nat-
ural Sciences and Engineering Council of Canada and a                 [15] Sadock, B.J. and Sadock, V.A., Kaplan and Sadock’s Syn-
Ph.D. Fellowship Grant from the Atlantic Computational                     opsis of Psychiatry: Behavioral Sciences/Clinical Psychiatry,
Excellence Network (ACEnet) awarded to the second author.                  10th ed.    Philadelphia: Lippincott Williams and Wilkins,
                                                                           2007.
                          R EFERENCES
 [1] Bhavsar, V.C. and Boley, H. and Yang, L., “A Weighted-Tree       [16] A. P. Association and A. P. A. T. F. on DSM-IV., Diagnostic
     Similarity Algorithm for Multi-Agent Systems in E-Business            and Statistical Manual of Mental Disorders: DSM-IV-TR.,
     Environments,” Computational Intelligence, vol. 20, no. 4, pp.        4th ed., ser. Diagnostic and Statistical Manual of Mental Dis-
     584–602, 2004.                                                        orders. Washington, DC: American Psychiatric Association,
                                                                           2000.
 [2] Jin, J., “Similarity of Weighted Directed Acyclic Graphs,”
     MSc Thesis, Faculty of Computer Science, University of New       [17] Kiani, M. and Bhavsar, V.C. and Boley, H., “Clustering Using
     Brunswick, Canada, Sep. 2006.                                         Combined Structure-Weight Graph Similarity,” University of
                                                                           New Brunswick, Canada, Internal Report (In Preparation).
 [3] Boley, H. and Shafiq, O., and Smith, D. and Osmun, T.,
     “The Social Semantic Subweb of Virtual Patient Support
     Groups,” in Proc. the 3rd Canadian Semantic Web Symposium
     (CSWS2011), Vancouver, British Columbia, Canada. CEUR,
     Aug. 2011, pp. 1–18.