<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Based Similarity Metrics and Deltas for Semantic Web Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishnamurthy Koduvayur Viswanathan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Finin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Maryland</institution>
          ,
          <addr-line>Baltimore County Baltimore, MD 21250</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recognizing that two Semantic Web documents or graphs are similar and characterizing their di erences is useful in many tasks, including retrieval, updating, version control and knowledge base editing. We describe several text-based similarity metrics that characterize the relation between Semantic Web graphs and evaluate these metrics for three speci c cases of similarity: similarity in classes and properties, similarity disregarding di erences in base-URIs, and versioning relationship. We apply these techniques for a speci c use case { generating a delta between versions of a Semantic Web graph. We have evaluated our system on several tasks using a collection of graphs from the archive of the Swoogle Semantic Web search engine.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web graphs</kwd>
        <kwd>similarity metrics</kwd>
        <kwd>delta</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Semantic Web search engines can bene t from recognizing nearly duplicate
documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for many of the same reasons that text search engines do. Comparing
Semantic Web documents or graphs, however, is more complicated. In natural
language text local word order is important to meaning while the order of triples
in a Semantic Web document (SWD) is not important. As a result, equivalent
Semantic Web documents may have completely di erent statement ordering. It
is also possible to have two di erent SWDs, which become identical after
performing inference. The presence of \blank nodes" in SWDs further complicates
their comparison.
      </p>
      <p>
        We explore three di erent ways in which a pair of Semantic Web graphs can
be similar to each other: similarity in classes and properties used while di
ering only in literal content, di erence only in base-URI, and implicit versioning
relationship [
        <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
        ]. We de ne text-based similarity metrics to characterize the
relation between them. As a part of this, we identify whether they may be
different versions of the same graph. Furthermore, if we determine a versioning
relationship between a candidate pair, then we generate a delta, i.e., a detailed
description of their di erences at the triple level delta between them.
      </p>
      <p>These methods enable a Semantic Web search engine to organize its query
results into groups of documents that are similar with respect to the di erent
metrics and also generate deltas between documents judged to have a versioning
relationship.</p>
    </sec>
    <sec id="sec-2">
      <title>Objective and Approach</title>
      <p>Our objective is identify pairs of documents that are similar to each other in a
collection of SWDs. We characterize SWD similarity along three dimensions: a
similarity in classes and properties used while di ering only in literal content,
di erence only in base-URI, and versioning relationship. For pairs of SWDs
exhibiting a versioning relationship we compute a triple-level delta for them.</p>
      <p>Our input corpus is in the form of a set of RDF documents. Our approach
involves the following steps:</p>
      <p>
        Convert to canonical representation: We convert all the documents into
a uniform n-triple serialization. Since equivalent semantic web graphs may have
di erent n-triple serializations, we apply the algorithm described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which
assigns consistent IDs to blank nodes and lexicographically order the triples.
      </p>
      <p>Generate reduced forms: In order to compute the similarity measures,
the canonical representations are decomposed into the following four reduced
forms. Each is a document with:
1. only the literals from the canonicalized n-triples le
2. literals replaced by the empty string1
3. the base-URI of every node replaced by the empty string
4. literals and the base-URI of every node are replaced by the empty string
Thus, each Semantic Web graph has a canonical representation, and four reduced
forms i.e. ve forms in all.</p>
      <p>
        Compute similarity measures: Given the text-based reduced forms, we
can use the the following similarity measures from the information retrieval eld.
1. Jaccard similarity and containment: We construct sets of character
4grams from the input documents which are used to compute the measures.
A high value for both Jaccard and containment metrics indicates a strong
possibility of a versioning or equivalence relation between two.
2. Cosine Similarity between semantic features: Each SWD is
represented as a vector of terms. The non-blank, non-literal nodes from each
SWD are extracted and their term-frequency in the SWD is used as the
feature weight.
3. Charikar's Simhash: We compute the Hamming distance between the
simhashes of the documents being compared. Simhash [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a locality
sensitive hashing technique where the ngerprints of similar documents di er in
a small number of bit positions.
      </p>
      <p>Pairwise computation of metrics: Given an input of Semantic Web
documents, we need to nd all pairs that are similar to each other. The total number
of metrics computed for each pair of SWDs is 17: two kinds of cosine similarity,
and three other metrics for each reduced form pair. The process is as follows:
1. Compute the two cosine similarity values between the canonical
representations (already generated) of both the SWDs.
1 Another viable approach is to replace each literal string by its XSD data type
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1.000 0.040 0.962 1.000 0.980 0.979 yes
0.960 0.000 1.000 0.960 0.980 0.990 no
Weighted Avg. 0.980 0.020 0.981 0.980 0.980 0.985
Table 2: Accuracy by class (pairs di erent only in base URI), using Naive Bayes
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.864 0.045 0.950 0.864 0.905 0.909 yes
0.955 0.136 0.875 0.955 0.913 0.909 no
Weighted Avg. 0.909 0.091 0.913 0.909 0.909 0.909
Table 3: Accuracy by class (versioning relationship), using SVM with linear kernel
2. If the cosine similarity values are below a pre-determined threshold, then
eliminate this pair from further consideration, else add this pair to a list of
candidate pairs. The threshold for this step was determined empirically by
performing pilot experiments, and was set at 0.7.
3. For all candidate pairs, compute the remaining three pairwise similarity
metrics for each reduced form.</p>
      <p>Thus the cosine similarity metric is used as an initial lter to reduce the
remaining computations. It is to be noted that this pairwise comparison approach
entails a quadratic number of comparisons.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Classi cation and Delta</title>
      <p>
        We trained three classi ers (one for each kind of similarity de ned) with a dataset
collected from the Swoogle, annotated with human judgements of the three kinds
of similarity. Pairwise similarity measures are computed for each candidate pair
in the labeled dataset and three di erent feature vectors (one for each
classier) are constructed for each candidate pair, using appropriate attributes. The
attributes used are the similarity measures that have been computed. These
classi ers are then used to detect the three forms of similarity that we have de ned.
For a list of attributes used for each classi er, see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Once it is determined that two SWDs have a versioning relationship between
them, we compute the set of statements that describe the change between
successive versions of the SWD. For details on the speci c deltas that we compute
between two versions, see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Conclusion</title>
      <p>Our system is based on several informal types of similarity that we have observed
among documents in Swoogle's repository. In addition, there exists no standard
labeled dataset of similar Semantic Web documents that we could use for the
purpose of evaluating our system. Hence we constructed a collection of
Semantic Web documents from Swoogles Semantic Web archive and cache services.
Swoogle periodically crawls the Semantic Web and maintains several snapshots
for each indexed SWD. We added such versions to our data-set and labeled
them as having a versioning relationship. We constructed a labeled dataset of
402 SWDs (over 161,000 candidate pairs) from Swoogle's Semantic Web archive
and cache services2. The results of the classi cation process are as shown in
Tables 1, 2, and 3.</p>
      <p>
        The results shown in Tables 1, and 2 were obtained by performing a ten-fold
strati ed cross validation using the labeled dataset. Table 3 shows results of
evaluation on a test set that was constructed in the same way as the training
data-set. The high true positive rate for determining version relationships
between SW graph pairs allows us to develop applications like generation of deltas.
For details on attributes used for each classi er, refer to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>We developed techniques to recognize when two Semantic Web graphs are
similar and characterize their di erence in three ways. When the system detects
a versioning relationship between a pair, we also generates a delta in terms
of triples to be added or deleted. One of the future directions is to increase
scalability. We intend to implement a principled ltering mechanism to reduce
the number of pairwise comparisons. We might use the Billion Triples Challenge
dataset for experiments3. We also plan to develop a global similarity and ranking
scheme where, given a sample, we would be able to nd and rank the most similar
matches. This would require representation of the similarity measures by more
complex schemes, e.g., lattices.
2 http://swoogle.umbc.edu/index.php?option=com swoogle service&amp;service=archive
3 http://km.aifb.kit.edu/projects/btc-2010/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Charikar</surname>
          </string-name>
          .
          <article-title>Similarity estimation techniques from rounding algorithms</article-title>
          .
          <source>In STOC '02: Proc. of the thiry-fourth annual ACM symposium on Theory of computing</source>
          , pages
          <volume>380</volume>
          {
          <fpage>388</fpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Manku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Sarma</surname>
          </string-name>
          .
          <article-title>Detecting near-duplicates for web crawling</article-title>
          . In C. L.
          <string-name>
            <surname>Williamson</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Zurko</surname>
            ,
            <given-names>P. F.</given-names>
          </string-name>
          <string-name>
            <surname>Patel-Schneider</surname>
          </string-name>
          , and P. J. Shenoy, editors,
          <source>WWW</source>
          , pages
          <volume>141</volume>
          {
          <fpage>150</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>K.</given-names>
            <surname>Viswanathan</surname>
          </string-name>
          .
          <article-title>Text Based Similarity Metrics and Delta for Semantic Web Graphs</article-title>
          .
          <source>Master's thesis</source>
          , 1000 Hilltop Circle,
          <year>June 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Viswanathan</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          .
          <article-title>Text Based Similarity and Delta for Semantic Web Graphs</article-title>
          .
          <source>Technical report, UMBC</source>
          , 1000 Hilltop Circle,
          <year>August 2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>