<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Types of Property Pairs and Alignment on Linked Datasets - A Preliminary Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kalpa Gunaratna</string-name>
          <email>kalpa@knoesis.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishnaprasad Thirunarayan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amit Sheth</string-name>
          <email>amit@knoesis.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kno.e.sis Center, Wright State University</institution>
          ,
          <addr-line>Dayton OH</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>35</fpage>
      <lpage>39</lpage>
      <abstract>
        <p>Dataset publication on the Web has been greatly influenced by the Linked Open Data (LOD) project. Many interlinked datasets have become freely available on the Web creating a structured and distributed knowledge representation. Analysis and aligning of concepts and instances in these interconnected datasets have received a lot of attention in the recent past compared to properties. We identify three different categories of property pairs found in the alignment process and study their relative distribution among well known LOD datasets. We also provide comparative analysis of state-of-the-art techniques with regard to different categories, highlighting their capabilities. This could lead to more realistic and useful alignment of properties in LOD and similar datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Property Alignment</kwd>
        <kwd>Property Pair Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        LOD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has popularized the way individual datasets can be published on the Web
by making inter-connections. This has resulted in the creation of a huge structured
knowledge graph on the Web. Since dataset publishers are autonomous and design
their datasets to meet their respective purposes for originally developing datasets, data
interoperability and data integration tasks on these datasets are challenging. Property
alignment is one such research problem where innovative solutions are required to
handle complex data representations in these interconnected datasets that go well beyond
simple string manipulations.
      </p>
      <p>
        We introduced a novel way of computing property alignment (similarity) between
interconnected datasets by exploring the available links between the datasets and using
statistical measures [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our solution can successfully handle complex data
representations found at the property level in the matching process. We start with a breakdown
of types of property pairs found on the LOD and discuss the performance of matching
algorithms on the non-trivial task of property alignment between datasets. The
analysis is based on manually identified and categorized property pairs of a sample of well
known linked datasets in the LOD cloud. Moreover, the analysis presents how many of
the manually identified property pairs in each category are identified by the different
matching techniques (recall for each property type) highlighting their applicability.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Analysis</title>
      <p>
        We analyze different types of property pairs found along with the experiments
performed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Such analysis can provide a deeper understanding of types of property
pairs that exist in linked datasets and how matching of such property pairs can be
improved between two linked datasets using property extensions.
      </p>
      <p>We can categorize the types of property pairs between linked datasets in two
orthogonal ways: (1) on the basis of their semantics, and (2) on the basis of the techniques and
tools required to determine the inter-relationships or alignment among property pairs.
On the basis of semantics, the related property pairs can be classified as (1) equivalent
properties or (2) those possessing a property-sub property relationship. On the basis of
the techniques used to align properties, we can classify property pairs as follows:
1. Simple property pairs: These have high syntactic similarity in the property names
and may have a common prefix, common suffix, adjectives, or different ordering of
words, e.g., birthPlace vs placeOfBirth. Here the words “place” and “birth” are in
a different order for the two properties.
2. Opaque property pairs: These have the same meaning but use different words. This
can be further categorized into two parts.
(a) Synonymous property pairs: Similarity of the two properties can be decided
by analyzing the meaning of the property names and is intentional. This can
be achieved by using an external dictionary or a lexical database like
WordNet. If property name is a word phrase, similarity can be checked by removing
common words from the property names, e.g., occupation vs profession, city
of birth vs place of birth. In the second property pair, the common suffix can
be eliminated from the comparison.
(b) Complex property pairs: Similarity cannot be determined by considering
property names alone, but requires additional information such as extension
analysis, and domain and range. These are ambiguous or have multiple meanings
but have a specific meaning in a dataset, e.g., battle vs participated in conflict,
resting place vs place of burial. The two terms “conflict” and “resting place”
have multiple meanings and are used in many contexts. Hence, the similarity is
harder to identify.</p>
      <p>
        In this analysis, we highlight the advantages of using property extensions compared
to string based and external dictionary based methods that focus on analyzing property
names in the matching process. We consider only object-type properties for this analysis
in DBpedia, Freebase, LinkedMDB, and DBLP datasets1, taking 5000 instances in each
sample set [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We did not consider property chains or composite property alignment in
this preliminary analysis, which belong to the complex property pair type. Composite
property alignment is the process of aligning a property in one dataset with several
properties (or property chains) in another dataset. There exist other efforts (within datasets),
different from ours, that analyze sets of properties in RDF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], combination of
properties and classes in LOD [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and time dynamics of LOD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
1 person, film and software domains between DBpedia and Freebase, films between DBpedia
and LinkedMDB, and articles in DBLP (L3S and RKB Explorer).
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
      </p>
      <p>DB-FB DB-FB Film DB-FB
Person Software</p>
      <p>Sim Comp Syn</p>
      <p>DB-LMDB</p>
      <p>Film</p>
      <p>DBLP
Articles</p>
      <p>DB-FB DB-FB Film DB-FB
Person Software</p>
      <p>Exact Non-Exact</p>
      <p>DB-LMDB</p>
      <p>Film</p>
      <p>DBLP
Articles
(a) Property pair type breakdown
(b) Exact matching of property pairs</p>
      <p>The correct matches in this analysis were manually identified and categorized by the
authors and verified by an external reviewer. Figure 1 shows the breakdown of
properties into the three types that we are interested in. According to Figure 1(a), the majority
of the property pairs belong to simple property pairs followed by complex and
synonymous property pairs. Moreover, some property pairs can be matched using exact
property name matching as shown in Figure 1(b), but they account for less. Based on
the facts presented in Figure 1, on average, the majority of the matching property pairs
are simple, but cannot be matched using exact matching of property names.</p>
      <p>
        There are different approaches for aligning property pairs between datasets
including [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is based on property extension matching. In the extension based
approach, alignment of two properties is decided by aggregating the number of matched
subject-object pairs in the property extension over the number of co-appearances of
the property pair in two linked datasets. We utilized Entity Co-Reference (ECR) links
that exist between linked datasets in matching extensions. That is, two instances (in
the property extension) are considered the same if they are connected by an ECR link.
There can be incorrect matches for each property as extensions of properties overlap.
For example, ”birthPlace” property may match to ”deathPlace” with some overlap in
the extension, but when the whole result set is aggregated and analyzed, these
coinsidental matches can be eliminated. For the WordNet based approach, we calculated the
normalized WordNet similarity using eight similarity measures2 found in the literature
over terms appearing in the property names after removing stop words. For string
similarity measurements, we added stemming in the preprocessing step before computing
the similarity over property names. More details including threshold values and
formulas used for matching are in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Considering these matchers, Figure 2 shows the percentages of the correctly
identified property pairs for the three types of property pairs. It also shows the superiority
of the extension based approach over string based and dictionary (WordNet) based
approaches. It is clear from Figures 2(a), 2(b), 2(c), and 2(d) that the extension based
2 namely, LCH, RES, HSO, JCN, LESK, PATH, WUP and LIN
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%</p>
      <p>DB-FB DB-FB Film DB-FB
Person Software</p>
      <p>Sim Comp Syn</p>
      <p>DB-LMDB</p>
      <p>Film</p>
      <p>DBLP
Articles</p>
      <p>DB-FB DB-FB Film DB-FB
Person Software
Sim Sim &amp; Exact
Comp &amp; Exact Syn</p>
      <p>DB-LMDB</p>
      <p>Film
Comp
Syn &amp; Exact</p>
      <p>DBLP
Articles
(a) Extension based algorithm matching</p>
      <p>(b) WordNet similarity based matching
DB-FB DB-FB Film DB-FB
Person Software</p>
      <p>Sim Comp Syn</p>
      <p>DB-LMDB</p>
      <p>Film</p>
      <p>No Results</p>
      <p>DBLP
Articles</p>
      <p>DB-FB DB-FB Film DB-FB
Person Software</p>
      <p>Sim Comp Syn</p>
      <p>DB-LMDB</p>
      <p>Film</p>
      <p>No Results</p>
      <p>
        DBLP
Articles
(c) String similarity based - Jaro Winkler
(d) String similarity based - Dice
approach performed better and achieved the highest results in matching all three types
of property pairs. We added exact matching of property names capability to WordNet
based algorithm and improved its performance as shown in Figure 2(b). This is
because some word phrases cannot be matched (searched) using WordNet but they have
the same or common word phrases in their names. It is also interesting to note that
the WordNet based approach failed to identify any of the synonymous property pairs
in most of the experiments as shown in Figure 2(b). This kind of behavior is expected
for string similarity or syntax based approaches, but not for a lexical database based
approach like WordNet, which is specialized in synonym word categorization. Figures
2(c) and 2(d) present matching performances when the similarity of property names are
considered using string matching algorithms. It is shown that string similarity based
matching missed all synonymous and complex property pairs leaving them unsuitable
for matching property pairs in general. Based on the facts (recall values) represented
in Figure 2, extension based property alignment has the capability to identify many
property pairs including complex and hidden property pairs compared to others.
Furthermore, Table 1 outlines both precision and recall for each matcher for all property
pair types, which also sheds lights on false positives (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for more details). Note that
it is not possible to provide a precision value breakdown for each property pair type,
since we are not identifying each type in the alignment process but all.
      </p>
      <p>Extension
Based
Algorithm
Dice
Similarity</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We provided a breakdown of types of property pairs that can be found on linked datasets
in the alignment process. Even though the majority of the property pairs are simple,
many cannot be identified using string manipulation techniques. In our sample datasets,
63%, 29%, and 8% of all property pairs are simple, complex, and synonymous,
respectively. We have shown that in every category, extension based property pair alignment
showed better results. For example, the extension based approach showed an average
improvement in the range of 5% - 32% compared to simple syntactic and WordNet
based approaches. Hence, we conclude that the extension (or instance) based approach
can discover many property pairs that are semantically the same, which cannot be
uncovered by purely syntactic means.</p>
      <p>Acknowledgement - This work was supported by the NSF award 1143717 “III:
EAGER Expressive Scalable Querying over Linked Open Data”. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. More at, http://wiki.knoesis.org/index.php/Property_Alignment</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data-the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 5</source>
          (
          <issue>3</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knauf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheglmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A systematic investigation of explicit and implicit schema information on the linked open data cloud</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          , pp.
          <fpage>228</fpage>
          -
          <lpage>242</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gunaratna</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirunarayan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wijeratne</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A statistical and schema independent approach to identify equivalent properties on linked data</article-title>
          .
          <source>In: 9th International Conference on Semantic Systems. ACM</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Ka¨fer, T.,
          <string-name>
            <surname>Abdelrahman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , OByrne,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>227</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moerkotte</surname>
          </string-name>
          , G.:
          <article-title>Characteristic sets: Accurate cardinality estimation for rdf queries with multiple joins</article-title>
          .
          <source>In: Data Engineering (ICDE)</source>
          ,
          <year>2011</year>
          IEEE 27th International Conference on. pp.
          <fpage>984</fpage>
          -
          <lpage>994</lpage>
          . IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>