<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Common Property and Special Property Entity Summarization Approach Based on Statistical Distribution?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liang Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering East China University of Science and Technology</institution>
          ,
          <addr-line>Shanghai, 200237</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Combined with previous research on entity summarization, the concept of common property and special property are de ned based on the statistical distributions of properties in our paper, and we use the two kinds of property to summarize entities. Common property is a property that the entities under the same type all have. It is the basic property of the entities, and it can help to recognize a kind of entities. Special property is a property that just a few entities have, and it can help to identify an individual entity among a kind of entities. In addition, we also calculate the importance of property values based on the statistical distributions of property values, so that when an entity has more than one property value with the same property, we choose the triple with more important property value to summarize the entity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 Introduction
Thalhammer et al. [1] proposed a method that uses the k-nearest neighbors and
corresponding properties of an entity to summarize the entity, and the k-nearest
neighbors are calculated by the similarity between the entities. The method
lters out properties that all entities have, which are called common properties in
our paper, and it focuses on the characteristics of entities. Our method takes both
similarities and characteristics of the properties in entities into consideration, and
we de ne common property and special property from a statistical point of view.
Common property can be used to recognize a kind of entities and special property
can be used to identify an individual entity among a kind of entities. These two
kinds of property are used to summarize entities. However, the facts of entities
(triples) consist of properties and property values, and an entity may have one
property with multiple property values. We use the statistical distributions of
property values to determine which property value is t for summarization. The
occurrence frequency of the property value re ects its importance, and more
important property value is chosen to summarize the entity. Furthermore, there
are redundant and useless triples in the facts of an entity. We rst lter out those
triples and then use our method to do entity summarization.</p>
      <p>The main ideas of our method:</p>
    </sec>
    <sec id="sec-2">
      <title>Director</title>
    </sec>
    <sec id="sec-3">
      <title>Film</title>
      <p>Actor
1
2
3
4
5
6
7
8
9 10
{ There are three kinds of entities in the LinkedMDB-30 Track, namely,
Director, Film, and Actor. The distributions of the occurrence number of each
property under the three classes are shown in Figure 1, and each class
contains ten entities. We found that some properties appear in almost all the
entities, and we name those properties as common property. Some properties
exist only in a handful of entities, and we name those properties as special
property. In Figure 1, all properties of the class Director appear in all the ten
entities, which indicates that entities of the class Director have only common
property.
{ Since a entity may have one property with multiple property values, for
example, a director may have made more than one lm (namely, triples
with the same subject and predicate, but di erent objects), the property
value should be considered too. Which property value should be taken to
summarize the entity depend on its importance, and the importance is judged
through the dump le (the full data set).
2</p>
      <p>Common Property and Special Property Entity
Summarization Approach
The processes of our approach consist of four parts: Preprocess, Statistic and
Analysis on Property , Statistic and Analysis on Property Value, Select Special
Triples and Common Triples, and the work ow of our method is shown in Figure
2
2.1</p>
      <sec id="sec-3-1">
        <title>Preprocess</title>
        <p>Through the observation and analysis of entity facts, we nd there are many
redundant and useless triples that should be removed, if not, these triples may</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Drop redundant triples</title>
    </sec>
    <sec id="sec-5">
      <title>Drop useless triples</title>
    </sec>
    <sec id="sec-6">
      <title>Distinguish common and special property</title>
      <sec id="sec-6-1">
        <title>Statistic and Analysis on Property</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Calculate degree of commonness and specialness</title>
      <sec id="sec-7-1">
        <title>Statistic and Analysis on Property Value</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Process the Process the Process the property value property value property value of “made” of “actor” of “rdf:type”</title>
      <sec id="sec-8-1">
        <title>Select Special Triples and Common Triples</title>
        <p>be selected and the quality of entity summarization would be very low. Here are
some examples of triples that need to be dropped. We also introduce the method
of remove as follow.
1. Drop redundant triples. There are some triples which are di erent in
expressing but have the same meaning. For example, in facts of Director entities
exist triples like &lt;director/A&gt; &lt;made&gt; &lt; lm/B&gt; and &lt; lm/B&gt; &lt;director&gt;
&lt;director/A&gt;. Obviously, these two triples mean the same thing, thus we
remove one of them. Our remove method is: if there exists triples A and B,
where the subject of A is the object of B, meanwhile, the object of A is the
subject of B, then we remove the triple whose object is the entity.
2. Drop useless triples. a) Drop triples whose object value is null. b) Drop
triples that contain information of entity id. c) Drop triples whose predicate
is \page", or \rdf-schema#label", or \owl#sameAs" etc. These triples have
nothing to do with entity summarization.
2.2</p>
        <p>Statistic and Analysis on Property
1. As special property and common property mentioned above are based on
statistic, we need to count each property exists in how many entity les. Next,
we regard the property whose occurrence time is more than x as common
property and add it to common property candidate list. Likewise, we regard
the property whose occurrence time is less than y as special property and
add it to special property candidate list. Here x and y are the threshold
parameters available for setting.
2. After generating the two kinds of property candidate lists for each entity, we
need to calculate the degree of commonness and the degree of specialness of
the properties in candicate lists and rank them from high to low separately.
We record each property's occurrence time in a entity le as n, occurrence
time in the 10 entity les as N, and occurrence times in dump le as Ndump.
We use the following equation to calculate the degree:</p>
        <p>SpecialDegree =
n
N
+</p>
        <p>
          n
Ndump
1000
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
CommonDegree = Ndump (2)
        </p>
        <p>
          N
Equation (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) is used to calculate the special degree. Nn is the special degree
of a special property in the 10 entity les and Ndnump is the special degree of
a special property in the dump le. As the dump le is huge, the value of
Ndnump is quite small, after analyzing each special property's Ndnump , we nd
it varies from 0.0001 to 0.001, so we multiple a balance parameter 1000 to
scale its in uence. Equation (2) is used to calculate the common degree. We
do not take Nn into account where Nn is the common degree of a property
in the 10 entity les. Because the range of Nn is about 10 and the value of
Ndump varies from 3000 to 20000. We calculate property's commonness in
n
dump le only, because the information in the 30 entities is too small, which
making their commonness not obvious. According to the degree, we give the
two candidate lists a rank from high to low separately.
2.3
        </p>
        <sec id="sec-8-1-1">
          <title>Statistic and Analysis on Property Value</title>
          <p>After obtaining special properties and common properties, the analysis and
statistics of property values are also needed. There exists such a situation that
one property has multiple property values in one entity. For example, a director
has a property \made", which is used to describe that the director made a lm.
Obviously, one director could make many lms. When we use property \made"
to summarize director entity, a strategy to determine which lm to choose is
required. We propose a method as follows:</p>
          <p>Extract each lms information from dump le. In dump le, most lms have
a property \performance", but the values are di erent. We assume that the value
of property \performance" is the lm entity's score, and high score implies high
importance. We choose the most important lm to summarize director entity
for the property \made". Similarly, when summarizing lm entities, we may use
\actor" property. We also extract actor information from dump le. We assume
that the more lms the actor participates, the more important the actor will be.
The actor's score is the number of lms he or she participate.</p>
          <p>For actor entities and director entities, they both have two kinds of type.
Take actor entity as example, one of its type is \person" and the other is \actor".
Obviously, \actor" is a subclass of \person" and the sub class is more accurate
when summarizing entity. There is a truth that in dump le, the amount of
triples which describe superclass is bigger than that which describe subclass.
Thus when an entity has two types, we select the one which has less appear
times in dump le.
2.4</p>
        </sec>
        <sec id="sec-8-1-2">
          <title>Select Special Triples and Common Triples</title>
          <p>Select z triples which contain special property and T-z triples which contain
common property to summarize entity, where z is a parameter available for
setting, and T is the total number of triples in summarization results. As common
property is much more than special property and not all of entity les meet the
requirement of z special properties (e.g. director entity), z is just a maximize
value. If there are only t (t&lt;z) special triples, then T-t common triples would
be selected. The special properties and common properties are selected from
the two ranked lists generated in 2.2 respectively. When processing the property
with multiple values (i.e. actors or lms), we select candidates from high to low
according to their score mentioned in 2.3.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Thalhammer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toma</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roavalverde</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fensel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Leveraging usage data for linked data movie entity summarization</article-title>
          .
          <source>Computer Science - Arti cial Intelligence</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>