A Common Property and Special Property
      Entity Summarization Approach Based on
               Statistical Distribution?

                                Yang Li and Liang Zhao

                 Department of Computer Science and Engineering
      East China University of Science and Technology, Shanghai, 200237, China
                      marine1ly@163.com,252007913@qq.com
        Abstract. Combined with previous research on entity summarization,
        the concept of common property and special property are defined based
        on the statistical distributions of properties in our paper, and we use
        the two kinds of property to summarize entities. Common property is a
        property that the entities under the same type all have. It is the basic
        property of the entities, and it can help to recognize a kind of entities.
        Special property is a property that just a few entities have, and it can help
        to identify an individual entity among a kind of entities. In addition, we
        also calculate the importance of property values based on the statistical
        distributions of property values, so that when an entity has more than
        one property value with the same property, we choose the triple with
        more important property value to summarize the entity.

1     Introduction
Thalhammer et al. [1] proposed a method that uses the k-nearest neighbors and
corresponding properties of an entity to summarize the entity, and the k-nearest
neighbors are calculated by the similarity between the entities. The method
filters out properties that all entities have, which are called common properties in
our paper, and it focuses on the characteristics of entities. Our method takes both
similarities and characteristics of the properties in entities into consideration, and
we define common property and special property from a statistical point of view.
Common property can be used to recognize a kind of entities and special property
can be used to identify an individual entity among a kind of entities. These two
kinds of property are used to summarize entities. However, the facts of entities
(triples) consist of properties and property values, and an entity may have one
property with multiple property values. We use the statistical distributions of
property values to determine which property value is fit for summarization. The
occurrence frequency of the property value reflects its importance, and more
important property value is chosen to summarize the entity. Furthermore, there
are redundant and useless triples in the facts of an entity. We first filter out those
triples and then use our method to do entity summarization.
     The main ideas of our method:
?
    This work was partially supported by the 863 plan of China Ministry of Science
    and Technology (project No: 2015AA020107), and Software and Integrated Circuit
    Industry Development Special Funds of Shanghai Economic and Information Com-
    mission (project No: 140304).
    10

      8

      6                                                                 Director

      4                                                                 Film
                                                                        Actor
      2

      0
             1    2     3    4    5    6     7    8    9 10
           Fig. 1. The distributions of the occurrence number of each property
 – There are three kinds of entities in the LinkedMDB-30 Track, namely, Di-
   rector, Film, and Actor. The distributions of the occurrence number of each
   property under the three classes are shown in Figure 1, and each class con-
   tains ten entities. We found that some properties appear in almost all the
   entities, and we name those properties as common property. Some properties
   exist only in a handful of entities, and we name those properties as special
   property. In Figure 1, all properties of the class Director appear in all the ten
   entities, which indicates that entities of the class Director have only common
   property.
 – Since a entity may have one property with multiple property values, for
   example, a director may have made more than one film (namely, triples
   with the same subject and predicate, but different objects), the property
   value should be considered too. Which property value should be taken to
   summarize the entity depend on its importance, and the importance is judged
   through the dump file (the full data set).


2     Common Property and Special Property Entity
      Summarization Approach
The processes of our approach consist of four parts: Preprocess, Statistic and
Analysis on Property , Statistic and Analysis on Property Value, Select Special
Triples and Common Triples, and the workflow of our method is shown in Figure
2

2.1       Preprocess
Through the observation and analysis of entity facts, we find there are many
redundant and useless triples that should be removed, if not, these triples may
                                 Preprocess
               Drop redundant
                                          Drop useless triples
                   triples


                  Statistic and Analysis on Property
                                      Calculate degree of
          Distinguish common
                                      commonness and
          and special property
                                      specialness


                Statistic and Analysis on Property Value
         Process the          Process the       Process the
        property value      property value     property value
          of “made”            of “actor”       of “rdf:type”


               Select Special Triples and Common Triples
                               Fig. 2. Workflow
be selected and the quality of entity summarization would be very low. Here are
some examples of triples that need to be dropped. We also introduce the method
of remove as follow.


1. Drop redundant triples. There are some triples which are different in express-
   ing but have the same meaning. For example, in facts of Director entities ex-
   ist triples like <director/A> <made> <film/B> and <film/B> <director>
   <director/A>. Obviously, these two triples mean the same thing, thus we
   remove one of them. Our remove method is: if there exists triples A and B,
   where the subject of A is the object of B, meanwhile, the object of A is the
   subject of B, then we remove the triple whose object is the entity.
2. Drop useless triples. a) Drop triples whose object value is null. b) Drop
   triples that contain information of entity id. c) Drop triples whose predicate
   is “page”, or “rdf-schema#label”, or “owl#sameAs” etc. These triples have
   nothing to do with entity summarization.
2.2     Statistic and Analysis on Property
1. As special property and common property mentioned above are based on
   statistic, we need to count each property exists in how many entity files. Next,
   we regard the property whose occurrence time is more than x as common
   property and add it to common property candidate list. Likewise, we regard
   the property whose occurrence time is less than y as special property and
   add it to special property candidate list. Here x and y are the threshold
   parameters available for setting.
2. After generating the two kinds of property candidate lists for each entity, we
   need to calculate the degree of commonness and the degree of specialness of
   the properties in candicate lists and rank them from high to low separately.
   We record each property’s occurrence time in a entity file as n, occurrence
   time in the 10 entity files as N, and occurrence times in dump file as Ndump .
   We use the following equation to calculate the degree:
                                             n     n
                          SpecialDegree =      +       ∗ 1000                     (1)
                                             N   Ndump

                                                     Ndump
                               CommonDegree =                                      (2)
                                                       N
                                                              n
      Equation (1) is used to calculate the special degree. N    is the special degree
                                                          n
      of a special property in the 10 entity files and Ndump is the special degree of
      a special property in the dump file. As the dump file is huge, the value of
         n                                                                n
      Ndump is quite small, after analyzing each special property’s Ndump , we find
      it varies from 0.0001 to 0.001, so we multiple a balance parameter 1000 to
      scale its influence. Equation (2) is used to calculate the common degree. We
      do not take N                          N
                     n into account where n is the common degree of a property
      in the 10 entity files. Because the range of N n is about 10 and the value of
      Ndump
         n    varies from 3000 to 20000. We calculate property’s commonness in
      dump file only, because the information in the 30 entities is too small, which
      making their commonness not obvious. According to the degree, we give the
      two candidate lists a rank from high to low separately.

2.3     Statistic and Analysis on Property Value
After obtaining special properties and common properties, the analysis and s-
tatistics of property values are also needed. There exists such a situation that
one property has multiple property values in one entity. For example, a director
has a property “made”, which is used to describe that the director made a film.
Obviously, one director could make many films. When we use property “made”
to summarize director entity, a strategy to determine which film to choose is
required. We propose a method as follows:
    Extract each films information from dump file. In dump file, most films have
a property “performance”, but the values are different. We assume that the value
of property “performance” is the film entity’s score, and high score implies high
importance. We choose the most important film to summarize director entity
for the property “made”. Similarly, when summarizing film entities, we may use
“actor” property. We also extract actor information from dump file. We assume
that the more films the actor participates, the more important the actor will be.
The actor’s score is the number of films he or she participate.
    For actor entities and director entities, they both have two kinds of type.
Take actor entity as example, one of its type is “person” and the other is “actor”.
Obviously, “actor” is a subclass of “person” and the sub class is more accurate
when summarizing entity. There is a truth that in dump file, the amount of
triples which describe superclass is bigger than that which describe subclass.
Thus when an entity has two types, we select the one which has less appear
times in dump file.


2.4   Select Special Triples and Common Triples
Select z triples which contain special property and T-z triples which contain
common property to summarize entity, where z is a parameter available for
setting, and T is the total number of triples in summarization results. As common
property is much more than special property and not all of entity files meet the
requirement of z special properties (e.g. director entity), z is just a maximize
value. If there are only t (t<z) special triples, then T-t common triples would
be selected. The special properties and common properties are selected from
the two ranked lists generated in 2.2 respectively. When processing the property
with multiple values (i.e. actors or films), we select candidates from high to low
according to their score mentioned in 2.3.

References
1. Thalhammer, A., Toma, I., Roavalverde, A., Fensel, D.: Leveraging usage data for
   linked data movie entity summarization. Computer Science - Artificial Intelligence
   (2012)