=Paper=
{{Paper
|id=Vol-1605/ensec2
|storemode=property
|title=A Common Property and Special Property Entity Summarization Approach Based on Statistical Distribution
|pdfUrl=https://ceur-ws.org/Vol-1605/ensec2.pdf
|volume=Vol-1605
|authors=Yang Li,Liang Zhao
|dblpUrl=https://dblp.org/rec/conf/esws/LiZ16
}}
==A Common Property and Special Property Entity Summarization Approach Based on Statistical Distribution==
A Common Property and Special Property
Entity Summarization Approach Based on
Statistical Distribution?
Yang Li and Liang Zhao
Department of Computer Science and Engineering
East China University of Science and Technology, Shanghai, 200237, China
marine1ly@163.com,252007913@qq.com
Abstract. Combined with previous research on entity summarization,
the concept of common property and special property are defined based
on the statistical distributions of properties in our paper, and we use
the two kinds of property to summarize entities. Common property is a
property that the entities under the same type all have. It is the basic
property of the entities, and it can help to recognize a kind of entities.
Special property is a property that just a few entities have, and it can help
to identify an individual entity among a kind of entities. In addition, we
also calculate the importance of property values based on the statistical
distributions of property values, so that when an entity has more than
one property value with the same property, we choose the triple with
more important property value to summarize the entity.
1 Introduction
Thalhammer et al. [1] proposed a method that uses the k-nearest neighbors and
corresponding properties of an entity to summarize the entity, and the k-nearest
neighbors are calculated by the similarity between the entities. The method
filters out properties that all entities have, which are called common properties in
our paper, and it focuses on the characteristics of entities. Our method takes both
similarities and characteristics of the properties in entities into consideration, and
we define common property and special property from a statistical point of view.
Common property can be used to recognize a kind of entities and special property
can be used to identify an individual entity among a kind of entities. These two
kinds of property are used to summarize entities. However, the facts of entities
(triples) consist of properties and property values, and an entity may have one
property with multiple property values. We use the statistical distributions of
property values to determine which property value is fit for summarization. The
occurrence frequency of the property value reflects its importance, and more
important property value is chosen to summarize the entity. Furthermore, there
are redundant and useless triples in the facts of an entity. We first filter out those
triples and then use our method to do entity summarization.
The main ideas of our method:
?
This work was partially supported by the 863 plan of China Ministry of Science
and Technology (project No: 2015AA020107), and Software and Integrated Circuit
Industry Development Special Funds of Shanghai Economic and Information Com-
mission (project No: 140304).
10
8
6 Director
4 Film
Actor
2
0
1 2 3 4 5 6 7 8 9 10
Fig. 1. The distributions of the occurrence number of each property
– There are three kinds of entities in the LinkedMDB-30 Track, namely, Di-
rector, Film, and Actor. The distributions of the occurrence number of each
property under the three classes are shown in Figure 1, and each class con-
tains ten entities. We found that some properties appear in almost all the
entities, and we name those properties as common property. Some properties
exist only in a handful of entities, and we name those properties as special
property. In Figure 1, all properties of the class Director appear in all the ten
entities, which indicates that entities of the class Director have only common
property.
– Since a entity may have one property with multiple property values, for
example, a director may have made more than one film (namely, triples
with the same subject and predicate, but different objects), the property
value should be considered too. Which property value should be taken to
summarize the entity depend on its importance, and the importance is judged
through the dump file (the full data set).
2 Common Property and Special Property Entity
Summarization Approach
The processes of our approach consist of four parts: Preprocess, Statistic and
Analysis on Property , Statistic and Analysis on Property Value, Select Special
Triples and Common Triples, and the workflow of our method is shown in Figure
2
2.1 Preprocess
Through the observation and analysis of entity facts, we find there are many
redundant and useless triples that should be removed, if not, these triples may
Preprocess
Drop redundant
Drop useless triples
triples
Statistic and Analysis on Property
Calculate degree of
Distinguish common
commonness and
and special property
specialness
Statistic and Analysis on Property Value
Process the Process the Process the
property value property value property value
of “made” of “actor” of “rdf:type”
Select Special Triples and Common Triples
Fig. 2. Workflow
be selected and the quality of entity summarization would be very low. Here are
some examples of triples that need to be dropped. We also introduce the method
of remove as follow.
1. Drop redundant triples. There are some triples which are different in express-
ing but have the same meaning. For example, in facts of Director entities ex-
ist triples like and
. Obviously, these two triples mean the same thing, thus we
remove one of them. Our remove method is: if there exists triples A and B,
where the subject of A is the object of B, meanwhile, the object of A is the
subject of B, then we remove the triple whose object is the entity.
2. Drop useless triples. a) Drop triples whose object value is null. b) Drop
triples that contain information of entity id. c) Drop triples whose predicate
is “page”, or “rdf-schema#label”, or “owl#sameAs” etc. These triples have
nothing to do with entity summarization.
2.2 Statistic and Analysis on Property
1. As special property and common property mentioned above are based on
statistic, we need to count each property exists in how many entity files. Next,
we regard the property whose occurrence time is more than x as common
property and add it to common property candidate list. Likewise, we regard
the property whose occurrence time is less than y as special property and
add it to special property candidate list. Here x and y are the threshold
parameters available for setting.
2. After generating the two kinds of property candidate lists for each entity, we
need to calculate the degree of commonness and the degree of specialness of
the properties in candicate lists and rank them from high to low separately.
We record each property’s occurrence time in a entity file as n, occurrence
time in the 10 entity files as N, and occurrence times in dump file as Ndump .
We use the following equation to calculate the degree:
n n
SpecialDegree = + ∗ 1000 (1)
N Ndump
Ndump
CommonDegree = (2)
N
n
Equation (1) is used to calculate the special degree. N is the special degree
n
of a special property in the 10 entity files and Ndump is the special degree of
a special property in the dump file. As the dump file is huge, the value of
n n
Ndump is quite small, after analyzing each special property’s Ndump , we find
it varies from 0.0001 to 0.001, so we multiple a balance parameter 1000 to
scale its influence. Equation (2) is used to calculate the common degree. We
do not take N N
n into account where n is the common degree of a property
in the 10 entity files. Because the range of N n is about 10 and the value of
Ndump
n varies from 3000 to 20000. We calculate property’s commonness in
dump file only, because the information in the 30 entities is too small, which
making their commonness not obvious. According to the degree, we give the
two candidate lists a rank from high to low separately.
2.3 Statistic and Analysis on Property Value
After obtaining special properties and common properties, the analysis and s-
tatistics of property values are also needed. There exists such a situation that
one property has multiple property values in one entity. For example, a director
has a property “made”, which is used to describe that the director made a film.
Obviously, one director could make many films. When we use property “made”
to summarize director entity, a strategy to determine which film to choose is
required. We propose a method as follows:
Extract each films information from dump file. In dump file, most films have
a property “performance”, but the values are different. We assume that the value
of property “performance” is the film entity’s score, and high score implies high
importance. We choose the most important film to summarize director entity
for the property “made”. Similarly, when summarizing film entities, we may use
“actor” property. We also extract actor information from dump file. We assume
that the more films the actor participates, the more important the actor will be.
The actor’s score is the number of films he or she participate.
For actor entities and director entities, they both have two kinds of type.
Take actor entity as example, one of its type is “person” and the other is “actor”.
Obviously, “actor” is a subclass of “person” and the sub class is more accurate
when summarizing entity. There is a truth that in dump file, the amount of
triples which describe superclass is bigger than that which describe subclass.
Thus when an entity has two types, we select the one which has less appear
times in dump file.
2.4 Select Special Triples and Common Triples
Select z triples which contain special property and T-z triples which contain
common property to summarize entity, where z is a parameter available for
setting, and T is the total number of triples in summarization results. As common
property is much more than special property and not all of entity files meet the
requirement of z special properties (e.g. director entity), z is just a maximize
value. If there are only t (t