-

A Common Property and Special Property Entity Summarization Approach Based on Statistical Distribution?

Yang Li

Liang Zhao

0 0 Department of Computer Science and Engineering East China University of Science and Technology , Shanghai, 200237 , China

Combined with previous research on entity summarization, the concept of common property and special property are de ned based on the statistical distributions of properties in our paper, and we use the two kinds of property to summarize entities. Common property is a property that the entities under the same type all have. It is the basic property of the entities, and it can help to recognize a kind of entities. Special property is a property that just a few entities have, and it can help to identify an individual entity among a kind of entities. In addition, we also calculate the importance of property values based on the statistical distributions of property values, so that when an entity has more than one property value with the same property, we choose the triple with more important property value to summarize the entity.

1 Introduction Thalhammer et al. [1] proposed a method that uses the k-nearest neighbors and corresponding properties of an entity to summarize the entity, and the k-nearest neighbors are calculated by the similarity between the entities. The method lters out properties that all entities have, which are called common properties in our paper, and it focuses on the characteristics of entities. Our method takes both similarities and characteristics of the properties in entities into consideration, and we de ne common property and special property from a statistical point of view. Common property can be used to recognize a kind of entities and special property can be used to identify an individual entity among a kind of entities. These two kinds of property are used to summarize entities. However, the facts of entities (triples) consist of properties and property values, and an entity may have one property with multiple property values. We use the statistical distributions of property values to determine which property value is t for summarization. The occurrence frequency of the property value re ects its importance, and more important property value is chosen to summarize the entity. Furthermore, there are redundant and useless triples in the facts of an entity. We rst lter out those triples and then use our method to do entity summarization.

The main ideas of our method:

Director Film

Actor 1 2 3 4 5 6 7 8 9 10 { There are three kinds of entities in the LinkedMDB-30 Track, namely, Director, Film, and Actor. The distributions of the occurrence number of each property under the three classes are shown in Figure 1, and each class contains ten entities. We found that some properties appear in almost all the entities, and we name those properties as common property. Some properties exist only in a handful of entities, and we name those properties as special property. In Figure 1, all properties of the class Director appear in all the ten entities, which indicates that entities of the class Director have only common property. { Since a entity may have one property with multiple property values, for example, a director may have made more than one lm (namely, triples with the same subject and predicate, but di erent objects), the property value should be considered too. Which property value should be taken to summarize the entity depend on its importance, and the importance is judged through the dump le (the full data set). 2

Common Property and Special Property Entity Summarization Approach The processes of our approach consist of four parts: Preprocess, Statistic and Analysis on Property , Statistic and Analysis on Property Value, Select Special Triples and Common Triples, and the work ow of our method is shown in Figure 2 2.1

Preprocess

Through the observation and analysis of entity facts, we nd there are many redundant and useless triples that should be removed, if not, these triples may

Drop redundant triples Drop useless triples Distinguish common and special property Statistic and Analysis on Property Calculate degree of commonness and specialness Statistic and Analysis on Property Value Process the Process the Process the property value property value property value of “made” of “actor” of “rdf:type” Select Special Triples and Common Triples

be selected and the quality of entity summarization would be very low. Here are some examples of triples that need to be dropped. We also introduce the method of remove as follow. 1. Drop redundant triples. There are some triples which are di erent in expressing but have the same meaning. For example, in facts of Director entities exist triples like <director/A> <made> < lm/B> and < lm/B> <director> <director/A>. Obviously, these two triples mean the same thing, thus we remove one of them. Our remove method is: if there exists triples A and B, where the subject of A is the object of B, meanwhile, the object of A is the subject of B, then we remove the triple whose object is the entity. 2. Drop useless triples. a) Drop triples whose object value is null. b) Drop triples that contain information of entity id. c) Drop triples whose predicate is \page", or \rdf-schema#label", or \owl#sameAs" etc. These triples have nothing to do with entity summarization. 2.2

Statistic and Analysis on Property 1. As special property and common property mentioned above are based on statistic, we need to count each property exists in how many entity les. Next, we regard the property whose occurrence time is more than x as common property and add it to common property candidate list. Likewise, we regard the property whose occurrence time is less than y as special property and add it to special property candidate list. Here x and y are the threshold parameters available for setting. 2. After generating the two kinds of property candidate lists for each entity, we need to calculate the degree of commonness and the degree of specialness of the properties in candicate lists and rank them from high to low separately. We record each property's occurrence time in a entity le as n, occurrence time in the 10 entity les as N, and occurrence times in dump le as Ndump. We use the following equation to calculate the degree:

SpecialDegree = n N +

n Ndump 1000 ( 1 ) CommonDegree = Ndump (2)

N Equation ( 1 ) is used to calculate the special degree. Nn is the special degree of a special property in the 10 entity les and Ndnump is the special degree of a special property in the dump le. As the dump le is huge, the value of Ndnump is quite small, after analyzing each special property's Ndnump , we nd it varies from 0.0001 to 0.001, so we multiple a balance parameter 1000 to scale its in uence. Equation (2) is used to calculate the common degree. We do not take Nn into account where Nn is the common degree of a property in the 10 entity les. Because the range of Nn is about 10 and the value of Ndump varies from 3000 to 20000. We calculate property's commonness in n dump le only, because the information in the 30 entities is too small, which making their commonness not obvious. According to the degree, we give the two candidate lists a rank from high to low separately. 2.3

Statistic and Analysis on Property Value

After obtaining special properties and common properties, the analysis and statistics of property values are also needed. There exists such a situation that one property has multiple property values in one entity. For example, a director has a property \made", which is used to describe that the director made a lm. Obviously, one director could make many lms. When we use property \made" to summarize director entity, a strategy to determine which lm to choose is required. We propose a method as follows:

Extract each lms information from dump le. In dump le, most lms have a property \performance", but the values are di erent. We assume that the value of property \performance" is the lm entity's score, and high score implies high importance. We choose the most important lm to summarize director entity for the property \made". Similarly, when summarizing lm entities, we may use \actor" property. We also extract actor information from dump le. We assume that the more lms the actor participates, the more important the actor will be. The actor's score is the number of lms he or she participate.

For actor entities and director entities, they both have two kinds of type. Take actor entity as example, one of its type is \person" and the other is \actor". Obviously, \actor" is a subclass of \person" and the sub class is more accurate when summarizing entity. There is a truth that in dump le, the amount of triples which describe superclass is bigger than that which describe subclass. Thus when an entity has two types, we select the one which has less appear times in dump le. 2.4

Select Special Triples and Common Triples

Select z triples which contain special property and T-z triples which contain common property to summarize entity, where z is a parameter available for setting, and T is the total number of triples in summarization results. As common property is much more than special property and not all of entity les meet the requirement of z special properties (e.g. director entity), z is just a maximize value. If there are only t (t<z) special triples, then T-t common triples would be selected. The special properties and common properties are selected from the two ranked lists generated in 2.2 respectively. When processing the property with multiple values (i.e. actors or lms), we select candidates from high to low according to their score mentioned in 2.3.

1. Thalhammer , A. , Toma , I. , Roavalverde , A. , Fensel , D. : Leveraging usage data for linked data movie entity summarization . Computer Science - Arti cial Intelligence ( 2012 )