A Common Property and Special Property Entity Summarization Approach Based on Statistical Distribution? Yang Li and Liang Zhao Department of Computer Science and Engineering East China University of Science and Technology, Shanghai, 200237, China marine1ly@163.com,252007913@qq.com Abstract. Combined with previous research on entity summarization, the concept of common property and special property are defined based on the statistical distributions of properties in our paper, and we use the two kinds of property to summarize entities. Common property is a property that the entities under the same type all have. It is the basic property of the entities, and it can help to recognize a kind of entities. Special property is a property that just a few entities have, and it can help to identify an individual entity among a kind of entities. In addition, we also calculate the importance of property values based on the statistical distributions of property values, so that when an entity has more than one property value with the same property, we choose the triple with more important property value to summarize the entity. 1 Introduction Thalhammer et al. [1] proposed a method that uses the k-nearest neighbors and corresponding properties of an entity to summarize the entity, and the k-nearest neighbors are calculated by the similarity between the entities. The method filters out properties that all entities have, which are called common properties in our paper, and it focuses on the characteristics of entities. Our method takes both similarities and characteristics of the properties in entities into consideration, and we define common property and special property from a statistical point of view. Common property can be used to recognize a kind of entities and special property can be used to identify an individual entity among a kind of entities. These two kinds of property are used to summarize entities. However, the facts of entities (triples) consist of properties and property values, and an entity may have one property with multiple property values. We use the statistical distributions of property values to determine which property value is fit for summarization. The occurrence frequency of the property value reflects its importance, and more important property value is chosen to summarize the entity. Furthermore, there are redundant and useless triples in the facts of an entity. We first filter out those triples and then use our method to do entity summarization. The main ideas of our method: ? This work was partially supported by the 863 plan of China Ministry of Science and Technology (project No: 2015AA020107), and Software and Integrated Circuit Industry Development Special Funds of Shanghai Economic and Information Com- mission (project No: 140304). 10 8 6 Director 4 Film Actor 2 0 1 2 3 4 5 6 7 8 9 10 Fig. 1. The distributions of the occurrence number of each property – There are three kinds of entities in the LinkedMDB-30 Track, namely, Di- rector, Film, and Actor. The distributions of the occurrence number of each property under the three classes are shown in Figure 1, and each class con- tains ten entities. We found that some properties appear in almost all the entities, and we name those properties as common property. Some properties exist only in a handful of entities, and we name those properties as special property. In Figure 1, all properties of the class Director appear in all the ten entities, which indicates that entities of the class Director have only common property. – Since a entity may have one property with multiple property values, for example, a director may have made more than one film (namely, triples with the same subject and predicate, but different objects), the property value should be considered too. Which property value should be taken to summarize the entity depend on its importance, and the importance is judged through the dump file (the full data set). 2 Common Property and Special Property Entity Summarization Approach The processes of our approach consist of four parts: Preprocess, Statistic and Analysis on Property , Statistic and Analysis on Property Value, Select Special Triples and Common Triples, and the workflow of our method is shown in Figure 2 2.1 Preprocess Through the observation and analysis of entity facts, we find there are many redundant and useless triples that should be removed, if not, these triples may Preprocess Drop redundant Drop useless triples triples Statistic and Analysis on Property Calculate degree of Distinguish common commonness and and special property specialness Statistic and Analysis on Property Value Process the Process the Process the property value property value property value of “made” of “actor” of “rdf:type” Select Special Triples and Common Triples Fig. 2. Workflow be selected and the quality of entity summarization would be very low. Here are some examples of triples that need to be dropped. We also introduce the method of remove as follow. 1. Drop redundant triples. There are some triples which are different in express- ing but have the same meaning. For example, in facts of Director entities ex- ist triples like and . Obviously, these two triples mean the same thing, thus we remove one of them. Our remove method is: if there exists triples A and B, where the subject of A is the object of B, meanwhile, the object of A is the subject of B, then we remove the triple whose object is the entity. 2. Drop useless triples. a) Drop triples whose object value is null. b) Drop triples that contain information of entity id. c) Drop triples whose predicate is “page”, or “rdf-schema#label”, or “owl#sameAs” etc. These triples have nothing to do with entity summarization. 2.2 Statistic and Analysis on Property 1. As special property and common property mentioned above are based on statistic, we need to count each property exists in how many entity files. Next, we regard the property whose occurrence time is more than x as common property and add it to common property candidate list. Likewise, we regard the property whose occurrence time is less than y as special property and add it to special property candidate list. Here x and y are the threshold parameters available for setting. 2. After generating the two kinds of property candidate lists for each entity, we need to calculate the degree of commonness and the degree of specialness of the properties in candicate lists and rank them from high to low separately. We record each property’s occurrence time in a entity file as n, occurrence time in the 10 entity files as N, and occurrence times in dump file as Ndump . We use the following equation to calculate the degree: n n SpecialDegree = + ∗ 1000 (1) N Ndump Ndump CommonDegree = (2) N n Equation (1) is used to calculate the special degree. N is the special degree n of a special property in the 10 entity files and Ndump is the special degree of a special property in the dump file. As the dump file is huge, the value of n n Ndump is quite small, after analyzing each special property’s Ndump , we find it varies from 0.0001 to 0.001, so we multiple a balance parameter 1000 to scale its influence. Equation (2) is used to calculate the common degree. We do not take N N n into account where n is the common degree of a property in the 10 entity files. Because the range of N n is about 10 and the value of Ndump n varies from 3000 to 20000. We calculate property’s commonness in dump file only, because the information in the 30 entities is too small, which making their commonness not obvious. According to the degree, we give the two candidate lists a rank from high to low separately. 2.3 Statistic and Analysis on Property Value After obtaining special properties and common properties, the analysis and s- tatistics of property values are also needed. There exists such a situation that one property has multiple property values in one entity. For example, a director has a property “made”, which is used to describe that the director made a film. Obviously, one director could make many films. When we use property “made” to summarize director entity, a strategy to determine which film to choose is required. We propose a method as follows: Extract each films information from dump file. In dump file, most films have a property “performance”, but the values are different. We assume that the value of property “performance” is the film entity’s score, and high score implies high importance. We choose the most important film to summarize director entity for the property “made”. Similarly, when summarizing film entities, we may use “actor” property. We also extract actor information from dump file. We assume that the more films the actor participates, the more important the actor will be. The actor’s score is the number of films he or she participate. For actor entities and director entities, they both have two kinds of type. Take actor entity as example, one of its type is “person” and the other is “actor”. Obviously, “actor” is a subclass of “person” and the sub class is more accurate when summarizing entity. There is a truth that in dump file, the amount of triples which describe superclass is bigger than that which describe subclass. Thus when an entity has two types, we select the one which has less appear times in dump file. 2.4 Select Special Triples and Common Triples Select z triples which contain special property and T-z triples which contain common property to summarize entity, where z is a parameter available for setting, and T is the total number of triples in summarization results. As common property is much more than special property and not all of entity files meet the requirement of z special properties (e.g. director entity), z is just a maximize value. If there are only t (t