CD at ENSEC 2016: Generating Characteristic and Diverse Entity Summaries Danyun Xu, Liang Zheng, and Yuzhong Qu National Key Laboratory for Novel Software Technology, Nanjing University, China {dyxu,zhengliang}@smail.nju.edu.cn,yzqu@nju.edu.cn Abstract. We introduce our entity summarization approach called CD, which aims to select characteristic and diverse features into an entity summary. The characterizing ability of a feature is measured according to information theory. The information overlap between features considers ontological semantics of classes and properties, as well as string and numerical similarity. Finally, selecting characteristic and diverse features is formulated as a binary quadratic knapsack problem to solve. Keywords: Entity summarization, self-information, reasoning, string similarity, numerical similarity. 1 Introduction Our entity summarization approach, called CD, is adapted from [3]. The basic idea is to, given an entity description composed of a set of property-value pairs called features, select a size-limited subset of characteristic and diverse features as an entity summary. We formulate it as a binary quadratic knapsack problem (QKP) to solve. Specifically, the characterizing ability of a feature is measured according to information theory, and the information overlap between features considers ontological semantics of classes and properties, as well as string and numerical similarity. 2 Preliminaries Let E, C, P , and L be the sets of all entities, classes, properties, and literals in a dataset, respectively. The description of an entity e is a set of property-value pairs called features, denoted by d(e) ⊆ P × (E ∪ C ∪ L). In RDF data, d(e) is obtained from RDF triples in which e is the subject or the object. When e is the subject of a triple t, the predicate (which is a property) and the object (which is an entity, a class, or a literal) of t comprise a feature. When e is the object of a triple t, the inverse of the predicate and the subject of t comprise a feature. The inverse of a property p is a property automatically created by our approach and is distinguished from p, though they share a common name; if a property pi is a subproperty of a property pj , we also define the inverse of pi as a subproperty of the inverse of pj . Given an integer k, an entity summary S of e is a subset of d(e) subject to |S| ≤ k. 2 D. Xu, L. Zheng, and Y. Qu 3 Approach 3.1 Characterizing Ability of a Feature The characterizing ability of a feature f , denoted by ch(f ), is measured accord- ing to information theory. Specifically, we compute the normalized amount of self-information contained in the probabilistic event of observing f in an en- tity description in a dataset. A feature will have high characterizing ability if it belongs to a small number of entity descriptions: − log |{e∈E:f |E| ∈d(e)}| ch(f ) = , (1) log |E| which is in the range of [0, 1]. 3.2 Information Overlap between Features The information overlap between two features fi and fj , denoted by ovlp(fi , fj ), considers ontological semantics of classes and properties, as well as string and numerical similarity. For a feature f , let prop(f ) and val(f ) return the property and the value of f , respectively. Firstly, we exploit ontological semantics of classes and properties. If both prop(fi ) and prop(fj ) are rdf:type and val(fi ) is a subclass of val(fj ) (or vice versa), we will define ovlp(fi , fj ) = 1 because one of them can be inferred from the other and thus they share maximized overlapping information. Similarly, we will also define ovlp(fi , fj ) = 1 if val(fi ) = val(fj ) and prop(fi ) is a subproperty of prop(fj ) (or vice versa). In other cases, we calculate the string similarity between property names (isub) and the similarity between property values (sim): ovlp(fi , fj ) = max{isub(prop(fi ), prop(fj )), sim(val(fi ), val(fj )), 0} , (2) which is in the range of [0, 1]. Here, isub ∈ [−1, 1] returns the ISub string sim- ilarity [2] between two property names; sim ∈ [−1, 1] returns the similarity between two property values. To measure sim(val(fi ), val(fj )), if both val(fi ) and val(fj ) are numerical data values, we calculate their similarity as follows. 1. If val(fi ) = val(fj ), sim(val(fi ), val(fj )) = 1; 2. otherwise, if val(fi ) · val(fj ) ≤ 0, sim(val(fi ), val(fj )) = −1; min{|val(f )|,|val(f )|} 3. otherwise, sim(val(fi ), val(fj )) = max{|val(fii )|,|val(fjj )|} . In other cases, we treat val(fi ) and val(fj ) as strings; that is, for entities and classes, we take their names, and for literals, we take their string forms. Then we calculate their ISub string similarity as sim. CD: Generating Characteristic and Diverse Entity Summaries 3 3.3 Selecting Characteristic and Diverse Features We aim to select up to k features from d(e) that maximize their total character- izing ability and minimize the total information overlap between them. To this end, we define the quality of an entity summary S as X X q(S) = γ · ch(f ) + δ · −ovlp(fi , fj ) , (3) f ∈d(e) fi ,fj ∈S in which γ, δ > 0 are the weights of the two objectives to tune, to achieve different trade-offs. Maximizing q can be reformulated as an instance of QKP [1] as follows. We number the features in d(e) from f1 to f|d(e)| . By introducing a series of binary variables xi for i = 1 · · · |d(e)| to indicate whether fi is selected into the optimal summary, the problem is formulated as |d(e)| |d(e)| X X maximize pij xi xj i=1 j=i |d(e)| X (4) subject to xi ≤ k i=1 xi ∈ {0, 1} for i = 1 · · · |d(e)| , in which pij is the “profit” achieved if both fi and fj are selected: ( γ · ch(fi ) if i = j , pij = (5) δ · (−ovlp(fi , fj )) otherwise . We solve QKP using a state-of-the-art heuristic algorithm [4]. Acknowledgments. This work was supported in part by the NSFC under Grant 61572247 and in part by the Fundamental Research Funds for the Central Universities. References 1. Pisinger, D.: The Quadratic Knapsack Problem - A Survey. Discrete Appl. Math. 155(5), 623–648 (2007) 2. Stoilos, G., Stamou, G., Kollias, S.: A String Metric for Ontology Alignment. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 624–637. Springer, Berlin Heidelberg (2005) 3. Xu, D., Cheng, G., Qu, Y.: Facilitating Human Intervention in Coreference Resolu- tion with Comparative Entity Summaries. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 535–549. Springer International Publishing, Switzerland (2014) 4. Yang, Z., Wang, G., Chu, F.: An Effective GRASP and Tabu Search for the 0-1 Quadratic Knapsack Problem. Comput. Oper. Res. 40(5), 1176–1185 (2013)