CD at ENSEC 2016: Generating Characteristic
        and Diverse Entity Summaries

                   Danyun Xu, Liang Zheng, and Yuzhong Qu

 National Key Laboratory for Novel Software Technology, Nanjing University, China
            {dyxu,zhengliang}@smail.nju.edu.cn,yzqu@nju.edu.cn


      Abstract. We introduce our entity summarization approach called CD,
      which aims to select characteristic and diverse features into an entity
      summary. The characterizing ability of a feature is measured according to
      information theory. The information overlap between features considers
      ontological semantics of classes and properties, as well as string and
      numerical similarity. Finally, selecting characteristic and diverse features
      is formulated as a binary quadratic knapsack problem to solve.

      Keywords: Entity summarization, self-information, reasoning, string
      similarity, numerical similarity.


1   Introduction
Our entity summarization approach, called CD, is adapted from [3]. The basic
idea is to, given an entity description composed of a set of property-value pairs
called features, select a size-limited subset of characteristic and diverse features
as an entity summary. We formulate it as a binary quadratic knapsack problem
(QKP) to solve. Specifically, the characterizing ability of a feature is measured
according to information theory, and the information overlap between features
considers ontological semantics of classes and properties, as well as string and
numerical similarity.

2   Preliminaries
Let E, C, P , and L be the sets of all entities, classes, properties, and literals in
a dataset, respectively. The description of an entity e is a set of property-value
pairs called features, denoted by d(e) ⊆ P × (E ∪ C ∪ L). In RDF data, d(e) is
obtained from RDF triples in which e is the subject or the object. When e is the
subject of a triple t, the predicate (which is a property) and the object (which is
an entity, a class, or a literal) of t comprise a feature. When e is the object of a
triple t, the inverse of the predicate and the subject of t comprise a feature. The
inverse of a property p is a property automatically created by our approach and
is distinguished from p, though they share a common name; if a property pi is a
subproperty of a property pj , we also define the inverse of pi as a subproperty
of the inverse of pj . Given an integer k, an entity summary S of e is a subset
of d(e) subject to |S| ≤ k.
2         D. Xu, L. Zheng, and Y. Qu

3     Approach

3.1     Characterizing Ability of a Feature

The characterizing ability of a feature f , denoted by ch(f ), is measured accord-
ing to information theory. Specifically, we compute the normalized amount of
self-information contained in the probabilistic event of observing f in an en-
tity description in a dataset. A feature will have high characterizing ability if it
belongs to a small number of entity descriptions:

                                         − log |{e∈E:f
                                                    |E|
                                                       ∈d(e)}|
                              ch(f ) =                           ,                     (1)
                                               log |E|

which is in the range of [0, 1].


3.2     Information Overlap between Features

The information overlap between two features fi and fj , denoted by ovlp(fi , fj ),
considers ontological semantics of classes and properties, as well as string and
numerical similarity.
    For a feature f , let prop(f ) and val(f ) return the property and the value
of f , respectively.
    Firstly, we exploit ontological semantics of classes and properties. If both
prop(fi ) and prop(fj ) are rdf:type and val(fi ) is a subclass of val(fj ) (or vice
versa), we will define ovlp(fi , fj ) = 1 because one of them can be inferred from
the other and thus they share maximized overlapping information. Similarly, we
will also define ovlp(fi , fj ) = 1 if val(fi ) = val(fj ) and prop(fi ) is a subproperty
of prop(fj ) (or vice versa).
    In other cases, we calculate the string similarity between property names
(isub) and the similarity between property values (sim):

      ovlp(fi , fj ) = max{isub(prop(fi ), prop(fj )), sim(val(fi ), val(fj )), 0} ,   (2)

which is in the range of [0, 1]. Here, isub ∈ [−1, 1] returns the ISub string sim-
ilarity [2] between two property names; sim ∈ [−1, 1] returns the similarity
between two property values. To measure sim(val(fi ), val(fj )), if both val(fi )
and val(fj ) are numerical data values, we calculate their similarity as follows.

 1. If val(fi ) = val(fj ), sim(val(fi ), val(fj )) = 1;
 2. otherwise, if val(fi ) · val(fj ) ≤ 0, sim(val(fi ), val(fj )) = −1;
                                            min{|val(f )|,|val(f )|}
 3. otherwise, sim(val(fi ), val(fj )) = max{|val(fii )|,|val(fjj )|} .

In other cases, we treat val(fi ) and val(fj ) as strings; that is, for entities and
classes, we take their names, and for literals, we take their string forms. Then
we calculate their ISub string similarity as sim.
                 CD: Generating Characteristic and Diverse Entity Summaries          3

3.3     Selecting Characteristic and Diverse Features
We aim to select up to k features from d(e) that maximize their total character-
izing ability and minimize the total information overlap between them. To this
end, we define the quality of an entity summary S as
                            X                 X
                 q(S) = γ ·      ch(f ) + δ ·     −ovlp(fi , fj ) ,          (3)
                            f ∈d(e)                   fi ,fj ∈S

in which γ, δ > 0 are the weights of the two objectives to tune, to achieve different
trade-offs.
    Maximizing q can be reformulated as an instance of QKP [1] as follows. We
number the features in d(e) from f1 to f|d(e)| . By introducing a series of binary
variables xi for i = 1 · · · |d(e)| to indicate whether fi is selected into the optimal
summary, the problem is formulated as
                                   |d(e)| |d(e)|
                                    X X
                      maximize                     pij xi xj
                                    i=1     j=i
                                   |d(e)|
                                    X                                              (4)
                      subject to            xi ≤ k
                                    i=1
                                   xi ∈ {0, 1} for i = 1 · · · |d(e)| ,
in which pij is the “profit” achieved if both fi and fj are selected:
                            (
                              γ · ch(fi )           if i = j ,
                     pij =                                                         (5)
                              δ · (−ovlp(fi , fj )) otherwise .

      We solve QKP using a state-of-the-art heuristic algorithm [4].

Acknowledgments. This work was supported in part by the NSFC under
Grant 61572247 and in part by the Fundamental Research Funds for the Central
Universities.

References
 1. Pisinger, D.: The Quadratic Knapsack Problem - A Survey. Discrete Appl. Math.
    155(5), 623–648 (2007)
 2. Stoilos, G., Stamou, G., Kollias, S.: A String Metric for Ontology Alignment. In:
    Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol.
    3729, pp. 624–637. Springer, Berlin Heidelberg (2005)
 3. Xu, D., Cheng, G., Qu, Y.: Facilitating Human Intervention in Coreference Resolu-
    tion with Comparative Entity Summaries. In: Presutti, V., d’Amato, C., Gandon,
    F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp.
    535–549. Springer International Publishing, Switzerland (2014)
 4. Yang, Z., Wang, G., Chu, F.: An Effective GRASP and Tabu Search for the 0-1
    Quadratic Knapsack Problem. Comput. Oper. Res. 40(5), 1176–1185 (2013)