1. Introduction

at the EYRE 2020 Entity Sum marization Task

Qingxia Liu

qxliu2013@smail.nju.edu.cn 0 1

Gong Cheng

gcheng@nju.edu.cn 0 1

Yuzhong Qu

Galway, Ireland

0 Commons License Attribution 4.0 International , CC BY 4.0 1 State Key Laboratory for Novel Software Technology, Nanjing University , Nanjing 210023 , China

Entity summaries provide human users with the key information about an entity. In this system paper, we present the implementation of our entity summarizer ESSTER. It aims at generating entity summaries that contain structurally important triples and exhibit high readability and low redundancy. For structural importance, we exploit the global and local characteristics of properties and values in RDF data. For readability, we learn the familarity of properties from a text corpus. To reduce redundancy, we perform logical reasoning and compute textual and numerical similarity between triples. ESSTER solves a combinatorial optimization problem to integrate these features. It achieves state-of-the-art results on the ESBM v1.2 dataset.

Entity summarization readability redundancy

1. Introduction

In RDF data, an entity is described by a possibly large set (e.g., hundreds) of RDF triples. The entity summarization task is to automatically generate a compact summary to provide human users with the key information about an entity. Specifically, an entity summary is a size-constrained subset of triples selected from an entity description. Current methods [ 1, 2, 3, 4, 5, 6 ] are mainly focused on selecting important triples, but ignore the reading experience of human users. In this system paper, we present the implementation of our entity summarizer named ESSTER [7].1 It aims at generating entity summaries of structural importance, high readability, and low redundancy. Improving textual readability and reducing information redundancy help to enhance the reading experience of users. Experiments on the ESBM v1.2 dataset [8] show that ES

STER achieves state-of-the-art results. Definition

RDF data is a set of subject-predicate-object triples . For an entity , its description desc( ) is the subset of triples in such that is the subject or the object. Each triple ∈ desc( )

provides a property-value pair ⟨, ⟩ for . When is the subject of , the property is ’s orcid: 0000-0001-6706-3776 (Q. Liu); 0000-0003-3539-7776 (G. Cheng); 0000-0003-2777-8149 (Y. Qu) © 2020 Copyright for this paper by its authors. Use permitted under Creative

CEUR Workshop 1https://github.com/nju-websoft/ESSTER of desc( ) satisfying | | ≤ define predicate and the value is ’s object. When is the object of , the property is the inverse of ’s predicate and the value is ’s subject. For convenience, we prop( ) = and val( ) = . Given an integer size constraint

, an entity summary for is a subset

3. Implementation of ESSTER

ESSTER considers structural importance, readability, and redundancy. Below we present their computation and finally integrate them by solving a combinatorial optimization problem.

3.1. Structural Importance

two perspectives.

We measure the structural importance of a triple from

First, globally popular properties often reflect important aspects of entities, while globally unpopular values are informative. Therefore, we compute the global importance of a triple as follows: glb( ) = ppop

global( ) ⋅ (1 − vpop( )) , ppop global( ) = vpop( ) = log(pfreq

global( ) + 1) log(| | + 1) log(vfreq( ) + 1) log(| | + 1)

(1) pfreq where prop triples in where is the set of all entities described in RDF data , global( ) is the number of entity descriptions in ( ) appears, and vfreq( ) is the number of where val( ) is the value.

Second, multi-valued properties are intrinsically popular compared with single-valued properties. To compensate for this, we penalize multi-valued properties portance of a triple as follows: by using local popularity. We compute the local im- and val( ) are equal, and rdfs:subPropertyOf is a ≠ , (8) (9) relation between prop( ) and prop( ).

Otherwise, we rely on the similarity between properties and the similarity between values: sim( , ) = max{simp( , ), simv( , ), 0} ,

(6) ical values. We compute For simv, we diferentiate between two cases. where for simp we use the ISub string similarity [9].

In the first case, val( ) and val( ) are both numersimv( , ) = { −1 min{val( ),val( )} max{val( ),val( )} val( ) ⋅ val( ) ≤ 0 , otherwise .

(7)

In all other cases, we simply use ISub for simv.

We formulate entity summarization as a 0-1 quadratic knapsack problem (QKP), and we solve it using a heuristic algorithm [10].

Specifically, we define the profit of choosing two triples , for a summary:

{ profit , = (1 − ) ⋅ ( struct( ) + text( )) = , ⋅ (−sim( , )) loc( ) = (1 − ppoplocal( )) ⋅ vpop( ) , ppoplocal( ) = log(pfreq

local( ) + 1) log(|desc( )| + 1)

, where prop( )

is the property. where pfreqlocal( ) is the number of triples in desc( )

Finally, we compute structural importance: struct( ) = ⋅ glb( ) + (1 − ) ⋅ loc( ) , (2) (3) where ∈ [ 0, 1 ] is a parameter to tune. 3.2. Textual Readability property is familiar to users if it is often used in an open-domain corpus. Specifically, given a text corpus of documents where the name of prop( ) appears. We compute by the user, let ( ) be the number of documents where

documents have been read ( ) =

− ( ) min( ( ), ) ( )) ⋅ ( − ) ⋅ familarity( ) , ∑ ( =0 familarity( ) =

( ) log( + 1) log( + 1) Here, represents the number of documents the user has read where the name of prop( ) appears, based on which familarity( ) gives the degree of familarity of prop( ) to the user. However, it is dificult

in practice, so ( ) computes the expected value of familarity( ) is a constant. In the experiments we set we use the Google Books Ngram2 as our corpus.

= 40 and . For simplicity, we assume

Finally, we compute textual readability:

text( ) = log( ( ) + 1).

To generate readable summaries, we measure the familiarity of a triple based on its property prop( ). A 3.4. Combinatorial Optimization (4)

Finally, our goal is to where ∈ [ 0, 1 ] is a parameter to tune.

, ⋅ ⋅ , ∈ {0, 1} for all = 1 … |desc( )| . (5)

4. Experiments 4.1. Settings 3.3. Information Redundancy

To reduce redundancy in summaries, we measure the in DBpedia and LinkedMDB. We follow the provided similarity between two triples , in various ways.

First, we perform logical reasoning to measure on- dation, and we use the training and development sets tological similarity. We define

sim( , ) = 1 if prop( ) and prop( ) are rdf:type, and rdfs:subClassOf is a relation between val( ) and val( ); or if val( ) the evaluation metric. for tuning our parameters and by grid search in the range of 0–1 with 0.01 increments. We use F1 score as We use the ESBM v1.2 dataset [8]. It provides groundtruth summaries under = 5 and

= 10 for entities training-development-test splits for 5-fold cross valiTable 1 [2] G. Cheng, T. Tran, Y. Qu, RELIN: relatedness and F1 Scores informativeness-based centrality for entity sum

DBpedia LinkedMDB marization, in: ISWC’11, Part I, 2011, pp. 114– = 5 = 10 = 5 = 10 129. doi:10.1007/978-3-642-25073-6_8. RELIN 0.242 0.455 0.203 0.258 [3] K. Gunaratna, K. Thirunarayan, A. P. Sheth, DIVERSUM 0.249 0.507 0.207 0.358 FACES: diversity-aware entity summarization FACES 0.270 0.428 0.169 0.263 using incremental hierarchical conceptual clusFACES-E 0.280 0.488 0.313 0.393 tering, in: AAAI’15, 2015, pp. 116–122. CD 0.283 0.513 0.217 0.331 [4] K. Gunaratna, K. Thirunarayan, A. P. Sheth, LinkSUM 0.287 0.486 0.140 0.279 G. Cheng, Gleaning types for literals in RDF BAFREC 0.335 0.503 0.360 0.402 triples with application to entity summarization, KAFCA 0.314 0.509 0.244 0.397 MPSUM 0.314 0.512 0.272 0.423 in: ESWC’16, 2016, pp. 85–100. doi:10.1007/ ESSTER 0.324 0.521 0.365 0.452 978-3-319-34129-3_6.

[5] A. Thalhammer, N. Lasierra, A. Rettinger,

LinkSUM: Using link analysis to summarize en4.2. Results tity data, in: ICWE’16, 2016, pp. 244–261. doi:10. 1007/978-3-319-38791-8_14.

Table 1 presents the evaluation results. We compare [6] H. Kroll, D. Nagel, W.-T. Balke, BAFREC: Balancwith known results of existing unsupervised entity sum- ing frequency and rarity for entity characterizamarizers [8]. On DBpedia under = 5, BAFREC [6] tion in linked open data, in: EYRE’18, 2018. achieves the highest F1 score, and is closely followed [7] Q. Liu, G. Cheng, Y. Qu, Entity summarization by ESSTER. In all the other three settings, ESSTER out- with high readability and low redundancy, Sci. performs all the baselines. Overall, ESSTER achieves Sin. Inform. 50 (2020) 845–861. doi:10.1360/ state-of-the-art results on ESBM v1.2. SSI-2019-0291. [8] Q. Liu, G. Cheng, K. Gunaratna, Y. Qu, ESBM: 5. Conclusion an entity summarization benchmark, in: ESWC’20, 2020, pp. 548–564. doi:10.1007/ In this system paper, we presented the implementa- 978-3-030-49461-2_32. tion of our entity summarizer ESSTER. By integrat- [9] G. Stoilos, G. B. Stamou, S. D. Kollias, A string ing structural importance, textual readability, and in- metric for ontology alignment, in: ISWC’05, formation redundancy via combinatorial optimization, 2005, pp. 624–637. doi:10.1007/11574620_45. ESSTER achieves state-of-the-art results among unsu- [10] Z. Yang, G. Wang, F. Chu, An efective GRASP pervised entity summarizers on the ESBM v1.2 dataset. and tabu search for the 0-1 quadratic knapsack However, the results are not comparable with super- problem, Comput. Oper. Res. 40 (2013) 1176– vised neural entity summarizers [11, 12]. 1185. doi:10.1016/j.cor.2012.11.023.

For the future work, we will consider more powerful [11] Q. Liu, G. Cheng, Y. Qu, Deeplens: Deep learning measures of readability and redundancy, and will in- for entity summarization, in: DL4KG’20, 2020. corporate these features into a neural network model. [12] J. Li, G. Cheng, Q. Liu, W. Zhang, E. Kharlamov, K. Gunaratna, H. Chen, Neural entity summarization with joint encoding and weak superviAcknowledgments sion, in: IJCAI’20, 2020, pp. 1644–1650. doi:10. 24963/ijcai.2020/228.

This work was supported by the National Key R&D Program of China (2018YFB1004300) and by the NSFC (61772264).

[1]

Liu , G. Cheng, K. Gunaratna,

Qu , Entity summarization: State of the art and future challenges , CoRR abs/ 1910 .08252 ( 2019 ).