=Paper= {{Paper |id=Vol-2949/paper4 |storemode=property |title=TTProfiler: Computing Types and Terms Profiles of Assertional Knowledge Graphs |pdfUrl=https://ceur-ws.org/Vol-2949/paper4.pdf |volume=Vol-2949 |authors=Lamine Diop,Arnaud Giacometti,Béatrice Markhoff,Arnaud Soulet |dblpUrl=https://dblp.org/rec/conf/swodch/DiopGMS21 }} ==TTProfiler: Computing Types and Terms Profiles of Assertional Knowledge Graphs== https://ceur-ws.org/Vol-2949/paper4.pdf
TTProfiler: Computing Types and Terms Profiles
        of Assertional Knowledge Graphs

    Lamine Diop, Arnaud Giacometti, Béatrice Markhoff, and Arnaud Soulet

                      Université de Tours, LIFAT, Blois, France
                       firstname.lastname@univ-tours.fr



       Abstract. As more and more knowledge graphs (KG) are published in
       the Web, there is a need of tools for abstracting their content for their
       producers to verify their result, and for their consumers to use it. This
       implies showing the schema-level patterns instantiated in the graph, with
       the frequency with which they are instantiated. A profile represents this
       information. In this paper, we propose a new type of profile that we
       call TT profile, for Types and Terms profile. It shows the used Types
       and predicates, and also the used Terms because of their paramount
       importance in most of KGs, especially in the Cultural Heritage (CH)
       domain. We present an algorithm for building a TT profile from an online
       KG’s assertional part, and we report on experiments performed over a
       set of CH KGs.


1     Introduction
It has become widespread in the Cultural Heritage (CH) field to generate Knowl-
edge Graphs from legacy datasets, using one or more ontologies [2]. A knowledge
graph (KG) is a dataset in RDF, i.e. a set of (subject, predicate, object) triples.
CH KGs contribute to the Linked Open Data(LOD) construction, publicly of-
fering inter-linked and semantically defined datasets, which is supposed to boost
knowledge discovery and efficient data-driven analytics at a world-wide scale.
However, using LOD datasets for analysis requires a clear idea of their content
and this is a long-standing difficulty. It is not enough to know which ontologies
are used, it is necessary to know how they are used, i.e. which of their components
serve in that particular dataset, and in what way.
    In the last ten years several proposals raised for helping users knowing what
contains a given KG, by extracting its predicates, the types of entities they link,
and some basic statistics, like ABSTAT [6] does. In the same way, our aim is to
generate such an abstract image of the KG as it is queryable online, by default
without reasoning. From a given KG, ABSTAT builds a set of (C, P, D) triples
with statistics, where C and D are types and P is a predicate. Such triple is
called Abstract Pattern (AP). Figure 1 (left) shows the four first APs returned by
ABSTAT when asking for the predicate dbo:country on a 2016 dump of DBpedia
in English, using its online tool1 . The first AP indicates that there are 560, 532
1
    http://abstat.disco.unimib.it/




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
Fig. 1: ABSTAT (left), Basic Graph Pattern and its corresponding AP (right).


RDF triples (last column) in this KG for which the predicate dbo:country relates
a subject of type dbo:Location to an object of type dbo:Country, which informs
us that we can query locations and their associated countries. Figure 1 (right)
presents the Basic Graph Pattern (BGP) able to compute an ABSTAT AP
representing its instances, with n its frequency (number of its instances in the
KG). Edges labeled with “ a” represent the predicate rdf:type.
    ABSTAT returns thousands of APs just for the predicate dbo:country from
this dataset, several of them representing the same facts in the KG. For instance,
dbo:Location and schema:Place in Figure 1 are probably both types of the sub-
jects of predicate dbo:country that have objects of type dbo:Country, since the
two APs have exactly the same frequency (i.e., 560,532). In other words, if the
BGP in Figure 2 (a) was instanciated in the KG, then ABSTAT would generate
four APs (cartesian product of subject’s and object’s types), all with the same
frequency n. For representing each fact in the KG with only one AP, we propose
to deal with APs where predicates relate not just types but sets of types, as
in Figure 2 (a). Moreover, to the best of our knowledge there is no tool that
highlights not only the types of the subject and object of a predicate, but also
the terms used for objects in the KG. It is one thing to indicate that there are
instances of crm:E22_Man-made_Object in the graph, but the fact that it contains
information about coins, or burials, or garments, is much more interesting and
precise. In KGs that use the CIDOC Conceptual Reference Model 2 (hereafter
CIDOC), this information is carried by terms, more precisely by URIs described
in some thesauri, such as http://nomisma.org/id/coin for instance. This is the
case because, quoting [3], “CIDOC defines and is restricted to the underlying se-
mantics of database schemata and document structures used in cultural heritage
and museum documentation in terms of a formal ontology. It does not define
any of the terminology appearing typically as data in the respective data struc-
tures; however it foresees the characteristic relationships for its use.” The type
crm:E55_Type is a gateway to these controlled vocabularies, but it is not the only
one, for example crm:E57_Material can also be detailed in an ad hoc nomencla-
ture. This choice is in line with the use of databases in CH communities insofar
as it organises in ontology the entities of the domain and their relationships, but
not the descriptive values, i.e. most of the values in databases. In general, these
ones are listed and described elsewhere in authority lists, for interoperability
2
    cidoc-crm.org/
               Fig. 2: Types (a) and Terms (b) Abstract Patterns.


purposes. This means that CIDOC-based KGs generally employ various sets of
terms, which provide at least as much meaning as the used CIDOC types and
predicates. For taking this into account, we propose abstract patterns showing
terms used in the graph, as illustrated in Figure 2 (b), where ti denotes the
instances of the variable ?t and the edge labeled with “prefLabel” represents the
situation where the variable ?o is instantiated by a term and ?t by a label of that
term. Detecting this situation depends on how are implemented the vocabularies.
In this paper we consider those implemented with SKOS3 .
    To sum-up our contribution, we deal with KGs that are sets of assertional
knowledge, whose intentional part is formally defined by existing RDFS or OWL
ontologies, and who contain instances of SKOS concepts, defined in existing
SKOS thesauri. Given that the ontologies and thesauri used are not neces-
sarily accessible online for programs, we present and discuss a program called
TTProfiler to build a set of Types and Terms (TT) APs that we call a profile,
by querying its online SPARQL endpoint4 . The rest of this paper is organised
as follows: in Section 2, we provide definitions for TT APs and profiles. In Sec-
tion 3 we present the TTProfiler algorithm. We report in Section 4 our uses of
its implementation on various online CH KGs. We conclude in Section 5.


2     Definitions and Problem Formulation
We use Description Logics (DL) [1] formal notations for defining our prob-
lem: we consider a Web KG as a knowledge base (KB) K, composed of
the TBox T (names and assertions about concepts and roles, respectively
called types and predicates in this paper) and the ABox A (assertions
about individuals, called entities and facts). For instance DBpedia is a KB
K = (T , A), one example of assertion in T is dbo:Artist v dbo:Person,
meaning that the type dbo:Artist is subsumed by the type dbo:Person,
i.e. all artists are persons. T also includes assertions like ∃dbo:birthYear
v dbo:Person, meaning that the predicate dbo:birthYear is defined for per-
sons. On the ABox side, dbo:Person(dbr:Michelle_Obama) declares that en-
tity dbr:Michelle_Obama is a person and birthYear(dbr:Michelle_Obama, 1964)
states the fact that Michelle Obama was born in 1964. Also, some persons
are related via the predicate dct:subject to a SKOS concept, for instance
3
    https://www.w3.org/TR/skos-reference/
4
    TTProfiler’s codes are available at https://github.com/DTTProfiler/DTTProfiler
we find in DBpedia Person(dbr:Ringo_Madlingozi), skos:Concept(Category:1964)
and dct:subject(dbr:Ringo_Madlingozi, Category:1964).
    Our aim is to give an abstract of the ABox content, which in the Web of
data is in general far bigger than the TBox. We work with the asserted KG
(or ABox), not with the version one gets by applying a reasoner. By default,
SPARQL endpoints do not perform entailments. We put in evidence all the
types and predicates appearing in the ABox, whatever the ontologies they be-
long to, so we do not limit ourselves to only one given ontology. We also want
to highlight the SKOS concept instances appearing in the ABox. To do so we
look for skos:Concept instances or subjects of skos:prefLabel, and we also look
for declared prefixes that correspond to some known thesauri. We do not use
the ontologies and thesauri in the algorithm presented in this paper. When pub-
licly available, they can be used latter on, together with profiles, for completing
them. There may be cases in which the TBox is limited to few ontologies that
are consistent by themselves and semantically compatible with each other. In
those rare cases, a reasoning step combining the TBox and ABox could also be
performed before or during the profile generation. This is out of the scope of this
paper, because we want our proposal to work on the online Web, which presents
too much uncertainty about the quality of the ontologies used (they are not even
guaranteed to be accessible online for programs). This is why we limit our scope
to the ABox content without considering the TBox, leaving it to other services
to process the ontologies and thesauri for enriching the profile information af-
terwards, when needed. The algorithm presented in this paper builds a profile
of the given ABox A, that is composed of TT AP (Types and Terms abstract
patterns), which are triples whose subjects and objects are sets, as defined in
Definition 1. In [6], APs are triples (C, P, D), where C and D are types and P
a predicate: we call them basic APs. TT APs generalise basic APs in two ways:
first, objects can be either types or terms (labels of instances of skos:Concept).
Second, both subjects and objects are sets (either set of types or set of terms),
as illustrated in Figure 2.

Definition 1 (TT Abstract Pattern). Given an ABox A, a TT abstract
pattern of A is a triple (C, P, D) such that C is a set of types in A, P is a
predicate in A, and D is either a set of types in A or a set of terms appearing
in A. Here a term is a string literal, label of an instance of skos:Concept. A TT
abstract pattern (C, P, D) represents a fact P (a, b) of A if:
 – the entity a is an instance of each type in C (i.e., C(a) ∈ A for C ∈ C), and
 – the entity b is an instance of each type in D (i.e., D(b) ∈ A for D ∈ D) or
   the entity b is an instance of skos:Concept and its prefLabel is in D (i.e.,
   skos:Concept(b) and skos:prefLabel(b, t) and t ∈ D).

    This definition can be declined in various versions. For instance, the subject
and object of an AP could be generalised to types not actually appearing in A
but defined using T , as owl:Thing, rdfs:Literal and so-called minimal types
used in [6]. Also for instances of skos:Concept, one could use some definitions
in their respective thesaurus. As already noticed, contrary to [6], if a or b have
                                  (({C1 , C2 }, P1 , {C3 }), 20)


                                                             (({C1 }, P3 , {t1 , t2 }), 100)
  {C4 }                                    {C1 , C2 , C3 }                                     {t1 , t2 , t3 }
          (({C1 , C3 }, P2 , {C4 }), 18)

                                                             (({C3 }, P3 , {t1 , t3 }), 50)


                              Fig. 3: Graph with maximal sets


several types asserted in A (whether or not linked in T by a subsumption) then
by Definition 1 the fact P(a,b) is represented by only one AP. Also contrary
to [6], a fact P (a, b) having no type asserted for a, or having neither a type
asserted for b nor any clue allowing to know that b belongs to a thesaurus, does
not raise any AP. Given the set of APs generated from an ABox A according
to Definition 1, we can associate statistics with those patterns, leading to the
following definition of a TT profile:

Definition 2 (TT Profile). Given an ABox A, a TT profile P of A is a set
of pairs ((C, P, D), S) such that (C, P, D) is a TT AP generated from A, and S
is a statistic value describing (C, P, D).

   There are many ways to define interesting statistics of a KG’s assertional
part. We may consider the global number of assertions C(a) for each type C, the
global number of assertions P (a, b) for each predicate P , the global number of
assertions P (a, b) for each SKOS concept b appearing in A... In this paper, we
deal with the frequency of a TT AP, that is how many facts of A it represents.
We call weight the function that associates with (C, P, D) its frequency in A.

Definition 3 (Weight of a TT abstract pattern). The weight of the TT
abstract pattern (C, P, D), denoted ω((C, P, D)), is the function that associates
with (C, P, D) its frequency in A. ω((C, P, D))= |{P (a, b), P (a, b) ∈ A and P (a, b)
is represented by (C, P, D) according to Definition 1}|.

     Moreover, for the sake of drawing clearly the TT profile as a graph, we
aim at grouping the sets of types in such a way that each type appears in
only one set (or node). For instance if we have in a TT profile P the TT APs
A1 = ({C1 , C2 }, P1 , {C3 })), A2 = ({C1 , C3 }, P2 , {C4 }), A3 = ({C1 }, P3 , {t1 , t2 })
and A4 = ({C3 }, P3 , {t1 , t3 }), with ω(A1 ) = 20, ω(A2 ) = 18, ω(A3 ) = 100
and ω(A4 ) = 50, then we merge sets {C1 , C2 }, {C3 }, {C1 , C3 } and {C1 } into
a maximal set {C1 , C2 , C3 } and sets {t1 , t2 } and {t1 , t3 } into a maximal set
{t1 , t2 , t3 }, which gives the representation shown in Figure 3.
     Searching for maximal sets is searching for the components of the graph
formed by the profile’s nodes (subjects and objects of TT APs), with an edge
connecting two nodes if and only if there is a non-empty intersection between
these two nodes. The union of component’s nodes is a maximal set. Computing
the components of a graph is generally done by a linear depth-first search, but in
Algorithm 1 we incrementally compute the maximal sets ϕ during the TT profile
building. For the profile visualisation, maximal nodes can be represented by one
of their types or terms, and the others can be shown on demand. As shown in
Figure 3, edges are annotated with the corresponding AP and its weight.
    We can now state our problem as follows: Given the assertional part of
a knowledge base K = (T , A), how to efficiently generate and visualise
a TT profile of A?


3   TTProfiler Algorithm

TTProfiler computes a TT profile of an ABox A following a three steps proce-
dure: 1) basic abstract patterns and statistics recovery, 2) TT profile computing,
and 3) TT profile visualisation structure building.
Step 1: Basic abstract patterns and statistics recovery. We recover all basic ab-
stract patterns (C, P, D) with w, their frequency, i.e. the number of instances of
(C, P, D) in A (line 1). An assertion P (a, b) in A is said to be an instance of the
abstract basic pattern (C, P, D) if and only if a is of type C in A (i.e., C(a) ∈ A)
and b is either of type D or a term of a thesaurus.
Step 2: Profile computing. To fit Definitions 1 and 2, for each predicate appearing
in a basic abstract pattern we group all types that have common instances (lines
5-11), and we also group terms for subjects having the same type (lines 12-14).
For this last case, we associate to the predicate a weight equals to the sum of
the weights computed in Step 1. With the resulting weighted TT APs, each fact
P(a,b) is represented by only one pattern. Each computed TT AP is added into
the TT profile P (line 15). We also incrementally compute the set ϕ of maximal
nodes, incorporating in it the nodes C and D (that are sets of types or terms)
(line 16). The incorporation of a node in ϕ consists in grouping its elements
with other nodes containing them, as explained in Section 2 (cf. the example
illustrated in Figure 3).
Step 3: Profile Visualisation structure computing. In this last step, for each
weighted TT abstract pattern we replace its subject and object by their cor-
responding maximal node in ϕ (lines 19-20) and we add the resulting triple to
the Profile Visualisation structure P V .
    Regarding the complexity, Step 1 consists in querying the KG, so it depends
on the SPARQL endpoint and the network capacities; Step 2 is linear in the
number of predicates and quadratic in the number of basic abstract patterns
computed in Step 1; Step 3 is linear in the number of TT abstract patterns.


4   Experiments with Cultural Heritage KGs

TTProfiler, whose code is published in Github (see Section 1), is devised to
apply to KGs that can be queried online via a SPARQL endpoint. This requires
Algorithm 1 TTProfiler: Types and Terms Profiler
    Input: The ABox A of a knowledge base K = (T , A)
    Output: The TT profile P of A and its visualisation structure P V
    //Step 1: basic abstract patterns extraction from A and statistics computation
 1: Let R = {((C, P, D), w)/(∃P (a, b) ∈ A) ∧ C(a) ∈ A ∧ (D(b) ∈ A ∨ (skos :
    Concept(b) ∧ skos:pref Label(b, D)))} where w is the number of instances P (a, b)
    in A for (C, P, D)
    //Step 2: TT profile computing: grouping types and terms in sets
 2: Let P = {P/(∃((C, P, D), w) ∈ R)}                      . P is the set of predicates in R
 3: P ← ∅, ϕ ← ∅                            . ϕ is a set of maximal sets of types or terms
 4: for (P ∈ P) do                               . grouping types and terms by predicates
 5:    for ((C1 , P, D1 ), w1 ) ∈ R do
 6:        C ← {C1 }, D ← {D1 }, w ← w1
 7:        for ((C2 , P, D2 ), w2 ) ∈ R ∧ ((C1 6= C2 ) ∨ (D1 6= D2 )) do
 8:            if (C1 6= C2 ) ∧ (D1 = D2 ) ∧ (∀P (a, b) ∈ A : C1 (a) ∈ A ∧ C2 (a) ∈ A)
    then
 9:                C ← C ∪ {C2 }                              . group the types of subjects
10:            if (D1 6= D2 ) ∧ (C1 = C2 ) ∧ (∀P (a, b) ∈ A : (D1 (b) ∈ A ∧ D2 (b) ∈ A)
    then
11:                D ← D ∪ {D2 }                               . group the types of objects
12:            if (isLabel(D1 )) ∧ (isLabel(D2 )) ∧ (C1 = C2 ) then
13:                D ← D ∪ {D2 }                                          . group the terms
14:                w ← w + w2
15:     P ← P ∪ {((C, P, D), w)}
16:     ϕ ← add(C, ϕ), ϕ ← add(D, ϕ)
    //Step 3: Profile visualisation structure
17: P V ← ∅
18: for ((C, P, D), w) ∈ P do
19:     A ← maxNode(C, ϕ)
20:     B ← maxNode(D, ϕ)
21:     P V ← P V ∪ (A, ((C, P, D), w), B)
22: return ( P, P V )
    // add(ϕ, C) returns the set of maximal nodes ϕ having incorporated C
    // isLabel(D) returns true if D is a term, label of a skos:Concept in A
    // maxNode(C, ϕ) returns the maximal node that contains C




to carefully write the SPARQL queries in Step 1 because of fair use policies
applied by public SPARQL endpoints. Moreover, as already said about the time
complexity, Step 1 of computing a TT profile depends on the configurations of the
SPARQL endpoint and the network capacities. Considering only the client side
computation (Step 2 and Step 3), on small graphs, less than 1, 000, 000 triples,
the TT profile generation takes about 0.06 seconds. For 91, 000, 000 triples it
takes 1.15 seconds. TTProfiler is implemented in Java using the Jena library to
query the public SPARQL endpoints. It was run on Windows 10 with an Intel
core i7 processor and 32 GB of RAM.
    We design this program as part of a French CH project called SESAMES5
and we test it with the archaeological KGs grouped in OpenArchaeo6 . Those
graphs are generated from legacy databases, based on a common model which
is a small excerpt of the CIDOC and its extensions. Even with such a restricted
ontology, all types and predicates do not have instances in all KGs, so the visual
query tool that OpenArchaeo provides could be complemented by the display of
TT profiles to show what can be asked. In addition, the producers of these graphs
use the TT profiles to inspect the results of the KG automatic generation, which
is based on mappings expressed with tools like Ontop and X3ML. KG producers
exactly know which predicates, types and terms should appear in the TT profiles,
and can therefore easily detect anomalies in their mappings.
    Besides the KGs in OpenArchaeo, we looked for other graphs using the
CIDOC CRM and offering a SPARQL API usable by an application. Of those
found, many are not always online and many do not answer to counting SPARQL
queries of Step 1. We present in Table 1 nine graphs that are currently7 capa-
ble of answering the required queries. Seven of them are from OpenArchaeo
(Kition, Iceramm, Arsol, Epicherchell, Outagr, Rita, and Aerba) and are
rather small, while the Smithsonian’s8 and Doremus’s9 graphs are of different
designs, use English terminologies, and are much larger. Doremus contains multi-
lingual labelsTable 1 shows the number of edges and nodes in KGs, the number
of distinct types/terms appearing in P, the number of TT APs (i.e., |P|) and
maximal nodes (i.e., |P V |). Although the set of basic abstract patterns is al-
ready a condensed representation of the original graph, it can be too large to
be easily visualised, hence the grouping of types and terms and the use of the
notion of maximal node, which allows us to display graphs with less nodes, as
shown in the last two columns of Table 1. Figure 4 gives an example of the graph
visualisation offered to end-users.
    Table 2 shows that the nine KGs use the CIDOC and eventually one or
more of its extensions (CRMsci, CRMarch, CRMba). Doremus uses the so-called
Erlangen implementation of the CIDOC (denoted ecrm). In addition, these KGs
use terms of thesaurus, the PACTOLS 10 for OpenArchaeo KGs, and an internal
vocabulary for Doremus. When thesauri are used, the number of the terms is far
larger than that of types. Concerning the predicates instantiated in each KG,
here again, all KGs use the CIDOC or its extensions. As can be noticed, CRMba
is used by ArSol in OpenArchaeo for one type and not for any predicate: this
is because the extensions are attached to CIDOC by subsumption links. So, one
can use extension classes with predicates that are defined in CIDOC. In general
both types and predicates of extensions are used in the tested KGs.

5
   http://anr-sesames.map.cnrs.fr/
6
   http://openarchaeo.huma-num.fr/explorateur/home
 7
   June, 2021
 8
   SPARQL API: http://edan.si.edu/saam/sparql
 9
   SPARQL human interface: http://data.doremus.org/sparql
10
   https://pactols.frantiq.fr
                  Table 1: Knowledge graphs and TT profiles
                     statistics for A           statistics for TT profile
A             nb triples nb nodes language nb types & terms nb AP nb nodes
Aerba             3,318       1,695   fr            5            3        5
Epicherchell      3,488       1,372    -           31            15      13
Kition           26,773       9,165   fr           72           31        19
Iceramm          32,687       9,325    -           13            21      13
Rita             40,479      10,769    -          184            6        7
Outagr           79,420      39,573   fr            8            8        8
Arsol           670,757     21,2143    -           94           34        17
Smithsonian 2,542,142       969,172   en           18            35      18
Doremus      91,093,326 24,141,972    en          599           678     146



    We present in Figure 4 a visualisation of a small profile, that of
Epicherchell. In this graph, nodes suffixed by et_al are sets of terms and
colours denote namespaces (e.g. blue for CRMsci). A click on an edge, here the
predicate P4_has_time-span, displays its subject and object. Node’s content is
also displayed on demand, for instance in the bottom of the figure one have
selected the node autel_et_al. These terms are used to describe the usage of ob-
jects. The profile shows that not only the objects are characterised by the type
E57_Material with the predicate P45_consists_of, but it also contains the set of
terms that are used in the KG for each material, with the node albatre_et_al
that appears as object of the same property.


5   Related Works and Conclusion

Compared to works aiming at discovering the schema of a graph, or extracting
modules or constraints from KGs, our problem is much simpler. TT profiles
can be seen as a special kind of KG summaries as described in the recent and
comprehensive surveys on KG summarisation [4]. Its authors classify the existing
summarisation techniques in four classes: (i) structural methods that consider
the paths (quotient graphs) or the subgraphs (with high centrality) in the KG,
(ii) pattern mining methods that discover patterns in the KG and use them for
showing a synthesis of the graph, (iii) statistical methods, that extract from the
graph quantitative measures or statistics, and (iv) hybrid methods that combine
some of the previous ones. As explained in [4], the proposals can be distinguished
by their inputs and outputs: some works consider only ontologies, some others
exploit only instances, and hybrid approaches process both. Outputs also differ
ranging from a graph (not necessarily an RDF graph) to a set of items (e.g.,
rules or queries). Finally, very few of these proposals can be used online or make
the source code available. We found only one work [7] that have been applied
in the CH field, a structural method focused on centrality of concepts in the
ontology. At the time of writing, it is not online usable and we did not succeed
to get the sources. The closest to our proposal is ABSTAT [6] dealing with both
           Table 2: Types, Terms and Predicates in the TT profiles
                       Number of Types                     Number of Terms
     K      crm crmsci crmarch crmba ecrm doremus        PACTOLS     vocabulary
   Aerba     4    0       0      0     0     0               0            0
Epicherchell 9    1       0      0     0     0              21            0
   Kition    12   1       1      0     0     0              57            0
  Iceramm    11   1       0      0     0     0               1            0
    Rita     5    0       0      0     0     0              178           0
   Outagr    6    1       0      0     0     0               1            0
   Arsol     11   1       1      1     0     0              79            0
Smithsonian 17    0       0      0     0     0               0            0
  Doremus    0    0       0      0    15     40              0           459

                                  Number of Predicates
                   K      crm crmsci crmarch crmba ecrm doremus
                 Aerba      3   0       0      0     0     0
              Epicherchell 13   2       0      0     0     0
                 Kition    26   3       1      0     0     0
                Iceramm    19   2       0      0     0     0
                  Rita      5   0       0      0     0     0
                 Outagr     7   1       0      0     0     0
                 Arsol     29   3       1      0     0     0
              Smithsonian 27    0       0      0     0     0
                Doremus     0   0       0      0    112   101




data and schemas, with a set of abstract patterns as output, usable online. To
the best of our knowledge, none of the existing summarisation proposals can be
tested on SPARQL endpoints (without loading locally a complete KG dump).
Also, none of them output information about the KG terms.
    With respect to ABSTAT, our proposal can be analysed with regard to
Thomas Kuhn’s six criteria for characterising successful improvements in sci-
entific theories [5]: Generality (the scope of the theory is increased), Simplicity
(the theory is less complicated), Explanatory power (the theory gives increased
meaning), Fruitfulness (the theory can potentially meet more currently unspeci-
fied requirements), Objectivity (the theory provides a more objective shared un-
derstanding of the world), and Precision (the theory gives a more precise picture
of the world). Computing a TT profile does not require access to the intensional
part of the KG (the ontologies), which is mandatory for ABSTAT and other
proposals. Moreover, our proposal works by querying online SPARQL endpoints,
relieving users of the burden of downloading the KG dump. So, TTProfiler pro-
vides a support similar to ABSTAT’s one, but in less constrained settings, which
can be considered to improve the Generality of the tool. Clustering types and
terms according to their use in the KG increases the representativity correctness,
or Objectivity. Explanatory criteria is hardly applicable for these two tools, but
as a TT profile exhibit more information about the KG’s content by showing not
                         Fig. 4: Epicherchell’s profile


only the types and predicates, but also the terms of shared controlled vocabular-
ies that are used, it provides more information about the KG content’s meaning.
This important point could also be considered to improve the Precision criteria.
Regarding Fruitfulness, TT abstract pattern definition allows us to propose a
new form of profile, and other ones could still be defined based on TT APs.
    As already mentioned, TTProfiler is used by OpenArchaeo’s KG producers
who know which predicates, types and terms should appear in the TT profiles
and can detect anomalies in their KG production process. TT profiles are there-
fore a support to the automatic generation of KGs. This interesting aspect is
also cited in [4] for summaries’ applications. It emerged during our experiments,
regarding the use of terms from thesaurus in particular, it revealed several errors
in the graphs generated for OpenArchaeo, which could then be corrected. For the
consumers of KGs, TT profiles offer an abstract of the graph content in the same
way as ABSTATS’ outputs, but with the terms added. It is a well-established
practice in humanities and digital libraries to create and use authority lists of
terms, i.e. shared controlled vocabularies, and our experiments demonstrated
how interesting it is for humanists users to explore terms in TT profile visuali-
sations. But the usefulness of terms goes beyond humanities: categories are first
class citizen in Wikipedia, of paramount importance for crowdsourcing stake-
holders; folksonomies are also a well known and studied phenomenon. User com-
munities tend to organise themselves to create lists of terms for their needs of
descriptions. Being it in a scholarly and structured way as in natural sciences,
humanities and libraries, or simply spontaneously like in the social web, the
phenomenon must be taken into account when trying to give an idea about a
knowledge graph’s content.
    A TT profile can be used for supporting humans in discoverins KG content
to retrieve the information they want. We are currently designing a Web visu-
alisation tool for interactively showing the KG profile, and we plan to provide
also an API for applications to query the profile. We are also studying ways of
completing the TT profile with information from the ontologies, by extracting
the minimal parts of ontologies useful for the profile’s types and predicates. An-
other need demonstrated with our experiments is to build summaries from the
profile for huge KGs. We would like to design algorithms for building summaries
as sets of k connected nodes, for a given number k.

Acknowledgements: This work is supported by the ANR-18-CE38-0009 (“SESAME”).
The authors thank Zilu Yang for her work on the Web visualisation tool.


References
1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
   (eds.): The Description Logic Handbook: Theory, Implementation, and Applica-
   tions. Cambridge University Press, New York, NY, USA (2003)
2. Bikakis, A., Hyvönen, E., Jean, S., Markhoff, B., Mosca, A.: Editorial: Special Issue
   on Semantic Web for Cultural Heritage. Semantic Web 12(2), 163–167 (2021)
3. Boeuf, P.L., Doerr, M., Ore, C., Stead, S., et al.: Definition of the CIDOC Con-
   ceptual Reference Model. version 6.2.1. ICOM/CIDOC Documentation Standards
   Group. CIDOC CRM SIG (2015)
4. Cebiric, S., Goasdoue, F., Kondylakis, H., Kotzinos, D., Manolescu, I., Troullinou,
   G., Zneika, M.: Summarizing Semantic Graphs: A survey. The VLDB Journal (2018)
5. Kuhn, T.: Objectivity, Value Judgment, and Theory Choice. University of Chicago
   Press (1977)
6. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: Ontology-
   Driven Linked Data Summaries with Pattern Minimalization. In: The Semantic
   Web - ESWC 2016 Satellite Events, Revised Selected Papers. pp. 381–395. Springer
   (2016)
7. Troullinou, G., Kondylakis, H., Daskalaki, E., Plexousakis, D.: Ontology under-
   standing without tears: The summarization approach. Semantic Web 8(6), 797–815
   (2017)