=Paper= {{Paper |id=Vol-3262/paper2 |storemode=property |title=Towards improving Wikidata reuse with emerging patterns |pdfUrl=https://ceur-ws.org/Vol-3262/paper2.pdf |volume=Vol-3262 |authors=Valentina Anita Carriero,Paul Groth,Valentina Presutti |dblpUrl=https://dblp.org/rec/conf/semweb/CarrieroGP22 }} ==Towards improving Wikidata reuse with emerging patterns== https://ceur-ws.org/Vol-3262/paper2.pdf
Towards improving Wikidata reuse with emerging
patterns
Valentina Anita Carriero1 , Paul Groth2 and Valentina Presutti1
1
    University of Bologna
2
    University of Amsterdam


                                         Abstract
                                         The ontology underlying Wikidata has not been formalized. Instead, its semantics emerges from the use
                                         of its classes and properties. Flexible rules and suggestions have been defined by the Wikidata project for
                                         the use of its ontology, however, it is still often difficult to reuse the ontology’s constructs. In this paper,
                                         we describe a method for extracting emerging patterns from (a domain-specific portion of) Wikidata, in
                                         the form of statistically frequent domain-property-range triplets. We show the results of our experiments
                                         on a Wikidata subset addressing the music domain, and compare them with the current support present
                                         in Wikidata. These patterns can provide guidance for the use of the Wikidata ontology and its potential
                                         improvement.




1. Introduction
Wikidata1 is a collaboratively built knowledge graph (KG) that stores structured data for its
Wikimedia sister projects, including Wikipedia and Wiktionary [15]. Wikidata is edited collabo-
ratively on a daily basis, thus contains a rich set of factual statements about entities and events
in the real world. Its underlying ontology is constantly subject to change due to its frequent
updates by its contributors and the way they model data.
Due to this bottom-up definition and constant evolution, it can sometimes be challenging to
effectively reuse the ontology [10, 4]. While Wikidata does provide some flexible guidelines
around use (see Section 2), there still remains room to provide additional, more detailed, guidance
on how to use the ontology based on its actual usage.
Hence, in this paper, we develop a method2 for the extraction of what we term emerging
patterns from Wikidata. These patterns are domain specific and consist of frequent domain-
property-range triplets and their usage statistics. We show how these emergent patterns provide
additional information not available from existing guidelines.
   The rest of this paper is organized as follows: In Section 2, we discuss existing constructs and
projects that support the reuse of Wikidata. Section 3 presents relevant related work focusing on
the generation of shapes or data-driven patterns. Section 4 describes our method, while Section

Wikidata’22: Wikidata workshop at ISWC 2022
$ valentina.carriero3@unibo.it (V. A. Carriero); p.t.groth@uva.nl (P. Groth); valentina.presutti@unibo.it
(V. Presutti)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                    https://www.wikidata.org/
                  2
                    Both code and results are available on GitHub: https://github.com/valecarriero/wikidata-emerging-patterns
5 shows the results of our experiments on a Wikidata subset on the music domain. Finally,
Section 6 compares our results with the current support present in Wikidata, and Section 7
discusses future development.


2. Motivation
We now detail various approaches for recommending how to use the Wikidata ontology.
Property constraints. The Wikidata community has defined several types of property con-
straints3 : property constraints are rules on properties that specify how they should be used,
with possible exceptions. These rules are flexible, aiming at guiding the editor and providing
useful suggestions while injecting/editing (new) statements; they are informally defined, with
no explicit logical specification, thus can still be violated/ignored. Popular property constraint
types include type constraint and value-type constraint, which specify that the domain or range
of a property, respectively, should be one in a list of classes. However, unlike OWL property
restrictions on classes, these do not limit the applicable classes. For example, a triple with
an instance of recurrent event edition as subject, part of the series (wdt:P179) as predicate,
and an instance of collection of articles as object would conform to the property constraints of
wdt:P179, even if a more appropriate range in this case would be the class recurring event.
Properties for this type. The property properties for this type (wdt:P1963) specifies the
properties suggested for instances of a certain type. For example, part of the series is one of
the recommended properties for instances of the type recurrent event edition, however the
appropriate range(s) to be paired with that specific type are not specified.
Type of Wikidata property. The class Type of Wikidata property (wd:Q107649491) is a
Wikidata metaclass, i.e. a class whose instances are classes that are related to a specific set of
items, domain or topic; the relation with the topic is also expressed through the property facet
of (wdt:P1269). These classes are organised in a hierarchy, and are populated by properties
that can be declared as instances of (more than) one of these classes. For example, the property
Chessgames.com player ID is an instance of the class Wikidata property related to chess that is
a subclass of Wikidata property related to sport. However, (i) this classification is an ongoing
activity, thus it is far from being complete for some domains; (ii) properties relevant to a certain
type may be excluded from the metaclass specific to that set of items because they are relevant
also in more general domains.
Wikidata schemas. The Schemas Wikidata project4 aims at defining schemas, expressed in
the Shape Expression language (ShEx) for validating subsets of items in Wikidata, to check
whether they conform to a standardised structure. At present, the Wikidata community has
manually defined more than 300 schemas, which may vary considerably in size and granularity.
For example, the shape E25 for actors includes 4 constraints, and the only domain-specific
constraint is related to their occupation (actor). Instead, the shape E42 for authors is much more
detailed, including both constraints that are valid for all humans (shape E10) and author-specific
constraints, such as copyright status. Anyway, constraints usually do not express the suggested
range (e.g. notable work in the author shape has generically an IRI as recommended range).
    3
        https://wikidata.org/wiki/Help:Property_constraints_portal
    4
        https://wikidata.org/wiki/Wikidata:WikiProject_Schemas
Properties list in a WikiProject. In the context of domain-specific projects, the community
expert in that domain defines a set of properties that can be used for describing relevant entities.
Each recommended property, listed in a table, is usually accompanied by the data type of its
range (e.g. item, string), a description of the (usage of the) property, which in some cases also
includes in plain text possible types for the range (e.g. artistic inspiration as range of the property
inspired by with written work as domain, in the WikiProject Books5 ). This process is performed
manually, and possible ranges are not always specified.


3. Related work
To help address missing guidance, there exist many approaches to generate constraints/definitions
(i.e. shapes) for concepts.
    Some of them (like Astrea [5]) are only based on ontologies, and do not take into account
the data level. However, most methods focus on generating shapes from a set of data. Shape
Designer [3] is a graphical tool for automatically building valid SHACL or ShEx constraints that
are satisfied from an RDF dataset. The cardinality of the triple constraints (exactly one, optional,
at least one, any number) is inferred from the data. However, if working with large KGs such
as Wikidata, there is a need to put a limit on the number of query results. Indeed, [11] shows
that existing methods are not able to handle the scale of large KGs like Wikidata, crashing with
KGs with a few millions triples6 . sheXer [6] is an automatic shape extractor able to extract
shapes – serialised in both ShEx and SHACL – by mining the graph structure and exploring the
neighborhood of predefined target nodes. A trustworthiness score allows to filter infrequent
constraints and sort/merge the inferred constraints for constructing the resulting shapes. Finally,
some methods exploit knowledge graph profiling, which focuses on producing concise and
meaningful summaries of RDF knowledge graphs, for building shapes. [9] presents a data-driven
approach that, based on machine learning techniques, aims at automatically generating RDF
shapes, as collections of validation rules. Profiled RDF data are used as features, exploiting the
Loupe tool7 [8], which provides information about the frequency of triple patterns (in the form
⟨𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑇 𝑦𝑝𝑒, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡𝑇 𝑦𝑝𝑒⟩) that appear in a dataset. In [12], the profiles generated
by ABSTAT are converted into SHACL shapes related to the instances of a specified target
class, which can be updated and corrected by a human user. ABSTAT [13, 1] is a profiling tool
that generates a semantic profile, starting from a knowledge graph and optionally an ontology
used in the KG: this profile is composed of Abstract Knowledge Patterns (AKPs), associated with
their occurrences, where subjectType is the most specific type of the subject and objectType is
the most specific type of the object, excluding more generic redundant patterns by using the
ontology, if any.
    As highlighted in [11], all approaches supporting automatic generation of shapes produce a
large number of shape constraints such that it is non-trivial to verify their validity. Moreover,
in most cases no constraint is generated for non-literal objects (e.g. to indicate that objects for a
property should be of a specific type). Some tools currently used by the Wikidata community for

    5
      https://wikidata.org/wiki/Wikidata:WikiProject_Books
    6
      We applied our method to a subKG of Wikidata with more than 5 millions triples.
    7
      http://loupe.linkeddata.es/
inserting new data suffer from a similar problem: for instance, Recoin8 recommends properties
for a class based on their frequency in the data, and reports frequent properties that are missing
for instances of a specific type, but lacks information on the appropriate ranges.
   The closest work to ours, based on statistical measures similar to the ones used for generating
data-driven shapes, is described in [16, 2]. The authors develop a method for extracting Statistical
Knowledge Patterns (SKPs) from KGs. An SKP is expressed in OWL and constructed around
one main class from an ontology: it enriches the properties and axioms involving the class from
the ontology with properties and axioms that can be induced from statistical measures. The
most frequent (based on a threshold) properties in the data are selected, and the appropriate
range(s) is/are provided if they are not explicitly asserted in the ontology. A catalogue with 34
SKPs extracted from a version of DBpedia is online9 , but the method described in the paper
has not been published, so it is not possible to reproduce their results. Moreover, no metadata
about the actual usage of the selected properties is present in the SKPs.


4. Method
We now describe our method for extracting emerging patterns from Wikidata. An overview of
the method is shown in Figure 1.




Figure 1: Method for extracting emerging patterns from a domain Wikidata KG.


Select relevant entities from the Wikidata subgraph. The first step of the method takes
as input the domain subgraph and counts the number of instances for each instantiated class
of the graph, i.e. it counts all the wdt:P31 triples for each class. Then, a threshold is given as
input, and allows to filter out all the classes whose instances fall below the given threshold. The
selected classes are used to generate the emerging patterns.
   The threshold is based on the absolute distance between the number of instances of a given
class and the number of instances of the most instantiated class (i.e. the maximum number
in the distribution of counts). This distance is then normalized by dividing the result by the
maximum value, so that the threshold T𝑐 falls within the range [0, 1], such that the more the
threshold is close to 1, the more classes will be selected: if T𝑐 is equal to 0, it means that only
   8
       https://www.wikidata.org/wiki/Wikidata:Recoin
   9
       http://www.ontologydesignpatterns.org/skp/
the most instantiated class will be selected (the distance between the count of a given class and
the maximum count must be smaller or equal to 0); if T𝑐 is equal to 1, it means that all classes
will be selected (the distance between the count of a given class and the maximum count must
be smaller or equal to the maximum count).
Extract a subgraph for each of those entities. Once the list of classes is obtained, we build a
subgraph for each class, by selecting from the domain subgraph only the triples with an instance
of the given class – or one of its subclasses – as subject. For example, for the class album, a
subgraph containing all the triples about instances of album or subclasses of album is built.
Most frequent properties for each class. At this stage, the occurrences of all the properties
instantiated in each subgraph is computed, i.e. the number of distinct instances that have at
least one triple involving that property is counted. Then, for each subgraph, we select only the
most common properties based on a threshold T𝑝 given as input. Notice that in this step we
discard the instance of wdt:P31 and subclass of wdt:P279 properties from the statistics.
Most frequent ranges for each frequent property. We compute all the domain-property-
range triplets in each subgraph, where domain is the type (wdt:P31) of the subject and range
is either the type of the object (when the object is a wikibase-item) or the wikidata data type
(e.g. time, monolingual text).10 The occurrences of each triplet are then counted to find the most
common domain-range pairs for each property. Again, a threshold T𝑑𝑟 selects the most common
domain-range pairs for each one of the most common properties selected in the previous step.


5. Results
In this section, we discuss the results of applying our method to the Wikidata subgraph on the
music domain (see below)2 .

5.1. Input
In order to deal with the size of Wikidata, we use the recently developed tool Knowledge Graph
Toolkit (KGTK)11 [7]. KGTK is a Python library for easy manipulation of KGs, a comprehensive
framework designed for ease of use, scalability, and speed. This tool allows us to avoid reaching
the query timeout limit on the SPARQL endpoint for some of our queries. We work with a json
dump of Wikidata12 , downloaded on 04-04-2022. We focus on a specific domain represented
in Wikidata in order to extract domain-dependent patterns, and to handle a more manageable
subgraph of Wikidata. While we choose to work on the music domain, the method can be
applied to any domain. The extraction of instances related to the music domain is based on a
list of WordNet and BabelNet synsets identified as belonging to the music domain, according
to BabelDomains [14]. Then, the Wikidata subgraph on music is extracted by selecting each
triple where the Wikidata music instance is in the subject position. The different thresholds
have been chosen after running some experiments, in order to extract reasonably representative
patterns from the music domain.

   10
      See https://www.wikidata.org/wiki/Special:ListDatatypes
   11
      https://kgtk.readthedocs.io/en/latest/
   12
      https://dumps.wikimedia.org/wikidatawiki/entities/
5.2. Wikidata emerging patterns on music
Most populated classes: music patterns. The threshold T𝑐 we use for the Wikidata music
subgraph is 0.95, thus we filter out all classes that have a number of instances lower than the 5%
of the number of instances of the most instantiated class (from a total of 6,043 classes, ∼6,000
of which have less than 200 instances). Clearly, the same entity can be an instance of more than
one class.

Table 1
Most populated classes in the Wikidata music subKG.
                               Class                      Instances      Triples
                Q5 human                                    63,594      2,348,331
                Q482994 album                               63,213       723,722
                Q215380 musical group                       25,016       527,537
                Q134556 single                              20,977       253,201
                Q105543609 musical work/composition         14,600       198,841
                Q169930 extended play                       3,816         33,725
                Q18127 record label                         3,640         35,118

   Table 1 lists the 7 classes around which we build our patterns, along with their number of
instances and the number of triples with an instance of the class (or one of its subclasses) as
subject. The most relevant entities in the Wikidata music domain include both agents (human,
musical group) and objects (single, album, musical work, extended play, record label). Notice
that wd:Q134556 single and wd:Q169930 extended play are not subclasses of wd:Q105543609
musical work/composition in the Wikidata hierarchy (wdt:P279*). By looking at the ratio
between the number of instances and the number of triples, at first sight, we can observe that
e.g. humans are more well described with facts than albums, considering that the number of
respective instances is roughly equal.
Recommended properties for each pattern. The threshold T𝑝 we use for selecting the most
frequent properties for each pattern is 0.85. The average number of selected properties for
each pattern is ∼21. In Table 2 you can find the actual number of selected properties for each
pattern, and the maximum and minimum number of occurrences from this set of properties,
defined as the number of instances that are subject of at least one triple involving a specific
property. Notice, the number of recommended properties is not directly proportional to the
number of triples in the subKG: for instance, musical groups have more properties that are
frequently used (selected out of a total of 891 properties) than albums (369 properties in total).
The most common properties across all patterns (except for IDs) are: wdt:P136 genre, which is
recommended for all patterns, and wdt:P264 record label, present in all patterns but record
label.
Recommended ranges for each property. For selecting the most frequent ranges for each
recommended property, we set the threshold to 0.5. Datatype properties will have only one
range in any case. Table 2 reports the number of triplets ⟨d, p, r⟩ – that is, the domain d and
range r pairs for each recommended property p – selected for each pattern. The average number
of triplets across all patterns is ∼29. Since the same property can be involved in more than
Table 2
Statistics of selected properties and triplets for each pattern.
                            Class                    Properties     Occurrences    Triplets
                                                                    max    min
            Q5 human                                      48       63,583 9,543      63
            Q482994 album                                 14       61,772 11,735     18
            Q215380 musical group                         33       22,423 3,474      38
            Q134556 single                                15       20,860 5,076      22
            Q105543609 musical work/composition           17       13,916 2,204      29
            Q169930 extended play                         10        3,793   650      12
            Q18127 record label                           11        3,577   625      20




Figure 2: The album pattern.


one pattern, a property can have different recommended ranges based on the specific pattern,
except for datatype properties. That is, ranges recommendations are local to the pattern. For
instance, both album and single patterns include the property wdt:P155 follows, with album
and single as range, respectively.
Example: the album pattern. In Figure 2 we provide a graphical representation of the pattern
for albums. Each domain-property-range triplet is associated with the number of instances in
the Wikidata subKG that comply with that triplet. Based on the 0.5 threshold, most properties
have only one recommended range. However, the performer can be both a human and a musical
group, and the 3 selected ranges of language of work or name have a subclass-of relation. As
you can notice, 4 recommended properties link to other frequent patterns as recommended
ranges (record label, human, musical group).
6. Discussion
Patterns coverage. In order to understand how the extracted patterns are populated in
the Wikidata subKG, we report in Table 313 the percentage of the total instances covering
different (increasing) subsets of recommended properties. No pattern has a 100% coverage even
considering only the most frequent property; in some cases, the set of the two most common
properties has a percentage of coverage very close to the first property (e.g. human), while
in others (e.g. musical work) it decreases significantly. In 4 out of 7 patterns, the instances
populating the first half (1/2) of the recommended properties are between the 35 and ∼58% of
the total number of instances; instead, humans musical works and groups have already a very
low coverage. The most populated pattern (considering all properties), wrt the total number
of instances, is extended play (112/3,816), followed by album (845/63,213) and musical group
(327/25,016). The pattern with the lower percentage of coverage is musical work (1/14,600).
The coverage percentages might appear very low, however this is not surprising: by using the
0.85 threshold, we include all properties that are used by at least 15% of the total number of
instances. If the least common property is used for e.g. 625/3,577 instances (see record label), it
is not surprising that the intersection of instances with all 11 properties is equal to 28 instances.
Comparison with property constraints. Let us consider the most common properties across
all patterns. The domains and ranges we suggest for the properties genre (7/7 patterns) and
record label (6/7) are all included in the type and value-type constraints of the two properties –
still, in some cases, the constraints suggest a superclass as range, e.g. work in place of the more
specific musical work. However, as we explained in Sec. 2, the correct pairs of domain and
range cannot be specified, thus our method can integrate these constraints by suggesting that
e.g. music genre is more correct as range of genre with record label as domain, than e.g. criticism
(included in the value-type constraint of genre), which never occurs in the data. Moreover,
not all properties define these constraints: e.g. follows (4/7 patterns) has no type/value-type
constraints.

Table 3
Percentages of coverage of the patterns properties in the KG.
           Class              1 prop     2 props       1/8          1/4         1/2                  all
  Q5 human                     99.98      98.99     [8] 50.34   [12] 32.97   [24] 3.65    [48] 0.007 (5 instances)
  Q482994 album                97.72      94.19     [2] 94.19    [4] 78.30   [7] 40.48   [14] 1.33 (845 instances)
  Q215380 musical group        89.63      78.36     [4] 60.99    [8] 34.22   [16] 9.82   [33] 1.31 (327 instances)
  Q134556 single               99.44      98.80     [2] 98.80    [4] 87.87   [7] 57.67   [15] 0.71 (151 instances)
  Q105543609 musical work      95.31      76.36     [2] 76.36    [4] 39.69    [8] 6.34     [17] 0.006 (1 instance)
  Q169930 extended play        99.39      97.95     [1] 99.39    [3] 92.29   [5] 56.70   [10] 2.93 (112 instances)
  Q18127 record label          98.26      84.25     [1] 98.26    [3] 69.06    [5] 35.0    [11] 0.76 (28 instances)


Comparison with properties for this type. Taking into account the 7 most populated classes
in the music Wikidata subgraph, we performed a comparison between the properties included

    13
       Columns: the number/fraction of properties considered. The actual number of properties corresponding to
the fraction is in square brackets. The actual number of instances covering the whole pattern is in round brackets.
Example instances populating the whole patterns: https://github.com/valecarriero/wikidata-emerging-patterns/tree/
main/results/supplementary_materials/example_instances
in our patterns and the properties included as value of the property properties for this type
(wdt:P1963) for those classes. We manually observed that some properties highly instantiated
in the data are not listed as properties for this type, while all the properties suggested as
properties for the type and excluded from our patterns are significantly less frequent, and
sometimes have a very low number of occurrences. Take musical group (wd:Q215380) as an
example14 . Identifier properties such as Freebase ID, MusicBrainz artist ID and Discogs artist ID
are widely used (about 81, 75 and 74 % respectively), but not included as properties for this type.
Instead, IDs less frequently associated with musical groups in the data (e.g. Apple Music artist
ID (U.S. version), ∼6.5%), hence filtered out from our pattern, are recommended. As another
example, properties such as influenced by and award received are recommended, while they are
discarded in our pattern because of their very low frequency (less than 0.5 and 2 % respectively).
10 out of the 18 properties recommended as properties for this type are also included in our
pattern.
Comparison with type of wikidata property. A subclass of Type of Wikidata property is
specifically dedicated to properties related to music (wd:Q27525351) and includes as instances
properties such as music-related IDs (e.g. YouTube playlist ID) and other specific relations (e.g.
composer, performed at). However, it is not specified which are the possible domains of such
properties, so it is difficult to understand which properties to use for a user that needs to model
a specific musical entity. 24 subclasses of Wikidata property related to music are specific to
some musical entities (e.g. music genres, songs, instruments). However, 14 of them group
only identifiers, e.g. for songs and bands. Even considering just the IDs, our patterns are more
complete and representative. For instance, the class Wikidata property to identify bands, facet
of musical group, includes only the property Encyclopaedia Metallum band ID (wdt:P1952).
The pattern we extracted for the class musical group contains 33 properties, including the
most common IDs, while excluding wdt:P1952, which is used with only 8% of musical groups.
Moreover, some relevant properties that we are able to include in the patterns are difficult to
identify for reuse based on the Wikidata property classes: for instance, genre, which is widely
used for musicians (about 50%) and musical works (about 60%), is included in both the human
and musical work/composition patterns, while it can only be found under the more general
classes Wikidata property for items about people and for items about works.
Comparison with properties listed in the WikiProject Music. The WikiProject Music15
(WPM hereinafter) defines a set of properties for 6 relevant entities in the domain: human,
musical ensemble, musical work, track, release, record label. Apart from human and record label,
our patterns do not perfectly overlap: musical ensemble vs musical group (the latter being the
most populated subclass of the former); musical work vs musical work/composition (musical
work has very few direct instances, while being a class with plenty instantiated subclasses e.g.
song); release, which groups together its subclasses album, single and extended play. However,
it is still useful to try to compare them. Let us take record label as an example: the WPM
recommends 4 properties in addition to 13 identifiers. Our pattern contains 11 properties, 6
of which are identifiers. Apart from instance of wdt:P31, which we always exclude from our

   14
      https://github.com/valecarriero/wikidata-music-odp/blob/main/results/supplementary_materials/
properties_forthis_type/Q215380_properties_comparison.tsv
   15
      https://wikidata.org/wiki/Wikidata:WikiProject_Music
patterns, and is included by WPM, we report in Table 4 a comparison between the properties
recommended by WPM and our method (EP), except for IDs. It can be observed that our pattern
is more inclusive (6 vs 3 properties). We can detect all properties recommended by WPM, while
WPM does not include inception, even if it is the second most frequent property. In our pattern
country (recommended as range of the property country by WPM) is the most frequent range,
but we also suggest 6 more specific classes such as sovereign state.

Table 4
Comparison between properties recommended by WikiProject Music and properties included in our
pattern for record labels.
                          Property                         Occurrences        WPM      EP
                          P17 country                         3,123            Y        Y
                          P571 inception                      2,905            N        Y
                          P856 official website               1,833            Y        Y
                          P159 headquarters location          1,023            Y        Y
                          P136 genre                           972             N        Y
                          P112 founded by                      714             N        Y



Table 5
Comparison between properties recommended by WikiProject Music and properties included in our
patterns for releases.
    WPM Property                 EP       WPM Property               EP       WPM Property           EP
    P577 publication date       A, S, P   P136 genre                A, S, P   P156 followed by      A, S, P
    P155 follows                A, S, P   P264 record label         A, S, P   P175 performer        A, S, P
    P162 producer                A, S     P407 language of work        P      P361 part of             S
    P1303 instrument            none      P483 recorded at studio   none      P676 lyrics by        none
    P86 composer                none      P658 tracklist            none      P736 cover art by     none
    P2291 charted in            none      P9237 reissue of          none      P1638 working title   none


Now, let us have a look at the Release properties in WPM and our patterns album (A), single (S)
and extended play (P) (Table 5). 6 properties recommended by WPM for releases are included in
all our patterns, 3 properties are included in a subset of our patterns, while 9 properties are not
included. However, e.g. composer is used only 6, 186 and 1602 times for extended plays, albums
and singles, respectively; instrument is never used for these entities (instead, it is included
in the human pattern); working title is used only twice for albums, while the property title
(wdt:P1476) is much more used (9,007 occurrences).
Comparison with music-related shapes. The only music-related shapes we were able to
manually identify from the list of Wikidata entity schemas16 are: E66 music composition by
W.A.Mozart, and E248 album, so there is room for improvement wrt coverage of the music
domain. The album shape recommends 18 properties as mandatory (exactly one/at least one):
7/18 are included also in our pattern; while some recommended properties may be statistically
relevant (e.g. title: 9,007/63,213 occurrences) and would have been included in our pattern with
a little higher threshold, other properties have very few occurrences that do not justify their
obligatory use (e.g. review score 722 and distributed by 299). Instead, e.g. the producer property
   16
        https://wikidata.org/wiki/User:HakanIST/EntitySchemaList
(with any number as cardinality constraint) is much more used (18,362) and is included in our
pattern.


7. Conclusion and future work
In this paper, we presented a method for extracting emerging patterns from Wikidata, in the
form of statistically frequent domain-property-range triplets. Experiments on the music domain,
demonstrated how these patterns can support the reuse of the Wikidata ontology. These
patterns can also support current WikiProjects aiming at defining properties that can be used
by domain-specific infoboxes (as shown for the WikiProject Music), and could work as an input
for new WikiProjects on under-documented subject areas.
As future work, we would like to transform these patterns into OWL ontology design patterns,
by defining the appropriate axioms for each relevant triplet; in this way, our Wikidata patterns
could be mapped to state-of-the-art ODPs (e.g. from http://www.ontologydesignpatterns.org/).
Moreover, we would like to test the method with domains other than music. Such analysis could
suggest ways to identify domain-specific properties and keep them separate from properties
associated with entities relevant to multiple domains. Extending this method to KGs other than
Wikidata is also an important direction forward.
   Acknowledgements. This work has been enabled by the H2020 Project Polifonia: a digital
harmoniser for musical heritage knowledge funded by the European Commission Grant number
101004746.


References
 [1] Renzo Arturo Alva Principe et al. “ABSTAT-HD: a scalable tool for profiling very large
     knowledge graphs”. In: VLDB Journal (2021), pp. 1–26.
 [2] Eva Blomqvist et al. “Statistical Knowledge Patterns for Characterising Linked Data”. In:
     WOP co-located with ISWC. Vol. 1188. CEUR-WS.org.
 [3] Iovka Boneva et al. “Shape Designer for ShEx and SHACL constraints”. In: ISWC (Posters
     & Demonstrations, Industry, and Outrageous Ideas). Vol. 2456. CEUR-WS.org, 2019, pp. 269–
     272.
 [4] Freddy Brasileiro et al. “Applying a Multi-Level Modeling Theory to Assess Taxonomic
     Hierarchies in Wikidata”. In: WWW ’16 Companion. 2016, pp. 975–980.
 [5] Andrea Cimmino, Alba Fernández-Izquierdo, and Raúl García-Castro. “Astrea: Automatic
     Generation of SHACL Shapes from Ontologies”. In: ESWC. Vol. 12123. 2020, pp. 497–513.
 [6] Daniel Fernandez-Álvarez, Jose Emilio Labra-Gayo, and Daniel Gayo-Avello. “Automatic
     extraction of shapes using sheXer”. In: Knowledge-Based Systems (2021), p. 107975.
 [7] Filip Ilievski et al. “KGTK: A Toolkit for Large Knowledge Graph Manipulation and
     Analysis”. In: ISWC. Springer. 2020, pp. 278–293.
 [8] Nandana Mihindukulasooriya et al. “Loupe-An Online Tool for Inspecting Datasets in
     the Linked Data Cloud.” In: ISWC (Posters & Demos). Vol. 1. 1. 2015, p. 2.
 [9] Nandana Mihindukulasooriya et al. “RDF shape induction using knowledge base profiling”.
     In: 33rd Annual ACM Symposium on Applied Computing. 2018, pp. 1952–1959.
[10]   Alessandro Piscopo and Elena Simperl. “Who Models the World? Collaborative Ontology
       Creation and User Roles in Wikidata”. In: Proc. ACM Hum.Comput.Interact. 2.CSCW (Nov.
       2018).
[11]   Kashif Rabbani, Matteo Lissandrini, and Katja Hose. “SHACL and ShEx in the Wild:
       A Community Survey on Validating Shapes Generation and Adoption”. In: The Web
       Conference. 2022.
[12]   Blerina Spahiu, Andrea Maurino, and Matteo Palmonari. “Towards Improving the Quality
       of Knowledge Graphs with Data-driven Ontology Patterns and SHACL”. In: WOP co-
       located with ISWC. Vol. 2195. CEUR-WS.org, 2018, pp. 52–66.
[13]   Blerina Spahiu et al. “ABSTAT: Ontology-driven Linked Data Summaries with Pattern
       Minimalization”. In: SumPre co-located with ESWC. Ed. by Andreas Thalhammer, Gong
       Cheng, and Kalpa Gunaratna. Vol. 1605. CEUR Workshop Proceedings. 2016.
[14]   Rocco Tripodi et al. Plurilingual corpora containing source texts in English, French, Spanish
       and German (v1.0). Deliverable 4.1. Polifonia Grant 101004746, 2021.
[15]   Denny Vrandečić and Markus Krötzsch. “Wikidata: A Free Collaborative Knowledgebase”.
       In: Commun. ACM 57.10 (Sept. 2014), pp. 78–85.
[16]   Ziqi Zhang et al. “Statistical knowledge patterns: Identifying synonymous relations in
       large linked datasets”. In: ISWC. Springer. 2013, pp. 703–719.