<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vladimir Alexiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar Manov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana Parvanova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svetoslav Petrov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontotext Corp</institution>
          ,
          <addr-line>Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The CIDOC Conceptual Reference Model (CRM) is an important ontology in the Cultural Heritage (CH) domain. CRM is intended mostly as a data integration mechanism, allowing reasoning and discoverability across diverse CH sources represented in CRM. CRM data comprises complex graphs of nodes and properties. An important question is how to search through such complex graphs, since the number of possible combinations is staggering. One answer is the "Fundamental Relations" (FR) approach that maps whole networks of CRM properties to fewer FRs, serving as a "search index" over the CRM semantic web. We present performance results for an FR Search implementation based on OWLIM. This search works over a significant CH dataset: almost 1B statements resulting from 2M objects of the British Museum. This is an exciting demonstration of large-scale reasoning with real-world data over a complex ontology (CIDOC CRM). We present volumetrics, hardware specs, compare the numbers to other repositories hosted by Ontotext, performance results, and compare performance of a SPARQL implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>CIDOC CRM</kwd>
        <kwd>cultural heritage</kwd>
        <kwd>semantic search</kwd>
        <kwd>Fundamental Relations</kwd>
        <kwd>OWLIM</kwd>
        <kwd>semantic repository</kwd>
        <kwd>inference</kwd>
        <kwd>performance</kwd>
        <kwd>benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The CIDOC Conceptual Reference Model (CRM)1 is an important ontology in the
Cultural Heritage (CH) domain. CRM is intended mostly as a data integration
mechanism, allowing reasoning and discoverability across diverse CH sources represented
in CRM. CRM data comprises complex graphs of nodes and properties. An important
question is how to search through such complex graphs, since the number of possible
combinations is staggering. The "Fundamental Relations" (FR) approach [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]
"compresses" the semantic network by mapping whole networks of CRM properties to
fewer FRs that serve as a "search index" over the CRM semantic web and allow the
user to use a simpler query vocabulary.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.cidoc-crm.org</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we published an implementation of CRM FRs created within the
ResearchSpace project2 using OWLIM3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and presented preliminary performance
results. Here we present a revised implementation, provide volumetrics and hardware
specs, performance results over the full CRM repository comprising over 2M CH
objects of the British Museum (BM), compare the numbers to other repositories
hosted by Ontotext, and compare the performance to one based on SPARQL.
2
2.1
      </p>
      <sec id="sec-2-1">
        <title>Specifics</title>
        <sec id="sec-2-1-1">
          <title>Implemented FRs</title>
          <p>
            We implemented the following FRs. Compared to [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] these are adjusted after initial
experimentation and gained user experience in RS. Each FR has domain Thing and
range indicated in parentheses. rso:E55_Technique is a subclass of crm:E55_Type
that we use for focused searching of Techniques. The last 5 FRs (17-23) are special
extensions:
1. rso:FR92i_created_by (crm:E39_Actor): Thing (or part/inscription thereof) was
created or modified/repaired by Actor (or group it is member of, e.g. Nationality)
2. rso:FR15_influenced_by (crm:E39_Actor): Thing's production was
influenced/motivated by Actor (or group it is member of). E.g.: Manner/ School/ Style
of; or Issuer, Ruler, Magistrate who authorised, patronised, ordered the
production.
3. rso:FR52_current_owner_keeper (crm:E39_Actor): Thing has current owner or
keeper Actor
4. rso:FR51_former_or_current_owner_keeper (crm:E39_Actor): Thing has former
or current owner or keeper Actor, or ownership/custody was transferred from/to
actor in Acquisition/Transfer of Custody event
5. rso:FR67_about_actor (crm:E39_Actor): Thing depicts or refers to Actor, or
carries an information object that is about Actor, or bears similarity with a thing that
is about Actor
6. rso:FR12_has_met (crm:E39_Actor): Thing (or another thing it is part of) has met
actor in the same event (or event that is part of it)
7. rso:FR67_about_period (crm:E4_Period): Thing depicts or refers to Event/Period,
or carries an information object that is about Event, or bears similarity with a
thing that is about Event
8. rso:FR12_was_present_at (crm:E4_Period): Thing was present at Event (eg
exhibition) or is from Period
9. rso:FR92i_created_in (crm:E53_Place): Thing (or part/inscription thereof) was
created or modified/repaired at/in place (or a broader containing place)
10. rso:FR55_located_in (crm:E53_Place): Thing has current or permanent location
in Place (or a broader containing place)
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 http://www.researchspace.org</title>
      <p>3 http://www.ontotext.com/owlim
11. rso:FR12_found_at (crm:E53_Place): Thing was found (discovered, excavated) at</p>
      <p>Place (or a broader containing place)
12. rso:FR7_from_place (crm:E53_Place): Thing has former, current or permanent
location at place, or was created/found at place, or moved to/from place, or
changed ownership/custody at place (or a broader containing place)
13. rso:FR67_about_place (crm:E53_Place): Thing depicts or refers to a place or
feature located in place, or is similar in features or composed of or carries an
information object that depicts or refers to a place
14. rso:FR2_has_type (crm:E55_Type): Thing is of Type, or has Shape, or is of Kind,
or is about or depicts a type (e.g. IconClass or subject heading)
15. rso:FR45_is_made_of (crm:E57_Material): Thing (or part thereof) consists of
material
16. rso:FR32_used_technique (rso:E55_Technique): The production of Thing (or part
thereof) used general technique
17. luc:myIndex (rdfs:Literal): The full text of the thing's description (including
thesaurus terms and textual descriptions) matches the given keyword. FTS using
Lucene built into OWLIM.
18. rso:FR108i_82_produced_within (rdfs:Literal): Thing was created within an
interval that intersects the given interval or year.
19. rso:FR1_identified_by (rdfs:Literal): Thing (or part thereof) has Identifier.
Exactmatch string
20. rso:FR138i_has_representation (xsd:boolean): Thing has at least one image
representation. Used to select objects that have images
21. rso:FR138i_representation (crm:E38_Image): Thing has image representation.</p>
      <p>Used to fetch all images of an object
22. rso:FR_main_representation (crm:E38_Image): Thing has main image
representation. Used to display object thumbnail in search results
23. rso:FR_dataset (rdfs:Literal): Thing belongs to indicated dataset. Used for
faceting by dataset
2.2</p>
      <sec id="sec-3-1">
        <title>OWLIM Rules</title>
        <p>
          We used OWLIM Rules to implement the FRs: a total of 120 rules:
• 14 rules implement RDFS reasoning, a small subset of OWL
(owl:TransitiveProperty, owl:inverseOf) and ptop:transitiveOver from the
PROTON ontology4. These are copied from standard rulesets, as described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
• 106 rules implement FRs. We use a method of decomposing the network of an FR
in pieces [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: conjunctive (e.g. checking the type of a node), disjunctive (parallel),
serial (property path), transitive. We implement each piece as a sub-FR and use it
to build up bigger pieces.
        </p>
        <p>To deal with the complexity of implementation, we used several approaches:
4 http://www.ontotext.com/proton-ontology
• A rule shortcut syntax that renders each rule on one line, instead of a line for each
premise and conclusion
• A literate programming style, where rule definitions are interspersed with
diagrams, discussion and justification in a wiki
• Checking that only known properties and classes are used in the rules (the
dependency graph in the next section helped for this)
• Checking that rule variables are used in a linear way (premise variables make a
chain, and the conclusion uses the ends of the chain), or in type checks. E.g.
x &lt;rdf:type&gt; &lt;rso:FC70_Thing&gt;; x &lt;crm:P46_is_composed_of&gt; y =&gt; x &lt;rso:FRT_46_106_148&gt; y
x &lt;rso:FRT_46_106_148&gt; y; y &lt;crm:P46_is_composed_of&gt; z =&gt; x &lt;rso:FRT_46_106_148&gt; z
p &lt;ptop:transitiveOver&gt; q; x p y; y q z =&gt; x p z
(a) First rule: x is used in a type check, and x-y=&gt;x-y is a linear chain.
(b) Second rule: x-y;y-z=&gt;x-z is a linear chain.
(c) Third rule: p and q are not used in a linear way. These variables are in
"property" position, and our check skips such variables
A lot more implementation details can be found at the ResearchSpace wiki5. The
following OWLIM reasoning features6 were important for the implementation:
• Custom rule-sets. The standard semantics that OWLIM supports (RDFS, RDFS</p>
        <p>
          Horst, OWL RL, QL and DL) are also implemented as rulesets.
• Fully-materializing forward-chaining reasoning. Rule consequences are stored in
the repository and query answering is very fast.
• sameAs optimization that allows fast cross-collection search using coreferenced
values (e.g. Agent URIs)
• Incremental retraction: when a triple is deleted, OWLIM removes all inferred
consequences that are left without support (recursively). In order to facilitate this,
OWLIM rules have a simple syntax, so they can be checked in "reverse".
• Incremental insert: when a triple is inserted (even an ontology triple), all rules are
checked. If a rule fires, the new conclusion is also checked against the rules, etc.
• Efficient rule execution: rules are compiled to Java and executed quickly. For
example, we decided late in the game that we want FR45 "Thing is made of Material"
to be transitive over the "broader" hierarchy. We added the 2 triples below, and 1M
new triples were inferred within 10 minutes (see the implementation of
ptop:transitiveOver in (c) above).
rso:FR45_is_made_of ptop:transitiveOver skos:broader, crm:P127_has_broader_term.
OWLIM rules also have their disadvantages, as described in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] section 5.3. Chief
among them is inflexibility: if the ruleset is changed, the OWLIM server needs to be
restarted. Furthermore, if the ruleset should infer different conclusions from the
exist5 https://confluence.ontotext.com/display/ResearchSpace/FR+Implementation
6 http://owlim.ontotext.com/display/OWLIMv53/OWLIM-SE+Reasoner
ing triples, the repository needs to be reloaded. But newly added triples are checked
against the rules, as shown in the previous example.
2.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Example: FR92i_created_by</title>
        <p>As an example, let's consider FR92i_created_by "Thing created by Actor", which we
define as "Thing (or part/inscription thereof) was created or modified/repaired by
Actor (or a group it is a member of)":
This FR includes the following source properties:
• P46_is_composed_of, P106_is_composed_of, P148_has_component: navigates
object part hierarchy
• P128_carries: to transition from object to Inscription carried by it
• P31i_was_modified_by (includes P108i_was_produced_by), P94i_was_created_
by: Modification/Production of physical thing, Creation of conceptual thing
(Inscription)
• P9_consists_of: navigates event part hierarchy (BM models uncorrelated
production facts as sub-events)
• P14_carried_out_by, P107i_is_current_or_former_member_of: agent and groups
he's member of
This FR uses a previously defined sub-FR FRT_46_106_148_128 (the first loop) and
defines another sub-FR:
• FRX92i_created := (FC70_Thing) FRT_46_106_148_128* / (P31i | P94i) / P9*
The sub-FR extends to the Modification/Creation node including the P9 loop and is
implemented with 5 rules:
x &lt;rdf:type&gt; &lt;rso:FC70_Thing&gt;; x &lt;crm:P31i_was_modified_by&gt; y =&gt; x &lt;rso:FRX92i_created&gt; y
x &lt;rdf:type&gt; &lt;rso:FC70_Thing&gt;; x &lt;crm:P94i_was_created_by&gt; y =&gt; x &lt;rso:FRX92i_created&gt; y
x &lt;rso:FRT_46_106_148_128&gt; y; y &lt;crm:P31i_was_modified_by&gt; z =&gt; x &lt;rso:FRX92i_created&gt; z
x &lt;rso:FRT_46_106_148_128&gt; y; y &lt;crm:P94i_was_created_by&gt; z =&gt; x &lt;rso:FRX92i_created&gt; z
x &lt;rso:FRX92i_created&gt; y; y &lt;crm:P9_consists_of&gt; z =&gt; x &lt;rso:FRX92i_created&gt; z
Finally, the FR uses the sub-FR (which also reused in another FR!), and is
implemented with 2 rules:
• FR92i_created_by := FRX92i_created / P14 / P107i*
x &lt;rso:FRX92i_created&gt; y; y &lt;crm:P14_carried_out_by&gt; z =&gt; x &lt;rso:FR92i_created_by&gt; z
x &lt;rso:FRX92i_created&gt; y; y &lt;crm:P14_carried_out_by&gt; z; z &lt;rso:FRT107i_member_of&gt; t =&gt;
x &lt;rso:FR92i_created_by&gt; t
2.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Sub-FRs and Dependency Graph</title>
        <p>The dependency graph of our implementation is shown below, a zoomable version is
also available7. It has:
• 51 source classes/properties, shown as plain text
• 13 intermediate sub-FRs, shown as filled rectangles. These sub-FRs are used by
several FRs to simplify the implementation
• 19 target FRs, shown as rectangles</p>
        <p>The diagram illustrates the complexity of the implementation. We used it to verify
the implementation as OWLIM rules (e.g. that there are no disconnected properties,
each FR uses all source properties as expected, etc).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7 https://www.dropbox.com/s/6lz48qfdbitkvlk/FR-graph.png</title>
      <p>2.5</p>
      <sec id="sec-4-1">
        <title>Hardware Specification</title>
        <p>• CPU: Intel Xeon E5-2650 2.00GHz, 20M Cache, 32 cores
• RAM: 128 GB RDIMM, 1600 MHz
• Solid-State Disks: 4*200GB SSD, SATA.
• Hard Disks: 3*300GB, SAS 6Gbps, 2.5-in, 15K RPM.
• Server cost: under $10k.</p>
        <p>Large-scale OWLIM deployments are recommended to use SSD for faster disk speed,
and the zFS compressing file-system for better SSD utilization and even faster speed.
zFS is native for Solaris, but we now have successful deployments on Linux as well.
This system has a lot of spare capacity: the hard disks and zFS are currently not used.
2.6</p>
      </sec>
      <sec id="sec-4-2">
        <title>Volumetrics</title>
        <p>Some numeric data of our implementation, with discussion:
• Museum objects: 2,051,797 (entities with type rso:FC70_Thing). Most of these are
from the British Museum. We are currently completing the ingest of Yale Center
for British Art objects into ResearchSpace.
• Thesaurus entries: 415,509 (type skos:Concept). All kinds of "fixed" values that
are used for search: object types, materials, techniques, people, places, … (a total
of 90 ConceptSchemes)
• Explicit statements: 195,208,156. We estimate that of these, 185M are for objects
(90 statements/object) and 9M are for thesaurus entries (22 statements/term).
• Total statements: 916,735,486. The expansion ratio is 4.7x (i.e. for each statement,
3.7 more are inferred). This is considerably higher compared to the typical
expansion for general datasets (e.g. DBpedia, GeoNames, FactForge) that is 1.2 - 2x, and
is due to the complexity described below.
• Nodes (unique URLs and literals): 53,803,189. (We don't use blank nodes)
• Repository size: 42 Gb, object full-text index: 2.5 Gb, thesaurus full-text index
(used for search auto-complete): 22Mb.
• Loading time (including all inferencing): 22.2h on RAM drive; 32.9h on
harddisks.
2.7</p>
      </sec>
      <sec id="sec-4-3">
        <title>Complexity: Classes</title>
        <p>CIDOC CRM is a complex ontology. The deepest branch of the class hierarchy8 is 10
levels: E1&gt;E77&gt;E70&gt;E71&gt;E28&gt;E90&gt;E73&gt;E36&gt;E37&gt;E34_Inscription. Furthermore,
multiple inheritance is used extensively, e.g. E33 is also a super-class of
E34_Inscription. For each inscription, 12 type statements are inferred. We use the
Erlangen CRM mapping to OWL9 because it provides inverse and transitive
proper8 http://www.cidoc-crm.org/cidoc_graphical_representation_v_5_1/class_hierarchy.html
9 http://erlangen-crm.org
ties. But it includes a lot of owl:Restriction anonymous classes, e.g. (in Manchester
notation)
E30_Right SubClassOf: P104i_applies_to some E72_Legal_Object
These anonymous classes are useless to us, so we wrote a tool that derives simpler
profiles of Erlangen CRM. Even with this simplification, type statements alone are
302,149,587 or 37% of the total. The number of types is 238. We counted statements
per type with this query and present some of the top types:
select ?t (count(*) as ?c) {?o a ?t} group by ?t</p>
      </sec>
      <sec id="sec-4-4">
        <title>Class Statements</title>
        <p>owl:Thing 36485904
E1_CRM_Entity 36485903
E77_Persistent_Item 17408450
E70_Thing 17339714
E71_Man-Made_Thing 17216212
E72_Legal_Object 17192518
E28_Conceptual_Object 14776488
E90_Symbolic_Object 14629292
E2_Temporal_Entity 11924877
E4_Period 11924877
E5_Event 11922986
E7_Activity 11796470
E63_Beginning_of_Existence 6377421
E11_Modification 6296015
E12_Production 6295825
rso:FC70_Thing 2051797
skos:Concept 415509</p>
        <p>Comments (look at the class hierarchy as well): we have 415k terms
(skos:Concept) and 2M museum objects (FC70_Thing). These objects have 6.3M
E12_Production records, which are repeated as the super-class E11_Modification;
there are a few hundred Repairs mapped to E11, over and above the E12 number. E12
is also repeated as E63_Beginning_of_Existence; which has additional 100k records
of Birth and Formation for the Person-Institution thesaurus. Another 5.4M
E7_Activity records stand for Acquisition, Discovery, exhibition, etc. Each E7 is
repeated as E5_Event, which is repeated as E4_Period (with an extra 19k historic
Periods) and E2_Temporal_Entity; etc.</p>
        <p>A lot of the higher-level classes are too abstract to be useful for querying (e.g.
E1_CRM_Entity, E70_Thing, E77_Persistent_Item, E72_Legal_Object. But OWLIM
materializes all inferences and unfortunately doesn't offer options for controlling
which ones to materialize.
2.8</p>
      </sec>
      <sec id="sec-4-5">
        <title>Complexity: Properties</title>
        <p>(Note: the analysis below is based on a slightly older version of the repository with
806M statements instead of 917M statements. But the percentages and conclusions
are approximately the same.)</p>
        <p>We have a total of 339 properties. We analyzed the statement distribution per
property with this query:
select ?p (count(*) as ?c) {?s ?p ?o} group by ?p
• Type statements take a significant proportion, analyzed in the previous section.
• Statements related to Objects are the majority (365M or 45% of total). Chief
amongst them are P3_has_note (10.10% of the Object statements), P2_has_type
(6.49%), P12_occurred_in_the_presence_of (3.29%)
─ rdfs:label is also significant (13.9M or 3.81% of the Object statements). We
estimate that 5M of rdfs:label statements are due to Thesauri and should be moved
from row Objects to row Thesauri.
─ A lot of the CRM properties have inverses (79 properties in our system). They
are useful when writing rules and queries, but create a significant number of
duplicate statements (18.6% of total: included in row Objects, and shown
separately on the last row)
• Extensions are sub-properties of CRM, following the CRM extensibility
guidelines. CRM itself uses sub-properties extensively. The maximum depth of the
property hierarchy is 4, e.g.: P12_occurred_in_the_presence_of&gt;
P11_had_participant&gt; P14_carried_out_by&gt; P22_transferred_title_to.
2.9</p>
      </sec>
      <sec id="sec-4-6">
        <title>Comparison to Other Repositories</title>
        <p>Below is a comparison of the RS CRM repository to some repositories hosted by
Ontotext and PSNC and provided as SPARQL public services. In each cell we show
the absolute number (in Millions, except for Expansion and Density) and the ratio
compared to RS CRM. Expansion=Total statements/Explicit statements shows the
intensity of inference. Density=Statements/Nodes shows the relative density of the
graph. Objects is not defined for the last two repositories, since they cover broad
domains and the objects are too heterogeneous.
Observations: The RS CRM repository is of moderate size compared to others (but is
expected to grow as more partners join RS). CRM expresses objects in considerably
more detail than all other repositories, even EDM. This can be seen in both ratios
Ex.st/obj (explicit statements per object) and Density (total statements per node).
3
3.1</p>
        <sec id="sec-4-6-1">
          <title>Performance</title>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>Performance of SPARQL Implementation</title>
        <p>
          FRs can be implemented by composing straight SPARQL queries. For example, the
query for FR92i_created_by (sec. 2.3) can be defined like this using SPARQL 1.1
Property Paths [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and you can try it at the RS CRM endpoint10:
select ?obj $act {
?obj a rso:FC70_Thing;
(crm:P46_is_composed_of|crm:P106_is_composed_of|crm:P148_has_component|crm:P128_carries)*/
(crm:P31i_was_modified_by|crm:P94i_was_created_by) / crm:P9_consists_of* /
crm:P14_carried_out_by / crm:P107i_is_current_or_former_member_of*
$act
} limit 20
The first few objects returned are Rembrandt paintings from the RKD dataset. $act is
bound to rkd-artist:Rembrandt, and also to groups that he belongs to:
profession/draughtsman, profession/printmaker, nationality:Dutch (conversely, the user can
search by such groups).
        </p>
        <p>In the RS system, $act is bound to an input variable and ?obj is the output variable:
10 http://test.researchspace.org:8081/sparql (ask the authors for login)
select distinct ?obj {
?obj a rso:FC70_Thing;
(crm:P46_is_composed_of|crm:P106_is_composed_of|crm:P148_has_component|crm:P128_carries)*/
(crm:P31i_was_modified_by|crm:P94i_was_created_by) / crm:P9_consists_of* /
crm:P14_carried_out_by / crm:P107i_is_current_or_former_member_of*
rkd-artist:Rembrandt
} limit 20
The endpoint takes over 15 minutes to answer the query. If you add more clauses, the
performance is even worse. The query can be optimized a bit by using intermediate
variables instead of property paths, but the performance is still untenable.
3.2</p>
      </sec>
      <sec id="sec-4-8">
        <title>Performance of FR Implementation</title>
        <p>The same query, using FR92i_created_by as defined in sec. 2.3, is trivial and has
subsecond response time:
select distinct ?obj {?obj rso:FR92i_created_by rkd-artist:Rembrandt} limit 500
Currently RS imposes a limit of 500 results due to browser memory limitations of the
used faceting system (Exhibit 2), but even the full set of 1418 objects is returned
within a second.</p>
        <p>Now let's add some complexity: let's find drawings by Rembrandt that are about
mammals. We first need to find the corresponding thesaurus terms, e.g.
select * {?s rdfs:label "drawing"}
select * {?s rdfs:label "mammal"}
The query uses another FR from the list in sec. 2.1: FR2_has_type (which is used to
relate to any E55_Type term, no matter whether it relates to the isness or aboutness
of the object):
select distinct ?obj {
?obj rso:FR92i_created_by rkd-artist:Rembrandt;</p>
        <p>rso:FR2_has_type thes:x6544, thes:x12965
} limit 500
The query takes less than a second and returns 13 objects. None of them has subject
"mammals" per se: they are about horses, pigs, lions, camels and an elephant (see next
screen-shot). But the corresponding FR is defined as transitiveOver skos:broader, so it
navigates the term hierarchy.</p>
        <p>Materializing the FR triples adds 12% to the repository size (see sec. 2.8), which
has negligible slow-down on basic querying speed. As shown in sec. 2.9, OWLIM has
been used successfully on much bigger repositories, so this extra size is not a concern.
RS uses the above to implement Semantic Search with controlled vocabularies and
faceting. The user enters terms using auto-completion, RS restricts to FRs applicable
to the specific term (e.g. created/modified is applicable to Agents, whereas
is/has/about is applicable to concepts such as Object type, Subject, etc) and constructs
a "search sentence". Here is a screen shot; you can view a video11 of RS search in
action, or ask the authors for a demo.</p>
        <p>Note: the Creator facet is populated from FR92i_created_by, which includes not
only individual creators but also groups they belong to. In this case "Dutch" is the
Nationality of Rembrandt.</p>
        <p>This search uses the query defined in the previous section. The search takes
significantly longer than the query alone (4.5 seconds) because after obtaining up to 500
objects, it executes several more queries to fetch their display fields, facets, and
images. Subsequent restrictions using the facets are much faster (sub-second response).
4
4.1</p>
        <sec id="sec-4-8-1">
          <title>Conclusion</title>
        </sec>
      </sec>
      <sec id="sec-4-9">
        <title>Summary</title>
        <p>
          We presented performance results for the RS implementation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] of FR Search as
defined in [
          <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
          ]. This search works over a significant CH dataset (almost 1B
statements), using a complex ontology (CIDOC CRM). Using a semantic repository is
appropriate for this dataset because of its complexity, graph-oriented nature, diversity
11 http://www.youtube.com/watch?v=HCnwgq6ebAs
of relations, and complexity of queries that users are interested in. This is an exciting
demonstration of large-scale reasoning with real-world data:
• The well-structured nature of the data allows for expressive reasoning. The inferred
knowledge makes good sense when reviewed by domain experts; unlike other
combinations of RDF data gathered "from the wild" that often generate strange/
faulty results.
• This is one of the first examples of such expressive reasoning with large datasets.
        </p>
        <p>Previous examples work with 5-10M statements, and often use synthetic data
• Reasoning adds real value: it would be very hard to service the same complex
queries without inference
4.2</p>
      </sec>
      <sec id="sec-4-10">
        <title>Related Work</title>
        <p>
          The RS repository is one of the largest CH datasets loaded in an RDF repository and
provides valuable implementation experience. Some other large CH repositories
include:
• The Europeana EDM repository (hosted by Ontotext) is bigger (see sec. 2.9), but is
much less structured. Since most objects were converted from ESE, they include
mostly literals: no controlled URIs and no links.
• CLAROS12 (Classical Arts Research Online Services): in 2009 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] reports 10M
triples loaded in a Jena TDB triple-store. This has expanded, implementing offline
indexing extensions (MILARQ) for better performance.
• The Poznan Supercomputing and Networking Center (PSNC) has implemented
several national aggregations of museum and bibliographic data based on CIDOC
CRM, also using OWLIM. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] reports 600k publications converted to CRM/
FRBRoo as part of the SYNAT project. Krzysztof Sielski reported the numbers in
section 2.9 at the CRMEX 2013 workshop, see the PSNC paper in this volume
4.3
        </p>
      </sec>
      <sec id="sec-4-11">
        <title>Future Work</title>
        <p>
          We would like to re-implement the FRs by using a lot of standard constructions and
only a few OWLIM rules:
• Standard RDFS and OWL constructs: rdfs:subPropertyOf,
owl:propertyChainAxiom, owl:inverseOf
• Additional properties: ptop:transitiveOver as generalization of
owl:TransitiveProperty; conjunctive property definitions that are needed for FRs,
as explained in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] sec. 3.3
• Define the FR networks in RDF data
The benefits of such reimplementation are better flexibility (OWLIM rules are not
flexible, see end of sec 2.2) and better portability to other repositories.
12 http://www.clarosnet.org
        </p>
        <p>We are exploring the opportunity to create a CH Benchmark using the above data
and FRs under the auspices of the Linked Data Benchmarking Council (LDBC)13
project, so that other vendors can implement the same reasoning and compare the
performance of their implementations. LDBC seeks to empower users of semantic
technologies by establishing significant and objective benchmarks that address
realworld data and user needs.</p>
      </sec>
      <sec id="sec-4-12">
        <title>Acknowledgements</title>
        <p>This work is supported by the ResearchSpace project, executed by the British
Museum and funded by the Andrew W. Mellon Foundation. We are grateful to Dominic
Oldman, ResearchSpace principal investigator, for his excellent continuing
collaboration. Dominic created the RS Semantic Search movie cited above.</p>
        <p>We thank the anonymous reviewers for their constructive and specific comments.
5</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Alexiev</surname>
          </string-name>
          :
          <article-title>Implementing CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules</article-title>
          .
          <source>Semantic Digital Archives workshop (SDA</source>
          <year>2012</year>
          ),
          <source>part of Theory and Practice of Digital Libraries conference (TPDL</source>
          <year>2012</year>
          ).
          <source>September</source>
          <year>2012</year>
          , Paphos, Cyprus. http://ceur-ws.
          <source>org/</source>
          Vol-912
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Katerina</given-names>
            <surname>Tzompanaki</surname>
          </string-name>
          ,
          <article-title>Martin Doerr: A New Framework for Querying Semantic Networks</article-title>
          .
          <source>ICS-FORTH Technical Report TR-419</source>
          , May 2011
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Katerina</given-names>
            <surname>Tzompanaki</surname>
          </string-name>
          , Martin Doerr:
          <article-title>Fundamental Categories and Relationships for intuitive querying CIDOC-CRM based repositories</article-title>
          ,
          <source>ICS-FORTH Technical Report TR-429</source>
          ,
          <year>April 2012</year>
          , http://www.cidoc-crm.org/docs/TechnicalReport429_April2012.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Barry</given-names>
            <surname>Bishop</surname>
          </string-name>
          , Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, Ruslan Velkov,
          <article-title>OWLIM: A family of scalable semantic repositories</article-title>
          ,
          <source>Semantic Web Journal</source>
          , Volume
          <volume>2</volume>
          ,
          <string-name>
            <surname>Number</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Barry</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Spas</given-names>
            <surname>Bojanov</surname>
          </string-name>
          .
          <article-title>Implementing OWL 2 RL and OWL 2 QL rule-sets for OWLIM</article-title>
          .
          <source>OWL Experiences and Directions workshop (OWLED</source>
          <year>2011</year>
          ), San Francisco, USA, June 5-6,
          <year>2011</year>
          , CEUR-WS.org,
          <source>ISSN 1613-0073</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Barry</given-names>
            <surname>Bishop</surname>
          </string-name>
          , Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, Ruslan Velkov.
          <article-title>FactForge: A fast track to the web of data</article-title>
          .
          <source>Semantic Web Journal</source>
          , V.2,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Barry</given-names>
            <surname>Bishop</surname>
          </string-name>
          , Atanas Kiryakov, Zdravko Tashev, Mariana Damova, Kiril Simov.
          <article-title>OWLIM Reasoning over FactForge</article-title>
          .
          <source>Proceedings of OWL Reasoner Evaluation Workshop</source>
          (ORE'
          <year>2012</year>
          ),
          <source>collocated with IJCAR</source>
          <year>2012</year>
          ,
          <article-title>Manchester</article-title>
          , UK
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>8. SPARQL Property Paths, http://www.w3.org/TR/sparql11-property-paths</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>David</given-names>
            <surname>Shotton</surname>
          </string-name>
          ,
          <article-title>The Future of the Past: Using CIDOC CRM for CLAROS. Semantic Web and</article-title>
          CIDOC CRM Workshop, co-located
          <source>with ISWC</source>
          <year>2009</year>
          ,
          <string-name>
            <surname>Washington</surname>
            <given-names>DC</given-names>
          </string-name>
          , http://www.semuse.org/index.php?title=Semantic_Web_and_CIDOC_CRM_Workshop
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Cezary</surname>
            <given-names>Mazurek</given-names>
          </string-name>
          , Krzysztof Sielski, Maciej Stroiński, Justyna Walkowska, Marcin Werla,
          <string-name>
            <given-names>Jan</given-names>
            <surname>Węglarz</surname>
          </string-name>
          .
          <article-title>Transforming a Flat Metadata Schema to a Semantic Web Ontology: The Polish Digital Libraries Federation and CIDOC CRM Case Study</article-title>
          .
          <source>Studies in Computational Intelligence</source>
          Volume
          <volume>390</volume>
          ,
          <year>2012</year>
          , pp
          <fpage>153</fpage>
          -
          <lpage>177</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>