=Paper= {{Paper |id=Vol-2977/paper13 |storemode=property |title=Establishing a Linked Data Infrastructure for the OGC Body of Knowledge (short paper) |pdfUrl=https://ceur-ws.org/Vol-2977/paper13.pdf |volume=Vol-2977 |authors=Gobe Hobona,Rob Atkinson,Greg Buehler,Scott Simmons,Ingo Simonis |dblpUrl=https://dblp.org/rec/conf/esws/HobonaABSS21 }} ==Establishing a Linked Data Infrastructure for the OGC Body of Knowledge (short paper)== https://ceur-ws.org/Vol-2977/paper13.pdf
    Establishing a Linked Data Infrastructure for the OGC
                      Body of Knowledge

    Gobe Hobona1[0000-0002-8733-4702] , Rob Atkinson1[0000-0002-7878-2693], Greg Buehler1[0000-
     0003-1386-69]
                   , Scott Simmons1[0000-0002-9085-010X], and Ingo Simonis1[0000-0001-5304-5868]
                       1
                           Open Geospatial Consortium, Wayland, MA, USA
                                       ghobona@ogc.org

Abstract. The OGC Body of Knowledge is a structured collection of concepts and related
resources that can be found in the set of documents published by the Open Geospatial Consortium
(OGC). An explicit view of this knowledge is available from the OGC Virtual Knowledge Store
and related components such as the OGC Definitions Server and the OGC Glossary of Terms.
The OGC Body of Knowledge is intended to provide a reference for users and developers of
geospatial software and services. This paper describes the approach taken to develop the OGC
Body of Knowledge and presents the results of the approach. It is intended to encourage and
facilitate discussion within the OGC membership and wider geospatial community.


         Keywords: linked data, body of knowledge, knowledge management


1        Introduction
The Open Geospatial Consortium (OGC) is an international consortium of more than
520 businesses, government agencies, research organizations, and universities driven
to make geospatial (location) information and services findable, accessible,
interoperable, and reusable (FAIR). The OGC Body of Knowledge is a structured
collection of concepts and related resources that can be found in the OGC Library [1].
It is, in effect, a view of explicit knowledge available from the OGC Virtual Knowledge
Store and related components such as the OGC Definitions Server
(http://www.opengis.net/def/)           and    the     OGC       Glossary   of    Terms
(https://www.ogc.org/ogc/glossary). A Body of Knowledge exists beyond the boundaries
of knowledge assets because of the tacit practices, skills, experiences, products,
processes, and interdisciplinary knowledge that define the field that are incorporated
into that body of knowledge [2]. This paper describes the approach taken to develop
the OGC Body of Knowledge.




Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
2


2      Methodology

2.1    Development Process
To develop the OGC Body of Knowledge, the OGC designed an approach that involved
extraction of knowledge from a prior-existing OGC knowledge base, transformation of
the knowledge into formal statements and then loading of the statements into persistent
storage. The formal statements are represented as triples and persisted in a triple store
that was named the Virtual Knowledge Store (VKS).

The content stored in the VKS can be queried at any time by applications by running
searches on the SPARQL interface. The SPARQL interface is implemented through a
deployment of the RDF4J workbench (https://rdf4j.org). For human users, a document is
generated from the triples that have been stored in the VKS. The document is serialized
as asciidoc and then compiled using asciidoctor software (https://asciidoctor.org). The
main presentation approach for the VKS is a Linked Data approach supporting HTTP
content-negotiation, with HTML and SKOS (RDF-encoded) the key formats supported.
Flexibility in terms of providing multiple alternative views using different data models
is supported through “Content Negotiation by Profile” [3].

2.2    The Information Model
The triples representing the OGC Body of Knowledge in the VKS implement the
Simple Knowledge Organization System (SKOS) standard of the World Wide Web
Consortium (W3C) (http://www.w3.org/TR/skos-reference). SKOS is designed to enable
concepts to be composed and published on the World Wide Web, linked with data on
the Web and integrated into concept schemes. The data model defined by SKOS
specifies Concept, ConceptScheme and Collection types that together make it possible
to represent knowledge organization systems such as thesauri, classification schemes,
subject heading lists, taxonomies, folksonomies, and other types of vocabularies.




Fig. 1. UML Class model of the SKOS representation of element types in the OGC Body of
Knowledge.
                                                                                        3


SKOS is encoded as an application profile of the Resource Description Framework
(RDF). As illustrated in Figure 1 above, the information model made use of the
skos:Concept class which can represent a notion of a thing, skos:ConceptScheme
class which represents an aggregation of concepts, and SKOS semantic relations which
are links between SKOS concepts where the link is inherent in the meaning of the linked
concepts. For example, skos:related was used to indicate the relationship between a
Standard and Learning Objectives, Executable Test Suites, Example Usage and a
Standards Working Group.

2.3     Context within OGC Infrastructure
The OGC Body of Knowledge sits within the Member Support part of OGC
infrastructure, as shown in Figure 2. The blue-filled boxes are those that have been
implemented to date, and the grey-filled boxes are those that are under development.
The OGC Body of Knowledge is represented in the diagram with a black-filled box.
Some of the boxes are labelled as TBA (To be added) to indicate that more content is
yet to be added to the VKS. The figure also highlights that the body of knowledge is
one of a series of tools intended to support OGC Members.




Fig. 2. A high level overview of the infrastructure within which the OGC Body of Knowledge
(BoK) sits.


The Inference and Processing Layer shown in Figure 2 represents a series of tools that
apply a semi-automated approach to processing of content (i.e., involving a
combination of automated and manual steps) in order to build the body of knowledge.
The steps can be summarized as follows:

      Step 1. Configure extraction rules for the types of documents that are to be
      processed (manual)
4


      Step 2. Configure rules for cross-referencing extracted content to third party
      content (manual)
      Step 3. Extract the content from the documents (automated)
      Step 4. Transform the extracted content into triples (automated)
      Step 5. Review the triples to verify that they are correct (manual)
      Step 6. Load the triples into the triple store (automated)
      Step 7. Optionally generate an asciidoc document from the triples (automated)

One of the areas where manual processing became unavoidable was the cross-
referencing of content from the OGC Library with content from 3rd parties. For
example, several OGC standards identify media types (formerly known as MIME type)
that are to be used with specific encodings. As the global register of media types is
maintained by the Internet Assigned Numbers Authority (IANA), it was necessary to
manually review the OGC Standards and cross-reference them to the appropriate IANA
resources. Once codified as triples, that information could then be queried by an
application.


3         Results

3.1       Extraction
Extraction of content from recent documents (post-2013) was found to be generally
easier to automate than extraction from historical documents. This is because recent
documents published by the consortium were published in both HTML and PDF
format, whereas prior to 2013 OGC documents were mainly published in PDF format.
HTML documents offered more structure to the content than text extracted from PDF
documents. Therefore, a semi-automated approach was seen as appropriate for
processing of content. Some of the tools that helped with the extraction of content were:

      ●    Apache POI (https://poi.apache.org): A toolkit for creating and maintaining
           content in several different Open Office related formats.
      ●    Apache Tika (https://tika.apache.org): A toolkit for detecting and extracting
           metadata and text from over a thousand different file types (such as PPT, XLS,
           and PDF).
      ●    Dstl Baleen (https://github.com/dstl/baleen): An extensible text processing
           capability that allows entity-related information to be extracted from
           unstructured and semi-structured data sources.

3.2       Loading
The extracted triples are imported into the VKS through the Definitions Server’s
Administration tool that is built on the Django framework. Django is a Python-based
free and open-source web framework that enables developers to implement solutions
based on the model-template-view architectural pattern that separates the concerns of
an application’s model from its views. The Administration tool stores a snapshot of the
SKOS file and then transmits the triples to an RDF4J Workbench instance which
                                                                                              5


exposes the triples through a SPARQL interface. A screenshot of the Administration
tool is shown in Figure 3.




Fig. 3. A screenshot of the Definitions Server’s Administration tool.

The Administration tool provides a range of functions to support consistency of the VKS:

      ●    entailment to standardise views of SKOS data
      ●    specialised importers for various canonical forms of content, such as XML based forms
           and dictionaries encoded in Geography Markup Language (GML)
      ●    option for manual entry
      ●    association of metadata to content sources
      ●    assignment of governance rights
      ●    review mode (publishing to staging repositories)
      ●    batch loading facilities


3.3       Governance
A key part of ensuring that the body of knowledge can be maintained long term is the
establishment of a process for governance of information content. OGC already had a
sub-committee, called the OGC Naming Authority, that was responsible for managing
content in the Definitions Server and the Glossary of Terms. Therefore, the scope of
the Naming Authority was expanded to include management of the OGC Body of
Knowledge. This meant that the item registration process adopted by the Naming
Authority could then also be applied to the body of knowledge. An illustration of the
item registration process, which is based on the one specified in the ISO 19135 standard
(https://www.iso.org/standard/54721.html), is shown in Figure 4.
6




Fig. 4. Item registration process adopted by the OGC Naming Authority (adapted from ISO
19135).


4      Discussion

4.1    Separation of Content from Presentation
All of the explicit knowledge that is presented in the OGC Body of Knowledge is
formally captured in the VKS as knowledge graphs consisting of concepts and their
relationships. This approach makes it possible to separate content from presentation, or
more specifically, it makes the explicit knowledge independent of the medium through
which it is presented. The rationale for this separation of concerns between content and
presentation is that the explicit knowledge that is available in the VKS can be reused
by other systems to address other needs that may not have been foreseen.

Separation of the VKS into separate knowledge graphs preserves the data provenance
and supports batch replacement of modules with updates. Updates may add additional
metadata, container (grouping) views, links to related concepts, etc. URIs and content
remain stable, with status flags used to deprecate superseded content.

Using Linked Data and Content Negotiation by Profile approaches, graphs can be
accessed in various forms by setting appropriate access headers. Humans using a Web
Browser can access details of individual terms, or link to access the containing graph
objects via the SKOS ‘inScheme” predicate link. Applications can access the RDF-
encoded SKOS content from the same URIs, using different access headers.

4.2    Coverage
The current scope of the OGC Body of Knowledge is limited to introductions to
concepts in OGC standards, learning objectives related to those standards, references
to compliance tests for those standards, and examples of innovation initiatives that
                                                                                            7


reference the standards. As such the OGC Body of Knowledge serves as an informative
reference, and is not intended to replace any OGC standard. In fact, it is intended to
help direct a reader towards relevant OGC standards where normative content can be
found. Moreover, the OGC Body of Knowledge is intended to be complementary to
other educational resources such as the OGC e-Learning resources, and the OGC
website.

4.3      Future
The architecture and resource model for the OGC Body of Knowledge permits other
web-accessible knowledge resources to be linked to or enhanced by this content. This
was successfully demonstrated through the integration of the Cadastre and Land
Administration Thesaurus (CaLAThe) into the OGC Definitions Server – a part of the
VKS [4]. OGC and its partner organizations are working to establish such links to
improve the consistency of geospatial knowledge and ensure that the most authoritative
source for any particular type of knowledge can be easily discovered and reused.

Another area identified for future work is that of automation. Natural Language
Processing (NLP) tools, such as OpenNLP, enable the parsing of sentences. Therefore,
an ability to identify elements of sentences in natural language text could potentially
support the automatic formulation of statements as triples. Another area for future work
is the harmonization of vocabularies for registers. Such harmonization would have to
be based on Linked Data technologies to ensure that the vocabularies can be used by
the wider Semantic Web.



5        Conclusions
The combination of SKOS and asciidoctor proved to be effective at enabling the
representation of the OGC Body of Knowledge in different forms to facilitate both
machine and human interpretation. Whereas the development of the OGC Body of
Knowledge will continue for some time to come, the approach taken thus far has shown
that Linked Data technologies such as SKOS and SPARQL services can aid the
development of such bodies of knowledge.



References
    1.   Hobona, G.: OGC Body of Knowledge - Version 0.1 - Discussion Paper. Open
         Geospatial Consortium, (2020) https://docs.opengeospatial.org/dp/19-077.html
    2.   Hart, H., Baehr, C.: Sustainable Practices for Developing a Body of Knowledge.
         Technical Communication 60(4), 259-266 (2013)
    3.   Svensson, L.G., Atkinson, R., Car, N.: Content Negotiation by Profile, W3C Working
         Draft, Data Exchange Working Group, 2019, https://www.w3.org/TR/dx-prof-conneg/
    4.   Stubkjær, E., Çağdaş, V.: Alignment of standards through semantic tools – The case of
         land administration. Land Use Policy. 104, 105381 (2021).