=Paper=
{{Paper
|id=Vol-1862/paper-05
|storemode=property
|title=LDWPO A Lightweight Ontology for Linked Data Management
|pdfUrl=https://ceur-ws.org/Vol-1862/paper-05.pdf
|volume=Vol-1862
|authors=Sandro Rautenberg,Ivan Ermilov,Edgard Marx,Sören Auer
|dblpUrl=https://dblp.org/rec/conf/ontobras/RautenbergEMA16
}}
==LDWPO A Lightweight Ontology for Linked Data Management==
LDWPO – A Lightweight Ontology for Linked Data
Management
Sandro Rautenberg1 , Ivan Ermilov2 , Edgard Marx2 , Sören Auer3
1
Computer Science Department – Midwestern State University (UNICENTRO)
PO Box 730 – Postal Code 85.015-430 – Guarapuava – PR – Brazil
2
AKSW, Institute of Computer Science, University of Leipzig
Leipzig, Germany.
3
University of Bonn and Fraunhofer IAIS
Bonn, Germany.
srautenberg@unicentro.br
Abstract. Managing the lifecycle of RDF datasets is a cumbersome activity.
Substantial efforts are spent on reproducing datasets over time. But, these
efforts can be reduced by a data management workflow framework. We present
the Linked Data Workflow Project ontology as the knowledge model for such
a workflow framework. The ontology is centered on the Plan, Method, and
Execution classes, facilitating the description of: i) the methodological process
that guides the lifecycle of RDF datasets, ii) the complete plan of the RDF
dataset production workflow, and iii) the executions of workflow. As a result, our
approach enables the reproducibility and repeatability of Linked Data processing
steps over time.
1. Introduction
In the context of the Web of Data, the management of data collections encoded according
to the Resource Description Framework (RDF dataset1 ) has been mainly focused on de-
veloping tools for supporting individual aspects of Linked Data Management (extraction,
mapping/transformation, quality assessment/repairing, linking, and publishing/visualiza-
tion). With this in mind, managing the complete lifecycle of RDF datasets over time can
become a problem, due to the myriad of tools, environments, and data sources. Thus,
that lifecycle requires substantial management effort for detailing provenance, ensuring
reproducibility, and dealing with repeatability issues.
To facilitate the data management, workflow and provenance ontologies (or
vocabularies) can be used to describe and automatize the linked data lifecycle. Scu-
fle2 [Hull et al. 2006] and Kepler [Ludäscher et al. 2006] are examples of such ontologies
used as knowledge models in some Workflow Management Systems. With regard to
ontology engineering best practices, those ontologies reveal important limitations. Scufle2
is not available2 and Kepler ontologies do not detail their elements with human-readable
descriptions. These limitations hinder the adoption of those ontologies, mainly, for: i)
1
Formally, it is a dataset “used to organize collections of RDF graphs, and comprise a default graph and
zero or more named graphs [W3C 2014].
2
The ontology is not published http://taverna.incubator.apache.org/
documentation/scufl2/ontology 27-10-2015 17:00
59
reusing them as knowledge sources in other ontology developments; ii) extending them for
sharing information among systems. Taking the provenance perspective into account, the
PROV ontology (PROV-O) [Lebo et al. 2015] and the Open Provenance Model Vocabulary
(OPMV) [Moreau et al. 2011] can be adopted. However, they lack crucial concepts to
describe the plan and execution perspectives of a workflow. In a nutshell, PROV-O and
OPMV are insufficient for describing, at the same time, the strategy (plan) and operation
(execution) aspects of (re)producing RDF datasets.
Tackling the limitations of existing approaches, we model a lightweight
ontology for orchestrating linked data processing workflows, dubbed the Linked
Data Workflow Project ontology (LDWPO). To develop LDWPO, we ap-
plied artifacts and best practices from On-to-Knowledge [Sure and Studer 2002],
METHONTOLOGY [Gomez-Perez et al. 2004], and the Ontology Development 101
Guide [Noy and McGuinness 2001]. Inspired on other knowledge sources, LDWPO
standardizes the Method, Plan, and Execution concepts for guiding the production
and maintenance of RDF datasets. It is noteworthy that the LDWPO is already used
as the knowledge model in LODFlow [Rautenberg et al. 2015], an environment for
planning, executing, and documenting workflows for linked data. LDWPO was verified in
large-scale real-world use cases, expressing the: i) creation of RDF datasets according to
a methodological process; ii) planning of RDF dataset maintenance on an high level of
abstraction, thus, enabling provenance extraction and reproducibility over time; and iii)
execution of the workflows for RDF dataset (re)production in a (semi-)automatized way,
using Linked Data Stack technologies [Auer et al. 2012].
The article is structured as follows: The LDWPO scope and purposes are presented
in Section 2. Section 3 discusses the main concepts of LDWPO. Section 4 describes
the LDWPO evaluation with two large-scale real-world use cases for promoting the
knowledge management in a Brazilian university. Section 5 presents related work that is
complementary to LDWPO. Finally, Section 6 outlines conclusions and some directions
for future work.
2. Preliminaries
LDWPO’s scope is limited to the linked data domain, extending concepts for methodologi-
cally planning and executing the (re)production of RDF datasets. The main requirements
addressed by the LDWPO are:
1. describing the methods which establish the process, activities, and tasks for pro-
ducing plans of RDF datasets;
2. representing the plans as workflows for (re)producing RDF datasets over time. It
is achieved by specifying a list of steps, where each step corresponds to a tool
invocation using a specific tool configuration, as well as input and output datasets;
3. supporting the reuse of workflows for guiding the (re)production of RDF datasets
over time;
4. mediating the automation of workflows, which involves a controlled environment
for a plan execution. It should be achieved by running tools with tool configurations
over input datasets as previously planned;
5. preserving workflow execution logs for checking the correctness or repeatability of
the results; and
60
6. reporting projects of RDF datasets, its workflow plans and executions in human-
readable formats.
Considering the scope, purposes, and competence questions3 , we listed adherent
ontologies and vocabularies, aiming the reusing of existing concepts and properties. We
identified the Publishing Workflow Ontology (PWO) [Gangemi et al. 2014], the Open
Provenance Model Vocabulary (OPMV) [Moreau et al. 2011], and the PROV Ontology
(PROV-O) [Lebo et al. 2015]. These approaches are suitable for describing the execution
of an RDF dataset and, therefore, can answer the questions about what was done during the
RDF dataset maintenance. However, these works do not include important concepts such
as method and plan. In particular, the method concept can answer questions about how or
why to proceed. Instances of this concept support a knowledge engineering perspective
of linked data, where an established process is related to the knowledge level of lifecycle,
standards, and best practices [Bourque and Fairley 2004]. The plan concept answers
questions about the actions related to a workflow or simply what to do with something over
time. Instances of plan are related to the knowledge level for scheduling the tools, steps,
and resources [WFMC 1995], supporting the lifecycle of RDF datasets in a systematic
way.
In such way, we are proposing the LDWPO4 as a new and complementary on-
tology to support the method, plan, and execution concepts for better representing and
implementing the RDF dataset maintenance.
3. LDWPO in a Nutshell
In LDWPO (Figure 1), the main concepts are dubbed with the prefix “LDW”, specializing
some general concepts to the context of workflows for RDF dataset (re)production.
The starting point in LDWPO is the LDWProject concept, a description of a
project for creating/maintaining RDF datasets. Among its properties, LDWProject is
associated with a set of LDWorkflows. An LDWorkflow embodies the plan necessary
to (re)produce RDFDatasets, encapsulating a linked list of LDWSteps. LDWStep
is a concept that represents an atomic and reusable unit of an LDWorkflow. It de-
scribes a set of procedures over a set of input Datasets, using a Tool with a Tool
Configuration, in order to produce a set of output Datasets. An LDWStep can
be reused, which means that the same LDWStep can be associated with one or more
LDWorkflows within existing LDWProjects. In addition, an LDWStep can be auto-
matically executed in a computational environment, on a user request. We exemplify the
automatization in more detail in Section 4, with real-world use cases.
An LDWorkflow can be reused as a Plan in Executions at any particular
point of time. In LDWPO, the concept for describing an LDWorkflow execution instantia-
tion is LDWorkflowExecution. Each LDWorkflowExecution needs to aggregate
the sequence of LDWStepExecutions close related to the sequence of LDWSteps of a
given LDWorkflow. In other words, it meets a representation for automating the execu-
tion of workflows, by running tools with tool configurations over datasets as it is previously
3
A detailed technical report is available at: https://github.com/AKSW/ldwpo/blob/
master/misc/technicalReport/LDWPO_technicalReport.pdf
4
The ontology is available at: https://github.com/AKSW/ldwpo/blob/master/1.0/
ldwpo.owl.
61
Figure 1. The LDWPO model.
planned. During the execution, the LDWStepExecutions can generate Messages
such as logging report and Statuses such as successful finished, unsuccessful finished,
aborted, etc. In this way, a whole LDWorkflowExecution can register the repro-
ducibility information of an LDWorkflow for checking the correctness or repeatability of
RDFDatasets. Forward, this kind of information can be used for reproducing the same
result over time.
Another important concept in the LDWPO is Task. This class is an atomic unit
of Method abstract concept and represents a best practice covered by the LDWSteps.
When an LDWStep is planned, it can be related to a set of Tasks, which is necessary to
accomplish during LDWStepExecutions. Examples of Tasks are: a) reusing estab-
lished vocabularies, b) describing RDF dataset with the Vocabulary of Interlinked Datasets
(VoID), c) establishing open license, d) keeping RDF dataset versioning information, or e)
providing human-readable data description. Relating Tasks to LDWSteps can be useful
to the data engineers for describing LDWorkflows in an upper level of abstraction like in
software development process from Software Engineering context. This methodological
perspective of the LDWPO is depicted in Figure 2. As it is illustrated, the Linked Data
Lifecycle [Auer et al. 2012] is instantiated as the Process. Additionally, the Extraction
Activity of that Process is related to the Reusing vocabularies Task. In a given
LDWProject, an instance of Task is associated to an LDWStep instance, making ex-
plicit a relationship between an LDWorkflow unit and a best practice. As consequence,
considering that the lifecycle of resources can be followed in a particular LDWorkflow,
when describing LDWSteps with LDWPO, we can understand how an RDF dataset is
(re)produced in the level of methodological approaches.
62
LÁWProject
name
Qualis"rasil
description
Qualis"rasilRisRpartRofR makingItheIprocessesI
http0MMlodYunicentroYbrRendpointYRItR
aimsRtoR[YYY]
explicitIandIeasyItoI
understandIforILinkedI Method
planningItheI DataIengineersIinIaI
maintenanceIofI highIlevelIofI Task Activity Process
LinkedIDataIdatasets abstraction name name name
ReusingRvocabulariesR Íxtraction LinkedRÁataR
insteadRofR]reQIinventing Lifecycle
Plan description
description TheRstageRresponQ description
LDWorkflow LDWStep ÁescribingRdataRwithRpreviously sibleRforRmappingR >RLinkedRÁataRdeQ
name name definedRvocabulariesRwhenever andRRconvertingRR velopmentRprocessR
MaintainRQualis"rasil stepI2I-IApplyingI possiblePRbeforeRdefiningRanyR structuredRorRunQ encompassingRactiQ
description SPARQLIFYItoI newRtermsYRRepresentingRRyour structuredRdataPR vitiesRforRproducingR
WorkflowRappliedRtoR convertIresources dataRRRwithRRRwellQestablishedRR consideringRaRsetR andRpublishingRlinQ
createRtheRRÁqRdataQ description vocabulariesRRcanRleverageR ofRRestabilishedR kedRdataRdatasetsR
setRofRQualisRIndexPRinR InRthisRstepPRweR theirR]reQIuseRinRtheRLinkedRÁaQ RÁqRdataRmodelsYR inRanRengineeringR
anRautomatizedRwayYR convertRtheRdataR taRKloudYRR>sRsuggestionPRyouR fashionYR
ItRencompassesRfiveR fromRtheRRKSVRfileR mightRRuseRRqriendQofQaQqriendR
steps0R toRRRÁqRRresourcesPR ]qO>qIPRÁublinRKoreR]ÁKIPR[YYY]
WIRretrievingRdataRfromR consideringRtheResQ describingILinkedIDataI
aRlegacyRdatabaseR tablishedRvocabuQ DevelopmentIProcessesI
andRsavingRitRinRaR laryRinRaR andIbestIpractices
KSVRfile5 SP>RQLIqYRtoolR
2;IconvertingICSVIto KonfigurationY (Ívaluation(P(Journal(P(issnJournal(P(nameJournal(P(Knowledgeqield(P(idKnowledgeqield(P
managing (nameKnowledgeqield(P(YearÍvaluation(P(yearIndex(P(Qualis(P(qualisIndex(
aIRDFDataset;
:IRloadingRtheR[YYY] theI (Journal_BBBWQ;23z_Knowledgeqield_W_YearÍvaluation_zBB;_Qualis_>W(P
lifecycleIof (Journal_BBBWQ;23z(P(BBBWQ;23z(P(>ctaRMathematica(P(Knowledgeqield_W(P(W(P
resources (M>TÍMÁTIK>RMRPRO">"ILIÁ>ÁÍRÍRÍST>TÍSTIK>(PR(YearÍvaluation_zBB;(P
maintainingIprovenanceI (zBB;(P(Qualis_>W(P(>W(
andIrepeatabilityIinformation [YYY]
[YYY]
RR
9http0MMlodYunicentroYbrMQualis"rasilMJournal_BBBWQ;23z<
Execution RRRR9http0MMpurlYorgMdcMelementsMWYWMidentifierctaRMathematica(R5
LDWorkflowExecution LDWStepExecution RRRRaRrdf0KlassRY
name name 9http0MMlodYunicentroYbrMQualis"rasilMKnowledgeqield_W<
>pplyingR RRRR9http0MMpurlYorgMdcMelementsMWYWMidentifierTÍMÁTIK>RMRPRO">"ILIÁ>ÁÍRÍRÍST>TÍSTIK>(R5
Qualis"rasilzBB; SP>RQLIqYR
RRRRaRrdf0KlassRY
description toRQualiszBB;
WorkflowRRexecutedRatR description
producingI 9http0MMlodYunicentroYbrMQualis"rasilMQualis_>W<
z2MB;MzBW;YRItRcreatedR InRthisRstepPR;z;:W;R the RRRRaRrdf0KlassR5
RRRRrdf0valueR(>W(RY
theRLinkedRÁataRdatasetR potencialRresources resources R
ofRQualisRIndexR]zBB;IPR ofRQualisRIndexRweQ 9http0MMlodYunicentroYbrMQualis"rasilMYearÍvaluation_zBB;<
withoutRprocessingRerrorY reRextractedRfromR RRRRaRrdf0KlassR5
zW3Pzz;RtriplesRareR[YYY] theRinputRKSVRfileY RRRRrdf0valueR(zBB;(RY
[YYY]
Figure 2. Exemplifying the LDWPO expressiveness.
4. LDWPO in Use
In this section, we describe how LDWPO supports the maintenance and publication of
5 star RDF datasets5 . In particular, these datasets support a Knowledge Management
project in a Brazilian university.
4.1. Data Sources
4.1.1. Qualis dataset
One of the data sources originates from Qualis, a dataset created and used by the Brazilian
Research Community and providing a complete view of research in and related to Brazil.
Qualis dataset encompasses indirect scores6 for research papers in journals, according to
5
For more information, please see the data classification system proposed by Tim Berners-Lee at
http://5stardata.info/
6
A typical entry of Qualis consists of ISSN, journal name, related knowledge field, and qualified journal
score.
63
the relevance of the journal to the knowledge fields (computer science, chemistry, medicine,
among others). Qualis is used in bibliometric/scientometric assessments and for ranking
post-graduate programs, research proposals, or individual research scholarships.
Although a web interface7 is publicly available for querying Qualis data, it has
several limitations: i) historical data is not available, making it difficult to perform time
series studies; ii) in the early years, the data was available only as 1 Star Data (i.e.
Portable Document Format - PDF) in an outdated web interface; iii) only the last versions
of the dataset are available for downloading as spreadsheets (MS Excel file extension -
XLS); and iv) the data is not linked to other datasets, which makes its use challenging.
4.1.2. Lattes Curriculum dataset
Another data source is the Lattes Platform8 , an integrated information system maintained by
the Ministry of Science, Technology and Innovation of Brazil. It is used to manage public
information of individual researchers, groups, and institutions settled in Brazil. Lattes
Curriculum9 (CVLattes) is the core component of Lattes Platform. CVLattes contains
information about personal activities and achievements such as teaching, research projects,
patents, technological products, publications, and awards. The maintenance of such
information requires manual input via web interface by individual researchers. CVLattes is
used to evaluate competence of researchers/institutions for funding research proposals.
CVLattes is available publicly via graphical web interface, which implements
security measures (e.g. CAPTCHA10 ) preventing crawlers to extract the data from the
platform. Therefore, automatic data extraction from CVLattes requires sophisticated
crawling mechanisms. In knowledge management perspective, we consider the scenario in
which a university can access CVLattes via formal request. On such a request, Brazilian
universities can extract a view of its researchers for loading data into internal databases.
4.2. The Use Cases
In our vision the scientific knowledge management for universities will benefit from a
knowledge management instrument called Yellow Pages. Yellow Pages facilitates identi-
fication of responsible parties “who knows what” (location and description) and creates
opportunities for sharing organizational knowledge. The value of such system directly de-
pends on the fact that the data (descriptions of skills, experiences of the groups/individuals
etc.) is up-to-date [Keyes 2006].
To enable Yellow Pages for the Brazilian universities, we consider: a) an integration
of Qualis and CVLattes datasets; and b) maintenance of the Yellow Pages knowledge base
in a systematic way. To achieve these goals, we use LDWPO to support the orchestration
of knowledge bases. For the integration of Qualis and CVLattes datasets, we instantiated
two LDWProjects: QualisBrasil and PeriodicalPapers (depicted in Figure 3).
7
https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/
veiculoPublicacaoQualis/listaConsultaGeralPeriodicos.jsf
8
a web interface is available at: lattes.cnpq.br
9
an example of CVLattes can be accessed at http://buscatextual.cnpq.br/
buscatextual/visualizacv.do?id=K4787027P5&idiomaExibicao=2
10
acronym for Completely Automated Public Turing test to tell Computers and Humans Apart. In
64
LDWProjectIQualisBrasil LDWProjectIPeriodicalPapers
LDWorkflowI LDWorkflowExecution LDWorkflowI LDWorkflowExecutionI
MaintainIQualisBrasil MaintainingIQualisBrasil2014 MaintainIPeriodicalI MaintainIPeriodicalI
LDWSteps LDWStepExecutions PapersIReferences PapersIReferences2014
LDWSteps LDWStepExecutions
RetrieveIrawIdata RetrieveIrawIdataIforI2014
RetrieveIrawIdata RetrieveIrawIdataIforI2014
ConvertICSVItoIRDF ConvertICSVItoIRDFIforI2014
ConvertICSVItoIRDF ConvertICSVItoIRDFIforI2014
LoadIintoIaItriplestore LoadIintoItriplestoreIforI2014
InterlinkIwithIDBpedia LoadIintoIaItriplestore LoadIintoItriplestoreIforI2014
InterlinkItoIDBpediaIforI2014
LoadIintoIaItriplestore LoadIintoItriplestoreIforI2014
Figure 3. LDWProjects provide a pipeline for upgrading Qualis and CVLattes data
sources up to 5 Stars Linked Data.
QualisBrasil LDWProject is based on Maintain QualisBrasil LDWorkflow,
which is composed by five LDWSteps as follows:
1. LDWStep a retrieves data from a legacy database and saving it in a Comma
Separated Values (CSV) format;
2. LDWStep b converts the CSV data to the Qualis RDFDataset, using the trans-
formation tool Sparqlify11 ;
3. LDWStep c updates a graph12 in a triple store with the generated resources;
4. LDWStep d interlinks the resulting Qualis RDFDataset with DBpedia13 data,
using the link discovery tool LIMES14 . For linking, it is considered the International
Standard Serial Number (ISSN) and rdfs:seeAlso property; and
5. LDWStep e loads the acquired links into the triple store.
PeriodicalPapers is an LDWProject, which converts the paper references from
scientific journals to linked data. Maintain Periodical Papers References LDWorkflow is
constituted by three LDWSteps:
1. LDWStep a retrieves the data from a legacy database and saves it in a CSV format;
2. LDWStep b performs conversion of the CSV data to the PeriodicalReferences
RDFDataset using the Sparqlify; and
3. LDWStep c updates a graph15 in a triple store with the RDFDataset.
computing, it is used to check whether or not the user is human.
11
http://aksw.org/Projects/Sparqlify.html
12
published on datahub http://datahub.io/dataset/qualisbrasil and
publicly available at http://lodkem.led.ufsc.br:8890/sparql, graph name:
“http://lod.unicentro.br/QualisBrasil/”.
13
is a community effort to extract structured information from Wikipedia and to make this information
accessible on the Web [Lehmann et al. 2009].
14
http://aksw.org/Projects/LIMES.html
15
published on datahub https://datahub.io/dataset/lattes-production
and publicly available at http://lodkem.led.ufsc.br:8890/sparql, graph name:
“http://lod.unicentro.br/LattesProduction/”.
65
QualisBrasil LDWProject PeriodicalPapers LDWProject
rdf:value foaf:name foaf:member
qualis:Score foaf:Group foaf:Person
or e
Sc page
home
has foaf:
e
m
lis:
a
qua
dc:contributor
bibo
:n
sn
bibo:is :issn
af
fo
Journal bibo:Journal
qualis:has foaf: bibo:Journal b
qualis:Evaluation nam ame ibt
qualis e foaf:n ex
:ha
qu :hasY sJ
earEv ou
al aluati rn
is: on al
ha
s Kn rdf:value bibtex:hasYear
o qualis:YearEvaluation bibtex:Article
wl
edg or
bibtex:hasAuth
itle
eF
ie ber es
sT
ld m ag
fier asNu
:ha
ti sP
dc:iden bibte
x:h
lu me a
sVo x:h
tex
qualis:KnowledgeField :ha te
tex bib
bib
dc:tit bib
le
Figure 4. Representing the knowledge base for theYellow Pages System.
For the execution of LDWorkflows, we developed the Linked Data Workflow Ex-
ecution Engine for LDWPO (LODFlow Engine16 ). This tool retrieves the LDWProjects
and LDWorkflows from the LDWPO knowledge base and manages the pipeline for
producing the RDF datasets in an automated fashion. Using LODFlow Engine and the
LDWorkflow definitions, we generated 698 668 interlinked entities for Qualis in an
automated fashion. For PeriodicalPapers LDWProject, LODFlow Engine generated
5 557 entities, representing the periodical papers references of 630 researchers related to a
Brazilian university.
The resulting RDF datasets of Qualis and PeriodicalPapers provide a foundation
for the Yellow Pages system. In other words, the resulting knowledge base (depicted
in Figure 4) integrates the data from heterogeneous sources, enabling new knowledge
management perspectives. For example, there is a limitation on classifying the periodical
papers according to the journal scores. Commonly, it requires manual effort and, gener-
ally, include one knowledge field. Using the resulting knowledge base and appropriated
SPARQL17 queries, the periodical papers can be classified more efficiently, considering the
research group units and/or knowledge fields. In this case, the SPARQL query in the listing
below can be customized for exploring new scientometric scenarios. These scenarios could
include questions, such as:
• What are the main competences of my university in the specific knowledge fields?
• Which researchers in my university work together in a particular knowledge field?
• Which researchers in my university could possibly work together in a research
project of a particular knowledge field? (finding possibilities of a collaboration)
• Which researchers should collaborate to improve the university key performance
indicators?
Such questions are easily formulated by research supervisors inside universities,
but are hardly answered by external researchers, who have university and institution web
sites as main information sources. We argue that the use of Yellow Pages, supported by a
knowledge base that evolves semantically, can be a cornerstone for sharing the knowledge
inside and out of a university.
16
https://github.com/AKSW/LODFlow/tree/master/tools/LODFlowEngine
17
a recursive acronym for SPARQL Protocol and RDF Query Language.
66
1 PREFIX rdfs:
2 PREFIX rdf:
3 PREFIX dc:
4 PREFIX foaf:
5 PREFIX bibo:
6 PREFIX bibtex:
7 PREFIX prod:
8 PREFIX qualis:
9
10 SELECT ?qualisYearEvaluationValue ?qualisKnowledgeFieldTitle ?
qualisScoreValue COUNT(*) as ?qtde where {
11 ?evaluation rdf:type qualis:Evaluation .
12 ?evaluation qualis:hasJournal ?qualisJournal .
13 ?evaluation qualis:hasYearEvaluation ?qualisYearEvaluation .
14 ?evaluation qualis:hasKnowledgeField ?qualisKnowledgeField .
15 ?evaluation qualis:hasScore ?qualisScore .
16 ?qualisJournal bibo:issn ?qualisJournalId .
17 ?qualisYearEvaluation rdf:value ?qualisYearEvaluationValue .
18 ?qualisScore rdf:value ?qualisScoreValue .
19 ?qualisKnowledgeField dc:title ?qualisKnowledgeFieldTitle .
20 ?paper rdf:type prod:PeriodicalPaper .
21 ?paper bibtex:hasJournal ?paperJournal .
22 ?paper bibtex:hasTitle ?paperTitle .
23 ?paper bibtex:hasYear ?qualisYearEvaluationValue .
24 ?paperJournal bibo:issn ?qualisJournalId .
25 ?paperJournal foaf:name ?journalName .
26 }
27 GROUP BY ?qualisYearEvaluationValue ?qualisKnowledgeFieldTitle ?
qualisScoreValue
5. Related Work
To the best of our knowledge, this work presents the first ontology focused on concepts
of Method (process), Plan (provenance), and Execution (reproducibility) for publishing
linked data. Although, there are works targeting the provenance and reproducibility.
For example, PWO [Gangemi et al. 2014] is an ontology for describing the work-
flows associated with the publication of a document. Using the core concepts of PWO, it
is possible to: i) define the initial Step for a given Workflow, ii) relate next/previous
Steps (therewith creating the Workflow) and iii) define the inputs and outputs for each
Step. OPMV [Moreau et al. 2011] is recommended as a model for data provenance,
which enables data publishing as well as data exchange between various systems. In
OPMV: i) a Process is controlled by an Agent; ii) a Process uses Artifacts
at certain time; iii) an Artifact is generated by a Process; iv) an Artifact can
be derived from another Artifact; and v) to execute the workflow, a Process trig-
gers a subsequent Process. However, OPMV does not define the concepts of Plan
explicitly. PROV-O [Lebo et al. 2015] is the W3C recommendation for representing and
interchanging provenance and reproducibility information generated by different systems
and contexts. With the core concepts, in PROV-O: i) an Activity is associated with an
Agent; ii) also, an Entity is attributed to an Agent; iii) an Activity uses Entities
in an interval of time; iv) an Entity can be derived from another Entity; and v) to keep
the workflow, an Activity is associated (wasInformedBy) to another Activity. As
OPMV, the concept of Plan cannot be entirely formulated. To overcome this limitation,
the Ontology for Provenance and Plans (P-Plan ontology) extends PROV-O enabling the
publishing of workflow plan and its execution(s) as linked data [Garijo and Gil 2012].
Considering a different domain of Linked Data, the scientific community coined
the term Scientific Workflow as “the automated process that combines data and pro-
67
cesses in a structured set of steps to implement computational solutions to a scientific
problem” [Altintas et al. 2006]. To facilitate workflows for data and control sequences,
Scientific Workflow Management Systems such as Apache Taverna [Hull et al. 2006] and
Kepler [Ludäscher et al. 2006] were developed. These management systems employ on-
tologies for modeling the workflows, such as Scufl2 and Kepler ontologies, respectively.
At the time of writing, the Scufl2 ontology is not available at the Taverna’s homepage.
Kepler ontologies are part of the Kepler framework and can be found in the source code.
Kepler ontologies do not include human-readable descriptions for concepts, as we show in
the following listing. Concept descriptions are required to facilitate the reuse of ontology
resources. In our vision, the absence of such descriptions limits the adoption of Kepler
ontologies. To leverage the limitations of Scufl2 and Kepler ontologies, we designed LD-
WPO to support the LODFlow, a customized Workflow Management System for Linked
Data Processing.
1 [...]
2
3
Workflow
4
5
6 [...]
7
8
9
Workflow Output
10
11
12
13
14 [...]
6. Conclusion, Limitations, and Future Work
In this paper, we presented Linked Data Workflow Project Ontology (LDWPO), an ontology
for supporting the RDF dataset maintenance. In our vision, an established process should
rule a workflow, which controls all computational procedures for maintaining an RDF
dataset over time. Focusing on provenance, reusability, and reproducibility issues, LDWPO
is aligning with existing vocabularies and ontologies, such as OPMV, PROV-O, and PWO.
The benefits of explicitness, reusability, and repeatability are observed when
LDWPO is applied. In particular, with the ontology, it is possible to create compre-
hensive workflow descriptions, preserving provenance information for reproducing the
LDWorkflows of an LDWProject. Moreover, technologically, it is possible to mediate
the use of tools, enabling the automatized execution of LDWorkflows in the context of
the Linked Data Stack and Linked Data Lifecycle.
With LDWPO we aimed to tackle one of the most pressing and challenging prob-
lems of Linked Data management – managing the lifecycle of RDF datasets over time,
considering the myriad of tools, environments, and resources. Considering that the sup-
port to lifecycle of RDF datasets is currently a cumbersome activity, when applied more
widely, LDWPO can provide a boost to the advancement and maturation of Linked Data
technologies.
Thus, we see this work as an important step in a large research agenda, which
68
aims at providing comprehensive workflow support for RDF dataset (re)production and
maintenance processes. As first contribution, LDWPO is already used in a real-world
application for publishing scientometric resources in an automated fashion. More precisely,
a scientific journal index and journal papers entries are maintained as linked open data,
using LDWPO for promoting knowledge management in a Brazilian university. The
datasets are publicly available at http://lodkem.led.ufsc.br:8890/sparql.
Specially, the Qualis RDF dataset can be reused by the community in other studies in the
Information Science field.
As future work, we aim to maintain the developed ontology, as well as, adopt it in
further use cases. In the context of Yellow Pages system, LDWPO can assist the knowledge
base expansion, considering the following scenarios:
1. Integration of new data sources, improving the knowledge base expressiveness (e.g.
research project descriptions, technological products, patents, courses, coming
from CVLattes or another bibliometric scores such as Journal Citation Reports
(JCR), SCImago Journal Rank (SJR), and Source Normalized Impact per Paper
(SNIP).
2. Maintenance of the existing RDF datasets (e.g. CVLattes and Qualis) via continu-
ous execution of the LDWorkflows over time.
3. Data validation and debugging via repeating LDWorkflowExecutions, when
necessary.
4. Generation of the documentation for LDWProjects to support data engineers in
assessing quality issues.
In addition, we are working on incorporating the LDWPO into a Linked Data
Stack tool, providing a full-integrated Workflow Management System for linked dataset
(re)production.
Acknowledgment
This work was supported by the Brazilian Federal Agency for the Support and Evaluation
of Graduate Education (CAPES/Brazil), under the program Sciences without Borders
(Process number - 18228/12-7) and Araucaria Foundation (Project number 601/14).
References
[Altintas et al. 2006] Altintas, I., Barney, O., and Jaeger-Frank, E. (2006). Provenance
collection support in the kepler scientific workflow system. In Moreau, L. and Foster,
I. T., editors, IPAW, volume 4145 of Lecture Notes in Computer Science, pages 118–132.
Springer.
[Auer et al. 2012] Auer, S., Bühmann, L., Dirschl, C., Erling, O., Hausenblas, M., Isele, R.,
Lehmann, J., Martin, M., Mendes, P. N., van Nuffelen, B., Stadler, C., Tramp, S., and
Williams, H. (2012). Managing the life-cycle of linked data with the LOD2 stack. In
Proceedings of International Semantic Web Conference (ISWC 2012).
[Bourque and Fairley 2004] Bourque, P. and Fairley, R. E. (2004). Guide to
software engineering body of knowledge. Retrieved October, 2014, from
http://www.computer.org/portal/web/swebok.
69
[Gangemi et al. 2014] Gangemi, A., Peroni, S., Shotton, D., and Vitali, F. (2014). A pattern-
based ontology for describing publishing workflows. In Proceedings of the 5th Workshop
on Ontology and Semantic Web Patterns (WOP2014) co-located with the 13th Inter-
national Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19,
2014., pages 2–13.
[Garijo and Gil 2012] Garijo, D. and Gil, Y. (2012). Augmenting prov with plans in p-plan:
Scientific processes as linked data. In Linked Science.
[Gomez-Perez et al. 2004] Gomez-Perez, A., Fernandez-Lopez, M., and Corcho, O. (2004).
Ontological Engineering: With Examples from the Areas of Knowledge Management,
E-Commerce and the Semantic Web, 1st Edition. Springer-Verlag, Heidelberg.
[Hull et al. 2006] Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M. R., Li, P.,
and Oinn, T. (2006). Taverna: a tool for building and running workflows of services.
Nucleic Acids Res, 34(Web Server issue):729–732.
[Keyes 2006] Keyes, J. (2006). Knowledge Management, Business Intelligence, and Content
Management: The IT Practitioner’s Guide. Auerbach Publications, 1 edition.
[Lebo et al. 2015] Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar,
D., Garijo, D., Soiland-Reyes, S., Zednik, S., and Zhao, J. (2015). PROV-O: The prov
ontology. Retrieved from http://www.w3.org/TR/prov-o/ on 13.01.2015.
[Lehmann et al. 2009] Lehmann, J., Bizer, C., Kobilarov, G., Auer, S., Becker, C., Cyganiak,
R., and Hellmann, S. (2009). DBpedia - a crystallization point for the web of data.
Journal of Web Semantics, 7(3):154–165.
[Ludäscher et al. 2006] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones,
M., Lee, E. A., Tao, J., and Zhao, Y. (2006). Scientific workflow management and the
kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039–
1065.
[Moreau et al. 2011] Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P.,
Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan,
E., and den Bussche, J. V. (2011). The open provenance model core specification (v1.1).
Future Generation Computer Systems (FGCS), 27(6):743–756. [IF 1.978, CORE A].
[Noy and McGuinness 2001] Noy, N. F. and McGuinness, D. L. (2001). Ontology develop-
ment 101: A guide to creating your first ontology. Development, 32(1):1–25.
[Rautenberg et al. 2015] Rautenberg, S., Ermilov, I., Marx, E., Auer, S., and Ngomo Ngonga,
A.-C. (2015). Lodflow – a workflow management system for linked data processing. In
SEMANTiCS 2015.
[Sure and Studer 2002] Sure, Y. and Studer, R. (2002). On-To-Knowledge methodology. In
Davies, J., Fensel, D., and van Harmelen, F., editors, On-To-Knowledge: Semantic Web
enabled Knowledge Management, chapter 3, pages 33–46. J. Wiley and Sons.
[W3C 2014] W3C (2014). RDF 1.1 Concepts and Abstract Syntax.
http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.
[WFMC 1995] WFMC (1995). The workflow reference model. Technical report, The
Workflow Management Coalition.
70