=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards the evaluation of the LarKC Reasoner Plug-ins
|pdfUrl=https://ceur-ws.org/Vol-666/paper1.pdf
|volume=Vol-666
|dblpUrl=https://dblp.org/rec/conf/semweb/Huang10
}}
==Towards the evaluation of the LarKC Reasoner Plug-ins==
Towards the Evaluation of the LarKC Reasoner
Plug-ins
Zhisheng Huang
huang@cs.vu.nl
Computer Science Department, Vrije Universiteit Amsterdam, The Netherlands
Abstract. In this paper, we present an initial framework of evaluation
and benchmarking of reasoners deployed within the LarKC platform, a
platform for massive distributed incomplete reasoning that will remove
the scalability barriers of currently existing reasoning systems for the Se-
mantic Web. We discuss the evaluation methods, measures, benchmarks,
and performance targets for the plug-ins to be developed for approxi-
mate reasoning with interleaved reasoning and selection. In this paper,
we propose a specification language of gold standards for the evaluation
and benchmarking, and discuss how it can be used for the evaluation of
reasoner plug-ins within the LarKC platform.
keywords: Reasoning, Evaluation, Benchmarking, Web Scale Reasoning
1 Introduction
The essence of the LarKC project[2]1 is to go beyond notions of absolute cor-
rectness and completeness in reasoning. We are looking for retrieval methods
that provide useful responses at a feasible cost of information acquisition and
processing. Therefore, generic inference methods need to be extended to non-
standard approaches. In consequence, traditional metrics such as completeness
and correctness for reasoning need to be replaced by metrics that ratio the utility
of solutions with their related costs as a means to evaluate the chosen problem
solver. Therefore, we will develop a framework for evaluation and measuring the
relative utility of various reasoning approaches that will be implemented in the
LarKC project.
Approximate reasoning is non-standard reasoning which is based on the idea
of sacrificing soundness or completeness for a significant speed-up in reasoning.
This is to be done in such a way that the loss of correctness is at least outweighed
by the obtained speed-up [8]. Anytime reasoning in which more answers can be
obtained over time is expected to be a behavior of approximate reasoning for
the LarKC platform. Interleaving reasoning and selection is considered to be an
approach to improving the performance of the LarKC platform[7, 4]. The main
idea of the interleaving framework is to use selectors to select only limited and
relevant part of data for reasoning, so that the efficiency and the scalability
of reasoning can be improved. Those non-standard reasoning approaches need
1
http://www.larkc.eu
Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010). Shanghai, China. November 8, 2010.
new metrics and frameworks for the evaluation and benchmarking of reasoners
developed within the LarKC platform.
These new reasoning paradigms, which fuse approaches from many different
fields will also require new approaches to evaluation. Traditional measures like
soundness and completeness will have to be enriched with measures such as recall
and precision, and worst-case complexity will have to be enriched by approaches
such as anytime performance profiles. In this paper, we will develop such new
evaluation measures and apply them to the implemented resoner plug-ins, both
on synthetic datasets and on datasets from the use-cases.
In this paper, we will present an initial framework of evaluation and bench-
marking of reasoners deployed with the LarKC platform. We will discuss the
evaluation methods, measures, benchmarks, and performance targets for the
plug-ins to be developed for the task of approximate reasoning with interleaved
reasoning and selection. Furthermore, we will develop a specification language
of gold standards for the evaluation and benchmarking.
The rest of the paper is organized as follows. In Section 2, we overview the
LarKC Platform and the general picture of reasoner plug-ins, which will be
developed or deployed within the LarKC platform. In Section 3, we develop a
framework of evaluation and benchmarking of reasoners. In Section 4, we propose
a specification language of gold standards for the evaluation and benchmarking
within the LarKC platform. Section 5 discusses the related work before conclud-
ing the paper in Section 6.
2 LarKC Platform
2.1 LarKC Architecture
In [9], the first version of the LarKC architecture has been proposed. This design
is based on a thorough analysis of the requirements and considering the lessons
learned during the first year of the project. Figure 1 shows a detailed view of
the LarKC Platform architecture.
The LarKC platform has been designed in a way so that it is as lightweight
as possible, but provides all necessary features to support both users and plug-
ins. For this purpose, the following components are distinguished as part of the
LarKC platform:
– Plug-in API: defines interfaces for plug-ins and therefore provides support
for interoperability between platform and plug-ins and between plug-ins.
– Data Layer API: provides support for data access and management.
– Plug-in Registry: contains all necessary features for plug-in registration
and discovery
– Workflow Support System: provides support for plug-in instantiation,
through the deployment of plug-in managers, and for monitoring and con-
trolling plug-in execution at workflow level.
Fig. 1. The LarKC Platform Architecture
– Plug-in Managers: provides support for monitoring and controlling plug-
ins execution, at plugin level. An independent instance of a Plug-in Manager
is deployed for each plug-in to be executed. This component includes the
support for both local and remote execution and management of plug-ins.
– Queues: provides support for deployment and management of the commu-
nication between platform and plug-ins and between plug-ins.
2.2 LarKC resoner plug-ins
Reasoning APIs All LarKC plug-ins share a common super class, namely
the Plug-in class. This class provides an interface for functions common to all
plug-in types. All plug-ins are identified by a name, which is a string. Plug-
ins provide meta data that describes the functionality that they offer. Plug-ins
provide Quality of Service (QoS) information regarding how they perform the
functionality that they offer.
The resoner plug-in will execute a given SPARQL Query against a Triple Set
provided by a Select plug-in. The interface of the resoner plug-in can be seen in
Table 1.
The resoner plug-in supports the four standard methods for a SPARQL end-
point, namely select, describe, construct, and ask. The input to each of the reason
methods are the same and includes the query to be executed, the statement set
Table 1. Reasoner Plug-in Interface
Function name
sparqlSelect(SPARQLQuery q, SetOfStatements s, Contract c, Context ctx)
sparqlConstruct(SPARQLQuery q, SetOfStatements s, Contract c, Context ctx)
sparqlDescribe(SPARQLQuery q, SetOfStatements s, Contract c, Context ctx)
sparqlAsk(SPARQLQuery q, SetOfStatements s, Contract c, Context ctx)
to reason over, the contract, which defines the behavior of the reasoner, and the
context, which defines the special information of the reasoning task. The output
of these reasoning methods depends on the reasoning task being performed. The
select method returns a Variable Binding as output where the variables corre-
spond to those specified in the query. The construct and describe methods return
RDF graphs, in the first case this graph is constructed according to the query
and in the second the graph contains triples that describe the variable specified
in the query. Finally ask returns a Boolean Information Set as output, which is
true if the pattern in the query can be found in the Triple Set, or false if not.
Reasoner Plug-ins The LarKC reasoner plug-ins can range from the reason-
ers which provide the standard reasoning support with RDF/RDFS/OWL data
to the reasoners which realize non-standard reasoning tasks such as reasoning
with inconsistent ontologies, rule-based reasoning, stream reasoning. Here is an
(incomplete) list of the LarKC resoner plug-ins which have been developed for
the LarKC platform.
– SPARQL Query Evaluation Reasoner: This resoner plug-in wraps OWL-
IM and enables the execution of SPARQL Select, Construct, Describe and
Ask queries to be executed against it.
– Pellet Reasoner: This resoner plug-in is a wrapper of Pellet SPARQL DL
Reasoner2 , which provides the reason support of Description Logics.
– DIG Interface: This resoner plug-in provides the support for the DIG
interface3 , which allows an external DIG reasoner to be called, like RACER,
FACT++,KAON2, PION, etc.
– OWLAPI Reasoner: This resoner plug-in provides the support for OWL
APIs, which is a standard for reasoners with OWL data.
– PION Reasoner: This is a reasoner which can be used for reasoning with
inconsistent ontologies. Namely, given an inconsistent ontology and a query,
the PION reasoner can return a meaningful answer.
2
http://clarkparsia.com/pellet
3
http://dig.sourceforge.net/
3 A Framework of Evaluation and Benchmarking for
Ontology Reasoners
3.1 General Consideration
In this section, we will present a framework of evaluation and benchmarking for
ontology reasoners. The main idea is to use the framework, which have devel-
oped in the KnowledgeWeb project for benchmarking inconsistency reasoners
in the Semantic Web[6]. In LarKC, we can use this framework to evaluate and
benchmark both standard reasoner plug-ins and non-standard reasoner plug-ins,
e.g. the PION reasoner[5, 3] for reasoning with inconsistent ontologies.
In ontology engineering, evaluation and benchmarking target software prod-
ucts, tools, services, and processes. The objects are called tested systems. Eval-
uation and benchmarking are the systematic determination of merit, worth, and
significance of tested systems. Those merit, worth, and significance are charac-
terized as a value relation, which is considered as a preference relation, i.e., a
partial order set hA, i. We consider a tested system as one which is targeted by
the objectives of evaluation or benchmarking. A tested system can be character-
ized as an input-output function, alternatively, called a characteristic function
of the tested system. Namely, it maps a tuple of the input parameters into an
output value. A value relation is defined as a preference relation on a set of
values. Namely, a value relation is characterized as a partial order set.
We define evaluation as the systematic determination of the values of tested
systems with respect to its partial ordered value relation, whereas benchmarking
as a continuous process for improving by systematically evaluating tested sys-
tems, and comparing them to those considered to be the best. Namely, bench-
marking is the continuous process of evaluation.
In the LarKC project, Task 4.7.1 is requested to develop an initial framework
for evaluation of reasoner plug-ins. For the purpose of continuous improvement
of the LarKC platform, the issue of benchmarking reasoner plug-ins should be
also covered.
3.2 Goals and Criteria for Evaluation and Benchmarking of
Reasoner Plug-ins
We consider the following initial goals for evaluating LarKC Reasoner plug-ins:
– Bug Detection: A good evaluation of reasoner plug-ins should be able to
detect hidden bugs in the implementation. These bugs may be hard to detect
with manual examination by developers. It requires that test data sets cover
many functionalities/use cases of reasoner plug-ins.
– Robustness: A robust reasoner plug-in should not fail with noisy or erro-
neous test data. Thus, special test data sets should be designed to test the
robustness of a reasoner plug-in.
– Performance analysis: One of the main concerns on the quality of a rea-
soner plug-in is its performance. Thus, a necessary procedure of evaluation
and benchmarking of reasoner plug-ins is to provide an analysis of their
performance. The usual criteria for examining the performance of reasoner
plug-ins are: (i) the time costs, including the time cost for getting the first
query answer with anytime behavior, and the average time cost for each
query answer, (ii) the resource consumption, including the maximal working
memory request, and (iii) the quality of the query answers, which will be
discussed in the next section.
– Scalability Potential: The LarKC platform is expected to support Web
scale reasoning. Thus, the scalability of a reasoner plug-in becomes a crucial
issue for the performance of the overall platform. The scalabililty potential
of a reasoner is how well it can deal with large amount of data.
– Platform Improvement: A useful evaluation and benchmarking of rea-
soner plug-ins should be able to find bottle necks within the platform. It
would provide an analysis of how the design of platform can be improved.
3.3 Measuring the Quality of Query Answers
As discussed in the last section, the quality of query answers is one of the main
criteria for evaluating and benchmarking of reasoners.
The answer value set for standard ontology reasoning is usually considered
as a Boolean value set, namely, it consists of ‘true’ and ‘false’. The answer value
set for reasoning with inconsistent ontologies usually consists of three values
accepted, rejected, and undetermined, as introduced in the PION system. We
will develop gold standards, which represents intuitive answers from a human
for queries on reasoning with consistent or inconsistent ontologies. Thus, we can
compare the answers from the tested system/approach with the gold standard,
which is supposed to be intuitively true by a human to see to what quality of
query answers provided by tested systems.
For a query with an inconsistent ontology, there might exist the following
difference between an answer from the tested system/approach and its intuitive
answer in a gold standard.
– Intended Answer: the system’s answer is the same as the intuitive answer;
– Counter-intuitive Answer: the system’s answer is opposite to the intu-
itive answer. Namely, the intuitive answer is ’accepted’ whereas the system’s
answer is ’rejected’, or vice versa.
– Cautious Answer: The intuitive answer is ’accepted’ or ’rejected’, but the
system’s answer is ’undetermined’.
– Reckless Answer: The system’s answer is ’accepted’ or ’rejected’ whereas
the intuitive answer is ’undetermined’. We call it a reckless answer, because
under this situation the system returns just one of the possible answers
without seeking other possibly opposite answers, which may lead to ’unde-
termined’.
Therefore, a value set
{intended answer, cautious answer, reckless answer, counter intuitive answer},
Fig. 2. Workflow of evaluation.
can be introduced for the evaluation of answers with gold standards. An in-
tended answer is considered as a best one, whereas a counter intuitive answer is
considered as a worse one. Cautious answers are usually not considered as wrong
answers, whereas reckless answers may give wrong answers. Thus, a preference
relation on the value set can be like this:
{intended answer cautious answer,
cautious answer reckless answer,
reckless answer counter intuitive answer}
Based on this preference order, we can measure the quality of query answers
by the following answer rates:
– IA Rate, which counts only intended answers. Namely the Intended Answer
Rate is defined as the ratio of the amount of Intended Answers to the total
amount of the answers.
– IC Rate, which counts non-error answers. Namely, IC Rate = (Intended
Answers +Cautious Answers)/TotalAnswerNumber.
3.4 Workflows of Evaluation and Benchmarking
Common data sets and common gold standards are usually used for an evaluation
of different tested systems. Those systems may be heterogeneous with respect to
their input data. For example, a reasoner may support only OWL data, whereas
another reasoner may support only DIG data. Therefore a data translator is
needed to convert data sets represented in a standard format into the data sets
which are represented in a format that is supported by a tested system. Based
on a comparison between test results and gold standards, result evaluation can
be done manually, semi-automatically, or automatically. The output of the result
Fig. 3. Workflow of benchmarking
evaluation and the implication are further analyzed by an evaluation analysis.
The methods of statistics and visualization are usually introduced in this phase
for better illustration. The evaluation results will be ranked with respect to its
value relation. Finally, it leads to an evaluation report which concludes the values
of tested systems and explain the reasons why the system behaves differently.
An investigation is usually made to detect the problem of tested systems based
on the analysis of the evaluation. The workflow of evaluation is shown in Figure
2.
As discussed above, benchmarking is a continuous processing of evaluation.
Therefore for benchmarking, evaluation results are used further for the improve-
ment of tested systems. This would usually lead to new versions of tested sys-
tems. Based on a benchmarking analysis, new test data sets may be re-designed
or previous data sets are adjusted for further evaluation with respect to some
targeted problems. The workflow of benchmarking is shown in Figure 3.
4 A Specification Language for Gold Standards
Manual evaluation and analysis of test results are usually time consuming, labor
intensive, and error prone. The formalism of gold standard will pave a way for
automatic or semi-automatic evaluation and analysis of test results.
A gold standard is an evaluation function which maps queries into answers
with confidence values. For reasoner benchmarking, a gold standard is a (partial)
function which maps queries into (intuitive) answers with a confidence value. For
example, for benchmarking inconsistency processing, we considered the answer
set {accepted, rejected, undetermined}. For a query ”are birds animals?”, the
expected answer is intuitively considered as ”accepted” with confidence value
”1.0”. However, for the query “are men animals?”, the expected answers may be
well suitable to be specified as an answer with lower confidence value, say, ”ac-
Table 2. Example of Gold Standard
...
...
cepted” with confidence value ”0.4”, ”rejected” with confidence value ”0.4”, and
”undetermined with the confidence value ”0.2”. Namely, we use the confidence
values to represent some kinds of uncertainty of expected answers. The confi-
dence values can be obtained by various approaches, like from questionnaires,
statistics, machine learning, etc.
We design gold standards which are independent from a specific ontology.
Namely, it is up to evaluators/users to decide which ontologies can be applied
with respect to a gold standard.
In the following, we develop a gold standard specification language which is
suitable for SPARQL queries as reasoning queries. Thus, it is an XML file, which
is easy to use and read. Table 2 shows an example of a gold standard which is
encoded as an XML document.
This XML document specifies the name and the version of the gold standard.
Each query consists of a detailed query statement (in the SPARQL query lan-
guage) and its expected answer specification. Each expected answer is attached
by a confidence value. For non-Boolean answers, like those for sparqlSelect and
sparqlConstruct queries which would return a variable binding or a rdf graph
Table 3. XML Format for Expected Answers
...
......
...
Table 4. RDF representation of the Gold standards
LarKC gold standard example 1
Are birds animals?
subsumes
...
accepted
1
......
as an answer, the values of expected answers are specified with a detailed xml-
encoded subtree as specified in Table 3.
Alternatively, the gold standard specification language can be defined by us-
ing the standard meta data languages or ontology language, such as RDF/RDFS/OWL.
Namely, a gold standard can be specified as meta data or an ontology, which pro-
vides a possibility for reasoning with gold standards. Table 4 shows an example
of the RDF representation of the gold standards.
5 Related Work
In [1], Castro develops an evaluation and benchmarking methodology for Se-
mantic Web technologies and presents various methods for RDF(S) and OWL
interoperatability benchmarking.
SEALS4 is a project on Semantic Evaluation at Large Scale. The goal of
the SEALS project is to provide an independent, open, scalable, extensible,
and sustainable evaluation infrastructure for semantic technologies. The SEALS
Platform allows the remote evaluation of semantic technologies thereby provid-
ing an objective comparison of the different existing semantic technologies. The
SEALS Platform will provide an integrated set of semantic technology evalua-
tion services and test suites. Therefore, one of the future work is to integrate the
evaluation methods and benchmarks developed in the context of LarKC with
the SEALS Platform.
The work on the evaluation design for advanced reasoning systems in the
SEALS project[10] is still under development. We have observed that the SEALS
project has presented the definition of the evaluations and test data that will be
used in the first SEALS Evaluation Campaign. The tests have been designed to
cover the interoperability and the performance of advance reasoning systems.
6 Conclusions
In this paper, we have presented an initial framework of evaluation and bench-
marking for reasoner plug-ins within the LarKC platform. We have proposed
the evaluation methods, measures, benchmarks, and performance targets for the
plug-ins to be developed for approximate reasoning with interleaved reasoning
and selection. Based on the initial framework, we have discussed the workflows
of evaluation and benchmarking. Furthermore, in this paper, we have proposed
a specification language of gold standards for evaluation and benchmarking. We
have discussed how the proposed language of gold standards can be used for the
evaluation of reasoner plug-ins within the LarKC platform.
References
1. R. G. Castro. Benchmarking Semantic Web Technology. IOS Press, 2010.
4
http://www.seals-project.eu/
2. D. Fensel, F. van Harmelen, B. Andersson, P. Brennan, H. Cunningham, E. D.
Valle, F. Fischer, Z. Huang, A. Kiryakov, T. K. Lee, L. School, V. Tresp, S. Wesner,
M. Witbrock, and N. Zhong. Towards larkc: A platform for web-scale reasoning. In
Proceedings of the International Conference on Semantic Computing, pages 524–
529, 2008.
3. Z. Huang and F. van Harmelen. Using semantic distances for reasoning with incon-
sistent ontolgies. In Proceedings of the 7th International Semantic Web Conference
(ISWC2008), 2008.
4. Z. Huang, F. van Harmelen, S. Schlobach, G. Tagni, A. ten Teije, Y. Wang, Y. Zeng,
and N. Zhong. D4.3.2 - implementation of plug-ins for interleaving reasoning and
selection, March 2010. Available from: http://www.larkc.eu/deliverables/.
5. Z. Huang, F. van Harmelen, and A. ten Teije. Reasoning with inconsistent ontolo-
gies. In Proceedings of the International Joint Conference on Artificial Intelligence
- IJCAI’05, 2005.
6. Z. Huang, J. Volker, Q. Ji, H. Stuckenschmidt, C. Meilicke, S. Schlobach,
F. van Harmelen, and J. Lam. Benchmarking the processing of incon-
sistent ontologies, knowledgeweb d1.2.2.1.4/d2.1.6.3, 2007. Available from:
http://wasp.cs.vu.nl/knowledgeweb/D2163/kweb2163.pdf.
7. Z. Huang, Y. Zeng, S. Schlobach, A. den Teije, F. van Harmelen, Y. Wang, and
N. Zhong. D4.3.1 - strategies and design for interleaving reasoning and selection
of axioms, September 2009. Available from: http://www.larkc.eu/deliverables/.
8. S. Rudolph, T. Tserendorj, and P. Hitzler. What is approximate reasoning? In
Proceedings of RR2008, LNCS 5341, pages 150–164, 2008.
9. M. Witbrock, B. Fortuna, L. Bradesko, M. Kerrigan, B. Bishop, F. van
Harmelen, A. ten Teije, E. Oren, V. Momtchev, A. Tenschert, A. Cheptsov,
S. Roller, and G. Gallizo. Larkc deliverable d5.3.1 - requirements analysis
and report on lessons learned during prototyping, June 2009. Available from:
http://www.larkc.eu/deliverables/.
10. M. Yatskevich, G. Stoilos, I. Horrocks, and F. Martin-Recuerda. Seals deliverable
d11.2, evaluation design and collection of test data for advanced reasoning systems,
2009.