<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Semantic Search Tools using the SEALS platform?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stuart N. Wrigley</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khadija Elbedweihy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dorothee Reinhard</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <email>bernsteing@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Ciravegna</string-name>
          <email>f.ciravegnag@dcs.shef.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of She eld</institution>
          ,
          <addr-line>Regent Court, 211 Portobello, She eld</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Zurich</institution>
          ,
          <addr-line>Binzmuhlestrasse 14, CH-8050 Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In common with many state of the art semantic technologies, there is a lack of comprehensive, established evaluation mechanisms for semantic search tools. In this paper, we describe a new evaluation and benchmarking approach for semantic search tools using the infrastructure under development within the SEALS initiative. To our knowledge, it is the rst e ort to present a comprehensive evaluation methodology for semantic search tools. The paper describes the evaluation methodology including our two-phase approach in which tools are evaluated both in a fully automated fashion as well as within a user-based study. We also present and discuss preliminary results from the rst SEALS evaluation campaign together with a discussion of some of the key ndings.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic search</kwd>
        <kwd>usability</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmarking</kwd>
        <kwd>performance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Searching the Semantic Web lies at the core of many activities that are envisioned
for the Semantic Web; many researchers have investigated means for indexing
and searching the Semantic Web. Semantic search tools are systems that take a
query as their input, reason over some kind of knowledge base and return the
compatible answers. The input query can take the form of a natural language
question, a triple representation of a question, a graphical representation,
keywords, etc. and the knowledge base can be one or more ontologies, annotated
text corpora or plain text documents, etc. Similarly, the answers which are
returned by a tool can take a multitude of forms from pure triples to a natural
language representation.</p>
      <p>In the area of semantic search there are a large number of di erent tool types
focussing on the diverse aspects of this domain. In this evaluation work, we focus
on user-centered tools for retrieving information and knowledge including those
which support some kind of natural language user-interface. The core
functionality of a semantic search tool is to allow a user to discover one or more facts
? This work was supported by the European Union 7th FWP ICT based
e-Infrastructures Project SEALS (Semantic Evaluation at Large Scale, FP7-238975).
or documents by inputting some form of query. The manner in which this
input occurs (natural language, keywords, visual representation) is not of concern;
however, the user experience of using the interface is of interest. Indeed, we feel it
is appropriate to directly compare tools with potentially di ering interfaces since
tool adopters (who may not have technical expertise in the semantic search eld)
will place signi cant emphasis on this aspect in their decision process. Therefore,
it is essential that the evaluation procedures emphasise the user experience of
each tool.</p>
      <p>
        We believe semantic search is an area where evaluation is critical and one
for which formalised and consistent evaluation has, until now, been unavailable.
The evaluation of semantic search technologies is a core element of the Semantic
Evaluation At Large Scale (SEALS) initiative which is aimed at developing a
new research infrastructure dedicated to the evaluation of Semantic Web
technologies. The SEALS Platform [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] provides facilities for storing all the materials
required for an evaluation to take place: the tool(s), the test data, a results
storage repository and a description of the evaluation work ow.
      </p>
      <p>Two aspects, however, make the evaluation of search tools more complicated
than the benchmarking employed for other types of Semantic Web tools (such
as reasoners or matchers): rst, di erent search tools use highly varying querying
metaphors as exhibited by a range of searching approaches as alluded to above
(e.g., keyword-based, language-based or graphical). Indeed, it has been decided
that no restriction will be placed on the of type of interfaces to be assessed. In
fact we hope as wide a range of interface styles will be evaluated as possible.
Second, the search task usually involves a human seeker, which adds additional
complexities into any benchmarking approach.</p>
      <p>This paper describes an evaluation which comprises both an automated
evaluation phase to determine retrieval performance measures, such as precision and
recall as well as an interactive phase to elicit usability measures. Speci cally, the
evaluation is comprised of a series of reference benchmark tests that will focus
on the performance of fundamental aspects of the tool in a strictly controlled
environment or scenario rather than their ability to solve open-ended, real-life
problems.</p>
      <p>It is intended that the presentation of the methodology and the execution of
the evaluation campaigns will spur on the adoption of this methodology serving
as the basis for comparing di erent search tools and fostering innovation.</p>
      <p>We will brie y describe previous evaluation initiatives before introducing
our methodology in detail. We will also describe the two core datasets and the
mechanisms for integrating tools with the evaluation software. Finally, we will
present some preliminary evaluation results and conclude with a short discussion
of these.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous Related Evaluations</title>
      <p>
        Few e orts exist to evaluate semantic search tools using a comprehensive,
standardised benchmarking approach. One of the rst attempts at a comprehensive
evaluation was conducted by Kaufmann [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in which four di erent question
answering systems with natural language interfaces to ontologies were compared:
NLP -Reduce, Querix, Ginseng and Semantic Crystal. The interfaces were tested
according to their performance and usability. These ontology-based tools were
chosen by virtue of their di ering forms of input. NLP-Reduce and Querix allow
the user to pose questions in full or slightly restricted English. Ginseng o ers
a controlled query language similar to English. Semantic Crystal provides the
end-user with a rather formal, graphical query language.
      </p>
      <p>
        Kaufmann [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] employed a large usability study conducted for each of the
four systems with the same group of non-expert subjects using the Mooney
dataset (see Sec. 4.2) as the ontological knowledge base. The goal of this
controlled experiment was to detect di erences related to the usability and
acceptance of the four varying query languages. The experiment revealed that the
subjects preferred query languages expecting full sentences as opposed to
separate keywords, menu-driven and graphical query languages | in this order.
Therefore, it can be concluded that casual end-users favour query languages
that support the formulation process of their queries and which structure their
input, but do not over-restrict them or make them learn a rather unusual new
way of phrasing questions.
      </p>
      <p>
        Another previous evaluation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] extensively benchmarked the K-Search
system, both in vitro (in principle) and in vivo (by real users). For instance, the
in vivo evaluation used 32 Rolls-Royce plc employees, who were asked about
their individual opinions on the system's e ciency, e ectiveness and
satisfaction. However, as is common with small-scale evaluations, they refrained from
comparing their tool with other similar ones in this domain.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Design</title>
      <p>This section describes the design of the evaluation methodology in detail. It
introduces the core assumptions which we have made and the two-phase approach
which we have deemed essential for evaluating the di erent aspects of a semantic
search tool. We also describe the criteria and metrics by which the tools will be
benchmarked and the analyses which will be made.
3.1</p>
      <sec id="sec-3-1">
        <title>Two-Phase Approach</title>
        <p>The evaluation of each tool is split into two complementary phases: the
Automated Phase and the User-in-the-loop Phase. The user-in-the-loop phase
comprises a series of experiments involving human subjects who are given a number
of tasks (questions) to solve and a particular tool and ontology with which to do
it. The subjects in the user-in-the-loop experiments are guided throughout the
process by bespoke software { the controller { which is responsible for presenting
the questions and gathering the results and metrics from the tool under
evaluation. Two general forms of metrics are gathered during such an experiment.
The rst type of metrics are directly concerned with the operation of the tool
itself such as time required to input a query, and time to display the results. The
second type is more concerned with the `user experience' and is collected at the
end of the experiment using a number of questionnaires.</p>
        <p>The outcome of these two phases will allow us to benchmark each tool both in
terms of its raw performance but also the ease with which the tool can be used.
Indeed, for semantic search tools, it could be argued that this latter aspect is
the most important. In addition to usability questionnaires, demographics data
will be collected from the subjects enabling tool adopters to assess whether a
particular tool is suited for their target user group(s).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Criteria</title>
        <p>Query expressiveness While some tools (especially form based) do not allow
complex queries, others (e.g., NLP-based approaches) allow, in principle, a much
more expressive set of queries to be performed. We have designed the queries
to test the expressiveness of each tool both formally (by asking participants in
the evaluation to state the formal expressiveness) and practically (by running
queries to test the actual coverage and robustness).</p>
        <p>Usability Usability will be assessed both in terms of ability to express
meaningful queries and in combination with large scale | for example, when a large
set of results is returned or a very large ontology is used. Indeed, the background
of the user may also in uence their impression of usability.</p>
        <p>Scalability Tools and approaches will be compared on the basis of ability to
scale over large data sets. This includes the tool's ability to query a large
repository in a reasonable time; its ability to cope with a large ontology; and its
ability to cope with a large amount of results returned in terms of
readability/accessibility of those results.</p>
        <p>Performance This measures the resource consumption of a particular search
tool. Performance measures (speed of execution) depend on the benchmark
processing environment and the underlying ontology.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Metrics and Analyses</title>
        <p>
          Automated Phase The metrics and interpretations used for tool evaluation in
the automated phase draw on the work of Kaufmann [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. A number of di erent
forms of data will be collected each addressing a di erent aspect of the evaluation
criteria.
        </p>
        <p>A number of `standard' measures are collected including the set of answers
returned by the tool, the amount of memory used, etc. These metrics cover the
query expressiveness and interoperability criteria described in Sec. 3.2:
{ Execution success (OK / FAIL / PLATFORM ERROR). The value is OK
if the test is carried out with no execution problem; FAIL if the test is
carried out with some execution problem; and PLATFORM ERROR if the
evaluation infrastructure throws an exception when executing the test.
{ Results. This is the set of results generated by the tool in response to the
query. This set may be in the form of a ranked list. The size of this set is
determined (at design time) by the tool developer.
{ Time to execute query. Speed with which the tool returns a result set. In
order to have a reliable measure, it will be averaged over several runs.</p>
        <p>For each tool, a large amount of raw metric data will be produced. From this,
a number of interpretations can be produced which can be both presented to
the community as well as be used to inform the semantic technology roadmaps
which will be produced after each evaluation campaign. The automated phase
is concerned with the interpretations concerning the `low-level' performance of
the search tool such as the ability to load ontology and query (interoperability)
and the precision, recall and f-measure of the returned results (search accuracy
and query expressiveness). The scalability criterion is assessed by examining the
average time to execute query with respect to ontology size. Tool robustness is
represented by the ratio between the number of tests executed and the number
of failed executions.</p>
        <p>User-in-the-loop Phase In order to address the usability of a tool, we also
collect a range of user-centric metrics such as the time required to obtain the
nal answer, number of attempts before the user is happy with the result. In
addition, data regarding the user's impression of the tool is also gathered using
questionnaires (see Sec. 3.4).</p>
        <p>
          For each topic / questions presented to the user, the following metrics are
collected:
{ Execution success (OK / FAIL / PLATFORM ERROR).
{ Underlying query (in the tool's internal format; e.g., in SPARQL format)
{ Results.
{ Is the answer in the result set? It is possible that the experiment subject
may have been unable to nd the appropriate answer (even after a number
of query input attempts). In this case, the subject would have indicated this
via the controller software.
{ User-speci c statistics: time required to obtain answer; number of queries
required to answer question; demographics; System Usability Scale (SUS)
questionnaire [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]; in-depth satisfaction questionnaire
        </p>
        <p>A small number of traditional interpretations will be generated which
relate to the `low-level' performance of the search tool (e.g., precision, recall and
f-measure). However, the emphasis is on usability and the user's satisfaction
when using the tool. This will be identi ed using the SUS score, the number of
attempts made by the user, the time required to obtain a satisfactory answer as
well as a number of correlations between usability metrics and other measures
and / or demographics.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Questionnaires</title>
        <p>
          For the user-in-the-loop phase we employ three kinds of questionnaires, namely
the System Usability Scale (SUS) questionnaire [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the Extended questionnaire
and the Demographics questionnaire. Such questionnaires represent a well-known
and often applied procedure in the domain of Human Computer Interaction
to assess the user satisfaction and to measure possible biases and correlations
between the test subject characteristics and the outcomes of the evaluation.
        </p>
        <p>
          SUS is a uni ed usability test comprising ten normalised questions (e.g., `I
think that the interface was easy to use,' `I think that I would need the support
of a technical person to be able to use this system,' etc.). The subjects answer
all questions on a 5-point Likert scale identifying their view and opinion of the
system. The test incorporates a diversity of usability aspects, such as the need
for support, training and complexity. The nal score of this questionnaire is a
value between 0 and 100, where 0 implies that the user regards the user interface
as unusable and that 100 implies that the user considers the user interface to
be perfect. Bangor et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] described the results of 2,324 SUS surveys from 206
usability tests collected over a ten year period and found that the SUS was a
highly reliable indicator of usability (alpha = 0.91) for many di erent interface
types (mobile phones, televisions as well as GUIs).
        </p>
        <p>The Extended questionnaire includes further questions regarding the
satisfaction of the users. These questions cover domains such as the design of the
tool, the tool's query language, the tool's feedback, questions according to the
performance and functionality of the tool and the user's emotional state during
the work with the tool.</p>
        <p>The Demographics questionnaire collects detailed demographic information
regarding the participants which allow us to identify tools or types of tools which
are better suited to particular types of users.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Datasets</title>
      <p>For the rst evaluation campaign we have taken the decision to focus on purely
ontology-based tools. More complex test data (document-based, chaotic data,
data with partially known schemas) will be considered for later evaluation
campaigns. Indeed, the SEALS consortium actively encourages community
participation in the speci cation of subsequent campaigns.
4.1</p>
      <sec id="sec-4-1">
        <title>Automated Phase</title>
        <p>EvoOnt3 is a set of software ontologies and data exchange format based on OWL.
It provides the means to store all elements necessary for software analyses
including the software design itself as well as its release and bug-tracking information.
For scalability testing it is necessary to use a data set which is available in several
di erent sizes. In the current campaign, it was decided to use sets of sizes 1k,
10k, 100k, 1M, 10M triples. The EvoOnt data set lends itself well to this since
tools are readily available which enable the creation of di erent ABox sizes for
a given ontology while keeping the same TBox. Therefore, all the di erent sizes
are variations of the same coherent knowledge base.</p>
        <sec id="sec-4-1-1">
          <title>3 http://www.ifi.uzh.ch/ddis/evo/</title>
          <p>4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>User-in-the-loop Phase</title>
        <p>
          The main requirement for the user-in-the-loop dataset is that it be from a
simple and understandable domain: it should be su ciently simple and well-known
that casual end-users are able to reformulate the questions into the respective
query language without having trouble to understand them. Additionally, a set
of questions are required which subjects will use as the basis of their input to
the tool's query language or interface. The Mooney Natural Language Learning
Data4 ful ls these requirements and is comprised of three data sets each
supplying a knowledge base, English questions, and corresponding logical queries. They
cover three di erent domains: geographical data, job data, and restaurant data.
We chose to apply only the geography data set, because it de nes data from a
domain immediately familiar to casual users. The geography OWL knowledge base
contains 9 classes, 11 datatype properties, 17 object properties and 697 instances.
An advantage of using the Mooney data for the user-in-the-loop evaluation is the
fact that it is a well-known and frequently used data set (e.g., [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]).
Furthermore, its use allowed the possibility of making the ndings comparable
with other evaluations of tools in this area, such as Cocktail [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], PANTO [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and
PRECISE [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Test Questions</title>
        <p>User-in-the-loop Phase The Mooney geography question set has been
augmented using the existing questions as templates. In the question `How many
cities are in Alabama?', for example, the class concept city can be exchanged
on the vertical level by other class concepts, such as lake, mountain, river, etc.
Furthermore, the instances can be exchanged to obtain more questions. For
example, Alabama could be replaced by any instance of the class state (e.g.,
California, Oregon, Florida, etc.). We also added more complicated questions
that ask for more than one instance and produce more complex queries, such as
`What rivers run through the state with the lowest point in the USA?' and
`What state bordering Nevada has the largest population?'.</p>
        <p>Automated Phase The EvoOnt data set comprises knowledge of the software
engineering domain; hence, the questions will have a di erent character than the
Mooney questions and make use of concepts like programming classes, methods,
bugs (issues), projects, versions, releases and bug reports. Simpler questions will
have the form `Does the class x have a method called y?' or `Give me all the
issues that were reported by the user x and have the state xed?', where x and
y are speci c instances of the respective ontological concept. Examples for more
complex questions that enclose more than three concepts are `Give me all the
issues that were reported in the project x by the user y and that are xed by
the version z?' and `Give me all the issues that were reported in the project
w by the user x after the date y and were xed by the version z?'.</p>
        <sec id="sec-4-3-1">
          <title>4 http://www.cs.utexas.edu/users/ml/nldata.html</title>
          <p>API
In order for a tool to be evaluated, the tool provider had to produce a tool
`wrapper' which implemented a number of methods5. This allowed the
evaluation platform to automatically issue query requests and gather the result sets,
for instance. Furthermore, exposing this functionality also allowed the
user-inthe-loop experiment software to gather various forms of data during the user
experiment that will be used for analysis.</p>
          <p>The core functionality can be split into three di erent areas: methods
required in both phases, methods required only for the user-in-the-loop phase and
methods required just for the automated phase.</p>
          <p>Functionality which is common to both evaluation phases include the method
to load an ontology into the tool. The other methods are related to the results
returned by the tool. The rst determines if the tool manages (and hence returns
via the API) its results as a ranked list. The second determines if the tool
has nished executing the query and, consequently, the results are ready. The
nal method retrieves the results associated with the current query; the method
returns URIs in the SPARQL Query Results XML Format6.</p>
          <p>Only one method is required speci cally for the automated phase: execute
query. This executes a query which has been formatted to suit an individual
search tool's internal query representation. Three methods are required for the
user-in-the-loop phase. The rst determines if the user has nished inputting
their query to the tool. The second retrieves the String representation of the
query entered by the user. For example, if the tool uses a Natural Language
interface, this method would simply return the text entered by the user. The
nal method retrieves the tool's internal representation of the user's query. This
should be in a form such that it could be passed to the automated phase's execute
query method and obtain the same results.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation results</title>
      <p>This section presents the preliminary results and analyses from the rst SEALS
Evaluation Campaign which was conducted during Summer 2010. The list of
participants and the phases in which they participated is shown in Table 1. Formal
analysis of the results is still ongoing and is the subject of current and future
work. However, these preliminary results contain a number of interesting points
which merit discussion. Furthermore, it should be noted that for some tools, the
formal evaluation is still ongoing; indeed, this is this case for PowerAqua hence
no detailed results will be presented for this tool. Due to space constraints, we
concentrate on the user-in-the-loop experiment results since this are the most
interesting for benchmarking semantic search tools and obtaining an insight into
what functionality users want from such a tool and whether or not the tools</p>
      <sec id="sec-5-1">
        <title>5 http://www.seals-project.eu/seals-evaluation-campaigns/</title>
        <p>semantic-search-tools/connect-your-tool</p>
      </sec>
      <sec id="sec-5-2">
        <title>6 http://www.w3.org/TR/rdf-sparql-XMLres/</title>
        <p>included in this campaign meet those requirements. As described in Sec. 3.4, the
subjects in each experiment provided feedback via questionnaires which will also
be discussed.
6.1</p>
        <sec id="sec-5-2-1">
          <title>Results and discussion</title>
          <p>The user-in-the-loop evaluation results are shown in Table 2. In order to facilitate
the discussion, the responses to each of the twenty questions by all users, along
with the average experiment time and feedback scores have been averaged7.</p>
          <p>The mean experimental time indicates how long, on average, the entire
experiment (answering twenty pre-de ned questions) took for each user. The mean
SUS indicates the mean system usability score for each tool as reported by the
users themselves. The mean extended questionnaire shows the average response
to the questionnaire in which more detailed questions were used to establish the
user's satisfaction and is scored out of 58. The mean number of attempts shows
how many times the user had to reformulate their query using the tools
interface in order to obtain answers with which they were satis ed (or indicated that
they were con dent a suitable answer could not be found). This latter
distinction between nding the appropriate answer after a number of attempts and the
user `giving up' after a number of attempts is shown by the mean answer found
rate. Input time refers to the amount of time the subject spent formulating their
query using the tool interface before submitting the query.</p>
          <p>
            The results show that the di erence in perceived usability between K-Search
and Ginseng is not signi cant { their SUS scores are almost identical { whereas
7 Extended results and analysis for both the user-in-the-loop and automated phases
will be available from the SEALS website from December 2010.
8 For details of the questions used in the extended questionnaire, download the
experiment pack from http://www.seals-project.eu/seals-evaluation-campaigns/
semantic-search-tools/experiment-pack
the SUS score for NLP-Reduce is much lower. It is also evident that none of
the tools received a score which indicates satisfactory user experience.
Bangor et al. [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] associated `adjective ratings' to the SUS score. According to these
adjective ratings, both K-Search and Ginseng fall into the Poor to OK ratings
and NLP-Reduce being classi ed as Awful (see Table 3 in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]). This is con rmed
by the details of the recorded user behaviour. For instance, for K-Search and
Ginseng, subjects required more than two attempts to formulate their query
before they were satis ed with the answer or moved on. Subjects using
NLPReduce, however, required more than ve attempts { twice that of the other
tools. Users of K-Search found satisfactory answers twice as often as those who
used Ginseng and NLP-Reduce which is supported by the higher f-measure score
for K-Search compared with the other tools.
          </p>
          <p>This usability performance is supported both by the low extended
questionnaire results and also the feedback which was collected from each of the
experiment subjects. This is interesting since despite the tools using di erent interface
approaches (form-based versus natural language) neither provided the exibility
desired by the subjects. When using K-Search, many subjects reported that they
liked the interface and particularly the ability to `see the ontological concepts
and relations between concepts easily' thus allowing `the user to know just what
sort of information is available to be retrieved from the system'. However, the
rigid framework of a form-based interface was also the cause of many of the
subjects' dislikes. K-Search provided no mechanism for negation: it was not possible
to formulate queries to answer questions such as Tell me which rivers do not
traverse the state with the capital nashville?. Furthermore, while the form-based
approach allows the creation of queries containing multiple concepts, it was not
clear how these related to each other. For instance, one subject reported that
`if I had 3 states on my form and i added a hasCity relation it was not obvious
which state should have the city'.</p>
          <p>Natural language interfaces are often promoted as a more exible way of
entering a query than keyword- or form-based approaches. However, this provides
a signi cant challenge to such tools: how to cope with the vast range of possible
ways of formulating a query. For instance, Ginseng employs a commonly used
solution: restrict the vocabulary and/or grammar which can be used for query
entry. The use of a very restricted language model can resemble
`autocompletion' when creating simple queries. Subjects liked the speed with which (simple)
queries could be entered; however, di culties arose with more complex questions.
Subjects reported that the language model could `railroad' them into a
particular direction. In this situation, it was commonly acknowledged that the only
alternative was to start again. Furthermore, it was sometimes unclear to subjects
as to which suggested search terms related to which ontological concepts leaving
subjects confused. The language model (or underlying query engine) of Ginseng
did not allow comparative queries using terms such as `biggest' or `smaller than'.
Although not employing a restrictive language model, NLP-Reduce su ered from
similar criticisms as Ginseng regarding it's NL input { largely due to the nave
underlying NLP engine. Indeed, as the SUS score indicates, the subjects found
it much harder to use; for instance, the tool didn't allow the use of superlatives
and subjects commonly reported that the tool didn't understand what they had
entered, thus forcing the subject to start again (hence NLP-Reduce having twice
the number of attempts).</p>
          <p>Finally, a commonly reported de ciency of all the tools was the manner in
which a query's results could be managed or stored. Since a number of the
questions used in the experiment had a high complexity level and needed to be
split into two or more sub-queries, subjects reported that they would have liked
to have either used previous results as the basis of the next query or to have
simply temporarily stored the results to allow some form of intersection or union
operation with the current result set.
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper has presented a methodology for the evaluation of any semantic
search tool regardless of its user interface. A critical aspect of semantic search
tool benchmarking is the user's experience of using the tool. Search is a
usercentric activity and without a formalised evaluation of the tool's interface, only
a limited insight into a tool's applicability to a particular task can be gained.
Therefore, we adopted a two phase approach: an automated phase and a
userin-the-loop phase. This approach has impacted all aspects of the evaluation
methodology: the criteria and metrics, the datasets and the analyses have all
had to have been carefully chosen to accommodate the two phases. Indeed, in
many cases, each phase is distinct (for example, each phase has its own, distinct,
dataset).</p>
      <p>As can be seen in the results section, the evaluation has provided a rich
source of data { only a small amount of which we have been able to present in
this paper. It is clear that users of search tools have very high expectations of
their performance and usability. The pervasive use of web search engines, such
as Google, condition the way in which non-expert users view search; indeed,
a number of subjects in the user-in-the-loop experiment compared the tools
(unfavourably) to Google. However, with respect to the results, many subjects
reported they wanted a much more sophisticated management (and subsequent
additional querying) of the result set rather than the traditional list of answers
and simple query re nement.</p>
      <p>The identi cation of such de ciencies in current search technologies is the
purpose of the SEALS benchmarking initiative and will help drive the
technology to meet the needs of the user. Furthermore, the regular SEALS evaluation
campaigns will help monitor this progress and, as the benchmarking approaches
become increasingly sophisticated, provide increasingly detailed insights into the
technology and user interfaces employed.</p>
      <p>The results and analyses presented in this paper are preliminary and a more
detailed study of the results is currently underway. Indeed, the rst campaign has
acted as an evaluation not only of the participating tools but of the methodology
itself. This is the rst evaluation of its kind and the experiences of organising
and executing the campaign, as well as feedback from the participants, will help
improve the methodology and organisation of future campaigns.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bangor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Kortum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>An empirical evaluation of the system usability scale</article-title>
          .
          <source>International Journal of Human-Computer Interaction</source>
          ,
          <volume>24</volume>
          (
          <issue>6</issue>
          ):
          <volume>574</volume>
          {
          <fpage>594</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bangor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Kortum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Determining what individual sus scores mean: Adding an adjective rating scale</article-title>
          .
          <source>Journal of Usability Studies</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <volume>114</volume>
          {
          <fpage>123</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhagdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lanfranchi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Petrelli</surname>
          </string-name>
          .
          <article-title>Hybrid search: E ectively combining keywords and semantic searches</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          , pages
          <volume>554</volume>
          {
          <fpage>568</fpage>
          . Springer Berlin / Heidelberg,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Brooke</surname>
          </string-name>
          .
          <article-title>SUS: a quick and dirty usability scale</article-title>
          . In P. W.
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          <string-name>
            <surname>Weerdmeester</surname>
            ,
            <given-names>and I. L</given-names>
          </string-name>
          . McClelland, editors,
          <source>Usability Evaluation in Industry</source>
          , pages
          <volume>189</volume>
          {
          <fpage>194</fpage>
          . Taylor and Francis,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Esteban-Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Castro, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez-Perez</surname>
          </string-name>
          .
          <article-title>Executing evaluations over semantic technologies using the seals platform</article-title>
          .
          <source>In International Workshop on Evaluation of Semantic Technologies (IWEST</source>
          <year>2010</year>
          ),
          <source>ISWC</source>
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. E. Kaufmann.
          <article-title>Talking to the Semantic Web | Natural Language Query Interfaces for Casual End-Users</article-title>
          .
          <source>PhD thesis</source>
          , Faculty of Economics, Business Administration and Information Technology of the University of Zurich,
          <year>September 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. E. Kaufmann and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>How useful are natural language interfaces to the semantic web for casual end-users?</article-title>
          <source>In ISWC/ASWC</source>
          , pages
          <volume>281</volume>
          {
          <fpage>294</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>A.-M. Popescu</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Etzioni</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kautz</surname>
          </string-name>
          .
          <article-title>Towards a theory of natural language interfaces to databases</article-title>
          .
          <source>In IUI '03: Proceedings of the 8th international conference on Intelligent user interfaces</source>
          ,
          <source>pages</source>
          <volume>149</volume>
          {
          <fpage>157</fpage>
          , New York, NY, USA,
          <year>2003</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Tang</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <article-title>Using multiple clause constructors in inductive logic programming for semantic parsing</article-title>
          .
          <source>In In Proceedings of the 12th European Conference on Machine Learning</source>
          , pages
          <volume>466</volume>
          {
          <fpage>477</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>