<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Semantic Service Discovery - A Survey and Directions for Future Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ulrich KuÄster</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Lausen</string-name>
          <email>holger.lausen@deri.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Birgitta KÄonig-Ries</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Enterprise Research Institute, University of Innsbruck</institution>
          ,
          <addr-line>6020 Innsbruck</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Friedrich-Schiller-University Jena</institution>
          ,
          <addr-line>07743 Jena, Germany, ukuester</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years a huge amount of e®ort and money has been invested in the area of semantic service discovery and presented approaches have become more sophisticated and mature. Nevertheless surprisingly little e®ort is being put into the evaluation of these approaches. We argue that the lack of established and theoretically well-founded methodologies and test beds for comparative evaluation of semantic service discovery is a major blocker of the advancement of the ¯eld. To lay the ground for a comprehensive treatment of this problem we discuss the applicability of well-known evaluation methodologies from information retrieval and provide an exhaustive survey of the current evaluation approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years semantic services research has emerged as an application of the
ideas of the semantic web to the service oriented computing paradigm.
Semantic web services (SWS in the following) have received a signi¯cant amount of
attention and research spending since their beginnings roughly six years ago [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Within the sixth EU framework program1 (which ran from 2002 to 2006) alone
at least 20 projects with a combined funding of more than 70 million Euro deal
directly with semantic services which gives a good impression of the importance
being currently put on this ¯eld of research. In the following we will focus on
e®orts in the ¯eld of SWS discovery and matchmaking. We refer to discovery as
the whole process of retrieving services that are able to ful¯ll a need of a client
and to matchmaking as the problem to automatically match semantically
annotated service o®ers with a semantically described service request. However, we
think that our ¯ndings also apply to other related areas, like automated semantic
service composition.
      </p>
      <p>
        In this paper we argue that despite of the huge amount of e®ort (and money)
spent into SWS discovery research and despite the fact that the presented
approaches become more sophisticated and mature, much too little e®ort is put
into the evaluation of the various approaches. Even though a variety of di®erent
1 http://cordis.europa.eu/fp6/projects.htm
service matchmakers have been proposed we did not suceed to ¯nd any
publications with a thorough, systematic, objective and well designed evaluation of
those matchmakers. This corresponds to a trend that seems to exist in
computer science in general. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] Tichy et. al. ¯nd that computer scientists publish
relatively few papers with experimentally validated results compared to other
sciences. In a follow-up work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Tichy claims that this trend is harmful for the
progress of the science.
      </p>
      <p>There are positive examples that back his claim:</p>
      <p>
        "[in the experiments] . . . there have been two missing elements. First [. . . ]
there has been no concerted e®ort by groups to work with the same data, use
the same evaluation techniques, and generally compare results across systems.
The importance of this is not to show any system to be superior, but to allow
comparison across a very wide variety of techniques, much wider than only
one research group would tackle. [. . . ] The second missing element, which has
become critical [. . . ] is the lack of a realistically-sized test collection. Evaluation
using the small collections currently available may not re°ect performance of
systems in large [. . . ] and certainly does not demonstrate any proven abilities
of these systems to operate in real-world [. . . ] environments. This is a major
barrier to the transfer of these laboratory systems into the commercial world."
This quote by Donna Harman [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] addressed the situation in text retrieval
research prior to the establishment of the series of TREC conferences2 in 1992
but seems to perfectly describe the current situation in SWS discovery research.
Harman continued:
      </p>
      <p>"The overall goal of the Text REtrieval Conference (TREC) was to address
these two missing elements. It is hoped that by providing a very large test
collection, and encouraging interaction with other groups in a friendly evaluation
forum, a new thrust in information retrieval will occur."
From the perspective of today, it is clear that her hope regarding the positive
in°uence of the availability of mature evaluation methods to the progress of
information retrieval research was well justi¯ed. In this paper we argue that a
similar e®ort for SWS related research is necessary today for the advancement
of this ¯eld.</p>
      <p>The rest of this paper is organized as follows. In Section 2 we will review the
philosophy of information retrieval evaluation and argue that traditional
evaluation methods can not be applied easily to the domain of SWS matchmaking.
In Section 3 we provide an extensive survey of current evaluation e®orts in the
area of SWS discovery. We cover the related work in Section 4, draw conclusions
about what is missing so far and provide directions for future work in Section 5
and ¯nally summarize in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Philosophy of Information Retrieval Evaluation</title>
      <p>
        In a broader context, discovery of semantic web services can be seen as a special
information retrieval (IR) problem. According to Voorhees [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], IR evaluation has
      </p>
      <sec id="sec-2-1">
        <title>2 http://trec.nist.gov/</title>
        <p>been dominated for four decades by the Cran¯eld paradigm which is
characterized by the following properties:
{ An IR system is mainly evaluated by means of recall and precision.
{ Recall is de¯ned as the proportion of retrieved documents that are relevant.
{ Precision is de¯ned as the proportion of relevant documents that are
retrieved.
{ Relevance is based on topical similarity as obtained from the judgements of
domain experts.
{ Test collections therefore have three components: a set of documents (the
test data), a set of information needs (topics or queries) and a set of relevance
judgements (list of documents which should be retrieved)
Vorhees identi¯es several assumptions on which the Cran¯eld paradigm is based
that are unrealistic in most cases. She concludes that experiments based on those
assumptions are a noisy process but is able to provide evidence that { despite of
the noise { such experiments yield useful results, as long as they are only used
to assess the relative performance of di®erent systems evaluated by the same
experiment.</p>
        <p>Since the experiments based on the Cran¯eld paradigm are extremely well
established, since the methodology of these experiments is well understood and
since SWS matchmaking is a special information retrieval problem, it seems
obvious to try to apply the same methods and measures to the SWS matchmaking
domain. However, in the following we argue that this is not a promising approach
for various reasons.</p>
        <p>A model of the general process of information retrieval is depicted in Figure 1.
The user has a real world need (like information about a certain topic) that needs
to be satis¯ed with the existing real world supply (like a collection of documents).
Both the need and the supply are abstracted to a model. In the case of web search
engines for instance, such a model will consist of descriptors extracted from the
query string and data structures like indexes built upon descriptors extracted
from the web pages etc. The information retrieval system then operates on that
model to match the need with the supply and returns the (real world) results.
As a matter of fact the power of this model (how well it captures the real world
and how well it supports the retrieval, i.e. matchmaking and ranking process) is
of critical importance for the retrieval system and thus a central component of
its overall performance.</p>
        <p>Traditional information retrieval systems typically create the model they
operate on in an autonomous fashion. Thus, from the viewpoint of an evaluation
they operate on the original data. Consequently, completely di®erent IR systems
can be evaluated on a common test data set (like a collection of documents).</p>
        <p>SWS matchmaking follows a di®erent paradigm. Here the semantic
annotation is the model that is exploited during the matchmaking and it is not created
automatically, but written by human experts. Currently there is no agreed upon
formalism used for the semantic annotations, but competing and mostly
incompatible formalisms are in use (like WSMO3, OWL-S4, WSDL-S5, DSD6, . . . ).</p>
        <p>To apply the Cran¯eld paradigm to the evaluation of SWS discovery, one
could provide a test collection of services in a particular formalism (e.g.
OWLS) and limit participation in the experiment to systems based on that formalism.
This is the approach of the current S3 Matchmaker Contest (Section 3.1).
Unfortunately, this excludes the majority of systems from participation. But there
is an even more severe issue. Virtually all semantic matchmaking systems that
are based on some form of logical reasoning operate deterministically. Therefore
the question whether a semantic o®er description matches a semantic request
description in the same formalism can usually be decided unambiguously
yielding perfect recall and precision (only depending on the de¯nition of match).
The task to evaluate, however, is the retrieval of real-world services that match
a real-world request. Thus, the major source of di®erentiation between various
approaches is the expressivity of the employed formalism and reasoning. The
critical questions here are:
{ How precisely can a description based on a particular formalism re°ect the
real-world semantics of a given service (o®er or request)?
{ How much of the information contained in the descriptions of a service can
a matchmaker use e±ciently and e®ectively to decide whether two services
match?
Note that the ¯rst question usually calls for more expressive formalisms whereas
the second one requires less expressive formalisms to keep the reasoning tractable.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3 http://www.wsmo.org/</title>
      </sec>
      <sec id="sec-2-3">
        <title>4 http://www.daml.org/services/owl-s/</title>
      </sec>
      <sec id="sec-2-4">
        <title>5 http://lsdis.cs.uga.edu/projects/meteor-s/wsdl-s/</title>
      </sec>
      <sec id="sec-2-5">
        <title>6 http://hnsp.inf-bb.uni-jena.de/DIANE/</title>
        <p>As argued above, the formalism employed for the semantic annotation of the
services, i.e. the model used for the matchmaking, is of crucial importance for the
overall performance of the discovery system. Consequently, a good evaluation of
SWS discovery should measure not only the performance of a system for a given
formalism, but also evaluate the pros and cons of that formalism itself. In fact, as
long as there is no common understanding about the pros and cons of di®erent
formalisms and no agreement about which formalism to employ for a given task,
evaluation of SWS discovery should ¯rst and foremost help to establish this
missing common understanding and agreement.</p>
        <p>The approach outlined above, however, neglects the in°uence of the model
used in the matchmaking process and therefore does not measure the
performance of that part of the retrieval process, which has the largest in°uence to
the overall performance of the system.</p>
        <p>
          A di®erent approach that overcomes this limitation would be to provide a
test collection of services in any format (e.g. human language) and let the
participants annotate the services manually with their particular formalism. This is
the approach taken by the SWS-Challenge (Section 3.2) and the DIANE
evaluation (Section 3.3)). Unfortunately there are problems with this proceeding, too.
First, such an experiment can hardly be performed on a large test collection,
since the e®ort for the participants to manually translate the services into their
particular formalisms is enormous. Yet, the unavoidable noise of experiments
based on the Cran¯eld paradigm precisely requires large test collections to yield
stable results [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Second, due to the human involvement, such an experiment can
not be conducted in an automated way. Even worse, such an experiment does
not only measure the performance of the matchmaking formalism and system,
but also the abilities of the experts that create the semantic annotations. This
introduces a whole new dimension of noise to the evaluation.
        </p>
        <p>For the reasons given above we conclude that the experimental setup and the
evaluation measures and methods developed for traditional information retrieval
do not transfer directly to the SWS discovery domain. The in°uence of the
described problems needs to be explored and new methods and measures have
to be developed where necessary. To lay the foundation for this task, we provide
an extensive survey of the current e®orts in SWS discovery evaluation in the
following section.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A Survey Of Current Approaches in Semantic Web</title>
    </sec>
    <sec id="sec-4">
      <title>Service Discovery Evaluation</title>
      <p>3.1</p>
      <sec id="sec-4-1">
        <title>S3 Matchmaker Contest</title>
        <p>Klusch et al. have recently announced an annual international contest S3 on
Semantic Service Selection7 whose ¯rst edition will be held in conjunction with
the upcoming International Semantic Web Conference in Busan, Korea
(November 2007). We would like to express our acknowledgement and appreciation of
7 http://www-ags.dfki.uni-sb.de/»klusch/s3/
this new e®ort that we welcome very much. Despite of that we identify some
problems in the current setup of the contest.</p>
        <p>The contest is based on a test collection of OWL-S services and "evaluation
of semantic web service matchmakers will base on classic performance metrics
recall/precision, F1, average query response time"7. As argued in the previous
section this style of application of the Cran¯eld paradigm to the SWS
matchmaking domain has a limited scope and signi¯cance since it does not allow a
comparative evaluation of di®erent semantic formalisms.</p>
        <p>We furthermore think there is currently a problematic °aw in the practical
setup of the contest, too. The most severe achilles' heel of any such contest is
the dependency on a good SWS test collection. This year the S3 contest will rely
solely upon the OWL-S Test Collection 28 which we believe to be unsuitable for
a meaningful comparative and objective SWS matchmaking evaluation. We will
explain our skepticism by a critical review of the collection.</p>
        <p>
          The OWL-S Test Collection 29 is the only publicly available test collection of
semantically annotated services of mentionable size. It has been developed within
the SCALLOPS project10 at the German Research Centre for Arti¯cial
Intelligence (DFKI). The most recent version 2.1 of the collection (OWLS-TC2 released
in October 2006) contains 582 semantic web services written in OWLS 1.1. To
put our following criticism into the correct light and in acknowledgement that
currently no better public standard test collection exists, we would like to
mention that the OWLS-TC2 does not claim to be more than "one possible starting
point for any activity towards achieving such a standard collection by the
community as a whole"[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Our criticism of OWLS-TC2 covers three aspects.
Use of realistic real-world examples. One common criticism to many use cases
and evaluations in the service matchmaking domain is the use of arti¯cial toy
examples which are far from realistic applications. Even though examples do
not necessarily have to be realistic to test features of a matchmaking system,
the use of real-world examples clearly minimizes the danger of failing to detect
lacking features or awkward modeling. Furthermore, toy examples far from
realworld applications critically hinder the acceptance of new technology by
industry. OWLS-TC2 claims that "the majority of [. . . ] services were retrieved from
public IBM UDDI registries, and semi-automatically transformed from WSDL to
OWL-S"[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Thus, one would expect somewhat realistic services but a substantial
share of the 582 services of OWLS-TC2 seems quite arti¯cial and idiosyncratic.
Oftentimes the semantic of the service is incomprehensible even for a human
expert and unfortunately only six of the original WSDL ¯les are included in
the test set download. A comprehensive coverage is impossible due to the size
of OWLS-TC2 but the following examples illustrate the issues (in the following
8 It is planned to extend the scope of the contest beyond OWL-S based matchmakers
in the future. However, public test collections based on other formalisms have
unfortunately not been developed so far. The S3 contest organizers have set up a public
wiki (http://www-ags.dfki.uni-sb.de/swstc-wiki) to initiate e®orts in this direction.
9 http://projects.semwebcentral.org/projects/owls-tc
10 http://www-ags.dfki.uni-sb.de/»klusch/scallops/
service names always refer to the name of the corresponding service description
¯le, not the service name from the service's pro¯le, quotes are from the service's
description ¯le or the OWLS-TC2 manual):
{ Some services are simply erranous, quite a few services for instance are
pairwise identical except for the informal textual description (e.g.
        </p>
        <p>price CannonCameraservice.owls and price Fishservice.owls)
{ The service destination MyOfficeservice.owls is supposed to "return
destination of my o±ce", but takes concepts of type organization and sur¯ng
(which is a subclass of sports) as input.
{ The service surfing farmland service.owls is described as "This is the
recommended service to know about the farmland for sur¯ng" and has an
input of type sur¯ng and an output of type farmland. What's the semantic
of this service?
{ The service qualitymaxprice cola service.owls "provides a cola for the
maximum price and quality. The quality is an optional input." It is described
by its inputs of type maxprice and quality and an output of type cola. There
are a whole lot of similar services that return cola (six more services), beer
+ cola, co®ee + whiskey (eleven services), cola-beer, cola + bread or biscuit
(two services), drinks (three services), liquid, whiskey + cola-beer as well as
irish co®ee + cola. It remains unclear what is the semantics of these services.
Besides these issues we believe that examples from domains like funding of
ballistic missiles, which the typical user of an evaluation system does not have any
experience with, make a realistic evaluation unnecessary di±cult.
Semantic richness of descriptions. Services should not only be realistic and
realistically complex, they also should be described in su±cient detail to allow for
meaningful semantic discovery. After all there should be an advantage to use
semantic annotations compared to simply using traditional information retrieval
techniques. Unfortunately the services of OWLS-TC2 are described extremely
super¯cial. First of all it seems that all services are solely described by their
inputs and outputs. What is the semantic of a service (car price service.owls)
that takes a concept of type Car as input and has a concept of type Price as
output? It might sell you a car and tell you the price afterwards, it might just as
well only inform you about the price of a new car or the price of a used car. It
might rent a car for the returned price. It might tell you the price of the yearly
inspection for the given car. There are many di®erent possible interpretations.
What is the semantic of a service like car priceauto service.owls that takes
as input a concept of type Car and has outputs of type Price and Auto (which
is a subclass of car)?</p>
        <p>In our view the services of OWLS-TC2 are not described in su±cient detail
to allow to perform meaningful semantic discovery on them. The problem is
greatly aggravated by the fact that the services in OWLS-TC2 make use of
classes in a class hierarchy but do not make use of attributes or relations. Thus,
in most cases the semantic of the services is greatly underspeci¯ed and - if at all
understandable only from the informal textual documentation11. Overall it seems
the textual descriptions of the service o®ers and queries are not captured well by
the semantic descriptions. Query 23 for instance is informally described as "the
client wants to travel from Frankfurt to Berlin, that's why it puts a request to
¯nd a map to locate a route from Frankfurt to Berlin." This request is described
(geographicalregiongeographical-region map service.owls) as a request
for a service with two unordered inputs of type geographical region and a single
output of type map. Clearly routing services will also be found (among many
others) by such a request, but we are afraid that o®ers and requests described at
this level of detail will neither allow to demonstrate the added value of semantic
service discovery nor to evaluate the power of matchmakers which should create
this added value.</p>
        <p>
          Independence of o®er and request descriptions Ideally, service o®er and request
descriptions should be designed independently since this is the envisioned
situation in reality. Service providers describe their o®ers, clients query for a service
with a semantic request description and a matchmaker is supposed to ¯nd the
o®ers that match the request. We acknowledge that in laboratory settings it is
sometimes desirable to arti¯cially design the o®ers to match a request at various
degrees. This way it can be assured that all potentially existing degrees of match
occur during a test run. However, a test where the o®ers have been designed to
match a request at hand with speci¯c degrees runs the risk of doing nothing
more than supporting the belief that a particular matchmaker implementation
operates as expected. It does not demonstrate the power of a certain semantic
description formalism or a certain matchmaking approach. Despite the fact that
OWLS-TC2 claims that most services where retrieved from public IBM UDDI
registries, we got the impression that for most of the queries in OWLS-TC2
the matching services have been arti¯cially designed for that particular query.
Query 4 for instance asks for the combined price of a car and a bicycle. It seems
quite idiosyncratic to buy a car and a bicycle as a package, yet there are at least
eleven service o®ers in OWLS-TC2 that precisely o®er to provide the price of a
package of one car and one bicycle. Our impression is further backed up by the
fact that the number of relevant services is quite stable for all the queries.
Conclusions OWLS-TC2 has been developed by the e®ort of a single group to
evaluate a particular (hybrid) matchmaker [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and the OWLS-TC2 manual states
that it has been designed to be balanced with respect to the matching ¯lters of
that matchmaker, i.e. besides performing semantic discovery it explicitly also
uses classical Information Retrieval techniques. Thus OWLS-TC2 is suited to
test and evaluate the features of this particular hybrid matchmaker, but for the
reasons given above we do not think this test collection is suited for a broader
comparative evaluation of di®erent semantic matchmakers. Based on this ¯nding
11 This may have been on purpose since the OWL-MX matchmaker, the matchmaker
OWLS-TC was designed for, is a hybrid matchmaker that combines semantic
matchmaking with traditional information retrieval techniques
and the discussion in Section 2 we doubt that the current setup of the S3 contest
will yield meaningful results.
        </p>
        <p>To put our criticism above into the correct context, we would like to
acknowledge once more that sadly there is currently no better public test collection than
OWLS-TC2 and that the creation of a balanced, realistic and rich, high-quality
semantic service test collection involves an immense amount of e®ort that clearly
exceeds the capabilities of any single group. The organizers of the S3 Contest
have therefore stressed that such a collection can only be built by the community
as a whole and that the contest and its current employment of OWLS-TC2 is
only a ¯rst step in that direction. They have set up a wiki12 to initiate a
corresponding community e®ort. We hope that our critical analysis of OWLS-TC2
will help to motivate such community e®ort and will therefore ultimately help
to improve the quality of the emerging collections.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Semantic Web Service Challenge</title>
        <p>The Semantic Web Service Challenge is an initiative aiming to create a test
bed for frameworks that facilitate the automation of web service mediation and
discovery. It is organized as a series of workshops in which participants try to
model and solve problems described in the publicly available test bed. The test
bed is organized in scenarios (e.g. discovery or mediation), each one containing
detailed problem descriptions. Compared to the S3 contest the number of
available services (at the time of writing around a dozen) is relatively small, however
the organizers put strong emphasis on providing realistic and detailed scenarios.</p>
        <p>The Challenge organizers have realized that the lack of comprehensive
evaluation and test beds for semantic web service system is one of the major blockers
for industrial adoption of the used techniques. They have designed the challenge
having the following ideas in mind:
{ Solution Independence. Existing test cases often su®er from the problem that
they have been reverse engineered from the solution, i.e. that the use case
has been created according to the strengths of a particular solution. This
hinders comparison across multiple systems. By letting the organizers not
directly participate and by de¯ning rules on how new scenarios can be added
the SWS Challenge tries to overcome this problem.
{ Language Neutral. Closely connected to the above issue is the one how to
describe the problem set. Using a particular formalism for describing services
already implies the solution to a huge degree. In our opinion, the choice of
the right level of detail to include in the service and goal descriptions in fact
still constitutes one of the core research problems and should not be
dictated by the test bed for an evaluation. The SWS Challenge organizers have
consequently decided not to provide formal descriptions but only natural
language ones.
12 http://www-ags.dfki.uni-sb.de/swstc-wiki
{ No Participation Without Invocation. Each scenario provided comes with a
set of publicly available web services. On the one hand this should yield in
some industrial relevance, on the other hand it provides the organizers with
an unambiguous evaluation method. If a system claims to be able to solve a
particular problem (e.g. discovery of the right shipment provider), this can
be automatically veri¯ed by monitoring the SOAP messages exchanged.
Scenario Design. Within the research community only little consensus exists
about what information should be included in a static service description and
how they should be semantically encoded. The scenarios are thus described
using existing technologies (WSDL, XSD, and natural language text descriptions).
In the following we will explain the philisophy of the scenarios by means of
the ¯rst discovery scenario13 provided. This scenario includes ¯ve shipment
services that are modeled according to the role models of order forms of existing
shipment companies on the Internet. The services are backed by corresponding
implementations that are part of the test bed.</p>
        <p>The task to solve is to discover and invoke a suitable shipper for a given
shipping request. The scenario contains a set of such requests which are categorized
into levels of increasing di±culty. It starts with Discovery Based on Destination
and adds weight and price criteria as well as simple composition and temporal
constraints to the more di±cult problems. For each request the problem
description contains the expected correct solution (i.e. the list of matching services)
already.</p>
        <p>Evaluation Methodology. Solutions to a scenario are presented at the
SWSChallenge workshops. The evaluation is performed by teams composed of the
workshop organizers and the peer participants. The organizers are aware,
however, that this causes scalability problems if the number of participants increases
and also is not strictly objective.</p>
        <p>The evaluation approach focuses on evaluating the functional coverage, i.e.
on whether a particular level of the problem could be solved by a particular
approach or formalism correctly or not. The intention is to focus on the how, that
is the concrete techniques and descriptions an approach uses to solve a problem
and not on the time it requires for execution, thus no runtime performance
measurements are taken.</p>
        <p>The organizers argue that in practice automatic and dynamic discovery is
not widely used, thus part of the challenge is to re¯ne the challenge and to
illustrate the bene¯t of using semantic descriptions. The basic assumption to
test is whether approaches which rely more heavily on semantic annotations will
be easier adaptable to changes in the problem scenarios. Therefore the challenge
does not only certify functional coverage, but initially it was planned to also
assess on how elegant a solution can address the problems posed and how much
e®ort was needed to proceed from a simpler to a more complex problem level.</p>
        <p>
          However it turned out that it is extremely di±cult to assess this in an
objective manner [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Measurements based on counting the number of lines (or
13 http://sws-challenge.org/wiki/index.php/Scenario: Shipment Discovery
statements) of semantic description do not adequately represent the usability of
an approach. Also the measurements of changes that are required to solve new
problems turned out to be problematic. They worked in the beginning when all
participants started with the same known set of problem levels which was then
extended at consecutive workshops. However, participants entering the challenge
right now have access to all problem levels right away which makes an objective
assessment of the necessary change to solve the more advanced levels on top of
the simpler ones impossible. It is planed to test the usability of surprise scenarios
for the envisioned assessment at the next workshop.
        </p>
        <p>Lessons Learned. By only describing the problems without any particular
formalism in mind the SWS Challenge organizers where able to attract various
di®erent teams from di®erent communities. Thus it successfully enables
evaluation across very heterogenous solutions. By requiring a grounding in real web
services a signi¯cant amount of e®ort was consumed both on the site of the
organizers as well of the participants with problems related to standard web
service technology, which are not strictly relevant when looking at discovery in
isolation. This also may have discouraged potential teams more familiar with
knowledge representation than with web service technology. On the other site
the implementation has been proven useful to (1) disambiguate the natural
language text descriptions and (2) undoubtedly show whether a participant has or
has not solved a particular problem. By having an implementation, no one could
change the scenario to ¯t their solution without failing at the automated tests
based on exchanged SOAP messages.</p>
        <p>With respect to the scenarios being described in informal natural language
only, it turned out that the original scenarios were indeed ambiguous in several
cases. However, during the course of usage of a particular scenarios these
ambiguities where discovered by the participants and could subsequently be resolved by
the authors of the particular scenario. Our experience shows that this way even
scenarios described in natural language only become su±ciently well-de¯ned over
time. Usually the implementation also does disambiguate a scenario, however it
is not the most e±cient way to ¯nd out about the intention of a particular
aspect.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>DIANE Service Description Evaluation</title>
        <p>
          Within the DIANE project14, a service description language, DIANE Service
Description (DSD) and an accompanying middleware supporting service
discovery, composition, and invocation have been developed. DIANE is one of the
projects taking part in the SWS Challenge. Besides the evaluation provided by
the challenge, considerable e®ort has been put into devising an evaluation suite
for semantic service description languages[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. While this work is certainly not
completed yet, it complements the SWS Challenge in some important aspects.
The DIANE evaluation focuses on four criteria an evaluation should measure:
14 http://hnsp.inf-bb.uni-jena.de/DIANE/
1. Degree of Automation: Are the language and the tools powerful enough to
allow for automatic and correct service usage? That means: Given a
service request and service o®ers, will the discovery mechanism ¯nd the
bestmatching service o®er and will it be possible to automatically invoke the
service based on these results?
2. E±ciency of Matchmaking: Is it possible to e±ciently and correctly compute
the matchvalue of arbitrary o®ers and requests?
3. Expressiveness: Is it possible to describe real services and real service
requests in su±cient detail to meet Criteria 1? Can this be done with
reasonable e®ort?
4. Decoupling: Will a discovery mechanism be able to determine similarity
between service o®ers and requests that are developed independently of each
other? In other words: If a service requester writes his request without
knowledge of the existing service descriptions, does the language o®er enough
guidance to ensure that suitable and only suitable services will be found by the
discovery mechanism?
        </p>
        <p>It is quite obvious, that these criteria require contradictory properties of
the description language: While, e.g., Criterion 3 requires a highly expressive
language, Criterion 2 will be the easier to meet the less powerful the language
is. Service description languages thus need to strike a balance between these
competing requirements.</p>
        <p>Criteria 1 and 2 can be evaluated basically by providing a proof-of-concept
implementation. We will not look at them in more detail here but instead focus
on the more interesting Criteria 3 and 4. To evaluate these, a benchmark has
been designed. This benchmark has been used for the DSD evaluation, so far. It
is, however, not language speci¯c and can be used for other approaches, too.</p>
        <p>To evaluate how well a service description language meets Criterion 3, a set
of real world services is needed. As mentioned earlier in the paper, the example
of the OWL-S TC shows that meaningful real world services are apparently not
easy to come by. In particular, meaningful real world services that are described
in su±cient detail are scarcely available. For our benchmark, we therefore chose
a di®erent approach: A group of test subjects not familiar with semantic web
technology were asked to formulate service requests for two di®erent application
domains. We have chosen a bookbuying and train ticket scenario with typical
end user requests as one domain and a travel agency looking for external services
that can be included in applications as the second domain. The queries the test
subjects devised were formulated in natural language. This resulted in about 200
requests. In preparation of the benchmark, domain experts developed ontologies
they deemed necessary to handle the two domains (books, money, trains, ....).
Subsequently, the experts attempted to translate the requests into DSD and
computed how many requests could be directly translated, how many could be
translated but required extensions of the ontologies and how many could not be
appropriately expressed using the language constructs provided by DSD. These
three values measure how well the language is able to describe realistic services
of di®erent types.</p>
        <p>To evaluate whether decoupled description of o®ers and requests is possible,
a number of the test subjects were given an introduction to DSD. They were
subsequently divided into two groups that were not allowed to communicate
with each other. The groups were then asked to formulate service o®ers and
requests, respectively, from a given natural language description. The resulting
DSD description were then evaluated by the matcher and precision and recall
of the matchmaking were determined. High values for both parameters indicate
that it is indeed possible to decouple o®er and request description. A summary
of the benchmark queries and results can be found online15.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Other Approaches</title>
        <p>
          The annual IEEE Web Service Challenge16 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is similar in spirit to the S3
Matchmaker Contest, but focussed rather on syntactic or low level semantic
matchmaking and composition based on matching WSDL part names whereas
we focus on explicit higher level semantics.
        </p>
        <p>
          Toma et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] presented a framework for the evaluation of semantic
matchmaking frameworks by identifying di®erent aspects of such frameworks that
should be evaluated: query and advertising language, scalability, reasoning
support, matchmaking versus brokering and mediation support. They evaluate a
number of frameworks in the service as well as the grid community with regard
to these criteria. The focus of the work is rather on the excellent survey than
on the comparison framework itself. While the framework does provide guidance
for a structured comparison, it does not o®er concrete test suites, measures,
benchmarks or procedures for an objective comparative evaluation.
        </p>
        <p>
          In her PhD thesis [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], Aºberg proposes a platform to evaluate service
discovery in the semantic web. However, her platform is rather a software architecture
to provide some guidance in the development of SWS frameworks than a real
evaluation platform and it does not become clear how this platform can help to
comparatively evaluate di®erent web service frameworks. Despite of some
interesting starting points she does not provide a comprehensive framework (neither
in theory nor practice) that can be used for the evaluation of di®erent discovery
approaches and furthermore ignores related approaches (like the SWS-Challenge
or the S3 Matchmaker Contest) completely.
        </p>
        <p>Moreover we have looked into the evaluation results of various SWS research
projects (see for instance [13{15]). Many have spent a suprisingly small share
of resources on evaluation. For example RW2, an Austrian funded research
project17, has implemented di®erent discovery engines for request and service
description in di®erent logical languages, respectively di®erent granularity.
However as evaluation only a relatively small set of a couple of dozen handcrafted
services exist. The EU projects DIP and ASG have also developed similar
discovery engines. With respect to evaluation they quote industrial case studies,
15 http://hnsp.inf-bb.uni-jena.de/DIANE/benchmark/
16 http://www.ws-challenge.org/
17 http://rw2.deri.at/
Scope of evaluation
Runtime performance +
Framework tool support and usability {
Expressivity of formalism and matchmaking {
Supported level of decoupling {
Quality of evaluation
{
{
+
{
{
o
+
+
Neutral to formalism o + +
Independent from solution { + o
Realistic and complex use cases { + o
Large test set + { o
Table 1. Preliminary comparison of complementary strengths of the existing e®orts.
however, in essence those are also just a small set of service descriptions.
Moreover due to intellectual property rights restrictions the situation is even slightly
worse, since not all descriptions are publicly available and a comparative
evaluation is thus impossible.
3.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Conclusions</title>
        <p>Our survey has shown that - despite of the amount of attention that SWS
discovery and matchmaking research receives - surprisingly little e®ort is devoted to
experimental and comparative evaluation of the various approaches. We found
only three approaches that intensively deal with SWS discovery evaluation in
particular. Table 1 shows a schematic and simpli¯ed comparison of the
di®erent strengths of these approaches which is only meant to give a very high level
summary of the extensive treatment above. All approaches can only be seen as
starting initiatives in the right direction. The SWS-Challenge is currently the
best established initiative in the ¯eld, but the S3 Matchmaker Contest and the
DIANE evaluation complement it in important aspects. In particular the
notions of decoupling the creation of o®er and request descriptions, of involving
inexperienced users in devising descriptions and all aspects related to runtime
performance comparisons are not covered by the challenge so far.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        The existing approaches to evaluate SWS discovery have been covered
extensively above. However, these approaches have not provided a critical review of
the evaluation process itself so far. In contrast, the evaluation of evaluation in
traditional information retrieval has been subject of a number of studies, e.g. by
Saracevic [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] or Voorhees [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], but as we argue in Section 2, the results can not be
directly applied to the domain of SWS discovery. Similar meta-evaluations have
not been done in the domain of SWS discovery so far except for the work by
Tsetsos et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which is the one most closely related to ours. We share the author's
opinion that there is a lack of established evaluation metrics, methodologies and
service test collections and agree with them that further analysis is needed to
understand whether and how well-known metrics like precision and recall can
be applied to service discovery. Tsetsos et al., however, focus on the weaknesses
of coarse-grained binary relevance judgements and suggest to use multi-valued
relevance judgements instead to exploit the fact that most service
matchmakers support di®erent degrees of match instead of a binary match/fail decisions.
In contrast, we provided an in-depth discussion why the Cran¯eld paradigm is
not applicable well to SWS discovery evaluation and presented a comprehensive
survey and discussion of current service discovery evaluation e®orts.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Directions for Future Work</title>
      <sec id="sec-6-1">
        <title>Making Existing Evaluations More Transparent</title>
        <p>We found that the existing evaluations generally lack a formal underpinning and
fail to clearly discuss the intention behind their design. This makes an objective
comparison di±cult. As a result of our analysis in Section 3 we derive a
preliminary set of questions which should be answered by any evaluation approach.
Assumptions of the test-bed or evaluation. Every evaluation should explicitly
state these in order to make its results comparable:
{ What are the assumptions on the formalisms / logical language used?
{ What is the scope for discovery? E.g. is discovery only concerned with static
descriptions or does it also involve dynamic communication?
{ What is the expected outcome of the discovery? Are ranked results expected
or a boolean match/nonmatch decision? If measures similar to recall or
precision are used, are they de¯ned in a meaningful way?
Dimensions measured: Evaluations should clearly indicate and motivate their
scope like:
{ Runtime performance, such as response time, throughput, etc.
{ Scalability in terms of services.
{ Complexity of descriptions required.
{ Level of guarantees provided by a match. Does it assume manual post
processing or support complete automation such that services can be directly
executed as a result of the discovery process?
{ Standard compliance and reusability. E.g. can existing ontologies be reused?</p>
        <p>Probably because of the complexity of the matter the existing approaches
only address some of the points above. We believe by providing this initial list
of criteria we work towards making evaluations more transparent. This catalog
might help to classify test beds and make it easier to ¯nd a suitable candidate for
a planned evaluation. In addition, by answering the questions above explicitly,
the designer of a test set will increase the actual value of it. This is particularly
true since it helps to obtain a more objective result, given that current test sets
are mainly created after a particular solution has been developed and might be
biased towards that particular solution.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Towards a Standard Evaluation Methodology and Test Bed</title>
        <p>None of the current evaluation approaches provides a comprehensive discussion
of the theoretical foundation of SWS discovery evaluation even though such a
discussion is necessary to justify the design decisions made by any evaluation
approach and ultimately to agree upon a standard way of evaluation.</p>
        <p>This paper provides the ¯rst comprehensive summary of the current state
of the art in this ¯eld. As such it hopefully serves as an important ¯rst step
towards s standard evaluation methodology and test bed for semantic service
discovery. Only such an agreed-upon standard will allow to e®ectively compare
approaches and results in an objective way, thereby promoting the advancement
of the whole ¯eld as such. On the way to this standard, we identify the following
rough roadmap for future work.
1. The set of possible dimensions of evaluation have to be clearly identi¯ed and
motivated (what to evaluate).
2. For each of these dimensions suitable means of measurement have to be
designed and evaluated (which criteria to use and how to measure them ).
3. The general requirements to the evaluation process itself have to be identi¯ed
(how to achieve validity, realiability and e±ciency ).
4. According to these requirements a common semantic service discovery test
bed needs to be established, which ultimately allows to e®ectively evaluate
and compare existing solutions with regard to all the dimensions in a uni¯ed
way. This will clearly be a continuous e®ort.
6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Summary</title>
      <p>We examined the state of the art of evaluation of SWS discovery. We discussed
the general applicability of the Cran¯eld paradigm predominantly used for
evaluation in IR and argued that this well-understood paradigm does not map directly
to the domain at hand. We continued by presenting an exhaustive survey of the
current evaluation approaches in SWS discovery and found that the few existing
approaches use very di®erent settings and methodologies highlighting di®erent
aspects of SWS discovery evaluation. A thorough discussion of the e®ects of the
decisions in the design of the evaluation on the results of that evaluation is
missing so far. We hope that this paper serves as a starting point towards a more
systematic approach to SWS discovery evaluation and provided suggestions for
future work in this direction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>McIlraith</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Son</surname>
            ,
            <given-names>T.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
          </string-name>
          , H.:
          <article-title>Semantic web services</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>16</volume>
          (
          <year>2001</year>
          )
          <volume>46</volume>
          {
          <fpage>53</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tichy</surname>
            ,
            <given-names>W.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukowicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prechelt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heinz</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Experimental evaluation in computer science: a quantitative study</article-title>
          .
          <source>Journal of Systems and Software</source>
          <volume>28</volume>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tichy</surname>
            ,
            <given-names>W.F.</given-names>
          </string-name>
          :
          <article-title>Should computer scientists experiment more? IEEE Computer 31 (</article-title>
          <year>1998</year>
          )
          <volume>32</volume>
          {
          <fpage>40</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Overview of the ¯rst Text REtrieval Conference (TREC-1)</article-title>
          .
          <source>In: Proceedings of TREC-1</source>
          , Gaithersbury, Maryland, USA (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>The philosophy of information retrieval evaluation</article-title>
          .
          <source>In: Evaluation of Cross-Language Information Retrieval Systems</source>
          , Second Workshop of the CrossLanguage Evaluation Forum (CLEF
          <year>2001</year>
          ), Darmstadt, Germany (
          <year>2001</year>
          )
          <volume>355</volume>
          {
          <fpage>370</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Khalid</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fries</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kapahnke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>OWLS-TC - OWL-S service retrieval test collection version 2.1 user manual (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Klusch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fries</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sycara</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automated semantic web service discovery with OWLS-MX</article-title>
          .
          <source>In: Proceedings of the 5th Intern. Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS</source>
          <year>2006</year>
          ), Hakodate, Japan (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Petrie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Margaria</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>KuÄster</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lausen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>SWS Challenge: status, perspectives and lessons learned so far</article-title>
          .
          <source>In: Proceedings of the 9th International Conference on Enterprise Information Systems (ICEIS2007)</source>
          ,
          <article-title>Special Session on Comparative Evaluation of Semantic Web Service Frameworks</article-title>
          , Funchal, Madeira-Portugal (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Entwicklung einer Evaluationsmethodik fuÄr Semantic Web Services und Anwendung auf die DIANE Service Descriptions (in German)</article-title>
          .
          <source>Master's thesis</source>
          , IPD, University Karlsruhe (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaeger</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wombacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>WSC-06: the web service challenge</article-title>
          .
          <source>In: Proceedings of the Eighth IEEE International Conference on E-Commerce Technology (CEC</source>
          <year>2006</year>
          )
          <article-title>and</article-title>
          Third IEEE International Conference on Enterprise Computing, E-Commerce and
          <string-name>
            <surname>E-Services (EEE</surname>
          </string-name>
          <year>2006</year>
          ), Palo Alto, California, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Toma</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fensel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Discovery in grid and web services environments: A survey and evaluation</article-title>
          .
          <source>International Journal on Multiagent and Grid Systems</source>
          <volume>3</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Aºberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>An Evaluation Platform for Semantic Web Technology</article-title>
          .
          <source>PhD thesis</source>
          , Department of Computer and Information Science, LinkÄopings Universitet Sweden (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. S^³rbu,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.:</surname>
          </string-name>
          <article-title>A logic based approach for service discovery with composition support</article-title>
          .
          <source>In: Proceedings of the ECOWS06 Workshop on Emerging Web Services Technology</source>
          , ZuÄrich, Switzerland (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. S^³rbu, A.:
          <article-title>DIP deliverable D4.14: discovery module prototype (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          <article-title>: RW2 project deliverable D2.3: prototype implementation of the discovery component (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Saracevic</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Evaluation of evaluation in information retrieval</article-title>
          .
          <source>In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR95)</source>
          , Seattle, Washington, USA (
          <year>1995</year>
          )
          <volume>138</volume>
          {
          <fpage>146</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tsetsos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anagnostopoulos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadjiefthymiades</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>On the evaluation of semantic web service matchmaking systems</article-title>
          .
          <source>In: Proceedings of the 4th IEEE European Conference on Web Services (ECOWS2006)</source>
          , ZuÄrich, Switzerland (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>