Evaluating Ad-Hoc Object Retrieval

                             Harry Halpin1 , Daniel M. Herzig2 , Peter Mika3 , Roi Blanco3 , Jeffrey Pound4 ,
                                             Henry S. Thompson1 , and Thanh Tran Duc2
                                                            1
                                                        University of Edinburgh, UK
                                                   2
                                                 Karlsruhe Institute of Technology, Germany
                                                        3
                                                           Yahoo! Research, Spain
                                                    4
                                                       University of Waterloo, Canada
                                       H.Halpin@ed.ac.uk, herzig@kit.edu, pmika@yahoo-inc.com,
                                     roi@yahoo-inc.com, jpound@cs.uwaterloo.ca, ht@inf.ed.ac.uk,
                                                        ducthanh.tran@kit.edu


                                    Abstract. In contrast to traditional search, semantic search aims at the
                                    retrieval of information from factual assertions about real-world objects
                                    rather than searching over web-pages with textual descriptions. One of
                                    the key tasks to address in this context is ad-hoc object retrieval, i.e.
                                    the retrieval of objects in response to user formulated keyword queries.
                                    Despite the significant commercial interest, this kind of semantic search
                                    has not been evaluated in a thorough and systematic manner. In this
                                    work, we discuss the first evaluation campaign that specifically targets
                                    the task of ad-hoc object retrieval. We also discuss the submitted sys-
                                    tems, the factors that contributed to positive results and the potential
                                    for future improvements in semantic search.


                            1    Introduction
                            Advances in information retrieval have long been driven by evaluation cam-
                            paigns, and the use of official TREC evaluations and the associated queries and
                            data-sets are ubiquitous in measuring improvements in the effectiveness of IR
                            systems. We believe the rigor of evaluation in semantic search should be no
                            different. Yet no such evaluation campaign exists for semantic search, and so
                            usually very small and artificial data-sets are used to evaluate semantic search,
                            using a diverse set of evaluation methods.
                                As a first step towards a common evaluation methodology, Pound et al.
                            [10] defined the task of Ad-hoc Object Retrieval, where ‘semantic search’ is
                            considered to be the retrieval of objects represented as Semantic Web data, using
                            keyword queries for retrieval. While ‘semantic search’ is a broader notion that
                            includes many other tasks, the one of object retrieval has always been considered
                            as an integral part.
                                In this paper, we describe the way we created a standard data-set and queries
                            for this task and the results of a public evaluation campaign we organized. Since
                            the evaluated systems also run a gamut of the approaches used in semantic
                            search, the evaluation results presented in this paper give an overview of the
                            state-of-the-art in this growing field of interest.


Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010). Shanghai, China. November 8, 2010.
    In the following, we will first give an overview of the often ambiguous term ‘se-
mantic search’ (Section 2) before delving into the particular evaluation method-
ology used to evaluate search over Semantic Web data, including the creation
of a standard data-set and queries (Section 3) and we briefly introduce our use
of crowd-sourcing (Section 3.3). Then the submitted evaluated systems will be
discussed along with the results of the evaluation (Section 4).


2     An Overview of Semantic Search
In detail the term ‘semantic search’ is highly contested, primarily because of the
perpetual and endemic ambiguity around the term ‘semantics.’ While ‘search’
is understood to be some form of information retrieval, ‘semantics’ typically
refers to the interpretation of some syntactic structure to another structure, the
‘semantic’ structure, that defines in more detail the meaning that is implicit in
the surface syntax (or even the ‘real-world’ that the syntax describes). Semantics
can be given to various parts of the information retrieval model, including the
representations of the queries and the documents. This semantics can then be
used to process queries against documents, as well as to support users during
query construction and the presentation of results.
    One main problem encountered by semantic search has been the general lack
of a standard to capture the semantics. Knowledge representation formalisms
vary widely, and up until the advent of the Semantic Web there have been no
common standards for capturing the semantics of semantic search. The primary
standard underlying the Semantic Web is RDF (Resource Description Frame-
work), a flexible model that can capture graph-based semantics such as semantic
networks, but also semi-structured data as used in databases. Semantic Web data
represented in RDF is composed of subject-predicate-object triples, where the
subject is an identifier for a resource (e.g. a real-world object), the predicate an
identifier for a relationship, and the object is either an identifier of another re-
source or some information given as a concrete value (e.g. a string or data-typed
value).
    While more complex kinds of semantic search attempt to induce some sort
of Semantic Web-compatible semantics from documents and queries directly,
we focus instead on search methods that apply information retrieval techniques
directly on Semantic Web data. We are motivated by the growing amount of data
that is available directly in RDF format thanks to the worldwide Linked Data5
movement that created a rapidly expanding data space of interlinked public data
sets.
    There are already a number of semantic search systems that crawl and index
Semantic Web data such as [2][5][9], and there is active research into algorithms
for ranking in this setting. Despite the growing interest, it has been concluded
in plenary discussions at the Semantic Search 2009 workshop6 that the lack of
standardized evaluation has become a serious bottleneck to further progress in
5
    http://linkeddata.org
6
    http://km.aifb.kit.edu/ws/semsearch09/
this field. In response to this conclusion, we organized the public evaluation
campaign that we describe in the following.


3     Evaluation Methodology

Arriving at a common evaluation methodology requires the definition of a shared
task that is accepted by the community as the one that is most relevant to
potential applications of the field. The definition of the task is also a precondition
for establishing a set of procedures and metrics for assessing performance on the
task, with the eventual purpose of ranking systems [1]. For the field of text
retrieval, this task is the retrieval of a ranked list of (text) documents from a
fixed corpus in response to free-form keyword queries, or what is known as the
ad-hoc document retrieval task.
    In ad-hoc object retrieval the goal is to retrieve a ranked list of objects from
a collection of RDF documents in response to free-form keyword queries. The
unit of retrieval is thus individual objects (resources in RDF parlance) and not
RDF documents7 . Although some search engines do retrieve RDF documents
(thus provide coarser granularity), object retrieval is the task with the highest
potential impact for applications. Pound et al. [10] also proposed an evaluation
protocol and tested a number of metrics for their stability and discriminating
power. In our current work, we instantiate this methodology in the sense of
creating a standard set of queries and data on which we execute the methodology.


3.1    Data Collection

Current semantic search engines have vastly different indices, with some special-
izing on only single data-sources with thousands of triples and others ranging
over billions of triples crawled from the Web. Therefore, in order to have a gen-
eralized evaluation of the ranking of results, it is essential to normalize the index
in order.
    We required a data-set that would not bias the results towards any particular
semantic search engine. The data-set that we wanted to use in the evaluation
campaign needed to contain real data, sizable enough to contain relevant infor-
mation for the queries, yet not so large that its indexing would require compu-
tational resources outside the scope of most research groups. We have chosen
the ‘Billion Triples Challenge’ 2009 data set, a data-set created for the Semantic
Web Challenge8 in 2009 and which is well-known in the community. The raw
size of the data is 247GB uncompressed and it contains 1.4B triples describing
114 million objects. This data-set was composed by combining crawls of multiple
semantic search engines. Therefore, it does not necessarily match the coverage of
any particular search engine. Also, it is only a fragment of the data that can be
found on the Semantic Web today that is representative and still manageable by
7
    An RDF graph connects a number of resources through typed relations.
8
    http://challenge.semanticweb.org
individual research groups. We refer the readers to http://vmlion25.deri.ie/
for more information on the dataset.
    The only modification we have done is to replace local, document-bound re-
source identifiers (’blank nodes’ in RDF parlance) with auto-generated URIs, i.e.,
globally unique resource identifiers. This operation does not change the seman-
tics of data but it is necessary because resources are the unit of retrieval. With
the announcement of the evaluation campaign, this modified ‘Billion Triples
Data-set’ was released for download and indexing by participants9 .


3.2    Real-world Web Queries

As the kinds of queries used by semantic search engines vary dramatically (rang-
ing from structured SPARQL queries to searching directly for URI-based iden-
tifiers), it was decided to focus first on keyword-based search. Keyword-based
search is the most commonly used query paradigm, and supported by most se-
mantic search engines.
     Clearly, the type of result expected, and thus the way to assess relevance
depend on the type of the query. For example, a query such as plumbers in
mason ohio is looking for instances of a class of objects, while a query like parcel
104 santa clara is looking for information for one particular object, in this case
a certain restaurant. Pound et al. [10] proposed a classification of queries by
expected result type, and for our first evaluation we have decided to focus on
object-queries, i.e. queries demonstrated by the latter example, where the user
is seeking information on a particular object. Note that for this type of queries
there might be other objects mentioned in the query other than the main object,
such as santa clara in the above case. However, it is clear that the focus of the
query is the restaurant named parcel 104, and not the city of Santa Clara as a
whole.
     We were looking for a set of object-queries that would be unbiased towards
any existing semantic search engine. First, although the search engine logs of
various semantic search engines were gathered, it was determined that the kinds
of queries varied quite a lot, with many of the query logs of semantic search
engines revealing idiosyncratic research tests by robots rather than real-world
queries by actual users. Since one of the claims of semantic search is that it
can help general purpose ad-hoc information retrieval on the Semantic Web, we
have decided to use queries from actual users of hypertext Web search engines.
As these queries would be from hypertext Web search engines, they would not
be biased towards any semantic search engine. We had some initial concerns if
within the scope of the data-set it would be possible to provide relevant results for
each of the queries. However, this possible weakness also doubled as a strength,
as the testing of a real query sample from actual users would determine whether
or not a billion triples from the Semantic Web realistically could help answer
the information needs of actual users, as opposed to purely researchers [4].
9
    http://km.aifb.kit.edu/ws/semsearch10/#eva
    In order to support our evaluation, Yahoo! released a new query set as part
of their WebScope program10 , called the Yahoo! Search Query Log Tiny Sample
v1.0, which contains 4,500 queries sampled from the company’s United States
query log from January, 2009. One limitation of this data-set is that it contains
only queries that have been posed by at least three different (not necessarily
authenticated) users, which removes some of the heterogeneity of the log, for
example in terms of spelling mistakes. While realistic, we considered this a hard
query set to solve. Given the well-known differences between the top of the
power-law distribution of queries and the long-tail, we used an additional log of
queries from the Microsoft Live Search containing queries that were repeated by
at least 10 different users.11 We expected these queries to be easier to answer.
    We have selected a sample of 42 entity-queries from the Yahoo! query log by
classifying queries manually as described in Pound et al. [10]. We have selected
a sample of 50 queries from the Microsoft log. In this case we have pre-filtered
queries automatically with the Edinburgh MUC named entity recognizer [8], a
gazetteer and rule-based named-entity recognizer that has shown to have very
high precision in competitions. Both sets were combined into a single, alphabeti-
cally ordered list, so that participants were not aware which queries belonged to
which set, or in fact that there were two sets of queries. We distributed the final
set of 92 queries to the participants two weeks before the submission deadline.


3.3   Crowd-sourcing Relevance Judgments

We crowd-sourced the relevance judgments using Amazon’s Mechanical Turk. A
deep analysis on the reliability and repeatability of the evaluation campaign is
left for future work. For the purpose of evaluation, we have created a simple ren-
dering algorithm to present the results in a concise, yet human-readable manner
without ontology-dependent customizations In order to achieve good through-
put from the judges, each HIT consisted of 12 query-result pairs for relevance
judgments. Of the 12 results, 10 were real results drawn from the participants’
submissions, and 2 were gold-standard results randomly placed in the results.
These gold-standard results were results from queries distinct from those used
by the participants that were manually judged to be either definitely a ‘relevant’
or ‘irrelevant’ result. For each HIT, there was both a gold-standard relevant and
gold-standard irrelevant result included.
    65 Turkers in total participated in judging a total of 579 HITs over a three-
point scale12 , covering 5786 submitted results and 1158 gold-standard checks. 2
minutes were allotted for completing each HIT. The average agreement and its
standard deviation, computed with Fleiss’s κ, for the two- and three-point scales
10
   http://webscope.sandbox.yahoo.com/
11
   This query log was used with permission from Microsoft Research and as the result
   of a Microsoft ‘Beyond Search’ award.
12
   Excellent- describes the query target specifically and exclusively
   Not bad - mostly about the target
   Poor - not about the target, or mentions it only in passing
are 0.44±0.22 and 0.36±0.18, respectively. There is thus no marked difference
between a three-point scale and a binary scale, meaning that it was feasible
to judge this task on a three-point scale. By comparing the number of times a
score appeared and the number of times it was agreed on, 1s (irrelevant results)
were not only the most numerous, but the easiest to agree on (69%), followed
by 3s (perfect results, 52%) and tailed by 2s (10%). This was expected given
the inherent fuzziness of the middle score. To see how this agreement compares
to the more traditional setting of using expert judges, we have re-judged 30
HITs ourselves. We have again used three judges per HIT, but this time with
all judges assessing all HITs. In this case, the average and standard deviation
of Fleiss’s κ for the two- and three-point scales are 0.57±0.18 and 0.56±0.16
and, respectively. The level of agreement is thus somewhat higher for expert
judges, with comparable deviation. For expert judges, there is practically no
difference between the two- and three-point scales, meaning that expert judges
had much less trouble using the middle judgment. The entire competition was
judged within 2 days, for a total cost of $347.16. We consider this both fast and
cost-effective.


4     Evaluation Results
4.1   Overview of Evaluated Systems
For the evaluation campaign, each semantic search engine was allowed to pro-
duce up to three different submissions (‘runs’), to allow the participants to try
different parameters or features. A submission consisted of an ordered list of
URIs for each query. In total, we received 14 different runs from six different
semantic search engines. The six participants were DERI (Digital Enterprise
Research Institute), University of Delaware (Delaware), Karlsruhe Institute of
Technology (KIT), University of Massachusetts (UMass), L3S, and Yahoo! Re-
search Barcelona (Yahoo! BCN).
    All systems used inverted indexes for managing the data. The differences
between the systems can be characterized by two major aspects: (1) the internal
model used for representing objects and (2), the kind of retrieval model applied
for matching and ranking. We will now first discuss these two aspects and then
discuss the specific characteristics of the participated systems and their differ-
ences.
    For object representation, RDF triples having the same URI as subject have
been included and that URI is used as the object identifier. Only the DERI and
the L3S deviate from this representation, as described below. More specifically,
the object description comprises attribute and relation triples as well as prove-
nance information. While attributes are associated with literal values, relation
triples establish a connection between one object and one another. Both the at-
tributes and the literal values associated with them are incorporated and stored
on the index. The objects of relation triples are in fact identifiers. Unlike literal
values, they are not directly used for matching but this additional information
has been considered valuable for ranking. Provenance is a general notion that
can include different kinds of information. For the problem of object retrieval,
participated systems used two different types of provenances. On the one hand,
RDF triples in the provided data-set are associated with an additional context
value. This value is in fact an identifier, which captures the origin of the triples,
e.g. from where it was crawled. This provenance information is called here the
‘context’. One the other hand, the URI of every RDF resource is a long string,
from which the domain can be extracted. This kind of provenance information
is called ‘domain’. Clearly, the domain is different to the context because URIs
with the same domain can be used in different contexts. Systems can be distin-
guished along this dimension, i.e., what specific aspects of the object they took
into account.
    The retrieval model, i.e. matching and ranking, is clearly related to the aspect
of object representation. From the descriptions of the systems, we can derive
three main types of approaches: (1) the purely ‘text based’ approach which relies
on the ‘bag-of-words’ representation of objects and applies ranking that is based
on TF/IDF, Okapi, or language models[6]. This type of approaches is centered
around the use of terms and particularly, weights of terms derived from statistics
computed for the text corpus. (2) Weighting properties separately is done by
approaches that use models like BM25F to capture the structure of documents
(and objects in this case) using a list of fields or alternatively, using mixture
language models, which weight certain aspects of an object differently. Since this
type of approaches does not consider objects as being flat as opposed to the text-
based ones but actually decompose them according to their structure, we call
them ‘structure-based’. (3) While with this one, the structure information is used
for ranking results for a specific query, there are also approaches that leverage the
structure to derive query independent scores, e.g. using PageRank. We refer to
them as ‘query-independent structure-based’ (Q-I-structured-based) approaches.
To be more precise, the three types discussed here actually capture different
aspects of a retrieval model. A concrete approach in fact uses a combination of
of these aspects.
    Based on the distinction introduced above, Table 1 gives an overview of the
systems and their characteristics using the identifiers provided in the original
run submissions. A brief description of each system is given below, and detailed
descriptions are available at http://km.aifb.kit.edu/ws/semsearch10/#eva.
    Delaware:Object representation: The system from Delaware took all triples
having the same subject URI as the description of an object. However, the re-
sulting structure of the object as well as the triple structure were then neglected.
Terms extracted from the triples are simply put into one ‘bag-of-words’ and in-
dexed as one document. Retrieval model: Three existing retrieval models were
applied for the different runs, namely Okapi for sub28-Okapi, language mod-
els with Dirichlet priors smoothing sub28-Dir, and an axiomatic approach for
sub28-AX.
    DERI: Object representation: The Sindice system from DERI applied a dif-
ferent notion of object. All triples having the same subject and also the same
context constitute one object description. Thus, the same subject that appears
                 Participant                Delaware                        DERI                     KIT L3S          UMass       Yahoo! BCN


                                                                sub27-dpr
                                                     sub28-AX


                                                                                         sub27-gpr
                                         sub28-Dir


                                                                             sub27-dlc


                                                                                                                       sub31-
                                         sub28-


                                                                                                                      sub31-

                                                                                                                      sub31-

                                                                                                                      sub30-

                                                                                                                      sub30-

                                                                                                                                          sub30-
                                                                                                                      RES.1

                                                                                                                      RES.2

                                                                                                                                          RES.3
                                         Okapi


                                                                                                      sub32

                                                                                                              sub29


                                                                                                                      run1

                                                                                                                      run2

                                                                                                                      run3
                                         Run
                     Attribute values       +   + + + + +                                             +       -       +   +   +   +   +    +
Object
                         Relations          -   - - + + +                                             -       -       -   -   -   -   -    -
representation
                 Context (+) / Domain (◦)   -   - - +◦ +◦ +◦                                          -       ◦       -   -   -   ◦   ◦    ◦
                       Text based           +   + +             +            +           +            -       +       +   +   -   -   -    -
Retrieval
                    Structure-based         -   - -             -            -           -            +       -       -   -   +   +   +    +
model
                   Q-I-Structure-based      -   - -             +            +           +            -       -       -   -   -   +   +    +

Table 1. Feature overview regarding system internal object representation and re-
trieval model


in two different contexts might be represented internally as two distinct objects.
Further, the system considered relations to other objects, context information,
and URI tokens for the representation of objects. Retrieval model: The con-
text information, as well as the relations between objects are used to compute
query independent PageRank-style scores. Different parameter configurations
have been tested for each run, resulting in different scores. For processing spe-
cific queries, these scores were combined with query dependent TF/IDF-style
scores for matches on predicates, objects and values.
     KIT: Object representation: The system by KIT considered literal values of
attributes and separately those of the rdf s : label attribute as the entity de-
scription. All other triples that can be found in the RDF data for an object were
ignored. Retrieval model: The results were ranked based on a mixture language
model inspired score, which combines the ratio of all query terms to the number
of term matches on one literal and discounts each term according to its global
frequency.
     L3S:Object representation: The system by L3S takes a different approach to
object representation. Each unique URI, appearing as subject or object in the
data set, is seen as an object. Only information captured by this URI is used
for representing the object. Namely, based on the observation that some URIs
contain useful strings, a URI was splitted into parts. These parts were taken as a
‘bag-of-words’ description of the object and indexed as one document. Thereby,
some provenance information is taken into account, i.e., the domain extracted
from the URI. Retrieval model: A TF/IDF-based ranking combined with using
cosine similarity to compute the degree of matching between terms of the query
and terms extracted from the object URI was used here.
     UMass: Object representation: All triples having the same subject URI were
taken as the description of an object. For the first two runs, sub31-run1 and
sub31-run2, the values of these triples are just seen as a ‘bag-of-words’ and
no structure information was taken into account. For the third run, sub31-
run3, the object representation was divided into four fields, one field containing
all values of the attribute title, one for values of the attribute name, a more
specific one for values of the attribute dbpedia : title and one field containing
the values for all the attributes. Retrieval model: Existing retrieval models were
applied, namely the query likelihood model for sub31-run1 and the Markov
random field model for sub31-run2. For sub31-run3, the fields were weighted
separately with specific boosts applied to dbpedia : title, name, and title.
     Yahoo! BCN: Object representation: Every URI appearing at the subject
position of the triples is regarded as one object and is represented as one virtual
document that might have up to 300 fields, one field per attribute. A subset of
the attributes were manually classified into one of the three classes important,
neutral, and unimportant and boosts applied respectively. The Yahoo! system
took the provenance of the URIs into account. However, not the context but the
domain of the URI was considered and similarly to the attributes, were classi-
fied into three classes. Relations and structure information that can be derived
from them were not taken into account. Retrieval model: The system created by
Yahoo! uses an approach for field-based scoring that is similar to BM25F. Match-
ing terms were weighted using a local, per property, term frequency as well as a
global term frequency. A boost was applied based on the number of query terms
matched. In addition, a prior was calculated for each domain and multiplied to
the final score. The three submitted runs represent different configurations of
these parameters.
     Only the top 10 results per query were evaluated, and after pooling the results
of all the submissions, there was a total of 6,158 unique query-result pairs. Note
this was out of a total of 12,880 potential query result pairs, showing that pooling
was definitely required. Some systems submitted duplicate results for one query.
We considered the first occurrence for the evaluation and took all following as
not relevant. Further, some submissions contained ties, i.e. several results for one
query had the same score. Although there exist tie-aware versions of our metrics
[7], the trec eval software13 we used to compute the scores can not deal with ties
in a correct way. Therefore we broke the ties by assigning scores to the involved
result according to the order of occurrences in the submitted file.


4.2     Evaluation results


 Participant      Run    P@10 MAP NDCG           Participant     Run    P@10 MAP NDCG
 Yahoo! BCN sub30-RES.3 0.4924 0.1919 0.3137     UMass       sub31-run1 0.3717 0.1228 0.2272
 UMass        sub31-run3 0.4826 0.1769 0.3073    DERI         sub27-dpr 0.3891 0.1088 0.2172
 Yahoo! BCN sub30-RES.2 0.4185 0.1524 0.2697     DERI         sub27-dlc 0.3891 0.1088 0.2171
 UMass        sub31-run2 0.4239 0.1507 0.2695    Delaware sub28-Dir 0.3652 0.1109 0.2140
 Yahoo! BCN sub30-RES.1 0.4163 0.1529 0.2689     DERI         sub27-gpr 0.3793 0.1040 0.2106
 Delaware    sub28-Okapi 0.4228 0.1412 0.2591    L3S            sub29   0.2848 0.0854 0.1861
 Delaware      sub28-AX 0.4359 0.1458 0.2549     KIT            sub32   0.2641 0.0631 0.1305
               Table 2. Results of submitted Semantic Search engines.


    The systems were ranked using three standard information retrieval evalua-
tion measures, namely mean average precision (MAP), precision at 10 (P@10)
13
     http://trec.nist.gov/trec eval/
                                                        sub28.AX

                                          sub28.Dir sub28.Okapi


                          0.5
                                                       sub31.run1
                                                          sub31.run2                       sub30.RES.3


                                                                                           sub30.RES.2


                          0.0
                                                                     sub31.run3            sub30.RES.1

                          sub27.dlcsub27.gpr
                                   sub27.dpr                                       sub29


                          −0.5
                      y

                          −1.0
                          −1.5
                                                                      sub32


                                               −0.5                 0.0           0.5        1.0

                                                                          x


           Fig. 1. Visualizing the distances between systems using MDS.


and normalized discounted cumulative gain (NDGC). Refer to [6] for a detailed
explanation on these metrics. Table 2 shows the evaluation results for the sub-
mitted runs. The third run submitted by Yahoo!, together with the third run of
the UMass system, gave the best results. The ordering of the systems changes
only slightly if we consider MAP instead of NDCG. Precision at 10 is much less
stable as it has been observed in previous evaluations.
    It was interesting to observe that the top two runs achieve similar levels of
performance with retrieving very different sets of results. The overlap between
these two runs as measured by Kendall’s τ is only 0.11. By looking at the results
in detail, we see that sub31-run3 has a strong prior on returning results from a
single domain, dbpedia.org, with 93.8% of all results from this domain. DBpedia,
which is an extraction of the structured data contained in Wikipedia, is a broad-
coverage dataset with high quality results and thus the authors have decided to
bias the ranking toward results from this domain. The competing run sub30-
RES3 returns only 40.6% of results from this domain, which explains the low
overlap. Considering all pairs of systems, the values of τ range from 0.018 to
0.995. Figure 1 visualizes the resulting matrix of dissimilarities (1-τ ) using non-
metric multi-dimensional scaling (MDS). The visualization manages to arrange
similar runs within close proximity. Comparing Figure 1 with Table 2, we can
observe that at least in this low dimensional projection there is no obvious
correlation between any of the dimensions and the systems’ performance, except
for the clear outlier sub32, which is distant from all other systems in the second
(y) dimension and performs poorly. In general, this image also suggest that
similar performance can be achieved by quite dissimilar runs.
    Figure 3 shows the per-query performance for queries from the Microsoft and
Yahoo! data-sets, respectively. Both Figures show the boundary of the first and
third quartiles using error bars. It is noticeable that the Yahoo! set is indeed more
difficult for the search engines to process, with larger variations of NDCG across
both queries and across systems. The performance on queries from the Microsoft
log, which are more frequent queries, shows less variation among queries and
      0.8                                       0.8
      0.7                                       0.7
      0.6                                       0.6
      0.5                                       0.5
      0.4                                       0.4
      0.3                                       0.3
      0.2                                       0.2
      0.1                                       0.1
      0.0                                       0.0


Table 3. Average NDCG for queries from the Microsoft data-set (left) and Yahoo!
data-set (right)

between systems processing the same queries. This confirms that popular queries
are not only easier, but more alike in difficulty.

4.3         Discussion
The systems submitted to the evaluation represent an array of approaches to
semantic search, as shown by the diversity of results. Most participants started
with well-known baselines from Information Retrieval. When applied to object
retrieval on RDF graphs these techniques yield workable results almost out-of-
the-box, although a differential weighting of properties has been key to achieving
top results (see the runs from Yahoo! BCN and UMass).
    Besides assigning different weights to properties, the use of ’semantics’ or the
meaning of the data has been limited. All the participating systems focused on
indexing only the subjects of the triples by creating virtual documents for each
subject, which is understandable given the task. However, we would consider
relations between objects as one of the strong characteristics of the RDF data
model, and the usefulness of graph-based approaches to ranking will still need
to be validated in the future. Note that in the context of RDF, graph-based
ranking can be applied to both the graph of objects as well as the graph of
information sources. Similarly, we found that keyword queries were taken as
such, and despite our expectations they were not interpreted or enhanced with
any kind of annotations or structures. The possibilities for query interpretation
using background knowledge (such as ontologies and large knowledge bases) or
the data itself is another characteristic of semantic search that will need to be
explored in the future.
    The lack of some of these advanced features is explained partly by the short
time that was available, and partly by the fact that this was the first evaluation
of this kind, and therefore no training data was available for the participants.
For next year’s evaluation, the participants will have access to assessments from
this year’s evaluation. This will make it significantly easier to test and tune
new features by comparing to previous results. We will also make the evaluation
software available, so that anyone can generate new pools of results, and thus
evaluate systems that are very dissimilar to the current set of systems.
5    Conclusions
We have described the methodology and results of the first public evaluation
campaign for ad-hoc object retrieval, one of the most basic tasks in semantic
search. We have designed our evaluation with the goals of efficiency in mind,
and have chosen a crowd-sourcing based approach. A natural next step will be
to perform a detailed analysis of the mechanical-turk produced ground truth.
    Our work could be also extended to new data-sets and new tasks. For ex-
ample, structured RDF and RDF-compatible data can now be embedded into
HTML pages in the form of RDFa and microformats, making it a natural next
step for our evaluation campaign. This kind of data is used by both Facebook,
Google, and Yahoo! for improving user experience. The selection of our queries
could be biased toward queries where current search engines fail to satisfy the
information need of the user due to its complexity, queries with a particular in-
tents or an increased pay-off etc. We plan to extend the number of public query
and data-sets in the future, and given funding we might open our system for
continuous submission, while we continue to host yearly evaluation campaigns.


References
 1. C. Cleverdon and M. Kean. Factors Determining the Performance of Indexing
    Systems, 1968.
 2. L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi,
    and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In
    CIKM, pages 652–659, New York, NY, USA, 2004. ACM Press.
 3. S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum. Language-
    model-based ranking for queries on rdf-graphs. In CIKM, pages 977–986, New
    York, NY, USA, 2009. ACM.
 4. H. Halpin. A query-driven characterization of linked data. In WWW Workshop
    on Linked Data on the Web, Madrid, Spain, 2009.
 5. A. Harth, J. Umbrich, A. Hogan, and S. Decker. YARS2: A Federated Repository
    for Querying Graph Structured Data from the Web. The Semantic Web, pages
    211–224, 2008.
 6. C. D. Manning, P. Raghavan and H. Schütze Introduction to Information Retrieval
    Cambridge University Press, 2008,
 7. F. McSherry and M. Najork. Computing information retrieval performance mea-
    sures efficiently in the presence of tied scores. In ECIR, Berlin, Heidelberg, April
    2008. Springer-Verlag.
 8. A. Mikheev, C. Grover, and M. Moens. Description of the LTG System Used for
    MUC-7. In MUC-7,1998.
 9. E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello.
    Sindice.com: a document-oriented lookup index for open linked data. IJMSO,
    3(1):37–52, 2008.
10. J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Ranking in the Web of Data.
    In WWW, pages 771–780, Raleigh, USA, 2010.